In [1]:
# Comparison with SQL
# Found here:
# https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html

import pandas as pd
import numpy as np

In [2]:
url = (
    "https://raw.github.com/pandas-dev"
    "/pandas/master/pandas/tests/io/data/csv/tips.csv"
)

In [3]:
tips = pd.read_csv(url)

In [4]:
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


## SELECT
In SQL, selection is done using a comma-separated list of columns you’d like to select (or a * to select all columns):
<br>
~~~
SELECT total_bill, tip, smoker, time
FROM tips;
~~~

In [6]:
tips[["total_bill", "tip", "smoker", "time"]]

Unnamed: 0,total_bill,tip,smoker,time
0,16.99,1.01,No,Dinner
1,10.34,1.66,No,Dinner
2,21.01,3.50,No,Dinner
3,23.68,3.31,No,Dinner
4,24.59,3.61,No,Dinner
...,...,...,...,...
239,29.03,5.92,No,Dinner
240,27.18,2.00,Yes,Dinner
241,22.67,2.00,Yes,Dinner
242,17.82,1.75,No,Dinner


In SQL, you can add a calculated column:
```
SELECT *, tip/total_bill as tip_rate
FROM tips;
```
With pandas, you can use the DataFrame.assign() method of a DataFrame to append a new column:

In [7]:
tips.assign(tip_rate=tips["tip"] / tips["total_bill"])

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_rate
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.50,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.139780
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.203927
240,27.18,2.00,Female,Yes,Sat,Dinner,2,0.073584
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.088222
242,17.82,1.75,Male,No,Sat,Dinner,2,0.098204


## WHERE
Filtering in SQL is done via a WHERE clause.
```
SELECT *
FROM tips
WHERE time = 'Dinner';
```
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

In [8]:
tips[tips["total_bill"] > 10]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [10]:
is_dinner = tips["time"] == "Dinner"
is_dinner

0      True
1      True
2      True
3      True
4      True
       ... 
239    True
240    True
241    True
242    True
243    True
Name: time, Length: 244, dtype: bool

In [11]:
is_dinner.value_counts()

True     176
False     68
Name: time, dtype: int64

In [12]:
tips[is_dinner]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).

Tips of more than $5 at Dinner meals:
```
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
```

In [13]:
tips[(tips["time"] == "Dinner") & (tips["tip"] > 5.00)]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
23,39.42,7.58,Male,No,Sat,Dinner,4
44,30.4,5.6,Male,No,Sun,Dinner,4
47,32.4,6.0,Male,No,Sun,Dinner,4
52,34.81,5.2,Female,No,Sun,Dinner,4
59,48.27,6.73,Male,No,Sat,Dinner,4
116,29.93,5.07,Male,No,Sun,Dinner,4
155,29.85,5.14,Female,No,Sun,Dinner,5
170,50.81,10.0,Male,Yes,Sat,Dinner,3
172,7.25,5.15,Male,Yes,Sun,Dinner,2
181,23.33,5.65,Male,Yes,Sun,Dinner,2


Tips by parties of at least 5 diners OR bill total was more than $45:
```
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
```

In [14]:
tips[(tips["size"] >= 5) | (tips["total_bill"] > 45)]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
59,48.27,6.73,Male,No,Sat,Dinner,4
125,29.8,4.2,Female,No,Thur,Lunch,6
141,34.3,6.7,Male,No,Thur,Lunch,6
142,41.19,5.0,Male,No,Thur,Lunch,5
143,27.05,5.0,Female,No,Thur,Lunch,6
155,29.85,5.14,Female,No,Sun,Dinner,5
156,48.17,5.0,Male,No,Sun,Dinner,6
170,50.81,10.0,Male,Yes,Sat,Dinner,3
182,45.35,3.5,Male,Yes,Sun,Dinner,3
185,20.69,5.0,Male,No,Sun,Dinner,5


NULL checking is done using the notna() and isna() methods.

In [15]:
frame = pd.DataFrame(
    {"col1": ["A", "B", np.NaN, "C", "D"], "col2": ["F", np.NaN, "G", "H", "I"]}
)

frame

Unnamed: 0,col1,col2
0,A,F
1,B,
2,,G
3,C,H
4,D,I


Assume we have a table of the same structure as our DataFrame above. We can see only the records where col2 IS NULL with the following query:
```
SELECT *
FROM frame
WHERE col2 IS NULL;
```

In [16]:
frame[frame["col2"].isna()]

Unnamed: 0,col1,col2
1,B,


Getting items where col1 IS NOT NULL can be done with notna().
```
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
```

In [17]:
frame[frame["col1"].notna()]

Unnamed: 0,col1,col2
0,A,F
1,B,
3,C,H
4,D,I


## GROUP BY
In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method. groupby() typically refers to a process where we’d like to split a dataset into groups, apply some function (typically aggregation) , and then combine the groups together.

A common SQL operation would be getting the count of records in each group throughout a dataset. For instance, a query getting us the number of tips left by sex:
```
SELECT sex, count(*)
FROM tips
GROUP BY sex;
/*
Female     87
Male      157
*/
```
The pandas equivalent would be:

In [20]:
pwd

'C:\\Users\\yyk\\src\\python_junk\\pandas_tutorial'