In [20]:
import pandas as pd

df = pd.read_csv("ads.csv")

## Subsetting by group

- It is very common to select only parts of a dataset for analysis. 
- Examples?

## Subsetting by group in pandas

- Pandas gives you a few ways to do this. 
- The one we will focus on is called "boolean indexing"
- For those coming from SQL, checkout [query](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) which is similar 

## Selecting a column 

- Recall that to select a column (aka "series") you use this syntax

In [15]:
df

Unnamed: 0,ad,click
0,ingredients,True
1,ingredients,True
2,cost,True
3,ingredients,True
4,ingredients,True
...,...,...
9995,cost,True
9996,cost,True
9997,cost,False
9998,ingredients,True


## From selection to Boolean indexing

- This is used for Boolean indexing

In [7]:
df["ad"] == "cost"

0       False
1       False
2        True
3       False
4       False
        ...  
9995     True
9996     True
9997     True
9998    False
9999     True
Name: ad, Length: 10000, dtype: bool

What is that? 

In [24]:
ix = df["ad"] == "cost"
type(ix)

pandas.core.series.Series

What is this doing?

In [27]:
df[ix]

Unnamed: 0.1,Unnamed: 0,ad,click,age
4,4,cost,True,
7,7,cost,False,
9,9,cost,True,57.0
12,12,cost,False,49.0
14,14,cost,True,
...,...,...,...,...
995,995,cost,False,33.0
996,996,cost,False,
997,997,cost,True,
998,998,cost,False,64.0


In [17]:
## another way to write it 

df[df["ad"] == "cost"]

Unnamed: 0,ad,click
2,cost,True
6,cost,False
7,cost,False
10,cost,False
11,cost,False
...,...,...
9994,cost,False
9995,cost,True
9996,cost,True
9997,cost,False


### Where is the for loop?

# Combining Boolean indexes? 

In [21]:
# What do you notice about this syntax?

df[(df["ad"] == "cost") & (df["age"] < 30)]

Unnamed: 0.1,Unnamed: 0,ad,click,age
35,35,cost,False,18.0
50,50,cost,False,19.0
68,68,cost,True,26.0
91,91,cost,False,22.0
97,97,cost,False,21.0
...,...,...,...,...
913,913,cost,True,24.0
935,935,cost,True,27.0
957,957,cost,False,20.0
971,971,cost,False,17.0


### Compute the mean age for those who clicked on the cost ad

- Do you notice any problems? Welcome to data analysis ðŸ˜€

In [76]:
dfOne = df[(df["ad"] == "cost") & 
           (df["click"] == True) & (df["age"].notnull())]
dfTwo["age"].mean()

45.52054794520548

# Possible fixes

- [fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)
- [dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)

In [91]:
import pandas as pd
df = pd.read_csv("ClassData.csv")

dfOne = df.dropna()
dfTwo = dfOne.astype({"Shoe size" : "double"})
dfTwo["Shoe size"].mean()

10.633333333333333