# UBC
## Programming in Python for DS
### Week 2
Instructor: Socorro Dominguez-Vidana

### Module 2

Overview:

- [] Show how to use `.loc` and `.iloc`
- [] Demonstrate how to rename columns of a dataframe using .rename()
- [] Create new or columns in a dataframe using `.assign()`.
- [] Drop columns in a dataframe using `.drop()`
- [] Use `df[]` to filter rows of a dataframe.
- [] Calculate summary statistics on grouped objects using `.groupby()` and `.agg()`.


In [1]:
import pandas as pd

In [2]:
data = {'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.],
                   'Min Speed': [120., 110., 5., 6.]}
data

{'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
 'Max Speed': [380.0, 370.0, 24.0, 26.0],
 'Min Speed': [120.0, 110.0, 5.0, 6.0]}

In [3]:
type(data)

dict

In [4]:
df = pd.DataFrame(data)

In [5]:
df

Unnamed: 0,Animal,Max Speed,Min Speed
0,Falcon,380.0,120.0
1,Falcon,370.0,110.0
2,Parrot,24.0,5.0
3,Parrot,26.0,6.0


In [6]:
type(df)

pandas.core.frame.DataFrame

In [7]:
pd.DataFrame(df)

Unnamed: 0,Animal,Max Speed,Min Speed
0,Falcon,380.0,120.0
1,Falcon,370.0,110.0
2,Parrot,24.0,5.0
3,Parrot,26.0,6.0


`Tip!` Remember that whenever you see a *new** function, look at the documentation. The easiest way (for me) is to look in the browswer. Type something as "pandas DataFrame documentation" in the browser and usually the first entry will be your best friend.

#### Methods and Attributes (The only thing you might need to know about OOP for now)

In [8]:
df.describe()

Unnamed: 0,Max Speed,Min Speed
count,4.0,4.0
mean,200.0,60.25
std,202.115479,63.352848
min,24.0,5.0
25%,25.5,5.75
50%,198.0,58.0
75%,372.5,112.5
max,380.0,120.0


In [9]:
df.describe(include='all')

Unnamed: 0,Animal,Max Speed,Min Speed
count,4,4.0,4.0
unique,2,,
top,Falcon,,
freq,2,,
mean,,200.0,60.25
std,,202.115479,63.352848
min,,24.0,5.0
25%,,25.5,5.75
50%,,198.0,58.0
75%,,372.5,112.5


In [10]:
df_desc = df.describe()
df_desc.loc['count', 'Max Speed']

4.0

In [11]:
df_desc.shape

(8, 2)

Any `pandas.DataFrame` has:
- Attributes:
    - `.shape`
    - `.columns`
    - `.dtypes`
    - `.index`
    - `.size`
    - etc...
- Methods:
    - `.group_by()`
    - `.sort_values()`
    - `.rename()`
    - etc...

## `.loc`

Select rows and columns based on `label`

The rule:

`df.loc[*row*, start:end:step]` or `df.loc[*row*, [labels]]`

For all rows, use `:`

In [12]:
df

Unnamed: 0,Animal,Max Speed,Min Speed
0,Falcon,380.0,120.0
1,Falcon,370.0,110.0
2,Parrot,24.0,5.0
3,Parrot,26.0,6.0


In [13]:
df.loc[1, 'Animal':'Min Speed':2]

Animal       Falcon
Min Speed     110.0
Name: 1, dtype: object

In [14]:
df.loc[:, ['Animal', 'Min Speed']]

Unnamed: 0,Animal,Min Speed
0,Falcon,120.0
1,Falcon,110.0
2,Parrot,5.0
3,Parrot,6.0


In [15]:
df.loc[2, 'Animal']

'Parrot'

In [16]:
df.loc[[0,2], ['Animal', 'Min Speed']]

Unnamed: 0,Animal,Min Speed
0,Falcon,120.0
2,Parrot,5.0


#### `.iloc`

Same as `.loc` but works with numeric index rather than label.

In [17]:
df.iloc[[1,2], [1,2]]

Unnamed: 0,Max Speed,Min Speed
1,370.0,110.0
2,24.0,5.0


In [18]:
df.iloc[:, 1]

0    380.0
1    370.0
2     24.0
3     26.0
Name: Max Speed, dtype: float64

#### Creating new columns or Overwriting columns with `.assign()`

with `.loc[]`

In [19]:
df

Unnamed: 0,Animal,Max Speed,Min Speed
0,Falcon,380.0,120.0
1,Falcon,370.0,110.0
2,Parrot,24.0,5.0
3,Parrot,26.0,6.0


In [20]:
df.assign(speed_diff = df.loc[:, 'Max Speed']-df.loc[:, 'Min Speed'])

Unnamed: 0,Animal,Max Speed,Min Speed,speed_diff
0,Falcon,380.0,120.0,260.0
1,Falcon,370.0,110.0,260.0
2,Parrot,24.0,5.0,19.0
3,Parrot,26.0,6.0,20.0


Keep in mind you are not saving anything here - yet.

In [21]:
df

Unnamed: 0,Animal,Max Speed,Min Speed
0,Falcon,380.0,120.0
1,Falcon,370.0,110.0
2,Parrot,24.0,5.0
3,Parrot,26.0,6.0


In [22]:
df = df.assign(speed_diff = df.loc[:, 'Max Speed']-df.loc[:, 'Min Speed'])

In [23]:
df

Unnamed: 0,Animal,Max Speed,Min Speed,speed_diff
0,Falcon,380.0,120.0,260.0
1,Falcon,370.0,110.0,260.0
2,Parrot,24.0,5.0,19.0
3,Parrot,26.0,6.0,20.0


without `.loc[]`

In [24]:
df = df.assign(speed_diff = df['Max Speed']-df['Min Speed'])
df

Unnamed: 0,Animal,Max Speed,Min Speed,speed_diff
0,Falcon,380.0,120.0,260.0
1,Falcon,370.0,110.0,260.0
2,Parrot,24.0,5.0,19.0
3,Parrot,26.0,6.0,20.0


Cases where `assign` will not work:

```python
df = df.assign('Diff Speed' = df['Max Speed']-df['Min Speed'])
```
White space in column name... In this cases, do not use `.assign()`

without `.assign()`

In [25]:
df['Speed Diff'] = df['Max Speed'] - df['Min Speed']

```python
df['newCol'] = df['col1'] - df['col2']
```

In [26]:
df

Unnamed: 0,Animal,Max Speed,Min Speed,speed_diff,Speed Diff
0,Falcon,380.0,120.0,260.0,260.0
1,Falcon,370.0,110.0,260.0,260.0
2,Parrot,24.0,5.0,19.0,19.0
3,Parrot,26.0,6.0,20.0,20.0


In [27]:
df_incorrect = df['Max Speed'] - df['Min Speed']

In [28]:
df_incorrect

0    260.0
1    260.0
2     19.0
3     20.0
dtype: float64

In [29]:
df['Speed Diff2'] = df['Max Speed'] - df['Min Speed']
df

Unnamed: 0,Animal,Max Speed,Min Speed,speed_diff,Speed Diff,Speed Diff2
0,Falcon,380.0,120.0,260.0,260.0,260.0
1,Falcon,370.0,110.0,260.0,260.0,260.0
2,Parrot,24.0,5.0,19.0,19.0,19.0
3,Parrot,26.0,6.0,20.0,20.0,20.0


**Be careful,** it is not the same to do:
```python
df = df['Max Speed']-df['Min Speed']
```
vs.
```python
df['Speed Diff'] = df['Max Speed']-df['Min Speed']
```

In the first one, you are overwriting the **whole** data frame with just a series, in the second one, you are modifying/adding the **column** only.

#### `.drop()`

In [30]:
# Option 1
#df.drop(columns = 'Max Speed', inplace = True)
# Option 2
#df = df.drop(columns = 'Max Speed')

In [31]:
df

Unnamed: 0,Animal,Max Speed,Min Speed,speed_diff,Speed Diff,Speed Diff2
0,Falcon,380.0,120.0,260.0,260.0,260.0
1,Falcon,370.0,110.0,260.0,260.0,260.0
2,Parrot,24.0,5.0,19.0,19.0,19.0
3,Parrot,26.0,6.0,20.0,20.0,20.0


In [32]:
df.drop(columns = ['Speed Diff2'])

Unnamed: 0,Animal,Max Speed,Min Speed,speed_diff,Speed Diff
0,Falcon,380.0,120.0,260.0,260.0
1,Falcon,370.0,110.0,260.0,260.0
2,Parrot,24.0,5.0,19.0,19.0
3,Parrot,26.0,6.0,20.0,20.0


In [33]:
df

Unnamed: 0,Animal,Max Speed,Min Speed,speed_diff,Speed Diff,Speed Diff2
0,Falcon,380.0,120.0,260.0,260.0,260.0
1,Falcon,370.0,110.0,260.0,260.0,260.0
2,Parrot,24.0,5.0,19.0,19.0,19.0
3,Parrot,26.0,6.0,20.0,20.0,20.0


In [34]:
df.drop(columns = 'Speed Diff2', inplace=True)

In [35]:
df

Unnamed: 0,Animal,Max Speed,Min Speed,speed_diff,Speed Diff
0,Falcon,380.0,120.0,260.0,260.0
1,Falcon,370.0,110.0,260.0,260.0
2,Parrot,24.0,5.0,19.0,19.0
3,Parrot,26.0,6.0,20.0,20.0


**Option 2**
You don't need to use `inplace = True`
You can also reassign the data frame, ie:

```python
df = df.drop(columns = 'speed_diff')
```

In [36]:
df = df.drop(columns = 'speed_diff')
df

Unnamed: 0,Animal,Max Speed,Min Speed,Speed Diff
0,Falcon,380.0,120.0,260.0
1,Falcon,370.0,110.0,260.0
2,Parrot,24.0,5.0,19.0
3,Parrot,26.0,6.0,20.0


### Renaming with `.rename()`

In [37]:
df = df.rename(columns = {'Max Speed':'max_speed', 
                          'Min Speed':'min_speed'})

In [38]:
df

Unnamed: 0,Animal,max_speed,min_speed,Speed Diff
0,Falcon,380.0,120.0,260.0
1,Falcon,370.0,110.0,260.0
2,Parrot,24.0,5.0,19.0
3,Parrot,26.0,6.0,20.0


### Renaming using the attribute `.columns`

In [39]:
df.columns

Index(['Animal', 'max_speed', 'min_speed', 'Speed Diff'], dtype='object')

In [40]:
df.columns = ['_animal', '_max_speed', '_min_speed', '_speed_diff']

In [41]:
df

Unnamed: 0,_animal,_max_speed,_min_speed,_speed_diff
0,Falcon,380.0,120.0,260.0
1,Falcon,370.0,110.0,260.0
2,Parrot,24.0,5.0,19.0
3,Parrot,26.0,6.0,20.0


#### `filter`

In [42]:
df

Unnamed: 0,_animal,_max_speed,_min_speed,_speed_diff
0,Falcon,380.0,120.0,260.0
1,Falcon,370.0,110.0,260.0
2,Parrot,24.0,5.0,19.0
3,Parrot,26.0,6.0,20.0


In [43]:
df['_max_speed'] > 300

0     True
1     True
2    False
3    False
Name: _max_speed, dtype: bool

In [44]:
df[df['_max_speed'] > 300]

Unnamed: 0,_animal,_max_speed,_min_speed,_speed_diff
0,Falcon,380.0,120.0,260.0
1,Falcon,370.0,110.0,260.0


```python
df[(_____) & (_____)]
```

In [45]:
df2 = df[(df['_max_speed'] > 320) & (df['_max_speed'] < 380)]

In [46]:
df2

Unnamed: 0,_animal,_max_speed,_min_speed,_speed_diff
1,Falcon,370.0,110.0,260.0


#### `.groupby()`

In [47]:
df

Unnamed: 0,_animal,_max_speed,_min_speed,_speed_diff
0,Falcon,380.0,120.0,260.0
1,Falcon,370.0,110.0,260.0
2,Parrot,24.0,5.0,19.0
3,Parrot,26.0,6.0,20.0


In [48]:
df.groupby('_animal')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11690a290>

**What does that even mean??**


![img](https://miro.medium.com/max/1206/0*5Zzcwe-rlxz-EQ_N)

























In [49]:
df[df['_animal'] == 'Falcon']

Unnamed: 0,_animal,_max_speed,_min_speed,_speed_diff
0,Falcon,380.0,120.0,260.0
1,Falcon,370.0,110.0,260.0


In [50]:
df2 = df.groupby('_animal', as_index=False).count()
df2

Unnamed: 0,_animal,_max_speed,_min_speed,_speed_diff
0,Falcon,2,2,2
1,Parrot,2,2,2


In [51]:
df2.columns

Index(['_animal', '_max_speed', '_min_speed', '_speed_diff'], dtype='object')

`.reset_index()`

In [52]:
df.groupby('_animal').sum()

Unnamed: 0_level_0,_max_speed,_min_speed,_speed_diff
_animal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Falcon,750.0,230.0,520.0
Parrot,50.0,11.0,39.0


In [53]:
df.groupby('_animal').sum().reset_index()

Unnamed: 0,_animal,_max_speed,_min_speed,_speed_diff
0,Falcon,750.0,230.0,520.0
1,Parrot,50.0,11.0,39.0


In [54]:
df.groupby('_animal', as_index=False).sum()

Unnamed: 0,_animal,_max_speed,_min_speed,_speed_diff
0,Falcon,750.0,230.0,520.0
1,Parrot,50.0,11.0,39.0


In [55]:
df.groupby('_animal').agg({'_max_speed': 'mean', 
                           '_min_speed': 'count'})

Unnamed: 0_level_0,_max_speed,_min_speed
_animal,Unnamed: 1_level_1,Unnamed: 2_level_1
Falcon,375.0,2
Parrot,25.0,2


In [56]:
df3 = df.groupby('_animal').agg({'_max_speed':['sum', 'mean'], 
                                 '_min_speed': ['max', 'min']})
df3

Unnamed: 0_level_0,_max_speed,_max_speed,_min_speed,_min_speed
Unnamed: 0_level_1,sum,mean,max,min
_animal,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Falcon,750.0,375.0,120.0,110.0
Parrot,50.0,25.0,6.0,5.0


In [57]:
df3.loc[:,'_max_speed'].reset_index()

Unnamed: 0,_animal,sum,mean
0,Falcon,750.0,375.0
1,Parrot,50.0,25.0


In [58]:
df3

Unnamed: 0_level_0,_max_speed,_max_speed,_min_speed,_min_speed
Unnamed: 0_level_1,sum,mean,max,min
_animal,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Falcon,750.0,375.0,120.0,110.0
Parrot,50.0,25.0,6.0,5.0


In [59]:
df3[df3['_max_speed', 'mean'] > 300]

Unnamed: 0_level_0,_max_speed,_max_speed,_min_speed,_min_speed
Unnamed: 0_level_1,sum,mean,max,min
_animal,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Falcon,750.0,375.0,120.0,110.0


In [60]:
df.groupby('_animal').agg(['mean', 'sum']).loc[:,'_max_speed'].reset_index()

Unnamed: 0,_animal,mean,sum
0,Falcon,375.0,750.0
1,Parrot,25.0,50.0


In [61]:
df.groupby('_animal').agg({'_max_speed':['mean', 'sum']}).reset_index()

Unnamed: 0_level_0,_animal,_max_speed,_max_speed
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,sum
0,Falcon,375.0,750.0
1,Parrot,25.0,50.0


There exist multiple ways of doing the same thing; be creative and share as much as you wish in Piazza. Discuss interesting ways you found to get to a result.

**The tests are *mostly* a guide.**

We cannot make very strict tests. 
- Inhibit your creativity. 

We cannot make very lenient tests.
- Cannot test your output

Tests are based on:
- An ideal of best practices
- Tidy data (more on this next week)
- A bit of hand holding to make your life in future questions "easier"
- Beginner friendly skills (if you are a pro, you might find them quite limiting) 



**Disclaimer:**
If you want the full mark, you need to pass the test.

How?

Read the bottom part of the error messages, you might find them helpful.


#### Summary

##### What we did?

- [x] Show how to use `.loc` and `.iloc`
- [x] Demonstrate how to rename columns of a dataframe using .rename()
- [x] Create new or columns in a dataframe using .assign() notation.
- [x] Drop columns in a dataframe using .drop()
- [x] Use df[] notation to filter rows of a dataframe.
- [x] Calculate summary statistics on grouped objects using .groupby() and .agg().

##### Extra (check at your own pace)

How to change labels based on multiple conditions for different columns (Age in Titanic)

In [62]:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.]})
df

Unnamed: 0,Animal,Max Speed
0,Falcon,380.0
1,Falcon,370.0
2,Parrot,24.0
3,Parrot,26.0


In [63]:
import numpy as np

conditions = [
    ((df['Animal'] == 'Falcon') & (df['Max Speed'] < 350)),
    ((df['Animal'] == 'Falcon') & (df['Max Speed'] >= 350)),
    ((df['Animal'] == 'Parrot') & (df['Max Speed'] <= 300)),
    ((df['Animal'] == 'Parrot') & (df['Max Speed'] > 300))
    ]

# create a list of the values we want to assign for each condition
values = ['slow_falcon', 'fast_falcon', 'avg_parrot', 'vfast_parrot']

# create a new column and use np.select to assign values to it using our lists as arguments
df['class'] = np.select(conditions, values)

df

Unnamed: 0,Animal,Max Speed,class
0,Falcon,380.0,fast_falcon
1,Falcon,370.0,fast_falcon
2,Parrot,24.0,avg_parrot
3,Parrot,26.0,avg_parrot


In [64]:
df.loc[(df['Animal']=="Falcon")&(df['Max Speed']>365), "class"] = "super falcon"

In [65]:
df

Unnamed: 0,Animal,Max Speed,class
0,Falcon,380.0,super falcon
1,Falcon,370.0,super falcon
2,Parrot,24.0,avg_parrot
3,Parrot,26.0,avg_parrot
