# UBC
## Programming in Python for DS
### Week 2
Instructor: Socorro Dominguez-Vidana

In [1]:
import pandas as pd

### Module 2

Overview:

- [] Show how to use `.loc` and `.iloc`
- [] Demonstrate how to rename columns of a dataframe using .rename()
- [] Create new or columns in a dataframe using .assign() notation.
- [] Drop columns in a dataframe using .drop()
- [] Use df[] notation to filter rows of a dataframe.
- [] Calculate summary statistics on grouped objects using .groupby() and .agg().


In [2]:
data = {'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.],
                   'Min Speed': [120., 110., 5., 6.]}
data

{'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
 'Max Speed': [380.0, 370.0, 24.0, 26.0],
 'Min Speed': [120.0, 110.0, 5.0, 6.0]}

In [3]:
type(data)

dict

In [4]:
df = pd.DataFrame(data)

In [5]:
df

Unnamed: 0,Animal,Max Speed,Min Speed
0,Falcon,380.0,120.0
1,Falcon,370.0,110.0
2,Parrot,24.0,5.0
3,Parrot,26.0,6.0


In [6]:
pd.DataFrame(df)

Unnamed: 0,Animal,Max Speed,Min Speed
0,Falcon,380.0,120.0
1,Falcon,370.0,110.0
2,Parrot,24.0,5.0
3,Parrot,26.0,6.0


`Tip!` Remember that whenever you see a *new** function, look at the documentation. The easiest way (for me) is to look in the browswer. Type something as "pandas DataFrame documentation" in the browswer and usually the first entry will be your best friend.

#### Methods and Attributes (The only thing you might need to know about OOP for now)

In [7]:
df.describe()

Unnamed: 0,Max Speed,Min Speed
count,4.0,4.0
mean,200.0,60.25
std,202.115479,63.352848
min,24.0,5.0
25%,25.5,5.75
50%,198.0,58.0
75%,372.5,112.5
max,380.0,120.0


In [8]:
df.shape

(4, 3)

## `.loc`

Select rows and columns based on `label`

The rule:

`df.loc[*row*, start:end:step]` or `df.loc[*row*, [labels]]`

For all rows, use `:`

In [9]:
df

Unnamed: 0,Animal,Max Speed,Min Speed
0,Falcon,380.0,120.0
1,Falcon,370.0,110.0
2,Parrot,24.0,5.0
3,Parrot,26.0,6.0


In [10]:
df.loc[1, 'Animal':'Min Speed':1]

Animal       Falcon
Max Speed     370.0
Min Speed     110.0
Name: 1, dtype: object

In [11]:
df.loc[:, ['Animal', 'Min Speed']]

Unnamed: 0,Animal,Min Speed
0,Falcon,120.0
1,Falcon,110.0
2,Parrot,5.0
3,Parrot,6.0


In [12]:
df.loc[2, 'Animal']

'Parrot'

In [13]:
df.loc[[0,2], ['Animal', 'Min Speed']]

Unnamed: 0,Animal,Min Speed
0,Falcon,120.0
2,Parrot,5.0


#### `.iloc`

Same as `loc` but works with index rather than label.

In [14]:
df.iloc[[1,2], [1,2]]

Unnamed: 0,Max Speed,Min Speed
1,370.0,110.0
2,24.0,5.0


In [15]:
df.iloc[:, 1]

0    380.0
1    370.0
2     24.0
3     26.0
Name: Max Speed, dtype: float64

#### Creating new columns or Overwriting columns with `assign()`

with `.loc[]`

In [16]:
df

Unnamed: 0,Animal,Max Speed,Min Speed
0,Falcon,380.0,120.0
1,Falcon,370.0,110.0
2,Parrot,24.0,5.0
3,Parrot,26.0,6.0


In [17]:
df.assign(speedx2 = df.loc[:, 'Max Speed']*2)

Unnamed: 0,Animal,Max Speed,Min Speed,speedx2
0,Falcon,380.0,120.0,760.0
1,Falcon,370.0,110.0,740.0
2,Parrot,24.0,5.0,48.0
3,Parrot,26.0,6.0,52.0


Keep in mind you are not saving anything here - yet.

In [18]:
df

Unnamed: 0,Animal,Max Speed,Min Speed
0,Falcon,380.0,120.0
1,Falcon,370.0,110.0
2,Parrot,24.0,5.0
3,Parrot,26.0,6.0


In [19]:
df  = df.assign(speedx2 = df.loc[:, 'Max Speed']*2)

In [20]:
df

Unnamed: 0,Animal,Max Speed,Min Speed,speedx2
0,Falcon,380.0,120.0,760.0
1,Falcon,370.0,110.0,740.0
2,Parrot,24.0,5.0,48.0
3,Parrot,26.0,6.0,52.0


without `.loc[]`

In [21]:
df.loc[:, 'Min Speed']

0    120.0
1    110.0
2      5.0
3      6.0
Name: Min Speed, dtype: float64

In [22]:
df['Min Speed']

0    120.0
1    110.0
2      5.0
3      6.0
Name: Min Speed, dtype: float64

In [23]:
df = df.assign(speedx2 = df['Max Speed']+2)
#df.assign(speedx2 = df.loc[:, 'Max Speed']*2)

In [24]:
df

Unnamed: 0,Animal,Max Speed,Min Speed,speedx2
0,Falcon,380.0,120.0,382.0
1,Falcon,370.0,110.0,372.0
2,Parrot,24.0,5.0,26.0
3,Parrot,26.0,6.0,28.0


without `.assign()`

In [25]:
df['speedx2']

0    382.0
1    372.0
2     26.0
3     28.0
Name: speedx2, dtype: float64

In [26]:
df['speedx2'] = df['Max Speed']*2

In [27]:
df

Unnamed: 0,Animal,Max Speed,Min Speed,speedx2
0,Falcon,380.0,120.0,760.0
1,Falcon,370.0,110.0,740.0
2,Parrot,24.0,5.0,48.0
3,Parrot,26.0,6.0,52.0


In [28]:
df['speedx2'] = df['Max Speed']*2

**Be careful,** it is not the same to do:
```python
df = df['Max Speed']*2
```
vs.
```python
df['Max Speed'] = df['Max Speed']*2
```

In the first one, you are overwriting the **whole** data frame with just a series, in the second one, you are modifying the series (column) only.

#### `.drop()`

In [29]:
# df.drop(columns = 'Max Speed', inplace = True)
# df = df.drop(columns = 'Max Speed')

In [30]:
df

Unnamed: 0,Animal,Max Speed,Min Speed,speedx2
0,Falcon,380.0,120.0,760.0
1,Falcon,370.0,110.0,740.0
2,Parrot,24.0,5.0,48.0
3,Parrot,26.0,6.0,52.0


In [31]:
df.drop(columns = ['Animal', 'Max Speed'])

Unnamed: 0,Min Speed,speedx2
0,120.0,760.0
1,110.0,740.0
2,5.0,48.0
3,6.0,52.0


In [32]:
df

Unnamed: 0,Animal,Max Speed,Min Speed,speedx2
0,Falcon,380.0,120.0,760.0
1,Falcon,370.0,110.0,740.0
2,Parrot,24.0,5.0,48.0
3,Parrot,26.0,6.0,52.0


I will not demonstrate; but if you wanted to really remove the columns, you would have to reassign the data frame, ie,
```python
df = df.drop(columns = ['Animal', 'Max Speed'])
```

### Renaming with `rename()`

In [33]:
df = df.rename(columns = {'Max Speed':'max_speed', 
                     'Min Speed':'min_speed'})

In [34]:
df

Unnamed: 0,Animal,max_speed,min_speed,speedx2
0,Falcon,380.0,120.0,760.0
1,Falcon,370.0,110.0,740.0
2,Parrot,24.0,5.0,48.0
3,Parrot,26.0,6.0,52.0


### Renaming using the attribute `.columns`

In [35]:
df.shape

(4, 4)

In [36]:
df.columns = ['_animal', '_max_speed', '_min_speed', '_speed2']

In [37]:
df

Unnamed: 0,_animal,_max_speed,_min_speed,_speed2
0,Falcon,380.0,120.0,760.0
1,Falcon,370.0,110.0,740.0
2,Parrot,24.0,5.0,48.0
3,Parrot,26.0,6.0,52.0


#### `filter`

In [38]:
df

Unnamed: 0,_animal,_max_speed,_min_speed,_speed2
0,Falcon,380.0,120.0,760.0
1,Falcon,370.0,110.0,740.0
2,Parrot,24.0,5.0,48.0
3,Parrot,26.0,6.0,52.0


In [39]:
df['_max_speed'] > 300

0     True
1     True
2    False
3    False
Name: _max_speed, dtype: bool

In [40]:
df[df['_max_speed'] > 300]

Unnamed: 0,_animal,_max_speed,_min_speed,_speed2
0,Falcon,380.0,120.0,760.0
1,Falcon,370.0,110.0,740.0


```python
df[(_____) & (_____)]
```

In [41]:
df2 = df[(df['_max_speed'] > 320) & (df['_max_speed'] < 380)]

In [42]:
df2

Unnamed: 0,_animal,_max_speed,_min_speed,_speed2
1,Falcon,370.0,110.0,740.0


#### `.groupby()`

In [43]:
df

Unnamed: 0,_animal,_max_speed,_min_speed,_speed2
0,Falcon,380.0,120.0,760.0
1,Falcon,370.0,110.0,740.0
2,Parrot,24.0,5.0,48.0
3,Parrot,26.0,6.0,52.0


In [44]:
df.groupby('_animal')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x119a5f190>

**What does that even mean??**


![img](https://miro.medium.com/max/1206/0*5Zzcwe-rlxz-EQ_N)

























In [45]:
df[df['_animal'] == 'Falcon']

Unnamed: 0,_animal,_max_speed,_min_speed,_speed2
0,Falcon,380.0,120.0,760.0
1,Falcon,370.0,110.0,740.0


In [46]:
df2 = df.groupby('_animal').mean()
df2

Unnamed: 0_level_0,_max_speed,_min_speed,_speed2
_animal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Falcon,375.0,115.0,750.0
Parrot,25.0,5.5,50.0


In [47]:
df2.columns

Index(['_max_speed', '_min_speed', '_speed2'], dtype='object')

`reset_index()`

In [48]:
df.groupby('_animal').sum()

Unnamed: 0_level_0,_max_speed,_min_speed,_speed2
_animal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Falcon,750.0,230.0,1500.0
Parrot,50.0,11.0,100.0


In [49]:
df.groupby('_animal').sum().reset_index()

Unnamed: 0,_animal,_max_speed,_min_speed,_speed2
0,Falcon,750.0,230.0,1500.0
1,Parrot,50.0,11.0,100.0


In [50]:
df.groupby('_animal', as_index=False).sum()

Unnamed: 0,_animal,_max_speed,_min_speed,_speed2
0,Falcon,750.0,230.0,1500.0
1,Parrot,50.0,11.0,100.0


In [51]:
df3 = df.groupby('_animal').agg({'_max_speed':['mean', 'sum'], '_min_speed': ['max', 'min']})
df3

Unnamed: 0_level_0,_max_speed,_max_speed,_min_speed,_min_speed
Unnamed: 0_level_1,mean,sum,max,min
_animal,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Falcon,375.0,750.0,120.0,110.0
Parrot,25.0,50.0,6.0,5.0


In [52]:
df3.loc[:,'_max_speed'].reset_index()

Unnamed: 0,_animal,mean,sum
0,Falcon,375.0,750.0
1,Parrot,25.0,50.0


In [53]:
df3

Unnamed: 0_level_0,_max_speed,_max_speed,_min_speed,_min_speed
Unnamed: 0_level_1,mean,sum,max,min
_animal,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Falcon,375.0,750.0,120.0,110.0
Parrot,25.0,50.0,6.0,5.0


In [54]:
df3[df3['_max_speed', 'mean'] > 300]

Unnamed: 0_level_0,_max_speed,_max_speed,_min_speed,_min_speed
Unnamed: 0_level_1,mean,sum,max,min
_animal,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Falcon,375.0,750.0,120.0,110.0


In [55]:
df.groupby('_animal').agg(['mean', 'sum']).loc[:,'_max_speed'].reset_index()

Unnamed: 0,_animal,mean,sum
0,Falcon,375.0,750.0
1,Parrot,25.0,50.0


There exist multiple ways of doing the same thing; be creative and share as much as you wish in Piazza. Discuss interesting ways you found to get to a result.

**The tests are *mostly* a guide.**

We cannot make very strict tests. 
- Inhibit your creativity. 

We cannot make very lenient tests.
- Cannot test your output

Tests are based on:
- An ideal of best practices
- Tidy data (more on this next week)
- A bit of hand holding to make your life in future questions "easier"
- Beginner friendly skills (if you are a pro, you might find them quite limiting) 



**Disclaimer:**
If you want the full mark, you need to pass the test.

How?

Read the bottom part of the error messages, you might find them helpful.


#### Summary

##### What we did?

- [x] Show how to use `.loc` and `.iloc`
- [x] Demonstrate how to rename columns of a dataframe using .rename()
- [x] Create new or columns in a dataframe using .assign() notation.
- [x] Drop columns in a dataframe using .drop()
- [x] Use df[] notation to filter rows of a dataframe.
- [x] Calculate summary statistics on grouped objects using .groupby() and .agg().

##### Extra (check at your own pace)

How to change labels based on multiple conditions for different columns (Age in Titanic)

In [56]:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.]})
df

Unnamed: 0,Animal,Max Speed
0,Falcon,380.0
1,Falcon,370.0
2,Parrot,24.0
3,Parrot,26.0


In [57]:
import numpy as np

conditions = [
    ((df['Animal'] == 'Falcon') & (df['Max Speed'] < 350)),
    ((df['Animal'] == 'Falcon') & (df['Max Speed'] >= 350)),
    ((df['Animal'] == 'Parrot') & (df['Max Speed'] <= 300)),
    ((df['Animal'] == 'Parrot') & (df['Max Speed'] > 300))
    ]

# create a list of the values we want to assign for each condition
values = ['slow_falcon', 'fast_falcon', 'avg_parrot', 'vfast_parrot']

# create a new column and use np.select to assign values to it using our lists as arguments
df['class'] = np.select(conditions, values)

df

Unnamed: 0,Animal,Max Speed,class
0,Falcon,380.0,fast_falcon
1,Falcon,370.0,fast_falcon
2,Parrot,24.0,avg_parrot
3,Parrot,26.0,avg_parrot


In [58]:
df.loc[(df['Animal']=="Falcon")&(df['Max Speed']>365), "class"] = "super falcon"

In [59]:
df

Unnamed: 0,Animal,Max Speed,class
0,Falcon,380.0,super falcon
1,Falcon,370.0,super falcon
2,Parrot,24.0,avg_parrot
3,Parrot,26.0,avg_parrot
