# Pandas

https://tomaugspurger.github.io/modern-1-intro.html
https://tomaugspurger.github.io/method-chaining



*   Like lists, you can index by location.
*   Like dictionaries, you can index by label.
*   Like NumPy arrays, you can index by boolean masks.
*   Any of these indexers could be scalar indexes, or they could be *   arrays, or they could be slices.
*   Any of these should work on the index (row labels) or columns of a DataFrame.
*   And any of these should work on hierarchical indexes.





#   .loc for label-based indexing

```
first.loc[['AA', 'AS', 'DL'], ['fl_date', 'tail_num']]
```


#   .iloc for positional indexing
```
first.iloc[[0, 1, 3], [0, 1]]
```

In [7]:
import pandas as pd
f = pd.DataFrame({'a':[1,2,3,4,5], 'b':[10,20,30,40,50]})
f


Unnamed: 0,a,b
0,1,10
1,2,20
2,3,30
3,4,40
4,5,50


In [8]:
# ignore the context manager for now
with pd.option_context('mode.chained_assignment', None):
    f[f['a'] <= 3]['b'] = f[f['a'] <= 3 ]['b'] / 10
f

Unnamed: 0,a,b
0,1,10
1,2,20
2,3,30
3,4,40
4,5,50


In [9]:
f.loc[f['a'] <= 3, 'b'] = f.loc[f['a'] <= 3, 'b'] / 10
f

Unnamed: 0,a,b
0,1,1.0
1,2,2.0
2,3,3.0
3,4,40.0
4,5,50.0


# ][ replace with a .loc[..., ...]

The "failure" to update f comes down to what's called chained indexing, a practice to be avoided. The "chained" comes from indexing multiple times, one after another, rather than one single indexing operation. Above we had two operations on the left-hand side, one __getitem__ and one __setitem__ (in python, the square brackets are syntactic sugar for __getitem__ or __setitem__ if it's for assignment). So f[f['a'] <= 3]['b'] becomes

1.   getitem: f[f['a'] <= 3]
2.   setitem: _['b'] = ... # using _ to represent the result of 1.

In [13]:
jack_jill = pd.DataFrame()
(jack_jill.pipe(went_up, 'hill')
    .pipe(fetch, 'water')
    .pipe(fell_down, 'jack')
    .pipe(broke, 'crown')
    .pipe(tumble_after, 'jill')
)

NameError: ignored

In [0]:
(df.dropna(subset=['dep_time', 'unique_carrier'])
   .loc[df['unique_carrier']
       .isin(df['unique_carrier'].value_counts().index[:5])]
   .set_index('dep_time')
   # TimeGrouper to resample & groupby at once
   .groupby(['unique_carrier', pd.TimeGrouper("H")])
   .fl_num.count()
   .unstack(0)
   .fillna(0)
   .rolling(24)
   .sum()
   .rename_axis("Flights per Day", axis=1)
   .plot()
)
sns.despine()

In [0]:
flights = (df[['fl_date', 'tail_num', 'dep_time', 'dep_delay']]
           .dropna()
           .sort_values('dep_time')
           .loc[lambda x: x.dep_delay < 500]
           .assign(turn = lambda x:
                x.groupby(['fl_date', 'tail_num'])
                 .dep_time
                 .transform('rank').astype(int)))

In [0]:
plt.figure(figsize=(15, 5))
(df[['fl_date', 'tail_num', 'dep_time', 'dep_delay']]
    .dropna()
    .assign(hour=lambda x: x.dep_time.dt.hour)
    .query('5 < dep_delay < 600')
    .pipe((sns.boxplot, 'data'), 'hour', 'dep_delay'))
sns.despine()