In [2]:
import numpy as np
import pandas as pd
import seaborn as sns

# for display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"  #default 'last_expr'

# <a id='0'>Advanced Pandas

- <a href='#c'>Categorical Type in Pandas
- <a href='#g'>Group Transform: Unwrapped GroupBy
- <a href='#m'>Method Chaining
- <a href='#ci'>Avoid Chained Indexing

## Categorical Type in Pandas
Why should I convert my string-type column to Pandas categorical type?
#### Pros:
For the 10,000 rows of data below, switching to Categorical data type results in <br> 
- 30% less memory usage
- 5 times faster computational speed for the tested calculation below.

The cost is a one-time conversion to the categorical type, which is usually negligible.

In [5]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 10000

N = len(fruits)
df = pd.DataFrame({'Fruit'    : fruits,
                   'Basket_id': np.arange(N),
                   'Count'    : np.random.randint(3, 15, size=N)
                  },
                  columns=['Basket_id', 'Fruit', 'Count'])
df.head(3)

Unnamed: 0,Basket_id,Fruit,Count
0,0,apple,12
1,1,orange,4
2,2,apple,8


In [6]:
print('Without using Categorical data type:')
print('Memory usage:{:.2f} MB'.format(df.memory_usage().sum()/1024/1024))
print('Run time for value counts:')
%timeit df['Fruit'].value_counts()

Without using Categorical data type:
Memory usage:0.92 MB
Run time for value counts:
3.48 ms ± 39.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
# conversion is not free, but just a one-time cost
%time df['Fruit'] = df['Fruit'].astype('category')

CPU times: user 2.86 ms, sys: 767 µs, total: 3.63 ms
Wall time: 3.07 ms


In [8]:
print('After using Categorical dtype:')
print('Memory usage:{:.2f} MB'.format(df.memory_usage().sum()/1024/1024))
print('Access time for value counts:')
%timeit df['Fruit'].value_counts()

After using Categorical dtype:
Memory usage:0.65 MB
Access time for value counts:
699 µs ± 3.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


#### Access the content of a categorical column

In [37]:
c = df['fruit'].values
type(c), c.categories
c.codes

(pandas.core.arrays.categorical.Categorical,
 Index(['apple', 'orange'], dtype='object'))

array([0, 1, 0, ..., 1, 0, 0], dtype=int8)

## <a id='g'>Group Transforms: Unwrapped GroupBys

A group tranform is very similar to a Group-and-Apply operation (see 4_Grouping_and_Aggregation.ipynb in the current folder), except that it produces an object of the same shape as the input DF. Therefore it's sometimes called the "unwrapped" GroupBys.<br>

As I will demonstrat below, you don't really need to use group transform if you are familiar with groupby object, since you can manually "unwrapp" it to the input shape. However, group transform is useful since its name reminds us of a very common and useful data transformation. Thus I decided to include it here.

<a href='#0'> Back to TOC

### Example: A sales transaction dataset

In [17]:
df = pd.read_csv('examples/sales_transactions.csv')
df.head()

Unnamed: 0,Order,SKU,Quantity,Sale_price
0,10001,B1-20000,7,236
1,10001,S1-27722,11,232
2,10001,B1-86481,3,108
3,10005,S1-06532,48,2679
4,10005,S1-82801,21,286


### Task: what percentage of the total sales of an order does each SKU represent?”

#### Warm up: Unwrap a groupby object

In [18]:
df['Sales'] = df['Quantity'] * df['Sale_price']
df.head()

grb = df.groupby('Order')

# groupby object
grb['Sales'].sum()

# unwrap it !
grb['Sales'].sum()[df['Order']]

Unnamed: 0,Order,SKU,Quantity,Sale_price,Sales
0,10001,B1-20000,7,236,1652
1,10001,S1-27722,11,232,2552
2,10001,B1-86481,3,108,324
3,10005,S1-06532,48,2679,128592
4,10005,S1-82801,21,286,6006


Order
10001      4528
10005    327803
10006    110612
Name: Sales, dtype: int64

Order
10001      4528
10001      4528
10001      4528
10005    327803
10005    327803
10005    327803
10005    327803
10005    327803
10006    110612
10006    110612
10006    110612
10006    110612
Name: Sales, dtype: int64

#### Solution

In [29]:
df['Sales'] = df['Quantity'] * df['Sale_price']

# Solution 1: using unwrapping, without group transform
df['Total_sales'] = df.groupby('Order')['Sales'].sum()[df['Order']].values

# Solution 2: using group transform.
df['Total_sales'] = df.groupby('Order')['Sales'].transform('sum')

df.head()

pd.DataFrame((df['Sales']/ df['Total_sales']).values, 
             index=pd.MultiIndex.from_arrays([df['Order'],df['SKU']]),
             columns=['pct_sales']
            )

Unnamed: 0,Order,SKU,Quantity,Sale_price,Sales,Total_sales
0,10001,B1-20000,7,236,1652,4528
1,10001,S1-27722,11,232,2552,4528
2,10001,B1-86481,3,108,324,4528
3,10005,S1-06532,48,2679,128592,327803
4,10005,S1-82801,21,286,6006,327803


Unnamed: 0_level_0,Unnamed: 1_level_0,pct_sales
Order,SKU,Unnamed: 2_level_1
10001,B1-20000,0.364841
10001,S1-27722,0.563604
10001,B1-86481,0.071555
10005,S1-06532,0.392284
10005,S1-82801,0.018322
10005,S1-06532,0.02287
10005,S1-47412,0.466036
10005,S1-27722,0.100487
10006,S1-27722,0.885546
10006,B1-33087,0.107918


## <a id='m'> Method Chaining
ft. assign( ) and pipe( )
    
Method chaining Pros and Cons:
- Pros: faster implementation, less temporary variables, better readibility if done right
- Cons: debugging can be tricky, and may take more CPU time sometimes
    
<a href='#0'> Back to TOC    

Some may complain that a long chain is hard to read, but let's watch a great example (adapted from Jeff Allen, RStudio) that demonstrates that a well-CHAINED story-telling style is much easier to understand than a NESTED function-calling style.

Chaining: 

jack_jill.pipe(went_up("hill"))
         .pipe(fetch("water"))
         .pipe(fell_down("jack"))
         .pipe(broke("crown"))
         .pipe(tumble_after("jill"))

Nested Function:

tumble_after(
    broke(
        fell_down(
            fetch(went_up(jack_jill, "hill"), "water"),
            jack),
        "crown"),
    "jill"
)

The Cons for chaining is debugging, because the output is not direct. So each step needs to be checked sequentially as well before deployment. So apply it to less volatile situations.

Let's use the sales transaction data set above in the Group Transform section as a start.

### Task: what percentage of the total sales does each SKU represent?”

#### Without Chaining

In [31]:
df['Sales'] = df['Quantity'] * df['Sale_price']
df['Sales']/ (df.groupby('Order')['Sales'].transform('sum'))

0     0.364841
1     0.563604
2     0.071555
3     0.392284
4     0.018322
5     0.022870
6     0.466036
7     0.100487
8     0.885546
9     0.107918
10    0.005885
11    0.000651
Name: Sales, dtype: float64

#### Chaining using assign( ): no need to create df['Sales']

In [32]:
df['Quantity']*df['Sale_price'] / (df.assign(Sales=df['Quantity']*df['Sale_price'])
                                   .groupby('Order')['Sales'].transform('sum'))

0     0.364841
1     0.563604
2     0.071555
3     0.392284
4     0.018322
5     0.022870
6     0.466036
7     0.100487
8     0.885546
9     0.107918
10    0.005885
11    0.000651
dtype: float64

Method chaining really SHINEs when we define our own versatile functions for real data sets. We will explore some examples below to really drive this point home.

### Using pipe( )
Enable method chaining when user-defined functions are involved, very powerful!<br>
I will use a subset of flight data extracted from Bureau of Transportation Statistics. Details of extracting the data from BTS can be found in Augspurger's notebook [here](https://github.com/TomAugspurger/effective-pandas/blob/master/modern_1_intro.ipynb). More detailed analysis can be found in his other notebook [here](https://github.com/TomAugspurger/effective-pandas/blob/master/modern_2_method_chaining.ipynb).

In [33]:
def my_read(fp):
    '''
    Method Chaining
    Warning: only a subset of operations are shown here for simplicity     
    '''
    df = (pd.read_csv(fp)
            .rename(columns=str.lower)
            .drop('unnamed: 32', axis=1)
            .pipe(extract_city_name)            
            .assign(fl_date =lambda x: pd.to_datetime(x['fl_date']),
                    origin  =lambda x: pd.Categorical(x['origin']),
                    dest    =lambda x: pd.Categorical(x['dest'])))
    return df

def extract_city_name(df):
    '''
    Chicago, IL -> Chicago
    '''
    cols = ['origin_city_name', 'dest_city_name']
    city = df[cols].apply(lambda x: x.str.extract("(.*), \w{2}", expand=False))
    df = df.copy()
    df[['origin_city_name', 'dest_city_name']] = city
    return df

In [35]:
import os
output = 'examples/flights.h5'

if not os.path.exists(output):
    df = my_read('examples/flights_short.csv')
    df.to_hdf(output, 'flights', format='table')
else:
    df = pd.read_hdf(output, 'flights', format='table')
    
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4500 entries, 0 to 4499
Data columns (total 32 columns):
fl_date                  4500 non-null datetime64[ns]
unique_carrier           4500 non-null object
airline_id               4500 non-null int64
tail_num                 4493 non-null object
fl_num                   4500 non-null int64
origin_airport_id        4500 non-null int64
origin_airport_seq_id    4500 non-null int64
origin_city_market_id    4500 non-null int64
origin                   4500 non-null object
origin_city_name         4500 non-null object
dest_airport_id          4500 non-null int64
dest_airport_seq_id      4500 non-null int64
dest_city_market_id      4500 non-null int64
dest                     4500 non-null category
dest_city_name           4500 non-null object
crs_dep_time             4500 non-null int64
dep_time                 4418 non-null float64
dep_delay                4418 non-null float64
taxi_out                 4417 non-null float64
wheels_off     

## <a id='ci'>Avoid Chained Indexing
    
You tried to assign some new values to part of a DF but just can't change those values and you are very frustrated and confused. It's very likely that you've fell victim to chained indexing. Tom Augspurger has a very detailed explannation [here](https://github.com/TomAugspurger/effective-pandas/blob/master/modern_1_intro.ipynb). I am just going to use his example here for demonstration.
    
<a href='#0'> Back to TOC

In [43]:
f = pd.DataFrame({'a':[1,2,3,4,], 'b':[10,20,30,40,]})
f

Unnamed: 0,a,b
0,1,10
1,2,20
2,3,30
3,4,40


### Task: reduce the first two values of column 'b' by 10 times.

In [44]:
f[f.index < 2]['b'] = f[f.index < 2 ]['b'] / 10
f

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,a,b
0,1,10
1,2,20
2,3,30
3,4,40


Not only there is a warning, but also the value didn't change. Let's do it the right way first and then I will explain.

In [45]:
f.loc[ f.index < 2, 'b'] = f.loc[ f.index < 2, 'b'] / 10
f

Unnamed: 0,a,b
0,1,1.0
1,2,2.0
2,3,30.0
3,4,40.0


The right way avoided using chained indexing like this: f[ i ][ j ], but used the form of f[ i, j ]. This seemingly small change makes a big difference if you attempt to assign values to f. Basically, you were assigning values to a different copy of the dataframe if you used chained indexing, that's why the original f is unchanged. See Augspurger's notebook for more information.