# Method chaining

Documentation sources:

* https://tomaugspurger.github.io/method-chaining
* https://www.tutorialspoint.com/python_pandas/

More advanced topics are discussed in the following sources:

* http://benalexkeen.com/resampling-time-series-data-with-pandas/
* https://machinelearningmastery.com/resample-interpolate-time-series-data-python/
* http://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#resampling

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import read_csv

## I. Philosophy

* Method chaining is a way to do data processing without intermediate variables.
* The code is organised as `new_df = df.method_1(...).method_2(...).....method_n(...)`.
* To make it readable, the code is formatted as

```python
new_df = (df
           .method_1()
           .method_2()
           ...
           .method_n())
```

* As you do not control the methods of `DataFrame`, several special methods are used:

  * `pipe(func)`            - for applying custom functions to the DataFrame
  * `assign(col = ...)`     - for defining new columns to the DataFrame 
  * `rename(mapping, axis)` - for renaming column or row indices 
  * `where(cond, value)`    - for replacing individual cells
  * `mask(cond, value)`     - for replacing individual cells
  *  window functions provided by `df.rolling(...)` object 
  *  resampling operations provided by `df.resample(...)` object

* Method chaining makes it easy to write and to debug the code:
  * The flow of execution is from top to bottom.
  * Function parameters are always near the function itself.
  * Method chaining creates a pure function that does not have internal states.
  * Method chaining **may alter** the initial data matrix if you are not careful.

* On the flip side, a lot of copying takes place:
  * Some of the copying operations are optimised out.
  * The overhead in copying is not so big that you should use `inplace=True` flag or try to avoid method chaining.
 
* Method chaining is just one way of programming. If you do not like it, forget about it.

## II. Apply custom functions through piping

* Piping is a way to define new methods to data frames without redefining the `DataFrame` class.
* You have to define a function `fun` for processing the DataFrame which takes in arguments `df, x1,...,xn`.
* After that you can pipe it with `df.pipe(fun, x1,...,xn)`.
* **Important:** The original data frame `df` **is not** preserved if function `fun` modifies the first argument `df`.

In [2]:
df = DataFrame({'x': [1,2,3], 'y': [4, 5, 6], 'z': [7, 8, 9]})
display(df)

# Bad function to demonstrate unsafety of piping
# I had to work hard to change the first argument
# - Python does not change the arguments! 
# - You can only modify the memory content pointed by argument!
def select_and_add(df, c):
    df.drop(0, axis = 0, inplace = True)
    df.drop('x', axis = 1, inplace = True)
    df.iloc[:, :] = df.iloc[:, :] + c
    return df

display(df.pipe(select_and_add, 1))
display(df)

Unnamed: 0,x,y,z
0,1,4,7
1,2,5,8
2,3,6,9


Unnamed: 0,y,z
1,6,9
2,7,10


Unnamed: 0,y,z
1,6,9
2,7,10


## III. Defining new and redefining old columns through assign

* You can define new columns and redefine old columns but you cannot define the same column more than once.

  * **Important:** You can define columns in terms of new ones if the Python version is above 3.6.
  * **Important:** Do not use redefined columns to define new ones. The result is not well defined.

* You can define columns by standard selection operations and universal functions:

  * Universal functions are functions that can operate on scalar, vector and matrix inputs.
  * Most of them are defined in [`numpy ufunc`](https://docs.scipy.org/doc/numpy-1.15.1/reference/ufuncs.html).
  * You can define your own universal functions using [`numpy.vectorize`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.vectorize.html).

* You can also use any function that is defined over the entire DataFrame:

  * Lambda expressions are particularly popular for this purpose.
  
* You can also use any expression that evaluates to vector of appropriate size.

In [3]:
df = DataFrame({'x': [1,2,3]})
df = df.assign(y = df['x'] + 3, z = df['x'] + 6)
display(df)

# Same assignment. Lambda expression is needed as DataFrame has no name yet 
df = DataFrame({'x': [1,2,3]}).assign(y = lambda df: df['x'] + 3, z = lambda df: df['y'] + 3)
display(df)

# Same assignment without lambda expression leads to wrong result as df references to wrong data frame
df = DataFrame({'x': [1,2,3,4]}).assign(y = df['x'] + 3, z = df['x'] + 6)
display(df)

df = DataFrame({'x': [1,2,3]})
# Here y and z are computed in terms of old x column as df has not changed yet 
display(df.assign(x = [0, 0, 0], y = df['x'] + 3, z = np.sin(df['x'])))
# Here y and z are computed in terms of new x column as lambda binds to the latest version of data frame
display(df.assign(x = [0, 0, 0], y = lambda df:  df['x'] + 3, z =  lambda df: np.sin(df['x'])))

Unnamed: 0,x,y,z
0,1,4,7
1,2,5,8
2,3,6,9


Unnamed: 0,x,y,z
0,1,4,7
1,2,5,8
2,3,6,9


Unnamed: 0,x,y,z
0,1,4.0,7.0
1,2,5.0,8.0
2,3,6.0,9.0
3,4,,


Unnamed: 0,x,y,z
0,0,4,0.841471
1,0,5,0.909297
2,0,6,0.14112


Unnamed: 0,x,y,z
0,0,3,0.0
1,0,3,0.0
2,0,3,0.0


## IV. Renaming column and row indices

* There are two ways to specify what you want to do:
  * you can use `index` and `column` arguments directly 
  * you can use `mapper` and `axis` arguments
* The simplest way to rename indexes is to specify the map through dictionary.
* This is handy if you want to rename only a few indices as you can specify only the changes.
* It is also possible to do it programmatically by specifying a transformation.

In [4]:
df = DataFrame({'x': [0], 'y': [1], 'z': [2]})
display(df.rename(index = lambda x: x + 3, columns = lambda x: x.upper()))
display(df.rename(axis = 1, mapper = lambda x: x.upper()))
display(df.rename(columns = {'x': 'x_1', 'y': 'x_2'}))

Unnamed: 0,X,Y,Z
3,0,1,2


Unnamed: 0,X,Y,Z
0,0,1,2


Unnamed: 0,x_1,x_2,z
0,0,1,2


Alternatively it is possible to use `df.set_axis` if you do not know the current row or column lables but know how to name them. 
This seems absurd but sometimes it is useful to tame the output of `df.groupby(...).aggregate(...)` construction. 

In [5]:
df = DataFrame({'index': [0, 0, 1], 'x': [1,2,3], 'y':[1, 1, 1]})
display(df)
display(
    df
    .groupby('index')
    .aggregate({'x': [min, max]})
)
display(
    df
    .groupby('index')
    .aggregate({'x': [min, max]})
    .set_axis(['min', 'max'], axis=1)
)

Unnamed: 0,index,x,y
0,0,1,1
1,0,2,1
2,1,3,1


Unnamed: 0_level_0,x,x
Unnamed: 0_level_1,min,max
index,Unnamed: 1_level_2,Unnamed: 2_level_2
0,1,2
1,3,3


Unnamed: 0_level_0,min,max
index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,2
1,3,3


## V. Updating cells based on selection

* Methods `where` and `mask` allow to specify which cells to modify and how to change them:
  * `where` keeps original values where selector is true
  * `mask` alters values where selector is true
* To get predictable results, the index must have the same shape.
  * It seems to be possible to select also rows.
  * The same logic does not work for columns.
* Update can be specified as a scalar, matrix or callable.

In [6]:
# Understanding how to select
df = DataFrame({'x': [1,2,3], 'y': [4, 5, 6], 'z': [7, 8, 9]})
display(df.where(df > 4, 'Updates'))
display(df.mask (df > 4, 'Updates'))

# Selecting rows for updates works
display(df.where(df['x'] > 2, 'Updates'))

# Selecting columns for updates does not work
display(df.where(df.loc[1,:] <= 0, 'Updates'))
display(df.where(df.loc[1,:] <= 0, 'Updates'))

# Non-constant updates
display(df.where(df > 4, -df))
display(df.where(df > 4, lambda df: np.sin(df)))

Unnamed: 0,x,y,z
0,Updates,Updates,7
1,Updates,5,8
2,Updates,6,9


Unnamed: 0,x,y,z
0,1,4,Updates
1,2,Updates,Updates
2,3,Updates,Updates


Unnamed: 0,x,y,z
0,Updates,Updates,Updates
1,Updates,Updates,Updates
2,3,6,9


Unnamed: 0,x,y,z
0,Updates,Updates,Updates
1,Updates,Updates,Updates
2,Updates,Updates,Updates


Unnamed: 0,x,y,z
0,Updates,Updates,Updates
1,Updates,Updates,Updates
2,Updates,Updates,Updates


Unnamed: 0,x,y,z
0,-1,-4,7
1,-2,5,8
2,-3,6,9


Unnamed: 0,x,y,z
0,0.841471,-0.756802,7
1,0.909297,5.0,8
2,0.14112,6.0,9


## VI. Resampling timeseries data

* Resampling is a convenience feature that works only with timeseries.
* The index must be of type `DatetimeIndex`, `PeriodIndex` or `TimedeltaIndex`.
* It is possible aggregate (downsample) the data by specifying the period and the aggregation method:
  * period is determined by the number and period unit `S`, `D`, `W`, `M`, `Q`, `Y`
  * aggregation method is one of the resampler methods, e.g. `mean`, `sum`
* It is possible to interpolate (upsample) the data by specifying the period and the imputation method:
  * period is determined by the number and period unit `S`, `D`, `W`,  `M`, `Q`, `Y`
  * imputation method is one of the resampler methods: `pad`, `ffil`, `bfil`, `interpolate`
* It is possible to define your own aggregation strategies by using the `apply` method.
* You are doing **something wrong** if you want to define a custom imputation method with resampling.
* The following sources provide further details:
  * http://benalexkeen.com/resampling-time-series-data-with-pandas/
  * https://machinelearningmastery.com/resample-interpolate-time-series-data-python/
  * http://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#resampling


In [7]:
df = read_csv('stock_data.csv', index_col=0)
df.index = pd.to_datetime(df.index)

# Downsampling
display(df.resample('M').mean().head())
display(df.resample('Y').min().head())
display(df.resample('Y').apply(lambda x: list(x)).head())

# Upsampling
display(df.resample('Y').min().resample('Q').ffill().head())
display(df.resample('Y').min().resample('Q').apply(lambda x: list(x)).head())
display(df.resample('Y').min().resample('Q').interpolate(method='spline', order=2).head())

Unnamed: 0_level_0,AAPL,GOOGL,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-31,103.673675,516.48,41.578615
2015-02-28,116.990626,539.176842,39.306026
2015-03-31,118.410691,566.354773,38.664114
2015-04-30,119.360429,549.896667,39.519143
2015-05-31,120.912695,548.0425,43.77455


Unnamed: 0_level_0,AAPL,GOOGL,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-12-31,89.6018,499.24,36.9788
2016-12-31,86.3519,682.49,45.5564
2017-12-31,112.2815,800.62,59.7966
2018-12-31,148.15,984.32,84.5906


Unnamed: 0_level_0,AAPL,GOOGL,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-12-31,"[103.8847, 100.9936, 99.3615, 99.977, 101.8702...","[532.6, 527.15, 520.5, 510.95, 501.51, 508.18,...","[42.5088, 42.2446, 42.2537, 41.8893, 42.5908, ..."
2016-12-31,"[97.3268, 100.3051, 95.3823, 93.5991, 93.4758,...","[762.2, 764.1, 750.37, 746.49, 747.8, 731.95, ...","[50.8345, 51.4054, 50.8345, 49.3185, 49.0096, ..."
2017-12-31,"[112.2815, 112.3299, 112.3978, 113.2317, 114.3...","[800.62, 809.89, 807.5, 814.99, 826.37, 827.07...","[60.3735, 60.0754, 59.7966, 59.9023, 60.3446, ..."
2018-12-31,"[167.6431, 169.978, 169.9879, 170.8746, 171.77...","[1053.02, 1073.93, 1097.09, 1103.45, 1111.0, 1...","[84.6595, 84.5906, 85.1165, 86.1683, 86.6991, ..."


Unnamed: 0_level_0,AAPL,GOOGL,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-12-31,89.6018,499.24,36.9788
2016-03-31,89.6018,499.24,36.9788
2016-06-30,89.6018,499.24,36.9788
2016-09-30,89.6018,499.24,36.9788
2016-12-31,86.3519,682.49,45.5564


Unnamed: 0_level_0,AAPL,GOOGL,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-12-31,[89.6018],[499.24],[36.9788]
2016-03-31,[],[],[]
2016-06-30,[],[],[]
2016-09-30,[],[],[]
2016-12-31,[86.3519],[682.49],[45.5564]


Unnamed: 0_level_0,AAPL,GOOGL,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-12-31,89.6018,499.24,36.9788
2016-03-31,86.314631,552.339424,38.350515
2016-06-30,85.113024,600.005734,39.983081
2016-09-30,85.564964,643.126152,42.146157
2016-12-31,86.3519,682.49,45.5564


## VII. Using sliding windows for data aggregation

* Sliding window computations are defined in two phases.
* First, you have to specify the window together with weights:
  * `window` – size of the window in samples
  * `win_type` – what kind of weights are assigned to each sample
  * `center` – to which index position the aggregated value is placed
* Second, you have to specify how to aggregate the data over the window
  * There are standard aggregation functions like `mean` and `var`.
  * Custom aggregations can be implemented with `apply` mechanism.
* Window types are described in [`scipy.signal window functions`](https://docs.scipy.org/doc/scipy/reference/signal.windows.html#module-scipy.signal.windows).

In [8]:
display(df.rolling(3).mean().head())
display(df.rolling(3, center = True).mean().head())
display(df.rolling(3, center = True, win_type ='triang').mean().head())
display(df.rolling(3, center = True).apply(lambda x: x.iloc[0] <= x.iloc[2], raw=False).head())

Unnamed: 0_level_0,AAPL,GOOGL,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-02,,,
2015-01-05,,,
2015-01-06,101.413267,526.75,42.3357
2015-01-07,100.1107,519.533333,42.1292
2015-01-08,100.4029,510.986667,42.2446


Unnamed: 0_level_0,AAPL,GOOGL,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-02,,,
2015-01-05,101.413267,526.75,42.3357
2015-01-06,100.1107,519.533333,42.1292
2015-01-07,100.4029,510.986667,42.2446
2015-01-08,102.308533,506.88,42.618133


Unnamed: 0_level_0,AAPL,GOOGL,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-02,,,
2015-01-05,101.30835,526.85,42.312925
2015-01-06,99.9234,519.775,42.160325
2015-01-07,100.296425,510.9775,42.155775
2015-01-08,102.19895,505.5375,42.6113


Unnamed: 0_level_0,AAPL,GOOGL,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-02,,,
2015-01-05,0.0,0.0,0.0
2015-01-06,0.0,0.0,0.0
2015-01-07,1.0,0.0,1.0
2015-01-08,1.0,0.0,1.0
