<center> <h1> Coding principles for rapid machine learning experimentation </h1> </center>

<br />

According to François Chollet, the creator of Keras and Someone who at his best, placed 17th on Kaggle's cumulative leaderboard: it is not about having the best idea(s) from the start but rather iterating over ideas often and quickly that allows people to win competitions and publish papers.

<br />

<center>

<figure>

<img src="https://i.ibb.co/17ghyM3/Screenshot-6.png" width='400px'>
    
<br />
    
<figcaption>Fig. 1 François Chollet's Twitter Feed  <a href="https://twitter.com/fchollet/status/1113478477116608512">Source</a></figcaption>

</figure>

</center>

<br />

It is therefore important to be able to optimize your workflow and tools to enable you to rapidly experiment and iterate over ideas faster.

<br />

<center>

<figure>

<img src="https://i.ibb.co/Jm6DM4s/experimentation-loop.jpg" width='400px'>
    
<br />

<figcaption>Fig. 2 François Chollet's Cycle of Experimentation  <a href="https://twitter.com/fchollet/status/1113479155847270400">Source</a></figcaption>

</figure>

</center>

<br />

The goal of this series of kaggle kernels is to explore the coding patterns and design principles for speeding up the machine learning workflow and enabling rapid experimentation. 

There are several key components in a regular machine learning workflow, some of them are listed below:

* Exploratory Data Analysis
* Feature Transformation / Engineering
* Validation Strategy
* Model Building
* Experiment Tracking
* Model Tuning / Ensembling
    
In this first post, we will look at writing code for feature transformations, creating validation strategies and data versioning.


## Feature Engineering

Functions are the best units of code that work for feature transformations and reusability. These functions when written with the goal of creating reusable, portable, repeatable, and chainable sets of data transformations, work very well for machine learning workflows. 


Let us understand how to tackle these design goals:

### Naming and Documenting functions

Firstly, it is important to name your functions appropriately, and document them following standards. I generally like to use a custom version of the sklearn docstring template, which can be found at the following [link](https://gist.github.com/jakevdp/3808292)

The must-have categories in a docstring are:

* A function description
* Parameters (with expected types)
* Returns (with the object type)

Based on the complexity of the function, it can either have a single line description or a multi-line description.

I like adding another portion to the documentation, which is __Suggested Imports__. This makes the function easier to be copied across various notebooks / scripts in projects without the worry of having to move scripts around to import just a copule of functions in a notebook and miss any important import statements. It also indirectly lets me know what libraries the function depends on. I would still recommend putting all imports in at the top of a script according to PEP standards.

Also, it may be beneficial to note down __Example Usage__ patterns of the function, in the rare cases where the function has complex use cases.

Let us now look at how to document a function


In [None]:
def example_fn(param_one, param_two):

    ''' 
    This is an example function showing how to write
    docstrings

    Parameters
    ----------
    param_one: pd.DataFrame
               Description of the param_one argument
    
    param_two: int
               Description of the param_two argument

    Returns
    -------
    return_data: pd.DataFrame
                 An example description of the returned      
                 object

    Suggested Imports
    ----------------
    import numpy as np
    import pandas as pd

    Example Usage
    -------------
    transformed_data = example_func(param_one, param_two)

    '''
    
    pass

* This now enables us to access the docstring from anywhere we import, define or call this `example_fn`function from

In [None]:
help(example_fn)

### Standardizing the UX

One of the key ingredients to make using a function familiar and easy, is if the inputs and outputs are standardized across the board for a specific use case. Thereby ensuring that the user or developer experience is seamless.

The scikit-learn API has standardized `.fit()`, `.transform()`, and `.predict()` methods across its library and it is a fantastic example of how a standardized UX can lead to high developer productivity and an extremely low barrier for entry.

For functions that are meant to perform feature engineering, it is important to always input a dataframe and some auxillary arguments while returning a dataframe, as can be seen in the figure below.

<br />

<center>

<figure>

<img src="https://i.ibb.co/QC1J5WQ/feature-transformation-functions.png" width='400px'/>
    
<br />
    
<figcaption>Fig.3 Every transformation function should take a data frame as input and return a dataframe</figcaption>

</figure>

</center>

This design can is independent of pandas and can be ported to other data processing frameworks such as spark. Any logical code written using pandas can then be ported over to run on a spark based parallel processing system using the `Koalas` API, yet following a similar design pattern. 

### Pure Functions

Pure functions have two important properties:
* If given the same arguments, the function must return the same value

* When the function is evaluated, there are no side effects (no I/O streams, no mutation of static and non-local variables)

<center>

<figure>

<img src="https://i.ibb.co/pPBzYKs/pure-functions.png" width='400px'>
    
<br />

<figcaption>Fig. 4 A visual representation of pure functions  <a href="https://livebook.manning.com/book/get-programming-with-scala/chapter-21/v-4/68">Source</a></figcaption>

</figure>

</center>

<br />

When writing functions for data transformations, we cannot always write pure functions, especially considering the limitations of system memory, given that to mutate a non-local variable would require to create an entire new copy of the dataframe.

Therefore, it is important to have a boolean argument `inplace` which can help the developer decide whether or not to mutate the dataframe as per the requirements of the situation.


### Type Hinting

Type hinting, was introduced in `PEP 484` and `Python 3.5`. Therefore I donot recommend using it completely as of now, unless you are sure that all of the libraries you use in your workflow are compatible for `Python 3.5` and above.

The basic structure of type hinting in python is as follows:

    def fn_name(arg_name: arg_type) -> return_type:
        
        pass
        
* Once a function definition is complete, use the `->` symbol to indicate the return type, it could be `int`, `dict` or any other python data type.

* Every argument in the function is followd by a `:` and the data type of the argument

* You can also use more complex ways of representing nested data types, optional data types, etc. using the `typing` module in Python

Below, you can find an example where I use the `typing` module and use type hinting in python

In [None]:
from typing import List

def list_squared(input_list: List[int]) -> List[int]:
    return [element**2 for element in input_list]

list_squared([2, 4, 6, 8])

Not all of the principles stated above are necessary, but they are important to consider when designing functions for feature transformation / engineering.

Let us now use these principles to design a function that allows us to engineer date based features. We will be using the `Rossman Store Sales` dataset.

## Reading in the dataset

In [None]:
import numpy as np
import pandas as pd

In [None]:
train_data = pd.read_csv('../input/rossmann-store-sales/train.csv', low_memory=False)

In [None]:
train_data.head()

## Putting the principles in practice

We will now write a function that allows us to engineer date based features, which can be used in downstream machine learning training tasks especially suited for tree based models

In [None]:
def generate_date_features(input_data: pd.DataFrame,
                           date_col:str,
                           use_col_name: bool=True,
                           inplace: bool=False) -> pd.DataFrame:
    
    ''' 
    generates date features from a date column, such as,
    year, month, day, etc. which can be used to train
    Machine Learning models. 

    Parameters
    ----------
    input_data : pd.DataFrame
                 The input data frame
          
    date_col : str
               The column name of the date column for which
               the features have to be generated
               
               Ensure that the column is a pandas datetime
               object. And also has year, month and day in
               it.
               
    use_col_name : bool, default True
                   If True, the column name will be appended
                   to the name of the feature that has been
                   created
    
    inplace : bool, default False
              If False, a new data frame object is returned.
              Else, the same data frame passed as input is
              modified

    Returns
    -------
    return_data: pd.DataFrame
                 The returned dataframe which has the appended
                 date based features

    Suggested Imports
    ----------------
    import numpy as np
    import pandas as pd

    Example Usage
    -------------
    data_with_date_features = generate_date_features(data, 'date')

    '''
    
    # Inplace or Copy
    if inplace:
        data_frame = input_data
    else:
        data_frame = input_data.copy()
        
    # Use column name
    if use_col_name:
        new_col_name = f'{date_col}_'
    else:
        new_col_name = ''
        
    # ensure that the column is converted to a 
    # pandas datetime object 
    data_frame[date_col] = pd.to_datetime(data_frame[date_col])
    
    # Generate date features
    data_frame[f'{new_col_name}year'] = data_frame[date_col].dt.year
    data_frame[f'{new_col_name}month'] = data_frame[date_col].dt.month
    data_frame[f'{new_col_name}day'] = data_frame[date_col].dt.day
    data_frame[f'{new_col_name}weeknum'] = data_frame[date_col].dt.weekofyear
    data_frame[f'{new_col_name}dayofweek'] = data_frame[date_col].dt.dayofweek
    data_frame[f'{new_col_name}quarter'] = data_frame[date_col].dt.quarter
    
    return data_frame

As can be seen from above, the `generate_date_features` function is portable, reusable, flexible and can work acorss various data transformation pipelines.

You can design `directed acyclic graphs` to execute specific python functions with pre and post dependecies to generate your final transformed dataset.

This feature engineering pipeline can also be constantly regenerated from new raw data from such a DAG. I would definitely recommend checking out the package `Airflow`, which allows us to write flexible DAGs to manage ETL workloads easily.

In [None]:
new_features = generate_date_features(train_data, date_col='Date', use_col_name=False)

In [None]:
new_features.head()

## Data Versioning

It is important to keep track of data that is generated from raw sources, so that, it becomes easier to reproduce results, machine learning models, bugs, or any anomalies found during the machine learning pipeline.

There are several ways to keep track of data. Two such ways are:

* Saving copies of the modified datasets
* Creating new columns with a standardized naming scheme to track validation sets, modified and engineered features


In this particular kernel, I will discuss the latter one, which I believe is a strategy that is more suited to a Data Scientist as compared to a machine learning engineer.

## Validation Strategy

If you go to Scikit-Learn's documentation for the `KFold` class, you will see a pattern which most data scientists / ml engineers use when performing validation. This pattern can be found below:

    kf = KFold(n_splits=2)
    
    for train_index, test_index in kf.split(X):
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
* When you start using the target based features such as target encoding in your model, your code must go inside this loop.

* Every time you build your model on a fold, your code must go inside this loop

This restricts the freedom of a Data Scientist and makes the results / code harder to keep track of.

This is where a solution that I first came across when reading the x4 Kaggle Grandmaster [Abhishek Thakur's](https://www.kaggle.com/abhishek) book, titled [Approaching any machine learning problem](https://www.amazon.com/Approaching-Almost-Machine-Learning-Problem-ebook/dp/B089P13QHT).

I recommend using a solution that is based on his approach, where we create a new column that tracks the fold number to which the record belongs. This enables me to not only train models on a given fold parallely, but also on machines that do not have any network connection between them. Every team member could build a model for one fold. This approach to validation unshackles the Data Scientist and increases their productivity.

In [None]:
from typing import Optional, List

def create_time_kfolds_days(input_data: pd.DataFrame,
                            date_col: str, num_days: int,
                            group_cols: Optional[List[str]] = None,
                            num_folds: int = 5, inplace: bool = False
                           ) -> pd.DataFrame:
    ''' 
    Creates a kfold column based on time, which allows us to validate our
    machine learning models. This works for when the forecasts are at
    the day level.

    Parameters
    ----------
    input_data : pd.DataFrame
                 The input data frame
           
    date_col : str
               The column name of the date column for which
               the features have to be generated
               
               Ensure that the column is a pandas datetime
               object. And also has year, month and day in
               it.
          
    num_days: int
              The number of days that have to be included
              in each validation fold
               
    group_cols : List[str] or None, default None
                 If a list of strings is passed, the folds
                 are created for the last num_days by grouping
                 the columns passed.
                 
                 If None, the creates the folds by just using
                 the date_col
                 
    num_folds: int default 5
               The number of folds to create
    
    inplace : bool, default False
              If False, a new data frame object is returned.
              Else, the same data frame passed as input is
              modified

    Returns
    -------
    return_data: pd.DataFrame
                 The returned dataframe which has the kfold
                 column appended to it.

    Suggested Imports
    ----------------
    import numpy as np
    import pandas as pd
    from typing import Optional, List

    Example Usage
    -------------
    1) Accessing training and validation datasets for a particular fold number
    
    data_with_kfold = create_time_based_folds(data, 'date', 28, group_cols=['store', 'item'])
    train_for_fold_1 = data_with_kfold[data_with_kfold['kfold'] != 1]
    val_for_fold_1 = data_with_kfold[data_with_kfold['kfold'] == 1]

    '''
    
    # Inplace or Copy
    if inplace:
        data_frame = input_data
    else:
        data_frame = input_data.copy()
        
    # ensure that the column is converted to a 
    # pandas datetime object 
    data_frame[date_col] = pd.to_datetime(data_frame[date_col])
    
    # Sort all observations by date for each of
    # the groupby columns
    data_frame = data_frame.groupby(group_cols)\
                 .apply(lambda x: x.sort_values(by=date_col))\
                 .reset_index(drop=True)
    
    max_date_in_data = data_frame[date_col].max()

    date_ranges = []
    
    for idx, fold_num in enumerate(range(1, num_folds+1)):
        date_range_fold = pd.date_range(max_date -\
                                        pd.DateOffset(fold_num*(num_days-1)),\
                                        max_date -\
                                        pd.DateOffset((fold_num-1)*(num_days-1))
                                       )
        date_ranges.append(date_range_fold)
    
    data_frame['kfold'] = -1
    
    date_folds = zip(reversed(date_ranges), range(num_folds))
    for idx, (date_range, fold_num) in enumerate(date_folds):
        data_frame.loc[data_frame[date_col].isin(date_range), 'kfold'] = fold_num
            
    return data_frame

In [None]:
kfold_data = create_time_kfolds_days(train_data, 'Date', 48, ['Store'])

For this particular problem, we can just use one time period of 48 days

In [None]:
train_kfold = kfold_data[kfold_data.kfold < 4]

val_kfold = kfold_data[kfold_data.kfold == 4]

In [None]:
train_kfold.Date.min()

In [None]:
train_kfold.Date.max()

In [None]:
val_kfold.Date.min()

In [None]:
val_kfold.Date.max()

To validate a model on a fold number of `k`, you can extract your train and validation sets using the code below

    train_kfold = kfold_data[kfold_data.kfold < k]
 
    val_kfold = kfold_data[kfold_data.kfold == k]

This now, allows us to build models and validate on different, disconnected systems without worrying about the processor architecture that generates random numbers based on a seed.

We can fully reproduce our results on each of the validation sets