%md #MA707 Report - Class Demonstrations (spring 2019, DataHeroes)

## Introduction

In this notebook all the classes which will be used by the feature union and pipeline class during pre-processing of the dataset. They are fited with relevant variables from the cleaned dataset to get the transformed dataframe which will be fitted into the feature union.

The classes to be used during the pre-processing of the dataset are:
  - `CreateTargetDF` : Assigns target variable from the merged datasets
  - `CreateDatetimeVarsDF`: Creates new variables from the existing `datetime` variable by splitting into days, week, year, time etc.
  - `CreateLagVarsDF`: Creates lagged versions of all the feature variables to be used in the training model to predict using the un-lagged target variable 
  - `DropNaRowsDF`: Drops all rows if there is any missing values `NaN` in the dataset
  - `CountVectColDF`: Converts the content in a feature variable into individual tokens and counts the frequency of each tokens in each observation.
  - `TfidfVectColDF`: Converts the content in a feature variable into individual tokens and counts frequency in each observation and multiplied with the inverse occurance frequency throughout the whole rows observations in the variable.
  
***Note These classes has been coded and explained in Notebook 0.2 Feature Creation***

## Contents
1. Setup
2. Class Demonstrations
3. Summary

## 1. Setup

In [6]:
%run "./0.1 Raw dataset (inc)"

In [7]:
%run "./0.2 Feature creation (inc)"

In [8]:
%run "./0.3 Feature selection (inc)"

In [9]:
%run "./0.4 Estimators (inc)"

In [10]:
%run "./0.5 Pipeline functions (inc)"

## 2. Class demonstrations

The following subsections demonstrate the classes used by the `FeatureUnion` and `Pipeline` classes to create a feature-target dataframe.

### 2.1 `CreateTargetDF`

This code fits the dataset `bci_dual_pdf` into the class `CreateTargetVarDF` defined in the notebook `./0.2 Feature creation (inc)` which takes the variable `bci_5tc` as its parameter and assigns it as the target variable.

In [15]:
%python
CreateTargetVarDF(var='bci_5tc') \
  .fit_transform(bci_dual_pdf) \
  .head()

Using the pipe operator, the dataset `bci_dual_pdf` is fitted into the defined class `CreateTargetVarDF` and then transformed to return the `bci_5tc` coulmn as the target variable using the `fit_transform` method.

### 2.2 `CreateDatetimeVarsDF`

Using the class `CreateDateTimeVarsDF` defined in the notebook `./0.2 Feature creation (inc)`, the variable `date` which is a `datetime` variable is fitted into the class and then transformed into new variables of the year, month, day, dayofyear week of the year and weekday using the `.dt` method.

In [19]:
bci_dual_pdf.info()

In [20]:
%python
CreateDatetimeVarsDF(var='date',
                     var_list=['year','month','day',
                               'dayofyear','weekofyear','weekday']) \
  .fit_transform(bci_dual_pdf) \
  .head()

The output is the newly created 6 variables from the `datetime` variable column `date`.

### 2.3 `CreateLagVarsDF`

The below section creates a lagged version of all the variables other than `bci_5tc` which is the target variable in the data set `bci_coal_pdf`. 
The class `CreateLagVarsDF` takes two parameters `var_list` and `lag_list` which is the number of rows to be lagged and given the range `(0,2)` and then it creates a lagged version of the variables in the list `var_list`. It then returns the dataframe `bci_coal_pdf` with the lagged version of the variables concated into the existing dataframe.

In [24]:
%python
CreateLagVarsDF(var_list=['cme_ln2','rici','p1a_03','p4_03','c7',
                          'cme_ln3','p3a_03','shfe_rb3','shfe_al3',
                          'shfe_cu3','ice_tib3','cme_fc3','opec_orb',
                          'ice_sb3','p3a_iv','ice_kc3','c5',
                          'p2a_03','cme_lc2','content','cme_sm3',
                          'ice_tib4','bci','tags','cme_ln1','cme_s2'],
                lag_list=range(0,2)) \
  .fit_transform(bci_coal_pdf) \
  .loc[:5,['bci_lag0','bci_lag1']] \
  .head()

The output is the first 5 rows of the two lagged version of the variable `bci` with `bci_lag0` and `bci_lag1` which are lagged by zero and one respectively.

In [26]:
bci_dual_pdf \
  .loc[:,['date','bci']] \
  .head()

### 2.3 `DropNaRowsDF`

Create a pipeline `xfm_pipe` with only a single object `lag` which is the class `CreateLagVarsDF` which takes all the variables in the `var_list` and concats the 3 lagged versions of all the variables back into the dataframe. 

Fit the pipeline with the dataframe `bci_pdf` and return the transformed dataframe with the lagged variables concatenated to it.

In [29]:
%python 
from sklearn.pipeline import Pipeline
xfm_pipe = Pipeline(
  steps=[('lag',CreateLagVarsDF(var_list=['cme_ln2','rici','p1a_03','p4_03','c7','cme_ln3','p3a_03','shfe_rb3','shfe_al3',
                                          'shfe_cu3','ice_tib3','cme_fc3','opec_orb','ice_sb3','p3a_iv','ice_kc3','c5',
                                          'p2a_03','cme_lc2','content','cme_sm3','ice_tib4','bci','tags','cme_ln1','cme_s2'],
                                lag_list=range(0,3)))
        ])
xfm_pipe \
  .fit_transform(bci_pdf) \
  .loc[:,['bci_lag0',
          'bci_lag1',
          'bci_lag2']] \
  .head(3)

The output shows the `bci` variable with its lagged versions with 0, 1 and 2 lagged and labelled as is. The entries in `bci` moves one row behind as it can be seen, the first value `3390.0` moves to the second index and third for the `bci_lag2` and it is replaced by `NaN`. These lagged predictor values will be used in training the data with the non lagged target variable.

Similar to above code, a new object `row` which is the class `DropNaRowsDF` is added to the existing pipeline `xfm_pipe` The class `DropNaRowsDF` deletes all the rows with any of its entry being `NaN` or missing which is assigned by `how='any'`. 

Fit the dataframe `bci_pdf` into the pipeline and print the first 2 rows of the lagged version of `bci`.

In [32]:
%python 
from sklearn.pipeline import Pipeline
xfm_pipe = Pipeline(
  steps=[('lag',CreateLagVarsDF(var_list=['cme_ln2','rici','p1a_03','p4_03','c7','cme_ln3','p3a_03','shfe_rb3','shfe_al3',
                                          'shfe_cu3','ice_tib3','cme_fc3','opec_orb','ice_sb3','p3a_iv','ice_kc3','c5',
                                          'p2a_03','cme_lc2','content','cme_sm3','ice_tib4','bci','tags','cme_ln1','cme_s2'],
                                lag_list=range(0,3))),
         ('row',DropNaRowsDF(how='any'))
        ])
xfm_pipe \
  .fit_transform(bci_pdf) \
  .loc[:,['bci_lag0',
          'bci_lag1',
          'bci_lag2']] \
  .head(2)

As `xfm_pipe` is a pipeline, when we fit it, it first fits the class `CreateLagVarsDF` which then transforms and returns a dataframe with all the lagged variables concatenated to the existing. It then fits this transformed lagged dataframe into the second class `DropNaRowsDF` which then drops the rows with any missing or `NaN` values in the columns.

The output is the three lagged version of the variable `bci`.

### 2.4 `CountVectColDF`

Define a class `CountVectColDF` which has a baseclass of `CountVectorizer`. It takes the parameters `col_name` which is the column name this class needs to fit into. It also takes a list of `ENGLISH_STOP_WORDS` as its parameter `stop_words` and new stop word list to be included as its parameter. The `super()` is used to call `__init__` method of the baseclass `CountVectorizer` which converts each words into a token and counts the number of tokens in each document or rows in that column.

The `fit` function fits the column name into the `super()` and then it transforms this fitted column and returns a dataframe using the `pd.Dataframe` which has new columns as feature names created from the fitted `CountVectorizer` class.

In [36]:
%python 
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
class CountVectColDF(CountVectorizer):
  def __init__(self,col_name,prefix='cnt_',
               stop_words=list(ENGLISH_STOP_WORDS),
               add_stop_words=[]
              ):
    stop_words_list = stop_words+add_stop_words
    self.col_name = col_name
    self.prefix   = prefix
    super().__init__(stop_words=stop_words_list)
    return
  
  def fit(self,X,y=None):
    super().fit(X[self.col_name])
    return self
  
  def transform(self,X,y=None):
    return pd.DataFrame(data=super().transform(X[self.col_name]).toarray(),
                        columns=[self.prefix+feature_name for feature_name in super().get_feature_names()]
                       )

In this code, use the `tag` column from the fitted dataframe `bci_dual_pdf` as the parameter `col_name` into the above defined class `CountVectColDF`. The word `2012` is added to the `stop_words_list` which will be added to the list of `ENGLISH_STOP_WORDS`. The dataframe `bci_coal_pdf` is fitted and transformed to create new features or variables with the prefix `cnt_` and it then prints out the column names in the dataframe.

In [38]:
%python
CountVectColDF(col_name='tags_coal',
               prefix='cnt_',
               add_stop_words=['2012']) \
  .fit(bci_dual_pdf) \
  .transform(bci_dual_pdf) \
  .head() \
  .columns

The output is the list of all the feature names or tokens created from the `CountVectorizer` baseclass with the prefix `cnt`

### 2.5 `TfidfVectColDF`

Similar to `CountVectColDF`, the `TfidfVectColDF` has the baseclass `TfidfVectorizer` which counts the number of tokens in each document multiplied by the weight representing how common a word is across documents or different texts in the column.

In [42]:
%python 
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
class TfidfVectColDF(TfidfVectorizer):
  def __init__(self,col_name,prefix='tfidf_',
               stop_words=list(ENGLISH_STOP_WORDS),
               add_stop_words=[]
              ):
    stop_words_list = stop_words+add_stop_words
    self.col_name = col_name
    self.prefix   = prefix
    super().__init__(stop_words=stop_words_list)
    return
  
  def fit(self,X,y=None):
    super().fit(X[self.col_name])
    return self
  
  def transform(self,X,y=None):
    return pd.DataFrame(data=super().transform(X[self.col_name]).toarray(),
                        columns=[self.prefix+feature_name for feature_name in super().get_feature_names()]
                       )

Fit the dataframe `bci_dual_pdf` into the class `TfidfVectColDF` with the column name `tags` as the column to be fitted and transformed. It then returns the dataframe with all the feature names created by the `TfidfVectorizer` baseclass with the `Tf-idf` values and prints the first 5 rows of the new dataframe.

In [44]:
%python
TfidfVectColDF(col_name='tags_coal',
               prefix='tfidf_',
               add_stop_words=['2012']) \
  .fit(bci_dual_pdf) \
  .transform(bci_dual_pdf) \
  .head() 

## Summary

All the above defined class will be used in creating feature union and pipeline creation while working with the three mining datasets and performing the gridsearch on various models to predict the target variable.