In [1]:
%sh ls /dbfs/mnt/group-ma755/data/ais*

### Setup Libraries

Import `numpy`,`pandas`,`sklearn`. Note the corresponding versions.

In [4]:
import numpy             as np
import pandas            as pd
import sklearn           

import datetime as dt
from dateutil import parser

from pandas import Series
from pandas import DataFrame
from sklearn import pipeline
from sklearn import preprocessing
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import gen_features
import sklearn.preprocessing, sklearn.decomposition, \
       sklearn.linear_model,  sklearn.pipeline, \
       sklearn.metrics

np.__version__, pd.__version__, sklearn.__version__

### Setup Dataframe

Read in the dataframe, change the datatype of the `vessel id`. The dataframe `ais_train_df` and `ais_test_df`  will be used throughout this notebook.

In [7]:
ais_train_df = pd.read_csv('/dbfs/mnt/group-ma755/data/ais-train.csv', dtype={'vessel_id': str})
ais_test_df = pd.read_csv('/dbfs/mnt/group-ma755/data/ais-test.csv',dtype={'vessel_id':str}) 

Read in the external dataframes - We will be using these external dataframes to add features to our training dataset. The description of each of the external dataframes is below:
- djia_df - The djia_df is a dataframe that shows the closing price of the dow jones index per day. 
- CrudeOil_df - the CrudeOil_df is a dataframe that shows the closing oil price on any given day.
- DollarIndex_df - The DollarIndex_df shows the foreign exhange value of the US dollar against currencies of a broad group of US trading partners which is again per day.
- China_Recession_Binary_df - The China_recession_Binary_df is an indicator of the current economic status of China which is a major importer of iron. The CHNRECDM column is a binary varianle (1 & 0 values) where 1 is an indicator of recessionary and 0 ix expansioary.

In [9]:
djia_df = pd.read_csv('/dbfs/FileStore/tables/DJIA.csv')
CrudeOil_df = pd.read_csv('/dbfs/FileStore/tables/Crude_Oil_Prices___BRENT_EUROPE-e3249.csv')
DollarIndex_df = pd.read_csv('/dbfs/FileStore/tables/Trade_Weighted_US_Dollar_index-81fa3.csv')
China_Recession_Binary_df = pd.read_csv('/dbfs/FileStore/tables/China_Recession_Indicator-0907c.csv')

Here we are keeping the format and name of the columns consistent. The dataframe `djia_df`, `CrudeOil_df`, `DollarIndex_df`, and `China_Recession_Binary_df` will be used as external features for our analysis.

In [11]:
djia_df['DATE']=pd.to_datetime(djia_df.DATE, format='%Y-%m-%d')
CrudeOil_df['DATE']=pd.to_datetime(CrudeOil_df.DATE, format='%Y-%m-%d')
DollarIndex_df['DATE']=pd.to_datetime(DollarIndex_df.DATE, format='%Y-%m-%d')
China_Recession_Binary_df['DATE']=pd.to_datetime(China_Recession_Binary_df.DATE, format='%Y-%m-%d')
djia_df.columns=['Date','DJIA']
CrudeOil_df.columns=['Date','Crude Oil Price']
DollarIndex_df.columns=['Date','Dollar Index']
China_Recession_Binary_df.columns=['Date','Indicator']

### Setup for Data Preprocessing Classes `DateTransformer`, `UniqueTransformer`, `AddDatesTransformer`, `MergeFeatureTransformer` and `TypeTransformer`

Create a class `DateTransformer` to sort a dataframe by `Date` variable and put the sequencial `Date` and another column (the default is `Average`) in a new dataframe.
- The fit method records a dataframe itself in the class.
- The transform method sort the `Date` in the dataframe X to be transformed and return a new dataframe with the sequencial `Date` and another column (the default is `Average`).

In [14]:
from sklearn.base import BaseEstimator, TransformerMixin
class DateTransformer(BaseEstimator, TransformerMixin):
  def __init__(self,col_name='Average'):
    self.col_name = col_name
  def fit(self, X, y=None):
    return self
  def transform(self, X):
    a=X.sort_values(by=['Date']).copy()
    a['Date']=pd.to_datetime(X.Date)
    dates=a['Date']
    col=a[self.col_name]
    df=pd.concat([dates, col], axis=1)
    return df

Create a class `UniqueTransformer` to drop duplicate values in a dataframe
- The fit method records a dataframe itself in the class.
- The transform method drop duplicate values in the dataframe X to be transformed and return a new time series dataframe.

In [16]:
from sklearn.base import BaseEstimator, TransformerMixin
class UniqueTransformer(BaseEstimator, TransformerMixin):
  def __init__(self,my_var=False):
    self.my_var = my_var
  def fit(self, X, y=None):
    return self
  def transform(self, X):
    unique_t_df=X.drop_duplicates()
    unique_t_df.set_index('Date',inplace=False)
    return unique_t_df

Create a class `AddDatesTransformer` to include the date numbers into a dataframe
- The fit method records a dataframe itself in the class.
- The transform method extracts and includes the `Year`, `Month` and `Day` values in the dataframe X to be transformed and returned in a dataframe. One Data column will be transformed to include three separate 'Year', 'Month' & 'Day' columns.

In [18]:
from sklearn.base import BaseEstimator, TransformerMixin
class AddDatesTransformer(BaseEstimator, TransformerMixin):
  def __init__(self,my_var=False):
    self.my_var = my_var
  def fit(self, X, y=None):
    return self
  def transform(self, X):
    X['Year']=pd.DatetimeIndex(X['Date']).year
    X['Month']=pd.DatetimeIndex(X['Date']).month
    X['Day']=pd.DatetimeIndex(X['Date']).day  
    return  X

Create a class `MergeFeaturesTransformer` to merge the external variables within a dataframe
- The fit method records a dataframe itself in the class.
- The transform method merge the external variables from `djia_df`, `CrudeOil_df`, `DollarIndex_df` and `China_Recession_Binary_df` within the dataframe X to be transformed and return the dataframe.

In [20]:
from sklearn.base import BaseEstimator, TransformerMixin
class MergeFeaturesTransformer(BaseEstimator, TransformerMixin):
  def __init__(self,df_name=djia_df):
    self.df_name = df_name
  def fit(self, X, y=None):
    return self
  def transform(self, X):
    X=pd.merge(X, self.df_name, on='Date')  
    return  X

Create a class `TypeTransformer` to deal with the missing values and also change the datatype of the added columns from `object` to `float`.
- The fit method records a dataframe itself in the class.
- The transform method deal with the NaN values in `DJIA`, `Crude Oil Price` and `Dollar Index`, and specify the data type as `float` in the dataframe X to be transformed and return the dataframe.

In [22]:
from sklearn.base import BaseEstimator, TransformerMixin
class TypeTransformer(BaseEstimator, TransformerMixin):
    def __init__(self,col_name = 'DJIA'):
       self.col_name = col_name
    def fit(self, X, y=None):
      return self
    def transform(self, X):
      X[self.col_name] = pd.to_numeric(X[self.col_name], errors='coerce').fillna(0).astype(float)
      return  X

Now demonstrate the previous five steps using a pipeline `dataset_pipeline`. The dataframe `ais_t_df` is the preprocessed training dataframe, and the dataframe `ais_testnew_df` is the preprocessed test dataframe.

In [24]:
from sklearn import pipeline
from sklearn import preprocessing

dataset_pipeline=pipeline.Pipeline([('Date', DateTransformer()),
                               ('Unique', UniqueTransformer()),
                               ('AppendYearMonthDay', AddDatesTransformer()),
                               ('MergeFeatures', MergeFeaturesTransformer(df_name=djia_df)),
                               ('MergeFeatures2', MergeFeaturesTransformer(df_name=CrudeOil_df)),
                               ('MergeFeatures3', MergeFeaturesTransformer(df_name=DollarIndex_df)), 
                               ('MergeFeatures4', MergeFeaturesTransformer(df_name=China_Recession_Binary_df)),   
                               ('Type', TypeTransformer(col_name='DJIA')),
                               ('Type2', TypeTransformer(col_name='Crude Oil Price')),
                               ('Type3', TypeTransformer(col_name='Dollar Index'))
                               
                               ])
ais_t_df =dataset_pipeline.fit_transform(ais_train_df)
ais_testnew_df =dataset_pipeline.fit_transform(ais_test_df)

### Setup for Feature Generating Classes `LagTransformer` and `WindowTransformer`

Create a class `LagsTransformer` to generate the lagged values in a dataframe
- The fit method records a dataframe itself in the class.
- The transform method generate an array in the dataframe X to be transformed and return the array with five columns of lagged `Average` value.

In [27]:
class LagsTransformer(BaseEstimator, TransformerMixin): 
  def __init__(self, n_lags=1, col_name = 'Average'):
    self.n_lags = n_lags
    self.col_name = col_name
  def fit(self, X,y=None):
    return self
  def transform(self, X):
    lag=np.zeros((len(X),self.n_lags))
    for i in range(self.n_lags):
      lag[:,i]=X[self.col_name].shift(i)  
    return  lag

Create a class `WindowTransformer` to generate the expanding window statistics values in a dataframe
- The fit method records a dataframe itself in the class.
- The transform method generate an array in the dataframe X to be transformed and return the array with three columns of expanding window statistics of the `Average` value.

In [29]:
from sklearn.base import BaseEstimator, TransformerMixin
class WindowTransformer(BaseEstimator, TransformerMixin): 
  def __init__(self, col_name = 'Average'):
    self.col_name = col_name
  def fit(self, X,y=None):
    return self
  def transform(self, X):
    window=np.zeros((len(X),3))
    window[:,0]=X[self.col_name].expanding().min()
    window[:,1]=X[self.col_name].expanding().mean()
    window[:,2]=X[self.col_name].expanding().max()
    return  window

In [30]:
from numpy import *
class NaNTransformer(BaseEstimator, TransformerMixin): 
  def fit(self, X, y=None):
    return self
  def transform(self, X):
      where_are_nans=np.isnan(X)
      X[where_are_nans]=0
      return  X

Create the `mapper_lags` object then use it to transform the `Average` variable of the imported dataframe into lagged `Average` variables.

In [32]:
def mapper_lags(n_lags=1, col_name='Average'):
  return DataFrameMapper([([col_name], LagsTransformer(n_lags=n_lags,col_name=col_name))],input_df=True)

Create the `mapper_binarized_date` object then use it to transform `Month` and `Day` variables of the imported dataframe into binary variables.

Create the `mapper_window` object then use it to transform the `Average` variable of the imported dataframe into expanding window statistics of the `Average` variable.

In [35]:
def mapper_window(col_name='Average'):
  return DataFrameMapper([([col_name], WindowTransformer(col_name=col_name))],input_df=True)

Create the `mapper_newdata` object then use it to transform the `DJIA`, `Crude Oil Price`, `Dollar Index` and `Indicator` variables of the imported dataframe into features.

Add `feature_generating` to the pipeline and create an array with all sets of transformed variables.

In [38]:
from sklearn import pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model  import LogisticRegression, LinearRegression
binarizer_list=[['Month'],['Day']]
external_data_list=[['DJIA'],['Crude Oil Price'],['Dollar Index'],['Indicator']]

pipeline=pipeline.Pipeline([('Date', DateTransformer()),
                               ('Unique', UniqueTransformer()),
                               ('AppendYearMonthDay', AddDatesTransformer()),
                               ('MergeFeatures', MergeFeaturesTransformer()),
                               ('MergeFeatures2', MergeFeaturesTransformer(df_name=CrudeOil_df)),
                               ('MergeFeatures3', MergeFeaturesTransformer(df_name=DollarIndex_df)), 
                               ('MergeFeatures4', MergeFeaturesTransformer(df_name=China_Recession_Binary_df)),   
                               ('Type', TypeTransformer(col_name='DJIA')),
                               ('Type2', TypeTransformer(col_name='Crude Oil Price')),
                               ('Type3', TypeTransformer(col_name='Dollar Index')),
                               
                               ('feature', FeatureUnion([('lags',mapper_lags(n_lags=20,col_name='Average')),
                                                         ('class', DataFrameMapper( 
                                                                   gen_features(columns=binarizer_list,
                                                                   classes=[sklearn.preprocessing.LabelBinarizer]),
                                                                   input_df=True)),
                                                         ('new_data', DataFrameMapper(
                                                                     gen_features(columns=external_data_list,
                                                                     classes=None),input_df=True)),
                                                         ('windows',mapper_window(col_name='Average')),
                                                         ('windows2',mapper_window(col_name='DJIA'))
                                                                             ])),
                               ('align',  NaNTransformer())
                           ])
tt_x=pipeline.fit_transform(ais_train_df)
tt_x,tt_x.shape