# ML-2: Trees, Model Interrogation and Bayesian Workflow
# Homework 2: Rossman Kaggle: Forecasting Sales
# Part 2 : Modelling without embeddings!
**ML-2 Cohort 1** <br>
**Instructor: Dr. Rahul Dave**<br>
**Max Score: 100** <br>

In [17]:
#importing libraries
import numpy as np
import scipy.stats
import scipy.special
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from matplotlib import cm
import pandas as pd
from sklearn.pipeline import make_pipeline, make_union, Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import ParameterGrid, train_test_split
from keras.models import Sequential
from keras.models import Model as KerasModel
from keras.layers import Input, Dense, Activation, Reshape
from keras.layers import Concatenate
from keras.layers.embeddings import Embedding
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model
import pickle
import csv
from datetime import datetime
from sklearn import preprocessing
from keras.callbacks import ModelCheckpoint
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder
%matplotlib inline

## Part 2: Modelling without Entity Embeddings

Remember the parameters we need to use

![Parameters.jpeg](https://drive.google.com/uc?export=view&id=1ROfqM3F5hWwJyrvQr_J1ATovNIW5niOs)

Lets import the feature_train_data.pickle file and set X,y values from the pickle file

In [18]:
f = open('feature_train_data.pickle', 'rb') 
(X, y) = pickle.load(f)

In [19]:
# we will split the train_ratio and 90% and 10% and set the train_size
train_ratio = 0.9
num_records = len(X)
train_size = int(train_ratio * num_records)

In [20]:
#lets look at our data
X[1], y[1]

(array([   0, 1058,    1,    0,    0,    0,    0,    1]), 4491)

In [21]:
np.unique(X[:, 7]) #2, 4, 5 and 7.

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

The next set of inputs is following:

1. Do you want to one hot encode the data?
2. Do you want to provide embeddings as input - this will be set to True for models with entity embeddings
3. Do you want to save the emmbeddings? - again set to true if you want to entity embeddings
4. if 3 is set to true, we want to save them to a embeddings.pickle


In [22]:
one_hot_as_input = True #one_hot is set to True
embeddings_as_input = False #embeddings later on needs to set to true for Part 3
save_embeddings = False
saved_embeddings_fname = "embeddings.pickle"  # set save_embeddings to True to create this file

Define a function to one hot encode the training set and after split transform your training set using the function

In [23]:
def Ohe(data):
  processesed_data = []
  enc = OneHotEncoder(handle_unknown='ignore')
  ohe_data = data[:, [2, 4, 5, 7]] #Only these columns need to be one hot encoded
  enc.fit(ohe_data)
  transformed_data = enc.transform(ohe_data).toarray().astype(int)
  for i in range(len(transformed_data)):
    try:
      processesed_data.append(np.concatenate([transformed_data[i], data[i, [0, 1, 3, 6]]]))
    except:
      print(i)
  return np.array(processesed_data)


    

Split the data into X_train, X_val, y_train, y_val based on the train_size

In [24]:
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size= train_size, random_state=42)

In [25]:
X_train_processed = Ohe(X_train)

In [26]:
X_val_processed = Ohe(X_val)

Lets also sample the data

**Why do we do this??**

Sample is used make inferences about a population. It's cost efficient to use a sample rather than the entire dataset. It's important to ensure that we suffle the population and pick random samples. It enables models to work with a small, manageable amount of data about a statistical population to build and run analytical models more quickly, while still producing accurate findings.

In [27]:
def sample(X, y, n):
    '''random samples'''
    num_row = X.shape[0]
    indices = np.random.randint(num_row, size=n)
    return X[indices, :], y[indices]

In [28]:
X_train, y_train = sample(X_train_processed, y_train, 200000)  # Simulate data sparsity
print("Number of samples used for training: " + str(y_train.shape[0]))

Number of samples used for training: 200000


## Now lets work with Models without embedding!!

**Lets define MAPE first**

In [29]:
#defining mape
def MAPE(Y_actual,Y_Predicted):
  return np.mean(np.abs((Y_actual - Y_Predicted) / Y_actual)) * 100

**We will be log-transforming the dependent variable(y) because it is long-tailed** - keep this in mind for each model or do the conversion after you split the data itself

In [30]:
y_train = np.log(y_train)

In [31]:
y_val = np.log(y_val)

### 2.1: Random Forests

1. Define a RandomForestRegressor model - with n_esitmators = 200, max_depth = 35, min_samples_split = 2, min_samples_leaf = 1
2. Fit on the training data
3. Predict on the validation and training data
4. evaluate the model - calculate the MAPE for validation and training data

**These parameters are from the paper** 

In [32]:
rf_reg = RandomForestRegressor(n_estimators = 200, max_depth = 35, min_samples_split = 2, min_samples_leaf = 1)
rf_reg.fit(X_train, y_train)

rf_train_pred = rf_reg.predict(X_train)
rf_val_pred = rf_reg.predict(X_val_processed)

In [33]:
print("The mean absolute percentage error on training data is ", np.round(MAPE(y_train, rf_train_pred), 3))

The mean absolute percentage error on training data is  0.644


In [35]:
print("The mean absolute percentage error on validation data is ", np.round(MAPE(y_val, rf_val_pred), 3))

The mean absolute percentage error on validation data is  1.762


### 2.2 Boosted Trees

For boosting, we will use XGBoost for regression
1. We will create a DMatrix from XGB for this - because we want to define a param_grid here. 
  * Again look at the parameters from the paper
2. The DMatrix should be provided with X_train and label as y_train
3. Parameters will be as follows:
  * 'nthread': -1,
  * 'max_depth': 7,
  * 'eta': 0.02,
  * 'silent': 1,
  * 'objective': 'reg:linear',
  * 'colsample_bytree': 0.7,
  * 'subsample': 0.7
  * num_round = 3000

4. Train the model

5. Note xgb.DMatrix needs features from Xtrain and Xval while predicting. Hence define:
```
feature_Xtr = xgb.DMatrix(X_train)
feature_Xval = xgb.DMatrix(X_val)
```
5. Predict on feature_Xtr and feature_Xval 
6. Calculate MAPE for both



Look at XGBoost [documentation](https://xgboost.readthedocs.io/en/latest/python/python_intro.html) for each parameter information

In [36]:
#your code here
DM_train = xgb.DMatrix(data = X_train, 
                       label = y_train)

In [37]:
param  = {'nthread' : -1, 
           'max_depth': 7, 
           'eta': 0.02,
           'silent': 1,
           'objective': 'reg:linear',
           'colsample_bytree': 0.7,
           'subsample': 0.7}

num_round = 3000

In [38]:
bst = xgb.train(param, DM_train, num_round)

In [39]:
feature_Xtr = xgb.DMatrix(X_train)
feature_Xval = xgb.DMatrix(X_val_processed)

In [40]:
xgb_y_pred_train = bst.predict(feature_Xtr)
xgb_y_pred_val = bst.predict(feature_Xval)

In [41]:
xgb_mape_train= MAPE(y_train, xgb_y_pred_train)
xgb_mape_val= MAPE(y_val, xgb_y_pred_val)

In [42]:
print("The mean absolute percentage error on training data is ", np.round(xgb_mape_train, 3))

The mean absolute percentage error on training data is  1.183


In [43]:
print("The mean absolute percentage error on validation data is ", np.round(xgb_mape_val, 3))

The mean absolute percentage error on validation data is  1.257


### 2.3 Multi Layer Perceptron

Define a Sequential model with the following:
(Read the Part VI Part A Neural Networks)

1. Dense Layer - 1000 neurons, keep the kernel_initializer as uniform, with activaation as relu
2. Dense Layer - 500 neurons, keep the kernel_initializer as uniform, with activaation as relu
3. Final dense layer with 1 neuron, and activation as sigmoid
4. Compile the model on mean absolute error and optimizer as adam
5. Fit the model on 10 epochs and batch size as 128, find the MAPE 

In [44]:
#Build the model
#your code here
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import optimizers

mlp_model = models.Sequential(name='MLP')

mlp_model.add(layers.Dense(1000,  kernel_initializer = 'uniform', activation = 'relu', input_shape = X_train[1].shape ))
mlp_model.add(layers.Dense(500,  kernel_initializer = 'uniform', activation = 'relu'))

mlp_model.add(layers.Dense(1, activation = 'sigmoid',name='output' ))

# View the model summary
mlp_model.summary()


Model: "MLP"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 1000)              39000     
_________________________________________________________________
dense_1 (Dense)              (None, 500)               500500    
_________________________________________________________________
output (Dense)               (None, 1)                 501       
Total params: 540,001
Trainable params: 540,001
Non-trainable params: 0
_________________________________________________________________


In [45]:
#we will rescale our y_train for the model
#the reason for this is mentioned in the paper in the same section
# to see this change you can plot the log(Sale) vs log(Sale_max) and see how it varies
max_log_y = max(np.max(np.log(y_train)), np.max(np.log(y_val)))
fitting_y = np.log(y_train) / max_log_y

In [46]:
mlp_model.compile(optimizers.Adam(), loss = 'mae')


In [47]:
#fit your model 
#your code here
history = mlp_model.fit(X_train, fitting_y, epochs = 10, batch_size = 128, validation_data=(X_val_processed,  y_val) )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [48]:
#predict and mape calculation
#your code here
mlp_y_pred_train = mlp_model.predict(X_train)
mlp_y_pred_val = mlp_model.predict(X_val_processed)

In [None]:
mlp_mape_train= MAPE(y_train, mlp_y_pred_train)
mlp_mape_val= MAPE(y_val, mlp_y_pred_val)

# You are done with Part 2!!
Print out the MAPE values for all models, you will need this in hand while working on Part 3 for comparing!