<a href="https://colab.research.google.com/github/vishalbalaji-v/Personal-Projects/blob/main/Guided%20Projects%20/%20Rossman%20Kaggle%3A%20Forecasting%20Sales%20/%20Part%201.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML-2: Trees, Model Interrogation and Bayesian Workflow
# Homework 2: Rossman Kaggle: Forecasting Sales
# Part 1: Preprocessing
**ML-2 Cohort 1** <br>
**Instructor: Dr. Rahul Dave**<br>
**Max Score: 100** <br>

#### **Name of people who have worked on this homework:**

## Table of Contents 

* **HW-2: Rossman Kaggle: Forecasting Sales**
  * Instructions
  * Learning Goals
  * Loading the DataFrame
  * Q1: Data-Preprocesing and Understanding the data **(10 marks)**(HW1_Part1)
  * Q2: Modelling without Entity Embeddings**(30 marks)**(HW1_Part2) 
    * 2.1 Random Forest 
    * 2.2 XGBoost 
    * 2.3 Multi Layer Perceptron 
  * Q3: Modelling MLP with Entity Embeddings**(10 marks)**(HW1_Part3)
  * Q4 : Modelling other models with Entity Embeddings **(40 marks)**(HW1_Part4)
    * 4.1 Random Forest 
    * 4.2 XGBoost
  * Q4: Final Comments **(10 marks)** (HW1_Part4)

## Instructions

- This homework should be submitted in pairs.

- Ensure you and your partner together have submitted the homework only once. Multiple submissions of the same work will be penalised and will cost you 2 points.

- Please restart the kernel and run the entire notebook again before you submit.

- Running cells out of order is a common pitfall in Notebooks. To make sure your code works restart the kernel and run the whole notebook again before you submit. 

- To work on the homework, you will first need to fork the repository into your GitHub account and clone it to work on it on your local computer. To submit your homework, push your homework into the same GitHub and upload the link on edStem.

- Submit the homework well before the given deadline. Submissions after the deadline will not be graded.

- We have tried to include all the libraries you may need to do the assignment in the imports statement at the top of this notebook. We strongly suggest that you use those and not others as we may not be familiar with them.

- Comment your code well. This would help the graders in case there is any issue with the notebook while running. It is important to remember that the graders will not troubleshoot your code. 

- Please use .head() when viewing data. Do not submit a notebook that is **excessively long**. 

- In questions that require code to answer, such as "calculate the $R^2$", do not just output the value from a cell. Write a `print()` function that includes a reference to the calculated value, **not hardcoded**. For example: 
```
print(f'The R^2 is {R:.4f}')
```
- Your plots should include clear labels for the $x$ and $y$ axes as well as a descriptive title ("MSE plot" is not a descriptive title; "95 % confidence interval of coefficients of polynomial degree 5" is).

- **Ensure you make appropriate plots for all the questions it is applicable to, regardless of it being explicitly asked for.**

<hr style="height:2pt">

## Learning Goals

**We will look here into the practicalities of Trees, MLPs and Entity Embedding.**

The homework is divided into four main parts:
1. Data-preprocessing
2. Developing different models and evaulating the models - without Entity Embeddings
3. Pass on the entity embeddings from Neural Network model to other models and evaluate the models
4. Compare the models

## Read this first!

The homework is divided into **4 notebooks**
1. Preprocessing and Storing Data
2. Modelling without Entity Embeddings
3. MultiLayer Perceptron with Entity Embeddings 
4. Modelling with Entity Embeddings and Comparing the results


This Homework is based on the **paper attached in the data folder**

Lets talk about the paper first:

A very simple explaination of what the paper is trying to achieve is to show how to accuracy of the model changes using Entity Embeddings. You will first pre-process the data, pass it through tree models and MLP and check the MAPE. After that you will build another MLP Model with embeddings. This embedding features will be extracted and then merged with the train set and then passed as input to the same tree models to check their MAPE. 

**Things to note:**

1. We want the results to be **almost same** as the results shown in the paper(Your results will not be exactly the same):

![Results.jpeg](https://drive.google.com/uc?export=view&id=1KqzimhXso6aojPYwcBNj5EnDNZoY_Hqb)


**We will not be implementing KNNs**

2. The paper specifically mentions the parameters it uses to achieve these results, and we will be using the **same as well**. 
![Parameters.jpeg](https://drive.google.com/uc?export=view&id=1ROfqM3F5hWwJyrvQr_J1ATovNIW5niOs)

**Again remember we will not be implementing KNNs**


3. The last point we want you to note is the following: we will be using MAPE


![Mape.jpeg](https://drive.google.com/uc?export=view&id=1UFi9yWzmSWePNm2qpRqGKylR5Q5-Q7ms)


4. Read the paper first!! Specifically **B. Comparison of different methods** **This will give you more clarity on what goes as an input to each model, hence your results will be as close as possible to the paper.** 

#### So lets get started! Please note: this particular notebook is only for Data Preprocessing and saving the datafile. The notebooks for Modelling without Entity Embeddings MLP with Entity Embedding and other models with Entity Embeddings is Part2 and Part3 and Part4. 

**Why are we doing this?** 

Each of this processing requires high RAM, which you may or may not have access to - hence we split the work in four parts and call the work from each part into the next one! Also this helps us modularise it better!!



In [None]:
#importing libraries
import numpy as np
import scipy.stats
import scipy.special
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from matplotlib import cm
import pandas as pd
from sklearn.pipeline import make_pipeline, make_union, Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import ParameterGrid
from keras.models import Sequential
from keras.models import Model as KerasModel
from keras.layers import Input, Dense, Activation, Reshape
from keras.layers import Concatenate
from keras.layers.embeddings import Embedding
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model
import pickle
import csv
from datetime import datetime
from sklearn import preprocessing
from keras.callbacks import ModelCheckpoint
import xgboost as xgb
%matplotlib inline

## Q1. Data Pre-Processing and Saving the data

### 1.1 Loading and understanding the data

#### About the data

Most of the fields are self-explanatory. The following are descriptions for those that aren't. 

1. **Id** - an Id that represents a (Store, Date) duple within the test set
2. **Store** - a unique Id for each store
3. **Sales** - the turnover for any given day (this is what you are predicting)
4. **Customers** - the number of customers on a given day
5. **Open** - an indicator for whether the store was open: 
    * 0 = closed
    * 1 = open
6. **StateHoliday** - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. 
    * a = public holiday 
    * b = Easter holiday
    * c = Christmas
    * 0 = None
7. **SchoolHoliday** - indicates if the (Store, Date) was affected by the closure of public schools
8. **StoreType** - differentiates between 4 different store models: a, b, c, d
9. **Assortment** - describes an assortment level: 
    * a = basic
    * b = extra
    * c = extended
10. **CompetitionDistance** - distance in meters to the nearest competitor store
11. **CompetitionOpenSince[Month/Year]** - gives the approximate year and month of the time the nearest competitor was opened
12. **Promo** - indicates whether a store is running a promo on that day
13. **Promo2** - Promo2 is a continuing and consecutive promotion for some stores: 
    * 0 = store is not participating
    * 1 = store is participating
14. **Promo2Since[Year/Week]** - describes the year and calendar week when the store started participating in Promo2
15. **PromoInterval** - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

**Note, since this data is large, we do not want to convert this data into dataframes, we will store it as array of dictionaries and pass the same to the models.**
**Also, we reccommend using Google Colab for completing this Homework.** 


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#importing the data as a string 
#your code here 
train_set = ('/content/drive/MyDrive/Colab files/Homework 2 - ML2/train.csv') #train.csv in rossmann kaggle folder(zip file)
stores_set = ('/content/drive/MyDrive/Colab files/Homework 2 - ML2/store.csv') #stores.csv in rossmann kaggle folder(zip file)
store_states = ('/content/drive/MyDrive/Colab files/Homework 2 - ML2/store_states.csv') #store_states.csv in data folder

We will now define functions:
1. To convert our csv files into dictionaries
2. To replace nan values

In [None]:
def csv2dicts(csvfile):
    data = []
    keys = []
    for row_index, row in enumerate(csvfile):
        if row_index == 0:
            keys = row
            print(row)
            continue
        data.append({key: value for key, value in zip(keys, row)})
    return data

In [None]:
def set_nan_as_string(data, replace_str='0'):
    for i, x in enumerate(data):
        for key, value in x.items():
            if value == '':
                x[key] = replace_str
        data[i] = x

Save the train_set as a dictionary using csv2dicts function defined above. 

Further save this as a pickle file - call it **train_set.pickle**

In [None]:
# save the train_set as a dictionary using csv2dicts function defined above. 
# Save this as a pickle file - call it train_set.pickle
#your code here
with open(train_set) as csvfile:
    data = csv.reader(csvfile, delimiter=',')
    with open('/content/drive/MyDrive/Colab files/Homework 2 - ML2/train_set.pickle', 'wb') as f:
        data = csv2dicts(data)
        data = data[::-1]
        pickle.dump(data, f, -1)
        print(data[:3])

['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday']
[{'Store': '1115', 'DayOfWeek': '2', 'Date': '2013-01-01', 'Sales': '0', 'Customers': '0', 'Open': '0', 'Promo': '0', 'StateHoliday': 'a', 'SchoolHoliday': '1'}, {'Store': '1114', 'DayOfWeek': '2', 'Date': '2013-01-01', 'Sales': '0', 'Customers': '0', 'Open': '0', 'Promo': '0', 'StateHoliday': 'a', 'SchoolHoliday': '1'}, {'Store': '1113', 'DayOfWeek': '2', 'Date': '2013-01-01', 'Sales': '0', 'Customers': '0', 'Open': '0', 'Promo': '0', 'StateHoliday': 'a', 'SchoolHoliday': '1'}]


If you look at store_states - it is basically sharing information about which stores are located in which states. Hence we will add this in the stores_set itself

In [None]:
#lets do the same thing what we did above for the store_set and store_states - call this pickle as store_set.pickle
#your code here
with open(stores_set) as csvfile:
    data = csv.reader(csvfile, delimiter=',')
    with open('/content/drive/MyDrive/Colab files/Homework 2 - ML2/store_set.pickle', 'wb') as f:
        data = csv2dicts(data)
        data = data[::-1]
        pickle.dump(data, f, -1)
        print(data[:3])

['Store', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval']
[{'Store': '1115', 'StoreType': 'd', 'Assortment': 'c', 'CompetitionDistance': '5350', 'CompetitionOpenSinceMonth': '', 'CompetitionOpenSinceYear': '', 'Promo2': '1', 'Promo2SinceWeek': '22', 'Promo2SinceYear': '2012', 'PromoInterval': 'Mar,Jun,Sept,Dec'}, {'Store': '1114', 'StoreType': 'a', 'Assortment': 'c', 'CompetitionDistance': '870', 'CompetitionOpenSinceMonth': '', 'CompetitionOpenSinceYear': '', 'Promo2': '0', 'Promo2SinceWeek': '', 'Promo2SinceYear': '', 'PromoInterval': ''}, {'Store': '1113', 'StoreType': 'a', 'Assortment': 'c', 'CompetitionDistance': '9260', 'CompetitionOpenSinceMonth': '', 'CompetitionOpenSinceYear': '', 'Promo2': '0', 'Promo2SinceWeek': '', 'Promo2SinceYear': '', 'PromoInterval': ''}]


In [None]:
with open(store_states) as csvfile:
    data = csv.reader(csvfile, delimiter=',')
    with open('/content/drive/MyDrive/Colab files/Homework 2 - ML2/store_states.pickle', 'wb') as f:
        data = csv2dicts(data)
        data = data[::-1]
        pickle.dump(data, f, -1)
        print(data[:3])

['Store', 'State']
[{'Store': '1115', 'State': 'HE'}, {'Store': '1114', 'State': 'HH'}, {'Store': '1113', 'State': 'SH'}]


Next we want to store the train_data length, hence load the data back from the pickle files saved and only assign num_records as the length of the train data

In [None]:
with open('/content/drive/MyDrive/Colab files/Homework 2 - ML2/train_set.pickle', 'rb') as f:
    train_data = pickle.load(f)
    num_records = len(train_data)
with open('/content/drive/MyDrive/Colab files/Homework 2 - ML2/store_set.pickle', 'rb') as f:
    store_data = pickle.load(f)
with open('/content/drive/MyDrive/Colab files/Homework 2 - ML2/store_states.pickle', 'rb') as f:
    store_states = pickle.load(f)

If you have saved and loaded the files correctly then **train_data[1]** and **store_data[1]** should be as follows:

![Mape.jpeg](https://drive.google.com/uc?export=view&id=1D7IMgfjbRvWNuJV_v5nx5H7TfzGjP811)


Check if the column names are the same - if not recheck the previous codes


In [None]:
#check the same
train_data[1], store_data[3]

({'Customers': '0',
  'Date': '2013-01-01',
  'DayOfWeek': '2',
  'Open': '0',
  'Promo': '0',
  'Sales': '0',
  'SchoolHoliday': '1',
  'StateHoliday': 'a',
  'Store': '1114'},
 {'Assortment': 'c',
  'CompetitionDistance': '1880',
  'CompetitionOpenSinceMonth': '4',
  'CompetitionOpenSinceYear': '2006',
  'Promo2': '0',
  'Promo2SinceWeek': '',
  'Promo2SinceYear': '',
  'PromoInterval': '',
  'Store': '1112',
  'StoreType': 'c'})

### 1.2 Feature list

We will define a function to extract features from the data - this will be the final data passed to all the models. 
Why are we doing this? - we dont need all the features from the train_set or stores_set to predict sales . we will pick a few selected ones  only.

The function should return the following paramters:
* the **store index** = from the train_set it should show the 'store'
* **year** = this should come from train_set 'Date'
* **month** = this should come from train_set 'Date'
* **day** = this should come from train_set 'Date'
* **day_of_week** = this should come from train_set 'DayOfWeek'
* check if the **store is open** 
    * if yes - save that 
    * else it should save 1
* **promo** = this should come from train_set 'Promo'
* **store_state** = this should come from store_state 'State'


Note the year month and day will come from Date - this is a string and has to be split for each individual values, you might want to use **[datetime.strptime](https://www.programiz.com/python-programming/datetime/strptime)** for this. 

In [None]:
def feature_list(record):
    #your code here
    if record['Open'] == '0':
      return(1)
    else:
      year = int(datetime.strptime(record['Date'],'%Y-%m-%d').strftime('%Y'))
      month = int(datetime.strptime(record['Date'],'%Y-%m-%d').strftime('%m'))
      day = int(datetime.strptime(record['Date'],'%Y-%m-%d').strftime('%d'))
      Store = int(record['Store'])
      DoW = int(record['DayOfWeek'])
      Promo = int(record['Promo'])
      Open = int(record['Open'])
      State = list(filter(lambda x: x['Store'] == str(Store), store_states))[0]['State']
      
      return(list([year,month,day,Store,DoW,Promo,Open,State]))

Now lets create two dictionaries - train_data_X and train_data_y 

* Run through the train_set, and check if the 'Sales' are not equal to 0 and 'Open' is not equal to 0 ( we do not want to store features for which sales is zero and the store is not open)
* If yes(store is open and sales is not 0), then store the features(from **feature list function**) into a variable named f1
* append the f1 values in train_data_X
* append the **Sales not equal** to 0 to train_data_y

In [None]:
feature_list(train_data[26789])

[2013, 1, 25, 1085, 5, 1, 1, 'BE']

In [None]:
train_data_X = []
train_data_y = []

for record in train_data:
  #your code here
  if record['Open'] == '0' or int(record['Sales']) == 0:
    continue
  else:
    f1 = feature_list(record)
    train_data_X.append(f1)
    train_data_y.append(int(record['Sales']))


In [None]:
#again check how your train_data_X looks
train_data_X[0]

# at this point your data should have 8 values - something like this [1, 948, 2, 0, 2013, 1, 1, 'BW']

[2013, 1, 1, 1097, 2, 0, 1, 'RP']

The next step is going to be labelencoding(the idea is to actually ordinally encode it, but we can use LabelEncoding here as well - look at the sklearn documentation for both) the train_data_X. We do this using LabelEncoder from sklearn

We will run this for the **complete train_data_X**

In [None]:
le = LabelEncoder()
check_X = train_data_X
check_X = np.array(check_X)
train_data_X = np.array(train_data_X)
les = []
for i in range(train_data_X.shape[1]):
    #your code here
    le.fit(train_data_X[:,i])
    train_data_X[:,i] = le.transform(train_data_X[:,i])
    les.append(train_data_X[:,i])

In [None]:
len(les)

8

In [None]:
#again check how your train_data_X looks 
train_data_X[1]

array(['0', '0', '0', '1058', '1', '0', '0', '1'], dtype='<U21')

We will dump the les dictionary(defined in the previous step) into les.pickle

And convert our train_data_X as **int** datatype, and save our train_data_y as an **numpy array**

In [None]:
with open('/content/drive/MyDrive/Colab files/Homework 2 - ML2/les.pickle', 'wb') as f:
    pickle.dump(les, f, -1)
train_data_X = train_data_X.astype(int)
train_data_y = np.array(train_data_y)

Finally we will store our train_data_X, train_data_y in a pickle file - **feature_train_data.pickle**

In [None]:
with open('/content/drive/MyDrive/Colab files/Homework 2 - ML2/feature_train_data.pickle', 'wb') as f:
    pickle.dump((train_data_X, train_data_y), f, -1)
    print(train_data_X[0], train_data_y[0])

[  0   0   0 109   1   0   0   7] 5961


## You are done with Part 1 of the Homework!


Save all the pickle files locally in your system/drive - these will be used in the next parts!