## File:2 Machine Learning and Prediction

# Notebook Structure

1. Pre-requisite for File2
2. Model Building

## 1. Pre-requisite for File2

#### Pre-requisite Libraries to be installed to setup environment:
`pip install seaborn`<br>
`pip install pandas`<br>
`pip install numpy`<br>
`pip install category_encoders`<br>
`pip install matplotlib`<br>
`pip install DateTime`<br>
`pip install seaborn`<br>
`pip install sklearn`<br>
`pip install statsmodels`<br>
`pip install scipy`<br>
`pip install flask`<br>
`pip install flask_restful`<br>

### `File1 (Group7_File1_DataValidation_and_Preprocessing.ipyb) has to be executed for File2 to run successfully.`

#### Import libraries

In [None]:
import pandas as pd                                       # dataframes 
import numpy as np  
import datetime as dt
from seaborn import load_dataset                          # Titanic dataset
from sklearn.cluster import KMeans                        # k-means clustering 
from sklearn.model_selection import train_test_split      # train/test data
from sklearn.neighbors import KNeighborsClassifier        # k-NN classification 
from sklearn.linear_model import LogisticRegression       # logistic regression 

import statsmodels.api as sm
from scipy import stats
 
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns


## 2. Model Building

#### Read the output dataset from data preprocessing

In [None]:
# reading the dataset
df = pd.read_csv('data/merged_data.csv') 

To Create a funtion to return discount value if available

In [None]:
def f(row):
    if row['PRICE'] < row['BASE_PRICE']:
        val = abs(row['PRICE'] - row['BASE_PRICE'])
    else:
        val = 0
    return val

Apply the funtion to the data set

In [None]:
# create a new column  
df['DISCOUNTVALUE'] = df.apply(f, axis=1)


In [None]:
df.head(10)

#### Drop features that are not impactful for prediction. 
Removing redundant variables as they wont add value to the model and large feature subsets may actually reduce the performance of some machine learning models.

In [None]:
df=df.drop(['WEEK_END_DATE', 'SUB_CATEGORY','PRICE',
            'PRODUCT_WEIGHT_LB',
            'SEG_VALUE_NAME','MSA_CODE','ADDRESS_STATE_PROV_CODE'
           ]
           , axis=1)


In [None]:
df.head(10)

#### Create numerical variables for the categorical variables

In [None]:
df.MANUFACTURER = pd.Categorical(df.MANUFACTURER)
df.MANUFACTURER = df.MANUFACTURER.cat.codes

df.CATEGORY = pd.Categorical(df.CATEGORY)
df.CATEGORY = df.CATEGORY.cat.codes

In [None]:
df.dtypes

Machine learning wont work if there are null values in the data set. So verifying for nulls

In [None]:
df.isna().sum().sum()

#### Defining features and target variables.

In [None]:
features = ['PRODUCT_ID','FEATURE','WEEKOFYEAR',
            'DISPLAY','MANUFACTURER','CATEGORY' ,
            'BASE_PRICE',
            'DISCOUNTVALUE'
            ,'STORE_ID'
            ,'AVG_WEEKLY_ORDERS'
            ,'SALES_AREA_SIZE_NUM',
           ]
target = 'UNITS'

features, target

#### Create train and test data with 75-25 ratio.

In [None]:
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [None]:
X_test

#### Creating linear Regression

In [None]:
lr=LinearRegression()
lr.fit(X_train, y_train)

#### Check Scores on train and test data set

In [None]:
## score for linear regression is the R2
lr.score(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

#### Checking Stats

In [None]:
import math
from sklearn.metrics import explained_variance_score, mean_absolute_error, r2_score, mean_squared_error

print(lr.score(X_test, y_test))

preds = lr.predict(X_test)

score = explained_variance_score(y_test, preds)
mae = mean_absolute_error(y_test, preds)
rmse = math.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)
    
print("score = {:.5f} | MAE = {:.3f} | RMSE = {:.3f} | R2 = {:.5f}"
          .format(score, mae, rmse, r2))

In [None]:
print(lr.intercept_)
print(lr.coef_)

In [None]:
X2 = sm.add_constant(X_test)
est = sm.OLS(y_test, X2)
est2 = est.fit()
print(est2.summary())

In [None]:
corr = df.corr(method ='pearson') 
plt.figure(figsize=(15, 10))
sns.heatmap(corr)
plt.show()

In [None]:
# linear regression feature importance
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot

# get importance
importance = lr.coef_
# summarize feature importance
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([X_train for X_train in range(len(importance))], importance)
pyplot.show()

The stats of the model show there are no P values for the set of features used.
The feature importance values for each of the column used is analysed to find which feature used affects the sales.
The result of feature importance check is below:

<pre>
PRODUCT_ID 			 Feature: 0, Score: -0.00000
FEATURE 				Feature: 1, Score: 19.50686
WEEKOFYEAR 			 Feature: 2, Score: -0.00667
DISPLAY				 Feature: 3, Score: 24.92311
MANUFACTURER			Feature: 4, Score: 2.38433
CATEGORY				Feature: 5, Score: 10.63296
BASE_PRICE			  Feature: 6, Score: -0.09663
DISCOUNTVALUE		   Feature: 7, Score: 0.09853
STORE_ID				Feature: 8, Score: -0.00010
AVG_WEEKLY_ORDERS	   Feature: 9, Score: 0.00531
SALES_AREA_SIZE_NUM     Feature: 10, Score: 0.00041
</pre>

**Inference from the feature importance analysis** <br>
- Display has a high impact on sales i.e., an item on the display at dealer store is more likely to sell.
- Feature affects the sales or featured products are most likely to sell.
- Category is the third most important feature that affects sales.
- The fourth most important feature is Manufacturer which is also the brand of the product.
- There is no evidence for the rest of features having significant impact on sales.

#### Verifying prediction: 
Predict for a product and store combination for year 2020, 16th week with other features.

In [None]:
product1 = { 
"PRODUCT_ID":1111009497,
"FEATURE":0,
"WEEKOFYEAR":202016,
"DISPLAY":0,
"MANUFACTURER":0,
"CATEGORY" :1,
"BASE_PRICE":122,
"DISCOUNTVALUE":100,
"STORE_ID":367,
"AVG_WEEKLY_ORDERS":1155,
"SALES_AREA_SIZE_NUM":24721
 }

X_new = []  # X_new contains new data items 

for obs in [product1]:
    new_obs = [obs["PRODUCT_ID"],
               obs["FEATURE"], 
               obs["WEEKOFYEAR"], 
               obs["DISPLAY"], 
               obs["MANUFACTURER"], 
               obs["CATEGORY"], 
               obs["BASE_PRICE"], 
               obs["DISCOUNTVALUE"] ,
               obs["STORE_ID"],
               obs["AVG_WEEKLY_ORDERS"],
               obs["SALES_AREA_SIZE_NUM"] 
              ]
    X_new.append(new_obs)
    

lr.predict(X_new)

#### Create Pikle file

In [None]:
import joblib
with open ("models/group7regressionmodel.pkl","wb") as fwb:
    joblib.dump(lr,fwb)

## Model Results and Report

Our goal was to forecast/predict demand of various products for the next week for different dealers. The data set received are validated and carefully analysed to arrive at some important inferences in the `Data Validation`, `Exploration`,`Preprocessing` steps.

The data received contains more columns that does not add value to the model. Some columns like dealer name, product description, address of dealer including city, state , MSA code, parking capacity are found in the data file. They are only additional information but does not add value to the model and these features are removed from the data set.

To perform prediction for a given week, we need a year and week variable for ease of input and handling. The `WEEK_END_DATE` column has been transformed to `WEEKOFYEAR` to hold only year and week number (i.e. 202001)

Similarly a new variable for `DISCOUNTVALUE` is created based on `BASE_PRICE` and `PRICE` variables.

A Linear regression model is build over the data set defining the features and target variable(UNITS).

#### Below variables are considered as features in the linear regression prediction
- PRODUCT_ID
- FEATURE
- WEEKOFYEAR
- DISPLAY
- MANUFACTURER
- CATEGORY
- BASE_PRICE
- DISCOUNTVALUE
- STORE_ID
- AVG_WEEKLY_ORDERS
- SALES_AREA_SIZE_NUM

#### To run the Flask app we need the below inputs in order.
PRODUCT_ID<br>FEATURE<br>WEEKOFYEAR<br>DISPLAY<br>MANUFACTURER<br>CATEGORY<br>ADDRESS_STATE_PROV_CODE<br>BASE_PRICE<br>DISCOUNTAVAILABLE<br>STORE_ID<br>AVG_WEEKLY_ORDERS<br>SALES_AREA_SIZE_NUM