<p style="font-size:36px;text-align:center"> <b>Walmart Recruiting</b> </p>

<b>Data Description</b>

You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

<b>stores.csv</b>

This file contains anonymized information about the 45 stores, indicating the type and size of store.

<b>train.csv</b>

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

    Store - the store number
    Dept - the department number
    Date - the week
    Weekly_Sales -  sales for the given department in the given store
    IsHoliday - whether the week is a special holiday week

<b>test.csv</b>

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

<b>features.csv</b>

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

    Store - the store number
    Date - the week
    Temperature - average temperature in the region
    Fuel_Price - cost of fuel in the region
    MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
    CPI - the consumer price index
    Unemployment - the unemployment rate
    IsHoliday - whether the week is a special holiday week

For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

In [None]:
ls

References 
https://www.kaggle.com/yepp2411/walmart-prediction-1-eda-with-time-and-space


https://www.kaggle.com/jevonlee001/walmat-rfr

# Exploratory Data Analysis

In [None]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns

# Reading Stores Data

In [None]:
df=pd.read_csv("stores.csv")
df.head()

In [None]:
df.shape

In [None]:
df.describe()

# Reading Features Data

In [None]:
df_fts=pd.read_csv("features.csv")
df_fts.head()

Note: MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.

In [None]:
df_fts.describe()

Observation: We can see here that a negative markdown value exists,which shouldn't be the case hence we need to remove all values <0

In [None]:
df_fts["MarkDown1"][df_fts["MarkDown1"]<0]=0
df_fts["MarkDown2"][df_fts["MarkDown2"]<0]=0
df_fts["MarkDown3"][df_fts["MarkDown3"]<0]=0
df_fts["MarkDown5"][df_fts["MarkDown5"]<0]=0

In [None]:
df_fts.describe()

# Reading Train Data

In [None]:
df_train=pd.read_csv("train.csv")
df_train.head()

In [None]:
df_train.describe()

Observation: Again Weekly_sales cannot have a negative value.

In [None]:
df_train["Weekly_Sales"][df_train["Weekly_Sales"]<0]=0

In [None]:
df_train.describe()

# Reading Test Data

In [None]:
df_test=pd.read_csv("test.csv")
df_test.head()

# Merging Train+Store+Features

In [None]:
df_full=pd.merge(df_train,df,how='left',on='Store').merge(df_fts,how='inner',on=['Store','IsHoliday','Date'])
df_full.head()

Observation: There are missing values in CPI, Unemployment, Temperature,we have filled values by mean imputation

In [None]:
df_full['CPI'] = df_full['CPI'].fillna(df_full['CPI'].mean())
df_full['Temperature'] = df_full['Temperature'].fillna(df_full['Temperature'].mean())
df_full['Unemployment'] = df_full['Unemployment'].fillna(df_full['Unemployment'].mean())

In [None]:
df_full.fillna(0,inplace=True)

In [None]:
df_full.head()

Type can be an estimate on weekly sales

In [None]:
sample_data=pd.concat([df_full['Type'],df_full['Size']],axis=1)
fig=sns.boxplot(x='Type',y='Size',data=sample_data)

Observations
1. We can infer that Store A is the largest and C is the smallest
2. There is no overlapped area in size among A,B,C


In [None]:
sample_data=pd.concat([df_full['Type'],df_full['Weekly_Sales']],axis=1)
fig=sns.boxplot(x='Type',y='Weekly_Sales',data=sample_data,showfliers=False)

Observations

1. The median of A is the highest and C is the lowest
2. Stores with more sizes have higher sales record

In [None]:
data = pd.concat([df_full['Store'], df_full['Weekly_Sales'], df_full['IsHoliday']], axis=1)
f, ax = plt.subplots(figsize=(25, 8))
fig = sns.boxplot(x='Store', y='Weekly_Sales', data=data, showfliers=False, hue="IsHoliday")

Observations

1. You can see that there is a slight increase in weekly sales on a holiday as compared to non holiday

# Feature Engineering

In [None]:
#encoding for IsHoliday
df_full.IsHoliday=df_full.IsHoliday.astype(int)

In [None]:
#encode Type Feature
le=preprocessing.LabelEncoder().fit(df_full['Type'])

le.classes_

df_full.Type=le.transform(df_full['Type'])

In [None]:
df_full.head()

In [None]:
#split Date into Day-Month-Year

df_full["day"] = [t.dayofweek for t in pd.DatetimeIndex(df_full.Date)]
df_full["month"] = [t.month for t in pd.DatetimeIndex(df_full.Date)]
df_full['year'] = [t.year for t in pd.DatetimeIndex(df_full.Date)]


In [None]:
df_full.drop("Date",axis=1,inplace=True)

In [None]:
df_full.head()

In [None]:
# Plotting correlation between all important features
corr = df_full.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(corr, annot=True)
plt.plot()

Observations
1. Not many features are correlated apart from MarkDown1 and MarkDown4 and Fuel_Price and year
2. Size and Type are negative correlated
3. Correlated Features should be deleted

In [None]:
#removing one of the highly correlated features 
df_full=df_full.drop(["MarkDown4","year","Size"],axis=1)

# Machine Learning Models

In [None]:
y=np.array(df_full['Weekly_Sales'])

X=np.array(df_full.drop(['Weekly_Sales'],axis=1))

In [None]:
#train test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

# Random Model

In [None]:
import statistics as st
import itertools
n=np.size(y_train)
y_mean=st.mean(y_train)
y_pred=list(itertools.repeat(y_mean,n))

In [None]:
from sklearn.metrics import median_absolute_error,r2_score
print("MAD Score",median_absolute_error(y_pred,y_train))
print("R2 Score",r2_score(y_pred,y_train))

https://www.researchgate.net/post/What_is_the_acceptable_R-squared_in_the_information_system_research_Can_you_provide_some_references
The (R-squared) , (also called the coefficient of determination), which is the proportion of variance (%) in the dependent variable that can be explained by the independent variable. Hence, as a rule of thumb for interpreting the strength of a relationship based on its R-squared value (use the absolute value of the R-squared value to make all values positive):
- if  R-squared value < 0.3 this value is generally considered a None or Very weak effect size,
- if R-squared value 0.3 < r < 0.5 this value is generally considered a weak or low effect size,
- if R-squared value 0.5 < r < 0.7 this value is generally considered a Moderate effect size,
- if R-squared value r > 0.7 this value is generally considered strong effect size,
Ref: Source: Moore, D. S., Notz, W. I, & Flinger, M. A. (2013). The basic practice of statistics (6th ed.). New York, NY: W. H. Freeman and Company. Page (138).

Source: Zikmund, William G. (2000). Business research methods (6th ed). Fort Worth: Harcourt College Publishers. (Page 513)

# Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
param_grid={'fit_intercept':[True,False],'normalize':[True,False]}
lr=LinearRegression(n_jobs=-1)
model=GridSearchCV(lr,param_grid,scoring='neg_median_absolute_error',n_jobs=-1,pre_dispatch='2*n_jobs').fit(X_train,y_train)

In [None]:
print("Best Hyperparam Values",model.best_params_)
print("Median cross-validated score ",model.best_score_) 

In [None]:
Lr_model=LinearRegression(fit_intercept=False,normalize=True,n_jobs=-1).fit(X_train,y_train)

In [None]:
import matplotlib.pyplot as plt
y_pred=Lr_model.predict(X_test)
plt.scatter(y_test,y_pred)
plt.title("Linear Regression Plot")
plt.xlabel("Y_Test")
plt.ylabel("Y_Predict")
plt.show()

In [None]:
from sklearn.metrics import median_absolute_error
print("MAD score : ",median_absolute_error(y_test,y_pred))
print("R2 Score : ",Lr_model.score(X_test,y_test))

# SGD Regressor

In [None]:
from sklearn.linear_model import SGDRegressor
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
param_grid={'loss':['squared_loss','huber','epsilon_insensitive','squared_epsilon_insensitive'],
'penalty':['l1','l2','elasticnet'],'alpha':[0.000001,0.00001,0.0001,0.001,0.01,0.1,1,10],
'learning_rate':['constant','optimal','invscaling','adaptive'] ,        
'early_stopping':[True,False]  }
svm_model=SGDRegressor()
model_svm=GridSearchCV(svm_model,param_grid,scoring='neg_median_absolute_error',n_jobs=-1,pre_dispatch='2*n_jobs').fit(X_train,y_train)

In [None]:
print("Best Hyperparam Values",model_svm.best_params_)
print("Median cross-validated score ",model_svm.best_score_) 

In [None]:
svm_model=SGDRegressor(alpha=0.0001,early_stopping=False,learning_rate='optimal', loss='huber', penalty= 'l2')
svm_model.fit(X_train,y_train)

In [None]:
y_pred=svm_model.predict(X_test)
plt.scatter(y_test,y_pred)
plt.title("SVM Regression Plot")
plt.xlabel("Y_Test")
plt.ylabel("Y_Predict")
plt.show()

In [None]:
from sklearn.metrics import median_absolute_error
print("MAD score : ",median_absolute_error(y_test,y_pred))
print("R2 Score : ",svm_model.score(X_test,y_test))

# Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor 
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
param_grid={'max_depth':[1,5,10,15,20,25,30],
 'max_features': ['auto', 'sqrt','log2'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10], }
dt=DecisionTreeRegressor()
model_dt=GridSearchCV(dt,param_grid,scoring='neg_median_absolute_error',n_jobs=-1,pre_dispatch='2*n_jobs').fit(X_train,y_train)

In [None]:
print("Best Hyperparam Values",model_dt.best_params_)
print("Median cross-validated score ",model_dt.best_score_) 

In [None]:
DTR_model=DecisionTreeRegressor(max_depth=25,max_features='auto',min_samples_leaf=4,min_samples_split=10).fit(X_train,y_train)

In [None]:
y_pred=DTR_model.predict(X_test)
plt.scatter(y_test,y_pred)
plt.title("Decision Tree Regression Plot")
plt.xlabel("Y_Test")
plt.ylabel("Y_Predict")
plt.show()

In [None]:
from sklearn.metrics import median_absolute_error
print("MAD score : ",median_absolute_error(y_test,y_pred))
print("R2 Score : ",DTR_model.score(X_test,y_test))

# Random Forrest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
param_grid={'max_depth':[1,5,10,15,20,25,30],'n_estimators':[20,50,100],
  'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10], }
RFR=RandomForestRegressor()
model_rf=RandomizedSearchCV(RFR,param_grid,scoring='neg_median_absolute_error',n_jobs=-1,pre_dispatch='2*n_jobs').fit(X_train,y_train)

In [None]:
print("Best Hyperparam Values",model_rf.best_params_)
print("Median cross-validated score ",model_rf.best_score_) 

In [None]:
random_frgr=RandomForestRegressor(n_estimators=100,min_samples_split=5,min_samples_leaf=2,max_depth=30,n_jobs=-1).fit(X_train,y_train)

In [None]:
y_pred=random_frgr.predict(X_test)
plt.scatter(y_test,y_pred)
plt.title("Random Forrest Tree Regression Plot")
plt.xlabel("Y_Test")
plt.ylabel("Y_Predict")
plt.show()

In [None]:
from sklearn.metrics import median_absolute_error
print("MAD score : ",median_absolute_error(y_test,y_pred))
print("R2 Score : ",random_frgr.score(X_test,y_test))

# Gradient Boosted Regressor

In [None]:
from xgboost import XGBRegressor
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
param_grid={'max_depth':[1,5,10,15,20,25,30],'n_estimators':[20,50,100],'learning_rate':[0.001,0.01,0.1,1]}
xgb=XGBRegressor(n_jobs=-1)
model_rf=RandomizedSearchCV(xgb,param_grid,scoring='neg_median_absolute_error',n_jobs=-1,pre_dispatch='2*n_jobs').fit(X_train,y_train)

In [None]:
print("Best Hyperparam Values",model_rf.best_params_)
print("Median cross-validated score ",model_rf.best_score_) 

In [None]:
xgbr=XGBRegressor(n_estimators=100,max_depth=25,learning_rate=0.1,n_jobs=-1).fit(X_train,y_train)

In [None]:
y_pred=xgbr.predict(X_test)
plt.scatter(y_test,y_pred)
plt.title("Gradient Boosted Tree Regression Plot")
plt.xlabel("Y_Test")
plt.ylabel("Y_Predict")
plt.show()

In [None]:
from sklearn.metrics import median_absolute_error
print("MAD score : ",median_absolute_error(y_test,y_pred))
print("R2 Score : ",xgbr.score(X_test,y_test))

# Conclusion

In [None]:
#http://zetcode.com/python/prettytable/
from prettytable import PrettyTable
x=PrettyTable()
print("Machine Learning Models")
x.field_names=['Model','MAD Score','R2 Score',]
x.add_row(['Random Model',12576.21,-3.166053952252629e+30])
x.add_row(['Linear Regression',9475.04,-0.4290])
x.add_row(['SGD Regressor',3958.41,-0.2867])
x.add_row(['Decision Tree Regressor',648.028,0.9483])
x.add_row(['Random Forrest Regressor',573.85,0.9619])
x.add_row(['Gradient Boosted Regressor',554.25,0.9664])
print(x)