# Walmart Recruiting - Store Sales Forecasting

- Here we are provided with historical sales data for 45 Walmart stores located in different regions. 
- Each store contains many departments, and participants must project the sales for each department in each store. 
- In the dataset selected holiday markdown events are included in the dataset. 
- These markdowns are known to affect sales, but it is challenging to predict which departments are affected and the extent of the impact.

#### Description of csv's:

1. stores.csv

    - This file contains anonymized information about the 45 stores, indicating the type and size of store.

2. train.csv

    - This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

        <b>i.   Store</b> - the store number <br>
        <b>ii.  Dept</b> - the department number <br>
        <b>iii. Date</b> - the week <br>
        <b>iv.  Weekly_Sales</b> -  sales for the given department in the given store<br>
        <b>v.   IsHoliday</b> - whether the week is a special holiday week<br>

3. test.csv

    - This file is identical to train.csv, except we have withheld the weekly sales. 
    - You must predict the sales for each triplet of store, department, and date in this file.

4. features.csv

    - This file contains additional data related to the store, department, and regional activity for the given dates. 
    - It contains the following fields:

        <b>i.    Store        </b> - the store number<br>
        <b>ii.   Date         </b> - the week<br>
        <b>iii.  Temperature  </b> - average temperature in the region<br>
        <b>iv.   Fuel_Price   </b> - cost of fuel in the region<br>
        <b>v.    MarkDown1-5  </b> - anonymized data related to promotional markdowns that Walmart is running.MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.<br>
        <b>vi.   CPI          </b> - the consumer price index<br>
        <b>vii.  Unemployment </b> - the unemployment rate<br>
        <b>viii. IsHoliday    </b> - whether the week is a special holiday week<br>

For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

- <b> Super Bowl    : </b>  12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13 <br>
- <b> Labor Day     : </b>  10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13 <br>
- <b> Thanksgiving  : </b>  26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13 <br>
- <b> Christmas     : </b>  31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13 <br>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Let's import some necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor         #Decision tree regression model
from sklearn.model_selection import cross_val_score    #import cross validation score package
from sklearn.model_selection import GridSearchCV        #import grid search cv
from sklearn.ensemble import RandomForestRegressor 
from sklearn.svm import SVR 
import xgboost as xgb
from xgboost.sklearn import XGBRegressor

pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
pd.set_option('display.max_colwidth',None)


import os
print(os.listdir("../input/walmart/"))

In [None]:
# Loading dataset
df_features = pd.read_csv('../input/walmart/features.csv')
df_train = pd.read_csv('../input/walmart/train.csv')
df_test = pd.read_csv('../input/walmart/test.csv')
df_store = pd.read_csv('../input/walmart/stores.csv')

In [None]:
# Looking for feature dataset
df_features.head()

In [None]:
# Looking for store dataset
df_store.head()

In [None]:
# Looking for test dataset
df_test.head()

In [None]:
# Looking for train dataset
df_train.head()

In [None]:
# checking for info
df_train.info()

In [None]:
# checking for descriptive statistic
df_train.describe()

In [None]:
# Checkin for the info of feature columns
df_features.info()

In [None]:
# checking for IsHoliday columns value
df_features['IsHoliday'].value_counts(dropna=True)

In [None]:
# Label Encoding 

# Converting date column to datetime datatype
df_features['Date'] = pd.to_datetime(df_features['Date'])
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_test['Date'] = pd.to_datetime(df_test['Date'])

# Mapping IsHoliday column with 0 and 1 
df_features['IsHoliday'] = LabelEncoder().fit_transform(df_features['IsHoliday'])
df_train['IsHoliday'] = LabelEncoder().fit_transform(df_train['IsHoliday'])
df_test['IsHoliday'] = LabelEncoder().fit_transform(df_test['IsHoliday'])
df_store['Size']  = LabelEncoder().fit_transform(df_store['Size'] )
df_store['Type'] = LabelEncoder().fit_transform(df_store['Type'])


In [None]:
df_test.head()

### Data Understanding & Preparation :

#### Merging the Dataset 

In [None]:
# Merging df_store_train with feature dataframe
df_store_feture = pd.merge(df_train,df_features,how='inner',on=['Store','Date','IsHoliday'])

df_store_feture_test = pd.merge(df_test,df_features,how='inner',on=['Store','Date','IsHoliday'])

print("Shape of dataframe after merging Train & Feature df : ",df_store_feture.shape[0])

In [None]:
df_final = pd.merge(df_store_feture,df_train,how='inner')

df_final_test = pd.merge(df_store_feture_test,df_test,how='inner')

print("Shape of dataframe after merging Store,Train & Feature df : ",df_final.shape[0])

In [None]:
df_final_test.shape

In [None]:
df_final_test.isnull().sum()

In [None]:
ts_df = df_final.copy()

In [None]:
df_final.info()

In [None]:
df_final.isnull().sum()

In [None]:
# Let's replace 'NaN' with '0.0' value
# df_final.fillna(0.0,inplace= True)

markdown = pd.DataFrame(SimpleImputer().fit_transform(df_final[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']]),columns=['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5'])
markdown_test = pd.DataFrame(SimpleImputer().fit_transform(df_final_test[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']]),columns=['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5'])

df = df_final.drop(['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5'],axis=1)
df_test_1 = df_final_test.drop(['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5'],axis=1)

df_final = pd.concat([df,markdown],axis=1)
df_final_test = pd.concat([df_test_1,markdown_test],axis=1)

In [None]:
df_final.shape

In [None]:
df_final_test.isnull().sum()

In [None]:
df_final_test.CPI.fillna(df_final_test.CPI.mean(),inplace=True)
df_final_test.Unemployment.fillna(df_final_test.Unemployment.mean(),inplace=True)

In [None]:
import datetime as dt
df_final['Year'] = df_final['Date'].dt.year
df_final['Month'] = df_final['Date'].dt.month
df_final['Week_of_Year'] = df_final['Date'].dt.weekofyear
df_final.columns

# For test
df_final_test['Year'] = df_final_test['Date'].dt.year
df_final_test['Month'] = df_final_test['Date'].dt.month
df_final_test['Week_of_Year'] = df_final_test['Date'].dt.weekofyear


In [None]:
df_final_test.head()

In [None]:
df_grp = df_final[['Year','Dept','Weekly_Sales']].groupby(['Year','Dept']).mean().reset_index()
df_grp.head()

#### Visualisaing the Data 

In [None]:
plt.figure(figsize=(18,5))
sns.barplot(data = df_grp,x= df_grp['Year'],y=df_grp['Weekly_Sales'])
plt.show()

In [None]:
def scatter(dataset, column):
    plt.figure()
    plt.scatter(dataset[column] , dataset['Weekly_Sales'])
    plt.ylabel('Weekly_Sales')
    plt.xlabel(column)

In [None]:
df_final.columns

In [None]:
scatter(df_final, 'Fuel_Price')
scatter(df_final, 'CPI')
scatter(df_final, 'IsHoliday')
scatter(df_final, 'Unemployment')
scatter(df_final, 'Temperature')
scatter(df_final, 'Store')
scatter(df_final, 'Dept')

In [None]:
weekly_sales = df_final['Weekly_Sales'].groupby(df_final['Dept']).mean()
plt.figure(figsize=(25,8))
sns.barplot(weekly_sales.index, weekly_sales.values, palette='dark')
plt.grid()
plt.title('Average Sales - per Dept', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Dept', fontsize=16)
plt.show()

In [None]:
weekly_sales = df_final['Weekly_Sales'].groupby(df_final['Year']).mean()
plt.figure(figsize=(25,8))
sns.barplot(weekly_sales.index, weekly_sales.values, palette='dark')
plt.grid()
plt.title('Average Sales - per Dept', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Year', fontsize=16)
plt.show()

In [None]:
weekly_sales = df_final['Weekly_Sales'].groupby(df_final['Month']).mean()
plt.figure(figsize=(25,8))
sns.barplot(weekly_sales.index, weekly_sales.values, palette='dark')
plt.grid()
plt.title('Average Sales - per Dept', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Month', fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(18,7))
sns.heatmap(df_final.corr(),annot=True)
plt.show()

In [None]:
df_final.drop(columns=['Month','Date'],inplace = True)
column_date = df_final_test['Date']
df_final_test.drop(columns=['Month','Date'],inplace = True)

In [None]:
df_final_test.columns

### Spliting data into X & Y

In [None]:
X = df_final.drop('Weekly_Sales',axis=1)
y = df_final['Weekly_Sales']

### Spliting dataset into train and validation

In [None]:
X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.3,random_state=34)

In [None]:
print("X Train Shape :",X_train.shape)
print("X Val Shape   :",X_val.shape)
print("Y Train Shape :",y_train.shape)
print("Y Val Shape   :",y_val.shape)


In [None]:
X.columns

In [None]:
# Building a model using Linear Regression:


lr = LinearRegression()
lr.fit(X_train,y_train)
y_pred = lr.predict(X_val)
lr_rmse_score = np.sqrt(mean_squared_error(y_pred,y_val))
lr_r2_score = r2_score(y_pred,y_val)
print("Root Mean Squared Error :",lr_rmse_score)
print("R2Score                 :",lr_r2_score)

### As we can see it is performing very poorly using linear regression.
#### Let's check for some other alogrithms 

## Decision Tree Regression

In [None]:
dt = DecisionTreeRegressor()
dt_model=dt.fit(X_train,y_train)         
y_pred_dtone=dt_model.predict(X_val) 

In [None]:
# calculate RMSE
rms_dt = np.sqrt(mean_squared_error(y_pred_dtone,y_val))
r2_dt = r2_score(y_val, y_pred_dtone)
print('RMSE of Decision Tree Regression:',rms_dt)
print('R-Squared value:',r2_dt)
R2 = r2_score(y_val, y_pred)
n = X_train.shape[0]
p = len(X_train.columns)
Adj_r2 = 1-(1-R2)*(n-1)/(n-p-1)
print('Adjusted R-Square is : ',Adj_r2)

# Random Forest Regression

In [None]:
# Importing libraries
rf_reg = RandomForestRegressor()

In [None]:
rf_model = rf_reg.fit(X_train,y_train)          
y_pred_rf = rf_model.predict(X_val)


In [None]:
rmse_rf = np.sqrt(mean_squared_error(y_pred_rf,y_val))
r2_rf = r2_score(y_pred_rf,y_val)

print('RMSE of predicted in RF model:',rmse_rf)
print('R Sqaured in RF model        :',r2_rf)

In [None]:
# Applying hyper parameter
rf_params = {'n_estimators':[10,20],'max_depth':[8,10],'max_leaf_nodes':[70,90]}

rf_grid = GridSearchCV(rf_reg,rf_params,cv=10)
rf_model_two = rf_grid.fit(X_train,y_train)
y_pred_rf_two = rf_model_two.predict(X_val)
rmse_rf_2 = np.sqrt(mean_squared_error(y_val,y_pred_rf_two))
r2_rf_2 = r2_score(y_pred_rf_two,y_val)
print('RMSE using RF grid search method:',rmse_rf_2)
print('R Sqaured in RF model           :',r2_rf_2)

In [None]:
rf_model_two.best_params_

In [None]:
# After Applying hyper parameter
rf_params = {'n_estimators':[10],'max_depth':[10],'max_leaf_nodes':[90]}

rf_grid = GridSearchCV(rf_reg,rf_params,cv=10)
rf_model_three = rf_grid.fit(X_train,y_train)
y_pred_rf_three = rf_model_three.predict(X_val)
rmse_rf = np.sqrt(mean_squared_error(y_val,y_pred_rf_three))
r2_rf = r2_score(y_pred_rf_three,y_val)
print('RMSE using RF grid search method :',rmse_rf)
print('R Sqaured in RF model            :',r2_score(y_pred_rf_three,y_val))

## Support Vector Machine : 

In [None]:
# import support vector regressor
         
sv_reg=SVR()

In [None]:
sv_model=sv_reg.fit(X_train,y_train)

In [None]:
# predict
y_pred_sv=sv_model.predict(X_val)        

In [None]:
# Calculate RMSE of SVR
rmse_svm = np.sqrt(mean_squared_error(y_val,y_pred_sv))
r2_svm = (r2_score(y_val,y_pred_sv))
print('RMSE of SVR model:',rmse_svm)
print("R2Score          :",r2_svm)

## XG Boost

In [None]:
# Implementing XG Boost library

regressor = xgb.XGBRegressor( 
                                n_estimators=100,
                                reg_lambda=1,
                                gamma=0,
                                max_depth=3
                            )

regressor.fit(X_train, y_train)

In [None]:
y_pred_xgb = regressor.predict(X_val)
rmse_xgb = np.sqrt(mean_squared_error(y_val, y_pred_xgb))
r2_xgb = r2_score(y_val, y_pred_xgb)
print('RMSE value without hyperparameter Tuning:',rmse_xgb)
print('R-Squared value:',r2_xgb)
R2 = r2_score(y_val, y_pred_xgb)
n = X_train.shape[0]
p = len(X_train.columns)
Adj_r2 = 1-(1-R2)*(n-1)/(n-p-1)
print('Adjusted R-Square is : ',Adj_r2)

#### Applying hyperparameter tuning

In [None]:

regressor = xgb.XGBRegressor()
parameters = {'nthread':[3,4], #when use hyperthread, xgboost may become slower
              'objective':['reg:squarederror'],
              'learning_rate': [0.1,0.2,0.05], #so called `eta` value
              'max_depth': [4,5,6],
              'min_child_weight': [1,2,3],
              'subsample': [0.7],
              'colsample_bytree': [0.7],
              'n_estimators': [100,150]}

xgb_grid = GridSearchCV(regressor, parameters, cv = 5, n_jobs = -1, verbose=True)
xgb_grid.fit(X_train, y_train)
print(xgb_grid.best_score_)
print(xgb_grid.best_params_)

In [None]:
parameters = {
              'colsample_bytree': [0.7], 
              'learning_rate': [0.1], 
              'max_depth': [6], 
              'min_child_weight': [2], 
              'n_estimators': [150], 
              'nthread': [3], 
              'objective': ['reg:squarederror'],
              'subsample': [0.7]
             }

xgb_grid = GridSearchCV(regressor, parameters, cv = 5, n_jobs = -1, verbose=True)
xgb_grid.fit(X_train, y_train)
y_pred_xgb_1 = xgb_grid.predict(X_val)

In [None]:
rmse_xgb_gscv = np.sqrt(mean_squared_error(y_val, y_pred_xgb_1))
r2_xgb_gscv = r2_score(y_val, y_pred_xgb_1)
print("RMSE using XG Boost after applying hyperparamter tuning:",round(rmse_xgb_gscv,3))
print('R-Squared value:',r2_xgb_gscv)
R2 = r2_score(y_val, y_pred_xgb_1)
n = X_train.shape[0]
p = len(X_train.columns)
Adj_r2 = 1-(1-R2)*(n-1)/(n-p-1)
print('Adjusted R-Square is : ',Adj_r2)

In [None]:
# Comparing all the models :
n = ['Linear Regression','Decision Tree','Random Forest','XG Boost']
val = [lr_rmse_score,rms_dt,rmse_rf,rmse_xgb]
val_1 = [lr_r2_score,r2_dt,r2_rf,r2_xgb]
compare_df = pd.DataFrame(data=[n,val,val_1]).T

In [None]:
compare_df.columns = ['Models','RMSE Score','R_Sqaured Value']
compare_df.sort_values('R_Sqaured Value',inplace = True,ascending=False)
compare_df.reset_index(drop=True)

In [None]:
X.columns

In [None]:
df_final_test.columns

In [None]:
df_final_test.head()

In [None]:
df_final_test.isnull().sum()

In [None]:
df_final_test.columns

In [None]:
predicted_test = rf_model.predict(df_final_test)

In [None]:
df_final_test['WeeklySales'] = predicted_test
df_final_test['Date'] = column_date
df_final_test['id'] = df_final_test['Store'].astype(str) + '_' +  df_final_test['Dept'].astype(str) + '_' +  df_final_test['Date'].astype(str)
df_final_test = df_final_test[['id', 'WeeklySales']]
df_final_test = df_final_test.rename(columns={'id': 'Id', 'WeeklySales': 'Weekly_Sales'})

In [None]:
df_final_test.to_csv('output.csv', index=False)

# This is all from side :) I will be updating the kernel with "Time Series" for forcasting Weekly Sales.

# I hope you guys like it if you like it please upvote so that it can motivate me to create moe such kernels :) 

# Thanks 