# walmart-recruiting-store-sales-forecasting predictions

## 1. Introduction
Following provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

### Store Sales Forecasting & Discount Strategy:
Goal:

1. Exploratory Data Analysis to describe and clean the data, and to understand attributes
2. Feature selection to keep only important attributes
3. Developing a framework to evaluate and spot-check algorithms
4. Predicting and explaining future sales
5. Identifying the right time for discount strategies

## 2. Data Loading, Preparation & Cleaning

In [None]:
# Importing all the libraries
import pandas as pd
import numpy as np
import warnings
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
pd.set_option("display.max_columns", 120)
pd.set_option("display.max_rows", 120)


## 2.1  Explore the Data

In [None]:
# Reading the data using pandas dataframe
features = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/features.csv.zip')
train = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/train.csv.zip')
stores = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/stores.csv')
test = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/test.csv.zip')
sample_submission = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/sampleSubmission.csv.zip')

In [None]:
print(features.head())
print("------------------------------------------------------------\n")
print(stores.head())
print("------------------------------------------------------------\n")
print(train.head())
print("------------------------------------------------------------\n")
print(test.head())
print("------------------------------------------------------------\n")
print(sample_submission.head())

We can see that the test dataset don't contain the features included in the train dataset, taking into consideration that these features (Temperature, Fuel price, MarkDowns, CPI and Unemployment) cannot be used in the test dataset due to their high dependences on the date, so it will be a good idea to delete them. but before that, we will make sure that these features don't provide any information on the target 'Weekly_Sales'.

In [None]:
# Finding the number of rowns and columns in dataframe
features.shape, train.shape, stores.shape, test.shape

In [None]:
# Some basic information of differnt column's data type of dataframe
print(features.dtypes)
print("------------------------------------------------------------\n")
print(train.dtypes)
print("------------------------------------------------------------\n")
print(stores.dtypes)
print("------------------------------------------------------------\n")
print(test.dtypes)

# Prepare the Dataset for Training

### 2.2 Data Cleaning
Let's start by cleaning the data of both datasets. We will see if they have missing values, duplicates and see if eliminate them if thats the case.

Very important to take into account that both datasets are going to merge. Therefore, they must have one key column that has the same values. Hence, We will also see if the values are consistent in both datasets.

In [None]:
feature_store = features.merge(stores, how='inner', on = "Store")

In [None]:
train = train.merge(feature_store, how='inner', on=['Store','Date','IsHoliday'])

In [None]:
test = test.merge(feature_store, how='inner', on=['Store','Date','IsHoliday'])

In [None]:
# Another useful step is to facilate the acces to the 'Date' attribute by splitting it into its componenents (i.e. Year, Month and week,day).
train = train.copy()
test = test.copy()

train['Date'] = pd.to_datetime(train['Date'])
train['Year'] = pd.to_datetime(train['Date']).dt.year
train['Month'] = pd.to_datetime(train['Date']).dt.month
train['Week'] = pd.to_datetime(train['Date']).dt.week
train['Day'] = pd.to_datetime(train['Date']).dt.day
train.replace({'A': 1, 'B': 2,'C':3},inplace=True)

test['Date'] = pd.to_datetime(test['Date'])
test['Year'] = pd.to_datetime(test['Date']).dt.year
test['Month'] = pd.to_datetime(test['Date']).dt.month
test['Week'] = pd.to_datetime(test['Date']).dt.week
test['Day'] = pd.to_datetime(test['Date']).dt.day
test.replace({'A': 1, 'B': 2,'C':3},inplace=True)


In [None]:
print(train.head())
print("------------------------------------------------------------\n")
print(test.head())

##  Descriptive statistics & data visualizations:
### Weekly_Sales
The plot makes the right skewness clear, so most weeks have sales around the median.
Also, we can see that the Weekly_Sales attribute has a large kurtosis which indicates the presence of extreme values, in other words, some weeks have high sales. It would be a good idea to know the origins of these extreme values.

In [None]:
weekly_sales = train.groupby(['Year','Week']).agg({'Weekly_Sales': ['mean', 'median']})
weekly_sales2010 = train.loc[train['Year']==2010].groupby(['Week']).agg({'Weekly_Sales': ['mean', 'median']})
weekly_sales2011 = train.loc[train['Year']==2011].groupby(['Week']).agg({'Weekly_Sales': ['mean', 'median']})
weekly_sales2012 = train.loc[train['Year']==2012].groupby(['Week']).agg({'Weekly_Sales': ['mean', 'median']})
plt.figure(figsize=(20, 7))
sns.lineplot(weekly_sales2010['Weekly_Sales']['mean'].index, weekly_sales2010['Weekly_Sales']['mean'].values)
sns.lineplot(weekly_sales2011['Weekly_Sales']['mean'].index, weekly_sales2011['Weekly_Sales']['mean'].values)
sns.lineplot(weekly_sales2012['Weekly_Sales']['mean'].index, weekly_sales2012['Weekly_Sales']['mean'].values)

plt.grid()
plt.xticks(np.arange(1, 53, step=1))
plt.legend(['2010', '2011', '2012'])
plt.show()

In [None]:
Y_train = train['Weekly_Sales']

In [None]:
targets = Y_train.copy()

In [None]:
train= train.drop(['Weekly_Sales'],axis=1)


In [None]:
# Let's also identify the numeric and categorical columns.
numeric_cols = train.select_dtypes(include=np.number).columns.tolist()
categorical_cols = train.select_dtypes('object').columns.tolist()

In [None]:
print(numeric_cols)
print("------------------------------------------------------------\n")
print(categorical_cols)

In [None]:
# Check if there is any null value in train dataframe
train.isnull().sum()

In [None]:
# Check if there is any null value test in dataframe
test.isnull().sum()

# Impute Numerical Data

In [None]:
# Create the imputer
imputer = SimpleImputer(missing_values= np.NaN, strategy='mean')

In [None]:
# Fit the imputer to the numeric columns
imputer.fit(train[numeric_cols])

In [None]:
#Replace all the null values
train[numeric_cols] =imputer.transform(train[numeric_cols])

In [None]:
# Check if there is any null value
train.isnull().sum()

# Evaluate Algorithms
After analysing, cleaning and preparing the data, the next step is to select the best algorithm with the optimal parameters to obtain the best results.
This step requiers manually selecting the type of data normalization, manually selecting algorithms and tune all hyperparameters. 

Many algorithms assume normal distribution of the data, especially when features have different ranges like our case, so it is necessary to implement this step in our pipeline.

#### For data normalization, Lale will have the following choices :

1. MinMaxscaler
#### Algorithms used for spot-checking :

1. LinearRegression
2. RandomForestRegressor
3. GradientBoostingRegressor

In [None]:
# importing MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Create the scaler
scaler = MinMaxScaler()

In [None]:
# Fit the scaler to the numeric columns
scaler.fit(train[numeric_cols])

In [None]:
# Transform and replace the numeric columns
train[numeric_cols] = scaler.transform(train[numeric_cols])

In [None]:
train[numeric_cols].describe().loc[['min', 'max']]

In [None]:
# 'Date' is irrelevant and Drop it from data.
train= train.drop(['Date'],axis=1)
test = test.drop(['Date'], axis=1)

In [None]:
# Preparing the dataset:
X_train =train[['Store','Dept','IsHoliday','Size','Week','Type','Year']]
X_test = test[['Store', 'Dept','IsHoliday', 'Size', 'Week', 'Type', 'Year']]

In [None]:
print(X_train.columns)
print(X_test.columns)

# Training and Validation Set

In [None]:
# Splitting and training
train_inputs, val_inputs, train_targets, val_targets = train_test_split(X_train, Y_train, test_size=0.25, random_state=42)

# Make Predictions and Evaluate Your Model

## XGBRegressor

In [None]:
# importing XGBRegressor
from xgboost import XGBRegressor

In [None]:
# fitting the model
model = XGBRegressor(random_state=42, n_jobs=-1, n_estimators=20, max_depth=4)

In [None]:
model.fit(train_inputs,train_targets)

## Feature Importance
Based on the gini index computations, a decision tree assigns an "importance" value to each feature. These values can be used to interpret the results given by a decision tree.


In [None]:
#Let's turn this into a dataframe and visualize the most important features.
importance_df = pd.DataFrame({
    'feature': X_test.columns,
    'importance':model.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
import seaborn as sns
plt.figure(figsize=(10,6))
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

In [None]:
# Make and evaluate predictions:
x_pred = model.predict(train_inputs)
x_pred

### Evaluation

In [None]:
# calculating mean_squared_error
def rmse(a, b):
    return mean_squared_error(a, b, squared=False)

In [None]:
rmse(x_pred,train_targets)

## Making Predictions

In [None]:
x_preds=model.predict(X_test)
x_preds

In [None]:
Final = X_test[['Store', 'Dept', 'Week']]
test['Weekly_Sales']= x_preds

In [None]:
sample_submission['Weekly_Sales'] = test['Weekly_Sales']
sample_submission.to_csv('submission_2.csv',index=False)

In [None]:
preds1=pd.read_csv('submission_2.csv')
preds1

In [None]:
#ploting prediction
plt.figure(figsize=(10,6))
sns.barplot(data=preds1.head(10), x='Id', y='Weekly_Sales');

## RandomForestRegressor

## Hyperparameter Tuning
For hyperparameter tuning, Lale give us the choice to use its search space or schemas as is, or we can customize the schemas to fit our purposes

In [None]:
def test_params(**params):
    model = RandomForestRegressor(random_state=42, n_jobs=-1, **params).fit(train_inputs, train_targets)
    train_rmse = mean_squared_error(model.predict(train_inputs), train_targets, squared=False)
    val_rmse = mean_squared_error(model.predict(val_inputs), val_targets, squared=False)
    return train_rmse, val_rmse

In [None]:
test_params(n_estimators=20, max_depth=20)

In [None]:
test_params(n_estimators=50, max_depth=10,min_samples_split=3, min_samples_leaf=4, max_features=0.4)

#### To plot the graph between training error and validation error.

In [None]:
def test_param_and_plot(param_name, param_values):
    train_errors, val_errors = [], [] 
    for value in param_values:
        params = {param_name: value}
        train_rmse, val_rmse = test_params(**params)
        train_errors.append(train_rmse)
        val_errors.append(val_rmse)
    plt.figure(figsize=(10,6))
    plt.title('Overfitting curve: ' + param_name)
    plt.plot(param_values, train_errors, 'b-o')
    plt.plot(param_values, val_errors, 'r-o')
    plt.xlabel(param_name)
    plt.ylabel('RMSE')
    plt.legend(['Training', 'Validation'])

In [None]:
test_param_and_plot('max_depth', [5, 10, 15, 20, 25, 30, 35])

In [None]:
test_param_and_plot('n_estimators', [5, 10, 15, 20, 25, 30, 35])

# Training the Best Model

In [None]:
# fitting the model with Hyperparameter Overfitting 
RF = RandomForestRegressor(n_estimators=58, max_depth=27, max_features=6, min_samples_split=3, min_samples_leaf=1)
RF.fit(train_inputs,train_targets)

##### We can compute the accuracy of the model on the training and validation sets using RF.score

In [None]:
RF.score(train_inputs, train_targets)

In [None]:
RF.score(val_inputs, val_targets)

In [None]:
# Make and evaluate predictions:
train_preds = RF.predict(train_inputs)
train_preds

### Evaluation

In [None]:
rmse(train_targets,train_preds)

## Feature Importance
Based on the gini index computations, a decision tree assigns an "importance" value to each feature. These values can be used to interpret the results given by a decision tree.

In [None]:
# Let's turn this into a dataframe and visualize the most important features.
importance_df = pd.DataFrame({
    'feature': X_test.columns,
    'importance': RF.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
import seaborn as sns
plt.figure(figsize=(10,6))
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

# Making Predictions

In [None]:
predict = RF.predict(X_test)
predict

## Making Predictions

In [None]:
Final = X_test[['Store', 'Dept', 'Week']]
test['Weekly_Sales']= predict

In [None]:
sample_submission['Weekly_Sales'] = test['Weekly_Sales']
sample_submission.to_csv('submission.csv',index=False)
predicts=pd.read_csv('submission.csv')
predicts

In [None]:
#ploting prediction
plt.figure(figsize=(10,6))
sns.barplot(data=predicts.head(10), x='Id', y='Weekly_Sales');

## LinearRegression

In [None]:
# importing the LinearRegression algorithm
from sklearn.linear_model import LinearRegression

In [None]:
# fitting the model
lr=LinearRegression()
lr.fit(train_inputs,train_targets)

In [None]:
Y_pred=lr.predict(train_inputs)
Y_pred

### Evaluation

In [None]:
rmse(train_targets,Y_pred)

## Making Predictions

In [None]:
y_pred=lr.predict(X_test)
y_pred

### To convert df to csv file

In [None]:
Final = X_test[['Store', 'Dept', 'Week']]
test['Weekly_Sales']= y_pred

In [None]:
sample_submission['Weekly_Sales'] = test['Weekly_Sales']
sample_submission.to_csv('submission_1.csv',index=False)

In [None]:
preds=pd.read_csv('submission_1.csv')

In [None]:
#ploting prediction
import seaborn as sns
plt.figure(figsize=(10,6))
sns.barplot(data=preds.head(10),x='Id', y='Weekly_Sales');