![](https://cdn.corporate.walmart.com/dims4/WMT/0b04aa6/2147483647/strip/true/crop/2400x1260+0+0/resize/1200x630!/quality/90/?url=https%3A%2F%2Fcdn.corporate.walmart.com%2F6f%2Fd3%2Ff3f5a16f44a88d88b8059defd0a9%2Foption-signage.jpg)

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import plotly.figure_factory as ff
import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline
plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = (15, 10)

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import RFE
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import RFECV

In [None]:
features_df  = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/features.csv.zip')
train_df  = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/train.csv.zip')
stores_df  = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/stores.csv')
test_df  = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/test.csv.zip')
sample_submission = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/sampleSubmission.csv.zip')

## Merging the data
We have merged the data in order to take into account the features and the stores data as well which will help our model become more robust in predicting the sales.

In [None]:
merged_train_df = train_df.merge(stores_df, how='left').merge(features_df, how='left')
merged_test_df = test_df.merge(stores_df, how='left').merge(features_df, how='left')

In [None]:
# Knowing the value counts and Data Types of our columns
merged_train_df.info()

In [None]:
merged_test_df.info()

In [None]:
merged_train_df.describe()

In [None]:
merged_test_df.describe()

## Exploratory Data Analysis
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

In [None]:
sales_df = merged_train_df.sample(frac=0.1)
hist_data = [sales_df.Weekly_Sales]
group_labels = ['Weekly Sales']
fig = ff.create_distplot(hist_data, group_labels, show_hist=False)
fig.update_layout(title_text='Weekly Sales Distplot')
fig.show()

In [None]:
sales_df

In [None]:
plt.title("Dept Wise Sales")
plt.xlabel('Dept')
sns.histplot(x=sales_df.Dept, y= sales_df.Weekly_Sales);

The store number 10 and 35 have the highest sales

Lets with the help of heatmap try to understand a correlation between the columns in our `merged_train_df`

In [None]:
corr = sales_df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(25,20))
cmap = sns.diverging_palette(220, 20, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, annot=True,square=True, linewidths=.5, cbar_kws={'shrink': .5})
plt.show()

In [None]:
fig = px.histogram(sales_df, x='Temperature', y ='Weekly_Sales', color='IsHoliday', marginal='box', title ='Affect of Store Temperature on Sales')
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)','paper_bgcolor': 'rgba(0, 0, 0, 0)'})

In [None]:
plt.figure(figsize=(15,10))
plt.title('Relation between Store size and sales')
sns.lineplot ( data = sales_df, x = 'Size', y =  'Weekly_Sales', hue = 'IsHoliday');

From the graph we can say, the store size somewhat increases the sales upto a point but after that it most likely won't have much impact on the sales of a store.

## Outlier Detection
*Wikipedia definition,*

"In statistics, an outlier is an observation point that is distant from other observations."

The above definition suggests that outlier is something which is separate/different from the crowd.

Finding outliers by looking at the data could be easy but it may be a quite challenging task when you have got thousands or even millions of datapoints. We have used IQR (Inter-Quartlie Range) to find out the outliers and then took them away so our model doesn't perform poor.

In [None]:
#Outlier Detection and removing the outliers
dataset = sorted(merged_train_df.Weekly_Sales)
q1, q3 = np.percentile(dataset,[25,75])
iqr = q3-q1
lower_fence = q1-(1.5*iqr)
upper_fence = q3+(1.5*iqr)

In [None]:
merged_train_df= merged_train_df[merged_train_df.Weekly_Sales < upper_fence]
merged_train_df

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(x=merged_train_df["Weekly_Sales"], y=merged_train_df["IsHoliday"], palette="Set2", orient="h");

## Data Pre-processing
In this section, we are gonna be cleaning up our data and getting it prepared for the model. We would be dealing with the missing values, categorical data and dropping any other unnecessary columns in the dataset.

In [None]:
# Splitting Date Column
def split_date(df):
    df['Date'] = pd.to_datetime(df['Date'])
    df['Year'] = df.Date.dt.year
    df['Month'] = df.Date.dt.month
    df['Day'] = df.Date.dt.day
    df['WeekOfYear'] = df.Date.dt.isocalendar().week

In [None]:
split_date(merged_train_df)
split_date(merged_test_df)
split_date(sales_df)

In [None]:
merged_train_df = merged_train_df.drop(['Date'], axis=1)
merged_test_df = merged_test_df.drop(['Date'], axis=1)
sales_df = sales_df.drop(['Date'], axis=1)

In [None]:
merged_train_df.isna().sum()

In [None]:
merged_test_df.isna().sum()

In [None]:
sales_df.isna().sum()

### Encoding and Imputing the values
Since ML algorithms can work with only numerical data, it is empirical for us to

- encode - Turning into a numerical value
- Impute - Required since there are NaNvalues in our data and dropping the rows that contain those values might not be such a good idea.

In [None]:
inputs_df = sales_df.copy()
categorical_cols = inputs_df.select_dtypes(include=['object']).columns.tolist()
categorical_cols

### OneHotEncoding

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough. In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

![](https://i.imgur.com/n8GuiOO.png)

Read More on Encoding data https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

In [None]:
inputs_df

In [None]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoder.fit(inputs_df[categorical_cols])
encoded_cols = list(encoder.get_feature_names_out(categorical_cols))
inputs_df[encoded_cols] = encoder.transform(inputs_df[categorical_cols])

In [None]:
Test_df = merged_test_df.copy()
categorical_cols1 = Test_df.select_dtypes(include=['object']).columns.tolist()
encoded_cols1 = list(encoder.get_feature_names_out(categorical_cols1))
Test_df[encoded_cols1] = encoder.transform(Test_df[categorical_cols1])

We also need to encode `IsHoliday` into a numerical value since this column is also a categorical data. We will use LabelEncoder() for this purpose.

In [None]:
encoder1 = LabelEncoder()
encoder1.fit(inputs_df['IsHoliday'])
inputs_df['IsHoliday'] = encoder1.transform(inputs_df['IsHoliday'])
Test_df['IsHoliday'] = encoder1.transform(Test_df['IsHoliday'])

### Dropping Unnecessary columns


In [None]:
target_col = merged_train_df.columns[2]
targets_df = sales_df['Weekly_Sales']
inputs_df.drop([ 'Type'], axis=1, inplace = True)
Test_df.drop(['Type'], axis=1, inplace = True)

### Imputation

Imputation is the process of replacing missing data with substituted values. Below are some of the imputation techniques
![](https://vitalflux.com/wp-content/uploads/2018/10/Missing-Data-Imputation-Techniques.png)

In [None]:
numeric_cols = inputs_df.columns[0:17].tolist()
numeric_cols1 = Test_df.columns[0:17].tolist()
imputer = SimpleImputer(strategy = 'mean')
imputer.fit(inputs_df[numeric_cols])
inputs_df[numeric_cols] = imputer.transform(inputs_df[numeric_cols])
Test_df[numeric_cols1] = imputer.transform(Test_df[numeric_cols1])

### Feature Selection

RFE is a transformer estimator, which means it follows the familiar fit/transform pattern of Sklearn. It is a popular algorithm due to its easy configurable nature and robust performance. As the name suggests, it removes features one at a time based on the weights given by a model of our choice in each iteration.

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. Then, the least important features are pruned from current set of features.

That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

More here :- https://www.kaggle.com/code/bhatnagardaksh/how-to-feature-selection-a-tutorial-updated

In [None]:
estimator = LinearRegression()
selector = RFE(estimator, n_features_to_select=10, step=1)
selector = selector.fit(inputs_df, targets_df)
Ranking = pd.DataFrame(data= selector.feature_names_in_, columns=['Features'])
Ranking['Feature Selected'] = selector.support_
Ranking[Ranking['Feature Selected'].eq(True)]

In [None]:
inputs_df1 = inputs_df[Ranking[Ranking['Feature Selected'].eq(True)]['Features'].values.tolist()]

## Spliting Data and Training model

Since we have been working with a fraction of the data, we will use that data to see an initial impression of which models are looking better for us on the sample. We will for now go ahead and train the data on the sample data and later on use the entire data.

Read more here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
train_inputs, val_inputs, train_targets, val_targets = train_test_split(inputs_df1, targets_df, test_size=0.25, random_state=42)

In [None]:
names = ['Linear Regression', "KNN", "Linear SVM", "Random Forest",'Ridge', 'Lasso']
regressors = [
    LinearRegression(),
    KNeighborsRegressor(n_neighbors=3),
    SVR(kernel="rbf", C=1.0),
    RandomForestRegressor(max_depth=5, n_estimators=100),
    Ridge(alpha=1.0),
    Lasso(alpha=1.0)]

In [None]:
scores = []
for name, clf in zip(names, regressors):
    clf.fit(train_inputs, train_targets)
    score = clf.score(val_inputs, val_targets)
    scores.append(score)
scores_df = pd.DataFrame()
scores_df['name'] = names
scores_df['score'] = scores
scores_df.sort_values('score', ascending= False)

Defining a function to transform the whole dataset and align it as per the requirements

In [None]:
def transformer(df):
    imputer = SimpleImputer(strategy = 'mean')
    imputer.fit(df[df.select_dtypes(include=['float64', 'int32','UInt32']).columns.tolist()])
    df[df.select_dtypes(include=['float64', 'int32','UInt32']).columns.tolist()] = imputer.transform(df[df.select_dtypes(include=['float64', 'int32','UInt32']).columns.tolist()])
    categorical_cols = df.select_dtypes(include=['object','bool']).columns.tolist()
    encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
    encoder.fit(df[categorical_cols])
    encoded_cols = list(encoder.get_feature_names_out(categorical_cols))
    df[encoded_cols] = encoder.transform(df[categorical_cols])
    df.drop(['Type', 'IsHoliday'], axis=1, inplace=True)

In [None]:
transformer(merged_train_df)

In [None]:
merged_train_df

Checking if the values have been imputed/filled or not.

In [None]:
merged_train_df.isna().sum()

In [None]:
targets_df

In [None]:
targets_df = merged_train_df['Weekly_Sales']
estimator = LinearRegression()
selector = RFECV(estimator, step=1, cv=5, min_features_to_select=10)
selector = selector.fit(merged_train_df, targets_df)
Ranking = pd.DataFrame(data= selector.feature_names_in_, columns=['Features'])
Ranking['Feature Selected'] = selector.support_
Ranking[Ranking['Feature Selected'].eq(True)]

In [None]:
inputs_df = merged_train_df[Ranking[Ranking['Feature Selected'].eq(True)]['Features'].values.tolist()]

In [None]:
train_inputs, val_inputs, train_targets, val_targets = train_test_split(inputs_df, 
                                                                        targets_df, 
                                                                        test_size=0.25, 
                                                                        random_state=42)

In [None]:
#names = ['Linear Regression', "KNN", "Linear SVM", "Random Forest",'Ridge', 'Lasso']
#regressors = [LinearRegression(), KNeighborsRegressor(n_neighbors=3),SVR(kernel="rbf", C=1.0),RandomForestRegressor(max_depth=5, n_estimators=100),Ridge(alpha=1.0),Lasso(alpha=1.0)]

In [None]:
model = LinearRegression()
model.fit(train_inputs, train_targets)
print("The Validation Score of Lin Reg Model is %0.2f" % (model.score(val_inputs, val_targets)))

In [None]:
model = KNeighborsRegressor(n_neighbors=3)
model.fit(train_inputs, train_targets)
print("The Validation Score of KNN Model is %0.2f" % (model.score(val_inputs, val_targets)))

In [None]:
model = RandomForestRegressor(n_jobs=-1, random_state=42)
model.fit(train_inputs, train_targets)
print("The Validation Score of Random Forest Model is %0.2f" % (model.score(val_inputs, val_targets)))

### Hyperparameter Tuning 

hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.

The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. 

Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data. The objective function takes a tuple of hyperparameters and returns the associated loss. Cross-validation is often used to estimate this generalization performance.

![](https://i.imgur.com/EJCrSZw.png)


In [None]:
sample_submission

In [None]:
#Test_df = Test_df.drop(['MarkDown2','MarkDown5','IsHoliday','Type_C', 'Year'], axis=1)

In [None]:
#sample_submission['Weekly_Sales'] = rf_test_preds
#sample_submission.to_csv('submission.csv',index=False)

### Future Work and references

As we can see the Random Forest model has given us the lowest RMSE so the best model in this case should be the Random Forest model.

References:-
- https://www.kaggle.com/maxdiazbattan/wallmart-sales-eda-feat-eng-future-update
- https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting
- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Future Work may include trying out some more algorithms like Support Vector Machines (SVM), Lasso Regression, Ridge Regression, Gausian Regression etc. and see which one can further reduce the RMSE and give us even better results.