# __Walmart Dataset__ <br>
_Walmart Store Sales Prediction - Regression Problem_

## __Description:__ <br>
One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc. <br>

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.<br>

__Acknowledgements__<br>
The dataset is taken from Kaggle. <br>

__Objective:__<br>
Understand the Dataset & cleanup (if required). <br>
Build Regression models to predict the sales w.r.t single & multiple features. <br>
Also evaluate the models & compare their respective scores like R2, RMSE, etc. <br>

### About this file
This is the historical data that covers sales from 2010-02-05 to 2012-11-01, in the file WalmartStoresales.
Within this file you will find the following fields:<br>


| **Field**   |     **Description** |  
|-------------|---------------------|
| Store       |  the store number | 
| Date        |  the week of sales |
| Weekly_Sales| sales for the given store   |
| Holiday_Flag| whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week | 
| Temperature | Temperature on the day of sale |
| Fuel_Price  |  Cost of fuel in the region |
| CPI         |  Prevailing consumer price index | 
| Unemployment|  Prevailing unemployment rate |
| Holiday Events| Super Bowl;Labour Day;Thanksgiving;Christmas|

## __Data Exploration__

**Step 1:** The very first step is to have a deeper look into the data:
1. Using pandas extract a dataframe called *df* from the file *walmart.csv*
2. Print the result of the method  ```name_dataframe.d_types```, in this way you print out the data types associated to each of the fields in the table
3. Run the method ```name_dataframe.head(N)``` to look at first N instances of the dataframe.
4. Use the method ```name_dataframe.describe( )``` to generate descriptive statistics that summarize each field of the dataframe

### Import the libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

### Read the dataset

In [None]:
df = pd.read_csv('/kaggle/input/walmart-dataset/Walmart.csv')
df.head()

### Check the shape of the dataset

In [None]:
df.shape

### Check the datatypes of each column

In [None]:
df.dtypes

### Check for Descriptive Statistics

In [None]:
df.describe().T

### Check for columns

In [None]:
print(df.columns.to_list())

### Convert the date column from object to datetime

In [None]:
df.Date = pd.to_datetime(df.Date)
df.Date.dtype

In [None]:
df['year'], df['month'] = df['Date'].dt.year, df['Date'].dt.month
df.sample(5)

### Checking for top 10 largest Sales

In [None]:
df[['Store', 'Date', 'Weekly_Sales', 'Holiday_Flag', 'Temperature', 'Fuel_Price']].nlargest(10,'Weekly_Sales')

**Store 14 has the most weekly_sales, while store 20,10,13,4 have are in top 10 twice**

### Convert the numerical columns to categorical

In [None]:
# we convert the store column to categorical since each Value is unique store no
df.Store = pd.Categorical(df.Store)
df.Store.dtype

In [None]:
df['Holiday_Flag'] = pd.Categorical(df.Holiday_Flag)
df.Holiday_Flag.dtype

## __Exploratory Data Analysis__

In [None]:
# Print the maximum of the date column
print(df.Date.max())

# Print the minimum of the date column
print(df.Date.min())

In [None]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
print(df[["Temperature", "Fuel_Price", "Unemployment"]].agg([iqr,np.mean,np.median]))

In [None]:
# Sort sales_1_1 by date
sales_1_1 = df.sort_values('Date')

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1['cum_weekly_sales'] = sales_1_1.Weekly_Sales.cumsum()

# Get the max of weekly_sales, add as cum_max_sales col
sales_1_1['max_sales'] = sales_1_1.Weekly_Sales.max()

# See the columns we calculated
sales_1_1[["Date", "Weekly_Sales", "cum_weekly_sales", "max_sales"]]

In [None]:
# Subset the rows where is_holiday is True and drop duplicate dates
holiday_dates = df[df['Holiday_Flag'] == 1].drop_duplicates(subset = 'Date')

# Print date col of holiday_dates
holiday_dates.Date

### Analysis by store

In [None]:
Store = df.groupby(['Store']).agg({'Weekly_Sales':['mean','max','sum']})
Store[:5]

### Set Style 

In [None]:
plt.style.use('seaborn-darkgrid')

### Line chart to view trend across stores for Total weekly_sales column

In [None]:
plt.figure(figsize = (15,8))
Store[('Weekly_Sales',  'sum')].plot()
plt.show()

### Bar chart for better analysis

In [None]:
plt.figure(figsize = (15,8))
Store[('Weekly_Sales',  'sum')].plot(kind = 'bar',color = 'blue')
plt.xticks(rotation = 0)
plt.title('Total Sum of Sales')
plt.axhline(y=200000000,color = 'orange')
plt.axhline(y=100000000,color = 'red')
plt.axhline(y=300000000,color = 'green')
plt.show()

__A bar plot is much more conclusive and we get the following observations__
* Stores getting total sales below the redline are underperforming<br>
* Stores between red and orange are average <br>
* Stores between orange and green are performing above average<br>
* Stores touching the green line are very well performing<br>

**We can do the same for max Weekly sales for each store**

In [None]:
plt.figure(figsize = (15,8))
Store[('Weekly_Sales',  'max')].plot(kind = 'bar',color = 'violet')
plt.xticks(rotation = 0)
plt.title('Max Weekly_Sales')
plt.axhline(y=2000000,color = 'orange')
plt.axhline(y=1000000,color = 'red')
plt.axhline(y=3000000,color = 'green')
plt.show()

__From the charts we can say that stores 4,20 are the best performing stores while 5,33,44 are the least performing__

###  Using the Holiday_flag column to check Weekly_sales

12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13\ Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13\ Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13\ Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

In [None]:
# check for total values in Holiday_Flag column
df.Holiday_Flag.value_counts()

### Aggregating Weekly sales based on store and holiday flag

In [None]:
Store_new = df.groupby(['Store','Holiday_Flag']).agg({'Weekly_Sales':['mean','max','sum']})
Store_new = Store_new.reset_index()
Store_new

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(x = 'Store',y = ('Weekly_Sales',  'mean'),hue= 'Holiday_Flag',data=Store_new)
plt.show()

__Inference__: <br>
* People generally tend to spend more during holiday weeks<br>
* The average spending is more during holiday weeks than normal weeks
    

### Features vs Weekly sales

In [None]:
# Temparature vs Weekly Sales
plt.figure(figsize=(15,8))
sns.scatterplot(x = 'Temperature',y = 'Weekly_Sales',hue = 'Store',data = df,legend = False)
plt.show()

In [None]:
# We can say that temperature doesn't have much impact

In [None]:
# CPI vs Weekly Sales
plt.figure(figsize=(15,8))
sns.scatterplot(x = 'CPI',y = 'Weekly_Sales',data = df,legend = False)
plt.show()

In [None]:
# Again no significant pattern can be observed

In [None]:
# Unemployment vs Week_sales
plt.figure(figsize=(15,8))
sns.scatterplot(x = 'Unemployment',y = 'Weekly_Sales',data = df,legend = False)
plt.show()

In [None]:
# Again no significant pattern can be observed

### Plotting for all features

In [None]:
sns.pairplot(df)
plt.show()

### Check for distribution of numerical features

In [None]:
# Temperature
fig, axs = plt.subplots(nrows=3, figsize=(15, 15))
sns.boxplot(x = df['Temperature'], ax=axs[0])
sns.violinplot(x = df['Temperature'], ax=axs[1])
sns.boxenplot(x = df['Temperature'], ax=axs[2])
plt.show()

In [None]:
# CPI
fig, axs = plt.subplots(nrows=3, figsize=(15, 15))
sns.boxplot(x = df['CPI'], ax=axs[0],color='lightblue')
sns.violinplot(x = df['CPI'], ax=axs[1],color='lightblue')
sns.boxenplot(x = df['CPI'], ax=axs[2],color='lightblue')
plt.show()

In [None]:
# Unemployement
plt.figure(figsize=(15,8))
sns.histplot(x = 'Unemployment',data = df)
plt.title("Histogram of Unemployement")
plt.show()

## __Data Preprocessing__

In [None]:
df.sample(5)

In [None]:
df.info()

### Adding dummy variables for Categorical features

In [None]:
df_dummies = pd.get_dummies(df,columns=['Store','Holiday_Flag'])
print(df_dummies.columns.to_list())

## __Machine Learning__

### Using sklearn

**Step 2 - Prepare the data:**
We split our data into two sets: one data set for training and another one that we will use at the end to test our model.

1. Import the function ```train_test_split``` from ```sklearn.model_selection```
2. Split our *df* in **X** made of all features except *Date*,*Weekly_Sales* and **y** made of the feature *Weekly_Sales* 
3. Use ```train_test_split``` with a *test_size*=0.20 (20 % of inputs became the test set) in following way to obtain a train set and a test set.

    ```X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)```

In [None]:
X = df_dummies.drop(['Date','Weekly_Sales'],axis=1)
y = df_dummies.Weekly_Sales

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)


In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()

#### Fit and predict the model

In [None]:
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)

#### Create a DataFrame of Actual vs Predicted values

In [None]:
df_new = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df_new.head()

#### Check for actual vs predicted mean

In [None]:
print(f"{y.mean()}")
y_pred.mean()

#### Check for error

In [None]:
from sklearn import metrics
print('Mean Absolute error: ', metrics.mean_absolute_error(y_test,y_pred))
print('Mean Squared Error: ', metrics.mean_squared_error(y_test,y_pred))
print('Root Mean Squared Error: ', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

In [None]:
# variance score: 1 means perfect prediction
print(f"The variance score is: {lr.score(X_test,y_test)}")

#### rmse,rsquare and adjusted rsquare

In [None]:
# rmse
from sklearn.metrics import mean_squared_error
from math import sqrt

rmse = sqrt(mean_squared_error(y_test, y_pred))

from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)

adj_r2 = 1 - float(len(y)-1)/(len(y)-len(lr.coef_)-1)*(1 - r2)

rmse, r2, adj_r2

### Using Polynomial Features

__What can be improved__:<br>
* Consider the date feature
* Do some more exploratory data analysis with other features like CPI,temperature
* Run ridge regression and linear regression with Tensorflow or any other model to compare results
* Using PCA to reduce the dimensions and check for the performance.
