# Data Description

You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

##### stores.csv

This file contains anonymized information about the 45 stores, indicating the type and size of store.

##### train.csv

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

- Store - the store number
- Dept - the department number
- Date - the week
- Weekly_Sales -  sales for the given department in the given store
- IsHoliday - whether the week is a special holiday week

##### test.csv

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

##### features.csv

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

- Store - the store number
- Date - the week
- Temperature - average temperature in the region
- Fuel_Price - cost of fuel in the region
- MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
- CPI - the consumer price index (The Consumer Price Index measures the average change in prices over time that consumers pay for a basket of goods and services.)
- Unemployment - the unemployment rate
- IsHoliday - whether the week is a special holiday week

For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

# Import libraries and load data

In [None]:
# import libraries

import numpy as np 
import pandas as pd 

# data viz libraries
import seaborn as sns
sns.set(style="whitegrid") # to make charts look better
import matplotlib.pyplot as plt
%matplotlib inline

# for functions
from tqdm import tqdm

# for ML
import datetime
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
pd.set_option('display.max_columns', None)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
from sklearn.preprocessing import Normalizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# import plotly modules
import chart_studio.plotly as py
import cufflinks as cf
import plotly.express as px
import plotly.figure_factory as ff

# make it work on jupyter notebook
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Use Plotly locally
cf.go_offline()

In [None]:
# load datasets
dataFeatures = pd.read_csv("C:/Users/digit/Desktop/Ironhack/project-week-9-final-project/data/features.csv")
dataStores = pd.read_csv("C:/Users/digit/Desktop/Ironhack/project-week-9-final-project/data/stores.csv")
dataTest = pd.read_csv("C:/Users/digit/Desktop/Ironhack/project-week-9-final-project/data/test.csv")
dataTrain = pd.read_csv("C:/Users/digit/Desktop/Ironhack/project-week-9-final-project/data/train.csv")

# EDA and Data Cleaning

- Here we will explore the data in order to search for patterns relationships and to understand them better. 
- Perform data cleaning if neccessary and data wrangling.
   

In [None]:
dataStores.head(5)

In [None]:
dataStores.shape 

In [None]:
dataFeatures.tail(5)

In [None]:
dataFeatures.shape

In [None]:
# we will start by merging dataStores and dataFeatures since Features is the extension of Stores
FeatSto = dataFeatures.merge(dataStores, how="inner", on="Store")

# check the head of the new df
FeatSto.head(5)

In [None]:
FeatSto.shape

In [None]:
# check the dtypes in FeatSto
FeatSto.dtypes

# Type is of categorical nature
# IsHoliday of binary categorical nature 

# the rest are numerical
# some of the features might contain numerical values but still behave as categorical
# Date is string and we will convert it into datetime later or drop it

In [None]:
# check for missing values
FeatSto.isnull().sum()

## Inspect the train and test data (dataTrain and dataTest)

In [None]:
dataTest.head(5)

In [None]:
dataTest.shape

In [None]:
# as we can see  dataTrain includes additional Weekly_Sales
dataTrain.tail(5)

In [None]:
dataTrain.shape

In [None]:
dataTest.dtypes

In [None]:
dataTrain.dtypes

In [None]:
# we will disregard the dataTest and use only dataTrain
# (we will train-test split the data later)
# merge dataTrain with Featsto ->dfwTrain
# now we have a dataframe containing dataTrain
# FeatSto with dataTrain

dfwTrain = pd.merge(FeatSto, dataTrain, how="inner", on=["Store", "Date", "IsHoliday"])

dfwTrain.head(5)

In [None]:
dfwTrain.shape

In [None]:
dfwTrain.tail(5)

In [None]:
# rename dfwTrain into df_total

df_total = dfwTrain

# show the head of the total dataframe

df_total.head()

In [None]:
# show the tail

df_total.tail()

# first impression:
# Date is the week
# Markdowns 1 - 5 contain a lot of missing values
# Weekly_Sales is numerical continuous data

In [None]:
# check the shape of df_total

df_total.shape

In [None]:
# check the dtypes of df_total

df_total.dtypes

# most features have numerical values
# Date, Type is a string
# some features with numerical values might behave as categoricals, encode them later
# such as Type, IsHoliday

In [None]:
# now we can check for missing values

df_total.isnull().sum()

In [None]:
# calculate the percentage of missing values in each column

df_total.isnull().sum() / len(df_total)

# if the column contains 85% missing values then it should be dropped
# MarkDown1-5 contains anonymized data and lots of missing values, despite that they contain important data

In [None]:
# instead of dropping, we will fill the NaN values with zero values
# because the code was executed the first time, when we execute it again, it will show an error
df_total.fillna(0, inplace=True)

df_total.head()

In [None]:
# check for missing values again

df_total.isnull().sum()

In [None]:
# check for duplicated values

df_total.duplicated().sum()

# no duplicates

In [None]:
# add a Month column

df_total["Month"] = pd.to_datetime(df_total['Date']).dt.month
df_total.sample(5) 

In [None]:
## add a Week column 
df_total["Week"] = pd.to_datetime(df_total["Date"]).dt.week
df_total.sample(5)

In [None]:
# add Year column
df_total["Year"] = pd.to_datetime(df_total["Date"]).dt.year 
df_total.sample(5)

In [None]:
# convert "Date" column to datetime format
df_total["Date"] = pd.to_datetime(df_total["Date"])
df_total.dtypes

In [None]:
df_total.head()

In [None]:
# plot Average Monthly Sales - Per Year

weekly_sales_2010 = df_total[df_total.Year==2010]['Weekly_Sales'].groupby(df_total['Month']).mean()
weekly_sales_2011 = df_total[df_total.Year==2011]['Weekly_Sales'].groupby(df_total['Month']).mean()
weekly_sales_2012 = df_total[df_total.Year==2012]['Weekly_Sales'].groupby(df_total['Month']).mean()
plt.figure(figsize=(20,8))
sns.lineplot(weekly_sales_2010.index, weekly_sales_2010.values)
sns.lineplot(weekly_sales_2011.index, weekly_sales_2011.values)
sns.lineplot(weekly_sales_2012.index, weekly_sales_2012.values)
plt.grid()
plt.xticks(np.arange(1, 13, step=1))
plt.legend(['2010', '2011', '2012'], loc='best', fontsize=16)
plt.title('Average Monthly Sales - Per Year', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Month', fontsize=16)
plt.show()

# 2012 compared to the rest was not doing so well

# there is a sharp rise in Sales between January and February, which is connected to SuperBowl
# and as we can see, the Monthly Sales are usually spiking in November and December
# when Thanksgiving and Christmas are happening



In [None]:
# use Plotly to plot TimeSeries to see whether Date affects Weekly_Sales
# make one plot

px.line(df_total, x="Date", y="Weekly_Sales", labels={"x":"Date", "y":"Weekly_Sales"},
       title="Weekly Sales across Feb 2010 - Oct 2012")

# in more detail, we can see how Date affects Weekly Sales
# the highest spikes are on Thanksgiving Day and Christmas Day
# as we have seen in the previous plot, 2012 was not so good in terms of sales for Walmart

In [None]:
# now we can drop Date column now

df_total.drop(["Date"], inplace=True, axis=1)
df_total.sample(5)

For Linear Regression, we need to have numerical values. Thus we will encode the categorical features from the dataset into numerical values.

Since it's of categorical text data. We use Label Encoder to convert them into model-understandable numerical data.

In [None]:
# encode Type

from sklearn.preprocessing import LabelEncoder
  
le = LabelEncoder()
df_total['Type']= le.fit_transform(df_total['Type'])

In [None]:
# encode IsHoliday

df_total['IsHoliday'] = le.fit_transform(df_total['IsHoliday'])

In [None]:
df_total.shape

In [None]:
# took a sample with the size of 5000, which should be enough to better understand the relationship between the columns

# had problems with loading the plot, that's why I saved an image of it

# sns.pairplot(df_total.sample(5000), size = 5)

#from IPython.display import Image
#Image("sns_pairplot_df_total.png")


In [None]:
df_total.columns

In [None]:
# plot a scatter matrix in plotly (because sns.pairplot was too heavy)

fig = px.scatter_matrix(df_total.sample(1000), dimensions=['Store', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2',
       'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment',
       'IsHoliday', 'Type', 'Size', 'Dept', 'Weekly_Sales', 'Month', 'Week',
       'Year'], height=5000, width=5000, title="Scatter Matrix", size_max=20)
fig.show()

# no correlation between features mostly
# we can drop Type later
# we can drop Size later

In [None]:
# check for correlations with correlation matrix
corr_matrix = df_total.corr(method="pearson") # we chose 'pearson'
corr_matrix

df_total.corr

In [None]:
# plot a heatmap for better overview
fig, ax = plt.subplots(figsize=(14,12))
ax = sns.heatmap(corr_matrix, annot=True)
plt.show()

# values range between (-1,1)
# 0: no correlation at all
# 0 - 0.3: weak correlation
# 0.3 - 0.7: moderate correlation
# 0.7 - 1: strong correlation

# strong correlation between MarkDown1 and MarkDown4, drop MarkDown4 later
# Year and Fuel_Price show high correlation

Here we will further inspect the relationship between features and our target variable ("Weekly_Sales"), and features that are highly correlated with each other in order to prevent

In [None]:
# visualise a scatter plot in plotly 
# to see whether there is correlation between Unemployment and Weekly_Sales
# since Walmart is a huge retail company that's competitive thanks to cheap prices
# but I think Walmart is a default choice for a lot of people who do not know what they want
fig = px.scatter(df_total, x="Unemployment", y="Weekly_Sales")
fig.show()

Is there any correlation between the MarkDown1 -5 and Weekly_Sales?

In [None]:
fig = px.scatter(df_total, x="MarkDown1", y="Weekly_Sales")
fig.show()

In [None]:
fig = px.scatter(df_total, x="MarkDown2", y="Weekly_Sales")
fig.show()

In [None]:
fig = px.scatter(df_total, x="MarkDown3", y="Weekly_Sales")
fig.show()

In [None]:
fig = px.scatter(df_total, x="MarkDown4", y="Weekly_Sales")
fig.show()

In [None]:
fig = px.scatter(df_total, x="MarkDown5", y="Weekly_Sales")
fig.show()

In [None]:
fig = px.scatter(df_total, x="MarkDown1", y="MarkDown4")
fig.show()

# as we can see there is a positive correlation between MarkDown 1 and Markdown4

In [None]:
df_total.tail(5)

In [None]:
# plot a histogram to check frequency distribution

df_total.hist(figsize=(20,30), xrot=45, bins=50)
plt.show()

# Temperature is left-skewed (negative skewness)
# MarkDown1-5 heavily imbalanced - perform imputation and apply logarithmic transformation in 2nd iteration (if there is time)
# po

In [None]:
# plot histogram with plotly
#x1 = df_total["Store"]
#x2 = df_total["Temperature"]
#x3 = df_total["Fuel_Price"]

#x4 = df_total["MarkDown1"]
#x5 = df_total["MarkDown2"]
#x6 = df_total["MarkDown3"]
#x7 = df_total["MarkDown4"]
#x8 = df_total["MarkDown5"]

#x9 = df_total["CPI"]
#x10 = df_total["Unemployment"]
#x11 = df_total["IsHoliday"]
#x12 = df_total["Type"]
#x13 = df_total["Size"]
# x14 = df_total["Dept"]

# x15 = df_total["Weekly_Sales"]
# x16 = df_total["Month"]
# x17 = df_total["Week"]
# x18 = df_total["Year"]

# hist_data = [x1, x2, x3,x4, x5, x6, x7, x8, x9,x10, x11, x12,x13, x14, x15,x16, x17, x18]

# group_labels = ['Store', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2',
       'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment',
       'IsHoliday', 'Type', 'Size', 'Dept', 'Weekly_Sales', 'Month', 'Week',
       'Year']

# colors = ['#333F44', '#37AA9C', '#94F3E4', '#660000','#663300','#666600','#333300','#000000','#FF0000','#800000','#FFFF00',
          '#808000','#00FF00', '#00FFFF','#008080','#0000FF', '#000080', '#FF00FF']


# Create distplot with curve_type set to 'normal'
#fig = ff.create_distplot(hist_data, group_labels, show_hist=False, colors=colors)

# Add title
#fig.update_layout(title_text='Curve and Rug Plot')
#fig.show()

In [None]:
sns.distplot(df_total["Temperature"])
plt.show()

In [None]:
sns.distplot(df_total["Unemployment"])
plt.show()

In [None]:
sns.distplot(df_total["Fuel_Price"])
plt.show()

In [None]:
sns.boxplot(x=df_total["Temperature"])
plt.show()

In [None]:
sns.boxplot(x=df_total["Unemployment"])
plt.show()

In [None]:
sns.boxplot(x=df_total["Fuel_Price"])
plt.show()

In [None]:
sns.boxplot(x=df_total["MarkDown5"])
plt.show()