This is part 1 of predicting Rossmann Sales 
It only deals with data cleaning, feature engineering and EDA.

For part 2, which deals with finding the best models for predicting sales, 
checkout https://www.kaggle.com/amithanayak/predict-rossmann-sales

# Getting Started

In [None]:
#import required libraries
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#reading data files
store_df=pd.read_csv("../input/rossmann-store-sales/store.csv")
train_df=pd.read_csv("../input/rossmann-store-sales/train.csv")

# Getting to Know your Data

Data fields

Most of the fields are self-explanatory. The following are descriptions for those that aren't.

    Id - an Id that represents a (Store, Date) duple within the test set
    Store - a unique Id for each store
    Sales - the turnover for any given day (this is what you are predicting)
    Customers - the number of customers on a given day
    Open - an indicator for whether the store was open: 0 = closed, 1 = open
    StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
    SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
    StoreType - differentiates between 4 different store models: a, b, c, d
    Assortment - describes an assortment level: a = basic, b = extra, c = extended
    CompetitionDistance - distance in meters to the nearest competitor store
    CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
    Promo - indicates whether a store is running a promo on that day
    Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
    Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
    PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store


In [None]:
store_df.head()

In [None]:
store_df.describe()

In [None]:
#Checking the no. of NaN vales
store_df.isna().sum()

In [None]:
train_df.head()

In [None]:
train_df.describe()

In [None]:
#Checking the no. of NaN values
train_df.isna().sum()

# Data Cleaning

In [None]:
#Merging both the Dataframes into one based on the "Store" ID
df=store_df.merge(train_df,on=["Store"],how="inner")
df.head()

In [None]:
#(rowsxcolumns) of the merged DataFrame
df.shape

In [None]:
#Checking the no. of NaN values
df.isna().sum()

The columns - CompetitionOpenSinceMonth, CompetitionOpenSinceYear, Promo2SinceWeek, Promo2SinceYear, PromoInterval have too many values as NaN (roughly 30% or above).
Whereas, the column CompetitionDistance has very few values missing, and these values can be substituted by the 'mode' of the very same column. 

In [None]:
#Dropping columns
df=df.drop(columns=["PromoInterval","Promo2SinceWeek","Promo2SinceYear"])

In [None]:
#Handling NaN
df.CompetitionDistance.fillna(df.CompetitionDistance.mode()[0],inplace=True)
df.CompetitionOpenSinceMonth.fillna(1, inplace=True)
df.CompetitionOpenSinceYear.fillna(df.CompetitionOpenSinceYear.mode()[0], inplace=True)
df.CompetitionOpenSinceMonth=df.CompetitionOpenSinceMonth.astype(int)
df.CompetitionOpenSinceYear=df.CompetitionOpenSinceYear.astype(int)

# Handling Outliers

In [None]:
#Find the range of data
plt.figure(figsize=(5,10))
sns.set(style="whitegrid")
sns.distplot(df["Sales"])

In [None]:
#Find the range of the data
plt.figure(figsize=(5,10))
sns.set(style="whitegrid")
sns.distplot(df["Customers"])

In [None]:
plt.figure(figsize=(10,10))
sns.set(style="whitegrid")
sns.boxenplot(data=df,scale="linear",x="DayOfWeek",y="Sales",color="orange")

In [None]:
plt.figure(figsize=(10,10))
sns.set(style="whitegrid")
sns.boxenplot(y="Customers", x="DayOfWeek",data=df, scale="linear",color="orange")

This data, contains many outliers, but these might have been caused to the surge of customers during a festival or Holiday, or due to an effective promo.
However I will cap off, the Customers at 3000, and Sales at 20,000.

In [None]:
df["Sales"]=df["Sales"].apply(lambda x: 20000 if x>20000 else x)
df["Customers"]=df["Customers"].apply(lambda y: 3000 if y>3000 else y)
print(max(df["Sales"]))
print(max(df["Customers"]))

# Working With 'TIME'

In [None]:
df["Date"]=pd.to_datetime(df["Date"])
df["Year"]=df["Date"].dt.year
df["Month"]=df["Date"].dt.month
df["Day"]=df["Date"].dt.day
df["Week"]=df["Date"].dt.week%4
df["Season"] = np.where(df["Month"].isin([3,4]),"Spring",np.where(df["Month"].isin([5,6,7,8]), "Summer",np.where(df["Month"].isin ([9,10,11]),"Fall",np.where(df["Month"].isin ([12,1,2]),"Winter","None"))))
df

Adding an additional feature, that records the no. of Holidays per week.

In [None]:
Holiday_Year_Month_Week_df=pd.DataFrame({"Holiday per week":df["SchoolHoliday"],"Week":df["Week"],"Month":df["Month"],"Year":df["Year"],"Date":df["Date"]})
Holiday_Year_Month_Week_df=Holiday_Year_Month_Week_df.drop_duplicates(subset=['Date'])
Holiday_Year_Month_Week_df=Holiday_Year_Month_Week_df.groupby(["Year","Month","Week"]).sum()
Holiday_Year_Month_Week_df

In [None]:
df=df.merge(Holiday_Year_Month_Week_df, on=["Year","Month","Week"],how="inner")

Adding additional features, that records the avg. no. of Customers per month and avg. no. of Customers per week

In [None]:
customer_time_df=pd.DataFrame({"Avg CustomersPerMonth":df["Customers"],"Month":df["Month"]})
AvgCustomerperMonth=customer_time_df.groupby("Month").mean()
AvgCustomerperMonth

In [None]:
customer_time_df=pd.DataFrame({"Avg CustomersPerWeek":df["Customers"],"Week":df["Week"],"Year":df["Year"],"Month":df["Month"]})
AvgCustomerperWeek=customer_time_df.groupby(["Year","Month","Week"]).mean()
AvgCustomerperWeek

In [None]:
df=df.merge(AvgCustomerperMonth,on="Month",how="inner")
df=df.merge(AvgCustomerperWeek,on=["Year","Month","Week"],how="inner")

adding an additional feature that records the no. of promo per week

In [None]:
promo_time_df=pd.DataFrame({"PromoCountperWeek":df["Promo"],"Year":df["Year"],"Month":df["Month"],"Week":df["Week"],"Date":df["Date"]})
promo_time_df=promo_time_df.drop_duplicates(subset=['Date'])
promo_time_df=promo_time_df.groupby(["Year","Month","Week"]).sum()
promo_time_df

In [None]:
df=df.merge(promo_time_df,on=["Year","Month","Week"], how="inner")

combining 'CompetitionSinceMonth' & 'CompetitionSinceYear' into 'CompetitionSince'

In [None]:
df=df.rename(columns={'CompetitionOpenSinceYear': 'year','CompetitionOpenSinceMonth':'month'})
df['CompetitionOpenSince'] = pd.to_datetime(df[['year', 'month']].assign(DAY=1))
df=df.rename(columns={ 'year':'CompetitionOpenSinceYear','month':'CompetitionOpenSinceMonth'})

# Handling Categorical Data

The columns StoreType, Assortment, Season have char type or String type values, all of this need to converted to a numerical value

In [None]:
numerical_data_col=["Store","Competition Distance","Promo2","DayOfWeek","Sales","Customers","Open","SchoolHoliday","Year","Month","Day","Week"]
categorical_data_col=["StoreType","Assortment","Season"]

In [None]:
for i in categorical_data_col:
    p=0
    for j in df[i].unique():
        df[i]=np.where(df[i]==j,p,df[i])
        p=p+1

    df[i]=df[i].astype(int)

In [None]:
#The column StateHoliday contains 0,'0',a and b. This needs to be conerted to a pure numerical data column
df["StateHoliday"].unique()

In [None]:
df["StateHoliday"]=np.where(df["StateHoliday"] == '0' ,0,1)
df["StateHoliday"]=df["StateHoliday"].astype(int)

# EDA

## Are the promos effective?

In [None]:
plt.figure(figsize=(10,10))
sns.set(style="whitegrid",palette="pastel",color_codes=True)
sns.violinplot(x="DayOfWeek",y="Sales",hue="Promo",split=True, data=df)

The days promos were present have indeed shown a slight improvement in Sales.
The plot above also shows that there was no promo offered on 6th and the 7th day of the week (Saturday and Sunday), and stores didn't suffer for doing so either, as it can be seen the no. of customers on the weekends, were more that that during the weekdays.

In [None]:
plt.figure(figsize=(10,10))
sns.set(style="whitegrid",palette="pastel",color_codes=True)
sns.violinplot(x="DayOfWeek",y="Customers",hue="Promo",split=True, data=df)

## Does competition distance matter?

In [None]:
plt.figure(figsize=(15,15))
sns.set(style="whitegrid")
df["CompetitionDistanceLOG"]=np.log(df["CompetitionDistance"])
sns.lineplot(x="CompetitionDistanceLOG", y="Sales", data=df)

Competition Distance does seem to affect Sales. The stores with less CompetitionDistance didn't make more Sales.

## Is there a surge of customers during SchoolHolidays?

In [None]:
sns.set(style="whitegrid")
g=sns.relplot(y="Avg CustomersPerWeek", x="Week", hue="Holiday per week", data=df)
g.fig.set_size_inches(10,10)

It doesn't look like there is a big difference in the no. of customers even if there were 4 School Holidays that week

## Is there an increase in promo if it is a School Holiday?

In [None]:
sns.set(style="whitegrid")
g=sns.relplot(y="Holiday per week", x="Week", hue="PromoCountperWeek", data=df)
g.fig.set_size_inches(10,10)

It doesn't seem like the Holidays had any effect on promo and Customers.

# Feature Engineering

## Finding location of stores

In [None]:
#using public state holidays data from https://www.timeanddate.com/holidays/germany/2013
holid=df.loc[df.StateHoliday=='a']
bydate=df.groupby('Date')['Store'].count()
#number of stores celebrating holidays
bydate.head()

In [None]:
#Figuring out store locations based on state holidays
SN = holid.loc[holid.Date == '2013-11-20','Store'].values
print('{} stores located in Saxony.'.format(SN.shape[0]))
BW_BY_ST = holid.loc[holid.Date == '2013-01-06','Store'].values
print('{} stores located in BW, BY, ST.'.format(BW_BY_ST.shape[0]))
BW_BY_HE_NW_RP_SL = holid.loc[holid.Date == '2013-05-30','Store'].values
print('{} stores located in BW, BY, HE, NW, RP, SL.'.format(BW_BY_HE_NW_RP_SL.shape[0]))
BY_SL = holid.loc[holid.Date =='2013-08-15','Store'].values
print('{} stores located in BY,SL.'.format(BY_SL.shape[0]))
BB_MV_SN_ST_TH = holid.loc[holid.Date =='2013-10-31','Store'].values
print('{} stores located in BB, MV, SN, ST, TH.'.format(BB_MV_SN_ST_TH.shape[0]))
BW_BY_NW_RP_SL = holid.loc[holid.Date =='2013-11-01','Store'].values
print('{} stores located in BW, BY, NW, RP, SL.'.format(BW_BY_NW_RP_SL.shape[0]))
BW_BY = np.intersect1d(BW_BY_ST, BW_BY_HE_NW_RP_SL)
print('{} stores located in BW, BY.'.format(BW_BY.shape[0]))

In [None]:
ST = np.setxor1d(BW_BY_ST, BW_BY)
print('{} stores located in ST.'.format(ST.shape[0]))
BY = np.intersect1d(BW_BY, BY_SL)
print('{} stores located in BY.'.format(BY.shape[0]))
SL = np.setxor1d(BY, BY_SL)
print('{} stores located in SL.'.format(SL.shape[0]))
BW = np.setxor1d(BW_BY, BY)
print('{} stores located in BW.'.format(BW.shape[0]))
HE = np.setxor1d(BW_BY_HE_NW_RP_SL,BW_BY_NW_RP_SL)
print('{} stores located in HE.'.format(HE.shape[0]))
BB_MV_TH = np.setxor1d(np.setxor1d(BB_MV_SN_ST_TH,SN),ST)
print('{} stores located in BB, MV, TH.'.format(BB_MV_TH.shape[0]))
NW_RP = np.setxor1d(BW_BY_NW_RP_SL,BW_BY) # SL has 0 stores
print('{} stores located in NW, RP.'.format(NW_RP.shape[0]))
allstores = np.unique(df.Store.values)
BE_HB_HH_NI_SH = np.setxor1d(np.setxor1d(allstores,BW_BY_HE_NW_RP_SL),BB_MV_SN_ST_TH)
print('{} stores located in BE, HB, HH, NI, SH.'.format(BE_HB_HH_NI_SH.shape[0]))

In [None]:
#using public school holidays data from http://www.holidays-info.com/School-Holidays-Germany/2015/school-holidays_2015.html.
#furthur division based on school holidays 
df.loc[df.Store.isin(NW_RP)].groupby('Date')['SchoolHoliday'].sum().value_counts()
RP = df.loc[df.Date=='2015-03-26'].loc[df.Store.isin(NW_RP)].loc[df.SchoolHoliday==1,'Store'].values
NW = np.setxor1d(NW_RP,RP)
print('{} stores located in RP.'.format(RP.shape[0]))
print('{} stores located in NW.'.format(NW.shape[0]))
df.loc[df.Store.isin(BB_MV_TH)].groupby('Date')['SchoolHoliday'].sum().value_counts()
TH = BB_MV_TH
print('{} stores located in TH.'.format(TH.shape[0]))
HH = df.loc[df.Date=='2015-03-02'].loc[df.Store.isin(BE_HB_HH_NI_SH)].loc[df.SchoolHoliday==1,'Store'].values
print('{} stores located in HH.'.format(HH.shape[0]))
BE_HB_NI_SH = np.setxor1d(BE_HB_HH_NI_SH,HH)
SH = df.loc[df.Date=='2015-04-17'].loc[df.Store.isin(BE_HB_NI_SH)].loc[df.SchoolHoliday==1,'Store'].values
print('{} stores located in SH.'.format(SH.shape[0]))
BE_HB_NI = np.setxor1d(BE_HB_NI_SH,SH)
BE = df.loc[df.Date=='2015-03-25'].loc[df.Store.isin(BE_HB_NI)].loc[df.SchoolHoliday==0,'Store'].values
print('{} stores located in BE.'.format(BE.shape[0]))
HB_NI = np.setxor1d(BE_HB_NI,BE)

In [None]:
states = pd.Series('',index = allstores,name='State')
states.loc[BW] = 'BW'
states.loc[BY] = 'BY'
states.loc[BE] = 'BE'
states.loc[HB_NI] = 'HB,NI'
states.loc[HH] = 'HH'
states.loc[HE] = 'HE'
states.loc[NW] = 'NW'
states.loc[RP] = 'RP'
states.loc[SN] = 'SN'
states.loc[ST] = 'ST'
states.loc[SH] = 'SH'
states.loc[TH] = 'TH'
states[states!=''].value_counts().sum()
states.to_csv('location.csv', header=True, index_label='Store')

In [None]:
location_df=pd.read_csv("./location.csv",index_col="Store")
location_df.head()

In [None]:
df=df.merge(location_df,on='Store',how="inner")

## Adding weather data

In [None]:
weather_df=pd.read_csv("../input/rossmann-stores-weather-dataset/weather.csv")

In [None]:
weather_df.head()

In [None]:
weather_df.describe()

In [None]:
weather_df.isna().sum()

In [None]:
weather_df=weather_df.bfill()
weather_df=weather_df.ffill()

In [None]:
weather_df.Date=pd.to_datetime(weather_df.Date)

In [None]:
#encoding the values
weather={'Fog':'1','Rain':'2','Hail':'3','Thunderstorm':'4','Snow':'5'}
encoding=dict()
for t in weather_df['Events']:
    j=t
    for i in j.split('-'):
      j=j.replace(i,weather[i])
    
    j=j.replace('-','')
    encoding[t]=j   
#print(encoding)

for i in encoding.keys():
  weather_df["Events"]=np.where(weather_df["Events"]==i,encoding[i],weather_df["Events"])

weather_df["Events"].unique()

In [None]:
df.to_csv('cleaned_weather.csv') 

In [None]:
df= df.merge(weather_df, how='inner', left_on=["Date", "State"], right_on=["Date","State"])

In [None]:
df.to_csv('final_RossmannSales.csv') 

# Final Check

In [None]:
df.head()

In [None]:
#Find Correlation between the data columns
plt.figure(figsize=(15,15))
sns.heatmap((df.corr()))

the heatmap shows all our hypothesis were true, there is very little correlation between School Holiday, Customers and Promo, but there is a strong correlation between Promo and Sales