# Exploratory data analysis

This notebook is part of series of notebooks analyzing the Rossmann store data set:

 1. [Deep learning with fast.ai v1 simple version](https://www.kaggle.com/omgrodas/rossmann-deep-learning-with-fast-ai-v1-simplen)
 2. [Exploratory data analysis](https://www.kaggle.com/omgrodas/rossmann-exploratory-data-analysis)(this one)
 2. [Data engineering](https://www.kaggle.com/omgrodas/rossmann-data-engineering) 
 3. [Deep Learning with fast.ai](https://www.kaggle.com/omgrodas/rossmann-deep-learning-with-fast-ai-v1) 
 4. Hyper parameter search with hyperopt
 
These notebooks are based one the notebook used in lesson 3 of the fast.ai deep learning for coders course.

https://github.com/fastai/fastai/blob/master/courses/dl1/lesson3-rossman.ipynb

Ideas for extra features are taken from:

https://www.kaggle.com/c/rossmann-store-sales/discussion/17896

# Setup environment

In [None]:
from pathlib import Path

import pandas as pd
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import cufflinks as cf

pd.set_option('display.max_columns', 0)
pd.set_option('display.max_rows', 500)

In [None]:
plotly.offline.init_notebook_mode()
cf.go_offline()

%matplotlib inline
%reload_ext autoreload
%autoreload 2

# Load data




In [None]:
path=Path("../input/rossmann-data-engineering/")
testdf=pd.read_feather(path/"test.feather")
traindf=pd.read_feather(path/"train.feather")
traindf.set_index("Date",inplace=True,drop=False)
testdf.set_index("Date",inplace=True,drop=False)

In [None]:
!ls ../input/

# Looking into missing data

There is some missing data in the middle of 2014.
The testdata has only 856 stores compared to 1115 in the training data

In [None]:
data=traindf.append(testdf,sort=False)
data.groupby(data.index).size().iplot(kind="bar")

In [None]:
#Size of training data
traindf.shape

In [None]:
#Size of test data
testdf.shape

In [None]:
#Start and end date training data
traindf.index.max(),traindf.index.min()

In [None]:
#number of days with training data
traindf.index.max()-traindf.index.min()

In [None]:
#Number of stores
traindf.Store.nunique()

In [None]:
#Start and end date test data
testdf.Date.max(),testdf.Date.min()

In [None]:
#number of days with test data
testdf.Date.max()-testdf.Date.min()

In [None]:
#Number of stores in test dataset
testdf.Store.nunique()

In [None]:
#Distribution of records per store in training dataset
# 934 stores has 942 records
# 180 stores has 758 records
# 1 store has 1941 records
traindf.groupby("Store").size().value_counts()

In [None]:
#Histogram of above
traindf.groupby("Store").size().iplot(kind="histogram")

In [None]:
#Distribution of records per store in test dataset
# 48 stores with 856 recods
testdf.groupby("Store").size().value_counts()

In [None]:
#Number of stores in test set but not in training data
trains=pd.Series(traindf.Store.unique(),name="train")
tests=pd.Series(testdf.Store.unique(),name="test")
len(tests[~tests.isin(trains.values)])

In [None]:
#Find some of the stores with missing data and plot them
traindf.groupby("Store").size().sort_values()[:10]

In [None]:
#Plot one of the stores with missing data
#No spike when data returns. Seems to be just missing data and not a closed store that is opening. When a store opens after beeing closed there is normally a big spike in sales. 
traindf[traindf.Store==710]["Sales"].iplot(kind="bar",rangeslider=True)

# Explore sales

## Store 1

Plotting the sales of store 1 over time to get a feel of the dataset. ( Could select any of the 1115 stores)

Notice:
* Chrismas in December each year
* Some kind of repeating pattern

First plot to analys the sales of one of the stores over time.  From this we might se some extra sales in the time before chrismas. The zoom and rangeslider in plotly is very helpful.



In [None]:
traindf[traindf.Store==1]["Sales"].iplot(kind="bar",rangeslider=True)

## Average daily sales

Notice:
* The repating pattern is more clear in the averaged data. There is some kind of bi weekly cycle

In [None]:
traindf["Sales"].groupby("Date").agg(["mean"]).iplot(kind="bar",rangeslider=True)

Plotting daily average for each year side by side. 
Notice:
* The bi weekly pattern is not align on dayofyear. 

In [None]:
traindf.groupby(["Year","Dayofyear"])["Sales"].agg("mean").to_frame().reset_index().pivot_table(values="Sales",index="Dayofyear",columns="Year").iplot(kind="bar",rangeslider=True)

## Average weekly sales

Notice:
* bi weekly pattern
* Clearly extra sales in December around Christmas.    

In [None]:
traindf["Sales"].resample("W").agg(["mean"]).iplot(kind="bar")

Weekly sales

In [None]:
traindf.groupby(["Year","Week"])["Sales"].agg("mean").to_frame().reset_index().pivot_table(values="Sales",index="Week",columns="Year").iplot(kind="bar",rangeslider=True)

## Average monthly sales

Notice:
    * Maybe small linear increase in sales
    * December extra high

In [None]:
traindf["Sales"].resample("M").agg(["mean"]).iplot(kind="bar")

##  Average Dayofweek sales

Notice:
* Linear trend, more sales on mondays less on fridays
* Almost no sales on sundays

In [None]:
#Almost no sales on sundays. Important feature
traindf.groupby("Dayofweek")["Sales"].agg(["mean"]).iplot(kind="bar")

# Closed stores

In [None]:
#List the stores with the larges numbers of closed days
traindf[traindf.Open==False].groupby("Store").size().sort_values(ascending=False)[:20]

In [None]:
#Plotting one of the stores with lots of closed days. 
#When a closed stores open there is normally a spike in sales
#There is also sometimes a spike before a store closes
traindf[traindf.Store==103]["Sales"].iplot(kind="bar",rangeslider=True)

# Autocorrelation

In [None]:
from pandas.plotting import autocorrelation_plot

In [None]:
data=traindf.groupby("Day")["Sales"].agg(["mean"])
autocorrelation_plot(data)


# Correlation Heatmap

In [None]:
datecols=traindf.select_dtypes(include="datetime").columns.tolist()
traindf[datecols]=traindf[datecols].astype("int64")
corr = traindf.corr()

In [None]:
corr.iplot(kind="heatmap",colorscale='spectral')

In [None]:
gcorr=corr.abs()["Sales"].sort_values(ascending=False)
gcorr

I have used the sorted list of features for selecting features for machine learning. 

In [None]:
gcorr.index.tolist()