# Class 15 - Starter Code

Exploring Rossmann Drug Store Sales Data

In [None]:
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid", font_scale=1.5)
%matplotlib inline

### Load Dataset and Pre-Process

We will use the [Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales) dataset for this exercise.  

Data Dictionary from Kaggle  

You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

Most of the fields are self-explanatory. The following are descriptions for those that aren't.

* Id - an Id that represents a (Store, Date) duple within the test set
* **Store** - a unique Id for each store
* **Sales** - the turnover for any given day (this is what you are predicting)
* **Customers** - the number of customers on a given day
* **Open** - an indicator for whether the store was open: 0 = closed, 1 = open
* **StateHoliday** - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
* **SchoolHoliday** - indicates if the (Store, Date) was affected by the closure of public schools
* StoreType - differentiates between 4 different store models: a, b, c, d
* Assortment - describes an assortment level: a = basic, b = extra, c = extended
* CompetitionDistance - distance in meters to the nearest competitor store
* CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
* **Promo** - indicates whether a store is running a promo on that day
* Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
* Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
* PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store


We will use a subset of these features for this exercise.

In [None]:
# Load data
data = pd.read_csv('../../assets/dataset/rossmann.csv', skipinitialspace=True, low_memory=False)

# Check info
data.info()

Because we are most interested in the `Date` column that contains the date of sales for each store, we will make sure to process that as a `DateTime` type, and make that the index of our dataframe.

In [None]:
# Convert to datetime object
data['Date'] = pd.to_datetime(data['Date'])

# Use `Date` as index
data.set_index('Date', inplace=True)

# Add extra columns to break out month and year
data['Year'] = data.index.year
data['Month'] = data.index.month

Using the date as index allows us to filter easily

In [None]:
# Filter for a particular year
data['2014'].head()

# Or month
data['2014-05'].head()

# Or date
data['2014-06-22'].head()

There are over a million sales data points in this dataset, so for some analysis we will focus on just one store.

In [None]:
# Filter for a single store
store1_data = data[data['Store'] == 1]

# Check sample
store1_data.sample(5)

# Part 1: Data Exploration and Mining

To compare sales on holidays, we can compare the sales using box-plots, which allows us to compare the distribution of sales on holidays against all other days. On state holidays the store is closed (and as a nice sanity check there are 0 sales), and on school holidays the sales are relatively similar.

In [None]:
# Plot sales vs school holiday
sns.factorplot(
    x='SchoolHoliday',
    y='Sales',
    data=store1_data, 
    kind='box')

**Check**: See if there is a difference affecting sales on promotion days.

In [None]:
# Plot sales vs promotion days
### FILL IN ###

Compare sales across days of the week

In [None]:
# Plot sales vs day of week
sns.factorplot(
    col='Open',
    x='DayOfWeek',
    y='Sales',
    data=store1_data,
    kind='box')

Lastly, we want to identify larger-scale trends in our data. How did sales change from 2014 to 2015? Were there any particularly interesting outliers in terms of sales or customer visits?

In [None]:
# Filter to days store 1 was open
store1_open_data = store1_data[store1_data.Open==1]

# Plot sales over time
store1_open_data['Sales'].plot()

In [None]:
# Plot customer visits over time
store1_open_data['Customers'].plot()

**Check**: Use the index filtering to filter to just 2014.  Zoom in on changes over time. Is it easier to identify the holiday sales bump?

In [None]:
# Plot sales over time for 2014
### FILL IN ###

# Part 2: Data Refining Using Time Series Statistics

**Warning** Prior to version 0.18.0, pd.rolling_*, pd.expanding_*, and pd.ewm* were module level functions and are now deprecated. These are replaced by using the Rolling, Expanding and EWM. objects and a corresponding method call.
The deprecation warning will show the new syntax, see an example [here](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0180-window-deprecations).

### Autocorrelation

To measure how much the sales are correlated with each other, we want to compute the autocorrelation of the 'Sales' column. In pandas, we do this we with the autocorr function.  
`autocorr` takes one argument, the `lag` - which is how many prior data points should be used to compute the correlation. If we set the lag to 1, we compute the correlation between every point and the point directly preceding it, while setting lag to 10, computes the correlation between every point and the point 10 days earlier.

In [None]:
# Daily correlation
print store1_data['Sales'].resample('D').mean().autocorr(lag=1)

# Weekly correlation
print store1_data['Sales'].resample('D').mean().autocorr(lag=7)

### Rolling Averages

If we want to investigate trends over time in sales, as always, we will start by computing simple aggregates.  We want to know what the mean and median sales were for each month and year.

In Pandas, this is performed using the `resample` command, which is very similar to the `groupby` command. It allows us to group over different time intervals.

We can use `data.resample` and provide as arguments:
    - The level on which to roll-up to, 'D' for day, 'W' for week, 'M' for month, 'A' for year
    - What aggregation to perform: 'mean', 'median', 'sum', etc.

In [None]:
store1_data['2014'][['Sales']].resample('M').mean().add_suffix('_Mean').head()

In [None]:
store1_data['2014'][['Sales']].resample('M').median().add_suffix('_Median').head()

In [None]:
store1_data['2014']['Sales'].resample('D').mean().plot()

While identifying the monthly averages are useful, we often want to compare the sales data of a date to a smaller window. To understand holidays sales, we don't want to compare late December with the entire month, but perhaps a few days surrounding it. We can do this using rolling averages.

In pandas, we can compute rolling average using the `pd.rolling_mean` or `pd.rolling_median` functions.

`rolling_mean` (as well as `rolling_median`) takes these important parameters:
- the first is the series to aggregate
- `window` is the number of days to include in the average
- `center` is whether the window should be centered on the date or use data prior to that date
- `freq` is on what level to roll-up the averages to (as used in `resample`). Either `D` for day, `M` for month or `A` for year, etc.

Instead of plotting the full timeseries, we can plot the rolling mean instead, which smooths random changes in sales as well as removing outliers, helping us identify larger trends.

In [None]:
store1_data['2014']['Sales'].rolling(freq='D', window=3, center=True).mean().plot()

In [None]:
store1_data['Sales'].rolling(freq='D', window=3, center=True).mean()['2014'].plot()

In [None]:
store1_data['Sales'].rolling(freq='D', window=7, center=True).mean()['2014'].plot()

In [None]:
store1_data['Sales'].rolling(freq='D', window=30, center=True).mean()['2014'].plot()

As we discussed earlier, this averages all values in the window evenly.  However we may want to weight closer values more.

For example, for a centered weighted average of 10 days, we want to put emphasis on +/- 1 day versus +/- 5 days.

One option to do that is the ewma function or the exponential weighted moving average function.

In [None]:
store1_data['Sales'][::-1].ewm(freq='D', span=30).mean()['2014'].plot()

### Diff
Pandas `rolling().mean()` and `rolling().median()` are only two examples of Pandas window function capabilities. Window functions are operate on a set of N consecutive rows (a window) and produce an output.

In addition to `rolling().mean()` and `rolling().median()`, there are `rolling().sum()`, `rolling().min()`, `rolling().max()`... and many more.

Another common one is `diff`, which takes the difference over time. `pd.diff` takes one arugment, `periods`, which is how many prior rows to use for the difference.


In [None]:
# Difference in sales, day by day
store1_data[['Sales']].resample('D').mean().diff(periods=1).head(10)

In [None]:
# Difference in sales, each day to the same day in the previous week
store1_data[['Sales']].resample('D').mean().diff(periods=7).head(10)

### Expanding functions

In addition to the set of `rolling()` functions, Pandas also provides a similar collection of `expanding()` functions, which, instead of using a window of N values, use all values up until that time.

In [None]:
# Mean of all previous values at each point
store1_data['Sales'].expanding(min_periods=1,freq='D').mean().plot()

# Part 3: Exercises

### 3.1 Plot the distribution of sales by month and compare the effect of promotions

In [None]:
# Plot sales vs promo
### FILL IN ###

### 3.2 Are sales more correlated with the prior date, a similar date last year, or a similar date last month?

In [None]:
# Get mean daily sales
average_daily_sales = ### FILL IN ###

print('Correlation with last day: {}'.format(average_daily_sales['Sales'].### FILL IN ###))
print('Correlation with last month: {}'.format(average_daily_sales['Sales'].### FILL IN ###))
print('Correlation with last year: {}'.format(average_daily_sales['Sales'].### FILL IN ###))

### 3.3 Plot the 15 day rolling mean of customers

In [None]:
# Get mean daily customers across all stores
average_daily_sales = ### FILL IN ###

# Get 15 day rolling mean
average_daily_sales### FILL IN ###

### 3.4 Identify the date with largest drop in sales from the same date in the previous week

In [None]:
# Get average daily sales difference with previous week
average_daily_sales = ### FILL IN ###
average_daily_sales['DiffVsLastWeek'] = ### FILL IN ###

# Get top 5 sorted rows
print average_daily_sales.sort_values(by='DiffVsLastWeek').head()

# Get top 5 sorted rows for open days
print average_daily_sales[average_daily_sales.Open == 1].sort_values(by='DiffVsLastWeek').head()

### 3.5 Compute the total sales up until Dec. 2014

In [None]:
# Get total daily sales across all stores
total_daily_sales = data[['Sales']].resample('D').sum()

# Get total sales up until Dec 2014
total_daily_sales.expanding(min_periods=1).sum()['2014-12'].head()

### 3.6 When were the largest differences between 15-day moving/rolling averages?

Hint: Using `rolling_mean` and `diff`

In [None]:
# Get mean daily sales across all stores
average_daily_sales = ### FILL IN ###

# Get largest 15-day rolling average difference
average_daily_sales.### FILL IN ###