In [None]:
import pandas as pd
import cufflinks as cf

# Go offline
cf.go_offline()

In [None]:
data = pd.read_csv('../assets/dataset/rossmann.csv', skipinitialspace=True)

data['Date'] = pd.to_datetime(data['Date'])
data.set_index('Date', inplace=True)

data['Year'] = data.index.year
data['Month'] = data.index.month

store1_data = data[data.Store == 1]

data.head()

In [None]:
data.index.year

In [None]:
import seaborn as sb
%matplotlib inline

In [None]:
data[['Sales']].resample('M').mean().sort_values(by='Sales')

In [None]:
store1_data.index

We want to identify larger-scale trends in our data. How did sales change from 2014 to 2015? Were any particularly interesting outliers in terms of sales or customer visits?

In [None]:
# Filter to days store 1 was open
store1_open_data = store1_data[store1_data.Open==1]
store1_open_data[['Sales']].iplot()

In [None]:
# To plot the customer visits over time:
store1_open_data[['Customers']].iplot()

We can see that there are large spikes of sales and customers towards the end of 2013 and 2014, leading into the first quarter of 2014 and 2015.


In [None]:
store1_data_2015 = store1_data['2015']
store1_data_2015[
    store1_data_2015.Open==1
][['Sales']].iplot()

In [None]:
x='Month'
y='Sales'
store1_data[[x, y]].set_index(x, append=True)[y].unstack().iplot(kind='box', boxpoints="suspectedoutliers")

In [None]:
store1_data[[x, y]].set_index(x, append=True)[y].unstack().head()

In [None]:
df1 = store1_data[store1_data['Open']==1]
promo = df1[df1['Promo']==1]
no_promo = df1[df1['Promo']==0]
x='Month'
y='Sales'
promo[[x, y]].set_index(x, append=True)[y].unstack().iplot(kind='box', boxpoints="suspectedoutliers")
no_promo[[x, y]].set_index(x, append=True)[y].unstack().iplot(kind='box', boxpoints="suspectedoutliers")


In [None]:
sb.factorplot(
    col='Open',
    hue='Promo',
    x='Month',
    y='Sales',
    data=store1_data, 
    kind='box'
)

In [None]:
sb.factorplot(
    col='Open',
    x='DayOfWeek',
    y='Customers',
    data=store1_data,
    kind='box')

### Slide 46:

## Computing Autocorrelation

To measure how much the sales are correlated with each other, we want to compute the _autocorrelation_ of the 'Sales' column. In pandas, we'll do this with the `autocorr` function.

`autocorr` takes one argument, the `lag` - which is how many prior data points should be used to compute the correlation. If we set the `lag` to 1, we compute the correlation between every point and the point directly preceding it, while setting `lag` to 10. This computes the correlation between every point and the point 10 days earlier:


In [None]:
data['Sales'].resample('D').mean().autocorr(lag=1)

Just like with correlation between different variables, the data becomes more correlated as this number moves closer to 1.

**Check:** What does the autocorrelation values of Sales and Customers imply about our data?


### Slide 48:

## Aggregates of sales over time

If we want to investigate trends over time in sales, as always, we will start by computing simple aggregates. We want to know: what were the mean and median sales in each month and year?

In Pandas, this is performed using the `resample` command, which is very similar to the `groupby` command. It allows us to group over different time intervals.

We can use `data.resample` and provide as arguments:
    - The level on which to roll-up to, 'D' for day, 'W' for week, 'M' for month, 'A' for year
    - The aggregation to perform: 'mean', 'median', 'sum', etc.

In [None]:
data[['Sales']].resample('A').apply(['median', 'mean']).head()

In [None]:
data[['Sales']].resample('M').apply(['median', 'mean']).head()

While identifying monthly averages is useful, we often want to compare the sales data of a date to a smaller window. To understand holidays sales, we don't want to compare sales data in late December with the entire month, but instead to a few days immediately surrounding it. We can do this using rolling averages.

In pandas, we can compute rolling average using the `pd.rolling().mean()` or `pd.rolling().median()` functions.

In [None]:
data[['Sales']].resample('D').mean().rolling(window=3, center=True).mean().head()

This computes a rolling mean of sales using the sales on each day, the day preceding and the day following (window = 3, center=True).


`rolling` takes three important parameters:
    - `window` is the number of days to include in the average
    - `center` is whether the window should be centered on the date or use data prior to that date
    - `freq` is on what level to roll-up the averages to (as used in `resample`). Either `D` for day, `M` for month or `A` for year, etc.

In [None]:
data[['Sales']].resample('D').mean().rolling(window=15, center=False).mean().diff(1).sort_values(by='Sales')

In [None]:
data[['Sales']].resample('D').mean().rolling(window=15, center=False).mean().iplot()

As we discussed earlier, this averages all values in the window evenly, but we might want to weight closer values more. For example, with a centered weighted average of 10 days, we want to put additional emphasis on +/- 1 day versus +/- days. One option to do that is the `ewma` function or `exponential weighted moving average` function.


In [None]:
pd.ewma(data[['Sales']].resample('D').mean(), span=10).iplot()

Pandas `rolling().mean` and `rolling().median` are only two examples of Pandas window function capabilities. Window functions operate on a set of N consecutive rows (i.e.: a window) and produce an output.

In addition to `rolling().mean` and `rolling().median`, there are `rolling().sum`, `rolling().min`, `rolling().max`... and many more.

Another common one is `diff`, which takes the difference over time. `df.diff` takes one argument: `periods`, which measures how many rows prior to use for the difference.

For example, if we want to compute the difference in sales, day by day, we could compute:


In [None]:
data[['Sales']].resample('D').mean().diff(periods=1).iplot()

## Pandas Expanding Functions

In addition to the set of `rolling` functions, Pandas also provides a similar collection of `expanding` functions, which, instead of using a window of N values, uses all values up until that time.

For example,

In [None]:
data[['Sales']].resample('D').sum().expanding().mean().iplot()

In [None]:
average_daily_customers = data[['Customers']].resample('D').mean()
average_daily_sales = data[['Sales', 'Open']].resample('D').mean()

In [None]:
average_daily_customers['DiffVsLastWeek'] = average_daily_customers.diff(periods=7)
average_daily_sales['DiffVsLastWeek'] = average_daily_sales[['Sales']].diff(periods=7)

In [None]:
average_daily_sales[average_daily_sales.Open == 1].sort_values(by='DiffVsLastWeek')

In [None]:
average_daily_sales = data[['Sales', 'Open']].resample('D').mean()

In [None]:
average_daily_sales['Sales'].autocorr(lag=1)

In [None]:
average_daily_sales['Sales'].autocorr(lag=30)

In [None]:
data['Sales'].resample('D').mean().expanding().mean().iplot()

In [None]:
data['2013-01-01'].Sales.sum()

In [None]:
total_daily_sales = data[['Sales']].resample('D').sum()

total_daily_sales.expanding().sum()['2014-12']

In [None]:
total_daily_sales.index