# Time Series Data Analysis

- Date - Date that data were measured
- Store - a unique Id for each store
- DayofWeek - Day of week (from 1-7)
- Sales - the turnover for any given day (this is what you are predicting)
- Customers - the number of customers on a given day
- Open - an indicator for whether the store was open: 0 = closed, 1 = open
- StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
- SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
- Promo - indicates whether a store is running a promo on that day

In [None]:
import pandas as pd
import numpy as np
data = pd.read_csv('rossmann.txt.bz2', skipinitialspace=True,compression='bz2',
                   dtype={'Date': np.str,
                          'Store': np.int64,
                          'DayOfWeek':np.int64,
                          'Sales': np.float64,
                          'Customers': np.int64,
                          'Open': np.int64,
                          'Promo': np.int64,
                          'StateHoliday': np.str,
                          'SchoolHoliday': np.int64
                         })

Because we are most interested in the `Date` column that contains the date of sales for each store, we will make sure to process that as a `DateTime` type, and make that the index of our dataframe.

In [None]:
data['Date'] = pd.to_datetime(data['Date'])
data.set_index('Date', inplace=True)

data['Year'] = data.index.year
data['Month'] = data.index.month

Let's create a deparate data frame to hold just the data from Store 1

In [None]:
store1_data = data[data.Store == 1]

In [None]:
store1_data.head()

### Data Exploration and Mining

To compare sales on holidays, we can compare the sales using box-plots, which allows us to compare the distribution of sales on holidays against all other days. On state holidays the store is closed (and as a nice sanity check there are 0 sales), and on school holidays the sales are relatively similar.

Check: See if there is a difference affecting sales on promotion days.

In [None]:
import seaborn as sns
%matplotlib inline

sns.factorplot(
    x='SchoolHoliday',
    y='Sales',
    data=store1_data, 
    kind='box'
)

We can see how being open or closed affects sales for different days of the week

In [None]:
sns.factorplot(
    col='Open',
    x='DayOfWeek',
    y='Sales',
    data=store1_data,
    kind='box',
    
)

Lastly, we want to identify larger-scale trends in our data. How did sales change from 2014 to 2015? Were there any particularly interesting outliers in terms of sales or customer visits?

Let's first get rid of days where store 1 was close, and only look at days when store 1 was open.

In [None]:
# Filter to days store 1 was open
store1_open_data = store1_data[store1_data.Open==1]
store1_open_data[['Sales']].plot()

Let's look at the trend for customers

In [None]:
store1_open_data[['Customers']].plot()

In pandas we can compute rolling average using the `pd.rolling_mean` or `pd.rolling_median` functions.

### Data Refining Using Time Series Statistics

### Rolling Averages

If we want to investigate trends over time in sales, as always, we will start by computing simple aggregates.  We want to know what the mean and median sales were for each month and year.

In Pandas, this is performed using the `resample` command, which is very similar to the `groupby` command. It allows us to group over different time intervals.

We can use `data.resample` and provide as arguments:
    - The level on which to roll-up to, 'D' for day, 'W' for week, 'M' for month, 'A' for year
    - What aggregation to perform: 'mean', 'median', 'sum', etc.

In [None]:
data[['Sales']].resample('M', how=['median', 'mean']).head()

While identifying the monthly averages are useful, we often want to compare the sales data of a date to a smaller window. To understand holidays sales, we don't want to compare late December with the entire month, but perhaps a few days surrounding it. We can do this using rolling averages.

In pandas, we can compute rolling average using the `pd.rolling_mean` or `pd.rolling_median` functions.

In [None]:
pd.rolling_mean(data[['Sales']], window=3, center=True, freq='D').head()

`rolling_mean` (as well as `rolling_median`) takes these important parameters:
    - the first is the series to aggregate
    - `window` is the number of days to include in the average
    - `center` is whether the window should be centered on the date or use data prior to that date
    - `freq` is on what level to roll-up the averages to (as used in `resample`). Either `D` for day, `M` for month or `A` for year, etc.

Instead of plotting the full timeseries, we can plot the rolling mean instead, which smooths random changes in sales as well as removing outliers, helping us identify larger trends.

In [None]:
pd.rolling_mean(data[['Sales']], window=10, center=True, freq='D').plot()

### Pandas Window functions
Pandas `rolling_mean` and `rolling_median` are only two examples of Pandas window function capabilities. Window functions are operate on a set of N consecutive rows (a window) and produce an output.

In addition to `rolling_mean` and `rolling_median`, there are `rolling_sum`, `rolling_min`, `rolling_max`... and many more.

Another common one is `diff`, which takes the difference over time. `pd.diff` takes one arugment, `periods`, which is how many prio rows to use for the difference.


In [None]:
data['Sales'].diff(periods=1).head()

### Pandas expanding functions

In addition to the set of `rolling_*` functions, Pandas also provides a similar collection of `expanding_*` functions, which, instead of using a window of N values, use all values up until that time.

In [None]:
# computes the average sales, from the first date _until_ the date specified.
pd.expanding_mean(data['Sales'], freq='d').head()

### Autocorrelation

To measure how much the sales are correlated with each other, we want to compute the _autocorrelation_ of the 'Sales' column. In pandas, we do this we with the `autocorr` function.

`autocorr` takes one argument, the `lag` - which is how many prior data points should be used to compute the correlation. If we set the `lag` to 1, we compute the correlation between every point and the point directly preceding it, while setting `lag` to 10, computes the correlation between every point and the point 10 days earlier.

In [None]:
data['Sales'].resample('D', how='mean').autocorr(lag=1)

## Exercises

Plot the distribution of sales by month and compare the effect of promotions:

Hint: Use the `factorplot` function in seaborn

In [None]:
sns.factorplot(
    col='Promo',
    x='Month',
    y='Sales',
    data=store1_data, 
    kind='box'
)

Are sales more correlated with the prior day, a similar date last year, or a similar date last month?

Hint: What function allows you to look at whether there is a correlation with a prior time.

What time periods do 1 day before, last year, and last month correspond to?

In [None]:
average_daily_sales = data[['Sales']].resample('D', how='mean')

print('Correlation with last day: {}'.format(average_daily_sales['Sales'].autocorr(lag=1)))
print('Correlation with last month: {}'.format(average_daily_sales['Sales'].autocorr(lag=30)))
print('Correlation with last year: {}'.format(average_daily_sales['Sales'].autocorr(lag=365)))

Plot the 15 day rolling mean of customers in the stores

In [None]:
pd.rolling_mean(data[['Sales']], window=15, freq='D').plot()

Identify the date with largest drop in sales from the previous 2 days.

Allow for days where the store was closed.

What day is this?

In [None]:
#Write a programming line that gives you average daily sales
#Hint:Use resample

#Fill in a value for n that represents the number of days you want to go back to compare
n=
average_daily_sales['DiffVsLastWeek'] = average_daily_sales[['Sales']].diff(periods=n)


#Use the .sort_values(by=) function on your dataframe to find the dates with the biggest drop


Compute the total sales up until Dec. 2014

In [None]:
#First get the total daily sales
total_daily_sales = data[['Sales']].resample('D', how='sum')

#Then use the expanding sum function until the the date you want to compute it to
pd.expanding_sum(total_daily_sales)['2014-12'].head()


When were the largest differences between 15-day moving/rolling averages?

Hint: Use `rolling_mean` and `diff`

In [None]:
pd.rolling_mean(data[['Sales']], window=15, freq='D').diff(1).sort_values(by='Sales').head()