# Introduction

!!!!!!!!!!!!!!1****redo****!!!!!!!!!!!!!!!!!!!!!!

I'm sure you're well aware of the value of accurate forecasts, but producing them isn't easy. In this document I'll try to outline various basic univariate time series forecasting methods in simple and easy to understand language, assuming you have a basic knowledge of statistics and python.

**Time series data definition**: Data collected on the same metrics or same objects at regular time intervals. It could be stock market records or sales records.

**Univariate Time Series Forecasting**: Only using the previous values in a time series to predict future values (not using any outside variables).

# Data Handling

### Importing Packages

In [None]:
import numpy as np, pandas as pd, seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from pandas import Series
import datetime

### Reading in Data

In [None]:
item_cats = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
sales_train = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')
shops = pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')
test_df = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')

### Inspecting the data

In [None]:
print('item cats')
print(item_cats.head())
print('items')
print(items.head())
print('sales train')
print(sales_train.head())
print('shops')
print(shops.head())
print('test df')
print(test_df.head())

We need to change the date into a datetime variable

In [None]:
sales_train.dtypes

In [None]:
sales_train.date = sales_train.date.apply(lambda x: datetime.datetime.strptime(x, '%d.%m.%Y'))
sales_train.dtypes

Let's take a deeper look at our sales dataframe:

In [None]:
from IPython.display import display
display(sales_train.head())
display(sales_train.shape)
display(sales_train.isnull().any())
display(sales_train.describe())

## Data Exploration

In [None]:
"""
In this cell we are having a look at the total sales for the company 1C.
It appears as though there is a downward trend and seasonality.
"""
ts = sales_train.groupby(['date_block_num'])['item_cnt_day'].sum()
ts.astype(float)

rolling_mean = ts.rolling(window = 12).mean() # rolling average of 12 months
rolling_std = ts.rolling(window = 12).std() # rolling std of 12 months

plt.figure(figsize=(16,8))
plt.title('Total Sales of 1C')
plt.xlabel('Month')
plt.ylabel('Units Sold')
plt.plot(ts, color = 'blue', label = 'Sales')
plt.plot(rolling_mean, color = 'red', label = 'Rolling Mean')
plt.plot(rolling_std, color = 'black', label = 'Rolling Std')
plt.legend(loc = 'best')
plt.show()

In [None]:
display(ts.head())

# Time series Analysis

## Stationarity

**definition**: The statistical properties of a stationary time series do not change over time. i.e. 2 points in a time series are related to each other by only how far apart they are & not by the direction (each point is independent).

Essentially, the mean, variance, and covariance should remain constant over time. If the data has a trend, it isn't stationary.

The reason it's important, without going into the math, is that many models rely on stationarity and assume that the data is too.

You can test for stationarity with the following tests:
* Augmented Dicky Fuller (ADF)
* KPSS
* Philips-Perron (PP)

For our data I will be performing an ADF test.

In [None]:
"""
In this cell we perform the ADF test to check for stationarity. The
ADF tests the null hypothesis that a unit root is present in the
time series. i.e. if the p-value is less than 5%, you can reject the
null hypothesis and assume that the data is stationary.
"""

def adf_test(ts):
    print('ADF test results:')
    adf = adfuller(ts, autolag  = 'AIC')
    adf_out = pd.Series(adf[0:4], index=['Test Statistic',
                                        'p-value','#Lags Used',
                                        'Number of Observations Used'])
    for key, val in adf[4].items():
        adf_out['Critical Value (%s)' %key] = val
    print(adf_out)
    
adf_test(ts)

The p-value is 14.3%, we therefore can't assume stationarity. 

## Differencing

**definition**: Differencing is a transformation of a time series, taking the difference between consecutive terms in a series. It can be used to remove time dependency and stabilise the mean, reducing trends and seasonality.



In [None]:
def difference(df, interval=1):
    diff = [] # Create empty list
    for i in range(interval, len(df)): # Iterate over every lag
        val = df[i] - df[i - interval] # Take the difference between consective terms
        diff.append(val) # Add the new values to the end of the list
    return Series(diff) # Return the differenced values as a time series

In [None]:
"""
Below the original time series is plotted, the same as the plot above.
"""
ts.astype(float)
plt.figure(figsize=(16,16))
plt.subplot(311)
plt.title('Original')
plt.xlabel('Month')
plt.ylabel('Units Sold')
plt.plot(ts) # Plot the original time series

"""
Below the new differenced time series is plotted.
"""
new_ts = difference(ts) # difference the time series
plt.subplot(312)
plt.title('Post-differencing')
plt.xlabel('Month')
plt.ylabel('Units Sold')
plt.plot(new_ts)
plt.plot()

"""
Below the time series is de-seasonalised (assuming the seasonality
12 months long)
"""
ds_ts = difference(ts, interval = 12)
plt.subplot(313)
plt.title('After De-seasonalising')
plt.xlabel('Month')
plt.ylabel('Units Sold')
plt.plot(ds_ts)
plt.plot

Let's test the differenced and deseasonalised series:

In [None]:
print('Differenced')
adf_test(new_ts)

print('\nDeseasonalised')
adf_test(ds_ts)

The ADF test of the deseasonalised data is below 5%, we can therefore reject the null hypothesis and assume the deseasonalised series is stationary. 

### Considerations
You have to be careful not to over-difference the time series. An over-differenced series may still be stationary, but will affect the model parameters (settings).

You should aim to use the minimum necessary differences to achieve stationarity.

**How do you know if a time series is over differenced?** Optimaly, the Autocorrelation Function (ACF) plot should reach 0 quickly, as seen below. If the first lag (the second pole on the PACF plot) is too far in the negative, then it is probably over-differenced.

Ok so let's have a look at an over-differenced series:

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(difference(difference(ts)));
plt.title('2nd Order Differencing ACF')
plt.show()

As previously described, the first lag goes far into the negative, suggesting that it is over-differenced.

The deseasonalised series is a much better series to work on:

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(ds_ts);
plt.title('Deseasonalised ACF')
plt.show()

**Autocorrelation:** autocorrelation summarises the strength of a relationship with an observation in a time series with observations at previous steps.

In simpler terms: correlation is the strength of a relationship between 2 variables (-1 -> 1), because the correlation of the time series observations are calculated with values of the same series at prior time steps, this is called a serial correlation or *autocorrelation*.

**How to read the above graph:** The ACF plot shows the lag value along the x-axis & the correlation on the y-axis (betweeen -1 and 1). By default the plot_acf function has a 95% confidence interval cone in light blue, suggesting that values outside of this cone are likely a correlation and not a statistical fluke.

# SARIMA Modeling

Now that the time series is differenced, we can move on to building our models.

Seasonal AutoRegressive Integrated Moving Average modeling is an old statistical model that combines a moving average (MA), an auto regressive (AR) model and a seasonal component.
* MA: Assumes that the next value in the series is a function of the average of the previous n values.
* AR: Assumes that the next value in the series is a function of the errors (difference in the mean) in the previous n values.

Pros:
* Very effective; remains close to cutting edge performance
* Simple to implement and not computationally intensive

Cons:
* Not very intuitive
* No way to build in our understanding about how our data works:
    * random walk element
    * external regressors

## How does the SARIMA model work?
There are 3 important terms in ARIMA models: p, d & q
* **p** is the order of the AR term
* **q** is the order of the MA term
* **d** is the number of times differencing is required to make the time series stationary
* **s** the seasonal component is comprised of:
    * P - The seasonal autoregressive order
    * D - The seasonal difference order
    * Q - The Seasonal moving average order
    * m - The number of time steps in a single seasonal period

**What do these terms mean?**
The AR part in ARIMA is a linear regression model that uses its own lags (previous time steps) as predictors. For a linear regression model to be effective you need the predictors to be independent of each other (not correlated), i.e. the time series needs to be stationary.

A common and effective way to make a time series stationary is to difference it (subtract the previous value from the current value). Depending on how complex the series is you may need more than one differencing. **d** is the minimum number of differences needed for the data to be stationary, so if it is stationary by default; d = 0.

**p** is the order of the AR term and refers to the number of lags (time steps) of Y (the dependent (the variable you're trying to forecast)) to be used as predictors.

**q** is the order of the MA terms and refers to the number of lagged forecast errors that should go into the ARIMA model.

An ARIMA model is a model that is differenced at least once and combines the MA and AR terms.

predicted Yt = Constant + linear combination of lags of Y (up to p lags) + linear combination of lagged forecast errors (up to q lags)

[source: https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/ ]

### Estimating the differencing term (d)

It is possible to use packages to estimate the number of differences required. We can use the function "ndiffs()" to perform a test of stationarity for different levels of d (and different tests) and estimate the number of differences required to make the time series stationary. As seen by the results below it doesn't always work, we know from the above tests that d is neither 1,2 or 0.

In [None]:
'''
from pmdarima.arima.utils import ndiffs, nsdiffs

# Normal Differencing:

# ADF test
d_adf = ndiffs(ts, test='adf') # = 1

# KPSS test
d_kpss = ndiffs(ts, test='kpss') # = 2

# PP test
d_pp = ndiffs(ts, test='pp') # = 0

print('Difference Estimations:\nADF:%s KPSS:%s PP:%s' % (d_adf,d_kpss,d_pp))
'''

## Finding the AR term (p)
We find p by analysing the Partial AutoCorrelation Function (PACF) plot.

**PACF explanation:** Autocorrelation for an observation & another observation at a prior time step is comprised of both the direct correlation & indirect correlations. The indirect correlations are a linear function of the correlation of the observation with observations at intervening time steps.

It is these indirect correlations that the PACF seeks to remove. The correlation between point Y0 and Y1 will have seome inertia and affect points later on.

In short, the PACF kind of conveys the pure correlation between an observation and the series. That way you will know if the obsevation is needed in the AR term or not.

**How do we find p?:** Any autocorrelation in a stationary time series can be fixed by adding enough AR terms. So we initially take the order of the AR term to be equal to the number of lags that cross the significance limit in the PACF plot.

Time series analysis is a bit of an art, there isn't a set methodlogy that you have to follow, many people analyse the ACF and PACF plots to find certain patterns that may give away the right order, but it is also possible to systematically find the correct order, although it is rather computationally intensive.

In [None]:
'''
Looping over possible values of p and q and measuring their AIC.

AIC can be thought of like mean squared error, it measures on average
how far off the prediction is from the actual result.
'''
import statsmodels.api as sm
import warnings

rng = range(5)
best_aic = np.inf
best_model = None
best_order = None

warnings.filterwarnings('ignore')

for i in rng:
    for j in rng:
        temp_model = sm.tsa.statespace.SARIMAX(ds_ts, order = (i, 0, j))
        results = temp_model.fit()
        temp_aic = results.aic
        if temp_aic < best_aic:
            best_aic = temp_aic
            best_order = (i, 0, j)
            best_model = temp_model

print('Best AIC: %s | Best order: %s' % (best_aic, best_order))

warnings.warn('Reinstating warnings')

In [None]:
"""
So in the above code cell we determined that p & q were best set at 1.
Earlier on with the ADF test we found that we needed to perform a seasonal difference with the interval set to 12.

We supplied the SARIMAX function with 3 parameters here; order, trend and seasonal order.
* The order parameter is just a copy of the results above.
* I chose the trend through trial and error, setting it to 't' gave me the best results.
* The seasonal order is (P,D,Q,m) where m is the number of time steps, 12 in our case. We set d to 1 because we only need 1
  seasonal difference and p & q are already used in the order parameter. We could supply seasonal P & Q but it's important
  not to make the model too complex and cause overfitting.
"""
sarima_model = sm.tsa.statespace.SARIMAX(ts, order = (1,0,1),trend = 't', seasonal_order=(0,1,0,12))
results = sarima_model.fit()
print(results.aic)

The best practice is to split the data into a training and testing set prior to fitting the model to validate it's accuracy, however I do want to keep this brief.

## Forecasting Sales for 1C

In [None]:
'''
We'll predict from the 22nd month, 2 years into the future.
'''
from statsmodels.tsa.statespace.sarimax import SARIMAXResults


preds = SARIMAXResults.predict(results, start = 33, end = 46)


ax = ts.plot(label = 'Observed')
preds.plot(ax = ax, label = 'SARIMA forecast')
plt.legend()
plt.title('1C Sales')
ax.set_xlabel('Month')
ax.set_ylabel('Units Sold')
plt.show()

# Prophet Forecasting

In February 2017 Facebook's Data Science team open sourced their forecasting library "Prophet". It's a highly optimised package to quickly perform forecasting on non-stationary data.

In [None]:
'''
Before forecasting we need to add the dates back into the time-series
'''
ts.index = pd.date_range(start = '2013-01-01', 
                         end = '2015-10-01', 
                         freq = 'MS')
ts = ts.reset_index()
ts.head()

In [None]:
from fbprophet import Prophet # Import the package

# Prophet requires you to name your columns the following:
ts.columns = ['ds','y']
prophet_model = Prophet(yearly_seasonality = True) # As determined in stationarity testing
prophet_model.fit(ts)

# We'll predict 12 months into the future
# 'MS' = month start
future = prophet_model.make_future_dataframe(periods = 12, freq = 'MS')
forecast = prophet_model.predict(future)
forecast.head()

In [None]:
prophet_model.plot(forecast);
plt.title('1C Sales - Prophet Forecast')
plt.xlabel('Date')
plt.ylabel('Units Sold')
plt.show()

In [None]:
prophet_model.plot_components(forecast)

In [None]:
ts = sales_train.groupby(['date_block_num'])['item_cnt_day'].sum()
ax = ts.plot(label = 'Observed')
preds.plot(ax = ax, label = 'SARIMA forecast', alpha = 0.9, linestyle = '-')
forecast.yhat[33:46].plot(ax = ax, label = 'Prophet forecast', alpha = 0.9, linestyle = '--')

plt.legend()
plt.title('1C Sales')
ax.set_xlabel('Month')
ax.set_ylabel('Units Sold')
plt.show()

It seems as though SARIMA does a better job of generalising and appears to be the simpler model, although Prophet is much easier to implement.

# More Complex Forecasting (Competition Entry)


We have to take the sales_train data and preprocess it and transform it so we can train a model to predict the test_df (below).


In [None]:
display(test_df.head())
display(sales_train.tail())

## Cleaning the Data

### Removing outliers

In [None]:
# We can see below that there's significant outliers that must
# be removed

# Plotting
plt.boxplot(sales_train.item_price)

# Removing Outlier
sales_train = sales_train[(sales_train.item_price < 300000)]

In [None]:
# Plotting
plt.boxplot(sales_train.item_cnt_day)

# Removing outlier
sales_train = sales_train[(sales_train.item_cnt_day < 1000)]

### Checking for Duplicates

We have 6 rows that are duplicated that we might have to address, however 6 rows in a dataset this big will unlikely make a material difference.

In [None]:
len(sales_train[sales_train.duplicated()])

### Downcasting the dataset

We can significantly reduce the size of the dataset by changing the datatypes of variables down from 64 bits to 16 & 32. This will make training our model much faster.

[Source for below: [kyakovlev](https://www.kaggle.com/kyakovlev/1st-place-solution-part-1-hands-on-data)]

In [None]:
def downcast(df):
    # Identifies whether the column is a float or int
    float_cols = [x for x in df if df[x].dtype == 'float64']
    int_cols = [x for x in df if df[x].dtype in ['int64', 'int32']]
    
    # Downsized them to their 32 & 16 bit equivalent
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols] = df[int_cols].astype(np.int16)
    return df

sales_train = downcast(sales_train)
print(sales_train.info())

### Inspecting the shops

It's somewhat difficult to see, given that it's in Russian, but some of the shop names are duplicated (e.g. shop 10 & 11). Maybe they've changed location or there's been an error, but it's probably a good idea to change it so it resembles the testing set.

In [None]:
'''
Duplicated shops:
0 = 57
1 = 58
10 = 11
40 = 39
'''
shops

In [None]:
def replace_shops(df):
    # Replace 0 with 57
    df.loc[df.shop_id == 0, 'shop_id'] = 57
    # Replace 1 with 58
    df.loc[df.shop_id == 1, 'shop_id'] = 58
    # Replace 10 with 11
    df.loc[df.shop_id == 10, 'shop_id'] = 11
    # Replace 40 with 39
    df.loc[df.shop_id == 40, 'shop_id'] = 39
    return df

# Perform the same changes to training & testing set
replace_shops(sales_train)
replace_shops(test_df)

In [None]:
# Inspecting changes (no 0, 1, 10, or 40)
sales_train['shop_id'].unique()

**Adding in City name and Category of Shop**

Reading over the #1 notebook, we can add the city and shop type to the shops dataframe. I had to borrow this, given that I don't read Russian.

[Source: [kyakovlev](https://www.kaggle.com/kyakovlev/1st-place-solution-part-1-hands-on-data)]



In [None]:
# Double spaces and removes special characters 
shops['shop_name'] = shops['shop_name'].apply(lambda x: x.lower()).str.replace('[^\w\s]', '').str.replace('\d+','').str.strip()

# Adds in the city name
shops['shop_city'] = shops['shop_name'].str.partition(' ')[0]

# Adds the shop category
shops['shop_type'] = shops['shop_name'].apply(lambda x: 'мтрц' if 'мтрц' in x else 'трц' if 'трц' in x else 'трк' if 'трк' in x else 'тц' if 'тц' in x else 'тк' if 'тк' in x else 'NO_DATA')
shops.head()

In [None]:
"""
ENCODING

Here we're going to encode the shop_city & shop_type variables.
In short, the model doesn't understand what "тц" means, so we 
assign each category of shop type a number. So all "тц" shops could
be assigned the number 4, and when that number comes up the model knows
that it's in the same group as the other observations with the number 4.
"""
from sklearn.preprocessing import LabelEncoder
shops['shop_city'] = LabelEncoder().fit_transform(shops.shop_city)
shops['shop_type'] = LabelEncoder().fit_transform(shops.shop_type)

"""
We don't need the shop_name, so we'll just remove it
"""
shops = shops[['shop_id','shop_city','shop_type']]
shops.head()

### Inspecting the items

I had to refer back to the #1 notebook on this, it's much simpler than using google translate on dozens of russian words. In this section we extract features from the item names (e.g. what they have in their brackets).

[Source for below: [kyakovlev](https://www.kaggle.com/kyakovlev/1st-place-solution-part-1-hands-on-data)]

In [None]:
display(items.head())

In [None]:
import re # importing regex to identify text with certain patterns

def rename(text):
    text = text.lower() # convert to lower case
    text = text.partition('[')[0] # Split at square bracket
    text = text.partition('(')[0] # Split at regular bracket
    text = re.sub('[^A-Za-z0-9А-Яа-я]+', '  ', text) # remove special characters (e.g. !)
    text = text.replace('  ',' ') # replace double space with single
    text = text.strip()
    return text

In [None]:
# Split the item name by the first bracket
items['name1'], items['name2'] = items.item_name.str.split('[',1).str
items["name1"], items["name3"] = items.item_name.str.split("(", 1).str

# Convert text to lowercase & remove special characters
items["name2"] = items.name2.str.replace('[^A-Za-z0-9А-Яа-я]+', " ").str.lower()
items["name3"] = items.name3.str.replace('[^A-Za-z0-9А-Яа-я]+', " ").str.lower()

# impute empty cells with '0'
items = items.fillna('0')

# Correct the item names,
# See if needed #items["item_name"] = items["item_name"].apply(lambda x: name_correction(x))

# Cuts off the last 2 characters of name2 unless it's 0
items.name2 = items.name2.apply( lambda x: x[:-1] if x !="0" else "0")

source: [dordotron85](https://www.kaggle.com/gordotron85/future-sales-xgboost-top-3)

In [None]:
# Pulls the item type from name (in square brackets)
items["type"] = items.name2.apply(lambda x: x[0:8] if x.split(" ")[0] == "xbox" else x.split(" ")[0] )

# ID's when the item is an xbox, mac, pc or playstation
items.loc[(items.type == "x360") | (items.type == "xbox360") | (items.type == "xbox 360") ,"type"] = "xbox 360"
items.loc[ items.type == "", "type"] = "mac"
items.type = items.type.apply( lambda x: x.replace(" ", "") )# Removes spaces
items.loc[ items.type == 'pc', "type" ] = "pc"
items.loc[ items.type == 'рs3' , "type"] = "ps3"

In [None]:
# Group the dataset by type & count the number of each item id
cat_counts = items.groupby(['type']).agg({'item_id':'count'})
cat_counts = cat_counts.reset_index()

bad_cats = []

# Counts whether each category has at least 40 observations, if not it labels it as other
for cat in cat_counts.type.unique():
    if cat_counts.loc[(cat_counts.type == cat), 'item_id'].values[0] < 40:
        bad_cats.append(cat)

items.name2 = items.name2.apply(lambda x: 'other' if (x in bad_cats) else x)
items = items.drop(['type'], axis = 1)

In [None]:
items.name2 = LabelEncoder().fit_transform(items.name2)
items.name3 = LabelEncoder().fit_transform(items.name3)

items.drop(['item_name','name1'], axis = 1, inplace = True)
items.head()