In [None]:
# Importing a library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import month_plot, seasonal_plot, plot_acf, plot_pacf, quarter_plot
from statsmodels.tsa.seasonal import seasonal_decompose

%matplotlib inline

In [None]:
# Loading data
df_hld = pd.read_csv('../input/store-sales-time-series-forecasting/holidays_events.csv')
df_oil = pd.read_csv('../input/store-sales-time-series-forecasting/oil.csv')
df_str = pd.read_csv('../input/store-sales-time-series-forecasting/stores.csv')
df_trns = pd.read_csv('../input/store-sales-time-series-forecasting/transactions.csv')
train = pd.read_csv('../input/store-sales-time-series-forecasting/train.csv')
test = pd.read_csv('../input/store-sales-time-series-forecasting/test.csv')
sample = pd.read_csv('../input/store-sales-time-series-forecasting/sample_submission.csv')

<h2 style='color:white; background:#000080; border:0'><center>Checking the data</center></h2>

In [None]:
# Check what data is available
train.head()

[The Data Description](https://www.kaggle.com/c/store-sales-time-series-forecasting/data) describes the train.csv as follows.
* The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales.
* store_nbr identifies the store at which the products are sold.
* family identifies the type of product sold.
* sales gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).
* onpromotion gives the total number of items in a product family that were being promoted at a store at a given date.

In [None]:
# Check the number of rows and columns
train.shape

In [None]:
# What kind of rows are there?
train.columns

In [None]:
# Display data types
train.info()

In [None]:
train.describe()

In [None]:
# Converting data types
train['date'] = pd.to_datetime(train['date'])

In [None]:
train_df = pd.read_csv(
    "../input/store-sales-time-series-forecasting/train.csv",
    index_col='date',
    parse_dates=['date'],
).drop(['store_nbr', 'family', 'onpromotion'], axis=1)

In [None]:
train_df['Time'] = np.arange(len(train_df.index))

In [None]:
train_df.head()

In [None]:
# Check what data is available
test.head()

[The Data Description](https://www.kaggle.com/c/store-sales-time-series-forecasting/data) describes the test.csv as follows.
* The test data, having the same features as the training data. You will predict the target sales for the dates in this file.
* The dates in the test data are for the 15 days after the last date in the training data.

In [None]:
# Display data types
test.info()

In [None]:
test.describe()

In [None]:
# Converting data types
test['date'] = pd.to_datetime(test['date'])

In [None]:
# Check the number of rows and columns
test.shape

In [None]:
# Check what data is available
df_hld.head()

[The Data Description](https://www.kaggle.com/c/store-sales-time-series-forecasting/data) describes the holidays_events.csv as follows.
* Holidays and Events, with metadata
* NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was actually celebrated, look for the corresponding row where type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to payback the Bridge.
* Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).

In [None]:
# Check what data is available
df_oil.head()

[The Data Description](https://www.kaggle.com/c/store-sales-time-series-forecasting/data) describes the oil.csv as follows.
* Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices.)

In [None]:
# Check what data is available
df_str.head()

[The Data Description](https://www.kaggle.com/c/store-sales-time-series-forecasting/data) describes the stores.csv as follows.
* Store metadata, including city, state, type, and cluster.
* cluster is a grouping of similar stores.

In [None]:
df_str.describe()

In [None]:
# Check the number of city types　on stores.csv
df_str['city'].value_counts()

In [None]:
df_str['city'].describe()

In [None]:
# Check the number of state types　on stores.csv
df_str['state'].value_counts()

In [None]:
df_str['state'].describe()

In [None]:
# Check the number of rows and columns
df_hld.shape

In [None]:
df_hld.describe()

In [None]:
# Check the number of rows and columns
df_oil.shape

In [None]:
df_oil.describe()

In [None]:
# Check the number of rows and columns
df_str.shape

In [None]:
# Check the number of rows and columns
sample.shape

In [None]:
# Check the submission format
sample.head()

<h2 style='color:white; background:#000080; border:0'><center>EDA and Data Visualization</center></h2>

**Personal Notes**
Draw sales data by city and state
Create weekly average data and monthly average data

In [None]:
# Time series plot of data
plt.figure(figsize=(10,6))
sns.lineplot(x=train.index, y="sales", data=train)
plt.show()

In table data such as time series data, it is necessary to deal with outliers and abnormal values.
This is because in the case of time-series data, outliers may cause the overall trend to shift.

Let's start with a simple plot of the values to see if there are any outliers.

In [None]:
# Calculation of index-weighted moving average
ewm_mean = train['sales'].ewm(span=90).mean()  

# Display exponentially weighted moving average
print(ewm_mean)

# visualization
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(train['sales'], label='original')
ax.plot(ewm_mean, label='ewma')
ax.legend()
plt.show()

* We can see from the graph that we have two obvious points, one with a large value. We could check the date and time from the graph and correct it, but we would not be able to deal with many more outliers, so we want to work out the logic and look for the outliers.

* We assume that the outliers deviate greatly from the trend from the mean and variance, and we want to consider how to remove the outliers while calculating various numbers.

In [None]:
def plot_outlier(ts, ewm_span=90, threshold=3.0):

    fig, ax = plt.subplots()
    # Calculation of index-weighted moving average
    ewm_mean = ts.ewm(span=ewm_span).mean()  
    # Calculation of exponentially weighted moving standard deviation
    ewm_std = ts.ewm(span=ewm_span).std()  
    ax.plot(ts, label='original')
    ax.plot(ewm_mean, label='ewma')

    # Plot data that are more than 3.0 times out of the standard deviation as outliers
    ax.fill_between(ts.index,
                    ewm_mean - ewm_std * threshold,
                    ewm_mean + ewm_std * threshold,
                    alpha=0.2)
    outlier = ts[(ts - ewm_mean).abs() > ewm_std * threshold]
    ax.scatter(outlier.index, outlier, label='outlier')
    ax.legend()
    plt.figure(figsize=(10,6))
    plt.show()
    return fig,outlier

fig,out_fil = plot_outlier(train['sales'],ewm_span=90, threshold=3.0);

I have been using exponential weighted moving averages to look for outliers, but I would like to also calculate exponential weighted moving standard deviations and plot data that is more than three times out of standard deviation as outliers. I will then try to remove the outliers.

In [None]:
# Extract records that do not have outliers
train_df_cln = train_df[~train_df.index.isin(out_fil.index)]
train_df_cln.head()

**Personal Notes**
I'm going to comment out some of the code that follows because it's too time consuming to finish.

In [None]:
# Time series plot of data　with outliers removed
# sns.lineplot(data=train_df_cln, x="date", y="sales")

In [None]:
# Let's take out a moving average.
# train_df_cln['ma7'] = train_df_cln['date'].rolling(７).mean()
# print(train_df_cln)

In [None]:
# Linear Regression with Time Series
# plt.style.use("seaborn-whitegrid")
# plt.rc(
#     "figure",
#     autolayout=True,
#     figsize=(11, 4),
#     titlesize=18,
#     titleweight='bold',
# )
# plt.rc(
#     "axes",
#     labelweight="bold",
#     labelsize="large",
#     titleweight="bold",
#     titlesize=16,
#     titlepad=10,
# )
# %config InlineBackend.figure_format = 'retina'

# fig, ax = plt.subplots()
# ax.plot('Time', 'sales', data=train_df_cln, color='0.75')
ax = sns.regplot(x='Time', y='sales', data=train_df_cln, ci=None, scatter_kws=dict(color='0.25'))
ax.set_title('Time Plot of Sales');

In [None]:
# Lag features with Time Series
# train_df_cln['Lag_1'] = train_df_cln['sales'].shift(1)
# train_df_cln = train_df_cln.reindex(columns=['sales', 'Lag_1'])

In [None]:
train_df_cln.head()

In [None]:
# fig, ax = plt.subplots()
# ax = sns.regplot(x='Lag_1', y='sales', data=train_df_cln, ci=None, scatter_kws=dict(color='0.25'))
# ax.set_aspect('equal')
# ax.set_title('Lag Plot of Sales');

<h2 style='color:white; background:#000080; border:0'><center>Processing data</center></h2>

In [None]:
train_plus = pd.concat([train, df_str])

In [None]:
grouped_mean = train_plus.groupby(['city','state'])['sales'].mean()
grouped_mean

In [None]:
# Check for missing value
train[train['sales'].isnull()]

In [None]:
df_str[df_str['city'].isnull()]

In [None]:
df_str[df_str['state'].isnull()]

Missing value data cannot be confirmed.

<h2 style='color:white; background:#000080; border:0'><center>Autocorrelation Coefficient</center></h2>

The autocorrelation coefficient is a number that indicates how much past values influence the current data.

In the case of daily data, if we shift the data by one step and check the autocorrelation, we can see how much the sales volume of one day ago affects today. The number of steps in this shifted data is called the lag.

Let's say the lag is 20.

In [None]:
# Calculation of sinusoidal waves
x = np.linspace(-6 * np.pi, 6 * np.pi, 100)
sin = pd.Series(np.sin(x))

plt.figure(figsize=(10,6))
plt.plot(sin.index, sin)
plt.show()

# Calculation of autocorrelation coefficient
lags = 20
autocorrs = [sin.autocorr(lag=lag) for lag in range(lags)]
print(autocorrs)

Let's use the calculated data to create a corelogram.

A correlogram is a graph with the autocorrelation coefficient or cross-correlation coefficient calculated for different lags, with the lag on the horizontal axis and the correlation coefficient on the vertical axis.
By using the correlogram, it is possible to visualize the periodicity of the data.

We will use the coefficients we have just calculated to draw the correlogram.

In [None]:
# Confirmation of periodicity using a cholerogram
plt.figure(figsize=(10,6))
plt.bar(range(lags), autocorrs)
plt.show()

You can see that the data is repeating the same kind of movement, that is, the sine curve is periodic.

<h2 style='color:white; background:#000080; border:0'><center>Creating a sales volume forecasting model</center></h2>

We need to know the start and end period of the training data first.

In [None]:
train = train.sort_values('date')
train.head()

In [None]:
# See when the data is available in a time series
train.tail()

* In other words, the training data covers about 56.5 months, from January 1, 2012 to August 15, 2017.
* The training data has data for a period of 56.5 months.Since we want to split the data in an approximate 8:2 ratio, we will split the data between September 2016 and earlier.

In [None]:
# Training data
train_splt = train[train['date']<'201６-0９-01']

# Test data
test_splt = train[train['date']>='201６-0９-01']