# Introduction
Predicting future sales based on sequential set of data points measured over time is called time series forecasting. There are about 3 approaches to time series forecasting, use of machine learning, statistical methods and use of neural networks. The best method might depend on the data set, this kernel will explore use of neural networks. The challenge is to predict monthly sales per item_id and shop_id pairs of an enterprise.

## Data Exploration
Load and explore data

In [None]:
#load libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

In [None]:
# load the data sets
sales_train_df = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv')
test_df = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/test.csv')
item_categories = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv')
items = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/items.csv')
shops = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/shops.csv')
sample_submission = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sample_submission.csv')

print('Number of Training Samples = {}'.format(sales_train_df.shape[0]))
print('Number of Test Samples = {}\n'.format(test_df.shape[0]))

print('Training X Shape = {}'.format(sales_train_df.shape))
print('Test X Shape = {}'.format(test_df.shape))
print('Test y Shape = {}\n'.format(test_df.shape[0]))

print('Index of Train set:\n', sales_train_df.columns)
print(sales_train_df.info())
print('\nIndex of Test set:\n', test_df.columns)


print('\nMissing values of Train set:\n', sales_train_df.isnull().sum())
print('\nNull values of Train set:\n', sales_train_df.isna().sum())

The sales_train sample and test samples are organized differently, the sales_train samples records daily sales per shop_id and per item_id while the test sample has an ID for the shop_id and item_id pair neccessary to aggregate monthly sales as required. 

In [None]:
# sample train set
sales_train_df.sample(10)

In [None]:
# sample test set
test_df.head()

Downcast the data types to reduce memory consumption

In [None]:
def downcast_dtypes(df):
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols = [c for c in df if df[c].dtype in ["int64", "int32"]]
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols] = df[int_cols].astype(np.int16)
    return df

sales_train_df = downcast_dtypes(sales_train_df)
print(sales_train_df.info())

In [None]:
# converting date object to datetime format
sales_train_df['date'] = pd.to_datetime(sales_train_df['date'],format = '%d.%m.%Y')
print('Min date from train set: %s' % sales_train_df['date'].min().date())
print('Max date from train set: %s' % sales_train_df['date'].max().date())

In [None]:
# print min and max num assigned to the months
print('Min date_block_num from train set: %s' % sales_train_df['date_block_num'].min())
print('Max date_block_num from train set: %s' % sales_train_df['date_block_num'].max())

The train data is from January 2013 date_block_num 0 to October 2015 date_block_num 33. We are to use the sales record for the past 34 months to predict the 35th month which is November 2015 date_block_num 34.

Plotting total sales of the company indicate seasonal peaks with decreasing trend

In [None]:
ts=sales_train_df.groupby(["date_block_num"])["item_cnt_day"].sum()
ts.astype('float')
plt.figure(figsize=(15,8))
plt.title('Total Sales')
plt.xlabel('Time')
plt.ylabel('Sales')
plt.plot(ts);

#### Outliers
Identify outliers by item_cnt_day and item_price

In [None]:
plt.figure(figsize=(10,4))
plt.xlim(-100, 3000)
sb.boxplot(x=sales_train_df['item_cnt_day'])
print('Sale volume outliers:',sales_train_df['item_id'][sales_train_df['item_cnt_day']>900].unique())

plt.figure(figsize=(10,4))
plt.xlim(sales_train_df['item_price'].min(), sales_train_df['item_price'].max())
sb.boxplot(x=sales_train_df['item_price'])
print('Item price outliers:',sales_train_df['item_id'][sales_train_df['item_price']>300000].unique())

In [None]:
# remove outliers
sales_train_df = sales_train_df[(sales_train_df.item_price < 300000 ) & (sales_train_df.item_cnt_day < 900)]

#### Aggregation
Restructure the sales_train_df by item_id and shop_id pairs in a format suitable to combine it with the test_df

In [None]:
#create a pivot table from train_sales
train_data = sales_train_df.pivot_table(index = ['shop_id','item_id'],values = ['item_cnt_day'],columns = ['date_block_num'],fill_value = 0,aggfunc='sum')
# reset indices for easy manipulation
train_data.reset_index(inplace = True)
train_data.head()

In [None]:
# merge train_data and test_df as to be suitable for prediction
all_data = pd.merge(test_df,train_data,on = ['item_id','shop_id'],how = 'left')

In [None]:
# fill all NaN values with 0
all_data.fillna(0,inplace = True)
# display
all_data.head()

Drop columns ID, shop_id, and item_id that are not sequence of observation over a time period. Retain only observations that can be used in creating time series lag features, i.e. features at previous levels to help predicting outcome at a future time.

In [None]:
all_data.drop(['ID','shop_id','item_id'],inplace = True, axis = 1)
all_data.head()

##  Data Preparation
Transform time series data to fit a supervised learning model, a model must learn from a series of past observation in order to predict the next sequence.The challenge is a multivariate time series with multiple parrallel input time series where the output is dependent on the input series. The sequence of observations will be transformed into multiple subsamples  with rows for time steps and one series per column into the input/output pairs suitable for processing.

Split function adapted from the book [deep-learning-for-time-series-forecasting](https://machinelearningmastery.com/deep-learning-for-time-series-forecasting/)

In [None]:
# split a multivariate sequence into samples
def split_sequences(sequences, n_steps):
    X, y = list(), list()
    for i in range(len(sequences)):
        # find the end of this pattern
        end_ix = i + n_steps
        # check if we are beyond the dataset
        if end_ix > len(sequences):
            break
        # gather input and output parts of the pattern
        seq_x, seq_y = sequences[i:end_ix, :-1], sequences[end_ix-1, -1]
        X.append(seq_x)
        y.append(seq_y)
    return array(X), array(y)

Creating subsequences using 3 time steps to predict 1 output, this will result in the loss of the 1st 2 samples because there are no previous samples to create a lag feature for them. 2 dummy samples filled with zero will be created and added on top of the dataset to prevent loss of the 2 samples for prediction

In [None]:
# create dummy samples
dummy_samples = [[0] * 34] * 2
# rename all_data columns to match dummy_samples dataframe
all_data.columns = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33]
# create dummy_samples dataframe and concatenate with all_data
all_data = pd.concat([pd.DataFrame(dummy_samples), all_data], ignore_index=True)
# print shape of all_data
all_data.shape

#### Train, Validate and Test 

Preserving the temporal structure of a time series forecasting problem is very important in spliting data into train, validation and test.The date_num_block 0 -31 will be used in training while 1 - 32 for validation and 2 - 33 for testing

In [None]:
# keep all columns execpt the last two 
train_data = np.expand_dims(all_data.values[:,:-2],axis = 2)
# keep all columns execpt the first and last
validation_data = np.expand_dims(all_data.values[:,1:-1],axis = 2)
# keep all columns execpt the first two
test_data = np.expand_dims(all_data.values[:,2:],axis = 2)
# print the shapes 
print(train_data.shape, validation_data.shape, test_data.shape)

Creating subsamples for training and testing

In [None]:
from numpy import array
# choose a number of time steps
n_steps = 3
# convert into input/output
X_train, y = split_sequences(train_data, n_steps)
X_val, y_val = split_sequences(validation_data, n_steps)
X_test, y_test = split_sequences(test_data, n_steps)
print(X_train.shape, y.shape)

Reshape input to 3 dimentional structure of [samples, timesteps, features]

In [None]:
X = X_train.reshape((X_train.shape[0], X_train.shape[1], X_train.shape[2]))
X_val = X_val.reshape((X_val.shape[0], X_train.shape[1], X_train.shape[2]))
X_test = X_test.reshape((X_test.shape[0], X_train.shape[1], X_train.shape[2]))

## LSTM
Define and run LSTM model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
import  tensorflow.keras.optimizers as optimizers
# number of features
n_features = X.shape[2]
# define model
model = Sequential()
model.add(LSTM(64, activation= 'relu', dropout=0.2, recurrent_dropout=0.2, return_sequences=True,  input_shape=(n_steps, n_features)))
model.add(LSTM(64, activation= 'relu', dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1))
model.compile(optimizer=optimizers.Adam(lr=.0001), loss= 'mse', metrics = ['mean_squared_error'])     
model.summary()

In [None]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
callbacks = [
    EarlyStopping(patience=5, verbose=1),
    ReduceLROnPlateau(factor=0.25, patience=2, min_lr=0.000001, verbose=1),
    ModelCheckpoint('model.h5', verbose=1, save_best_only=True, save_weights_only=True)
]

In [None]:
# fit model
model.fit(X, y, epochs=15, callbacks=callbacks, validation_data=(X_val, y_val))

In [None]:
# create submission file 
submit = model.predict(X_test)
# clip between 0 and 20
submit = submit.clip(0,20)
# creating dataframe with required columns 
submission = pd.DataFrame({'ID':test_df['ID'],'item_cnt_month':submit.ravel()})
# creating csv file from dataframe
submission.to_csv('submit2a.csv',index = False)


In [None]:
submission.head()