# **Store Sales Forecasting with RNNs** üìàüìâ
# 2nd part - Building our ML Model

## Introduction ‚úèÔ∏è

Time series forecasting is one of the most important tasks in the world of business. It is a very complex task, and it is not always possible to predict the future. But we can build ML models to do so. One of the best ways to do so is to use recurrent neural networks (RNNs), which can handle time series data pretty well because they keep a memory state of the previous time steps.

To apply this concept, we will use the [Store Sales - Time Series Forecasting](https://www.kaggle.com/c/store-sales-time-series-forecasting/data) to predict the sales of a store in the next two weeks. We will read, manipulate and visualize the data, and then build a model to predict the sales. 

In the first notebook, we analyzed the data and feature engineered it. In this one, we will build the model and apply it. Let's get started!

## Dependencies üë™

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import os

## Reading Data üìñ

In [13]:
dtypes = {
    'store_nbr': 'uint8',
    'family': 'object',
    'sales': 'float16',
    'date': 'object',
    'onpromotion': 'uint8',
    'city': 'object',
    'type_of_store': 'object',
    'cluster': 'uint8',
    'dcoilwtico': 'float32',
    'transactions': 'uint32',
    'n_holidays': 'float16',
}
train_data = pd.read_csv('../data/train_data_cleaned.csv',)
train_data.head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion,city,type_of_store,cluster,dcoilwtico,transactions,n_holidays
0,2013-01-01,1,Others,0.0,0,Quito,D,13,93.14,,1.0
1,2013-01-01,1,Others,0.0,0,Quito,D,13,93.14,,1.0
2,2013-01-01,1,Others,0.0,0,Quito,D,13,93.14,,1.0
3,2013-01-01,1,BEVERAGES,0.0,0,Quito,D,13,93.14,,1.0
4,2013-01-01,1,Others,0.0,0,Quito,D,13,93.14,,1.0


## Data Manipulation üìù
### Replacing Missing Values

In [5]:
train_data.isnull().sum()

date             0
store_nbr        0
family           0
sales            0
onpromotion      0
city             0
type_of_store    0
cluster          0
dcoilwtico       0
transactions     0
n_holidays       0
dtype: int64

In [6]:
train_data.dcoilwtico.fillna(method='ffill', inplace=True)
train_data.transactions.fillna(0, inplace=True)

### Correlation Matrix

In [7]:
corr = train_data.corr()
fig = px.imshow(corr)
fig.update_layout(title='Correlation Matrix')

## Preprocessing Data üìä

In [8]:
X_train, y_train = train_data.drop("sales", axis=1)[:2400710], train_data.sales[:2400710]
X_valid, y_valid = train_data.drop("sales", axis=1)[2400710:], train_data.sales[2400710:]

In [9]:
X_train.drop(columns=["date"], axis=1, inplace=True)
X_valid.drop(columns=["date"], axis=1, inplace=True)

### Scaling and One-Hot Encoding

In [8]:
obj_cols = [col for col in X_train.columns if X_train[col].dtype == 'object']
num_cols = [col for col in X_train.columns if X_train[col].dtype != 'object']

In [9]:
# import gc
import gc

del corr
gc.collect()

64

In [57]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

cat_pipe = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])
num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

prep_pipe = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, obj_cols),
])

X_train_prep = prep_pipe.fit_transform(X_train)
X_valid_prep = prep_pipe.transform(X_valid)

In [62]:
X_train_prep.shape[0]

2400710

In [24]:
WINDOW_SIZE = np.ceil(len(train_data) / train_data.date.nunique()).astype(int)

In [61]:
from keras.preprocessing.sequence import TimeseriesGenerator

data_gen = TimeseriesGenerator(X_train_prep.getnnz(), y_train,
                                 length=WINDOW_SIZE,
                                    batch_size=32)

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

In [51]:
print("Shape of each timeseries")
for i in range(10):
    X, y = data_gen[i]
    print(X.shape, y.shape)

Shape of each timeseries
(32, 1782, 6) (32,)
(32, 1782, 6) (32,)
(32, 1782, 6) (32,)
(32, 1782, 6) (32,)
(32, 1782, 6) (32,)
(32, 1782, 6) (32,)
(32, 1782, 6) (32,)
(32, 1782, 6) (32,)
(32, 1782, 6) (32,)
(32, 1782, 6) (32,)


In [52]:
len(data_gen)

74967

In [54]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(32, input_shape=[WINDOW_SIZE, len(num_cols)]),
])
optimizer = keras.optimizers.Adam(learning_rate=0.005)
model.compile(loss="mse", optimizer=optimizer)
history = model.fit(data_gen, epochs=1)

   19/74967 [..............................] - ETA: 3:03:08 - loss: 1413769.0000

KeyboardInterrupt: 

## Predicting Sales ü§î

In [None]:
test_data = pd.read_csv('../data/test_data_cleaned.csv')
test_data.dcoilwtico.fillna(method='ffill', inplace=True)
test_data.transactions.fillna(0, inplace=True)