### Introduction

The goal of this notebook was to use LSTMs for Time series forecasting as they are good at modelling long range dependencies. I wasn't able use the technique properly due to categorical variables. 

I have one hot encoded the categorical variables. What I have done is more like regression with sparse data due to one hot encoding, but wasn't able to fully train the model due to lack of time.

With more time, the goal would have been to try different architecture to tune the model. Another idea, which I read online was using categorical variables as auxiliary input for the model. Given more time, I am sure we could have done better.

In [None]:
import pandas as pd
import numpy as np

from keras.preprocessing.sequence import TimeseriesGenerator
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

### Load and Prepare Data

In [2]:
dataset_path = '../data/DS_ML Coding Challenge Dataset.xlsx'
train_dataset = pd.read_excel(dataset_path, sheet_name='Training Dataset')
test_dataset = pd.read_excel(dataset_path, sheet_name='Test Dataset')

### Feature Processing

In [3]:
def preprocess_data(dataset):
    '''
    Returns X and y after converting categorical variables to one-hot encoding and creating time features
    '''
    
    # Renaming column and removing spaces
    dataset.rename(columns={'ProductType':'ProductName'}, inplace=True)
    dataset.columns = [column_name.replace(' ','') for column_name in dataset.columns]
    
    # Creating time features
    dataset['Year'] = pd.DatetimeIndex(dataset['MonthofSourcing']).year
    dataset['Month'] = pd.DatetimeIndex(dataset['MonthofSourcing']).month
    
    # Creating one-hot-encoding for categorical variables
    dataset = pd.get_dummies(dataset, columns=['ProductName'], drop_first=True, prefix='ProductName')
    dataset = pd.get_dummies(dataset, columns=['Manufacturer'], drop_first=True, prefix='Manufacturer')
    dataset = pd.get_dummies(dataset, columns=['AreaCode'], drop_first=True, prefix='AreaCode')
    dataset = pd.get_dummies(dataset, columns=['SourcingChannel'], drop_first=True, prefix='SourcingChannel')
    dataset = pd.get_dummies(dataset, columns=['ProductSize'], drop_first=True, prefix='ProductSize')
    dataset = pd.get_dummies(dataset, columns=['ProductType'], drop_first=True, prefix='ProductType')
    
    # Creating X and y
    X = dataset.drop(['MonthofSourcing','SourcingCost'], axis=1).values
    y = dataset['SourcingCost'].values
    
    return X, y

In [4]:
X_train, y_train = preprocess_data(train_dataset)
X_test, y_test = preprocess_data(test_dataset)

### Model Training

In [None]:
# Generator for model training
n_input = 32
n_features = 56
generator = TimeseriesGenerator(X_train, y_train, length=n_input, batch_size=1)

# Model
model = Sequential()
model.add(LSTM(150, activation='relu', input_shape=(n_input, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

In [14]:
model.summary()

In [None]:
model.fit(generator,epochs=30)