# Predict future sales - EDA and LSTM prediction

**This analysis includes an exploratory analysis of the data, an LSTM model, a plot of traing RMSE performance, and a look at the top 10 predictions of the model.**

Table of contents:
------------------
1. Read csv files
2. Add Russian to English translations
3. EDA - Explore data
4. Descriptive stats for item_cnt_day
5. Transform Dates
6. EDA - Time series trends
7. EDA - Explore data for items, categories, and shops
8. Drop outliers
9. Prepare the data for LSTM model
10. Model training plots
11. LSTM Model
12. Output predictions to csv
13. Top 10 predicted sales increases


## Read csv files

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns  

In [None]:
# Read the csv files

sales = pd.read_csv("../input/competitive-data-science-predict-future-sales/sales_train.csv")
test = pd.read_csv("../input/competitive-data-science-predict-future-sales/test.csv")
items = pd.read_csv("../input/competitive-data-science-predict-future-sales/items.csv")



## Add Russian to English translations

Three of the files in the dataset have Russian text (shops, items, item_categories). For English speakers, translations were added.
Kaggle user Orhan kindly made the Russian to English translated files available as a dataset
https://www.kaggle.com/orhankaramancode/filestranslated

In [None]:
shops_t = pd.read_csv("../input/filestranslated/shops-translated.csv")
items_t = pd.read_csv("../input/filestranslated/items-translated.csv")
item_categories_t = pd.read_csv("../input/filestranslated/categories_translated.csv")

In [None]:
# items (items-translated.csv was missing the category ids - join to the original items.csv)

del items['item_name']
items = items.merge(items_t, how='left', on='item_id')
items.rename(columns={'item_name_translated':'item_name'}, inplace=True)

In [None]:
# item categories

del item_categories_t['Unnamed: 0']
item_categories = item_categories_t

In [None]:
#shops
shops_t.rename(columns={'shop_name_translated':'shop_name'}, inplace=True)
shops = shops_t

## EDA - Explore data

Preliminary EDA with descriptive statistics

In [None]:
# Sales:

print('rows and columns:', sales.shape, '\n')
print(sales.info(), '\n')
print(sales.count(), '\n')
print('missing values:\n', sales.isna().sum())

In [None]:
sales.sort_values(by=['date','shop_id','item_id']).head(5)

### Descriptive stats for item_cnt_day

Since the task is to predict counts for each item-store combination in one month, it will be good to look at the statistics for the item_cnt_day column.

* max count is much higher than the average, indicating large outliers 
* the min count is -22, indicating item returns 
* the 25% and 75% quartiles for item_cnt_day = 1.0, indicating most sales are for one item

In [None]:
print('\nitem_cnt_day descriptive statistics')
print( sales.item_cnt_day.describe().apply(lambda x: format(x, '10.1f')) )

In [None]:
# The item_count series shows some purchases with huge counts

sales.item_cnt_day.plot(figsize=(15, 6))
plt.show()

In [None]:
# there are lots of outliers in the with large quantities


sns.boxplot(y="item_cnt_day", data=sales)
plt.title('Item Count')
plt.show()

In [None]:
# Generate item_cnt_day outliers plot

plt.figure(figsize=(16,5))
sns.distplot(sales['item_cnt_day'], kde=False, rug=True)
plt.xlim(-20, 500)
plt.show()

In [None]:
# how many zero or negative item_cnt_days?

print( 'zero item_cnt_days:', len ( sales.loc[sales['item_cnt_day']==0] ) )
print( 'negative item_cnt_days:', len ( sales.loc[sales['item_cnt_day']<0] ) )

## Transform Dates
* convert date field to datetime
* add some additional date features 

In [None]:
# convert the date field to pandas datetime

sales.loc[ : , ('date') ] = pd.to_datetime(sales.loc[ : , ('date') ],format='%d.%m.%Y')

In [None]:
# add year, month, and day columns

sales['year'] = sales['date'].dt.year
sales['month'] = sales['date'].dt.month
sales['day'] = sales['date'].dt.day

In [None]:
# create a year-month field

sales['year_month'] = sales['date'].map(lambda x: 100*x.year + x.month)

In [None]:
# create a weekday field (0=Mon, 6=Sun)

sales['day_of_week'] = sales['date'].map(lambda x: x.weekday())

In [None]:
# Sort values in dataframe, order the columns, reset the index

sales = sales.sort_values(by=['date','shop_id','item_id'], ascending=[True,True,True])

sales_dates = ['date','date_block_num','year_month','year','month','day','day_of_week']
sales_data = ['shop_id','item_id','item_price','item_cnt_day']
sales = sales[sales_dates + sales_data]

sales.reset_index(drop=True,inplace=True)

In [None]:
sales.head()

## EDA - Time series trends

In [None]:
daily_count_sum = sales.groupby('date')['item_cnt_day'].sum()

In [None]:
# Overall time series trend - downward sales volume 

daily_count_sum.plot(figsize=(15, 6))
plt.title('Daily sales counts (sum of item count per day)')
plt.show()

In [None]:
# Yearly trend shows certain months like January and December have higher volumes

# Monthly sums

count_by_month = pd.DataFrame(sales.groupby('month')['item_cnt_day'].sum() )
count_by_month.reset_index(inplace=True)


# (January is 1)

count_by_month_jd = count_by_month.loc[ count_by_month['month'].isin([1,12]) ]
count_by_month_other = count_by_month.loc[ ~ count_by_month['month'].isin([1,12]) ]



# Graph the data

objects = ('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
objects_other = ('Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov')
objects_jd = ('Jan', 'Dec')

x_pos = np.arange(len(objects))
x_pos_other = np.array([1,2,3,4,5,6,7,8,9,10])
x_pos_jd = np.array([0,11])

plt.figure(figsize=(5,5))
width=0.5

rects_other = plt.bar(x_pos_other, count_by_month_other.item_cnt_day, width, color='b')
rects_jd = plt.bar(x_pos_jd, count_by_month_jd.item_cnt_day, width, color='r')

plt.xticks(x_pos, objects, rotation=90)
plt.title('More Items are Sold in January and December')
plt.ylabel('Count of Items Sold')
plt.xlabel(None)
plt.show() 

In [None]:
# Weekly time series trend - peaks and troughs can be observed for every week

daily_count_sum['2013-01-01':'2013-07-01'].plot(figsize=(15, 6)) # weekly peaks/troughs can be observed
plt.title('Daily sales counts (sum of item count per day) 1/2013-6/2013')
plt.show()

In [None]:
# Weekly cycles occur because weekend sales are higher

count_by_day = pd.DataFrame( sales.groupby('day_of_week')['item_cnt_day'].sum() )
count_by_day.reset_index(inplace=True)

# weekdays, weekends

count_by_day_we = count_by_day.loc[ count_by_day['day_of_week']>=5 ]
count_by_day_wd = count_by_day.loc[ count_by_day['day_of_week']< 5 ]

# Graph

objects = ('Mon','Tues','Wed','Thu','Fri','Sat','Sun')
objects_wd = ('Mon','Tues','Wed','Thu','Fri')
objects_we = ('Sat','Sun')

x_pos = np.arange(len(objects))
x_pos_wd = np.arange(len(objects_wd))
x_pos_we =(5,6)

plt.figure(figsize=(5,5))
width=0.5

rects_wd = plt.bar(x_pos_wd, count_by_day_wd.item_cnt_day, width, color='b')
rects_we = plt.bar(x_pos_we, count_by_day_we.item_cnt_day, width, color='r')

plt.xticks(x_pos, objects, rotation=90)
plt.title('More Items are Sold on Saturday and Sunday')
plt.ylabel('Average # of Items Sold')
plt.xlabel(None)
plt.show() 

## EDA - Explore data for items, categories, and shops

In [None]:
# merge sales data with descriptive data for items, shops and categories

sales = pd.merge(sales,items,how='left',on='item_id', copy=False)
sales = sales.merge(item_categories,how='left',on='item_category_id', copy=False)
sales = sales.merge(shops,how='left',on='shop_id', copy=False)

In [None]:
sales.head()

In [None]:
#The item which sold the most units:
sales.loc[sales['item_cnt_day']==sales['item_cnt_day'].max()]

In [None]:
item_category_count_sums = pd.DataFrame( sales.groupby(['item_category_name'])['item_cnt_day'].sum() )
item_category_count_sums = item_category_count_sums.reset_index() 

In [None]:
iccs_sort = item_category_count_sums.sort_values(by='item_cnt_day', ascending=False)
iccs_sort.reset_index(inplace=True, drop=True)

In [None]:
plt.figure(figsize=(16,10))
sns.barplot(x='item_category_name', y='item_cnt_day', data=item_category_count_sums, order=iccs_sort.item_category_name )
plt.xticks(rotation=90)
plt.xlabel('None')
plt.ylabel('total units sold')
plt.title('Units sold by item category')
plt.show()

In [None]:
shop_count_sums = pd.DataFrame( sales.groupby(['shop_name'])['item_cnt_day'].sum() )
shop_count_sums = shop_count_sums.reset_index() 

In [None]:
scs_sort = shop_count_sums.sort_values(by='item_cnt_day', ascending=False)
scs_sort.reset_index(inplace=True,drop=True)

In [None]:
plt.figure(figsize=(16,10))
sns.barplot(x='shop_name', y='item_cnt_day', data=shop_count_sums, order=scs_sort.shop_name)
plt.xticks(rotation=90)
plt.xlabel(None)
plt.ylabel('Sales count')
plt.title('Total sales count for each shop')
plt.show()

## Drop outliers 

In [None]:
# drop any records with item_cnt_day above x, where x can be changed before fitting the model.

print(sales.shape)
sales = sales.loc[ sales['item_cnt_day'] <= 25 ]
print(sales.shape)

## Prepare the data for LSTM model

In [None]:
# Pivot the table to wide format
# rows = shop_id+item_id
# columns = date_block_num as the columns 
# values = sum(item_cnt_day)
# 

sales_monthly = sales.pivot_table(index = ['shop_id','item_id']
                                  ,values = ['item_cnt_day']
                                  ,columns = ['date_block_num']
                                  ,fill_value = 0
                                  ,aggfunc='sum')

In [None]:
sales_monthly.reset_index(inplace = True)
sales_monthly.head()

In [None]:
# Left join merge the test data with the training data on item_id and shop_id
# This keeps all shop+item combinations that are required by the test set 
# and drops those from the training set that are not.

sales_monthly = pd.merge(test,sales_monthly,on = ['shop_id', 'item_id'],how = 'left')

In [None]:
#filling NaN with zeroes
sales_monthly.fillna(0,inplace = True)

In [None]:
# Drop Id, shop_id, and item_id as they are uniquely captured by the index

sales_monthly.drop(['ID', 'shop_id','item_id','ID'],inplace = True, axis = 1)
sales_monthly.head()

In [None]:
sales_monthly.shape

In [None]:
# select all the columns except for the last one for the training set
# expand to an array of 3 dimensions with shape (214200, 33, 1)

X_train = np.expand_dims(sales_monthly.values[:,:-1],axis=2)
X_train.shape

In [None]:
# The last column is our training labels (or truth values)
# creates a 2d array with shape (214200, 1) 

y_train = sales_monthly.values[:,-1:]
y_train.shape

In [None]:
# select all the columns except for the first one for the 'test' set 
# expand to an array of 3 dimensions with shape (214200, 33, 1)

# note that it must include the last column, unlike the training set,
# but doesn't include the first columns so the arrays can be the same shape.

X_test = np.expand_dims(sales_monthly.values[:,1:],axis=2)
X_test.shape

## Model training plots 

In [None]:
import math
import matplotlib.pyplot as plt

In [None]:
def rmse(acc):
  rmse =[]
  for i in acc:
    rmse.append(math.sqrt(i))
  return rmse

In [None]:
def plot_train_curve(history):

    colors = ['#e66101','#fdb863']
    accuracy = rmse(history.history['mean_squared_error'])
    epochs = range(len(accuracy))
    with plt.style.context("ggplot"):
        plt.figure(figsize=(8, 8/1.618))
        plt.ticklabel_format(useOffset=False)
        plt.plot(epochs, accuracy, marker='o', c=colors[0], label='Training RMSE')
        
        axes = plt.gca()

        plt.title('Training RMSE')
        plt.legend()
        plt.show()

## LSTM Model

In [None]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.models import load_model, Model

lstm_model_1 = Sequential()
lstm_model_1.add(LSTM(units = 64 , input_shape = (33,1), activation='relu'))
lstm_model_1.add(Dropout(0.5))
lstm_model_1.add(Dense(1))

lstm_model_1.compile(loss='mse',optimizer = 'adam',metrics=['mean_squared_error'])
lstm_model_1.summary()

In [None]:
lstm_model_1_history = lstm_model_1.fit(X_train,y_train,batch_size=4096,epochs=20)

In [None]:
plot_train_curve(lstm_model_1_history)

## Output predictions to csv

In [None]:
output = lstm_model_1.predict(X_test)
submission = pd.DataFrame({'ID':test['ID'],'item_cnt_month':output.ravel()})
submission.head()

In [None]:
submission.to_csv('sample_submission_lstm_model_1.csv',index = False)

## Top 10 predicted sales increases

In [None]:
# Get sales for the last month and the predicted sales for the next month

lstm_predictions = list(output)
last_month = list(y_train)
predicted_changes = pd.DataFrame( {'last':last_month, 'pred':lstm_predictions})

In [None]:
# Get the percent difference for (predicted sales next month - sales previous month)

predicted_changes['pct_diff'] = (predicted_changes['pred'] - predicted_changes['last'])/predicted_changes['last']

In [None]:
# Remove any non-numeric results, sort the store-items by percent change, and get the top 10

predicted_changes = predicted_changes[ ~ predicted_changes['pct_diff'].isin([np.nan, np.inf, -np.inf]) ]
predicted_changes.sort_values(by='pct_diff', ascending=False, inplace=True)
predicted_changes = predicted_changes[0:9] 

In [None]:
# Reset the index and name the column ID

predicted_changes.reset_index(inplace=True)
predicted_changes.rename(columns={'index': 'ID'}, inplace=True)

In [None]:
# Join the top 10 changes with the descriptions for shop, item, and item category

predicted_changes = pd.merge(predicted_changes, test, on = ['ID'], how = 'left')
predicted_changes = pd.merge(predicted_changes, shops, on = ['shop_id'], how = 'left')
predicted_changes = pd.merge(predicted_changes, items, on = ['item_id'], how = 'left')
predicted_changes = pd.merge(predicted_changes, item_categories, on = ['item_category_id'], how = 'left')

In [None]:
# Only keep the descriptive columns and format the percentages

predicted_changes = predicted_changes[['item_name','item_category_name','shop_name','pct_diff']]

def format_pcts(x):
  x = x.astype(float)
  x = x * 100
  x = round(x,2)
  return x
  
predicted_changes['pct_diff'] = format_pcts(predicted_changes['pct_diff'])
predicted_changes.head(10)