# <center>Store Sales EDA and Linear Drift Prediction</center>
<center>If you liked this kernel and/or found it helpful, please upvote it so others can see it too!</center>

<!--<img src="https://s.hdnux.com/photos/76/55/06/16437658/5/rawImage.jpg">-->
![](https://s.hdnux.com/photos/76/55/06/16437658/5/rawImage.jpg)

"You are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items at 10 different stores."

Let's explore the provided dataset.

# EDA (Exploratory Data Analysis)

**Imports**

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

**Dataset**

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
submission = pd.read_csv('../input/sample_submission.csv')

In [None]:
submission.info()

In [None]:
submission.head()

In [None]:
test.info()

In [None]:
test.head()

The submission format is 45000 rows, each expecting one numerical sales forcast. This corresponds to one forecast for each of 50 products, at 10 different stores, over a three month period (~90 days). The id in the submission row corresponds to a date/product/store combination in the test set.

In [None]:
train.info()

In [None]:
train.head()

In [None]:
train.describe()

The training data consists of daily sales data for the same 50 products across the same 10 stores, over the last 5 years.

**Data Exploration**

The competition overview mentioned seasonality as a significant factor. Let's aggregate sales (of all products) by month and plot them.

In [None]:
# calculate monthly sales per store
dates = train['date'].apply(lambda x: x[:-3]).unique() # chop off the "day" part of the date
all_storesales = {}
for id in range(1,11):
    print('calculating store',id)
    all_storesales[id] = []
    for date in dates:
        # extract the sales data for that store, for that date
        storedata = train[(train['store'] == id) & (train['date'].apply(lambda x: x[:-3]) == date)]
        storesales = storedata['sales'].sum()
        all_storesales[id].append(storesales)

In [None]:
# plot the results for each store
for id in range(1,11):
    plt.figure(figsize=(20,10))
    plt.plot(dates, all_storesales[id])
    plt.title('Store '+str(id)+' Sales', fontsize=30)
    plt.xticks(rotation=90)
    plt.show()

We see some really consistent trends in monthly sales data! Seems like there is definitely a relationship between month and sales, for all 10 stores!
However, because there are 50 products contributing to these numbers, we should explore the sales by product to see if the same relationship exists. We'll do essentially the same thing as before, grouping by product this time rather than date.

In [None]:
# collect total sales per item, per store
items = range(1,51)
all_itemsales = {}
for id in range(1,11):
    print('calculating store',id)
    all_itemsales[id] = []
    for item in items:
        itemdata = train[(train['store'] == id) & (train['item'] == item)]
        itemsales = itemdata['sales'].sum()
        all_itemsales[id].append(itemsales)

We're going to plot total sales by product, and add a red line (the same across all graphs) representing the mean sales across all 10 stores. This will help us establish whether there is a relationship (a common shape/trend) between the various products, even if total sales are different between stores.

In [None]:
# calculate the mean sales per product, across all 10 stores
mean_item_sales = np.array(all_itemsales[1])
for id in range(2,11):
    mean_item_sales += np.array(all_itemsales[id])
mean_item_sales = np.divide(mean_item_sales, 10)

# plot them
for id in range(1,11):
    plt.figure(figsize=(20,10))
    plt.bar(items, all_itemsales[id])
    plt.plot(items, mean_item_sales, color='red')
    plt.title('Store '+str(id)+' Sales By Item', fontsize=30)
    plt.xticks(items)
    plt.show()

It looks like each product makes up a fraction of the sales that is relatively consistent across stores. With this information (monthly sales consistent, product sales consistent), we will do the following:

# Basic Linear Projection

Imagining each month as its own dataset, apply a simple linear trend line across months (for each store), which will give us a predicted total monthly sales for each store. Then calculate the portion of those sales each product will make up (from the mean proportions we found) and submit those as predictions for each day in the month we're predicting (this does not address day-to-day sales variations, which we have not yet explored).

In [None]:
dates_projected = [ '2018-'+str(mo) if mo >= 10 else '2018-0'+str(mo) for mo in range(1,13) ] # some effort needed to account for leading '0' in single-digit months e.g. '2018-07'

all_storesales_projected = {}
for id in range(1,11):
    print('calculating store',id)
    
    all_storesales_projected[id] = []
    # iterate over months, and THEN years, to collect the trend of a single month's sales over multiple years, and then repeat for all months
    for month in range(1,13):
        month_pts = []
        for year in range(2013,2018):
            # get num sales for same month from past years
            date = str(year)+'-'+str(month) if month >= 10 else str(year)+'-0'+str(month)
            storedata = train[(train['store'] == id) & (train['date'].apply(lambda x: x[:-3]) == date)]
            storesales = storedata['sales'].sum()
            month_pts.append(storesales)
            
        # predict next value by taking the average of the diffs between consecutive years, and append it to projected sales
        # this could be improved with a true linear regression, as long as there aren't any extreme outliers
        total_diff = 0
        for idx,mp in enumerate(month_pts[1:]):
            total_diff += mp - month_pts[idx-1]
        mean_diff = total_diff/(len(month_pts)-1)
        next_pt = month_pts[-1] + mean_diff
        all_storesales_projected[id].append(next_pt)

In [None]:
# plot the predicted monthly sales
for id in range(1,11):
    plt.figure(figsize=(20,10))
    plt.plot(dates, all_storesales[id])
    plt.plot(dates_projected, all_storesales_projected[id], color='red')
    plt.title('Store '+str(id)+' Sales', fontsize=30)
    plt.xticks(rotation=90)
    plt.show()

Sanity check passed, predictions look like they belong.

In [None]:
days_in_month = { 1: 31, 2: 28.25, 3: 31, 4: 30, 5: 31, 6: 30, 7: 31, 8: 31, 9: 30, 10: 31, 11: 30, 12: 31 }

predicted_sales = []
for idx,row in test.iterrows():
    month = int(row['date'].split('-')[1])
    id = row['store']
    item = row['item']
    
    # get the predicted sales for the month
    total_month_sales_projected = all_storesales_projected[id][month-1]
    
    # get product's fraction of sales
    item_sales_fraction = float(all_itemsales[id][item-1]) / sum(all_itemsales[id])
    
    # get predicted monthly sales for that product, and divide by # days in that month to get daily sales of that product
    item_sales_projected = total_month_sales_projected*item_sales_fraction / days_in_month[month]
    predicted_sales.append(item_sales_projected)

# add predictions to submission file
submission['sales'] = predicted_sales
submission.to_csv('submission_basic.csv', index=False)

# Random Forest

As a bonus, let's see what it would look like to pop the whole dataset into a Random Forest classifier

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
train_rf = train.copy()

# separate month/day/year into separate columns, as they are independent variables for RF
train_rf['year'] = train_rf['date'].apply(lambda x: int(x.split('-')[0]))
train_rf['month'] = train_rf['date'].apply(lambda x: int(x.split('-')[1]))
train_rf['day'] = train_rf['date'].apply(lambda x: int(x.split('-')[2]))

train_rf = train_rf.drop('date', axis=1)

In [None]:
# train the model
model = RandomForestRegressor(n_estimators=100)
model.fit(train_rf.drop('sales',axis=1), train_rf['sales'])

In [None]:
test_rf = test.copy()

In [None]:
# apply same data transformation to the test set
test_rf['year'] = test_rf['date'].apply(lambda x: int(x.split('-')[0]))
test_rf['month'] = test_rf['date'].apply(lambda x: int(x.split('-')[1]))
test_rf['day'] = test_rf['date'].apply(lambda x: int(x.split('-')[2]))

test_rf = test_rf.drop(['date','id'], axis=1)

In [None]:
# make predictions
pred = model.predict(test_rf)
print(pred)

# set on test dataframes
test_rf['sales'] = pred
test['sales'] = pred

In [None]:
test.info()

In [None]:
test.head()

Predictions are on the right order of magnitude! Let's plot the predictions for fun.

In [None]:
# plot monthly sales per store
dates = test['date'].apply(lambda x: x[:-3]).unique()
all_storesales_projected = {}
for id in range(1,11):
    print('calculating store',id)
    all_storesales_projected[id] = []
    for date in dates:
        storedata = test[(test['store'] == id) & (test['date'].apply(lambda x: x[:-3]) == date)]
        storesales = storedata['sales'].sum()
        all_storesales_projected[id].append(storesales)
        
prev_dates = train['date'].apply(lambda x: x[:-3]).unique()

for id in range(1,11):
    plt.figure(figsize=(20,10))
    plt.plot(prev_dates, all_storesales[id])
    plt.plot(dates, all_storesales_projected[id], color='red')
    plt.title('Store '+str(id)+' Sales (RF)', fontsize=30)
    plt.xticks(rotation=90)
    plt.show()

Not bad - the model predicts values that make sense and look roughly like the expected shape - however, we can see that Random Forest doesn't account well for drift (which our linear model was specifically designed to predict), and thus these predictions are most likely too low unless a recession hits in 2018 (it's 2019 as of writing this, so I guess not).

Next steps would be to understand the dataset better and engineer more features - perhaps day-level features (e.g. "is weekend", "temperature", "is holiday") or inter-month relationships (past month's performance affects current month?). Thanks for reading and feel free to ask for questions/clarification!