### Kaggle Data Submission

What we found in the "model" section is that general time series may not be very accuarte for this project due to lack of historical data and fluctuations that make the data not very stationary. We will still submit data to kaggle to evaluate its accuarcy, and that would serve as baseline accuracy for continous improvments.

We know that SARIMAX model would be the most accurate among all. The output, however, is total sales. To fit kaggle requirements, we need to break down into item and shop level. This workbook is to accomplish that.

In [1]:
import numpy as np
import pandas as pd
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt
import seaborn as sns
import pandas.testing as tm
sns.set()
%matplotlib inline 

  import pandas.util.testing as tm


In [2]:
raw_df = pd.read_csv('data/train_df.csv', parse_dates=['date'])

In [3]:
raw_df.tail()

Unnamed: 0,date,date_block_num,shop_id,shop_name,item_id,item_name,item_category_id,item_category_name,item_price,item_cnt_day,sales
2935844,2015-10-22,33,55,Цифровой склад 1С-Онлайн,13093,Карта оплаты Windows: 250 рублей [Цифровая вер...,36,Карты оплаты - Windows (Цифра),250.0,1.0,250.0
2935845,2015-09-21,32,55,Цифровой склад 1С-Онлайн,13091,Карта оплаты Windows: 1000 рублей [Цифровая ве...,36,Карты оплаты - Windows (Цифра),1000.0,1.0,1000.0
2935846,2015-09-16,32,55,Цифровой склад 1С-Онлайн,13094,Карта оплаты Windows: 2500 рублей [Цифровая ве...,36,Карты оплаты - Windows (Цифра),2500.0,1.0,2500.0
2935847,2015-09-22,32,55,Цифровой склад 1С-Онлайн,13094,Карта оплаты Windows: 2500 рублей [Цифровая ве...,36,Карты оплаты - Windows (Цифра),2500.0,2.0,5000.0
2935848,2015-10-26,33,55,Цифровой склад 1С-Онлайн,13092,Карта оплаты Windows: 2000 рублей [Цифровая ве...,36,Карты оплаты - Windows (Цифра),2000.0,1.0,2000.0


We will use up all the train data and predict Nov 2015 sales

In [4]:
ts_data = raw_df[['date','sales']]
ts_daily = ts_data.groupby(['date'])['sales'].sum()

In [5]:
mod = SARIMAX(ts_daily, order=(3,1,3), seasonal_order=(1,1,2,12), freq='D')
res = mod.fit()
sarimax = res.predict(start='2015-11-01', end='2015-11-30')



In [6]:
# total sales predicted for Nov 2015
nov_2015 = sarimax.sum(axis=0)

To split the total sales back to item/shop level, we assume the sales penetration by item by shop is closely resemble to what Oct 2015 was.

In [7]:
oct_2015 = raw_df.query('date_block_num == 33')
oct_2015 = oct_2015.groupby(['item_id', 'shop_id'])['sales'].sum().reset_index()

In [8]:
oct_2015_ttl = oct_2015['sales'].sum()

In [9]:
# get sale % by shop and by item
oct_2015['sales_%'] = oct_2015['sales']/oct_2015_ttl

In [10]:
oct_2015.sort_values(by='sales_%', ascending=False)

Unnamed: 0,item_id,shop_id,sales,sales_%
10861,7224,25,390738.0,0.004633
1419,1578,42,368959.0,0.004375
10871,7224,42,323928.0,0.003841
1409,1578,25,305966.0,0.003628
10820,7223,25,275931.0,0.003272
...,...,...,...,...
24319,16146,21,-1799.0,-0.000021
3361,2860,48,-1799.0,-0.000021
8127,5488,12,-2799.0,-0.000033
30330,21363,26,-3199.0,-0.000038


In [11]:
#import test data
test = pd.read_csv('data/test.csv')

In [12]:
test = pd.merge(test, oct_2015, on=['shop_id','item_id'], how='left')

The following step is to get the price per item_id, which will be used to convers sales to sales count

In [13]:
items = pd.read_csv('./data/items.csv')

In [14]:
items_price = pd.merge(items, raw_df, on=['item_id'], how='left')
items_price = items_price[['item_id', 'item_price']]
items_price = items_price.drop_duplicates(subset='item_id')

In [15]:
test = pd.merge(test, items_price, on=['item_id'], how='left')
test['item_cnt_month'] = nov_2015*test['sales_%']/test['item_price']
test.fillna(0,inplace=True)

In [16]:
test = test.reset_index().drop(['shop_id', 'item_id', 'sales', 'sales_%', 'item_price','index'], axis=1)
test.to_csv('data/submission.csv', index=False)