# Predict Future Sales

In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. 

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

#### File descriptions
* sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
* test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
* sample_submission.csv - a sample submission file in the correct format.
* items.csv - supplemental information about the items/products.
* item_categories.csv  - supplemental information about the items categories.
* shops.csv- supplemental information about the shops.
#### Data fields
* ID - an Id that represents a (Shop, Item) tuple within the test set
* shop_id - unique identifier of a shop
* item_id - unique identifier of a product
* item_category_id - unique identifier of item category
* item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
* item_price - current price of an item
* date - date in format dd/mm/yyyy
* date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
* item_name - name of item
* shop_name - name of shop
* item_category_name - name of item category

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import copy
import matplotlib as mpl
from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from math import sqrt
import collections
from sklearn.model_selection import (
    train_test_split,
    cross_val_score
)
from xgboost import XGBRegressor
from xgboost import plot_importance
from lightgbm import LGBMRegressor
from lightgbm import plot_importance
from sklearn.metrics import mean_squared_error
from math import sqrt

## Подготовка данных

Загружаем данные:

In [None]:
train = pd.read_csv("../input/competitive-data-science-predict-future-sales/sales_train.csv")
test = pd.read_csv("../input/competitive-data-science-predict-future-sales/test.csv")
sample_submission = pd.read_csv("../input/competitive-data-science-predict-future-sales/sample_submission.csv")
items = pd.read_csv("../input/competitive-data-science-predict-future-sales/items.csv")
item_categories = pd.read_csv("../input/competitive-data-science-predict-future-sales/item_categories.csv")
shops = pd.read_csv("../input/competitive-data-science-predict-future-sales/shops.csv")

In [None]:
# let's do the date column in correct format
train['date']=train['date'].apply(lambda x:datetime.datetime.strptime(x, '%d.%m.%Y'))

In [None]:
print('train:',train.shape,'test:',test.shape,'items:',items.shape,'item_categories:',item_categories.shape,'shop:',shops.shape)

In [None]:
#add information about category
train = train.join(items, on='item_id', rsuffix='_').drop(['item_id_', 'item_name'], axis=1)

In [None]:
train.head()

Проверим данные на наличие отрицательных значений:

In [None]:
train[train['item_price']<=0]

In [None]:
train[(train.shop_id==32)&(train.item_id==2973)&(train.date_block_num==4)]

Вместо отрицательной цены, указываем среднюю по данной паре магазин-продукт:

In [None]:
train.loc[train.item_price<0,'item_price'] = train[(train.shop_id==32)&(train.item_id==2973)&(train.date_block_num==4)].item_price.mean()

**Формирование таблицы с продажами по месяцам:**

In [None]:
train_monthly = train.sort_values('date').groupby(['date_block_num', 'shop_id','item_category_id', 'item_id'], as_index=False)

In [None]:
train_monthly = train_monthly.agg({'item_price':['median', 'mean'], 'item_cnt_day':['sum', 'count']})

In [None]:
train_monthly.head(3)

In [None]:
train_monthly.columns = ['date_block_num', 'shop_id', 'item_category_id','item_id', 'item_price_median', 'item_price_mean', 'item_cnt', 'transactions']

In [None]:
train_monthly.head(3)

**Формирование таблицы продаж по категориям (для каждого магазина):**

In [None]:
train_monthly_by_category = train_monthly.groupby(['date_block_num','shop_id', 'item_category_id'], as_index=False)

In [None]:
train_monthly_by_category = train_monthly_by_category.agg({'item_price_median':['mean'], 'item_price_mean':['mean'],'item_cnt':['sum', 'mean'], 'transactions': ['mean']})

In [None]:
train_monthly_by_category.head()

**Формирование таблицы продаж по категориям (без учета магазина):**

In [None]:
train_cat_no_shop = train_monthly.groupby(['date_block_num', 'item_category_id'], as_index=False)

In [None]:
train_cat_no_shop = train_cat_no_shop.agg({'item_price_median':['mean'], 'item_price_mean':['mean'],'item_cnt':['sum', 'mean'], 'transactions': ['mean']})

In [None]:
train_cat_no_shop.head(3)

Добавим столбцы с указанием **года и месяца (порядковый номер):**

In [None]:
train_monthly['year'] = train_monthly['date_block_num'].apply(lambda x: ((x//12+2013)))
train_monthly['month'] = train_monthly['date_block_num'].apply(lambda x: (x%12+1))

In [None]:
train_monthly_by_category['year'] = train_monthly_by_category['date_block_num'].apply(lambda x: ((x//12+2013)))
train_monthly_by_category['month'] = train_monthly_by_category['date_block_num'].apply(lambda x: (x%12+1))

In [None]:
train_cat_no_shop['year'] = train_cat_no_shop['date_block_num'].apply(lambda x: ((x//12+2013)))
train_cat_no_shop['month'] = train_cat_no_shop['date_block_num'].apply(lambda x: (x%12+1))

In [None]:
train_monthly_by_category.columns = ['date_block_num', 'shop_id','item_category_id','item_price_median', 'item_price_mean', 'item_cnt_sum', 'item_cnt', 'transactions', 'year', 'month']
train_cat_no_shop.columns = ['date_block_num', 'item_category_id','item_price_median', 'item_price_mean', 'item_cnt_sum', 'item_cnt', 'transactions', 'year', 'month']

In [None]:
train_monthly.head(3)

In [None]:
train_cat_no_shop.head(3)

In [None]:
train_monthly_by_category.head(3)

Посмотрим, какие продукты в магазинах продавались более, чем 20 штук в месяц (либо отрицательное значение):

In [None]:
# train_monthly.query('item_cnt >= 0 and item_cnt <= 20')
# train_monthly.query('item_cnt <= 0 or item_cnt >= 20')

In [None]:
train_monthly.shape

In [None]:
train_monthly[train_monthly['item_cnt']>1000]

### Расширение обучающей таблицы

В таблице с данными представлена информация по парам магазин-продукт только в те месяцы, когда были продажи. Это значит, что мы можем расширить таблицу по следующему принципу. 

1) Если по паре магазин-продукт продажи существовали хотя бы в одном месяце, то в остальные месяцы мы ставим 0 по продажам для этой пары. 

2) Можно еще по-другому расширить таблицу и рассмотреть все возможные сочетания магазин-продукт, даже если в данном магазине ни разу продаж по данному продукту не было... Вопрос остается открытым, на сколько верно при таком подходе обнулять продажи по данному продукту. Возможно, эти пары просто не вошли в обучающую выборку, но продажи по ним были - ?

Я выбрала 1 вариант.

In [None]:
uniq_pairs = train_monthly.groupby(['shop_id','item_id']).size().reset_index()
uniq_pairs.shape

In [None]:
empty_df = pd.DataFrame(index = uniq_pairs.index, columns = ['date_block_num','shop_id','item_id'])
empty_df[['shop_id', 'item_id']] = uniq_pairs[['shop_id','item_id']]
empty_df_2 = pd.DataFrame(columns = ['date_block_num','shop_id','item_id'])
for i in range(34):
    empty_df_1 = empty_df.copy()
    empty_df_1['date_block_num'] = i
    empty_df_2 = pd.concat([empty_df_2, empty_df_1])

In [None]:
empty_df_2 = empty_df_2.reset_index()

In [None]:
empty_df_2.head()

In [None]:
full_train_monthly = empty_df_2.merge(train_monthly,on=['date_block_num','shop_id', 'item_id'], how='left').fillna(0).drop(['index'], axis=1)

**Обновим столбцы year и month:**

In [None]:
full_train_monthly['year'] = full_train_monthly['date_block_num'].apply(lambda x: ((x//12+2013)))
full_train_monthly['month'] = full_train_monthly['date_block_num'].apply(lambda x: (x%12+1))

Добавим информацию по **item_category_id** для добавленных нулевых строк:

In [None]:
full_train_monthly = full_train_monthly.join(items, on='item_id', rsuffix='_').drop(['item_id_', 'item_name', 'item_category_id'], axis=1)
full_train_monthly = full_train_monthly.rename(columns={'item_category_id_':'item_category_id'})

In [None]:
full_train_monthly.shape

In [None]:
print('Table by months shape:', train_monthly.shape, 'Full table by months shape:', full_train_monthly.shape)

Теперь добавим нулевые строки для таблицы по категориям (логика та же):

In [None]:
uniq_pairs_cat = train_monthly_by_category.groupby(['shop_id','item_category_id']).size().reset_index()

In [None]:
empty_df = pd.DataFrame(index = uniq_pairs_cat.index, columns = ['date_block_num','shop_id','item_category_id'])
empty_df[['shop_id', 'item_category_id']] = uniq_pairs_cat[['shop_id','item_category_id']]
empty_df_2 = pd.DataFrame(columns = ['date_block_num','shop_id','item_category_id'])
for i in range(34):
    empty_df_1 = empty_df.copy()
    empty_df_1['date_block_num'] = i
    empty_df_2 = pd.concat([empty_df_2, empty_df_1])

In [None]:
empty_df_2 = empty_df_2.reset_index()

In [None]:
# print('Table by categories shape:', train_monthly_by_category.shape, 'Full table by categories shape:', empty_df_2.shape)

In [None]:
full_train_monthly_by_category = empty_df_2.merge(train_monthly_by_category, 
                                      on=['date_block_num','shop_id', 'item_category_id'],how='left').fillna(0).drop(['index'], axis=1)

In [None]:
full_train_monthly_by_category.head(3)

In [None]:
uniq_cat = train_cat_no_shop['item_category_id'].unique()

In [None]:
empty_df = pd.DataFrame(index = uniq_cat, columns = ['date_block_num','item_category_id'])
empty_df['item_category_id'] = uniq_cat
empty_df_2 = pd.DataFrame(columns = ['date_block_num','item_category_id'])
for i in range(34):
    empty_df_1 = empty_df.copy()
    empty_df_1['date_block_num'] = i
    empty_df_2 = pd.concat([empty_df_2, empty_df_1])

In [None]:
empty_df_2 = empty_df_2.reset_index()

In [None]:
full_train_cat_no_shop = empty_df_2.merge(train_cat_no_shop, 
                                      on=['date_block_num', 'item_category_id'],how='left').fillna(0).drop(['index'], axis=1)

In [None]:
full_train_cat_no_shop.head()

In [None]:
full_train_monthly_by_category['year'] = full_train_monthly_by_category['date_block_num'].apply(lambda x: ((x//12+2013)))
full_train_monthly_by_category['month'] = full_train_monthly_by_category['date_block_num'].apply(lambda x: (x%12+1))
full_train_cat_no_shop['year'] = full_train_cat_no_shop['date_block_num'].apply(lambda x: ((x//12+2013)))
full_train_cat_no_shop['month'] = full_train_cat_no_shop['date_block_num'].apply(lambda x: (x%12+1))

In [None]:
full_train_monthly_by_category.head(2)

### Добавление информации по продажам в будущем и прошлом

Для обучающей выборки добавим в каждой строке информацию по продажам **в следующем месяце:**

In [None]:
full_train_monthly['item_cnt_next_month'] = full_train_monthly.sort_values('date_block_num').groupby(['shop_id','item_id'])['item_cnt'].shift(-1)

In [None]:
full_train_monthly.head(3)

In [None]:
# full_train_monthly[(full_train_monthly.shop_id==0)&(full_train_monthly.item_id==5572)]

Для таблицы с информацией по категориям также добавим информацию по продажам **в следующем месяце:**

In [None]:
full_train_monthly_by_category['item_cnt_next_month'] = full_train_monthly_by_category.sort_values('date_block_num').groupby(['shop_id', 'item_category_id'])['item_cnt'].shift(-1)

In [None]:
full_train_cat_no_shop['item_cnt_next_month'] = full_train_cat_no_shop.sort_values('date_block_num').groupby(['item_category_id'])['item_cnt'].shift(-1)

Теперь добавим информацию по продажам в **прошлых месяцах**. Интересно посмотреть на те месяцы, которые были год назад по сравнению с предсказываемым месяцем. То есть, если мы предсказываем следующий месяц, то добавляем также информацию, которая была 11 и 23 месяца назад (год назад и 2 года назад по сравнению со следующим месяцем):

In [None]:
lag_list = [1,2,3,4,5,6,11,23]

for lag in lag_list:
    ft_name = ('item_cnt_shifted%s' % lag)
    full_train_monthly[ft_name] = full_train_monthly.sort_values('date_block_num').groupby(['shop_id', 'item_id'])['item_cnt'].shift(lag)

In [None]:
full_train_monthly.head()

In [None]:
lag_list = [1,2,3,4,5,6,11,23]

for lag in lag_list:
    ft_name = ('item_cnt_shifted%s' % lag)
    full_train_monthly_by_category[ft_name] = full_train_monthly_by_category.sort_values('date_block_num').groupby(['shop_id', 'item_category_id'])['item_cnt'].shift(lag)

In [None]:
full_train_monthly_by_category.head()

In [None]:
for lag in lag_list:
    ft_name = ('item_cnt_shifted%s' % lag)
    full_train_cat_no_shop[ft_name] = full_train_cat_no_shop.sort_values('date_block_num').groupby(['item_category_id'])['item_cnt'].shift(lag)

In [None]:
full_train_cat_no_shop.head()

То же самое сделаем для **item_price_mean:**

In [None]:
for lag in lag_list:
    ft_name = ('item_price_mean_shifted%s' % lag)
    full_train_monthly[ft_name] = full_train_monthly.sort_values('date_block_num').groupby(['shop_id','item_id'])['item_price_mean'].shift(lag)

In [None]:
full_train_monthly.head()

In [None]:
for lag in lag_list:
    ft_name = ('item_price_mean_shifted%s' % lag)
    full_train_monthly_by_category[ft_name] = full_train_monthly_by_category.sort_values('date_block_num').groupby(['shop_id','item_category_id'])['item_price_mean'].shift(lag)

In [None]:
full_train_monthly_by_category.head(3)

In [None]:
for lag in lag_list:
    ft_name = ('item_price_mean_shifted%s' % lag)
    full_train_cat_no_shop[ft_name] = full_train_cat_no_shop.sort_values('date_block_num').groupby(['item_category_id'])['item_price_mean'].shift(lag)

In [None]:
full_train_cat_no_shop.head(3)

Добавим dummy variables для категориальных фичей: year и month:

In [None]:
full_train_monthly  = pd.concat(
          [full_train_monthly, pd.get_dummies(full_train_monthly['year'], prefix='year')],axis=1
        )
full_train_monthly  = pd.concat(
          [full_train_monthly, pd.get_dummies(full_train_monthly['month'], prefix='month')],axis=1
        )

In [None]:
full_train_monthly.shape

In [None]:
full_train_monthly_by_category = pd.concat(
          [full_train_monthly_by_category, pd.get_dummies(full_train_monthly_by_category['year'], prefix='year')],axis=1
        )
full_train_monthly_by_category = pd.concat(
          [full_train_monthly_by_category, pd.get_dummies(full_train_monthly_by_category['month'], prefix='month')],axis=1
        )

In [None]:
full_train_monthly_by_category.shape

In [None]:
full_train_cat_no_shop = pd.concat(
          [full_train_cat_no_shop, pd.get_dummies(full_train_cat_no_shop['year'], prefix='year')],axis=1
        )
full_train_cat_no_shop = pd.concat(
          [full_train_cat_no_shop, pd.get_dummies(full_train_cat_no_shop['month'], prefix='month')],axis=1
        )

In [None]:
full_train_cat_no_shop.shape

### Сокращение обучающей выборки

Попробуем сократить train выборку, оставим только те продукты и магазины, которые **есть в тесте.**

In [None]:
shop_id_test = test['shop_id'].unique()
item_id_test = test['item_id'].unique()

In [None]:
t_train = full_train_monthly[full_train_monthly['shop_id'].isin(shop_id_test)]
t_train.shape

In [None]:
t_train = t_train[t_train['item_id'].isin(item_id_test)]
t_train.shape

In [None]:
t = test.copy()
t = t.join(items, on='item_id', rsuffix='_').drop(['item_id_', 'item_name'], axis=1)

In [None]:
print('Train before reduction:',full_train_monthly.shape, 'Train after reduction:', t_train.shape)

In [None]:
t_train_cat = full_train_monthly_by_category.copy()

### Подготовка тестовой выборки

**Добавление информации по тестовой выборке:**

In [None]:
%%time
m_test = pd.merge(t, t_train[t_train.date_block_num==33], how = 'left', on=['shop_id', 'item_id'])
m_test = m_test.rename(columns={'item_category_id_x':'item_category_id'})
m_test = m_test.drop('item_category_id_y', axis=1)

In [None]:
m_test.head(3)

In [None]:
%%time
# для тех пар, которых нет в обучающей выборке t_train, берем информацию по продажам из таблицы t_train_cat
# по средним продажам в категории:

null_test = pd.merge(m_test[m_test['item_cnt'].isnull()][['ID', 'shop_id', 'item_id','item_category_id']],t_train_cat[t_train_cat.date_block_num==33],how = 'left',on = ['shop_id', 'item_category_id'])
null_test.index = null_test['ID']
for i in m_test.columns:
    m_test.loc[m_test.ID.isin(null_test.ID),i] = null_test[i]

In [None]:
m_test[m_test['item_cnt'].isnull()].shape

Даже после заполнения пропущенных значений средними в категории по магазину, остаются пары, информации по которым нет в обучающей выборке. То есть по данной категории товаров не было продаж в данном магазине (за весь период). Я пробовола добавлять для этих пар информацию по средним продажам в данной категории по всем магазинам, но итоговые результаты получаются хуже, чем при обнулении продаж для этих пар. 

In [None]:
# # для тех пар, которых нет ни в обучающей выборке t_train, ни в таблице по продажам в магазине по категориям t_train_cat,
# # берем средние продажи в категории по всем магазинам:
# null_test_2 = pd.merge(m_test[m_test['item_cnt'].isnull()][['ID', 'shop_id', 'item_id','item_category_id']],
#                      full_train_cat_no_shop[full_train_cat_no_shop.date_block_num==33], 
#                                          how = 'left',
#                                          on = ['item_category_id']
#                      )
# null_test_2.index = null_test_2['ID']
# for i in m_test.columns:
#     m_test.loc[m_test.ID.isin(null_test_2.ID),
#                                       i] = null_test_2[i]

In [None]:
m_test.loc[m_test['item_cnt'].isnull(),'year'] = 2015
m_test.loc[m_test['item_cnt'].isnull(),'month'] = 10
m_test.loc[m_test['item_cnt'].isnull(),'month_10'] = 1
m_test.loc[m_test['item_cnt'].isnull(),'date_block_num'] = 33

In [None]:
m_test = m_test.fillna(0)

In [None]:
m_test.head()

In [None]:
drop_cols = ['date_block_num',
             'ID',
             'shop_id',
             'item_id',
             'item_price_median',
             'year',
             'month',
             'item_cnt_next_month',
             'item_category_id']
X_test = m_test.drop(drop_cols, axis=1)

In [None]:
X_test.to_csv('X_test.csv', index=False)

In [None]:
# t_train.to_csv('t_train.csv', index=False)

In [None]:
# full_train_cat_no_shop.to_csv('full_train_cat_no_shop.csv', index=False)

In [None]:
# full_train_monthly.to_csv('full_train_monthly.csv', index=False)

In [None]:
# full_train_monthly_by_category.to_csv('full_train_monthly_by_category.csv', index=False)

## Моделирование

## Model 1: 

* train (80% of samples from t_train and t_train_cat) is information since 07.2013 to 09.2015, train only for pairs shop-item that exist in test
* validation: 20% of pairs shop-item from t_train and t_train_cat, since 07.2013 to 09.2015.

In [None]:
# X_test = pd.read_csv('X_test.csv')
# test = pd.read_csv('test.csv')
# t_train = pd.read_csv('t_train.csv')
# t_train_cat = pd.read_csv('t_train_cat.csv')
# full_train_cat_no_shop = pd.read_csv('full_train_cat_no_shop.csv')

In [None]:
# full_train_monthly = pd.read_csv('full_train_monthly.csv')
# full_train_monthly_by_category = pd.read_csv('full_train_monthly_by_category.csv')

In [None]:
%%time
train_set = t_train.query('date_block_num>=6 and date_block_num <33').copy()
train_set_cat = t_train_cat.query('date_block_num>=6 and date_block_num <33').copy()

Выберем из всей обучающей выборки 20% случайных пар магазин-продукт (одни и те же пары для всех месяцев) для проверочного набора (validation set):

In [None]:
rand_n = train_set.query('date_block_num==6').shape[0]
print(rand_n)
for_val = np.random.choice(rand_n, size=int(0.2*rand_n), replace = False)
print(for_val, for_val.shape)

Тогда для обучающей выборке останутся пары с ID:

In [None]:
# np.setxor1d: Find the set exclusive-or of two arrays.
# Return the sorted, unique values that are in only one (not both) of the input arrays.

for_fit = np.setxor1d(np.arange(rand_n), for_val)
print(for_fit, for_fit.shape)

Выбрать пары для train set и validation set необходимо для всех месяцев (чтобы это были одни и те же пары магазин-продукт):

In [None]:
x1 = train_set[train_set['date_block_num']==6].reset_index().iloc[for_val]
for i in range(7,33):
    x = train_set[train_set['date_block_num']==i].reset_index().iloc[for_val]
    x1 = x1.append(x)
val_data = x1.copy()

In [None]:
x2 = train_set[train_set['date_block_num']==6].reset_index().iloc[for_fit]
for i in range(7,33):
    xx = train_set[train_set['date_block_num']==i].reset_index().iloc[for_fit]
    x2 =  x2.append(xx)
fit_data = x2.copy()

In [None]:
%%time
drop_cols = ['date_block_num', 'shop_id', 'item_id', 'item_price_median', 'year', 'month', 'item_cnt_next_month','item_category_id', 'index']
X_train = fit_data.drop(drop_cols, axis=1)
Y_train = fit_data['item_cnt_next_month']
X_val = val_data.drop(drop_cols, axis=1)
Y_val = val_data['item_cnt_next_month']

In [None]:
def LGBReg(x_train, y_train, x_val, y_val):
    lgb_reg = LGBMRegressor(
        n_jobs=-1,
        tree_method='auto',
        learning_rate=0.02,
        max_depth=8,
        n_estimators=1000,
        colsample_bytree=0.8, 
        subsample=0.8, 
        seed=42)
    
    lgb_reg.fit(
        x_train, 
        y_train, 
        eval_metric="rmse", 
        eval_set=[(x_train, y_train), (x_val, y_val)], 
        verbose=10, 
        early_stopping_rounds = 10)
    return lgb_reg

L2LossFunction = sum((y_true - y_pred)^2)

In [None]:
%%time
lgb_reg_1 = LGBReg(X_train, Y_train, X_val, Y_val)

In [None]:
plot_importance(lgb_reg_1, figsize=(20, 20))

In [None]:
# importance_features = pd.DataFrame(lgb_reg_1.feature_importances_, X_train.columns).sort_values(by=[0], ascending=False)
# importance_features

In [None]:
lgb_test_pred = lgb_reg_1.predict(X_test).clip(0, 20)

In [None]:
submission37 = pd.DataFrame(test['ID'])
submission37['item_cnt_month'] = lgb_test_pred
submission37.to_csv('submission37.csv', index=False)

In [None]:
submission37.head()

### Score Model 1:
Эта посылка получила на kaggle: **1.01092**

In [None]:
# if we round submission:
submission38 = pd.DataFrame(test['ID'])
submission38['item_cnt_month'] = lgb_test_pred.round()
submission38.to_csv('submission38.csv', index=False)

In [None]:
submission38.head()

При округлении до целых чисел, посылка получила **1.03365**

## Model 2: linear regression 

In [None]:
from statsmodels.regression.linear_model import OLS
import statsmodels.api as sm

In [None]:
x = sm.add_constant(X_train.fillna(0))

In [None]:
model2 = OLS(
    Y_train,
    x
).fit()
print(model2.summary())

In [None]:
X_test['const'] = 1

In [None]:
X_test.head()

In [None]:
y_predict_OLS = model2.predict(X_test).clip(0, 20)

In [None]:
submission39 = pd.DataFrame(test['ID'])
submission39['item_cnt_month'] = y_predict_OLS
submission39.to_csv('submission39.csv', index=False)

In [None]:
submission39.head()

In [None]:
rmse_train_1 = sqrt(mean_squared_error(Y_train, model2.predict(x)))
rmse_train_1

In [None]:
rmse_val_1 = sqrt(mean_squared_error(Y_val,model2.predict(sm.add_constant(X_val.fillna(0)))))
rmse_val_1