# Previous value benchmark

In this notebook I will make sure that if we submit value from a previous month, we will get a score of 1.16777 on the public leaderboard, as it was said in other people's kernels.

In [1]:
import numpy as np
import pandas as pd

In [2]:
sales_train = pd.read_csv('data/sales_train_v2.csv')
items = pd.read_csv('data/items.csv')
shops = pd.read_csv('data/shops.csv')
item_categories = pd.read_csv('data/item_categories.csv')
test = pd.read_csv('data/test.csv')
sample_submission = pd.read_csv('data/sample_submission.csv')

After reading the data, I am extracting month and year from date column

In [3]:
sales_train['date'] = pd.to_datetime(sales_train['date'], format='%d.%m.%Y')
sales_train['month'] = sales_train['date'].dt.month
sales_train['year'] = sales_train['date'].dt.year
sales_train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,month,year
0,2013-01-02,0,59,22154,999.0,1.0,1,2013
1,2013-01-03,0,25,2552,899.0,1.0,1,2013
2,2013-01-05,0,25,2552,899.0,-1.0,1,2013
3,2013-01-06,0,25,2554,1709.05,1.0,1,2013
4,2013-01-15,0,25,2555,1099.0,1.0,1,2013


Select rows from previous month, calculate aggregations

In [4]:
prev_month_selector = (sales_train.month == 10) & (sales_train.year == 2015)
train_subset = sales_train.loc[prev_month_selector]
train_subset['item_cnt_day'] = train_subset['item_cnt_day'].clip(0,20)
groups = train_subset[['shop_id', 'item_id', 'item_cnt_day']].groupby(by = ['shop_id', 'item_id'])
train_subset = groups.agg({'item_cnt_day':'sum'}).reset_index()
train_subset = train_subset.rename(columns = {'item_cnt_day' : 'item_cnt_month'})
train_subset.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,shop_id,item_id,item_cnt_month
0,2,31,1.0
1,2,486,3.0
2,2,787,1.0


In order to match aggregations with ID provided in sample submission, I need to merge test with sample submission

In [5]:
test = test.merge(sample_submission, on=["ID"], how="left")
test.head()

Unnamed: 0,ID,shop_id,item_id,item_cnt_month
0,0,5,5037,0.5
1,1,5,5320,0.5
2,2,5,5233,0.5
3,3,5,5232,0.5
4,4,5,5268,0.5


We don't need this sample item_cnt_month

In [8]:
test.drop(columns=['item_cnt_month'], inplace=True)

Merge our aggregations, stored in train_subset, with test set, and check how many missing values do we get

In [9]:
merged = test.merge(train_subset, on=["shop_id", "item_id"], how="left")[["ID", "item_cnt_month"]]
merged.isna().sum()

ID                     0
item_cnt_month    185520
dtype: int64

Fill missing values with zero, clip them for specified range

In [10]:
merged['item_cnt_month'] = merged.item_cnt_month.fillna(0).clip(0,20)
submission = merged.set_index('ID')
submission.to_csv('benchmark.csv')

After submitting csv file that I produced, I get a score that was expected