In this notebook, I present my approach to solve the challenge of predicting future sales described in the Coursera course on How to win data science competitions. 

This notebook is oganized as follows: 
1. Exploratory Data Analysis
2. Data Preprocessing including feature engineering
3. Modeling including feature selection and model hyperparameters optimization
4. Ensembling
5. Submitting the predictions

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os
import random

In [None]:
print(f'Numpy: {np.__version__}')
print(f'Pandas: {pd.__version__}')
print(f'Matplotlib: {matplotlib.__version__}')
print(f'Seaborn: {sns.__version__}')

In the submitted version of my notebook I used the following versions of the libraries:
* Numpy 1.18.5
* Pandas 1.1.5
* Matplotlib 3.2.1
* Seaborn 0.10.0

# 1. Exploratory Data Analysis (EDA)

For this challenge, we have 6 provided csv-data files. These files are: 
* sales_train.csv contains the daily historical data from january 2013 to october 2015
* test.csv contains the entries we are supposed to predict for november 2015
* sample_submission.csv is an example of submission file.
* items.csv presents supplemental information about the items/products.
* item_categories.csv  presents the supplemental information about the items categories.
* shops.csv contains the supplemental information about the shops.

Let's load the data and investigate it. 

## a. Loading Data and Overall View

### i. Train Data

In [None]:
train = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')
print('Train Sales Data')
train.head(2)

* Checking for data types

In [None]:
train.dtypes

* Checking for missing data

In [None]:
train.isna().sum()

* Statistic Summary

In [None]:
train.describe()

### ii. Items Extra Data

In [None]:
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
print('Items Data')
items.head(2)

* Data Types

In [None]:
items.dtypes

* Missing Values

In [None]:
items.isna().sum()

* Statistical Summary

Since data is categorical I will just present the number of unique values 

In [None]:
items.nunique()

In [None]:
items.shape[0]

### iii. Item Categories

In [None]:
it_cat = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
print('Item Categories Data')
it_cat.head(2)

No need for the data types investigation let's go directly into the missing data.
* Missing Data

In [None]:
it_cat.isna().sum()

* Statistical Summary 

In [None]:
it_cat.nunique()

In [None]:
it_cat.shape[0]

this number of categories is equal to the one of the item's data. 

### iv. Shops Data

In [None]:
shop = pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')
print("Shops Data")
shop.head(2)

In [None]:
shop.isna().sum()

In [None]:
shop.nunique()

In [None]:
shop.shape[0]

### v. Test Data

In [None]:
test = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')
print('Test Data')
test.head(2)

* Missing Values?

In [None]:
test.isna().sum()

* Statistical Summary

In [None]:
test.nunique()

So for the test we are only going to predict the sales for a 42 shop and a 5100 item. Even though we have 60 shops and 22170 item in total. Let's check how many items and shops we have in the training set

In [None]:
train.nunique()

Apperently, we have shops in the training data that we won't predict for the test data. **Maybe we should get rid of these shops.**

### vi. Submission Sample

In [None]:
sub = pd.read_csv('../input/competitive-data-science-predict-future-sales/sample_submission.csv')
sub.head(2)

So our objective is to predict the count of items that would be sold per(shop and item).

## b. Basic Data Investigation

### i. Train Data

In [None]:
train.head(2)

Let's investigate the range of number of items sold per day (i.e., the target value)

In [None]:
train.item_cnt_day.min(), train.item_cnt_day.max()

So apparently we have some days when items are given back to the shop. While some items are sold by the thousands in a single day. 

* Let's see how many entries represents the back to shop transactions

In [None]:
train[train.item_cnt_day<0]

So we have a 7356 back to shop transaction. 

At first glance, it seems like the shop with id 25 is the responsible for all these items returns. **Let's check that out and compute the total number of returned items per shop id.** 

In [None]:
return_per_shop = train[train.item_cnt_day<0].groupby('shop_id')['item_cnt_day'].sum()
return_per_shop

Well no, a lot of shops are involved in the return transactions. **Which ones have the highest number of returns?** 

In [None]:
return_per_shop.sort_values(ascending=True).iloc[:10]

So here the list of the 10 shop that have the highest number of returns. **What about those with low returns?**

In [None]:
return_per_shop.sort_values(ascending=False).iloc[:10]

And those are the shops with the lowest returns. 

**Does the number of returns depends on the item id?** Let's get the list of the 10 most returned items and the 10 less returned ones. 

In [None]:
return_per_item = train[train.item_cnt_day<0].groupby('item_id')['item_cnt_day'].sum()
return_per_item

In [None]:
return_per_item.sort_values(ascending=True).iloc[:10]

In [None]:
return_per_item.sort_values(ascending=False).iloc[:10]

Let's see how this return is doing in relation with time (monthly returns). 

In [None]:
returns_monthly = train[train.item_cnt_day<0].groupby('date_block_num')['item_cnt_day'].sum()
returns_monthly

In [None]:
returns_monthly.plot()
plt.xlabel('date_block_num')
plt.ylabel('Number of returned Items')
plt.show()

It seems that the number of returns is quite seasonly and it is evolving in time (not stationary). 

**Q: Do we have entries with no sold items?**

In [None]:
no_sold = train[train.item_cnt_day==0.0]
no_sold

No and this is quite logical. 


**Let's check how the total monthly sold items evolves in time!**

In [None]:
sold_per_month = train.groupby('date_block_num')['item_cnt_day'].sum()

In [None]:
sold_per_month.plot()
plt.xlabel('date_block_num')
plt.ylabel('Number of sold Items')
plt.show()

Here are some observations we can derive from this graph: 
* First, it seems like the number of sold items is decreasing in time
* also we have two pulses of increasing number of transactions around month number 11/12 and 23/24. 
* it seems like we have a seasonality in the data maybe related to a yearly evolution. 

**Let's check the evolution in time of the total count of sold items per month and per shop**

In [None]:
plt.figure()
tmp_data = train.groupby(['date_block_num', 'shop_id'], as_index=False)['item_cnt_day'].sum()
sns.lineplot(x='date_block_num', y='item_cnt_day', hue='shop_id', data = tmp_data)
plt.xlabel('Date Month Block numbre')
plt.ylabel('Number of Sold Items')
plt.show()

Although the figure is not very clear we can have a sens that most shops have the same shape in the sales as the total number of sales. We can also notice that some shops do not have data for all dates this means some shops have opened after a certain date. Let's check the first date of sales for each shop! 

In [None]:
tmp_data = train.groupby('shop_id')['date_block_num'].min()
tmp_data[tmp_data>0]

**Do these shops exist in the test set?**

Yes and this is the list of the shops that are in the test set.

In [None]:
test[test['shop_id'].isin(tmp_data[tmp_data>0].index)]['shop_id'].unique()

One of these shops actually only started in the last period of time! This would effect our model. 

Does this also apply for items? do we have items that are not present in the training data? or that we do not have a lot of data ? 

In [None]:
test_items = test['item_id'].unique()
test_items

In [None]:
tmp_data = train.groupby('item_id', as_index=False)['date_block_num'].min()
tmp_data

In [None]:
tmp_data[tmp_data.item_id.isin(test_items)].sort_values(by='date_block_num', ascending=False)

Yes we also have new items. 

How would we use this in our model. 

In [None]:
test_items.shape[0]

We also have items that do not exist in the training set. 

**Do we have outliers?**

In [None]:
sns.boxplot(x='item_cnt_day', data=train)

Lets check which items are these that are sold above 1000 piece a day.

In [None]:
train[train['item_cnt_day']>=1000]

In [None]:
items[items['item_id'].isin([20949, 11373])]['item_name'].values

I think these are outliers and should be dropped. 

In [None]:
sns.boxplot(x='item_price', data=train)

We have an item priced over 300000 let's check this one. 

In [None]:
train[train['item_price']>=100000]

In [None]:
items[items['item_id']==6066]

This is also an outlier and should either be fixed or droped **Do we have other sales record of this item?**

In [None]:
train[train['item_id']==6066]

actually it is a single record so let's drop it. 

# 2. Data Preprocessing

## a. Drop Outliers

In [None]:
train = train[train['item_cnt_day']<1000]
train = train[train['item_price']<100000]
train.reset_index(drop=True, inplace=True)

## b. Feature engineering

Here are my main ideas of feature engineering:
* From the date find out if that day is holiday or weekend
* Compute the number of holidays per month 
* get the month id
* convert item category name 
* convert item name
* get sales of previous month 

In [None]:
import holidays

In [None]:
ru_holidays = holidays.Russia()

Before getting if the date is a holiday we should redefine the date formula. 

In [None]:
train['new_date'] = train['date'].apply(lambda x: x.split('.')[1]+'/'+x.split('.')[0]+'/'+x.split('.')[2])

In [None]:
train

In [None]:
train['is_holiday'] = train['new_date'].apply(lambda x: x in ru_holidays)

In [None]:
train

In [None]:
train[train['date_block_num']==0]['new_date'].nunique()

In [None]:
tmp_data = train.groupby(['date_block_num', 'new_date'], as_index=False)['is_holiday'].sum()
tmp_data['is_holiday'] = tmp_data['is_holiday']>0
n_holidays = tmp_data.groupby(['date_block_num'])['is_holiday'].sum()
n_holidays

In [None]:
train['new_date'] = pd.to_datetime(train['new_date'])

In [None]:
train['weekend'] = train['new_date'].apply(lambda x : x.weekday() in [5, 6])

In [None]:
tmp_data = train.groupby(['date_block_num', 'date'], as_index=False)['weekend'].sum()
tmp_data['weekend'] = tmp_data['weekend']>0
n_weekends = tmp_data.groupby(['date_block_num'])['weekend'].sum()
n_weekends

In [None]:
train['money'] = train['item_price'] * train['item_cnt_day']

In [None]:
train.groupby(['date_block_num', 'shop_id'], as_index=False)['money'].sum()

In [None]:
train['month'] = train['date'].apply(lambda x: x.split('.')[1])
train

In [None]:
train.loc[train.shop_id == 0, "shop_id"] = 57
test.loc[test.shop_id == 0 , "shop_id"] = 57
train.loc[train.shop_id == 1, "shop_id"] = 58
test.loc[test.shop_id == 1 , "shop_id"] = 58
train.loc[train.shop_id == 11, "shop_id"] = 10
test.loc[test.shop_id == 11, "shop_id"] = 10
train.loc[train.shop_id == 40, "shop_id"] = 39
test.loc[test.shop_id == 40, "shop_id"] = 39

In [None]:
X_train = train.groupby(['date_block_num', 'shop_id', 'item_id'], as_index=False)['item_cnt_day'].sum()
X_train

In [None]:
X_train['Weekends'] = X_train['date_block_num'].map(n_weekends)
X_train['Holidays'] = X_train['date_block_num'].map(n_holidays)
X_train

In [None]:
X_train['month'] = X_train['date_block_num'].map(train.groupby(['date_block_num'])['month'].unique())
X_train['month'] = X_train['month'].apply(lambda x : int(x[0]))

In [None]:
X_train

**Let's get features from the shops dataframe**

In [None]:
shop

I googled some of the shops names and took a glance at some of the EDA notebooks. It seems like the shop name contains more information than just the name. The first word represents the city and the second represents the type of the shop. So let's get these features! 

In [None]:
shop.loc[ shop.shop_name == 'Сергиев Посад ТЦ "7Я"',"shop_name" ] = 'СергиевПосад ТЦ "7Я"'
shop["city"] = shop.shop_name.str.split(" ").map( lambda x: x[0] )
shop["type"] = shop.shop_name.str.split(" ").map( lambda x: x[1] )
shop.loc[shop.city == "!Якутск", "city"] = "Якутск"

In [None]:
shop

It seems to me that we have X types of shops the ТЦ, ТРЦ, ТК, ТРК, MTPЦ, other categories I'm not sure they are really categories so I will just join them in a single category I call other. 

In [None]:
shop.loc[~shop.type.isin(['ТЦ', 'ТРЦ', 'ТК', 'ТРК', 'MTPЦ']),"type"] = 'other'
shop

In [None]:
shops = shop[['shop_id', 'city', 'type']]

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
shops["city"] = LabelEncoder().fit_transform( shops.city )
shops["type"] = LabelEncoder().fit_transform( shops.type )

In [None]:
shops

In [None]:
import re
def clean_name(x):
    x = x.lower()
    x = x.partition('[')[0]
    x = x.partition('(')[0]
    x = re.sub('[^A-Za-z0-9А-Яа-я]+', ' ', x)
    x = x.replace('  ', ' ')
    x = x.strip()
    return x

In [None]:
items["name1"], items["name2"] = items.item_name.str.split("[", 1).str
items["name1"], items["name3"] = items.item_name.str.split("(", 1).str

items["name2"] = items.name2.str.replace('[^A-Za-z0-9А-Яа-я]+', " ").str.lower()
items["name3"] = items.name3.str.replace('[^A-Za-z0-9А-Яа-я]+', " ").str.lower()
items = items.fillna('0')

items["item_name"] = items["item_name"].apply(lambda x: clean_name(x))
items.name2 = items.name2.apply( lambda x: x[:-1] if x !="0" else "0")

In [None]:
items["type"] = items.name2.apply(lambda x: x[0:8] if x.split(" ")[0] == "xbox" else x.split(" ")[0] )
items.loc[(items.type == "x360") | (items.type == "xbox360") | (items.type == "xbox 360") ,"type"] = "xbox 360"
items.loc[ items.type == "", "type"] = "mac"
items.type = items.type.apply( lambda x: x.replace(" ", "") )
items.loc[ (items.type == 'pc' )| (items.type == 'pс') | (items.type == "pc"), "type" ] = "pc"
items.loc[ items.type == 'рs3' , "type"] = "ps3"

In [None]:
group_sum = items.groupby(["type"], as_index=False).agg({"item_id": "count"})
group_sum

In [None]:
drop_cols = group_sum.loc[group_sum['item_id']<40,'type'].values.tolist()

In [None]:
items.name2 = items.name2.apply( lambda x: "etc" if (x in drop_cols) else x )
items = items.drop(["type"], axis = 1)

In [None]:
items.name2 = LabelEncoder().fit_transform(items.name2)
items.name3 = LabelEncoder().fit_transform(items.name3)

items.drop(["item_name", "name1"],axis = 1, inplace= True)
items.head()

In [None]:
from itertools import product
import time
ts = time.time()
matrix = []
cols  = ["date_block_num", "shop_id", "item_id"]
for i in range(34):
    sales = train[train.date_block_num == i]
    matrix.append( np.array(list( product( [i], sales.shop_id.unique(), sales.item_id.unique() ) ), dtype = np.int16) )

matrix = pd.DataFrame( np.vstack(matrix), columns = cols )
matrix["date_block_num"] = matrix["date_block_num"].astype(np.int8)
matrix["shop_id"] = matrix["shop_id"].astype(np.int8)
matrix["item_id"] = matrix["item_id"].astype(np.int16)
matrix.sort_values( cols, inplace = True )
time.time()- ts

In [None]:
train["revenue"] = train["item_cnt_day"] * train["item_price"]

In [None]:
ts = time.time()
group = train.groupby( ["date_block_num", "shop_id", "item_id"] ).agg( {"item_cnt_day": ["sum"]} )
group.columns = ["item_cnt_month"]
group.reset_index( inplace = True)
matrix = pd.merge( matrix, group, on = cols, how = "left" )
matrix["item_cnt_month"] = matrix["item_cnt_month"].fillna(0).clip(0,20).astype(np.float16)
time.time() - ts

In [None]:
test["date_block_num"] = 34
test["date_block_num"] = test["date_block_num"].astype(np.int8)
test["shop_id"] = test.shop_id.astype(np.int8)
test["item_id"] = test.item_id.astype(np.int16)

In [None]:
ts = time.time()

matrix = pd.concat([matrix, test.drop(["ID"],axis = 1)], ignore_index=True, sort=False, keys=cols)
matrix.fillna( 0, inplace = True )
time.time() - ts

In [None]:
matrix

In [None]:
ts = time.time()
matrix = pd.merge( matrix, shops, on = ["shop_id"], how = "left" )
matrix = pd.merge(matrix, items, on = ["item_id"], how = "left")
matrix["city"] = matrix["city"].astype(np.int8)
matrix["type"] = matrix["type"].astype(np.int8)
matrix["item_category_id"] = matrix["item_category_id"].astype(np.int8)
matrix["name2"] = matrix["name2"].astype(np.int8)
matrix["name3"] = matrix["name3"].astype(np.int16)
time.time() - ts

In [None]:
def lag_feature( df,lags, cols ):
    for col in cols:
        print(col)
        tmp = df[["date_block_num", "shop_id","item_id",col ]]
        for i in lags:
            shifted = tmp.copy()
            shifted.columns = ["date_block_num", "shop_id", "item_id", col + "_lag_"+str(i)]
            shifted.date_block_num = shifted.date_block_num + i
            df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    return df

In [None]:
ts = time.time()

matrix = lag_feature( matrix, [1,2,3], ["item_cnt_month"] )
time.time() - ts

In [None]:
ts = time.time()
group = matrix.groupby( ["date_block_num"] ).agg({"item_cnt_month" : ["mean"]})
group.columns = ["date_avg_item_cnt"]
group.reset_index(inplace = True)

matrix = pd.merge(matrix, group, on = ["date_block_num"], how = "left")
matrix.date_avg_item_cnt = matrix["date_avg_item_cnt"].astype(np.float16)
matrix = lag_feature( matrix, [1], ["date_avg_item_cnt"] )
matrix.drop( ["date_avg_item_cnt"], axis = 1, inplace = True )
time.time() - ts

In [None]:
ts = time.time()
group = matrix.groupby(['date_block_num', 'item_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_item_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left')
matrix.date_item_avg_item_cnt = matrix['date_item_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3], ['date_item_avg_item_cnt'])
matrix.drop(['date_item_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

In [None]:
ts = time.time()
group = matrix.groupby( ["date_block_num","shop_id"] ).agg({"item_cnt_month" : ["mean"]})
group.columns = ["date_shop_avg_item_cnt"]
group.reset_index(inplace = True)

matrix = pd.merge(matrix, group, on = ["date_block_num","shop_id"], how = "left")
matrix.date_avg_item_cnt = matrix["date_shop_avg_item_cnt"].astype(np.float16)
matrix = lag_feature( matrix, [1,2,3], ["date_shop_avg_item_cnt"] )
matrix.drop( ["date_shop_avg_item_cnt"], axis = 1, inplace = True )
time.time() - ts

In [None]:
ts = time.time()
group = matrix.groupby( ["date_block_num","shop_id","item_id"] ).agg({"item_cnt_month" : ["mean"]})
group.columns = ["date_shop_item_avg_item_cnt"]
group.reset_index(inplace = True)

matrix = pd.merge(matrix, group, on = ["date_block_num","shop_id","item_id"], how = "left")
matrix.date_avg_item_cnt = matrix["date_shop_item_avg_item_cnt"].astype(np.float16)
matrix = lag_feature( matrix, [1,2,3], ["date_shop_item_avg_item_cnt"] )
matrix.drop( ["date_shop_item_avg_item_cnt"], axis = 1, inplace = True )
time.time() - ts

In [None]:
ts = time.time()
group = matrix.groupby(['date_block_num', 'shop_id']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_shop_subtype_avg_item_cnt']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_id'], how='left')
matrix.date_shop_subtype_avg_item_cnt = matrix['date_shop_subtype_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], ['date_shop_subtype_avg_item_cnt'])
matrix.drop(['date_shop_subtype_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

In [None]:
ts = time.time()
group = matrix.groupby(['date_block_num', 'city']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_city_avg_item_cnt']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', "city"], how='left')
matrix.date_city_avg_item_cnt = matrix['date_city_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], ['date_city_avg_item_cnt'])
matrix.drop(['date_city_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts


In [None]:
ts = time.time()
group = matrix.groupby(['date_block_num', 'item_id', 'city']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_item_city_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'item_id', 'city'], how='left')
matrix.date_item_city_avg_item_cnt = matrix['date_item_city_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], ['date_item_city_avg_item_cnt'])
matrix.drop(['date_item_city_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

In [None]:
ts = time.time()
group = train.groupby( ["item_id"] ).agg({"item_price": ["mean"]})
group.columns = ["item_avg_item_price"]
group.reset_index(inplace = True)

matrix = matrix.merge( group, on = ["item_id"], how = "left" )
matrix["item_avg_item_price"] = matrix.item_avg_item_price.astype(np.float16)


group = train.groupby( ["date_block_num","item_id"] ).agg( {"item_price": ["mean"]} )
group.columns = ["date_item_avg_item_price"]
group.reset_index(inplace = True)

matrix = matrix.merge(group, on = ["date_block_num","item_id"], how = "left")
matrix["date_item_avg_item_price"] = matrix.date_item_avg_item_price.astype(np.float16)
lags = [1, 2, 3]
matrix = lag_feature( matrix, lags, ["date_item_avg_item_price"] )
for i in lags:
    matrix["delta_price_lag_" + str(i) ] = (matrix["date_item_avg_item_price_lag_" + str(i)]- matrix["item_avg_item_price"] )/ matrix["item_avg_item_price"]

def select_trends(row) :
    for i in lags:
        if row["delta_price_lag_" + str(i)]:
            return row["delta_price_lag_" + str(i)]
    return 0

matrix["delta_price_lag"] = matrix.apply(select_trends, axis = 1)
matrix["delta_price_lag"] = matrix.delta_price_lag.astype( np.float16 )
matrix["delta_price_lag"].fillna( 0 ,inplace = True)

features_to_drop = ["item_avg_item_price", "date_item_avg_item_price"]
for i in lags:
    features_to_drop.append("date_item_avg_item_price_lag_" + str(i) )
    features_to_drop.append("delta_price_lag_" + str(i) )
matrix.drop(features_to_drop, axis = 1, inplace = True)
time.time() - ts

In [None]:
ts = time.time()
group = train.groupby( ["date_block_num","shop_id"] ).agg({"revenue": ["sum"] })
group.columns = ["date_shop_revenue"]
group.reset_index(inplace = True)

matrix = matrix.merge( group , on = ["date_block_num", "shop_id"], how = "left" )
matrix['date_shop_revenue'] = matrix['date_shop_revenue'].astype(np.float32)

group = group.groupby(["shop_id"]).agg({ "date_block_num":["mean"] })
group.columns = ["shop_avg_revenue"]
group.reset_index(inplace = True )

matrix = matrix.merge( group, on = ["shop_id"], how = "left" )
matrix["shop_avg_revenue"] = matrix.shop_avg_revenue.astype(np.float32)
matrix["delta_revenue"] = (matrix['date_shop_revenue'] - matrix['shop_avg_revenue']) / matrix['shop_avg_revenue']
matrix["delta_revenue"] = matrix["delta_revenue"]. astype(np.float32)

matrix = lag_feature(matrix, [1], ["delta_revenue"])
matrix["delta_revenue_lag_1"] = matrix["delta_revenue_lag_1"].astype(np.float32)
matrix.drop( ["date_shop_revenue", "shop_avg_revenue", "delta_revenue"] ,axis = 1, inplace = True)
time.time() - ts

In [None]:
matrix["month"] = matrix["date_block_num"] % 12
days = pd.Series([31,28,31,30,31,30,31,31,30,31,30,31])
matrix["days"] = matrix["month"].map(days).astype(np.int8)

In [None]:
ts = time.time()
matrix["item_shop_first_sale"] = matrix["date_block_num"] - matrix.groupby(["item_id","shop_id"])["date_block_num"].transform('min')
matrix["item_first_sale"] = matrix["date_block_num"] - matrix.groupby(["item_id"])["date_block_num"].transform('min')
time.time() - ts

In [None]:
matrix['Weekends'] = matrix['date_block_num'].map(n_weekends).fillna(10)
matrix['Holidays'] = matrix['date_block_num'].map(n_holidays).fillna(1)

In [None]:
matrix

In [None]:
ts = time.time()
data = matrix[matrix["date_block_num"] > 3]
time.time() - ts

In [None]:
from xgboost import XGBRegressor
import pickle

In [None]:
X_train = data[data.date_block_num < 33].drop(['item_cnt_month'], axis=1)
Y_train = data[data.date_block_num < 33]['item_cnt_month']
X_valid = data[data.date_block_num == 33].drop(['item_cnt_month'], axis=1)
Y_valid = data[data.date_block_num == 33]['item_cnt_month']
X_test = data[data.date_block_num == 34].drop(['item_cnt_month'], axis=1)

In [None]:
model = XGBRegressor(
    max_depth=9,
    n_estimators=1000,
    min_child_weight=1.5, 
    colsample_bytree=0.6, 
    subsample=0.7, 
    eta=0.01,
#     tree_method='gpu_hist',
    seed=42)

model.fit(
    X_train, 
    Y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, Y_train), (X_valid, Y_valid)], 
    verbose=True, 
    early_stopping_rounds = 20)

In [None]:
Y_pred = model.predict(X_valid).clip(0, 20)
Y_test = model.predict(X_test).clip(0, 20)

submission = pd.DataFrame({
    "ID": test.index, 
    "item_cnt_month": Y_test
})
submission.to_csv('xgb_submission.csv', index=False)

# save predictions for an ensemble
pickle.dump(Y_pred, open('xgb_train.pickle', 'wb'))
pickle.dump(Y_test, open('xgb_test.pickle', 'wb'))

In [None]:
from xgboost import plot_importance
fig, ax = plt.subplots(1,1,figsize=(10,15))
plot_importance(booster=model, ax=ax)