Hello everyone! The notebook is related to Future sales prediction task. It consists of EDA, feature engineering, leaderboard probing and finally model training. I tried to grasp all the concepts learned in the course (https://www.coursera.org/learn/competitive-data-science) and fullfill them here. Please notice that some of concepts were borrowed from other competitors and from forum, you will find the links to them by following the notebook.

# Part 1 : EDA

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

In [None]:
! ls ../input/competitive-data-science-predict-future-sales/

In [None]:
SEED = 5

First of all, we will read and explore data. We will start with item_categories dataframe.

In [None]:
item_categories = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
item_categories.head()

In [None]:
item_categories['item_category_id'].nunique()

We should also check for some additional features at item_category_name field.

In [None]:
item_categories['item_category_name'].values

That's obvious that from the related field we can already make some new.

Now we can move on to items.csv.

In [None]:
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
items.head()

In [None]:
items.item_id.nunique()

In [None]:
items.item_category_id.nunique()

Let's see how much items related to each category we have.

In [None]:
plt.figure(figsize=(14,14))
items.groupby('item_category_id')['item_id'].size().plot.barh(rot=0)
plt.title('Number of items related to different categories')
plt.xlabel('Categories')
plt.ylabel('Number of items');

From the chart above that's obvious that there are some categories that are the most popular.

Let's understand what is the name of the categories which conist of maximum and minumum number of items.

In [None]:
items.groupby('item_category_id')['item_id'].size().mean(), items.groupby('item_category_id')['item_id'].size().max(),items.groupby('item_category_id')['item_id'].size().min()

In [None]:
item_categories[item_categories['item_category_id'].isin(items.groupby('item_category_id')['item_id'].size().nlargest(5).index)]

That's pretty obvious that lots of items are related to movies stuff.

In [None]:
item_categories[item_categories['item_category_id']\
                .isin((items.groupby('item_category_id')['item_id'].size()[items.groupby('item_category_id')['item_id'].size()==1])\
                      .index)]

We see that there are few categories that have only one related item. 

It's useful to examine if we have any category that doesn't have any item or if we have any item that belongs to more than one category.

In [None]:
(items.groupby('item_category_id')['item_id'].size()==0).astype(int).sum()

We see that all the items belong to at least one category.

In [None]:
(items.groupby('item_id')['item_category_id'].size()>=2).sum()

And obviously we have only one item per category. Now let's merge two dataframes.

In [None]:
items_categories_merged = items.merge(item_categories,left_on='item_category_id',right_on='item_category_id',how='inner')

In [None]:
items_categories_merged.head()

In [None]:
from collections import Counter
counter = Counter([i for i in np.hstack(items_categories_merged['item_name'].str.split(' ').values) if i])
sorted(counter.items(),key=lambda x: x[1])[::-1][:30]

We see that there are some words that tend to appear much frequently than others. Maybe we should make a feature based on it.

In [None]:
len(items_categories_merged), len(items)

As the length of data before and after merge is the same, it seems that we haven't missed any values. Now we will go on with data containing shops.

In [None]:
shops = pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')

In [None]:
shops.head()

In [None]:
len(shops)

We see that from the shop_name we can already retrieve two new features : name of the city of the shop and type of the shop.

In [None]:
shops.shop_name

Now let's move to train_sales data.

In [None]:
train_sales = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')

Let's check for NaN's.

In [None]:
train_sales.isnull().sum(axis=1).sum()

Obviously there are no NaN's or they are imputed by other values.

In [None]:
train_sales.describe()

We see that our item_price field contains -1 value as minimum and 3e05 as maximum. This values could be NaN's or outliers.

In [None]:
train_sales.head()

Let's check if our dataset is shuffled.

In [None]:
to_plot = train_sales['item_cnt_day'].rolling(5).sum()
plt.figure(figsize=(14,8))
plt.plot(range(len(to_plot.index)),to_plot.values)

From plot above that's obvious that data is shuffled.

Let's make two new features for simplicity, mainly year, month and day.

In [None]:
train_sales['day'] = train_sales['date'].apply(lambda x: x.split('.')[0])
train_sales['month'] = train_sales['date'].apply(lambda x: x.split('.')[1])
train_sales['year'] = train_sales['date'].apply(lambda x: x.split('.')[2])

In [None]:
train_sales.head()

That's important to now in which date range we are working, so let's understand what is the minimum and maximum dates.

In [None]:
train_sales['date'] = pd.to_datetime(train_sales['date'],format='%d.%m.%Y')

In [None]:
train_sales['date'].min(),train_sales['date'].max()

In [None]:
train_sales.head(10)

My hypotesis is that the data is grouped by date_block_num, shop_id and item_id as the date is a bit unordered.

Let's make some analysis of timesires data. First of all we will plot sum of item_cnt_day grouped by different date related columns.

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4,1,figsize=(16,20))
fig.tight_layout(pad=3.0)

to_plot = train_sales.groupby('date',as_index=False)['item_cnt_day'].sum().reset_index()
z = np.polyfit(y=to_plot['item_cnt_day'],x=to_plot['index'], deg=1)
p = np.poly1d(z)
ax1.plot(to_plot['date'],to_plot['item_cnt_day'],'-')
ax1.plot(to_plot['date'],p(to_plot['index'].values),'--r')
ax1.legend(['Sum of sold items','Trend line'])
ax1.title.set_text('Sum of sold items by date')

to_plot = train_sales.groupby('day')['item_cnt_day'].sum()
ax2.plot(to_plot.values,'-o')
ax2.title.set_text('Sum of sold items by day')
ax2.set_xticks(range(len(to_plot)))

to_plot = train_sales.groupby('month')['item_cnt_day'].sum()
ax3.plot(to_plot.values,'-o')
ax3.title.set_text('Sum of sold items by month')
ax3.set_xticks(range(len(to_plot)))

to_plot = train_sales.groupby('year',as_index=False)['item_cnt_day'].sum()
ax4.plot(to_plot['item_cnt_day'].values,'-o')
ax4.title.set_text('Sum of sold items by year')
ax4.set_xticks(range(len(to_plot)))
ax4.set_xticklabels(list(to_plot['year'].values));


That's obvious that with time number of item sold is decreasing, let's now plot the situation for the price.

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4,1,figsize=(16,20))
fig.tight_layout(pad=3.0)

to_plot = train_sales.groupby('date',as_index=False)['item_price'].mean().reset_index()
z = np.polyfit(y=to_plot['item_price'],x=to_plot['index'], deg=1)
p = np.poly1d(z)
ax1.plot(to_plot['date'],to_plot['item_price'],'-')
ax1.plot(to_plot['date'],p(to_plot['index'].values),'--r')
ax1.legend(['Mean price of items','Trend line'])
ax1.title.set_text('Mean price of items by date')

to_plot = train_sales.groupby('day')['item_price'].mean()
ax2.plot(to_plot.values,'-o')
ax2.title.set_text('Mean price of items by day')
ax2.set_xticks(range(len(to_plot)))

to_plot = train_sales.groupby('month')['item_price'].mean()
ax3.plot(to_plot.values,'-o')
ax3.title.set_text('Mean price of items by month')
ax3.set_xticks(range(len(to_plot)))

to_plot = train_sales.groupby('year',as_index=False)['item_price'].mean()
ax4.plot(to_plot['item_price'].values,'-o')
ax4.title.set_text('Mean price of items by year')
ax4.set_xticks(range(len(to_plot)))
ax4.set_xticklabels(list(to_plot['year'].values));


We see that as mean price goes higher, the sum of sold items goes lower meaning that there is a dependency between price and item_cnt. We should probably use this as feature.

Let's plot the averaged, maximum and minumum revenue per month during all the time. But first of all let's see how monthes are distributed across the dataset.

In [None]:
dict_monthes = dict(train_sales['month'].value_counts())
monthes, frequencies = zip(*sorted(dict_monthes.items(),key=lambda x: int(x[0][1]) if x[0][0]=='0' else int(x[0])))
plt.figure(figsize=(15,12))
plt.bar(range(len(monthes)),frequencies)
plt.title('Distribution of monthes in dataset')
plt.xticks(range(len(monthes)),monthes);

As we see monthes occurancies are in general evenly distributed across dataset, only January (01) appears more freuqently than others.

In [None]:
train_sales['revenue'] = train_sales['item_price']*train_sales['item_cnt_day']
plt.figure(figsize=(14,8))
train_sales.groupby('month')['revenue'].mean().plot.bar(rot=0)
plt.title('Averaged revenue per month')
plt.xlabel('Monthes')
plt.ylabel('Revenue');

That's obvious that the averaged revenue is bigger during the 12th month, as the dates related to it are close to New Year holiday. Let's now visualize min and max values.

In [None]:
plt.figure(figsize=(14,8))
train_sales.groupby('month')['revenue'].max().plot.bar(rot=0)
plt.title('Maximum revenue per month')
plt.xlabel('Monthes')
plt.ylabel('Revenue');

Hm, we have a maximum value for 11th month, that's interesting, it could be because of "Black Friday". As we in revenue column we can have negative values (if the goods are returned) we will visualize only the data rows that have values > 0.

In [None]:
plt.figure(figsize=(14,8))
train_sales[train_sales['revenue']>0].groupby('month')['revenue'].min().plot.bar()
plt.title('Minimum revenue per month')
plt.xlabel('Monthes')
plt.ylabel('Revenue');

In [None]:
plt.figure(figsize=(14,8))
train_sales.groupby('date_block_num')['revenue'].mean().plot.bar(rot=0)
plt.title('Averaged revenue per month (count)')
plt.xlabel('Relative number of monthes')
plt.ylabel('Revenue');

We also want to make the same for day column.

In [None]:
plt.figure(figsize=(14,8))
train_sales.groupby('day')['revenue'].mean().plot.bar(rot=0)
plt.title('Averaged revenue per day')
plt.xlabel('Days')
plt.ylabel('Revenue');

In [None]:
mean_revenue_day_month = train_sales.groupby(['month','day'])['revenue'].mean()

mean_revenue_day_month[mean_revenue_day_month.isin(mean_revenue_day_month.nlargest(5))]

If we search for date 2013/11/29, we will find out that it was the date of black friday in russia, thus that's adequate to have the maximum revenue on this day.

Now let's visualize the revenue for each year and for each week day and make some additional charts.

In [None]:
train_sales['year'].value_counts()

In [None]:
plt.figure(figsize=(14,8))
train_sales.groupby('year')['revenue'].mean().plot.bar(rot=0)
plt.title('Averaged revenue per year')
plt.xlabel('Year')
plt.ylabel('Revenue');

In [None]:
train_sales['dayname'] = train_sales['date'].dt.day_name()
train_sales.groupby('dayname')['revenue'].mean().plot.bar(rot=90)
plt.title('Averaged revenue per week day')
plt.xlabel('Week day')
plt.ylabel('Revenue');

In [None]:
train_sales['dayname'].value_counts().plot.bar()
plt.title('Distribution of week days in dataframe');

We will also visualize distribution of averaged item_price.

In [None]:
plt.figure(figsize=(14,8))
plt.title('Distribution of mean item_price')
mean_price = train_sales.groupby(['shop_id','item_id','date_block_num'])['item_price'].mean().values
plt.hist(mean_price,bins=30)
plt.xlabel('Values')
plt.ylabel('Frequency');

In [None]:
plt.figure(figsize=(14,8))
plt.title('Distribution of mean item_price on log scale')
plt.hist(np.log1p(train_sales.groupby(['shop_id','item_id','date_block_num'])['item_price'].mean().values),bins=30)
plt.xlabel('Values')
plt.ylabel('Frequency');

We see that item_prices are normally distributed with some outliers, thus we will need to clip them or to use log scale.

In [None]:
plt.scatter(train_sales['month'],train_sales['item_price']);

In [None]:
train_sales[train_sales['item_price']==train_sales['item_price'].max()]

In [None]:
train_sales[train_sales['item_price']==train_sales['item_price'].min()]

Now let's derive some deeper insights from our data.

In [None]:
shops_per_item = (train_sales.groupby('item_id')['shop_id'].nunique()>=2).astype(int).sum()
print('There are {0} items that relate to more than one shop'.format(shops_per_item))

That's obviously that we can have multiple shops that sell one item. Let's now understand the difference between training and testing datasets. We already now that data in testing dataset is montly aggregated.

In [None]:
(train_sales['item_id'].value_counts()==1).astype(int).sum()

For sure there are some items that are out of date. Let's compare our dataframe with test data.

In [None]:
test_sales = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')
test_sales.info()

In [None]:
test_sales.head()

In [None]:
test_sales['shop_id'].value_counts().unique()

In [None]:
test_sales['item_id'].value_counts().unique()

That's strange that we have the same amount of events related to each shop and each item. By multiplying those values together we will get the exact number of rows as in dataset, and that's very strange, mainly it seems that the test set is just a catalog of items for which we need to predict prices. The other thing is that if we look on shop_id and item_id columns we will notice that the data is ordered a bit. Ordered by the shop_id and item_id columns.

In [None]:
diff_test_items = set(train_sales.item_id.unique()).difference(test_sales.item_id.unique())
print('Number of items that are in train set, but are not in test one : {0}'.format(len(diff_test_items))) 
diff_train_items = set(test_sales.item_id.unique()).difference(train_sales.item_id.unique())
print('Number of items that are in test set, but are not in train one : {0}'.format(len(diff_train_items))) 
diff_test_shops = set(train_sales.shop_id.unique()).difference(test_sales.shop_id.unique())
print('Number of shops that are in train set, but are not in test one : {0}'.format(len(diff_test_shops))) 
diff_train_shops = set(test_sales.shop_id.unique()).difference(train_sales.shop_id.unique())
print('Number of shops that are in test set, but are not in train one : {0}'.format(len(diff_train_shops))) 

We see that there are items which weren't used in the test set at all! And the same for the train one. We can also see that there some shops which are not included in test set. To deal with items we will than make the empty dataframe which will have all the possible products of item_id,shop_id and date_block_num and merge it with our ones. For now let's work with item_cnt_day column and make some usefull plots.

In [None]:
plt.figure(figsize=(15,12))
dict_returned = dict(train_sales[train_sales['item_cnt_day']<0].month.value_counts())
dict_returned = dict(sorted(dict_returned.items(), key=lambda x: int(x[0][1]) if x[0][0]=='0' else int(x[0])))
plt.bar(range(len(dict_returned.values())),dict_returned.values())
plt.xticks(range(len(dict_returned.values())),dict_returned.keys())
plt.title('Number of times the goods were returned during different monthes')
plt.xlabel('Monthes')
plt.ylabel('Cases of returning the goods');


In [None]:
plt.figure(figsize=(15,12))
dict_returned = dict(train_sales[train_sales['item_cnt_day']<0].date_block_num.value_counts())
dict_returned = dict(sorted(dict_returned.items(), key=lambda x: int(x[0])))
plt.bar(range(len(dict_returned.values())),dict_returned.values())
plt.xticks(range(len(dict_returned.values())),dict_returned.keys())
plt.title('Number of times the goods were returned during different date_block_num')
plt.xlabel('date_block_num')
plt.ylabel('Cases of returning the goods');



From the plots above we can say that people tend to return items right after the New Year. Maybe it's because their presents weren't so good. Basically, our hypotezis can be wrong, as events related to January seems to appear oftener than others, but I don't think that this is the case, as events related to December (12) appears oftener than the ones related to February (02), but still more items are returned during February. Now let's see what items are returned most often.

In [None]:
(train_sales[train_sales['item_cnt_day']<0]['item_id'].value_counts()).nlargest(50)

In [None]:
sales = train_sales[train_sales['item_cnt_day']<0]['item_id'].value_counts()
idx = list(sales[sales>=10].index)

In [None]:
items_categories_merged[items_categories_merged['item_id'].isin(idx)].item_category_name.unique()

We see that most of all - games are returned. Basically the above analysis didn't help us to derive new features but we got the point that we should probably make features based on categories and types of items. So far we got the following things:

1. We can make additional features from item_categories df such as type of category. 
1. We can make additional features from shop_name field in shops df such as shop_name and shop_type.
1. We can make additional features based on item name, maybe using tfidf or count vectorizer.
1. We should probably concat our data with all the other shops, date_block_nums and item_ids, if there is a missing one it means that it just wasn't sold. Also it's benefitial to make our data of the same format as test one.
1. There is a dependency between the revenue and month number, thus people tend to by more products during monthes that have holidays, thus we can add a new feature indicating if month has a holiday plus number of month.
1. There is a dependency between price and number of sold items, we can make some time series features based on it, but we should also remember to deal with outliers in item price. Also there is a way to construct new features via mean encoding (as we have lot's of categorical features).



# Part 2: Leader board probing

In [None]:
submission = pd.read_csv('../input/competitive-data-science-predict-future-sales/sample_submission.csv')
submission["item_cnt_month"] = 1
submission.to_csv('lb_probing1.csv',index=False)
submission["item_cnt_month"] = 0
submission.to_csv('lb_probing2.csv',index=False)

We can now submit two predictions and calculate the mean of the leader board target. We then can use it to make our score better and to align cross_validation set with test one. We will use the following calculations (full conversation about LB probing is accessible by the following [link](https://www.kaggle.com/c/competitive-data-science-predict-future-sales/discussion/79142)).

$MSE0 = \frac{ \sum{i=1}^{N} (yi - 0)^2 }{N} = \frac{ \sum{i=1}^{N} y_i^2 }{N}$

$MSE1 = \frac{ \sum{i=1}^{N} (yi - 1)^2 }{N} = \frac{ \sum{i=1}^{N} (yi^2 - 2yi + 1) }{N} = \frac{ \sum{i=1}^{N} yi^2 - 2 \sum{i=1}^{N} yi + N }{N} = MSE0 - \frac{2}{N} \sum{i=1}^{N} y_i + 1$

$\frac{\sum{i=1}^{N} yi}{N} = \frac{MSE1 - MSE0 - 1}{-2}$

After the submission we see that the score for MSE0 is $1.25011^2$ and for MSE1 is $1.41241^2$. Let's now calculate the target mean of public leaderboard.

In [None]:
y_hat_mean = (1.41241**2-1.25011**2-1)/-2
print('Mean of target values in public leaderboard is : {0}'.format(y_hat_mean))

# Part 3: Feature engineering

## (1,3). Additional features based on items_categories_merged df.

In [None]:
items_categories_merged.head()

In the code below we will make a new feature name type_of_category, make some item_name cleaning (exlucde mess) and construct tfidf features based on it.

In [None]:
def exclude_preprositions(x):
    x = x.split(' ')
    x = ' '.join(i for i in x if not i in prepositions_to_exclude).strip()
    return x

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
items_categories_merged['type_of_category']=items_categories_merged['item_category_name'].apply(lambda x: x.split(' ')[0].strip())
dict_types = dict(items_categories_merged['type_of_category'].value_counts())
cat, _ = zip(*sorted(dict_types.items(),key=lambda x: x[1])[::-1][:5])
print('Most frequent types of categories : {0}'.format(cat))
num_features = 10
symbols_to_exclude = ['[',']','!','.',',','*','(',')','"',':']
prepositions_to_exclude = ['в','на','у','the','a','an','of','для']
for symbol in symbols_to_exclude:
    items_categories_merged['item_name'] = items_categories_merged['item_name'].str.replace(symbol,'')
items_categories_merged['item_name'] = items_categories_merged['item_name'].str.lower()
items_categories_merged['item_name'] = items_categories_merged['item_name'].str.replace('-',' ')
items_categories_merged['item_name'] = items_categories_merged['item_name'].str.replace('/',' ')
items_categories_merged['item_name'] = items_categories_merged['item_name'].str.strip()
items_categories_merged['item_name'] = items_categories_merged['item_name'].apply(exclude_preprositions)
vectorizer = TfidfVectorizer(max_features=num_features)
res = vectorizer.fit_transform(items_categories_merged['item_name'])
print('Top {0} features of tfidf : {1}'.format(num_features,vectorizer.get_feature_names()))
count_vect_df = pd.DataFrame(res.todense(), columns=vectorizer.get_feature_names())
items_categories_merged = pd.concat([items_categories_merged,count_vect_df],axis=1)

In [None]:
items_categories_merged['type_of_category'].unique()

In [None]:
items_categories_merged.drop(columns=['item_name','item_category_name'],inplace=True)

In [None]:
import gc
del vectorizer, res, count_vect_df
gc.collect();

## 2. Additional features based on shop_name.

In [None]:
import re
def create_city_name(x):
    for i in not_city:
        if i in x:
            return 'unk_city'
    return x.split(' ')[0].strip()
def create_shop_type(x):
    to_return = 'unk_type'
    for i in type_of_shops:
        regex = re.compile(i)
        if re.search(regex,x):
                to_return = i 
    return to_return
not_city = ['Выездная Торговля','Интернет-магазин','Цифровой склад 1С-Онлайн']
type_of_shops = ['ТРЦ', 'ТЦ','ТРК','ТК','МТРЦ']+not_city
shops['city_name'] = shops['shop_name'].apply(create_city_name)
shops['shop_type'] = shops['shop_name'].apply(create_shop_type)

In [None]:
shops.head()

In [None]:
shops.drop(columns='shop_name',inplace=True)

## 4. Aggregating data.

We need our training data to be very similar to the test one. In the test data there are many items that were not sold, as we need to predict number of sales for a catalog. To achieve the similarity of train and test data we will basically, create a product data frame which consists of each pair of shop and item for a unique month, by this we will achieve the same target distribution as in test set (the idea is retrieved from this [notebook](https://www.kaggle.com/dlarionov/feature-engineering-xgboost)).

In [None]:
mean = train_sales.groupby(['date_block_num','shop_id','item_id'])['item_cnt_day'].sum().mean()
print('Mean of target value in train data : {0}'.format(mean))
if np.abs(mean-y_hat_mean)<0.2:
    print('The mean of train and test targets is aligned!')
else:
    print('The mean of train and test targets is not aligned!')

In [None]:
from itertools import product
matrix = []
cols = ['shop_id','item_id','date_block_num']
for i in range(34):
    sales = train_sales[train_sales.date_block_num==i]
    matrix.append(np.array(list(product(sales.shop_id.unique(), sales.item_id.unique(),[i])), dtype='int16'))

matrix = pd.DataFrame(np.vstack(matrix), columns=cols)
matrix['date_block_num'] = matrix['date_block_num'].astype(np.int8)
matrix['shop_id'] = matrix['shop_id'].astype(np.int8)
matrix['item_id'] = matrix['item_id'].astype(np.int16)
matrix.sort_values(cols,inplace=True)

In [None]:
def fn(x):
    return list(x)[0]

In [None]:
train_sales = train_sales.groupby(['shop_id','item_id','date_block_num'],as_index=False).agg({'item_cnt_day': np.sum,'item_price' : np.mean,
                                               'month' : fn})
train_sales = matrix.merge(train_sales,on=['shop_id','item_id','date_block_num'],how='left')
train_sales['item_cnt_month'] = train_sales['item_cnt_day'].fillna(0).clip(0,20)
train_sales.drop(columns='item_cnt_day',inplace=True)
print('Mean of target value in train_sales column : {0}'.format(train_sales['item_cnt_month'].mean()))
if np.abs(train_sales['item_cnt_month'].mean()-y_hat_mean)<2:
    print('The mean of train and test targets is aligned!')

Let's also concat everything with test data in order to use lagged features.

In [None]:
test_sales['date_block_num'] = 34
test_sales.drop(columns='ID',inplace=True)
data = pd.concat([train_sales,test_sales],ignore_index=True, sort=False, keys=['shop_id','item_id','date_block_num'])
data.head()

In [None]:
month_mapping = data[['month','date_block_num']].dropna().drop_duplicates().sort_values(by=['date_block_num'])\
.set_index('date_block_num').to_dict()['month']
month_mapping.update({34:'11'})

In [None]:
data = data.sort_values(by=['date_block_num','shop_id','item_id'])
data['item_price'] = data['item_price'].fillna(0)
data['month'] = data['date_block_num'].map(month_mapping)

In [None]:
data.head()

We will also clean unuseful data.

In [None]:
del test_sales, train_sales, matrix
gc.collect();

 ## 5. Additional features based on monthes

We will add a feature based on monthes, that indicates if the month contains holiday or not.

In [None]:
holiday_monthes = ['01','02','03','05','06','11']
data['is_holiday']=data['month'].apply(lambda x: 1 if x in holiday_monthes else 0)

## 6. Features based on time-series, merging everything togather

Before all of this, we should delete outliers in item_price column. For this step we will use a technique called winsorization.

In [None]:
data['item_price'] = data['item_price'].fillna(0)
lower, upper = np.percentile(data['item_price'].values,[1,99])
data['item_price'] = data['item_price'].clip(lower,upper)

In [None]:
plt.hist(np.log1p(data['item_price'].values));

In [None]:
data['revenue'] = data['item_price']*data['item_cnt_month']

We see that we basically dealed with outliers using winsorization.

In [None]:
def lag_feature(df, lags, col):
    tmp = df[['shop_id','item_id','date_block_num',col]]
    for i in lags:
        shifted = tmp.copy()
        shifted.columns = ['shop_id','item_id','date_block_num',col+'_lag_'+str(i)]
        shifted['date_block_num'] += i
        df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    return df

In [None]:
def aggregated_previous(data,column,target_col,lags=[1],type_='mean'):
    for i in lags:
        tmp_data = data.copy()
        tmp_data.loc[:,'date_block_num'] +=1
        if isinstance(column,list):
            to_group = ['date_block_num']+column
            name = '_'.join(i for i in column)
        else:
            to_group = ['date_block_num']+[column]
            name = column
        tmp_data = tmp_data.groupby(to_group).agg({target_col:type_})
        tmp_data.rename(columns={target_col:target_col+'_previous_{0}_by_'.format(type_)+name+'_lag_'+str(i)},inplace=True)
        data = data.merge(tmp_data,how='left',right_index=True,left_on=to_group)
    return data

Let's add lag features based on price and item_cnt_month (it can take some time).

In [None]:
data = lag_feature(data, [1,3,6,12], 'item_cnt_month')
data = lag_feature(data, [1,3,6,12], 'item_price')
data = aggregated_previous(data,'shop_id','item_cnt_month',[1,3])
data = aggregated_previous(data,'item_id','item_cnt_month',[1,3])
data = aggregated_previous(data,'shop_id','revenue',[1])
data = aggregated_previous(data,'item_id','revenue',[1])

In [None]:
data.drop(columns='revenue',inplace=True)

In [None]:
data.head()

In [None]:
data.fillna(0,inplace=True)

Let's now merge our data with other dataframes.

In [None]:
data = data.merge(shops,on='shop_id',how='left')
data = data.merge(items_categories_merged,on='item_id',how='left')

In [None]:
# data = aggregated_previous(data,['shop_id','item_category_id'],'item_cnt_month',[1])
# data = aggregated_previous(data,['shop_id','item_category_id'],'item_cnt_month',[1],'sum')
data.fillna(0,inplace=True)

In [None]:
data.head(5)

In [None]:
data.columns

In [None]:
del items, shops, items_categories_merged
gc.collect();

In [None]:
data.drop(columns='item_price',inplace=True)

# Part 4: Feature processing

For now we gonna only use the xgboost as our main model (that was also an idea to use lstm or to make ensemble, but that ideas will be exploited with time), thus we only need to factorize our categorical columns.

In [None]:
to_encode = ['month','city_name','shop_type','type_of_category']
nunique_cat = {}
for i in to_encode:
    data[i] = data[i].factorize()[0]
    nunique_cat.update({i:data[i].nunique()})
nunique_cat.update({'shop_id':data['shop_id'].nunique()})
nunique_cat.update({'item_id':data['item_id'].nunique()})
nunique_cat.update({'item_category_id':data['item_category_id'].nunique()})
print('Factorized all the columns!')

In [None]:
data.head()

# Part 5: Machine learning part

## Casting data to related dtypes and basic preparation.

In [None]:
data.columns

We will firstly reduce the memory usage by casting columns to appropriate dtypes and split the data by month.

In [None]:
def cast_categorical(data):
    data['is_holiday'] = data['is_holiday'].astype('uint8')
    data['shop_id'] = data['shop_id'].astype('uint8')
    data['month'] = data['month'].astype('uint8')
    data['shop_type'] = data['shop_type'].astype('uint8')
    data['city_name'] = data['city_name'].astype('uint8')
    data['item_category_id'] = data['item_category_id'].astype('uint8')
    data['date_block_num'] = data['date_block_num'].astype('uint8')
    data['item_id'] = data['item_id'].astype('uint16')
    data['type_of_category'] = data['type_of_category'].astype('uint8')

In [None]:
def cast_numerical(data):
    for i in data.columns:
        if 'float' in str(data[i].dtype):
            data[i] = data[i].astype('float16')

In [None]:
cast_categorical(data)
cast_numerical(data)

In [None]:
data.info()

In [None]:
np.isfinite(data).sum()

In [None]:
train, test = data[data.date_block_num<34],data[data.date_block_num==34]
del data
gc.collect();

In [None]:
partA = train[train.date_block_num<32]
partB = train[train.date_block_num == 32]
partC = train[train.date_block_num == 33]

In [None]:
part_A_x = partA.drop(columns=['item_cnt_month','date_block_num'])
part_A_y = partA['item_cnt_month']
part_B_x = partB.drop(columns=['item_cnt_month','date_block_num'])
part_B_y = partB['item_cnt_month']
part_C_x = partC.drop(columns=['item_cnt_month','date_block_num'])
part_C_y = partC['item_cnt_month']
test = test.drop(columns=['item_cnt_month','date_block_num'])

In [None]:
del train, partA,partB, partC
gc.collect();

# First level models : LGBM, NN, Lasso, Ridge.

## LGBM training

In [None]:
to_rename = {'версия':'version','регион':'region','русская':'rus','цифровая':'numeric','фигурка':'figure',
            'фирм':'firm','коллекция':'collection'}
part_A_x.rename(columns=to_rename,inplace=True)
part_B_x.rename(columns=to_rename,inplace=True)
part_C_x.rename(columns=to_rename,inplace=True)
test.rename(columns=to_rename,inplace=True)


In [None]:
eval_set = [(part_A_x,part_A_y),(part_B_x,part_B_y),(part_C_x,part_C_y)]

In [None]:
import lightgbm as lgb
from lightgbm import plot_importance

In [None]:
lgb_model = lgb.LGBMRegressor(feature_fraction= 0.75,
               metric = 'rmse',
               max_depth = 8, 
               min_data_in_leaf = 2**7, 
               bagging_fraction = 0.75, 
               learning_rate = 0.03, 
               objective = 'mse', 
               bagging_seed = 2**7, 
               num_leaves = 100,
               bagging_freq =1,
               verbose = 1,
            random_state=5,
                             n_estimators=300)
lgb_model.fit(part_A_x,part_A_y,eval_metric="rmse", 
    eval_set=eval_set, 
    verbose=True, 
    early_stopping_rounds = 10)

In [None]:
plot_importance(lgb_model,ax=plt.subplots(1,1,figsize=(15,12))[1])

In [None]:
lgb_B = lgb_model.predict(part_B_x)
lgb_C = lgb_model.predict(part_C_x)
lgb_test = lgb_model.predict(test)

In [None]:
lgb_model._Booster.__del__()

In [None]:
gc.collect();

# NN training

Function make_arch - makes neural network architecture. In order to work correctly with categorical columns - embedding layer is used. Also a spatial dropout along with dropout is used to reduce overfitting.

In [None]:
def make_arch(numerical_cols,categorical_cols):
    tf.keras.backend.clear_session()
    tf.random.set_seed(5)
    inputs = []
    embeddings = []
    for cat_col in categorical_cols:
        if not cat_col=='is_holiday':
            no_of_unique_cat = nunique_cat[cat_col]
            embedding_size = int(min(np.ceil((no_of_unique_cat)/2), 50))
            input = tf.keras.layers.Input(shape = (1,),name='input_for_{0}'.format(cat_col))
            embs = tf.keras.layers.Embedding(no_of_unique_cat+1, embedding_size, name = 'embeddings_for_{0}'.format(cat_col))(input)
            drop = tf.keras.layers.SpatialDropout1D(0.4)(embs)
            reshape = tf.keras.layers.Reshape(target_shape = (embedding_size,),name='reshape_for_{0}'.format(cat_col))(drop)
            embeddings.append(reshape)
            inputs.append(input)
        else:
            input = tf.keras.layers.Input(shape = (1,),name='input_for_{0}'.format(cat_col))
            embs = tf.keras.layers.Dense(4,activation='relu')(input)
            embeddings.append(reshape)
            inputs.append(input)
    numeric_input = tf.keras.layers.Input(shape=(len(numerical_cols),), name='input_for_numerical')
    numeric_embs = tf.keras.layers.Dense(32)(numeric_input)
    leaky_relu = tf.keras.layers.LeakyReLU(0.1)(numeric_embs)
    drop_concat = tf.keras.layers.Dropout(0.2)(leaky_relu)
    inputs.append(numeric_input)
    embeddings.append(drop_concat)
    concat = tf.keras.layers.Concatenate()(embeddings)
    concat_dense = tf.keras.layers.Dense(8)(concat)
    leaky_relu2 = tf.keras.layers.LeakyReLU(0.1)(concat_dense)
    last_dense = tf.keras.layers.Dense(1,activation='relu')(leaky_relu2)
    model = tf.keras.Model(outputs=last_dense,inputs=inputs)
    return model

def root_mean_squared_error(y_true, y_pred):
        return tf.keras.backend.sqrt(tf.keras.backend.mean(tf.keras.backend.square(y_pred - y_true))) 

The columns are divided into numerical and categorical. Numerical columns are scaled.

In [None]:
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
categorical_cols = [i for i in part_A_x.columns if 'int' in str(part_A_x[i].dtype)]
numerical_cols = [i for i in part_A_x.columns if 'float' in str(part_A_x[i].dtype)]

categorical_input_A = [part_A_x[i].values for i in categorical_cols]
scaler = StandardScaler()
numerical_input_A = scaler.fit_transform(part_A_x[numerical_cols].astype('float32'))
categorical_input_A.append(numerical_input_A)

categorical_input_B = [part_B_x[i].values for i in categorical_cols]
numerical_input_B = scaler.transform(part_B_x[numerical_cols].astype('float32'))
categorical_input_B.append(numerical_input_B)

categorical_input_C = [part_C_x[i].values for i in categorical_cols]
numerical_input_C = scaler.transform(part_C_x[numerical_cols].astype('float32'))
categorical_input_C.append(numerical_input_C)

categorical_input_test = [test[i].values for i in categorical_cols]
numerical_input_test = scaler.transform(test[numerical_cols].astype('float32'))
categorical_input_test.append(numerical_input_test)



In [None]:
model = make_arch(numerical_cols,categorical_cols)

model.compile(loss=root_mean_squared_error,optimizer=tf.keras.optimizers.SGD(momentum=0.1,lr=0.009))

history = model.fit(x=categorical_input_A,y=part_A_y.values,validation_data = [categorical_input_B,part_B_y.values],
     batch_size=512,epochs=4,callbacks=[tf.keras.callbacks.EarlyStopping(patience=1)])

We will now plot results of the network.

In [None]:
loss, val_loss = history.history['loss'],history.history['val_loss']
plt.figure(figsize=(13,8))
plt.title('NN training loss versus validation')
plt.plot(range(len(loss)),loss,'b')
plt.plot(range(len(val_loss)),val_loss,'r')
plt.xticks(range(len(val_loss)));
plt.yticks(np.arange(min(val_loss),max(loss),0.01));

In [None]:
model.evaluate(categorical_input_C,part_C_y.values)

Creating nn predictions.

In [None]:
nn_B = model.predict(categorical_input_B)
nn_C = model.predict(categorical_input_C)
nn_test = model.predict(categorical_input_test)

## Lasso Regression training

In [None]:
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score, mean_squared_error

In [None]:
lasso = Lasso(random_state=SEED,alpha=0.04)
lasso.fit(numerical_input_A,part_A_y)

In [None]:
r2_B = r2_score(y_true=part_B_y,y_pred=lasso.predict(numerical_input_B))
mse_B = np.sqrt(mean_squared_error(y_true=part_B_y,y_pred=lasso.predict(numerical_input_B)))
print('RMSE on B part: {0}'.format(mse_B))
print('r2_score on B part: {0}'.format(r2_B))

In [None]:
r2_C = r2_score(y_true=part_C_y,y_pred=lasso.predict(numerical_input_C))
mse_C = np.sqrt(mean_squared_error(y_true=part_C_y,y_pred=lasso.predict(numerical_input_C)))
print('RMSE on C part: {0}'.format(mse_C))
print('r2_score on C part: {0}'.format(r2_C))

In [None]:
lasso_B = lasso.predict(numerical_input_B)
lasso_C = lasso.predict(numerical_input_C)
lasso_test = lasso.predict(numerical_input_test)

# Ridge regression training

In [None]:
from sklearn.linear_model import Ridge


In [None]:
ridge = Ridge(random_state=SEED,alpha=0.04)
ridge.fit(numerical_input_A,part_A_y)

In [None]:
r2_B = r2_score(y_true=part_B_y,y_pred=ridge.predict(numerical_input_B))
mse_B = np.sqrt(mean_squared_error(y_true=part_B_y,y_pred=ridge.predict(numerical_input_B)))
print('RMSE on B part: {0}'.format(mse_B))
print('r2_score on B part: {0}'.format(r2_B))

In [None]:
r2_C = r2_score(y_true=part_C_y,y_pred=ridge.predict(numerical_input_C))
mse_C = np.sqrt(mean_squared_error(y_true=part_C_y,y_pred=ridge.predict(numerical_input_C)))
print('RMSE on C part: {0}'.format(mse_C))
print('r2_score on C part: {0}'.format(r2_C))

In [None]:
ridge_B = ridge.predict(numerical_input_B)
ridge_C = ridge.predict(numerical_input_C)
ridge_test = ridge.predict(numerical_input_test)

## 2nd lvl model training

We now will gather all the predictions and train second lvl model.

In [None]:
part_B_2 = pd.DataFrame(index=range(len(nn_B)))
part_B_2['lasso'] = lasso_B
part_B_2['ridge'] = ridge_B
part_B_2['lgb'] = lgb_B
part_B_2['nn'] = nn_B
cols = part_B_2.columns
for i in cols:
    for j in cols:
        if i!=j:
            part_B_2['{0}_{1}_distance'.format(i,j)] = part_B_2[i]-part_B_2[j]
part_B_2['target'] = part_B_y.values

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(part_B_2.corr(), 
        xticklabels=part_B_2.corr().columns,
        yticklabels=part_B_2.corr().columns)

In [None]:
part_C_2 = pd.DataFrame(index=range(len(nn_C)))
part_C_2['lasso'] = lasso_C
part_C_2['ridge'] = ridge_C
part_C_2['lgb'] = lgb_C
part_C_2['nn'] = nn_C
cols = part_C_2.columns
for i in cols:
    for j in cols:
        if i!=j:
            part_C_2['{0}_{1}_distance'.format(i,j)] = part_C_2[i]-part_C_2[j]
part_C_2['target'] = part_C_y.values

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(part_C_2.corr(), 
        xticklabels=part_C_2.corr().columns,
        yticklabels=part_C_2.corr().columns)

In [None]:
test_2 = pd.DataFrame(index=range(len(nn_test)))
test_2['lasso'] =lasso_test
test_2['ridge'] =ridge_test
test_2['lgb'] = lgb_test
test_2['nn'] = nn_test
cols = test_2.columns
for i in cols:
    for j in cols:
        if i!=j:
            test_2['{0}_{1}_distance'.format(i,j)] = test_2[i]-test_2[j]

In [None]:
test_2.head()

In [None]:
test_2.corr()

In [None]:
features = part_B_2.columns.tolist()
target = features.pop(features.index('target'))
X_B , Y_B = part_B_2[features], part_B_2[target]
X_C , Y_C = part_C_2[features], part_C_2[target]

In [None]:
features

In [None]:
from sklearn.linear_model import SGDRegressor

In [None]:
lr = SGDRegressor(alpha=0.001,random_state=SEED)

In [None]:
lr.fit(X_B,Y_B)

In [None]:
r2_B = np.sqrt(r2_score(y_true=Y_B,y_pred=lr.predict(X_B)))
mse_B = np.sqrt(mean_squared_error(y_true=Y_B,y_pred=lr.predict(X_B)))
print('RMSE on B part: {0}'.format(mse_B))
print('r2_score on B part: {0}'.format(r2_B))

In [None]:
r2_C = np.sqrt(r2_score(y_true=Y_C,y_pred=lr.predict(X_C)))
mse_C = np.sqrt(mean_squared_error(y_true=Y_C,y_pred=lr.predict(X_C)))
print('RMSE on C part: {0}'.format(mse_C))
print('r2_score on C part: {0}'.format(r2_C))

## Final predictions

In [None]:
final_preds = np.clip(lr.predict(test_2),0,20)

In [None]:
submission = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')
to_merge = test
Y_test = final_preds
to_merge['item_cnt_month'] = Y_test
sub_to_merge = to_merge[['shop_id','item_id','item_cnt_month']].copy()
submission = submission.merge(sub_to_merge,how='left',on=['shop_id','item_id'])
submission = submission[['ID','item_cnt_month']]
submission.to_csv('submission_stacking.csv',index=False)