# Predict Future Sales (EDA)
Please notice that most of this EDA has been inspired by Sarhak Batra's post (https://www.kaggle.com/sarthakbatra/predicting-sales-tutorial) - KUDOS to him

In [None]:
! pip install googletrans

In [None]:
! pip install strsim

In [None]:
import numpy as np 
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from googletrans import Translator
import re
from tqdm import tqdm_notebook
import gc
from itertools import product

In [None]:
def downcast_dtypes(df):
    '''
        Changes column types in the dataframe: 
                
                `float64` type to `float32`
                `int64`   type to `int32`
    '''
    
    # Select columns to downcast
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols =   [c for c in df if df[c].dtype == "int64"]
    
    # Downcast
    for f in float_cols:
        df.loc[:,f] = pd.to_numeric(df[f], downcast='float')
    
    for i in int_cols:
        df.loc[:,i] = pd.to_numeric(df[i], downcast='integer')
    
    return df

## Features to add
- use frequency encoding (item_id, shop_id, item_category_id, ...). This should be useful to identify outliers.

## Read data

In [None]:
train = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
categories = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
shops = pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')
test = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')
submission = pd.read_csv('../input/competitive-data-science-predict-future-sales/sample_submission.csv')

In [None]:
train.head()

In [None]:
test.head()

As we can see, we have to predict the `date_block_num` 34 (i.e. November 2015) volume of items sold for each item_id and shop_id combination presented in the test set.

## Value counts
Let's simply print how many entries (in the train dataset) we have by some relevant categories. In particular, I will plot the value counts shop_id, category_id and date block num. I will also plot the histograms for item_id, price and item_count.

In [None]:
fig = plt.figure(figsize=(18,9))
plt.subplots_adjust(hspace=.5)

plt.subplot2grid((3,3), (0,0), colspan = 3)
counts = train['shop_id'].value_counts(normalize=True).sort_values(ascending=False)
sns.barplot(x = counts.index, y=counts, order=counts.index)
plt.title("Number of transactions by shop ID (normalized)")

plt.subplot2grid((3,3), (1,0))
sns.distplot(train.item_id)
plt.title("Item ID histogram")

plt.subplot2grid((3,3), (1,1))
sns.distplot(train.item_price)
plt.title("Item price histogram")

plt.subplot2grid((3,3), (1,2))
sns.distplot(train.item_cnt_day)
plt.title("Item count day histogram")

plt.subplot2grid((3,3), (2,0), colspan=3)
counts = train['date_block_num'].value_counts(normalize=True).sort_values(ascending=False)
sns.barplot(x=counts.index, y=counts, order=counts.index)
plt.title("Number of transactions per date block num");

Let me highlight the key points here:

1. Clearly a few shops are responsible for a lot of transactions (namely shop 31, 25, 54 and 28);
2. Some item IDs seem to have a higher than average number of transactions. In addition, the histogram is quite smooth, possibly suggesting that close item_ids are indeed related to similar products.
3. Item price and item count clear show some outlier values (the x-axis of the respective histograms is very wide, while most of the values are concentrated on a tiny interval).
4. A bigger number transactions were recorded in date block number 11 and 23, i.e. December 2013 and December 2014. This may suggest some sort of seasonality effect in the data.

## Outliers
### Item ID

In [None]:
train['item_id'].value_counts(ascending=False)[:5]

It seems that product 20949 has been transacated 31,340 times in the entire dataset. This seems a bit weird. Let's find out.

In [None]:
items.loc[items['item_id']==20949]

In [None]:
translator = Translator()
translator.translate(items.loc[items['item_id']==20949].item_name.values[0]).text

Translation is wrong here, but this items turns out to be a plastic bag... (only one item in that category). In addition, this tricy item is indeed present in the test set.

In [None]:
test[test.item_id==20949].head()

In other words, signalling somehow that this number is usually selling a lot would be a good thing. I will use frequency encoding to capture such patterns.

### Item cnt day

In [None]:
train.item_cnt_day.sort_values(ascending=False)[:10]

Here, we also have a few transactions with a very high number of items sold. As you can see above, for example, there was one transaction in which more than 2,000 items were sold. Let's check this out.

In [None]:
train[train.item_cnt_day>2000]

In [None]:
items[items.item_id==11373]

In [None]:
translator.translate(items[items.item_id==11373].item_name.values[0]).text

The translation does not help here, but let's check out the median number of items sold for the same product in the dataframe (excluding this anomalous transaction).

In [None]:
train[(train.item_id==11373)&(train.item_cnt_day<2000)]['item_cnt_day'].median()

As you can see, median day count for this item is 4. We could then remove this entry.

In [None]:
train = train[train.item_cnt_day < 2000]

### Duplicated
Because the field is called `item_cnt_day`, I expect to be no duplicated entries for date, shop_id and item_id combination. In other words, for each product, shop and day we should have no more than one transaction per day. Let's check this out.

In [None]:
train[train.duplicated(subset=['date', 'shop_id', 'item_id'], keep=False)]

This is weird! Not only we have duplicated transactions, but you can see that price also changes. It also seems that this issue affect a relatively small set of items. Probably I should look at duplicates by looking at all possible values in columns (i.e. two transactions for the same product, shop and date, but different price - for whateer reason it is - should not be considered duplicated).

In [None]:
train[train.duplicated(keep=False)]

Ok, this is a sensibly lower number. I would only keep one of the duplicated transactions then.

In [None]:
print(train.shape)
train = train[~train.duplicated()]
print(train.shape)

### Prices

In [None]:
train.item_price.sort_values(ascending=False)[:10]

Ok, there seems to be one transaction with an insanely high price. Let's check it out.

In [None]:
train[train.item_price>300000]

In [None]:
items[items.item_id==6066]

In [None]:
translator.translate(items[items.item_id==6066].item_name.values[0]).text

A-ha! It seems this product has been sold to 522 people, and possible the price is not the price of the single product, but the value of the entire transaction. Let met check if there are other transactions.

In [None]:
train[train.item_id==6066]

Not really. It then makes sense to remove this row.

In [None]:
print(train.shape)
train = train[train.item_price<300000]
print(train.shape)

Let's now check if we have some transactions where the price was negative (in theory we should assume only positive prices).

In [None]:
train[train.item_price <= 0]

Only one transaction where the item was negative. Let's check if we have other transactions for the same product.

In [None]:
train[train.item_id==2973].head()

It seems that this item has actually a normal price. Let's replace the transaction with negative price with the median price (same product, same shop, same date block num).

In [None]:
median_price_item_2973 = train[(train.item_id==2973)&(train.date_block_num==4)&(train.shop_id==32)&(train.item_price>0)]['item_price'].median()
train.loc[484683,'item_price'] = median_price_item_2973

Let's now reprint the distribution of the train set.

In [None]:
fig = plt.figure(figsize=(18,9))
plt.subplots_adjust(hspace=.5)

plt.subplot2grid((3,3), (0,0), colspan = 3)
counts = train['shop_id'].value_counts(normalize=True).sort_values(ascending=False)
sns.barplot(x = counts.index, y=counts, order=counts.index)
plt.title("Number of transactions by shop ID (normalized)")

plt.subplot2grid((3,3), (1,0))
sns.distplot(train.item_id)
plt.title("Item ID histogram")

plt.subplot2grid((3,3), (1,1))
sns.distplot(train.item_price)
plt.title("Item price histogram")

plt.subplot2grid((3,3), (1,2))
sns.distplot(train.item_cnt_day)
plt.title("Item count day histogram")

plt.subplot2grid((3,3), (2,0), colspan=3)
counts = train['date_block_num'].value_counts(normalize=True).sort_values(ascending=False)
sns.barplot(x=counts.index, y=counts, order=counts.index)
plt.title("Number of transactions per date block num");

## Test set
Let's now have a look at the test set.

In [None]:
fig = plt.figure(figsize=(18,9))
plt.subplots_adjust(hspace=.5)

plt.subplot2grid((3,3), (0,0), colspan = 3)
counts = test['shop_id'].value_counts(normalize=True).sort_values(ascending=False)
sns.barplot(x = counts.index, y=counts, order=counts.index)
plt.title("Number of transactions by shop ID (normalized)")

plt.subplot2grid((3,3), (1,0))
sns.distplot(test.item_id)
plt.title("Item ID histogram");

The first thing we can see is that we have a much more uniform distribution of shops for which we are reuqested to come up with predictions. We already know that some of the shops have a much lower number of transactions (and thus, possibly, item counts). Encoding shops using frequency encoding (or mean encoding) should help us signal these differences.  
  
Similarly, while we know that items with item_id around 5000 are the ones transacted the most (in the train set), here we have a relatively more uniform representation of all possible item ids. Again, by encoding item id using frequency or mean encoding we should be able to signal this.  
  
Let me check if I have some items and item-shop_id combination in the test set which do not have any transaction in the train set.

In [None]:
print("Number of item in test set: {}".format(len(test.item_id.unique())))
item_not_in_train_set = test[~test.item_id.isin(train.item_id.unique())].item_id.sort_values().unique()
print("Number of item in test set with no transaction in train: {}".format(len(item_not_in_train_set)))

Interesting. There are 363 products in the test which have not any transaction associated to them. Let's check them out. In particular, in a previous notebook I have worked on, I found out that item_id seem to have been assigned at a later stage when creating the dataset for the competition.  
I take an example, FIFA video games. Consecutive editions of FIFA have growing number of item_id (e.g. FIFA 2013 may be item_id 10000 and FIFA 2014 may be item_id 10005). Let's check neighbouring item_id for the test set products with no transaction in the train set.

In [None]:
item_not_in_train_set[:10]

In [None]:
items[items.item_id.isin(range(198, 210))]

In [None]:
items.loc[204,'item_name']

This is very intersting indeed! Product with item_id 204 is a Cleopatra audio-book. As you can see items around it - in terms of item_id - are also audiobooks (for which we have recorded transactions). For an unseen product, we may let our behave in two ways:

* predict something close to zero, i.e. the product has never sold anything and that is what we should assume;
* assuming that the product is new, we can predict the average number of items sold by products of the same type during their first launch.  
  
The second method is quite tricky. Either we create a more complicated validation set where we remove some products for which we have data, but we assume they are new (although that could be biased) or, for products which are new, we actually create some sort of "manual predictions" (like the one explained above). This could be an interesting strategy indeed.

In [None]:
del counts
del item_not_in_train_set
gc.collect()

## Shops
First of all, let me see if we are requested to predict sales for shops in the test set which have never recorded a transaction in the train set.

In [None]:
shops_not_in_train = test[~test.shop_id.isin(train.shop_id.unique())].shop_id.unique()
print("Number of shops in test with no transaction in train: {}".format(len(shops_not_in_train)))

Ok, not a problem. Let's focus on the shop names now.

In [None]:
shops.shop_name[:5]

The first thing I have noticed (also this is quite visible in most notebooks and discussions in the competition), is that the name of the shops contain quite a lot of information. In particular, the first word in the shop is the city where the shop is in. Let's extract the shop city with the fucntion below.

In [None]:
shop_splitter = re.compile(r'(\w+)\s(.*)')
shop_names = shops.shop_name.apply(lambda x: shop_splitter.search(x).groups())
shop_names_df = pd.DataFrame(shop_names.values.tolist(), columns=['city', 'extracted_name'])
shop_names_df.head()

In [None]:
shops = pd.concat([shops, shop_names_df], axis=1)
shops.head()

Let's examine the city more in details. Some values are indeed city, but three city names are actually other categories. In particular, 'Выездная', 'магазин', 'Цифровой'. It would then be good to signal this with a boolean variable.

In [None]:
shops.loc[:,'is_city'] = shops.city.apply(lambda x :0 if x in ['Выездная', 'магазин', 'Цифровой'] else 1)

Let's now look at the shop names.

In [None]:
shops.shop_name.unique()

Although I do not speak Russian, you can see that there are some words that repeat mostly all the times, in particular:

* ТЦ;
* ТРЦ;
* ТРК;
* Орджоникидзе;
* other.  
  
Let's add this information. When we will run EDA we may find this is not useful, but let's add it for now.

In [None]:
def shop_sub_type(x):
    if x[0] == 0:
        return 'non_city'
    else:
        if 'ТЦ' in x[1]:
            return 'ТЦ'
        elif 'ТРЦ' in x[1]:
            return 'ТРЦ'
        elif 'ТРК' in x[1]:
            return 'ТРК'
        elif 'Орджоникидзе' in x[1]:
            return 'Орджоникидзе'
        else:
            return 'other'

In [None]:
shops.loc[:,'shop_sub_type'] = shops[['is_city', 'extracted_name']].apply(shop_sub_type, axis=1)
shops.head()

Let's now see if we have shops with the same shop_name but different shop_id.

In [None]:
shops[shops.shop_name.duplicated(keep=False)]

In [None]:
del shop_names_df, shops_not_in_train
gc.collect()

Let's use a slightly better approach. Let's evaluate the similarity between product names and plot that into a sort of grid plot.

In [None]:
from similarity.normalized_levenshtein import NormalizedLevenshtein

In [None]:
unique_shop_id = shops.shop_id.unique()
similarity_grid = np.zeros(shape=(len(unique_shop_id), len(unique_shop_id)))

In [None]:
norm_lev = NormalizedLevenshtein()

for i in unique_shop_id:
    for j in unique_shop_id:
        distance = norm_lev.similarity(shops[shops.shop_id==i].shop_name.values[0], shops[shops.shop_id==j].shop_name.values[0])
        similarity_grid[i,j] = distance

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
mask = similarity_grid < 0.6
sns.heatmap(similarity_grid, ax=ax, mask=mask, cmap = sns.color_palette('Blues'))
ax.set_facecolor("grey")

This seems quite interesting. We have a couple of shops with very similar names. Let's check them out.

In [None]:
indices = zip(*np.triu_indices_from(similarity_grid))

In [None]:
similar_stores = []

for c in indices:
    i, j = c[0], c[1]
    if i != j and similarity_grid[i,j]>0.6:
        similar_stores.append([i,j, similarity_grid[i,j]])
similar_stores = pd.DataFrame(similar_stores, columns=['i','j','similarity'])
similar_stores.sort_values(by='similarity',ascending=False, inplace=True)
similar_stores

These are very similar stores by name. Let's check the names.

In [None]:
shops[shops.shop_id.isin([10,11])].shop_name

Bingo, these two shops are indeed the same - there is just one character different. It would then make sense to categorise any transaction from shop number 10 as it was registered in shop number 11 (similaryly in the test set).

In [None]:
train.loc[train.shop_id==10, 'shop_id'] = 11
test.loc[test.shop_id==10, 'shop_id'] = 11

In [None]:
shops[shops.shop_id.isin([23,24])].shop_name

After some translation, these are two equally-named shops, but positioned in different areas of the same shopping center. I don't really know whether one specialises in some specific products and the other doesn't, but I will keep them as separate. I am not really sure whether the sales of one of the shops can influence the other though.

In [None]:
shops[shops.shop_id.isin([30,31])].shop_name

These are not really similar shops (probably the name of the area is close).

In [None]:
shops[shops.shop_id.isin([0,57])].shop_name

Well, this one looks pretty much the same. Let's replace it.

In [None]:
train.loc[train.shop_id==57, 'shop_id'] = 0
test.loc[test.shop_id==57, 'shop_id'] = 0

In [None]:
shops[shops.shop_id.isin([1,58])].shop_name

Again, very similar. Let's replace them.

In [None]:
train.loc[train.shop_id==58, 'shop_id'] = 1
test.loc[test.shop_id==58, 'shop_id'] = 1

In [None]:
shops[shops.shop_id.isin([39,40])].shop_name

Some Google search shows this shops are indeed different. Let's check the last one.

In [None]:
shops[shops.shop_id.isin([38,54])].shop_name

Ok, this one is a different city. Let's stop here.

In [None]:
del similar_stores, similarity_grid
gc.collect()

## Categories analysis
I have noticed that categories include some hierarchical information (separated by a hyphen).

In [None]:
categories.item_category_name.head()

In [None]:
split_names = categories.item_category_name.apply(lambda x: [x.strip() for x in x.split(' - ')])

In [None]:
new_categories = np.chararray((len(categories), 2), itemsize=33, unicode=True)
new_categories[:] = 'None'

# Add categories with a for loop
for i, c_list in enumerate(split_names):
    for j, c_value in enumerate(c_list):
        new_categories[i,j] = c_value

In [None]:
new_categories_df = pd.DataFrame(new_categories, columns=['category', 'sub_category'])
categories = categories.join(new_categories_df)

# If sub_category is None replace it with category
categories.loc[:,'sub_category'] = categories[['category', 'sub_category']].apply(lambda x: x[0] if x[1]=='None' else x[1], axis=1)

categories.head()

In [None]:
del split_names, new_categories, new_categories_df
gc.collect()

## Item_id analysis
Let's now analyse item_ids. Let me take an example I have analysed in a another notebook, i.e. the FIFA 13 videogame.

In [None]:
items[items.item_name.str.contains('FIFA 13')]

As you can see, this video game was released in various versions and for various gaming platforms (e.g. PC, PS3, Xbox, etc.). What is interesting is that the item_id of these products are consecutive. This is precious information, because we can say that if FIFA 13 is selling well on Xbox, we have no doubt to believe that it will also sell good on other consoles. Mean-encoding may be misleading, because some consoles are more popular than others, but other simple variables (e.g. min item count on other consoles greater than 0, average item count on other consoles greater than 5, etc.) could be a good approach. Please notice that in this case I would use a leave-one-out approach.  
  
Calculating word similarity as before would take too much time. I would transform product names using TfIdf vectorisation (on the entire names set). I will then use built-in cosine similarity function in from sklearn to calculate a metric of similarities.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Please notice I had to change the default token_pattern to also tokenize 1-character word
# (e.g. Far Cry 3 and Far Cry 2)
vectorizer = TfidfVectorizer(token_pattern='(?u)\\b\\w\\w*\\b')
vectorized_names = vectorizer.fit_transform(items.item_name.values)

In [None]:
# Calculate cosine similarity grid
cosine_similarity_grid = cosine_similarity(vectorized_names)

In [None]:
# Let's print out the most similar names (excluding same names)
indices = zip(*np.triu_indices_from(cosine_similarity_grid))
similar_items = []

for c in tqdm_notebook(indices):
    i, j = c[0], c[1]
    if i != j and cosine_similarity_grid[i,j]>0.9:
        similar_items.append([i,j, cosine_similarity_grid[i,j]])
similar_items = pd.DataFrame(similar_items, columns=['i','j','similarity'])
similar_items.sort_values(by='similarity',ascending=False, inplace=True)

In [None]:
similar_items[similar_items.similarity==1].shape

There are 64 pairs for which cosine similarity is 1. This is quite weird. Let me expect a couple of them.

In [None]:
similar_items[similar_items.similarity==1].tail()

Let's have a look at some of them.

In [None]:
items[items.item_id.isin([8642, 8643, 8632, 8633])]

Wow! It seems that actually some product are considered differently only because lower/upper case differences. These items should definitely be categorised as the same. Let's check some others.

In [None]:
items[items.item_id.isin([9048, 9049, 18126, 18127])]

Same stuff! I would use a brute-force approach here and replace each duplicated item with the first id appearing in the dataframe above.

In [None]:
duplicated_items = similar_items[similar_items.similarity==1].copy()
for c in duplicated_items.columns:
    duplicated_items.loc[:,c] = pd.to_numeric(duplicated_items[c], downcast='integer')

In [None]:
for _, r in tqdm_notebook(duplicated_items.iterrows()):
    train.loc[train.item_id==r[1], 'item_id'] = r[0]
    test.loc[test.item_id==r[1], 'item_id'] = r[0]

In [None]:
train[train.item_id==9049]

In [None]:
del duplicated_items
gc.collect()

Let's now check items with a very high similarity (e.g. > 0.99), but lower than 1.

In [None]:
similar_items[(similar_items.similarity>0.99)&(similar_items.similarity<1)]

In [None]:
items[items.item_id.isin([4199, 4200, 10479, 10480, 14431, 14432])]

As you can see, many of these items are very very similar. It may make sense to start thinking about clustering my item_ids according to their name similarity. To do that, I will use a an algorithm that does not require to fix the number of clusters.

Let's take a sample of products in the similarity grid, and calculate t-SNE values. Let's plot this into a scatter plot.

In [None]:
# Take a sample
sample_idx = np.random.choice(np.arange(22170), size=250, replace=False)
sample = cosine_similarity_grid[sample_idx].copy()

In [None]:
from sklearn.manifold import TSNE
sim_embed = TSNE().fit_transform(sample)
x, y = zip(*sim_embed)
plt.scatter(x,y);

Very interesting! There seems to be a hidden categorisation of items according to their names. I have tried a couple of clustering methods (namely DBSCAN and Affinity Propagation), but it took a long time and memory to train them. I will revert back to them in case I am not satisfied with my solution.  
Another way to categorise each product is to check for the set with the closest similarity levels. Let me find out a smart way to do this.

In [None]:
del sim_embed, sample
gc.collect()

In [None]:
from torch import topk
import torch

Here I want to find out the top 3 most similar product (by product name) to a given item. I want to keep a cosine similarity threshold (0.65) to only retain similar names if their cosine similarity with the original name is higher (or equal to the threshold). If the closest items have cosine similarity lower than 0.6, then I will just return NaN (and fill it with min, max and mean of the original product).
To do this, I will leverage pytorch top k function.

In [None]:
cosine_similarity_grid_torch = torch.from_numpy(cosine_similarity_grid)

In [None]:
topk_values, topk_indices = topk(cosine_similarity_grid_torch, 4)
topk_values, topk_indices = topk_values.numpy(), topk_indices.numpy()

In [None]:
def add_index(i, n, topk_values, topk_indices, threshold=0.6):
    
    val, ind = topk_values[i], topk_indices[i]
    
    if val[n] > threshold:
        return ind[n]
    else:
        return i

In [None]:
# Create similar 1 column
items.loc[:,'similar_1'] = items.item_id.apply(add_index, n=1, topk_values=topk_values, topk_indices=topk_indices)

In [None]:
def add_index_mul(i, n, topk_values, topk_indices, threshold=0.6):
    
    item_id, similar_prev = i[0], i[1]
    
    val, ind = topk_values[item_id], topk_indices[item_id]
    
    if val[n] > threshold:
        return ind[n]
    else:
        return similar_prev

In [None]:
# Create similar 2 columns
items.loc[:,'similar_2'] = items[['item_id', 'similar_1']].apply(add_index_mul, n=2, topk_values=topk_values, topk_indices=topk_indices, axis=1)

# Create similar 3 columns
items.loc[:,'similar_3'] = items[['item_id', 'similar_2']].apply(add_index_mul, n=3, topk_values=topk_values, topk_indices=topk_indices, axis=1)

In [None]:
# Let's check out the FIFA example again
items[items.item_name.str.contains('FIFA 14')]

In [None]:
del similar_items, cosine_similarity_grid, topk_values, topk_indices, cosine_similarity_grid_torch
del vectorized_names
gc.collect()

In [None]:
items = items.drop('item_name', axis=1)

## Frequency encoding
Frequency encoding of:
* item_id,
* shop_id,
* city,
* category_id,
* category,
* sub category and
* shop-item combinations.

In [None]:
def frequency_encode(series):
    return series.value_counts(normalize=True)

In [None]:
#Add shop_id and item_id combinationa
train.loc[:,'shop_and_item'] = train.shop_id.astype(str) + '-' + train.item_id.astype(str)
train.head()

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# Create all possible shop and item combinations
shop_and_item = pd.Series(list(product(shops.shop_id.values, items.item_id.values)))
shop_and_item = pd.DataFrame(shop_and_item.apply(lambda x: str(x[0]) + '-' + str(x[1])), columns=['shop_and_item'])

# Label-encode them
shop_and_item_encoder = LabelEncoder()
shop_and_item_encoder.fit(shop_and_item.shop_and_item)

# Transform
shop_and_item.loc[:,'shop_and_item'] = shop_and_item_encoder.transform(shop_and_item.shop_and_item)
train.loc[:,'shop_and_item'] = shop_and_item_encoder.transform(train.shop_and_item)

In [None]:
# Create frequency encodings in items dataframe
items.loc[:,'item_id_freq_encod'] = items.item_id.map(frequency_encode(train.item_id))

In [None]:
# Frequency encode shop_id
shops.loc[:,'shop_id_freq_encod'] = shops.shop_id.map(frequency_encode(train.shop_id))

I now have to add missing information to the train set, including city, shop_sub_type, category_id, category, sub_category. This will require some ad-hoc merging.

In [None]:
# Add shops details
train = train.merge(shops[['shop_id', 'city', 'shop_sub_type']], how='left', on=['shop_id'])

# Add category id
train = train.merge(items[['item_id', 'item_category_id']], how='left', on=['item_id'])

# Add category information
train = train.merge(categories[['item_category_id', 'category', 'sub_category']], how='left', on=['item_category_id'])

In [None]:
# Add city freq encoding
shops.loc[:,'city_freq_encod'] = shops.city.map(frequency_encode(shops.city))

# Add category_id freq encoding
categories.loc[:,'item_category_id_freq_encod'] = categories.item_category_id.map(frequency_encode(train.item_category_id))

# Add category freq encoding
categories.loc[:,'category_freq_encod'] = categories.category.map(frequency_encode(train.category))

# Add sub_category freq encoding
categories.loc[:,'sub_category_freq_encod'] = categories.sub_category.map(frequency_encode(train.sub_category))

# Add shop_item freq encoding
shop_and_item.loc[:,'shop_and_item_freq_encod'] = shop_and_item.shop_and_item.map(frequency_encode(train.shop_and_item))

In [None]:
# Fill na
items = items.fillna(0)
categories = categories.fillna(0)
shops = shops.fillna(0)
shop_and_item = shop_and_item.fillna(0)

In [None]:
# Add oldest transaction (don't have to fill NA here yet)
items.loc[:,'oldest_date_block_num'] = items.item_id.map(train.groupby('item_id')['date_block_num'].min())

In [None]:
# Dump modified files
items.to_hdf('processed_items.hdf5', key='df')
categories.to_hdf('processed_categories.hdf5', key='df')
shops.to_hdf('processed_shops.hdf5', key='df')
shop_and_item.to_hdf('shop_and_item.hdf5', key='df')
train.to_hdf('processed_train.hdf5', key='df')
test.to_hdf('processed_test.hdf5', key='df')

# Target and lagged (count and mean)
## Create Grid

In [None]:
grid = []
for block_num in tqdm_notebook(train.date_block_num.unique()):
    cur_shops = train[train['date_block_num']==block_num]['shop_id'].unique()
    cur_items = train[train['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(cur_shops, cur_items, [block_num])), dtype='int32'))

In [None]:
# Create dataframe from grid
index_cols = ['shop_id', 'item_id', 'date_block_num']
grid = pd.DataFrame(np.vstack(grid), columns=index_cols, dtype=np.int32)
grid.sort_values(by=['date_block_num', 'shop_id', 'item_id'], inplace=True)
grid.reset_index(inplace=True, drop=True)

In [None]:
grid.head()

In [None]:
# Add item_cnt_month (not target)
item_cnt_df = train.groupby(['date_block_num', 'shop_id', 'item_id'])['item_cnt_day'].sum().rename('item_cnt_month').reset_index()
item_cnt_df.head()

In [None]:
# Clip values
item_cnt_df.loc[:,'item_cnt_month'] = item_cnt_df.item_cnt_month.clip(0,20)

In [None]:
# Merge item_cnt_month into grid (NaN values fill them with 0)
grid = grid.merge(item_cnt_df, how='left', on=['date_block_num', 'shop_id', 'item_id']).fillna(0)

In [None]:
del item_cnt_df
gc.collect()

In [None]:
grid.head()

### Append test set

In [None]:
test.loc[:,'date_block_num'] = 34
test.loc[:,'item_cnt_month'] = 0

In [None]:
grid = grid.append(test.drop('ID', axis=1))
grid.loc[:,'item_cnt_month'] = grid.item_cnt_month.astype(int)

In [None]:
# Add shop and item
grid.loc[:,'shop_and_item'] = grid.shop_id.astype(str) + '-' + grid.item_id.astype(str)
grid.loc[:,'shop_and_item'] = shop_and_item_encoder.transform(grid.shop_and_item)

In [None]:
grid.head()

In [None]:
del train
del test
gc.collect()

## Lag sales
Here I am using the same formula as in https://www.kaggle.com/sarthakbatra/predicting-sales-tutorial

In [None]:
def generate_lag(grid, months, lag_column):
    for month in months:
        # Speed up by grabbing only the useful bits
        
        grid_shift = grid[['date_block_num', 'shop_id', 'item_id', lag_column]].copy()
        grid_shift.columns = ['date_block_num', 'shop_id', 'item_id', lag_column+'_lag_'+ str(month)]
        grid_shift['date_block_num'] += month
        grid = pd.merge(grid, grid_shift, on=['date_block_num', 'shop_id', 'item_id'], how='left')
    return grid

In [None]:
grid = downcast_dtypes(grid)

In [None]:
# Lag item counts
%time
grid = generate_lag(grid, [1,2,3,4,5,6,11,12], 'item_cnt_month')

In [None]:
# Fill na with zero (later remember to select only date_block_num greater or equal to 12)
grid = grid.fillna(0)

In [None]:
for c in grid.columns[5:]:
    grid.loc[:,c] = pd.to_numeric(grid[c], downcast='integer')

In [None]:
grid = downcast_dtypes(grid)

## Mean encodings
### Label Encode categorical features

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# Fix shops
encoder = LabelEncoder()
shops.loc[:,'city'] = encoder.fit_transform(shops.city)
shops.loc[:,'shop_sub_type'] = encoder.fit_transform(shops.shop_sub_type)

In [None]:
# Fix categories
encoder = LabelEncoder()
categories.loc[:,'category'] = encoder.fit_transform(categories.category)
categories.loc[:,'sub_category'] = encoder.fit_transform(categories.sub_category)

In [None]:
# Drop some columns
categories.drop(columns=['item_category_name'], inplace=True)
shops.drop(columns=['shop_name', 'extracted_name'], inplace=True)

In [None]:
# Add all to grid
grid = grid.merge(items, how='left', on=['item_id'])
del items
gc.collect()
grid = downcast_dtypes(grid)

In [None]:
grid = grid.merge(categories, how='left', on=['item_category_id'])
del categories
gc.collect()
grid = downcast_dtypes(grid)

In [None]:
grid = grid.merge(shops, how='left', on=['shop_id'])
del shops
gc.collect()
grid = downcast_dtypes(grid)

In [None]:
grid = grid.merge(shop_and_item, how='left', on=['shop_and_item'])
del shop_and_item
gc.collect()
grid.drop(columns=['shop_and_item'])
grid = downcast_dtypes(grid)

In [None]:
grid = downcast_dtypes(grid)

In [None]:
del shop_names, shop_splitter, x, y
gc.collect()

In [None]:
grid.to_hdf('grid.hdf5', key='df')

### Mean encodings
Here I will create mean encodings by:
* item_id
* shop id
* city,
* category_id,
* category,
* sub category.

In [None]:
# # Mean item_id
# mean_id = grid.groupby(['date_block_num', 'item_id'])['item_cnt_month'].mean().rename('item_month_mean').reset_index()
# grid = grid.merge(mean_id, how='left', on=['date_block_num', 'item_id'])

# # Delete mean_id
# del mean_id
# gc.collect()

# # Create lags
# grid = generate_lag(grid, [1,2,3,4,5,6,11,12], 'item_month_mean')

# # We need to drop item_month_mean otherwise that would be a massive leakage for our model
# # Item month mean is basically the average target value for the product.
# grid.drop(columns=['item_month_mean'], inplace=True)

In [None]:
# grid = downcast_dtypes(grid)

In [None]:
# # Mean shop_id (should capture the activity of a shop)
# mean_shop_id = grid.groupby(['date_block_num', 'shop_id'])['item_cnt_month'].mean().rename('shop_month_mean').reset_index()
# grid = grid.merge(mean_shop_id, how='left', on=['date_block_num', 'shop_id'])

# # Delete mean_id
# del mean_shop_id
# gc.collect()

# # Create lags
# grid = generate_lag(grid, [1,2,3,4,5,6,11,12], 'shop_month_mean')

# # We need to drop shop_month_mean otherwise that would be a massive leakage for our model
# grid.drop(columns=['shop_month_mean'], inplace=True)

In [None]:
# grid = downcast_dtypes(grid)

In [None]:
# # Mean city
# mean_city_id = grid.groupby(['date_block_num', 'city'])['item_cnt_month'].mean().rename('city_month_mean').reset_index()
# grid = grid.merge(mean_city_id, how='left', on=['date_block_num', 'city'])

# # Delete mean_id
# del mean_city_id
# gc.collect()

# # Create lags
# grid = generate_lag(grid, [1,2,3,4,5,6,11,12], 'city_month_mean')

# # We need to drop city_month_mean otherwise that would be a massive leakage for our model
# grid.drop(columns=['city_month_mean'], inplace=True)

In [None]:
# grid = downcast_dtypes(grid)

In [None]:
# # Mean category_id
# mean_category_id = grid.groupby(['date_block_num', 'item_category_id'])['item_cnt_month'].mean().rename('category_id_month_mean').reset_index()
# grid = grid.merge(mean_category_id, how='left', on=['date_block_num', 'item_category_id'])

# # Delete mean_id
# del mean_category_id
# gc.collect()

# # Create lags
# grid = generate_lag(grid, [1,2,3,4,5,6,11,12], 'category_id_month_mean')

# # We need to drop category_id_month_mean otherwise that would be a massive leakage for our model
# grid.drop(columns=['category_id_month_mean'], inplace=True)

In [None]:
# grid = downcast_dtypes(grid)

In [None]:
# # Mean category
# # Mean category_id
# mean_category = grid.groupby(['date_block_num', 'category'])['item_cnt_month'].mean().rename('category_month_mean').reset_index()
# grid = grid.merge(mean_category, how='left', on=['date_block_num', 'category'])

# # Delete mean_id
# del mean_category
# gc.collect()

# # Create lags
# grid = generate_lag(grid, [1,2,3,4,5,6,11,12], 'category_month_mean')

# # We need to drop category_id_month_mean otherwise that would be a massive leakage for our model
# grid.drop(columns=['category_month_mean'], inplace=True)

In [None]:
# grid = downcast_dtypes(grid)

In [None]:
# Mean sub-category

# Percentage change and positive item count for similar products (across various months)

# Product age