# Intro

The Predict Future Sales competition challenges Kagglers to (somewhat unsurprisingly) predict the sales of a Russian software firm: 1C Company.

Competitors are provided with transaction data, information about the items being sold, and info about the stores in which said transactions have taken place - they are to use this information to make predictions about the sales of these (and other items) in the month following the period covered in the data.

In this notebook you will find an exploration of the competition data, a few illuminating insights, and a spot of feature engineering that will hopefully set any prediction models off to a good start. Lets take a look....

![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fvortini.com%2Fwp-content%2Fuploads%2F2017%2F05%2FSalesForecast.png&f=1&nofb=1)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_palette('husl'); sns.set_style('ticks');
import pickle
from tqdm import tqdm

%matplotlib inline

# display all figures to 2 decimal places
pd.options.display.float_format = '{:.2f}'.format

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# Basic Data Exploration and Feature Engineering

The competition data is broken down as follows:

<ul>
    <li><code><b>sales_train.csv</b></code>: training data with transaction information showing the date, time, shop, item and price of each transaction
    <li><code><b>items.csv</b></code>: names of each item and the category they belong to
    <li><code><b>item_categories.csv</b></code>: a short description of each category
    <li><code><b>shops.csv</b></code>: name details for each shop
    <li><code><b>test.csv</b></code>: item / shop pairings for which we are required to predict sales
    <li><code><b>sample_submission.csv</b></code>: an example of the competition submission format
</ul>

to start, we'll go through each of the dataframes and familiarise ourselves with the information they contain, and have a crack at creating some simple features if there's any obvious low-hanging fruit to be had:

In [None]:
# start by loading the competition data into pandas dataframes
os.chdir('../input/competitive-data-science-predict-future-sales')

sales_train    = pd.read_csv('sales_train.csv')
items           = pd.read_csv('items.csv')
item_categories = pd.read_csv('item_categories.csv')
shops           = pd.read_csv('shops.csv')
sample_submission = pd.read_csv('sample_submission.csv')
test = pd.read_csv('test.csv')
         
os.chdir('../../working')

# - sales_train

In the sales data comprises the primary transaction data we'll be using to make sales predictions. We have ~2.9M transactions records over a period of 34 months. Individual records include a <code><b>date</b></code> (broken down into individual days - this is <b>very</b> important to remember for later), <code><b>date_block_num</b></code>, <code><b>shop_id</b></code>, <code><b>item_id</b></code>, <code><b>item_price</b></code>, and the target variable <code><b>item_cnt_day</b></code> which are all pretty self explanatory:

In [None]:
sales_train.info()

In [None]:
sales_train.head(5)

In [None]:
sales_train.tail(5)

To start with, lets visualise the overall annual sales figures to see what we are working with:

In [None]:
# splits the date feature into month and year
annual_trends = sales_train[['date','item_cnt_day']].copy()
annual_trends['month'] = annual_trends.date.apply(
    lambda x: int(x.split('.')[1])
)
annual_trends['year'] = annual_trends.date.apply(
    lambda x: int(x.split('.')[2])
)

# group the transactions by individual item per month and year
annual_trends = annual_trends[['year','month','item_cnt_day']].\
                    groupby(['year','month'], as_index=False).sum()

# visualise the sales patterns
fig, ax = plt.subplots(figsize=(15,8))
sns.lineplot(x='month', 
             y='item_cnt_day', 
             hue='year', 
             data=annual_trends, 
             ax=ax, 
             palette=['yellow','orange','red'], 
             linewidth=3)
sns.despine()

Immediately, we can see some pretty strong seasonal trends at work here. There are distinct months in which there are consistent higher sales than others. 

We can take a more detailed look at some of the individual items, to see if there are any trends at the item level:

In [None]:
random_items = [21904, 20553, 15228, 18221,  3327,  7882, 18847,  6225, 16592, 7218]

# ### uncomment the below to see a random selection of items ###
# random_items = np.random.choice(sales_train.item_id.unique(), 10)

item_mask = (sales_train.item_id.isin(random_items))
features = ['date_block_num', 'item_id', 'item_cnt_day']
item_trends = sales_train[item_mask][features].copy()

item_trends = item_trends.groupby(['date_block_num', 'item_id'], 
                                  as_index=False).sum()

fig, ax = plt.subplots(figsize=(15,8))

for item in random_items:
    mask = item_trends.item_id == item
    plot_data = item_trends[mask]
    sns.lineplot(x='date_block_num', 
                 y='item_cnt_day',
                 label=item,
                 data=plot_data,
                 ax=ax,
                 linewidth=2)

sns.despine()

<small><i>Disclaimer: though a fixed group of items have been used for the above, this is primarily to avoid random selection resulting in an unattractive or unreadable graph, where one high selling item dominates and the other items clump near the bottom. The items above were originally chosen at random by numpy - forking the kernel, running the above with the commented section, and analysing different items is encouraged.</i></small>

The data is a lot noisier at this level (i.e. it's a mess!), but there are still some clear trends that appear: sales tend to peak within the first two months of the item's introduction and tail off as the item gets older. Also, note that none of the above items feature sales in every month of the data - many are introduced much later on in the timeline and many stop selling fairly early. It is clear that there are strong temporal trends right the way through the data, and so we should think carefully about this when building our training dataset and when training our model. 

If we move on to look at the distribution of examples within each feature, we see some hefty tails on <code><b>item_price</b></code> and <code><b>item_cnt_day</b></code>), and some negative values for each:

In [None]:
sales_train.describe()

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,5))
sns.violinplot(sales_train.item_cnt_day, ax=ax1, palette=['green'])

sns.violinplot(sales_train.item_price, ax=ax2, palette=['blue'])

Though certainly worth keeping in mind, the tails shouldn't be hugely surprising. Given the scope of the dataset and the huge number of stores and items, we would expect to see individual sales in far greater numbers than multiple sales, and the odd hugely popular item. Similarly, considering the sales prices of the stock in a shop we would expect to see a minority of very high-value items. The negative values for <code><b>item_cnt_day</b></code> most likely represent the return of unwanted or faulty items. 

We must carefully consider how to treat outlying examples. It's possible that the very small minority of data points that exhibit very high sales / prices or negative values may end up hiding some of the useful variance between the items within a more normal range of values.

Starting with <code><b>item_cnt_day</b></code>, it can be seen that a tiny fraction of the data features in the long tail of the distribution:

In [None]:
sales_over_10 = len(sales_train[sales_train.item_cnt_day >= 10])
print(f"{sales_over_10} transactions or {sales_over_10 * 100 / len(sales_train)}% "
      + "of all the transactions feature the sale of 10 or more items")

similarly, a very small minority of transactions feature a negative <code><b>item_cnt_day</b></code>:

In [None]:
negative_sales = len(sales_train[sales_train.item_cnt_day < 0])
print(f"{negative_sales} transactions or {negative_sales * 100 / len(sales_train)}% "
      + "of all the transactions represent returns or a negative item_cnt_day")

Further exploration of the <code><b>item_cnt_day</b></code> target feature doesn't seem to reveal anything obvious that points to examples that should / shouldn't be removed or changed. From experience the most effective approach has been to remove any transactions outside the [0, 500] <code><b>item_cnt_day</b></code> range. No satisfyingly scientific reasoning behind this;  just that trial and error with a range different approaches has shown this to produce the best results:

In [None]:
drop_mask = (sales_train.item_cnt_day < 500) & (sales_train.item_cnt_day >= 0)
sales_train = sales_train[drop_mask]

With <code><b>item_price</b></code> there are some interesting features of the outliers that merit a closer look. Isolating the higher valued items, ordering them from lowest to highest median value, and then plotting their min, median, and max price we get the following:

In [None]:
# creating a plot to show the items with high values
def plot_hv_items():
    # create list of items valued over 5000
    high_value_mask = sales_train.item_price > 5000
    high_value_items = sales_train[high_value_mask].item_id.unique()

    # collect the transactions relating to the above items, 
    mask = sales_train.item_id.isin(high_value_items)
    high_value_prices = (
        # group them by item
        sales_train[mask].groupby('item_id')
        # capture the min, median, and max prices
        .agg({'item_price': ['min','median','max']})
    )

    # remove the multi-level column structure in the grouped dataframe
    high_value_prices.columns = high_value_prices.columns.droplevel()
    # order the values by median item_price
    high_value_prices = high_value_prices.sort_values('median').reset_index()

    # plot each of min / median / max by the ordered index
    fig, ax = plt.subplots(figsize=(15,10))
    
    for agg_method in ['min','median','max']:
        sns.lineplot(x=high_value_prices.index, 
                     y=agg_method,
                     label=agg_method,
                     data=high_value_prices, 
                     ax=ax, 
                     linewidth=2)

    plt.xlabel('items in price order low-high')
    plt.ylabel('price')
    sns.despine()
    
plot_hv_items()

For the most part, the <code><b>item_price</b></code> seems  to do what we would expect i.e. it varies pretty uniformly around the median <code><b>item_price</b></code> for each item. There are, however, a couple of pretty blatant outliers:
<ol>
<li>At the very start of the graph, there is an odd spike in the max value of one of the items. A quick look at the data, and a bit of google translating shows this example to be a sale of Доставка (EMS) or 'Shipping'. The prices of the other sales of shipping all fall within a much smaller range than the current max, so this appears to be an outlier.

<li>At the end of the graph there is a sudden explosion of the gradient - this turns out to be only one transaction 'Radmin 3 - 522 persons' valued at 307980. Apparently Radmin 3 is a piece of software used by IT professionals, and it seems this is some sort of bulk transaction (probably some sort of enterprise /  corporate arrangement or suchlike that found its way onto the books)</li>
</ol>

A couple of outlying examples are unlikely to affect our predictions dramatically, but we can remove them anyway:

In [None]:
sales_train = sales_train[sales_train.item_price < 50000]

Re-running the above graph cell will show the prices now fall within a more reasonable min / mean / max distribution.

In [None]:
plot_hv_items()

Having dealt with the distribution of the existing features, lets see if there are any we can quickly add. Firstly, given that for each sale we have a price (<code><b>item_price</b></code>) and a sales volume (<code><b>item_cnt_day</b></code>), we can quickly create a revenue feature (i.e. number of items sold multiplied by the sales price):

In [None]:
sales_train['revenue'] = sales_train.item_cnt_day * sales_train.item_price

A feature that identifies each month of the year may also be useful, as this may help a predictive model identify some of the obvious trends in the above graph of sales:

In [None]:
sales_train['month'] = (sales_train.date_block_num % 12)

Finally, it may also be useful for our model to know how long each month is, as more days mean more potential sales:

In [None]:
# create quick dataframe containing months and their lengths
month_lengths = pd.DataFrame({
    'month': range(0,12),
    'month_length': [31,28,31,30,31,30,31,31,30,31,30,31]
})

# merge the month_lengths into the sales_train data using the month feature
sales_train = sales_train.merge(month_lengths, on='month', how='left')
sales_train['month_length'] = sales_train.month_length

Easy! There's obviously potential here for a huge number of engineered features. For the moment though, it's better to explore the other data we've been provided and return when we are ready to draw all the information together in one training dataset:

# - shops

The shops data is a small table of the 60 shops in the dataset, each with only a <code><b>shop_name</b></code> and corresponding <code><b>shop_id</b></code>:

In [None]:
shops.info()

At first glance this information appears pretty mundane. As it stands each of the <code><b>shop_name</b></code> strings are unique and so the feature provides no more information than the <code><b>shop_id</b></code> that already features in <code><b>sales_train</b></code>. Closer inspection though, shows repetition of several of the 'words' in each of the name strings:

In [None]:
shops.head(10)

For those unfamiliar with the cyrillic alphabet and / or the geography of Russia, some quick googling will show you that the first token (or word) in the <code><b>shop_name</b></code> string is a city i.e. <b>'Якутск'</b> is <b>Yakutsk</b>. Thus, we can conclude that we have several different shops in each city. There's also a lot of repetition in the second token / word, with the likes of <b>'ТЦ'</b> and <b>'ТРЦ'</b> appearing numerous times - these turn out to be things like <b>'shopping centre'</b> and <b>'shopping and entertainment centre'</b>.

We can easily extract this information as new features:

In [None]:
# define new features
shops['city'] = ""
shops['shop_cat'] = ""

# iterate through entries in shops 
for idx in shops.index:
    # split name into individual tokens
    full_name = shops.loc[idx].shop_name.split(' ')
    # and define city as first token, shop as second
    shops.loc[idx, 'city'] = full_name[0]
    shops.loc[idx, 'shop_cat'] = full_name[1]

In [None]:
# selection of shops chosen to exemplify the new features 
# and highlight the required cleaning mentioned below
shop_idxs = [0,1,57,58,52,54,46]
shops.iloc[shop_idxs]

A check of the new features shows a little bit of manual cleaning is required:

<ul>
<li>two of the 'Якутск' examples are prefixed with a <b>'!'</b> - this needs removing as the model will otherwise assume they represent a different city</li>
<li>example <b>id 46</b> is <b>'Сергиев Посад ТЦ "7Я"'</b> - the subcategory has been set to <b>'Посад'</b> when it should be <b>'ТЦ'</b></li>
</ul>
Of course, we could make the above code more sophisticated to account for these inconsistencies. Given their small number however, we can be lazy and just manually change them:

In [None]:
shops.loc[0:1, 'city'] = 'Якутск'
shops.loc[46, 'shop_cat'] = 'ТЦ'

Then we can create an ordinal encoding for these features. It should be noted that the features themselves aren't ordinal - they don't have a natural order (e.g. small -> medium -> large). Creating an ordinal encoding, however, makes it easier when fitting the data to different model types, and allows you to reduce the amount of space these features take up in memory (more on this later):

In [None]:
# use pandas 'categorical' datatype to encode features
shops['city_id'] = shops.city.astype('category').cat.codes
shops['shop_cat_id'] = shops.shop_cat.astype('category').cat.codes

In [None]:
shops.head(10)

Further inspection of the data also shows that some of the shops appear to be duplicates:

In [None]:
duplicate_shops = [0, 57, 1, 58, 11, 10, 40, 39]
shops.loc[duplicate_shops]

With a peek at the test data, it is clear that only one out of the two shops in these pairs makes an appearance:

In [None]:
print("Duplicate shops in test data: "
      f"{[shop for shop in duplicate_shops if shop in test.shop_id.unique()]}")

We can assume, therefore, that these shops are indeed duplicates and must have changed names at some point in the data. As such, we can go through the training data and amend the offending <code><b>shop_id</b></code>s accordingly:

In [None]:
# create dict of obsolete shops and replacement shop id
duplicate_shops = {0:57, 1:58, 11:10, 40:39}

 
# apply the function to ammend duplicate shop ids to sales_train
# (note: this would not change the training data features 
# if applied to the shops dataset)
sales_train['shop_id'] = sales_train.shop_id.apply(
    lambda x: duplicate_shops[x] if x in duplicate_shops.keys() else x
)

That seems to be about all we can do with the <code><b>shops</b></code> dataset for the moment. Moving on we have:

# - item_categories

As with <code><b>shops</b></code> an initial look at the <code><b>item_categories</b></code> data shows a relatively uninteresting table with an <code><b>item_category_name</b></code> and an <code><b>item_category_id</b></code> for each example:

In [None]:
item_categories.head(10)

For those of us that don't read Russian, a bit of google translating and formatting and the names can be translated to english. This can make it a tad easier to parse out useful information, as well as making this explanation a fair bit easier and more interesting! To save time and space, a csv file with the translations is included in the notebook data:

In [None]:
filepath = '../input/english-categories-for-predict-future-sales/english_categories.csv'
english_categories = pd.read_csv(filepath)
item_categories['item_category_name'] = english_categories.item_category_name

In [None]:
item_categories.head(10)

Like shops, there are common features of the item_category_name feature that can be used to engineer new features. They are (hopefully) pretty obvious after a quick look at the data, and seem to be well divided between the '-' character in each name string. We'll use the sections of the string either side of this '-' as new features:

In [None]:
# declare new subcat features
item_categories['subcat_a'] = ""
item_categories['subcat_b'] = ""

# iterate through each item category
for idx in item_categories.index:
    # split category name into two strings either side of ' - '
    cat_name = item_categories.loc[idx].item_category_name.split(' - ')
    # avoid creating new categories for hyphenated names
    if len(cat_name) == 2:
        # use indexes of cat_name variable to define new features
        item_categories.loc[idx, 'subcat_a'] = cat_name[0]
        item_categories.loc[idx, 'subcat_b'] = cat_name[1]

Checking the new features, we can see a few examples where the <code><b>item_category_name</b></code> is not separated by the characters <code><b>" - "</b></code>:

In [None]:
mask = (item_categories.subcat_a == "") | (item_categories.subcat_b == "")

item_categories[mask]

Some of these examples with missing <code><b>subcat_a</b></code> / <code><b>subcat_b</b></code> info obviously fit well with existing categories and so can be manually amended as such:

In [None]:
item_categories.loc[81:82, 'subcat_a'] = 'Blank media'
item_categories.loc[81, 'subcat_b'] = 'spire'
item_categories.loc[82, 'subcat_b'] = 'piece'
item_categories.loc[32, 'subcat_a'] = 'Payment Cards'
item_categories.loc[32, 'subcat_b'] = 'Cinema, Music, Games'

The remaining examples don't fit with any of the other <code><b>subcat_a</b></code> / <code><b>subcat_b</b></code> categories, and so are filled with their <code><b>item_category_name</b></code> in both subcategory features to signify that they are unique:

In [None]:
item_categories[item_categories.subcat_a == ""]

In [None]:
for idx in item_categories[item_categories.subcat_a == ""].index:
    item_categories.loc[idx, ['subcat_a', 'subcat_b']] = (
        item_categories.loc[idx, 'item_category_name']
    )

As with <code><b>shops</b></code> ordinal categories are then created:

In [None]:
for feature in ['subcat_a', 'subcat_b']:
    item_categories[feature + '_id'] = (
        item_categories[feature].astype('category').cat.codes
    )

Given that we have a relatively small number of categories, these can all be examined by eye to see if there are any other features common to them that have not been captured in the data. One obvious feature of the <code><b>item_categories</b></code> is that several relate to gaming:

In [None]:
gaming_categories = [0,1,7,8,9,11,14]
item_categories[item_categories.subcat_a_id.isin(gaming_categories)]

We can manually create a feature to reflect this, by using a one-hot-encoding - that is, to set the feature to 1 if the category relates to gaming and 0 otherwise:

In [None]:
item_categories['gaming'] = 0

for idx in item_categories.index:
    if item_categories.loc[idx, 'subcat_a_id'] in gaming_categories:
        item_categories.loc[idx, 'gaming'] = 1

In [None]:
item_categories.head(10)

There are even more granular features you can create using the <code><b>item_category_name</b></code> strings should you wish. The above are those that have proved most successful when training subsequent models.

Let's move onto the next of our datasets:

# - items

Firstly, credit for this section goes to the brillant <a href="https://www.kaggle.com/kyakovlev">Konstantin Yakovlev</a>, with his excellent notebook <a href="https://www.kaggle.com/kyakovlev/1st-place-solution-part-1-hands-on-data"> 1st place solution - Part 1 - "Hands on Data"</a> which comes highly recommended. The <code><b>name_correction</b></code> function below is taken directly from his work.

As with <code><b>shops</b></code> and <code><b>item_categories</b></code> the <code><b>items</b></code> table is a relatively uninteresting list of each item with its <code><b>item_naame</b></code>, <code><b>item_id</b></code> and <code><b>item_category_id</b></code>:

In [None]:
items.head()

The table is far too large to do much in the way of manual inspection, but with a bit of digging around it can be seen that some of the items share relatively similar names:

In [None]:
items.loc[1411:1421]

We can use python's string and regex methods to standardize each name to lowercase characters only and remove the bracketed text and special characters, which saves us the daunting task of wading through all of the data ourselves:

In [None]:
import re
def name_correction(x):
    x = x.lower()
    x = x.partition('[')[0]
    x = x.partition('(')[0]
    x = re.sub('[^A-Za-z0-9А-Яа-я]+', ' ', x)
    x = x.replace('  ', ' ')
    x = x.strip()
    return x

items['item_name_corrected'] = items['item_name'].apply(
    lambda x: name_correction(x)
)

Having done this, it's clear that a lot of the simplified names match:

In [None]:
unique_item_names = len(items.item_name.unique())
unique_corrected_item_names = len(items.item_name_corrected.unique())

print(f"{unique_item_names} unique item names, " 
      + f"{unique_corrected_item_names} unique corrected item_names")

In [None]:
items.loc[1411:1421]

This <code><b>item_name_corrected</b></code> feature could prove useful in highlighting trends between similar items, much in the same way as the features in <code><b>shops</b></code> and <code><b>item_categories</b></code> - as with these datasets, we'll create an ordinal encoding:

In [None]:
items['item_name_id'] = items.item_name_corrected.astype('category').cat.codes

and with that, we've explored all of the training data we have been provided. Before we start to draw it all together into one big training dataset, let's take a look at the test data to see what we have been tasked with:

# - Test

With the test data, we are provided only<code><b>ID</b></code>, <code><b>shop_id</b></code>, and <code><b>item_id</b></code> features: 

In [None]:
test.info()

In [None]:
test.head()

Digging further into the test data, we can explore the the shops and items featured in <code><b>test</b></code> and compare them with those featured in <code><b>sales_train</b></code>:

In [None]:
def compare_unique_features(feature):
    unique_train = sales_train[feature].unique()
    unique_test = test[feature].unique()
    common = test[test[feature].isin(unique_train)][feature].unique()
    print(f"{feature.upper()}: Training data contains {len(unique_train)}, "
          f"Test data {len(unique_test)}, "
          f"{len(common)} common to both.")
    
compare_unique_features('shop_id')
compare_unique_features('item_id')

We can see that the training data contains considerably more shops and items than are required in the test set. We should certainly consider keeping hold of this data, as (in general) the more data we have the better inferences our model will be able to draw. Worryingly though, we are missing some of the items in the test set, and so we'll have to make some predictions without prior sales info.

If we look at the numbers involved, it's clear that we are required to make a prediction for each unique shop and item combination in the test set:

   <b>42 unique shops * 5100 unique items = 214,200 (the number of items in the test dataset)</b>
    
Lets compare this to the shop / item combinations available in the training data:

In [None]:
shop_item = ['shop_id','item_id']
train_shop_item_combos = sales_train[shop_item + ['item_cnt_day']].\
                            groupby(shop_item, as_index=False).mean()
train_shop_item_combos['featured'] = 1
test = test.merge(train_shop_item_combos[shop_item + ['featured']],
                  on=shop_item,
                  how='left')
test.featured.fillna(0, inplace=True)
not_featured = (test.featured == 0)
item_info = (test.item_id.isin(sales_train.item_id.unique()))
print("We have shop and item train_sales information on " 
      + f"{len(test[~not_featured])} examples in the test set.")
print(f"Of the remaining test examples:")
print("   * we have item only train_sales information for " 
      + f"{len(test[not_featured & item_info])}")
print("   * and no train_sales information for "
      + f"{len(test[not_featured & ~item_info])}")

So we can consider different approaches to making predictions for the three different groups of examples above.

Speaking of predictions, we are also provided with a sample submission file:

In [None]:
sample_submission.head()

We are told in the competition description that each entry represents sales for the month following the training data, month 34. This can be seen in the submission format, as it requires the predictions to be labeled <code><b>item_cnt_month</b></code>. 

The observant among us will immediately see a few challenges posed by this submission format: 

<ul>
<li> The format of the data in <code><b>sales_train</b></code> is <b>daily</b> not <b>monthly</b>. We need to train our model on a dataset that reflects the distribution of the test data. As such, we can't simply join all of our training dataframes to form a training dataset -  we will have to significantly reformat <code><b>sales_train</b></code> into a dataset of monthly sales.</li>

<li>As things stand, we have information on completed transactions - presumably we will also be required to predict when items <b>don't</b> sell as often (or perhaps more often) than when they do. At some point, therefore, we will have to decide how we are going to add training examples for the items that haven't sold.</li>

<li>We are also being asked to make predictions on data <b>outside</b> of the time-frame we are given in the training data. This means that:<ol>
<li>Our training and test distributions will not match, which will make model evaluation more challenging: Our evaluation metrics may produce dramatically different scores between train, test and validation sets, even if our model is making reliable predictions.</li>
<li>We cannot use standard techniques to partition our data into training and validation sets: standard cross-validation / K-fold techniques would result in train / validation sets being taken from within the time-frame of the training data, which does not match that of the test data - we require our model to make predictions on examples taken from outside the time-frame on which it is trained.</li>
<li>We must be <b>very</b> careful when engineering features, so as to only include information that is available to us for month 34 (the testing month), and that we avoid our model to 'snooping' on future information it shouldn't have access to (e.g. making a prediction on an example from month 33 using mean sales for month 33)</li></ol></li></ul>

With that, let's crack on and get our training data up to spec:

# Match Training Data to Submission Format

Before we can continue with our data preparation, we need to reformat the sales_train information so that it matches the submission format i.e. so that it represents monthly, rather than daily transactions. This can be easily done by grouping the transactions by month, shop, and item. The far bigger challenge is to add transactions for those items that <b>haven't</b> sold in a given month, and was somewhat of an epic journey for the author! I'll go ahead and spoil the ending now for the sake of brevity, but a more detailed explanation and (attempted) justification is included in an appendix (at the end of the notebook) for those interested.

The most effective way of adding examples for unsold items is to find the items and shops that feature in each individual month of the data, and add zero sales for any combination thereof that isn't present. For example if we had the following transactions:

In [None]:
example_sales = pd.DataFrame({
    'shop':['a','a','a', 'b', 'b', 'c'],
    'item':['x','y','z','x','z','y'],
    'sales':[1,2,1, 3,1,2]
})
example_sales

Then we would add examples representing zero sales resulting in the following:

In [None]:
example_with_zero_sales = pd.DataFrame([[shop, item] 
                           for shop in example_sales.shop.unique()
                           for item in example_sales.item.unique()],
                                       columns = ['shop','item'])
example_with_zero_sales.merge(example_sales,
                             on = ['shop','item'],
                             how='left').fillna(0)

Most importantly, <b>we shouldn't be adding any zero transactions for items or shops that don't appear in a given month</b>.

To achieve this, we first make a DataFrame with only the <code><b>date_block_num</b></code>, <code><b>shop_id</b></code> and <code><b>item_id</b></code> for each example:

In [None]:
train_data = []

for month in range(34):
    month_data = sales_train[sales_train.date_block_num == month]
    train_data += ([[month, shop, item] 
                      for shop in month_data.shop_id.unique() 
                      for item in month_data.item_id.unique()])
    
date_shop_item = ['date_block_num', 'shop_id', 'item_id']
train_data = pd.DataFrame(train_data, columns=date_shop_item)

train_data.head()

We then concat the <code><b>test</b></code> examples to the bottom of this dataframe, adding the relevant <code><b>date_block_num</b></code> along the way. This is a shortcut that means we don't have to add any engineered features to the test set separately - this is particularly useful when we start adding lagged data later:

In [None]:
test_copy = test.copy()
test_copy['date_block_num'] = 34
test_copy.drop(['ID','featured'], axis=1, inplace=True)

train_test_data = train_data.append(test_copy).reset_index(drop=True)

and then we group the <code><b>sales_train</b></code> data by <code><b>date_block_num</b></code>, <code><b>shop_id</b></code> and <code><b>item_id</b></code> and merge the mean  <code><b>item_cnt_day</b></code> and <code><b>revenue</b></code> into our data.

In [None]:
date_shop_item = ['date_block_num', 'shop_id', 'item_id']
sales_features = ['item_cnt_day', 'revenue']

monthly_sales_train = sales_train[date_shop_item + sales_features].\
                            groupby(date_shop_item, as_index=False).sum()
monthly_sales_train.rename(columns={'item_cnt_day':'item_cnt_month'}, 
                           inplace=True)

train_test_data = train_test_data.merge(monthly_sales_train, 
                                        on=date_shop_item, 
                                        how='left').fillna(0)

We can also add a monthly average of the sales prices for each item. <b>Note:</b> <code><b>item_price</b></code> should not be included! - The price varies from transaction to transaction and so an average item price for the month is certainly useful. But, if a mean value of the <code><b>item_price</b></code> feature itself is used, then the sales volume of each transaction won't be taken into account:

i.e. 4 items sold at \$100 and one item at \$150 will be counted as an average price of \$125 and not the correct value of \$110

Instead, the revenue per transaction should be used to calculate the average item price for each month:

In [None]:
date_item = ['date_block_num','item_id']
features = ['item_cnt_day','revenue']
avg_prices = sales_train[date_item + features].\
                    groupby(date_item, as_index=False).sum()

avg_prices['month_avg_price'] =  avg_prices.revenue / avg_prices.item_cnt_day
train_test_data = train_test_data.merge(avg_prices[date_item + ['month_avg_price']], 
                                        on=date_item, 
                                        how='left').fillna(0)

And let's not forget all of the other features we've already engineered:

In [None]:
# easier to add the month / month_length features fresh than merge from sales_train
train_test_data['month'] = (train_test_data.date_block_num % 12)

train_test_data = train_test_data.merge(month_lengths, 
                                        on='month', 
                                        how='left')

train_test_data['month_length'] = train_test_data.month_length

item_features = ['item_id', 'item_category_id', 'item_name_id']
train_test_data = train_test_data.merge(items[item_features], 
                                        on='item_id', 
                                        how='left')

category_features = ['item_category_id','subcat_a_id', 'subcat_b_id','gaming']
train_test_data = train_test_data.merge(item_categories[category_features], 
                                        on='item_category_id', 
                                        how='left')

shop_features = ['city_id', 'shop_cat_id','shop_id']
train_test_data = train_test_data.merge(shops[shop_features], 
                                        on='shop_id', 
                                        how='left')

In [None]:
train_test_data.head()

# Managing Memory

Now that we have begun to add more features to our data, it's worth considering the size of our training dataset. Managing the amount of space our data takes up in memory is one of the main challenges in this competition. Bearing in mind the training data will have to be loaded into RAM, usually multiple times, to train our prediction model we must remain mindful of the size of the dataset we are creating:

In [None]:
train_test_data.info()

Our dataset takes up nearly 1GB in memory. Given the number of examples in our training data we are already fast approaching an unmanageably large dataset, even with the few features we have already engineered. Thankfully, there are ways to reduce the footprint of our dataset without compromising the amount of information it contains. 

If you look, each feature in the info table above is accompanied by a datatype i.e. <code><b>int64</b></code> or <code><b>float64</b></code>: this signifies the type of data stored by that feature and the amount of space in memory reserved for each individual value. So, for example, an <code><b>int64</b></code> is an integer value with 64 bits of memory reserved for its storage. The amount of memory reserved for a data point dictates the number of possible values it can take - so, continuing with our <code><b>int64</b></code> example, this can take the following values:

In [None]:
np.iinfo(np.int64)

In the case of our <code><b>date_block_num</b></code> feature, this is obviously overkill as we only need months 0 through 34! Indeed, we don't have much need for over 9 x 10^18 values for any of our current features. 

To avoid wasting huge amounts of memory describing very simple features, we can 'downcast' the datatype to a more appropriate size. You can use the above code block to check the ranges of the different dtypes. We can save a ton of memory by choosing the minimum amount of reserved space we need to ensure all of the values in a feature fit into that datatype. Pandas has already helpfully done this for us with the ordinal categories we have created:  <code><b>subcat_a_id</b></code>, <code><b>subcat_b_id</b></code>, <code><b>city_id</b></code>, <code><b>shop_cat_id</b></code> were all created as <code><b>int8</b></code>s and <code><b>item_name_id</b></code> was created as an  <code><b>int16</b></code>.

Lets downcast the rest of our features and see how much memory we can save:

In [None]:
dtypes = {'date_block_num': 'int8', 
          'item_id': 'int16',
          'shop_id': 'int8', 
          'item_cnt_month': 'int8', 
          'revenue': 'float32', 
          'month_avg_price': 'float32',
          'month': 'int8', 
          'month_length': 'int8', 
          'item_category_id': 
          'int8', 'gaming': 'int8'}
          
for feature, dtype in dtypes.items():
    train_test_data[feature] = train_test_data[feature].astype(dtype)
    
train_test_data.info()

Boom! We've reduced the memory usage by over 2 / 3 and didn't lose any information in the process. We'll have to keep an eye on this as we add more features, but that's just a case of downcasting them as and when they are added to the training data.

# More Advanced Features

If we recall from the initial exploration of the <code><b>sales_train</b></code> dataframe, there appeared to be strong temporal trends within the data. That is to say, for example, that more general information about sales of an item categories, shop sales , or sales in a given month may be very useful in making predictions about future sales. Being mindful of this, we can engineer features that capture some of the details of the trends in the overall data in each example, that will help any model to make more accurate predictions.

As mentioned before, we should be very careful when using details from other examples in our data to engineer new features. The task is to make predictions of the sales in the month following the last in the training data, so we must be mindful that we should not introduce information in our training data that isn't available to us in the test data. We should also be very careful not to accidentally introduce information about an example's label into any of these features. This is perhaps best explained with an example:

Imagine we thought it useful for our model to know the average sales for each month, and added the average sales for every month to every example. When training, the model could then make a prediction on an example from month 25 using the sales average from the preceding months, but also month 25 itself (which includes details of this example's label) and the months in the future. When making predictions for the test data, however, it would not have this information; only sales averages for months past - if it had learned when training to make predictions by looking into the future, it's unlikely to perform well without this information.

Bering this information in mind, we shall tread carefully.

# Item first / last sold

A good place to start is recalling the trends we saw in the sales of individual items. It appeared clear that items sold best in the month or the month after they were introduced (or that we first saw recorded sales to be pedantic). As such, we can create a feature that signposts this to our model:

In [None]:
# group sales_train by item_id and take minimum date_block_num as feature
date_first_sold = (
    sales_train[date_item].
    groupby('item_id', as_index=False).min()
    .rename(columns={'date_block_num': 'date_block_first_sold'})
)
# merge date_block_first_sold with train_test_data
train_test_data = train_test_data.merge(
    date_first_sold, 
    on='item_id', 
    how='left'
)
# downcast
train_test_data['date_block_first_sold'] = (
    train_test_data.date_block_first_sold.
    fillna(0).astype('int8')
)
# create months_since_item_first_sale feature by subtracting
# current date_block_num from date_block_first_sold
train_test_data['months_since_item_first_sale'] = (
    train_test_data.date_block_num 
    - train_test_data.date_block_first_sold
)
# drop redundant feature
train_test_data.drop('date_block_first_sold', axis=1, inplace=True)

We also know that as items get older sales slow down. As sales slow, certain of the later months in a product's lifetime feature no sales at all. It may be useful for our model to know this so it can pick up on when items may have gone 'out of fashion':

In [None]:
# create dict for data
months_sold = {'date_block_num':[],
               'item_id':[],
               'months_since_item_last_sale':[]}
# create list of item_ids in training data
train_items = (
    sales_train[sales_train.date_block_num < 34]
    .item_id.unique()
)
# create dataframe of transactions by date_block_num and item_id
last_sold_data = (
    sales_train[date_item + ['item_cnt_day']]
    .groupby(date_item, as_index=False).min()
)
for item in tqdm(train_items):
    # find data relating to item
    item_data = last_sold_data[last_sold_data.item_id == item]
    # find first month item appears in data
    item_first_sold = int(
        date_first_sold[date_first_sold.item_id == item].date_block_first_sold
    )
    # loop through months the item was sold
    for month in range(item_first_sold, 34):
        # add date_block_num and item_id details to data
        months_sold['date_block_num'].append(month + 1)
        months_sold['item_id'].append(item)
        # find entry in item_data for month
        item_sales_month = item_data[item_data.date_block_num == month]
        # if entry exists add zero
        if len(item_sales_month):
            months_sold['months_since_item_last_sale'].append(0)
        # else accumulate months item hasn't sold
        else:
            months_sold['months_since_item_last_sale'].append(
                months_sold['months_since_item_last_sale'][-1] + 1
            )
                
months_sold = pd.DataFrame(months_sold)
train_test_data = train_test_data.merge(
    months_sold, 
    on=['item_id','date_block_num'], 
    how='left'
)
train_test_data['months_since_item_last_sale'] = (
    train_test_data.months_since_item_last_sale
    .fillna(0).astype('int8')
)

# Lag Features

Next, we can feed our model details of past trends in the form of a 'lagged' feature. This can be a clever way of providing our model with really helpful information - effectively we're using the labels from information earlier in the dataset to help our model make predictions on data later on. For example, we could provide the model with information about the mean sales in each shop for the same time the previous year - this would be a <b>12 month 'lagged' sales by shop</b> feature.

To create a lagged feature you can follow a pretty simple recipe:
<ol><li>Choose a feature, or group of features that you want to group your new feature by; this will always include <code><b>date_block_num</b></code> as we are lagging the features by a specified number of months - in our example we would add <code><b>shop_id</b></code></li>
<li>Choose your target feature - in our example this would be <code><b>item_cnt_month</b></code></li> 
<li>Decide how you want to aggregate the grouped data - so our example would be a <b>mean</b> of the <code><b>item_cnt_month</b></code> for each shop, but this could just as easily be a <b>median</b> or <b>sum</b></li>
<li>Add the number of months you are lagging to the <code><b>date_block_num</b></code> feature - so our example would be <b>12</b></li>
<li>Merge this data with your training dataset</li></ol>

easy.....

So easy, in fact, that we can define a function to do this for us:

In [None]:
def add_lag_features(df, lag_months, group_features, feature_name, target_feature, agg_method, cumulative=True, dtype='float32'):
    """
    Adds lagged features to monthly sales dataset
    
    Parameters:
    df (pandas DataFrame): DataFrame containing monthly transaction data to which lag data is added
    lag_months (list): list containing the lag(s) in months required
    group_features (list): list containing the feature names the lag feature should be grouped on
    feature_name (string): Name of new feature column when added to df
    target_feature (string): Name of the feature in df to be aggregated in lag data
    agg_method (Pandas.DataFrame.agg function): Name of aggregation method used by Pandas.groupby().agg()
    cumulative (bool): If True feature is calculated cumulatively over specified months
    dtype (data type): Option to specify datatype to downcast new features to
    
    Returns:
    DataFrame: df with lagged feature data added
    """
    # collect required data in grouped dataframe
    feature_data = (
        train_test_data[group_features + [target_feature]].
        groupby(group_features, as_index=False).agg(agg_method)
    )
    # create list to store names of new features added
    features_created = []
    for lag_month in tqdm(range(1, max(lag_months) + 1)):
        # skip this month if we aren't accumulating the data for every month
        # up to the max lag_month
        if not cumulative and lag_month not in lag_months:
            continue
        # create a copy of feature data that can be manipulated
        temp_data = feature_data[group_features + [target_feature]].copy()
        temp_data['date_block_num'] += lag_month
        lag_feature_name = 'lag_' + str(lag_month) + 'm_' + feature_name
        # add the feature to lag_months if it is one of the months requested
        if lag_month in lag_months:
            features_created.append(lag_feature_name)
        temp_data.rename(
            columns={target_feature: lag_feature_name}, 
            inplace=True
        )
        # merge the lag feature with the overall feature data
        feature_data = feature_data.merge(
            temp_data, 
            on=group_features, 
            how='left'
        ).fillna(0)
        # downcast the new feature
        feature_data.iloc[:,-1] = feature_data.iloc[:,-1].astype(dtype)
        if lag_month > 1 and cumulative:
            # sum the new feature with the last one created to calculate a cumulative figure
            feature_data.iloc[:,-1] = (
                feature_data.iloc[:,-2:].sum(axis=1).astype(dtype)
            )

    # if agg_method is mean, the columns will currently contain a sum of the
    # mean for each month
    if cumulative and agg_method == 'mean':
        for lag_month in lag_months:
            # dividing by lag_month will result in the overall mean
            feature_data.iloc[:,len(group_features) + lag_month] /= lag_month

    features_required = features_created + group_features
    return df.merge(feature_data[features_required], 
                    on=group_features, 
                    how='left')

And then we can try this out by creating the <b>12 month 'lagged' sales by shop</b> feature we discussed above. Given the above function allows you to specify several "month lags", let's get a little more for our money and ask for the sales by shop for the previous <b>1, 6 and 12 months</b>:

In [None]:
train_test_data = add_lag_features(
    df=train_test_data,
    lag_months=[1,6,12],
    group_features=['date_block_num', 'shop_id'],
    feature_name='sales_by_shop',
    target_feature='item_cnt_month',
    agg_method='sum',
    cumulative=False,
    dtype='int32'
)

From here we have a whole host of different potential features at our fingertips. For example, we can create a feature that tries to capture the trends we saw earlier in overall sales for each month:

In [None]:
train_test_data = add_lag_features(
    df=train_test_data,
    lag_months=[1,6,12],
    group_features=['date_block_num'],
    feature_name='mean_sales',
    target_feature='item_cnt_month',
    agg_method='mean',
    cumulative=False,
    dtype='float32'
)

and a feature that highlights how well an item has sold in a particular shop in the past will certainly be useful - we could make this feature cumulative, to tell our model the total sum of sales over the last 6 and 12 months:

In [None]:
train_test_data = add_lag_features(
    df=train_test_data,
    lag_months=[1,6,12],
    group_features=['date_block_num', 'shop_id', 'item_id'],
    feature_name='cum_sales_by_item_shop',
    target_feature='item_cnt_month',
    agg_method='mean',
    cumulative=False,
    dtype='int16'
)

In [None]:
train_test_data.info()

There are a whole host of different potential features that can be added by taking different groupings of <code><b>shop_id</b></code> / <code><b>item_category_id</b></code> / <code><b>subcat_a_id</b></code> / <code><b></b>city_id</code> etc. The limit is only the number of permutations between the different features available (which is an impractically large number!). Forking the notebook and having a go at making some features is positively encouraged. 

# Finishing Touches

Now that we've added some good features, we'll put the finishing touches to our training data / test data and output both of them.

First, we've added a whole bunch of lag features that look back over the previous 12 months of data. This means the first 12 months of data isn't all that useful to us anymore, as it doesn't include the full complement of these lag features. We have plenty of data besides so we can comfortably drop these examples:

In [None]:
drop_lag_months_mask = train_test_data.date_block_num > 11
train_test_data = train_test_data[drop_lag_months_mask]

Next, we can separate the training data. Since the competition requires us to make predictions in the range 0-20, we can clip the target feature <code><b>item_cnt_month</b></code> so that our model learns to make predictions in this range. We should also drop the <code><b>revenue</b></code> feature, as our model will have a pretty easy time of making predictions if we keep it! (have a think about how it's prepared if you're unsure why):

In [None]:
train_data_mask = train_test_data.date_block_num < 34
train_data = (
    train_test_data[train_data_mask]
    .drop('revenue', axis=1).reset_index(drop=True)
)
train_data['item_cnt_month'] = train_data.item_cnt_month.clip(0,20)
train_data = (
    train_data.reindex(np.random.permutation(train_data.index))
    .reset_index(drop=True)
)

Then we separate the training data. This is easily done by selecting only month 34 (or not the data in <code><b>train_data_mask</b></code>). Don't forget to drop both <code><b>revenue</b></code> and <code><b>item_cnt_month</b></code> as we don't have any labels for our test data - at any rate, they're all zeros anyway! We'll merge in the ID feature from the <code><b>test</b></code> dataframe as this will be required when we submit our predictions:

In [None]:
drop_features = ['item_cnt_month', 'revenue']
test_data = train_test_data[~train_data_mask].drop(drop_features, axis=1)
test_data = test_data.merge(test, on=['shop_id','item_id'], how='left')
test_data = test_data.set_index('ID')

and all that remains is to pickle our <code><b>train_data</b></code> and <code><b>test_data</b></code> datasets and pat ourselves on the back for a job well done!

In [None]:
train_data.to_pickle('train_data.pkl')
test_data.to_pickle('test_data.pkl')

# Final Thoughts / TLDR

That about does it for building a functioning training dataset. We won't be fitting a model here, as this has already been a pretty epic journey! At any rate, It's recommended to prep your data and fit your model separately for this competition. For one thing, Jupyter notebooks (on which Kaggle notebooks are based) and pandas both have some quirks when it comes to memory that mean a large amount of RAM is eaten up as you prep data that is neither required, nor is it easy to free up. It's much easier and quicker to pickle your train and test datasets and then load a fresh notebook to build and test models so you don't run into memory issues. This has the added benefit of keeping your data prep and model stages separate, so they can be versioned much more clearly when you commit, and you can organise several separate models should you wish to ensemble at some point. 

This competition really is about the data. Better data cleaning and feature engineering can improve scores dramatically - comparatively, model selection and parameter tuning are far less important. The training data prepared in this model, with a few decent lagged features added (that aren't provided above - that would be too easy!) will provide competitors with data that can easily score in the very low <b>0.9 RMSE</b> area when fed to any of your bog-standard XGBoost / LightGBM / RandomForestRegressor models, more or less out of the box.

To summarise, we have:

<ul>
    <li>Cleaned the data - eliminating outliers and repeat data items etc.</li>
    <li>Looked more closely at the categorical features to engineer some useful sub-categories</li>
    <li>Grouped the sales data so it represents monthly transactions</li>
    <li>Added zero transactions for items and shops that feature in a month, but represent no transactions in combination</li>
    <li>Managed memory footprint very carefully and downcast features where possible</li>
    <li>Created a feature that tells us how long it is since items first / last sold</li>
    <li>Created lag features to point your model to past sales performance in different categories</li>
</ul>


All that remains is a warning from the author. As mentioned above, the temporal element of this particular data presents a number of challenges, none more so than when it comes to model evaluation. All of the usual train validation methods, your CV splits and your K-folds, are not welcome here. Using a random sample of the training data from the whole timeline to evaluate a model will really encourage it to dig deep, and remember the performance of each item in each month to make the best predictions on the validation data. When it comes to the test data however, it wont ever have seen data from month 34, and so all this memorising will have been for nought. Put more simply and scientifically, a model will massively overfit if we validate this way. 

Insted we should set up your train / validation split to a group of the earlier months for training and then the month immediately following for the validation. This will ensure that our model is optimised to make predictions on data it hasn't seen before, and that it can't cheat by using data from the same month it's being tested on. Doing this, we should also be mindful that our train / validation / test distributions will differ and so we may well end up with quite substantial differences in RMSE scoring. As such, validating the model several times using different ultimate months is encouraged i.e. 'walking forward' using month 30, then 31, then 32, and so on as validation data, and the prior 12 months to train.

With that, I bid you farewell. Sincere thanks to those that made it to the end - I would really welcome any feedback or comments. Anybody interested in the next stage, model fitting, may wish to take a look at the following:

<a href='https://www.kaggle.com/dustyturner/dense-nn-with-categorical-embeddings'>Dense NN With Categorical Embeddings</a>

# Appendix - adding zero transactions

The most significant breakthroughs I experienced in this competition have come from different approaches when adding entries for items that haven't sold. 

We should start by asking <b>if</b> it's even necessary to add zero transactions. We can quickly answer this question by making a submission in which we predict <b>1.0</b> for every entry, and another where we predict <b>0.5</b>. You will find the score for the <b>0.5</b> predictions is considerably higher, leading us to believe that either some of the targets should be zero, or that the competition organisers have lied to us and some of the targets are fractions. Lets assume the former! Therefore, we can assume we'll need to feed our model some zeros, or it's not going to do a good job of making predictions itself. 

With the <b>if</b> answered we can move onto the <b>how</b>. We are making some big assumptions when we choose any approach that adds examples to our training data. When we add zero transactions we are effectively telling our model <b>"this item was on the shelves in this shop, and nobody picked it up and bought it"</b>.  With this we can make a huge difference to the effectiveness of our model, positive or negative:

Initially, I began by adding a zero entry for every month / shop / item in the training data that didn't already feature. That's a lot of entries,  ~ 41.5M to be exact! This was just unmanageably large even without engineering any further features. To overcome this, I removed any of the items that don't feature in the test set and went about engineering new features from there. The model predictions from this data were always overly cautious - very few predictions rose above the 0.4-0.6 range. The resulting RMSEs weren't fantastic (around the 1.0 mark). 

It is clear from the above results, that far too many zero-sales transactions had been added. Indeed, after some more exploration of the data I found that certain shops and the majority of items don't make their initial appearance in <code><b>date_block_num</b></code> zero and often appear for the first time much later. By removing any zero transactions for shops or items that have yet to feature in the <code><b>sales_train</b></code> in a given month, the results improved dramatically - dropping to a RMSE of approximately 0.92 in the best case. Whilst this was a much better result, it wasn't particularly close to the top end of the leaderboard which was pretty frustrating.

I stagnated at the previous 'stage' of my results for a long time. The final breakthrough came in two stages. Firstly I noted that the majority of my predictions were still very tentative - that is to say, closer to 0-1 than above, and that my best results were achieved when my models tended to make more 'daring' predictions above 1. I then went back to the data, looking at the pattern of sales for different items. Several items exhibit gaps in their sales patterns: some miss the odd month here or there, others don't feature for several months at a time and then reappear. Several of the items drop off entirely and then make a sudden and significant resurgence (if you'd like to see this for yourself, have a play around with the graph of item sales in the exploration section above, as it is a similar graph that helped me make this discovery). 

It became clear to me that my assumption <b>"once an item featured in the data it was on sale from then on in every shop"</b> may have been a problem. Items may have been removed from shelves, meaning they had no opportunity to sell,  before being added again. Alternatively, there may have been gaps in the training data and I was superimposing zero sales of items that did in fact sell. At any rate, by removing any zero examples that didn't feature at all in a given month of <code><b>sales_train</b></code>, i.e. either the shop or item were totally absent, removed a huge number of zero transactions. This allowed me to re-introduce the dropped examples, those that didn't feature in the test data, back into the training data. The result was an immediate improvement in my score, and this approach has continued to produce my best scores (just over 0.90 at the time of writing).