[NB 03](https://github.com/radekosmulski/personalized_fashion_recs/blob/messing_around/03_Basic_lgbm_with_idxs_restart.ipynb) didn't quite work.

The two likely reasons for that are:
 - issues with how I am generating train data
 - false assumption that you can toss whatever at a ranking model and it will do the rest

In this notebook, I want to put the groundwork needed for growing a good solution. First of all, that will require a robust and fast local validation scheme. We know that using the last week of train data for validation work and tracks the leaderboard nicely.

Secondly, we want to start from a kernel of a solution that we can extend. This [notebook](https://www.kaggle.com/hengzheng/time-is-our-best-friend-v2) on kaggle seems to me like a great starting point.

The plan is to develop the functionality needed for a nice setup of a solution that we will reuse in NB 05. Along the way I hope to learn a bit more about the data, about some of the trends that I might want to model through the features I will engineer.

The plan is:
* implement a quick training pipeline leading to good validation
* train a ranking model on candidates we know to be good
* only generate new training data / candidates while validating whether we are moving in the right direction using local CV
* start with building sensible features and see whether they move the needle on the score

The truth is I do not know what will work. These RecSys models are a completely new breed of models to me. But I can set the problem up in a way as to help me learn. And that is what I am going to do :).

Once I get this working I will breathe a sigh of relief and will jump into reading papers and drawing inspiration from there.

Let's get started.

NOTE: You are welcome to check out the earlier code that I wrote which can be found on [this branch](https://github.com/radekosmulski/personalized_fashion_recs/tree/messing_around). I learned a lot about RecSys models and this particular problem through it. But it has quite a few bugs and a couple of issues I now know off with regards to the approach. **You should not need any of the earlier code to run the notebooks in the main branch of this repo.**

In [2]:
!wget https://raw.githubusercontent.com/benhamner/Metrics/master/Python/ml_metrics/average_precision.py

In [1]:
# helper functions
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
from average_precision import apk

# https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635
def customer_hex_id_to_int(series):
    return series.str[-16:].apply(hex_id_to_int)

def hex_id_to_int(str):
    return int(str[-16:], 16)

def article_id_str_to_int(series):
    return series.astype('int32')

def article_id_int_to_str(series):
    return '0' + series.astype('str')

class Categorize(BaseEstimator, TransformerMixin):
    def __init__(self, min_examples=0):
        self.min_examples = min_examples
        self.categories = []
        
    def fit(self, X):
        for i in range(X.shape[1]):
            vc = X.iloc[:, i].value_counts()
            self.categories.append(vc[vc > self.min_examples].index.tolist())
        return self

    def transform(self, X):
        data = {X.columns[i]: pd.Categorical(X.iloc[:, i], categories=self.categories[i]).codes for i in range(X.shape[1])}
        return pd.DataFrame(data=data)

We want this to be fast. I can get as much RAM as I will ever need through VMs on GCP, but that is not the point. I want to see how far I can push my local hardware, but this goes even beyond that.

I need the speed to make a good use of my time as I continue to build my understanding of what RecSys models are about. And the path to this leads through making the data I will work on smaller.

In [3]:
%%time
import pandas as pd

transactions = pd.read_csv('/kaggle/input/h-and-m-personalized-fashion-recommendations/transactions_train.csv', dtype={"article_id": "str"})
customers = pd.read_csv('/kaggle/input/h-and-m-personalized-fashion-recommendations/customers.csv')
articles = pd.read_csv('/kaggle/input/h-and-m-personalized-fashion-recommendations/articles.csv', dtype={"article_id": "str"})

In [4]:
transactions.memory_usage(deep=True)

In [5]:
transactions.info(memory_usage='deep')

In [6]:
%%time
transactions['customer_id'].nunique()

In [7]:
%%time
transactions['customer_id'] = customer_hex_id_to_int(transactions['customer_id'])
transactions['customer_id'].nunique()

In [8]:
transactions.memory_usage(deep=True)

In [9]:
transactions.info(memory_usage='deep')

Nice!

Initially, I wanted to get rid of the `t_dat` column but on second thought I am not a fan.

I am all for speed and reducing weight, but the main purpose of this activity is to increase developer productivity.

If I fall back down to ints representing year, week, day I will be certainly trading developer productivity for fewer CPU cycles that are needed (and I want to go in the exact opposite direction! developer productivity > (nearly) anything else)

In [10]:
%%time

transactions.t_dat = pd.to_datetime(transactions.t_dat, format='%Y-%m-%d')

In [11]:
transactions['week'] = 104 - (transactions.t_dat.max() - transactions.t_dat).dt.days // 7

In [12]:
transactions.info(memory_usage='deep')

Let's do something about the `article_id` (both here and on `articles`) and let's take a closer look at `price`, `sales_channel_id` and `week`.

In [13]:
transactions.article_id = article_id_str_to_int(transactions.article_id)
articles.article_id = article_id_str_to_int(articles.article_id)

transactions.week = transactions.week.astype('int8')
transactions.sales_channel_id = transactions.sales_channel_id.astype('int8')
transactions.price = transactions.price.astype('float32')

In [14]:
transactions.info(memory_usage='deep')

In [15]:
transactions.drop(columns='t_dat').info(memory_usage='deep')

Well, this is interesting. There are very few unique `t_dat` values hence despite it being a scary `datetime64` it takes up very little memory!

Keeping it for convenience is definitely the way to go.

Let's take a brief look at the `customers` and `articles` dfs.

In [16]:
customers.info(memory_usage='deep')

In [17]:
articles.info(memory_usage='deep')

Well, this stuff will be getting merged with our transactions df at some point, so I guess we can also make this smaller and easier to work with down the road.

In [18]:
customers['club_member_status'].unique()

In [19]:
customers.customer_id = customer_hex_id_to_int(customers.customer_id)
for col in ['FN', 'Active', 'age']:
    customers[col].fillna(-1, inplace=True)
    customers[col] = customers[col].astype('int8')

In [20]:
customers.club_member_status = Categorize().fit_transform(customers[['club_member_status']]).club_member_status
customers.postal_code = Categorize().fit_transform(customers[['postal_code']]).postal_code
customers.fashion_news_frequency = Categorize().fit_transform(customers[['fashion_news_frequency']]).fashion_news_frequency

In [21]:
customers.info(memory_usage='deep')

In [22]:
for col in articles.columns:
    if articles[col].dtype == 'object':
        articles[col] = Categorize().fit_transform(articles[[col]])[col]

In [23]:
articles.info(memory_usage='deep')

In [24]:
for col in articles.columns:
    if articles[col].dtype == 'int64':
        articles[col] = articles[col].astype('int32')

And this concludes our raw data preparation step! Let's now write everything back to disk.

In [25]:
transactions.sort_values(['t_dat', 'customer_id'], inplace=True)

In [26]:
%%time

transactions.to_parquet('transactions_train.parquet')
customers.to_parquet('customers.parquet')
articles.to_parquet('articles.parquet')