# Overview

This notebook is written based on the notebook created by LichtLab. 
In this notebook, I attempt to improve the calculation process.

Also see Byfone and CHris' notebook for the underlying algorithm.

Thanks to 
- LichtLab : https://www.kaggle.com/lichtlab/0-0226-byfone-chris-combination-approach
- Byfone: https://www.kaggle.com/byfone/h-m-trending-products-weekly
- Chris : https://www.kaggle.com/cdeotte/recommend-items-purchased-together-0-021  


# About Improvements

- Accelerate date type conversion process in transaction data sets
- The process of generating candidate predictions got about 5 times faster by applying a native sorting process instead of `pandas.Series`.
- To improve processing speed, we select the necessary columns from the data frame and transform them into lists.  


In [None]:
import gc
import numpy as np
import os
import pandas as pd
from math import sqrt
from pathlib import Path
from tqdm import tqdm

tqdm.pandas()

# size of prediction candidates
N = 12

df_trans = pd.read_csv('/kaggle/input/h-and-m-personalized-fashion-recommendations/transactions_train.csv',
                       dtype={'article_id': str})
df_trans['t_dat'] = pd.to_datetime(df_trans['t_dat'], format='%Y-%m-%d')

# Calculate distance from latest date

Calculate the distance after rounding the date to the nearest week.  
However, it seems to take a lot of time when processing with `apply`.
We will try to speed up the process by calculating by columns.

For rounding for the `datetime` type, see the following entry.  
https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.floor.html


In [None]:
# Step1
df = df_trans[['t_dat', 'customer_id', 'article_id']].copy()
last_ts = df['t_dat'].max()

# df['ldbw'] = df['t_dat'].progress_apply(lambda d: last_ts - (last_ts - d).floor('7D'))
df['offset_dat'] = (last_ts - df['t_dat']).dt.floor('7D')
df['ldbw'] = last_ts - df['offset_dat']

In [None]:
weekly_sales = df.drop(['customer_id', 'offset_dat'], axis=1).groupby(['ldbw', 'article_id']).count()
weekly_sales = weekly_sales.rename(columns={'t_dat': 'count'})
df = df.join(weekly_sales, on=['ldbw', 'article_id'])
weekly_sales = weekly_sales.reset_index().set_index('article_id')
last_day = last_ts.strftime('%Y-%m-%d')

In [None]:
df = df.join(
    weekly_sales.loc[weekly_sales['ldbw']==last_day, ['count']],
    on='article_id', rsuffix="_targ")

df['count_targ'].fillna(0, inplace=True)
del weekly_sales, df_trans
gc.collect()
df['quotient'] = df['count_targ'] / df['count']


# Generate purchase history dictionary

To improve processing speed, we select the necessary columns from the data frame and transform them into lists.  
Then we store the list of values for each column in dictonary `calc_buffer`.  
By referring to the list, we can expect to improve processing speed considerably.

In [None]:
purchase_dict = {}

# Processed by list to achieve faster speeds
cols_to_list = ['customer_id', 'article_id', 't_dat', 'quotient']
calc_buffer = {_c: df[_c].to_list() for _c in cols_to_list}

In [None]:
for i in tqdm(range(0, len(calc_buffer['customer_id']))):
    cust_id = calc_buffer['customer_id'][i]
    art_id = calc_buffer['article_id'][i]
    t_dat = calc_buffer['t_dat'][i]

    if cust_id not in purchase_dict:
        purchase_dict[cust_id] = {}

    if art_id not in purchase_dict[cust_id]:
        purchase_dict[cust_id][art_id] = 0
    
    x = max(1, (last_ts - t_dat).days)

    a, b, c, d = 2.5e4, 1.5e5, 2e-1, 1e3
    y = a / np.sqrt(x) + b * np.exp(-c*x) - d

    value = calc_buffer['quotient'][i] * max(0, y)
    purchase_dict[cust_id][art_id] += value

del calc_buffer
gc.collect()

In [None]:
target_sales = df.drop('customer_id', axis=1).groupby('article_id')['quotient'].sum()
general_pred = target_sales.nlargest(N).index.tolist()

In [None]:
general_pred

In [None]:
# Step2 & Step3
pairs = np.load('../input/hmitempairs/pairs_cudf.npy',allow_pickle=True).item()
sub = pd.read_csv('/kaggle/input/h-and-m-personalized-fashion-recommendations/sample_submission.csv')

In [None]:
pred_list = []
for cust_id in tqdm(sub['customer_id']):
    # in case of the customer who has purchase history
    if cust_id in purchase_dict:
        # get purchase history
        #series = pd.Series(purchase_dict[cust_id])
        #series = series[series > 0]
        purchased = sorted(purchase_dict[cust_id].items(), key=lambda x:x[1], reverse=True)
        # Get up to 12 cases in order of likelihood
        #l = series.nlargest(N).index.tolist()
        l = [_[0] for _ in purchased if _[1] > 0]
        tmp_l = l.copy()
        for elm in tmp_l:
            # If the number of recommendation candidates is less than 12, 
            # add products to the recommendation candidates for possible simultaneous purchase.
            if len(l) < N and int(elm) in pairs.keys():
                itm = pairs[int(elm)]
                l.append('0' + str(itm))
        if len(l) < N:
            # If the 12 recommended candidate slots are not filled, 
            # pick up from the general forecast candidates to fill the slots
            l = l + general_pred[:(N-len(l))]
    else:
        # If no purchase history is available, apply general prediction candidates
        l = general_pred
    pred_list.append(' '.join(l))

sub['prediction'] = pred_list
sub.to_csv(f'submission.csv',index=False)