Radek posted about this [here](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/discussion/309220), and linked to a GitHub repo with the code.

I just transferred that code here to Kaggle notebooks, that's all.

## Introduction
This notebook is an article with code analysis of "[Radek's LGBMRanker starter-pack](https://www.kaggle.com/code/marcogorelli/radek-s-lgbmranker-starter-pack)" This notebook is a code analysis of "[Radek's LGBMRanker starter-pack]().

It provides an overview of each process in Japanese. (Please refer to the comment-outs.)

As a new user of LGBMRanker, the preprocessing of this data set was very difficult for me.
Therefore, in this notebook, I focused on the preprocessing of the data before rank learning in particular.

For more information on how to use LGBMRanker and how to put together a submission file after making predictions, you may want to refer to the following notebook.

https://www.kaggle.com/code/kimurayut/gbm-ranking

Also, the following discussion should help you understand the data to be put into the study.

https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/discussion/307288#1728274

## はじめに
こちらのノートブックは「[Radek's LGBMRanker starter-pack](https://www.kaggle.com/code/marcogorelli/radek-s-lgbmranker-starter-pack)」のコード分析をした記事です。

日本語で各処理の概要を解説しています。（コメントアウトを参考ください。）

LGBMRankerを初めて使う私にとって今回のデータセットの前処理は非常に難しいものでした。
このノートブックでは特にランク学習を行う前のデータの前処理にフォーカスをして分析しました。

LGBMRankerの使い方や、予測を行った後どのように提出ファイルとしてまとめるかについては以下のノートブックを参考にすると良いかもしれません。

https://www.kaggle.com/code/kimurayut/gbm-ranking

また、学習に投入するデータについては以下のディスカッションの内容を参考にすると理解できるはずです。

https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/discussion/307288#1728274


In [None]:
import numpy as np

def apk(actual, predicted, k=10):
    """
    Computes the average precision at k.

    This function computes the average prescision at k between two lists of
    items.

    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements

    Returns
    -------
    score : double
            The average precision at k over the input lists

    """
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=10):
    """
    Computes the mean average precision at k.

    This function computes the mean average prescision at k between two lists
    of lists of items.

    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted 
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements

    Returns
    -------
    score : double
            The mean average precision at k over the input lists

    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

# https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635
def customer_hex_id_to_int(series):
    return series.str[-16:].apply(hex_id_to_int)

def hex_id_to_int(str):
    return int(str[-16:], 16)

def article_id_str_to_int(series):
    return series.astype('int32')

def article_id_int_to_str(series):
    return '0' + series.astype('str')

class Categorize(BaseEstimator, TransformerMixin):
    def __init__(self, min_examples=0):
        self.min_examples = min_examples
        self.categories = []
        
    def fit(self, X):
        for i in range(X.shape[1]):
            vc = X.iloc[:, i].value_counts()
            self.categories.append(vc[vc > self.min_examples].index.tolist())
        return self

    def transform(self, X):
        data = {X.columns[i]: pd.Categorical(X.iloc[:, i], categories=self.categories[i]).codes for i in range(X.shape[1])}
        return pd.DataFrame(data=data)


def calculate_apk(list_of_preds, list_of_gts):
    # for fast validation this can be changed to operate on dicts of {'cust_id_int': [art_id_int, ...]}
    # using 'data/val_week_purchases_by_cust.pkl'
    apks = []
    for preds, gt in zip(list_of_preds, list_of_gts):
        apks.append(apk(gt, preds, k=12))
    return np.mean(apks)

def eval_sub(sub_csv, skip_cust_with_no_purchases=True):
    sub=pd.read_csv(sub_csv)
    validation_set=pd.read_parquet('data/validation_ground_truth.parquet')

    apks = []

    no_purchases_pattern = []
    for pred, gt in zip(sub.prediction.str.split(), validation_set.prediction.str.split()):
        if skip_cust_with_no_purchases and (gt == no_purchases_pattern): continue
        apks.append(apk(gt, pred, k=12))
    return np.mean(apks)

In [None]:
import pandas as pd

In [None]:
%%time

transactions = pd.read_parquet('../input/warmup/transactions_train.parquet')
customers = pd.read_parquet('../input/warmup/customers.parquet')
articles = pd.read_parquet('../input/warmup/articles.parquet')

# sample = 0.05
# transactions = pd.read_parquet(f'data/transactions_train_sample_{sample}.parquet')
# customers = pd.read_parquet(f'data/customers_sample_{sample}.parquet')
# articles = pd.read_parquet(f'data/articles_train_sample_{sample}.parquet')

In [None]:
transactions.week.max() - 10

In [None]:
# 後にテストデータを用意するために使用
# Used to prepare test data after
test_week = transactions.week.max() + 1 #105

# week94よりも大きいweekのトランザクションを保存
# Save transactions for a week greater than week94
transactions = transactions[transactions.week > transactions.week.max() - 10] 

# Generating candidates

### Last purchase candidates

In [None]:
%%time
# 各カスタマーが購入したweekを抽出
# Extract weeks purchased by each customer
c2weeks = transactions.groupby('customer_id')['week'].unique()

In [None]:
# 各weekの始まりと終わりの日付を確認
# Check the beginning and end dates of each WEEK
transactions.groupby('week')['t_dat'].agg(['min', 'max'])

In [None]:
c2weeks

In [None]:
%%time

# 顧客が購入したweekの１周ずらした値を辞書型で持つ
# Have the value of one round shift of the WEEK purchased by the customer in dictionary type.
c2weeks2shifted_weeks = {}

for c_id, weeks in c2weeks.items():
    # c_id...顧客ID
    # weeks...顧客が購入したweekの配列
    c2weeks2shifted_weeks[c_id] = {}
    for i in range(weeks.shape[0]-1):
        # 顧客が購入したweekの１周ずらした値を設定
        # Set the value of one round shift of the WEEK purchased by the customer.
        c2weeks2shifted_weeks[c_id][weeks[i]] = weeks[i+1]
    c2weeks2shifted_weeks[c_id][weeks[-1]] = test_week #最後のweekには必ずtest_week(今回は105)が設定される The last week is always set to test_week (105 in this case)

In [None]:
c2weeks2shifted_weeks[28847241659200][95]

In [None]:
candidates_last_purchase = transactions.copy()

In [None]:
%%time

weeks = []
for i, (c_id, week) in enumerate(zip(transactions['customer_id'], transactions['week'])):
    # Set the week one round off from the week of purchase (but the last week is always set to 105)
    # 購入した週から１周ずらした週を設定（ただし最後の週は必ず105が設定される）
    weeks.append(c2weeks2shifted_weeks[c_id][week]) 
    
#列情報を全て１周ずらした形で上書き
# Overwrite all column information with one round shift. 
candidates_last_purchase.week=weeks 

In [None]:
# 顧客IDが272412481300040の人の情報
# Information about the person whose customer ID is 272412481300040
candidates_last_purchase[candidates_last_purchase['customer_id']==272412481300040]

In [None]:
# 顧客IDが272412481300040の人の情報
# Information about the person whose customer ID is 272412481300040
transactions[transactions['customer_id']==272412481300040]

★ 全ての顧客が最後に買った週はweekが105に設定されているためそれを、最終購入日を抽出するための条件として後々使用する？

### Bestsellers candidates

In [None]:
# week、article_id毎に売り上げの平均を算出
# Average sales per week, per article_id
mean_price = transactions \
    .groupby(['week', 'article_id'])['price'].mean()

In [None]:
mean_price

In [None]:
transactions.groupby('week')['article_id'].value_counts()

In [None]:
transactions \
    .groupby('week')['article_id'].value_counts() \
    .groupby('week').rank(method='dense', ascending=False) \
    .groupby('week').head(12).rename('bestseller_rank').astype('int8')

In [None]:
# 週毎の商品購入回数をもとにランク付けを行い、上位12の商品のみ抽出
# Ranked based on number of product purchases per week, only top 12 products selected
sales = transactions \
    .groupby('week')['article_id'].value_counts() \
    .groupby('week').rank(method='dense', ascending=False) \
    .groupby('week').head(12).rename('bestseller_rank').astype('int8') 

In [None]:
sales

In [None]:
# week95の上位12の商品
# top 12 products of week95
sales.loc[95]

In [None]:
#売り上げ上位12の商品テーブルと平均販売価格のテーブルを結合
#Combine the top 12 products sold table with the average selling price table
bestsellers_previous_week = pd.merge(sales, mean_price, on=['week', 'article_id']).reset_index()

In [None]:
bestsellers_previous_week

In [None]:
# １周分足し上げる
# Add up one round
bestsellers_previous_week.week += 1

In [None]:
# week96のデータ確認
# Check data for week96
bestsellers_previous_week.pipe(lambda df: df[df['week']==96])

In [None]:
# 週ごとに各顧客の一番最初のトランザクションを格納
# Store first transaction for each customer per week
unique_transactions = transactions \
    .groupby(['week', 'customer_id']) \
    .head(1) \
    .drop(columns=['article_id', 'price']) \
    .copy()

In [None]:
unique_transactions

In [None]:
# 重複している行を削除（週ごとに同じ顧客は存在しなくなる）
# Remove duplicate rows (same customer no longer exists from week to week)
transactions.drop_duplicates(['week', 'customer_id'])

In [None]:

candidates_bestsellers = pd.merge(
    #週ごとに各顧客の一番最初のトランザクション
    #First transaction for each customer per week
    unique_transactions, 
    #売り上げ上位12の商品テーブルと平均販売価格のテーブルを結合したテーブル
    #Table combining the top 12 products sold table and the average selling price table
    bestsellers_previous_week, 
    on='week',
)

In [None]:
candidates_bestsellers

In [None]:
# 全てのweekで重複している顧客IDを削除する
# Remove duplicate customer IDs in all WEEKS
test_set_transactions = unique_transactions.drop_duplicates('customer_id').reset_index(drop=True)

#全てweekを105にする
#All to 105 for the week.
test_set_transactions.week = test_week 


In [None]:
test_set_transactions

In [None]:
candidates_bestsellers_test_week = pd.merge(
    # 全てのトランザクションにtest_week(105)を設定したテーブル
    # Table with test_week(105) for all transactions
    test_set_transactions, 
    # 売り上げ上位12の商品テーブルと平均販売価格のテーブルを結合したテーブル
    # Table combining the top 12 products sold table and the average selling price table
    bestsellers_previous_week, 
    on='week'
)

In [None]:
candidates_bestsellers_test_week

In [None]:
# 週ごとに各顧客の一番最初のトランザクションが保持されているテーブルと
# 全てのweekで重複している顧客IDを削除し、全レコードのweekを105にしたテーブルを縦に結合
# with a table that holds the very first transaction for each customer by week, and
# Vertically join the table with all records in week 105, removing duplicate customer IDs in all weeks.
candidates_bestsellers = pd.concat([candidates_bestsellers, candidates_bestsellers_test_week])
candidates_bestsellers.drop(columns='bestseller_rank', inplace=True)

In [None]:
candidates_bestsellers

# Combining transactions and candidates / negative examples

In [None]:
# 全てのトランザクションに対し購入フラグ(1)を設定
# Set purchase flag (1) for all transactions
transactions['purchased'] = 1

In [None]:
# transactions..全てのトランザクション
# candidates_last_purchase..transactionsのweek列を一周ずらしたテーブル
# candidates_bestsellers...週ごとに各顧客の一番最初のトランザクションが保持されているテーブルと
# 全てのweekで重複している顧客IDを削除し、全レコードのweekを105にしたテーブルを縦に結合したテーブル
# All transactions and
# a table holding the first transaction for each customer by week, and
# Vertically join the table with all records in week 105, removing duplicate customer IDs in all weeks.
data = pd.concat([transactions, candidates_last_purchase, candidates_bestsellers])

# 全てのトランザクションに対し欠損値対応、このときcandidates_last_purchase、candidates_bestsellersから参照したデータには、
# 購入フラグ(0)が設定される
# Missing values for all transactions, data referenced from candidates_last_purchase and candidates_bestsellers will be set to Purchase flag (0) is set
data.purchased.fillna(0, inplace=True)

In [None]:
data

In [None]:
# 顧客IDと商品ID、weekで重複しているレコードを削除
# Delete duplicate records with customer ID, product ID, and week
data.drop_duplicates(['customer_id', 'article_id', 'week'], inplace=True)

In [None]:
data.purchased.mean()

### Add bestseller information

In [None]:
# ランク情報を付与
# Assign rank information
data = pd.merge(
    data,
    bestsellers_previous_week[['week', 'article_id', 'bestseller_rank']],
    on=['week', 'article_id'],
    how='left'
)

In [None]:
data

In [None]:
# 一番古いweekは対象外にする
# Exclude the oldest WEEK from the list.
data = data[data.week != data.week.min()]

In [None]:
data

In [None]:
# ランク情報がないものは999で補完
# 999 completes those without rank information
data.bestseller_rank.fillna(999, inplace=True)

In [None]:
# 商品と顧客の特徴情報を追加
# Add product and customer feature information
data = pd.merge(data, articles, on='article_id', how='left')
data = pd.merge(data, customers, on='customer_id', how='left')

In [None]:
data

In [None]:
# weekと顧客IDで並び替え
# Sort by week and customer ID
data.sort_values(['week', 'customer_id'], inplace=True)
data.reset_index(drop=True, inplace=True)

In [None]:
# week105以外のデータをトレーニングデータにする
# Use data other than week105 as training data
train = data[data.week != test_week]

# week105のデータをテストデータにする
# week105 data to test data.
test = data[data.week==test_week].drop_duplicates(['customer_id', 'article_id', 'sales_channel_id']).copy()

In [None]:
train.groupby(['week', 'customer_id'])['article_id'].count()

In [None]:
# LGBMRankerのgroupに登録するためのクエリ(week毎に顧客が何回買い物をしたかわかるクエリ)
# Query to register in LGBMRanker's group (query to see how many times a customer has shopped each week)
train_baskets = train.groupby(['week', 'customer_id'])['article_id'].count().values

In [None]:
columns_to_use = ['article_id', 'product_type_no', 'graphical_appearance_no', 'colour_group_code', 'perceived_colour_value_id',
'perceived_colour_master_id', 'department_no', 'index_code',
'index_group_no', 'section_no', 'garment_group_no', 'FN', 'Active',
'club_member_status', 'fashion_news_frequency', 'age', 'postal_code', 'bestseller_rank']

In [None]:
%%time

train_X = train[columns_to_use]
train_y = train['purchased']

test_X = test[columns_to_use]

# Model training

In [None]:
from lightgbm.sklearn import LGBMRanker

In [None]:
ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    boosting_type="dart",
    n_estimators=1,
    importance_type='gain',
    verbose=10
)

In [None]:
%%time

ranker = ranker.fit(
    train_X,
    train_y,
    group=train_baskets,
)

In [None]:
for i in ranker.feature_importances_.argsort()[::-1]:
    print(columns_to_use[i], ranker.feature_importances_[i]/ranker.feature_importances_.sum())

# Calculate predictions

In [None]:
%time

test['preds'] = ranker.predict(test_X)

c_id2predicted_article_ids = test \
    .sort_values(['customer_id', 'preds'], ascending=False) \
    .groupby('customer_id')['article_id'].apply(list).to_dict()

bestsellers_last_week = \
    bestsellers_previous_week[bestsellers_previous_week.week == bestsellers_previous_week.week.max()]['article_id'].tolist()

# Create submission

In [None]:
sub = pd.read_csv('/kaggle/input/h-and-m-personalized-fashion-recommendations/sample_submission.csv')

In [None]:
%%time
preds = []
for c_id in customer_hex_id_to_int(sub.customer_id):
    pred = c_id2predicted_article_ids.get(c_id, [])
    pred = pred + bestsellers_last_week
    preds.append(pred[:12])

In [None]:
preds = [' '.join(['0' + str(p) for p in ps]) for ps in preds]
sub.prediction = preds

In [None]:
sub_name = 'basic_model_submission'
sub.to_csv(f'{sub_name}.csv.gz', index=False)