# Overview

In competition, it takes a lot of time to pre-process or training because too many items of information are included.

So I try to find an efficient set of items for making predictions.

*Referece Notebook*
- Byfone: https://www.kaggle.com/byfone/h-m-trending-products-weekly

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns

# Load Dataset

In [None]:
articles = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/articles.csv')
transactions = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv', usecols=['t_dat', 'customer_id', 'article_id'])

In [None]:
article2idx = dict(zip(articles["article_id"], articles.index))
idx2article = dict(zip(articles.index, articles["article_id"]))
del articles

In [None]:
transactions["article_id"] = transactions["article_id"].map(lambda x: article2idx[x])

In [None]:
import gc 
gc.collect()

# Split Train/Test

before predict, split the train/test dataset

In [None]:
transactions['t_dat'] = pd.to_datetime(transactions['t_dat'])

print(transactions['t_dat'].max())

In [None]:
train = transactions.query("t_dat<='2020-09-15'").reset_index(drop=True)
test = transactions.query("t_dat>'2020-09-15'").reset_index(drop=True)

In [None]:
print(train.shape)
print(test.shape)

# Method

## 1.Prediction method using geometric distribution

In [None]:
summary = train.groupby('article_id')['t_dat'].agg(['min', 'max', 'count', 'nunique']).reset_index()
summary = summary.rename(columns={'min': 'min_dat', 'max': 'max_dat', 'count':'total_sales', 'nunique':'unique_dat'})

In [None]:
summary['diff_dat'] = (summary['max_dat'] - summary['min_dat']).dt.days 
summary['diff_dat'] = pd.TimedeltaIndex(summary['diff_dat'] + 1, unit='D').days

last_tdat = train['t_dat'].max()
summary['last_diff_dat'] = (last_tdat - summary['max_dat']).dt.days

In [None]:
summary['avg_sales'] = summary['total_sales'] / summary['unique_dat']
summary['tdat_ratio'] = summary['unique_dat'] / summary['diff_dat']

summary['daily_sales'] = summary['avg_sales'] * summary['tdat_ratio']

In [None]:
sns.displot(summary['daily_sales'], kind='kde')

In [None]:
summary.head()

We calculated the average sales and the ratio at which transactions occurred.

Multiply the above two variables to get the daily sales.

**`daily_sales` = `avg_sales` * `tdat_ratio`**

Daily Sales is the average number of articles that can be sold per day.

However, adjustments are required for articles that have not traded until recently

In [None]:
from scipy.stats import geom

def geom_func(p, n):
    n //= 7
    if n==0:
        return p
    return geom(p).pmf(n)

In [None]:
summary['alpha'] = summary.apply(lambda x: geom_func(x['tdat_ratio'], x['last_diff_dat']), axis=1)

In [None]:
summary['pred_sales'] = summary['alpha'] * summary['daily_sales']

*To avoid confusion, the expression sales does not refer to actual sales.*

In [None]:
sns.displot(summary['pred_sales'], kind='kde')

In [None]:
summary = summary.sort_values(by='pred_sales', ascending=False).reset_index(drop=True)
summary.head(10)

In [None]:
pred_sales = summary.query('pred_sales>=1')['article_id'].values
daily_sales = summary.query('daily_sales>=1')['article_id'].values

## 2. Weekly Sales

In Weekly Sales, we get `quotient`

- `quotient` : last_week_sales / ldbw_sales

How to create a variable is detailed in [Byfone's Notebook](https://www.kaggle.com/byfone/h-m-trending-products-weekly)

In [None]:
last_ts = train['t_dat'].max()

# df['ldbw'] = df['t_dat'].progress_apply(lambda d: last_ts - (last_ts - d).floor('7D'))
train['offset_dat'] = (last_ts - train['t_dat']).dt.floor('7D')
train['ldbw'] = last_ts - train['offset_dat']
train.head()

In [None]:
weekly_sales = train.groupby(['ldbw', 'article_id']).t_dat.count().reset_index(name='count')
weekly_sales.tail()

In [None]:
train = train.merge(weekly_sales, on=['ldbw', 'article_id'], how='left')
train.head()

In [None]:
weekly_sales = weekly_sales.set_index('article_id')

train = train.merge(weekly_sales.loc[weekly_sales['ldbw']==last_ts, ['count']],
                    how='left',
                    on='article_id', 
                    suffixes=("", "_targ"))

# last week sales
train['count_targ'].fillna(0, inplace=True)
train.head()

In [None]:
# last_week_sales / ldbw_sales
train['quotient'] = train['count_targ'] / train['count']
train = train.sort_values(by='quotient', ascending=False).reset_index(drop=True)
train.head()

In [None]:
lw_sales = set(train.query('quotient>=1')['article_id'].values)

In [None]:
print(f'Total Article Size: {train.article_id.nunique()}')
print('-' * 80)
print(f'Pred Sales Candidate Size: {len(pred_sales)}')
print(f'Daily Sales Candidate Size: {len(daily_sales)}')
print(f'Last Weekly Sales(Qutotient) Candidate Size: {len(lw_sales)}')

# Evaluation

In [None]:
from collections import Counter

test_sales = Counter(test['article_id'].values)

In [None]:
test_size = len(test_sales.keys())
print(f'Test Articles Count: {test_size}')

In [None]:
print(f'Matched Articles Count: {len(set(test_sales.keys()).intersection(set(pred_sales)))}')
print(f'Matched Articles Ratio: {len(set(test_sales.keys()).intersection(set(pred_sales))) / test_size}')

In [None]:
print(f'Matched Articles Count: {len(set(test_sales.keys()).intersection(set(daily_sales)))}')
print(f'Matched Articles Ratio: {len(set(test_sales.keys()).intersection(set(daily_sales))) / test_size}')

In [None]:
print(f'Matched Articles Count: {len(set(test_sales.keys()).intersection(set(lw_sales)))}')
print(f'Matched Articles Ratio: {len(set(test_sales.keys()).intersection(set(lw_sales))) / test_size}')

`lw_sales` can be seen that the matched articles ratio is quite high(0.77%).

Considering the size of each candidate, `pred_sales` can do a lot of matching even with a relatively small size.

Perhaps if you increase the size a little more, you can increase the matching ratio.

In [None]:
pred_sales = summary.query('pred_sales>=0.1')['article_id'].values
print(f'Pred Sales Candidate Size: {len(pred_sales)}')

print(f'Matched Articles Count: {len(set(test_sales.keys()).intersection(set(pred_sales)))}')
print(f'Matched Articles Ratio: {len(set(test_sales.keys()).intersection(set(pred_sales))) / test_size}')

**when pred_sales set to 0.1, 80% of the articles could be matched to the test data even with about 20,000 candidate sizes.**

This can increase the accuracy of the model by effectively reducing the number of articles to be considered.

In [None]:
pred_sales = summary.query('pred_sales>=0.01')['article_id'].values
print(f'Pred Sales Candidate Size: {len(pred_sales)}')

print(f'Matched Articles Count: {len(set(test_sales.keys()).intersection(set(pred_sales)))}')
print(f'Matched Articles Ratio: {len(set(test_sales.keys()).intersection(set(pred_sales))) / test_size}')

**when pred_sales set to 0.01, 92% of the articles could be matched to the test data even with about 30,000 candidate sizes.**

In [None]:
lw_sales = set(train.query('quotient>=0.1')['article_id'].values)
print(f'Last Weekly Sales(Qutotient) Candidate Size: {len(lw_sales)}')

print(f'Matched Articles Count: {len(set(test_sales.keys()).intersection(set(lw_sales)))}')
print(f'Matched Articles Ratio: {len(set(test_sales.keys()).intersection(set(lw_sales))) / test_size}')

Additionally, there was no difference when the ratio were adjusted for the `lw_sales`.

## Top100 Articles

Basically, the evaluation method considering the order is [ndcg](https://en.wikipedia.org/wiki/Discounted_cumulative_gain).

However, since the evaluation method in the competition is a map, we will check whether it is actually in the top 100.

In [None]:
top100 = sorted(test_sales.items(), key=lambda x: -x[1])[:100]

df = pd.DataFrame(top100, columns=['article_id', 'sales'])

In [None]:
summary['rank'] = summary.index

df = df.merge(summary[['article_id', 'rank']], how='left', on='article_id')

In [None]:
df.head(20)

index 18, 19 is Missing Rank(NaN)

Maybe, The missing values will be the first articles that appears at the test periods

In [None]:
df.loc[df['rank'].isnull()]

In [None]:
transactions.query('article_id==105222')['t_dat'].min()

We will check only 95 items, excluding missing values.

In [None]:
df = df.loc[df['rank'].notna()].reset_index(drop=True)

df['correct'] = df['rank'].map(lambda x: 1 if x<=100 else 0)

In [None]:
df['correct'].sum() / len(df)

The top 95 items were matched at a rate of 45% out of 95.
This could be used as general_pred.

# **If you find this note book helpful, please upvote!!**