# Baseline

Apply top-12 items as prediction baseline but this time using our own train and dev splits.

- [X] Write function for target metric
- [X] Compute metric on validation set
- [X] Check that computed metric is similar to performance on hidden competition set

In [1]:
import os

import pandas as pd
import numpy as np

In [2]:
os.chdir('..')

In [15]:
from fashion_recommendations.metrics.average_precision import mapk

### Evaluation metric

__Competition information__

Submissions are evaluated according to the Mean Average Precision @ 12 (MAP@12)

Notes:
- You will be making purchase predictions for all customer_id values provided, regardless of whether these customers made purchases in the training data.
- Customer that did not make any purchase during test period are excluded from the scoring.
- There is never a penalty for using the full 12 predictions for a customer that ordered fewer than 12 items; thus, it's advantageous to make 12 predictions for each customer.

Submission File:
- For each customer_id observed in the training data, you may predict up to 12 labels for the article_id, which is the predicted items a customer will buy in the next 7-day period after the training time period. The file should contain a header and have the following format:

__Own notes__

- Competition definition of MAP is wrong: omits the 'average' part
- _Qn_: What happens if customer makes more than 12 purchases?
- _Qn_: What happens if customer makes multiple purchases of the same item?

- In the discussion forum, a member of the Kaggle staff cites a GitHub repo from the Founder of Kaggle which has code for MAP@k:
    - https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/306007#1680513
    - https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py
- We use this for our own metric computation
- According to this code the number of purchases can exceed k. Only cares if your k predictions are a subset of those purchases
- Only cares about distinct purchases. Predicting the same item multiple times does not contribute to the score -> Only make distinct purchase predictions (not purchased item X, y times)

### Make predictions

In [3]:
train_df = pd.read_parquet('data/splits/train.parquet')
print(train_df.shape)

dev_df = pd.read_parquet('data/splits/dev.parquet')
print(dev_df.shape)

(30143457, 5)
(277388, 5)


In [4]:
train_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2.0
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2.0
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2.0
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2.0
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2.0


In [5]:
train_df['article_id'].value_counts().sort_values(ascending=False).head(12) / train_df.shape[0]

0706016001    0.001563
0706016002    0.001123
0372860001    0.001004
0610776002    0.000935
0759871002    0.000853
0464297007    0.000795
0372860002    0.000752
0399223001    0.000728
0610776001    0.000704
0720125001    0.000682
0562245001    0.000676
0351484002    0.000673
Name: article_id, dtype: float64

In [6]:
top_12_purchases = train_df['article_id'].value_counts().sort_values(ascending=False).head(12).index.tolist()
top_12_purchases

['0706016001',
 '0706016002',
 '0372860001',
 '0610776002',
 '0759871002',
 '0464297007',
 '0372860002',
 '0399223001',
 '0610776001',
 '0720125001',
 '0562245001',
 '0351484002']

In [7]:
dev_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2020-08-10,00045027219e894b683fb4687211e2d0c904c268e9f28d...,832481001,0.016932,1.0
1,2020-08-10,00045027219e894b683fb4687211e2d0c904c268e9f28d...,907696001,0.016932,1.0
2,2020-08-10,00058592fc65afabbb00b1bb7d33c6b221d00c6a98c621...,829152002,0.030492,2.0
3,2020-08-10,00058592fc65afabbb00b1bb7d33c6b221d00c6a98c621...,812668001,0.050831,2.0
4,2020-08-10,00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec...,887770002,0.008458,2.0


In [8]:
dev_df_by_customer = dev_df.groupby('customer_id').apply(lambda x: list(x['article_id'])).reset_index().rename(columns={0: 'article_id'})
dev_df_by_customer.head()

Unnamed: 0,customer_id,article_id
0,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,"[0896152002, 0730683050, 0927530004, 0791587015]"
1,00009d946eec3ea54add5ba56d5210ea898def4b46c685...,"[0884319008, 0921226001, 0706016001, 0881244001]"
2,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,"[0900157002, 0900157002, 0850244001, 085024400..."
3,00025f8226be50dcab09402a2cacd520a99e112fe01fdd...,"[0781613016, 0781613006, 0751471001]"
4,0002db27a1651998a3de4463437b580b45dfa7d8107afa...,[0926502001]


In [9]:
dev_df_by_customer['number_of_purchases'] = dev_df_by_customer['article_id'].apply(len)

In [10]:
dev_df_by_customer['number_of_purchases'].max()

100

In [11]:
dev_df_by_customer['number_of_distinct_purchases'] = dev_df_by_customer['article_id'].apply(lambda x: len(set(x)))

In [12]:
dev_df_by_customer['number_of_distinct_purchases'].max()

73

In [13]:
dev_df_by_customer.head()

Unnamed: 0,customer_id,article_id,number_of_purchases,number_of_distinct_purchases
0,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,"[0896152002, 0730683050, 0927530004, 0791587015]",4,4
1,00009d946eec3ea54add5ba56d5210ea898def4b46c685...,"[0884319008, 0921226001, 0706016001, 0881244001]",4,4
2,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,"[0900157002, 0900157002, 0850244001, 085024400...",8,7
3,00025f8226be50dcab09402a2cacd520a99e112fe01fdd...,"[0781613016, 0781613006, 0751471001]",3,3
4,0002db27a1651998a3de4463437b580b45dfa7d8107afa...,[0926502001],1,1


In [22]:
actuals = dev_df_by_customer['article_id'].to_list()

In [18]:
predictions = [top_12_purchases for _ in range(dev_df_by_customer.shape[0])]

In [24]:
mapk(actuals, predictions, k=12)

0.0022334424392915896

Official submission score with same strategy: 0.0027

Closeness suggests validation set is reasonably representative