
## **<span style="color:#3b51e3;font-size:150%"><center>Welcome and Enjoy</center></span>**

In this notebook we will calculate MAP@12 using formula by Kaggle staff [average_precision.py](https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py)

From kaggler public notebook: 
1. julian3833: https://www.kaggle.com/julian3833/h-m-content-based-12-most-popular-items-0-007?scriptVersionId=87501760
2. gpreda: https://www.kaggle.com/gpreda/h-m-eda-and-prediction?scriptVersionId=87584685
3. hengzheng: https://www.kaggle.com/hengzheng/time-is-our-best-friend-v2?scriptVersionId=87521274
4. julian3833: https://www.kaggle.com/julian3833/h-m-collaborative-filtering-user-user?scriptVersionId=88178355
5. cdeotte: https://www.kaggle.com/cdeotte/recommend-items-purchased-together-0-021?scriptVersionId=88348330

Train-Valid Split Strategy: 
* Eval split start from '2020-09-16', Train split on the rest 

Some Tricks used: 
* Memory reduction by Chris Deotte : https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635
* How to calculate MAP@12 by kaerururu : https://www.kaggle.com/kaerunantoka/h-m-how-to-calculate-map-12 


# **<a id="Content" style="color:#3b51e3;">Table of Content</a>**
* [**<span style="color:#3b51e3;">Results</span>**](#Results)  
* [**<span style="color:#3b51e3;">Data Preparation</span>**](#DataPrep)  
* [**<span style="color:#3b51e3;">julian3833</span>**](#julian3833)  
* [**<span style="color:#3b51e3;">gpreda</span>**](#gpreda)  
* [**<span style="color:#3b51e3;">hengzheng</span>**](#hengzheng)  
* [**<span style="color:#3b51e3;">julian3833 CF user-user</span>**](#julian3833UUCF)  
* [**<span style="color:#3b51e3;">cdeotte</span>**](#cdeotte)  





## **<span id="Results" style="color:#3b51e3;">Results</span>**


|Notebook Owner|Notebook Link|Valid Score|Public Score|
|---|---|---|---|
|julian3833|[link](https://www.kaggle.com/julian3833/h-m-content-based-12-most-popular-items-0-007?scriptVersionId=87501760)|0.0067|0.0071|
|gpreda|[link](https://www.kaggle.com/gpreda/h-m-eda-and-prediction?scriptVersionId=87584685)|0.0211|0.0195|
|hengzheng|[link](https://www.kaggle.com/hengzheng/time-is-our-best-friend-v2?scriptVersionId=87521274)|0.0237|0.0204|
|julian3833|[link](https://www.kaggle.com/julian3833/h-m-collaborative-filtering-user-user?scriptVersionId=88178355)|0.0224|0.0193|
|cdeotte|[link](https://www.kaggle.com/cdeotte/recommend-items-purchased-together-0-021)|0.0245|0.0214|


# **<span id="DataPrep" style="color:#3b51e3;">Data Preparation</span>**

In [None]:
import numpy as np
import pandas as pd 
from datetime import datetime, timedelta
import gc

In [None]:
# read dataset 
## memory reduction in importing dtype={'article_id': 'int32'}
transactions = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv',
                          usecols= ['t_dat', 'customer_id', 'article_id'], dtype={'article_id': 'int32'})

customers = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/customers.csv',
                        usecols=['customer_id'])

In [None]:
# memory reduction 
transactions['t_dat'] = transactions['t_dat'].map(lambda x: datetime.strptime(x, '%Y-%m-%d'))
transactions['customer_id'] = transactions['customer_id'].apply(lambda x: int(x[-16:],16) ).astype('int64')
customers['customer_id'] = customers['customer_id'].apply(lambda x: int(x[-16:],16) ).astype('int64')

# Splitting
val_start_date = '2020-09-16'
train_df = transactions.query(f"t_dat < '{val_start_date}'").reset_index(drop=True)
valid_df = transactions.query(f"t_dat >= '{val_start_date}'").reset_index(drop=True)

# Sorting
train_df = train_df.sort_values(["customer_id", "t_dat"], ascending=False)
valid_df = valid_df.sort_values(["customer_id", "t_dat"], ascending=False)

_ = gc.collect()

In [None]:
train_df.head()

In [None]:
valid_df.head()

In [None]:
valid_df = valid_df.sort_values(['customer_id', 't_dat'], ascending = [True, True]) 
valid_cust = valid_df.groupby('customer_id')['article_id'].apply(list).reset_index()
valid_cust['valid_true'] = valid_cust['article_id'].map(lambda x: '0'+' 0'.join(str(x)[1:-1].split(', ')))
del valid_df, valid_cust['article_id']
_ = gc.collect()

In [None]:
valid_cust.tail()

## Util functions

In [None]:
# Source: https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py
def apk(actual, predicted, k=10):
    """
    Computes the average precision at k.
    This function computes the average prescision at k between two lists of
    items.
    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The average precision at k over the input lists
    """
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=12):
    """
    Computes the mean average precision at k.
    This function computes the mean average prescision at k between two lists
    of lists of items.
    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted 
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The mean average precision at k over the input lists
    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted) if a]) # CHANGES: ignore null actual (variable=a)

# **<span id="julian3833" style="color:#3b51e3;">julian3833</span>**

https://www.kaggle.com/julian3833/h-m-content-based-12-most-popular-items-0-007?scriptVersionId=87501760 Version 4

In [None]:
eval_julian3833 = customers.copy() 

top_12_items = train_df[train_df['t_dat'] > '2020-09-01'].groupby('article_id')['customer_id'].nunique().sort_values(ascending=False).head(12).index.tolist()
top_12_items = ['0' + str(item) for item in top_12_items]
eval_julian3833['prediction'] =  ' '.join(top_12_items)
eval_julian3833 = valid_cust.merge(eval_julian3833, on ='customer_id', how ='left')


In [None]:
eval_julian3833.tail()

In [None]:
mapk(
    eval_julian3833['valid_true'].map(lambda x: x.split()), 
    eval_julian3833['prediction'].map(lambda x: x.split()), 
    k=12
)

In [None]:
del eval_julian3833, top_12_items
_ = gc.collect()

# **<span id="gpreda" style="color:#3b51e3;">gpreda</span>**

https://www.kaggle.com/gpreda/h-m-eda-and-prediction?scriptVersionId=87584685 version 12

In [None]:
eval_gpreda = customers.copy() 

most_frequent_articles = list(train_df.loc[train_df.t_dat==train_df.t_dat.max()].article_id.value_counts()[0:12].index)
art_list = []
for art in most_frequent_articles:
    art = "0"+str(art)
    art_list.append(art)
art_str = " ".join(art_list)

def padding_articles(x):
    if x:
        xl = x.split()
        x = []
        for xi in xl:
            x.append("0"+xi)
        dimm_x = len(x)
        if dimm_x < 12:
            x.extend(art_list[:12-dimm_x])
        return(" ".join(x))
    

eval_gpreda = train_df.groupby(["customer_id"])["article_id"].agg(lambda x: str(x.values[0:12])[1:-1]).reset_index()
eval_gpreda["prediction"] = eval_gpreda["article_id"].apply(lambda x: padding_articles(x))
eval_gpreda = valid_cust.merge(eval_gpreda, on ='customer_id', how ='left')

eval_gpreda['prediction'] = eval_gpreda['prediction'].astype(str)

In [None]:
eval_gpreda.tail()

In [None]:
mapk(
    eval_gpreda['valid_true'].map(lambda x: x.split()), 
    eval_gpreda['prediction'].map(lambda x: x.split()), 
    k=12
)

In [None]:
del eval_gpreda, most_frequent_articles, art_list
_ = gc.collect()

# **<span id="hengzheng" style="color:#3b51e3;">hengzheng</span>**

https://www.kaggle.com/hengzheng/time-is-our-best-friend-v2?scriptVersionId=87521274 version 5

In [None]:
eval_hengzheng = customers[['customer_id']]

transactions_3w = train_df[train_df['t_dat'] >= pd.to_datetime('2020-08-24')].copy()
transactions_2w = train_df[train_df['t_dat'] >= pd.to_datetime('2020-08-31')].copy()
transactions_1w = train_df[train_df['t_dat'] >= pd.to_datetime('2020-09-07')].copy()

purchase_dict_3w = {}

for i,x in enumerate(zip(transactions_3w['customer_id'], transactions_3w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_3w:
        purchase_dict_3w[cust_id] = {}
    
    if art_id not in purchase_dict_3w[cust_id]:
        purchase_dict_3w[cust_id][art_id] = 0
    
    purchase_dict_3w[cust_id][art_id] += 1
    
dummy_list_3w = list((transactions_3w['article_id'].value_counts()).index)[:12]
dummy_list_3w = ['0' + str(item) for item in dummy_list_3w]


purchase_dict_2w = {}

for i,x in enumerate(zip(transactions_2w['customer_id'], transactions_2w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_2w:
        purchase_dict_2w[cust_id] = {}
    
    if art_id not in purchase_dict_2w[cust_id]:
        purchase_dict_2w[cust_id][art_id] = 0
    
    purchase_dict_2w[cust_id][art_id] += 1
    

dummy_list_2w = list((transactions_2w['article_id'].value_counts()).index)[:12]
dummy_list_2w = ['0' + str(item) for item in dummy_list_2w]


purchase_dict_1w = {}

for i,x in enumerate(zip(transactions_1w['customer_id'], transactions_1w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_1w:
        purchase_dict_1w[cust_id] = {}
    
    if art_id not in purchase_dict_1w[cust_id]:
        purchase_dict_1w[cust_id][art_id] = 0
    
    purchase_dict_1w[cust_id][art_id] += 1
    
dummy_list_1w = list((transactions_1w['article_id'].value_counts()).index)[:12]
dummy_list_1w = ['0' + str(item) for item in dummy_list_1w]


prediction_list = []

dummy_list = list((transactions_1w['article_id'].value_counts()).index)[:12]
dummy_list = ['0' + str(item) for item in dummy_list]
dummy_pred = ' '.join(dummy_list)

for i, cust_id in enumerate(eval_hengzheng['customer_id'].values.reshape((-1,))):
    if cust_id in purchase_dict_1w:
        l = sorted((purchase_dict_1w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = ['0' + str(y[0]) for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_1w[:(12-len(l))])
    elif cust_id in purchase_dict_2w:
        l = sorted((purchase_dict_2w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = ['0' + str(y[0]) for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_2w[:(12-len(l))])
    elif cust_id in purchase_dict_3w:
        l = sorted((purchase_dict_3w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = ['0' + str(y[0]) for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_3w[:(12-len(l))])
    else:
        s = dummy_pred
    prediction_list.append(s)

eval_hengzheng['prediction'] = prediction_list
eval_hengzheng = valid_cust.merge(eval_hengzheng, on ='customer_id', how ='left')
eval_hengzheng['prediction'] = eval_hengzheng['prediction'].astype(str)

In [None]:
eval_hengzheng.tail()

In [None]:
mapk(
    eval_hengzheng['valid_true'].map(lambda x: x.split()), 
    eval_hengzheng['prediction'].map(lambda x: x.split()), 
    k=12
)

In [None]:
del prediction_list, transactions_3w, transactions_2w, transactions_1w
del purchase_dict_3w, purchase_dict_2w, purchase_dict_1w, dummy_list_3w, dummy_list_2w,dummy_list_1w
del dummy_list, dummy_pred

_ = gc.collect()

# **<span id="julian3833UUCF" style="color:#3b51e3;">julian3833 CF user-user</span>**

https://www.kaggle.com/julian3833/h-m-collaborative-filtering-user-user?scriptVersionId=88178355 version 19 

In [None]:
import time
import multiprocessing as mp
from multiprocessing import Pool
from functools import partial

dfu = customers.copy()
dfi = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/articles.csv',
                        usecols=['article_id'], dtype={'article_id': 'int32'})

ALL_USERS = dfu['customer_id'].unique().tolist()
ALL_ITEMS = dfi['article_id'].unique().tolist()

user_to_customer_map = {user_id: customer_id for user_id, customer_id in enumerate(ALL_USERS)}
customer_to_user_map = {customer_id: user_id for user_id, customer_id in enumerate(ALL_USERS)}

item_to_article_map = {item_id: article_id for item_id, article_id in enumerate(ALL_ITEMS)}
article_to_item_map = {article_id: item_id for item_id, article_id in enumerate(ALL_ITEMS)}

del dfu, dfi

df = train_df.copy() 

df['user_id'] = df['customer_id'].map(customer_to_user_map)
df['item_id'] = df['article_id'].map(article_to_item_map)

#Configuration parameters
N_SIMILAR_USERS = 20
MINIMUM_PURCHASES = 10
START_DATE = '2020-08-31' # '2020-09-07'
DROP_PURCHASED_ITEMS = False
DROP_USER_FROM_HIS_NEIGHBORHOOD = False
TEST_RUN = False
TEST_SIZE = 1000

def flatten(l):
    """ Flatten a list of lists"""
    return [item for sublist in l for item in sublist]

def compare_vectors(v1, v2):
    """Compare lists of purchased product for two given users
    v1 stands for the "vector representation for user 1", which is a list of the purchases of u1
    
    Returns:
        A value between 0 and 1 (similarity)
    """
    intersection = len(set(v1) & set(v2))
    denominator = np.sqrt(len(v1) * len(v2))
    return intersection / denominator

def get_similar_users(u, v, dfh):
    """
    Get the N_SIMILAR_USERS most similar users to the given one with their similarity score
    Arguments:
        u: the user_id, 
        v:  the "vector" representation of the user (list of item_id)
        dfh : the "history of transaccions" dataframe
        
    Returns:
        tuple of lists ([similar user_id], [similarity scores])
    """
    similar_users = dfh.apply(lambda v_other: compare_vectors(v, v_other)).sort_values(ascending=False).head(N_SIMILAR_USERS + 1)
    
    if DROP_USER_FROM_HIS_NEIGHBORHOOD:
        similar_users = similar_users[similar_users.index != u]
        
    return similar_users.index.tolist(), similar_users.tolist()

def get_items(u, v, dfh):
    """ Get the recommend items for a given users
    
    It will:
        1) Get similar users for the given user
        2) Obtain all the items those users purchased
        3) Rank them using the similarity scores of the user that purchased them
        4) Return the 12 best ranked
    
    Arguments:
        u: the user_id, 
        v:  the "vector" representation of the user (list of item_id)
        dfh : the "history of transaccions" dataframe
        
    Returns:
        list of item_id of lenght at most 12
    """
    global i, n
    
    users, scores = get_similar_users(u, v, dfh)
    df_nn = pd.DataFrame({'user': users, 'score': scores})
    df_nn['items'] = df_nn.apply(lambda row: dfh.loc[row.user], axis=1)
    df_nn['weighted_items'] = df_nn.apply(lambda row: [(item, row.score) for item in row['items']], axis=1)

    recs = pd.DataFrame(flatten(df_nn['weighted_items'].tolist()), columns=['item', 'score']).groupby('item')['score'].sum().sort_values(ascending=False)
    if DROP_PURCHASED_ITEMS:
        recs = recs[~recs.index.isin(v)]
    # Keep the first 12 and get the item_ids
    i +=1
    if i % 200 == 0:
        pid = mp.current_process().pid
        print(f"[PID {pid:>2d}] Finished {i:3d} / {n:5d} - {i/n*100:3.0f}%")
    return recs.head(12).index.tolist()

def get_items_chunk(user_ids: np.array, dfh: pd.DataFrame):
    """ Call get_item for a list of user_ids
    
    Arguments:
        user_ids: list of user_id, 
        dfh: the "history of transaccions" dataframe
        
    Returns:
        pd.Series with index user_id and list of item_id (recommendations) as value
    """
    global i, n
    i = 0
    
    n = len(user_ids)
    pid = mp.current_process().pid
    print(f"[PID {pid:>2d}] Started working with {n:5d} users")
    
    df_user_vectors = pd.DataFrame(dfh.loc[user_ids]).reset_index()
    df_user_vectors['recs'] = df_user_vectors.apply(lambda row: get_items(row.user_id, row.item_id, dfh), axis=1)
    return df_user_vectors.set_index('user_id')['recs']

def get_recommendations(users: list, dfh: pd.DataFrame):
    """
    Obtained recommendation for the users using transaccion dfh in a parallelized manner
    
    Call get_items_chunk in a "smart" multiprocessing fashion
    
    Arguments:
        users: list of user_id
        dfh: the "history of transaccions" dataframe
    
    Returns:
        pd.DataFrame with index user_id and list of item_id (recommendations) as value
    
    """
    time_start = time.time()
    
    # Split into approximately evenly sized chunks
    # We will send just one batch to each CPU 
    user_chunks = np.array_split(users, mp.cpu_count())
    
    f = partial(get_items_chunk, dfh=dfh)
    with Pool(mp.cpu_count()) as p:
        res = p.map(f, user_chunks)
    
    df_rec = pd.DataFrame(pd.concat(res))

    elapsed = (time.time() - time_start) / 60
    print(f"Finished get_recommendations({len(users)}). It took {elapsed:5.2f} mins")
    return df_rec


def uucf(df, start_date=START_DATE):
    """ Entry point for the UUCF model. 
    
    Receive the original transactions_train.csv and a start_date and gets UUCF recommendations
    
    The model will not cover the full list of users, but just a subset of them.
    
    It will provide recommendations for users with at least MINIMUM_PURCHASES after start_date.
    It might return less than 12 recs per user.
    
    An ad-hoc function for filling these gaps should be used downstream.
    (See fill functionality right below)
    
    
    Arguments:
        df: The raw dataframe from transactions_train.csv
        start_date: a date
        
    Returns:
        a submission-like pd.DataFrame with columns [customer_id, prediction]
        'prediction' is a list and not a string though
    
    """
    df_small = df[df['t_dat'] > start_date]
    print(f"Kept data from {start_date} on. Total rows: {len(df_small)}")
    
    # H stands for "Transaction history"
    # dfh is a series of user_id => list of item_id (the list of purchases in order)
    dfh = df_small.groupby("user_id")['item_id'].apply(lambda items: list(set(items)))
    dfh = dfh[dfh.str.len() >= MINIMUM_PURCHASES]
    if TEST_RUN:
        print("WARNING: TEST_RUN is True. It will be a toy execution.")
        dfh = dfh.head(TEST_SIZE)
    
    users = dfh.index.tolist()
    n_users = len(users)
    print(f"Total users in the time frame with at least {MINIMUM_PURCHASES}: {n_users}")
    
    df_rec = get_recommendations(users, dfh)
    df_rec['customer_id'] = df_rec.index.map(user_to_customer_map)
    df_rec['prediction'] = df_rec['recs'].map(lambda l: [item_to_article_map[i] for i in l])
    
    # Submission ready dataframe
    df_rec.reset_index(drop=True)[['customer_id', 'prediction']]
    return df_rec 

df_recs = uucf(df)

df_fill = eval_hengzheng.copy()[['customer_id', 'prediction']]

def drop_duplicates(seq):
    """ Remove duplicates of a given sequence keeping order"""
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

def fill_row(row):
    uucf = row['prediction_uucf']
    fill = row['prediction_fill'].split()
    new_list = drop_duplicates(uucf + fill)[:12]
    return ' '.join(new_list)


def fill(df_recs, df_fill):
    df_recs['len'] = df_recs['prediction'].str.len()
    df_recs = pd.merge(df_fill, df_recs, how='left', on='customer_id', suffixes=('_fill', '_uucf'))
    
    
    # No recs from UUCF at all: use the fallback model 
    df_recs.loc[df_recs['prediction_uucf'].isnull(), 'prediction'] = df_recs['prediction_fill']


    # Full UUCF recommendation
    mask = df_recs['prediction_uucf'].notnull() & (df_recs['len'] == 12)
    df_recs.loc[mask, 'prediction'] = df_recs['prediction_uucf']


    # Fill with another model. Not enough recs from UUCF
    fill_mask = df_recs['prediction_uucf'].notnull() & (df_recs['len'] < 12)
    df_recs.loc[fill_mask, 'prediction'] = df_recs[fill_mask].apply(fill_row, axis=1)
    return df_recs.drop(['prediction_uucf', 'prediction_fill', 'len', 'recs'], axis=1)

eval_julian3833uucf = fill(df_recs, df_fill)
eval_julian3833uucf = valid_cust.merge(eval_julian3833uucf, on ='customer_id', how ='left')
eval_julian3833uucf['prediction'] = eval_julian3833uucf['prediction'].astype(str)

In [None]:
eval_julian3833uucf.tail() 

In [None]:
mapk(
    eval_julian3833uucf['valid_true'].map(lambda x: x.split()), 
    eval_julian3833uucf['prediction'].map(lambda x: x.split()), 
    k=12
)

In [None]:
del eval_julian3833uucf, df_recs, df_fill

_ = gc.collect()

# **<span id="cdeotte" style="color:#3b51e3;">cdeotte</span>**

https://www.kaggle.com/cdeotte/recommend-items-purchased-together-0-021

note: for item pairs, I re calculate it again to avoid data leakage. the note book is in https://www.kaggle.com/hervind/h-m-generate-item-pairs 

In [None]:
import cudf
print('RAPIDS version',cudf.__version__)

In [None]:
train = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')
train['customer_id'] = train['customer_id'].str[-16:].str.hex_to_int().astype('int64')
train['article_id'] = train.article_id.astype('int32')

val_start_date = '2020-09-16'
train = train.loc[train['t_dat'] < val_start_date].reset_index(drop=True)

train.t_dat = cudf.to_datetime(train.t_dat)
train = train[['t_dat','customer_id','article_id']]
train.to_parquet('train.pqt',index=False)

tmp = train.groupby('customer_id').t_dat.max().reset_index()
tmp.columns = ['customer_id','max_dat']
train = train.merge(tmp,on=['customer_id'],how='left')
train['diff_dat'] = (train.max_dat - train.t_dat).dt.days
train = train.loc[train['diff_dat']<=6]

tmp = train.groupby(['customer_id','article_id'])['t_dat'].agg('count').reset_index()
tmp.columns = ['customer_id','article_id','ct']
train = train.merge(tmp,on=['customer_id','article_id'],how='left')
train = train.sort_values(['ct','t_dat'],ascending=False)
train = train.drop_duplicates(['customer_id','article_id'])
train = train.sort_values(['ct','t_dat'],ascending=False)



In [None]:
train = train.to_pandas()
pairs = np.load('../input/h-m-generate-item-pairs/item_pair_3_months.npy',allow_pickle=True).item()
train['article_id2'] = train.article_id.map(pairs)
train2 = train[['customer_id','article_id2']].copy()
train2 = train2.loc[train2.article_id2.notnull()]
train2 = train2.drop_duplicates(['customer_id','article_id2'])
train2 = train2.rename({'article_id2':'article_id'},axis=1)

train = train[['customer_id','article_id']]
train = pd.concat([train,train2],axis=0,ignore_index=True)
train.article_id = train.article_id.astype('int32')
train = train.drop_duplicates(['customer_id','article_id'])

train.article_id = ' 0' + train.article_id.astype('str')
preds = cudf.DataFrame( train.groupby('customer_id').article_id.sum().reset_index() )
preds.columns = ['customer_id','prediction']
preds.head()

In [None]:
train = cudf.read_parquet('train.pqt')
train.t_dat = cudf.to_datetime(train.t_dat)
# train = train.loc[train.t_dat >= cudf.to_datetime('2020-09-16')]
train = train.loc[train.t_dat >= cudf.to_datetime('2020-09-09')]
top12 = ' 0' + ' 0'.join(train.article_id.value_counts().to_pandas().index.astype('str')[:12])
print("Last week's top 12 popular items:")
print( top12 )

In [None]:
sub = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv')
sub = sub[['customer_id']]
sub['customer_id_2'] = sub['customer_id'].str[-16:].str.hex_to_int().astype('int64')
sub = sub.merge(preds.rename({'customer_id':'customer_id_2'},axis=1),\
    on='customer_id_2', how='left').fillna('')
# del sub['customer_id_2']
sub.prediction = sub.prediction + top12
sub.prediction = sub.prediction.str.strip()
sub.prediction = sub.prediction.str[:131]

sub = sub.to_pandas()
eval_cdeotte = valid_cust.merge(sub[['customer_id_2', 'prediction']], left_on ='customer_id', right_on = 'customer_id_2', how ='left')
eval_cdeotte['prediction'] = eval_cdeotte['prediction'].astype(str)

In [None]:
eval_cdeotte.tail()

In [None]:
mapk(
    eval_cdeotte['valid_true'].map(lambda x: x.split()), 
    eval_cdeotte['prediction'].map(lambda x: x.split()), 
    k=12
)

# Thank you for reading so far
## Please Upvote if you find this notebook helpful 
## Also upvote the sources notebook as well  


# Source
* https://www.kaggle.com/kaerunantoka/h-m-how-to-calculate-map-12 
* https://www.kaggle.com/julian3833/h-m-content-based-12-most-popular-items-0-007
* https://www.kaggle.com/gpreda/h-m-eda-and-prediction
* https://www.kaggle.com/hengzheng/time-is-our-best-friend-v2
* https://www.kaggle.com/cdeotte/recommend-items-purchased-together-0-021