# Intro

The notebook is based on CHRIS DEOTTE's notebook [Recommend Items Purchased Together],score is 0.0214 
这个notebook参考了CHRIS DEOTTE的《Recommend Items Purchased Together》，他的分数是0.0214

I have make some change in Part2, I use "Purchased After" instead of "Purchased Together", and let's see whether score would get higher.
我在第二部分做了一些改变，使用了"后来购买"来代替"一起购买"，一起来看看分数是否会有改善吧～

使用原来第二步：0.0214  <br>
不使用第二步：0.0212  <br>
使用新第二步（优化前，全不命中）：0.0205  <br>
使用新第二步（优化前）：0.0211  <br>
使用新第二步（优化后，全不命中）：0.0205  <br>
使用新第二步（优化后）：0.0217  <br>

优化：使用了相对日期代替绝对日期，先计算购买次数与平均购买日期，并提高推荐的门槛

# RAPIDS cuDF
We will use RAPIDS cuDF for fast dataframe operations

In [None]:
import cudf
print('RAPIDS version',cudf.__version__)

# Load Transactions, Reduce Memory
Discussion about reducing memory is [here][1]

[1]: https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635

In [None]:
train = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')
train['customer_id'] = train['customer_id'].str[-16:].str.hex_to_int().astype('int64')
train['article_id'] = train.article_id.astype('int64')
train.t_dat = cudf.to_datetime(train.t_dat)
train = train[['t_dat','customer_id','article_id']]
train.to_parquet('train.pqt',index=False)
print( train.shape )
train.head()

# Find Each Customer's Last Week of Purchases
Our final predictions will have the row order from of our dataframe. Each row of our dataframe will be a prediction. We will create the `predictionstring` later by `train.groupby('customer_id').article_id.sum()`. Since `article_id` is a string, when we groupby sum, it will concatenate all the customer predictions into a single string. It will also create the string in the order of the dataframe. So as we proceed in this notebook, we will order the dataframe how we want our predictions ordered.

In [None]:
tmp = train.groupby('customer_id').t_dat.max().reset_index()
tmp.columns = ['customer_id','max_dat']
train = train.merge(tmp,on=['customer_id'],how='left')
train['diff_dat'] = (train.max_dat - train.t_dat).dt.days
train = train.loc[train['diff_dat']<=6]
print('Train shape:',train.shape)

# (1) Recommend Most Often Previously Purchased Items
Note that many operations in cuDF will shuffle the order of the dataframe rows. Therefore we need to sort afterward because we want the most often previously purchased items first. Because this will be the order of our predictons. Since we sort by `ct` and then `t_dat` will will recommend items that have been purchased more frequently first followed by items purchased more recently second.

In [None]:
tmp = train.groupby(['customer_id','article_id'])['t_dat'].agg('count').reset_index()
tmp.columns = ['customer_id','article_id','ct']
train = train.merge(tmp,on=['customer_id','article_id'],how='left')
train = train.sort_values(['ct','t_dat'],ascending=False)
train = train.drop_duplicates(['customer_id','article_id'])
train = train.sort_values(['ct','t_dat'],ascending=False)
train.head()

# (2) Recommend Items Purchased After

In [None]:
import numpy as np
import pandas as pd
from datetime import datetime
from math import sqrt
from pathlib import Path
from tqdm import tqdm
tqdm.pandas()

data_path = Path('../input/h-and-m-personalized-fashion-recommendations/')
data_output_path = Path('/kaggle/working/')
N = 12

print(datetime.now())

In [None]:
#读取transactions
#t_dat范围：'2018-09-20'到'2020-09-22'
df_tran = pd.read_csv(data_path / 'transactions_train.csv',
                 usecols = ['t_dat', 'customer_id', 'article_id'],
                 dtype={'article_id': int})
print(df_tran['t_dat'].min(),'  ',df_tran['t_dat'].max())
print(datetime.now(),'  ',df_tran.shape)
display(df_tran.head())

In [None]:
#先聚合计算平均购买日期和购买次数，以便后面计算联购中发现是否因为某一方交易特别多
dat_ref = "2018-09-20"
df_tran['t_dat'] = pd.to_datetime(df_tran['t_dat'])-datetime.strptime(dat_ref, "%Y-%m-%d")
df_tran['solo_count'] = 1
df_tran = pd.DataFrame(df_tran.groupby(['article_id','customer_id'])
                       .agg({'t_dat': 'sum', 'solo_count': 'sum'}))
df_tran['t_dat'] = df_tran['t_dat']/df_tran['solo_count']
df_tran = df_tran.reset_index()
print(datetime.now(),'  ',df_tran.shape)
display(df_tran.head())

In [None]:
#Recommend Items always Purchased After
#每次只计算3个月内的buy together的article，筛选buy after且发生超过1次的关联，计算1年结果约13分钟
#16g极限，2020-03-22到2020-09-22半年纪录共7365109，通过customer_id自己inner join后189641831，join后再操作也是爆内存
com_year = [('2018-09-22','2018-12-22'),('2018-12-22','2019-03-22')
            ,('2019-03-22','2019-05-22'),('2019-05-22','2019-07-22'),('2019-07-22','2019-09-22')
            ,('2019-09-22','2019-12-22'),('2019-12-22','2020-03-22')
            ,('2020-03-22','2020-06-22'),('2020-06-22','2020-09-22')]
for dates in com_year:
    dat_start = datetime.strptime(dates[0], "%Y-%m-%d")-datetime.strptime(dat_ref, "%Y-%m-%d")
    dat_end = datetime.strptime(dates[1], "%Y-%m-%d")-datetime.strptime(dat_ref, "%Y-%m-%d")
    print('当前计算',dat_start,'到',dat_end)
    #筛选3个月内与buy together关联
    print('step1:',datetime.now())
    df_ac = df_tran.copy()
    df_ac = df_ac.loc[(df_ac['t_dat']>dat_start)&(df_ac['t_dat']<dat_end)]
    df_ca = df_ac
    df_aa = pd.merge(df_ac,df_ca,on = ["customer_id"],how="inner")
    #display(df_aa.head())
    
    #筛选buy after
    print('step2:',datetime.now(),'  ',df_ac.shape,df_aa.shape) #(3981458, 3) (74325888, 5)
    df_aa = df_aa.drop(columns=['customer_id']).loc[df_aa['t_dat_y']>df_aa['t_dat_x']]
    #display(df_aa.head())
    
    #准备聚合的度量
    print('step3:',datetime.now(),'  ',df_aa.shape)
    df_aa['t_dat_diff'] = df_aa['t_dat_y']-df_aa['t_dat_x']
    df_aa['article_count'] = 1
    df_aa = df_aa.drop(columns=['t_dat_y','t_dat_x'])
    #display(df_aa.head())
    
    #聚合计算buy了多少次，然后平均after多久后buy，1分钟+3分钟
    print('step4:',datetime.now(),'  ',df_aa.shape)
    df_aa = pd.DataFrame(df_aa.groupby(['article_id_x','article_id_y'])
                         .agg({'t_dat_diff': 'sum', 'article_count': 'sum'
                               , 'solo_count_x': 'sum', 'solo_count_y': 'sum'}))
    #设置推荐的最小门槛
    df_aa = df_aa.loc[(df_aa['article_count']>10)
                      &(df_aa['article_count']/df_aa['solo_count_x']>0.2)
                      &(df_aa['article_count']/df_aa['solo_count_y']>0.2)]
    df_aa['avg_dat_diff'] = df_aa.apply(
        lambda x: 0 if x['t_dat_diff'].days is np.nan else int(x['t_dat_diff'].days/x['article_count']) 
        ,axis=1)
    df_aa = df_aa.drop(columns=['t_dat_diff'])
    #display(df_aa.head())
    
    print('step5:',datetime.now(),'  ',df_aa.shape)
    file_tmp = "df_aa"+dates[0]+"_"+dates[1]+".csv"
    df_aa.to_csv(file_tmp, index=True)
    print('-----------------')


In [None]:
#读取合并
features = ['article_id_x', 'article_id_y','article_count']
df_f = pd.DataFrame(columns=features)
for dates in com_year:
    dat_start = dates[0]
    dat_end = dates[1]
    file_tmp = "df_aa"+dat_start+"_"+dat_end+".csv"
    df_temp = pd.read_csv(data_output_path / file_tmp,
                       usecols = features,
                       dtype={'article_id_x': int,'article_id_y': int})
    print('df_temp:',df_temp.shape)
    #取联购数量最多的配搭
    #df_temp = df_temp.groupby('article_id_x').apply(lambda t: t[t.article_count==t.article_count.max()])
    #print('df_temp:',df_temp.shape)
    if(len(df_f)<1):
        df_f = df_temp.copy()
    else:
        df_f = pd.concat([df_f,df_temp],ignore_index=True)
print('df_f:',df_f.shape)
display(df_f.head())

In [None]:
#取联购数量最多的配搭，约18分钟
df_f = df_f.groupby('article_id_x').apply(lambda t: t[t.article_count==t.article_count.max()])
df_f.drop_duplicates(subset=['article_id_x'], keep='first', inplace=True)
df_f = df_f.drop(columns=['article_count'])
print(datetime.now(),'   ',df_f.shape)
display(df_f.head())

In [None]:
df_f.info()

In [None]:
#pandas转dict
pairs = {}
for i in tqdm(df_f.index):
    article_id_x = df_f.at[i, 'article_id_x']
    article_id_y = df_f.at[i, 'article_id_y']
    #article_id_y = 0   #测试全不命中
    pairs[article_id_x] = article_id_y
print(type(pairs))
#display(pairs)

下面的逻辑跟 [Recommend Items Purchased Together]一样

In [None]:
# USE PANDAS TO MAP COLUMN WITH DICTIONARY
import pandas as pd, numpy as np
train = train.to_pandas()
train['article_id2'] = train.article_id.map(pairs)

# RECOMMENDATION OF PAIRED ITEMS
train2 = train[['customer_id','article_id2']].copy()
train2 = train2.loc[train2.article_id2.notnull()]
train2 = train2.drop_duplicates(['customer_id','article_id2'])
train2 = train2.rename({'article_id2':'article_id'},axis=1)

# CONCATENATE PAIRED ITEM RECOMMENDATION AFTER PREVIOUS PURCHASED RECOMMENDATIONS
train = train[['customer_id','article_id']]
train = pd.concat([train,train2],axis=0,ignore_index=True)
train.article_id = train.article_id.astype('int64')
train = train.drop_duplicates(['customer_id','article_id'])

# CONVERT RECOMMENDATIONS INTO SINGLE STRING
train.article_id = ' 0' + train.article_id.astype('str')
preds = cudf.DataFrame( train.groupby('customer_id').article_id.sum().reset_index() )
preds.columns = ['customer_id','prediction']
preds.head()

# (3) Recommend Last Week's Most Popular Items
After recommending previous purchases and items purchased together we will then recommend the 12 most popular items. Therefore if our previous recommendations did not fill up a customer's 12 recommendations, then it will be filled by popular items.

In [None]:
train = cudf.read_parquet('train.pqt')
train.t_dat = cudf.to_datetime(train.t_dat)
train = train.loc[train.t_dat >= cudf.to_datetime('2020-09-16')]
top12 = ' 0' + ' 0'.join(train.article_id.value_counts().to_pandas().index.astype('str')[:12])
print("Last week's top 12 popular items:")
print( top12 )

# Write Submission CSV
We will merge our predictions onto `sample_submission.csv` and submit to Kaggle.

In [None]:
sub = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv')
sub = sub[['customer_id']]
sub['customer_id_2'] = sub['customer_id'].str[-16:].str.hex_to_int().astype('int64')
sub = sub.merge(preds.rename({'customer_id':'customer_id_2'},axis=1),\
    on='customer_id_2', how='left').fillna('')
del sub['customer_id_2']
sub.prediction = sub.prediction + top12
sub.prediction = sub.prediction.str.strip()
sub.prediction = sub.prediction.str[:131]
sub.to_csv(f'submission.csv',index=False)
sub.head()