## H&M Recommender system - Capstone Part 2

The focus of this notebook is to make use of Kaggle submission scores to uncover facts of the underlying purchasing habits of customers, in order to better come up with a recommender. A non machine learning, rule based approach will be used for the base model, and further models will seek to add additional features. 

This base model was created before I learnt about tools such as higher RAM and GPU use on Google Colab, RAPIDS cuDF, parquet files and other memory saving techniques.

For the base model, the focus is:

- 1) Dealing with the common problem of cold start, by recommending popular items.
- 2) Dealing with customers who are returning, by recommending items previously purchased.

As a preliminary conclusion, I found that the last 5 weeks seems to be the most relevant to predict the 1 week ahead. 

Based on my research and understanding of H&M, the company is in fast fashion, and items for sale can change every season. Taking the most recent information only will forgo past purchasing information from the customers but there is also a trade off when we take old information, which is that the product in the old transactions are no longer in season and not for sale anymore. In my iterative process of 9 submissions, which I have left my entire workings below, I will attempt to find that best trade off number of weeks.

In [1]:
import numpy as np
import pandas as pd

In [2]:
transactions = pd.read_csv('../datasets/transactions_train.csv', dtype={'article_id': str})
transactions['t_dat'] = pd.to_datetime(transactions['t_dat'])
transactions['month'] = transactions['t_dat'].dt.month
transactions['year'] = transactions['t_dat'].dt.year

submission = pd.read_csv('../datasets/sample_submission.csv')

### Baseline model - using June to November as in season months

Based on best guess of season as stated in my EDA notebook, I will be zooming in to June to November as the periods of concern since it is likely that the clothing that is most popularly sold will be summer and autumn collection. To deal with cold start users, I will recommend them the most popular 12 items in the season.

In [8]:
# zoom in transaction to only June to November
transactions_season = transactions.loc[transactions['month'].isin([6,7,8,9,10,11])]
transactions_season.reset_index(drop=True, inplace=True)

In [9]:
transactions_season.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,month,year
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2,9,2018
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2,9,2018
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2,9,2018
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2,9,2018
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2,9,2018


In [10]:
# Create purchase dictionary for customer and article belonging to June to November period
purchase_dict = {}

for i,x in enumerate(zip(transactions_season['customer_id'], transactions_season['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict:
        purchase_dict[cust_id] = {}
    
    if art_id not in purchase_dict[cust_id]:
        purchase_dict[cust_id][art_id] = 0
    
    purchase_dict[cust_id][art_id] += 1
    
print(len(purchase_dict))

1074208


In [13]:
# Base model will recommend top 12 in season items to cold start users if they have not bought more than 12 items
# For customers who have bought more than 12 items, model will recommend the top 12 items they have most frequently bought in the season

top12inseason_benchmark = submission[['customer_id']]
prediction_list = []
dummy_list = list((transactions_season['article_id'].value_counts()).index)[:12]
dummy_pred = ' '.join(dummy_list)

for i, cust_id in enumerate(submission['customer_id'].values.reshape((-1,))):
    if cust_id in purchase_dict:
        l = sorted((purchase_dict[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list[:(12-len(l))])
    else:
        s = dummy_pred
    prediction_list.append(s)

top12inseason_benchmark['prediction'] = prediction_list
print(top12inseason_benchmark.shape)
top12inseason_benchmark.head()

(1371980, 2)


Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0797065001 0607642008 0745232001 0656719005 07...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0689898002 0583558001 0666448006 0599580024 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0663713001 0541518023 0794321007 0706016001 03...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0742079001 0732413001 0706016001 0372860001 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0399061015 0634249005 0677049001 0589440005 08...


In [14]:
top12inseason_benchmark.to_csv('../datasets/submission1.csv', index=False)

Kaggle submission above returned 0.0069 Mean Average Precision on submission data, which is a relatively poor score. This shows that we may not have accurately picked items, and most of the items picked may no longer be on sale in stores any longer.

### Baseline model - using recent 3 weeks

Based on my observation, items will go on and off season, and since there is no publicly available information of when styles of clothing goes off season, the latest month might be the best guess for what is in season. As such, I will attempt a submission to test this hypothesis.

In [11]:
# zoom in transaction from September 2020 alone 
transactions_3w = transactions.loc[(transactions['month']==9) & (transactions['year']==2020)]
transactions_3w.reset_index(drop=True, inplace=True)

In [12]:
transactions_3w.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,month,year
0,2020-09-01,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,777148006,0.013542,1,9,2020
1,2020-09-01,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,835801001,0.018627,1,9,2020
2,2020-09-01,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,923134005,0.012695,1,9,2020
3,2020-09-01,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,865929003,0.016932,1,9,2020
4,2020-09-01,0005ed68483efa39644c45185550a82dd09acb07622acb...,863646004,0.033881,1,9,2020


In [13]:
# Create purchase dictionary for customer and article belonging to September 2020 period
purchase_dict_3w = {}

for i,x in enumerate(zip(transactions_3w['customer_id'], transactions_3w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_3w:
        purchase_dict_3w[cust_id] = {}
    
    if art_id not in purchase_dict_3w[cust_id]:
        purchase_dict_3w[cust_id][art_id] = 0
    
    purchase_dict_3w[cust_id][art_id] += 1
    
print(len(purchase_dict_3w))

189510


In [14]:
# Variant Base model will recommend top 12 in current month items to cold start users if they have not bought more than 12 items
# For customers who have bought more than 12 items, model will recommend the top 12 items they have most frequently bought historically

top12inmonth_benchmark = submission[['customer_id']]
prediction_list = []
dummy_list = list((transactions_3w['article_id'].value_counts()).index)[:12]
dummy_pred = ' '.join(dummy_list)

for i, cust_id in enumerate(submission['customer_id'].values.reshape((-1,))):
    if cust_id in purchase_dict_3w:
        l = sorted((purchase_dict_3w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list[:(12-len(l))])
    else:
        s = dummy_pred
    prediction_list.append(s)

top12inmonth_benchmark['prediction'] = prediction_list
print(top12inmonth_benchmark.shape)
top12inmonth_benchmark.head()

(1371980, 2)


Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0751471001 0909370001 0918522001 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0751471001 0909370001 0918522001 0924243001 09...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0751471001 0909370001 0918522001 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0751471001 0909370001 0918522001 0924243001 09...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0751471001 0909370001 0918522001 0924243001 09...


In [17]:
top12inmonth_benchmark.to_csv('../datasets/submission2.csv', index=False)

Kaggle submission above returned 0.0191 Mean Average Precision on submission data, which was a massive improvement. This shows that recent data is crucial in predictions, possibly due to trends as H&M is in fast fashion, and trends move very quickly.

### Baseline model - using recent 2 months

In [29]:
# zoom in transaction from August, September 2020 alone 
transactions_ = transactions.loc[transactions['month'].isin([8,9])]
transactions_2m = transactions_.loc[transactions_['year']==2020]
transactions_2m.reset_index(drop=True, inplace=True)

In [32]:
# Create purchase dictionary for customer and article belonging to August and September 2020 period
purchase_dict = {}

for i,x in enumerate(zip(transactions_2m['customer_id'], transactions_2m['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict:
        purchase_dict[cust_id] = {}
    
    if art_id not in purchase_dict[cust_id]:
        purchase_dict[cust_id][art_id] = 0
    
    purchase_dict[cust_id][art_id] += 1
    
print(len(purchase_dict))

363798


In [33]:
# Variant Base model will recommend top 12 in last 2 month items to cold start users if they have not bought more than 12 items
# For customers who have bought more than 12 items, model will recommend the top 12 items they have most frequently bought historically

top12in2month_benchmark = submission[['customer_id']]
prediction_list = []
dummy_list = list((transactions_2m['article_id'].value_counts()).index)[:12]
dummy_pred = ' '.join(dummy_list)

for i, cust_id in enumerate(submission['customer_id'].values.reshape((-1,))):
    if cust_id in purchase_dict:
        l = sorted((purchase_dict[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list[:(12-len(l))])
    else:
        s = dummy_pred
    prediction_list.append(s)

top12in2month_benchmark['prediction'] = prediction_list
print(top12in2month_benchmark.shape)
top12in2month_benchmark.head()

(1371980, 2)


Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0751471001 0918292001 0706016001 04...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0751471001 0918292001 0706016001 0448509014 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0751471001 0918292001 0706016001 04...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0751471001 0918292001 0706016001 0448509014 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0896152002 0730683050 0927530004 0791587015 07...


In [34]:
top12in2month_benchmark.to_csv('../datasets/submission3.csv', index=False)

Kaggle submission above returned 0.0155 Mean Average Precision on submission data, which was not an improvement. This is crucial information, as it shows that the last month's information is the best predictor for what would be bought.

### Baseline model - using recent 1 weeks

In [30]:
# zoom in transaction from 1 week September 2020 alone 
transactions_1w = transactions_3w.loc[(transactions_3w['t_dat']>='2020-09-15')]
transactions_1w.reset_index(drop=True, inplace=True)

In [17]:
# Create purchase dictionary for customer and article belonging to September 2020 14 to 22 period
purchase_dict_1w = {}

for i,x in enumerate(zip(transactions_1w['customer_id'], transactions_1w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_1w:
        purchase_dict_1w[cust_id] = {}
    
    if art_id not in purchase_dict_1w[cust_id]:
        purchase_dict_1w[cust_id][art_id] = 0
    
    purchase_dict_1w[cust_id][art_id] += 1
    
print(len(purchase_dict_1w))

82588


In [18]:
# Variant Base model will recommend top 12 in last 1 week items to cold start users if they have not bought more than 12 items
# For customers who have bought more than 12 items, model will recommend the top 12 items they have most frequently bought historically

top12in1week_benchmark = submission[['customer_id']]
prediction_list = []
dummy_list = list((transactions_1w['article_id'].value_counts()).index)[:12]
dummy_pred = ' '.join(dummy_list)

for i, cust_id in enumerate(submission['customer_id'].values.reshape((-1,))):
    if cust_id in purchase_dict_1w:
        l = sorted((purchase_dict_1w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list[:(12-len(l))])
    else:
        s = dummy_pred
    prediction_list.append(s)

top12in1week_benchmark['prediction'] = prediction_list
print(top12in1week_benchmark.shape)
top12in1week_benchmark.head()

(1371980, 2)


Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0924243001 0923758001 0924243002 0918522001 07...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0924243001 0923758001 0924243002 0918522001 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0924243001 0923758001 0924243002 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0924243001 0923758001 0924243002 0918522001 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0924243001 0923758001 0924243002 0918522001 07...


In [42]:
top12in1week_benchmark.to_csv('../datasets/submission4.csv', index=False)

Kaggle score returned 0.0186, slightly lower than what 3 weeks gives.

### Baseline model - using recent 1 day

In [5]:
# zoom in transaction from 22 September 2020 alone 
transactions_1d = transactions.loc[(transactions['t_dat']=='2020-09-22')]
transactions_1d.reset_index(drop=True, inplace=True)

In [7]:
# Create purchase dictionary for customer and article belonging to 22 September 2020 
purchase_dict = {}

for i,x in enumerate(zip(transactions_1d['customer_id'], transactions_1d['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict:
        purchase_dict[cust_id] = {}
    
    if art_id not in purchase_dict[cust_id]:
        purchase_dict[cust_id][art_id] = 0
    
    purchase_dict[cust_id][art_id] += 1
    
print(len(purchase_dict))

10528


In [8]:
# Variant Base model will recommend top 12 in last 1 day items to cold start users if they have not bought more than 12 items
# For customers who have bought more than 12 items, model will recommend the top 12 items they have most frequently bought historically

top12in1day_benchmark = submission[['customer_id']]
prediction_list = []
dummy_list = list((transactions_1d['article_id'].value_counts()).index)[:12]
dummy_pred = ' '.join(dummy_list)

for i, cust_id in enumerate(submission['customer_id'].values.reshape((-1,))):
    if cust_id in purchase_dict:
        l = sorted((purchase_dict[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list[:(12-len(l))])
    else:
        s = dummy_pred
    prediction_list.append(s)

top12in1day_benchmark['prediction'] = prediction_list
print(top12in1day_benchmark.shape)
top12in1day_benchmark.head()

(1371980, 2)


Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0924243002 0751471001 0448509014 0918522001 08...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0924243002 0751471001 0448509014 0918522001 08...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0924243002 0751471001 0448509014 0918522001 08...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0924243002 0751471001 0448509014 0918522001 08...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0924243002 0751471001 0448509014 0918522001 08...


In [9]:
top12in1day_benchmark.to_csv('../datasets/submission5.csv', index=False)

Kaggle score returned 0.00929 Mean Average Precision @ 12. This is a massive decrease, showing that recent does not always mean better, and there is a sweet spot that I have yet to uncover. 

### Baseline model - using recent 1 day

In [19]:
# zoom in transaction from 2 week September 2020 alone 
transactions_2w = transactions_3w.loc[(transactions_3w['t_dat']>='2020-09-07')]
transactions_2w.reset_index(drop=True, inplace=True)

In [21]:
# Create purchase dictionary for customer and article belonging to September 2020 14 to 22 period
purchase_dict_2w = {}

for i,x in enumerate(zip(transactions_2w['customer_id'], transactions_2w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_2w:
        purchase_dict_2w[cust_id] = {}
    
    if art_id not in purchase_dict_2w[cust_id]:
        purchase_dict_2w[cust_id][art_id] = 0
    
    purchase_dict_2w[cust_id][art_id] += 1
    
print(len(purchase_dict_2w))

143455


In [22]:
# Variant Base model will recommend top 12 in last 2 week items to cold start users if they have not bought more than 12 items
# For customers who have bought more than 12 items, model will recommend the top 12 items they have most frequently bought historically

top12in2week_benchmark = submission[['customer_id']]
prediction_list = []
dummy_list = list((transactions_2w['article_id'].value_counts()).index)[:12]
dummy_pred = ' '.join(dummy_list)

for i, cust_id in enumerate(submission['customer_id'].values.reshape((-1,))):
    if cust_id in purchase_dict_2w:
        l = sorted((purchase_dict_2w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list[:(12-len(l))])
    else:
        s = dummy_pred
    prediction_list.append(s)

top12in2week_benchmark['prediction'] = prediction_list
print(top12in2week_benchmark.shape)
top12in2week_benchmark.head()

(1371980, 2)


Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0909370001 0924243001 0918522001 0448509014 07...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0909370001 0924243001 0918522001 0448509014 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0909370001 0924243001 0918522001 04...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0909370001 0924243001 0918522001 0448509014 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0909370001 0924243001 0918522001 0448509014 07...


In [23]:
top12in2week_benchmark.to_csv('../datasets/submission6.csv', index=False)

Kaggle score returned 0.0186, slightly lower than what 3 weeks gives.

#### Combined 1w, 2w, 3w model

Factoring in that not all customers are truly cold start and new in their transactions, I tried instead to look across 3 weeks before filling up their existing purchases, before filling up the top purchases.

In [25]:
dummy_list_1w = list((transactions_1w['article_id'].value_counts()).index)[:12]
dummy_list_2w = list((transactions_2w['article_id'].value_counts()).index)[:12]
dummy_list_3w = list((transactions_3w['article_id'].value_counts()).index)[:12]

In [26]:
combined_benchmark = submission[['customer_id']]
prediction_list = []

dummy_list = list((transactions_1w['article_id'].value_counts()).index)[:12]
dummy_pred = ' '.join(dummy_list)

for i, cust_id in enumerate(submission['customer_id'].values.reshape((-1,))):
    if cust_id in purchase_dict_1w:
        l = sorted((purchase_dict_1w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_1w[:(12-len(l))])
    elif cust_id in purchase_dict_2w:
        l = sorted((purchase_dict_2w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_2w[:(12-len(l))])
    elif cust_id in purchase_dict_3w:
        l = sorted((purchase_dict_3w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_3w[:(12-len(l))])
    else:
        s = dummy_pred
    prediction_list.append(s)

combined_benchmark['prediction'] = prediction_list
print(combined_benchmark.shape)
combined_benchmark.head()

(1371980, 2)


Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0751471001 0909370001 0918522001 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0924243001 0923758001 0924243002 0918522001 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0924243001 0923758001 0924243002 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0924243001 0923758001 0924243002 0918522001 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0924243001 0923758001 0924243002 0918522001 07...


In [27]:
combined_benchmark.to_csv('../datasets/submission7.csv', index=False)

Kaggle score improved significantly to 0.203. This gives credit to the method of going by most recent, then filling it up historically, before filling with top 12 recommended. I will be testing with 4w and 5w as well. Perhaps until significant effect cannot be found.

In [32]:
transactions_4w = transactions[transactions['t_dat'] >= pd.to_datetime('2020-08-22')].copy()
transactions_5w = transactions[transactions['t_dat'] >= pd.to_datetime('2020-08-15')].copy()

In [33]:
purchase_dict_4w = {}

for i,x in enumerate(zip(transactions_4w['customer_id'], transactions_4w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_4w:
        purchase_dict_4w[cust_id] = {}
    
    if art_id not in purchase_dict_4w[cust_id]:
        purchase_dict_4w[cust_id][art_id] = 0
    
    purchase_dict_4w[cust_id][art_id] += 1
    
print(len(purchase_dict_4w))

dummy_list_4w = list((transactions_4w['article_id'].value_counts()).index)[:12]

256355


In [34]:
purchase_dict_5w = {}

for i,x in enumerate(zip(transactions_5w['customer_id'], transactions_5w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_5w:
        purchase_dict_5w[cust_id] = {}
    
    if art_id not in purchase_dict_5w[cust_id]:
        purchase_dict_5w[cust_id][art_id] = 0
    
    purchase_dict_5w[cust_id][art_id] += 1
    
print(len(purchase_dict_5w))

dummy_list_5w = list((transactions_5w['article_id'].value_counts()).index)[:12]

290751


In [35]:
# recommend purchases based on most recent transaction
combined_to5wbenchmark = submission[['customer_id']]
prediction_list = []

dummy_list = list((transactions_1w['article_id'].value_counts()).index)[:12] #come up with top 12 best seller in week 104
dummy_pred = ' '.join(dummy_list)

for i, cust_id in enumerate(submission['customer_id'].values.reshape((-1,))):  
    if cust_id in purchase_dict_1w:                # if customer made a purchase 1 week ago, recommend same items until 12   
        l = sorted((purchase_dict_1w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])                  # if customer bought more than 12, cut list to 12
        else:
            s = ' '.join(l+dummy_list_1w[:(12-len(l))])   # if list is still less than 12, recommend best sellers in week 104
    elif cust_id in purchase_dict_2w:             
        l = sorted((purchase_dict_2w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])                 
        else:
            s = ' '.join(l+dummy_list_2w[:(12-len(l))])
    elif cust_id in purchase_dict_3w:
        l = sorted((purchase_dict_3w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_3w[:(12-len(l))])
    elif cust_id in purchase_dict_4w:
        l = sorted((purchase_dict_4w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_4w[:(12-len(l))])
    elif cust_id in purchase_dict_5w:
        l = sorted((purchase_dict_5w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_5w[:(12-len(l))])
    else:
        s = dummy_pred
    prediction_list.append(s)

combined_to5wbenchmark['prediction'] = prediction_list
print(combined_to5wbenchmark.shape)
combined_to5wbenchmark.head()

(1371980, 2)


Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0751471001 0909370001 0918522001 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0924243001 0924243002 0923758001 0918522001 09...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0924243001 0923758001 0924243002 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0924243001 0924243002 0923758001 0918522001 09...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0924243001 0924243002 0923758001 0918522001 09...


In [36]:
combined_to5wbenchmark.to_csv('../datasets/submission8.csv', index=False)

Returned MAP@12 is the most optimal at 0.0206.

In [37]:
transactions_6w = transactions[transactions['t_dat'] >= pd.to_datetime('2020-08-08')].copy()
transactions_7w = transactions[transactions['t_dat'] >= pd.to_datetime('2020-08-01')].copy()

In [38]:
purchase_dict_6w = {}

for i,x in enumerate(zip(transactions_6w['customer_id'], transactions_6w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_6w:
        purchase_dict_6w[cust_id] = {}
    
    if art_id not in purchase_dict_6w[cust_id]:
        purchase_dict_6w[cust_id][art_id] = 0
    
    purchase_dict_6w[cust_id][art_id] += 1
    
print(len(purchase_dict_6w))

dummy_list_6w = list((transactions_6w['article_id'].value_counts()).index)[:12]

325906


In [39]:
purchase_dict_7w = {}

for i,x in enumerate(zip(transactions_7w['customer_id'], transactions_7w['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_7w:
        purchase_dict_7w[cust_id] = {}
    
    if art_id not in purchase_dict_7w[cust_id]:
        purchase_dict_7w[cust_id][art_id] = 0
    
    purchase_dict_7w[cust_id][art_id] += 1
    
print(len(purchase_dict_7w))

dummy_list_7w = list((transactions_7w['article_id'].value_counts()).index)[:12]

363798


In [40]:
combined_to7wbenchmark = submission[['customer_id']]
prediction_list = []

dummy_list = list((transactions_1w['article_id'].value_counts()).index)[:12]
dummy_pred = ' '.join(dummy_list)

for i, cust_id in enumerate(submission['customer_id'].values.reshape((-1,))):
    if cust_id in purchase_dict_1w:
        l = sorted((purchase_dict_1w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_1w[:(12-len(l))])
    elif cust_id in purchase_dict_2w:
        l = sorted((purchase_dict_2w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_2w[:(12-len(l))])
    elif cust_id in purchase_dict_3w:
        l = sorted((purchase_dict_3w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_3w[:(12-len(l))])
    elif cust_id in purchase_dict_4w:
        l = sorted((purchase_dict_4w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_4w[:(12-len(l))])
    elif cust_id in purchase_dict_5w:
        l = sorted((purchase_dict_5w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_5w[:(12-len(l))])
    elif cust_id in purchase_dict_6w:
        l = sorted((purchase_dict_6w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_6w[:(12-len(l))])
    elif cust_id in purchase_dict_7w:
        l = sorted((purchase_dict_7w[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_7w[:(12-len(l))])
    else:
        s = dummy_pred
    prediction_list.append(s)

combined_to7wbenchmark['prediction'] = prediction_list
print(combined_to7wbenchmark.shape)
combined_to7wbenchmark.head()

(1371980, 2)


Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0751471001 0909370001 0918522001 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0924243001 0924243002 0923758001 0918522001 09...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0924243001 0923758001 0924243002 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0924243001 0924243002 0923758001 0918522001 09...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0896152002 0730683050 0927530004 0791587015 07...


In [41]:
combined_to7wbenchmark.to_csv('../datasets/submission9.csv', index=False)

Kaggle score stopped improving from  `dummy_list_5w`. This shows that the most relevant feature is actually the first 5 weeks information.

## Conclusion from base model

I found the sweet spot for the base model to be around the most recent 5 weeks of transaction. This is because taking the most recent information only will forgo past purchasing information from the customers. There is also a trade off when we take old information, which is that the product in the old transactions are no longer in season and not for sale anymore.

The MAP@12 score for baseline model is currently at 0.0206, and in future models, I will try to improve from this baseline.