## Capstone Part 5 - Trending weekly model

The objective of this notebook is to create an alternative model to LGBM and base model in parts 2 and 4. 

This notebook was run on Google Colab Pro+ account, using rapids and cuDF to shorten run time. Below are some of the commands to aid with installation to enable cuDF to be imported.

As the notebook was optimised to run on Google Colab, the relative references are kept in order to let the notebook run on Colab.

## Steps for cuDF installation onto Colab

Below are the steps to enable running of cuDF installation onto Colab. As I will be using a virtual machine, the same steps will need to be repeated each time I start Colab. 

In [None]:
#!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
#!python rapidsai-csp-utils/colab/env-check.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 300, done.[K
remote: Counting objects: 100% (129/129), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 300 (delta 74), reused 99 (delta 55), pack-reused 171[K
Receiving objects: 100% (300/300), 87.58 KiB | 7.30 MiB/s, done.
Resolving deltas: 100% (136/136), done.
***********************************************************************
Woo! Your instance has the right kind of GPU, a Tesla P100-PCIE-16GB!
***********************************************************************



In [None]:
#!bash rapidsai-csp-utils/colab/update_gcc.sh
#import os
#os._exit(00)

Updating your Colab environment.  This will restart your kernel.  Don't Panic!
Hit:1 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1,575 B]
Hit:4 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:6 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Get:7 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:8 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:9 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:10 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic InRelease [20.8 kB]
Get:11 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:12 https://developer.download.nvidia.com/compute/machine-learning/

In [None]:
#import condacolab
#condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:23
🔁 Restarting kernel...


In [None]:
# you can now run the rest of the cells as normal
#import condacolab
#condacolab.check()

✨🍰✨ Everything looks OK!


In [None]:
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
# The <release> options are 'stable' and 'nightly'.  Leaving it blank or adding any other words will default to stable.
#!python rapidsai-csp-utils/colab/install_rapids.py stable
#import os
#os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
#os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
#os.environ['CONDA_PREFIX'] = '/usr/local'

Found existing installation: cffi 1.14.5
Uninstalling cffi-1.14.5:
  Successfully uninstalled cffi-1.14.5
Found existing installation: cryptography 3.4.5
Uninstalling cryptography-3.4.5:
  Successfully uninstalled cryptography-3.4.5
Collecting cffi==1.15.0
  Downloading cffi-1.15.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (427 kB)
Installing collected packages: cffi
Successfully installed cffi-1.15.0
Installing RAPIDS Stable 21.12
Starting the RAPIDS install on Colab.  This will take about 15 minutes.
Collecting package metadata (current_repodata.json): ...working... done
failed with initial frozen solve. Retrying with flexible solve.
failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - cudatoolkit=11.2
    - dask-sql
    - gcsfs
    - llvmlite
    - openssl
    - python=3.7
    - 

In [None]:
# import libraries
import numpy as np
import pandas as pd 
from datetime import datetime, timedelta
import gc

import cudf

### Read the transactions data

In [None]:
# set number of predictions to be made as 12
N = 12

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# read csv and apply memory reduction techniques onto customer_id
df  = cudf.read_csv('/content/drive/MyDrive/datasets/transactions_train.csv',
                            usecols= ['t_dat', 'customer_id', 'article_id'], 
                            dtype={'article_id': 'int32', 't_dat': 'string', 'customer_id': 'string'})
df ['customer_id'] = df ['customer_id'].str[-16:].str.hex_to_int().astype('int64')

# change t_dat to datetime format
df['t_dat'] = cudf.to_datetime(df['t_dat'])
last_ts = df['t_dat'].max()

### Add the last day of billing week

In [None]:
# create dataframe containing t_dat, day of week and last day of billing week ('ldbw')
%%time
tmp = df[['t_dat']].copy().to_pandas()
tmp['dow'] = tmp['t_dat'].dt.dayofweek
tmp['ldbw'] = tmp['t_dat'] - pd.TimedeltaIndex(tmp['dow'] - 1, unit='D')
tmp.loc[tmp['dow'] >=2 , 'ldbw'] = tmp.loc[tmp['dow'] >=2 , 'ldbw'] + pd.TimedeltaIndex(np.ones(len(tmp.loc[tmp['dow'] >=2])) * 7, unit='D')

df['ldbw'] = tmp['ldbw'].values

CPU times: user 6.36 s, sys: 556 ms, total: 6.92 s
Wall time: 6.66 s


In [None]:
tmp.head()

Unnamed: 0,t_dat,dow,ldbw
0,2018-09-20,3,2018-09-25
1,2018-09-20,3,2018-09-25
2,2018-09-20,3,2018-09-25
3,2018-09-20,3,2018-09-25
4,2018-09-20,3,2018-09-25


### Count the number of transactions per week 

In [None]:
# create dataframe with last day of billing week, article_id and count for number of sales
weekly_sales = df.drop('customer_id', axis=1).groupby(['ldbw', 'article_id']).count().reset_index()
weekly_sales = weekly_sales.rename(columns={'t_dat': 'count'})

In [None]:
weekly_sales.head()

Unnamed: 0,ldbw,article_id,count
0,2018-12-18,568652020,1
1,2019-05-14,560222012,5
2,2019-08-20,746260001,4
3,2020-04-28,831644001,28
4,2019-02-12,693614004,3


In [None]:
df = df.merge(weekly_sales, on=['ldbw', 'article_id'], how = 'left')

In [None]:
df.head()

Unnamed: 0,t_dat,customer_id,article_id,ldbw,count
0,2018-09-20,-5912610896107984360,467302100,2018-09-25,33
1,2018-09-20,-5912610896107984360,467302100,2018-09-25,33
2,2018-09-20,-5912610896107984360,467302100,2018-09-25,33
3,2018-09-20,-5912610896107984360,467302100,2018-09-25,33
4,2018-09-20,-5912610896107984360,467302100,2018-09-25,33


### Assume prediction week sales will be similar to the last week of the training data

In [None]:
# create 'count_targ' column in main dataframe, which is the count of sales in the last week
weekly_sales = weekly_sales.reset_index().set_index('article_id')

df = df.merge(
    weekly_sales.loc[weekly_sales['ldbw']==last_ts, ['count']],
    on='article_id', suffixes=("", "_targ"))

df['count_targ'].fillna(0, inplace=True)
del weekly_sales

In [None]:
df

Unnamed: 0,t_dat,customer_id,article_id,ldbw,count,count_targ
0,2018-09-20,5768831383057974898,297067002,2018-09-25,102,55
1,2018-09-20,-7545749809139154060,148033001,2018-09-25,43,56
2,2018-09-20,-1363792404675816646,400285006,2018-09-25,74,20
3,2018-09-20,-5363196209725736780,399136009,2018-09-25,247,8
4,2018-09-20,-5363196209725736780,663396001,2018-09-25,115,20
...,...,...,...,...,...,...
12541131,2020-09-22,4013153654210679014,851094008,2020-09-22,24,24
12541132,2020-09-22,-2450040194960081695,857778011,2020-09-22,91,91
12541133,2020-09-22,-8179904361405576348,751471001,2020-09-22,526,526
12541134,2020-09-22,5675320231868812037,767032001,2020-09-22,125,125


### Calculate sales rate adjusted for changes in product popularity 

In [None]:
# create numerical feature 'quotient' to see how much each week affects the last week
df['quotient'] = df['count_targ'] / df['count']

In [None]:
df

Unnamed: 0,t_dat,customer_id,article_id,ldbw,count,count_targ,quotient
0,2018-09-20,5768831383057974898,297067002,2018-09-25,102,55,0.539216
1,2018-09-20,-7545749809139154060,148033001,2018-09-25,43,56,1.302326
2,2018-09-20,-1363792404675816646,400285006,2018-09-25,74,20,0.270270
3,2018-09-20,-5363196209725736780,399136009,2018-09-25,247,8,0.032389
4,2018-09-20,-5363196209725736780,663396001,2018-09-25,115,20,0.173913
...,...,...,...,...,...,...,...
12541131,2020-09-22,4013153654210679014,851094008,2020-09-22,24,24,1.000000
12541132,2020-09-22,-2450040194960081695,857778011,2020-09-22,91,91,1.000000
12541133,2020-09-22,-8179904361405576348,751471001,2020-09-22,526,526,1.000000
12541134,2020-09-22,5675320231868812037,767032001,2020-09-22,125,125,1.000000


### Take supposedly popular products

In [None]:
# create general prediction list, created by the highest sum of quotient 12 article_id. Means article_id persistent in popularity
target_sales = df.drop('customer_id', axis=1).groupby('article_id')['quotient'].sum()
general_pred = target_sales.nlargest(N).index.to_pandas().tolist()
general_pred = ['0' + str(article_id) for article_id in general_pred]
general_pred_str =  ' '.join(general_pred)



In [None]:
target_sales

article_id
897077002       3.0
820549002      32.0
807748001      59.0
897077001     110.0
817484004      35.0
              ...  
504667003      48.0
687704030      44.0
685814022    5292.0
912334001      70.0
811719005      96.0
Name: quotient, Length: 17986, dtype: float64

In [None]:
general_pred 

['0448509014',
 '0573085028',
 '0751471001',
 '0706016001',
 '0673677002',
 '0715624001',
 '0706016003',
 '0158340001',
 '0579541001',
 '0372860001',
 '0372860002',
 '0706016002']

In [None]:
# these are default predictions that will be filled if there are no other predictions made
general_pred_str

'0448509014 0573085028 0751471001 0706016001 0673677002 0715624001 0706016003 0158340001 0579541001 0372860001 0372860002 0706016002'

### Fill in purchase dictionary

In [None]:
%%time
# create empty purchase dictionary
purchase_dict = {}

# create temp dataframe from main df
tmp = df.copy().to_pandas()

# make temporary 'x' column, which is number of days till last day of transaction dataset
tmp['x'] = ((last_ts - tmp['t_dat']) / np.timedelta64(1, 'D')).astype(int)

# create label for positive class, 1 for purchase happening
tmp['dummy_1'] = 1 
tmp['x'] = tmp[["x", "dummy_1"]].max(axis=1)

# as indicator whether or not customer A will purchase product B again is generated by following code, 
# reference :https://www.kaggle.com/code/lichtlab/0-0226-byfone-chris-combination-approach/notebook
a, b, c, d = 2.5e4, 1.5e5, 2e-1, 1e3
tmp['y'] = a / np.sqrt(tmp['x']) + b * np.exp(-c*tmp['x']) - d

# create label for negative class, 0 for purchase not happening
tmp['dummy_0'] = 0 

# y and dummy_0 to be minimally 0, maximum to be highest y value based on time decay formula
tmp['y'] = tmp[["y", "dummy_0"]].max(axis=1)
tmp['value'] = tmp['quotient'] * tmp['y'] 

# sum to get customer_id to article_id to sum of values
tmp = tmp.groupby(['customer_id', 'article_id']).agg({'value': 'sum'})
tmp = tmp.reset_index()

# masking to get high 'value' items, high 'value' means more likely to buy at last week
tmp = tmp.loc[tmp['value'] > 100]

# assign 'rank' feature 1 to 12, 1 being highest in value for that customer_id
tmp['rank'] = tmp.groupby("customer_id")["value"].rank("dense", ascending=False)
tmp = tmp.loc[tmp['rank'] <= 12]

# sort by descending order for customer and value
purchase_df = tmp.sort_values(['customer_id', 'value'], ascending = False).reset_index(drop = True)   

# add back '0' to article_id for prediction purpose
purchase_df['prediction'] = '0' + purchase_df['article_id'].astype(str) + ' '

# compile prediction for customer based on value rank
purchase_df = purchase_df.groupby('customer_id').agg({'prediction': sum}).reset_index()
purchase_df['prediction'] = purchase_df['prediction'].str.strip()
purchase_df = cudf.DataFrame(purchase_df)

CPU times: user 15.1 s, sys: 751 ms, total: 15.8 s
Wall time: 15.4 s


In [None]:
purchase_df.head()

Unnamed: 0,customer_id,prediction
0,-9223352921020755230,0673396002 0812167004 0849493006 0706016001 05...
1,-9223343869995384291,0908292002 0910601003 0903926002 0865929007 08...
2,-9223295149301589789,0826620001
3,-9223293121067732640,0715624001 0456163060 0835008005 0557599022 07...
4,-9223290575350349271,0852584006 0905518001 0912204009 0757904007 07...


### Submission

In [None]:
%%time

# read csv
sub  = cudf.read_csv('/content/drive/MyDrive/datasets/sample_submission.csv',
                            usecols= ['customer_id'], 
                            dtype={'customer_id': 'string'})

# create customer_id2 with same format as purchase_df customer_id
sub['customer_id2'] = sub['customer_id'].str[-16:].str.hex_to_int().astype('int64')

# merge with purchase_df
sub = sub.merge(purchase_df, left_on = 'customer_id2', right_on = 'customer_id', how = 'left',
               suffixes = ('', '_ignored'))

sub = sub.to_pandas()

# fill null values, which is due to no ranking trend prediction with 12 general prediction
sub['prediction'] = sub['prediction'].fillna(general_pred_str)

# fill prediction with general prediction to ensure there is more than 12
sub['prediction'] = sub['prediction'] + ' ' +  general_pred_str

# strip leading whitespaces, if any
sub['prediction'] = sub['prediction'].str.strip()

# limit number of prediction to first 12. 131 comes from 12 predictions * length of 10 + 11 blank spaces
sub['prediction'] = sub['prediction'].str[:131]

# change into format for kaggle requirement 
sub = sub[['customer_id', 'prediction']]

# save as csv
sub.to_csv('/content/drive/MyDrive/datasets/trending.csv', index=False)


CPU times: user 8.73 s, sys: 1.04 s, total: 9.77 s
Wall time: 10.4 s


MAP@12 was 0.0226. A significant improvement compared to base model.