[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shashist/recsys-course/blob/master/week_05_multi-stage/rs_seminar_2-level.ipynb)

# 2-level recommender system

### Motivation

- We want to handle all available features of different types and nature
- Some features depend on user-item pair and thus should be calculated online
- It is natural to combine recommendations from different sources

Despite all of the advantages above, it might be hard to make 2-level model perform better than 1-level.

Today we will fit two-level model and discuss corresponding details.



<img src=https://raw.githubusercontent.com/xei/recommender-system-tutorial/main/assets/retrieval_ranking.png width=800>


<small>
[(image source)](https://github.com/xei/recommender-system-tutorial)
</small>


### Validation

### Splitting pipeline

<img src=https://gist.githubusercontent.com/shashist/8e9094d4d975e6bda8f0556159ef324e/raw/fd002d6f76c90ab654b8454d452103d7fdefc08a/2_level_split.png width=1000>

1. Train candidate generation model on I & II_seed & III_seed & test, validate on II_holdout
2. Generate candidates for II, III, test
3. Generate features for candidates
4. Train ranking model on II_candidates, validate on III. Compute final metrics on III_holdout

### Models

1 level: [LightFM](https://github.com/lyst/lightfm) with and w/o features

2 level: XGBoost with binary classsification

### Spotify dataset

Dataset adopted from RecSys challenge 2018. It containts 100k playlists for training and 10k for test.
Description is available [here](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge).

Following resources were used:

- https://dl.acm.org/doi/10.1145/3267471.3267488

- https://github.com/VasiliyRubtsov/recsys2018/tree/master

In [1]:
from google.colab import drive
drive.mount('/content/drive')
# drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
PATH_TO_DATA_FOLDER = '/content/drive/MyDrive/data/RecSys/Spotify/'

%cd $PATH_TO_DATA_FOLDER

/content/drive/MyDrive/data/RecSys/Spotify


In [3]:
!pip install -q lightfm

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/316.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/316.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.4/316.4 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for lightfm (setup.py) ... [?25l[?25hdone


In [4]:
import glob
import joblib
import json
import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
import pickle
import tqdm

import numpy as np
np.random.seed(0)
import pandas as pd
import scipy.sparse as sp
from lightfm import LightFM
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.class_weight import compute_sample_weight
import xgboost

In [5]:
HDFS_PATH = 'data/hdfs'
SPLITTED_PATH = 'data/splitted'
CANDIDATES_PATH = 'data/candidates'

In [None]:
# !mkdir $SPLITTED_PATH

## 0. Data preparation

#### 0.1 reading hdfs

In [6]:
df_tracks = pd.read_hdf(f'{HDFS_PATH}/df_tracks.hdf')
df_playlists = pd.read_hdf(f'{HDFS_PATH}/df_playlists.hdf')
df_playlists_info = pd.read_hdf(f'{HDFS_PATH}/df_playlists_info.hdf')
df_playlists_test = pd.read_hdf(f'{HDFS_PATH}/df_playlists_test.hdf')
df_playlists_test_info = pd.read_hdf(f'{HDFS_PATH}/df_playlists_test_info.hdf')

In [7]:
print(df_tracks['tid'].nunique(), df_tracks.shape)
df_tracks.head(2)

693339 (693339, 8)


Unnamed: 0,album_name,album_uri,artist_name,artist_uri,duration_ms,track_name,track_uri,tid
0,Culture,spotify:album:2AvupjUeMnSffKEV05x222,Migos,spotify:artist:6oMuImdp5ZcFhWP0ESe6mG,304041,Slippery (feat. Gucci Mane),spotify:track:6p8NuHm8uCGnn2Dtbtf7zE,0
1,TBA,spotify:album:2apbRBAafIKmcWwESmLHJi,A Boogie Wit da Hoodie,spotify:artist:31W5EY0aAly4Qieq6OFu6I,184000,Timeless (DJ SPINKING),spotify:track:0q5DrtpnnGpOvBy5nnPMbe,1


In [8]:
print(df_playlists['pid'].nunique(), df_playlists['tid'].nunique(), df_playlists.shape)
df_playlists.head(2)

100000 689942 (6650217, 3)


Unnamed: 0,pid,tid,pos
0,0,0,0
1,0,1,1


In [9]:
print(df_playlists_info['pid'].nunique(), df_playlists_info.shape)
df_playlists_info.head(2)

100000 (100000, 10)


Unnamed: 0,collaborative,duration_ms,modified_at,name,num_albums,num_artists,num_edits,num_followers,num_tracks,pid
0,False,8074036,1489536000,as,28,21,14,1,36,0
1,False,12043803,1483747200,Sappy,53,47,29,2,56,1


In [10]:
print(df_playlists_info['pid'].nunique(), df_playlists_info.shape)
df_playlists_info.head(2)

100000 (100000, 10)


Unnamed: 0,collaborative,duration_ms,modified_at,name,num_albums,num_artists,num_edits,num_followers,num_tracks,pid
0,False,8074036,1489536000,as,28,21,14,1,36,0
1,False,12043803,1483747200,Sappy,53,47,29,2,56,1


In [11]:
print(df_playlists_test['pid'].nunique(), df_playlists_test['tid'].nunique(), df_playlists_test.shape)
df_playlists_test.head(2)

9000 66243 (281000, 3)


Unnamed: 0,pid,tid,pos
0,100000,59256,0
1,100000,12450,1


In [12]:
print(df_playlists_test_info['pid'].nunique(), df_playlists_test_info.shape)
df_playlists_test_info.head(2)

10000 (10000, 5)


Unnamed: 0,name,num_holdouts,num_samples,num_tracks,pid
0,spanish playlist,11,0,11,100002
1,Groovin,48,0,48,100003


In [14]:
config = {
    'num_playlists': df_playlists_test_info.pid.max() + 1,
    'num_tracks': df_tracks.tid.max() + 1,
}

# with open('data/config.config', 'wb') as f:
#     pickle.dump(config, f)

#### 0.2 splitting

In [15]:
num_tracks = df_playlists_info.groupby('num_tracks', group_keys=False).pid.apply(np.array)

In [16]:
num_tracks

num_tracks
5      [86, 210, 255, 385, 403, 479, 527, 622, 782, 1...
6      [87, 630, 638, 653, 693, 822, 1078, 1121, 1699...
7      [146, 160, 167, 219, 396, 397, 431, 491, 603, ...
8      [77, 365, 481, 577, 850, 886, 1137, 1163, 1366...
9      [84, 281, 315, 340, 1106, 1278, 1304, 1535, 15...
                             ...                        
246    [5382, 6297, 6511, 6616, 8321, 12704, 12957, 1...
247    [1736, 3610, 4360, 5188, 6448, 9069, 9401, 113...
248    [1800, 7643, 8593, 9348, 16740, 17317, 20065, ...
249    [1756, 7979, 12489, 18199, 18549, 20107, 20712...
250    [2186, 4670, 10929, 14266, 14287, 19576, 22303...
Name: pid, Length: 246, dtype: object

In [48]:
df_playlists_test_info.num_tracks.value_counts().reset_index().head()

Unnamed: 0,index,num_tracks
0,40,139
1,42,122
2,48,114
3,45,112
4,44,106


In [None]:
validation_playlists = {}
for i, j in df_playlists_test_info.num_tracks.value_counts().reset_index().values:
    validation_playlists[i] = np.random.choice(num_tracks.loc[i], 2 * j, replace=False)

In [19]:
num_tracks[249]

array([ 1756,  7979, 12489, 18199, 18549, 20107, 20712, 20713, 22357,
       24018, 24474, 25130, 29204, 29559, 34184, 37202, 37702, 39883,
       48332, 51489, 52807, 52916, 57791, 58487, 58796, 59878, 60222,
       61467, 64339, 65802, 68685, 73993, 80144, 84496, 85616, 86031,
       87461, 91688, 92537, 94796, 99592])

In [20]:
df_playlists_test_info['num_samples'].value_counts()

5      2000
10     2000
25     2000
100    2000
0      1000
1      1000
Name: num_samples, dtype: int64

In [21]:
val1_playlist = {}
val2_playlist = {}
for i in [0, 1, 5, 10, 25, 100]:

    val1_playlist[i] = []
    val2_playlist[i] = []

    value_counts = df_playlists_test_info.query('num_samples==@i').num_tracks.value_counts()
    for j, k in value_counts.reset_index().values:

        val1_playlist[i] += list(validation_playlists[j][:k])
        validation_playlists[j] = validation_playlists[j][k:]

        val2_playlist[i] += list(validation_playlists[j][:k])
        validation_playlists[j] = validation_playlists[j][k:]

In [22]:
val1_index = df_playlists.pid.isin(val1_playlist[0])
val2_index = df_playlists.pid.isin(val2_playlist[0])

In [23]:
for i in [1, 5, 10, 25, 100]:
    val1_index = val1_index | (df_playlists.pid.isin(val1_playlist[i]) & (df_playlists.pos >= i))
    val2_index = val2_index | (df_playlists.pid.isin(val2_playlist[i]) & (df_playlists.pos >= i))

In [24]:
train = df_playlists[~(val1_index | val2_index)]

val1 = df_playlists[val1_index]
val2 = df_playlists[val2_index]

val1_pids = np.hstack([val1_playlist[i] for i in val1_playlist])
val2_pids = np.hstack([val2_playlist[i] for i in val2_playlist])

In [25]:
train.tail(2)

Unnamed: 0,pid,tid,pos
6650215,99999,240387,108
6650216,99999,42033,109


In [28]:
train.shape, val1.shape, val2.shape

((5251885, 3), (699166, 3), (699166, 3))

In [31]:
train = df_playlists[~(val1_index | val2_index)]

In [None]:
train.to_hdf(f'{SPLITTED_PATH}/train.hdf', key='abc')

val1.to_hdf(f'{SPLITTED_PATH}/val1.hdf', key='abc')
val2.to_hdf(f'{SPLITTED_PATH}/val2.hdf', key='abc')

joblib.dump(val1_pids, f'{SPLITTED_PATH}/val1_pids.pkl')
joblib.dump(val2_pids, f'{SPLITTED_PATH}/val2_pids.pkl')

['data/splitted/val2_pids.pkl']

In [None]:
val1_pids.shape

(10000,)

## 1. Train first level model

#### 1.1 LightFM without features

In [32]:
X_train = sp.coo_matrix(
    (np.ones(len(train)), (train.pid, train.tid)),
    shape=(config['num_playlists'], config['num_tracks'])
)
X_train.shape

(149361, 693339)

In [33]:
model = LightFM(no_components=200, loss='warp', learning_rate=0.02, max_sampled=400, random_state=1, user_alpha=1e-05)

model.fit_partial(X_train, epochs=1, num_threads=2, verbose=1)

Epoch: 100%|██████████| 1/1 [05:55<00:00, 355.49s/it]


In [None]:
config['model_path'] = 'models/lightfm_model.pkl'

with open(config['model_path'], 'wb') as f:
    joblib.dump(model, f)

#### 1.2 LightFM with user (playlist) features

In [34]:
playlist_name1 = df_playlists_test_info.set_index('pid').name
playlist_name2 = df_playlists_info.set_index('pid').name
playlist_name = pd.concat([playlist_name1, playlist_name2]).sort_index()
playlist_name = playlist_name.reindex(np.arange(config['num_playlists'])).fillna('')

vectorizer = CountVectorizer(max_features=20000)
user_features = vectorizer.fit_transform(playlist_name)

user_features = sp.hstack([user_features, sp.eye(config['num_playlists'])])

In [35]:
model_text = LightFM(
    no_components=200,
    loss='warp',
    learning_rate=0.03,
    max_sampled=400,
    random_state=1,
    user_alpha=1e-05,
)

model_text.fit_partial(X_train, epochs=1, num_threads=2, user_features=user_features, verbose=1)

Epoch: 100%|██████████| 1/1 [09:02<00:00, 542.85s/it]


<lightfm.lightfm.LightFM at 0x7d5171187610>

In [None]:
config['model_text_path'] = 'models/lightfm_model_text.pkl'

with open(config['model_text_path'], 'rb') as f:
    model_text = joblib.load(f)

## 2. Generate candidates

In [None]:
model = joblib.load(open('models/lightfm_model.pkl', 'rb'))
model_text = joblib.load(open('models/lightfm_model_text.pkl', 'rb'))

In [None]:
train = pd.read_hdf('data/splitted/train.hdf')
val1 = pd.read_hdf('data/splitted/val1.hdf')
val1_pids = joblib.load('data/splitted/val1_pids.pkl')
val2 = pd.read_hdf('data/splitted/val2.hdf')
val2_pids = joblib.load('data/splitted/val2_pids.pkl')

In [None]:
import pickle

with open('data/config.config', 'rb') as f:
    config = pickle.load(f)

In [36]:
user_seen = set(zip(train.pid, train.tid))

In [37]:
def save_candidates(model, model_text, target_pids, file_name, df=None, K=1000):
    target_pids_text = list(set(target_pids).difference(train.pid))
    target_pids_no_text = list(set(target_pids).difference(target_pids_text))

    if df is not None:
        val_tracks = df.groupby('pid').tid.apply(set).to_dict()

    pids = []
    tids = []
    targets = []

    for pid in tqdm.tqdm(target_pids):
        if pid in target_pids_text:
            scores = model_text.predict(
                [pid] * config['num_tracks'],
                list(range(config['num_tracks'])),
                user_features=user_features,
                num_threads=2,
            )
        else:
            scores = model.predict(
                [pid] * config['num_tracks'],
                list(range(config['num_tracks'])),
                num_threads=2,
            )

        candidate_tids = list(np.argpartition(scores, -K)[-K:])
        rel = scores[tids]
        pids += [pid] * K
        tids += candidate_tids

        if df is not None:
            tracks_t = val_tracks[pid]
            targets += [i in tracks_t for i in candidate_tids]

    candidates = pd.DataFrame()
    candidates['pid'] = np.array(pids)
    candidates['tid'] = np.array(tids)

    if df is not None:
        candidates['target'] = np.array(targets).astype(int)

    index = []
    for pid, tid in candidates[['pid', 'tid']].values:
        index.append((pid, tid) not in user_seen)

    candidates = candidates[index]
    candidates.to_hdf(file_name, key='abc')

In [None]:
save_candidates(
    model,
    model_text,
    val1_pids,
    'data/candidates/ii_candidate.hdf',
    val1
)

In [None]:
save_candidates(
    model,
    model_text,
    val2_pids,
    val2.pid.value_counts(),
    'data/candidates/iii_candidate.hdf',
    val2
)

In [None]:
save_candidates(
    model,
    model_text,
    df_playlists_test_info.pid.values,
    df_playlists_test_info.set_index('pid').num_holdouts,
    'data/candidates/test_candidate.hdf'
)

In [38]:
!ls data/candidates/

ii_candidate.hdf	      iii_co_occurence_features.hdf  test_candidate.hdf
ii_co_occurence_features.hdf  iii_lightfm_features.hdf	     test_co_occurence_features.hdf
iii_candidate.hdf	      ii_lightfm_features.hdf	     test_lightfm_features.hdf


In [None]:
!ls data/candidates/

ii_candidate.hdf	  ii_lightfm_features.hdf	  test_lightfm_features.hdf
iii_candidate.hdf	  test_candidate.hdf
iii_lightfm_features.hdf  test_co_occurence_features.hdf


## 3. Generate features

1. Rank and score from LightFM, LightFM_text
2. Dot product and biases from LightFM, LightFM_text
3. Co-occurence statistics features


#### 3.1 lightfm features

In [None]:
model = joblib.load(open('models/lightfm_model.pkl', 'rb'))
model_text = joblib.load(open('models/lightfm_model_text.pkl', 'rb'))

In [39]:
def create_lightfm_features(model, model_text, df):
    user_biases_text, user_embeddings_text = model_text.get_user_representations()

    df['pid_bias'] = model.user_biases[df.pid]
    df['tid_bias'] = model.item_biases[df.tid]

    pid_embeddings = model.user_embeddings[df.pid]
    tid_embeddings = model.item_embeddings[df.tid]

    df['lightfm_dot_product'] = (pid_embeddings * tid_embeddings).sum(axis=1)
    df['lightfm_prediction'] = df['lightfm_dot_product'] + df['pid_bias'] + df['tid_bias']

    df['lightfm_rank'] = df.groupby('pid')['lightfm_prediction'].rank(ascending=False)

    df['pid_bias_text'] = user_biases_text[df.pid]
    df['tid_bias_text'] = model_text.item_biases[df.tid]

    pid_embeddings = user_embeddings_text[df.pid]
    tid_embeddings = model_text.item_embeddings[df.tid]

    df['lightfm_dot_product_text'] = (pid_embeddings * tid_embeddings).sum(axis=1)
    df['lightfm_prediction_text'] = df['lightfm_dot_product_text'] + df['pid_bias_text'] + df['tid_bias_text']

    df['lightfm_rank_text'] = df.groupby('pid')['lightfm_prediction_text'].rank(ascending=False)

In [None]:
train = pd.read_hdf(f'{CANDIDATES_PATH}/ii_candidate.hdf')
val = pd.read_hdf(f'{CANDIDATES_PATH}/iii_candidate.hdf')

In [None]:
train.head(2)

Unnamed: 0,pid,tid,target
0,20612,6509,0
1,20612,168,0


In [None]:
create_lightfm_features(model, model_text, train)
create_lightfm_features(model, model_text, val)

In [None]:
train.head(2)

Unnamed: 0,pid,tid,target,pid_bias,tid_bias,lightfm_dot_product,lightfm_prediction,lightfm_rank,pid_bias_text,tid_bias_text,lightfm_dot_product_text,lightfm_prediction_text,lightfm_rank_text
0,20612,6509,0,0.0,0.581104,-0.003705,0.577398,885.0,0.0,0.47012,0.001971,0.472092,859.0
1,20612,168,0,0.0,0.540023,-0.004621,0.535403,945.0,0.0,0.521509,0.001236,0.522744,599.0


In [None]:
train.to_hdf(f'{CANDIDATES_PATH}/ii_lightfm_features.hdf', key='abc')
val.to_hdf(f'{CANDIDATES_PATH}/iii_lightfm_features.hdf', key='abc')

#### 3.2. Co-occurence features

In [None]:
!ls $HDFS_PATH

df_playlists.hdf       df_playlists_test.hdf	   df_tracks.hdf
df_playlists_info.hdf  df_playlists_test_info.hdf


In [None]:
data = pd.read_hdf(f'data/splitted/train.hdf')
data = data.drop_duplicates(['pid', 'tid'])

In [None]:
num_items = data.tid.max() + 1
num_users =  data.pid.max() + 1

In [None]:
from collections import defaultdict

co_occurence = [defaultdict(int) for i in range(num_items)]
occurence = [0 for i in range(num_items)]
for q, (_, df) in enumerate(tqdm.tqdm(data.groupby('pid'))):
    if q % 100000 == 0:
        print(q / 10000)
    tids = list(df.tid)
    for i in tids:
        occurence[i] += 1
    for k, i in enumerate(tids):
        for j in tids[k + 1:]:
            co_occurence[i][j] += 1
            co_occurence[j][i] += 1

  0%|          | 43/107000 [00:00<12:24, 143.76it/s]

0.0


 10%|█         | 10931/107000 [00:44<06:32, 244.93it/s]


KeyboardInterrupt: ignored

In [None]:
def get_f(i, f):
    if len(i) == 0:
        return -1
    else:
        return f(i)

def create_co_occurence_features(df):
    pids = df.pid.unique()
    seed = train[data.pid.isin(pids)]
    tid_seed = seed.groupby('pid', group_keys=False).tid.apply(list)

    co_occurence_seq = []
    for pid, tid in df[['pid', 'tid']].values:
        tracks = tid_seed.get(pid, [])
        co_occurence_seq.append(np.array([co_occurence[tid][i] for i in tracks]))

    df['co_occurence_max'] = [get_f(i, np.max) for i in co_occurence_seq]
    df['co_occurence_min'] = [get_f(i, np.min) for i in co_occurence_seq]
    df['co_occurence_mean'] = [get_f(i, np.mean) for i in co_occurence_seq]
    df['co_occurence_median'] = [get_f(i, np.median) for i in co_occurence_seq]

    co_occurence_seq = []
    for pid, tid in df[['pid', 'tid']].values:
        tracks = tid_seed.get(pid, [])
        co_occurence_seq.append(np.array([co_occurence[tid][i] / occurence[i] for i in tracks]))

    df['co_occurence_norm_max'] = [get_f(i, np.max) for i in co_occurence_seq]
    df['co_occurence_norm_min'] = [get_f(i, np.min) for i in co_occurence_seq]
    df['co_occurence_norm_mean'] = [get_f(i, np.mean) for i in co_occurence_seq]
    df['co_occurence_norm_median'] = [get_f(i, np.median) for i in co_occurence_seq]

In [None]:
train_cooc = pd.read_hdf('data/candidates/ii_candidate.hdf')
val_cooc = pd.read_hdf('data/candidates/iii_candidate.hdf')
# test_cooc = pd.read_hdf('candidates9/test_candidate.hdf')

In [None]:
%%time
create_co_occurence_features(train_cooc)
create_co_occurence_features(val_cooc)
create_co_occurence_features(test_cooc)

In [None]:
train_cooc.head(2)

## 4. Second level model

In [52]:
train = pd.read_hdf(f'{CANDIDATES_PATH}/ii_candidate.hdf')
val = pd.read_hdf(f'{CANDIDATES_PATH}/iii_candidate.hdf')

In [53]:
train.head(2)

Unnamed: 0,pid,tid,target
0,20612,6509,0
1,20612,168,0


In [54]:
train_lightfm = pd.read_hdf(f'{CANDIDATES_PATH}/ii_lightfm_features.hdf').drop('target', axis=1)
val_lightfm = pd.read_hdf(f'{CANDIDATES_PATH}/iii_lightfm_features.hdf').drop('target', axis=1)

train = train.merge(train_lightfm, on=['pid', 'tid'])
val = val.merge(val_lightfm, on=['pid', 'tid'])

In [None]:
train.head(2)

In [None]:
text_cols = [col for col in train.columns if '_text' in col]
text_cols

['pid_bias_text',
 'tid_bias_text',
 'lightfm_dot_product_text',
 'lightfm_prediction_text',
 'lightfm_rank_text']

In [None]:
train = train.drop(text_cols, axis=1)
val = val.drop(text_cols, axis=1)

In [None]:
train.head(2)

Unnamed: 0,pid,tid,target,pid_bias,tid_bias,lightfm_dot_product,lightfm_prediction,lightfm_rank
0,20612,6509,0,0.0,0.581104,-0.003705,0.577398,885.0
1,20612,168,0,0.0,0.540023,-0.004621,0.535403,945.0


In [None]:
data = pd.read_hdf('data/splitted/train.hdf')
train_holdouts = pd.read_hdf('data/splitted/val1.hdf')
val_holdouts = pd.read_hdf('data/splitted/val2.hdf')

val_length = val_holdouts.groupby('pid').tid.nunique()

In [None]:
train_lightfm = pd.read_hdf(f'{CANDIDATES_PATH}/ii_lightfm_features.hdf').drop('target', axis=1)
val_lightfm = pd.read_hdf(f'{CANDIDATES_PATH}/iii_lightfm_features.hdf').drop('target', axis=1)

train = train.merge(train_lightfm, on=['pid', 'tid'])
val = val.merge(val_lightfm, on=['pid', 'tid'])

In [None]:
train_co = pd.read_hdf(f'{CANDIDATES_PATH}/ii_co_occurence_features.hdf').drop('target', axis=1)
val_co = pd.read_hdf(f'{CANDIDATES_PATH}/iii_co_occurence_features.hdf').drop('target', axis=1)

train = train.merge(train_co, on=['pid', 'tid'])
val = val.merge(val_co, on=['pid', 'tid'])

In [None]:
cols = ['pid', 'tid', 'target']
xgtrain = xgboost.DMatrix(
    train.drop(cols, axis=1),
    train.target,
    weight=compute_sample_weight(
        class_weight='balanced', y=train['target']
    )
)

In [None]:
xgval = xgboost.DMatrix(
    val.drop(cols, axis=1),
    val.target,
    weight=compute_sample_weight(
        class_weight='balanced', y=val['target']
    )
)

With LightFM features

In [None]:
%%time
params = {
    'objective':'binary:logistic',
    'eta':0.1,
    'booster':'gbtree',
    'max_depth':7,
    'nthread':2,
    'seed':1,
    'eval_metric':'auc',
}

model_level2 = xgboost.train(
    params=list(params.items()),
    early_stopping_rounds=30,
    verbose_eval=10,
    dtrain=xgtrain,
    evals=[(xgtrain, 'train'), (xgval, 'test')],
    num_boost_round=50,
)

[0]	train-auc:0.81358	test-auc:0.80946
[10]	train-auc:0.82212	test-auc:0.81782
[20]	train-auc:0.82447	test-auc:0.81977
[30]	train-auc:0.82594	test-auc:0.82082
[40]	train-auc:0.82686	test-auc:0.82128
[49]	train-auc:0.82753	test-auc:0.82149
CPU times: user 6min 2s, sys: 12.5 s, total: 6min 14s
Wall time: 5min 23s


In [None]:
%%time
p = model_level2.predict(xgval)
val['p'] = p

CPU times: user 181 ms, sys: 23 ms, total: 204 ms
Wall time: 123 ms


In [None]:
val_length = val_holdouts.groupby('pid').tid.nunique()

scores = []
for pid, df, in val.sort_values('p', ascending=False).groupby('pid'):
    n = val_length[pid]
    scores.append(df[:n].target.sum() / n)
np.mean(scores)

0.10201097201438063

Without LightFM features

In [None]:
%%time
params = {
    'objective':'binary:logistic',
    'eta':0.1,
    'booster':'gbtree',
    'max_depth':7,
    'nthread':2,
    'seed':1,
    'eval_metric':'auc',
}

model_level2 = xgboost.train(
    params=list(params.items()),
    early_stopping_rounds=30,
    verbose_eval=10,
    dtrain=xgtrain,
    evals=[(xgtrain, 'train'), (xgval, 'test')],
    num_boost_round=50,
)

In [None]:
%%time
p = model_level2.predict(xgval)
val['p'] = p

CPU times: user 12.2 s, sys: 88.7 ms, total: 12.2 s
Wall time: 6.3 s


In [None]:
val_length = val_holdouts.groupby('pid').tid.nunique()

scores = []
for pid, df, in val.sort_values('p', ascending=False).groupby('pid'):
    n = val_length[pid]
    scores.append(df[:n].target.sum() / n)
np.mean(scores)

0.099752874607483