# Введение

В этом задании Вы продолжите работать с данными из семинара [Articles Sharing and Reading from CI&T Deskdrop](https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop).

In [1]:
# !pip install lightfm
from lightfm import LightFM
from lightfm.data import Dataset
from lightfm.evaluation import precision_at_k
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import WordPunctTokenizer
import numpy as np
import math
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import string 



## Загрузка и предобработка данных

Загрузим данные и проведем предобработку данных как на семинаре.

In [2]:
!curl -L -o articles.zip 'https://www.dropbox.com/s/z8syrl5trawxs0n/articles.zip?dl=0'
# !wget -q -N https://www.dropbox.com/s/z8syrl5trawxs0n/articles.zip?dl=0 -O articles.zip
!unzip -o -q articles.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1076    0  1076    0     0    552      0 --:--:--  0:00:01 --:--:--     0
100 8196k  100 8196k    0     0  1860k      0  0:00:04  0:00:04 --:--:-- 5403k


In [3]:
articles_df = pd.read_csv('articles/shared_articles.csv')
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']
articles_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en


In [4]:
interactions_df = pd.read_csv('articles/users_interactions.csv')
interactions_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US


In [5]:
interactions_df.personId = interactions_df.personId.astype(str)
interactions_df.contentId = interactions_df.contentId.astype(str)
articles_df.contentId = articles_df.contentId.astype(str)

In [6]:
# зададим словарь определяющий силу взаимодействия
event_type_strength = {
   'VIEW': 1.0,
   'LIKE': 2.0, 
   'BOOKMARK': 2.5, 
   'FOLLOW': 3.0,
   'COMMENT CREATED': 4.0,  
}

interactions_df['eventStrength'] = interactions_df.eventType.apply(lambda x: event_type_strength[x])

Оставляем только тех пользователей, которые произамодействовали более чем с пятью статьями.

In [7]:
users_interactions_count_df = (
    interactions_df
    .groupby(['personId', 'contentId'])
    .first()
    .reset_index()
    .groupby('personId').size())
print('# users:', len(users_interactions_count_df))

users_with_enough_interactions_df = \
    users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[['personId']]
print('# users with at least 5 interactions:',len(users_with_enough_interactions_df))

# users: 1895
# users with at least 5 interactions: 1140


Оставляем только те взаимодействия, которые относятся к отфильтрованным пользователям.

In [8]:
interactions_from_selected_users_df = interactions_df.loc[np.in1d(interactions_df.personId,
            users_with_enough_interactions_df)]

In [9]:
print('# interactions before:', interactions_df.shape)
print('# interactions after:', interactions_from_selected_users_df.shape)

# interactions before: (72312, 9)
# interactions after: (69868, 9)


Объединяем все взаимодействия пользователя по каждой статье и сглаживаем полученный результат, взяв от него логарифм.

In [10]:
def smooth_user_preference(x):
    return math.log(1+x, 2)
    
interactions_full_df = (
    interactions_from_selected_users_df
    .groupby(['personId', 'contentId']).eventStrength.sum()
    .apply(smooth_user_preference)
    .reset_index().set_index(['personId', 'contentId'])
)
interactions_full_df['last_timestamp'] = (
    interactions_from_selected_users_df
    .groupby(['personId', 'contentId'])['timestamp'].last()
)
        
interactions_full_df = interactions_full_df.reset_index()
interactions_full_df.head(5)

Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
1,-1007001694607905623,-6623581327558800021,1.0,1487240080
2,-1007001694607905623,-793729620925729327,1.0,1472834892
3,-1007001694607905623,1469580151036142903,1.0,1487240062
4,-1007001694607905623,7270966256391553686,1.584963,1485994324


Разобьём выборку на обучение и контроль по времени.

In [11]:
from sklearn.model_selection import train_test_split

split_ts = 1475519530
interactions_train_df = interactions_full_df.loc[interactions_full_df.last_timestamp < split_ts].copy()
interactions_test_df = interactions_full_df.loc[interactions_full_df.last_timestamp >= split_ts].copy()

print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))

interactions_train_df

# interactions on Train set: 29329
# interactions on Test set: 9777


Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
2,-1007001694607905623,-793729620925729327,1.0,1472834892
6,-1032019229384696495,-1006791494035379303,1.0,1469129122
7,-1032019229384696495,-1039912738963181810,1.0,1459376415
8,-1032019229384696495,-1081723567492738167,2.0,1464054093
...,...,...,...,...
39099,997469202936578234,9112765177685685246,2.0,1472479493
39100,998688566268269815,-1255189867397298842,1.0,1474567164
39101,998688566268269815,-401664538366009049,1.0,1474567449
39103,998688566268269815,6881796783400625893,1.0,1474567675


Для удобства подсчёта качества запишем данные в формате, где строка соответствует пользователю, а столбцы будут истинными метками и предсказаниями в виде списков.

In [12]:
interactions = (
    interactions_train_df
    .groupby('personId')['contentId'].agg(lambda x: list(x))
    .reset_index()
    .rename(columns={'contentId': 'true_train'})
    .set_index('personId')
)

interactions['true_test'] = (
    interactions_test_df
    .groupby('personId')['contentId'].agg(lambda x: list(x))
)

# заполнение пропусков пустыми списками
interactions.loc[pd.isnull(interactions.true_test), 'true_test'] = [
    list() for x in range(len(interactions.loc[pd.isnull(interactions.true_test), 'true_test']))]

interactions.head(1)

Unnamed: 0_level_0,true_train,true_test
personId,Unnamed: 1_level_1,Unnamed: 2_level_1
-1007001694607905623,"[-5065077552540450930, -793729620925729327]","[-6623581327558800021, 1469580151036142903, 72..."


## Библиотека LightFM

Для рекомендации Вы будете пользоваться библиотекой [LightFM](https://making.lyst.com/lightfm/docs/home.html), в которой реализованы популярные алгоритмы. Для оценивания качества рекомендации, как и на семинаре, будем пользоваться метрикой *precision@10*.

## Задание 1. (2 балла)

Модели в LightFM работают с разреженными матрицами. Создайте разреженные матрицы `data_train` и `data_test` (размером количество пользователей на количество статей), такие что на пересечении строки пользователя и столбца статьи стоит сила их взаимодействия, если взаимодействие было, и стоит ноль, если взаимодействия не было.

In [13]:
def to_sparse_matrix(df, full_df):
    dataset = Dataset()
    dataset.fit(full_df.personId.unique(), full_df.contentId.unique())
    interactions_matrix, weights_matrix = dataset.build_interactions([tuple(i) for i in df.drop(['last_timestamp'], axis = 1).values])
    return weights_matrix

In [14]:
# Ваш код здесь
data_train = to_sparse_matrix(interactions_train_df, interactions_full_df)
data_test = to_sparse_matrix(interactions_test_df, interactions_full_df)

some checks

In [15]:
dataset = Dataset()
dataset.fit(interactions_full_df.personId.unique(), interactions_full_df.contentId.unique())
interactions_matrix, weights_matrix = dataset.build_interactions([tuple(i) for i in interactions_train_df.drop(['last_timestamp'], axis = 1).values])
mapping = dataset.mapping()

In [16]:
assert(interactions_train_df.loc[(interactions_train_df['personId'] == '-1007001694607905623') & (interactions_train_df['contentId'] == '6881796783400625893')].empty)
assert(weights_matrix.toarray()[mapping[0]['-1007001694607905623'], mapping[2]['6881796783400625893']] == .0)

In [17]:
assert(not interactions_train_df.loc[(interactions_train_df['personId'] == '-1032019229384696495') & (interactions_train_df['contentId'] == '-1081723567492738167')].empty)
assert(weights_matrix.toarray()[mapping[0]['-1032019229384696495'], mapping[2]['-1081723567492738167']] == 2.0)

## Задание 2. (1 балл)

Обучите модель LightFM с `loss='warp'` и посчитайте *precision@10* на тесте.

In [106]:
# Ваш код здесь
model = LightFM(loss='warp', random_state=42)
model.fit(data_train)
print(f'precision@10 = {precision_at_k(model, data_test).mean()}')

precision@10 = 0.003564154729247093


## Задание 3. (3 балла)

При вызове метода `fit` LightFM позволяет передавать в `item_features` признаковое описание объектов. Воспользуемся этим. Будем получать признаковое описание из текста статьи в виде [TF-IDF](https://ru.wikipedia.org/wiki/TF-IDF) (можно воспользоваться `TfidfVectorizer` из scikit-learn). Создайте матрицу `feat` размером количесвто статей на размер признакового описание и обучите LightFM с `loss='warp'` и посчитайте precision@10 на тесте.

In [45]:
def build_item_features(full_df, articles_df):
    dataset = Dataset()

    vectorizer = TfidfVectorizer()
    tf_idfs = vectorizer.fit_transform(articles_df.text) 
    contentIds = articles_df.contentId.to_numpy()
    
    dataset.fit_partial(items=articles_df.contentId)
    dataset.fit_partial(item_features=tf_idfs.toarray().flatten())

    return dataset.build_item_features(list(zip(contentIds, tf_idfs.toarray())))

In [107]:
# Ваш код здесь
feat = build_item_features(interactions_full_df, articles_df)
model = LightFM(loss='warp', random_state=42)
model.fit(data_train, item_features=feat)
print(f'precision@10 = {precision_at_k(model, data_test, item_features=feat).mean()}')

precision@10 = 0.0016293278895318508


## Задание 4. (2 балла)

В задании 3 мы использовали сырой текст статей. В этом задании необходимо сначала сделать предобработку текста (привести к нижнему регистру, убрать стоп слова, привести слова к нормальной форме и т.д.), после чего обучите модель и оценить качество на тестовых данных.

In [47]:
articles_df.lang.unique()

array(['en', 'pt', 'es', 'la', 'ja'], dtype=object)

In [48]:
tokenizer = WordPunctTokenizer()
nltk.download('stopwords')
stop_words = set(stopwords.words('portuguese'))
stop_words.update(set(stopwords.words('english')))
stop_words.update(set(stopwords.words('spanish')))
# extra punctuation and symbols which are not deleted by string.punctuation
stop_words.update(set([',"', '—', '\n', '."', '...', ',\'', '°', '!)', ')"','//', '),', '->', '✔', '")', ').']))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mmanakhova/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [49]:
def build_item_features_with_preprocessing(full_df, articles_df):
    dataset = Dataset()

    vectorizer = TfidfVectorizer()
    tf_idfs = vectorizer.fit_transform(pd.Series([lemmatize(remove_extra_parts(preprocess(text))) for text in articles_df.text])) 
    contentIds = articles_df.contentId.to_numpy()
    
    dataset.fit_partial(items=articles_df.contentId)
    dataset.fit_partial(item_features=tf_idfs.toarray().flatten())

    return dataset.build_item_features(list(zip(contentIds, tf_idfs.toarray())))

def preprocess(text):
    return ' '.join(tokenizer.tokenize(text.lower()))

def remove_extra_parts(sentence):
    words = sentence.split(' ')
    new_sentence = []
    for word in words:
        if word in stop_words or word.isdigit() or word in string.punctuation:
            continue
        new_sentence.append(word)
        
    return ' '.join(new_sentence) 

def lemmatize(text, lang='en'):
    stemmer = SnowballStemmer('portuguese') if lang == 'pt' else SnowballStemmer('english')
    return ' '.join(stemmer.stem(token) for token in text.split())

In [108]:
feat = build_item_features_with_preprocessing(interactions_full_df, articles_df)
model = LightFM(loss='warp', random_state=42)
model.fit(data_train, item_features=feat)
print(f'precision@10 = {precision_at_k(model, data_test, item_features=feat).mean()}')

precision@10 = 0.0018329938175156713


Улучшилось ли качество предсказания?

Качество предсказания незначительно улучшилось ¯\\\_(ツ)\_/¯.

## Задание 5. (2 балла)

Подберите гиперпараметры модели LightFM (`n_components` и др.) для улучшения качества модели.

In [110]:
# Ваш код здесь    
best_nc = 0
best_precision = 0
nc_range = np.arange(1, 100, 9)
for parameter_value in nc_range:
    model = LightFM(loss='warp', no_components=parameter_value, random_state=42)
    model.fit(data_train, item_features=feat)
    precision = precision_at_k(model, data_test, item_features=feat).mean()
    print(f'no_components = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_nc = parameter_value
        best_precision = precision        

no_components = 1 	| precision = 0.0019348268397152424
no_components = 10 	| precision = 0.0018329938175156713
no_components = 19 	| precision = 0.001731160911731422
no_components = 28 	| precision = 0.001731160911731422
no_components = 37 	| precision = 0.001731160911731422
no_components = 46 	| precision = 0.0018329938175156713
no_components = 55 	| precision = 0.001731160911731422
no_components = 64 	| precision = 0.0016293278895318508
no_components = 73 	| precision = 0.0016293278895318508
no_components = 82 	| precision = 0.0016293278895318508
no_components = 91 	| precision = 0.0016293278895318508


In [111]:
best_lr = None
lr_range = [.001, .005, .0001, .00005, .00001, .000005, .000001]
for parameter_value in lr_range:
    model = LightFM(loss='warp', learning_rate=parameter_value, no_components=best_nc, random_state=42)
    model.fit(data_train, item_features=feat)
    precision = precision_at_k(model, data_test, item_features=feat).mean()
    print(f'learning rate = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_lr = parameter_value
        best_precision = precision

learning rate = 0.001 	| precision = 0.00295315682888031
learning rate = 0.005 	| precision = 0.0024439918342977762
learning rate = 0.0001 	| precision = 0.0031568228732794523
learning rate = 5e-05 	| precision = 0.0031568228732794523
learning rate = 1e-05 	| precision = 0.0031568228732794523
learning rate = 5e-06 	| precision = 0.0031568228732794523
learning rate = 1e-06 	| precision = 0.0031568228732794523


In [112]:
best_lr = 0.05 if best_lr == None else best_lr
best_ms = None
ms_range = np.arange(1, 100, 9)
for parameter_value in ms_range:
    model = LightFM(loss='warp', learning_rate=best_lr, no_components=best_nc, max_sampled=parameter_value, random_state=42)
    model.fit(data_train, item_features=feat)
    precision = precision_at_k(model, data_test, item_features=feat).mean()
    print(f'max_sampled = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_ms = parameter_value
        best_precision = precision

max_sampled = 1 	| precision = 0.0031568228732794523
max_sampled = 10 	| precision = 0.0031568228732794523
max_sampled = 19 	| precision = 0.0031568228732794523
max_sampled = 28 	| precision = 0.0031568228732794523
max_sampled = 37 	| precision = 0.0031568228732794523
max_sampled = 46 	| precision = 0.0031568228732794523
max_sampled = 55 	| precision = 0.0031568228732794523
max_sampled = 64 	| precision = 0.0031568228732794523
max_sampled = 73 	| precision = 0.0031568228732794523
max_sampled = 82 	| precision = 0.0031568228732794523
max_sampled = 91 	| precision = 0.0031568228732794523


In [113]:
best_ms = 10 if best_ms == None else best_ms
best_ia = 0
ia_range = [.1, .2, .3, .4, .5, .6, .7, .8, .9]
for parameter_value in ia_range:
    model = LightFM(loss='warp', learning_rate=best_lr, no_components=best_nc, max_sampled=best_ms, item_alpha=parameter_value, random_state=42)
    model.fit(data_train, item_features=feat)
    precision = precision_at_k(model, data_test, item_features=feat).mean()
    print(f'item_alpha = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_ia = parameter_value
        best_precision = precision

item_alpha = 0.1 	| precision = 0.0031568228732794523
item_alpha = 0.2 	| precision = 0.0031568228732794523
item_alpha = 0.3 	| precision = 0.0031568228732794523
item_alpha = 0.4 	| precision = 0.0031568228732794523
item_alpha = 0.5 	| precision = 0.0031568228732794523
item_alpha = 0.6 	| precision = 0.0031568228732794523
item_alpha = 0.7 	| precision = 0.0031568228732794523
item_alpha = 0.8 	| precision = 0.0031568228732794523
item_alpha = 0.9 	| precision = 0.0031568228732794523


In [114]:
best_ia = .0 if best_ia == None else best_ia
best_ua = 0
ua_range = [.1, .2, .3, .4, .5, .6, .7, .8, .9]
for parameter_value in ua_range:
    model = LightFM(loss='warp', learning_rate=best_lr, no_components=best_nc, max_sampled=best_ms, item_alpha=best_ia, user_alpha=parameter_value, random_state=42)
    model.fit(data_train, item_features=feat)
    precision = precision_at_k(model, data_test, item_features=feat).mean()
    print(f'user_alpha = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_ua = parameter_value
        best_precision = precision

user_alpha = 0.1 	| precision = 0.0031568228732794523
user_alpha = 0.2 	| precision = 0.0031568228732794523
user_alpha = 0.3 	| precision = 0.0031568228732794523
user_alpha = 0.4 	| precision = 0.0031568228732794523
user_alpha = 0.5 	| precision = 0.0031568228732794523
user_alpha = 0.6 	| precision = 0.0032586557790637016
user_alpha = 0.7 	| precision = 0.0032586557790637016
user_alpha = 0.8 	| precision = 0.002851323690265417
user_alpha = 0.9 	| precision = 0.002749490784481168


In [115]:
print('learning_schedule=\'adagrad\'')
print(f'best no components = {best_nc}')
print(f'best learning rate = {best_lr}')
print(f'best max sampled = {best_ms}')
print(f'best item alpha = {best_ia}')
print(f'best user alpha = {best_ua}\n')
print(f'best precision = {best_precision:3f}')

learning_schedule='adagrad'
best no components = 1
best learning rate = 0.0001
best max sampled = 10
best item alpha = 0
best user alpha = 0.6

best precision = 0.003259


Results are pretty sad, let's try with original model without item_features.

In [117]:
best_nc = 0
best_precision = 0
nc_range = np.arange(1, 100, 9)
for parameter_value in nc_range:
    model = LightFM(loss='warp', no_components=parameter_value, random_state=42)
    model.fit(data_train)
    precision = precision_at_k(model, data_test).mean()
    print(f'no_components = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_nc = parameter_value
        best_precision = precision  

no_components = 1 	| precision = 0.0022403260227292776
no_components = 10 	| precision = 0.003564154729247093
no_components = 19 	| precision = 0.002749490784481168
no_components = 28 	| precision = 0.003564154729247093
no_components = 37 	| precision = 0.002749490784481168
no_components = 46 	| precision = 0.004582484718412161
no_components = 55 	| precision = 0.003665987867861986
no_components = 64 	| precision = 0.0032586557790637016
no_components = 73 	| precision = 0.004073319956660271
no_components = 82 	| precision = 0.0031568228732794523
no_components = 91 	| precision = 0.003971486818045378


In [119]:
best_lr = None
lr_range = [.05, .001, .005, .0001, .00005, .00001, .000005, .000001]
for parameter_value in lr_range:
    model = LightFM(loss='warp', learning_rate=parameter_value, no_components=best_nc, random_state=42)
    model.fit(data_train)
    precision = precision_at_k(model, data_test).mean()
    print(f'learning rate = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_lr = parameter_value
        best_precision = precision

learning rate = 0.05 	| precision = 0.004582484718412161
learning rate = 0.001 	| precision = 0.0037678207736462355
learning rate = 0.005 	| precision = 0.0038696539122611284
learning rate = 0.0001 	| precision = 0.00295315682888031
learning rate = 5e-05 	| precision = 0.0024439918342977762
learning rate = 1e-05 	| precision = 0.0012219959171488881
learning rate = 5e-06 	| precision = 0.0016293278895318508
learning rate = 1e-06 	| precision = 0.002851323690265417


In [120]:
best_lr = 0.05 if best_lr == None else best_lr
best_ms = None
ms_range = np.arange(1, 100, 9)
for parameter_value in ms_range:
    model = LightFM(loss='warp', learning_rate=best_lr, no_components=best_nc, max_sampled=parameter_value, random_state=42)
    model.fit(data_train)
    precision = precision_at_k(model, data_test).mean()
    print(f'max_sampled = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_ms = parameter_value
        best_precision = precision

max_sampled = 1 	| precision = 0.0030549897346645594
max_sampled = 10 	| precision = 0.004582484718412161
max_sampled = 19 	| precision = 0.004276985768228769
max_sampled = 28 	| precision = 0.0036659876350313425
max_sampled = 37 	| precision = 0.004276985768228769
max_sampled = 46 	| precision = 0.0038696536794304848
max_sampled = 55 	| precision = 0.00295315682888031
max_sampled = 64 	| precision = 0.003462321823462844
max_sampled = 73 	| precision = 0.002749490784481168
max_sampled = 82 	| precision = 0.002749490784481168
max_sampled = 91 	| precision = 0.0031568226404488087


In [121]:
best_ms = 10 if best_ms == None else best_ms
best_ia = 0
ia_range = [.1, .2, .3, .4, .5, .6, .7, .8, .9]
for parameter_value in ia_range:
    model = LightFM(loss='warp', learning_rate=best_lr, no_components=best_nc, max_sampled=best_ms, item_alpha=parameter_value, random_state=42)
    model.fit(data_train)
    precision = precision_at_k(model, data_test).mean()
    print(f'item_alpha = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_ia = parameter_value
        best_precision = precision

item_alpha = 0.1 	| precision = 0.018024440854787827
item_alpha = 0.2 	| precision = 0.004480652045458555
item_alpha = 0.3 	| precision = 0.007433808874338865
item_alpha = 0.4 	| precision = 0.007535641547292471
item_alpha = 0.5 	| precision = 0.0023421586956828833
item_alpha = 0.6 	| precision = 0.0030549897346645594
item_alpha = 0.7 	| precision = 0.008452138863503933
item_alpha = 0.8 	| precision = 0.0022403260227292776
item_alpha = 0.9 	| precision = 0.00539714889600873


In [122]:
best_ia = .0 if best_ia == None else best_ia
best_ua = 0
ua_range = [.1, .2, .3, .4, .5, .6, .7, .8, .9]
for parameter_value in ua_range:
    model = LightFM(loss='warp', learning_rate=best_lr, no_components=best_nc, max_sampled=best_ms, item_alpha=best_ia, user_alpha=parameter_value, random_state=42)
    model.fit(data_train)
    precision = precision_at_k(model, data_test).mean()
    print(f'user_alpha = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_ua = parameter_value
        best_precision = precision

user_alpha = 0.1 	| precision = 0.007128309458494186
user_alpha = 0.2 	| precision = 0.004480652045458555
user_alpha = 0.3 	| precision = 0.013849287293851376
user_alpha = 0.4 	| precision = 0.01771894283592701
user_alpha = 0.5 	| precision = 0.0023421591613441706
user_alpha = 0.6 	| precision = 0.002647657645866275
user_alpha = 0.7 	| precision = 0.00590631365776062
user_alpha = 0.8 	| precision = 0.0030549897346645594
user_alpha = 0.9 	| precision = 0.00590631365776062


In [123]:
print('learning_schedule=\'adagrad\'')
print(f'best no components = {best_nc}')
print(f'best learning rate = {best_lr}')
print(f'best max sampled = {best_ms}')
print(f'best item alpha = {best_ia}')
print(f'best user alpha = {best_ua}\n')
print(f'best precision = {best_precision:3f}')

learning_schedule='adagrad'
best no components = 46
best learning rate = 0.05
best max sampled = 10
best item alpha = 0.1
best user alpha = 0

best precision = 0.018024


Results are much better than with preprocessed tf-idf. At last, let's try tf-idf without processing.

In [124]:
feat = build_item_features(interactions_full_df, articles_df)

In [125]:
best_nc = 0
best_precision = 0
nc_range = np.arange(1, 100, 9)
for parameter_value in nc_range:
    model = LightFM(loss='warp', no_components=parameter_value, random_state=42)
    model.fit(data_train, item_features=feat)
    precision = precision_at_k(model, data_test, item_features=feat).mean()
    print(f'no_components = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_nc = parameter_value
        best_precision = precision   

no_components = 1 	| precision = 0.0016293278895318508
no_components = 10 	| precision = 0.0016293278895318508
no_components = 19 	| precision = 0.001731160911731422
no_components = 28 	| precision = 0.0016293278895318508
no_components = 37 	| precision = 0.0014256618451327085
no_components = 46 	| precision = 0.0016293278895318508
no_components = 55 	| precision = 0.001731160911731422
no_components = 64 	| precision = 0.0013238289393484592
no_components = 73 	| precision = 0.0015274948673322797
no_components = 82 	| precision = 0.0015274948673322797
no_components = 91 	| precision = 0.001731160911731422


In [126]:
best_lr = None
lr_range = [.001, .005, .0001, .00005, .00001, .000005, .000001]
for parameter_value in lr_range:
    model = LightFM(loss='warp', learning_rate=parameter_value, no_components=best_nc, random_state=42)
    model.fit(data_train, item_features=feat)
    precision = precision_at_k(model, data_test, item_features=feat).mean()
    print(f'learning rate = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_lr = parameter_value
        best_precision = precision

learning rate = 0.001 	| precision = 0.003360488684847951
learning rate = 0.005 	| precision = 0.001832993933930993
learning rate = 0.0001 	| precision = 0.0036659876350313425
learning rate = 5e-05 	| precision = 0.0036659876350313425
learning rate = 1e-05 	| precision = 0.003665987867861986
learning rate = 5e-06 	| precision = 0.003462321823462844
learning rate = 1e-06 	| precision = 0.00295315682888031


In [127]:
best_lr = 0.05 if best_lr == None else best_lr
best_ms = None
ms_range = np.arange(1, 100, 9)
for parameter_value in ms_range:
    model = LightFM(loss='warp', learning_rate=best_lr, no_components=best_nc, max_sampled=parameter_value, random_state=42)
    model.fit(data_train, item_features=feat)
    precision = precision_at_k(model, data_test, item_features=feat).mean()
    print(f'max_sampled = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_ms = parameter_value
        best_precision = precision

max_sampled = 1 	| precision = 0.003665987867861986
max_sampled = 10 	| precision = 0.003665987867861986
max_sampled = 19 	| precision = 0.003665987867861986
max_sampled = 28 	| precision = 0.003665987867861986
max_sampled = 37 	| precision = 0.003665987867861986
max_sampled = 46 	| precision = 0.003665987867861986
max_sampled = 55 	| precision = 0.003665987867861986
max_sampled = 64 	| precision = 0.003665987867861986
max_sampled = 73 	| precision = 0.003665987867861986
max_sampled = 82 	| precision = 0.003665987867861986
max_sampled = 91 	| precision = 0.003665987867861986


In [128]:
best_ms = 10 if best_ms == None else best_ms
best_ia = 0
ia_range = [.1, .2, .3, .4, .5, .6, .7, .8, .9]
for parameter_value in ia_range:
    model = LightFM(loss='warp', learning_rate=best_lr, no_components=best_nc, max_sampled=best_ms, item_alpha=parameter_value, random_state=42)
    model.fit(data_train, item_features=feat)
    precision = precision_at_k(model, data_test, item_features=feat).mean()
    print(f'item_alpha = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_ia = parameter_value
        best_precision = precision

item_alpha = 0.1 	| precision = 0.0037678207736462355
item_alpha = 0.2 	| precision = 0.0037678207736462355
item_alpha = 0.3 	| precision = 0.0037678207736462355
item_alpha = 0.4 	| precision = 0.0037678207736462355
item_alpha = 0.5 	| precision = 0.003665987867861986
item_alpha = 0.6 	| precision = 0.003665987867861986
item_alpha = 0.7 	| precision = 0.003665987867861986
item_alpha = 0.8 	| precision = 0.003665987867861986
item_alpha = 0.9 	| precision = 0.003665987867861986


In [129]:
best_ia = .0 if best_ia == None else best_ia
best_ua = 0
ua_range = [.1, .2, .3, .4, .5, .6, .7, .8, .9]
for parameter_value in ua_range:
    model = LightFM(loss='warp', learning_rate=best_lr, no_components=best_nc, max_sampled=best_ms, item_alpha=best_ia, user_alpha=parameter_value, random_state=42)
    model.fit(data_train, item_features=feat)
    precision = precision_at_k(model, data_test, item_features=feat).mean()
    print(f'user_alpha = {parameter_value} \t| precision = {precision}')
    if precision > best_precision:
        best_ua = parameter_value
        best_precision = precision

user_alpha = 0.1 	| precision = 0.003665987867861986
user_alpha = 0.2 	| precision = 0.003665987867861986
user_alpha = 0.3 	| precision = 0.003665987867861986
user_alpha = 0.4 	| precision = 0.003665987867861986
user_alpha = 0.5 	| precision = 0.003665987867861986
user_alpha = 0.6 	| precision = 0.003665987867861986
user_alpha = 0.7 	| precision = 0.003665987867861986
user_alpha = 0.8 	| precision = 0.003665987867861986
user_alpha = 0.9 	| precision = 0.003665987867861986


In [130]:
print('learning_schedule=\'adagrad\'')
print(f'best no components = {best_nc}')
print(f'best learning rate = {best_lr}')
print(f'best max sampled = {best_ms}')
print(f'best item alpha = {best_ia}')
print(f'best user alpha = {best_ua}\n')
print(f'best precision = {best_precision:3f}')

learning_schedule='adagrad'
best no components = 19
best learning rate = 1e-05
best max sampled = 10
best item alpha = 0.1
best user alpha = 0

best precision = 0.003768


Slightly better than with processed tf-idf, but much worse than model without item_features.

## Бонусное задание. (3 балла)

Выше мы использовали достаточно простое представление текста статьи в виде TF-IDF. В этом задании Вам нужно представить текст статьи (можно вместе с заголовком) в виде эмбеддинга полученного с помощью рекуррентной сети или трансформера (можно использовать любую предобученную модель, которая Вам нравится). Обучите модель с ипользованием этих эмеддингов и сравните результаты с предыдущими.

In [None]:
# Ваш код здесь