# Введение

В этом задании Вы продолжите работать с данными из семинара [Articles Sharing and Reading from CI&T Deskdrop](https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop).

# Загрузка и предобработка данных

In [18]:
import math
import numpy as np
import pandas as pd
from scipy import sparse
from lightfm import LightFM
from lightfm.evaluation import precision_at_k
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
import optuna

Загрузим данные и проведем предобраотку данных как на семинаре.

In [2]:
!kaggle datasets download -d gspmoreira/articles-sharing-reading-from-cit-deskdrop
!unzip articles-sharing-reading-from-cit-deskdrop.zip -d articles

Traceback (most recent call last):
  File "/home/dinara/anaconda3/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/home/dinara/anaconda3/lib/python3.9/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/home/dinara/anaconda3/lib/python3.9/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /home/dinara/.kaggle. Or use the environment method.
unzip:  cannot find or open articles-sharing-reading-from-cit-deskdrop.zip, articles-sharing-reading-from-cit-deskdrop.zip.zip or articles-sharing-reading-from-cit-deskdrop.zip.ZIP.


In [2]:
articles_df = pd.read_csv("articles/shared_articles.csv")
articles_df = articles_df[articles_df["eventType"] == "CONTENT SHARED"]
articles_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en


In [3]:
interactions_df = pd.read_csv("articles/users_interactions.csv")
interactions_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US


In [4]:
interactions_df.personId = interactions_df.personId.astype(str)
interactions_df.contentId = interactions_df.contentId.astype(str)
articles_df.contentId = articles_df.contentId.astype(str)

In [5]:
# зададим словарь определяющий силу взаимодействия
event_type_strength = {
    "VIEW": 1.0,
    "LIKE": 2.0,
    "BOOKMARK": 2.5,
    "FOLLOW": 3.0,
    "COMMENT CREATED": 4.0,
}

interactions_df["eventStrength"] = interactions_df.eventType.apply(
    lambda x: event_type_strength[x]
)

Оставляем только тех пользователей, которые провзамодействовали более чем с пятью статьями.

In [6]:
users_interactions_count_df = (
    interactions_df.groupby(["personId", "contentId"])
    .first()
    .reset_index()
    .groupby("personId")
    .size()
)
print("# users:", len(users_interactions_count_df))

users_with_enough_interactions_df = users_interactions_count_df[
    users_interactions_count_df >= 5
].reset_index()[["personId"]]
print("# users with at least 5 interactions:", len(users_with_enough_interactions_df))

# users: 1895
# users with at least 5 interactions: 1140


Оставляем только те взаимодействия, которые относятся к отфильтрованным пользователям.

In [7]:
interactions_from_selected_users_df = interactions_df.loc[
    np.in1d(interactions_df.personId, users_with_enough_interactions_df)
]

In [8]:
print(f"# interactions before: {interactions_df.shape}")
print(f"# interactions after: {interactions_from_selected_users_df.shape}")

# interactions before: (72312, 9)
# interactions after: (69868, 9)


Объединяем все взаимодействия пользователя по каждой статье и сглаживаем полученный результат, взяв от него логарифм.

In [9]:
def smooth_user_preference(x):
    return math.log(1 + x, 2)


interactions_full_df = (
    interactions_from_selected_users_df.groupby(["personId", "contentId"])
    .eventStrength.sum()
    .apply(smooth_user_preference)
    .reset_index()
    .set_index(["personId", "contentId"])
)
interactions_full_df["last_timestamp"] = interactions_from_selected_users_df.groupby(
    ["personId", "contentId"]
)["timestamp"].last()

interactions_full_df = interactions_full_df.reset_index()
interactions_full_df.head(5)

Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
1,-1007001694607905623,-6623581327558800021,1.0,1487240080
2,-1007001694607905623,-793729620925729327,1.0,1472834892
3,-1007001694607905623,1469580151036142903,1.0,1487240062
4,-1007001694607905623,7270966256391553686,1.584963,1485994324


Разобьём выборку на обучение и контроль по времени.

In [10]:
from sklearn.model_selection import train_test_split

split_ts = 1475519530
interactions_train_df = interactions_full_df.loc[
    interactions_full_df.last_timestamp < split_ts
].copy()
interactions_test_df = interactions_full_df.loc[
    interactions_full_df.last_timestamp >= split_ts
].copy()

print(f"# interactions on Train set: {len(interactions_train_df)}")
print(f"# interactions on Test set: {len(interactions_test_df)}")

interactions_train_df

# interactions on Train set: 29329
# interactions on Test set: 9777


Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
2,-1007001694607905623,-793729620925729327,1.0,1472834892
6,-1032019229384696495,-1006791494035379303,1.0,1469129122
7,-1032019229384696495,-1039912738963181810,1.0,1459376415
8,-1032019229384696495,-1081723567492738167,2.0,1464054093
...,...,...,...,...
39099,997469202936578234,9112765177685685246,2.0,1472479493
39100,998688566268269815,-1255189867397298842,1.0,1474567164
39101,998688566268269815,-401664538366009049,1.0,1474567449
39103,998688566268269815,6881796783400625893,1.0,1474567675


Для удобства подсчёта качества запишем данные в формате, где строка соответствует пользователю, а столбцы будут истинными метками и предсказаниями в виде списков.

In [11]:
interactions = (
    interactions_train_df.groupby("personId")["contentId"]
    .agg(lambda x: list(x))
    .reset_index()
    .rename(columns={"contentId": "true_train"})
    .set_index("personId")
)

interactions["true_test"] = interactions_test_df.groupby("personId")["contentId"].agg(
    lambda x: list(x)
)

# заполнение пропусков пустыми списками
interactions.loc[pd.isnull(interactions.true_test), "true_test"] = [
    ""
    for x in range(
        len(interactions.loc[pd.isnull(interactions.true_test), "true_test"])
    )
]

interactions.head(1)

Unnamed: 0_level_0,true_train,true_test
personId,Unnamed: 1_level_1,Unnamed: 2_level_1
-1007001694607905623,"[-5065077552540450930, -793729620925729327]","[-6623581327558800021, 1469580151036142903, 72..."


# Библиотека LightFM

Для рекомендации Вы будете пользоваться библиотекой [LightFM](https://making.lyst.com/lightfm/docs/home.html), в которой реализованы популярные алгоритмы. Для оценивания качества рекомендации, как и на семинаре, будем пользоваться метрикой *precision@10*.

In [13]:
#!pip install lightfm

Collecting lightfm
  Downloading lightfm-1.17.tar.gz (316 kB)
[K     |████████████████████████████████| 316 kB 684 kB/s eta 0:00:01
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25ldone
[?25h  Created wheel for lightfm: filename=lightfm-1.17-cp39-cp39-linux_x86_64.whl size=449828 sha256=0bc310f10d91778f52c9759d0564b7c45bb8743f5cd1e82859b67d502e665fe2
  Stored in directory: /home/dinara/.cache/pip/wheels/d8/65/93/6ac8180274dc2e8f86ff326be62da1dfa55dc158fd45faba7d
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.17


## Задание 1 (1.5 балла)

Модели в LightFM работают с разреженными матрицами. Создайте разреженные матрицы `data_train` и `data_test` (размером количество пользователей на количество статей), такие что на пересечении строки пользователя и столбца статьи стоит сила их взаимодействия, если взаимодействие было, и стоит ноль, если взаимодействия не было.

In [12]:
# Ваш код здесь
data_train = pd.pivot_table(
    interactions_full_df,
    values='eventStrength',
    index='personId',
    columns='contentId').fillna(0)
data_test = pd.pivot_table(
    interactions_full_df,
    values='eventStrength',
    index='personId',
    columns='contentId').fillna(0)

In [13]:
data_train.sample(5)

contentId,-1006791494035379303,-1021685224930603833,-1022885988494278200,-1024046541613287684,-1033806831489252007,-1038011342017850,-1039912738963181810,-1046621686880462790,-1051830303851697653,-1055630159212837930,...,9222265156747237864,943818026930898372,957332268361319692,962287586799267519,966067567430037498,967143806332397325,972258375127367383,980458131533897249,98528655405030624,991271693336573226
personId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5867547777922915203,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-3815514359891076081,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7984451585080990517,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2032181124951647555,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5200962297848391505,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0


In [14]:
for person in tqdm(list(interactions_full_df.personId.unique())):
    if person in list(interactions_train_df.personId.unique()):
        contents_df = interactions_train_df[interactions_train_df.personId == person]
        for paper in list(data_train.loc[person].index):
            if paper not in list(contents_df.contentId):
                data_train[paper][person] = 0.0
    else:
        for paper in list(data_train.loc[person].index):
            data_train[paper][person] = 0.0

100%|████████████████████████████████████████████| 1140/1140 [05:44<00:00,  3.31it/s]


In [15]:
for person in tqdm(list(interactions_full_df.personId.unique())):
    if person in list(interactions_test_df.personId.unique()):
        contents_df = interactions_test_df[interactions_test_df.personId == person]
        for paper in list(data_test.loc[person].index):
            if paper not in list(contents_df.contentId):
                data_test[paper][person] = 0.0
    else:
        for paper in list(data_test.loc[person].index):
            data_test[paper][person] = 0.0

100%|████████████████████████████████████████████| 1140/1140 [05:32<00:00,  3.43it/s]


In [16]:
data_train_sparse = sparse.csr_matrix(data_train.values)
data_train_sparse

<1140x2984 sparse matrix of type '<class 'numpy.float64'>'
	with 29329 stored elements in Compressed Sparse Row format>

In [17]:
data_test_sparse = sparse.csr_matrix(data_test.values)
data_test_sparse

<1140x2984 sparse matrix of type '<class 'numpy.float64'>'
	with 9777 stored elements in Compressed Sparse Row format>

## Задание 2 (0.5 балла)

Обучите модель LightFM с `loss="warp"` и посчитайте *precision@10* на тесте.

In [19]:
model = LightFM(loss='warp', random_state=123)
model.fit(data_train_sparse, epochs=10, num_threads=16)

<lightfm.lightfm.LightFM at 0x7f9e1f1c9e50>

In [20]:
train_precision = precision_at_k(model, data_train_sparse, k=10).mean()
test_precision = precision_at_k(model, data_test_sparse, k=10, train_interactions=data_train_sparse).mean()
print(train_precision)
print(test_precision)

0.2057554
0.0068228105


## Задание 3 (2 балла)

При вызове метода `fit` LightFM позволяет передавать в `item_features` признаковое описание объектов. Воспользуемся этим. Будем получать признаковое описание из текста статьи в виде [TF-IDF](https://ru.wikipedia.org/wiki/TF-IDF) (можно воспользоваться `TfidfVectorizer` из scikit-learn). Создайте матрицу `feat` размером количесвто статей на размер признакового описание и обучите LightFM с `loss="warp"` и посчитайте precision@10 на тесте.

In [21]:
vectorizer = TfidfVectorizer()
papers = list(interactions_full_df.contentId.unique())
feat = vectorizer.fit_transform([np.array(articles_df.loc[articles_df['contentId'] == i].text)[0] if i in list(articles_df.contentId) else '' for i in papers])
vectorizer.get_feature_names_out()

array(['00', '000', '0000', ..., '食べ物', '食事中', '건배'], dtype=object)

In [22]:
feat.shape

(2984, 71875)

In [23]:
model = LightFM(loss='warp', random_state=42)
model.fit(data_train_sparse, epochs=10, item_features = feat)

<lightfm.lightfm.LightFM at 0x7f9e1e2acfa0>

In [24]:
train_precision = precision_at_k(model, data_train_sparse, k=10, item_features = feat).mean()
test_precision = precision_at_k(model, data_test_sparse, k=10, train_interactions=data_train_sparse, item_features = feat).mean()
print(train_precision)
print(test_precision)

0.21465829
0.0061099795


## Задание 4 (1.5 балла)

В задании 3 мы использовали сырой текст статей. В этом задании необходимо сначала сделать предобработку текста (привести к нижнему регистру, убрать стоп слова, привести слова к номральной форме и т.д.), после чего обучите модель и оценить качество на тестовых данных.

In [25]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('snowball_data')
nltk.download('perluniprops')
nltk.download('universal_tagset')
nltk.download('nonbreaking_prefixes')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /home/dinara/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/dinara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package snowball_data to
[nltk_data]     /home/dinara/nltk_data...
[nltk_data]   Package snowball_data is already up-to-date!
[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/dinara/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/dinara/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/dinara/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!
[nltk_data] Downloading package wordnet to /home/dinara/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [26]:
articles_df['text_lower'] = articles_df.text.apply(lambda x: x.lower())
articles_df.sample(2)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang,text_lower
1126,1464356041,CONTENT SHARED,8038154459609789205,-108842214936804958,-5072897400043657671,,,,HTML,http://airbnb.design/building-a-visual-language/,Building a Visual Language - Behind the scenes...,This article is part of a series on our new De...,en,this article is part of a series on our new de...
1561,1467136120,CONTENT SHARED,-2549933363319068481,801895594717772308,8570706322370780593,,,,HTML,https://www.washingtonpost.com/news/wonk/wp/20...,The real reasons you procrastinate - and how t...,Have you ever sat down to complete an importan...,en,have you ever sat down to complete an importan...


In [27]:
from pymorphy2 import MorphAnalyzer
#from functools import lru_cache
from nltk.corpus import stopwords
import re

m = MorphAnalyzer()
regex = re.compile("[А-Яа-яA-z]+")

def words_only(text, regex=regex):
    try:
        return regex.findall(text.lower())
    except:
        return []

In [28]:
def lemmatize_word(token, pymorphy=m):
    return pymorphy.parse(token)[0].normal_form

def lemmatize_text(text):
    return [lemmatize_word(w) for w in text]


mystopwords = stopwords.words('english') 
def remove_stopwords(lemmas, stopwords = mystopwords):
    return [w for w in lemmas if not w in stopwords and len(w) > 3]

def clean_text(text):
    tokens = words_only(text)
    lemmas = lemmatize_text(tokens)
    return ' '.join(remove_stopwords(lemmas))

In [29]:
from multiprocessing import Pool


with Pool(4) as p:
    lemmas = list(tqdm(p.imap(clean_text, articles_df['text_lower']), total=len(articles_df)))
    
articles_df['lemmas'] = lemmas
articles_df.sample(5)

100%|███████████████████████████████████████████| 3047/3047 [00:16<00:00, 184.86it/s]


Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang,text_lower,lemmas
2921,1483362590,CONTENT SHARED,7697593937932606048,-7410485589492665094,-6773017960785592027,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2...,MG,BR,HTML,http://sensedia.com/blog/apis/top-5-conteudos-...,Top 5 conteúdos 2016 - Sensedia,"Com o ano acabando, você precisa saber do que ...",pt,"com o ano acabando, você precisa saber do que ...",acabando precisa saber aconteceu melhor neste ...
240,1459961256,CONTENT SHARED,-8388400806941996172,-4243635261966794110,-2955857083043076023,,,,HTML,http://blog.mgechev.com/2016/03/28/ambient-typ...,Managing ambient type definitions and dealing ...,"Maintaining the angular2-seed , I found out th...",en,"maintaining the angular2-seed , i found out th...",maintaining angular seed found common problem ...
2943,1484063971,CONTENT SHARED,-1668558678125613329,3609194402293569455,-851069973704807471,Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...,SP,BR,HTML,http://www.b9.com.br/69568/advertising/bot-da-...,Bot da Chevrolet te ajuda a alcançar as resolu...,Suas resoluções de ano novo ainda estão de pé?...,pt,suas resoluções de ano novo ainda estão de pé?...,suas resolu novo ainda como maioria pessoas ac...
1069,1464005298,CONTENT SHARED,-6174871522512566947,4918484843075254252,-5131349532990822420,,,,HTML,http://www.psfk.com/2016/05/program-your-groce...,Program your way to your next grocery delivery,Tesco created IFTTT channel where shoppers can...,en,tesco created ifttt channel where shoppers can...,tesco created ifttt channel shoppers pick food...
1443,1466287803,CONTENT SHARED,6829640091575814990,-1130272294246983140,-782450246769532337,,,,HTML,http://www.updateordie.com/2016/06/09/branding...,Branding é problema seu. E meu.,"B randing é um assunto injustiçado, coitado. T...",pt,"b randing é um assunto injustiçado, coitado. t...",randing assunto injusti coitado tachado comple...


In [None]:
🐂 Делаем предсказания с помощью LightFM после лемматизации

In [30]:
vectorizer = TfidfVectorizer()
papers = list(interactions_full_df.contentId.unique())
feat_lemmas = vectorizer.fit_transform([np.array(articles_df.loc[articles_df['contentId'] == i].lemmas)[0] if i in list(articles_df.contentId) else '' for i in papers])
vectorizer.get_feature_names_out()

array(['_____', '______', '_______', ..., 'соnvеr', 'соuрl', 'уеаr'],
      dtype=object)

In [31]:
feat_lemmas.shape

(2984, 63066)

In [32]:
model = LightFM(loss='warp', random_state=42)
model.fit(data_train_sparse, epochs=10, item_features=feat_lemmas)

<lightfm.lightfm.LightFM at 0x7f9e00507580>

In [33]:
train_precision = precision_at_k(model, data_train_sparse, k=10, item_features = feat_lemmas).mean()
test_precision = precision_at_k(model, data_test_sparse, k=10, train_interactions=data_train_sparse, item_features = feat_lemmas).mean()
print(train_precision)
print(test_precision)

0.22077338
0.006211813


Улучшилось ли качество предсказания?

In [None]:
🦛 немного улучшилось

## Задание 5 (1.5 балла)

Подберите гиперпараметры модели LightFM (`n_components` и др.) для улучшения качества модели.

In [68]:
import itertools

import numpy as np

from lightfm.evaluation import auc_score


def sample_hyperparameters():
    """
    Yield possible hyperparameter choices.
    """

    while True:
        yield {
            "no_components": np.random.randint(16, 64),
            "learning_schedule": np.random.choice(["adagrad", "adadelta"]),
            "loss": np.random.choice(["bpr", "warp", "warp-kos"]),
            "learning_rate": np.random.exponential(0.05),
            "item_alpha": np.random.exponential(1e-8),
            "user_alpha": np.random.exponential(1e-8),
            "max_sampled": np.random.randint(5, 15),
            "num_epochs": np.random.randint(5, 50),
        }


def random_search(train, test, num_samples=10, num_threads=1):
    """
    Sample random hyperparameters, fit a LightFM model, and evaluate it
    on the test set.

    Parameters
    ----------

    train: np.float32 coo_matrix of shape [n_users, n_items]
        Training data.
    test: np.float32 coo_matrix of shape [n_users, n_items]
        Test data.
    num_samples: int, optional
        Number of hyperparameter choices to evaluate.


    Returns
    -------

    generator of (auc_score, hyperparameter dict, fitted model)

    """

    for hyperparams in itertools.islice(sample_hyperparameters(), num_samples):
        num_epochs = hyperparams.pop("num_epochs")

        model = LightFM(**hyperparams)
        model.fit(train, epochs=num_epochs, num_threads=num_threads)

        score = auc_score(model, test, train_interactions=train, num_threads=num_threads).mean()
        
        #score = precision_at_k(model, train, k=10).mean()
        hyperparams["num_epochs"] = num_epochs

        yield (score, hyperparams, model)

In [73]:
(score, hyperparams, model) = max(random_search(data_train_sparse, data_test_sparse, num_threads=2), key=lambda x: x[0])

print("Best score {} at {}".format(score, hyperparams))

Best score 0.5752180218696594 at {'no_components': 35, 'learning_schedule': 'adagrad', 'loss': 'bpr', 'learning_rate': 0.016334129798121093, 'item_alpha': 3.880712290696974e-09, 'user_alpha': 2.0892267361228756e-08, 'max_sampled': 8, 'num_epochs': 32}


In [77]:
model = LightFM(no_components= 41, learning_schedule= 'adagrad', loss= 'warp-kos', 
                learning_rate= 0.026501846286832772, item_alpha=1.5111749871814318e-08, 
                user_alpha= 4.394857807746486e-09, 
                max_sampled= 14,
               random_state=42)
model.fit(data_train_sparse, epochs=42, item_features=feat_lemmas)
train_precision = precision_at_k(model, data_train_sparse, k=10, item_features = feat_lemmas).mean()
test_precision = precision_at_k(model, data_test_sparse, k=10, train_interactions=data_train_sparse, item_features = feat_lemmas).mean()
print(train_precision)
print(test_precision)

0.5613309
0.0065173116


## Задание 6 (1 балл)

Реализуйте функции для вычисления следующих метрик:
* precision@k
* recall@k
* NDCG@k



In [None]:
# Ваш код здесь

## Задание 7 (1 балл)

Вычислите значения реализованных метрик для $k=10$ для лучшей полученной модели в предыдущих шагах.

Найдите уже реализованные варианты этих метрик в библиотеках lightfm и sklearn. Сравните полученные у вас значения метрик с результатами встроенных в библиотеки метрик.

In [None]:
# Ваш код здесь

## Задание 8 (1 балл)

Реализуйте алгоритм ALS и примените его для решения задачи ноутбука.

**ALS**

Итак, поставлена задача построения модели со скрытыми переменными (latent factor model) для коллаборативной фильтрации:

$$ \sum_{u,i} (r_{ui} - \langle p_u, q_i \rangle)^2 \to \min_{P,Q}$$

Суммирование ведется по всем парам $(u, i),$ для которых известен рейтинг $r_{ui}$ (и только по ним), а $p_u, q_i$ – латентные представления пользователя~$u$ и товара $i$, соответственно, матрицы $P, Q$ получаются путем записывания по столбцам векторов $p_u, q_i$ соответственно.

Подход ALS (Alternating Least Squares) решает задачу, попеременно фиксируя матрицы $P$ и $Q$, — оказывается, что, зафиксировав одну из матриц, можно выписать аналитическое решение задачи для другой.

$$\nabla_{p_u} \bigg[ \sum_{u,i} (r_{ui} - \langle p_u, q_i \rangle)^2 \bigg] = \sum_{i} 2(r_{ui} - \langle p_u, q_i \rangle)q_i = 0$$

Воспользовавшись тем, что $a^Tbc = cb^Ta$, получим
$$\sum_{i} r_{ui}q_i - \sum_i q_i q_i^T p_u = 0.$$

Тогда окончательно каждый столбец матрицы $P$ можно найти по формуле
$$p_u = \bigg( \sum_i q_i q_i^T\bigg)^{-1}\sum_ir_{ui}q_i \;\; \forall u,$$

аналогично для столбцов матрицы $Q$
$$q_i = \bigg( \sum_u p_u p_u^T\bigg)^{-1}\sum_ur_{ui}p_u \;\; \forall i.$$

Таким образом мы можем решать оптимизационную задачу, поочередно фиксируя одну из матриц $P$ или $Q$ и проводя оптимизацию по второй.

**Оригинальная статья c постановкой задачи для ALS на explicit feedback:**

* Bell, R.M. and Koren, Y., 2007, October. Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In Seventh IEEE international conference on data mining (ICDM 2007) (pp. 43-52). IEEE.

**Оригинальная статья с ALS для implicit данных, которая стала более известной:**

* Hu, Y., Koren, Y. and Volinsky, C., 2008, December. Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE international conference on data mining (pp. 263-272). Ieee.


In [None]:
# Ваш код здесь