# Введение

В этом задании Вы продолжите работать с данными из семинара [Articles Sharing and Reading from CI&T Deskdrop](https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop).

In [1]:
import pandas as pd
import numpy as np
import math

## Загрузка и предобработка данных

Загрузим данные и проведем предобраотку данных как на семинаре.

In [2]:
!wget -q -N https://www.dropbox.com/s/z8syrl5trawxs0n/articles.zip?dl=0 -O articles.zip
!unzip -o -q articles.zip

In [2]:
articles_df = pd.read_csv('articles/shared_articles.csv')
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']
articles_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en


In [3]:
interactions_df = pd.read_csv('articles/users_interactions.csv')
interactions_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US


In [4]:
interactions_df.personId = interactions_df.personId.astype(str)
interactions_df.contentId = interactions_df.contentId.astype(str)
articles_df.contentId = articles_df.contentId.astype(str)

In [5]:
# зададим словарь определяющий силу взаимодействия
event_type_strength = {
   'VIEW': 1.0,
   'LIKE': 2.0, 
   'BOOKMARK': 2.5, 
   'FOLLOW': 3.0,
   'COMMENT CREATED': 4.0,  
}

interactions_df['eventStrength'] = interactions_df.eventType.apply(lambda x: event_type_strength[x])

Оставляем только тех пользователей, которые произамодействовали более чем с пятью статьями.

In [6]:
users_interactions_count_df = (
    interactions_df
    .groupby(['personId', 'contentId'])
    .first()
    .reset_index()
    .groupby('personId').size())
print('# users:', len(users_interactions_count_df))

users_with_enough_interactions_df = \
    users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[['personId']]
print('# users with at least 5 interactions:',len(users_with_enough_interactions_df))

# users: 1895
# users with at least 5 interactions: 1140


Оставляем только те взаимодействия, которые относятся к отфильтрованным пользователям.

In [7]:
interactions_from_selected_users_df = interactions_df.loc[np.in1d(interactions_df.personId,
            users_with_enough_interactions_df)]

In [8]:
print('# interactions before:', interactions_df.shape)
print('# interactions after:', interactions_from_selected_users_df.shape)

# interactions before: (72312, 9)
# interactions after: (69868, 9)


Объединяем все взаимодействия пользователя по каждой статье и сглаживаем полученный результат, взяв от него логарифм.

In [9]:
def smooth_user_preference(x):
    return math.log(1+x, 2)
    
interactions_full_df = (
    interactions_from_selected_users_df
    .groupby(['personId', 'contentId']).eventStrength.sum()
    .apply(smooth_user_preference)
    .reset_index().set_index(['personId', 'contentId'])
)
interactions_full_df['last_timestamp'] = (
    interactions_from_selected_users_df
    .groupby(['personId', 'contentId'])['timestamp'].last()
)
        
interactions_full_df = interactions_full_df.reset_index()
interactions_full_df.head(5)

Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
1,-1007001694607905623,-6623581327558800021,1.0,1487240080
2,-1007001694607905623,-793729620925729327,1.0,1472834892
3,-1007001694607905623,1469580151036142903,1.0,1487240062
4,-1007001694607905623,7270966256391553686,1.584963,1485994324


Разобьём выборку на обучение и контроль по времени.

In [10]:
from sklearn.model_selection import train_test_split

split_ts = 1475519530
interactions_train_df = interactions_full_df.loc[interactions_full_df.last_timestamp < split_ts].copy()
interactions_test_df = interactions_full_df.loc[interactions_full_df.last_timestamp >= split_ts].copy()

print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))

interactions_train_df

# interactions on Train set: 29329
# interactions on Test set: 9777


Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
2,-1007001694607905623,-793729620925729327,1.0,1472834892
6,-1032019229384696495,-1006791494035379303,1.0,1469129122
7,-1032019229384696495,-1039912738963181810,1.0,1459376415
8,-1032019229384696495,-1081723567492738167,2.0,1464054093
...,...,...,...,...
39099,997469202936578234,9112765177685685246,2.0,1472479493
39100,998688566268269815,-1255189867397298842,1.0,1474567164
39101,998688566268269815,-401664538366009049,1.0,1474567449
39103,998688566268269815,6881796783400625893,1.0,1474567675


Для удобства подсчёта качества запишем данные в формате, где строка соответствует пользователю, а столбцы будут истинными метками и предсказаниями в виде списков.

In [11]:
interactions = (
    interactions_train_df
    .groupby('personId')['contentId'].agg(lambda x: list(x))
    .reset_index()
    .rename(columns={'contentId': 'true_train'})
    .set_index('personId')
)

interactions['true_test'] = (
    interactions_test_df
    .groupby('personId')['contentId'].agg(lambda x: list(x))
)

# заполнение пропусков пустыми списками
interactions.loc[pd.isnull(interactions.true_test), 'true_test'] = [
    list() for x in range(len(interactions.loc[pd.isnull(interactions.true_test), 'true_test']))]

interactions.head(1)

Unnamed: 0_level_0,true_train,true_test
personId,Unnamed: 1_level_1,Unnamed: 2_level_1
-1007001694607905623,"[-5065077552540450930, -793729620925729327]","[-6623581327558800021, 1469580151036142903, 72..."


## Библиотека LightFM

Для рекомендации Вы будете пользоваться библиотекой [LightFM](https://making.lyst.com/lightfm/docs/home.html), в которой реализованы популярные алгоритмы. Для оценивания качества рекомендации, как и на семинаре, будем пользоваться метрикой *precision@10*.

In [12]:
!pip install lightfm
from lightfm import LightFM
from lightfm.evaluation import precision_at_k



In [13]:
from lightfm.data import Dataset

## Задание 1. (2 балла)

Модели в LightFM работают с разреженными матрицами. Создайте разреженные матрицы `data_train` и `data_test` (размером количество пользователей на количество статей), такие что на пересечении строки пользователя и столбца статьи стоит сила их взаимодействия, если взаимодействие было, и стоит ноль, если взаимодействия не было.

Проблема (вроде как) в том, что по умолчанию если взаимодействия не было lightfm Dataset ставит 1.0, поэтому надо в функции ```_unpack_datum``` поменять параметр ```weight = 1.0``` на ```weight = 0.0``` при условии ```len(datum) = 2```

In [13]:
import array

import numpy as np

import scipy.sparse as sp

import sklearn.preprocessing


class _IncrementalCOOMatrix(object):
    def __init__(self, shape, dtype):

        if dtype is np.int32:
            type_flag = "i"
        elif dtype is np.int64:
            type_flag = "l"
        elif dtype is np.float32:
            type_flag = "f"
        elif dtype is np.float64:
            type_flag = "d"
        else:
            raise Exception("Dtype not supported.")

        self.shape = shape
        self.dtype = dtype

        self.rows = array.array("i")
        self.cols = array.array("i")
        self.data = array.array(type_flag)

    def append(self, i, j, v):

        m, n = self.shape

        if i >= m or j >= n:
            raise Exception("Index out of bounds")

        self.rows.append(i)
        self.cols.append(j)
        self.data.append(v)

    def tocoo(self):

        rows = np.frombuffer(self.rows, dtype=np.int32)
        cols = np.frombuffer(self.cols, dtype=np.int32)
        data = np.frombuffer(self.data, dtype=self.dtype)

        return sp.coo_matrix((data, (rows, cols)), shape=self.shape)

    def __len__(self):

        return len(self.data)


class _FeatureBuilder(object):
    def __init__(
        self, id_mapping, feature_mapping, identity_features, normalize, entity_type
    ):

        self._id_mapping = id_mapping
        self._feature_mapping = feature_mapping
        self._identity_features = identity_features
        self._normalize = normalize
        self._entity_type = entity_type

    def features_shape(self):

        return len(self._id_mapping), len(self._feature_mapping)

    def _iter_features(self, features):

        if isinstance(features, dict):
            for entry in features.items():
                yield entry

        else:
            for feature_name in features:
                yield (feature_name, 1.0)

    def _process_features(self, datum):

        if len(datum) != 2:
            raise ValueError(
                "Expected tuples of ({}_id, features), "
                "got {}.".format(self._entity_type, datum)
            )

        entity_id, features = datum

        if entity_id not in self._id_mapping:
            raise ValueError(
                "{entity_type} id {entity_id} not in {entity_type} id mappings.".format(
                    entity_type=self._entity_type, entity_id=entity_id
                )
            )

        idx = self._id_mapping[entity_id]

        for (feature, weight) in self._iter_features(features):
            if feature not in self._feature_mapping:
                raise ValueError(
                    "Feature {} not in feature mapping. "
                    "Call fit first.".format(feature)
                )

            feature_idx = self._feature_mapping[feature]

            yield (idx, feature_idx, weight)

    def build(self, data):

        features = _IncrementalCOOMatrix(self.features_shape(), np.float32)

        if self._identity_features:
            for (_id, idx) in self._id_mapping.items():
                features.append(idx, self._feature_mapping[_id], 1.0)

        for datum in data:
            for (entity_idx, feature_idx, weight) in self._process_features(datum):
                features.append(entity_idx, feature_idx, weight)

        features = features.tocoo().tocsr()

        if self._normalize:
            if np.any(features.getnnz(1) == 0):
                raise ValueError(
                    "Cannot normalize feature matrix: some rows have zero norm. "
                    "Ensure that features were provided for all entries."
                )

            sklearn.preprocessing.normalize(features, norm="l1", copy=False)

        return features


class Dataset(object):
    """
    Tool for building interaction and feature matrices, taking care of the
    mapping between user/item ids and feature names and internal feature indices.
    To create a dataset:
    - Create an instance of the `Dataset` class.
    - Call `fit` (or `fit_partial`), supplying user/item ids and feature names
      that you want to use in your model. This will create internal mappings that
      translate the ids and feature names to internal indices used by the LightFM
      model.
    - Call `build_interactions` with an iterable of (user id, item id) or (user id,
      item id, weight) to build an interactions and weights matrix.
    - Call `build_user/item_features` with iterables of (user/item id, [features])
      or (user/item id, {feature: feature weight}) to build feature matrices.
    - To add new user/item ids or features, call `fit_partial` again. You will need
      to resize your LightFM model to be able to use the new features.
    Parameters
    ----------
    user_identity_features: bool, optional
        Create a unique feature for every user in addition to other features.
        If true (default), a latent vector will be allocated for every user. This
        is a reasonable default for most applications, but should be set to false
        if there is very little data for every user. For more details see the Notes
        in :doc:`LightFM<lightfm>`.
    item_identity_features: bool, optional
        Create a unique feature for every item in addition to other features.
        If true (default), a latent vector will be allocated for every item. This
        is a reasonable default for most applications, but should be set to false
        if there is very little data for every item. For more details see the Notes
        in :doc:`LightFM<lightfm>`.
    """

    def __init__(self, user_identity_features=True, item_identity_features=True):

        self._user_identity_features = user_identity_features
        self._item_identity_features = item_identity_features

        self._user_id_mapping = {}
        self._item_id_mapping = {}
        self._user_feature_mapping = {}
        self._item_feature_mapping = {}

    def _check_fitted(self):

        if not self._user_id_mapping or not self._item_id_mapping:
            raise ValueError(
                "You must call fit first to build the item and user " "id mappings."
            )

    def fit(self, users, items, user_features=None, item_features=None):
        """
        Fit the user/item id and feature name mappings.
        Calling fit the second time will reset existing mappings.
        Parameters
        ----------
        users: iterable of user ids
        items: iterable of item ids
        user_features: iterable of user features, optional
        item_features: iterable of item features, optional
        """

        self._user_id_mapping = {}
        self._item_id_mapping = {}
        self._user_feature_mapping = {}
        self._item_feature_mapping = {}

        return self.fit_partial(users, items, user_features, item_features)

    def fit_partial(
        self, users=None, items=None, user_features=None, item_features=None
    ):
        """
        Fit the user/item id and feature name mappings.
        Calling fit the second time will add new entries to existing mappings.
        Parameters
        ----------
        users: iterable of user ids, optional
        items: iterable of item ids, optional
        user_features: iterable of user features, optional
        item_features: iterable of item features, optional
        """

        if users is not None:
            for user_id in users:
                self._user_id_mapping.setdefault(user_id, len(self._user_id_mapping))

                if self._user_identity_features:
                    self._user_feature_mapping.setdefault(
                        user_id, len(self._user_feature_mapping)
                    )

        if items is not None:
            for item_id in items:
                self._item_id_mapping.setdefault(item_id, len(self._item_id_mapping))

                if self._item_identity_features:
                    self._item_feature_mapping.setdefault(
                        item_id, len(self._item_feature_mapping)
                    )

        if user_features is not None:
            for user_feature in user_features:
                self._user_feature_mapping.setdefault(
                    user_feature, len(self._user_feature_mapping)
                )

        if item_features is not None:
            for item_feature in item_features:
                self._item_feature_mapping.setdefault(
                    item_feature, len(self._item_feature_mapping)
                )

    def _unpack_datum(self, datum):

        if len(datum) == 3:
            (user_id, item_id, weight) = datum
        elif len(datum) == 2:
            (user_id, item_id) = datum
            weight = 0.0
        else:
            raise ValueError(
                "Expecting tuples of (user_id, item_id, weight) "
                "or (user_id, item_id). Got {}".format(datum)
            )

        user_idx = self._user_id_mapping.get(user_id)
        item_idx = self._item_id_mapping.get(item_id)

        if user_idx is None:
            raise ValueError(
                "User id {} not in user id mapping. Make sure "
                "you call the fit method.".format(user_id)
            )

        if item_idx is None:
            raise ValueError(
                "Item id {} not in item id mapping. Make sure "
                "you call the fit method.".format(item_id)
            )

        return (user_idx, item_idx, weight)

    def interactions_shape(self):
        """
        Return a tuple of (num users, num items).
        """

        return (len(self._user_id_mapping), len(self._item_id_mapping))

    def build_interactions(self, data):
        """
        Build an interaction matrix.
        Two matrices will be returned: a (num_users, num_items)
        COO matrix with interactions, and a (num_users, num_items)
        matrix with the corresponding interaction weights.
        Parameters
        ----------
        data: iterable of (user_id, item_id) or (user_id, item_id, weight)
            An iterable of interactions. The user and item ids will be
            translated to internal model indices using the mappings
            constructed during the fit call. If weights are not provided
            they will be assumed to be 1.0.
        Returns
        -------
        (interactions, weights): COO matrix, COO matrix
            Two COO matrices: the interactions matrix
            and the corresponding weights matrix.
        """

        interactions = _IncrementalCOOMatrix(self.interactions_shape(), np.int32)
        weights = _IncrementalCOOMatrix(self.interactions_shape(), np.float32)

        for datum in data:
            user_idx, item_idx, weight = self._unpack_datum(datum)

            interactions.append(user_idx, item_idx, 1)
            weights.append(user_idx, item_idx, weight)

        return (interactions.tocoo(), weights.tocoo())

    def user_features_shape(self):
        """
        Return the shape of the user features matrix.
        Returns
        -------
        (num user ids, num user features): tuple of ints
            The shape.
        """

        return (len(self._user_id_mapping), len(self._user_feature_mapping))

    def build_user_features(self, data, normalize=True):
        """
        Build a user features matrix out of an iterable of the form
        (user id, [list of feature names]) or (user id, {feature name: feature weight}).
        Parameters
        ----------
        data: iterable of the form
            (user id, [list of feature names]) or (user id,
            {feature name: feature weight}).
            User and feature ids will be translated to internal indices
            constructed during the fit call.
        normalize: bool, optional
            If true, will ensure that feature weights sum to 1 in every row.
        Returns
        -------
        feature matrix: CSR matrix (num users, num features)
            Matrix of user features.
        """

        builder = _FeatureBuilder(
            self._user_id_mapping,
            self._user_feature_mapping,
            self._user_identity_features,
            normalize,
            "user",
        )

        return builder.build(data)

    def item_features_shape(self):
        """
        Return the shape of the item features matrix.
        Returns
        -------
        (num item ids, num item features): tuple of ints
            The shape.
        """

        return (len(self._item_id_mapping), len(self._item_feature_mapping))

    def build_item_features(self, data, normalize=True):
        """
        Build a item features matrix out of an iterable of the form
        (item id, [list of feature names]) or (item id, {feature name: feature weight}).
        Parameters
        ----------
        data: iterable of the form
            (item id, [list of feature names]) or (item id,
            {feature name: feature weight}).
            Item and feature ids will be translated to internal indices
            constructed during the fit call.
        normalize: bool, optional
            If true, will ensure that feature weights sum to 1 in every row.
        Returns
        -------
        feature matrix: CSR matrix (num items, num features)
            Matrix of item features.
        """

        builder = _FeatureBuilder(
            self._item_id_mapping,
            self._item_feature_mapping,
            self._item_identity_features,
            normalize,
            "item",
        )

        return builder.build(data)

    def model_dimensions(self):
        """
        Returns a tuple that characterizes the number of user/item feature
        embeddings in a LightFM model for this dataset.
        """

        return (len(self._user_feature_mapping), len(self._item_feature_mapping))

    def mapping(self):
        """
        Return the constructed mappings.
        Invert these to map internal indices to external ids.
        Returns
        -------
        (user id map, user feature map, item id map, item feature map): tuple of dictionaries
        """

        return (
            self._user_id_mapping,
            self._user_feature_mapping,
            self._item_id_mapping,
            self._item_feature_mapping,
        )

In [14]:
# Ваш код здесь

data = Dataset()

data.fit(interactions_full_df.personId.unique(),
         interactions_full_df.contentId.unique())

interactions_train, weights_train = data.build_interactions([tuple(i) for i in interactions_train_df[['personId', 'contentId', 'eventStrength']].values])
interactions_test, weights_test = data.build_interactions([tuple(i) for i in interactions_test_df[['personId', 'contentId', 'eventStrength']].values])

## Задание 2. (1 балл)

Обучите модель LightFM с `loss='warp'` и посчитайте *precision@10* на тесте.

In [15]:
# Ваш код здесь
model = LightFM(loss='warp')
model.fit(weights_train, epochs=30, num_threads=2)

<lightfm.lightfm.LightFM at 0x7fd8c6b83210>

In [16]:
precision_at_k(model, weights_train, k=10).mean()

0.2409173

In [17]:
precision_at_k(model, weights_test, k=10).mean()

0.0037678208

## Задание 3. (3 балла)

При вызове метода `fit` LightFM позволяет передавать в `item_features` признаковое описание объектов. Воспользуемся этим. Будем получать признаковое описание из текста статьи в виде [TF-IDF](https://ru.wikipedia.org/wiki/TF-IDF) (можно воспользоваться `TfidfVectorizer` из scikit-learn). Создайте матрицу `feat` размером количесвто статей на размер признакового описание и обучите LightFM с `loss='warp'` и посчитайте precision@10 на тесте.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [19]:
from nltk.tokenize import word_tokenize

In [20]:
articles_df = articles_df[articles_df.contentId.isin(interactions_full_df.contentId.unique())]

In [21]:
articles = pd.merge(left=interactions_full_df,
                     right=articles_df[['contentId', 'text']],
                     left_on='contentId', right_on='contentId',
                    how='outer')

In [22]:
split_ts = 1475519530
interactions_train_df = articles.loc[articles.last_timestamp < split_ts].copy()
interactions_test_df = articles.loc[articles.last_timestamp >= split_ts].copy()

print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))


# interactions on Train set: 29329
# interactions on Test set: 9777


In [23]:
articles_short = articles.drop_duplicates(subset=['contentId'])

In [24]:
tfidf = TfidfVectorizer(min_df=50, max_features=50000).fit(list(articles_short['text'].astype('U')))

In [25]:
tfidfs = tfidf.transform(list(articles_short['text'].astype('U')))

In [26]:
data = Dataset()

data.fit(articles.personId.unique(),
         articles.contentId.unique(),
         item_features = pd.DataFrame(tfidfs)
         )

In [27]:

interactions_train, weights_train = data.build_interactions([tuple(i) for i in interactions_train_df[['personId', 'contentId', 'eventStrength']].values])
interactions_test, weights_test = data.build_interactions([tuple(i) for i in interactions_test_df[['personId', 'contentId', 'eventStrength']].values])

In [28]:
model = LightFM(loss='warp')
model.fit(weights_train, item_features = tfidfs, epochs=30, num_threads=2)

<lightfm.lightfm.LightFM at 0x7fd8c0a73c90>

In [29]:
precision_at_k(model, weights_train, item_features=tfidfs, k=10).mean()

0.2195144

In [30]:
precision_at_k(model, weights_test, item_features=tfidfs, k=10).mean()

0.004378819

Качество практически не изменилось, но я не занимался подбором гиперпараметров модели, поэтому его наверное можно было дальше улучшать



In [30]:
# Ваш код здесь
# feat = 

## Задание 4. (2 балла)

В задании 3 мы использовали сырой текст статей. В этом задании необходимо сначала сделать предобработку текста (привести к нижнему регистру, убрать стоп слова, привести слова к номральной форме и т.д.), после чего обучите модель и оценить качество на тестовых данных.

In [31]:
import string

import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords

exclude = set(string.punctuation)
exclude.update(stopwords.words('english'))
exclude.update(stopwords.words('portuguese'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [32]:
def process_text(text):
    return ' '.join([word for word in word_tokenize(text.lower()) if word not in exclude])

In [33]:
articles_short['processed_text'] = articles_short['text'].astype('U').apply(process_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [34]:
tfidf = TfidfVectorizer(min_df=25, max_features=50000).fit(list(articles_short['processed_text'].astype('U')))

In [35]:
tfidffs = tfidf.transform(list(articles_short['processed_text'].astype('U')))

In [36]:
data = Dataset()

data.fit(articles.personId.unique(),
         articles.contentId.unique(),
         item_features = pd.DataFrame(tfidffs)
         )

In [37]:
interactions_train, weights_train = data.build_interactions([tuple(i) for i in interactions_train_df[['personId', 'contentId', 'eventStrength']].values])
interactions_test, weights_test = data.build_interactions([tuple(i) for i in interactions_test_df[['personId', 'contentId', 'eventStrength']].values])


In [38]:
model = LightFM(loss='warp')
model.fit(weights_train, item_features = tfidffs, epochs=30, num_threads=2)

<lightfm.lightfm.LightFM at 0x7fd8bfeb8210>

In [39]:
precision_at_k(model, weights_train, item_features=tfidffs, k=10).mean()

0.23705037

In [40]:
precision_at_k(model, weights_test, item_features=tfidffs, k=10).mean()

0.004276986

Здесь я начал подозревать что либо оно переобучается, либо (что более вероятно) я что-то делаю не так

In [41]:
# Ваш код здесь

Улучшилось ли качество предсказания?

## Задание 5. (2 балла)

Подберите гиперпараметры модели LightFM (`n_components` и др.) для улучшения качества модели.

In [41]:
model = LightFM(loss='warp', no_components=10, learning_rate=0.01, max_sampled=10)
model.fit(weights_train, item_features = tfidffs, epochs=50, num_threads=2, verbose=True)

Epoch: 100%|██████████| 50/50 [01:04<00:00,  1.29s/it]


<lightfm.lightfm.LightFM at 0x7fd8c0a975d0>

In [42]:
precision_at_k(model, weights_train, item_features=tfidffs, k=10).mean()

0.17715827

In [43]:
precision_at_k(model, weights_test, item_features=tfidffs, k=10).mean()

0.0045824847

Ну технически оно немного выросло, в 1.071429 раз

In [45]:
# Ваш код здесь

## Бонусное задание. (3 балла)

Выше мы использовали достаточно простое представление текста статьи в виде TF-IDF. В этом задании Вам нужно представить текст статьи (можно вместе с заголовком) в виде эмбеддинга полученного с помощью рекуррентной сети или трансформера (можно использовать любую предобученную модель, которая Вам нравится). Обучите модель с ипользованием этих эмеддингов и сравните результаты с предыдущими.

In [52]:
from gensim.models import FastText

In [44]:
import gensim
from gensim.models import Word2Vec
import numpy as np

In [45]:
model = Word2Vec(list(articles_short['processed_text']), size=1500, min_count = 25)

In [46]:
def mean_vector(word2vec_model, string):
    # remove out-of-vocabulary words
    string = [word for word in string if word in word2vec_model.vocab]
    if string:
      return np.mean(word2vec_model[string], axis=0)
    else:
      return np.zeros(1000)

trainx=[]
i = 0
for string in articles_short['processed_text']:
    trainx.append(mean_vector(model.wv, string))
    if i % 500 == 0:
      print(i)
    i+=1

train_X = np.array(trainx)

0
500
1000
1500
2000
2500


In [47]:
data = Dataset()

data.fit(articles.personId.unique(),
         articles.contentId.unique(),
         item_features = pd.DataFrame(train_X)
         )

interactions_train, weights_train = data.build_interactions([tuple(i) for i in interactions_train_df[['personId', 'contentId', 'eventStrength']].values])
interactions_test, weights_test = data.build_interactions([tuple(i) for i in interactions_test_df[['personId', 'contentId', 'eventStrength']].values])


In [48]:
model = LightFM(loss='warp')
model.fit(weights_train, item_features = sp.csr_matrix(train_X), epochs=30, num_threads=2, verbose=True)

Epoch: 100%|██████████| 30/30 [03:43<00:00,  7.44s/it]


<lightfm.lightfm.LightFM at 0x7fd8bd697490>

In [49]:
precision_at_k(model, weights_train, item_features=sp.csr_matrix(train_X), k=10).mean()

0.029766189

In [50]:
precision_at_k(model, weights_test, item_features=sp.csr_matrix(train_X), k=10).mean()

0.0035641547

Наверное я все-таки что-то делаю очень не так, ну либо рекомендательные системы игрушка дьявола ежжи

In [54]:
# Ваш код здесь