<center><img src="./images/logo_fmkn.png" width=300 style="display: inline-block;"></center> 

## Машинное обучение 2
### Семинар 10. Рекомендательные системы

<br />
<br />
21 апреля 2022

### Рекомендательные системы

In [1]:
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn

from nltk.corpus import stopwords
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
from sklearn.preprocessing import MinMaxScaler

import matplotlib.pyplot as plt

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/avalur/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Loading data: CI&T Deskdrop dataset

*Датасет* [Deskdrop dataset](https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop) содержит данные о взаимодействии с внутренней коммуникационной системой (Mar. 2016 - Feb. 2017) from CI&T's Internal Communication platform (DeskDrop). Он содержит информацию про 73 тысячи пар пользователь-статья, общее количество статей — более 3 тысяч.

Файлы:
- **shared_articles.csv**
- **users_interactions.csv**

## shared_articles.csv

In [3]:
articles_df = pd.read_csv('shared_articles.csv.zip')
articles_df.head(5)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
0,1459192779,CONTENT REMOVED,-6451309518266745024,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,CONTENT SHARED,-6151852268067518688,3891637997717104548,-1457532940883382585,,,,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en
4,1459194497,CONTENT SHARED,2448026894306402386,4340306774493623681,8940341205206233829,,,,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en


In [4]:
articles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3122 entries, 0 to 3121
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   timestamp        3122 non-null   int64 
 1   eventType        3122 non-null   object
 2   contentId        3122 non-null   int64 
 3   authorPersonId   3122 non-null   int64 
 4   authorSessionId  3122 non-null   int64 
 5   authorUserAgent  680 non-null    object
 6   authorRegion     680 non-null    object
 7   authorCountry    680 non-null    object
 8   contentType      3122 non-null   object
 9   url              3122 non-null   object
 10  title            3122 non-null   object
 11  text             3122 non-null   object
 12  lang             3122 non-null   object
dtypes: int64(4), object(9)
memory usage: 317.2+ KB


In [5]:
articles_df['eventType'].value_counts()  # unique()

CONTENT SHARED     3047
CONTENT REMOVED      75
Name: eventType, dtype: int64

Оставим только записи "CONTENT SHARED"

In [6]:
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']

## users_interactions.csv

In [7]:
interactions_df = pd.read_csv('users_interactions.csv.zip')
interactions_df.head(10)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,
5,1465413742,VIEW,310515487419366995,-8763398617720485024,1395789369402380392,Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...,MG,BR
6,1465415950,VIEW,-8864073373672512525,3609194402293569455,1143207167886864524,,,
7,1465415066,VIEW,-1492913151930215984,4254153380739593270,8743229464706506141,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...,SP,BR
8,1465413762,VIEW,310515487419366995,344280948527967603,-3167637573980064150,,,
9,1465413771,VIEW,3064370296170038610,3609194402293569455,1143207167886864524,,,


In [8]:
interactions_df['eventType'].value_counts()

VIEW               61086
LIKE                5745
BOOKMARK            2463
COMMENT CREATED     1611
FOLLOW              1407
Name: eventType, dtype: int64

## Подготовка данных

Взвесим разные типы взаимодействий пользователя со статьей

In [9]:
event_type_strength = {
   'VIEW': 1.0,
   'LIKE': 2.0, 
   'BOOKMARK': 2.5, 
   'FOLLOW': 3.0,
   'COMMENT CREATED': 4.0,  
}

interactions_df['eventStrength'] = interactions_df['eventType'].apply(lambda x: event_type_strength[x])

У рекомендательные систем есть проблема холодного старта. Для простоты и более простой валидациии оставим пользователей, которые совершили как минимум 15 действий.

In [10]:
INTERACTIONS_THRESHOLD = 15

users_interactions_count_df = interactions_df.groupby(['personId', 'contentId']).size().groupby('personId').size()
print('# users: %d' % len(users_interactions_count_df))
users_with_enough_interactions_df = users_interactions_count_df[users_interactions_count_df >= INTERACTIONS_THRESHOLD].reset_index()[['personId']]
print(f'# users with at least {INTERACTIONS_THRESHOLD} interactions: {len(users_with_enough_interactions_df)}')

# users: 1895
# users with at least 15 interactions: 579


In [11]:
print('# of interactions: %d' % len(interactions_df))
interactions_from_selected_users_df = interactions_df.merge(users_with_enough_interactions_df, 
               how = 'right',
               left_on = 'personId',
               right_on = 'personId')
print('# of interactions from users with at least 15 interactions: %d' % len(interactions_from_selected_users_df))

# of interactions: 72312
# of interactions from users with at least 15 interactions: 62717


Пользователь может взаимодействовать с одной статьей несколько раз. Добавим агрегацию.

In [12]:
def smooth_user_preference(x):
    return math.log(1 + x, 2)
    
interactions_full_df = interactions_from_selected_users_df \
                    .groupby(['personId', 'contentId'])['eventStrength'].sum() \
                    .apply(smooth_user_preference).reset_index()
print('# of unique user/item interactions: %d' % len(interactions_full_df))
interactions_full_df.head(10)

# of unique user/item interactions: 34417


Unnamed: 0,personId,contentId,eventStrength
0,-9223121837663643404,-8949113594875411859,1.0
1,-9223121837663643404,-8377626164558006982,1.0
2,-9223121837663643404,-8208801367848627943,1.0
3,-9223121837663643404,-8187220755213888616,1.0
4,-9223121837663643404,-7423191370472335463,3.169925
5,-9223121837663643404,-7331393944609614247,1.0
6,-9223121837663643404,-6872546942144599345,1.0
7,-9223121837663643404,-6728844082024523434,1.0
8,-9223121837663643404,-6590819806697898649,1.0
9,-9223121837663643404,-6558712014192834002,1.584963


Проверим, что нет дубликатов пар

In [13]:
(interactions_full_df.groupby(['personId', 'contentId'])['eventStrength'].nunique() > 1).sum()

0

# Evaluation

Будем использовать валидацию на holdout-выборке. И в обучающей и тестовой выборке будут присутствовать все пользователи.

In [14]:
interactions_train_df, interactions_test_df = train_test_split(interactions_full_df,
                                   stratify=interactions_full_df['personId'], 
                                   test_size=0.20,
                                   random_state=42)

print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))

# interactions on Train set: 27533
# interactions on Test set: 6884


In [15]:
interactions_full_df['personId'].nunique(), \
interactions_train_df['personId'].nunique(), interactions_test_df['personId'].nunique()

(579, 579, 579)

In [16]:
interaction_per_person_count = pd.concat([interactions_train_df.groupby('personId')['contentId'].count(), interactions_test_df.groupby('personId')['contentId'].count()], axis=1)
interaction_per_person_count.columns = ['train', 'test']
interaction_per_person_count

Unnamed: 0_level_0,train,test
personId,Unnamed: 1_level_1,Unnamed: 2_level_1
-9223121837663643404,34,9
-9172914609055320039,19,5
-9120685872592674274,19,5
-9109785559521267180,34,9
-9047547311469006438,19,5
...,...,...
9105269044962898535,14,3
9109075639526981934,58,15
9148269800512008413,41,10
9187866633451383747,14,3


The Top-N accuracy metric choosen was **Recall@N** which evaluates whether the interacted item is among the top N items (hit) in the ranked list of 10 recommendations for a user.  
Ps. Other popular ranking metrics are **NDCG@N** and **MAP@N**, whose score calculation takes into account the position of the relevant item in the ranked list (max. value if relevant item is in the first position).

In [17]:
#Indexing by personId to speed up the searches during evaluation
interactions_full_indexed_df = interactions_full_df.set_index('personId')
interactions_train_indexed_df = interactions_train_df.set_index('personId')
interactions_test_indexed_df = interactions_test_df.set_index('personId')

## Evaluation

* Будем оценивать две метрики: Precision и Recall.
* Приведены два способа подсчета метрик:
  * Precision считается простым образом
  * Recall использует добавление дополнительных записей, с которыми заведомо не было взаимодействия,
* Метрики считаются по каждому пользователю отдельно  

Контракт: модель реализует метод    


```
def recommend_items(self, user_id, topn=10, items_to_ignore=[], verbose=False) -> pd.DataFrame
```
* user_id - id пользователя
* topn - количество рекомендаций 
* items_to_ignore - blacklist item, которые не рекомендуются (обычно из train датасета) 
* verbose - влияет на формат ответа.
Всегда содержит contentId и recStrength
Если True добавляется title, url, lang из статьи



In [18]:
def get_items_interacted(person_id, interactions_df):
    interacted_items = interactions_df.loc[person_id]['contentId']
    return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items])

In [45]:
# Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluator:


    def get_not_interacted_items_sample(self, person_id, sample_size, seed=43):
        interacted_items = get_items_interacted(person_id, interactions_full_indexed_df)
        all_items = set(articles_df['contentId'])
        non_interacted_items = all_items - interacted_items

        random.seed(seed)
        non_interacted_items_sample = random.sample(tuple(non_interacted_items), sample_size)
        return set(non_interacted_items_sample)

    def _verify_hit_top_n(self, item_id, recommended_items, topn):        
            try:
                index = next(i for i, c in enumerate(recommended_items) if c == item_id)
            except:
                index = -1
            hit = int(0 <= index < topn)
            return hit, index

    def _calculate_hits(self, person_interacted_items_testset, person_id, person_recs_df): 
        hits_at_5_count = 0
        hits_at_10_count = 0
        # For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:
            # Getting a random sample (100) items the user has not interacted 
            # (to represent items that are assumed to be no relevant to the user)
            non_interacted_items_sample = self.get_not_interacted_items_sample(person_id, 
                                                                          sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS, 
                                                                          seed=item_id%(2**32))

            # Combining the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))

            # Filtering only recommendations that are either the interacted item or from a random sample of 100 non-interacted items
            valid_recs_df = person_recs_df[person_recs_df['contentId'].isin(items_to_filter_recs)]
            valid_recs = valid_recs_df['contentId'].values
            # Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

      
        return hits_at_5_count, hits_at_10_count

    def evaluate_model_for_user(self, model, person_id):
        # Getting the items in test set
        interacted_values_testset = interactions_test_indexed_df.loc[[person_id]]
        person_interacted_items_testset = set(interacted_values_testset['contentId'])
        interacted_items_count_testset = len(person_interacted_items_testset) 

        # Getting a ranked recommendation list from a model for a given user
        person_recs_df = model.recommend_items(person_id, 
                                               items_to_ignore=get_items_interacted(person_id, 
                                                                                    interactions_train_indexed_df), 
                                               topn=10**7)
        
        hits_at_5_count, hits_at_10_count = self._calculate_hits(person_interacted_items_testset, person_id, person_recs_df)
        # Recall is the rate of the interacted items that are ranked among the Top-N recommended items, 
        # when mixed with a set of non-relevant items
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)
        
        
        # print(interacted_items_count_testset)
        precision_at_3 = person_recs_df.iloc[:3]['contentId'].isin(person_interacted_items_testset).mean()

       

        person_metrics = {'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10,
                          'precision@3': precision_at_3}
        return person_metrics

    def evaluate_model(self, model):
        # print('Running evaluation for users')
        people_metrics = []
        for idx, person_id in enumerate(list(interactions_test_indexed_df.index.unique().values)):
            if idx % 100 == 0 and idx > 0:
                print(f'{idx} users processed')
            person_metrics = self.evaluate_model_for_user(model, person_id)  
            person_metrics['_person_id'] = person_id
            people_metrics.append(person_metrics)
        print('%d users processed' % idx)

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('interacted_count', ascending=False)
        
        global_precision_at_3 = detailed_results_df['precision@3'].mean()
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10,
                          'precision@3': global_precision_at_3}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluator()

## Задание 1

Добавить расчет Average Precision и Mean Average Precision

# Content-Based Filtering model

Переведем статьи в векторное представление

In [46]:
# Ignoring stopwords (words with no semantics) from English and Portuguese (as we have a corpus with mixed languages)
stopwords_list = stopwords.words('english') + stopwords.words('portuguese')


# Trains a model whose vectors size is 5000, composed by the main unigrams and bigrams found in the corpus, ignoring stopwords
vectorizer = TfidfVectorizer(analyzer='word',
                     ngram_range=(1, 2),
                     min_df=0.003,
                     max_df=0.5,
                     max_features=5000,
                     stop_words=stopwords_list)

item_ids = articles_df['contentId'].tolist()
tfidf_matrix = vectorizer.fit_transform(articles_df['title'] + "" + articles_df['text'])
tfidf_feature_names = vectorizer.get_feature_names_out()
tfidf_matrix

<3047x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 638928 stored elements in Compressed Sparse Row format>

In [47]:
x = scipy.sparse.csr_matrix.toarray(tfidf_matrix[0])

In [48]:
x.shape, np.sum(x)

((1, 5000), 9.50309706256499)

In [49]:
np.linalg.norm(x)

1.0

Профиль пользователя состоит из всех статьей, с которыми он взаимодействовал. 

In [50]:
def get_item_profile(item_id):
    idx = item_ids.index(item_id)
    item_profile = tfidf_matrix[idx:idx+1]
    return item_profile

def get_item_profiles(ids):
    item_profiles_list = [get_item_profile(x) for x in ids]
    item_profiles = scipy.sparse.vstack(item_profiles_list)
    return item_profiles

def build_users_profile(person_id, interactions_indexed_df):
    interactions_person_df = interactions_indexed_df.loc[person_id]
    user_item_profiles = get_item_profiles(interactions_person_df['contentId'])
    
    user_item_strengths = np.array(interactions_person_df['eventStrength']).reshape(-1,1)
    
    # Weighted average of item profiles by the interactions strength. Why multiply? 
    user_item_strengths_weighted_avg = np.sum(user_item_profiles.multiply(user_item_strengths), axis=0) \
                                              / np.sum(user_item_strengths)
    user_profile_norm = sklearn.preprocessing.normalize(user_item_strengths_weighted_avg)
    return user_profile_norm

def build_users_profiles(): 
    interactions_indexed_df = interactions_train_df[interactions_train_df['contentId'] \
                                                   .isin(articles_df['contentId'])].set_index('personId')
    user_profiles = {}
    for person_id in interactions_indexed_df.index.unique():
        user_profiles[person_id] = build_users_profile(person_id, interactions_indexed_df)
    return user_profiles

In [51]:
import warnings
# for np.matrix FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning) 
# warnings.filterwarnings("ignore", category=DeprecationWarning) 

user_profiles = build_users_profiles()
len(user_profiles)

579

Это профиль автора оригинального jupyter notebook. 

In [52]:
author_id = -1479311724257856983
author_profile = user_profiles[author_id]
print(author_profile.shape)
pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        user_profiles[-1479311724257856983].flatten().tolist()), key=lambda x: -x[1])[:20],
             columns=['token', 'relevance'])

(1, 5000)


Unnamed: 0,token,relevance
0,learning,0.301096
1,machine learning,0.254526
2,machine,0.247158
3,google,0.196238
4,data,0.180081
5,graph,0.118791
6,algorithms,0.112597
7,ai,0.103908
8,like,0.096942
9,human,0.078915


In [53]:
cosine_sim_predictions = cosine_similarity(author_profile, tfidf_matrix)[0]
cosine_sim_predictions

array([0.14170787, 0.1312642 , 0.21366766, ..., 0.16456611, 0.31695328,
       0.03900567])

In [54]:
articles_df.iloc[cosine_sim_predictions.argmax()]

timestamp                                                 1472776911
eventType                                             CONTENT SHARED
contentId                                        5250363310227021277
authorPersonId                                  -1443636648652872475
authorSessionId                                  6245842811012374102
authorUserAgent                                                  NaN
authorRegion                                                     NaN
authorCountry                                                    NaN
contentType                                                     HTML
url                https://backchannel.com/how-google-is-remaking...
title              How Google is Remaking Itself as a "Machine Le...
text               If you want to build artificial intelligence i...
lang                                                              en
Name: 2252, dtype: object

In [55]:
class ContentBasedRecommender:
    
    MODEL_NAME = 'Content-Based'
    
    def __init__(self, items_df=None):
        self.item_ids = item_ids
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def _get_similar_items_to_user_profile(self, person_id, topn=1000):
        # Computes the cosine similarity between the user profile and all item profiles
        cosine_similarities = cosine_similarity(user_profiles[person_id], tfidf_matrix)[0]
        # Gets the top similar items
        similar_indices = cosine_similarities.argsort()[-topn:]
        # Sort the similar items by similarity
        similar_items = sorted([(item_ids[i], cosine_similarities[i]) for i in similar_indices], key=lambda x: -x[1])
        return similar_items
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        similar_items = self._get_similar_items_to_user_profile(user_id)
        # Ignores items the user has already interacted
        similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
        
        recommendations_df = pd.DataFrame(similar_items_filtered, columns=['contentId', 'recStrength']) \
                                    .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'contentId', 
                                                          right_on = 'contentId')[['recStrength', 'contentId', 'title', 'url', 'lang']]


        return recommendations_df
    
content_based_recommender_model = ContentBasedRecommender(articles_df)

In [56]:
print('Evaluating Content-Based Filtering model...')
cb_global_metrics, cb_detailed_results_df = model_evaluator.evaluate_model(content_based_recommender_model)
print('')
pd.Series(cb_global_metrics)
#cb_detailed_results_df.head(10)

Evaluating Content-Based Filtering model...
100 users processed
200 users processed
300 users processed
400 users processed
500 users processed
578 users processed



modelName      Content-Based
recall@5            0.167199
recall@10           0.263655
precision@3         0.035694
dtype: object

# Collaborative Filtering model

## Matrix Factorization

Latent factor models compress user-item matrix into a low-dimensional representation in terms of latent factors. One advantage of using this approach is that instead of having a high dimensional matrix containing abundant number of missing values we will be dealing with a much smaller matrix in lower-dimensional space.  
A reduced presentation could be utilized for either user-based or item-based neighborhood algorithms that are presented in the previous section. There are several advantages with this paradigm. It handles the sparsity of the original matrix better than memory based ones. Also comparing similarity on the resulting matrix is much more scalable especially in dealing with large sparse datasets.  

Here we a use popular latent factor model named [Singular Value Decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition). There are other matrix factorization frameworks more specific to CF you might try, like [surprise](https://github.com/NicolasHug/Surprise), [mrec](https://github.com/Mendeley/mrec) or [python-recsys](https://github.com/ocelma/python-recsys). We chose a [SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html) implemenation of SVD because it is available on Kaggle kernels. 

P.S. See an example of SVD on a movies dataset in this [blog post](https://beckernick.github.io/matrix-factorization-recommender/).

An important decision is the number of factors to factor the user-item matrix. The higher the number of factors, the more precise is the factorization in the original matrix reconstructions. Therefore, if the model is allowed to  memorize too much details of the original matrix, it may not generalize well for data it was not trained on. Reducing the number of factors increases the model generalization.

In [57]:
#Creating a sparse pivot table with users in rows and items in columns
users_items_pivot_matrix_df = interactions_train_df.pivot(index='personId', 
                                                          columns='contentId', 
                                                          values='eventStrength').fillna(0)

users_items_pivot_matrix_df.head(10)

contentId,-9222795471790223670,-9216926795620865886,-9194572880052200111,-9192549002213406534,-9190737901804729417,-9189659052158407108,-9184137057748005562,-9176143510534135851,-9172673334835262304,-9171475473795142532,...,9191014301634017491,9207286802575546269,9208127165664287660,9209629151177723638,9209886322932807692,9213260650272029784,9215261273565326920,9217155070834564627,9220445660318725468,9222265156747237864
personId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-9223121837663643404,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9172914609055320039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9120685872592674274,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9109785559521267180,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9047547311469006438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9016528795238256703,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.584963,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9009798162809551896,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-9001583565812478106,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-8994220765455693336,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-8891033171626175843,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
users_items_pivot_matrix = users_items_pivot_matrix_df.values
users_items_pivot_matrix[:10]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [59]:
users_items_pivot_matrix.shape

(579, 2911)

In [60]:
users_ids = list(users_items_pivot_matrix_df.index)
users_ids[:10]

[-9223121837663643404,
 -9172914609055320039,
 -9120685872592674274,
 -9109785559521267180,
 -9047547311469006438,
 -9016528795238256703,
 -9009798162809551896,
 -9001583565812478106,
 -8994220765455693336,
 -8891033171626175843]

In [61]:
users_items_pivot_sparse_matrix = csr_matrix(users_items_pivot_matrix)
users_items_pivot_sparse_matrix

<579x2911 sparse matrix of type '<class 'numpy.float64'>'
	with 27533 stored elements in Compressed Sparse Row format>

In [62]:
#The number of factors to factor the user-item matrix.
NUMBER_OF_FACTORS_MF = 15
#Performs matrix factorization of the original user item matrix
#U, sigma, Vt = svds(users_items_pivot_matrix, k = NUMBER_OF_FACTORS_MF)
U, sigma, Vt = svds(users_items_pivot_sparse_matrix, k = NUMBER_OF_FACTORS_MF)

In [63]:
U.shape

(579, 15)

In [64]:
Vt.shape

(15, 2911)

In [65]:
sigma = np.diag(sigma)
sigma.shape

(15, 15)

Восстановим оригинальную матрицу

In [66]:
user_factors = np.dot(U, sigma)
item_factors = Vt
print(user_factors.shape)
print(item_factors.shape)
all_user_predicted_ratings = np.dot(user_factors, item_factors) 
all_user_predicted_ratings

(579, 15)
(15, 2911)


array([[ 0.03884689,  0.00197415, -0.00641533, ..., -0.01781361,
         0.01269934,  0.00483313],
       [ 0.00633741,  0.00061755,  0.01266008, ...,  0.01764341,
         0.01334771, -0.00054403],
       [-0.00854964,  0.00158704, -0.01909456, ..., -0.00934293,
         0.0227213 ,  0.00102611],
       ...,
       [-0.03380878,  0.01353069,  0.12074347, ..., -0.00900013,
         0.09684967, -0.00107693],
       [-0.01922095,  0.00066254,  0.01352422, ..., -0.00361609,
         0.00790632, -0.00509767],
       [-0.00593284,  0.00412424,  0.11886455, ..., -0.01055101,
         0.05437229,  0.00106111]])

In [67]:
user_factors.shape

(579, 15)

In [68]:
all_user_predicted_ratings_norm = (all_user_predicted_ratings - all_user_predicted_ratings.min()) \
                            / (all_user_predicted_ratings.max() - all_user_predicted_ratings.min())

In [69]:
all_user_predicted_ratings_norm

array([[0.19729339, 0.19330345, 0.19239564, ..., 0.19116225, 0.19446401,
        0.19361282],
       [0.19377559, 0.19315666, 0.19445976, ..., 0.194999  , 0.19453417,
        0.19303097],
       [0.19216469, 0.19326156, 0.19102364, ..., 0.19207885, 0.19554847,
        0.19320087],
       ...,
       [0.18943144, 0.19455397, 0.20615528, ..., 0.19211594, 0.20356977,
        0.1929733 ],
       [0.19100997, 0.19316153, 0.19455327, ..., 0.19269854, 0.19394536,
        0.19253822],
       [0.19244785, 0.19353611, 0.20595196, ..., 0.19194813, 0.19897336,
        0.19320465]])

In [70]:
#Converting the reconstructed matrix back to a Pandas dataframe
cf_preds_df = pd.DataFrame(all_user_predicted_ratings_norm, columns = users_items_pivot_matrix_df.columns, index=users_ids).transpose()
cf_preds_df.head(10)

Unnamed: 0_level_0,-9223121837663643404,-9172914609055320039,-9120685872592674274,-9109785559521267180,-9047547311469006438,-9016528795238256703,-9009798162809551896,-9001583565812478106,-8994220765455693336,-8891033171626175843,...,8920667914865172372,8941502917401491878,8968131284214320024,8992729171160464416,9013651444868609421,9105269044962898535,9109075639526981934,9148269800512008413,9187866633451383747,9210530975708218054
contentId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-9222795471790223670,0.197293,0.193776,0.192165,0.196263,0.194465,0.172878,0.197465,0.196408,0.190187,0.19043,...,0.194699,0.19219,0.195809,0.192604,0.19292,0.195463,0.201805,0.189431,0.19101,0.192448
-9216926795620865886,0.193303,0.193157,0.193262,0.193029,0.19315,0.196378,0.193247,0.193374,0.193211,0.19299,...,0.193388,0.193221,0.192738,0.193461,0.19299,0.193079,0.194128,0.194554,0.193162,0.193536
-9194572880052200111,0.192396,0.19446,0.191024,0.191033,0.194158,0.200822,0.208383,0.197681,0.193213,0.193296,...,0.191739,0.193762,0.200898,0.195563,0.191956,0.193291,0.202385,0.206155,0.194553,0.205952
-9192549002213406534,0.195248,0.195555,0.192293,0.18752,0.191386,0.245042,0.272697,0.201302,0.190222,0.183381,...,0.187816,0.195538,0.192972,0.193967,0.196563,0.192914,0.20474,0.200002,0.193223,0.2063
-9190737901804729417,0.196517,0.193665,0.190318,0.195696,0.193077,0.201625,0.19961,0.193918,0.195111,0.196067,...,0.195942,0.19308,0.201863,0.193195,0.19188,0.194846,0.192252,0.190439,0.193561,0.191563
-9189659052158407108,0.195506,0.193325,0.19665,0.193675,0.194151,0.248327,0.21533,0.198478,0.193998,0.191868,...,0.194165,0.195528,0.192692,0.195283,0.194976,0.193723,0.203666,0.202546,0.193675,0.202741
-9184137057748005562,0.193465,0.193236,0.193092,0.193489,0.193313,0.194694,0.193265,0.193201,0.193441,0.193719,...,0.193527,0.193138,0.191417,0.193207,0.192952,0.193266,0.193513,0.193398,0.193209,0.193109
-9176143510534135851,0.195782,0.193003,0.200202,0.198451,0.193589,0.200844,0.20176,0.193026,0.19418,0.196328,...,0.197234,0.194198,0.19422,0.193165,0.196341,0.193831,0.197615,0.19763,0.193437,0.192803
-9172673334835262304,0.193578,0.193297,0.192434,0.19297,0.192827,0.200074,0.196479,0.193665,0.193127,0.193319,...,0.193255,0.193241,0.191784,0.193112,0.193197,0.193625,0.19332,0.193609,0.193407,0.193398
-9171475473795142532,0.195352,0.194024,0.19155,0.194862,0.192493,0.196167,0.196901,0.193767,0.194467,0.195353,...,0.195095,0.193333,0.200173,0.193251,0.193033,0.193907,0.193507,0.192193,0.193972,0.192422


In [71]:
len(cf_preds_df.columns)

579

In [72]:
class CFRecommender:
    
    MODEL_NAME = 'Collaborative Filtering'
    
    def __init__(self, cf_predictions_df, items_df=None):
        self.cf_predictions_df = cf_predictions_df
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Get and sort the user's predictions
        sorted_user_predictions = self.cf_predictions_df[user_id].sort_values(ascending=False) \
                                    .reset_index().rename(columns={user_id: 'recStrength'})

        # Recommend the highest predicted rating movies that the user hasn't seen yet.
        recommendations_df = sorted_user_predictions[~sorted_user_predictions['contentId'].isin(items_to_ignore)] \
                               .sort_values('recStrength', ascending = False) \
                               .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'contentId', 
                                                          right_on = 'contentId')[['recStrength', 'contentId', 'title', 'url', 'lang']]


        return recommendations_df
    
cf_recommender_model = CFRecommender(cf_preds_df, articles_df)

In [73]:
print('Evaluating Collaborative Filtering (SVD Matrix Factorization) model...')
cf_global_metrics, cf_detailed_results_df = model_evaluator.evaluate_model(cf_recommender_model)
print('')
pd.Series(cf_global_metrics)
#cf_detailed_results_df.head(10)

Evaluating Collaborative Filtering (SVD Matrix Factorization) model...
100 users processed
200 users processed
300 users processed
400 users processed
500 users processed
578 users processed



modelName      Collaborative Filtering
recall@5                      0.325973
recall@10                     0.460198
precision@3                   0.083477
dtype: object

## Задание 2

Реализовать матричную факторизацию на базе AlternatingLeastSquares и посчитать метрики


In [74]:
!pip install implicit

You should consider upgrading via the '/home/avalur/mkn_env/bin/python -m pip install --upgrade pip' command.[0m


## Задание 3

Реализовать ансамбль рекомендательных систем (CF и CB моделей) и посчитать метрики. Предсказания этих моделей должны браться с весами (веса передаваться как параметры)  

# Тестирование

In [75]:
def inspect_interactions(person_id, test_set=True):
    if test_set:
        interactions_df = interactions_test_indexed_df
    else:
        interactions_df = interactions_train_indexed_df
    return interactions_df.loc[person_id].merge(articles_df, how = 'left', 
                                                      left_on = 'contentId', 
                                                      right_on = 'contentId') \
                          .sort_values('eventStrength', ascending = False)[['eventStrength', 
                                                                          'contentId',
                                                                          'title', 'url', 'lang']]

Интересы автора **machine learning**, **deep learning**, **artificial intelligence**, and **google cloud platform**.

In [76]:
inspect_interactions(author_id, test_set=False).head(10)

Unnamed: 0,eventStrength,contentId,title,url,lang
48,4.285402,7342707578347442862,"At eBay, Machine Learning is Driving Innovativ...",https://www.ebayinc.com/stories/news/at-ebay-m...,en
86,4.044394,-4460374799273064357,"Deep Learning for Chatbots, Part 1 - Introduction",http://www.wildml.com/2016/04/deep-learning-fo...,en
107,3.906891,2589533162305407436,6 reasons why I like KeystoneML,http://radar.oreilly.com/2015/07/6-reasons-why...,en
51,3.70044,5258604889412591249,Machine Learning Is No Longer Just for Experts,https://hbr.org/2016/10/machine-learning-is-no...,en
97,3.523562,-4944551138301474550,Algorithms and architecture for job recommenda...,https://www.oreilly.com/ideas/algorithms-and-a...,en
20,3.459432,-532999578436827210,IBM Seeks to Simplify Graph with New Titan Ser...,https://www.datanami.com/2016/07/27/ibm-seeks-...,en
90,3.459432,-8377626164558006982,Bad Writing Is Destroying Your Company's Produ...,https://hbr.org/2016/09/bad-writing-is-destroy...,en
60,3.459432,444378495316508239,How to choose algorithms for Microsoft Azure M...,https://azure.microsoft.com/en-us/documentatio...,en
55,3.459432,-5658245291907121574,Machine Learning and the VP Debate,https://medium.com/@srobtweets/machine-learnin...,en
79,3.321928,-8085935119790093311,Graph Capabilities with the Elastic Stack,https://www.elastic.co/webinars/sneak-peek-of-...,en


И рекомендации 

In [77]:
cf_recommender_model.recommend_items(author_id, topn=10, verbose=True)

Unnamed: 0,recStrength,contentId,title,url,lang
0,0.29363,-8085935119790093311,Graph Capabilities with the Elastic Stack,https://www.elastic.co/webinars/sneak-peek-of-...,en
1,0.291452,-6467708104873171151,5 reasons your employees aren't sharing their ...,http://justcuriousblog.com/2016/04/5-reasons-y...,en
2,0.286092,-6727357771678896471,This Super Accurate Portrait Selection Tech Us...,http://petapixel.com/2016/06/29/super-accurate...,en
3,0.283065,5258604889412591249,Machine Learning Is No Longer Just for Experts,https://hbr.org/2016/10/machine-learning-is-no...,en
4,0.279078,-7126520323752764957,"How Google is Remaking Itself as a ""Machine Le...",https://backchannel.com/how-google-is-remaking...,en
5,0.278863,3320201327008235211,How Mark Zuckerberg Led Facebook's War to Crus...,http://www.vanityfair.com/news/2016/06/how-mar...,en
6,0.27828,4118743389464105405,Why Google App Engine rocks: A Google engineer...,https://cloudplatform.googleblog.com/2016/04/w...,en
7,0.274187,1549650080907932816,Spark comparison: AWS vs. GCP,https://www.oreilly.com/ideas/spark-comparison...,en
8,0.273819,4241385837165861843,Understanding the Twitter topic Tracker,https://unionmetrics.zendesk.com/hc/en-us/arti...,en
9,0.272217,2468005329717107277,How Netflix does A/B Testing - uxdesign.cc - U...,https://uxdesign.cc/how-netflix-does-a-b-testi...,en


# Заключение

## Задача рекомендации
- задача регрессии
- задача классификации (бинарной, многоклассовой)
- задача ранжирования






## Модели

- Matrix Factorization (SVD, ALS, SLIM ...)
- Factorization Machines
- Random Walk
- Linear Models
- Deep Neural Networks
- Variational Auto Encoder
- Graph Convolution Network
- ...


## Метрики


https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision

Given positive/negative rates for each class k, the resulting score is computed this way: $$Precision_{micro} = \frac{\sum_{k \in C} TP_k}{\sum_{k \in C} TP_k + FP_k}$$ $$Recall_{micro} = \frac{\sum_{k \in C} TP_k}{\sum_{k \in C} TP_k + FN_k}$$ F1-score is the harmonic mean of precision and recall $$MeanFScore = F1_{micro}= \frac{2 Precision_{micro} Recall_{micro}}{Precision_{micro} + Recall_{micro}}$$

Macro-averaged F1-score is the average mean of harmonic means of precision and recall of each class $$F1-macro = \frac{1}{|C|} \sum_{k \in C} F1score_k = \frac{1}{|C|} \sum_{k \in C} \frac{2 P_k R_k}{P_k + R_k}$$