# Recommender Systems in Python 101
This document will demonstrate how to implement **Collaborative Filtering, Content-Based Filtering and Hybrid mothods** for the task of providing personalized recommendations in Python.

In [1]:
import numpy as np
import pandas as pd
import scipy, math, random, sklearn
from nltk.corpus import stopwords
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt


### Loading data : CI&T Deskdrop dataset
The dataset contains **shared_articles.csv** and **users_interactions.csv**.   

##### shared_articles.csv
It contains information about the articles shared in the platform.
- sharing date, original url, content in plain text, language and author    

Event types at a given timestamp are **CONTENT SHARED** and **CONTENT REMOVED** but only consider here the "CONTENT SHARED" type for the sake of simplicity assuming that all articles were available during the period.

In [2]:
articles_df = pd.read_csv('./archive/shared_articles.csv')
articles_df = articles_df[articles_df['eventType']=='CONTENT SHARED']
articles_df.head()

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,CONTENT SHARED,-6151852268067518688,3891637997717104548,-1457532940883382585,,,,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en
4,1459194497,CONTENT SHARED,2448026894306402386,4340306774493623681,8940341205206233829,,,,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en
5,1459194522,CONTENT SHARED,-2826566343807132236,4340306774493623681,8940341205206233829,,,,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en


##### users-interactions.csv
It contains logs of user interactions on shared articles and can be joined to **articles_shared.csv** by **contentId** column.
- VIEW, LIKE, COMMENT CREATED, FOLLOW and BOOKMARK


In [3]:
interactions_df = pd.read_csv('./archive/users_interactions.csv')
interactions_df.head()

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,


### Data munging
We associate different interactions types with a weight or strength.   
EX) Assuming that a comment in an article indicates a higher interest of the user than a like or a simple view.

In [4]:
# 각 이벤트에 서로 다른 가중치 배정
event_strength = {'VIEW': 1.0,
   'LIKE': 2.0, 
   'BOOKMARK': 2.5, 
   'FOLLOW': 3.0,
   'COMMENT CREATED': 4.0,}

interactions_df['eventStrength'] = interactions_df['eventType'].apply(lambda x: event_strength[x])

----
### `pd.DataFrame.apply()`

In [9]:
testdf = pd.DataFrame([[4,'view'],[1,'like'],[2,'comment']],columns=['A','B'])
testdf

Unnamed: 0,A,B
0,4,view
1,1,like
2,2,comment


In [10]:
teststr = {'view':1, 'like':22, 'comment':44}

In [11]:
testdf['C'] = testdf['B'].apply(lambda x: teststr[x])

In [12]:
testdf

Unnamed: 0,A,B,C
0,4,view,1
1,1,like,22
2,2,comment,44


---

To avoid the problem of ***user cold-start***, we are keeping in the dataset only users with at least 5 interactions.

In [5]:
users_inter_cnt_df = interactions_df.groupby(['personId', 'contentId']).size().groupby('personId').size()

In [6]:
print('# users: %d' % len(users_inter_cnt_df))

# users: 1895


In [7]:
users_with_5inter_df = users_inter_cnt_df[users_inter_cnt_df >= 5].reset_index()[['personId']]

In [8]:
print('# users with enough interactions: %d' % len(users_with_5inter_df))

# users with enough interactions: 1140


---
### `pd.DataFrame.groupby().size()`

In [51]:
testsize = testdf.groupby(['A','B']).size().groupby(['A']).size()

In [53]:
testsize[testsize>=3]

A
c    3
dtype: int64

In [50]:
testsize[testsize>=3].reset_index()

Unnamed: 0,A,0
0,c,3


In [54]:
testsize[testsize>=3].reset_index()['A']

0    c
Name: A, dtype: object

In [55]:
testsize[testsize>=3].reset_index()[['A']]

Unnamed: 0,A
0,c


----

In [9]:
print('# of interactions: %d' % len(interactions_df))
interact_from_selected_users_df = interactions_df.merge(users_with_5inter_df,
                                                       how='right',
                                                       left_on='personId',
                                                       right_on='personId')
print('# of interactions from users with enough interactions: %d' %len(interact_from_selected_users_df))

# of interactions: 72312
# of interactions from users with enough interactions: 69868


To model the user interset on a given article, we aggregate all the interactions the user has performed in an item by weighted sum of interaction type strength and apply a log trans to smooth the distribution.

In [10]:
def smooth_user_preference(x):
    return math.log(1+x, 2)

interactions_full_df = interact_from_selected_users_df.groupby([
    'personId','contentId'])['eventStrength'].sum().apply(smooth_user_preference).reset_index()
print('# of uniq user/item interactions: %d' % len(interactions_full_df))
interactions_full_df.head(10)

# of uniq user/item interactions: 39106


Unnamed: 0,personId,contentId,eventStrength
0,-9223121837663643404,-8949113594875411859,1.0
1,-9223121837663643404,-8377626164558006982,1.0
2,-9223121837663643404,-8208801367848627943,1.0
3,-9223121837663643404,-8187220755213888616,1.0
4,-9223121837663643404,-7423191370472335463,3.169925
5,-9223121837663643404,-7331393944609614247,1.0
6,-9223121837663643404,-6872546942144599345,1.0
7,-9223121837663643404,-6728844082024523434,1.0
8,-9223121837663643404,-6590819806697898649,1.0
9,-9223121837663643404,-6558712014192834002,1.584963


### Evaluation
One key aspect of evaluation is to ensure that the trained model generalizes for data it was not trained on, using **Cross-validation** techniques. We are using a simple cross-validation approach named **holdout** in which a random data sample are kept aside in the training process and exclusively used for evaluation.   
All evaluation metrics reported here are computed using the **test set**.      

P.s. train set과 test set을 날짜를 기준으로 구분하여 더 탄탄한 evaluation을 수행할 수 있습니다.   

Evaluation을 위해 사용되는 set metrics 중 **Top-N accuracy metrics**를 사용한다. Test set에서 유저가 실제로 상호작용한 아이템들과 비교하여 유저에게 제안된 상위 추천들의 정확도를 계산한다.   
- For each user
 - For each item the user has interacted in test set
  - 유저가 상호작용하지 않은 100개의 다른 아이템을 샘플링한다.   
  이 아이템들이 유저와 관계가 없다는 단순한 가정을 바탕으로 한다.
  - 하나의 상호작용한 아이템과 100개의 상호작용하지 않은 아이템에서 추천되는 아이템들의 랭크 리스트를 recommender model로 생성한다.
  - 랭크 리스트에서 Top-N accuracy metrics를 계산한다.
- Global Top-N accuracy metrics를 총합한다.   

*Recall@N, NDCG@N, MAP@N*

In [11]:
inter_train_df, inter_test_df = train_test_split(interactions_full_df,
                                                stratify=interactions_full_df['personId'],
                                                test_size=0.2,
                                                random_state=27)

print('# interactions on Train: %d' % len(inter_train_df))
print('# interactions on Test: %d' % len(inter_test_df))

# interactions on Train: 31284
# interactions on Test: 7822


In [12]:
# Indexing by personId to speed up the seaches during evaluaion
interactions_full_df = interactions_full_df.set_index('personId')
inter_train_df = inter_train_df.set_index('personId')
inter_test_df = inter_test_df.set_index('personId')

In [13]:
# 유저와 관련된 아이템 뽑아내기
def get_items_interacted(personid, inter_df):
    interacted_items = inter_df.loc[personid]['contentId']
    return set(interacted_items if type(interacted_items)==pd.Series else [interacted_items])

In [70]:
# Top-N accuracy metrics consts
Eval_Random_Sample_Non_Interacted_Items = 100

class ModelEvaluator:
    # 유저와 관련되지 않은 아이템 중 샘플 뽑아내기
    def get_not_interacted_items_sample(self, personid, sample_size, seed=27):
        interacted_items = get_items_interacted(personid, interactions_full_df)
        all_items = set(articles_df['contentId'])
        non_interacted_items = all_items-interacted_items
        random.seed(seed)
        non_interacted_items_sample = random.sample(non_interacted_items, sample_size)
        return set(non_interacted_items_sample)
        
    # recommended_items와 item_id는 같은 format    
    def verify_hit_top_n(self, item_id, recommended_items, topn):
        try:
            index = next(i for i,c in enumerate(recommended_items) if c==item_id)
        except:
            index = -1
        # hit = True / False
        hit = int(index in range(0, topn))
        return hit, index
    
    def evaluate_model_for_user(self, model, personid):
        # Getting the items in test set
        interacted_values_testset = inter_test_df.loc[personid]
        if type(interacted_values_testset['contentId'])==pd.Series:
            person_interacted_items_testset = set(interacted_values_testset['contentId'])
        else:
            person_interacted_items_testset = set([int(interacted_values_testset['contentId'])])
        interacted_items_cnt_testset = len(person_interacted_items_testset)
    
        # Getting a ranked recommendation list from a model for a user
        person_recs_df = model.recommend_items(personid, items_to_ignore=get_items_interacted(personid,inter_train_df),
                                               topn=10000000000)
        hits_at_5_cnt = 0
        hits_at_10_cnt = 0
        
        # For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:
            # Getting a random sample items the user has not interacted
            # represnet items that are assumed to be no relevant to the user
            non_interacted_items_sample = self.get_not_interacted_items_sample(personid,
                                                                               sample_size=Eval_Random_Sample_Non_Interacted_Items,
                                                                               seed=item_id%(2**32))
            
            # Combining the current interactec item with the random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))
            
            # Filtering only recommendations that are either -, -
            valid_recs_df = person_recs_df[person_recs_df['contentId'].isin(items_to_filter_recs)]
            valid_recs = valid_recs_df['contentId'].values
            
            # Verifying if the current inter item is among the Top-N
            hits_at_5, index_at_5 = self.verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_cnt += hits_at_5
            
            hits_at_10, index_at_10 = self.verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_cnt += hits_at_10
            
            # Recall is the rate of the interacted items that are ranked among the Top-N
            # when mixed with a set of non-rel items
        recall_at_5 = hits_at_5_cnt / float(interacted_items_cnt_testset)
        recall_at_10 = hits_at_10_cnt / float(interacted_items_cnt_testset)

        person_metrics = {'hits@5_count':hits_at_5_cnt, 
                      'hits@10_count':hits_at_10_cnt, 
                      'interacted_count': interacted_items_cnt_testset,
                      'recall@5': recall_at_5,
                      'recall@10': recall_at_10}
        return person_metrics
            
    def evaluate_model(self, model):
        print('Running evaluation for users')
        people_metrics = []
        
        for idx, personid in enumerate(list(inter_test_df.index.unique().values)):
            if idx % 100 == 0 and idx > 0:
                print('%d users processed' % idx)
            person_metrics = self.evaluate_model_for_user(model, personid)
            person_metrics['_personid'] = personid
            people_metrics.append(person_metrics)
        print('%d users processed' % idx)
        
        detailed_results_df = pd.DataFrame(people_metrics).sort_values('interacted_count',
                                                                       ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}
        
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluator()
        

### Popularity model
일반적인 기준 접근법은 Popularity model이다. 이 모델은 개인화되지 않고, 유저가 기존에 사용한 적 없는 가장 인기있는 아이템들을 추천해준다. 일반적으로 모든 사람에게 흥미를 유발하는 좋은 추천을 제공한다.    
P.s. 추천 시스템의 주 목적은 특정한 흥미를 가진 유저들을 대상으로 long-tail 아이템들을 활용하는 것임을 기억하라.

In [54]:
# Compute the most popular items
item_popular_df = interactions_full_df.groupby('contentId')['eventStrength'].sum().sort_values(ascending=False).reset_index()
item_popular_df.head(10)


Unnamed: 0,contentId,eventStrength
0,-4029704725707465084,307.733799
1,-6783772548752091658,233.762157
2,-133139342397538859,228.024567
3,-8208801367848627943,197.107608
4,-6843047699859121724,193.825208
5,8224860111193157980,189.04468
6,-2358756719610361882,183.110951
7,2581138407738454418,180.282876
8,7507067965574797372,179.094002
9,1469580151036142903,170.548969


In [66]:
class PopularRecommender:
    Model_Name = 'Popularity'
    def __init__(self, popular_df, items_df=None):
        self.popular_df = popular_df
        self.items_df = items_df
        
    def get_model_name(self):
        return self.Model_Name
    
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # isin -> T/F
        # ~ -> T/F 변환
        recommendations_df = self.popular_df[~self.popular_df['contentId'].isin(items_to_ignore)]\
        .sort_values('eventStrength', ascending=False).head(topn)
        
        if verbose:
            if self.items_df is None:
                raise Exception('"item_df" is required in verbose mode')
                
            recommendations_df = recommendations_df.merge(self.items_df,
                                                          how='left',
                                                         left_on='contentId',
                                                         right_on='contentId')[['eventStrength', 'contentId', 'title', 'url', 'lang']]
        return recommendations_df
    
popular_model = PopularRecommender(item_popular_df, articles_df)
        
        

---

In [48]:
aa = pd.DataFrame([1,2,3])
aa[pd.Series([True,False,True])]

Unnamed: 0,0
0,1
2,3


---

In [71]:
print('Evaluating Popularity recommendation model ...')
pop_global_metrics, pop_detailed_results_df = model_evaluator.evaluate_model(popular_model)
print('\nGlobal metrics : \n%s' % pop_global_metrics)
pop_detailed_results_df.head()

Evaluating Popularity recommendation model ...
Running evaluation for users


since Python 3.9 and will be removed in a subsequent version.
  non_interacted_items_sample = random.sample(non_interacted_items, sample_size)


100 users processed
200 users processed
300 users processed
400 users processed
500 users processed
600 users processed
700 users processed
800 users processed
900 users processed
1000 users processed
1100 users processed
1139 users processed

Global metrics : 
{'modelName': 'Popularity', 'recall@5': 0.2476348759907952, 'recall@10': 0.3726668371260547}


Unnamed: 0,hits@5_count,hits@10_count,interacted_count,recall@5,recall@10,_personid
69,26,50,192,0.135417,0.260417,3609194402293569455
209,13,28,134,0.097015,0.208955,-2626634673110551643
46,21,30,130,0.161538,0.230769,-1032019229384696495
175,8,13,117,0.068376,0.111111,-1443636648652872475
155,19,33,88,0.215909,0.375,-2979881261169775358


---

### Content-Based Filtering model
Content-based filtering approaches leverage description or attributes from items the user has interacted to recommend similar items.   
For textual items, it is simple to use the raw text to build item profiles and user profiles.   
In this document, a very popular technique in IR named **TF-IDF** is used.

In [73]:
# Ignoring stopword from English and Portuguese
stop_list = stopwords.words('english')+stopwords.words('portuguese')

# Trains a model : composed by the main unigrams and bigrams found in corpus
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,2),
                            min_df=0.003, max_df=0.5,
                            max_features=5000, stop_words=stop_list)

item_ids = articles_df['contentId'].tolist()
tfidf_matrix = vectorizer.fit_transform(articles_df['title']+""+articles_df['text'])
tfidf_feature_names = vectorizer.get_feature_names()
tfidf_matrix





<3047x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 638928 stored elements in Compressed Sparse Row format>

To model the user profile, take all the item profiles the user has interacted and average them. The average is weighted by the interaction strength.

In [81]:
result = []
for item_id in item_ids:
    idx = item_ids.index(item_id)
    if tfidf_matrix[idx:idx+1] == [tfidf_matrix[idx]]:
        result.append(True)
    else:
        print(idx, ':',tfidf_matrix[idx:idx+1],[tfidf_matrix[idx]])
        result.append(False)

# print(result.count(False))

0 :   (0, 6)	0.023607241622865427
  (0, 929)	0.023535483028300084
  (0, 346)	0.02098202516042195
  (0, 4207)	0.02251040707046885
  (0, 4188)	0.029874684383953284
  (0, 4977)	0.0238290255469059
  (0, 2516)	0.04707096605660017
  (0, 2713)	0.023395034146893666
  (0, 2384)	0.02262744537181538
  (0, 3716)	0.020575564249489486
  (0, 2517)	0.017490920272521353
  (0, 311)	0.014681047167908063
  (0, 129)	0.018135390080053063
  (0, 1294)	0.017014866920704988
  (0, 1236)	0.023753980174264947
  (0, 4978)	0.011880184631898414
  (0, 4)	0.012132660567661872
  (0, 2600)	0.014351337666457506
  (0, 1876)	0.022687026421662575
  (0, 716)	0.019401320594457417
  (0, 3528)	0.015280617052263298
  (0, 4517)	0.013636227666283123
  (0, 4400)	0.013712882263501855
  (0, 691)	0.02430507889872227
  (0, 3724)	0.012587161968256504
  :	:
  (0, 4017)	0.016214135039471844
  (0, 3967)	0.04010389675404811
  (0, 2739)	0.058808728090076985
  (0, 3532)	0.04414935657965368
  (0, 2559)	0.02111081968406946
  (0, 4468)	0.01661561

460 :   (0, 4066)	0.07827025070678187
  (0, 273)	0.07714260236798784
  (0, 1092)	0.15576903373978943
  (0, 2788)	0.06418144032250574
  (0, 2171)	0.07127788949831296
  (0, 1071)	0.07176824702215019
  (0, 3231)	0.07307220706257055
  (0, 3320)	0.07176824702215019
  (0, 1048)	0.06748301036421296
  (0, 3677)	0.07643689354620264
  (0, 4936)	0.06824445783553847
  (0, 875)	0.04427820625646526
  (0, 2138)	0.0757640205198049
  (0, 1575)	0.062121449500993124
  (0, 1488)	0.20135042849023332
  (0, 3886)	0.06805030509953827
  (0, 2135)	0.06478420093989841
  (0, 2105)	0.05831672745283085
  (0, 694)	0.06824445783553847
  (0, 4454)	0.07643689354620264
  (0, 3422)	0.07420723319374913
  (0, 306)	0.05812126283422523
  (0, 3271)	0.04394652941798515
  (0, 4540)	0.07609654462780792
  (0, 2879)	0.06623992516643809
  :	:
  (0, 2710)	0.13261152066885443
  (0, 3297)	0.1214756431500677
  (0, 1096)	0.04072623358658885
  (0, 705)	0.11013895459517016
  (0, 1299)	0.03846328400964532
  (0, 4028)	0.04457955385738984
  

876 :   (0, 2090)	0.2098591069427435
  (0, 894)	0.1924418484126368
  (0, 3808)	0.14697612818509595
  (0, 3710)	0.14872092233043752
  (0, 551)	0.46832836846133014
  (0, 3060)	0.14749194295798387
  (0, 2012)	0.11656547873189464
  (0, 4553)	0.20024753924957223
  (0, 1407)	0.14497440800207995
  (0, 3407)	0.14290911555702726
  (0, 285)	0.1924418484126368
  (0, 1810)	0.14925892328790782
  (0, 2140)	0.0931150319687307
  (0, 3899)	0.2911541884563756
  (0, 2032)	0.3043674552887299
  (0, 4545)	0.07195751345307115
  (0, 4970)	0.14980395713663236
  (0, 1192)	0.15363182660436625
  (0, 2254)	0.10168585761008549
  (0, 2002)	0.07983726424536323
  (0, 4301)	0.10250113908834606
  (0, 887)	0.19126201667859055
  (0, 4962)	0.08713286630842634
  (0, 3297)	0.3501665565932643
  (0, 613)	0.10015717127129058
  (0, 4741)	0.074312446816393
  (0, 2712)	0.07869210918310789
  (0, 1849)	0.16316119821486652 [<1x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 28 stored elements in Compressed Sparse Row forma

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
def get_item_profile(item_id):
    idx = item_ids.index(item_id)
    item_profile = tfidf_matrix[idx:idx+1]
    


