# Modélisation : content filtering
Nous allons utiliser ici en mesure de similarité le produit scalaire (parmi : Cosinus, dist euclidienne)

Au niveau du contenu nous allons utiliser la catégorie de l'article en one hot vector concaténée avec le nombre de mots

Pour un utilisateur, nous utiliserons la catégorie la plus cliquée ainsi que la longueur moyenne des articles

In [1]:
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt

## Lecture des articles

In [13]:
articles_df = pd.read_csv('data/articles_metadata.csv')
articles_emb = pickle.load(open("data/articles_embeddings.pickle","rb"))


## Recherche des catégories d'articles cliquées

In [3]:
clicks_by_hour_df = pd.DataFrame()
for i in range(385):
    index = str(i).zfill(3)
    if i%10 == 0:
        print("Reading file index :",index)
    clicks_df = pd.read_csv('data/clicks/clicks_hour_'+index+'.csv')
    clicks_by_hour_df = clicks_by_hour_df.append(clicks_df)

Reading file index : 000
Reading file index : 010
Reading file index : 020
Reading file index : 030
Reading file index : 040
Reading file index : 050
Reading file index : 060
Reading file index : 070
Reading file index : 080
Reading file index : 090
Reading file index : 100
Reading file index : 110
Reading file index : 120
Reading file index : 130
Reading file index : 140
Reading file index : 150
Reading file index : 160
Reading file index : 170
Reading file index : 180
Reading file index : 190
Reading file index : 200
Reading file index : 210
Reading file index : 220
Reading file index : 230
Reading file index : 240
Reading file index : 250
Reading file index : 260
Reading file index : 270
Reading file index : 280
Reading file index : 290
Reading file index : 300
Reading file index : 310
Reading file index : 320
Reading file index : 330
Reading file index : 340
Reading file index : 350
Reading file index : 360
Reading file index : 370
Reading file index : 380


In [4]:
clicks_by_hour_df

Unnamed: 0,user_id,session_id,session_start,session_size,click_article_id,click_timestamp,click_environment,click_deviceGroup,click_os,click_country,click_region,click_referrer_type
0,0,1506825423271737,1506825423000,2,157541,1506826828020,4,3,20,1,20,2
1,0,1506825423271737,1506825423000,2,68866,1506826858020,4,3,20,1,20,2
2,1,1506825426267738,1506825426000,2,235840,1506827017951,4,1,17,1,16,2
3,1,1506825426267738,1506825426000,2,96663,1506827047951,4,1,17,1,16,2
4,2,1506825435299739,1506825435000,2,119592,1506827090575,4,1,17,1,24,2
...,...,...,...,...,...,...,...,...,...,...,...,...
2564,10051,1508211372158328,1508211372000,2,84911,1508211557302,4,3,2,1,25,1
2565,322896,1508211376302329,1508211376000,2,30760,1508211672520,4,1,17,1,25,2
2566,322896,1508211376302329,1508211376000,2,157507,1508211702520,4,1,17,1,25,2
2567,123718,1508211379189330,1508211379000,2,234481,1508211513583,4,3,2,1,25,2


Nous récupérons les articles cliqués par l'utilisateur

In [5]:
user_id = 0

In [6]:
article_interest_df = clicks_by_hour_df[clicks_by_hour_df.user_id == user_id]['click_article_id']

In [7]:
article_interest_df

0       157541
1        68866
7412     96755
7413    313996
4881    160158
4882    233470
1811     87224
1812     87205
Name: click_article_id, dtype: object

Recherchons les catégories des articles

In [11]:
articles_categories = articles_df[articles_df.article_id.isin(article_interest_df)].category_id
articles_categories

68866     136
87205     186
87224     186
96755     209
157541    281
160158    281
233470    375
313996    431
Name: category_id, dtype: int64

In [17]:
category_freqs = articles_df[articles_df.article_id.isin(article_interest_df)].category_id.value_counts()
category_freqs

281    2
186    2
375    1
431    1
209    1
136    1
Name: category_id, dtype: int64

In [85]:
def get_recommendations(user_id,articles_df,clicks_by_hour_df,articles_emb,top_k = 5):
    
    article_interest_df = clicks_by_hour_df[clicks_by_hour_df.user_id == user_id]['click_article_id']
    articles_categories = articles_df[articles_df.article_id.isin(article_interest_df)].category_id
    category_freqs = articles_df[articles_df.article_id.isin(article_interest_df)].category_id.value_counts()
    
    cf = category_freqs.index.to_series()
    cat=cf.to_numpy()[0]
    print("Selected category :",cat)
    selected_article = articles_categories[articles_categories==cat].index[0]
    exclude_list = articles_categories[articles_categories==cat].index.to_numpy()
    print("Exclude list :",exclude_list)
    print("Selected article :",selected_article)
    current_emb = articles_emb[selected_article]
    print("Article embedding :",current_emb)
    
    similarities = np.dot(current_emb,np.transpose(articles_emb))
    print("Similarities shape",similarities.shape)
    to_retrieve = (top_k + len(exclude_list))-1
    selected = similarities.argsort()[-to_retrieve:]
    print("Articles selectionnés :",selected)
    filtered = set(selected) - set(exclude_list)
    print("Filtered :",filtered)
    
    return list(filtered)
    
    
        

In [86]:
user_id = 0

rec = get_recommendations(user_id,articles_df,clicks_by_hour_df,articles_emb, top_k=5)

print(rec)

Selected category : 281
Exclude list : [157541 160158]
Selected article : 157541
Article embedding : [ 0.04563604 -0.9817215  -0.3488117   0.13537404  0.18759827  0.46154243
 -0.7924494   0.5103028   0.03752192 -0.04350547  0.7562447  -0.3990103
 -0.765851   -0.2971812   0.44985953 -0.31984672 -0.01483636 -0.19543175
 -0.68874836 -0.7412663   0.5747204  -0.18989867  0.27447984  0.44798225
 -0.48008433 -0.715874    0.6232814   0.41901016 -0.8594551   0.6883967
  0.8074732   0.6078273   0.77360547 -0.8022254  -0.82356477 -0.17657441
  0.5343014   0.37477484 -0.669281   -0.72143614 -0.07189102  0.10664697
  0.87900984 -0.9698581  -0.7696766   0.5552143  -0.6911229  -0.17906854
 -0.04259193 -0.58978194 -0.04458507  0.1946762  -0.01661651 -0.6879604
  0.37235388  0.7880901  -0.15220796  0.43662453  0.9619818   0.0172913
  0.9440444   0.84787196 -0.52713937 -0.27290303 -0.2106504   0.41728926
  0.82955945  0.60010344 -0.33098266  0.02484715  0.15784632 -0.32729748
  0.6672566  -0.7081773   0