# Topic Based Recommender

# Topic Based Recommender
1. Represent articles in terms of Topic Vector
2. Represent user in terms of Topic Vector of read articles
3. Calculate cosine similarity between read and unread articles 
4. Get the recommended articles 

**Describing parameters**:

*1. PATH_ARTICLE_TOPIC_DISTRIBUTION: specify the path where 'ARTICLE_TOPIC_DISTRIBUTION.csv' is present.* <br/>
*2. PATH_NEWS_ARTICLES: specify the path where news_article.csv is present*  <br/>
*3. NO_OF_TOPIC: Number of topics specified when training your topic model. This would refer to the dimension of        each vector representing an article*  <br/>
*4. ARTICLES_READ: List of Article_Ids read by the user*  <br/>
*5. NO_RECOMMENDED_ARTICLES: Refers to the number of recommended articles as a result*

In [4]:
PATH_ARTICLE_TOPIC_DISTRIBUTION = "/Users/sourabhrohilla/Downloads/Final/python/Topic Model/model/Article_Topic_Distribution.csv"
PATH_NEWS_ARTICLES = "/Users/sourabhrohilla/Downloads/Final/news_articles.csv"
NO_OF_TOPICS=150
ARTICLES_READ=[2,7]
NUM_RECOMMENDED_ARTICLES=5

In [5]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

## 1. Represent Read Article in terms of Topic Vector

In [6]:
article_topic_distribution = pd.read_csv(PATH_ARTICLE_TOPIC_DISTRIBUTION)
article_topic_distribution.shape

(22186, 3)

In [8]:
article_topic_distribution.head()

Unnamed: 0,Article_Id,Topic_Id,Topic_Weight
0,0,25,0.324485
1,0,27,0.131476
2,0,127,0.53594
3,1,5,0.306691
4,1,47,0.277037


***Generate Article-Topic Distribution matrix ***

In [19]:
#Pivot the dataframe
article_topic_pivot = article_topic_distribution.pivot(index='Article_Id', columns='Topic_Id', values='Topic_Weight')
#Fill NaN with 0
article_topic_pivot.fillna(value=0, inplace=True)
#Get the values in dataframe as matrix
articles_topic_matrix = article_topic_pivot.values
articles_topic_matrix.shape

(4831, 150)

In [20]:
article_topic_pivot.head()

Topic_Id,0,1,2,3,4,5,6,7,8,9,...,140,141,142,143,144,145,146,147,148,149
Article_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.306691,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.015589,0.0,0.077002,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.396528,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2. Represent user in terms of Topic Vector of read articles


***A user vector is represented in terms of average of read articles topic vector***

In [21]:
#Select user in terms of read article topic distribution
row_idx = np.array(ARTICLES_READ)
read_articles_topic_matrix=articles_topic_matrix[row_idx[:, None]]
#Calculate the average of read articles topic vector 
user_vector = np.mean(read_articles_topic_matrix, axis=0)
user_vector.shape

(1, 150)

## 3. Calculate cosine similarity between read and unread articles 

In [22]:
def calculate_cosine_similarity(articles_topic_matrix, user_vector):
    articles_similarity_score=cosine_similarity(articles_topic_matrix, user_vector)
    recommended_articles_id = articles_similarity_score.flatten().argsort()[::-1]
    #Remove read articles from recommendations
    final_recommended_articles_id = [article_id for article_id in recommended_articles_id 
                                     if article_id not in ARTICLES_READ ][:NUM_RECOMMENDED_ARTICLES]
    return final_recommended_articles_id

In [24]:
recommended_articles_id = calculate_cosine_similarity(articles_topic_matrix, user_vector)
recommended_articles_id

[2843, 3419, 2760, 3123, 3307]

# 4. Recommendation Using Topic Model:-

In [26]:
#Recommended Articles and their title
news_articles = pd.read_csv(PATH_NEWS_ARTICLES)
print 'Articles Read'
print news_articles.loc[news_articles['Article_Id'].isin(ARTICLES_READ)]['Title'].values
print '\n'
print 'Recommender '
print news_articles.loc[news_articles['Article_Id'].isin(recommended_articles_id)]['Title'].values

Articles Read
[ 'US  South Korea begin joint military drill amid nuclear threat from North Korea'
 'Dialogue crucial in finding permanent solution to Kashmir s crisis  PM Modi']


Recommender 
[ 'Rajnath Singh s security is Pak s responsibility during SAARC visit  says Rijiju after JuD  Hizbul threats'
 'Siachen avalanche  Indian Army says missing soldiers presumed dead'
 'Military Plane Crashes Outside Seville Airport in Spain'
 'Europe survives  year of hell   but worse expected to come in 2016'
 'Jammu   Kashmir  Army Indicts 9 Soldiers for Killing 2 Kashmiri Youths in Budgam']


# Topics + NER Recommender

# Topic + NER Based Recommender

1. Represent user in terms of - <br/>
        (Alpha) <Topic Vector> + (1-Alpha) <NER Vector> <br/>
   where <br/>
   Alpha => [0,1] <br/>
   [Topic Vector] => Topic vector representation of concatenated read articles <br/>
   [NER Vector]   => Topic vector representation of NERs associated with concatenated read articles <br/>
2. Calculate cosine similarity between user vector and articles Topic matrix
3. Get the recommended articles 

In [27]:
ALPHA = 0.01
DICTIONARY_PATH = "/Users/sourabhrohilla/Downloads/Final/python/Topic Model/model/dictionary_of_words.p"
LDA_MODEL_PATH = "/Users/sourabhrohilla/Downloads/Final/python/Topic Model/model/lda.model"

In [28]:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import tree2conlltags
import re
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem.snowball import SnowballStemmer
import pickle
import gensim
from gensim import corpora, models

ImportError: No module named gensim

# 1. Represent User in terms of Topic Distribution and NER

1. Represent user in terms of read article topic distribution
2. Represent user in terms of NERs associated with read articles
        2.1 Get NERs of read articles
        2.2 Load LDA model
        2.3 Get topic distribution for the concated NERs
3. Generate user vector

## 1.1. Represent user in terms of read article topic distribution

In [131]:
row_idx = np.array(ARTICLES_READ)
read_articles_topic_matrix=articles_topic_matrix[row_idx[:, None]]
#Calculate the average of read articles topic vector 
user_topic_vector = np.mean(read_articles_topic_matrix, axis=0)
user_topic_vector.shape

(1, 150)

## 1.2. Represent user in terms of NERs associated with read articles

In [132]:
# Get NERs of read articles
def get_ner(article):
    ne_tree = ne_chunk(pos_tag(word_tokenize(article)))
    iob_tagged = tree2conlltags(ne_tree)
    ner_token = ' '.join([token for token,pos,ner_tag in iob_tagged if not ner_tag==u'O']) #Discarding tokens with 'Other' tag
    return ner_token

In [133]:
articles = news_articles['Content'].tolist()
user_articles_ner = ' '.join([get_ner(articles[i]) for i in ARTICLES_READ])
print "NERs of Read Article =>", user_articles_ner

NERs of Read Article => United States South Korea North United Nations Security Council North Korea UN North Korea South Korea Ulchi Freedom Guardian Command North Korean Korean People Army Ulji Freedom Guardian KPA South Korea North London Seoul Kim Jong Un North Korean Narendra Modi Kashmir Modi Jammu Kashmir Modi Burhan Wani Omar Abdullah Abdullah National Conference Congress PCC CPI Tarigami Valley Modi Kashmir Jammu Kashmir


In [134]:
stop_words = set(stopwords.words('english'))
tknzr = TweetTokenizer()
stemmer = SnowballStemmer("english")

In [135]:
def clean_text(text):
    cleaned_text=re.sub('[^\w_\s-]', ' ', text)                                            #remove punctuation marks 
    return cleaned_text                                                                    #and other symbols 

def tokenize(text):
    word = tknzr.tokenize(text)                                                             #tokenization
    filtered_sentence = [w for w in word if not w.lower() in stop_words]                    #removing stop words
    stemmed_filtered_tokens = [stemmer.stem(plural) for plural in filtered_sentence]        #stemming
    tokens = [i for i in stemmed_filtered_tokens if i.isalpha() and len(i) not in [0, 1]]
    return tokens

In [136]:
#Cleaning the article
cleaned_text = clean_text(user_articles_ner)
article_vocabulary = tokenize(cleaned_text)

In [137]:
#Load model dictionary
model_dictionary = pickle.load(open(DICTIONARY_PATH,"rb"))
#Generate article maping using IDs associated with vocab
corpus = [model_dictionary.doc2bow(text) for text in [article_vocabulary]]

In [138]:
#Load LDA Model
lda =  models.LdaModel.load(LDA_MODEL_PATH)

In [139]:
# Get topic distribution for the concated NERs
article_topic_distribution=lda.get_document_topics(corpus[0])
article_topic_distribution

[(29, 0.12263313269087657),
 (62, 0.050353370951179081),
 (84, 0.15588838753218867),
 (127, 0.36080067044623093),
 (135, 0.29498052303560879)]

In [140]:
user_ner_vector =[0]*NO_OF_TOPICS
for topic_id, topic_weight in article_topic_distribution:
    user_ner_vector[topic_id]=topic_weight
len(user_ner_vector)

150

## 1.3. Generate user vector

In [141]:
# User_Vector =>  (Alpha) [Topic Vector] + (1-Alpha) [NER Vector] 
alpha_topic_vector = [topic_weight*ALPHA for topic_weight in user_topic_vector]
alpha_ner_vector = [ner*(1-ALPHA) for ner in user_ner_vector]

user_vector = np.sum(zip(alpha_topic_vector,alpha_ner_vector))
user_vector.shape

(150,)

# 2. Calculate cosine similarity between user vector and articles Topic matrix

In [142]:
recommended_articles_id = calculate_cosine_similarity(articles_topic_matrix, user_vector)
recommended_articles_id



[2843, 3419, 2760, 3123, 3307]

# 3. Get recommended articles

In [143]:
#Recommended Articles and their title
news_articles = pd.read_csv(PATH_NEWS_ARTICLES)
print 'Articles Read'
print news_articles.loc[news_articles['Article_Id'].isin(ARTICLES_READ)]['Title']
print '\n'
print 'Recommender '
print news_articles.loc[news_articles['Article_Id'].isin(recommended_articles_id)]['Title']

Articles Read
2    US  South Korea begin joint military drill ami...
7    Dialogue crucial in finding permanent solution...
Name: Title, dtype: object


Recommender 
2760    Rajnath Singh s security is Pak s responsibili...
2843    Siachen avalanche  Indian Army says missing so...
3123    Military Plane Crashes Outside Seville Airport...
3307    Europe survives  year of hell   but worse expe...
3419    Jammu   Kashmir  Army Indicts 9 Soldiers for K...
Name: Title, dtype: object
