# Named Entity based Recommender System

### Recommender System based on Named Entities as representation of documents

# Named Entity Based Recommender
1. Represent articles in terms of bag of words
2. Represent user in terms of NER words associated with read articles 
3. Generate TF-IDF matrix for user read articles and unread articles
4. Calculate cosine similarity between user read articles and unread articles 
5. Get the recommended articles 

**Describing parameters**:

*1. PATH_NEWS_ARTICLES: specify the path where news_article.csv is present*  <br/>
*2. ARTICLES_READ: List of Article_Ids read by the user*  <br/>
*3. NO_RECOMMENDED_ARTICLES: Refers to the number of recommended articles as a result*

In [1]:
PATH_NEWS_ARTICLES="/home/phoenix/Documents/HandsOn/news_articles.csv"
ARTICLES_READ=[4,5,7,8]
NUM_RECOMMENDED_ARTICLES=5

In [2]:
import pandas as pd
import pickle as pk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from nltk.stem.snowball import SnowballStemmer
import nltk
import numpy
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import tree2conlltags
stemmer = SnowballStemmer("english")

## 1. Represent articles in terms of bag of words


In [3]:
news_articles = pd.read_csv(PATH_NEWS_ARTICLES)
news_articles.head()

Unnamed: 0,Article_Id,Title,Author,Date,Content,URL
0,0,14 dead after bus falls into canal in Telangan...,Devyani Sultania,"August 22, 2016 12:34 IST",At least 14 people died and 17 others were inj...,http://www.ibtimes.co.in/14-dead-after-bus-fal...
1,1,Pratibha Tiwari molested on busy road Saath ...,Suparno Sarkar,"August 22, 2016 19:47 IST",TV actress Pratibha Tiwari who is best known ...,
2,2,US South Korea begin joint military drill ami...,Namrata Tripathi,"August 22, 2016 18:10 IST",The United States and South Korea began a join...,http://www.ibtimes.co.in/us-south-korea-begin-...
3,3,Illegal construction in Bengaluru Will my hou...,S V Krishnamachari,"August 22, 2016 17:39 IST",The relentless drive by Bengaluru s Bangalore...,http://www.ibtimes.co.in/illegal-construction-...
4,4,Punjab Gau Rakshak Dal chief held for assaulti...,Pranshu Rathee,"August 22, 2016 17:34 IST",Punjab Gau Raksha Dal chief Satish Kumar and h...,http://www.ibtimes.co.in/punjab-gau-rakshak-da...


In [4]:
#Select relevant columns and remove rows with missing values
news_articles = news_articles[['Article_Id','Title','Content']].dropna()
#articles is a list of all articles
articles = news_articles['Content'].tolist()
articles[0] #an uncleaned article

'At least 14 people died and 17 others were injured after a bus travelling from Hyderabad to Kakinada plunged into a canal from a bridge on the accident-prone stretch of the Hyderabad-Khammam highway in Telangana early Monday morning \nThe injured were admitted to the Government General Hospital for treatment \n\n\nSeven people died on the spot and the others succumbed to injuries while undergoing treatment at the hospital  The passengers belonged to the East and West Godavari districts of Andhra Pradesh \nThe bus  owned by private operator Yatra Genie  commenced its journey from Hyderabad at 11 30 p m  on Sunday  Khammam Superintendent of Police Shah Nawaz Khan was quoted by the Hindustan Times as saying \nThe accident happened around 2 30 a m  when the driver slammed the brakes to avoid a collision with another vehicle coming from the opposite direction on a bridge over Nagarjunsagar project left canal at Nayankangudem village in Khammam district  the daily reported  The bus hit the 

In [6]:
def clean_tokenize(document):
    document = re.sub('[^\w_\s-]', ' ',document)       #remove punctuation marks and other symbols
    tokens = nltk.word_tokenize(document)              #Tokenize sentences
    cleaned_article = ' '.join([stemmer.stem(item) for item in tokens])    #Stemming each token
    return cleaned_article

In [7]:
cleaned_articles = map(clean_tokenize, articles)
cleaned_articles[0]  #a cleaned, tokenized and stemmed article 

u'at least 14 peopl die and 17 other were injur after a bus travel from hyderabad to kakinada plung into a canal from a bridg on the accident-pron stretch of the hyderabad-khammam highway in telangana earli monday morn the injur were admit to the govern general hospit for treatment seven peopl die on the spot and the other succumb to injuri while undergo treatment at the hospit the passeng belong to the east and west godavari district of andhra pradesh the bus own by privat oper yatra geni commenc it journey from hyderabad at 11 30 p m on sunday khammam superintend of polic shah nawaz khan was quot by the hindustan time as say the accid happen around 2 30 a m when the driver slam the brake to avoid a collis with anoth vehicl come from the opposit direct on a bridg over nagarjunsagar project left canal at nayankangudem villag in khammam district the daili report the bus hit the parapet wall of the bridg and nose-div into the canal the driver of the bus was appar drive at high speed due 

# 2. Represent user in terms of NER words associated with read articles 


In [None]:
def NERwords(art):
    ne_tree = ne_chunk(pos_tag(word_tokenize(art)))
    iob_tagged = tree2conlltags(ne_tree)
    NERwrds = ' '.join([str(n[0]) for n in iob_tagged if not n[2]==u'O'])
    return NERwrds

def user_artNER(article_ids,doc):
    sen = ' '.join([NERwords(doc[int(i)]) for i in article_ids])
    doc.append(sen)
    return doc

In [14]:
def vectorizing(documents):
    #Function for vectorizing
    vectorizer = TfidfVectorizer(stop_words='english', min_df=2,
                                 tokenizer=tokenizer_func)
    X_data = vectorizer.fit_transform(documents)
    return X_data

## Similarity match

Calculating the Cosine similarity between articles read and unread articles

In [16]:
def similarity(articles_id,tf_idf):
    #Function to calculate cosine similarity
    cs_matrix=cosine_similarity(tf_idf[-1], tf_idf)
    recommended_articles_id = numpy.concatenate(cs_matrix, axis=0).argsort()[:][::-1]
    #Remove read articles from recommendations
    final_recommended_articles_id = [art_id for art_id in recommended_articles_id 
                                     if art_id not in articles_id ][:5]
    return final_recommended_articles_id

In [18]:
#Reading the documents
documents = pk.load(open('artlist.pkl', 'rb+'))
articles_ids = [2,3]

In [19]:
#For combining the articles
new_art = user_artNER(articles_ids,documents)
#For calculating the tf-idf vector
tf_idf = vectorizing(new_art)
#Recommendations
recommendations = similarity(articles_ids,tf_idf)

## Results

In [None]:
#Recommended Articles and their title
#df_news = pd.read_csv(PATH_NEWS_ARTICLES)
print 'Articles Read'
print inputfile1.loc[inputfile1['Article_Id'].isin(articles_ids)]['Title']
print '\n'
print 'Recommender '
print inputfile1.loc[inputfile1['Article_Id'].isin(recommendations)]['Title']