# TF-IDF based Recommender System

### Recommender System based on tf-idf as vector representation of documents

# TF-IDF Based Recommender
1. Represent articles in terms of bag of words
2. Represent user in terms of read articles associated words
3. Generate TF-IDF matrix for user read articles and unread articles
4. Calculate cosine similarity between user read articles and unread articles 
5. Get the recommended articles 

**Describing parameters**:

*1. PATH_NEWS_ARTICLES: specify the path where news_article.csv is present*  <br/>
*2. ARTICLES_READ: List of Article_Ids read by the user*  <br/>
*3. NO_RECOMMENDED_ARTICLES: Refers to the number of recommended articles as a result*

In [74]:
PATH_NEWS_ARTICLES="/Users/sourabhrohilla/Downloads/Final/news_articles.csv"
ARTICLES_READ=[7,6,76,61,761]
NUM_RECOMMENDED_ARTICLES=5

In [61]:
try:
    import numpy
    import pandas as pd
    import pickle as pk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    import re
    from nltk.stem.snowball import SnowballStemmer
    import nltk
    stemmer = SnowballStemmer("english")
except ImportError:
    print('You are missing some packages! ' \
          'We will try installing them before continuing!')
    !pip install "numpy" "pandas" "sklearn" "nltk"
    import numpy
    import pandas as pd
    import pickle as pk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    import re
    from nltk.stem.snowball import SnowballStemmer
    import nltk
    stemmer = SnowballStemmer("english")
    print('Done!')

## 1. Represent articles in terms of bag of words

1. Reading the csv file to get the Article id, Title and News Content
2. Remove punctuation marks and other symbols from each article
3. Tokenize each article
4. Stem token of every article

In [62]:
news_articles = pd.read_csv(PATH_NEWS_ARTICLES)
news_articles.head()

Unnamed: 0,Article_Id,Title,Author,Date,Content,URL
0,0,14 dead after bus falls into canal in Telangan...,Devyani Sultania,"August 22, 2016 12:34 IST",At least 14 people died and 17 others were inj...,http://www.ibtimes.co.in/14-dead-after-bus-fal...
1,1,Pratibha Tiwari molested on busy road Saath ...,Suparno Sarkar,"August 22, 2016 19:47 IST",TV actress Pratibha Tiwari who is best known ...,
2,2,US South Korea begin joint military drill ami...,Namrata Tripathi,"August 22, 2016 18:10 IST",The United States and South Korea began a join...,http://www.ibtimes.co.in/us-south-korea-begin-...
3,3,Illegal construction in Bengaluru Will my hou...,S V Krishnamachari,"August 22, 2016 17:39 IST",The relentless drive by Bengaluru s Bangalore...,http://www.ibtimes.co.in/illegal-construction-...
4,4,Punjab Gau Rakshak Dal chief held for assaulti...,Pranshu Rathee,"August 22, 2016 17:34 IST",Punjab Gau Raksha Dal chief Satish Kumar and h...,http://www.ibtimes.co.in/punjab-gau-rakshak-da...


In [63]:
#Select relevant columns and remove rows with missing values
news_articles = news_articles[['Article_Id','Title','Content']].dropna()
#articles is a list of all articles
articles = news_articles['Content'].tolist()
articles[0] #an uncleaned article

'At least 14 people died and 17 others were injured after a bus travelling from Hyderabad to Kakinada plunged into a canal from a bridge on the accident-prone stretch of the Hyderabad-Khammam highway in Telangana early Monday morning \nThe injured were admitted to the Government General Hospital for treatment \n\n\nSeven people died on the spot and the others succumbed to injuries while undergoing treatment at the hospital  The passengers belonged to the East and West Godavari districts of Andhra Pradesh \nThe bus  owned by private operator Yatra Genie  commenced its journey from Hyderabad at 11 30 p m  on Sunday  Khammam Superintendent of Police Shah Nawaz Khan was quoted by the Hindustan Times as saying \nThe accident happened around 2 30 a m  when the driver slammed the brakes to avoid a collision with another vehicle coming from the opposite direction on a bridge over Nagarjunsagar project left canal at Nayankangudem village in Khammam district  the daily reported  The bus hit the 

In [19]:
def clean_tokenize(document):
    document = re.sub('[^\w_\s-]', ' ',document)       #remove punctuation marks and other symbols
    tokens = nltk.word_tokenize(document)              #Tokenize sentences
    cleaned_article = ' '.join([stemmer.stem(item) for item in tokens])    #Stemming each token
    return cleaned_article

In [20]:
cleaned_articles = map(clean_tokenize, articles)
cleaned_articles[0]  #a cleaned, tokenized and stemmed article 

u'at least 14 peopl die and 17 other were injur after a bus travel from hyderabad to kakinada plung into a canal from a bridg on the accident-pron stretch of the hyderabad-khammam highway in telangana earli monday morn the injur were admit to the govern general hospit for treatment seven peopl die on the spot and the other succumb to injuri while undergo treatment at the hospit the passeng belong to the east and west godavari district of andhra pradesh the bus own by privat oper yatra geni commenc it journey from hyderabad at 11 30 p m on sunday khammam superintend of polic shah nawaz khan was quot by the hindustan time as say the accid happen around 2 30 a m when the driver slam the brake to avoid a collis with anoth vehicl come from the opposit direct on a bridg over nagarjunsagar project left canal at nayankangudem villag in khammam district the daili report the bus hit the parapet wall of the bridg and nose-div into the canal the driver of the bus was appar drive at high speed due 

# 2. Represent user in terms of read articles associated words


In [84]:
#Get user representation in terms of words associated with read articles
user_articles = ' '.join(cleaned_articles[i] for i in ARTICLES_READ)

In [85]:
user_articles

u'prime minist narendra modi has express deep concern and pain at the unrest and unab violenc in kashmir modi has urg all polit parti to unanim support a perman and last solut within the framework of the constitut to the problem of jammu and kashmir prime minist modi highlight the need for dialogu for restor of normalci in the valley as the unrest that began sinc the kill of hizb-ul-mujahideen leader burhan wani on juli 8 enter the 45th day so far 68 peopl have been kill a 75-minute-long meet with a joint 20-member opposit deleg that was led by former j k chief minist omar abdullah and addit compris seven of abdullah s nation confer mlas along with congress legisl led by pcc chief g a mir and cpi m mla m y tarigami present a memorandum to prime minist modi they collect made an appeal for a polit approach to resolv the crisi in the valley and to ensur that the mistak of the past are not repeat modi appreci the construct suggest and reiter his govern s commit to the welfar and develop of

# 3. Generate TF-IDF matrix for user read articles and unread articles


In [75]:
#Generate tfidf matrix model for entire corpus
tfidf_matrix = TfidfVectorizer(stop_words='english', min_df=2)
article_tfidf_matrix = tfidf_matrix.fit_transform(cleaned_articles)
article_tfidf_matrix #tfidf vector of an article

<4831x16009 sparse matrix of type '<type 'numpy.float64'>'
	with 468648 stored elements in Compressed Sparse Row format>

In [76]:
#Generate tfidf matrix model for read articles
user_article_tfidf_vector = tfidf_matrix.transform([user_articles])
user_article_tfidf_vector

<1x16009 sparse matrix of type '<type 'numpy.float64'>'
	with 469 stored elements in Compressed Sparse Row format>

# 4. Calculate cosine similarity between user read articles and unread articles 



In [77]:
articles_similarity_score=cosine_similarity(article_tfidf_matrix, user_article_tfidf_vector)

In [78]:
recommended_articles_id = articles_similarity_score.flatten().argsort()[::-1]

In [79]:
recommended_articles_id

array([   6,   61,    1, ...,  439, 3643,  395])

In [80]:
#Remove read articles from recommendations
final_recommended_articles_id = [article_id for article_id in recommended_articles_id 
                                 if article_id not in ARTICLES_READ ][:NUM_RECOMMENDED_ARTICLES]

# 5. Get the recommended articles 

In [83]:
final_recommended_articles_id

[1, 72, 762, 323, 3883]

In [82]:
#Recommended Articles and their title
print 'Articles Read'
print news_articles.loc[news_articles['Article_Id'].isin(ARTICLES_READ)]['Title'].values
print '\n'
print 'Recommender '
print news_articles.loc[news_articles['Article_Id'].isin(recommended_articles_id)]['Title'].values

Articles Read
[ 'Infosys shares likely to fall on Tuesday after company s client RBS scraps Williams   Glyn project'
 'Dialogue crucial in finding permanent solution to Kashmir s crisis  PM Modi'
 'Revathy to direct Queen s Tamil  Telugu remakes  Suhasini to pen dialogues in both languages'
 'When cricketer R Ashwin started fans club for Trisha Krishnan'
 ' Baahubali  to have world television premiere in Malayalam channel  VIDEO ']


Recommender 
[ '14 dead after bus falls into canal in Telangana s Khammam district  Andhra CM promises Rs 3 lakh compensation'
 'Pratibha Tiwari molested on busy road   Saath Nibhana Saathiya  actress drags accused to police station'
 'US  South Korea begin joint military drill amid nuclear threat from North Korea'
 ...,
 'Samsung Galaxy S6 Active  Water Resistant Variant Leaks On Samsung s Website  Release Date'
 'Google Rolls Out Android 5 1 Lollipop OTA  Improvements  New Features'
 'Apple Decorates Homepage with Beautiful Images Taken Using iPhones  A 