# TF-IDF based Recommender System

### Recommender System based on tf-idf as vector representation of documents

# TF-IDF Based Recommender
1. Represent articles in terms of bag of words
2. Represent user in terms of read articles associated words
3. Generate TF-IDF matrix for user read articles and unread articles
4. Calculate cosine similarity between user read articles and unread articles 
5. Get the recommended articles 

**Describing parameters**:

*1. PATH_NEWS_ARTICLES: specify the path where news_article.csv is present*  <br/>
*2. ARTICLES_READ: List of Article_Ids read by the user*  <br/>
*3. NO_RECOMMENDED_ARTICLES: Refers to the number of recommended articles as a result*

In [1]:
PATH_NEWS_ARTICLES="/Users/Dell/Music/articles1.csv"
ARTICLES_READ=[2,7,8,17,18,34]
NUM_RECOMMENDED_ARTICLES=50

In [2]:
try:
    import numpy
    import pandas as pd
    import pickle as pk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    import re
    from nltk.stem.snowball import SnowballStemmer
    import nltk
    stemmer = SnowballStemmer("english")
except ImportError:
    print('You are missing some packages! ' \
          'We will try installing them before continuing!')
    !pip install "numpy" "pandas" "sklearn" "nltk"
    import numpy
    import pandas as pd
    import pickle as pk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    import re
    from nltk.stem.snowball import SnowballStemmer
    import nltk
    stemmer = SnowballStemmer("english")
    print('Done!')

## 1. Represent articles in terms of bag of words

1. Reading the csv file to get the Article id, Title and News Content
2. Remove punctuation marks and other symbols from each article
3. Tokenize each article
4. Stem token of every article

In [3]:
news_articles = pd.read_csv(PATH_NEWS_ARTICLES)
news_articles.head()

Unnamed: 0,id,article_no,user rating,login,title,publication,author,date,year,month,url,content,Unnamed: 12
0,0,17283,1,886,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016,12,,WASHINGTON — Congressional Republicans have...,
1,1,17284,2,777,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017,6,,"After the bullet shells get counted, the blood...",
2,2,17285,0,915,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017,1,,"When Walt Disney’s “Bambi” opened in 1942, cri...",
3,3,17286,3,793,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017,4,,"Death may be the great equalizer, but it isn’t...",
4,4,17287,0,335,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017,1,,"SEOUL, South Korea — North Korea’s leader, ...",


In [4]:
#Select relevant columns and remove rows with missing values
news_articles = news_articles[['id','title','content']].dropna()
#articles is a list of all articles
articles = news_articles['content'].tolist()
articles[0] #an uncleaned article

'WASHINGTON  —   Congressional Republicans have a new fear when it comes to their    health care lawsuit against the Obama administration: They might win. The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on health insurance subsidies for   and   Americans, handing House Republicans a big victory on    issues. But a sudden loss of the disputed subsidies could conceivably cause the health care program to implode, leaving millions of people without access to health insurance before Republicans have prepared a replacement. That could lead to chaos in the insurance market and spur a political backlash just as Republicans gain full control of the government. To stave off that outcome, Republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the Obama health care law, angering conservative voters who have been 

In [5]:
def clean_tokenize(document):
    document = re.sub('[^\w_\s-]', ' ',document)       #remove punctuation marks and other symbols
    tokens = nltk.word_tokenize(document)              #Tokenize sentences
    cleaned_article = ' '.join([stemmer.stem(item) for item in tokens])    #Stemming each token
    return cleaned_article

In [6]:
cleaned_articles = list (map(clean_tokenize, articles[0:50]))
cleaned_articles  #a cleaned, tokenized and stemmed article 

['washington congression republican have a new fear when it come to their health care lawsuit against the obama administr they might win the incom trump administr could choos to no longer defend the execut branch against the suit which challeng the administr s author to spend billion of dollar on health insur subsidi for and american hand hous republican a big victori on issu but a sudden loss of the disput subsidi could conceiv caus the health care program to implod leav million of peopl without access to health insur befor republican have prepar a replac that could lead to chao in the insur market and spur a polit backlash just as republican gain full control of the govern to stave off that outcom republican could find themselv in the awkward posit of appropri huge sum to temporarili prop up the obama health care law anger conserv voter who have been demand an end to the law for year in anoth twist donald j trump s administr worri about preserv execut branch prerog could choos to fig

# 2. Represent user in terms of read articles associated words


In [9]:
#Get user representation in terms of words associated with read articles
user_articles = ' '.join(cleaned_articles[i] for i in ARTICLES_READ)

In [10]:
user_articles

'when walt disney s bambi open in 1942 critic prais it spare haunt visual style vast differ from anyth disney had done befor but what they did not know was that the film s strike appear had been creat by a chines immigr artist who took as his inspir the landscap paint of the song dynasti the extent of his contribut to bambi which remain a mark for film anim would not be wide known for decad like the film s titl charact the artist tyrus wong weather irrevoc separ from his mother and in the hope of make a life in america incarcer isol and rigor interrog all when he was still a child in the year that follow he endur poverti discrimin and chronic lack of recognit not onli for his work at disney but also for his fine art befor find acclaim in his 90s mr wong die on friday at 106 a hollywood studio artist painter printmak calligraph illustr and in later year maker of fantast kite he was one of the most celebr artist of the 20th centuri but becaus of the margin to which were long subject he p

# 3. Generate TF-IDF matrix for user read articles and unread articles


In [11]:
#Generate tfidf matrix model for entire corpus
tfidf_matrix = TfidfVectorizer(stop_words='english', min_df=2)
article_tfidf_matrix = tfidf_matrix.fit_transform(cleaned_articles)
article_tfidf_matrix #tfidf vector of an article

<50x2926 sparse matrix of type '<class 'numpy.float64'>'
	with 15553 stored elements in Compressed Sparse Row format>

In [12]:
#Generate tfidf matrix model for read articles
user_article_tfidf_vector = tfidf_matrix.transform([user_articles])
user_article_tfidf_vector

<1x2926 sparse matrix of type '<class 'numpy.float64'>'
	with 1423 stored elements in Compressed Sparse Row format>

In [13]:
user_article_tfidf_vector.toarray()

array([[ 0.04243785,  0.02178228,  0.02468441, ...,  0.00773386,
         0.        ,  0.01546771]])

# 4. Calculate cosine similarity between user read articles and unread articles 



In [14]:
articles_similarity_score=cosine_similarity(article_tfidf_matrix, user_article_tfidf_vector)

In [15]:
recommended_articles_id = articles_similarity_score.flatten().argsort()[::-1]

In [16]:
recommended_articles_id

array([ 7,  2, 17,  8, 18, 26,  9, 38, 34, 37, 27,  3, 43, 12, 40, 29, 22,
        1, 23, 21, 30, 10, 31, 46, 44, 24,  6, 45, 42, 15, 16, 32, 48, 13,
       41, 36, 28, 39, 49, 33,  4, 11, 19, 20, 25,  0, 14, 35, 47,  5], dtype=int64)

In [17]:
#Remove read articles from recommendations
final_recommended_articles_id = [article_id for article_id in recommended_articles_id 
                                 if article_id not in ARTICLES_READ ][:NUM_RECOMMENDED_ARTICLES]

# 5. Get the recommended articles 

In [18]:
final_recommended_articles_id

[26,
 9,
 38,
 37,
 27,
 3,
 43,
 12,
 40,
 29,
 22,
 1,
 23,
 21,
 30,
 10,
 31,
 46,
 44,
 24,
 6,
 45,
 42,
 15,
 16,
 32,
 48,
 13,
 41,
 36,
 28,
 39,
 49,
 33,
 4,
 11,
 19,
 20,
 25,
 0,
 14,
 35,
 47,
 5]

In [19]:
#Recommended Articles and their title
print ('Articles Read')
print (news_articles.loc[news_articles['id'].isin(ARTICLES_READ)]['title'])
print ('\n')
print ('Recommender ')
print (news_articles.loc[news_articles['id'].isin(final_recommended_articles_id)]['title'])

Articles Read
2     Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...
7     After ‘The Biggest Loser,’ Their Bodies Fought...
8     First, a Mixtape. Then a Romance. - The New Yo...
17    Modi’s Cash Ban Brings Pain, but Corruption-We...
18    Suicide Bombing in Baghdad Kills at Least 36 -...
34    Riot by Drug Gangs in Brazil Prison Leaves at ...
Name: title, dtype: object


Recommender 
0     House Republicans Fret About Winning Their Hea...
1     Rift Between Officers and Residents as Killing...
3     Among Deaths in 2016, a Heavy Toll in Pop Musi...
4     Kim Jong-un Says North Korea Is Preparing to T...
5     Sick With a Cold, Queen Elizabeth Misses New Y...
6     Taiwan’s President Accuses China of Renewed In...
9     Calling on Angels While Enduring the Trials of...
10    Weak Federal Powers Could Limit Trump’s Climat...
11    Can Carbon Capture Technology Prosper Under Tr...
12    Mar-a-Lago, the Future Winter White House and ...
13    How to form healthy habits in your 20s - T