## ref : https://www.kaggle.com/vikashrajluhaniwal/recommending-news-articles-based-on-read-articles
<br><br>
## Table of Content
### 1. Importing necessary Libraries<br>

### 2. Loading Data<br>

### 3. Data Preprocessing
#### 3.a Fetching only the articles from 2018
#### 3.b Removing all the short headline articles
#### 3.c Checking and removing all the duplicates
#### 3.d Checking for missing values<br>

### 4. Text Preprocessing
#### 4.a Stopwords removal
#### 4.b Lemmatization<br>

### 5. Headline based similarity on new articles
#### 5.a Using Bag of Words method
#### 5.b Using TF-IDF method
#### 5.c Using Word2Vec embedding
#### 5.d Weighted similarity based on headline and category
#### 5.e Weighted similarity based on headline, category and author

## 1. Importing necessary Libraries

In [1]:
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = None

import os
import math
import time

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px

# For text processing (NLTK)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import sklearn

# For feature representation using sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# For similarity matrices using sklearn
from sklearn.metrics.pairwise import cosine_similarity  
from sklearn.metrics import pairwise_distances

# For Word2vec
import gensim


# original - np : 1.19.5  |  pd : 1.2.2  |  nltk : 3.2.4  |  sklearn : 0.24.1  |  gensim : 4.0.0
print(f'np : {np.__version__}  |  pd : {pd.__version__}  |  nltk : {nltk.__version__}  |  sklearn : {sklearn.__version__}  |  gensim : {gensim.__version__}')

np : 1.19.5  |  pd : 1.2.2  |  nltk : 3.2.4  |  sklearn : 0.24.1  |  gensim : 4.0.0


## 2. Loading Data

In [2]:
news_articles = pd.read_json("/kaggle/input/news-category-dataset/News_Category_Dataset_v2.json", lines = True)
news_articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200853 entries, 0 to 200852
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   category           200853 non-null  object        
 1   headline           200853 non-null  object        
 2   authors            200853 non-null  object        
 3   link               200853 non-null  object        
 4   short_description  200853 non-null  object        
 5   date               200853 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.2+ MB


In [3]:
news_articles.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,"There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV",Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89,She left her husband. He killed their children. Just another day in America.,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song,Andy McDonald,https://www.huffingtonpost.com/entry/will-smith-joins-diplo-and-nicky-jam-for-the-official-2018-world-cup-song_us_5b09726fe4b0fdb2aa541201,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-grant-marries_us_5b09212ce4b0568a880b9a8c,The actor and his longtime girlfriend Anna Eberstein tied the knot in a civil ceremony.,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And Democrats In New Artwork,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carrey-adam-schiff-democrats_us_5b0950e8e4b0fdb2aa53e675,The actor gives Dems an ass-kicking for not fighting hard enough against Donald Trump.,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags To Pick Up After Her Dog,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-margulies-trump-poop-bag_us_5b093ec2e4b0fdb2aa53df70,"The ""Dietland"" actress said using the bags is a ""really cathartic, therapeutic moment.""",2018-05-26


## 3. Data Preprocessing
### 3.a Fetching only the articles from 2018
- We are only considering the latest articles from the year 2018. Those are 8583 articles out of 200853.

In [4]:
news_articles = news_articles[news_articles['date'] >= pd.Timestamp('2018-01-01')]
news_articles.shape

(8583, 6)

### 3.b Removing all the short headline articles

In [5]:
news_articles = news_articles[news_articles.headline.apply(lambda x : len(x.split())>5)]
news_articles.shape

(8530, 6)

### 3.c Checking and removing all the duplicates

In [6]:
news_articles = news_articles.sort_values('headline', ascending=False)
duplicated_articles_series = news_articles.duplicated('headline', keep = False)
news_articles = news_articles[~duplicated_articles_series]

news_articles.shape

(8485, 6)

### 3.d Checking for missing values

In [7]:
news_articles.isna().sum()

category             0
headline             0
authors              0
link                 0
short_description    0
date                 0
dtype: int64

## 4. Text Preprocessing
- By Data processing in Step 2, we get a subset of original dataset which has different index labels. <br>
  So, let's make the indices uniform ranging from 0 to total number of articles.
- After text preprocessing, the original headlines will be modified. <br>
  However, It doesn't make sense to recommend articles by displaying modified headlines. <br>
  Therefore, let's copy the dataset into some other dataset and perform text preprocessing on the later.

In [8]:
# reset index
news_articles = news_articles.reset_index(drop=True)

# copy original dataset for preprocessing
news_articles_temp = news_articles.copy()

news_articles_temp.head(3)

Unnamed: 0,category,headline,authors,link,short_description,date
0,QUEER VOICES,‘Will & Grace’ Creator To Donate Gay Bunny Book To Every Grade School In Indiana,Elyse Wanshel,https://www.huffingtonpost.com/entry/will-grace-creator-donate-john-olivers-gay-bunny-book-to-every-elementary-school-in-indiana_us_5ac28265e4b00fa46f854225,It's about to be a lot easier for kids in Mike Pence's home state to read “A Day in the Life of Marlon Bundo.”,2018-04-02
1,QUEER VOICES,‘The Voice’ Blind Auditions Make History With First Trans Contestant,"Lyndsey Parker, Yahoo Entertainment",https://www.huffingtonpost.com/entry/the-voice-blind-auditions-make-history-with-first-trans-contestant_us_5a9ece6ee4b002df2c5e39c2,"Austin Giorgio, 21: “How Sweet It Is (To Be Loved by You)” Young crooners have appeared on singing competitions since “American",2018-03-06
2,QUEER VOICES,‘The Penumbra’ Is The Queer Audio Drama You Didn’t Know You Needed,"Sarah Emily Baum, ContributorFreelance Writer",https://www.huffingtonpost.com/entry/the-penumbra-is-the-queer-audio-drama-you-didnt_us_5a48f900e4b0df0de8b06b29,"Young, fun, fantastical and, most notably, inclusive, the show is a must-listen for young queer people.",2018-01-05


### 4.a Stopwords removal

In [9]:
stop_words = set(stopwords.words('english'))  # stopwords from nltk
len(stop_words), list(stop_words)[:10]

(179,
 ['during',
  'him',
  'yourselves',
  'i',
  'below',
  'by',
  'doing',
  'didn',
  'am',
  'this'])

In [10]:
%%time

for i in range(len(news_articles_temp["headline"])):
    string = ""
    for word in news_articles_temp["headline"][i].split():
        word = ("".join(e for e in word if e.isalnum()))
        word = word.lower()
        if not word in stop_words:
            string += word + " "
    news_articles_temp.at[i,"headline"] = string.strip()

news_articles_temp.head(3)

CPU times: user 386 ms, sys: 128 µs, total: 386 ms
Wall time: 388 ms


Unnamed: 0,category,headline,authors,link,short_description,date
0,QUEER VOICES,grace creator donate gay bunny book every grade school indiana,Elyse Wanshel,https://www.huffingtonpost.com/entry/will-grace-creator-donate-john-olivers-gay-bunny-book-to-every-elementary-school-in-indiana_us_5ac28265e4b00fa46f854225,It's about to be a lot easier for kids in Mike Pence's home state to read “A Day in the Life of Marlon Bundo.”,2018-04-02
1,QUEER VOICES,voice blind auditions make history first trans contestant,"Lyndsey Parker, Yahoo Entertainment",https://www.huffingtonpost.com/entry/the-voice-blind-auditions-make-history-with-first-trans-contestant_us_5a9ece6ee4b002df2c5e39c2,"Austin Giorgio, 21: “How Sweet It Is (To Be Loved by You)” Young crooners have appeared on singing competitions since “American",2018-03-06
2,QUEER VOICES,penumbra queer audio drama didnt know needed,"Sarah Emily Baum, ContributorFreelance Writer",https://www.huffingtonpost.com/entry/the-penumbra-is-the-queer-audio-drama-you-didnt_us_5a48f900e4b0df0de8b06b29,"Young, fun, fantastical and, most notably, inclusive, the show is a must-listen for young queer people.",2018-01-05


### 4.b Lemmatization

In [11]:
%%time

lemmatizer = WordNetLemmatizer()  # lemmatizer from nltk.stem
for i in range(len(news_articles_temp["headline"])):
    string = ""
    for w in word_tokenize(news_articles_temp["headline"][i]):
        string += lemmatizer.lemmatize(w, pos = "v") + " "
    news_articles_temp.at[i, "headline"] = string.strip()

news_articles_temp.head(3)

CPU times: user 4.26 s, sys: 47.4 ms, total: 4.31 s
Wall time: 4.33 s


Unnamed: 0,category,headline,authors,link,short_description,date
0,QUEER VOICES,grace creator donate gay bunny book every grade school indiana,Elyse Wanshel,https://www.huffingtonpost.com/entry/will-grace-creator-donate-john-olivers-gay-bunny-book-to-every-elementary-school-in-indiana_us_5ac28265e4b00fa46f854225,It's about to be a lot easier for kids in Mike Pence's home state to read “A Day in the Life of Marlon Bundo.”,2018-04-02
1,QUEER VOICES,voice blind audition make history first trans contestant,"Lyndsey Parker, Yahoo Entertainment",https://www.huffingtonpost.com/entry/the-voice-blind-auditions-make-history-with-first-trans-contestant_us_5a9ece6ee4b002df2c5e39c2,"Austin Giorgio, 21: “How Sweet It Is (To Be Loved by You)” Young crooners have appeared on singing competitions since “American",2018-03-06
2,QUEER VOICES,penumbra queer audio drama didnt know need,"Sarah Emily Baum, ContributorFreelance Writer",https://www.huffingtonpost.com/entry/the-penumbra-is-the-queer-audio-drama-you-didnt_us_5a48f900e4b0df0de8b06b29,"Young, fun, fantastical and, most notably, inclusive, the show is a must-listen for young queer people.",2018-01-05


## 5. Headline based similarity on new articles
Let's try various feature representation of headline

In [12]:
# function for making recommendations based on content simiarity
def recommend_similar_txt(doc_vectors, row_index, num_recommend, dist_metric='euclidean', print_result=False):
#     # original code
#     couple_dist = pairwise_distances(doc_vectors, doc_vectors[row_index])
#     indices = np.argsort(couple_dist.ravel())[0:num_recommend]
#     df = pd.DataFrame({'publish_date': news_articles['date'][indices].values,
#                        'headline':news_articles['headline'][indices].values,
#                        'Euclidean similarity with the queried article': couple_dist[indices].ravel()})
#     print("="*30,"Queried article details","="*30)
#     print('headline : ',news_articles['headline'][indices[0]])
#     print("\n","="*25,"Recommended articles : ","="*23)
#     return df.iloc[1:,]

    dist_arr = pairwise_distances(doc_vectors, doc_vectors[row_index].reshape(1, -1), metric=dist_metric).ravel()
    df = pd.DataFrame({'publish_date': news_articles['date'], 
                       'headline': news_articles['headline'],
                       f'{dist_metric}_similarity': dist_arr})
    df = df.sort_values(f'{dist_metric}_similarity').head(num_recommend+1)
    df = df.iloc[1:,].reset_index(drop=True)  # remove input itself (most similar item)
    
    if print_result:
        print("="*30, "Queried article details", "="*30)
        print('headline : ', news_articles.loc[row_index, 'headline'])
        print("\n", "="*25, "Recommended articles : ", "="*23)
        display(df)
    
    return df


# sample news to get recommendations
query_idx = 133
news_articles.loc[[query_idx]]

Unnamed: 0,category,headline,authors,link,short_description,date
133,POLITICS,Woman Fired After Flipping Off Trump's Motorcade Sues Former Employer,Jennifer Bendery,https://www.huffingtonpost.com/entry/juli-briskman-flipping-off-trump-lawsuit_us_5ac4dfbae4b093a1eb212528,Juli Briskman claims Akima LLC broke state law when it fired her over a viral photo of her presidential insult.,2018-04-04


### 5.a Using Bag of Words method
Using **BoW** approach, each document is represented by a d-dimensional vector, where d is total number of unique words in the corpus.

In [13]:
cnt_vectorizer = CountVectorizer()
bow_features = cnt_vectorizer.fit_transform(news_articles_temp['headline'])

print('>>> input features :', type(bow_features), bow_features.shape, '\n')
df = recommend_similar_txt(bow_features, row_index=query_idx, num_recommend=10, print_result=True)

# using cosine metric
# df = recommend_similar_txt(bow_features, row_index=query_idx, num_recommend=10, dist_metric='cosine', print_result=True)

>>> input features : <class 'scipy.sparse.csr.csr_matrix'> (8485, 11122) 

headline :  Woman Fired After Flipping Off Trump's Motorcade Sues Former Employer



Unnamed: 0,publish_date,headline,euclidean_similarity
0,2018-04-02,The Trump Administration Is Suing California Again,2.828427
1,2018-05-01,Texas Sues Trump Administration To End DACA,3.162278
2,2018-03-07,Stormy Daniels Suing Trump Over Nondisclosure Agreement,3.162278
3,2018-04-28,Trump: Mueller Should Never Have Been Appointed,3.162278
4,2018-04-24,Spanish Woman Looks More Like Trump Than The Donald Himself,3.162278
5,2018-02-12,What You Should Know About Trump's Nihilist Budget,3.162278
6,2018-05-09,The Caliphate Of Trump And A Planet In Ruins,3.162278
7,2018-03-26,Trump Ally Sues Qatar For Hacking His Email,3.162278
8,2018-02-21,All They Will Call You Will Be Deportees,3.162278
9,2018-04-11,Pursuing Desegregation In The Trump Era,3.162278


## 5.b Using TF-IDF method
**TF-IDF** method is a weighted measure which gives more importance to less frequent words in a corpus. <br>
It assigns a weight to each term(word) in a document based on Term frequency(TF) and inverse document frequency(IDF). <br><br>

**TF(i,j)** = (# times word i appears in document j) / (# words in document j) <br>
**IDF(i,D)** = log_e(#documents in the corpus D) / (#documents containing word i) <br>
weight(i,j) = **TF(i,j) x IDF(i,D)** <br><br>

So if a word occurs more number of times in a document but less number of times in all other documents, then its TF-IDF value will be high.<br>

In [14]:
tfidf_vectorizer = TfidfVectorizer(min_df = 0)
tfidf_features = tfidf_vectorizer.fit_transform(news_articles_temp['headline'])

print('>>> input features :', type(tfidf_features), tfidf_features.shape, '\n')
df = recommend_similar_txt(tfidf_features, row_index=query_idx, num_recommend=10, print_result=True)

# using cosine metric
# df = recommend_similar_txt(bow_features, row_index=query_idx, num_recommend=10, dist_metric='cosine', print_result=True)

>>> input features : <class 'scipy.sparse.csr.csr_matrix'> (8485, 11122) 

headline :  Woman Fired After Flipping Off Trump's Motorcade Sues Former Employer



Unnamed: 0,publish_date,headline,euclidean_similarity
0,2018-05-21,The Supreme Court Just Made It A Lot Harder For You To Sue Your Employer,1.164067
1,2018-04-02,The Trump Administration Is Suing California Again,1.253867
2,2018-04-10,"Lou Dobbs Flips Out On Live TV, Urges Trump To 'Fire The SOB' Robert Mueller",1.25881
3,2018-04-26,Cardi B's Former Manager Sues Her For $10 Million,1.268704
4,2018-04-03,A Third Woman Is Suing To Break A Trump-Related Nondisclosure Agreement,1.274264
5,2018-02-24,Former RNC Chair Fires Back At Claim He Was Only Hired Because He Was Black,1.274847
6,2018-01-16,State Employer Side Payroll Taxes And Loser Liberalism,1.276696
7,2018-02-21,Democrats Flip Kentucky State House Seat Where Trump Won Overwhelmingly,1.282008
8,2018-01-09,Big Tax Game Hunting: Employer Side Payroll Taxes,1.285147
9,2018-02-28,Democrats Flip 2 More GOP-Held State House Seats,1.287403


## 5.c Using Word2Vec embedding
**Word2Vec** is one of the techniques for semantic similarity which was invented by Google in 2013. <br>
For a given corpus, it observes the patterns and respresents each word by a d-dimensional vector. <br>
To get better results, we need fairly large corpus.

In [15]:
%%time

# download w2v model
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
!ls

--2021-04-09 14:46:58--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.204.221
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.204.221|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2021-04-09 14:47:46 (33.1 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]

GoogleNews-vectors-negative300.bin.gz  __notebook__.ipynb
CPU times: user 1.41 s, sys: 385 ms, total: 1.8 s
Wall time: 49.5 s


In [16]:
%%time

# load w2v model
model_dir = 'GoogleNews-vectors-negative300.bin.gz'
word2vec_model = gensim.models.keyedvectors.KeyedVectors.load_word2vec_format(
    fname = model_dir, binary=True
)
word2vec_model

CPU times: user 1min 16s, sys: 11.4 s, total: 1min 27s
Wall time: 1min 28s


<gensim.models.keyedvectors.KeyedVectors at 0x7f12981b6a90>

In [17]:
%%time

# calculate mean vector of each text
vector_size = word2vec_model.vector_size  # vector_size=300
w2v_features = []
for txt in news_articles_temp['headline']:
    w2Vec_word = np.zeros(vector_size, dtype="float32")
    tokens = word_tokenize(txt)
    for tk in tokens:
        if tk in word2vec_model:
            w2Vec_word += word2vec_model[tk]
    w2Vec_word = w2Vec_word / len(tokens)
    w2v_features.append(w2Vec_word)
w2v_features = np.array(w2v_features)


# make recommendations
print('>>> input features :', type(w2v_features), w2v_features.shape, '\n')
df = recommend_similar_txt(w2v_features, row_index=query_idx, num_recommend=10, print_result=True)

# make recommendations - using cosine metric
# df = recommend_similar_txt(bow_features, row_index=query_idx, num_recommend=10, dist_metric='cosine', print_result=True)

>>> input features : <class 'numpy.ndarray'> (8485, 300) 

headline :  Woman Fired After Flipping Off Trump's Motorcade Sues Former Employer



Unnamed: 0,publish_date,headline,euclidean_similarity
0,2018-03-19,White House Lawyer Insists Trump Isn't Considering Firing Mueller,1.029262
1,2018-03-17,Trump’s Legal Team Says It Can Sue Stormy Daniels For $20 Million,1.086385
2,2018-03-20,Ex-Playboy Model Who Claims Affair With Trump Sues To Break Silence,1.09593
3,2018-02-24,Former RNC Chair Fires Back At Claim He Was Only Hired Because He Was Black,1.097608
4,2018-01-23,Former Trump Aide's Fiancee Warns White House: ‘A Lot To Come’,1.101049
5,2018-01-25,Trump HUD Official Lynne Patton Under Fire After Calling Journalist 'Miss Piggy',1.106954
6,2018-03-07,"Former Trump Attorney Stuns 'Fox & Friends,' Says Stormy Daniels NDA Is Likely Invalid",1.110137
7,2018-01-28,Another White House Official Disputes Report That Trump Wanted To Axe Mueller,1.112997
8,2018-01-08,'Fire And Fury' Legal Team Hits Back At Trump In New Statement,1.1138
9,2018-02-20,Trump Claims He 'Never Met' Woman Accusing Him Of Sexually Assaulting Her In Trump Tower,1.12082


CPU times: user 2.48 s, sys: 266 ms, total: 2.74 s
Wall time: 2.29 s
