## ref : https://www.kaggle.com/vikashrajluhaniwal/recommending-news-articles-based-on-read-articles
<br><br>
## Table of Content
### 1. Importing necessary Libraries<br>

### 2. Loading Data<br>

### 3. Data Preprocessing
#### 3.a Fetching only the articles from 2018
#### 3.b Removing all the short headline articles
#### 3.c Checking and removing all the duplicates
#### 3.d Checking for missing values<br>

### 4. Text Preprocessing
#### 4.a Stopwords removal
#### 4.b Lemmatization<br>

### 5. Headline based similarity on new articles
#### 5.a Using Bag of Words method
#### 5.b Using TF-IDF method
#### 5.c Using Word2Vec embedding
#### 5.d Weighted similarity based on headline and category
#### 5.e Weighted similarity based on headline, category and author

## 1. Importing necessary Libraries

In [1]:
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = None

# For text processing (NLTK)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import sklearn

# For feature representation using sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# For similarity matrices using sklearn
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances

# For Word2vec
import gensim


# original - np : 1.19.5  |  pd : 1.2.2  |  nltk : 3.2.4  |  sklearn : 0.24.1  |  gensim : 4.0.0
print(f'np : {np.__version__}  |  pd : {pd.__version__}  |  nltk : {nltk.__version__}  |  sklearn : {sklearn.__version__}  |  gensim : {gensim.__version__}')

np : 1.19.5  |  pd : 1.2.2  |  nltk : 3.2.4  |  sklearn : 0.24.1  |  gensim : 4.0.0


## 2. Loading Data

In [2]:
news_articles = pd.read_json("/kaggle/input/news-category-dataset/News_Category_Dataset_v2.json", lines = True)

# check df
news_articles.info()
print('\n')
news_articles.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200853 entries, 0 to 200852
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   category           200853 non-null  object        
 1   headline           200853 non-null  object        
 2   authors            200853 non-null  object        
 3   link               200853 non-null  object        
 4   short_description  200853 non-null  object        
 5   date               200853 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.2+ MB




Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,"There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV",Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89,She left her husband. He killed their children. Just another day in America.,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song,Andy McDonald,https://www.huffingtonpost.com/entry/will-smith-joins-diplo-and-nicky-jam-for-the-official-2018-world-cup-song_us_5b09726fe4b0fdb2aa541201,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-grant-marries_us_5b09212ce4b0568a880b9a8c,The actor and his longtime girlfriend Anna Eberstein tied the knot in a civil ceremony.,2018-05-26


## 3. Data Preprocessing

In [3]:
%%time
print(news_articles.shape)

# fetching only recent articles
news_articles = news_articles.sort_values('date').tail(8000)  # recent 8000 records
print('-> ', news_articles.shape)

# checking and removing all the duplicates
news_articles = news_articles.drop_duplicates('headline')
print('-> ', news_articles.shape)

# checking for missing values
print('>>> No of NA :', news_articles.isna().sum().sum())

news_articles.head(3)

(200853, 6)
->  (8000, 6)
->  (7966, 6)
>>> No of NA : 0
CPU times: user 72.2 ms, sys: 5.24 ms, total: 77.4 ms
Wall time: 75 ms


Unnamed: 0,category,headline,authors,link,short_description,date
8036,BLACK VOICES,Black Artists Creatively Reimagine Racist H&M Ad,Princess-India Alexander,https://www.huffingtonpost.com/entry/hm-racist-ad-reimagined_us_5a54e148e4b01e1a4b19f7af,Black boy joy continues to overcome.,2018-01-09
8047,ENTERTAINMENT,"For The Record, Meryl Streep Is Totally Here For An Oprah Presidency",Cole Delbyck,https://www.huffingtonpost.com/entry/for-the-record-meryl-streep-is-totally-here-for-an-oprah-presidency_us_5a54d816e4b003133ecc8eee,"""Wow, where do I send the check.""",2018-01-09
8049,ENTERTAINMENT,'Aladdin' Star Navid Negahban Addresses Fans' Whitewashing Concerns,Bill Bradley,https://www.huffingtonpost.com/entry/aladdin-star-navid-negahban-addresses-whitewashing-concerns_us_5a536c76e4b003133eca3834,"The ""Homeland"" icon also talks his new movie ""12 Strong"" and his role on ""Legion"" Season 2.",2018-01-09


## 4. Text Preprocessing
- By Data processing in Step 2, we get a subset of original dataset which has different index labels. <br>
  So, let's make the indices uniform ranging from 0 to total number of articles.
- After text preprocessing, the original headlines will be modified. <br>
  However, It doesn't make sense to recommend articles by displaying modified headlines. <br>
  Therefore, let's copy the dataset into some other dataset and perform text preprocessing on the later.

In [4]:
%%time

# function to preprocess text
def preprocess_newstxt(txt, stopwords, lemmatizer, min_len=5):
    tokens = word_tokenize(txt)
    r_tokens = []
    for tk in tokens:
        tk = tk.lower()
        
        # remove stopwords
        if not tk.isalnum() or tk in stopwords:
            continue
        
        # lemmatize tokens
        tk = lemmatizer.lemmatize(tk, pos = "v")
        r_tokens.append(tk)
    
    # remove too short text after preprocessing
    if len(r_tokens) < min_len:
        r_string = ''
    else:
        r_string = ' '.join(r_tokens).strip()
    
    return r_string


# preprocess text
stop_words = set(stopwords.words('english'))  # stopwords from nltk
lemmatizer = WordNetLemmatizer()  # lemmatizer from nltk
news_articles['headline_processed'] = news_articles['headline']. \
                                            apply(lambda x: preprocess_newstxt(x, stop_words, lemmatizer))

# remove rows with empty string
news_articles = news_articles[news_articles.headline_processed.str.len()>0] \
                    .reset_index(drop=True)


news_articles.head(3)

CPU times: user 4.04 s, sys: 62.8 ms, total: 4.1 s
Wall time: 4.12 s


Unnamed: 0,category,headline,authors,link,short_description,date,headline_processed
0,BLACK VOICES,Black Artists Creatively Reimagine Racist H&M Ad,Princess-India Alexander,https://www.huffingtonpost.com/entry/hm-racist-ad-reimagined_us_5a54e148e4b01e1a4b19f7af,Black boy joy continues to overcome.,2018-01-09,black artists creatively reimagine racist h ad
1,ENTERTAINMENT,"For The Record, Meryl Streep Is Totally Here For An Oprah Presidency",Cole Delbyck,https://www.huffingtonpost.com/entry/for-the-record-meryl-streep-is-totally-here-for-an-oprah-presidency_us_5a54d816e4b003133ecc8eee,"""Wow, where do I send the check.""",2018-01-09,record meryl streep totally oprah presidency
2,ENTERTAINMENT,'Aladdin' Star Navid Negahban Addresses Fans' Whitewashing Concerns,Bill Bradley,https://www.huffingtonpost.com/entry/aladdin-star-navid-negahban-addresses-whitewashing-concerns_us_5a536c76e4b003133eca3834,"The ""Homeland"" icon also talks his new movie ""12 Strong"" and his role on ""Legion"" Season 2.",2018-01-09,star navid negahban address fan whitewash concern


## 5. Headline based similarity on new articles
Let's try various feature representation of headline

In [5]:
# function for making recommendations based on content simiarity
def recommend_similar_txt(doc_vectors, row_index, num_recommend, dist_metric='euclidean', print_result=False):
    dist_arr = pairwise_distances(doc_vectors, doc_vectors[row_index].reshape(1, -1), metric=dist_metric).ravel()
    df = pd.DataFrame({'category':news_articles['category'], 'publish_date': news_articles['date'],
                       'headline': news_articles['headline'], f'{dist_metric}_similarity': dist_arr})
    df = df.sort_values(f'{dist_metric}_similarity').head(num_recommend+1)
    df = df.iloc[1:,].reset_index(drop=True)  # remove input itself (most similar item)
    
    if print_result:
        print("="*30, "Queried article details", "="*30)
        print('headline : ', news_articles.loc[row_index, 'headline'])
        print("\n", "="*25, "Recommended articles : ", "="*23)
        display(df)
    
    return df


# sample news to get recommendations
query_idx = 4990
news_articles.loc[[query_idx]]

Unnamed: 0,category,headline,authors,link,short_description,date,headline_processed
4990,POLITICS,Woman Fired After Flipping Off Trump's Motorcade Sues Former Employer,Jennifer Bendery,https://www.huffingtonpost.com/entry/juli-briskman-flipping-off-trump-lawsuit_us_5ac4dfbae4b093a1eb212528,Juli Briskman claims Akima LLC broke state law when it fired her over a viral photo of her presidential insult.,2018-04-04,woman fire flip trump motorcade sue former employer


### 5.a Using Bag of Words method
Using **BoW** approach, each document is represented by a d-dimensional vector, where d is total number of unique words in the corpus.

In [6]:
# generate TF-IDF features
cnt_vectorizer = CountVectorizer()
bow_features = cnt_vectorizer.fit_transform(news_articles['headline_processed'])

# make recommendations
print('>>> input features :', type(bow_features), bow_features.shape, '\n')
df = recommend_similar_txt(bow_features, row_index=query_idx, num_recommend=10, print_result=True)

# make recommendations - using cosine metric
# df = recommend_similar_txt(bow_features, row_index=query_idx, num_recommend=10, dist_metric='cosine', print_result=True)

>>> input features : <class 'scipy.sparse.csr.csr_matrix'> (7695, 9182) 

headline :  Woman Fired After Flipping Off Trump's Motorcade Sues Former Employer



Unnamed: 0,category,publish_date,headline,euclidean_similarity
0,POLITICS,2018-04-03,A Third Woman Is Suing To Break A Trump-Related Nondisclosure Agreement,3.162278
1,POLITICS,2018-03-21,Trump Administration Sued On Elephant Trophy Ban Flip-Flop,3.162278
2,WEIRD NEWS,2018-04-24,Spanish Woman Looks More Like Trump Than The Donald Himself,3.162278
3,POLITICS,2018-04-10,GOP Senator: It Would Be 'Suicide' For Trump If He Fires Mueller,3.162278
4,POLITICS,2018-03-26,Trump Ally Sues Qatar For Hacking His Email,3.162278
5,POLITICS,2018-03-07,Stormy Daniels Suing Trump Over Nondisclosure Agreement,3.162278
6,POLITICS,2018-05-01,Texas Sues Trump Administration To End DACA,3.162278
7,BLACK VOICES,2018-04-26,Cardi B's Former Manager Sues Her For $10 Million,3.162278
8,POLITICS,2018-01-26,Trump Has A History Of Being Cocky And Unprepared Under Oath,3.316625
9,POLITICS,2018-01-31,The Hidden Extremism Of Trump's State Of The Union,3.316625


## 5.b Using TF-IDF method
**TF-IDF** method is a weighted measure which gives more importance to less frequent words in a corpus. <br>
It assigns a weight to each term(word) in a document based on Term frequency(TF) and inverse document frequency(IDF). <br><br>

**TF(i,j)** = (# times word i appears in document j) / (# words in document j) <br>
**IDF(i,D)** = log_e(#documents in the corpus D) / (#documents containing word i) <br>
weight(i,j) = **TF(i,j) x IDF(i,D)** <br><br>

So if a word occurs more number of times in a document but less number of times in all other documents, then its TF-IDF value will be high.<br>

In [7]:
# generate TF-IDF features
tfidf_vectorizer = TfidfVectorizer(min_df = 0)
tfidf_features = tfidf_vectorizer.fit_transform(news_articles['headline_processed'])

# make recommendations
print('>>> input features :', type(tfidf_features), tfidf_features.shape, '\n')
df = recommend_similar_txt(tfidf_features, row_index=query_idx, num_recommend=10, print_result=True)

# make recommendations - using cosine metric
# df = recommend_similar_txt(bow_features, row_index=query_idx, num_recommend=10, dist_metric='cosine', print_result=True)

>>> input features : <class 'scipy.sparse.csr.csr_matrix'> (7695, 9182) 

headline :  Woman Fired After Flipping Off Trump's Motorcade Sues Former Employer



Unnamed: 0,category,publish_date,headline,euclidean_similarity
0,POLITICS,2018-05-21,The Supreme Court Just Made It A Lot Harder For You To Sue Your Employer,1.158152
1,MEDIA,2018-01-24,Garrison Keillor's Former Station Reports He Was Fired For More Than Touching A Woman's Back,1.234945
2,BLACK VOICES,2018-04-26,Cardi B's Former Manager Sues Her For $10 Million,1.24863
3,POLITICS,2018-04-03,A Third Woman Is Suing To Break A Trump-Related Nondisclosure Agreement,1.251581
4,POLITICS,2018-02-28,Democrats Flip 2 More GOP-Held State House Seats,1.257818
5,POLITICS,2018-01-16,State Employer Side Payroll Taxes And Loser Liberalism,1.27251
6,POLITICS,2018-02-24,Former RNC Chair Fires Back At Claim He Was Only Hired Because He Was Black,1.275462
7,COMEDY,2018-04-24,Seth Meyers Taunts Donald Trump: 'There's A Good Chance He'll Flip On Himself',1.278543
8,POLITICS,2018-02-21,Democrats Flip Kentucky State House Seat Where Trump Won Overwhelmingly,1.282204
9,POLITICS,2018-03-22,Terrified Tweeters Flip Out Over Donald Trump's 'Coming Arms Race' Prediction,1.287066


## 5.c Using Word2Vec embedding
**Word2Vec** is one of the techniques for semantic similarity which was invented by Google in 2013. <br>
For a given corpus, it observes the patterns and respresents each word by a d-dimensional vector. <br>
To get better results, we need fairly large corpus.

In [8]:
%%time

# download w2v model - about 40s
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

# load w2v model - about 1min 20s
model_dir = 'GoogleNews-vectors-negative300.bin.gz'
word2vec_model = gensim.models.keyedvectors.KeyedVectors.load_word2vec_format(
    fname = model_dir, binary=True
)

print(word2vec_model)

--2021-04-16 14:44:45--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.131.61
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.131.61|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2021-04-16 14:45:25 (39.4 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]

<gensim.models.keyedvectors.KeyedVectors object at 0x7f806e3e08d0>
CPU times: user 1min 13s, sys: 4.42 s, total: 1min 18s
Wall time: 1min 57s


In [9]:
%%time

# function to calculate mean vector of each text
def get_mean_vector(txt, model=word2vec_model):
    vector_size = model.vector_size  # vector_size=300
    r_vector = np.zeros(vector_size, dtype="float32")
    
    tokens = word_tokenize(txt)
    for tk in tokens:
        if tk in model:
            r_vector += model[tk]
    r_vector = r_vector / len(tokens)
    return r_vector


# generate w2v features
w2v_features = np.array(news_articles['headline_processed'].apply(get_mean_vector).tolist())

# make recommendations
print('>>> input features :', type(w2v_features), w2v_features.shape, '\n')
df = recommend_similar_txt(w2v_features, row_index=query_idx, num_recommend=10, print_result=True)

# make recommendations - using cosine metric
# df = recommend_similar_txt(bow_features, row_index=query_idx, num_recommend=10, dist_metric='cosine', print_result=True)

>>> input features : <class 'numpy.ndarray'> (7695, 300) 

headline :  Woman Fired After Flipping Off Trump's Motorcade Sues Former Employer



Unnamed: 0,category,publish_date,headline,euclidean_similarity
0,POLITICS,2018-03-19,White House Lawyer Insists Trump Isn't Considering Firing Mueller,1.045618
1,POLITICS,2018-01-23,Former Trump Aide's Fiancee Warns White House: ‘A Lot To Come’,1.076581
2,POLITICS,2018-03-17,Trump’s Legal Team Says It Can Sue Stormy Daniels For $20 Million,1.086385
3,POLITICS,2018-02-24,Former RNC Chair Fires Back At Claim He Was Only Hired Because He Was Black,1.097608
4,POLITICS,2018-03-27,Attorney Who Declined A Job On Trump's Legal Team Rips 'Turmoil' In White House,1.109826
5,POLITICS,2018-01-28,Another White House Official Disputes Report That Trump Wanted To Axe Mueller,1.112997
6,POLITICS,2018-01-25,Trump HUD Official Lynne Patton Under Fire After Calling Journalist 'Miss Piggy',1.117826
7,POLITICS,2018-05-04,Giuliani Tells Mueller To Back Off ‘Fine Woman’ Ivanka Trump But Calls Kushner 'Disposable',1.125783
8,POLITICS,2018-03-07,"Former Trump Attorney Stuns 'Fox & Friends,' Says Stormy Daniels NDA Is Likely Invalid",1.133482
9,POLITICS,2018-01-26,White House Spent Months Denying That Trump Considered Firing Mueller,1.134542


CPU times: user 1.77 s, sys: 130 ms, total: 1.9 s
Wall time: 1.74 s
