## ref : https://www.kaggle.com/vikashrajluhaniwal/recommending-news-articles-based-on-read-articles
<br><br>
## Table of Content
### 1. Importing necessary Libraries<br>

### 2. Loading Data<br>

### 3. Data Preprocessing
#### 3.a Fetching only the articles from 2018
#### 3.b Removing all the short headline articles
#### 3.c Checking and removing all the duplicates
#### 3.d Checking for missing values<br>

### 4. Text Preprocessing
#### 4.a Stopwords removal
#### 4.b Lemmatization<br>

### 5. Headline based similarity on new articles
#### 5.a Using Bag of Words method
#### 5.b Using TF-IDF method
#### 5.c Using Word2Vec embedding
#### 5.d Weighted similarity based on headline and category
#### 5.e Weighted similarity based on headline, category and author
#### 5.f Weighted similarity based on headline, category, author and publishing day

## 1. Importing necessary Libraries

In [1]:
import numpy as np
import pandas as pd

import os
import math
import time

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px

# For text processing (NLTK)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import sklearn

# For feature representation using sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# For similarity matrices using sklearn
from sklearn.metrics.pairwise import cosine_similarity  
from sklearn.metrics import pairwise_distances


# original - np : 1.19.5  |  pd : 1.2.2  |  nltk : 3.2.4  |  sklearn : 0.24.1
print(f'np : {np.__version__}  |  pd : {pd.__version__}  |  nltk : {nltk.__version__}  |  sklearn : {sklearn.__version__}')

np : 1.19.5  |  pd : 1.2.2  |  nltk : 3.2.4  |  sklearn : 0.24.1


## 2. Loading Data

In [2]:
news_articles = pd.read_json("/kaggle/input/news-category-dataset/News_Category_Dataset_v2.json", lines = True)
news_articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200853 entries, 0 to 200852
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   category           200853 non-null  object        
 1   headline           200853 non-null  object        
 2   authors            200853 non-null  object        
 3   link               200853 non-null  object        
 4   short_description  200853 non-null  object        
 5   date               200853 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.2+ MB


In [3]:
news_articles.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


## 3. Data Preprocessing
### 3.a Fetching only the articles from 2018
- We are only considering the latest articles from the year 2018. Those are 8583 articles out of 200853.

In [4]:
news_articles = news_articles[news_articles['date'] >= pd.Timestamp('2018-01-01')]
news_articles.shape

(8583, 6)

### 3.b Removing all the short headline articles

In [5]:
news_articles = news_articles[news_articles.headline.apply(lambda x : len(x.split())>5)]
news_articles.shape

(8530, 6)

### 3.c Checking and removing all the duplicates

In [6]:
news_articles = news_articles.sort_values('headline', ascending=False)
duplicated_articles_series = news_articles.duplicated('headline', keep = False)
news_articles = news_articles[~duplicated_articles_series]

news_articles.shape

(8485, 6)

### 3.d Checking for missing values

In [7]:
news_articles.isna().sum()

category             0
headline             0
authors              0
link                 0
short_description    0
date                 0
dtype: int64

## 4. Text Preprocessing
- By Data processing in Step 2, we get a subset of original dataset which has different index labels. <br>
  So, let's make the indices uniform ranging from 0 to total number of articles.
- After text preprocessing, the original headlines will be modified. <br>
  However, It doesn't make sense to recommend articles by displaying modified headlines. <br>
  Therefore, let's copy the dataset into some other dataset and perform text preprocessing on the later.

In [8]:
# reset index
news_articles = news_articles.reset_index(drop=True)

# copy original dataset for preprocessing
news_articles_temp = news_articles.copy()

news_articles_temp.head(3)

Unnamed: 0,category,headline,authors,link,short_description,date
0,QUEER VOICES,‘Will & Grace’ Creator To Donate Gay Bunny Boo...,Elyse Wanshel,https://www.huffingtonpost.com/entry/will-grac...,It's about to be a lot easier for kids in Mike...,2018-04-02
1,QUEER VOICES,‘The Voice’ Blind Auditions Make History With ...,"Lyndsey Parker, Yahoo Entertainment",https://www.huffingtonpost.com/entry/the-voice...,"Austin Giorgio, 21: “How Sweet It Is (To Be Lo...",2018-03-06
2,QUEER VOICES,‘The Penumbra’ Is The Queer Audio Drama You Di...,"Sarah Emily Baum, ContributorFreelance Writer",https://www.huffingtonpost.com/entry/the-penum...,"Young, fun, fantastical and, most notably, inc...",2018-01-05


### 4.a Stopwords removal

In [9]:
stop_words = set(stopwords.words('english'))  # stopwords from nltk
len(stop_words), list(stop_words)[:10]

(179,
 ['any',
  'weren',
  "isn't",
  'who',
  "needn't",
  'more',
  "shan't",
  "you'd",
  'so',
  "should've"])

In [10]:
%%time

for i in range(len(news_articles_temp["headline"])):
    string = ""
    for word in news_articles_temp["headline"][i].split():
        word = ("".join(e for e in word if e.isalnum()))
        word = word.lower()
        if not word in stop_words:
            string += word + " "
    news_articles_temp.at[i,"headline"] = string.strip()

news_articles_temp.head(3)

CPU times: user 630 ms, sys: 0 ns, total: 630 ms
Wall time: 629 ms


Unnamed: 0,category,headline,authors,link,short_description,date
0,QUEER VOICES,grace creator donate gay bunny book every grad...,Elyse Wanshel,https://www.huffingtonpost.com/entry/will-grac...,It's about to be a lot easier for kids in Mike...,2018-04-02
1,QUEER VOICES,voice blind auditions make history first trans...,"Lyndsey Parker, Yahoo Entertainment",https://www.huffingtonpost.com/entry/the-voice...,"Austin Giorgio, 21: “How Sweet It Is (To Be Lo...",2018-03-06
2,QUEER VOICES,penumbra queer audio drama didnt know needed,"Sarah Emily Baum, ContributorFreelance Writer",https://www.huffingtonpost.com/entry/the-penum...,"Young, fun, fantastical and, most notably, inc...",2018-01-05


### 4.b Lemmatization

In [11]:
%%time

lemmatizer = WordNetLemmatizer()  # lemmatizer from nltk.stem
for i in range(len(news_articles_temp["headline"])):
    string = ""
    for w in word_tokenize(news_articles_temp["headline"][i]):
        string += lemmatizer.lemmatize(w,pos = "v") + " "
    news_articles_temp.at[i, "headline"] = string.strip()

news_articles_temp.head(3)

CPU times: user 6.63 s, sys: 40.2 ms, total: 6.67 s
Wall time: 6.71 s


Unnamed: 0,category,headline,authors,link,short_description,date
0,QUEER VOICES,grace creator donate gay bunny book every grad...,Elyse Wanshel,https://www.huffingtonpost.com/entry/will-grac...,It's about to be a lot easier for kids in Mike...,2018-04-02
1,QUEER VOICES,voice blind audition make history first trans ...,"Lyndsey Parker, Yahoo Entertainment",https://www.huffingtonpost.com/entry/the-voice...,"Austin Giorgio, 21: “How Sweet It Is (To Be Lo...",2018-03-06
2,QUEER VOICES,penumbra queer audio drama didnt know need,"Sarah Emily Baum, ContributorFreelance Writer",https://www.huffingtonpost.com/entry/the-penum...,"Young, fun, fantastical and, most notably, inc...",2018-01-05
