####  Imagine you have a dataset where you have different categories of data, Nowyou need to find the most similar data to the given data by using any 4 differentsimilarity algorithms. Now you have to build a model which can find the most similar data to the given data.

**Recommender systems** are most extensively used to suggest "Similar items", "Relevant jobs", "preferred foods", "Movies of interest" etc to their users. 

**Recommender system** with appropiate item suggestions helps in boosting sales, increasing revenue, retaining customers and also adds competitive advantage. 
There are basically two kind of **recommendation** methods.
1. **Content based recommendation**
2. **Collaborative filtering**

**Content based recommendation ** is based on similarity among users/items obtained through their **attributes**. It uses the additional information(meta data) about the **users** or **items** i.e. it relies on what kind of **content** is already available. This meta data could be **user's demograpic information** like *age*, *gender*, *job*, *location*, *skillsets* etc. Similarly for **items** it can be *item name*, *specifications*, *category*, *registration date* etc.

So the core idea is to recommend items by finding similar items/users to the concerned **item/user** based on their **attributes**. 

here  **Content based recommendation** using **News category** dataset has been analysed. The goal is to recommend **news articles** which are similar to the already read article by using attributes like article *headline*, *category*, *author* and *publishing date*.



## 1. Importing necessary Libraries

In [5]:
pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
     - -------------------------------------- 0.0/1.5 MB 653.6 kB/s eta 0:00:03
     - -------------------------------------- 0.0/1.5 MB 653.6 kB/s eta 0:00:03
     - -------------------------------------- 0.0/1.5 MB 653.6 kB/s eta 0:00:03
     - -------------------------------------- 0.0/1.5 MB 653.6 kB/s eta 0:00:03
     - -------------------------------------- 0.1/1.5 MB 233.8 kB/s eta 0:00:07
     -- ------------------------------------- 0.1/1.5 MB 291.5 kB/s eta 0:00:05
     --- ------------------------------------ 0.1/1.5 MB 343.4 kB/s eta 0:00:05
     --- ------------------------------------ 0.1/1.5 MB 355.0 kB/s eta 0:00:04
     ---- ----------------------------------- 0.2/1.5 MB 374.1 kB/s eta 0:00:04
     ----- ---------------------------------- 0.2/1.5 MB 430.1 kB/s eta 0:00:03
     ------- -------------------------------- 0.3/1.5 MB 505.4


[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
pip install plotly

Collecting plotlyNote: you may need to restart the kernel to use updated packages.

  Downloading plotly-5.14.1-py2.py3-none-any.whl (15.3 MB)
     ---------------------------------------- 0.0/15.3 MB ? eta -:--:--
     --------------------------------------- 0.0/15.3 MB 991.0 kB/s eta 0:00:16
     --------------------------------------- 0.0/15.3 MB 991.0 kB/s eta 0:00:16
     --------------------------------------- 0.0/15.3 MB 991.0 kB/s eta 0:00:16
     --------------------------------------- 0.0/15.3 MB 991.0 kB/s eta 0:00:16
     --------------------------------------- 0.0/15.3 MB 991.0 kB/s eta 0:00:16
     --------------------------------------- 0.1/15.3 MB 193.2 kB/s eta 0:01:20
     --------------------------------------- 0.1/15.3 MB 196.9 kB/s eta 0:01:18
     --------------------------------------- 0.1/15.3 MB 238.1 kB/s eta 0:01:05
     --------------------------------------- 0.1/15.3 MB 262.6 kB/s eta 0:00:59
     --------------------------------------- 0.2/15.3 MB 349.3 kB


[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [74]:

import os
import math
import time

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px

# Below libraries are for text processing using NLTK
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Below libraries are for feature representation using sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Below libraries are for similarity matrices using sklearn
from sklearn.metrics.pairwise import cosine_similarity  
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import manhattan_distances

from scipy.spatial import distance   ##minkowski
from sklearn.metrics import jaccard_score

## 2. Loading Data

In [19]:
news_articles = pd.read_json(r"D:\ineuron\Placement related assignment\sample code\Q3\News_Category_Dataset_v3.json", lines = True)

In [20]:
news_articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209527 entries, 0 to 209526
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   link               209527 non-null  object        
 1   headline           209527 non-null  object        
 2   category           209527 non-null  object        
 3   short_description  209527 non-null  object        
 4   authors            209527 non-null  object        
 5   date               209527 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.6+ MB


The dataset contains about two million records of six different features. 

In [21]:
news_articles.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


## 3. Data Preprocessing

### 3.a Fetching only the articles from 2022 

Since the dataset size is quite large so processing through entire dataset may consume too much time. To refrain from this, we are only considering the latest articles from the year 2022. 

In [22]:
news_articles = news_articles[news_articles['date'] >= pd.Timestamp(2022,1,1)]

In [23]:
news_articles.shape

(1398, 6)

In [24]:
news_articles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1398 entries, 0 to 1397
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   link               1398 non-null   object        
 1   headline           1398 non-null   object        
 2   category           1398 non-null   object        
 3   short_description  1398 non-null   object        
 4   authors            1398 non-null   object        
 5   date               1398 non-null   datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 76.5+ KB


Now, the number of news articles comes down to 8583.

### 3.b Removing all the short headline articles 

After stop words removal from headline, the articles with very short headline may become blank headline articles. So let's remove all the articles with less words(<5) in the headline.   

In [26]:
#news_articles = news_articles[news_articles['headline'].apply(lambda x: len(x.split())>5)]
news_articles =news_articles[news_articles['headline'].apply(lambda x : len(x.split())>7)]
print("Total number of articles after removal of headlines with short title:", news_articles.shape[0])

Total number of articles after removal of headlines with short title: 1347


### 3.c Checking and removing all the duplicates

Since some articles are exactly same in headlines, so let's remove all such articles having duplicate headline appearance.

In [28]:
news_articles.sort_values('headline',inplace=True, ascending=False)
duplicated_articles_series = news_articles.duplicated('headline', keep = False)
news_articles = news_articles[~duplicated_articles_series]
print("Total number of articles after removing duplicates:", news_articles.shape[0])

Total number of articles after removing duplicates: 1347


### 3.d Checking for missing values

In [29]:
news_articles.isna().sum()

link                 0
headline             0
category             0
short_description    0
authors              0
date                 0
dtype: int64

## 4. Basic Data Exploration 

### 4.a Basic statistics - Number of articles,authors,categories

In [30]:
print("Total number of articles : ", news_articles.shape[0])
print("Total number of authors : ", news_articles["authors"].nunique())
print("Total number of unqiue categories : ", news_articles["category"].nunique())

Total number of articles :  1347
Total number of authors :  404
Total number of unqiue categories :  24


### 4.b Distribution of articles category-wise

In [32]:
fig = go.Figure([go.Bar(x=news_articles["category"].value_counts().index, y=news_articles["category"].value_counts().values)])

fig['layout'].update(title={"text" : 'Distribution of articles category-wise','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Category name",yaxis_title="Number of articles")
fig.update_layout(width=800,height=700)
fig

From the bar chart, we can observe that **politics** category has **highest** number of articles then **US news** ,**entertainment** and so on.  

### 4.c Number of articles per month

Let's first group the data on monthly basis using **resample()** function. 

In [33]:
news_articles_per_month = news_articles.resample('m',on = 'date')['headline'].count()

news_articles_per_month

date
2022-01-31    143
2022-02-28    146
2022-03-31    169
2022-04-30    156
2022-05-31    163
2022-06-30    144
2022-07-31    138
2022-08-31    167
2022-09-30    121
Freq: M, Name: headline, dtype: int64

In [34]:
fig = go.Figure([go.Bar(x=news_articles_per_month.index.strftime("%b"), y=news_articles_per_month)])
fig['layout'].update(title={"text" : 'Distribution of articles month-wise','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Month",yaxis_title="Number of articles")
fig.update_layout(width=500,height=500)
fig

From the bar chart, we can observe that **March** month has **highest** number of articles then **August** and so on.  

By Data processing in Step 2, we get a subset of original dataset which has different index labels so let's make the indices uniform ranging from 0 to total number of articles. 

In [35]:
news_articles.index = range(news_articles.shape[0])

In [36]:
# Adding a new column containing both day of the week and month, it will be required later while recommending based on day of the week and month
news_articles["day and month"] = news_articles["date"].dt.strftime("%a") + "_" + news_articles["date"].dt.strftime("%b")

Since after text preprocessing the original headlines will be modified and it doesn't make sense to recommend articles by displaying modified headlines so let's copy the dataset into some other dataset and perform text preprocessing on the later.

In [37]:
news_articles_temp = news_articles.copy()

## 5. Text Preprocessing

### 5.a Stopwords removal

Stop words are not much helpful in analyis and also their inclusion consumes much time during processing so let's remove these. 

In [43]:
stop_words = set(stopwords.words('english'))

In [45]:
for j in range(len(news_articles_temp["headline"])):
    string = ""
    for word in news_articles_temp["headline"][j].split():
        word = ("".join(i for i in word if i.isalnum()))
        word = word.lower()
        if not word in stop_words:
          string += word + " "  
    if(j%1000==0):
      print(j)           # To track number of records processed
    news_articles_temp.at[j,"headline"] = string.strip()

0
1000


### 5.b Lemmatization

Let's find the base form(lemma) of words to consider different inflections of a word same as lemma.

In [49]:
wn = WordNetLemmatizer()

In [50]:
for j in range(len(news_articles_temp["headline"])):
    string = ""
    for w in word_tokenize(news_articles_temp["headline"][j]):
        string += wn.lemmatize(w,pos = "v") + " "
    news_articles_temp.at[j, "headline"] = string.strip()
       

## 6. Headline based similarity on new articles

Generally, we assess **similarity** based on **distance**. If the **distance** is minimum then high **similarity** and if it is maximum then low **similarity**.
To calculate the **distance**, we need to represent the headline as a **d-dimensional** vector. Then we can find out the **similarity** based on the **distance** between vectors.

There are multiple methods to represent a **text** as **d-dimensional** vector like **Bag of words**, **TF-IDF method**, **Word2Vec embedding** etc. Each method has its own advantages and disadvantages. 

Let's see the feature representation of headline through all the methods one by one.

### 6.a Using Bag of Words method

A **Bag of Words(BoW)** method represents the occurence of words within a **document**. Here, each headline can be considered as a **document** and set of all headlines form a **corpus**.

Using **BoW** approach, each **document** is represented by a **d-dimensional** vector, where **d** is total number of **unique words** in the corpus. The set of such unique words forms the **Vocabulary**.

In [52]:
headline_vectorizer = CountVectorizer()
headline_features   = headline_vectorizer.fit_transform(news_articles_temp['headline'])

In [53]:
headline_features.get_shape()

(1347, 4299)

The output **BoW matrix**(headline_features) is a sparse matrix.

In [54]:
pd.set_option('display.max_colwidth', -1)  # To display a very long headline completely


Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.



In [60]:
def bag_of_words_based_model(row_index, num_similar_items):
    couple_dist = pairwise_distances(headline_features,headline_features[row_index],metric='euclidean')
    indices = np.argsort(couple_dist.ravel())[0:num_similar_items]
    df = pd.DataFrame({'publish_date': news_articles['date'][indices].values,
               'headline':news_articles['headline'][indices].values,
                'Euclidean similarity with the queried article': couple_dist[indices].ravel()})
    print("="*30,"Queried article details","="*30)
    print('headline : ',news_articles['headline'][indices[0]])
    print("\n","="*25,"Recommended articles : ","="*23)
    #return df.iloc[1:,1]
    return df.iloc[1:,]

bag_of_words_based_model(125, 11) # Change the row index for any other queried article

headline :  UN To Scale Up Humanitarian Operations In Ukraine Following Russia's Invasion



Unnamed: 0,publish_date,headline,Euclidean similarity with the queried article
1,2022-03-07,Netflix Suspends Service In Russia Over Ukraine Invasion,3.162278
2,2022-02-28,"UN General Assembly, Security Council To Hold Meetings About Russia's Invasion Of Ukraine",3.162278
3,2022-07-27,Should I Take My Cat Out In This Heat?,3.316625
4,2022-07-18,Republicans Begin To Sour On Aid To Ukraine,3.316625
5,2022-01-30,7 Happiness Hacks You Can Do On A Commute,3.316625
6,2022-03-15,"UN Says Women Pay Highest Price In Conflict, Now In Ukraine",3.464102
7,2022-02-03,How To Salvage Your Vacation If It Rains Most Of The Time,3.464102
8,2022-03-15,Biden To Travel To Europe For NATO Summit On Russia’s War On Ukraine,3.464102
9,2022-01-30,U.S. And Russia Square Off At UN Security Council,3.464102
10,2022-04-18,Zelenskyy: Russian Offensive In Eastern Ukraine Has Begun,3.464102


In [63]:
def bag_of_words_based_model(row_index, num_similar_items):
    couple_dist = cosine_similarity(headline_features,headline_features[row_index])
    indices = np.argsort(couple_dist.ravel())[0:num_similar_items]
    df = pd.DataFrame({'publish_date': news_articles['date'][indices].values,
               'headline':news_articles['headline'][indices].values,
                'Cosine similarity with the queried article': couple_dist[indices].ravel()})
    print("="*30,"Queried article details","="*30)
    print('headline : ',news_articles['headline'][indices[0]])
    print("\n","="*25,"Recommended articles : ","="*23)
    #return df.iloc[1:,1]
    return df.iloc[1:,]

bag_of_words_based_model(125, 11) # Change the row index for any other queried article

headline :  ‘The Batman’ Gives Movie Theaters A New Hope With Big Launch



Unnamed: 0,publish_date,headline,Euclidean similarity with the queried article
1,2022-03-07,"Former NY Gov. Andrew Cuomo Rips 'Cancel Culture,' Hints At Political Comeback",0.0
2,2022-06-28,"Former Nazi Guard, 101, Jailed In Germany For Aiding Murder",0.0
3,2022-09-02,Former White House Counsel Pat Cipollone To Testify Before Grand Jury: Reports,0.0
4,2022-01-10,Fox News Host Has Some Blunt Talk For Trump: 'You Have To Learn To Lose',0.0
5,2022-04-08,Fox News Reporter Feels ‘Damn Lucky’ After Losing Limbs In Ukraine Attack,0.0
6,2022-03-24,Fox News' Ketanji Brown Jackson Coverage Melts 'Daily Show' Comedian's Mind,0.0
7,2022-08-05,French Scientist Trolls Twitter By Claiming Chorizo Slice Is Actually A Distant Star,0.0
8,2022-08-23,Former Louisville Cop Pleads Guilty In Breonna Taylor Case,0.0
9,2022-09-03,Fuel Leak Ruins NASA's 2nd Shot At Launching New Moon Rocket,0.0
10,2022-05-08,G7 Pledges To Phase Out Import Of Russian Oil,0.0


Above function recommends **10 similar** articles to the **queried**(read) article based on the headline. It accepts two arguments - index of already read artile and the total number of articles to be recommended.

Based on the **Euclidean distance** it finds out 10 nearest neighbors and recommends. 

**Disadvantages**
1. It gives very low **importance** to less frequently observed words in the corpus. Few words from the queried article like "employer", "flip", "fire" appear less frequently in the entire corpus so **BoW** method does not recommend any article whose headline contains these words. Since **trump** is commonly observed word in the corpus so it is recommending the articles with headline containing "trump".   
2. **BoW** method doesn't preserve the order of words.

To overcome the first disadvantage we use **TF-IDF** method for feature representation. 


In [67]:
def bag_of_words_based_model(row_index, num_similar_items):
    couple_dist = manhattan_distances(headline_features,headline_features[row_index])
    indices = np.argsort(couple_dist.ravel())[0:num_similar_items]
    df = pd.DataFrame({'publish_date': news_articles['date'][indices].values,
               'headline':news_articles['headline'][indices].values,
                'Manhattan with the queried article': couple_dist[indices].ravel()})
    print("="*30,"Queried article details","="*30)
    print('headline : ',news_articles['headline'][indices[0]])
    print("\n","="*25,"Recommended articles : ","="*23)
    #return df.iloc[1:,1]
    return df.iloc[1:,]

bag_of_words_based_model(125, 11) # Change the row index for any other queried article

headline :  UN To Scale Up Humanitarian Operations In Ukraine Following Russia's Invasion



Unnamed: 0,publish_date,headline,Manhattan with the queried article
1,2022-03-07,Netflix Suspends Service In Russia Over Ukraine Invasion,10.0
2,2022-02-28,"UN General Assembly, Security Council To Hold Meetings About Russia's Invasion Of Ukraine",10.0
3,2022-07-18,Republicans Begin To Sour On Aid To Ukraine,11.0
4,2022-01-30,7 Happiness Hacks You Can Do On A Commute,11.0
5,2022-07-27,Should I Take My Cat Out In This Heat?,11.0
6,2022-01-20,5 Hacks To Try If You Only Have A Minute To Destress,12.0
7,2022-03-15,Biden To Travel To Europe For NATO Summit On Russia’s War On Ukraine,12.0
8,2022-02-28,"More Than 500,000 Have Fled Ukraine Since Russia Invaded, UN Reports",12.0
9,2022-05-20,Monkeypox: What You Need To Know About The Virus,12.0
10,2022-04-18,Zelenskyy: Russian Offensive In Eastern Ukraine Has Begun,12.0


In [None]:
###ML algo

In [101]:
news_articles1=news_articles
display(news_articles1['category'].value_counts())
#selected_cat=['POLITICS', 'ENTERTAINMENT', 'U.S. NEWS', 'WORLD NEWS']
#data=news_articles1[['category','headline']][news_articles1['category'].isin(selected_cat)].reset_index(drop=True)


12    388
19    234
6     230
23    188
15    71 
3     37 
7     26 
21    23 
4     22 
8     17 
11    15 
16    15 
2     15 
20    15 
1     12 
17    7  
14    7  
10    7  
22    5  
9     4  
0     4  
13    2  
5     2  
18    1  
Name: category, dtype: int64

In [106]:
news_articles1.shape

(1347, 7)

In [107]:
# map label to int

class_names0=news_articles1['category'].unique().tolist()
class_names=sorted(class_names0)
N=list(range(len(class_names)))
normal_mapping=dict(zip(class_names,N)) 
reverse_mapping=dict(zip(N,class_names))
news_articles1['category']=news_articles1['category'].map(normal_mapping)

In [109]:
news_articles1.shape

(1347, 7)

In [110]:
news_articles1.head()

Unnamed: 0,link,headline,category,short_description,authors,date,day and month
0,https://www.huffpost.com/entry/the-batman-gives-movie-theaters-a-new-hope-with-big-launch_n_6224f08be4b042f866ef757c,‘The Batman’ Gives Movie Theaters A New Hope With Big Launch,6,"“The Batman,” starring Robert Pattinson, took in $128.5 million in its box office debut over the weekend.","Lindsey Bahr, AP",2022-03-06,Sun_Mar
1,https://www.huffpost.com/entry/the-batman-still-no-1-crosses-300-million_n_62375be8e4b019fd812f26b3,"‘The Batman,’ Still No. 1, Crosses $300 Million",6,"“The Batman” is still going strong three weeks into its theatrical run, with a tight grip on the top spot at the box office.","Lindsey Bahr, AP",2022-03-20,Sun_Mar
2,https://www.huffpost.com/entry/stranger-things-surfer-boy-pizza-number-surprise_n_62cc65dae4b02e0ac917b846,‘Stranger Things’ Fans Who Call The Surfer Boy Pizza Number Will Get Gnarly Surprise,6,People who call the number featured on the side of Argyle’s truck are getting something way sweeter than a pineapple pizza topping.,Elyse Wanshel,2022-07-11,Mon_Jul
3,https://www.huffpost.com/entry/stranger-things-creators-admit-they-made-huge-continuity-error-in-season-4_n_629f8ba3e4b05fe694f94b72,‘Stranger Things’ Creators Admit They Made Huge Continuity Error In Season 4,6,The Duffer Brothers said they had no intention of making one character’s birthday so sad.,Elyse Wanshel,2022-06-07,Tue_Jun
4,https://www.huffpost.com/entry/quick-reaction-forces-qrf-oath-keepers-capitol-attack_n_61f2d986e4b02de5f51634bd,‘Quick Reaction Forces’ And The Lingering Mysteries Of The Plot Against The Capitol,12,"The Oath Keeper “QRFs” show how things could have been a lot worse, and how much more there is to learn.",Ryan J. Reilly,2022-01-28,Fri_Jan


In [111]:
# split dataset in features and target variable

# Features
X = news_articles1.drop(columns=["category"])

# Target variable
y = news_articles1['category'] 


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

In [112]:
vectorizer = TfidfVectorizer()
#fit_transform for train data
X_train = vectorizer.fit_transform(X_train['headline'])
#transform for test data
X_test = vectorizer.transform(X_test['headline'])


In [113]:
X_train

<942x3921 sparse matrix of type '<class 'numpy.float64'>'
	with 10285 stored elements in Compressed Sparse Row format>

In [123]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report 
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

In [124]:
model = LogisticRegression(random_state = 42)
model.fit(X_train, y_train)

In [126]:
y_pred = model.predict(X_test)
accuracy = round(metrics.accuracy_score(y_test, y_pred),5)

print(f'accuracy:{accuracy}')

accuracy:0.51605


In [129]:
#print(metrics.classification_report(y_test, y_pred))