## Exploring Unsupervised Models in HuffPost Dataset

This is the second jupyter notebook exploring the HuffPost Dataset. In this notebook we want to answer the following question: Is it possible to find clusters of authors according to their articles' categories? for example, I expect that no all authors write articles in the same categories and there could be groups of authors (clusters) that present some pattern.

Here is the outline of the notebook so you can jump among sections

## Outline

1. [Loading Libraries and Data](#Loading-libraries)
2. [EDA and ETL](#ETL)
3. [Models](#Models)
    1. [Spectral Clustering](#Spectral-Clustering)
4. [Conclusions](#Conclusions)

<a name="Loading-libraries"></a>
# Loading Libraries and Data

## Third Party Libraries

In [1]:
import numpy as np
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt

# SkLearn
from sklearn.metrics.pairwise import chi2_kernel
from sklearn.metrics import calinski_harabasz_score
from sklearn.cluster import SpectralClustering

## Utility Functions

In [2]:
from utils import plot_freq_x_context

## Load Data

In [3]:
df_org = pd.read_json('Data/News_Category_Dataset_v2.json',lines=True)
# Rename index as id since it will help to build the data after token explotion
df_org.index.rename('id',inplace= True)
df_org = df_org.reset_index()
df_org

Unnamed: 0,id,category,headline,authors,link,short_description,date
0,0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26
...,...,...,...,...,...,...,...
200848,200848,TECH,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,"Reuters, Reuters",https://www.huffingtonpost.com/entry/rim-ceo-t...,Verizon Wireless and AT&T are already promotin...,2012-01-28
200849,200849,SPORTS,Maria Sharapova Stunned By Victoria Azarenka I...,,https://www.huffingtonpost.com/entry/maria-sha...,"Afterward, Azarenka, more effusive with the pr...",2012-01-28
200850,200850,SPORTS,"Giants Over Patriots, Jets Over Colts Among M...",,https://www.huffingtonpost.com/entry/super-bow...,"Leading up to Super Bowl XLVI, the most talked...",2012-01-28
200851,200851,SPORTS,Aldon Smith Arrested: 49ers Linebacker Busted ...,,https://www.huffingtonpost.com/entry/aldon-smi...,CORRECTION: An earlier version of this story i...,2012-01-28


<a name="ETL"></a>
# EDA and ETL

Let's explore the authors with larger number of article counts

In [4]:
counts_by_author = (df_org
     .filter(['id','authors'])
     .groupby('authors')['id']
     .size()
     .reset_index()
     .rename(columns={'id':'counts'})
     .sort_values('counts', ascending = False))
counts_by_author.head(10)

Unnamed: 0,authors,counts
0,,36620
16031,Lee Moran,2423
23063,Ron Dicker,1913
22335,"Reuters, Reuters",1562
7959,Ed Mazza,1322
5353,Cole Delbyck,1140
1696,Andy McDonald,1068
13769,Julia Brucculieri,1059
4149,Carly Ledbetter,1054
5634,Curtis M. Wong,1020


We have almost 36k of articles with empty author column. Let's filter out these articles which are not interesting for answering our question.

Now let's look at authors with low counts

In [5]:
df = df_org.query("authors != ''").copy()
counts_by_author.tail(10)

Unnamed: 0,authors,counts
16268,"Lilly Workneh, Michael McLaughlin, and Meredit...",1
16269,"Lillyanne Daigle, ContributorNetwork Campaigne...",1
16272,"Lily Chen, ContributorSoftware Engineer, Write...",1
16274,"Lily Golightly, ContributorOwner, Golightly Media",1
16275,"Lily Hua, Contributor\nCertified Financial Pla...",1
7385,"Dr. Amy Nunn, Contributor\nAssistant Professor...",1
16277,Lily Karlin and Bill Bradley,1
16278,Lily Karlin and Sam Levine,1
16279,"Lily Kuo, Quartz Africa",1
2341,"Ashford Evans, ContributorSales rep, small bus...",1


Some articles do not have only one author. Some articles are written by multiple authors separated by comma, "and" or they have the authors profession after the comma. This needs cleaning up.

In [6]:
import re
def clean_authors(x):
    authors = x['authors'].lower()
    clean = re.split(',| and ',authors)
    return [word.strip() for word in clean if word!='']

# Do lower case, splitting and strip. Then explode
df['author'] = df.apply(lambda x: clean_authors(x), axis = 1)
df_exp = df.explode('author')
df_exp.head()

Unnamed: 0,id,category,headline,authors,link,short_description,date,author
0,0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26,melissa jeltsen
1,1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26,andy mcdonald
2,2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26,ron dicker
3,3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26,ron dicker
4,4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26,ron dicker


Now we see that each articles has multiple copies with each copy being a different author from atuhors preprocessing

Now let's check the count in order to see if they make sense

In [7]:
(df_exp
     .filter(['id','author'])
     .groupby('author')['id']
     .size()
     .reset_index()
     .rename(columns={'id':'counts'})
     .sort_values('counts', ascending = False)
     .head(10))

Unnamed: 0,author,counts
41776,reuters,6519
33454,lee moran,2434
3829,author,2425
8944,contributor\nauthor,2348
13752,contributorauthor,1917
42441,ron dicker,1916
20479,contributorwriter,1712
12931,contributor\nwriter,1592
49139,writer,1572
9750,contributor\ncontributor,1529


On the one hand, we see that the author "Lee Moran" has increased the number of articles from 2423 to 2434, which means that it now has some articles count from other articles written by multiple authors. On the other hand, we now have a lot of author names which really are professions. Let's clean them up

In [8]:
list_to_exclude = ['author','writer','reuters','contributor','blogger','m.d','ph.d.','ap',
                   'journalist','.com',"'",'"','/']

def authors_to_exclude(row):
    return (not any([word in row['author'] for word in list_to_exclude]))

df_authors = df_exp[df_exp.apply(authors_to_exclude, axis=1)]

In [9]:
(df_authors
     .filter(['id','author'])
     .groupby('author')['id']
     .size()
     .reset_index()
     .rename(columns={'id':'counts'})
     .sort_values('counts', ascending = False)
     .head(10))

Unnamed: 0,author,counts
18905,lee moran,2434
27292,ron dicker,1916
9719,ed mazza,1328
6590,cole delbyck,1146
2074,andy mcdonald,1082
4890,carly ledbetter,1066
16838,julia brucculieri,1063
7404,curtis m. wong,1022
3533,bill bradley,985
13895,igor bobic,975


Now all authors name are cleaner. Lets now filter authors with certain number of articles in order to reduce noise.

In [111]:
df_authors['count_by_author'] = df_authors.groupby('author')['id'].transform('count')
df_authors_final = df_authors[df_authors['count_by_author']>30]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Let's see how many authors we have left

In [11]:
authors_grouped = df_authors_final[['id','author']].groupby('author')['id'].size().reset_index().sort_values('id', ascending = False)
len(authors_grouped)

872

The number of authors with more than 20 articles is of around 900

<a name="Models"></a>
# Models

Now that we have all authors to consider and their articles and their category, let's build a representation that allow us to compare the categories they write among others. For this purpose, let's build a table that count the number of articles in each category for each author

In [12]:
authors_grouped = df_authors_final[['id','category','author']].groupby(['category','author'])['id'].size().reset_index().sort_values('id', ascending = False)
authors_mat = authors_grouped.pivot_table (index='author',
                             columns = 'category',
                             values='id',
                             fill_value = 0)
authors_mat

category,ARTS,ARTS & CULTURE,BLACK VOICES,BUSINESS,COLLEGE,COMEDY,CRIME,CULTURE & ARTS,DIVORCE,EDUCATION,...,TASTE,TECH,THE WORLDPOST,TRAVEL,WEDDINGS,WEIRD NEWS,WELLNESS,WOMEN,WORLD NEWS,WORLDPOST
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
...,9,0,18,21,3,10,5,6,33,11,...,3,2,2,14,6,1,86,13,7,9
a...,0,0,1,1,0,0,0,0,0,0,...,1,0,0,16,0,0,13,2,0,0
aaron barksdale,0,0,38,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaron nemo,0,0,0,0,0,33,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abigail williams,0,1,0,1,4,1,0,0,0,0,...,30,0,0,16,0,0,0,2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zach carter,0,0,0,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zahara hill,0,0,75,0,0,2,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
zaki hasan,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zeba blay,0,2,171,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,47,0,0


The above representation allows to capture the categories in which authors write. Nevertheless, this representation is very sensitivie to total number of articles written by each author. Let's normalize the counts of each author that that each row becomes an empirical distribution of categories written by author.

In [13]:
np_mat = authors_mat.to_numpy()
sum_of_rows = np_mat.sum(axis=1)
normalized_array = np_mat / sum_of_rows[:, np.newaxis]

<a name="Spectral-Clustering"></a>
## Spectral Clustering

The model that we are going to use to cluster authors is Spectral Clustering. Basically, we have to find a metric that tell us how similar is one author's distribution to another author. For this metric, we are going to use Chi Squared which defined as follows:

$$k(x, y) = exp^{-\gamma \sum{\frac{(x - y)^2}{(x + y)}}}$$

Where gamma is a parameter that controls the amount of smoothing between distances.

Now let's iterate for some values of the number of clusters (n_clusters) and gamma. For each pair of parameters we are going to evaluate calinski_harabasz_score which is a score which is higher if the distance between members of the same cluster is small and if the distance among members of different clusters is high.

In [17]:
best_n_clusters = 0
best_gamma = 0
best_score = 0
for index, gamma in enumerate((0.01,0.1,1,10)):
    for index, k in enumerate((4,6,8,10,12,14,16,18,20)):
        X = chi2_kernel(normalized_array, gamma=gamma)
        y_pred = SpectralClustering(n_clusters=k, affinity='precomputed').fit_predict(X)
        score = calinski_harabasz_score(X, y_pred)
        if score > best_score:
            best_score = score
            best_n_clusters = k
            best_gamma = gamma
        print(f"Calinski-Harabasz Score with gamma={gamma}, n_clusters={k}, score:{score}")

Calinski-Harabasz Score with gamma=0.01, n_clusters=4, score:496.2277045078305
Calinski-Harabasz Score with gamma=0.01, n_clusters=6, score:448.5850278853257
Calinski-Harabasz Score with gamma=0.01, n_clusters=8, score:406.04576119981175
Calinski-Harabasz Score with gamma=0.01, n_clusters=10, score:340.9890370367538
Calinski-Harabasz Score with gamma=0.01, n_clusters=12, score:292.8400189817797
Calinski-Harabasz Score with gamma=0.01, n_clusters=14, score:245.90938219300844
Calinski-Harabasz Score with gamma=0.01, n_clusters=16, score:226.39000654870463
Calinski-Harabasz Score with gamma=0.01, n_clusters=18, score:206.66559987777507
Calinski-Harabasz Score with gamma=0.01, n_clusters=20, score:184.33113788887707
Calinski-Harabasz Score with gamma=0.1, n_clusters=4, score:505.10650437964415
Calinski-Harabasz Score with gamma=0.1, n_clusters=6, score:409.4775837121525
Calinski-Harabasz Score with gamma=0.1, n_clusters=8, score:411.5533013576192
Calinski-Harabasz Score with gamma=0.1, n_c

Let's create a model with parametrs that correspond to the highest calinski_harabasz_score

In [30]:
X = chi2_kernel(normalized_array, gamma=best_gamma)
X.shape
model = SpectralClustering(random_state=0, n_clusters=best_n_clusters,
                           affinity='precomputed'
                           ).fit(X)
labels = model.labels_
clusters = pd.concat([authors_mat.reset_index(),pd.DataFrame({'label':labels})], axis = 1)
clusters_tall = clusters.melt(id_vars=['author','label'])
clusters_tall.index.rename('id', inplace=True)
clusters_tall.rename(columns={'variable':'category'}, inplace=True)
clusters_tall

Unnamed: 0_level_0,author,label,category,value
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,...,3,ARTS,9
1,a...,3,ARTS,0
2,aaron barksdale,3,ARTS,0
3,aaron nemo,3,ARTS,0
4,abigail williams,3,ARTS,0
...,...,...,...,...
35747,zach carter,0,WORLDPOST,0
35748,zahara hill,3,WORLDPOST,0
35749,zaki hasan,3,WORLDPOST,0
35750,zeba blay,3,WORLDPOST,0


Plot the number of articles by Category in each of the cluster that where found

In [31]:
plot_freq_x_context(clusters_tall.query('value!=0'), class_col_name='label', tok_col_name='category', 
                    n=20, n_cols=2, width=250, height=200)

Plot the number of authors by cluster

In [99]:
final_clusters = clusters_tall.groupby(['author','label'])['category'].apply(','.join).reset_index()
final_clusters = final_clusters.groupby('label', ).agg({'label':'count'})
final_clusters = final_clusters.rename(columns={'label':'count'})
alt.Chart(final_clusters.reset_index()).mark_bar().encode(
    y=alt.Y('count', title='Count'),
    x=alt.X('label:N', sort='-y', title='Cluster Label'),
    color='label:N'
).properties(
    width=250,
    height=250,
    title='[Interactive] Distribution of Authors by Cluster'
)

<a name="Conclusions"></a>
# Conclusions

The main conclusion is that it is possible to find some patterns that allow us to cluster authors according to their category writting behavior. In particular, we could form 4 clusters called [0-3] that have the following characteristics:

- Cluster 0: cluster with authors writing mainly about politics.
- Cluster 1: cluster with authors writing about wellness, healthy living and parenting.
- Cluster 2: cluster with authors writing almost only about travelling. This is the smalles cluster.
- Cluster 3: This is definetely the largest cluster of all with authors that seem not to fall in any previous clusters.