<a href="https://colab.research.google.com/github/s-miramontes/News_Filter/blob/emma/Pilot/Data%20Cleanse%20and%20Clustering%20Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering News Headlines

In this notebook we begin by importing data to analyze its contents and to determine the best clustering algorithm to determine the articles that are most related to each other. 

We start by importing some dependencies and downloading libraries

In [0]:
%%capture
# Install the latest Tensorflow version.
!pip3 install --upgrade tensorflow-gpu
# Install TF-Hub.
!pip3 install tensorflow-hub
!pip3 install seaborn

In [1]:
import os

import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer
import re

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans

from sklearn.metrics.pairwise import cosine_similarity

from absl import logging

import tensorflow as tf
import tensorflow_hub as hub

import heapq
import operator

from math import sqrt
from joblib import Parallel, delayed


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/erusson/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Importing Data from Local File System

Datasets are located here: https://www.kaggle.com/snapcrack/all-the-news/version/4#articles3.csv

Proceed to download all three 'csv' files, and store them in a 'data' directory at the location of your choice.


In [0]:
articles_3 = pd.read_csv('w266/finalproj/data/articles3.csv')
articles_2 = pd.read_csv('w266/finalproj/data/articles2.csv')
articles_1 = pd.read_csv('w266/finalproj/data/articles1.csv')

In [0]:
articles_3.head()

In [0]:
articles_2.head()

In [0]:
articles_1.head()

In [0]:
# publications in articles_3.csv
set(articles_3.publication)

In [0]:
# publications in articles_2.csv
set(articles_2.publication)

In [0]:
# publications in articles_1.csv
set(articles_1.publication)

In [3]:
# join all datasets into one
full_data = pd.concat([articles_1, articles_2, articles_3], ignore_index=True)

full_data.shape

(142570, 10)

In [4]:
full_data.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [0]:
# remove duplicates
full_data = full_data.drop_duplicates(subset=['title', 'publication', 'author', 'date'])

In [0]:
# remove missing titles 
full_data = full_data.dropna(subset=['title'])

In [7]:
# sample 13k observations from full_data (set seed to 5)
small_data = full_data.sample(n=13000, random_state=5).reset_index()
set(small_data.publication)

{'Atlantic',
 'Breitbart',
 'Business Insider',
 'Buzzfeed News',
 'CNN',
 'Fox News',
 'Guardian',
 'NPR',
 'National Review',
 'New York Post',
 'New York Times',
 'Reuters',
 'Talking Points Memo',
 'Vox',
 'Washington Post'}

In [8]:
small_data.shape

(13000, 11)

In [9]:
small_data.head()

Unnamed: 0.1,index,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,74496,77946,118473,"Chaos in the Family, Chaos in the State: The W...",National Review,Kevin D. Williamson,2016-03-17,2016.0,3.0,http://www.nationalreview.com/article/432876/d...,Michael Brendan Dougherty is bitter. I think t...
1,71184,74592,113594,US Civil Rights Commission Will Observe Stand...,Buzzfeed News,Nidhi Subbaraman,2016-12-08,2016.0,12.0,https://web.archive.org/web/20161208153906/htt...,WASHINGTON — The US Commission on Civil Ri...
2,120205,123668,184574,"Venezuela hunts rogue helicopter attackers, Ma...",Reuters,Andrew Cawthorne and Victoria Ramirez,2017-06-29,2017.0,6.0,http://www.reuters.com/article/us-venezuela-po...,The Venezuelan government hunted on Wednesday...
3,128977,132440,199665,Fruit juice isn’t much better for you than sod...,Vox,Julia Belluz,2016/3/25,2016.0,3.0,http://www.vox.com/2016/3/25/11305614/soda-jui...,One of the biggest public health wins of rece...
4,134837,138300,208223,Sessions won’t testify at congressional budget...,Washington Post,Sari Horwitz,2017-06-10,2017.0,6.0,https://web.archive.org/web/20170611000758/htt...,"Attorney General Jeff Sessions, who had agree..."


In [10]:
# remove publisher tags from article titles 

def clean_title(title):
  title = re.sub(r"(- Breitbart)(?!.*\1)", '', title)
  title = re.sub(r'(- The New York Times)(?!.*\1)', '', title)
  return title
  
just_titles = Parallel(n_jobs=16)(delayed(clean_title)(title) for title in small_data.title)

just_titles[:5]

['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction',
 ' US Civil Rights Commission Will Observe Standing Rock\xa0Standoff',
 'Venezuela hunts rogue helicopter attackers, Maduro foes suspicious',
 'Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.',
 'Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead']

In [11]:
len(just_titles)

13000

In [12]:
# store ids of titles 
id_list = small_data.copy().id

len(id_list)

13000

In [13]:
# join with content of article

small_text = list(map(lambda i,j: i + " " + j, just_titles, small_data.content))

small_text[:5]

['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction Michael Brendan Dougherty is bitter. I think that I can write that in both truth and charity. (I think you might even say that he and I are friends.) Dougherty is a conservative of the sort sometimes advertised as “paleo” and served as national correspondent for The American Conservative. Like many conservative writers with those associations, Dougherty spends a great deal of time lambasting the conservative movement and its organs, from which he feels, for whatever reason, estranged  —   an alienation that carries with it more than a little to suggest that it is somewhat personal. You know: Them. Donald Trump is the headline, and explaining the benighted white working class to Them is the main matter. Sanctimony is the literary mode, for Dougherty and for many others doing the same work with less literary facility. Never mind the petty sneering (as though the conservative movement were populated by septua

In [14]:
len(small_text)

13000

## Clustering with tf-idf, k-means, and k-nn


*   create tf-idf features
*   cluster with k-means
*   predidct with k-nn




### Create Matrix of TF-IDF Features

In [0]:
def preprocess_text(text):

  # function to remove punctuation 
  def Punctuation(string): 
    return re.sub(r'[\W_]', ' ', string)

  # remove punctuation and perform tokenization
  text = Punctuation(text.lower()).split()

  # remove stop words and stem
  stop_words = set(stopwords.words('english'))
  stemmer = SnowballStemmer("english")
  text = Parallel(n_jobs=16)(delayed(stemmer.stem)(t) for t in text if not t in stop_words)

  return text


In [0]:
# instantiate tfidf vectorizer 
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', use_idf=True, tokenizer=preprocess_text, ngram_range=(1,3))

In [0]:
# fit small_text to vectorizer (fit and transform)
tfidf_matrix = tfidf_vectorizer.fit_transform(small_text)
print(tfidf_matrix.shape)

In [0]:
# reduce memory 
tfidf_matrix = tfidf_matrix.astype(np.float32)

### Run K-Means Clustering 

In [0]:
# set 25 clusters
num_clusters = 25

# instantiate k-means (set seed to 10)
km = KMeans(n_clusters=num_clusters, random_state=10)

# fit tf-idf features to k-means
km.fit(tfidf_matrix)

# list of cluster assignments 
clusters = km.labels_.tolist()

In [0]:
# cosine similarities for each row with respective cluster center
def cluster_sim(i, label):
 return cosine_similarity(tfidf_matrix[i].toarray(), km.cluster_centers_[label].reshape(1, -1))[0][0]

cos_sim = Parallel(n_jobs=16)(delayed(cluster_sim)(i, label) for i,label in enumerate(km.labels_))

In [0]:
# sum of squared distances of samples to their closest cluster center.
km.inertia_

In [0]:
# dataframe with title, cluster, and cosine similarity 
title_data = pd.DataFrame({'title':small_data.title, 'cluster':clusters, 'cos_sim':cos_sim})

title_data.shape

In [0]:
title_data.head()

In [0]:
# inspect titles with highest cosine similarity in clusters 
top_5 = title_data.groupby('cluster')['cos_sim'].nlargest(5)
for i,ind in top_5.index:
  print("cluster", i)
  print(just_titles[ind])
  print('-------------')

### Predict Cluster of Input

In [0]:
input_topic = ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

In [0]:
# run tfidf vectorizer on input (transform)
tfidf_input = tfidf_vectorizer.transform(input_topic)

# function to make tfidf features into list for k-nn 
def make_list(features, i): 
  return features.getrow(i).toarray()[0].tolist()

# make list of tfidf_input
tfidf_input = Parallel(n_jobs=16)(delayed(make_list)(tfidf_input, i) for i in range(tfidf_input.shape[0]))

In [0]:
# make list of tfidf_matrix
tfidf_list = Parallel(n_jobs=16)(delayed(make_list)(tfidf_matrix, i) for i in range(tfidf_matrix.shape[0]))


In [0]:
np.random.seed(5)
X_train = tfidf_list
y_train = title_data.cluster
X_test  = tfidf_input

# create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train) 

res = knn.predict(X_test)

In [0]:
print(res)

In [0]:
for i,ind in top_5.index:
  if (i==8) | (i==11):
    print("cluster", i)
    print(just_titles[ind])
    print('-------------')

# ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

## Clustering with the Universal Sentence Encoder 

run seperately from tf-idf

### Create Embeddings on Training Data

In [0]:
# load the Universal Sentence Encoder's TF Hub module

#module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
#model = hub.load(module_url)
#print ("module %s loaded" % module_url)

# download model from https://tfhub.dev/google/universal-sentence-encoder/4 and save locally 
model = hub.load("w266/finalproj/tmp")


In [0]:
# reduce logging output
logging.set_verbosity(logging.ERROR)

# compute a representation for each article
small_embeddings = model(small_text)

In [0]:
# semantic similarity of two sentences can be trivially computed as the inner product of the encodings
corr = np.inner(small_embeddings, small_embeddings)

In [18]:
# data frame of titles and semantic similarities
corr_df = pd.DataFrame(corr)
corr_df.columns = just_titles
corr_df.index = just_titles

corr_df.head()

Unnamed: 0,"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",US Civil Rights Commission Will Observe Standing Rock Standoff,"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,Qatar: UAE and Saudi Arabia step up pressure in diplomatic crisis,’The 4-Hour Workweek’ author says a 3-step process he learned from Tony Robbins drastically improved his life,Rick Perry Attacked for Keeping His Word,Watch how a mathematician explains an astonishing coincidence,Second baby bald eagle begins hatching process at National Arboretum,...,"McDonald’s, Chick-fil-A, and Subway are making changes to profit off the shifting definition of ’healthy’",Bernie says the AP twisted his ‘messy’ convention prediction,Nevada’s Joe Heck Aims to Keep the Senate Republican,Donald Trump: John McCain and Lindsey Graham ‘Looking to Start World War III’,"Radiohead Announces New Album, Hear Another New Song",Roger Federer and Rafael Nadal roll back years for unexpected encore,Why voters like She’s Not Trump,Advanced Placement Tests To Hide History of Religion and Islamic Jihad in Europe,Little Simz,Wilhelmina to feature transgender model this fall
"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",1.0,0.291457,0.336936,0.256971,0.181461,0.282036,0.389552,0.316919,0.289691,0.295462,...,0.217885,0.347041,0.459125,0.317545,0.065338,0.159626,0.458244,0.456367,0.228482,0.326126
US Civil Rights Commission Will Observe Standing Rock Standoff,0.291457,1.0,0.543982,0.10907,0.411541,0.422529,0.224046,0.27175,0.141546,0.381025,...,0.059385,0.298766,0.349619,0.275823,0.209703,0.112333,0.26849,0.271677,0.186528,0.228896
"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",0.336936,0.543982,1.0,0.106961,0.405647,0.540752,0.184615,0.287032,0.161188,0.340552,...,0.0847,0.344075,0.400919,0.310918,0.255471,0.267033,0.259199,0.362642,0.26564,0.168313
Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,0.256971,0.10907,0.106961,1.0,0.052951,0.109257,0.242693,0.108912,0.14604,0.155383,...,0.344664,0.227382,0.186247,0.101332,0.064881,0.121565,0.248952,0.207817,0.171043,0.204495
Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,0.181461,0.411541,0.405647,0.052951,1.0,0.384612,0.176325,0.368784,0.109351,0.225344,...,0.033626,0.316896,0.351867,0.394767,0.222523,0.173553,0.287082,0.243409,0.216823,0.191981


In [44]:
# function to return the column index of the top five values in a row of a dataframe
def find_top5ind(df, i):
  return list(list(zip(*heapq.nlargest(5, enumerate(df.iloc[i,:]), key=operator.itemgetter(1))))[0])

# function to return the top 5 values in a list
def find_top5(lst, ind):
  return [lst[i] for i in ind]

# find index of 5 most similar titles 
top5_ind = Parallel(n_jobs=16)(delayed(find_top5ind)(corr_df, i) for i in range(len(corr_df)))

# show most similar titles -- sanity check
top5 = Parallel(n_jobs=16)(delayed(find_top5)(just_titles, ind) for ind in top5_ind)

top5[:5]

[['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction',
  'After Trump, conservatives should stop longing for the past — and learn a little humility',
  'The alt-right is more than warmed-over white supremacy. It’s that, but way way weirder.',
  'The Return of ‘Street Corner Conservatism’',
  'Liberals should get behind marriage (Opinion)'],
 [' US Civil Rights Commission Will Observe Standing Rock\xa0Standoff',
  ' Cory Booker Calls For Federal Investigation Into Police Tactics At Dakota Access\xa0Pipeline',
  'Army will close Dakota pipeline protesters’ campsite, Sioux leader\xa0says',
  'Dakota Access Pipeline protest site is cleared',
  'Protesters, Police Still Clashing Over Disputed North Dakota Pipeline'],
 ['Venezuela hunts rogue helicopter attackers, Maduro foes suspicious',
  'Venezuelan opposition blocks streets to protest Maduro power shakeup',
  'Protester dies, minister sacked after Paraguay re-election vote',
  'Venezuela Erupts In ’Mother Of 

### Assign Cluster to Input

In [0]:
# examples of user given inputs
input_topic = ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

# compute a representation for each input
input_embeddings = model(input_topic)

In [0]:
# semantic similarity between inputs and training articles
inp_corr = np.inner(input_embeddings, small_embeddings)

In [22]:
# data frame of titles and semantic similarities
inp_df = pd.DataFrame(inp_corr)
inp_df.columns = just_titles
inp_df.index = input_topic

inp_df

Unnamed: 0,"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",US Civil Rights Commission Will Observe Standing Rock Standoff,"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,Qatar: UAE and Saudi Arabia step up pressure in diplomatic crisis,’The 4-Hour Workweek’ author says a 3-step process he learned from Tony Robbins drastically improved his life,Rick Perry Attacked for Keeping His Word,Watch how a mathematician explains an astonishing coincidence,Second baby bald eagle begins hatching process at National Arboretum,...,"McDonald’s, Chick-fil-A, and Subway are making changes to profit off the shifting definition of ’healthy’",Bernie says the AP twisted his ‘messy’ convention prediction,Nevada’s Joe Heck Aims to Keep the Senate Republican,Donald Trump: John McCain and Lindsey Graham ‘Looking to Start World War III’,"Radiohead Announces New Album, Hear Another New Song",Roger Federer and Rafael Nadal roll back years for unexpected encore,Why voters like She’s Not Trump,Advanced Placement Tests To Hide History of Religion and Islamic Jihad in Europe,Little Simz,Wilhelmina to feature transgender model this fall
Hillary Clinton defends handling of Benghazi attack,0.019261,0.135737,0.21394,-0.040233,0.270479,0.242178,-0.059264,0.235375,0.039634,0.098044,...,0.077255,0.288595,0.195516,0.280349,0.027464,0.020349,0.281599,0.114124,0.017507,-0.070851
Women's March Highlights,0.045296,0.119294,-0.000519,0.065587,0.049757,-0.028455,-0.019417,0.070615,0.004875,0.16622,...,0.006225,0.1285,0.060778,0.00033,0.051476,0.025595,0.12345,0.030094,0.040689,0.124655
Hillary Clinton emails,0.036642,0.032883,0.058746,0.017771,0.199208,0.109163,-0.04811,0.151369,0.013369,0.053923,...,0.073164,0.322738,0.131154,0.256044,0.063897,0.047353,0.367134,-0.036238,0.014079,0.013729


In [45]:
# find index of 5 most similar titles 
top5_ind = Parallel(n_jobs=16)(delayed(find_top5ind)(inp_df, i) for i in range(len(inp_df)))

# show most similar titles -- final clusters
top5 = Parallel(n_jobs=16)(delayed(find_top5)(just_titles, ind) for ind in top5_ind)

top5[:5]
# topics: ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

[['Benghazi may be over, but #Benghazi has a life of its own',
  'Univision’s Ramos picked the wrong Benghazi question for Hillary Clinton',
  'Donald Trump on Benghazi: ’Hillary Clinton Decided to Go Home and Sleep’ ',
  'The GOP stoops for scandal',
  'Exclusive - Sarah Palin: Administration Lies, Soldiers Die. Yes, Hillary, at Every Point ‘It Matters’ '],
 ['The Women’s March on Washington is becoming a\xa0joke',
  ' Women Share Why They Hit The Streets At The Sundance Women’s\xa0March',
  'Women will march on – What would a feminist do? podcast',
  'The Exhausting Work of Tallying America’s Largest Protest',
  'Women’s March organizers prepare for hundreds of thousands of protesters'],
 ['Donald Trump: ’Good Job, Huma. Thank you, Anthony Weiner’ ',
  'Fox News anchor grills Hillary Clinton on her email scandal',
  'Hillary Clinton’s email problems might be even worse than we thought',
  'Hillary Clinton pal Neera Tanden’s greatest hits from WikiLeaks emails',
  'Hillary Clinton Del

In [47]:
# index of most similar titles 
top5id = Parallel(n_jobs=16)(delayed(find_top5)(id_list, ind) for ind in top5_ind)

top5id[:5]

[[152203, 210482, 47436, 213302, 39123],
 [122748, 103164, 149141, 73686, 147468],
 [40299, 68550, 213322, 86663, 44841]]

In [0]:
# cleaned text for clusters 
top5text = Parallel(n_jobs=16)(delayed(find_top5)(small_text, ind) for ind in top5_ind)

top5text[:5]

### Output Clusters

In [0]:
# first example: "Hillary Clinton defends handling of Benghazi attack"
cluster1 = small_data.iloc[[i in top5id[0] for i in small_data.id]]
text1_df = pd.DataFrame({'id':top5id[0], 'text':top5text[0]})
cluster1 = cluster1.merge(text1_df, on='id')

# second example: "Women's March Highlights"
cluster2 = small_data.iloc[[i in top5id[1] for i in small_data.id]]
text2_df = pd.DataFrame({'id':top5id[1], 'text':top5text[1]})
cluster2 = cluster2.merge(text2_df, on='id')

# third example: "Hillary Clinton emails"
cluster3 = small_data.iloc[[i in top5id[2] for i in small_data.id]]
text3_df = pd.DataFrame({'id':top5id[2], 'text':top5text[2]})
cluster3 = cluster3.merge(text3_df, on='id')

In [0]:
# save each cluster data to csv
cluster1.to_csv("w266/finalproj/clusters/cluster1.csv", index=False)
cluster2.to_csv("w266/finalproj/clusters/cluster2.csv", index=False)
cluster3.to_csv("w266/finalproj/clusters/cluster3.csv", index=False)

#### Sanity Check to see if individual vs mass inputs create different embeddings -- NO

In [0]:
a = model(["hi my name is emma"])
b = model(["nice to meet you"])
c = model(["hi what's your name"])

In [0]:
d = model(["hi my name is emma", "nice to meet you", "hi what's your name"])

In [0]:
e = np.concatenate((a, b, c))

In [0]:
f = np.array(d)

In [0]:
np.inner(e, e)

In [0]:
np.inner(f, f)