<a href="https://colab.research.google.com/github/s-miramontes/News_Filter/blob/emma/Pilot/Data%20Cleanse%20and%20Clustering%20Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering News Headlines

In this notebook we begin my importing data to analyze its contents and be able to determine the best clustering algorithm to determine the articles that are mostly related to each other.

We start by importing some dependencies

In [1]:
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

import re
from nltk.stem.snowball import SnowballStemmer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans

from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/erusson/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
%%capture
# Install the latest Tensorflow version.
!pip3 install --upgrade tensorflow-gpu
# Install TF-Hub.
!pip3 install tensorflow-hub
!pip3 install seaborn

In [0]:
from absl import logging

import tensorflow as tf
import tensorflow_hub as hub

import os

import heapq
import operator

## Importing Data from Local File System

Datasets are located here: https://www.kaggle.com/snapcrack/all-the-news/version/4#articles3.csv

Proceed to download all three 'csv' files, and store them in a 'data' directory at the location of your choice.


*Note: It might be a better idea to have a way to import the data from the cloud (google drive) rather than from the user's local file system. Let's leave it like this for now, mainly bringing this up for easy reproducibility.*

Instead, we can mount google drive and import the datasets from an existing file in the cloud.

In [0]:
# from pydrive.auth import GoogleAuth
# from pydrive.drive import GoogleDrive
# from google.colab import auth
# from oauth2client.client import GoogleCredentials

ModuleNotFoundError: ignored

In [0]:
# auth.authenticate_user()
# gauth = GoogleAuth()
# gauth.credentials = GoogleCredentials.get_application_default()
# drive = GoogleDrive(gauth)

For every file available we have a shared link:

CSV File: 'Articles 3':
https://drive.google.com/a/berkeley.edu/file/d/1BDhi5FuIvJ_6jiIdIFXRjT0w6tgfNk10/view?usp=sharing

CSV File: 'Articles 2':
https://drive.google.com/a/berkeley.edu/file/d/1qpoRkKEOnxYg_12e0RdeRi-JWUGWNFP3/view?usp=sharing

CSV File: 'Articles 1':
https://drive.google.com/a/berkeley.edu/file/d/1c9NbOIx7M6PgyJKxR9E7HgzuJlxGIaXU/view?usp=sharing


For each of the links above, we utiliz the File ID provided:

- Articles 1 File ID: '1BDhi5FuIvJ_6jiIdIFXRjT0w6tgfNk10'
- Articles 2 File ID: '1qpoRkKEOnxYg_12e0RdeRi-JWUGWNFP3'
- Articles 3 File ID: '1c9NbOIx7M6PgyJKxR9E7HgzuJlxGIaXU'


In [0]:
# downloaded3 = drive.CreateFile({'id':"1BDhi5FuIvJ_6jiIdIFXRjT0w6tgfNk10"})   # replace the id with id of file you want to access
# downloaded3.GetContentFile('articles3.csv.zip')        # replace the file name with your file

# downloaded2 = drive.CreateFile({'id': "1qpoRkKEOnxYg_12e0RdeRi-JWUGWNFP3"})
# downloaded2.GetContentFile('articles2.csv.zip')

# downloaded1 = drive.CreateFile({'id': "1c9NbOIx7M6PgyJKxR9E7HgzuJlxGIaXU"})
# downloaded1.GetContentFile('articles1.csv.zip')

In [0]:
articles_3 = pd.read_csv('w266/articles3.csv')
articles_2 = pd.read_csv('w266/articles2.csv')
articles_1 = pd.read_csv('w266/articles1.csv')

In [0]:
articles_3.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,103459,151908,Alton Sterling’s son: ’Everyone needs to prote...,Guardian,Jessica Glenza,2016-07-13,2016.0,7.0,https://www.theguardian.com/us-news/2016/jul/1...,The son of a Louisiana man whose father was sh...
1,103460,151909,Shakespeare’s first four folios sell at auctio...,Guardian,,2016-05-25,2016.0,5.0,https://www.theguardian.com/culture/2016/may/2...,Copies of William Shakespeare’s first four boo...
2,103461,151910,My grandmother’s death saved me from a life of...,Guardian,Robert Pendry,2016-10-31,2016.0,10.0,https://www.theguardian.com/commentisfree/2016...,"Debt: $20, 000, Source: College, credit cards,..."
3,103462,151911,I feared my life lacked meaning. Cancer pushed...,Guardian,Bradford Frost,2016-11-26,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,"It was late. I was drunk, nearing my 35th birt..."
4,103463,151912,Texas man serving life sentence innocent of do...,Guardian,,2016-08-20,2016.0,8.0,https://www.theguardian.com/us-news/2016/aug/2...,A central Texas man serving a life sentence fo...


In [0]:
articles_2.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,53293,73471,Patriots Day Is Best When It Digs Past the Her...,Atlantic,David Sims,2017-01-11,2017.0,1.0,,"Patriots Day, Peter Berg’s new thriller that r..."
1,53294,73472,A Break in the Search for the Origin of Comple...,Atlantic,Ed Yong,2017-01-11,2017.0,1.0,,"In Norse mythology, humans and our world were ..."
2,53295,73474,Obama’s Ingenious Mention of Atticus Finch,Atlantic,Spencer Kornhaber,2017-01-11,2017.0,1.0,,“If our democracy is to work in this increasin...
3,53296,73475,"Donald Trump Meets, and Assails, the Press",Atlantic,David A. Graham,2017-01-11,2017.0,1.0,,Updated on January 11 at 5:05 p. m. In his fir...
4,53297,73476,Trump: ’I Think’ Hacking Was Russian,Atlantic,Kaveh Waddell,2017-01-11,2017.0,1.0,,Updated at 12:25 p. m. After months of equivoc...


In [0]:
articles_1.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [0]:
set(articles_3.publication)

{'Guardian', 'NPR', 'Reuters', 'Vox', 'Washington Post'}

In [0]:
set(articles_2.publication)

{'Atlantic',
 'Buzzfeed News',
 'Fox News',
 'Guardian',
 'National Review',
 'New York Post',
 'Talking Points Memo'}

In [0]:
set(articles_1.publication)

{'Atlantic', 'Breitbart', 'Business Insider', 'CNN', 'New York Times'}

In [0]:
full_data = pd.concat([articles_1, articles_2, articles_3], ignore_index=True)

In [5]:
full_data

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."
...,...,...,...,...,...,...,...,...,...,...
142565,146028,218078,An eavesdropping Uber driver saved his 16-year...,Washington Post,Avi Selk,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,Uber driver Keith Avila picked up a p...
142566,146029,218079,Plane carrying six people returning from a Cav...,Washington Post,Sarah Larimer,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,Crews on Friday continued to search L...
142567,146030,218080,After helping a fraction of homeowners expecte...,Washington Post,Renae Merle,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,When the Obama administration announced a...
142568,146031,218081,"Yes, this is real: Michigan just banned bannin...",Washington Post,Chelsea Harvey,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,This story has been updated. A new law in...


In [0]:
# remove duplicates
full_data = full_data.drop_duplicates(subset=['title', 'publication', 'author', 'date'])

In [0]:
# remove missing titles 
full_data = full_data.dropna(subset=['title'])

In [8]:
# sample from full_data (set seed to 5)
small_data = full_data.sample(n=10000, random_state=5).reset_index()
set(small_data.publication)

{'Atlantic',
 'Breitbart',
 'Business Insider',
 'Buzzfeed News',
 'CNN',
 'Fox News',
 'Guardian',
 'NPR',
 'National Review',
 'New York Post',
 'New York Times',
 'Reuters',
 'Talking Points Memo',
 'Vox',
 'Washington Post'}

In [9]:
small_data

Unnamed: 0.1,index,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,74496,77946,118473,"Chaos in the Family, Chaos in the State: The W...",National Review,Kevin D. Williamson,2016-03-17,2016.0,3.0,http://www.nationalreview.com/article/432876/d...,Michael Brendan Dougherty is bitter. I think t...
1,71184,74592,113594,US Civil Rights Commission Will Observe Stand...,Buzzfeed News,Nidhi Subbaraman,2016-12-08,2016.0,12.0,https://web.archive.org/web/20161208153906/htt...,WASHINGTON — The US Commission on Civil Ri...
2,120205,123668,184574,"Venezuela hunts rogue helicopter attackers, Ma...",Reuters,Andrew Cawthorne and Victoria Ramirez,2017-06-29,2017.0,6.0,http://www.reuters.com/article/us-venezuela-po...,The Venezuelan government hunted on Wednesday...
3,128977,132440,199665,Fruit juice isn’t much better for you than sod...,Vox,Julia Belluz,2016/3/25,2016.0,3.0,http://www.vox.com/2016/3/25/11305614/soda-jui...,One of the biggest public health wins of rece...
4,134837,138300,208223,Sessions won’t testify at congressional budget...,Washington Post,Sari Horwitz,2017-06-10,2017.0,6.0,https://web.archive.org/web/20170611000758/htt...,"Attorney General Jeff Sessions, who had agree..."
...,...,...,...,...,...,...,...,...,...,...,...
9995,137105,140568,211140,Patient secretly recorded doctors as they oper...,Washington Post,Yanan Wang,2016-04-07,2016.0,4.0,https://web.archive.org/web/20160408000201/htt...,"Last summer, Ethel Easter wanted nothin..."
9996,60293,63612,86308,Fox News Poll: Clinton edges Trump by two poin...,Fox News,Dana Blanton,2016-10-07,2016.0,10.0,https://web.archive.org/web/20161008002456/htt...,Third party candidates Gary Johnson (6 percent...
9997,55079,58389,80070,The Atlantic Politics & Policy Daily: Trump’s...,Atlantic,Elaine Godfrey,2016-08-31,2016.0,8.0,,For us to continue writing great stori...
9998,110248,113711,168509,A Son In Chains. A Depressed Mom. Here’s What ...,NPR,Nurith Aizenman,2016-04-15,2016.0,4.0,http://www.npr.org/sections/goatsandsoda/2016/...,It was a hospital — but to psychologist Ink...


In [10]:
# remove publisher tags from article titles 

just_titles = []

for title in small_data.title:
  title = re.sub(r"(- Breitbart)(?!.*\1)", '', title)
  title = re.sub(r'(- The New York Times)(?!.*\1)', '', title)
  just_titles.append(title)

just_titles[:5]

['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction',
 ' US Civil Rights Commission Will Observe Standing Rock\xa0Standoff',
 'Venezuela hunts rogue helicopter attackers, Maduro foes suspicious',
 'Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.',
 'Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead']

In [11]:
len(just_titles)

10000

In [12]:
# join with content of article

small_text = list(map(lambda i,j: i + " " + j, just_titles, small_data.content))

small_text[:5]

['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction Michael Brendan Dougherty is bitter. I think that I can write that in both truth and charity. (I think you might even say that he and I are friends.) Dougherty is a conservative of the sort sometimes advertised as “paleo” and served as national correspondent for The American Conservative. Like many conservative writers with those associations, Dougherty spends a great deal of time lambasting the conservative movement and its organs, from which he feels, for whatever reason, estranged  —   an alienation that carries with it more than a little to suggest that it is somewhat personal. You know: Them. Donald Trump is the headline, and explaining the benighted white working class to Them is the main matter. Sanctimony is the literary mode, for Dougherty and for many others doing the same work with less literary facility. Never mind the petty sneering (as though the conservative movement were populated by septua

In [13]:
len(small_text)

10000

## Create Matrix of TF-IDF Features

In [0]:
def preprocess_text(text):

  # function to remove punctuation 
  def Punctuation(string): 
    return re.sub(r'[\W_]', ' ', string)

  # remove punctuation and perform tokenization
  text = Punctuation(text.lower()).split()

  # remove stop words and stem
  stop_words = set(stopwords.words('english'))
  stemmer = SnowballStemmer("english")
  text = [stemmer.stem(t) for t in text if not t in stop_words]

  return text


In [0]:
# tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', use_idf=True, tokenizer=preprocess_text, ngram_range=(1,3))

In [0]:
# fit just_titles to vectorizer
tfidf_matrix = tfidf_vectorizer.fit_transform(small_text)
print(tfidf_matrix.shape)

  'stop_words.' % sorted(inconsistent))


(10000, 5000)


In [0]:
# reduce memory 
tfidf_matrix = tfidf_matrix.astype(np.float32)

In [0]:
tfidf_matrix

<10000x5000 sparse matrix of type '<class 'numpy.float32'>'
	with 2077663 stored elements in Compressed Sparse Row format>

## Create Matrix of Word2Vec Features

## Universal Encoder

In [0]:
# load the Universal Sentence Encoder's TF Hub module

# module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
# model = hub.load(module_url)
# print ("module %s loaded" % module_url)

model = hub.load("w266/tmp")


In [0]:
# reduce logging output
logging.set_verbosity(logging.ERROR)

# compute a representation for each message
message_embeddings = model(small_text)

In [0]:
# semantic similarity of two sentences can be trivially computed as the inner product of the encodings
corr = np.inner(message_embeddings, message_embeddings)

In [17]:
# data frame of titles and semantic similarities
corr_df = pd.DataFrame(corr)
corr_df.columns = just_titles
corr_df.index = just_titles

corr_df.head()

Unnamed: 0,"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",US Civil Rights Commission Will Observe Standing Rock Standoff,"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,Qatar: UAE and Saudi Arabia step up pressure in diplomatic crisis,’The 4-Hour Workweek’ author says a 3-step process he learned from Tony Robbins drastically improved his life,Rick Perry Attacked for Keeping His Word,Watch how a mathematician explains an astonishing coincidence,Second baby bald eagle begins hatching process at National Arboretum,...,Auto CEOs want Trump to order review of 2025 fuel rules,Golden State Warriors NBA champions after downing Cavs,Look Who Now Refuses to ‘Accept the Result of This Election’,"Obama Climate Plan, Now in Court, May Hinge on Error in 1990 Law",E.P.A. Head Stacks Agency With Climate Change Skeptics,Patient secretly recorded doctors as they operated on her. Should she be so distressed by what she heard?,Fox News Poll: Clinton edges Trump by two points one month ahead of election,The Atlantic Politics & Policy Daily: Trump’s Campaign Goes South,A Son In Chains. A Depressed Mom. Here’s What Helped,Trump’s expected VP pick: coal advocate who defied Obama’s climate agenda
"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",1.0,0.291457,0.336936,0.256971,0.181461,0.282036,0.389552,0.316919,0.289691,0.295462,...,0.370325,0.228231,0.486348,0.346315,0.391943,0.235392,0.379992,0.467373,0.384896,0.380097
US Civil Rights Commission Will Observe Standing Rock Standoff,0.291457,1.0,0.543982,0.10907,0.411541,0.422529,0.224046,0.27175,0.141546,0.381025,...,0.372918,0.220481,0.457202,0.428989,0.450195,0.28331,0.232701,0.420117,0.282087,0.371748
"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",0.336936,0.543982,1.0,0.106961,0.405647,0.540752,0.184615,0.287032,0.161188,0.340552,...,0.420927,0.22026,0.45993,0.459599,0.428584,0.283553,0.278399,0.46748,0.296813,0.345345
Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,0.256971,0.10907,0.106961,1.0,0.052951,0.109257,0.242693,0.108912,0.14604,0.155383,...,0.163759,0.123956,0.148778,0.166097,0.169043,0.156256,0.197509,0.187718,0.171532,0.155491
Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,0.181461,0.411541,0.405647,0.052951,1.0,0.384612,0.176325,0.368784,0.109351,0.225344,...,0.321637,0.142903,0.362914,0.371471,0.451796,0.283832,0.320866,0.4503,0.158671,0.378904


In [18]:
# find index of 5 most similar titles 
top5_ind = [] 
for i in range(len(corr_df)):
  top5_ind.append(list(list(zip(*heapq.nlargest(5, enumerate(corr_df.iloc[i,:]), key=operator.itemgetter(1))))[0]))

# show most similar titles -- sanity check
top5 = []
for ind in top5_ind:
  top5.append([just_titles[i] for i in ind])

top5[:5]

[['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction',
  'After Trump, conservatives should stop longing for the past — and learn a little humility',
  'The alt-right is more than warmed-over white supremacy. It’s that, but way way weirder.',
  'The Return of ‘Street Corner Conservatism’',
  'Liberals should get behind marriage (Opinion)'],
 [' US Civil Rights Commission Will Observe Standing Rock\xa0Standoff',
  ' Cory Booker Calls For Federal Investigation Into Police Tactics At Dakota Access\xa0Pipeline',
  'Army will close Dakota pipeline protesters’ campsite, Sioux leader\xa0says',
  'Dakota Access Pipeline protest site is cleared',
  'Protesters, Police Still Clashing Over Disputed North Dakota Pipeline'],
 ['Venezuela hunts rogue helicopter attackers, Maduro foes suspicious',
  'Protester dies, minister sacked after Paraguay re-election vote',
  'Venezuela Erupts In ’Mother Of All Protests’ As Anti-Maduro Sentiment Seethes',
  'Confusion In Venezuela

In [0]:
np.array(message_embeddings).shape

(1000, 512)

In [0]:
message_embeddings

<tf.Tensor: shape=(1000, 512), dtype=float32, numpy=
array([[ 0.01333818, -0.05380408,  0.00232677, ...,  0.0533301 ,
        -0.03825486,  0.04241087],
       [-0.01210087, -0.01844358,  0.05187282, ...,  0.05565305,
        -0.05972204,  0.04307982],
       [-0.00471127, -0.04964766, -0.05074216, ...,  0.05359754,
        -0.05339101, -0.05365704],
       ...,
       [ 0.00479301,  0.04639916, -0.04595203, ...,  0.0533572 ,
        -0.05650016,  0.02762916],
       [-0.04229458, -0.06001651,  0.03259774, ...,  0.03309052,
        -0.04628655,  0.05280183],
       [ 0.04829948, -0.05886184, -0.01870255, ...,  0.05115974,
        -0.0572932 ,  0.05578536]], dtype=float32)>

In [0]:
 km.cluster_centers_.shape


NameError: ignored

## Run KMeans Clustering 

In [0]:
num_clusters = 25

km = KMeans(n_clusters=num_clusters, random_state=10)

#km.fit(tfidf_matrix)
km.fit(message_embeddings)

clusters = km.labels_.tolist()

In [0]:
# cosine similarities for each row with cluster center
#cos_sim = [cosine_similarity(tfidf_matrix[i].toarray(), km.cluster_centers_[label].reshape(1, -1))[0][0] for i,label in enumerate(km.labels_)]
cos_sim = [cosine_similarity(np.array(message_embeddings[i]).reshape(1, -1), km.cluster_centers_[label].reshape(1, -1))[0][0] for i,label in enumerate(km.labels_)]
#sem_sim = [np.inner(np.array(message_embeddings[i]).reshape(1, -1), km.cluster_centers_[label].reshape(1, -1))[0][0] for i,label in enumerate(km.labels_)]


In [21]:
# sum of squared distances of samples to their closest cluster center.
km.inertia_

5733.306836503798

In [0]:
# simple dataframe with title, cluster, and cosine similarity 
title_data = pd.DataFrame({'title':small_data.title, 'cluster':clusters, 'cos_sim':cos_sim})

In [24]:
title_data

Unnamed: 0,title,cluster,cos_sim
0,"Chaos in the Family, Chaos in the State: The W...",3,0.649303
1,US Civil Rights Commission Will Observe Stand...,3,0.629662
2,"Venezuela hunts rogue helicopter attackers, Ma...",10,0.683993
3,Fruit juice isn’t much better for you than sod...,1,0.408172
4,Sessions won’t testify at congressional budget...,5,0.760302
...,...,...,...
9995,Patient secretly recorded doctors as they oper...,11,0.566854
9996,Fox News Poll: Clinton edges Trump by two poin...,19,0.718047
9997,The Atlantic Politics & Policy Daily: Trump’s...,22,0.777570
9998,A Son In Chains. A Depressed Mom. Here’s What ...,11,0.621257


In [25]:
# inspect titles with highest cosine similarity in clusters 
top_5 = title_data.groupby('cluster')['cos_sim'].nlargest(5)
for i,ind in top_5.index:
  print("cluster", i)
  print(just_titles[ind])
  print('-------------')

cluster 0
23 killed in West Virginia floods that swept preschooler away from grandfather’s reach
-------------
cluster 0
At least 26 dead in West Virginia flooding
-------------
cluster 0
As floods threaten N.C., officials urge: ‘Move to higher ground now’
-------------
cluster 0
More than 280 reported dead in Haiti as extent of Hurricane Matthew damage comes into focus
-------------
cluster 0
Lake Oroville dam: emergency staff race to fix spillway before more rain strikes
-------------
cluster 1
Routine DNA Sequencing May Be Helpful And Not As Scary As Feared
-------------
cluster 1
’Minibrains’ Could Help Drug Discovery For Zika And For Alzheimer’s 
-------------
cluster 1
 At Gene Editing Meeting, Scientists Discuss God, Racism, Designer Babies
-------------
cluster 1
Can Web Search Predict Cancer? Promise And Worry Of Big Data And Health
-------------
cluster 1
Debunking the Debunkers on Sugar
-------------
cluster 2
House Republicans repeal Obamacare, hurdles await in U.S. Senate


In [26]:
top_5

cluster      
0        7054    0.782964
         7073    0.779029
         3178    0.766450
         6440    0.757836
         24      0.736117
                   ...   
24       3765    0.753378
         6195    0.740042
         3141    0.730567
         4448    0.719834
         7532    0.718625
Name: cos_sim, Length: 125, dtype: float64

## Predict Input

In [0]:
input_topic = ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

In [0]:
# run tfidf on input 
#tfidf_input = tfidf_vectorizer.transform(input_topic)

# make list of tfidf_input
#tfidf_input = [tfidf_input.getrow(i).toarray()[0].tolist() for i in range(tfidf_input.shape[0])]

In [0]:
# run encoder on input 
enc_input = model(input_topic)

# make list of enc_input
enc_input = [list(np.array(enc_input)[i]) for i in range(enc_input.shape[0])]

In [0]:
# make list of tfidf_matrix
#tfidf_list = [tfidf_matrix.getrow(i).toarray()[0].tolist() for i in range(tfidf_matrix.shape[0])]


In [0]:
# make list of message_embeddings
enc_list = [list(np.array(message_embeddings)[i]) for i in range(message_embeddings.shape[0])]


In [30]:
len(title_data.cluster)

10000

In [0]:
# np.random.seed(5)
# X_train = tfidf_list
# y_train = title_data.cluster
# X_test  = tfidf_input

# # Create and fit a nearest-neighbor classifier
# from sklearn.neighbors import KNeighborsClassifier
# knn = KNeighborsClassifier()
# knn.fit(X_train, y_train) 

# res = knn.predict(X_test)

In [0]:
np.random.seed(5)
X_train = enc_list
y_train = title_data.cluster
X_test  = enc_input

# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train) 

res = knn.predict(X_test)

In [32]:
print(res)

[5 3 5]


In [34]:
for i,ind in top_5.index:
  if (i==3) | (i==5):
    print("cluster", i)
    print(just_titles[ind])
    print('-------------')

# ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

cluster 3
Milo Yiannopoulos is trying to convince colleges that hate speech is cool
-------------
cluster 3
Trump fans’ ’Deploraball’ party shows rift in alt-right movement
-------------
cluster 3
Yes, there is a free speech crisis. But its victims are not white men
-------------
cluster 3
Hate speech seeps into U.S. mainstream amid bitter campaign
-------------
cluster 3
Ciccotta: Guest Speaker’s Call to Violence Reveals Bucknell’s Radicalism 
-------------
cluster 5
Trump knows the feds are closing in on him
-------------
cluster 5
Comey to be pressed on whether Trump interfered with Russia probe
-------------
cluster 5
The FBI probe into Trump and Russia is huge news. Our political system isn’t ready for it.
-------------
cluster 5
Senate intelligence committee to question Trump team on links with Russia
-------------
cluster 5
Trump finds himself exactly where he doesn’t want to be
-------------


Check to see if individual vs mass inputs create different embeddings -- NO

In [0]:
a = model(["hi my name is emma"])
b = model(["nice to meet you"])
c = model(["hi what's your name"])

In [0]:
d = model(["hi my name is emma", "nice to meet you", "hi what's your name"])

In [0]:
e = np.concatenate((a, b, c))

In [0]:
f = np.array(d)

In [54]:
np.inner(e, e)

array([[1.        , 0.23792592, 0.5253966 ],
       [0.23792592, 1.        , 0.43527788],
       [0.5253966 , 0.43527788, 1.        ]], dtype=float32)

In [55]:
np.inner(f, f)

array([[0.9999999 , 0.23792589, 0.5253966 ],
       [0.23792589, 1.        , 0.43527794],
       [0.5253966 , 0.43527794, 1.0000001 ]], dtype=float32)

In [57]:
message_embeddings

<tf.Tensor: shape=(10000, 512), dtype=float32, numpy=
array([[ 0.01333818, -0.05380408,  0.00232677, ...,  0.0533301 ,
        -0.03825486,  0.04241087],
       [-0.01210087, -0.01844358,  0.05187282, ...,  0.05565305,
        -0.05972204,  0.04307982],
       [-0.00471127, -0.04964766, -0.05074216, ...,  0.05359754,
        -0.05339101, -0.05365704],
       ...,
       [-0.03098017, -0.05402634, -0.04946342, ...,  0.05646259,
        -0.05528343, -0.04850345],
       [ 0.0071629 , -0.05772226,  0.05848116, ...,  0.05804956,
        -0.05633788, -0.05588677],
       [ 0.03915349, -0.0549503 , -0.05372609, ...,  0.05495479,
        -0.05285954,  0.03810295]], dtype=float32)>

In [0]:
input_embeddings = model(input_topic)

In [59]:
input_embeddings

<tf.Tensor: shape=(3, 512), dtype=float32, numpy=
array([[-0.00043203,  0.03228568, -0.04395936, ...,  0.06762028,
        -0.03446475,  0.00829037],
       [-0.04381321,  0.0106896 ,  0.00313417, ...,  0.02813102,
        -0.07244141,  0.02743991],
       [ 0.00777211,  0.04472262,  0.00644468, ...,  0.03325547,
        -0.0335868 ,  0.00524749]], dtype=float32)>

In [0]:
inp_corr = np.inner(input_embeddings, message_embeddings)

In [64]:
# data frame of titles and semantic similarities
inp_df = pd.DataFrame(inp_corr)
inp_df.columns = just_titles
inp_df.index = input_topic

inp_df

Unnamed: 0,"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",US Civil Rights Commission Will Observe Standing Rock Standoff,"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,Qatar: UAE and Saudi Arabia step up pressure in diplomatic crisis,’The 4-Hour Workweek’ author says a 3-step process he learned from Tony Robbins drastically improved his life,Rick Perry Attacked for Keeping His Word,Watch how a mathematician explains an astonishing coincidence,Second baby bald eagle begins hatching process at National Arboretum,...,Auto CEOs want Trump to order review of 2025 fuel rules,Golden State Warriors NBA champions after downing Cavs,Look Who Now Refuses to ‘Accept the Result of This Election’,"Obama Climate Plan, Now in Court, May Hinge on Error in 1990 Law",E.P.A. Head Stacks Agency With Climate Change Skeptics,Patient secretly recorded doctors as they operated on her. Should she be so distressed by what she heard?,Fox News Poll: Clinton edges Trump by two points one month ahead of election,The Atlantic Politics & Policy Daily: Trump’s Campaign Goes South,A Son In Chains. A Depressed Mom. Here’s What Helped,Trump’s expected VP pick: coal advocate who defied Obama’s climate agenda
Hillary Clinton defends handling of Benghazi attack,0.019261,0.135737,0.21394,-0.040233,0.270479,0.242178,-0.059264,0.235375,0.039634,0.098044,...,0.20322,0.059251,0.26468,0.107626,0.163884,0.137996,0.252123,0.317374,0.09281,0.22132
Women's March Highlights,0.045296,0.119294,-0.000519,0.065587,0.049757,-0.028455,-0.019417,0.070615,0.004875,0.16622,...,-0.015162,0.082744,0.106243,0.013035,-0.00255,0.052811,0.09829,0.124025,0.028465,0.090474
Hillary Clinton emails,0.036642,0.032883,0.058746,0.017771,0.199208,0.109163,-0.04811,0.151369,0.013369,0.053923,...,0.11911,0.073826,0.273021,0.035931,0.121179,0.079669,0.269662,0.28104,0.044456,0.205921


In [65]:
# find index of 5 most similar titles 
top5_ind = [] 
for i in range(len(inp_df)):
  top5_ind.append(list(list(zip(*heapq.nlargest(5, enumerate(inp_df.iloc[i,:]), key=operator.itemgetter(1))))[0]))

# show most similar titles -- sanity check
top5 = []
for ind in top5_ind:
  top5.append([just_titles[i] for i in ind])

top5[:5]
# ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

[['Benghazi may be over, but #Benghazi has a life of its own',
  'The GOP stoops for scandal',
  'Exclusive - Sarah Palin: Administration Lies, Soldiers Die. Yes, Hillary, at Every Point ‘It Matters’ ',
  'Read the House Benghazi report',
  'Clinton Bristles at Question on E-mail Indictment: ‘It’s Not Going to Happen’'],
 ['The Women’s March on Washington is becoming a\xa0joke',
  ' Women Share Why They Hit The Streets At The Sundance Women’s\xa0March',
  'Women will march on – What would a feminist do? podcast',
  'The Exhausting Work of Tallying America’s Largest Protest',
  'Women’s March organizers prepare for hundreds of thousands of protesters'],
 ['Donald Trump: ’Good Job, Huma. Thank you, Anthony Weiner’ ',
  'Fox News anchor grills Hillary Clinton on her email scandal',
  'Hillary Clinton’s email problems might be even worse than we thought',
  'Hillary Clinton pal Neera Tanden’s greatest hits from WikiLeaks emails',
  'Hillary Clinton Deleted More Emails Than She Sent to the 