<a href="https://colab.research.google.com/github/s-miramontes/News_Filter/blob/emma/Pilot/Data%20Cleanse%20and%20Clustering%20Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering News Headlines

In this notebook we begin by importing data to analyze its contents and be able to determine the best clustering algorithm to determine the articles that are most related to each other. 

We start by importing some dependencies and downloading libraries

In [0]:
%%capture
# Install the latest Tensorflow version.
!pip3 install --upgrade tensorflow-gpu
# Install TF-Hub.
!pip3 install tensorflow-hub
!pip3 install seaborn

In [1]:
import os

import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer
import re

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans

from sklearn.metrics.pairwise import cosine_similarity

from absl import logging

import tensorflow as tf
import tensorflow_hub as hub

import heapq
import operator

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/erusson/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Importing Data from Local File System

Datasets are located here: https://www.kaggle.com/snapcrack/all-the-news/version/4#articles3.csv

Proceed to download all three 'csv' files, and store them in a 'data' directory at the location of your choice.


In [0]:
articles_3 = pd.read_csv('w266/articles3.csv')
articles_2 = pd.read_csv('w266/articles2.csv')
articles_1 = pd.read_csv('w266/articles1.csv')

In [3]:
articles_3.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,103459,151908,Alton Sterling’s son: ’Everyone needs to prote...,Guardian,Jessica Glenza,2016-07-13,2016.0,7.0,https://www.theguardian.com/us-news/2016/jul/1...,The son of a Louisiana man whose father was sh...
1,103460,151909,Shakespeare’s first four folios sell at auctio...,Guardian,,2016-05-25,2016.0,5.0,https://www.theguardian.com/culture/2016/may/2...,Copies of William Shakespeare’s first four boo...
2,103461,151910,My grandmother’s death saved me from a life of...,Guardian,Robert Pendry,2016-10-31,2016.0,10.0,https://www.theguardian.com/commentisfree/2016...,"Debt: $20, 000, Source: College, credit cards,..."
3,103462,151911,I feared my life lacked meaning. Cancer pushed...,Guardian,Bradford Frost,2016-11-26,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,"It was late. I was drunk, nearing my 35th birt..."
4,103463,151912,Texas man serving life sentence innocent of do...,Guardian,,2016-08-20,2016.0,8.0,https://www.theguardian.com/us-news/2016/aug/2...,A central Texas man serving a life sentence fo...


In [4]:
articles_2.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,53293,73471,Patriots Day Is Best When It Digs Past the Her...,Atlantic,David Sims,2017-01-11,2017.0,1.0,,"Patriots Day, Peter Berg’s new thriller that r..."
1,53294,73472,A Break in the Search for the Origin of Comple...,Atlantic,Ed Yong,2017-01-11,2017.0,1.0,,"In Norse mythology, humans and our world were ..."
2,53295,73474,Obama’s Ingenious Mention of Atticus Finch,Atlantic,Spencer Kornhaber,2017-01-11,2017.0,1.0,,“If our democracy is to work in this increasin...
3,53296,73475,"Donald Trump Meets, and Assails, the Press",Atlantic,David A. Graham,2017-01-11,2017.0,1.0,,Updated on January 11 at 5:05 p. m. In his fir...
4,53297,73476,Trump: ’I Think’ Hacking Was Russian,Atlantic,Kaveh Waddell,2017-01-11,2017.0,1.0,,Updated at 12:25 p. m. After months of equivoc...


In [5]:
articles_1.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [6]:
# publications in articles_3.csv
set(articles_3.publication)

{'Guardian', 'NPR', 'Reuters', 'Vox', 'Washington Post'}

In [7]:
# publications in articles_2.csv
set(articles_2.publication)

{'Atlantic',
 'Buzzfeed News',
 'Fox News',
 'Guardian',
 'National Review',
 'New York Post',
 'Talking Points Memo'}

In [8]:
# publications in articles_1.csv
set(articles_1.publication)

{'Atlantic', 'Breitbart', 'Business Insider', 'CNN', 'New York Times'}

In [9]:
# join all datasets into one
full_data = pd.concat([articles_1, articles_2, articles_3], ignore_index=True)

full_data.shape

(142570, 10)

In [10]:
full_data.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [0]:
# remove duplicates
full_data = full_data.drop_duplicates(subset=['title', 'publication', 'author', 'date'])

In [0]:
# remove missing titles 
full_data = full_data.dropna(subset=['title'])

In [13]:
# sample 13k observations from full_data (set seed to 5)
small_data = full_data.sample(n=13000, random_state=5).reset_index()
set(small_data.publication)

{'Atlantic',
 'Breitbart',
 'Business Insider',
 'Buzzfeed News',
 'CNN',
 'Fox News',
 'Guardian',
 'NPR',
 'National Review',
 'New York Post',
 'New York Times',
 'Reuters',
 'Talking Points Memo',
 'Vox',
 'Washington Post'}

In [14]:
small_data.shape

(13000, 11)

In [16]:
small_data.head()

Unnamed: 0.1,index,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,74496,77946,118473,"Chaos in the Family, Chaos in the State: The W...",National Review,Kevin D. Williamson,2016-03-17,2016.0,3.0,http://www.nationalreview.com/article/432876/d...,Michael Brendan Dougherty is bitter. I think t...
1,71184,74592,113594,US Civil Rights Commission Will Observe Stand...,Buzzfeed News,Nidhi Subbaraman,2016-12-08,2016.0,12.0,https://web.archive.org/web/20161208153906/htt...,WASHINGTON — The US Commission on Civil Ri...
2,120205,123668,184574,"Venezuela hunts rogue helicopter attackers, Ma...",Reuters,Andrew Cawthorne and Victoria Ramirez,2017-06-29,2017.0,6.0,http://www.reuters.com/article/us-venezuela-po...,The Venezuelan government hunted on Wednesday...
3,128977,132440,199665,Fruit juice isn’t much better for you than sod...,Vox,Julia Belluz,2016/3/25,2016.0,3.0,http://www.vox.com/2016/3/25/11305614/soda-jui...,One of the biggest public health wins of rece...
4,134837,138300,208223,Sessions won’t testify at congressional budget...,Washington Post,Sari Horwitz,2017-06-10,2017.0,6.0,https://web.archive.org/web/20170611000758/htt...,"Attorney General Jeff Sessions, who had agree..."


In [17]:
# remove publisher tags from article titles 

just_titles = []

for title in small_data.title:
  title = re.sub(r"(- Breitbart)(?!.*\1)", '', title)
  title = re.sub(r'(- The New York Times)(?!.*\1)', '', title)
  just_titles.append(title)

just_titles[:5]

['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction',
 ' US Civil Rights Commission Will Observe Standing Rock\xa0Standoff',
 'Venezuela hunts rogue helicopter attackers, Maduro foes suspicious',
 'Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.',
 'Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead']

In [19]:
len(just_titles)

13000

In [21]:
# join with content of article

small_text = list(map(lambda i,j: i + " " + j, just_titles, small_data.content))

small_text[:5]

['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction Michael Brendan Dougherty is bitter. I think that I can write that in both truth and charity. (I think you might even say that he and I are friends.) Dougherty is a conservative of the sort sometimes advertised as “paleo” and served as national correspondent for The American Conservative. Like many conservative writers with those associations, Dougherty spends a great deal of time lambasting the conservative movement and its organs, from which he feels, for whatever reason, estranged  —   an alienation that carries with it more than a little to suggest that it is somewhat personal. You know: Them. Donald Trump is the headline, and explaining the benighted white working class to Them is the main matter. Sanctimony is the literary mode, for Dougherty and for many others doing the same work with less literary facility. Never mind the petty sneering (as though the conservative movement were populated by septua

In [23]:
len(small_text)

13000

## Clustering with tf-idf, k-means, and k-nn


*   create tf-idf features
*   cluster with k-means
*   predidct with k-nn




### Create Matrix of TF-IDF Features

In [0]:
def preprocess_text(text):

  # function to remove punctuation 
  def Punctuation(string): 
    return re.sub(r'[\W_]', ' ', string)

  # remove punctuation and perform tokenization
  text = Punctuation(text.lower()).split()

  # remove stop words and stem
  stop_words = set(stopwords.words('english'))
  stemmer = SnowballStemmer("english")
  text = [stemmer.stem(t) for t in text if not t in stop_words]

  return text


In [0]:
# instantiate tfidf vectorizer 
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', use_idf=True, tokenizer=preprocess_text, ngram_range=(1,3))

In [26]:
# fit small_text to vectorizer (fit and transform)
tfidf_matrix = tfidf_vectorizer.fit_transform(small_text)
print(tfidf_matrix.shape)

  'stop_words.' % sorted(inconsistent))


(13000, 5000)


In [0]:
# reduce memory 
tfidf_matrix = tfidf_matrix.astype(np.float32)

### Run K-Means Clustering 

In [0]:
# set 25 clusters
num_clusters = 25

# instantiate k-means (set seed to 10)
km = KMeans(n_clusters=num_clusters, random_state=10)

# fit tf-idf features to k-means
km.fit(tfidf_matrix)

# list of cluster assignments 
clusters = km.labels_.tolist()

In [0]:
# cosine similarities for each row with respective cluster center
cos_sim = [cosine_similarity(tfidf_matrix[i].toarray(), km.cluster_centers_[label].reshape(1, -1))[0][0] for i,label in enumerate(km.labels_)]


In [30]:
# sum of squared distances of samples to their closest cluster center.
km.inertia_

11459.21866215886

In [33]:
# dataframe with title, cluster, and cosine similarity 
title_data = pd.DataFrame({'title':small_data.title, 'cluster':clusters, 'cos_sim':cos_sim})

title_data.shape

(13000, 3)

In [32]:
title_data.head()

Unnamed: 0,title,cluster,cos_sim
0,"Chaos in the Family, Chaos in the State: The W...",10,0.382897
1,US Civil Rights Commission Will Observe Stand...,11,0.21581
2,"Venezuela hunts rogue helicopter attackers, Ma...",1,0.256718
3,Fruit juice isn’t much better for you than sod...,11,0.158706
4,Sessions won’t testify at congressional budget...,16,0.409593


In [34]:
# inspect titles with highest cosine similarity in clusters 
top_5 = title_data.groupby('cluster')['cos_sim'].nlargest(5)
for i,ind in top_5.index:
  print("cluster", i)
  print(just_titles[ind])
  print('-------------')

cluster 0
IPhone 7 and Wireless Headphones: Analyzing Apple’s Announcements 
-------------
cluster 0
DOJ would allow Apple to keep or destroy software to help FBI hack iPhone
-------------
cluster 0
Apple’s risky bet on protecting a terrorist’s iPhone
-------------
cluster 0
Condolences to Apple for its Big Win
-------------
cluster 0
N.Y. judge backs Apple in encryption fight with government
-------------
cluster 1
It’s Official: America Has Two Presidents at One Time
-------------
cluster 1
For Obama, a bittersweet farewell from the world stage
-------------
cluster 1
Patient Diplomacy And A Reluctance To Act: Obama’s Mark On Foreign Policy
-------------
cluster 1
Obama’s year-end message: I did it right
-------------
cluster 1
Obama hands off legacy on terror to sharp critic 
-------------
cluster 2
Obama’s last trip will be to a divided Europe
-------------
cluster 2
’Game changer’: How the EU may shut Turkish door on migrants
-------------
cluster 2
Brexit: what happens when Brita

### Predict Cluster of Input

In [0]:
input_topic = ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

In [0]:
# run tfidf vectorizer on input (transform)
tfidf_input = tfidf_vectorizer.transform(input_topic)

# make list of tfidf_input
tfidf_input = [tfidf_input.getrow(i).toarray()[0].tolist() for i in range(tfidf_input.shape[0])]

In [0]:
# make list of tfidf_matrix
tfidf_list = [tfidf_matrix.getrow(i).toarray()[0].tolist() for i in range(tfidf_matrix.shape[0])]


In [0]:
np.random.seed(5)
X_train = tfidf_list
y_train = title_data.cluster
X_test  = tfidf_input

# create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train) 

res = knn.predict(X_test)

In [40]:
print(res)

[ 8 11  8]


In [42]:
for i,ind in top_5.index:
  if (i==8) | (i==11):
    print("cluster", i)
    print(just_titles[ind])
    print('-------------')

# ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

cluster 8
From Whitewater to Benghazi: A Clinton-Scandal Primer
-------------
cluster 8
Hillary, Not Trump, Forced Us to Revisit Bill Clinton’s Scandals
-------------
cluster 8
Clinton: Trump’s lies ’outlandish’ 
-------------
cluster 8
Here’s a guide to the sex allegations that Donald Trump may raise in the presidential debate
-------------
cluster 8
Can we please be really truly done with Hillary?
-------------
cluster 11

America by Air: Parks Over Pittsburgh

-------------
cluster 11
Researchers Confront an Epidemic of Loneliness 
-------------
cluster 11

America by Air: Coasting Past a Nuclear Reactor

-------------
cluster 11
America by Air: Descending Into a Dust Storm
-------------
cluster 11
Can mythbusters like Snopes.com keep up in a post-truth era?
-------------


## Clustering with the Universal Sentence Encoder 

run seperately from tf-idf

In [0]:
# load the Universal Sentence Encoder's TF Hub module

#module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
#model = hub.load(module_url)
#print ("module %s loaded" % module_url)

# download model from https://tfhub.dev/google/universal-sentence-encoder/4 and save locally 
model = hub.load("w266/tmp")


In [44]:
# reduce logging output
logging.set_verbosity(logging.ERROR)

# compute a representation for each article
small_embeddings = model(small_text)

ResourceExhaustedError: ignored

In [46]:
# semantic similarity of two sentences can be trivially computed as the inner product of the encodings
corr = np.inner(small_embeddings, small_embeddings)

NameError: ignored

In [0]:
# data frame of titles and semantic similarities
corr_df = pd.DataFrame(corr)
corr_df.columns = just_titles
corr_df.index = just_titles

corr_df.head()

Unnamed: 0,"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",US Civil Rights Commission Will Observe Standing Rock Standoff,"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,Qatar: UAE and Saudi Arabia step up pressure in diplomatic crisis,’The 4-Hour Workweek’ author says a 3-step process he learned from Tony Robbins drastically improved his life,Rick Perry Attacked for Keeping His Word,Watch how a mathematician explains an astonishing coincidence,Second baby bald eagle begins hatching process at National Arboretum,...,Auto CEOs want Trump to order review of 2025 fuel rules,Golden State Warriors NBA champions after downing Cavs,Look Who Now Refuses to ‘Accept the Result of This Election’,"Obama Climate Plan, Now in Court, May Hinge on Error in 1990 Law",E.P.A. Head Stacks Agency With Climate Change Skeptics,Patient secretly recorded doctors as they operated on her. Should she be so distressed by what she heard?,Fox News Poll: Clinton edges Trump by two points one month ahead of election,The Atlantic Politics & Policy Daily: Trump’s Campaign Goes South,A Son In Chains. A Depressed Mom. Here’s What Helped,Trump’s expected VP pick: coal advocate who defied Obama’s climate agenda
"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",1.0,0.291457,0.336936,0.256971,0.181461,0.282036,0.389552,0.316919,0.289691,0.295462,...,0.370325,0.228231,0.486348,0.346315,0.391943,0.235392,0.379992,0.467373,0.384896,0.380097
US Civil Rights Commission Will Observe Standing Rock Standoff,0.291457,1.0,0.543982,0.10907,0.411541,0.422529,0.224046,0.27175,0.141546,0.381025,...,0.372918,0.220481,0.457202,0.428989,0.450195,0.28331,0.232701,0.420117,0.282087,0.371748
"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",0.336936,0.543982,1.0,0.106961,0.405647,0.540752,0.184615,0.287032,0.161188,0.340552,...,0.420927,0.22026,0.45993,0.459599,0.428584,0.283553,0.278399,0.46748,0.296813,0.345345
Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,0.256971,0.10907,0.106961,1.0,0.052951,0.109257,0.242693,0.108912,0.14604,0.155383,...,0.163759,0.123956,0.148778,0.166097,0.169043,0.156256,0.197509,0.187718,0.171532,0.155491
Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,0.181461,0.411541,0.405647,0.052951,1.0,0.384612,0.176325,0.368784,0.109351,0.225344,...,0.321637,0.142903,0.362914,0.371471,0.451796,0.283832,0.320866,0.4503,0.158671,0.378904


In [0]:
# find index of 5 most similar titles 
top5_ind = [] 
for i in range(len(corr_df)):
  top5_ind.append(list(list(zip(*heapq.nlargest(5, enumerate(corr_df.iloc[i,:]), key=operator.itemgetter(1))))[0]))

# show most similar titles -- sanity check
top5 = []
for ind in top5_ind:
  top5.append([just_titles[i] for i in ind])

top5[:5]

[['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction',
  'After Trump, conservatives should stop longing for the past — and learn a little humility',
  'The alt-right is more than warmed-over white supremacy. It’s that, but way way weirder.',
  'The Return of ‘Street Corner Conservatism’',
  'Liberals should get behind marriage (Opinion)'],
 [' US Civil Rights Commission Will Observe Standing Rock\xa0Standoff',
  ' Cory Booker Calls For Federal Investigation Into Police Tactics At Dakota Access\xa0Pipeline',
  'Army will close Dakota pipeline protesters’ campsite, Sioux leader\xa0says',
  'Dakota Access Pipeline protest site is cleared',
  'Protesters, Police Still Clashing Over Disputed North Dakota Pipeline'],
 ['Venezuela hunts rogue helicopter attackers, Maduro foes suspicious',
  'Protester dies, minister sacked after Paraguay re-election vote',
  'Venezuela Erupts In ’Mother Of All Protests’ As Anti-Maduro Sentiment Seethes',
  'Confusion In Venezuela

### Assign Cluster to Input

In [0]:
# compute a representation for each input
input_embeddings = model(input_topic)

In [0]:
# semantic similarity between inputs and training articles
inp_corr = np.inner(input_embeddings, small_embeddings)

In [22]:
# data frame of titles and semantic similarities
inp_df = pd.DataFrame(inp_corr)
inp_df.columns = just_titles
inp_df.index = input_topic

inp_df

Unnamed: 0,"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",US Civil Rights Commission Will Observe Standing Rock Standoff,"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,Qatar: UAE and Saudi Arabia step up pressure in diplomatic crisis,’The 4-Hour Workweek’ author says a 3-step process he learned from Tony Robbins drastically improved his life,Rick Perry Attacked for Keeping His Word,Watch how a mathematician explains an astonishing coincidence,Second baby bald eagle begins hatching process at National Arboretum,...,"McDonald’s, Chick-fil-A, and Subway are making changes to profit off the shifting definition of ’healthy’",Bernie says the AP twisted his ‘messy’ convention prediction,Nevada’s Joe Heck Aims to Keep the Senate Republican,Donald Trump: John McCain and Lindsey Graham ‘Looking to Start World War III’,"Radiohead Announces New Album, Hear Another New Song",Roger Federer and Rafael Nadal roll back years for unexpected encore,Why voters like She’s Not Trump,Advanced Placement Tests To Hide History of Religion and Islamic Jihad in Europe,Little Simz,Wilhelmina to feature transgender model this fall
Hillary Clinton defends handling of Benghazi attack,0.019261,0.135737,0.21394,-0.040233,0.270479,0.242178,-0.059264,0.235375,0.039634,0.098044,...,0.077255,0.288595,0.195516,0.280349,0.027464,0.020349,0.281599,0.114124,0.017507,-0.070851
Women's March Highlights,0.045296,0.119294,-0.000519,0.065587,0.049757,-0.028455,-0.019417,0.070615,0.004875,0.16622,...,0.006225,0.1285,0.060778,0.00033,0.051476,0.025595,0.12345,0.030094,0.040689,0.124655
Hillary Clinton emails,0.036642,0.032883,0.058746,0.017771,0.199208,0.109163,-0.04811,0.151369,0.013369,0.053923,...,0.073164,0.322738,0.131154,0.256044,0.063897,0.047353,0.367134,-0.036238,0.014079,0.013729


In [23]:
# find index of 5 most similar titles 
top5_ind = [] 
for i in range(len(inp_df)):
  top5_ind.append(list(list(zip(*heapq.nlargest(5, enumerate(inp_df.iloc[i,:]), key=operator.itemgetter(1))))[0]))

# show most similar titles -- sanity check
top5 = []
for ind in top5_ind:
  top5.append([just_titles[i] for i in ind])

top5[:5]
# topics: ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

[['Benghazi may be over, but #Benghazi has a life of its own',
  'Univision’s Ramos picked the wrong Benghazi question for Hillary Clinton',
  'Donald Trump on Benghazi: ’Hillary Clinton Decided to Go Home and Sleep’ ',
  'The GOP stoops for scandal',
  'Exclusive - Sarah Palin: Administration Lies, Soldiers Die. Yes, Hillary, at Every Point ‘It Matters’ '],
 ['The Women’s March on Washington is becoming a\xa0joke',
  ' Women Share Why They Hit The Streets At The Sundance Women’s\xa0March',
  'Women will march on – What would a feminist do? podcast',
  'The Exhausting Work of Tallying America’s Largest Protest',
  'Women’s March organizers prepare for hundreds of thousands of protesters'],
 ['Donald Trump: ’Good Job, Huma. Thank you, Anthony Weiner’ ',
  'Fox News anchor grills Hillary Clinton on her email scandal',
  'Hillary Clinton’s email problems might be even worse than we thought',
  'Hillary Clinton pal Neera Tanden’s greatest hits from WikiLeaks emails',
  'Hillary Clinton Del

#### Check to see if individual vs mass inputs create different embeddings -- NO

In [0]:
a = model(["hi my name is emma"])
b = model(["nice to meet you"])
c = model(["hi what's your name"])

In [0]:
d = model(["hi my name is emma", "nice to meet you", "hi what's your name"])

In [0]:
e = np.concatenate((a, b, c))

In [0]:
f = np.array(d)

In [0]:
np.inner(e, e)

array([[1.        , 0.23792592, 0.5253966 ],
       [0.23792592, 1.        , 0.43527788],
       [0.5253966 , 0.43527788, 1.        ]], dtype=float32)

In [0]:
np.inner(f, f)

array([[0.9999999 , 0.23792589, 0.5253966 ],
       [0.23792589, 1.        , 0.43527794],
       [0.5253966 , 0.43527794, 1.0000001 ]], dtype=float32)