<a href="https://colab.research.google.com/github/s-miramontes/News_Filter/blob/emma/Pilot/Data%20Cleanse%20and%20Clustering%20Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering News Headlines

In this notebook we begin my importing data to analyze its contents and be able to determine the best clustering algorithm to determine the articles that are mostly related to each other.

We start by importing some dependencies

In [1]:
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

import re
from nltk.stem.snowball import SnowballStemmer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans

from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Importing Data from Local File System

Datasets are located here: https://www.kaggle.com/snapcrack/all-the-news/version/4#articles3.csv

Proceed to download all three 'csv' files, and store them in a 'data' directory at the location of your choice.


*Note: It might be a better idea to have a way to import the data from the cloud (google drive) rather than from the user's local file system. Let's leave it like this for now, mainly bringing this up for easy reproducibility.*

Instead, we can mount google drive and import the datasets from an existing file in the cloud.

In [0]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [0]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

For every file available we have a shared link:

CSV File: 'Articles 3':
https://drive.google.com/a/berkeley.edu/file/d/1BDhi5FuIvJ_6jiIdIFXRjT0w6tgfNk10/view?usp=sharing

CSV File: 'Articles 2':
https://drive.google.com/a/berkeley.edu/file/d/1qpoRkKEOnxYg_12e0RdeRi-JWUGWNFP3/view?usp=sharing

CSV File: 'Articles 1':
https://drive.google.com/a/berkeley.edu/file/d/1c9NbOIx7M6PgyJKxR9E7HgzuJlxGIaXU/view?usp=sharing


For each of the links above, we utiliz the File ID provided:

- Articles 1 File ID: '1BDhi5FuIvJ_6jiIdIFXRjT0w6tgfNk10'
- Articles 2 File ID: '1qpoRkKEOnxYg_12e0RdeRi-JWUGWNFP3'
- Articles 3 File ID: '1c9NbOIx7M6PgyJKxR9E7HgzuJlxGIaXU'


In [0]:
downloaded3 = drive.CreateFile({'id':"1BDhi5FuIvJ_6jiIdIFXRjT0w6tgfNk10"})   # replace the id with id of file you want to access
downloaded3.GetContentFile('articles3.csv.zip')        # replace the file name with your file

downloaded2 = drive.CreateFile({'id': "1qpoRkKEOnxYg_12e0RdeRi-JWUGWNFP3"})
downloaded2.GetContentFile('articles2.csv.zip')

downloaded1 = drive.CreateFile({'id': "1c9NbOIx7M6PgyJKxR9E7HgzuJlxGIaXU"})
downloaded1.GetContentFile('articles1.csv.zip')

In [0]:
articles_3 = pd.read_csv('articles3.csv.zip', compression='zip')
articles_2 = pd.read_csv('articles2.csv.zip', compression='zip')
articles_1 = pd.read_csv('articles1.csv.zip', compression='zip')

In [6]:
articles_3.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,103459,151908,Alton Sterling’s son: ’Everyone needs to prote...,Guardian,Jessica Glenza,2016-07-13,2016.0,7.0,https://www.theguardian.com/us-news/2016/jul/1...,The son of a Louisiana man whose father was sh...
1,103460,151909,Shakespeare’s first four folios sell at auctio...,Guardian,,2016-05-25,2016.0,5.0,https://www.theguardian.com/culture/2016/may/2...,Copies of William Shakespeare’s first four boo...
2,103461,151910,My grandmother’s death saved me from a life of...,Guardian,Robert Pendry,2016-10-31,2016.0,10.0,https://www.theguardian.com/commentisfree/2016...,"Debt: $20, 000, Source: College, credit cards,..."
3,103462,151911,I feared my life lacked meaning. Cancer pushed...,Guardian,Bradford Frost,2016-11-26,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,"It was late. I was drunk, nearing my 35th birt..."
4,103463,151912,Texas man serving life sentence innocent of do...,Guardian,,2016-08-20,2016.0,8.0,https://www.theguardian.com/us-news/2016/aug/2...,A central Texas man serving a life sentence fo...


In [7]:
articles_2.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,53293,73471,Patriots Day Is Best When It Digs Past the Her...,Atlantic,David Sims,2017-01-11,2017.0,1.0,,"Patriots Day, Peter Berg’s new thriller that r..."
1,53294,73472,A Break in the Search for the Origin of Comple...,Atlantic,Ed Yong,2017-01-11,2017.0,1.0,,"In Norse mythology, humans and our world were ..."
2,53295,73474,Obama’s Ingenious Mention of Atticus Finch,Atlantic,Spencer Kornhaber,2017-01-11,2017.0,1.0,,“If our democracy is to work in this increasin...
3,53296,73475,"Donald Trump Meets, and Assails, the Press",Atlantic,David A. Graham,2017-01-11,2017.0,1.0,,Updated on January 11 at 5:05 p. m. In his fir...
4,53297,73476,Trump: ’I Think’ Hacking Was Russian,Atlantic,Kaveh Waddell,2017-01-11,2017.0,1.0,,Updated at 12:25 p. m. After months of equivoc...


In [8]:
articles_1.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [9]:
set(articles_3.publication)

{'Guardian', 'NPR', 'Reuters', 'Vox', 'Washington Post'}

In [10]:
set(articles_2.publication)

{'Atlantic',
 'Buzzfeed News',
 'Fox News',
 'Guardian',
 'National Review',
 'New York Post',
 'Talking Points Memo'}

In [11]:
set(articles_1.publication)

{'Atlantic', 'Breitbart', 'Business Insider', 'CNN', 'New York Times'}

In [0]:
full_data = pd.concat([articles_1, articles_2, articles_3], ignore_index=True)

In [13]:
full_data

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."
...,...,...,...,...,...,...,...,...,...,...
142565,146028,218078,An eavesdropping Uber driver saved his 16-year...,Washington Post,Avi Selk,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,Uber driver Keith Avila picked up a p...
142566,146029,218079,Plane carrying six people returning from a Cav...,Washington Post,Sarah Larimer,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,Crews on Friday continued to search L...
142567,146030,218080,After helping a fraction of homeowners expecte...,Washington Post,Renae Merle,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,When the Obama administration announced a...
142568,146031,218081,"Yes, this is real: Michigan just banned bannin...",Washington Post,Chelsea Harvey,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,This story has been updated. A new law in...


In [0]:
# remove duplicates
full_data = full_data.drop_duplicates(subset=['title', 'publication', 'author', 'date'])

In [0]:
# remove missing titles 
full_data = full_data.dropna(subset=['title'])

In [16]:
# sample from full_data (set seed to 5)
small_data = full_data.sample(n=10000, random_state=5).reset_index()
set(small_data.publication)

{'Atlantic',
 'Breitbart',
 'Business Insider',
 'Buzzfeed News',
 'CNN',
 'Fox News',
 'Guardian',
 'NPR',
 'National Review',
 'New York Post',
 'New York Times',
 'Reuters',
 'Talking Points Memo',
 'Vox',
 'Washington Post'}

In [17]:
small_data

Unnamed: 0.1,index,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,74496,77946,118473,"Chaos in the Family, Chaos in the State: The W...",National Review,Kevin D. Williamson,2016-03-17,2016.0,3.0,http://www.nationalreview.com/article/432876/d...,Michael Brendan Dougherty is bitter. I think t...
1,71184,74592,113594,US Civil Rights Commission Will Observe Stand...,Buzzfeed News,Nidhi Subbaraman,2016-12-08,2016.0,12.0,https://web.archive.org/web/20161208153906/htt...,WASHINGTON — The US Commission on Civil Ri...
2,120205,123668,184574,"Venezuela hunts rogue helicopter attackers, Ma...",Reuters,Andrew Cawthorne and Victoria Ramirez,2017-06-29,2017.0,6.0,http://www.reuters.com/article/us-venezuela-po...,The Venezuelan government hunted on Wednesday...
3,128977,132440,199665,Fruit juice isn’t much better for you than sod...,Vox,Julia Belluz,2016/3/25,2016.0,3.0,http://www.vox.com/2016/3/25/11305614/soda-jui...,One of the biggest public health wins of rece...
4,134837,138300,208223,Sessions won’t testify at congressional budget...,Washington Post,Sari Horwitz,2017-06-10,2017.0,6.0,https://web.archive.org/web/20170611000758/htt...,"Attorney General Jeff Sessions, who had agree..."
...,...,...,...,...,...,...,...,...,...,...,...
9995,137105,140568,211140,Patient secretly recorded doctors as they oper...,Washington Post,Yanan Wang,2016-04-07,2016.0,4.0,https://web.archive.org/web/20160408000201/htt...,"Last summer, Ethel Easter wanted nothin..."
9996,60293,63612,86308,Fox News Poll: Clinton edges Trump by two poin...,Fox News,Dana Blanton,2016-10-07,2016.0,10.0,https://web.archive.org/web/20161008002456/htt...,Third party candidates Gary Johnson (6 percent...
9997,55079,58389,80070,The Atlantic Politics & Policy Daily: Trump’s...,Atlantic,Elaine Godfrey,2016-08-31,2016.0,8.0,,For us to continue writing great stori...
9998,110248,113711,168509,A Son In Chains. A Depressed Mom. Here’s What ...,NPR,Nurith Aizenman,2016-04-15,2016.0,4.0,http://www.npr.org/sections/goatsandsoda/2016/...,It was a hospital — but to psychologist Ink...


In [18]:
# remove publisher tags from article titles 

just_titles = []

for title in small_data.title:
  title = re.sub(r"(- Breitbart)(?!.*\1)", '', title)
  title = re.sub(r'(- The New York Times)(?!.*\1)', '', title)
  just_titles.append(title)

just_titles[:5]

['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction',
 ' US Civil Rights Commission Will Observe Standing Rock\xa0Standoff',
 'Venezuela hunts rogue helicopter attackers, Maduro foes suspicious',
 'Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.',
 'Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead']

In [19]:
len(just_titles)

10000

In [20]:
# join with content of article

small_text = list(map(lambda i,j: i + " " + j, just_titles, small_data.content))

small_text[:5]

['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction Michael Brendan Dougherty is bitter. I think that I can write that in both truth and charity. (I think you might even say that he and I are friends.) Dougherty is a conservative of the sort sometimes advertised as “paleo” and served as national correspondent for The American Conservative. Like many conservative writers with those associations, Dougherty spends a great deal of time lambasting the conservative movement and its organs, from which he feels, for whatever reason, estranged  —   an alienation that carries with it more than a little to suggest that it is somewhat personal. You know: Them. Donald Trump is the headline, and explaining the benighted white working class to Them is the main matter. Sanctimony is the literary mode, for Dougherty and for many others doing the same work with less literary facility. Never mind the petty sneering (as though the conservative movement were populated by septua

In [21]:
len(small_text)

10000

## Create Matrix of TF-IDF Features

In [0]:
def preprocess_text(text):

  # function to remove punctuation 
  def Punctuation(string): 
    return re.sub(r'[\W_]', ' ', string)

  # remove punctuation and perform tokenization
  text = Punctuation(text.lower()).split()

  # remove stop words and stem
  stop_words = set(stopwords.words('english'))
  stemmer = SnowballStemmer("english")
  text = [stemmer.stem(t) for t in text if not t in stop_words]

  return text


In [0]:
# tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', use_idf=True, tokenizer=preprocess_text, ngram_range=(1,3))

In [0]:
# fit just_titles to vectorizer
tfidf_matrix = tfidf_vectorizer.fit_transform(small_text)
print(tfidf_matrix.shape)

  'stop_words.' % sorted(inconsistent))


(10000, 5000)


In [0]:
# reduce memory 
tfidf_matrix = tfidf_matrix.astype(np.float32)

In [0]:
tfidf_matrix

<10000x5000 sparse matrix of type '<class 'numpy.float32'>'
	with 2077739 stored elements in Compressed Sparse Row format>

## Create Matrix of Word2Vec Features

## Universal Encoder

In [0]:
%%capture
# Install the latest Tensorflow version.
!pip3 install --upgrade tensorflow-gpu
# Install TF-Hub.
!pip3 install tensorflow-hub
!pip3 install seaborn

In [26]:
#Load the Universal Sentence Encoder's TF Hub module
from absl import logging

import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


In [34]:
#@title Compute a representation for each message, showing various lengths supported.
word = "Elephant"
sentence = "I am a sentence for which I would like to get its embedding."
paragraph = (
    "Universal Sentence Encoder embeddings also support short paragraphs. "
    "There is no hard limit on how long the paragraph is. Roughly, the longer "
    "the more 'diluted' the embedding will be.")
messages = small_text

# Reduce logging output.
logging.set_verbosity(logging.ERROR)

message_embeddings = embed(messages)

for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
  print("Message: {}".format(messages[i]))
  print("Embedding size: {}".format(len(message_embedding)))
  message_embedding_snippet = ", ".join(
      (str(x) for x in message_embedding[:3]))
  print("Embedding: [{}, ...]\n".format(message_embedding_snippet))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




Message: 20 killed in stabbings at Muslim shrine, Pakistani police say Islamabad, Pakistan  (CNN) A custodian and four other suspects have been arrested after 20 people were killed at a Muslim shrine, authorities in Pakistan said Sunday.   Three more people are hospitalized in critical condition after the attack in Punjab province, said Liaqat Chattah, deputy police commissioner for Sargodha city, where the attack happened.  Custodian Abdul Waheed drugged and stripped the victims before they were killed at the Sufi shrine, which is named the Ali Ahmad Gunnar Shrine, said Chattah.      Knives and clubs were used to attack the devotees before police were alerted by two men and two women who had fled the scene, Chattah said. Waheed was arrested, along with four other suspects. The police official described Waheed as mentally unstable.  An investigation is underway, he said.  Shrines to various saints are scattered all over Pakistan, where they are considered places of meditation and refu

In [0]:
corr = np.inner(message_embeddings, message_embeddings)

In [0]:
corr_df = pd.DataFrame(corr)

corr_df.columns = just_titles
corr_df.index = just_titles

In [45]:
corr_df

Unnamed: 0,"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",US Civil Rights Commission Will Observe Standing Rock Standoff,"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,Qatar: UAE and Saudi Arabia step up pressure in diplomatic crisis,’The 4-Hour Workweek’ author says a 3-step process he learned from Tony Robbins drastically improved his life,Rick Perry Attacked for Keeping His Word,Watch how a mathematician explains an astonishing coincidence,Second baby bald eagle begins hatching process at National Arboretum,In search of the Chibok girls and the meaning of their kidnapping,Biden Jokingly Thanks Trump for ’Making the American People Look in the Mirror’,"Hillary Clinton’s plan to undo the school-to-prison pipeline, explained","HIV Was Likely Transmitted On A Gay Porn Set, CDC Reports",Man survives knife in eye socket,Viacom in talks with Jim Gianopulos to head Paramount,Emmanuel Macron Declared Next French President,Ivanka Trump cuts off Cosmo interview after tough questioning,Hillary Campaign Denies Report of Campaign Shake-up After New Hampshire,VIDEO: Alabama High School Girl Fires Stun Gun at Teacher,How Sexism Held Back Space Exploration,Trump-Clinton Showdown Most-Watched Debate in History,How Republicans came to embrace anti-environmentalism,"Taliban storm German consulate in Afghan city, four killed",Lake Oroville dam: emergency staff race to fix spillway before more rain strikes,Viewsroom Predictions 2017: Part 2,Trump is finding it easier to tear down old policies than to build his own,Iran’s ex-president writes Obama on seized assets,U.S. charges Florida man in case linked to JPMorgan hacking probe,The real reason superrich people hate Obama,Trump Tower Lobby Evacuated by Police after Reports of Suspicious Package,Stormtrooper behind infamous ‘Star Wars’ blooper breaks his silence,Patient Diplomacy And A Reluctance To Act: Obama’s Mark On Foreign Policy,Millennials are totally mixed up about what they believe in,Cologne Police Attacked for ’Racial Profiling’ During NYE,"BIAS ALERT: CNN reporter says Hannity, Limbaugh want Hillary ’dying’",Jill Stein files for recount in Wisconsin,A little-known Pakistani tribe that loves wine and whiskey fears its Muslim neighbors,Valeant to appoint interim CEO as Pearson remains hospitalized: source,Wall Street rises as data points to accelerating economy,...,This week we saw that the Republican Party — not just Trump — is the problem,What the holidays are like for a recovering alcoholic like me,Doping has always been part of the Olympics. Of course Russia got off the hook.,Here’s how to not to get old lady hands,Texas Health Officials Brace for More Zika Virus Cases After Six Confirmed Statewide,Cruz: Trump Attacked My Wife Because He’s ’Scared’,Canada says most border-crossing asylum seekers were in U.S. legally,"LGBT employees protected from workplace discrimination, appeals court rules",Outgoing US ambassador: America is ’not an ethno-state’,We asked legal experts if Greg Gianforte can serve in Congress if he’s convicted of assault in reporter ’body-slam’ case,What next for Maria Sharapova as she prepares to appeal against her drug ban?,Donald Trump promises to deport 3 million “illegal immigrant criminals.” That’s literally impossible.,American democracy is winning... so far,USSS chief says ’No friction’ with Trump’s private security,The simple arithmetic that could jump-start America’s economic growth,Trump and Clinton join mourners at 9/11 anniversary ceremony,"Lucille Ball statue needs a makeover, locals say",FBI monitored former Trump campaign adviser Carter Page on Russia,Dr. Jane Orient: ‘Universal Coverage Means Less Care’,"Pro-Trump Art Show Finds New Venue, Despite Cancellation Attempt And Legal Threats",Marvel Comics Cancels Black Lives Matter-Themed ’Black Panther’ Due to Poor Sales,Libya national army recaptures oil ports at Sidra and Ras Lanuf,"What Bernie Sanders needs to learn from Hillary Clinton, and vice versa",Stay angry after Manchester,"TIME: Trump Loss Could Result in ’A New Right-Wing Populist Party, Anchored by a Trump-Breitbart-Ailes Media Empire’","Clinton Powers Through Super Tuesday, Sets Sights On Trump",Warren Buffett: I bought $12 billion of stock after Trump won,GOP Mega-Debate Blog,Volkswagen America Chief Michael Horn Resigns,Teenager Seeks to Honor Veterans of War by Preserving Their Stories,Auto CEOs want Trump to order review of 2025 fuel rules,Golden State Warriors NBA champions after downing Cavs,Look Who Now Refuses to ‘Accept the Result of This Election’,"Obama Climate Plan, Now in Court, May Hinge on Error in 1990 Law",E.P.A. Head Stacks Agency With Climate Change Skeptics,Patient secretly recorded doctors as they operated on her. Should she be so distressed by what she heard?,Fox News Poll: Clinton edges Trump by two points one month ahead of election,The Atlantic Politics & Policy Daily: Trump’s Campaign Goes South,A Son In Chains. A Depressed Mom. Here’s What Helped,Trump’s expected VP pick: coal advocate who defied Obama’s climate agenda
"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",1.000000,0.291457,0.336936,0.256971,0.181461,0.282036,0.389551,0.316919,0.289691,0.295462,0.440229,0.249661,0.423986,0.195106,0.188498,0.195321,0.402051,0.435655,0.366493,0.194521,0.391413,0.230025,0.495496,0.290977,0.247193,0.392017,0.461511,0.259906,0.300757,0.300338,0.293250,0.180506,0.327410,0.595260,0.343602,0.284286,0.334819,0.395611,0.350891,0.332254,...,0.561130,0.288338,0.361339,0.142494,0.250517,0.356758,0.361094,0.315388,0.383622,0.305031,0.190893,0.402139,0.486652,0.357250,0.568888,0.356364,0.307867,0.251892,0.364093,0.377043,0.281445,0.223061,0.358147,0.368368,0.447582,0.444941,0.372099,0.338761,0.331583,0.277880,0.370325,0.228231,0.486348,0.346315,0.391943,0.235392,0.379992,0.467373,0.384896,0.380097
US Civil Rights Commission Will Observe Standing Rock Standoff,0.291457,1.000000,0.543982,0.109070,0.411541,0.422529,0.224046,0.271750,0.141546,0.381025,0.394586,0.254232,0.409052,0.303505,0.289978,0.238294,0.315366,0.248853,0.309557,0.357640,0.250740,0.227431,0.426162,0.446733,0.488827,0.257067,0.376665,0.325969,0.322797,0.119291,0.321066,0.198248,0.300885,0.227979,0.469392,0.302752,0.345799,0.344382,0.373645,0.357101,...,0.355217,0.242404,0.334919,0.098917,0.346023,0.277401,0.540610,0.404285,0.467805,0.479594,0.294901,0.278824,0.378827,0.364401,0.270305,0.365561,0.317745,0.303007,0.261098,0.460971,0.345169,0.525177,0.289962,0.312449,0.249073,0.296501,0.238153,0.218928,0.277692,0.453544,0.372918,0.220481,0.457202,0.428989,0.450195,0.283310,0.232701,0.420117,0.282087,0.371748
"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",0.336936,0.543982,1.000000,0.106961,0.405647,0.540752,0.184615,0.287032,0.161188,0.340552,0.496353,0.262362,0.344230,0.302758,0.436904,0.283682,0.461428,0.242524,0.362011,0.350824,0.269456,0.266465,0.417118,0.556532,0.493539,0.361689,0.366373,0.475206,0.435180,0.242520,0.341805,0.239511,0.445015,0.262305,0.492565,0.324988,0.288255,0.364235,0.471169,0.401487,...,0.394198,0.259664,0.354480,0.102672,0.359986,0.391297,0.436070,0.323494,0.515974,0.440763,0.334845,0.313370,0.483248,0.324152,0.325420,0.372635,0.307279,0.311935,0.288056,0.355572,0.321760,0.608213,0.342764,0.411887,0.359434,0.285848,0.321542,0.272913,0.372750,0.332260,0.420927,0.220260,0.459930,0.459599,0.428584,0.283553,0.278399,0.467480,0.296813,0.345345
Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,0.256971,0.109070,0.106961,0.999999,0.052951,0.109257,0.242693,0.108912,0.146040,0.155383,0.169326,0.133344,0.213598,0.126937,0.060196,0.159555,0.138772,0.192490,0.207176,0.143686,0.137557,0.185294,0.137274,0.147131,0.198811,0.209470,0.215257,0.028350,0.150327,0.308400,0.139774,0.070090,0.156332,0.230201,0.104554,0.180427,0.168984,0.153244,0.168052,0.138021,...,0.221460,0.207075,0.317575,0.197903,0.139864,0.105257,0.175732,0.103050,0.113901,0.173156,0.232171,0.144789,0.197923,0.154879,0.322481,0.087798,0.172873,0.117933,0.257364,0.143244,0.164409,0.132387,0.279697,0.218400,0.146807,0.230288,0.201782,0.146811,0.203473,0.113018,0.163759,0.123956,0.148778,0.166097,0.169043,0.156256,0.197509,0.187718,0.171532,0.155491
Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,0.181461,0.411541,0.405647,0.052951,1.000000,0.384612,0.176325,0.368784,0.109351,0.225344,0.200232,0.323061,0.323962,0.133778,0.251900,0.239904,0.268766,0.330142,0.351093,0.248917,0.240766,0.312375,0.378555,0.293382,0.257763,0.151555,0.462101,0.324921,0.406046,0.240509,0.331039,0.240964,0.315641,0.168998,0.338324,0.366980,0.371093,0.167109,0.339653,0.317318,...,0.441263,0.191015,0.252760,0.121867,0.248112,0.301005,0.251728,0.340361,0.453017,0.501037,0.337584,0.303857,0.479369,0.345872,0.314085,0.267156,0.148301,0.582224,0.281912,0.256051,0.281738,0.379716,0.233785,0.174403,0.423505,0.244356,0.303939,0.328788,0.310806,0.153719,0.321637,0.142903,0.362913,0.371471,0.451796,0.283832,0.320866,0.450300,0.158671,0.378904
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Patient secretly recorded doctors as they operated on her. Should she be so distressed by what she heard?,0.235392,0.283310,0.283553,0.156256,0.283832,0.265033,0.331497,0.230673,0.233065,0.330073,0.323173,0.203243,0.226230,0.254465,0.360594,0.129727,0.201763,0.299933,0.220060,0.304837,0.268871,0.108042,0.152352,0.286417,0.277268,0.123587,0.275929,0.261729,0.191214,0.144783,0.248262,0.284764,0.236085,0.140088,0.284442,0.313350,0.231483,0.225994,0.251377,0.161968,...,0.308695,0.344684,0.237021,0.174929,0.268881,0.250653,0.315168,0.260119,0.303412,0.213703,0.309935,0.250096,0.256569,0.229627,0.142274,0.302842,0.281695,0.257574,0.304777,0.202713,0.210290,0.205835,0.185785,0.280328,0.126881,0.201947,0.142650,0.234229,0.308187,0.258365,0.187319,0.158680,0.255372,0.199795,0.318114,1.000000,0.150293,0.314685,0.437575,0.193149
Fox News Poll: Clinton edges Trump by two points one month ahead of election,0.379992,0.232701,0.278399,0.197509,0.320866,0.297060,0.161304,0.471135,0.229161,0.270680,0.238489,0.417171,0.348348,0.210365,0.166642,0.195138,0.485886,0.482194,0.572810,0.215837,0.175408,0.497523,0.308526,0.207492,0.277163,0.314131,0.435656,0.254012,0.244957,0.382214,0.331447,0.083486,0.347066,0.496868,0.259581,0.418541,0.493949,0.248568,0.300063,0.384553,...,0.538432,0.142051,0.204809,0.061838,0.256259,0.437381,0.331155,0.279466,0.278076,0.331490,0.184817,0.298013,0.513771,0.353507,0.427157,0.228214,0.139810,0.354513,0.352397,0.307426,0.218533,0.255470,0.410803,0.277601,0.551739,0.517861,0.421811,0.396468,0.160160,0.176305,0.366800,0.275099,0.557804,0.247927,0.314423,0.150293,1.000000,0.526430,0.205309,0.468168
The Atlantic Politics & Policy Daily: Trump’s Campaign Goes South,0.467373,0.420117,0.467480,0.187718,0.450300,0.431241,0.315624,0.584903,0.313041,0.328896,0.368427,0.458995,0.445674,0.249556,0.276897,0.269573,0.499870,0.586460,0.582161,0.317671,0.334567,0.438664,0.490244,0.398593,0.421874,0.298072,0.566460,0.345957,0.392945,0.321500,0.463694,0.165709,0.489623,0.391153,0.327938,0.449229,0.563059,0.374230,0.409141,0.322075,...,0.723464,0.249410,0.388076,0.065902,0.459734,0.564706,0.514368,0.367988,0.528829,0.506569,0.288312,0.420448,0.630920,0.491058,0.400619,0.404094,0.324165,0.483406,0.412322,0.478061,0.354819,0.398461,0.480942,0.357621,0.574623,0.582749,0.402321,0.493656,0.307618,0.306022,0.429166,0.218183,0.649551,0.364936,0.483837,0.314685,0.526430,1.000000,0.303780,0.609512
A Son In Chains. A Depressed Mom. Here’s What Helped,0.384896,0.282087,0.296813,0.171532,0.158671,0.283374,0.299188,0.185527,0.216815,0.269272,0.534896,0.189617,0.369980,0.325306,0.310359,0.181462,0.261842,0.321659,0.316531,0.185758,0.310069,0.140542,0.301139,0.328294,0.303167,0.216524,0.333989,0.246509,0.227388,0.201871,0.220440,0.104767,0.298743,0.296246,0.328947,0.306802,0.230921,0.338983,0.342165,0.225082,...,0.309949,0.362990,0.294835,0.112201,0.288857,0.261506,0.373658,0.210055,0.329173,0.190113,0.299553,0.303631,0.259712,0.335440,0.198900,0.305955,0.290795,0.217816,0.324658,0.299187,0.219639,0.257938,0.349319,0.320008,0.177416,0.291968,0.218171,0.267784,0.244561,0.367269,0.203563,0.140798,0.282445,0.242898,0.314041,0.437575,0.205309,0.303780,1.000000,0.242837


In [0]:
def plot_similarity(labels, features, rotation):
  corr = np.inner(features, features)
  sns.set(font_scale=1.2)
  g = sns.heatmap(
      corr,
      xticklabels=labels,
      yticklabels=labels,
      vmin=0,
      vmax=1,
      cmap="YlOrRd")
  g.set_xticklabels(labels, rotation=rotation)
  g.set_title("Semantic Textual Similarity")

def run_and_plot(messages_):
  message_embeddings_ = embed(messages_)
  plot_similarity(messages_, message_embeddings_, 90)

In [35]:
# messages = [
#     # Smartphones
#     "I like my phone",
#     "My phone is not good.",
#     "Your cellphone looks great.",

#     # Weather
#     "Will it snow tomorrow?",
#     "Recently a lot of hurricanes have hit the US",
#     "Global warming is real",

#     # Food and health
#     "An apple a day, keeps the doctors away",
#     "Eating strawberries is healthy",
#     "Is paleo better than keto?",

#     # Asking about age
#     "How old are you?",
#     "what is your age?",
# ]

run_and_plot(messages)

  font.set_text(s, 0.0, flags=flags)


ValueError: ignored

ValueError: ignored

<Figure size 432x288 with 2 Axes>

In [33]:
import pandas
import scipy
import math
import csv

sts_dataset = tf.keras.utils.get_file(
    fname="Stsbenchmark.tar.gz",
    origin="http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz",
    extract=True)
sts_dev = pandas.read_table(
    os.path.join(os.path.dirname(sts_dataset), "stsbenchmark", "sts-dev.csv"),
    error_bad_lines=False,
    skip_blank_lines=True,
    usecols=[4, 5, 6],
    names=["sim", "sent_1", "sent_2"])
sts_test = pandas.read_table(
    os.path.join(
        os.path.dirname(sts_dataset), "stsbenchmark", "sts-test.csv"),
    error_bad_lines=False,
    quoting=csv.QUOTE_NONE,
    skip_blank_lines=True,
    usecols=[4, 5, 6],
    names=["sim", "sent_1", "sent_2"])
# cleanup some NaN values in sts_dev
sts_dev = sts_dev[[isinstance(s, str) for s in sts_dev['sent_2']]]

Downloading data from http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz


## Run KMeans Clustering 

In [0]:
num_clusters = 15

km = KMeans(n_clusters=num_clusters, random_state=10)

km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

In [0]:
# cosine similarities for each row with cluster center
cos_sim = [cosine_similarity(tfidf_matrix[i].toarray(), km.cluster_centers_[label].reshape(1, -1))[0][0] for i,label in enumerate(km.labels_)]

In [0]:
# sum of squared distances of samples to their closest cluster center.
km.inertia_

8981.595195485403

In [0]:
# simple dataframe with title, cluster, and cosine similarity 
title_data = pd.DataFrame({'title':small_data.title, 'cluster':clusters, 'cos_sim':cos_sim})

In [0]:
title_data

Unnamed: 0,title,cluster,cos_sim
0,"Chaos in the Family, Chaos in the State: The W...",6,0.424168
1,US Civil Rights Commission Will Observe Stand...,9,0.224492
2,"Venezuela hunts rogue helicopter attackers, Ma...",5,0.264832
3,Fruit juice isn’t much better for you than sod...,9,0.138075
4,Sessions won’t testify at congressional budget...,9,0.166797
...,...,...,...
9995,Patient secretly recorded doctors as they oper...,6,0.292499
9996,Fox News Poll: Clinton edges Trump by two poin...,10,0.395914
9997,The Atlantic Politics & Policy Daily: Trump’s...,10,0.505341
9998,A Son In Chains. A Depressed Mom. Here’s What ...,6,0.231095


In [0]:
# inspect titles with highest cosine similarity in clusters 
top_5 = title_data.groupby('cluster')['cos_sim'].nlargest(5)
for i,ind in top_5.index:
  print("cluster", i)
  print(just_titles[ind])
  print('-------------')

cluster 0
Sanders: Clinton is running a ‘desperate’ campaign that lacks excitement
-------------
cluster 0
Has Sanders joined the ‘Bernie or Bust’ movement? Nah.
-------------
cluster 0
Who’s more electable: Bernie Sanders or Hillary Clinton?
-------------
cluster 0
Bernie Sanders knows he’s going to lose. Here’s how you can tell.
-------------
cluster 0
Democrats are bracing for a nasty debate between Hillary Clinton and Bernie Sanders
-------------
cluster 1
US Stock Market Ends Its Worst Week Since 2011
-------------
cluster 1
New U.S. single-family home sales race to 10-month high
-------------
cluster 1
Rising U.S. layoffs hint at ebbing labor market momentum  
-------------
cluster 1
Robust U.S. payrolls brighten economic outlook
-------------
cluster 1
Wall Street ticks up as hawkish Fed fears ebb; Apple weighs
-------------
cluster 2
North Korea missile launch marks a direct challenge to Trump administration
-------------
cluster 2
The North Korean military threat to America an

## Predict Input

In [0]:
input_topic = ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

In [0]:
# run tfidf on input 
tfidf_input = tfidf_vectorizer.transform(input_topic)

# make list of tfidf_input
tfidf_input = [tfidf_input.getrow(i).toarray()[0].tolist() for i in range(tfidf_input.shape[0])]

In [0]:
# make list of tfidf_matrix
tfidf_list = [tfidf_matrix.getrow(i).toarray()[0].tolist() for i in range(tfidf_matrix.shape[0])]


In [0]:
np.random.seed(5)
X_train = tfidf_list
y_train = title_data.cluster
X_test  = tfidf_input

# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train) 

res = knn.predict(X_test)

In [0]:
print(res)

[10  6 10]


In [0]:
for i,ind in top_5.index:
  if (i==10) | (i==6):
    print("cluster", i)
    print(just_titles[ind])
    print('-------------')

cluster 6
50 Wonderful Things From 2016
-------------
cluster 6
We read all 20 National Book Award nominees for 2016. Here’s what we thought.
-------------
cluster 6
In A Genre Crowded With Bros, Cam Lifts Off
-------------
cluster 6
Marie Kondo and the Ruthless War on Stuff 
-------------
cluster 6
Being Aaron Burr
-------------
cluster 10
From Whitewater to Benghazi: A Clinton-Scandal Primer
-------------
cluster 10
Hillary, Not Trump, Forced Us to Revisit Bill Clinton’s Scandals
-------------
cluster 10
Can we please be really truly done with Hillary?
-------------
cluster 10
A new tell-all about the Clinton campaign is a searing indictment of the candidate herself
-------------
cluster 10
Clinton: ’People should and do trust me’
-------------
