<a href="https://colab.research.google.com/github/s-miramontes/News_Filter/blob/silvia/Pilot/Data%20Cleanse%20and%20Clustering%20Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering News Headlines

In this notebook we begin by importing data to analyze its contents and to determine the best clustering algorithm to determine the articles that are most related to each other. 

We start by importing some dependencies and downloading libraries

In [0]:
%%capture
# Install the latest Tensorflow version.
!pip3 install --upgrade tensorflow-gpu
# Install TF-Hub.
!pip3 install tensorflow-hub
!pip3 install seaborn

In [1]:
import os

import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer
import re

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans

from sklearn.metrics.pairwise import cosine_similarity

from absl import logging

import tensorflow as tf
import tensorflow_hub as hub

import heapq
import operator

from math import sqrt
from joblib import Parallel, delayed


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/silvsilva/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Importing Data from Local File System

Datasets are located here: https://www.kaggle.com/snapcrack/all-the-news/version/4#articles3.csv

Proceed to download all three 'csv' files, and store them in a 'data' directory at the location of your choice.


In [0]:
articles_3 = pd.read_csv('data/articles3.csv')
articles_2 = pd.read_csv('data/articles2.csv')
articles_1 = pd.read_csv('data/articles1.csv')

In [0]:
articles_3.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,103459,151908,Alton Sterling’s son: ’Everyone needs to prote...,Guardian,Jessica Glenza,2016-07-13,2016.0,7.0,https://www.theguardian.com/us-news/2016/jul/1...,The son of a Louisiana man whose father was sh...
1,103460,151909,Shakespeare’s first four folios sell at auctio...,Guardian,,2016-05-25,2016.0,5.0,https://www.theguardian.com/culture/2016/may/2...,Copies of William Shakespeare’s first four boo...
2,103461,151910,My grandmother’s death saved me from a life of...,Guardian,Robert Pendry,2016-10-31,2016.0,10.0,https://www.theguardian.com/commentisfree/2016...,"Debt: $20, 000, Source: College, credit cards,..."
3,103462,151911,I feared my life lacked meaning. Cancer pushed...,Guardian,Bradford Frost,2016-11-26,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,"It was late. I was drunk, nearing my 35th birt..."
4,103463,151912,Texas man serving life sentence innocent of do...,Guardian,,2016-08-20,2016.0,8.0,https://www.theguardian.com/us-news/2016/aug/2...,A central Texas man serving a life sentence fo...


In [0]:
articles_2.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,53293,73471,Patriots Day Is Best When It Digs Past the Her...,Atlantic,David Sims,2017-01-11,2017.0,1.0,,"Patriots Day, Peter Berg’s new thriller that r..."
1,53294,73472,A Break in the Search for the Origin of Comple...,Atlantic,Ed Yong,2017-01-11,2017.0,1.0,,"In Norse mythology, humans and our world were ..."
2,53295,73474,Obama’s Ingenious Mention of Atticus Finch,Atlantic,Spencer Kornhaber,2017-01-11,2017.0,1.0,,“If our democracy is to work in this increasin...
3,53296,73475,"Donald Trump Meets, and Assails, the Press",Atlantic,David A. Graham,2017-01-11,2017.0,1.0,,Updated on January 11 at 5:05 p. m. In his fir...
4,53297,73476,Trump: ’I Think’ Hacking Was Russian,Atlantic,Kaveh Waddell,2017-01-11,2017.0,1.0,,Updated at 12:25 p. m. After months of equivoc...


In [0]:
articles_1.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [0]:
# publications in articles_3.csv
set(articles_3.publication)

{'Guardian', 'NPR', 'Reuters', 'Vox', 'Washington Post'}

In [0]:
# publications in articles_2.csv
set(articles_2.publication)

{'Atlantic',
 'Buzzfeed News',
 'Fox News',
 'Guardian',
 'National Review',
 'New York Post',
 'Talking Points Memo'}

In [0]:
# publications in articles_1.csv
set(articles_1.publication)

{'Atlantic', 'Breitbart', 'Business Insider', 'CNN', 'New York Times'}

In [19]:
# join all datasets into one
full_data = pd.concat([articles_1, articles_2, articles_3], ignore_index=True)

full_data.shape

(142570, 10)

In [20]:
full_data.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [0]:
# remove duplicates
full_data = full_data.drop_duplicates(subset=['title', 'publication', 'author', 'date'])

In [0]:
# remove missing titles 
full_data = full_data.dropna(subset=['title'])

In [23]:
# sample 13k observations from full_data (set seed to 5)
small_data = full_data.sample(n=13000, random_state=5).reset_index()
set(small_data.publication)

{'Atlantic',
 'Breitbart',
 'Business Insider',
 'Buzzfeed News',
 'CNN',
 'Fox News',
 'Guardian',
 'NPR',
 'National Review',
 'New York Post',
 'New York Times',
 'Reuters',
 'Talking Points Memo',
 'Vox',
 'Washington Post'}

In [24]:
small_data.shape

(13000, 11)

In [25]:
small_data.head()

Unnamed: 0.1,index,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,74496,77946,118473,"Chaos in the Family, Chaos in the State: The W...",National Review,Kevin D. Williamson,2016-03-17,2016.0,3.0,http://www.nationalreview.com/article/432876/d...,Michael Brendan Dougherty is bitter. I think t...
1,71184,74592,113594,US Civil Rights Commission Will Observe Stand...,Buzzfeed News,Nidhi Subbaraman,2016-12-08,2016.0,12.0,https://web.archive.org/web/20161208153906/htt...,WASHINGTON — The US Commission on Civil Ri...
2,120205,123668,184574,"Venezuela hunts rogue helicopter attackers, Ma...",Reuters,Andrew Cawthorne and Victoria Ramirez,2017-06-29,2017.0,6.0,http://www.reuters.com/article/us-venezuela-po...,The Venezuelan government hunted on Wednesday...
3,128977,132440,199665,Fruit juice isn’t much better for you than sod...,Vox,Julia Belluz,2016/3/25,2016.0,3.0,http://www.vox.com/2016/3/25/11305614/soda-jui...,One of the biggest public health wins of rece...
4,134837,138300,208223,Sessions won’t testify at congressional budget...,Washington Post,Sari Horwitz,2017-06-10,2017.0,6.0,https://web.archive.org/web/20170611000758/htt...,"Attorney General Jeff Sessions, who had agree..."


In [26]:
# remove publisher tags from article titles 

def clean_title(title):
  title = re.sub(r"(- Breitbart)(?!.*\1)", '', title)
  title = re.sub(r'(- The New York Times)(?!.*\1)', '', title)
  return title
  
just_titles = Parallel(n_jobs=16)(delayed(clean_title)(title) for title in small_data.title)

just_titles[:5]

['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction',
 ' US Civil Rights Commission Will Observe Standing Rock\xa0Standoff',
 'Venezuela hunts rogue helicopter attackers, Maduro foes suspicious',
 'Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.',
 'Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead']

In [0]:
small_data.to_csv("small_data.csv")

In [11]:
len(just_titles)

13000

In [12]:
# store ids of titles 
id_list = small_data.copy().id

len(id_list)

13000

In [13]:
# join with content of article

small_data = list(map(lambda i,j: i + " " + j, just_titles, small_data.content))

small_data[:5]

['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction Michael Brendan Dougherty is bitter. I think that I can write that in both truth and charity. (I think you might even say that he and I are friends.) Dougherty is a conservative of the sort sometimes advertised as “paleo” and served as national correspondent for The American Conservative. Like many conservative writers with those associations, Dougherty spends a great deal of time lambasting the conservative movement and its organs, from which he feels, for whatever reason, estranged  —   an alienation that carries with it more than a little to suggest that it is somewhat personal. You know: Them. Donald Trump is the headline, and explaining the benighted white working class to Them is the main matter. Sanctimony is the literary mode, for Dougherty and for many others doing the same work with less literary facility. Never mind the petty sneering (as though the conservative movement were populated by septua

In [18]:
small_data

['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction Michael Brendan Dougherty is bitter. I think that I can write that in both truth and charity. (I think you might even say that he and I are friends.) Dougherty is a conservative of the sort sometimes advertised as “paleo” and served as national correspondent for The American Conservative. Like many conservative writers with those associations, Dougherty spends a great deal of time lambasting the conservative movement and its organs, from which he feels, for whatever reason, estranged  —   an alienation that carries with it more than a little to suggest that it is somewhat personal. You know: Them. Donald Trump is the headline, and explaining the benighted white working class to Them is the main matter. Sanctimony is the literary mode, for Dougherty and for many others doing the same work with less literary facility. Never mind the petty sneering (as though the conservative movement were populated by septua

In [17]:
small_data.to_csv("small_data.csv")

AttributeError: ignored

## Clustering with tf-idf, k-means, and k-nn


*   create tf-idf features
*   cluster with k-means
*   predidct with k-nn




### Create Matrix of TF-IDF Features

In [0]:
def preprocess_text(text):

  # function to remove punctuation 
  def Punctuation(string): 
    return re.sub(r'[\W_]', ' ', string)

  # remove punctuation and perform tokenization
  text = Punctuation(text.lower()).split()

  # remove stop words and stem
  stop_words = set(stopwords.words('english'))
  stemmer = SnowballStemmer("english")
  text = Parallel(n_jobs=16)(delayed(stemmer.stem)(t) for t in text if not t in stop_words)

  return text


In [0]:
# instantiate tfidf vectorizer 
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', use_idf=True, tokenizer=preprocess_text, ngram_range=(1,3))

In [0]:
# fit small_text to vectorizer (fit and transform)
tfidf_matrix = tfidf_vectorizer.fit_transform(small_text)
print(tfidf_matrix.shape)

In [0]:
# reduce memory 
tfidf_matrix = tfidf_matrix.astype(np.float32)

### Run K-Means Clustering 

In [0]:
# set 25 clusters
num_clusters = 25

# instantiate k-means (set seed to 10)
km = KMeans(n_clusters=num_clusters, random_state=10)

# fit tf-idf features to k-means
km.fit(tfidf_matrix)

# list of cluster assignments 
clusters = km.labels_.tolist()

In [0]:
# cosine similarities for each row with respective cluster center
def cluster_sim(i, label):
 return cosine_similarity(tfidf_matrix[i].toarray(), km.cluster_centers_[label].reshape(1, -1))[0][0]

cos_sim = Parallel(n_jobs=16)(delayed(cluster_sim)(i, label) for i,label in enumerate(km.labels_))

In [0]:
# sum of squared distances of samples to their closest cluster center.
km.inertia_

In [0]:
# dataframe with title, cluster, and cosine similarity 
title_data = pd.DataFrame({'title':small_data.title, 'cluster':clusters, 'cos_sim':cos_sim})

title_data.shape

In [0]:
title_data.head()

In [0]:
# inspect titles with highest cosine similarity in clusters 
top_5 = title_data.groupby('cluster')['cos_sim'].nlargest(5)
for i,ind in top_5.index:
  print("cluster", i)
  print(just_titles[ind])
  print('-------------')

### Predict Cluster of Input

In [0]:
input_topic = ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

In [0]:
# run tfidf vectorizer on input (transform)
tfidf_input = tfidf_vectorizer.transform(input_topic)

# function to make tfidf features into list for k-nn 
def make_list(features, i): 
  return features.getrow(i).toarray()[0].tolist()

# make list of tfidf_input
tfidf_input = Parallel(n_jobs=16)(delayed(make_list)(tfidf_input, i) for i in range(tfidf_input.shape[0]))

In [0]:
# make list of tfidf_matrix
tfidf_list = Parallel(n_jobs=16)(delayed(make_list)(tfidf_matrix, i) for i in range(tfidf_matrix.shape[0]))


In [0]:
np.random.seed(5)
X_train = tfidf_list
y_train = title_data.cluster
X_test  = tfidf_input

# create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train) 

res = knn.predict(X_test)

In [0]:
print(res)

In [0]:
for i,ind in top_5.index:
  if (i==8) | (i==11):
    print("cluster", i)
    print(just_titles[ind])
    print('-------------')

# ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

## Clustering with the Universal Sentence Encoder 

run seperately from tf-idf

### Create Embeddings on Training Data

In [0]:
!pwd

/home/silvsilva/w266Final


In [0]:
# load the Universal Sentence Encoder's TF Hub module

#module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
#model = hub.load(module_url)
#print ("module %s loaded" % module_url)

# download model from https://tfhub.dev/google/universal-sentence-encoder/4 and save locally 
model = hub.load("tmp")


In [0]:
# reduce logging output
logging.set_verbosity(logging.ERROR)

# compute a representation for each article
small_embeddings = model(small_text)

In [0]:
# semantic similarity of two sentences can be trivially computed as the inner product of the encodings
corr = np.inner(small_embeddings, small_embeddings)

In [0]:
# data frame of titles and semantic similarities
corr_df = pd.DataFrame(corr)
corr_df.columns = just_titles
corr_df.index = just_titles

corr_df.head()

Unnamed: 0,"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",US Civil Rights Commission Will Observe Standing Rock Standoff,"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,Qatar: UAE and Saudi Arabia step up pressure in diplomatic crisis,’The 4-Hour Workweek’ author says a 3-step process he learned from Tony Robbins drastically improved his life,Rick Perry Attacked for Keeping His Word,Watch how a mathematician explains an astonishing coincidence,Second baby bald eagle begins hatching process at National Arboretum,...,"McDonald’s, Chick-fil-A, and Subway are making changes to profit off the shifting definition of ’healthy’",Bernie says the AP twisted his ‘messy’ convention prediction,Nevada’s Joe Heck Aims to Keep the Senate Republican,Donald Trump: John McCain and Lindsey Graham ‘Looking to Start World War III’,"Radiohead Announces New Album, Hear Another New Song",Roger Federer and Rafael Nadal roll back years for unexpected encore,Why voters like She’s Not Trump,Advanced Placement Tests To Hide History of Religion and Islamic Jihad in Europe,Little Simz,Wilhelmina to feature transgender model this fall
"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",1.0,0.291457,0.336936,0.256971,0.181461,0.282036,0.389551,0.316919,0.289691,0.295462,...,0.217885,0.347041,0.459125,0.317545,0.065338,0.159626,0.458244,0.456368,0.228482,0.326126
US Civil Rights Commission Will Observe Standing Rock Standoff,0.291457,1.0,0.543982,0.10907,0.411541,0.422529,0.224046,0.27175,0.141546,0.381025,...,0.059385,0.298766,0.349619,0.275823,0.209703,0.112333,0.26849,0.271677,0.186529,0.228896
"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",0.336936,0.543982,1.0,0.106961,0.405647,0.540752,0.184615,0.287032,0.161188,0.340552,...,0.0847,0.344075,0.40092,0.310918,0.255471,0.267033,0.259199,0.362642,0.26564,0.168313
Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,0.256971,0.10907,0.106961,0.999999,0.052951,0.109257,0.242693,0.108912,0.14604,0.155383,...,0.344664,0.227382,0.186247,0.101332,0.064881,0.121565,0.248952,0.207818,0.171043,0.204495
Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,0.181461,0.411541,0.405647,0.052951,1.0,0.384612,0.176325,0.368784,0.109351,0.225344,...,0.033626,0.316896,0.351867,0.394767,0.222523,0.173553,0.287082,0.243409,0.216823,0.191981


In [0]:
# function to return the column index of the top five values in a row of a dataframe
def find_top5ind(df, i):
  return list(list(zip(*heapq.nlargest(5, enumerate(df.iloc[i,:]), key=operator.itemgetter(1))))[0])

# function to return the top 5 values in a list
def find_top5(lst, ind):
  return [lst[i] for i in ind]

# find index of 5 most similar titles 
top5_ind = Parallel(n_jobs=16)(delayed(find_top5ind)(corr_df, i) for i in range(len(corr_df)))

# show most similar titles -- sanity check
top5 = Parallel(n_jobs=16)(delayed(find_top5)(just_titles, ind) for ind in top5_ind)

top5[:5]

[['Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction',
  'After Trump, conservatives should stop longing for the past — and learn a little humility',
  'The alt-right is more than warmed-over white supremacy. It’s that, but way way weirder.',
  'The Return of ‘Street Corner Conservatism’',
  'Liberals should get behind marriage (Opinion)'],
 [' US Civil Rights Commission Will Observe Standing Rock\xa0Standoff',
  ' Cory Booker Calls For Federal Investigation Into Police Tactics At Dakota Access\xa0Pipeline',
  'Army will close Dakota pipeline protesters’ campsite, Sioux leader\xa0says',
  'Dakota Access Pipeline protest site is cleared',
  'Protesters, Police Still Clashing Over Disputed North Dakota Pipeline'],
 ['Venezuela hunts rogue helicopter attackers, Maduro foes suspicious',
  'Venezuelan opposition blocks streets to protest Maduro power shakeup',
  'Protester dies, minister sacked after Paraguay re-election vote',
  'Venezuela Erupts In ’Mother Of 

### Assign Cluster to Input

In [0]:
# examples of user given inputs
input_topic = ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

# compute a representation for each input
input_embeddings = model(input_topic)

In [0]:
# semantic similarity between inputs and training articles
inp_corr = np.inner(input_embeddings, small_embeddings)

In [0]:
# data frame of titles and semantic similarities
inp_df = pd.DataFrame(inp_corr)
inp_df.columns = just_titles
inp_df.index = input_topic

inp_df

Unnamed: 0,"Chaos in the Family, Chaos in the State: The White Working Class’s Dysfunction",US Civil Rights Commission Will Observe Standing Rock Standoff,"Venezuela hunts rogue helicopter attackers, Maduro foes suspicious",Fruit juice isn’t much better for you than soda. Let’s stop pretending otherwise.,Sessions won’t testify at congressional budget hearings but at Senate intelligence hearing instead,Qatar: UAE and Saudi Arabia step up pressure in diplomatic crisis,’The 4-Hour Workweek’ author says a 3-step process he learned from Tony Robbins drastically improved his life,Rick Perry Attacked for Keeping His Word,Watch how a mathematician explains an astonishing coincidence,Second baby bald eagle begins hatching process at National Arboretum,...,"McDonald’s, Chick-fil-A, and Subway are making changes to profit off the shifting definition of ’healthy’",Bernie says the AP twisted his ‘messy’ convention prediction,Nevada’s Joe Heck Aims to Keep the Senate Republican,Donald Trump: John McCain and Lindsey Graham ‘Looking to Start World War III’,"Radiohead Announces New Album, Hear Another New Song",Roger Federer and Rafael Nadal roll back years for unexpected encore,Why voters like She’s Not Trump,Advanced Placement Tests To Hide History of Religion and Islamic Jihad in Europe,Little Simz,Wilhelmina to feature transgender model this fall
Hillary Clinton defends handling of Benghazi attack,0.019261,0.135737,0.21394,-0.040233,0.270479,0.242178,-0.059264,0.235375,0.039634,0.098044,...,0.077255,0.288595,0.195516,0.280349,0.027464,0.020349,0.281599,0.114124,0.017507,-0.070851
Women's March Highlights,0.045296,0.119294,-0.000519,0.065587,0.049757,-0.028455,-0.019417,0.070615,0.004875,0.16622,...,0.006225,0.1285,0.060778,0.00033,0.051476,0.025595,0.12345,0.030094,0.040689,0.124655
Hillary Clinton emails,0.036642,0.032883,0.058746,0.017771,0.199208,0.109163,-0.04811,0.151369,0.013369,0.053923,...,0.073164,0.322738,0.131154,0.256044,0.063897,0.047352,0.367134,-0.036238,0.014079,0.013729


In [0]:
# find index of 5 most similar titles 
top5_ind = Parallel(n_jobs=16)(delayed(find_top5ind)(inp_df, i) for i in range(len(inp_df)))

# show most similar titles -- final clusters
top5 = Parallel(n_jobs=16)(delayed(find_top5)(just_titles, ind) for ind in top5_ind)

top5[:5]
# topics: ["Hillary Clinton defends handling of Benghazi attack", "Women's March Highlights", "Hillary Clinton emails"]

[['Benghazi may be over, but #Benghazi has a life of its own',
  'Univision’s Ramos picked the wrong Benghazi question for Hillary Clinton',
  'Donald Trump on Benghazi: ’Hillary Clinton Decided to Go Home and Sleep’ ',
  'The GOP stoops for scandal',
  'Exclusive - Sarah Palin: Administration Lies, Soldiers Die. Yes, Hillary, at Every Point ‘It Matters’ '],
 ['The Women’s March on Washington is becoming a\xa0joke',
  ' Women Share Why They Hit The Streets At The Sundance Women’s\xa0March',
  'Women will march on – What would a feminist do? podcast',
  'The Exhausting Work of Tallying America’s Largest Protest',
  'Women’s March organizers prepare for hundreds of thousands of protesters'],
 ['Donald Trump: ’Good Job, Huma. Thank you, Anthony Weiner’ ',
  'Fox News anchor grills Hillary Clinton on her email scandal',
  'Hillary Clinton’s email problems might be even worse than we thought',
  'Hillary Clinton pal Neera Tanden’s greatest hits from WikiLeaks emails',
  'Hillary Clinton Del

In [0]:
# index of most similar titles 
top5id = Parallel(n_jobs=16)(delayed(find_top5)(id_list, ind) for ind in top5_ind)

top5id[:5]

[[152203, 210482, 47436, 213302, 39123],
 [122748, 103164, 149141, 73686, 147468],
 [40299, 68550, 213322, 86663, 44841]]

In [0]:
# cleaned text for clusters 
top5text = Parallel(n_jobs=16)(delayed(find_top5)(small_text, ind) for ind in top5_ind)

top5text[:5]

[['Benghazi may be over, but #Benghazi has a life of its own To understand the story of Benghazi, you first have to understand there are two of them. The first Benghazi is the   city in Libya where in 2012 the US ambassador Chris Stevens and three other Americans were killed in an attack on a diplomatic station. The second is #Benghazi, a dumpster fire of conspiracy theories and nakedly political attacks, fanned by rightwing Twitter trolls and fed by the    conviction that, despite overwhelming evidence to the contrary, there’s a   coverup at work. The conclusive   Benghazi report has finally been issued. It has no major revelations nor any further evidence of wrongdoing by Hillary Clinton, who has already taken “full responsibility” for the tragedy  —   now get ready to hear a whole lot more about #Benghazi. A casual observer would be forgiven for thinking the report was already out there, as this is the eighth congressional panel to explore the attack, and in October Clinton sat for 

### Output Clusters

Don't re-run this because these files have already been created. 
- /News_Filter/Pilot/clusters
- cluster1.csv, cluster2.csv, cluster3.csv

In [0]:
# first example: "Hillary Clinton defends handling of Benghazi attack"
cluster1 = small_data.iloc[[i in top5id[0] for i in small_data.id]]
text1_df = pd.DataFrame({'id':top5id[0], 'text':top5text[0]})
cluster1 = cluster1.merge(text1_df, on='id')

# second example: "Women's March Highlights"
cluster2 = small_data.iloc[[i in top5id[1] for i in small_data.id]]
text2_df = pd.DataFrame({'id':top5id[1], 'text':top5text[1]})
cluster2 = cluster2.merge(text2_df, on='id')

# third example: "Hillary Clinton emails"
cluster3 = small_data.iloc[[i in top5id[2] for i in small_data.id]]
text3_df = pd.DataFrame({'id':top5id[2], 'text':top5text[2]})
cluster3 = cluster3.merge(text3_df, on='id')

In [0]:
# save each cluster data to csv
cluster1.to_csv("w266/finalproj/clusters/cluster1.csv", index=False)
cluster2.to_csv("w266/finalproj/clusters/cluster2.csv", index=False)
cluster3.to_csv("w266/finalproj/clusters/cluster3.csv", index=False)

FileNotFoundError: ignored

#### Sanity Check to see if individual vs mass inputs create different embeddings -- NO

In [0]:
a = model(["hi my name is emma"])
b = model(["nice to meet you"])
c = model(["hi what's your name"])

In [0]:
d = model(["hi my name is emma", "nice to meet you", "hi what's your name"])

In [0]:
e = np.concatenate((a, b, c))

In [0]:
f = np.array(d)

In [0]:
np.inner(e, e)

In [0]:
np.inner(f, f)

### Summarization via Similarity Matrix

We proceed to use the outputted and saved clusters above to summarize via a similarity matrix

In [0]:
# importing the csv files
lst_files = ['News_Filter/Pilot/clusters/cluster1.csv', 
             'News_Filter/Pilot/clusters/cluster2.csv',
             'News_Filter/Pilot/clusters/cluster3.csv']
clusters = Parallel(n_jobs=3)(delayed(pd.read_csv)(csv_file) for csv_file in lst_files)

In [0]:
# importing some dependencies
from nltk.cluster.util import cosine_distance
import networkx as nx

In [0]:
# how do these clusters look like?
clusters[0]

Unnamed: 0.1,index,Unnamed: 0,id,title,publication,author,date,year,month,url,content,text
0,20374,20377,39123,"Exclusive - Sarah Palin: Administration Lies, ...",Breitbart,Sarah Palin,2016-06-28,2016.0,6.0,,Listening to the Benghazi Committee reveal wha...,"Exclusive - Sarah Palin: Administration Lies, ..."
1,100217,103677,152203,"Benghazi may be over, but #Benghazi has a life...",Guardian,Lucia Graves,2016-06-28,2016.0,6.0,https://www.theguardian.com/commentisfree/2016...,"To understand the story of Benghazi, you first...","Benghazi may be over, but #Benghazi has a life..."
2,138814,142277,213302,The GOP stoops for scandal,Washington Post,Eugene Robinson,2016-07-04,2016.0,7.0,https://web.archive.org/web/20160705001849/htt...,The Republican yearning to pin a scandal on H...,The GOP stoops for scandal The Republican yea...
3,136585,140048,210482,Univision’s Ramos picked the wrong Benghazi qu...,Washington Post,Erik Wemple,2016-03-10,2016.0,3.0,https://web.archive.org/web/20160311000724/htt...,Jorge Ramos chose flash over substance. I...,Univision’s Ramos picked the wrong Benghazi qu...
4,28667,28675,47436,Donald Trump on Benghazi: ’Hillary Clinton Dec...,Breitbart,Alex Swoyer,2016-04-27,2016.0,4.0,,"GOP frontrunner Donald Trump, looking toward t...",Donald Trump on Benghazi: ’Hillary Clinton Dec...


So each of the members inside of "clusters" is already a data frame. Therefore the actual data we care about, is the summarization of the content of each row within each dataframe.

In [0]:
clusters[0].columns

Index(['index', 'Unnamed: 0', 'id', 'title', 'publication', 'author', 'date',
       'year', 'month', 'url', 'content', 'text'],
      dtype='object')

In [0]:
# lets create variables for the content of each of these
clust_cont1 = clusters[0]['text'] 
clust_cont2 = clusters[1]['text']
clust_cont3 = clusters[2]['text']

Some functions to work with:

In [0]:
def read_content(piece):
  article = piece.split(". ")
  sentences = []

  for sentence in article:
    #print(sentence)
    sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
    #sentences.pop()

  return sentences

In [0]:
# measuring similarity
def sentence_similarity(sent1, sent2, stopwords=None):
  if stopwords is None:
    stopwords=[]

  sent1 = [w.lower() for w in sent1]
  sent2 = [w.lower() for w in sent2]

  all_words = list(set(sent1 + sent2))

  vector1 = [0] * len(all_words)
  vector2 = [0] * len(all_words)

  # build vector for the first sentence
  for w in sent1:
    if w in stopwords:
      continue
    vector1[all_words.index(w)] += 1

  # build vector for second sentence
  for w in sent2:
    if w in stopwords:
      continue
    vector2[all_words.index(w)] += 1

  return 1 - cosine_distance(vector1, vector2)

In [0]:
# similarity matrix: cos sim to find similarity btw sent
def build_sim_matrix(sentences, stop_words):
  # create empty sim matrix
  sim_matrix = np.zeros((len(sentences), len(sentences)))

  for i in range(len(sentences)):
    for j in range(len(sentences)):
      if i == j: # skip if both are same sent
        continue
      sim_matrix[i][j] = sentence_similarity(sentences[i],
                                             sentences[j],
                                             stop_words)
  return sim_matrix

In [0]:
# generate summary method
def generate_summary(content, top_n=5):
  stop_words = stopwords.words('english')
  summarized_text=[]

  # first read in article and tokenize
  sentences = read_content(content)

  # second generate sim matrix accross sents
  sent_sim_mat = build_sim_matrix(sentences, stop_words)

  # third rank sentences in sim matrix
  sent_sim_graph = nx.from_numpy_array(sent_sim_mat)

  scores = nx.pagerank(sent_sim_graph, max_iter=5000)

  # fourth sort the rank and pick top sent
  ranked_sent = sorted(((scores[i],s) for i,s in enumerate(sentences)),
                       reverse=True)

  for i in range(top_n):
    summarized_text.append(" ".join(ranked_sent[i][1]))

  # fifth, output the sumarized text
  output = ". ".join(summarized_text)
  #print("Summarized Text: \n", ". ".join(summarized_text))

  return output

In [0]:
def get_summaries(cluster_name):
  '''
  input: list of articles in cluster
  output: dictionary with valid summaries
  '''

  summa = {}
  for j in range(len(cluster_name)):
    try:
      out = generate_summary(cluster_name[j])
      summa[j] = out
    except:
      continue
  
  return summa


Finally, here is a call per cluster.

In [0]:
first = get_summaries(clust_cont1.values)
second = get_summaries(clust_cont2.values)
third = get_summaries(clust_cont3.values)

  return 1 - (numpy.dot(u, v) / (sqrt(numpy.dot(u, u)) * sqrt(numpy.dot(v, v))))


In [0]:
clust_cont1

0    Exclusive - Sarah Palin: Administration Lies, ...
1    Benghazi may be over, but #Benghazi has a life...
2    The GOP stoops for scandal  The Republican yea...
3    Univision’s Ramos picked the wrong Benghazi qu...
4    Donald Trump on Benghazi: ’Hillary Clinton Dec...
Name: text, dtype: object

In [0]:
first

{0: 'Yes, Hillary, at Every Point ‘It Matters’  Listening to the Benghazi Committee reveal what we knew  —   the Obama Administration’s devastating lies cost lives, and at any point it just doesn’t matter to them. Much like today’s refusal to believe real threats  —   denying Islam put the “I” in ISIS  —   they are wrapping America’s children in a suicide vest..  Hillary left our men behind. They sat back and allowed our peacekeepers to die during what they witnessed becoming a suicide mission. Exclusive - Sarah Palin: Administration Lies, Soldiers Die',
 1: 'The second biggest upshot is likely the personal misfortune for Kevin McCarthy, who  —   as result of his comment suggesting the entire   Benghazi investigation is a patently political attempt to tar Clinton  —   is not the House speaker. The conclusive   Benghazi report has finally been issued. At the time, the real smoking gun of the hearing from Matt Drudge’s perspective was a coughing fit and slow speech, which he claimed at t

In [0]:
clust_cont2[0]

' Women Share Why They Hit The Streets At The Sundance Women’s\xa0March ’ Earlier this morning, protesters marched down Main Street in Park City, Utah  —   where the Sundance Film Festival is currently taking place  —   in a Women’s March organized by Chelsea Handler.’ ’ BuzzFeed News was on site and asked women why they were participating in the march. Here are their answers:’ ’ “Marching for my grandchildren & 30 years of students in public schools”’ ’ “Trump is not an ally to the colorful, complicated, spirited people of this country! ”’ ’ “We need to advocate for EVERYONE, regardless of political climate”’ ’ “Equality for WOMEN  —   no harassment”’ ’ “I march for THE FUTURE”’ ’ “I’m a school counselor #publicschool”’ ’ “Inequality is unsustainable”’ ’ “I’m here to stand up for the rights of those that are threatened  —   minorities, women, & OUR EARTH”’ ’ “To put Trump on notice”’ ’ “I will no longer be silent”’ ’ “For the radical idea that women are people, too”’ ’ “Future is fema

In [0]:
second

{0: ' Women Share Why They Hit The Streets At The Sundance Women’s\xa0March ’ Earlier this morning, protesters marched down Main Street in Park City, Utah  —   where the Sundance Film Festival is currently taking place  —   in a Women’s March organized by Chelsea Handler.’ ’ BuzzFeed News was on site and asked women why they were participating in the march. ‘Balance’ ‘Spiritual’ Beginning Life Happiness ‘Fire in the Home’  —   LOVE”’ ’ “I marched for Civil Rights. ”’ ’ “So I can be anything when I grow up”’]   . Here are their answers:’ ’ “Marching for my grandchildren & 30 years of students in public schools”’ ’ “Trump is not an ally to the colorful, complicated, spirited people of this country! ”’ ’ “We need to advocate for EVERYONE, regardless of political climate”’ ’ “Equality for WOMEN  —   no harassment”’ ’ “I march for THE FUTURE”’ ’ “I’m a school counselor #publicschool”’ ’ “Inequality is unsustainable”’ ’ “I’m here to stand up for the rights of those that are threatened  —   m

In [0]:
clust_cont3[0]

'Hillary Clinton pal Neera Tanden’s greatest hits from WikiLeaks emails Tanden has a potty mouth, and talks a lot of trash, but she’s also mostly right whenever she complains about something, such as the absurd secrecy of Hillary’s inner circle of friends, who withheld information about her email habits at the State Department because “they wanted to get away with it. ” Here are some of Tanden’s greatest hits: “    insane” That’s how Tanden described the scandal involving Hillary Clinton’s use of a private email server as secretary of state. “Do we actually know who told Hillary she could use a private email?” she wrote in an email to Clinton campaign chairman John Podesta in 2015. “And has that person been drawn and quartered?” “    these   ” See the Fox News 2016 battleground prediction map and make your own election projections. See Predictions Map → Those were Tanden’s choice words for liberals criticizing Hillary Clinton for using the phrase “all lives matter. ” “He’s a    ” Tande

In [0]:
third

{0: '” Here are some of Tanden’s greatest hits: “    insane” That’s how Tanden described the scandal involving Hillary Clinton’s use of a private email server as secretary of state. “And has that person been drawn and quartered?” “    these   ” See the Fox News 2016 battleground prediction map and make your own election projections. CLICK TO READ MORE FROM HEAT STREET,   . ” “He’s a    ” Tanden expressed her displeasure when she learned that Democratic operative Faiz Shakir, who currently works for Senate Minority Leader Harry Reid, had started advising Hillary Clinton’s primary challenger Bernie Sanders. See Predictions Map → Those were Tanden’s choice words for liberals criticizing Hillary Clinton for using the phrase “all lives matter',
 1: 'Comey that during its investigation, the FBI found several thousand   emails that had not been previously turned over by Clinton. Hillary Clinton Deleted More Emails Than She Sent to the State Department  Chris Cillizza of the Washington Post co

### Doing the Summary of the Summaries
Re-input the outputted summaries to create a summary of each of the outputted summaries for each cluster.

So:
- First: outputs a dictionary of the summaries made for each text in the the first cluster. The key is the number of the article within that cluster. The values are the summaries for each of those texts.
- Second: follows same structure as above.
- Third: follow same structure too
- Fourth: same for this one.

In [0]:
# so let's just make it work with the first outputted clusters.
first

{0: 'Yes, Hillary, at Every Point ‘It Matters’  Listening to the Benghazi Committee reveal what we knew  —   the Obama Administration’s devastating lies cost lives, and at any point it just doesn’t matter to them. Much like today’s refusal to believe real threats  —   denying Islam put the “I” in ISIS  —   they are wrapping America’s children in a suicide vest..  Hillary left our men behind. They sat back and allowed our peacekeepers to die during what they witnessed becoming a suicide mission. Exclusive - Sarah Palin: Administration Lies, Soldiers Die',
 1: 'The second biggest upshot is likely the personal misfortune for Kevin McCarthy, who  —   as result of his comment suggesting the entire   Benghazi investigation is a patently political attempt to tar Clinton  —   is not the House speaker. The conclusive   Benghazi report has finally been issued. At the time, the real smoking gun of the hearing from Matt Drudge’s perspective was a coughing fit and slow speech, which he claimed at t

In [0]:
frame_first = pd.DataFrame.from_dict(first, orient= 'index')
frame_first

Unnamed: 0,0
0,"Yes, Hillary, at Every Point ‘It Matters’ Lis..."
1,The second biggest upshot is likely the person...
4,Donald Trump on Benghazi: ’Hillary Clinton Dec...


In [0]:
entire_text = []
add_text = ""
for text in frame_first[0]:
  add_text += text
entire_text.append(add_text)


In [0]:
# first summary of first cluster/topic:
generate_summary(entire_text[0])

'Much like today’s refusal to believe real threats  —   denying Islam put the “I” in ISIS  —   they are wrapping America’s children in a suicide vest.. Exclusive - Sarah Palin: Administration Lies, Soldiers DieThe second biggest upshot is likely the personal misfortune for Kevin McCarthy, who  —   as result of his comment suggesting the entire   Benghazi investigation is a patently political attempt to tar Clinton  —   is not the House speaker. At the time, the real smoking gun of the hearing from Matt Drudge’s perspective was a coughing fit and slow speech, which he claimed at the time was “obvious” evidence she was  —   horror of horrors  —   taking   meds. Never mind that we’ve now had more investigations of Benghazi than every   terror attack in recent history combinedDonald Trump on Benghazi: ’Hillary Clinton Decided to Go Home and Sleep’  GOP frontrunner Donald Trump, looking toward the general election, criticized Democrat frontrunner Hillary Clinton and President Obama’s foreig

In [0]:
def summary_summaries(cluster):
  """
  Recall that each cluster is a dictionary
  Key: Article number
  Values: Text

  Thus, cluster - dictionary
  """
  articles = cluster.values()
  concat_ = ""
  for article in articles:
    concat_ += article
  
  summary = generate_summary(concat_)

  return summary



**Here are all summary of summaries per each cluster.**

In [0]:
first_summary = summary_summaries(first)
second_summary = summary_summaries(second)
third_summary = summary_summaries(third)

In [0]:
# outputting for the sake of the github repo
first_summary

'Much like today’s refusal to believe real threats  —   denying Islam put the “I” in ISIS  —   they are wrapping America’s children in a suicide vest.. Exclusive - Sarah Palin: Administration Lies, Soldiers DieThe second biggest upshot is likely the personal misfortune for Kevin McCarthy, who  —   as result of his comment suggesting the entire   Benghazi investigation is a patently political attempt to tar Clinton  —   is not the House speaker. At the time, the real smoking gun of the hearing from Matt Drudge’s perspective was a coughing fit and slow speech, which he claimed at the time was “obvious” evidence she was  —   horror of horrors  —   taking   meds. Never mind that we’ve now had more investigations of Benghazi than every   terror attack in recent history combinedDonald Trump on Benghazi: ’Hillary Clinton Decided to Go Home and Sleep’  GOP frontrunner Donald Trump, looking toward the general election, criticized Democrat frontrunner Hillary Clinton and President Obama’s foreig

In [0]:
second_summary

'The Women’s March on Washington is becoming a\xa0joke Is the Women’s March on Washington cursed? The   protest set for JanThe march organizers are still officially only expecting a crowd of 200, 000 to 250, 000 people  —   mostly women, but also many men  —   based on the number who have registered.  Women Share Why They Hit The Streets At The Sundance Women’s\xa0March ’ Earlier this morning, protesters marched down Main Street in Park City, Utah  —   where the Sundance Film Festival is currently taking place  —   in a Women’s March organized by Chelsea Handler.’ ’ BuzzFeed News was on site and asked women why they were participating in the march. As always, we want to hear from you: Please join the conversation in the comments below.” You see, 1997 saw a Million Woman March in DC for    —   a follow to Louis Farrakhan’s 1995 Million Man March. ”’ ’ “So I can be anything when I grow up”’]   . But the roar of the masses will surely reach the president’s ears later, as the speeches and 

In [0]:
third_summary

'             Caption   The former secretary of state clinched the number of delegates needed to be the Democratic nominee for president.  Melina   Washington Post  . Hillary Clinton’s email problems might be even worse than we thought        Here’s the good news for Hillary Clinton: The FBI has recommended that no charges be brought following its investigation of the former secretary of state’s private email server.   …     [W]e know from FBI Director James B. Thank you, Anthony Weiner.”  —   @realDonaldTrump on @HillaryClinton email investigation pic'

**Now we will like to see if these inputs result in the same clusters as before.**

We will use the universal encoder to cluster these summaries (1 per cluster) to determine whether we have lost some of the syntactic meaning behind each article.

Recall the topics we had:
1. Hillary Clinton Benghazi
2. Women's March
3. I forgot.


In [0]:
# YOU HAVE ALREADY DONE THIS PART

# FOLLOWING EM'S CODE
# small data is already loaded in this notebook (messy, sorry)
# model is already loaded from folder 'tmp'

#emb_model = hub.load('tmp')
#logging.set_verbosity(logging.ERROR)

# compute embeddings for each article in 'small_data'
#train_embeddings = emb_model(small_data.content)

In [0]:
# get clusters
clusters = pd.read_csv('clusters.csv')

In [0]:
clusters

Unnamed: 0,id,title,publication,author,date,year,month,url,content,text,cluster_labels
0,45558,Paul Ryan Backs President Obama’s Sanctions Ov...,Breitbart,Adelle Nazarian,2016-12-29,2016.0,12.0,,House Speaker Paul Ryan issued a statement on ...,Paul Ryan Backs President Obama’s Sanctions Ov...,1
1,72300,Putin’s chaos strategy is coming back to bite him,Business Insider,,2016-10-31,2016.0,10.0,,"’ ’ ’ Back in March, when the U. S. electio...",Putin’s chaos strategy is coming back to bite ...,1
2,194362,Russia intervened to help Trump win election: ...,Reuters,John Walcott,2016-12-10,2016.0,12.0,http://www.reuters.com/article/us-usa-election...,U. S. intelligence analysts have concluded th...,Russia intervened to help Trump win election: ...,1
3,59828,Russian hacking activity continues unabated,CNN,Shimon Prokupecz,2016-12-16,2016.0,12.0,,(CNN) Russian cyberhacking activity has conti...,Russian hacking activity continues unabated (...,1
4,92926,Trump: Special Counsel Will Prove My Team Didn...,Talking Points Memo,,,,,https://web.archive.org/web/20170518003300/htt...,President Donald Trump responded Wednesday to ...,Trump: Special Counsel Will Prove My Team Didn...,1
5,205453,Russia’s assault on America’s elections is jus...,Washington Post,David Ignatius,2017-02-23,2017.0,2.0,https://web.archive.org/web/20170224004841/htt...,One of the most startling allegations in a Ja...,Russia’s assault on America’s elections is jus...,1
6,67261,Top-secret NSA report: Russian hackers tried t...,Business Insider,Natasha Bertrand,2017-06-06,2017.0,6.0,,’ ’ ’ Hackers associated with Russia’’s mil...,Top-secret NSA report: Russian hackers tried t...,1
7,72962,Clinton campaign backs effort to brief Elector...,Business Insider,Pamela Engel,2016-12-13,2016.0,12.0,,’ ’ ’ A top adviser to Hillary Clinton said...,Clinton campaign backs effort to brief Elector...,1
8,72358,The US government is launching a massive effor...,Business Insider,Mark Abadi,2016-11-04,2016.0,11.0,,’ ’ ” The US government is launching a mass...,The US government is launching a massive effor...,1
9,31540,Clinton Aide Jennifer Palmieri Blasts Trump fo...,Breitbart,Adam Shaw,2017-05-16,2017.0,5.0,,Former Clinton aide Jennifer Palmieri has blas...,Clinton Aide Jennifer Palmieri Blasts Trump fo...,1


In [0]:
# summarize every article in clusters
cluster_summaries = []
for i in range(1,6):
  # argument below is the list of articles for each cluster
  summaries = get_summaries(clusters[clusters.cluster_labels == i].reset_index().content)
  cluster_summaries.append(summaries)

  return 1 - (numpy.dot(u, v) / (sqrt(numpy.dot(u, u)) * sqrt(numpy.dot(v, v))))


In [0]:
cluster_summaries

[{3: ' The attempts appear to trace back to some of the same Russian hackers behind the breaches of Democratic Party organizations in the past year, one official said.  Russians ”continue to do all kinds of stuff” against American political organizations, think tanks and thought leaders another US official said.  (CNN) Russian cyberhacking activity has continued largely unabated since the November election, including against US political organizations, US officials briefed on the investigation told CNN on Thursday.  Among the attempted cyber breaches are phishing attempts targeting Democratic presidential nominee Hillary Clinton’s campaign, one official says.   ”It’s not like the one and done deal here',
  6: '’ ’   The NSA document offers the  ’ ’  ’ ’  ’ ’   ’ ’’ ’  ’ ’  ’ ’  ’ ’   ’ ’  ’ ” Kremlin had ordered cyberattacks on the   Democratic National Committee and Hillary Clinton’s campaign   chairman, John Podesta, during the 2016 election. ’  ’ ’   Hackers associated with Russia’’

**Now make a summary of summaries for each cluster**

In [0]:
# summarizing summaries of each cluster outputted above
# note that each dictionary in list denotes a cluster
# there is five clusters
len(cluster_summaries)

5

In [0]:

# this is already iterating through the list
# each element in list is a dictionary
# hence the function you made up works

summary_of_summaries = []
for summaries in cluster_summaries:
  # this is a summary of concatenated text
  summario = summary_summaries(summaries)
  summary_of_summaries.append(summario)


In [0]:
len(summary_of_summaries)

5

### Evaluate Summary of Summaries with Univ Sent Enc
Goal: each summary is clustered with original articles used to create the summaries.

In [0]:
# create embeddings for each user summary
summary_embeddings = model(summary_of_summaries)

In [0]:
# data frame of titles and semantic similarities
cos_df = pd.DataFrame(cosine_similarity(summary_embeddings, small_embeddings))
cos_df.columns = small_data.title
cos_df.index = [summary_of_summaries[i][:50] for i in range(len(summary_of_summaries))]

cos_df.shape

(5, 13000)

In [0]:
# func to return the column index of the top n values in a row of a dataframe

def find_topind(df, i ,n):
  return list(list(zip(*heapq.nlargest(n, enumerate(df.iloc[i,:]), key=operator.itemgetter(1))))[0])

# function to return the top n values in a list
def find_top(lst, ind):
  return [lst[i] for i in ind]

# how many articles per cluster
n = 10

#find index of n most similar articles
top_ind = Parallel(n_jobs=16)(delayed(find_topind)(cos_df, i, n) for i in range(len(cos_df)))

In [0]:
# ids of most similar articles 
top_id = Parallel(n_jobs=16)(delayed(find_top)(small_data.id, ind) for ind in top_ind)

top_id

[[67261, 72358, 194362, 58394, 72962, 87784, 73049, 66558, 59828, 45558],
 [70464, 189841, 55757, 42840, 213322, 120746, 38810, 42339, 202858, 156620],
 [103164, 157279, 113231, 37346, 122748, 58922, 183701, 50953, 147468, 28692],
 [199737, 96452, 67766, 39532, 55909, 68874, 49239, 209446, 213842, 211304],
 [34717, 31626, 45658, 29924, 163918, 94510, 203153, 29675, 87437, 120670]]

In [0]:
# ids of original articles  
og_ids = []
for i in range(1,6):
  cluster = clusters[clusters.cluster_labels == i]
  og_ids.append(list(cluster.id))

og_ids

[[45558, 72300, 194362, 59828, 92926, 205453, 67261, 72962, 72358, 31540],
 [70464, 49591, 56439, 85727, 92404, 68550, 40299, 39429, 213302, 86771],
 [34095, 103164, 48701, 73686, 74730, 60705, 38222, 122748, 147468, 161410],
 [96444, 39532, 117693, 68874, 199737, 67766, 97167, 55909, 96452, 49182],
 [28111, 214420, 80967, 45658, 46778, 120670, 163918, 205111, 163453, 34717]]

In [0]:
# proportion of original articles clustered with summaries 
np.mean([sum([id in top_id[i] for id in og_ids[i]])/10 for i in range(len(og_ids))])


0.4