<a href="https://colab.research.google.com/github/s-miramontes/News_Filter/blob/master/notebooks/eval_bert_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation of Summarization with BERT

In [0]:
# install libraries 

#!pip install bert-extractive-summarizer

#!pip install spacy==2.1.3
#!pip install transformers==2.2.2
#!pip install neuralcoref

#!pip install torch

In [0]:
# import statements 

import pandas as pd
import numpy as np

from summarizer import Summarizer

from sklearn.metrics.pairwise import cosine_similarity

from joblib import Parallel, delayed

import heapq
import operator

from absl import logging

import tensorflow as tf
import tensorflow_hub as hub

## Load Universal Sentence Encoder and Training Data for Evaluation

In [0]:
# import original training data
small_data = pd.read_csv("news_filter/data/small_data.csv")

In [0]:
# download model from https://tfhub.dev/google/universal-sentence-encoder/4 and save locally 
emb_model = hub.load("news_filter/tmp")

In [0]:
# reduce logging output
logging.set_verbosity(logging.ERROR)

# compute embeddings for each article
train_embeddings = emb_model(small_data.content)

## Create Summaries for Clusters from Training Data

In [0]:
# import cluster data

clusters = pd.read_csv("news_filter/data/clusters.csv")

In [0]:
# instantiate summarizer
model = Summarizer()

# function to return summary of each article in cluster
def make_summaries(cluster):
  result = {}
  for i in range(len(cluster.content)):
    summary = model(cluster.content[i], min_length=50, ratio=0.20) 
    result[i] = ''.join(summary)
  return result

In [0]:
# summarize every aritcle in clusters
cluster_summaries = []
for i in range(1,6):
  summaries = make_summaries(clusters[clusters.cluster_labels == i].reset_index())
  cluster_summaries.append(summaries)

#Parallel(n_jobs=16)(delayed(make_summaries)(clusters[clusters.cluster_labels == i].reset_index()) for i in range(1,6)) # pickling error 

In [9]:
cluster_summaries

[{0: 'House Speaker Paul Ryan issued a statement on Thursday backing President Barack Obama’s new sanctions against Russia. [ However, there is no proof that the Russian government was involved to this effect. The Hill reported, [w]hile lawmakers were seemingly united on the need to present a strong bipartisan response, the FBI and CIA gave lawmakers differing accounts on Russia’s motives, according to The Post. ” Follow Adelle Nazarian on Twitter and Periscope @AdelleNaz.',
  1: '’  ’ ’   Back in March, when the U. S. elections still seemed far away  —     back before anyone had heard the name Fancy Bear and before   everyone knew John Podesta’s risotto secrets  —   I was in Moscow   talking to a Russian who had previously worked in the Kremlin. ’ ’ Over the course of a   conversation, it became clear   that we agreed on one key characteristic of Vladimir Putin. This is tame, by the way, in comparison with the      rhetoric of Russian TV host and propagandist Dmitry   Kiselyov, who cl

## Create Summary of Summaries for each Cluster

In [0]:
# summarize summaries of each cluster 
summary_of_summaries = []
for summaries in cluster_summaries:
  summary = ' '.join(list(summaries.values()))
  summary_of_summaries.append(model(summary))

In [11]:
summary_of_summaries

 '’  ’ ’   Donald Trump said in a Tuesday statement that ”our adversaries   almost certainly have a blackmail file” on Hillary Clinton after   FBI Director James Comey announced it was certainly ”possible”   hostile actors gained access to her private email account. ’ ’ The revelation comes just days after the leak of thousands of Democratic National Committee emails    US officials allege Russian hackers    prompted major turmoil within the party, causing the abrupt resignation of its chairwoman, Rep. Debbie Wasserman Schultz. Based on the FBI investigative file, including notes from Clinton’s July interview, Gowdy said it doesn’t appear agents pressed Clinton on why she set up the server. “ She lied under oath when she turned over all of her work related emails. The Republican yearning to pin a scandal on Hillary Clinton knows no bounds. Here are the 5 most serious accusations in the report. ( Know you love her, but this stuff is like her Achilles heal.',
 'Friday on ABC’s “The View,

## Evaluate Summary of Summaries with Universal Sentence Encoder

Goal: each summary is clustered with original articles used to create the summaries

In [0]:
# create embeddings for each user summary 
summary_embeddings = emb_model(summary_of_summaries)

In [13]:
# data frame of titles and semantic similarities
cos_df = pd.DataFrame(cosine_similarity(summary_embeddings, train_embeddings))
cos_df.columns = small_data.title
cos_df.index = [summary_of_summaries[i][:50] for i in range(len(summary_of_summaries))]

cos_df.shape

(5, 13000)

In [0]:
# function to return the column index of the top n values in a row of a dataframe
def find_topind(df, i, n):
  return list(list(zip(*heapq.nlargest(n, enumerate(df.iloc[i,:]), key=operator.itemgetter(1))))[0])

# function to return the top n values in a list
def find_top(lst, ind):
  return [lst[i] for i in ind]

# how many articles per cluster
n = 10

# find index of n most similar articles 
top_ind = Parallel(n_jobs=16)(delayed(find_topind)(cos_df, i, n) for i in range(len(cos_df)))

In [15]:
# ids of most similar articles 
top_id = Parallel(n_jobs=16)(delayed(find_top)(small_data.id, ind) for ind in top_ind)

top_id

[[194362, 31540, 203681, 148535, 205453, 21609, 52955, 38873, 73049, 204147],
 [72262, 86661, 70464, 189841, 40329, 86746, 216655, 85727, 56856, 121493],
 [161410, 157279, 73686, 147468, 149141, 74730, 40421, 30697, 50953, 42962],
 [68874, 55909, 96452, 39532, 199737, 209550, 212779, 97167, 67766, 68517],
 [205111, 163453, 28111, 80967, 46778, 38958, 29675, 87437, 214420, 120670]]

In [19]:
# ids of original articles  
og_ids = []
for i in range(1,6):
  cluster = clusters[clusters.cluster_labels == i]
  og_ids.append(list(cluster.id))

og_ids

[[45558, 72300, 194362, 59828, 92926, 205453, 67261, 72962, 72358, 31540],
 [70464, 49591, 56439, 85727, 92404, 68550, 40299, 39429, 213302, 86771],
 [34095, 103164, 48701, 73686, 74730, 60705, 38222, 122748, 147468, 161410],
 [96444, 39532, 117693, 68874, 199737, 67766, 97167, 55909, 96452, 49182],
 [28111, 214420, 80967, 45658, 46778, 120670, 163918, 205111, 163453, 34717]]

In [36]:
# proportion of original articles clustered with summaries 
np.mean([sum([id in top_id[i] for id in og_ids[i]])/10 for i in range(len(og_ids))])  

0.45999999999999996