In [1]:
import numpy as np
import pandas as pd
import re
import sys
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

In [2]:
df = pd.read_csv('data/preprocessed.csv', index_col=0)
pd.options.display.max_colwidth = 50
df

Unnamed: 0,speaker,text,cluster tokens
0,Wallace,Good evening from the Health Education Campus ...,good evening health education campus case west...
1,Wallace,This debate is being conducted under health an...,debate conduct health safety protocol design c...
2,Biden,"How you doing, man?",man
3,Trump,How are you doing?,
4,Biden,I'm well.,
...,...,...,...
784,Wallace,"Gentlemen, just say that's the end of it . Th...",gentleman end end debate
785,Trump,I want to see an honest ballot count.,want honest ballot count
786,Wallace,We're going to leave it there-,go leave there
787,Trump,And I think he does too-,think too


In [3]:
trump_df = df[df['speaker'] == 'Trump' ]
biden_df = df[df['speaker'] == 'Biden']

trump_df = trump_df.dropna()
biden_df = biden_df.dropna()

In [4]:
trump_df

Unnamed: 0,speaker,text,cluster tokens
6,Trump,"Thank you very much, Chris. I will tell you ve...",thank chris tell simply win election election ...
7,Trump,And we won the election and therefore we have ...,win election right choose people knowingly way...
10,Trump,"Thank you, Joe.",thank joe
14,Trump,There aren't a hundred million people with pre...,million people pre existing condition far conc...
16,Trump,"During that period of time, during that period...",period time period time opening elect year ele...
...,...,...,...
777,Trump,You think that's good?,think good
780,Trump,It's already been established. Take a look at ...,establish look carolyn maloney race
783,Trump,I want to see an honest ballot cut-,want honest ballot cut
785,Trump,I want to see an honest ballot count.,want honest ballot count


In [5]:
biden_df

Unnamed: 0,speaker,text,cluster tokens
2,Biden,"How you doing, man?",man
9,Biden,"Well, first of all, thank you for doing this a...",thank look forward mr president
11,Biden,The American people have a right to have a say...,american people right supreme court nominee oc...
12,Biden,"Now, what's at stake here is the President's m...",stake president clear want rid affordable care...
13,Biden,"And that ended when we, in fact, passed the Af...",end fact pass affordable care act million peop...
...,...,...,...
751,Biden,Five states have had mail-in ballots for the l...,state mail ballot decade include republican st...
756,Biden,I am concerned that any court would settle thi...,concern court settle deal ballot fill suppose ...
761,Biden,Mail service delivers 185 million pieces of ...,mail service deliver million piece mail day
779,Biden,Yes. And here's the deal. We count the ballots...,yes deal count ballot point ballot state open ...


## Baseline Summaries

## 1. Trump

In [6]:
trump_utterances = trump_df.text.tolist()
trump_utterances.sort(key=len, reverse=True)

for utterance in trump_utterances[:1]:
    print(utterance+'\n')

I don't think you have any law enforcement. You can't even say the word law enforcement. Because if you say those words, you're going to lose all of your radical left supporters. And why aren't you saying those words, Joe? Why don't you say the words law enforcement? Because you know what? If called us in Portland, we would put out that fire in a half an hour. But they won't do it, because they're run by radical left Democrats. If you look at Chicago, if you look at any place you want to look, Seattle, they heard we were coming in the following day and they put up their hands and we got back Seattle. Minneapolis, we got it back, Joe, because we believe in law and order, but you don't. The top 10 cities and just about the top 40 cities are run by Democrats, and in many cases radical left. And they've got you wrapped around their finger, Joe, to a point where you don't want to say anything about law and order. And I'll tell you what, the people of this country want and demand law and ord

## 2. Biden

In [7]:
biden_utterances = biden_df.text.tolist()
biden_utterances.sort(key=len, reverse=True)

for utterance in biden_utterances[:1]:
    print(utterance+'\n')

Gas and oil because the heat will not be going out. There's so many things that we can do now to create thousands and thousands of jobs. We can get to net zero, in terms of energy production, by 2035. Not only not costing people jobs, creating jobs, creating millions of good-paying jobs. Not 15 bucks an hour, but prevailing wage, by having a new infrastructure that in fact, is green. And the first thing I will do, I will rejoin the Paris Accord. I will join the Paris Accord because with us out of it, look what's happening. It's all falling apart. And talk about someone who has no relationship with foreign policy. The rainforests of Brazil are being torn down, are being ripped down. More carbon is absorbed in that rainforest than every bit of carbon that's emitted in the United States. Instead of doing something about that, I would be gathering up and making sure we had the countries of the world coming up with $20 billion, and say, Here's $20 billion. Stop tearing down the forest. And 

In [8]:
trump_vectorizer = TfidfVectorizer(
#     min_df = 2,
#     max_df = 0.95,
    max_features = 5000,
    stop_words = 'english'
)

tv = trump_vectorizer.fit_transform(trump_df["cluster tokens"])

# # sort features by idf score and get top n features
# sorted_indices = np.argsort(trump_vectorizer.idf_)[::-1]
# features = trump_vectorizer.get_feature_names()
# top_n = 500
# trump_features = [features[i] for i in sorted_indices[:top_n]]
# print(trump_features)

In [9]:
biden_vectorizer = TfidfVectorizer(
#     min_df = 5,
#     max_df = 0.95,
    max_features = 5000,
    stop_words = 'english'
)
bv = biden_vectorizer.fit_transform(biden_df["cluster tokens"])
print(bv.toarray())

# sort features by idf score and get top n features
# sorted_indices = np.argsort(biden_vectorizer.idf_)[::-1]
# features = biden_vectorizer.get_feature_names()
# top_n = 500
# biden_features = [features[i] for i in sorted_indices[:top_n]]
# print(biden_features)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.13099507 0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


## K means Clustering

In [10]:
k = 6
pd.options.display.max_colwidth = 100
def get_top_keywords(data, clusters, features, n_terms):
    df = pd.DataFrame(data.todense()).groupby(clusters).mean()
    keywords = []
    
    for i,r in df.iterrows():
        keywords.append(','.join([features[t] for t in np.argsort(r)[-n_terms:]]))
        
    return keywords

### 1. Trump

In [11]:
trump_clusters = KMeans(n_clusters=k, random_state=0).fit_predict(tv)

keywords = get_top_keywords(tv, trump_clusters, trump_vectorizer.get_feature_names(), 10)

print('Trump clusters')
trump_list = list(trump_clusters)
for i in range(k):
    print('\nCluster {}'.format(i, trump_list.count(i)))
    print(keywords[i])

Trump clusters

Cluster 0
talk,excuse,okay,true,left,run,chris,right,want,say

Cluster 1
mayor,talk,yes,pay,son,joe,moscow,half,dollar,million

Cluster 2
right,ballot,mask,come,forest,say,good,group,support,think

Cluster 3
dominate,sarcastically,joe,suburb,fraud,people,say,ballot,happen,know

Cluster 4
fixing,favored,count,vote,chris,love,statement,make,know,wrong

Cluster 5
understand,good,thing,care,big,tell,let,people,year,joe


### 2. Biden

In [12]:
biden_clusters = KMeans(n_clusters=k, random_state=0).fit_predict(bv)

keywords = get_top_keywords(bv, biden_clusters, biden_vectorizer.get_feature_names(), 10)

print('Biden clusters')
biden_list = list(biden_clusters)
for i in range(k):
    print('\nCluster {}'.format(i, biden_list.count(i)))
    print(keywords[i])

Biden clusters

Cluster 0
healthcare,fact,economy,deal,talk,tax,say,look,sure,plan

Cluster 1
focus,folk,fool,finish,dishonorably,discharge,deal,absolutely,simply,true

Cluster 2
way,fact,want,ballot,right,court,man,vote,president,people

Cluster 3
panic,senator,fact,people,lie,let,job,deal,lot,know

Cluster 4
scientist,worried,pressure,disagree,remember,ranting,final,ask,answer,question

Cluster 5
true,way,million,fact,report,dollar,matter,discredited,discredit,totally


In [27]:
trump_df['cluster'] = trump_clusters.tolist()
biden_df['cluster'] = biden_clusters.tolist()
biden_df[biden_df['cluster'] == 5]

Unnamed: 0,speaker,text,cluster tokens,cluster
274,Biden,"Yes, I would. He's been totally irresponsible the way in which he has handled the social distanc...",yes totally irresponsible way handle social distancing people wear mask basically encourage fool,5
404,Biden,Is totally-,totally,5
406,Biden,Totally discredited. Totally discredited. And by the way-,totally discredited totally discredited way,5
418,Biden,It's been totally discredited.,totally discredit,5
432,Biden,"By everyone, has discredited. And matter of fact Matter of fact-",discredit matter fact matter fact,5
633,Biden,He wasn't given tens of millions of dollars. It was all discredited.,give ten million dollar discredit,5
636,Biden,That is not true. That report is totally discredited.,true report totally discredit,5


## Intra-Cluster Extractive Summarization

In [14]:
import math
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

In [15]:
SUMM_PERCENT = .4
MAX_SUMM_LEN = 10

## 1. Trump Summaries

In [26]:
trump_summaries_tr = []

for i in range(k):
    text = trump_df[trump_df['cluster'] == i].text
    parser = PlaintextParser(' '.join(text.to_list()), Tokenizer("English"))
    summarizer = TextRankSummarizer()
    
    # the extractive summary for each cluster will consist of the min of:
    # a. 10 sentences 
    # b. 40% of the number of sentences in that cluster
    summary = summarizer(parser.document, 5)
    trump_summaries_tr.append(summary)

    
for c in range(k):
    print('Cluster {}:'.format(c))
    
    for sent in trump_summaries_tr[c][:5]:
        print(sent)
        
    print('')
    
# print('[Trump] TextRank output for Cluster 1:\n')
# for sent in trump_summaries_tr[1]:
#     print(sent)

Cluster 0:
If you look at Chicago, if you look at any place you want to look, Seattle, they heard we were coming in the following day and they put up their hands and we got back Seattle.
The places we had trouble were democratic run cities- I think as a party issue, you can bring in a couple of examples but if you look at Chicago, what's going on in Chicago where a 53 people were shot and eight died shot, if you look at New York where it's going up, like nobody's ever seen anything.
The numbers are going up a 100%, 150%, 200% crime, it is crazy what's going on and he doesn't want to say law and order because he can't because he'll lose his radical left supporters and once he does that, it's over with.
But if he ever got to run this country and they ran it the way he would want to run it, we would have by the way our suburbs would be gone.
You go and vote- You either do, Chris, a solicited ballot, where you're sending it in, they're sending it back and you're sending.

Cluster 1:
I paid

In [25]:
trump_summaries_lr = []

for i in range(k):
    text = trump_df[trump_df['cluster'] == i].text
    parser = PlaintextParser(' '.join(text.to_list()), Tokenizer("English"))
    summarizer = LexRankSummarizer()
    
    summary = summarizer(parser.document, 5)
    trump_summaries_lr.append(summary)
    
for c in range(k):
    print('Cluster {}:'.format(c))
    
    for sent in trump_summaries_tr[c][:5]:
        print(sent)
        
    print('')

# print('[Trump] LexRank output for Cluster 1:\n')
# for sent in trump_summaries_lr[1]:
#     print(sent)

Cluster 0:
Well, he wants to shut down this country and I want to keep it open, and we did a great thing by shutting it down- Because it's a political thing.
If you look at Chicago, if you look at any place you want to look, Seattle, they heard we were coming in the following day and they put up their hands and we got back Seattle.
And I'll tell you what, the people of this country want and demand law and order and you're afraid to even say it.
The places we had trouble were democratic run cities- I think as a party issue, you can bring in a couple of examples but if you look at Chicago, what's going on in Chicago where a 53 people were shot and eight died shot, if you look at New York where it's going up, like nobody's ever seen anything.
The numbers are going up a 100%, 150%, 200% crime, it is crazy what's going on and he doesn't want to say law and order because he can't because he'll lose his radical left supporters and once he does that, it's over with.

Cluster 1:
I paid millions

## 2. Biden Summaries

In [18]:
biden_summaries_tr = []

for i in range(k):
    text = biden_df[biden_df['cluster'] == i].text
    parser = PlaintextParser(' '.join(text.to_list()), Tokenizer("English"))
    summarizer = TextRankSummarizer()
    
    summary = summarizer(parser.document, 5)
    biden_summaries_tr.append(summary)
  
for c in range(k):
    print('Cluster {}:'.format(c))
    
    for sent in biden_summaries_tr[c][:5]:
        print(sent)
        
    print('')

# print('[Biden] LexRank output for Cluster 4:\n')
# for sent in biden_summaries_lr[4]:
#     print(sent)

Cluster 0:
You should get out of your bunker and get out of the sand trap in your golf course and go in the Oval Office and bring together the Democrats and Republicans and fund what needs to be done now to save lives.
Every serious company is talking about maybe having a vaccine done by the end of the year, but the distribution of that vaccine will not occur until sometime beginning of the middle of next year to get it out, if we get the vaccine.
And under my proposal, we're going to make sure that every penny of that has to be made by a company- By the way, I'm going to eliminate a significant number of the taxes.
We're going to build a economy that in fact is going to provide for the ability of us to take 4 million buildings and make sure that they in fact are weatherized in a way that in fact will they'll emit significantly less gas and oil because the heat will not be going out.
This is the guy who says that, the fact that- I'm talking about the Biden plan  - That is not- Simply..

In [19]:
biden_summaries_lr = []

for i in range(k):
    text = biden_df[biden_df['cluster'] == i].text
    parser = PlaintextParser(' '.join(text.to_list()), Tokenizer("English"))
    summarizer = LexRankSummarizer()
    
    summary = summarizer(parser.document, min(MAX_SUMM_LEN, round(SUMM_PERCENT*len(parser.document.sentences))))
    biden_summaries_lr.append(summary)

for c in range(k):
    print('Cluster {}:'.format(c))
    
    for sent in biden_summaries_lr[c][:5]:
        print(sent)
        
    print('')
    
# print('[Biden] LexRank output for Cluster 1:\n')
# for sent in biden_summaries_lr[1]:
#     print(sent)

Cluster 0:
You should get out of your bunker and get out of the sand trap in your golf course and go in the Oval Office and bring together the Democrats and Republicans and fund what needs to be done now to save lives.
And there was no one ... We didn't shut down the economy.
And under my proposal, we're going to make sure that every penny of that has to be made by a company- By the way, I'm going to eliminate a significant number of the taxes.
Mr. Vice- It is not a fact.
That's what this is all about.

Cluster 1:
That is not true.
That is not true.
That is not true.
That is not true.
That's not true.

Cluster 2:
The American people have a right to have a say in who the Supreme Court nominee is and that say occurs when they vote for United States Senators and when they vote for the President of United States.
We should wait and see what the outcome of this election is because that's the only way the American people get to express their view is by who they elect as President and who the