<a href="https://colab.research.google.com/github/wenjunsun/personal-machine-learning-projects/blob/master/unsupervised-learning/topic_modeling_with_Trump_and_biden_speeches.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will look at what each candidate talks about in their speeches. My expectation is that Biden talks more about environment and race issues and Trump talks more about economics. But we will see if that is really the case.

There are obviously a lot of ways to do topic modeling. In this notebook I will explore two techniques that fall into the realm of matrix decomposition: SVD and NMF.

I won't provide details about how NMF or SVD works, interested readers can read more about these methods online.

Motivation for using matrix decomposition to do topic modeling by Rachel Thomas from Fast AI: 

> "Consider the most extreme case - reconstructing the matrix using an outer product of two vectors. Clearly, in most cases we won't be able to reconstruct the matrix exactly. But if we had one vector with the relative frequency of each vocabulary word out of the total word count, and one with the average number of words per document, then that outer product would be as close as we can get.
Now consider increasing that matrices to two columns and two rows. The optimal decomposition would now be to cluster the documents into two groups, each of which has as different a distribution of words as possible to each other, but as similar as possible amongst the documents in the cluster. We will call those two groups "topics". And we would cluster the words into two groups, based on those which most frequently appear in each of the topics."

In [1]:
import numpy as np # basical linear algbera numerical computation package.
import pandas as pd # for reading data.

# 1. Read Data

Data is collected by Elijah Greisz, from [this website](https://www.rev.com/blog/transcript-category/2020-election-transcripts), we used speeches of both Biden and Trump from Aug 1 to Sep 10 as our data. 

In [21]:
ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


In [22]:
# before running this line need to connect to Google Drive.
data = pd.read_csv('drive/My Drive/political_sentiment_analysis/text_data.csv')

In [23]:
data.head()

Unnamed: 0,text,speech,candidate
0,Good afternoon folks. Sorry I’m a little late....,Joe Biden Press Conference Transcript September 4,biden
1,When my son volunteered and joined the United ...,Joe Biden Press Conference Transcript September 4,biden
2,"He stood by failing, failing to take action or...",Joe Biden Press Conference Transcript September 4,biden
3,"I’ve talked to a lot of real working people, a...",Joe Biden Press Conference Transcript September 4,biden
4,This is a special place for the Biden family. ...,Joe Biden Press Conference Transcript September 4,biden


In [24]:
# select speech texts from biden
biden_data = data[data['candidate'] == 'biden']
biden_data.head()

Unnamed: 0,text,speech,candidate
0,Good afternoon folks. Sorry I’m a little late....,Joe Biden Press Conference Transcript September 4,biden
1,When my son volunteered and joined the United ...,Joe Biden Press Conference Transcript September 4,biden
2,"He stood by failing, failing to take action or...",Joe Biden Press Conference Transcript September 4,biden
3,"I’ve talked to a lot of real working people, a...",Joe Biden Press Conference Transcript September 4,biden
4,This is a special place for the Biden family. ...,Joe Biden Press Conference Transcript September 4,biden


In [25]:
# select speech texts from trump
trump_data = data[data['candidate'] == 'trump']
trump_data.head()

Unnamed: 0,text,speech,candidate
362,"We brought you a lot of car plants, Michigan. ...",Donald Trump Freeland MI Campaign Rally Speech...,trump
363,"Long time, it’s been a long time since you had...",Donald Trump Freeland MI Campaign Rally Speech...,trump
364,This is the most important election in the his...,Donald Trump Freeland MI Campaign Rally Speech...,trump
365,"We shouldn’t be smiling about it, but we’ve ch...",Donald Trump Freeland MI Campaign Rally Speech...,trump
366,"If Biden wins, China wins. If Biden wins, the ...",Donald Trump Freeland MI Campaign Rally Speech...,trump


In [26]:
print(f'there are {biden_data.shape[0]} records of Biden speech documents')
print(f'there are {trump_data.shape[0]} records of Trump speech documents')

there are 362 records of Biden speech documents
there are 886 records of Trump speech documents


Now we want to print out some sample texts from both Trump and Biden, just to see what their speeches are like.

In [30]:
from random import randint

In [33]:
# randomly print 5 speeches from Biden.
for i in range(5):
  random_index = randint(0, biden_data.shape[0] - 1)
  print(biden_data.iloc[random_index]['text'])
  print()

And speaking of President Obama, a man I was honored to serve alongside for eight years as vice president. Let me take this moment to say something we don’t say nearly enough. Thank you, Mr. President, you were a great president, a president that our children could and did look up to. No one’s going to say that about the current occupant of the White House. What you know about this president is if he’s given four more years, he’ll be what he’s been for the last four years. The president takes no responsibility, refuses to lead, blames others, cozies up to dictators, and fans the flames of hate and division. He’ll wake up every day believing that job is all about him, never about you. Is that the America you want for you, your family, your children? I see a different America, one that’s generous and strong, selfless, and humble. It’s an America we could rebuild together.

He invited companies to the White House to make what he called the pledge to American workers. He couldn’t even keep

In [34]:
# randomly print 5 speeches from Trump.
for i in range(5):
  random_index = randint(0, trump_data.shape[0] - 1)
  print(trump_data.iloc[random_index]['text'])
  print()

And I saw where these phonies, you know they want to end everything we’ve done. They want to end it. They want to go to wind. They don’t even know if they want to go to wind. I think they want to just basically close up our country, because they’ve taken away our strength, but they want to do something. But, there is no such thing. Solar can’t do it. I love solar. It’s all fine. Very, very heavily expensive. Very expensive. But they want to go to other forms of alternative energy. And I think that’s okay, except we don’t have them. And it’s not going to power these massive factories.

We can’t loose.

That’s what they want to do. They want to take away your Second Amendment. If I weren’t president, you would either have an obliterated Second Amendment or it would be gone entirely. I am standing between them and your Second Amendment. And that’s it. That’s it. They know.

They said manufacturing jobs will never come back. Remember, you need a magic wand. Where’s the magic wand? Well, we

We see that Trump talks about China, manufacture jobs, second Amendment.

Biden talks about moral leadership, George Lloyd, American soldiers, President Obama, etc.

# 2. Applying SVD, NMF to extract what Trump and Biden each talks about in their speeches.

In [35]:
biden_speeches = biden_data['text'].tolist()

In [36]:
biden_speeches[0]

'Good afternoon folks. Sorry I’m a little late. I was a mesmerized. I was walking out of the office, that listening to an interview of former General Barry McCaffrey and Bill Cohen, former Secretary of Defense. Before I begin, I want to speak a little bit to what they talked about and the revelations about President Trump’s disregard for our military and our veterans. Quite frankly, if what is written in the Atlantic is true, it’s disgusting. At affirms what most of us believe to be true, that Donald Trump is not fit to do the job of President, to be the Commander in Chief. The President reportedly said, and I emphasize reportedly, said that those who sign up to serve, instead of doing something more lucrative are suckers. Let me be real clear. When my son was an Assistant US Attorney and he volunteered to go to Kosovo, when the war is going on as a civilian, he wasn’t a sucker.'

In [37]:
trump_speeches = trump_data['text'].tolist()

In [38]:
trump_speeches[0]

'We brought you a lot of car plants, Michigan. We brought you a lot of car plants. You know that, right?'

## 2.1 Preprocessing/tokenizing:

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [44]:
# in python there is a stemming library called snowballStemmer.
from nltk.stem.snowball import SnowballStemmer

In [45]:
import nltk

In [46]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [47]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [48]:
stemmer = SnowballStemmer(language='english')

In [49]:
stopwords = nltk.corpus.stopwords.words('english')

def stemming_tokenizer(text):
  tokens = []
  sentences = nltk.sent_tokenize(text)
  for sentence in sentences:
    for word in nltk.word_tokenize(sentence):
      if word not in stopwords:
        tokens.append(word.lower())

  # filter out any tokens that are not words - filter out punctuations and numbers
  onlyWordTokens = [word for word in filter(lambda x: x.isalpha(), tokens)]

  # replace every word by its stem:
  onlyWordTokens = [stemmer.stem(word) for word in onlyWordTokens]

  return onlyWordTokens

In [50]:
# see how the first biden speech will get tokenized.
stemming_tokenizer(biden_speeches[0])

['good',
 'afternoon',
 'folk',
 'sorri',
 'i',
 'littl',
 'late',
 'i',
 'mesmer',
 'i',
 'walk',
 'offic',
 'listen',
 'interview',
 'former',
 'general',
 'barri',
 'mccaffrey',
 'bill',
 'cohen',
 'former',
 'secretari',
 'defens',
 'befor',
 'i',
 'begin',
 'i',
 'want',
 'speak',
 'littl',
 'bit',
 'talk',
 'revel',
 'presid',
 'trump',
 'disregard',
 'militari',
 'veteran',
 'quit',
 'frank',
 'written',
 'atlant',
 'true',
 'disgust',
 'at',
 'affirm',
 'us',
 'believ',
 'true',
 'donald',
 'trump',
 'fit',
 'job',
 'presid',
 'command',
 'chief',
 'the',
 'presid',
 'report',
 'said',
 'i',
 'emphas',
 'report',
 'said',
 'sign',
 'serv',
 'instead',
 'someth',
 'lucrat',
 'sucker',
 'let',
 'real',
 'clear',
 'when',
 'son',
 'assist',
 'us',
 'attorney',
 'volunt',
 'go',
 'kosovo',
 'war',
 'go',
 'civilian',
 'sucker']

In [68]:
tfidf_trump = TfidfVectorizer(max_df=0.99, max_features= 300,
                        min_df=0.01, stop_words='english',
                        use_idf=True, tokenizer = stemming_tokenizer, # our custom tokenizer that ignores the tense of words.
                        ngram_range=(1,1))

In [69]:
tfidf_matrix_trump = tfidf_trump.fit_transform(trump_speeches)

  'stop_words.' % sorted(inconsistent))


In [84]:
# convert a sparse matrix to dense matrix, for later SVD.
tfidf_matrix_trump = tfidf_matrix_trump.todense()

In [70]:
# each of trump's speech is represented
# by a 300 dimensional vector
tfidf_matrix_trump.shape

(886, 300)

In [71]:
tfidf_biden = TfidfVectorizer(max_df=0.99, max_features= 250,
                        min_df=0.01, stop_words='english',
                        use_idf=True, tokenizer = stemming_tokenizer, # our custom tokenizer that ignores the tense of words.
                        ngram_range=(1,1))

In [72]:
tfidf_matrix_biden = tfidf_biden.fit_transform(biden_speeches)

  'stop_words.' % sorted(inconsistent))


In [85]:
tfidf_matrix_biden = tfidf_matrix_biden.todense()

In [73]:
tfidf_matrix_biden.shape

(362, 250)

In [76]:
# 20 words biden used
tfidf_biden.get_feature_names()[:20]

['abl',
 'act',
 'administr',
 'afford',
 'ago',
 'alreadi',
 'alway',
 'america',
 'american',
 'anyth',
 'ask',
 'away',
 'bad',
 'begin',
 'believ',
 'best',
 'better',
 'biden',
 'big',
 'billion']

In [77]:
# 20 words trump used
tfidf_trump.get_feature_names()[:20]

['achiev',
 'actual',
 'administr',
 'agenda',
 'ago',
 'allow',
 'alreadi',
 'alway',
 'amaz',
 'amend',
 'america',
 'american',
 'anoth',
 'anybodi',
 'anyth',
 'ask',
 'away',
 'bad',
 'ballot',
 'ban']

In [78]:
trump_vocab = tfidf_trump.get_feature_names()

In [79]:
biden_vocab = tfidf_biden.get_feature_names()

In [80]:
# write a function that given a vector,
# show the words that correspond to the most importance.
# (largest in magnitude)
# assume the element that has more magnitute are more important.
def getMostImportantWords(vector, numOfWords, isTrump):
  indices_of_most_important_words = np.argsort(vector)[::-1][:numOfWords]
  if isTrump:
    return ' '.join([trump_vocab[index] for index in indices_of_most_important_words])
  else:
    return ' '.join([biden_vocab[index] for index in indices_of_most_important_words])

## 2.2 SVD.

In [82]:
from scipy import linalg

In [99]:
%time U, s, Vh = linalg.svd(tfidf_matrix_trump, full_matrices=False)

CPU times: user 99.5 ms, sys: 36.2 ms, total: 136 ms
Wall time: 76.1 ms


In [100]:
# print out 5 topics Trump talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = Vh[i], numOfWords= 10, isTrump= True))

topic 0: russia favor trillion plan post term judg free illeg play
topic 1: biden year china joe happen countri like left citi deal
topic 2: peopl said know say like think good thing look guy
topic 3: want peopl citi polic left democrat america win biden law
topic 4: great peopl love state job nation american good america unit


In [101]:
# print out 5 topics Trump talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = Vh[i], numOfWords= 15, isTrump= True))

topic 0: russia favor trillion plan post term judg free illeg play stock achiev yeah polici oil
topic 1: biden year china joe happen countri like left citi deal mani look thing said talk
topic 2: peopl said know say like think good thing look guy lot big happen right poll
topic 3: want peopl citi polic left democrat america win biden law run everi york american crime
topic 4: great peopl love state job nation american good america unit right open million work make


In [95]:
%time U, s, Vh = linalg.svd(tfidf_matrix_biden, full_matrices=False)

CPU times: user 44.9 ms, sys: 16 ms, total: 60.9 ms
Wall time: 37.3 ms


In [97]:
# print out 5 topics biden talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = Vh[i], numOfWords= 10, isTrump= False))

topic 0: program shot local crosstalk justic afford buy general citi death
topic 1: thank want god think elect realli crosstalk hope got better
topic 2: crosstalk know right got think said talk someth vote thing
topic 3: crosstalk right worker tax thank job trump pay american union
topic 4: school abl educ need home health safe sure children make


In [98]:
# print out 5 topics biden talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = Vh[i], numOfWords= 15, isTrump= False))

topic 0: program shot local crosstalk justic afford buy general citi death learn test cut street idea
topic 1: thank want god think elect realli crosstalk hope got better day number import state lot
topic 2: crosstalk know right got think said talk someth vote thing lot anyth want say happen
topic 3: crosstalk right worker tax thank job trump pay american union compani build treat feder make
topic 4: school abl educ need home health safe sure children make kid open billion parent think


## 2.3 NMF

In [102]:
from sklearn import decomposition

In [103]:
clf = decomposition.NMF(n_components=5, random_state=1)

W1 = clf.fit_transform(tfidf_matrix_trump)
H1 = clf.components_

In [108]:
# print out 5 topics Trump talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = H1[i], numOfWords= 10, isTrump= True))

topic 0: said happen like come know thing say think got right
topic 1: thank great job want love let everybodi man friend michigan
topic 2: china year job american world america countri nation biden histori
topic 3: want biden left citi polic joe law look berni radic
topic 4: peopl great love good win state right know job realli


In [109]:
# print out 5 topics Trump talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = H1[i], numOfWords= 15, isTrump= True))

topic 0: said happen like come know thing say think got right time let look big year
topic 1: thank great job want love let everybodi man friend michigan respect children god carolina realli
topic 2: china year job american world america countri nation biden histori billion deal unit million state
topic 3: want biden left citi polic joe law look berni radic suburb democrat crime everi run
topic 4: peopl great love good win state right know job realli like vote lot say governor


In [110]:
W1 = clf.fit_transform(tfidf_matrix_biden)
H1 = clf.components_

In [111]:
# print out 5 topics Biden talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = H1[i], numOfWords= 10, isTrump= False))

topic 0: american trump job presid worker tax america donald pay union
topic 1: thank god want better day covid labor elect hope number
topic 2: peopl think know said say talk got thing want look
topic 3: crosstalk right treat know got someth life love want health
topic 4: school need educ safe abl make latino sure children year


In [112]:
# print out 5 topics Biden talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = H1[i], numOfWords= 15, isTrump= False))

topic 0: american trump job presid worker tax america donald pay union million build work make compani
topic 1: thank god want better day covid labor elect hope number state promis realli protect union
topic 2: peopl think know said say talk got thing want look countri lot time come presid
topic 3: crosstalk right treat know got someth life love want health deal problem okay happen number
topic 4: school need educ safe abl make latino sure children year home open everi mask health


## 2.4 K-means

In [113]:
from sklearn.cluster import KMeans

In [114]:
km = KMeans(n_clusters = 5)
km.fit(tfidf_matrix_trump)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [116]:
# Trump:
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(km.cluster_centers_[i], numOfWords = 10, isTrump=True))

topic 0: china year job countri world american biden billion america deal
topic 1: thank great job love want peopl right realli good man
topic 2: peopl great want know say right good state like win
topic 3: said think like thing say right come know happen got
topic 4: biden want left citi know polic look joe american yeah


In [117]:
# Trump:
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(km.cluster_centers_[i], numOfWords = 15, isTrump=True))

topic 0: china year job countri world american biden billion america deal nation histori right number million
topic 1: thank great job love want peopl right realli good man friend make america know said
topic 2: peopl great want know say right good state like win lot love vote new got
topic 3: said think like thing say right come know happen got want time year mani let
topic 4: biden want left citi know polic look joe american yeah countri berni america law radic


In [118]:
km = KMeans(n_clusters = 5)
km.fit(tfidf_matrix_biden)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [119]:
# biden
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(km.cluster_centers_[i], numOfWords = 10, isTrump=False))

topic 0: crosstalk inaud got tell togeth chanc know campaign want right
topic 1: trump job presid american worker make tax donald pay america
topic 2: peopl think say presid thing talk know said like countri
topic 3: thank god want elect state better day number covid think
topic 4: latino abl make peopl school union know sure okay everi


In [120]:
# biden
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(km.cluster_centers_[i], numOfWords = 15, isTrump=False))

topic 0: crosstalk inaud got tell togeth chanc know campaign want right way time generat lot think
topic 1: trump job presid american worker make tax donald pay america work million school countri peopl
topic 2: peopl think say presid thing talk know said like countri look make vote want understand
topic 3: thank god want elect state better day number covid think hope labor good union honor
topic 4: latino abl make peopl school union know sure okay everi program billion way state educ


# 3. repeat analysis for slightly different parameters

In [121]:
tfidf_trump = TfidfVectorizer(max_df=0.95, max_features= 500, # use more words.
                        min_df=0.02, stop_words='english',
                        use_idf=True, tokenizer = stemming_tokenizer, # our custom tokenizer that ignores the tense of words.
                        ngram_range=(1,1))

In [122]:
tfidf_matrix_trump = tfidf_trump.fit_transform(trump_speeches)

  'stop_words.' % sorted(inconsistent))


In [123]:
# convert a sparse matrix to dense matrix, for later SVD.
tfidf_matrix_trump = tfidf_matrix_trump.todense()

In [124]:
# each of trump's speech is represented
# by a 300 dimensional vector
tfidf_matrix_trump.shape

(886, 392)

In [125]:
tfidf_biden = TfidfVectorizer(max_df=0.99, max_features= 250,
                        min_df=0.01, stop_words='english',
                        use_idf=True, tokenizer = stemming_tokenizer, # our custom tokenizer that ignores the tense of words.
                        ngram_range=(1,1))

In [126]:
tfidf_matrix_biden = tfidf_biden.fit_transform(biden_speeches)

  'stop_words.' % sorted(inconsistent))


In [127]:
tfidf_matrix_biden = tfidf_matrix_biden.todense()

In [128]:
tfidf_matrix_biden.shape

(362, 250)

In [129]:
# 20 words biden used
tfidf_biden.get_feature_names()[:20]

['abl',
 'act',
 'administr',
 'afford',
 'ago',
 'alreadi',
 'alway',
 'america',
 'american',
 'anyth',
 'ask',
 'away',
 'bad',
 'begin',
 'believ',
 'best',
 'better',
 'biden',
 'big',
 'billion']

In [130]:
# 20 words trump used
tfidf_trump.get_feature_names()[:20]

['abl',
 'achiev',
 'actual',
 'administr',
 'advantag',
 'agenda',
 'ago',
 'agre',
 'alien',
 'allow',
 'alreadi',
 'alway',
 'amaz',
 'amend',
 'america',
 'american',
 'announc',
 'anoth',
 'anybodi',
 'anymor']

In [131]:
trump_vocab = tfidf_trump.get_feature_names()

In [132]:
biden_vocab = tfidf_biden.get_feature_names()

In [133]:
# write a function that given a vector,
# show the words that correspond to the most importance.
# (largest in magnitude)
# assume the element that has more magnitute are more important.
def getMostImportantWords(vector, numOfWords, isTrump):
  indices_of_most_important_words = np.argsort(vector)[::-1][:numOfWords]
  if isTrump:
    return ' '.join([trump_vocab[index] for index in indices_of_most_important_words])
  else:
    return ' '.join([biden_vocab[index] for index in indices_of_most_important_words])

## 3.2 SVD.

In [134]:
from scipy import linalg

In [135]:
%time U, s, Vh = linalg.svd(tfidf_matrix_trump, full_matrices=False)

CPU times: user 172 ms, sys: 52.4 ms, total: 224 ms
Wall time: 146 ms


In [136]:
# print out 5 topics Trump talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = Vh[i], numOfWords= 10, isTrump= True))

topic 0: terrorist alien sudden manifesto short kind fair speak histor immedi
topic 1: thank great love job friend good repres john realli fantast
topic 2: china america biden year american world thank job joe nation
topic 3: said year china billion deal farmer good job tariff thing
topic 4: great job love peopl state good open american nation million


In [137]:
# print out 5 topics Trump talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = Vh[i], numOfWords= 15, isTrump= True))

topic 0: terrorist alien sudden manifesto short kind fair speak histor immedi honest anywher coupl sort futur
topic 1: thank great love job friend good repres john realli fantast man everybodi honor carolina god
topic 2: china america biden year american world thank job joe nation countri histori billion unit deal
topic 3: said year china billion deal farmer good job tariff thing took sir come right guy
topic 4: great job love peopl state good open american nation million right win work john governor


In [138]:
%time U, s, Vh = linalg.svd(tfidf_matrix_biden, full_matrices=False)

CPU times: user 39 ms, sys: 19.1 ms, total: 58.1 ms
Wall time: 35.1 ms


In [139]:
# print out 5 topics biden talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = Vh[i], numOfWords= 10, isTrump= False))

topic 0: program shot local crosstalk justic afford buy general citi death
topic 1: thank want god think elect realli crosstalk hope got better
topic 2: crosstalk know right got think said talk someth vote thing
topic 3: crosstalk right worker tax thank job trump pay american union
topic 4: school abl educ need home health safe sure children make


In [140]:
# print out 5 topics biden talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = Vh[i], numOfWords= 15, isTrump= False))

topic 0: program shot local crosstalk justic afford buy general citi death learn test cut street idea
topic 1: thank want god think elect realli crosstalk hope got better day number import state lot
topic 2: crosstalk know right got think said talk someth vote thing lot anyth want say happen
topic 3: crosstalk right worker tax thank job trump pay american union compani build treat feder make
topic 4: school abl educ need home health safe sure children make kid open billion parent think


## 2.3 NMF

In [141]:
from sklearn import decomposition

In [142]:
clf = decomposition.NMF(n_components=5, random_state=1)

W1 = clf.fit_transform(tfidf_matrix_trump)
H1 = clf.components_

In [143]:
# print out 5 topics Trump talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = H1[i], numOfWords= 10, isTrump= True))

topic 0: said peopl like say know thing happen come think right
topic 1: thank let want repres great everybodi god make america respect
topic 2: year china billion world job countri biden american deal america
topic 3: want biden left citi polic law joe everi america radic
topic 4: great love job good state win right john peopl guy


In [144]:
# print out 5 topics Trump talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = H1[i], numOfWords= 15, isTrump= True))

topic 0: said peopl like say know thing happen come think right look lot got let big
topic 1: thank let want repres great everybodi god make america respect children proud minnesota job honor
topic 2: year china billion world job countri biden american deal america histori nation took joe economi
topic 3: want biden left citi polic law joe everi america radic look countri democrat new berni
topic 4: great love job good state win right john peopl guy governor realli friend open vote


In [145]:
W1 = clf.fit_transform(tfidf_matrix_biden)
H1 = clf.components_

In [146]:
# print out 5 topics Biden talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = H1[i], numOfWords= 10, isTrump= False))

topic 0: american trump job presid worker tax america donald pay union
topic 1: thank god want better day covid labor elect hope number
topic 2: peopl think know said say talk got thing want look
topic 3: crosstalk right treat know got someth life love want health
topic 4: school need educ safe abl make latino sure children year


In [147]:
# print out 5 topics Biden talks about
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(vector = H1[i], numOfWords= 15, isTrump= False))

topic 0: american trump job presid worker tax america donald pay union million build work make compani
topic 1: thank god want better day covid labor elect hope number state promis realli protect union
topic 2: peopl think know said say talk got thing want look countri lot time come presid
topic 3: crosstalk right treat know got someth life love want health deal problem okay happen number
topic 4: school need educ safe abl make latino sure children year home open everi mask health


## 2.4 K-means

In [148]:
from sklearn.cluster import KMeans

In [149]:
km = KMeans(n_clusters = 5)
km.fit(tfidf_matrix_trump)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [150]:
# Trump:
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(km.cluster_centers_[i], numOfWords = 10, isTrump=True))

topic 0: said come like happen say know yeah thing got want
topic 1: citi polic left want look york biden know law new
topic 2: thank great job love want peopl realli good right know
topic 3: year china job biden american countri america world billion histori
topic 4: peopl know want right great think say like lot good


In [151]:
# Trump:
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(km.cluster_centers_[i], numOfWords = 15, isTrump=True))

topic 0: said come like happen say know yeah thing got want right let time countri think
topic 1: citi polic left want look york biden know law new peopl radic run portland crime
topic 2: thank great job love want peopl realli good right know repres said america friend make
topic 3: year china job biden american countri america world billion histori deal nation joe state number
topic 4: peopl know want right great think say like lot good state thing way win time


In [152]:
km = KMeans(n_clusters = 5)
km.fit(tfidf_matrix_biden)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [153]:
# biden
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(km.cluster_centers_[i], numOfWords = 10, isTrump=False))

topic 0: talk togeth know got say said anyth thing come lot
topic 1: presid trump american job america worker countri make donald work
topic 2: thank inaud tell want state god way better elect day
topic 3: peopl make think school abl vote need look right sure
topic 4: crosstalk right treat love someth life got know watch matter


In [154]:
# biden
for i in range(5):
  print(f'topic {i}: ' + getMostImportantWords(km.cluster_centers_[i], numOfWords = 15, isTrump=False))

topic 0: talk togeth know got say said anyth thing come lot someth want stand time son
topic 1: presid trump american job america worker countri make donald work peopl nation union famili build
topic 2: thank inaud tell want state god way better elect day come number good covid friend
topic 3: peopl make think school abl vote need look right sure realli know way thing year
topic 4: crosstalk right treat love someth life got know watch matter fact want say thank think


# Conclusion: 

As we can see, it is pretty clear that Trump talks a lot about China, job, America crime, radical, left. Biden talks more about school, education, health, safe, worker, vote. There isn't much surprise here. 

- We can try with different topic modeling models and different hyperparameters in the future.