# Trump Tweets

This the data behind the story [The Worldâ€™s Favorite Donald Trump Tweets](https://fivethirtyeight.com/features/the-worlds-favorite-donald-trump-tweets/).

In [1]:
# The usual suspects ...
import logging
import pandas as pd

# And their accomplices ...
from gensim import corpora
from gensim import models
from gensim import similarities
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
from pprint import pprint

# Settings
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
tweets = pd.read_csv('realDonaldTrump_poll_tweets.csv')

In [3]:
tweets.shape

(448, 3)

In [4]:
tweets.head()

Unnamed: 0,id,created_at,text
0,7.656299e+17,8/16/2016 19:22:57,"It's just a 2-point race, Clinton 38%, Trump 3..."
1,7.587319e+17,7/28/2016 18:32:31,"""@LallyRay: Poll: Donald Trump Sees 17-Point P..."
2,7.583505e+17,7/27/2016 17:16:56,Great new poll - thank you!\n#MakeAmericaGreat...
3,7.575775e+17,7/25/2016 14:05:27,Great POLL numbers are coming out all over. Pe...
4,7.536034e+17,7/14/2016 14:53:46,Another new poll. Thank you for your support! ...


In [5]:
# Text corpus
document = [i for i in tweets['text']]

In [6]:
# Removing common words and tokenize
stop_words = set(stopwords.words('english'))
stop_words.update(['-', '=', '+', '*','.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}'])
for doc in document:
    list_of_words = [i.lower() for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]
stop_words.update(list_of_words)

In [7]:
# Removing common words
texts = [[word for word in doc.lower().split() if word not in stop_words] for doc in document]

# Removing words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

pprint(texts)

[['clinton', 'trump'],
 ['poll:',
  'donald',
  'trump',
  'two',
  'breitbart',
  '@realdonaldtrump"',
  'great!'],
 ['great', 'new', 'poll', 'thank', 'you!', '#makeamericagreatagain'],
 ['great',
  'poll',
  'numbers',
  'coming',
  'people',
  'want',
  'another',
  'four',
  'years',
  'crooked',
  'hillary',
  'even'],
 ['another', 'new', 'poll.', 'thank', 'support!', '#imwithyou'],
 ['great', 'new', 'poll-', 'thank', 'america!', '#trump2016', '#imwithyou'],
 ['despite', 'spending', 'day', 'ads', 'nationwide', 'zero'],
 ['great', 'poll-', 'thank', 'you!'],
 ['new', 'poll', 'thank', 'you!', '#trump2016'],
 ['new',
  'q',
  'poll',
  'going',
  'win',
  'make',
  'america',
  'great',
  'again!',
  '#trump2016'],
 ['poll',
  'done',
  '@abc',
  '@washingtonpost',
  'even',
  'many',
  'democrats',
  'good.'],
 ['hillary', 'clinton', 'change', 'old', 'spending', 'spending', 'polls!'],
 ['@abc', 'poll', 'dishonest', 'good!'],
 ['many',
  'great',
  'things',
  'new',
  'poll',
  'numb

  'much',
  'higher.',
  'trump',
  'corrects',
  'them!'],
 ['thanks', '@nbcnews', 'showing', 'us', 'use', 'poll!', '#trump2016'],
 ['@todayshow',
  'use',
  'poll',
  'numbers',
  'massive',
  'lead',
  'instead',
  'used',
  '@cnn',
  'numbers',
  'lead'],
 ['best', 'poll', 'numbers,', 'also', 'much', 'media', 'totally', 'sad!'],
 ['leading',
  'big',
  'polls,',
  'two',
  'today,',
  '@nbc',
  'nbc',
  'poll',
  'double',
  '29%.',
  'fiorina'],
 ['@nbc', 'new', 'poll', 'numbers.', 'based', 'debate', 'results,'],
 ['@danscavino:',
  'check',
  'latest',
  'morning',
  'consult',
  'poll,',
  'released',
  '@realdonaldtrump',
  '#1',
  '#trump2016'],
 ['wow,',
  'great',
  'post-debate',
  'poll:',
  '"trump',
  'increases',
  'via',
  'breitbart'],
 ['every',
  'poll',
  'done',
  'debate',
  'last',
  'drudge',
  'newsmax',
  'time',
  'winning',
  'landslide.',
  '#makeamericagreatagain!'],
 ['winning', '@drudge_report', 'poll-'],
 ['really',
  'looking',
  'despite',
  '&amp;',

In [8]:
# Create dictionary of document
bag = corpora.Dictionary(texts)
bag.save('trump.dict')

# Converting document to a vector (bag-of-words)
corpus = [bag.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('trump.mm', corpus)

2018-06-25 15:15:56,178 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-06-25 15:15:56,185 : INFO : built Dictionary(660 unique tokens: ['clinton', 'trump', '@realdonaldtrump"', 'breitbart', 'donald']...) from 448 documents (total 3742 corpus positions)
2018-06-25 15:15:56,186 : INFO : saving Dictionary object under trump.dict, separately None
2018-06-25 15:15:56,188 : INFO : saved trump.dict
2018-06-25 15:15:56,193 : INFO : storing corpus in Matrix Market format to trump.mm
2018-06-25 15:15:56,194 : INFO : saving sparse matrix to trump.mm
2018-06-25 15:15:56,195 : INFO : PROGRESS: saving document #0
2018-06-25 15:15:56,203 : INFO : saved 448x660 matrix, density=1.239% (3664/295680)
2018-06-25 15:15:56,204 : INFO : saving MmCorpus index to trump.mm.index


We have assigned a unique integer id to all words appearing in the corpus by:
   
   1. sweeping across the texts
   2. collecting word counts and relevant statistics
   
Our corpus is a 448 x 661 matrix.

***

### Transformation: _tf-idf_

#### Step 1:

In [9]:
# Initialization
tfidf = models.TfidfModel(corpus)

2018-06-25 15:15:56,210 : INFO : collecting document frequencies
2018-06-25 15:15:56,211 : INFO : PROGRESS: processing document #0
2018-06-25 15:15:56,214 : INFO : calculating IDF weights for 448 documents and 659 features (3664 matrix non-zeros)


We have initialized (trained) a transaformation model. Different transformation may require different initialization parameters; however, in our case, ___tf-idf___, the "training" consists simply of going through the supplied corpus once and computing document frequencies of all its features. This is in comparison to ___Latent Semantic Analysis___ & ___Latent Dirichlet Allocation___ which are more involved and take more time.

|Note:|
|---|
|**A note on transaformations**<br>Transformations always convert between two specific vector spaces. The same vector space (= the same set of feature ids) must be used for training as well as for subsequent vector transformations. Failure to use the same input feature space, such as applying a different string preprocessing, using different feature ids, or using bag-of-words input vectors where ___tf-idf___ vectors are expceted, will result in feature mismatch during transformation calls and consequently in either garbage output and/or runtime exceptions.|

#### Step 2:
From now on, ___tf-idf___ is treated as a read-only object that can be used to convert any vector from the old representation (___bag-of-words___ integer counts) to the new representation (___tf-idf___ real-valued weights).

In [10]:
# Applying the transformation to the whole corpus
corpus_tfidf = tfidf[corpus]

We have transformed our corpus (the one we used for training) into a weighted vector. We can do this for any vector (provided they come from the same vector space), even if they are not used in the corpus at all. This can be achived by _folding-in_ for ___LSA___ and by _topic inference_ for ___LDA___.

#### Step 3:
We will transform our ___tf-idf___ corpus via [Latent Semantic Indexing](https://en.wikipedia.org/wiki/Latent_semantic_indexing) into a latent 10-D space (... num_topics = 10).

In [11]:
# Initializing an LSI transformation
lsi = models.LsiModel(corpus_tfidf, id2word=bag, num_topics=10)
corpus_lsi = lsi[corpus_tfidf]

2018-06-25 15:15:56,236 : INFO : using serial LSI version on this node
2018-06-25 15:15:56,237 : INFO : updating model with new documents
2018-06-25 15:15:56,267 : INFO : preparing a new chunk of documents
2018-06-25 15:15:56,270 : INFO : using 100 extra samples and 2 power iterations
2018-06-25 15:15:56,271 : INFO : 1st phase: constructing (660, 110) action matrix
2018-06-25 15:15:56,277 : INFO : orthonormalizing (660, 110) action matrix
2018-06-25 15:15:56,322 : INFO : 2nd phase: running dense svd on (110, 448) matrix
2018-06-25 15:15:56,345 : INFO : computing the final decomposition
2018-06-25 15:15:56,347 : INFO : keeping 10 factors (discarding 76.498% of energy spectrum)
2018-06-25 15:15:56,350 : INFO : processed documents up to #448
2018-06-25 15:15:56,352 : INFO : topic #0(3.938): 0.411*"thank" + 0.407*"you!" + 0.364*"#makeamericagreatagain" + 0.346*"#trump2016" + 0.305*"great" + 0.288*"new" + 0.196*"poll" + 0.143*"poll-" + 0.112*"poll." + 0.111*"national"
2018-06-25 15:15:56,35

In [12]:
lsi.print_topics()

2018-06-25 15:15:56,365 : INFO : topic #0(3.938): 0.411*"thank" + 0.407*"you!" + 0.364*"#makeamericagreatagain" + 0.346*"#trump2016" + 0.305*"great" + 0.288*"new" + 0.196*"poll" + 0.143*"poll-" + 0.112*"poll." + 0.111*"national"
2018-06-25 15:15:56,366 : INFO : topic #1(2.986): -0.401*"trump" + 0.242*"you!" + -0.227*"rubio" + -0.227*"carson" + 0.209*"#makeamericagreatagain" + 0.203*"thank" + -0.191*"donald" + -0.183*"leads" + -0.182*"cruz" + -0.164*"@realdonaldtrump"
2018-06-25 15:15:56,368 : INFO : topic #2(2.459): 0.497*"great" + -0.279*"trump" + -0.269*"#makeamericagreatagain" + 0.242*"new" + 0.211*"america" + -0.192*"#trump2016" + 0.188*"make" + 0.180*"again!" + 0.168*"big" + -0.161*"rubio"
2018-06-25 15:15:56,369 : INFO : topic #3(2.292): 0.368*"national" + -0.317*"great" + -0.254*"rubio" + 0.228*"#trump2016" + 0.216*"#makeamericagreatagain" + 0.214*"lead" + -0.206*"poll-" + -0.201*"carson" + -0.198*"trump" + -0.183*"you!"
2018-06-25 15:15:56,370 : INFO : topic #4(2.172): 0.257*"n

[(0,
  '0.411*"thank" + 0.407*"you!" + 0.364*"#makeamericagreatagain" + 0.346*"#trump2016" + 0.305*"great" + 0.288*"new" + 0.196*"poll" + 0.143*"poll-" + 0.112*"poll." + 0.111*"national"'),
 (1,
  '-0.401*"trump" + 0.242*"you!" + -0.227*"rubio" + -0.227*"carson" + 0.209*"#makeamericagreatagain" + 0.203*"thank" + -0.191*"donald" + -0.183*"leads" + -0.182*"cruz" + -0.164*"@realdonaldtrump"'),
 (2,
  '0.497*"great" + -0.279*"trump" + -0.269*"#makeamericagreatagain" + 0.242*"new" + 0.211*"america" + -0.192*"#trump2016" + 0.188*"make" + 0.180*"again!" + 0.168*"big" + -0.161*"rubio"'),
 (3,
  '0.368*"national" + -0.317*"great" + -0.254*"rubio" + 0.228*"#trump2016" + 0.216*"#makeamericagreatagain" + 0.214*"lead" + -0.206*"poll-" + -0.201*"carson" + -0.198*"trump" + -0.183*"you!"'),
 (4,
  '0.257*"national" + 0.252*"leads" + -0.245*"@realdonaldtrump" + 0.220*"donald" + 0.200*"great" + -0.196*"numbers" + -0.190*"@cnn" + -0.167*"you!" + 0.156*"america" + -0.155*"thank"'),
 (5,
  '-0.325*"rubio" 

#### Topics

We've transformed our corpus to have 10 topics according to ___LSI___:
<br>
**Topic 1**
> _"thank", "you!", "#makeamericagreatagain", "#trump2016", "great", "new", "poll", "-", "poll-", "poll", "national"_

It appears that ___"thank"___, ___"you"___, the hash-tags ___"#makeamericagreatagain"___ & ___"#trump2016"___ are all related and contribute the most in the direction of the first topic. While ___"poll"___ (with its variants) and ___"national"___ contribute the least.

**Topic 2**
> _"trump", "you!", "carson", "rubio", "#makeamericagreatagain", "thank", "donald", "leads", "cruz", "@realdonaldtrump"_

In the second topic, we have the words ___"trump"___, ___"you!"___, ___"carson"___, ___"rubio"___ & the hash-tag ___"#makeamericagreatagain"___ contributing the most. This topic associates the current American president, [Donald Trump](https://en.wikipedia.org/wiki/Donald_Trump), with his Republican critics and opposition during the primaries, [Ben Carson](https://en.wikipedia.org/wiki/Ben_Carson) and [Marco Rubio](https://en.wikipedia.org/wiki/Marco_Rubio). While [Ted Cruz](https://en.wikipedia.org/wiki/Ted_Cruz) mentions contributed the least.

**Topic 3**
> _"great", "trump", "#makeamericagreatagain", "new", "america", "#trump2016", "make", "again!", "big", "rubio"_

Here we have the president's name being associated with "greatness", "bigness" and "America". This is likely in association with his compaign slogan ["Make America Great Again"](https://en.wikipedia.org/wiki/Make_America_Great_Again).

**Topic 4**
> _"national", "great", "rubio", "#trump2016", "#makeamericagreatagain", "lead", "poll-", "carson", "trump", "you!"_

Topic number 4 has a theme that contains "national", "great", "Marco Rubio", "leading", "poll", "Ben Carson", "Donald Trump" and the hash-tags ___"#trump2016"___ & ___"#makeamericagreatagain"___. This is likely the topic related to Donald Trump making comments about his Republican opposition and references to him leading in the national polls.

**Topic 5**
> _"national", "leads", "@realdonaldtrump", "donald", "great", "numbers", "@cnn", "you!", "america", "thank"_

In this topic we have [Donald Trump](https://en.wikipedia.org/wiki/Donald_Trump) using his Twitter handle [___"@realdonaldtrump"___](https://twitter.com/realDonaldTrump) and tweeting about how he is leading in the national polls. There is reference to [___"@cnn"___](https://twitter.com/cnn) and giving thanks to the people of America.

**Topic 6**
> _"rubio", "carson", "donald", "cruz", "clinton", "poll:", "leads", "national", "@realdonaldtrump", "bush"_

This topic contains tweets on topics where Donald Trump likely is comparing himself to his opposition, that is, [Marco Rubio](https://en.wikipedia.org/wiki/Marco_Rubio), [Ben Carson](https://en.wikipedia.org/wiki/Ben_Carson), [Ted Cruz](https://en.wikipedia.org/wiki/Ted_Cruz), [Hillary Clinton](https://en.wikipedia.org/wiki/Hillary_Clinton), and [Jeb Bush](https://en.wikipedia.org/wiki/Jeb_Bush), in the national polls.

**Topic 7**
> _"new", "@realdonaldtrump", "big", "great", "you!", "clinton", "hillary", "#1", "america", "make"_

Topic 7 has [Donald Trump](https://en.wikipedia.org/wiki/Donald_Trump) tweeting about his Democrat opponent [Hillary Clinton](https://en.wikipedia.org/wiki/Hillary_Clinton) with references of "#1", "new-ness", "big-ness", "great-ness", "making" and "America". This is likely tweets where Donald Trump is critisizing Hillary, then making reference to his campaign slogan.

**Topic 8**
> _"debate", "poll.", "said", "lead", "iowa", "new", "debate.", "@cnn", "@realdonaldtrump", "every"_

It appears in this topic, ___"debate"___ contributes the most. This is followed by lines (or sentences) ending with ___"poll"___, containing the words ___"said"___, ___"lead"___, ___"Iowa"___, ___"new"___, ___"every"___ and the CNN hash-tag ___"@cnn"___. [Iowa](https://en.wikipedia.org/wiki/Iowa) is referenced likely because of the presidential caucus, which is the first in the country and the starting point along with the [New Hampshire](https://en.wikipedia.org/wiki/New_Hampshire) primary, where the two major-party candidates for president are chosen.

**Topic 9**
> _"@realdonaldtrump", "debate", "big", "leads", "national", "lead", "last", "leading", "#1", "said"_

Topic 9 associates Donald Trump's tweets on the debate and leading in the national polls. The official handle [***@realdonaldtrump***](https://twitter.com/realDonaldTrump) contributes the most in this topic, likely because these tweets were made using this account.

**Topic 10**
> _"numbers", "leads", "lead", "new", "debate.", "said", "clinton", "poll.", "great", "big"_

In this topic, Donald Trump is tweeting about polling numbers, the debate and Hillary Clinton. These are likely tweets related to Donald Trump commenting and critizing what Hillary Clinton has said and how it has affected the polls.

In [13]:
# Executing: bow->tfidf and tfidf->lsi
for doc in corpus_lsi:
    print(doc)

[(0, 0.073231423637567239), (1, -0.20895483266254669), (2, -0.11710527852370793), (3, -0.11512318465152012), (4, 0.045601569748463411), (5, 0.23294072271561675), (6, 0.18864461192586707), (7, -0.095635960466902439), (8, -0.066741419692552595), (9, 0.13168177113064941)]
[(0, 0.062404045327024163), (1, -0.20684914235001917), (2, -0.11686233620297344), (3, -0.029416789707289788), (4, 0.14213299878455601), (5, 0.19134294804849247), (6, 0.033615119585154452), (7, -0.026694164690436169), (8, 0.0064661948257302357), (9, 0.014117068840645435)]
[(0, 0.81022402948214078), (1, 0.26566732919775615), (2, 0.026644319948621396), (3, -0.14316918058017833), (4, 0.021129841252047324), (5, 0.019696095753822057), (6, 0.01588624124794874), (7, 0.030957273703774695), (8, -0.011953678044104164), (9, 0.081184591938764483)]
[(0, 0.12868167660485624), (1, -0.094325163192610126), (2, 0.20090120850906742), (3, -0.034383091939804745), (4, -0.093952566208802213), (5, 0.077196675876744558), (6, 0.16193814845696819),

In [14]:
# Model persistence: save(), load()
lsi.save('trump.lsi')
lsi = models.LsiModel.load('trump.lsi')

2018-06-25 15:15:56,537 : INFO : saving Projection object under trump.lsi.projection, separately None
2018-06-25 15:15:56,542 : INFO : saved trump.lsi.projection
2018-06-25 15:15:56,544 : INFO : saving LsiModel object under trump.lsi, separately None
2018-06-25 15:15:56,547 : INFO : not storing attribute projection
2018-06-25 15:15:56,548 : INFO : not storing attribute dispatcher
2018-06-25 15:15:56,551 : INFO : saved trump.lsi
2018-06-25 15:15:56,552 : INFO : loading LsiModel object from trump.lsi
2018-06-25 15:15:56,614 : INFO : loading id2word recursively from trump.lsi.id2word.* with mmap=None
2018-06-25 15:15:56,615 : INFO : setting ignored attribute projection to None
2018-06-25 15:15:56,616 : INFO : setting ignored attribute dispatcher to None
2018-06-25 15:15:56,617 : INFO : loaded trump.lsi
2018-06-25 15:15:56,618 : INFO : loading LsiModel object from trump.lsi.projection
2018-06-25 15:15:56,619 : INFO : loaded trump.lsi.projection


***
### Similarity

#### Step 1:

In [15]:
# Initializing the query structure: transform corpus to LSI space and index it
index = similarities.MatrixSimilarity(lsi[corpus])

2018-06-25 15:15:56,636 : INFO : creating matrix with 448 documents and 10 features


In [16]:
# Index persistence
index.save('trump.index')
index = similarities.MatrixSimilarity.load('trump.index')

2018-06-25 15:15:56,659 : INFO : saving MatrixSimilarity object under trump.index, separately None
2018-06-25 15:15:56,665 : INFO : saved trump.index
2018-06-25 15:15:56,666 : INFO : loading MatrixSimilarity object from trump.index
2018-06-25 15:15:56,682 : INFO : loaded trump.index


#### Step 2:

In [17]:
# Performing queries
doc = "Hillary Clinton."
vec_bow = bag.doc2bow(doc.lower().split())

# Convert the query to LSI space
vec_lsi = lsi[vec_bow]

# Perform a similarity query against the corpus
sims = index[vec_lsi]

# Ranking the tweets by their weights of similarity
sims = sorted(enumerate(sims), key=lambda item: -item[1])

# Printing the associated Tweets:
for i in range(10):
    print("Tweet Rank #{}:\tWeight: {}\nRaw text: {}\n".format(i+1, sims[i][1], document[sims[i][0]]))

Tweet Rank #1:	Weight: 0.9743563532829285
Raw text: Hillary Clinton is not a change agent, just the same old status quo! She is spending a fortune, I am spending very little. Close in polls!

Tweet Rank #2:	Weight: 0.9544565677642822
Raw text: Don't believe the @FoxNews Polls, they are just another phony hit job on me. I will beat Hillary Clinton easily in the General Election.

Tweet Rank #3:	Weight: 0.88831627368927
Raw text: .@USATODAY Poll and @QuinnipiacPoll say that I beat both Hillary and Bernie, and I havn't even started on them yet!

Tweet Rank #4:	Weight: 0.8672994375228882
Raw text: The Republican establishment, out of self preservation, is concerned w/ my high poll #'s. More concerned are Demsâ€”I beat Hillary heads up!

Tweet Rank #5:	Weight: 0.8623911142349243
Raw text: Kasich only looks O.K. in polls against Hillary because nobody views him as a threat and therefore have placed ZERO negative ads against him

Tweet Rank #6:	Weight: 0.8412905931472778
Raw text: Ted Cruz is

When we make a query for "Hillary Clinton" to retrieve the respective top tweets associated with her name, we find that the leading tweet with the greatest weight is a strong criticism of Hillary Clinton and her campaign spending. The remaining nine tweets are associated with the polls, darted with references to Donald Trump criticizing poll results not in his favor, him leading against the opposition as well as pitting himself likely to win.