# Topic Modelling (LDA)

In this notebook, I use Latent Dirichlet Allocation models to model the topics in twitter. For the 20 day periods, I ran the model to extract the top features to compare topics of true news and fake news and analyse the differences of topics and top features for fake and true news. I did the analysis once every 5 days, because the topics would not change much in a short period of time.

For the analysis i need the package nltk. In initailising the cluster, I have already instructed all machines to download the package. Here, I just need to check if the package has been installed successfully.

In [1]:
#preparation
import nltk 
nltk.download('all') 
sc.defaultParallelism # note that this should ouput "already up-to-date!" for every package!

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package cess_cat is already up-

[nltk_data]    |   Package sentiwordnet is already up-to-date!
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package sentence_polarity is already up-to-date!
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package sinica_treebank is already up-to-date!
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package smultron is already up-to-date!
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /usr/local/share/nltk_data...
[nltk_data]    |   Package state_union is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /usr/local/share/nltk_data.

4

In [2]:
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.clustering import LDA
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import lit
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
import numpy as np
import os

In [3]:
#files are here
file_location_1 = 'gs://good-bucket/st446_project/hillary/'
file_location_2 = 'gs://good-bucket/st446_project/trump/'
files = ['2016-10-12.json','2016-10-17.json','2016-10-22.json','2016-10-27.json','2016-11-01.json']

The first step in LDA is to preprocessing the texts. I need to remove the stopwords and conduct lemmatization. Tweeter has some domain specific stopwords like rt (retweet), nt (something to do with hashtagging). Also, need to remove words like 'hillary', 'clinton', 'donald','trump' as by construction, they will be there and give no useful information. Also remove screen names for Hillary Clinton ('hillaryclinton') and Donald Trump ('realdonaldtrump').

In [4]:
#natural language processing ()

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

stopword = stopwords.words('english')
#add on tweeter specific stopwords
stopword.append('rt')
stopword.append('http')
stopword.append('https')
stopword.append('nt')
stopword.append('hillary')
stopword.append('clinton')
stopword.append('donald')
stopword.append('trump')
stopword.append('hillaryclinton')
stopword.append('realdonaldtrump')

stop_words = set(stopword)

table = str.maketrans('', '', string.punctuation)
lmtzr = WordNetLemmatizer()

#define a function to split texts into tokens after preprocessing
#and define user define functions to create a new column to store result

def get_tokens(line): #remove stopwords, lemmatization etc
    ###
    import nltk
#     nltk.download('all')
    ###
    tokens = word_tokenize(line)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuations from each word
    stripped = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]
    # filter out stop words
    words = [w for w in words if not w in stop_words]
    # lemmatizing the words, see https://en.wikipedia.org/wiki/Lemmatisation
    words = [lmtzr.lemmatize(w) for w in words]
    return (words)
tokenized_udf = udf(lambda z: get_tokens(z), ArrayType(StringType()))

In [9]:
#topic modelling function
#The hyper parameters, k and l are selected by trial and error, 
#something that seems to make sense for me to interpret the output

def topic_modelling(data_from_file,k=30,l=15):
    '''Read tweet json file and output topics of the tweets'''
    
    #k is number of topics to use
    #l is the number of words for each topic
    
    
    print ('Begin text preprocessing')
    texts = data_from_file.withColumn("dummy", lit(1)).withColumn('tokenized_text', tokenized_udf(data_from_file.text)).filter("text is not null").select(['dummy','tokenized_text'])
    
    print ('Constructing spark vector')
    
    cv = CountVectorizer(inputCol="tokenized_text", outputCol="features", minDF=2)
    cv_model = cv.fit(texts)

    texts_df_w_features = cv_model.transform(texts)
    texts_df_w_features.cache()

    print ('Begin LDA')
    lda = LDA(k=k, maxIter=10)
    lda_model = lda.fit(texts_df_w_features)

    topics = lda_model.describeTopics(l)

    topic_i = topics.select("termIndices").rdd.map(lambda r: r[0]).collect()
    for i in topic_i:
        print(np.array(cv_model.vocabulary)[i])
    

Below I show an example of output of the token_udf function.

In [6]:
#Sample tweet text after cleaning
data_from_file = spark.read.json("gs://good-bucket/hillary/2016-10-12.json")
texts = data_from_file.withColumn("dummy", lit(1)).withColumn('tokenized_text', tokenized_udf(data_from_file.text)).filter("text is not null").select(['dummy','tokenized_text'])
texts.take(1)

[Row(dummy=1, tokenized_text=['benefit', 'foul', 'play', 'suspected', 'keeping', 'green', 'party', 'nevada', 'ballot', 'prison', 'trumppotus'])]

Then I ran topic modelling for tweets mentioning hillary and trump every five days to understand the topics and see the top words (for true and fake news).

In [None]:
print (file_location[-15:-5]) #Date collected 
data_from_file = spark.read.json(file_location)

In [10]:
#true weets mentioning hillary clinton
for i in files:
    file_full_address = file_location_1+i
    print (file_full_address[-15:-5])
    data_from_file = spark.read.json(file_full_address)
    data_from_file = data_from_file.filter('fact_check==1')
    topic_modelling(data_from_file)

2016-10-12
Begin text preprocessing
Constructing spark vector
Begin LDA
['lead' 'assault' 'via' 'bachmann' 'michele' 'sexual' 'russian' 'think'
 'huffpostpol' 'voting' 'kurteichenwald' 'christian' 'warns' 'wrong' 'sex']
['cernovich' 'hillarygropedme' 'attention' 'year' 'state' 'special' 'gave'
 'old' 'bill' 'quake' 'dept' 'locker' 'room' 'offended' 'silly']
['hypocrisy' 'call' 'penny' 'sander' 'twitter' 'tcovnyselswfh'
 'caymanstaxdodge' 'supporter' 'overheard' 'standard' 'bigoted' 'campaign'
 'remark' 'barackobama' 'crazy']
['gatewaypundit' 'therealroseanne' 'tcoehzykjhzbk' 'said' 'muslim'
 'anyone' 'imagine' 'carlson' 'catholic' 'aide' 'tcoprahkhpuow'
 'mundyspeaks' 'wikileaks' 'allinwithchris' 'hardball']
['camp' 'catholic' 'bigotry' 'mocking' 'morris' 'video' 'fatherjonathan'
 'father' 'billhemmer' 'love' 'beyoncé' 'make' 'hypocrite' 'woman'
 'xrated']
['imwithher' 'make' 'america' 'one' 'great' 'pantsuit' 'vote' 'continue'
 'nbc' 'conway' 'sharonstone' 'jailed' 'whether' 'news' 'n

['u' 'vote' 'war' 'could' 'push' 'lying' 'judicialwatch' 'tomfitton'
 'endangered' 'world' 'tcofeikfgrqrt' 'warns' 'gorbachev' 'bigchrisbrfc'
 'tcocvccnvbceo']
['joyannreid' 'az' 'incredible' 'paper' 'response' 'supporter' 'threat'
 'juxterjinx' 'ananavarro' 'piece' 'mitchellvii' 'networksmanager'
 'dixielanddiva' 'hollyammon' 'mccoytonia']
2016-10-22
Begin text preprocessing
Constructing spark vector
Begin LDA
['teamtrump' 'poll' 'message' 'show' 'gain' 'rigged' 'tcogqjlgjkbqz'
 'court' 'gun' 'supreme' 'ruling' 'characterization' 'slammed' 'fousfan'
 'campaign']
['staff' 'cybersecurity' 'lectured' 'state' 'dept' 'video'
 'jasoninthehouse' 'lead' 'seems' 'reuters' 'system' 'working' 'half'
 'rigged' 'new']
['sexual' 'remember' 'assault' 'paid' 'please' 'get' 'jill' 'spin'
 'pattyarquette' 'harth' 'campaign' 'violence' 'rally' 'right' 'tying']
['corruption' 'battle' 'cry' 'closing' 'argument' 'danscavino' 'wikileaks'
 'aide' 'energizer' 'mistress' 'bunny' 'alleged' 'address' 'scramble'


['email' 'maga' 'scandal' 'obama' 'aide' 'broke' 'shock' 'expressed'
 'campaign' 'north' 'michelle' 'support' 'carolina' 'want' 'edhenry']
['imwithher' 'republican' 'sorry' 'probably' 'would' 'beating' 'rubio'
 'kasich' 'fawfulfan' 'wikileaks' 'email' 'camp' 'responds' 'amp'
 'mikeemanuelfox']
['obamacare' 'go' 'new' 'face' 'see' 'challenge' 'care' 'foundation'
 'wikileaks' 'business' 'memo' 'health' 'personal' 'reveals' 'interplay']
['woman' 'de' 'anyone' 'god' 'via' 'pledge' 'laugh' 'allegiance' 'remove'
 'seriously' 'news' 'campaign' 'clean' 'inspiring' 'well']
['watch' 'north' 'gop' 'carolina' 'introduces' 'cnngo' 'supporter' 'court'
 'michelle' 'nominee' 'another' 'block' 'campaign' 'win' 'conservative']
['give' 'charity' 'fund' 'making' 'income' 'slush' 'rest' 'huge' 'aid'
 'whopping' 'halerazor' 'show' 'email' 'bill' 'speech']
2016-11-01
Begin text preprocessing
Constructing spark vector
Begin LDA
['fbi' 'comey' 'russia' 'election' 'james' 'via' 'source' 'clear' 'see'
 'lie' 'op

In [11]:
#false weets mentioning hillary clinton
for i in files:
    file_full_address = file_location_1+i
    print (file_full_address[-15:-5])
    data_from_file = spark.read.json(file_full_address)
    data_from_file = data_from_file.filter('fact_check==-1')
    topic_modelling(data_from_file)

2016-10-12
Begin text preprocessing
Constructing spark vector
Begin LDA
['tcohebppubgax' 'fade' 'maga' 'spread' 'three' 'cost' 'minute' 'video'
 'breaking' 'retard' 'dn' 'sure' 'fcking' 'watch' 'obliterate']
['assange' 'julian' 'varneyco' 'keep' 'retweeting' 'tgowdysc'
 'tcoyvriuhpoun' 'email' 'proof' 'soon' 'claim' 'bombshell' 'trying'
 'released' 'funding']
['bill' 'video' 'girl' 'race' 'plunge' 'chaos' 'rape' 'showing'
 'anonymous' 'fbi' 'politics' 'presidential' 'breaking' 'raping' 'comey']
['breaking' 'video' 'obliterate' 'endorsed' 'watch' 'stein' 'jill' 'staff'
 'bombshell' 'crime' 'felony' 'moveforwardhuge' 'bragging' 'news' 'expose']
['breaking' 'email' 'lamaness' 'tcorrxhegoodc' 'huge' 'black' 'racism'
 'towards' 'hacked' 'expose' 'vast' 'tcokjvqkgwofa' 'justinraimondo'
 'emerged' 'chaos']
['break' 'horrible' 'silence' 'alert' 'roseanne' 'reveals' 'barr' 'thing'
 'gropethis' 'livnow' 'expose' 'hannity' 'foxnews' 'wikileaks'
 'thedonaldnews']
['say' 'rip' 'staffer' 'ok' 'voter

['cnn' 'vote' 'breaking' 'fraudulent' 'thousand' 'ten' 'ohio' 'found'
 'hiding' 'look' 'sick' 'warehouse' 'cabal' 'go' 'turned']
['actually' 'need' 'email' 'wrote' 'dems' 'wikileaks' 'heal' 'campaign'
 'turner' 'nina' 'dontgetfooledagain' 'afbranco' 'call'
 'hillaryliesmatter' 'hillaryliedpeopledied']
2016-10-22
Begin text preprocessing
Constructing spark vector
Begin LDA
['federal' 'campaign' 'judge' 'office' 'bill' 'finally' 'law' 'happening'
 'forced' 'read' 'perjury' 'threat' 'submit' 'holding' 'disqualified']
['vote' 'breaking' 'discovered' 'thousand' 'ten' 'fraudulent' 'eeynouf'
 'change' 'ad' 'powerful' 'everything' 'could' 'antihillary' 'cost'
 'woman']
['agency' 'hacksturns' 'russia' 'blame' 'lied' 'said' 'intelligence'
 'breaking' 'lindasuhler' 'lawsuit' 'fraud' 'massive' 'election' 'filed'
 'amp']
['journalist' 'dead' 'exposed' 'famous' 'found' 'think' 'disqualified'
 'presidential' 'nominee' 'poll' 'usakathydavis' 'tcoevshrwomss' 'bernie'
 'vicitrue' 'hillary']
['breaking' 

['wikileaks' 'assange' 'expose' 'trying' 'pedo' 'frame' 'reddit' 'company'
 'linked' 'front' 'love' 'star' 'point' 'state' 'freedom']
['breaking' 'video' 'vote' 'gun' 'foundation' 'executive' 'morris' 'dick'
 'teneo' 'racketeering' 'confession' 'fixer' 'smoking' 'proof' 'issue']
2016-11-01
Begin text preprocessing
Constructing spark vector
Begin LDA
['email' 'worse' 'could' 'leaked' 'isi' 'amp' 'anyone' 'breaking'
 'discovered' 'vote' 'via' 'thousand' 'ten' 'fraudulent' 'top']
['world' 'politicus' 'breaking' 'via' 'prayer' 'answered' 'corrupt'
 'video' 'wifi' 'turned' 'campaign' 'reason' 'expose' 'jet' 'stage']
['wikileaks' 'everything' 'could' 'election' 'change' 'deleted' 'release'
 'muslim' 'brotherhood' 'tcokfoksvkrdj' 'tpoliticmanager' 'confirms'
 'rigged' 'working' 'terror']
['found' 'life' 'property' 'nypd' 'raided' 'ruin' 'take' 'going' 'support'
 'send' 'please' 'gowdy' 'pentagon' 'trey' 'police']
['via' 'yournewswire' 'obama' 'twitter' 'michelle' 'deletes' 'isi'
 'wikileaks' 

By examing the top features identified by the LDA model and searching them on Google, I analysed the top three identiable topics for true news and fake news mentioning Hillary each day. The identification for true news topics are pretty accurate. However, for fake news, most of them are removed from the Internet at the time of research, I have to make educated guesses at times. The aim is to find the difference in topics and top features.



On 12th Dec, 


The true news mentioning Hillary were about 

1. Kurt Eichenwald: Vladimir Putin Suffered "Buyer's Remorse" About Supporting Donald Trump After Khizr Khan Incident, see https://www.realclearpolitics.com/video/2016/12/17/kurt_eichenwald_vladimir_putin_suffered_buyers_remorse_about_supporting_trump_after_khizr_khan_incident.html

2. Express doubts about Trump saving making America Great Again

3. Wikileak leaking Hillary working emails, see https://wikileaks.org/clinton-emails/


The fake news mentioning Hillary were about

1. a video about presidential candidate raping a girl

2. Wikileak leaking Hillary working emails, but may from an uncredible source that exaggerates things

3. A page on facebook that receives payment to attack people


On 17th Dec,

The true news mentioning Hillary were about 

1. Sen. McCain Says Republicans Will Block All Court Nominations If Clinton Wins, see https://www.npr.org/2016/10/17/498328520/sen-mccain-says-republicans-will-block-all-court-nominations-if-clinton-wins

2. FBI docs: Clinton 'contemptuous' of security agents, put team at risk for photo op
https://www.foxnews.com/politics/fbi-docs-clinton-contemptuous-of-security-agents-put-team-at-risk-for-photo-op

3. George Soros Donald Trump Would Be Dictator Democracy In Crisis
https://insider.foxnews.com/node?page=1159

The fake news mentioning Hillary were about

1. A video about Wikileak Founder Julian Assange killing a woman

2. Wiki Leak Founder found dead with brain tumor

3. A video about President Obama and George Soros smoking and killing dogs. 


On 22nd Dec,

The true news mentioning Hillary were about 

1. WikiLeaks Document Shows Apparent Gender Pay Gap at Clinton Foundation
https://insider.foxnews.com/2016/10/22/wikileaks-hillary-clinton-email-hack-gender-pay-gap-clinton-foundation

2. 18 revelations from Wikileaks' hacked Clinton emails
https://www.bbc.com/news/world-us-canada-37639370

3. Clinton far ahead in Electoral College race: Reuters/Ipsos poll
https://www.reuters.com/article/us-usa-election-poll-electoral-idUSKCN12M0JR

The fake news mentioning Hillary were about

1. Federal judges threatened people to pass bills

2. Video showing Wikileaker founder raping a black boy

3. Something to do with wikileak and killing

On 27th Dec,

The true news mentioning Hillary were about 

1. Voting machine errors changed votes in Cruz-O’Rourke race, group says
https://www.houstonchronicle.com/news/politics/texas/article/Voting-machine-errors-changed-some-Texans-13339298.php

2. Hillary Clinton's cloud of corruption
http://akdart.com/lib133.html

3. WikiLeaked Clinton Memo Acknowledges Obamacare's Exchanges Are Failing
https://www.forbes.com/sites/theapothecary/2016/10/18/leaked-clinton-memo-acknowledges-obamacares-exchanges-are-failing/#5db44ec9d8d8

The fake news mentioning Hillary were about

1. Joe Biden insulted and forces some women

2. Actress Susan saradon was involved in drug, murder, sex

3. Voting machine is rigged and this is alerting


In [12]:
#true weets mentioning trump
for i in files:
    file_full_address = file_location_2+i
    print (file_full_address[-15:-5])
    data_from_file = spark.read.json(file_full_address)
    data_from_file = data_from_file.filter('fact_check==1')
    topic_modelling(data_from_file)

2016-10-12
Begin text preprocessing
Constructing spark vector
Begin LDA
['read' 'one' 'offended' 'never' 'central' 'park' 'calling' 'apologized'
 'murder' 'wesleylowery' 'guy' 'husband' 'tomhanks' 'kharyp' 'father']
['year' 'girl' 'word' 'ten' 'need' 'dating' 'special' 'prosecutor'
 'debate' 'right' 'email' 'video' 'case' 'story' 'jysexton']
['via' 'bill' 'cosby' 'tell' 'lie' 'link' 'abc' 'assault' 'absurd'
 'anncoulter' 'handle' 'allegation' 'advised' 'next' 'say']
['woman' 'say' 'two' 'touched' 'inappropriately' 'story' 'see' 'nytimes'
 'via' 'top' 'w' 'mikiebarb' 'like' 'new' 'breaking']
['new' 'allegation' 'hand' 'sexual' 'put' 'skirt' 'tried' 'york' 'mr'
 'grabbed' 'time' 'breast' 'threatens' 'via' 'groping']
['drip' 'cbsnews' 'via' 'mmflint' 'joshtpm' 'disgusting' 'human' 'student'
 'university' 'liberty' 'slug' 'associated' 'revolting' 'tired' 'kennethn']
['knew' 'could' 'libel' 'journalist' 'lawsuit' 'filed' 'win' 'harass'
 'adamsteinbaugh' 'want' 'gt' 'top' 'shoe' 'stay' 'name

['thiel' 'peter' 'support' 'give' 'rosiegray' 'chooses' 'article' 'hill'
 'make' 'prisonplanet' 'via' 'million' 'wrote' 'retweeting'
 'tcohmphfzdqef']
['via' 'lead' 'people' 'nationally' 'teamtrump' 'tcoooyukdnuaw' 'say' 'jr'
 'sexual' 'called' 'shooting' 'true' 'aurora' 'predator' 'joked']
2016-10-22
Begin text preprocessing
Constructing spark vector
Begin LDA
['georgia' 'lose' 'samsteinhp' 'w' 'atlanta' 'svdate' 'dateline' 'u' 'day'
 'america' 'medium' 'plan' 'electoral' 'latest' 'ppollingnumbers']
['amp' 'much' 'deal' 'time' 'power' 'russia' 'concentration' 'warner'
 'opposes' 'online' 'wikileaks' 'cheerleader' 'many' 'lawsuit' 'abc']
['first' 'cabin' 'log' 'republican' 'lay' 'policy' 'day' 'white' 'house'
 'tcoorwoifjusz' 'history' 'university' 'trial' 'hope' 'via']
 'election' 'sound' 'familiar' 'immigrant' 'result' 'lawmaker' 'plan'
 'tcojhxsdtjrrs']
['say' 'gettysburg' 'woman' 'via' 'debate' 'sex' 'foxnews' 'offered'
 'like' 'speech' 'sexually' 'assaulted' 'christinawilkie' 'com

['people' 'medium' 'year' 'ca' 'take' 'insult' 'blow' 'hillbullies'
 'cernovich' 'get' 'right' 'american' 'please' 'campaign' 'immigration']
['one' 'know' 'election' 'rigged' 'right' 'teapainusa' 'riggin' 'oughta'
 'tcoxtsptlpahf' 'thing' 'via' 'campaign' 'huffpostpol' 'losing' 'say']
['back' 'attack' 'campaign' 'dial' 'medium' 'seriously' 'wolf' 'blitzer'
 'pleads' 'court' 'date' 'given' 'rape' 'lawsuit' 'lawyer']
['cnn' 'force' 'formidable' 'nevada' 'poll' 'vote' 'toward' 'hidden'
 'becomes' 'big' 'florida' 'really' 'one' 'move' 'fox']
['via' 'huffpostpol' 'gop' 'hotel' 'like' 'say' 'job' 'u' 'keep' 'medium'
 'got' 'voter' 'blitzer' 'tell' 'country']
2016-11-01
Begin text preprocessing
Constructing spark vector
Begin LDA
['voter' 'poll' 'usa' 'north' 'carolina' 'via' 'highly' 'survey'
 'mitchellvii' 'bradtmusic' 'respected' 'supporter' 'want' 'arrested'
 'place']
['wall' 'street' 'year' 'old' 'confirms' 'journal' 'make' 'diversity'
 'line' 'u' 'shouting' 'rooftop' 'miserable' 'right'

In [13]:
#false weets mentioning trump
for i in files:
    file_full_address = file_location_2+i
    print (file_full_address[-15:-5])
    data_from_file = spark.read.json(file_full_address)
    data_from_file = data_from_file.filter('fact_check==-1')
    topic_modelling(data_from_file)

2016-10-12
Begin text preprocessing
Constructing spark vector
Begin LDA
['watch' 'endorsed' 'obliterate' 'jill' 'stein' 'breaking' 'via'
 'anitadwhite' 'debalwaystrump' 'twittercomuspoliticstoday'
 'gartrelllinda' 'time' 'jillstein' 'marcogutierrez' 'u']
['video' 'record' 'pennsylvania' 'crowd' 'massive' 'conservative' 'break'
 'nation' 'debalwaystrump' 'tcosiemumiexh' 'tallahfortrump' 'return'
 'erictrump' 'tax' 'come']
['come' 'finally' 'truth' 'erictrump' 'crookedhillary' 'party' 'antonio'
 'great' 'san' 'welcoming' 'trumpvideos' 'jill' 'endorsed' 'stein' 'watch']
['stop' 'must' 'thank' 'second' 'drjillstein' 'pick' 'speaks' 'protest'
 'paid' 'protester' 'rally' 'nubianawakening' 'kellyannepolls'
 'deplorable' 'neverhillary']
['come' 'erictrump' 'truth' 'finally' 'crookedhillary' 'dbongino'
 'jimgeraghty' 'charging' 'well' 'double' 'clinton' 'mr' 'covering'
 'fargo' 'donor']
['got' 'surprise' 'panama' 'city' 'florida' 'lifetime' 'stepped' 'stage'
 'ryan' 'god' 'paul' 'oh' 'bombshell

['obama' 'amp' 'medium' 'lagartijanix' 'war' 'freejulian' 'amendment'
 'alternative' 'declares' 'rigged' 'election' 'come' 'forward' 'steal'
 'admits']
['rally' 'medium' 'violence' 'smoking' 'directly' 'responsible' 'gun'
 'quickly' 'cincinnati' 'frightened' 'leave' 'protest' 'paid' 'speaks'
 'protester']
2016-10-22
Begin text preprocessing
Constructing spark vector
Begin LDA
['early' 'election' 'result' 'faagifts' 'amp' 'voting' 'yournewswire'
 'via' 'show' 'landslide' 'wikileaks' 'winning' 'vote' 'see' 'flare']
['crowd' 'rally' 'like' 'speaks' 'medium' 'paid' 'tell' 'point' 'think'
 'deplorable' 'news' 'protester' 'dishonest' 'protest' 'via']
['democrat' 'scrambling' 'everywhere' 'florida' 'obama' 'watch' 'michelle'
 'penny' 'monkeying' 'experience' 'around' 'admitted' 'leaked' 'amp'
 'email']
['look' 'medium' 'washington' 'right' 'turned' 'post' 'ballot' 'sign'
 'liberal' 'found' 'ford' 'harrison' 'fake' 'thousand' 'melting']
['exposed' 'debate' 'watch' 'people' 'see' 'finish' 'vide

['watch' 'viral' 'going' 'video' 'new' 'epic' 'american' 'every' 'fan'
 'make' 'president' 'tell' 'massively' 'vote' 'release']
['sharing' 'heart' 'country' 'vote' 'kazmierskir' 'show' 'berniesanders'
 'randpaul' 'bring' 'sherylcrow' 'susansarandon' 'winning' 'texas' 'tell'
 'landslide']
2016-11-01
Begin text preprocessing
Constructing spark vector
Begin LDA
['best' 'ever' 'pulled' 'yesterday' 'prank' 'florida' 'every' 'crooked'
 'video' 'poll' 'plummet' 'daybreak' 'soar' 'thing' 'election']
['fbi' 'maga' 'tcot' 'last' 'said' 'night' 'wakeupamerica' 'zuckerberg'
 'facebook' 'mark' 'pjnet' 'shock' 'breaking' 'nkirukanistoran' 'russia']
['pastor' 'family' 'give' 'foxnews' 'iran' 'held' 'dick' 'haiti' 'tried'
 'tcozepzmuunpu' 'gop' 'stealing' 'intentionally' 'proudly' 'shock']
['rally' 'john' 'record' 'break' 'attendance' 'river' 'city' 'elton'
 'arena' 'girl' 'destroys' 'watch' 'old' 'day' 'year']
['tv' 'live' 'trumphating' 'vote' 'gowdy' 'trey' 'reason' 'considered'
 'nevertrumpers' 're

On 12th Dec, 


The true news mentioning Trump were about 

1. Ann Coulter To HBO’s Bill Maher: We Always Knew Donald Trump Was “A Lout” 
https://deadline.com/2016/10/ann-coulter-real-time-with-bill-maher-donald-trump-1201836954/

2. Donald Trump and the Central Park Five: the racially charged rise of a demagogue
https://www.theguardian.com/us-news/2016/feb/17/central-park-five-donald-trump-jogger-rape-case-new-york

3. Clinton bashes Trump over Russia praise, but emails show she praised Putin
https://www.foxnews.com/politics/clinton-bashes-trump-over-russia-praise-but-emails-show-she-praised-putin


The fake news mentioning Trump were about

1. Massive crowd gathering in Pennsylvania, asking for text return and somehow related with Trump

2. Shitty liberals, stone, house

3. A variety of swear words used against Donald Trump.


On 17th Dec,

The true news mentioning Trump were about 

1. Lewd Donald Trump Tape Is a Breaking Point for Many in the G.O.P.
https://www.nytimes.com/2016/10/09/us/politics/donald-trump-campaign.html

2. Amy Schumer sees about 200 people walk out of show after attack on Donald Trump
https://www.telegraph.co.uk/news/2016/10/18/amy-schumer-sees-about-200-people-walk-out-of-show-after-attack/

3. Flynn under fire for fake news
https://www.politico.com/story/2016/12/michael-flynn-conspiracy-pizzeria-trump-232227

The fake news mentioning Trump were about

1. Democrats admitting attacking a woman in the state Ohio

2. putin cyber attack gone viral, and related to christians

3. Announcement of General attorney pick generated nightmore 



On 22nd Dec,

The true news mentioning Trump were about 

1. What the latest poll of Georgia voters says about the election
https://www.ajc.com/news/state--regional-govt--politics/podcast-what-the-latest-poll-georgia-voters-says-about-the-election/HZt0mX9e61jtslqZMIuRgK/

2. Log Cabin Republicans Plan to ‘Push’ Trump on LGBT Rights
https://www.thedailybeast.com/log-cabin-republicans-plan-to-push-trump-on-lgbt-rights

3. Trump closing argument is battle cry against Clinton corruption
https://www.washingtontimes.com/news/2016/oct/22/donald-trumps-closing-argument-battle-cry-against-/

The fake news mentioning Trump were about

1. Early voting result shows landslide victory for hillary

2. Democrats scrambling everywhere in florida, obama better watches his wife michelle

3. Obama, Joe Biden, threatening, prisonplanet

On 27th Dec,

The true news mentioning Trump were about 

1. Putin ally tells Americans: vote Trump or face nuclear war
https://www.reuters.com/article/us-usa-election-russian-trump/putin-ally-tells-americans-vote-trump-or-face-nuclear-war-idUSKCN12C28Q

2. Inside the Trump Bunker, With Days to Go
https://www.bloomberg.com/news/articles/2016-10-27/inside-the-trump-bunker-with-12-days-to-go

3. Clinton and Obama lead calls for unity as US braces for Trump presidency
http://d.digests.nhub.news/2016/11/10/00/usa_mix_en.m.html

The fake news mentioning Hillary were about

1. Video of Hollywood Star led votes in Texas and destroyed the State

2. Obama Campaign illegally transferred money to build momentum

3. The stunning result of the poll slapped hillary and is hilarious and stunning


### Conclusion

1. We can clearly see fake news topics can be divided into two categories: 
 * Those are related to true news happening at the time.  
 * Videos created to express violence/hatry/uglify people
 
Examples for the first category includes those related to Wikileak founder (claiming he is dead/he has killed someone ), those related to problems in election voting (a breaking down of a voting machine led to the claim that voting is rigged).

Examples for the second category includes video about President Obama and George Soros killing dogs. Actress Susan saradon was involved in drug, murder, sex.


2. The top features of fake news are more novel and generate greater emotion than the ones in true news. Fake news constantly use words like killing, raping, smuggling that sounds horrifying. Also, fake news use a lot of sear words, like fuck,shit, etc. On the contrary, true news seem to be pretty to the point. There are few words generating emotion. For example, ['cernovich' 'hillarygropedme' 'attention' 'year' 'state' 'special' 'gave'
 'old' 'bill' 'quake' 'dept' 'locker' 'room' 'offended' 'silly'] and
['hypocrisy' 'call' 'penny' 'sander' 'twitter' 'tcovnyselswfh'
 'caymanstaxdodge' 'supporter' 'overheard' 'standard' 'bigoted' 'campaign'] does not have strong/novel words or swearing words.