# Trump's Twitter History: Data Cleaning
Those reading this will already be familiar with the political career of Donald Trump, and will no doubt understand the level of influence his twitter account, as well as his television interviews and meetings with the press had in communicating his political message. Often, his tweets were headline news in publications around the globe, and were not uncommon to captured the attention of the world. His twitter account was suspended as of January 8th, 2021, and has not been renstated as of March 19th, 2021.  
His complete twitter history has been archived thanks to https://www.thetrumparchive.com/, which I have used as my source of data for the following notebooks.
  
Why take the time to preform an overview analysis of Trump's twitter career? Well, most influencial political leaders of the past are nearly completly lost to history. How many speechs of Roman emperors do you know? How many of there speeches even remain?  
Thus this dataset is, in an way, a historical artifact. We, as data scientists/data analysts, have a chance observe and study this political artifact before it is lost to the memory of human kind. In addition, analysing this dataset allows us to imagine how political leaders of the future will use their rising technological powers to rule the populos of the future.  
  
Having said that, this series of notebooks will serve more so as an investigation of modern natural language processing and data analisys techniques. Both experimental and fundemental, to see what insights we can extract from a twitter account alone.   

If the reader has any questions or comments, big or small, feel free to leave them on this thread and I will do my best to answer them in a timely manner. The notebook is ment to be readable on its own, but do let me know if anything is unclear.
  
The most interesting of questions, that off the so called 'ratio', cannot be discussed here as per number of responses were not recorded in the original data set, and may now be lost to time forever. If there are any enterprising readers out there who may know how to get there hands on such data, you may hold the answers to the most interesting questions of all. 

### Trump Twitter Series
I am splitting the work I am doing on the Trump Twitter dataset in three notebooks.  
1) Data Cleaning.  
2) Analysis of the Data.  
3) Evaluation of off the Self NLP Machine Learning Techniques.  

## Table of Content
Section 1: Data Retreval and Package Download  
Section 2.1: Extracting Quote Tweets  
Secion 2.2: Extracting Hashtags and Mentions  
Section 2.3: Extracting Datetime Information  
Section 2.4: Sentiment Analysis  
Section 2.5: Category Extraction  
Section 2.6: Category Coherence Analysis

In [None]:
# Import the Relevent Packages
import pandas as pd
import os
import re
import math
import statistics as sts
import datetime as dt
import time
import seaborn as sns
from matplotlib import pyplot as plt
import spacy
spacy_model = spacy.load('en_core_web_sm')

import random as rd
from textblob import TextBlob
from wordcloud import WordCloud

In [None]:
tweets = pd.read_csv('../input/donald-trump-tweets-dataset/tweets.csv')
tweets.head(5)

### Section 2.1 Extracting Quote Tweets.
### Types of Tweets
The first thing to mention when examining the tweets is that there appear to be three types of tweets.  
1) Regular Trump Tweets.  
2) Tweets that have been retweeted by Trump. These tweets are disinquesed in two ways, they all begin with RT in the text, and isRetweet is valued at 't'.   
3) Quote tweets; similar to a retweet, but not flagged by isRetweet. These tweets can be detected by the """@user that begins the text string. To deal with this we will both extract the username, the quoted message, and the additional message added by trump. 

In [None]:
# 1) An example of a standard tweet.
tweets.loc[54],tweets.loc[(54,'text')]

In [None]:
# 2) An example of a retweet.
tweets.loc[942],tweets.loc[(942,'text')]

In [None]:
# 3) An example of a qoute tweet.
tweets.loc[23186], tweets.loc[(23186,'text')]

Our next goal is to seperate out the information encoded in quote tweets.  
This section is a bit of the nitty-gritty found in real world data problems, because the formating used is inconsistent through out the dataset.  
  
Here, pattern1 identifies tweet['text'] of the form: """@user: User_Message"" Trump_message".  
pattern2 identifies tweets with the text of the form: """@user User_messsage".  
  
The function extracts and records each of the above sections, as well as recordes an indicator as to whether the tweet is of type pattern1 (t1), pattern2 (t2), or not quote-tweet ('n/a').

In [None]:
# A Funtion to extract info from quote tweets.

isQuoteTweet = []
quotedFrom = []
quoteText = []
trumpResp = []

pattern1 = re.compile(r'"""(@.+):(.+"")(.*)')
pattern2 = re.compile(r'"""(@.+) (.+")')

for i in range(len(tweets)):
    matches1 = pattern1.finditer(tweets.loc[(i,'text')])
    initCheck = len(isQuoteTweet)
    for match1 in matches1:
        isQuoteTweet.append('t1')
        quotedFrom.append(match1.group(1))
        quoteText.append(match1.group(2))
        trumpResp.append(match1.group(3))
    if initCheck == len(isQuoteTweet):
        matches2 = pattern2.finditer(tweets.loc[(i,'text')])
        for match2 in matches2:
            isQuoteTweet.append('t2')
            quotedFrom.append(match2.group(1))
            quoteText.append(match2.group(2))
            trumpResp.append('n/a')
    if initCheck == len(isQuoteTweet):
        isQuoteTweet.append('f')
        quotedFrom.append('n/a')
        quoteText.append('n/a')
        trumpResp.append('n/a')

tweets['isQuoteTweet'] = isQuoteTweet
tweets['quotedFrom'] = quotedFrom
tweets['quoteText'] = quoteText
tweets['trumpResp'] = trumpResp

Further analysis of the quote tweets can be found in the next notebook, Trump Tweets Analysis.

### Section 2.2: Hashtags and Mentions  
A quick an easy way to extract all hastags and mentions from a tweet, using python's Regular Expressions Package (re).  
Code copied from: https://stackoverflow.com/questions/45874879/extract-hashtags-from-columns-of-a-pandas-dataframe

In [None]:
# Extract Hashtags and mentions from tweets
tweets['mentions'] = tweets['text'].str.findall(r'(?:(?<=\s)|(?<=^))@.*?(?=\s|$)')
tweets['hashtags'] = tweets['text'].str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')

### Section 2.3: Extracting Datetime information
Next, we want to take the existing date strings and convert them into a more python friendly format, by the way of the datetime package.

In [None]:
# A simple application of the datetime module.
year = []
month = []
day = []
hour = []
date = []
for i in range(len(tweets)):
    datestamp = dt.datetime.strptime(tweets.loc[(i,'date')], '%Y-%m-%d %H:%M:%S')
    year.append(datestamp.date().year)
    month.append(datestamp.date().month)
    day.append(datestamp.date().day)
    hour.append(datestamp.time().hour)
    date.append(datestamp.date())

In [None]:
# Append the date information to our dataframe
tweets['year'] = year
tweets['month'] = month
tweets['day'] = day
tweets['hour'] = hour
tweets['date'] = date

### Section 2.4: Sentiment Analysis:  
This Sentement Analysis technique is based on work done by kaggle user ahmedterry, his notebook can be found here:  
https://www.kaggle.com/ahmedterry/trump-tweets-eda-nlp-sentiments-analysis  
  
There are few steps involved in this sentiment analysis:  
1) Clean the text, lemminize.  
2) Extract 'Subjectivity' and 'Polerization' Scores.  
3) Simplfy the subjectivity into 'Positive', 'Neutral', and 'Negative' buckets.  
4) And finnaly we seperate our cleaned sentences into tokens that we can use for searching, as will be see in the create category section.

In [None]:
# Cleaing features from the tweets
processed_features = []
for sentence in tweets['text']:
    # Remove all the http: urls
    processed_feature = re.sub('(https?://\S+)', '', str(sentence))
    
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', processed_feature)
 
    #Converting to Lowercase
    processed_feature = processed_feature.lower()
    
    processed_features.append(processed_feature)

In [None]:
# Create a function to get the subjectivity
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity

# Create a function to get the polarity
def getPolarity(text):
    return  TextBlob(text).sentiment.polarity


# Create two new columns 'Subjectivity' & 'Polarity'
tweets['subjectivity'] = pd.Series(processed_features).apply(getSubjectivity)
tweets['polarity'] = pd.Series(processed_features).apply(getPolarity)

In [None]:
def getAnalysis(score):
    if score < 0:
        return 'negative'
    elif score == 0:
        return 'neutral'
    else:
        return 'positive'
tweets['analysis'] = tweets['polarity'].apply(getAnalysis)

In [None]:
# An example of our Sentiment Analysis.
tweets[['text','subjectivity','polarity','analysis']].loc[944]

There is an interesting note here by the stop words used in the trump tweet dataset. There are many occurances of single letters like, 's','t','u', etc.  
I am not really sure what is the reason for this, but we can address the common occuring ones up front here. 

In [None]:
# Removing stopwords
stop_words = ['rt','s','t','amp','u','m','w','p','c',' ','  ','   ']
for stopword in stop_words:
    lexeme = spacy_model.vocab[stopword]
    lexeme.is_stop = True

In [None]:
#Lemmanize
#Note that this block can take ~10 min to run.
start = time.time()
cleanTweet = []
for tweet in processed_features:
    tweet = spacy_model(tweet)
    tokenTweet = []
    for token in tweet:
        if not token.is_punct and not token.is_stop and not token.like_num and token.lemma_ != '-PRON-':
                tokenTweet.append(token.lemma_)
    cleanTweet.append(tokenTweet)
print('Time to Run:',(time.time() - start)/60, 'minutes')

An example of our cleaned text, notice its a list of lemmanized terms. We will use these terms later as the basis of our tweet search.

In [None]:
cleanTweet[5]

In [None]:
# Create our Clean tweet  column.
tweets['cleanText']=cleanTweet

### Section 2.5: Topic Extraction
The primary technique we will use to extract the topics from tweets is as follows.  
1) Sort through the top 500 most used lemminized terms used in the dataset.  
2) Categorize these terms into 1 or 0 groups based on human judgement.  
3) Examine a statistically significant number of tweets from each custom made category (and subcategory), to see if the categories make sence.  
  
The issues with this approach: Keywords can be used in different contexts. For example, 'Presedent' is used both with 'Trump', and with 'Obama' in this dataset. To address this issue, I will look at LSA models of each section, in my Analysis Notebook.

In [None]:
# Creates a list of all tokens (words) used, accross all tweets.
wordList = []
for tweet in tweets['cleanText']:
    for word in tweet:
        wordList.append(word)

wordSeries = pd.Series(wordList)

In [None]:
# Number of Tokens
len(wordSeries.value_counts())

To create the topics, I have looked through the top 500 appearing search terms of Trumps vocab and attempted to sort them into categories by usage. I will analize the effectiveness of this approach later in this notebook, and in the next.

In [None]:
# Percentage of the dataset the top 500 terms cover.
wordSeries.value_counts()[1:500].sum()/wordSeries.value_counts()[1:].sum()

In [None]:
wordSeries.value_counts()[1:20]

After sorting through the top 500 terms, I came to define the categories by the following definitions. If you are interested in seeing more detail about the definitions of any category, see Section 2.6: Topic Coherence for a more detailed look.  
1) 'selfReference': Any term that may refer to Donald Trump, or his campaign, in the third person. (note 'I' was removed as a stopword.  
2) 'usa': Terms that refer to the country 'United States of America'.  
3) 'government': Terms that refer to the government of the United States or its major governmental bodies.  
4) 'democrates': Terms that refer to either the democratic party, or one of its promenent members.  
5) 'republicans': Terms that refer to either the republican party, or one of its promenent members.  
6) 'election': Terms that refer to elections, or electoral processess.  
7) 'positive': Terms that generaly imply a positive meaning, or a positive adjective.  
8) 'negative': Terms that generaly imply a negative meaning, or a negative adjective.  
9) 'news': Terms that talk about news, and news networks.  
10) 'law': Terms that refer to law and order, or the American Judicial System.  
11) 'border': Terms that reference the US border wall, or imigration.  
12) 'economic': Terms that reference the economy.  
13) 'states': Terms that reference and State in the United States.  
14) 'countries': Names of other countries used, not including the United States.  
15) 'bucket': Bucket is the category where I have stored interesting terms, but could not categorized. An enterprising reader may wish to contiune my work were I could not.****

In [None]:
# Sorted Category of tokens. 
selfReference = ['realdonaltrump','president','trump','donald','trump2016','maga','makeamericagreatagain','potus','teamtrump','donaldtrump']
usa = ['contry','america','state','american','united','states','national','usa']
government = ['whitehouse','senate','senator','congress','office','govenor','washington','administration',
             'government','federal','congressman','impeachment']
democrates = ['obama','barackobama','democrats','joe','biden','hillary','clinton','dem','democrat',
             'obamacare','schiff','nancy','pelosi','comey','bernie']
republicans = ['republican','gop','mittromney','bush','cruz','gopchairwomen']
election = ['election','poll','campaign','party','debate','bill','voter','elect','vote','endorsement','primary','candidate','ballot']
positive = ['great','good','big','win','new','love','true','strong','support','amazing','hope',
           'happy','agree','friend','wow','totally','wonderful','beautiful','success','important','fantastic',
           'incredible','tremendous','smart','winner','champion','protect']
negative = ['bad','fight','fail','lie','wrong','crooked','kill','terrible','corrupt','sad','disaster','loser','hoax',
           'hate','lose','sleepy','phoney','rig','fake','attack','destroy','radical']
news = ['news','medium','rating','foxandfriend','foxnew','foxnews','fox','seanhannity','cnn','nbc']
law = ['law','justice','drug','fraud','police','crime','security','illegal','enforcement','criminal','military','fbi','investigation',
      'court','mueller','war','power']
border = ['border','build','wall','immigration']
economic = ['people','job','work','buisness','money','stock','tariff','cost','oil','tax','economy','economic','unemployment','dollar']
states = ['california','florida','carolina','texas','pennsylvania','iowa']
countries = ['china','mexico','iran','korea']

bucket = ['twitter','supporter','isis','thank','run','world','right','report','history','leader','honor',
          'live','family','forward','fact','believe','open','course','order','donaldjtrumpjr','help',
          'record','give','presedential','political',
         'stand','cut','golf','join','problem','women','complete','appretice','celebapprentice','apprencticebc','healthcare'
         'case','city','fire','spend','hit','speech','nation',
         'miss','tough','meeting','witch','hunt','white','crowd','release','massive','proud','rally','collusion',
         'leadership','general','rate','buy',
         'respect','question','price','vet','ammendment','hotel','ivankatrump','politician','truth','coronavirus',
         'energy','policy','market','small','large','focus','entrepreneur','god','worker','early','raise','safe',
         'million','tweet','celebrity','check','little','deliver','deal']

Now we will create a function that will categorized tweets, based on the appearence of any of the terms in a given category.

In [None]:
# Records the index of tweets which contain 'term'.
# Keep in mind, term should be the 'lemmanized' version of the word (ie. rig vs. rigged) as per text cleaning.
def searchTerm(term):
    catcher = []
    for i in range(len(tweets)):
        if term in tweets.loc[(i,'cleanText')]:
             catcher.append(tweets.loc[(i,'text')])
    return pd.Series(catcher)

In [None]:
# Example of search term
sample = searchTerm('rig')
sample[3]

In [None]:
# Creates a new column, with values 0,1 based on if given tweet contains one of terms in category bucket.
def createCategory(bucket, name):
    start = time.time()
    category = []
    for i in range(len(tweets)):
        j = 0
        for word in bucket:
            if j == 0 and word in tweets.loc[(i,'cleanText')]:
                j = 1
        if j == 0:
            category.append(0)
        else:
            category.append(1)
    tweets[name] = category 

In [None]:
# This code block should only take a few minutes.
start = time.time()

createCategory(selfReference, 'selfReference')
createCategory(usa,'usa')
createCategory(government,'government')
createCategory(democrates,'democrates')
createCategory(republicans,'republicans')
createCategory(election,'election')
createCategory(positive,'positive')
createCategory(negative,'negative')
createCategory(news,'news')
createCategory(law,'law')
createCategory(border,'border')
createCategory(economic,'economic')
createCategory(states,'states')
createCategory(countries,'countries')

print('Total Time:',time.time() - start)

#### A Quick Disscussion on Topic Coherence.
Now that we have defined the topics by commonly occuring key words, we want to examine if we have defined our topics apporpriatly.  
One way of doinging this, is random sampling our categories and veiwing if the tweets fit that category manualy.  
  
Of course, manual labour like this can be time intensive and hard work, so we hope that there is a way to intellegently plan out the work to be done ahead. And in fact, as statisticians will quickly realize, our topic coherence validation process creates a simple binomial distribution for each topic. This gives us a way to intellegently select a sample size for each topic before digging in.

#### Self-Reference Category Coherence
A tweet qualifies as 'SeflReference' if the tweet in some way referes to Donald Trump.  
Positive Example (42162: I just got off the phone with the great people of Guam! Thank you for your support! #VoteTrump today! #Trump2016).  
Negative Example (40775: Our next Vice President of the United States of America, Gov. @Mike_Pence!#GOPinCLE #GOPConvention#AmericaFirst https://t.co/TZT3XcKp1c).  

Notice that in our negative example, the tweet was selected for using the keyword 'Presedent' but in context, is refering to Obama and not trump.

In [None]:
rd.seed(0)
x = tweets[(tweets['selfReference'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [0,1,1,1,1,1,1,0,1,1,1,1,1,0,0,1,1,1,1,1]
p_st = sum(hit)/len(hit)
p_st

In [None]:
rd.seed(0)
x = tweets[(tweets['selfReference'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_sr = sum(hit)/len(hit)
p_sr

In [None]:
rd.seed(0)
x = tweets[(tweets['selfReference'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_sq = sum(hit)/len(hit)
p_sq

In [None]:
rd.seed(0)
x = tweets[(tweets['usa'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,0,1,1,1,1,1,1,1,1,0,0,1,0,0,1,1,1,1,1]
p_ut = sum(hit)/len(hit)
p_ut

In [None]:
rd.seed(0)
x = tweets[(tweets['usa'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,0,1]
p_ur = sum(hit)/len(hit)
p_ur

In [None]:
rd.seed(0)
x = tweets[(tweets['usa'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_uq = sum(hit)/len(hit)
p_uq

#### Government Category Coherence
Here what we are looking for is for the tweet to refer to the US government in some way.  

A Positive Example is: (44878: Democrats purposely misstated Medicaid under new Senate bill - actually goes up. https://t.co/necCt4K6UH)  

A Negative Example is: (17362: """Imagine how much stronger economic shape we would be in if we made the Iraqi government agree to a cost-sharing (cont) http://t.co/Zf2pEO80")

In [None]:
rd.seed(0)
x = tweets[(tweets['government'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1]
p_gt = sum(hit)/len(hit)
p_gt

In [None]:
rd.seed(0)
x = tweets[(tweets['government'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1]
p_gr = sum(hit)/len(hit)
p_gr

In [None]:
rd.seed(0)
x = tweets[(tweets['government'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_gq = sum(hit)/len(hit)
p_gq

#### Democratic Category Coherence
For the democratic category, we are just looking for the tweet to mention the democratic party, or one of its members.  
Postive Example: (28382: If only the illegals were Tea Party members then Obama would get them out of the country immediately.)  

In [None]:
rd.seed(0)
x = tweets[(tweets['democrates'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_dt = sum(hit)/len(hit)
p_dt

In [None]:
rd.seed(0)
x = tweets[(tweets['democrates'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_dr = sum(hit)/len(hit)
p_dr

In [None]:
rd.seed(0)
x = tweets[(tweets['democrates'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_dq = sum(hit)/len(hit)
p_dq

#### Republican Category Coherence
For the Republican category, we are looking for tweets to mention the Republican party, or one of its members.  
Postive Example: (36508: The Republican Party will become “The Party of Healthcare!”)  

In [None]:
rd.seed(0)
x = tweets[(tweets['republicans'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_rt = sum(hit)/len(hit)
p_rt

In [None]:
rd.seed(0)
x = tweets[(tweets['republicans'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_rr = sum(hit)/len(hit)
p_rr

In [None]:
rd.seed(0)
x = tweets[(tweets['republicans'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_rq = sum(hit)/len(hit)
p_rq

#### Election Category Coherence
For a tweet to be included in this category, it must reference an election in some way.  
Positive Example: (25771: This ‘deal’ @RNC voted for has $41 in tax increases for every $1 in spending cuts.  It is pathetic.  Obama is laughing at them.)  
Negative Example: (20764: On Bill O'Reilly in 5 minutes!)

In [None]:
rd.seed(0)
x = tweets[(tweets['election'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,0,1,1,0,1,1,1,0,0,1,0,1,1,1,1,1,0]
p_et = sum(hit)/len(hit)
p_et

In [None]:
rd.seed(0)
x = tweets[(tweets['election'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,0,0,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1]
p_er = sum(hit)/len(hit)
p_er

In [None]:
rd.seed(0)
x = tweets[(tweets['election'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_eq = sum(hit)/len(hit)
p_eq

#### Positive Category Coherence.
A tweet is considered positive if it has some positive sentiment.  
Positive Example: (52371: As bad as @CNN is, Comcast MSNBC is worse. Their ratings are also way down because they have lost all credibility. I believe their stories about me are not 93% negative, but actually 100% negative. They are incapable of saying anything positive, despite all of the great things...)  
Negative Example: (16861: #CelebrityApprentice Who will win? http://t.co/1IjFi52y Find out tonight- live Season Finale at 9PM ET on NBC.)

In [None]:
rd.seed(0)
x = tweets[(tweets['positive'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [0,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,0,1,1,0]
p_pt = sum(hit)/len(hit)
p_pt

In [None]:
rd.seed(0)
x = tweets[(tweets['positive'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,1,1,0,1,1]
p_pr = sum(hit)/len(hit)
p_pr

In [None]:
rd.seed(0)
x = tweets[(tweets['positive'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_pq = sum(hit)/len(hit)
p_pq

#### Negative Category Coherence:
A tweet should be included in this category if the overall sentiment of the tweet is negative.  
Positive Example: (31084: I wonder how much money dumb @BuzzFeed and even dumber Ben Smith loooose each year? They have zero credibility - totally irrelevant and sad!)  
Negative Example: (40751: I highly recommend the just out book - THE FIELD OF FIGHT - by General Michael Flynn. How to defeat radical Islam.)

In [None]:
rd.seed(0)
x = tweets[(tweets['negative'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,0,0,1,0,0,1,1,1,1,1,1]
p_nt = sum(hit)/len(hit)
p_nt

In [None]:
rd.seed(0)
x = tweets[(tweets['negative'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_nr = sum(hit)/len(hit)
p_nr

In [None]:
rd.seed(0)
x = tweets[(tweets['negative'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1]
p_nq = sum(hit)/len(hit)
p_nq

#### News Category Coherence:
A tweet is in this category if it makes mention of news.  
Positive Example: (31398: Watching Gates on @seanhannity - looks like he got hit by a truck! Why didn't Obama get him, and others,to sign a confidentiality agreement?)  
Negative Example: (13350: Be sure to watch The Apprentice tonight, 10 p.m. on NBC--it's an episode you won't forget!)

In [None]:
rd.seed(0)
x = tweets[(tweets['news'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1]
p_ft = sum(hit)/len(hit)
p_ft

In [None]:
rd.seed(0)
x = tweets[(tweets['news'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1]
p_fr = sum(hit)/len(hit)
p_fr

In [None]:
rd.seed(0)
x = tweets[(tweets['news'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_fq = sum(hit)/len(hit)
p_fq

#### Law and Order Category Coherence:
A tweet should be included in this category if the tweet has to do with law, justice, or the American Judicial system.  
Positive Example: (45674: I have made my decision on who I will nominate for The United States Supreme Court. It will be announced live on Tuesday at 8:00 P.M. (W.H.))  
Negative Example: (36761: Entrepreneurs: Resolve to be bigger than your problems. Who's the boss? Don't negate your own power.)  

In [None]:
rd.seed(0)
x = tweets[(tweets['law'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_lt = sum(hit)/len(hit)
p_lt

In [None]:
rd.seed(0)
x = tweets[(tweets['law'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_lr = sum(hit)/len(hit)
p_lr

In [None]:
rd.seed(0)
x = tweets[(tweets['law'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,0,1,1,1,1,1,1,1,1,1,1,0,1,0,1,1,0]
p_lq = sum(hit)/len(hit)
p_lq

#### Boarder Category Coherence:
A tweet is in this topic if it referes to the border wall at all.  

Postive Example: (6192: The Wall is funded &amp, being built! https://t.co/84BOxKr2Eo)  

Negative Example: (5601: Sleepy Joe Biden was in charge of the H1N1 Swine Flu epidemic which killed thousands of people. The response was one of the worst on record. Our response is one of the best, with fast action of border closings &amp; a 78% Approval Rating, the highest on record. His was lowest!)

In [None]:
rd.seed(0)
x = tweets[(tweets['border'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,0,1,1]
p_bt = sum(hit)/len(hit)
p_bt

In [None]:
rd.seed(0)
x = tweets[(tweets['border'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,0,1,1,0,1,0,1,0,1,1,1,0,1,1,1,0,1,1,1]
p_br = sum(hit)/len(hit)
p_br

In [None]:
rd.seed(0)
x = tweets[(tweets['border'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [0,1,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_bq = sum(hit)/len(hit)
p_bq

#### Economic Category Coherence
A tweet is in this category if the tweet refers to the economy in some way.  
Positive Example: (3183: He is out of real solutions--@BarackObama's job bill is nothing more than a tax increase.)  
Negative Example: (30171: China is closing a massive oil deal w/ Russia, taking advantage of the Ukraine conflict http://t.co/tItkQ0PmZH Smart, unlike our leaders.)

In [None]:
rd.seed(0)
x = tweets[(tweets['economic'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,0,0,1,0,1,1,0,1,1,0,0,1,0,0,1,0,1,1]
p_ct = sum(hit)/len(hit)
p_ct

In [None]:
rd.seed(0)
x = tweets[(tweets['economic'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,0,1,1,1,0,1,0,1,1,1,0,1,1,1,1,1,1,1,1]
p_cr = sum(hit)/len(hit)
p_cr

In [None]:
rd.seed(0)
x = tweets[(tweets['economic'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,0,1,1,0,0,0,1,0,0,0,0,1,0,0,1,0,1,0,0]
p_cq = sum(hit)/len(hit)
p_cq

### States Category Coherence
Topic includes mention of any state that is in the United States."

In [None]:
rd.seed(0)
x = tweets[(tweets['states'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_stt = sum(hit)/len(hit)
p_stt

In [None]:
rd.seed(0)
x = tweets[(tweets['states'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_str = sum(hit)/len(hit)
p_str

In [None]:
rd.seed(0)
x = tweets[(tweets['states'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_stq = sum(hit)/len(hit)
p_stq

### Countries Category Coherence
Topic includes any mention of a country that is not the United States.

In [None]:
rd.seed(0)
x = tweets[(tweets['countries'] == 1) & (tweets['isRetweet'] == 'f') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_cnt = sum(hit)/len(hit)
p_cnt

In [None]:
rd.seed(0)
x = tweets[(tweets['countries'] == 1) & (tweets['isRetweet'] == 't') & (tweets['isQuoteTweet'] == 'f')]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_cnr = sum(hit)/len(hit)
p_cnr

In [None]:
rd.seed(0)
x = tweets[(tweets['countries'] == 1) & (tweets['isRetweet'] == 'f') & ((tweets['isQuoteTweet'] == 't1') ^ (tweets['isQuoteTweet'] == 't2'))]
    
for identity in rd.sample(x.index.tolist(), k = 20):
    print(identity,x.loc[(identity,'text')])

In [None]:
hit = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
p_cnq = sum(hit)/len(hit)
p_cnq

In [None]:
table = {
    'category': ['selfReference','selfReference','selfReference', 'usa', 'usa', 'usa', 
                 'government', 'government', 'government','democrates', 'democrates', 'democrates',
                 'republicans', 'republicans', 'republicans', 'election', 'election', 'election',
                'positive', 'positive', 'positive', 'negative', 'negative', 'negative', 
                 'fakeNews', 'fakeNews', 'fakeNews', 'law', 'law', 'law', 
                 'border', 'border', 'border', 'economic', 'economic', 'economic',
                'states','states','states','countries','countries','countries'],
    'tweetType': ['trump','retweet','quote','trump','retweet','quote','trump','retweet','quote',
                 'trump','retweet','quote','trump','retweet','quote','trump','retweet','quote',
                 'trump','retweet','quote', 'trump','retweet','quote', 'trump','retweet','quote',
                 'trump','retweet','quote', 'trump','retweet','quote', 'trump','retweet','quote',
                 'trump','retweet','quote', 'trump','retweet','quote'],
    'p_hat':[p_st, p_sr, p_sq, p_ut, p_ur, p_uq, p_gt, p_gr, p_gq, p_dt, p_dr, p_dq, p_rt, p_rr, p_rq,
            p_et, p_er, p_eq, p_pt, p_pr, p_pq, p_nt, p_nr, p_nq, p_ft, p_fr, p_fq, p_lt, p_lr, p_lq,
            p_bt, p_br, p_bq, p_ct, p_cr, p_cq, p_stt, p_str, p_stq, p_cnt, p_cnr, p_cnq]
}
coherenceTable = pd.DataFrame(table)

In [None]:
# Over all coherence
fig_dims = (15, 5)
fig, ax = plt.subplots(figsize=fig_dims)
totalCoherence = coherenceTable.groupby('category').mean().reset_index()
#totalCoherence
sns.barplot(data = totalCoherence, x = 'category', y = 'p_hat')
totalCoherence

In [None]:
# Coherence by category
fig_dims = (15, 5)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot(data = coherenceTable, x = 'category', y = 'p_hat', hue = 'tweetType')

In [None]:
# Export

tweets.to_csv('trumpTweetsClean.csv')