## Trump Twitter

To analyze Trump's tweet I will be using a csv data containing tweets from 2009 to 2020. 

I will be performing the following tasks:
* Data Processing
    * Yearly Corpus
    * Document Term Matrix
* Exploratory Data Analysis
* Sentiment Analysis
    * TextBlob
* Topic Modelling
    * Latent Dirichlet Allocation
* Text Generation 
    * Markov's Chain 
    * Recurrent Neural Network

<b>Note:</b> Trump's tweet are only until June 2020.

## Data Processing

##### Importing Libraries

In [38]:
import pandas as pd
import matplotlib.pyplot as plt
import time
import re
import string
import nltk
import pickle

##### Loading Data

In [39]:
data = pd.read_csv('data/realdonaldtrump.csv')
data.head()

Unnamed: 0,id,link,content,date,retweets,favorites,mentions,hashtags
0,1698308935,https://twitter.com/realDonaldTrump/status/169...,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,,
1,1701461182,https://twitter.com/realDonaldTrump/status/170...,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,,
2,1737479987,https://twitter.com/realDonaldTrump/status/173...,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,,
3,1741160716,https://twitter.com/realDonaldTrump/status/174...,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,,
4,1773561338,https://twitter.com/realDonaldTrump/status/177...,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,,


##### Converting date from string to timestamp

In [40]:
data = data.astype({'date':'datetime64[ns]'})
type(data['date'][0])

pandas._libs.tslibs.timestamps.Timestamp

##### Excluding irrelevant data for this notebook 

In [41]:
# Excluding id, link, mentions, hashtags

data = data[['content', 'date', 'retweets', 'favorites']]
data.head()

Unnamed: 0,content,date,retweets,favorites
0,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917
1,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267
2,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19
3,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26
4,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945


### Data Cleaning

##### Round 1

To perform data science operations on data, it has to be cleaned first.

Cleaning of data involves the following:
    * Convert the text into lowercase
    * Removal of brackets
    * Removal of mentions
    * Removal of links
    * Removal of punctuation
    * Removal of numeric and alphanumeric words

In [42]:
def clean_text_round1(text):
    
    # Convert all tweets to lowercase
    text = text.lower()
    
    # removing small bracket
    text = re.sub('\(', '', text)
    text = re.sub('\)', '', text)
    
    # removing mentions
    text = re.sub('(?<=^|(?<=[^a-zA-Z0-9-_\.]))@(\s[A-Za-z]+[A-Za-z0-9_]+)', '', text)
    
    # removing links
    text = re.sub(r'http\S+', "", text)
    
    # removing links
    text = re.sub(r"'s", "", text)
    
    # replacing -- with one white-space 
    text = re.sub('--', ' ', text)
    
    #removing punctuation
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    
    #removing numeric and alpha-numeric characters
    text = re.sub('\w*\d\w*', '', text)
    
    return text

round1 = lambda x: clean_text_round1(x)

In [43]:
data_clean = data
data_clean['content'] = data['content'].apply(round1)
data_clean.head()

Unnamed: 0,content,date,retweets,favorites
0,be sure to tune in and watch donald trump on l...,2009-05-04 13:54:25,510,917
1,donald trump will be appearing on the view tom...,2009-05-04 20:00:10,34,267
2,donald trump reads top ten financial tips on l...,2009-05-08 08:38:08,13,19
3,new blog post celebrity apprentice finale and ...,2009-05-08 15:40:15,11,26
4,my persona will never be that of a wallflower ...,2009-05-12 09:07:28,1375,1945


##### Round 2

Removing Non-English words from the tweets

In [44]:
words = set(nltk.corpus.words.words())

def clean_text_round2(text):
    text = " ".join(w for w in nltk.wordpunct_tokenize(text) if w.lower() in words or not w.isalpha())
    return text

round2 = lambda x: clean_text_round2(x)

data_clean['content'] = data_clean['content'].apply(round2)
data_clean.head()

Unnamed: 0,content,date,retweets,favorites
0,be sure to tune in and watch trump on late nig...,2009-05-04 13:54:25,510,917
1,trump will be on the view tomorrow morning to ...,2009-05-04 20:00:10,34,267
2,trump top ten financial on late show with very...,2009-05-08 08:38:08,13,19
3,new post celebrity apprentice finale and learn...,2009-05-08 15:40:15,11,26
4,my persona will never be that of a wallflower ...,2009-05-12 09:07:28,1375,1945


##### Pickle Clean Data

In [45]:
data_clean.to_pickle('data/data_clean.pkl')

### Building Document Term Matrix

* Document Term Matrix is a matrix which describes the frequency of terms that occur in a collection of documents. 
* In DTM rows represents documents and columns reperesents terms.
* In this case, each tweets of a year is considered as a document. 
* To create a DTM, corpus is required which has been created in the above cell.
* This corpus is then used to create a Count Vectorizer which is then used to create DTM.
* Using CountVectorizer stop words can also be eliminated.

In [46]:
columns = ['year', 'transcript']
corpus_yearly = pd.DataFrame(columns=columns)

group = data_clean.groupby(data_clean.date.dt.year)

year_list = []
transcript_list = []

for year, group_data in group:
    text = ' '.join(group_data['content'])
    year_list.append(year)
    transcript_list.append(text)

corpus_yearly['year'] = year_list
corpus_yearly['transcript'] = transcript_list

corpus_yearly.head()

Unnamed: 0,year,transcript
0,2009,be sure to tune in and watch trump on late nig...
1,2010,celebrity apprentice to outstanding list of se...
2,2011,watch me on late night with jimmy tomorrow nig...
3,2012,my interview the make great again filing and t...
4,2013,and the are laughing at the deal they just got...


In [47]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = 'english')
data_cv = cv.fit_transform(corpus_yearly.transcript)
dtm_yearly = pd.DataFrame(data_cv.toarray(), columns = cv.get_feature_names())
dtm_yearly.index = corpus_yearly.year
dtm_yearly

Unnamed: 0_level_0,abandon,abandoned,abbas,abhor,abide,abiding,ability,abject,able,abnormally,...,zac,zeal,zee,zero,zimbabwe,zip,zone,zoning,zoo,zoom
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2010,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2011,0,2,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2012,1,3,0,0,0,0,3,1,13,0,...,0,0,0,14,0,0,0,0,0,0
2013,2,8,0,1,0,0,8,0,17,1,...,0,1,1,21,0,0,4,0,0,0
2014,0,0,0,0,1,0,8,0,16,0,...,0,0,0,15,0,1,4,0,0,0
2015,1,1,0,0,0,0,8,0,22,0,...,1,0,0,15,1,1,1,0,0,1
2016,2,1,0,0,0,0,3,0,12,0,...,0,0,0,24,0,0,0,0,0,0
2017,0,1,1,0,0,0,1,0,10,0,...,0,0,0,11,0,0,1,0,1,0
2018,3,5,0,0,0,0,2,0,19,0,...,0,0,0,17,0,0,1,0,0,0


##### Pickle Document Term Matrix

In [48]:
dtm_yearly.to_pickle('data/dtm_yearly.pkl')

##### Pickle Count Vectorizer

In [49]:
pickle.dump(cv, open('data/cv.pkl', 'wb'))

### Building Yearly Corpus

A yearly corpus will be formed by joining all the tweets of every year. 

In [50]:
group = data_clean.groupby(data_clean.date.dt.year)

corpus_dict = {}

for year, group_data in group:
    text = ' '.join(group_data['content'])
    text = " ".join(w for w in nltk.wordpunct_tokenize(text) if w.lower() in words or not w.isalpha())
    corpus_dict[year] = [text]

In [51]:
corpus_yearly = pd.DataFrame.from_dict(corpus_dict).transpose()
corpus_yearly.columns = ['transcript']
corpus_yearly = corpus_yearly.sort_index()
corpus_yearly

Unnamed: 0,transcript
2009,be sure to tune in and watch trump on late nig...
2010,celebrity apprentice to outstanding list of se...
2011,watch me on late night with jimmy tomorrow nig...
2012,my interview the make great again filing and t...
2013,and the are laughing at the deal they just got...
2014,today is the first day of the rest of your lif...
2015,for president the club was amazing tonight eve...
2016,happy new year from thank you to my great fami...
2017,well the new year we will together make great ...
2018,the united foolishly given more than billion i...


##### Pickle yearly corpus

In [52]:
corpus_yearly.to_pickle('data/corpus_yearly.pkl')