<center><img src=https://www.thetimes.co.uk/imageserver/image/%2Fmethode%2Ftimes%2Fprod%2Fweb%2Fbin%2F31d3727c-d0ea-11e5-b413-ac87650795f1.jpg?crop=1500%2C844%2C0%2C78&resize=1180>Image Source: https://www.thetimes.co.uk</center>

# Introduction

Donald Trump Jr, is arguably the most controversial leader that the world has seen in a very long time. But regardless of what his haters might say, he was undeniably the largest source of Internet memes, since the Harlem Shake. And for good reason, too! President Trump has been an avid user of Twitter.com over the years and his words have made the headlines more times than one can count.

Thanks to the creator of this dataset, we now have a chance to analyse some of his words over time, and visualize for ourselves, how his sentiments, ideals and opinions changed over time and the global reach and impact they made. 




| ![space-1.jpg](https://media2.giphy.com/media/26uf2JHNV0Tq3ugkE/source.gif) | 
|:--:| 
| **When I heard that Using Gifs in notebooks gets you more upvotes** |


If this is your first time, here, please note that this is a work in progress. I want to develop my notebooks over time, adding something new and interesting each time, as and when I think of it. If you like my notbook, you can let me know by upvoting it! it not only helps keep me motivated but also serves as an indicator for what I did right and what could be improved. 

Thanks for stopping by!

In [None]:
# used for identifying errors in tweets
!pip install pyspellchecker

In [None]:
#imports 
import re
from tqdm.notebook import tqdm
import pandas as pd 
import numpy as np
from datetime import datetime

from spellchecker import SpellChecker
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

import plotly.offline as pyo 
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
import seaborn as sns 
import scattertext as st
from IPython.display import IFrame
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import random 

In [None]:
# intializate our tools 
sns.set_style('darkgrid')

# for sentiment analysis 
sia = SIA() 

# to identify misspelled words
spell = SpellChecker() 

# to display plotly graphs 
pyo.init_notebook_mode() 

# duh
df = pd.read_csv("/kaggle/input/trumps-legacy/Trumps Legcy.csv")


# Preprocessing

Here, we will first clean our data and make it easier to work with. We perform the following steps to clean our data: 
1. **Convert time Strings to python datetime objects** : This makes it easier to extract Year, month, date, hour etc. 
2. **Remove mentions** : The strings that begin with "@" since these do not contribute to the content of the tweet itself. However, we will store these separately. Maybe Trump tends to mentions some accounts more than others? May be interesting to check. 
3. **Remove Hashtags** : Hashtags are often changing over time so I will not be analysing them in this notebook. 
4. **Remove URLs** : Who cares about URLs anyway? 
5. **Remove Special Characters** : Keep only english strings
6. **Remove Single characters** : We will only be looking at full words and not individual characters 
7. **Replace multiple Spaces** : Replace a string of spaces with a single space 

Please note that this part of the code has been borrowed from some popular notebooks on twitter data and it not my own. 

In [None]:
data = df.copy()
data['original_text'] = df['text']
data['date'] = df.date.apply(lambda x: datetime.strptime(x, "%m/%d/%Y %H:%M"))
rt_mask = data.text.apply(lambda x: "RT @" in x)

# standard tweet preprocessing 
data.text =data.text.str.lower()
#Remove twitter handlers
data.text = data.text.apply(lambda x:re.sub('@[^\s]+','',x))
#remove hashtags
data.text = data.text.apply(lambda x:re.sub(r'\B#\S+','',x))
# Remove URLS
data.text = data.text.apply(lambda x:re.sub(r"http\S+", "", x))
# Remove all the special characters
data.text = data.text.apply(lambda x:' '.join(re.findall(r'\w+', x)))
#remove all single characters
data.text = data.text.apply(lambda x:re.sub(r'\s+[a-zA-Z]\s+', '', x))
# Substituting multiple spaces with single space
data.text = data.text.apply(lambda x:re.sub(r'\s+', ' ', x, flags=re.I))

# Feature Extraction


Here, we will use the cleaned data to extract some new columsn that will aid us in generating visualizations later in this notebook. Some of the features we will extract are: 

1. **Errors** : The errors in the sentence, identified by the `pyspellchecker` Library. 
2. **Counts** : The counts for the words, errors and characters in the tweet text 
3. **Time and Date** : This will make grouping the rows more convenient later 
4. **Sentiment** : Sentiment Values as predicted by the python VADER package. It assigns a compound score betweem -1 and 1 to each sentence. On my previous notebook [here](https://www.kaggle.com/pawanbhandarkar/training-a-sith-lord/comments) I found that generally, values between -0.05 and 0.35 correspond to a "neutral" sentiment, less than -0.05 correspong to a neutral sentiment and anything above 0.35 can be labeled positive. These are the thresholds we will use in this notebook as well. 

In [None]:
def label_sentiment(x:float):
    if x < -0.05 : return 'negative'
    if x > 0.35 : return 'positive'
    return 'neutral'

# Feature Extraction
data['words'] = data.text.apply(lambda x:re.findall(r'\w+', x ))
data['errors'] = data.words.apply(spell.unknown)
data['errors_count'] = data.errors.apply(len)
data['words_count'] = data.words.apply(len)
data['sentence_length'] = data.text.apply(len)
data['hour'] = data.date.apply(lambda x: x.hour)
data['date'] = data.date.apply(lambda x: x.date())
data['month'] = data.date.apply(lambda x: x.month)
data['year'] = data.date.apply(lambda x: x.year)


# Extract Sentiment Values for each tweet 
data['sentiment'] = [sia.polarity_scores(x)['compound'] for x in tqdm(data['text'])]
data['overall_sentiment'] = data['sentiment'].apply(label_sentiment);



### Splitting the Data

The Datset contains two types of tweets: 
1. **Original Content (OC)** -Tweets sent by Trump's official accounts 
2. **Retweets (RT)** - Tweets that one or more of his affiliated accounts retweeted. 

While I am interested in diving into some of his retweets, I will start my analysis by considering only his Original tweets. I might add some analysis of his Retweets in a future version, depending on how well this notebook is recieved. 

In [None]:
# Split into Retweets and Original Content 
rt_df = data[rt_mask]
oc_df = data[~rt_mask]

Let's quickly take a look at our data so far. 

In [None]:
oc_df.head()

# Visualizations

### Trump's Rise in Popularity

While Donald Trump was pretty well known for being a Billionaire and occasionally making cameos on TV shows and movies, it was only around 2016, when he ran for office of 45th President of the United States and subsequently won, that his popularity skyrocketed. As the plots below suggest, It was during the months of January 2015, (the start of his campaign) that his twitter activity started to climb in terms of Likes and Retweets. 

**Technical note**: 
There is a LOT of data for this plot, since a lot can happen in 5 years. Plotting a simple Scatter Plot would make it very difficult to interpret. So to make the graph more readable, I define a helper function to convert a list of numbers into a cumulative list of exponentially weighted averages, that capture the trend in the data. 

No one explains it better than the OG of Machine Learning, [Dr. Andrew Ng himself. ](https://www.youtube.com/watch?v=lAq96T8FkTw&t=145s)

In [None]:
# Helper Function to get the running average 
def get_weighted(series: pd.Series, beta=0.9):
    weighted = pd.Series(dtype=float)
    weighted[series.index[0]] = 0 
    for i in range(1, len(series)):
        current = series.iloc[i]
        previous = weighted.iloc[i-1]
        date = series.index[i]
        weighted[date] = beta*previous + (1-beta)*current
    return weighted 

# Get a two-line title for our plots
def get_multi_line_title(title:str, subtitle:str):
    return f"{title}<br><sub>{subtitle}</sub><br>"

In [None]:
title = get_multi_line_title(
    "Trump's Rise In Popularity", 
    "Plotting the avergae number of likes and rewteets over the years"
)

beta = 0.99 #higher value -> smoother curve

likes = oc_df.groupby('date')['favorites'].mean()
likes_std = likes.std()
likes = likes[likes < 3*likes_std]

retweets = oc_df.groupby('date')['retweets'].mean()
retweets_std = retweets.std()
retweets = retweets[retweets < 3*retweets_std]

weighted_retweets = get_weighted(retweets, beta)
weighted_likes = get_weighted(likes, beta)


fig = go.Figure([
    go.Scatter(
        name="Daily Average Likes",
        x=likes.index, 
        y=likes.values,
        mode="markers",
        opacity=0.3,
        marker_color="salmon"
    ), 
    go.Scatter(
        name="Weighted Average Likes",
        x=weighted_likes.index, 
        y=weighted_likes.values,
        opacity=0.8,
        marker_color='crimson'
    ),
    go.Scatter(
        name="Daily Average Retweets",
        x=retweets.index, 
        y=retweets.values,
        mode="markers",
        opacity=0.3,
        marker_color="lightseagreen"
        
    ), 
    go.Scatter(
        name="Weighted Average Retweets",
        x=weighted_retweets.index, 
        y=weighted_retweets.values,
        opacity=0.8,
        marker_color='darkgreen'
    )
])

fig.update_layout(
    hovermode='x',
    title=title,
    xaxis_title="Time",
    yaxis_title="Average Likes per Tweet",
    template="ggplot2",
    legend_orientation = 'h'
)

fig.show()

### How tech-savvy is Donald Trump?

With the advant of the internet, a lot of social platforms including twitter have gained massive momentum in the last decade alone. With technology advancing at such an overwhelming pace, it is only expected that politicians too would have to "get with the times". In this section, let us try to visualize what devices Trump used over the years in order to communicate with the masses.

In [None]:
devices = pd.DataFrame(oc_df.groupby(['year', 'device'])['text'].count()).sort_values('year').reset_index()

title = get_multi_line_title(
    "Devices Over the Years", 
    "What devices has Trump used over the years?"
)

unique_devices = devices.device.unique().tolist()

devices = devices[devices['year'] != 2021] 
line_plots = []
for d in unique_devices:
    device_data = devices[devices.device == d]
    line_plots.append(go.Scatter(
        name = d,
        x = device_data.year,
        y=device_data.text,
    ))
    
fig = go.Figure(line_plots)
fig.update_layout(
    title =title,
    template="ggplot2",
    hovermode='x',
    legend_orientation = 'h'
)
fig.show()

#### What do we learn from this? 

It Looks like ever since the start of his campaign, Trump has been almost exclusivley using iPhones for his tweets. What could be the reason behind this? The ease of use? The iOS privacy features? I'll let you know once the Pentagon returns my calls. 

In [None]:
title = get_multi_line_title("Sentence Length Distribution", "Distirbution of number of characters per tweet, by top-5 devices")
data = oc_df[oc_df['text'].apply(len) != 0]
top_devices = data.groupby('device')['text'].count().sort_values(ascending=False)[:5].index.tolist()
data = data[data['device'].apply(lambda x: x in top_devices)]
fig = px.histogram(data, x="sentence_length", color="device", opacity=0.75)
fig.update_layout(hovermode='x', title=title)
fig.show()

### This graph is more interesting than you might think
It looks like almost all of trumps more verbose tweets came from an Iphone as we can see a sharp increase in the total counts. But the real reason for this is that trump started to use an iphone in 2016, around the same time that [Twitter DOUBLED it's maximum character count](https://www.washingtonpost.com/news/the-switch/wp/2017/11/07/twitter-is-officially-doubling-the-character-limit-to-280/). So it makes sense that since then pretty much all the tweets would have been over 140 characters (previous limit) and closer to 280 characters (new limit)

### How Active is Trump? 

Not much to say about this section except that's it's a simple visualization of the total number of tweets by Donald Trump each Year.

In [None]:
title = get_multi_line_title(
    "Activity over the years", 
    "How many Tweets has trump sent out over the years?")

annual_counts = pd.DataFrame(oc_df['year'].value_counts()).reset_index()
annual_counts.columns = ['year', 'count']
annual_counts = annual_counts[annual_counts['year'] != 2021]

fig = go.Figure(go.Bar(
    name="Annual Count", 
    x=annual_counts.year, 
    y=annual_counts['count'], 
    marker_color=annual_counts['count'] 
))
fig.update_layout(template='ggplot2', title=title)
fig.show();

### What time of the Day was Donald most active? 
There's a popular meme on the internet which states that Donald Trump used to tweet while on the toilet. So to humour this meme, I would like to visualize the distribution of Trump's tweets across the 24 hours on average. This is where the DateTime Processing we did earlier comes handy!

For now, we will only visualize the time of day for the tweets made from iphones, since these are the dominating devices


In [None]:
title = get_multi_line_title("Time of the day most Tweeted", "")
def format_hour(h: int):
    h = str(h)
    if len(h) == 1: 
        h = '0'+h
    h = h+ ":00"
    return h

oc = oc_df[oc_df['device'] == 'Twitter for iPhone']
hourly = oc.groupby('hour')['text'].count()
hourly = pd.DataFrame(hourly).reset_index()
hourly.columns =['Hour of Day',"Number of Tweets"]
hourly['Hour of Day'] = hourly['Hour of Day'].apply(format_hour)


fig = px.line_polar(
    data_frame=hourly,
    r = 'Number of Tweets',
    theta='Hour of Day',
    line_close=True,
    color_discrete_sequence=['crimson'],
)

fig.update_layout(
    title=title, 
    template="ggplot2",
    title_x=0.5)

fig.show()

#### What do we understand from this? 

It looks like most of Trump's tweets are increasingly likely to be sent towards the noon and we see the peak at the 12th hour, with tweets getting less frequent after that. 

NOTE: This is **NOT** the visualization Trump's tweets in a single day but of his tweets over the course of many years. It would interesting to plot a similar polar grapph for the frequency of tweets hour each hour than just the raw count. I will add this in a future version. 

### How Positive is Trump? 

Being the leader of the free world is no small thing. Every day your thoughts and actions affect the lives of Hundreds of thousands, if not millions of people. Therefore, it is critical that his messages do not convey too much negativity too often. One way to measure this would to quantify the "sentiment" values for his tweets and then visualise and that's exactly what we aim to do in this section!

**Technical Note**:
The Sentiment Analysis for this task is a clustering problem - one that can be solved using a variety of approaches, depending on the desired outcomes. I decided to use the popular VADER package from NLTK since it is quite simple to use and relatively straightforward. You can search for "sentiment Analysis" on Kaggle to get some good alternative solutions by talented Kagglers.

In [None]:
title = get_multi_line_title(
    'Sentiment Distribution',
    "How positive is the leader of the free world?"
)

sentiment_pie = pd.DataFrame(oc_df['overall_sentiment'].value_counts() / oc_df.shape[0]*100).reset_index()
sentiment_pie.columns = ['Sentiment', 'Percentage']
fig = px.pie(sentiment_pie, values='Percentage', names='Sentiment', title=title)

fig.update_layout(
title=title, title_x=0.48)

fig.show()

**Note**: The colors are probably not the best for the sentiments and I will fix it later. 

In [None]:
sentiment_over_time = oc_df.sort_values('date')[['year', 'sentiment', 'overall_sentiment']]
sentiment_over_time = sentiment_over_time[sentiment_over_time.year !=2021]
annual_sentiment = pd.DataFrame(sentiment_over_time.groupby('year')['overall_sentiment'].value_counts())
annual_sentiment.columns = ['Count']
annual_sentiment = annual_sentiment.reset_index()

title = get_multi_line_title('Annual Tweet Sentiment', "How Trump's sentiments changed over the years")
years = annual_sentiment.year.unique().tolist()
sents = {'positive' : 'mediumseagreen', 'negative': 'crimson', 'neutral': 'royalblue'}


sentiment_bars = [] 
for s in sents.keys():
    current_year = annual_sentiment[annual_sentiment.overall_sentiment == s]
    sentiment_bars.append(
        go.Bar(name=s, x=current_year.year, y=current_year.Count, marker_color = sents[s])
    )
    
    
fig = go.Figure(sentiment_bars)
fig.update_layout(template='ggplot2', title=title)
fig.show()

### Wordclouds

What were the most frequently used words, based on the sentiment of the tweet? 

In [None]:
def flatten_list(l):
    return [x for y in l for x in y]

# color coding our wordclouds 
def red_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
    return f"hsl(0, 100%, {random.randint(25, 75)}%)" 

def green_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
    return f"hsl({random.randint(90, 150)}, 100%, 30%)" 

def yellow_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
    return f"hsl(42, 100%, {random.randint(25, 50)}%)" 

def generate_word_clouds(neg_doc, neu_doc, pos_doc):
    # Display the generated image:
    fig, axes = plt.subplots(1,3, figsize=(20,10))
    
    
    wordcloud_neg = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(neg_doc))
    axes[0].imshow(wordcloud_neg.recolor(color_func=red_color_func, random_state=3), interpolation='bilinear')
    axes[0].set_title("Negative Tweets")
    axes[0].axis("off")

    wordcloud_neu = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(neu_doc))
    axes[1].imshow(wordcloud_neu.recolor(color_func=yellow_color_func, random_state=3), interpolation='bilinear')
    axes[1].set_title("Neutral Words")
    axes[1].axis("off")

    wordcloud_pos = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(pos_doc))
    axes[2].imshow(wordcloud_pos.recolor(color_func=green_color_func, random_state=3), interpolation='bilinear')
    axes[2].set_title("Positive Words")
    axes[2].axis("off")

    plt.tight_layout()
    plt.show();

sentiment_sorted= data.sort_values('favorites', ascending=False)
positive_top_100 = sentiment_sorted[sentiment_sorted['overall_sentiment'] == "positive"].iloc[:100]
negative_top_100 = sentiment_sorted[sentiment_sorted['overall_sentiment'] == "negative"].iloc[:100]
neutral_top_100 = sentiment_sorted[sentiment_sorted['overall_sentiment'] == "neutral"].iloc[:100]

cleanup = lambda x: [y for y in x.split() if y not in stopwords.words('english')]
neg_doc = flatten_list(negative_top_100['text'].apply(cleanup))
pos_doc = flatten_list(positive_top_100['text'].apply(cleanup))
neu_doc = flatten_list(neutral_top_100['text'].apply(cleanup))

generate_word_clouds(neg_doc, neu_doc, pos_doc)

# ScatterText

Now we come to my favourite part of this notebook: **ScatterText**

## What is it? 
ScatterText as the name suggests, is a scatterplot for text data. But unlike regular old BORING scatter graphs, ScatterText is ridiculously intuitive and quite frankly, I'm shocked that not more people use this. I learned about it on a [medium post](https://jamesopacich.medium.com/interpreting-scattertext-a-seductive-tool-for-plotting-text-2e94e5824858) here and I know right away that I had to use it and showcase it to the world.

While I highly recommend that you give the above article a read, I will quickly go over some important features of this tool: 
1. It is particularly well suited when you want to see how words are distributed betweem two categorical variables. In our case, we will consider the "Negative" and "Non-Negative" sentiment as the categorical classes.
2. Words closer to the  axes are said to have higher "precision" with respect to each axis and the ones farther away are said to have more 'Recall'
3. The words in the top right corner of the graph have high recall in both classes and generally represent stop words
4. The word across the diagonal represent words common to both classes  
5. The Search bar can be used to highlight text by index (in this case, we use "Date") 
6. The graph is saved as HTML and therefore we use IFrame to display it 

If you want to get started, I recommend taking a look at the article above and also the github repo readme [here](https://github.com/JasonKessler/scattertext)

In [None]:
data = oc_df.copy()
data['binary_sentiment'] = data['overall_sentiment'].apply(lambda x: x if x =="negative" else "non-negative")
data['date'] = data['date'].apply(str)

df = data.assign(
    parse=lambda df: df.original_text.apply(st.whitespace_nlp_with_sentences)
)

corpus = st.CorpusFromParsedDocuments(
    df, category_col='binary_sentiment', parsed_col='parse'
).build().get_unigram_corpus().compact(st.AssociationCompactor(2000))

html = st.produce_scattertext_explorer(
    corpus,
    category='negative', category_name='Negative', not_category_name='Neutral/Positive',
    minimum_term_frequency=0, pmi_threshold_coefficient=0,
    width_in_pixels=1000, metadata=corpus.get_df()['date'],
    transform=st.Scalers.dense_rank
)

open('./demo_compact.html', 'w').write(html)
IFrame(src='./demo_compact.html', width=1200, height=700)

#### What do we understand from this? 

We see from the above graph, that words like "strong", "maga" and "golf" generally appear in tweets that convey a more neutral or positive sentiment, while words like "hoax", "illegal" and "fake" have often been used to convey negative sentiments. Whereas words like "job", "obama" "country" are equally as likely to appear in texts with either sentiment. 

Intuitive, no? 

# Conclusion 

In conclusion I would like to conclude by saying that the notebook is now concluded. 

# Summary

Work Completed as of 07-02-2021:

* **Data Cleaning** : Preprocessing and date time cleanup.
* **Feature Extraction** : Creating some useful features for visualizations 
* **Trump's Rise in Popularity** : How did his twitter fandom increase over time?
* **Devices Over the Years** : How tech-savvy is Donal Trump? 
* **Activity Over the Years** : Visualize the Annual total tweets 
* **Time of Day** : What time of the day is Trump most active? 
* **Sentiment Analysis** : How positive is the leader of the free world? 
* **Word Clouds** : Visualize frequent words by sentiment
* **ScatterText** : A powerful tool for text data visualization 

Thanks for taking the time to read this notebook. If you liked it, an **UPVOTE** is massively encouraging! I will try to keep this notebook updated and add in more visualizations in the future so be sure to check back soon!