<h1>Trump Tweets</h1>


<img src="https://static01.nyt.com/images/2017/05/31/us/31tweet-sub1/31tweet-sub1-videoSixteenByNineJumbo1600.jpg" width="600"></img>

# Introduction


The Dataset we are using here is collected using [Twitter API](https://developer.twitter.com/en/docs/api-reference-index), and **[tweepy](https://www.tweepy.org/)** Python package for easy use of Twitter API.  

It contains the recent Trump Tweets.


Note: due to frequent abusive posts, some of the tweets of President Trump were censored by Twitter. As well, recently, his Twitter account was suspended and then permanently canceled.


# Data preparation

## Load packages

In [None]:
import numpy as np 
import pandas as pd
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
from wordcloud import WordCloud, STOPWORDS

## Load data

In [None]:
tweets_df = pd.read_csv("/kaggle/input/trump-tweets/trump_tweets.csv")

# Data exploration


## Glimpse the data

In [None]:
print(f"data shape: {tweets_df.shape}")

In [None]:
tweets_df.info()

In [None]:
tweets_df.describe()

In [None]:
tweets_df.head()

### Missing data

In [None]:
def missing_data(data):
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return(np.transpose(tt))

In [None]:
missing_data(tweets_df)

### Unique values

In [None]:
def unique_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    uniques = []
    for col in data.columns:
        unique = data[col].nunique()
        uniques.append(unique)
    tt['Uniques'] = uniques
    return(np.transpose(tt))

In [None]:
unique_values(tweets_df)

### Most frequent values

In [None]:
def most_frequent_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    items = []
    vals = []
    for col in data.columns:
        itm = data[col].value_counts().index[0]
        val = data[col].value_counts().values[0]
        items.append(itm)
        vals.append(val)
    tt['Most frequent item'] = items
    tt['Frequence'] = vals
    tt['Percent from total'] = np.round(vals / total * 100, 3)
    return(np.transpose(tt))

In [None]:
most_frequent_values(tweets_df)

### Extract date & time features

In [None]:
tweets_df['datedt'] = pd.to_datetime(tweets_df['date'])

In [None]:
tweets_df['year'] = tweets_df['datedt'].dt.year
tweets_df['month'] = tweets_df['datedt'].dt.month
tweets_df['day'] = tweets_df['datedt'].dt.day
tweets_df['dayofweek'] = tweets_df['datedt'].dt.dayofweek
tweets_df['hour'] = tweets_df['datedt'].dt.hour
tweets_df['minute'] = tweets_df['datedt'].dt.minute

In [None]:
tweets_df['dated'] = tweets_df.apply(lambda x: x.date[0:10], axis=1)
tweets_df['dated'] = pd.to_datetime(tweets_df['dated'])

## Visualize the data distribution

In [None]:
def plot_count(feature, title, df, size=1):
    f, ax = plt.subplots(1,1, figsize=(4*size,4))
    total = float(len(df))
    g = sns.countplot(df[feature], order = df[feature].value_counts().index[:31], palette='Set3')
    g.set_title("Number and percentage of {}".format(title))
    if(size > 2):
        plt.xticks(rotation=90, size=8)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 0.2,
                '{:1.2f}%'.format(100*height/total),
                ha="center") 
    plt.show()    

In [None]:
plot_count("dayofweek", "tweets per day of week", tweets_df, size=2)

In [None]:
plot_count("hour", "tweets per hour", tweets_df, size=3)

In [None]:
plot_count("day", "tweets per day of month", tweets_df, size=4)

In [None]:
def plot_count(feature, title, df, size=1):
    f, ax = plt.subplots(1,1, figsize=(4*size,4))
    total = float(len(df))
    g = sns.countplot(df[feature], order = df[feature].value_counts().index[:20], palette='Set3')
    g.set_title("Number and percentage of {}".format(title))
    if(size > 2):
        plt.xticks(rotation=90, size=8)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(100*height/total),
                ha="center") 
    plt.show()    

In [None]:
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=50,
        max_font_size=40, 
        scale=5,
        random_state=1
    ).generate(str(data))

    fig = plt.figure(1, figsize=(10,10))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

### Text wordcloauds

In [None]:
show_wordcloud(tweets_df['text'], title = 'Prevalent words in tweets')

### Hashtags analysis

In [None]:
tweets_df['hashtags'] = tweets_df['hashtags'].replace(np.nan, "['None']", regex=True)
tweets_df['hashtags'] = tweets_df['hashtags'].apply(lambda x: x.replace('\\N',''))
tweets_df['hashtags_count'] = tweets_df['hashtags'].apply(lambda x: len(x.split(',')))

In [None]:
tweets_df['hashtags_individual'] = tweets_df['hashtags'].apply(lambda x: x.split(','))
from itertools import chain
all_hashtags = set(chain.from_iterable(list(tweets_df['hashtags_individual'])))
print(f"There are totally: {len(all_hashtags)}: {all_hashtags}")

Let's count the tweets with each of these hashtags.

In [None]:
for hashtag in all_hashtags:
    _d_df = tweets_df.loc[tweets_df.hashtags==hashtag]
    print(f"Hashtag: {hashtag}, tweets: {_d_df.shape[0]}")

Most of the tweets have no hashtag associated and most of the tweets with hashtag have the trademark **MAGA** hashtag.

### Impact analysis

In [None]:
def plot_time_variation(df, x='date', y='retweets', hue=None, size=1, title="", is_log=False):
    f, ax = plt.subplots(1,1, figsize=(4*size,3*size))
    g = sns.lineplot(x=x, y=y, hue=hue, data=df)
    plt.xticks(rotation=90)
    if hue:
        plt.title(f'{y} grouped by {hue} | {title}')
    else:
        plt.title(f'{y} | {title}')
    if(is_log):
        ax.set(yscale="log")
    ax.grid(color='black', linestyle='dotted', linewidth=0.75)
    plt.show() 

In [None]:
plot_time_variation(tweets_df, x='dated', y='retweets', title= 'day averages and variation', size=4, is_log=False)

In [None]:
plot_time_variation(tweets_df, x='dated', y='favorites', title='day averages and variation',size=4, is_log=True)

In [None]:
plot_time_variation(tweets_df, x='dated', y='user_followers', title='day average and variation', size=4, is_log=False)

Let's see what are the tweets with most retweets and favorites.

In [None]:
pd.set_option('display.max_colwidth', 150)
tweets_ordered_df = tweets_df.sort_values(by=["retweets"], ascending=False)
tweets_ordered_df[["text", "date", "hashtags","day", "hour",  "dayofweek", "retweets", "favorites"]].head(5)

In [None]:
pd.set_option('display.max_colwidth', 150)
tweets_ordered_df = tweets_df.sort_values(by=["favorites"], ascending=False)
tweets_ordered_df[["text", "date", "hashtags","day", "hour", "dayofweek", "retweets", "favorites"]].head(5)

Until recently, the tweets with most retweets and favorites have *MAGA* hashtag (Make America Great Again) or just the uppercase text of it or promote conspiracy theories related to the election.

Then, the most retweets and favorites are with the tweets about POTUS Donald Trump and FLOTUS Melania Trump covid infection updates.

We also see in top 5 now (due to election days and counting the votes) recent tweets about election results (with claims of President Trump that he won the elections).


Recent tweets related with the election (and vote counting) days are also making an impact. Let's do not forget that, due to fake news content, recent tweets from President Trump were censored by Twitter.  

Also recently we see a decrease in the number of followers of Donald Trump.
