# Jair Bolsonaro (@jairbolsonaro) - Tweets Staticts

# Sections

The notebook is ordering by sections on the left menu. Below the graphs there are some described facts about the data, please consider reading these.

# Who is
Jair Messias Bolsonaro born 21 March 1955. He's a Brazilian politician and retired military officer who has been the 38th president of Brazil since 1 January 2019. He served in the country's Chamber of Deputies, representing the state of Rio de Janeiro, between 1991 and 2018. He was elected president as a member of the conservative Social Liberal Party, before leaving them to found the party Alliance for Brazil. [[1]](https://en.wikipedia.org/wiki/Jair_Bolsonaro)

# The Dataset

The dataset contains tweets from [@jairbolsonaro](https://twitter.com/jairbolsonaro) as president of Brazil, from 1 January 2019 to 31 December 2019. There are 2.551 tweets that contains 5 variables, are they: id, text, retweet_count, favorite_count and created_at. The variables were explained by [Twitter's API](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object).

# Exploratory Data Analysis (EDA)

1. [What is the frequency of tweets per month?](#What-is-the-frequency-of-tweets-per-month?)
1. [Which tweet was most favorited per month?](#Which-tweet-was-most-favorited-per-month?)
1. [Was there a most retweet or favorited tweet per month?](#Was-there-a-most-retweet-or-favorited-tweet-per-month?)
1. [How many tweets that mention some user?](#How-many-tweets-that-mention-some-user?)
1. [How many tweets were retweets?](#How-many-tweets-were-retweets?)
1. [Were there more tweets than retweets?](#Were-there-more-tweets-than-retweets?)
1. [Which was the frequency of tweets by hour per day?](#Which-was-the-frequency-of-tweets-by-hour-per-day?)
1. [Which was the frequency of tweets by weekday?](#Which-was-the-frequency-of-tweets-by-weekday?)
1. [Which was the frequency of tweets by weekday of month?](#Which-was-the-frequency-of-tweets-by-weekday-of-month?)
1. [Which was the frequency of tweets by hour per weekday?](#Which-was-the-frequency-of-tweets-by-hour-per-weekday?)
1. [Which people were retweets?](#Which-people-were-retweets?)
1. [Witch people were mentioned?](#Witch-people-were-mentioned?)

### Imports and declarations

In [None]:
import re, string, unicodedata, random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot

from collections import Counter
from itertools import chain

from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

from wordcloud import WordCloud

In [None]:
STOPWORDS = stopwords.words('portuguese')

blind = {
    "<empty>": "",
}

In [None]:
def color_func(word, font_size, position, orientation, random_state=None,
                    **kwargs):
    COLORS = ['#b58900', '#cb4b16', '#dc322f', 
          '#d33682', '#6c71c4', '#268bd2', '#2aa198', '#859900']
    return COLORS[random.randint(0, len(COLORS)-1)]

def convert(x):
  x = str(x)
  return f'{x[:4]}-{x[4:]}'

def re_sub(text, pattern, repl):
    return re.sub(pattern, repl, text)


def remove_non_ascii(text):
    new_tokens = []
    tokens = text.split()
    
    for token in tokens:
        token = unicodedata.normalize('NFKD', token).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_tokens.append(token)
    
    return ' '.join(new_tokens)


def remove_punctuation(text):
    tokens = [c for c in text if c not in string.punctuation]
                
    return ''.join(tokens)


def strip_text(text):
    return text.strip()


def remove_stopwords(text):
    tokens = text.split()
    tokens = [token for token in tokens if token not in STOPWORDS]
                
    return ' '.join(tokens)


def normalize_serie(text):
    text = text.lower()
    text = remove_stopwords(text)
    text = remove_non_ascii(text)
    
    text = re_sub(text, r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", blind["<empty>"])
    text = re_sub(text, r"\b(\w*rt\w*)\b", blind["<empty>"])
    text = re_sub(text, r"\b(\w*jairbolsonaro\w*)\b", blind["<empty>"])
    text = re_sub(text, r"\b(k+)\b", blind["<empty>"])
    text = re_sub(text, r"\b(\d+)\b", blind["<empty>"])
    
    text = strip_text(text)
    text = remove_punctuation(text)

    return text

In [None]:
f = '/kaggle/input/bolsonaros-200-days-as-president-on-twitter/jairbolsonaro.csv'

In [None]:
df = pd.read_csv(f, date_parser=['created_at'])

### Describe the dataset


The dataset contains 5 variables, are they: id, text, retweet_count, favorite_count and create_at.

In [None]:
df.head()

In [None]:
df.info()

All variables are filled, don't existing any null objects.

### Grouping by month

In order to facilitate grouping, I separated the tweets by month/year, that will facility the data collection over the next few years.

In [None]:
df['YearMonth'] = pd.to_datetime(df['created_at']).apply(lambda x: int(f'{x.year}{x.month}'))

## What is the frequency of tweets per month?

In [None]:
res = df.groupby('YearMonth')['id'].count()
print(res)

In [None]:
X = tuple(map(convert, res.index))
Y = res.values

fig = go.Figure(data=[go.Scatter(x=X, y=Y, text=Y)])
fig.update_layout(title='Tweets per month - @jairbolsonaro', 
                  xaxis_title='Month', yaxis_title='Tweets')

fig.show()

> October 2019 had the most tweets posts over the first year.

## Which tweet was most retweeted per month?

In [None]:
idx_retweet_count = df.groupby('YearMonth')['retweet_count'].transform(max) == df['retweet_count']
x = df[idx_retweet_count]['YearMonth'].apply(convert)

retweet_count = df[idx_retweet_count]['retweet_count']
hovertext = df[idx_retweet_count]['text']

fig = go.Figure(data=[go.Bar(
    x=x, 
    y=retweet_count,
    text=retweet_count,
    textposition='auto',
    hovertext=hovertext,
  )
])

fig.update_layout(title='Tweets most retweeted per month - @jairbolsonaro')

fig.show()

### In 2019:
1. The tweets that had more than 30k of retweets were posted by [@realDonaldTrump](https://twitter.com/realDonaldTrump). [[2]](https://en.wikipedia.org/wiki/Donald_Trump)
1. Although October was the month with the largest number of tweets, the twitter with more RT was [this](http://t.co/THbGVEbasE), about the [‘This will not stick’: Brazilian president lashes out over alleged links to left-wing politician’s killing](https://www.washingtonpost.com/nation/2019/10/30/jair-bolsonaro-marielle-franco-murder-link/).

## Which tweet was most favorited per month?

In [None]:
idx_favorite_count = df.groupby('YearMonth')['favorite_count'].transform(max) == df['favorite_count']

favorite_count = df[idx_favorite_count]['favorite_count']
hovertext = df[idx_favorite_count]['text']

fig = go.Figure(data=[go.Bar(
    x=x, 
    y=favorite_count,
    text=favorite_count,
    textposition='auto',
    hovertext=hovertext,
  )
])

fig.update_layout(title='Tweets most favorited per month - @jairbolsonaro')

fig.show()

### In 2019:
1. In this case, the [same tweet](http://t.co/THbGVEbasE) has the most favorites votes.

## Was there a most retweet or favorited tweet per month?

In [None]:
temp = df[['YearMonth', 'favorite_count', 'retweet_count']
          ].groupby(['YearMonth'], as_index=False).sum()

fig = go.Figure(
    data=[
      go.Bar(name='Retweet', x=x, y=temp['retweet_count'], 
             text=temp['retweet_count'], textposition='auto'),
      go.Bar(name='Favorite', x=x, y=temp['favorite_count'], 
             text=temp['favorite_count'], textposition='auto')
])

fig.update_layout(title='Retweets <i>vs</i> Favorite tweets - @jairbolsonaro', barmode='group')
fig.show()

This graph shows a visualization of the sum of favorite_count and retweet_count during the year.

## How many tweets that mention some user?

In [None]:
regex_mention = r'(@\w+)'
df['mentions'] = df.text.apply(lambda x: ' '.join(re.findall(regex_mention, x)))

In [None]:
mentions = df[['mentions', 'YearMonth']].loc[df.mentions.str.contains('@')].groupby('YearMonth', as_index=False).count().sort_values(by='YearMonth')

fig = go.Figure(data=go.Bar(name='Mentions', x=x, y=mentions['mentions'], 
                            text=mentions['mentions'], textposition='auto'))
fig.update_layout(title='Tweets that mention some user - @jairbolsonaro')
fig.show()

## How many tweets were retweets?

In [None]:
rts = df.loc[df.text.str.contains('RT ')].groupby('YearMonth', as_index=False).count().sort_values(by='YearMonth')['id'].values
fig = go.Figure(data=go.Bar(name='Mentions', x=x, y=rts, 
                            text=rts, textposition='auto'))
fig.update_layout(title='Retweets per Month - @jairbolsonaro')
fig.show()

## Were there more tweets than retweets?

In [None]:
not_rts = Y - rts

fig = go.Figure(
    data=[
         go.Bar(name='Tweets', x=x, y=not_rts, text=not_rts, textposition='auto'),
         go.Bar(name='RT', x=x, y=rts, text=rts, textposition='auto'),
])

fig.update_layout(title='Tweets composition per month - @jairbolsonaro', 
                  barmode='stack')

fig.show()

## Which was the frequency of tweets by hour per day?

In [None]:
df['Hour'] = pd.to_datetime(df['created_at']).apply(lambda x: int(x.hour))

In [None]:
hours = df[['Hour', 'id']].groupby('Hour', as_index=False).count().sort_values(by='Hour')

fig = go.Figure(
      data=[go.Bar(x=hours['Hour'], y=hours['id'], 
                   text=hours['id'], textposition='auto')
      ],
)

fig.update_layout(title='Tweet Frequency by hour - @jairbolsonaro')
fig.show()

## Which was the frequency of tweets by weekday?

In [None]:
df['WeekDay'] = pd.to_datetime(df['created_at']).apply(lambda x: x.strftime('%w'))

In [None]:
weekdays = df[['WeekDay', 'id']].groupby('WeekDay', as_index=False).count().sort_values(by='WeekDay')
days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',]

fig = go.Figure(data=[go.Bar(x=days, y=weekdays['id'], text=weekdays['id'], textposition='auto')])
fig.update_layout(title='Tweet Frequency by weekday - @jairbolsonaro')

fig.show()

## Which was the frequency of tweets by weekday of month?

In [None]:
months_week = df[['YearMonth', 'WeekDay', 'id']].groupby(['YearMonth', 'WeekDay'], as_index=False).count()
  
fig = go.Figure()

fig.add_scatter(
    x=months_week['YearMonth'].apply(convert), 
    y=months_week['WeekDay'].apply(lambda x: days[int(x)]), mode='markers+text', 
    marker_color=[
                  '#b58900', '#cb4b16', '#dc322f', 
                  '#d33682', '#6c71c4', '#268bd2', 
                  '#2aa198',
                ] * len(x),
    # text=months_week['id'],
    marker=dict(size=months_week['id'] * .7)
)

fig.update_layout(title='Tweets frenquency by weekday per Month - @jairbolsonaro')

fig.show()


## Which was the frequency of tweets by hour per weekday?

In [None]:
months_week = df[['WeekDay', 'Hour', 'id']].groupby(['WeekDay', 'Hour'], as_index=False).count()
  
fig = go.Figure()

fig.add_scatter(
    x=months_week['Hour'], y=months_week['WeekDay'].apply(lambda x: days[int(x)]), 
    mode='markers+text', text=months_week['id'],
    marker=dict(size=months_week['id'])
)

fig.update_layout(title='Tweets frenquency by hour per Weekday - @jairbolsonaro')

fig.show()

## Witch people were mentioned?

In [None]:
all_mentions = []
for year in X:
  
  mentions = []
  mentions_ = df.loc[
                    (df.YearMonth == int(year.replace('-', ''))) 
                    & (df.mentions != '')
                    & (df.text.str.contains('RT ') == False)
                  ]['mentions'].values

  for m in mentions_:
    for mention in m.split():
      mentions.append(mention)
  
  all_mentions.append(mentions)

In [None]:
counter = []
for mentions in all_mentions:
  counter.append(Counter(mentions))

In [None]:
mentions = []
saved_mention = ['@jairbolsonaro']

for co in counter:
  for mention in list(co.most_common()):
    values = []
    name = mention[0]
    
    if name.lower() in saved_mention:
      continue
    
    for co in counter:
      if name in chain(*co.most_common()):
        for mention in list(co.most_common()):
          if mention[0] == name:
            values.append(mention[1])
      else:
        values.append(0)

    if sum(values) > 1 and name.lower() not in saved_mention:
      mentions.append((name, values, sum(values)))
      saved_mention.append(name.lower())
        
mentions = sorted(mentions)

In [None]:
fig = go.Figure()

for mention in mentions:
  fig.add_trace(go.Scatter(x=x, y=mention[1], name=mention[0], mode='lines'))

fig.update_layout(title='Mentions per month - @jairbolsonaro',)

fig.show()

## Which people the retweets came from?

In [None]:
all_mentions = []
for year in X:
  
  mentions = []
  mentions_ = df.loc[
                    (df.YearMonth == int(year.replace('-', ''))) 
                    & (df.mentions != '')
                    & (df.text.str.contains('RT '))
                  ]['mentions'].values

  for m in mentions_:
    for mention in m.split():
      mentions.append(mention)
  
  all_mentions.append(mentions)

In [None]:
counter = []
for mentions in all_mentions:
  counter.append(Counter(mentions))

In [None]:
mentions = []
saved_mention = ['@jairbolsonaro']

for co in counter:
  for mention in list(co.most_common()):
    values = []
    name = mention[0]
    
    if name.lower() in saved_mention:
      continue
    
    for co in counter:
      if name in chain(*co.most_common()):
        for mention in list(co.most_common()):
          if mention[0] == name:
            values.append(mention[1])
      else:
        values.append(0)

    if sum(values) > 1 and name.lower() not in saved_mention:
      mentions.append((name, values, sum(values)))
      saved_mention.append(name.lower())
        
mentions = sorted(mentions)

In [None]:
fig = go.Figure()

for mention in mentions:
  fig.add_trace(go.Scatter(x=x, y=mention[1], name=mention[0], mode='lines'))

fig.update_layout(title='Mentions per month - @jairbolsonaro',)

fig.show()

## WordCloud

In [None]:
df['normalized'] = df['text'].apply(normalize_serie)

In [None]:
wordcloud = WordCloud(
    width=3000,
    height=2000,
    background_color='#073642',
    collocations=False,
    
).generate(' '.join(df['normalized'].values))

In [None]:
fig = plt.figure(
    figsize=(20, 15),
    facecolor='k',
    edgecolor='k'
)

plt.axis('off')
plt.tight_layout(pad=0)
plt.imshow(wordcloud.recolor(color_func=color_func, random_state=3),
           interpolation="bilinear")
plt.show()