# Bolsonaro, before running for president and after beginning the campaign
Some say that Bolsonaro changed a lot and became moderate to light compared to his previous behaviour, let's investigate that!

The official announcement that Jair Bolsonaro was going to be running for president in 2018 was in July 22, 2018 (https://g1.globo.com/politica/eleicoes/2018/noticia/2018/07/22/psl-confirma-candidatura-de-jair-bolsonaro-a-presidencia-da-republica.ghtml), but he was clearly doing campaign way earlier as confirmed by some journalists (https://especiais.gazetadopovo.com.br/eleicoes/2018/campanha-presidente-jair-bolsonaro-presidencial/).

I will compare Bolsonaro pre-2018 and after-2018 (2018 included in this set).

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
print(os.listdir("../input"))

In [None]:
import pandas as pd
import scattertext as st
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from IPython.display import Image

In [None]:
df = pd.read_csv('../input/jair-bolsonaro-twitter-data/bolsonaro_tweets.csv')
df.head()

In [None]:
df.tail()

In [None]:
df_before = df[df['date'] < '2018-01-01'].copy()
df_before.head()

In [None]:
df_after = df[df['date'] >= '2018-01-01'].copy()
df_after.tail()

# Cleaning the text

Let's do some cleaning on the text before doing word clouds and using the scatter text library for visualization

In [None]:
def clean_df(df_clean):
    remove_names = False # if True, assumes you have a nomes.txt file with common brazilian names in your current dir
    remove_usernames = False
    
    # Copy the original text for later metadata
    df_clean['original_text'] = df_clean['text']

    # Lower case
    df_clean['text'] = df_clean['text'].apply(
        lambda x: " ".join(x.lower() for x in x.split()))

    # Remove usernames
    if remove_usernames:
        df_clean['text'] = df_clean['text'].str.replace(
            '@[^\s]+', "")

    # Remove links
    df_clean['text'] = df_clean['text'].str.replace(
        'https?:\/\/.*[\r\n]*', '')

    # Remove punctuation
    df_clean['text'] = df_clean['text'].str.replace(
        '[^\w\s]', '')

    # Remove stopwords
    from nltk.corpus import stopwords
    stop = stopwords.words('portuguese')
    df_clean['text'] = df_clean['text'].apply(
        lambda x: " ".join(x for x in x.split() if x not in stop))

    # Remove common brazilian names
    if remove_names:
        nomes = pd.read_csv('nomes.txt', encoding='latin', header=None)
        lista_nomes = (nomes[0].str.lower()).tolist()
        df_clean['text'] = df_clean['text'].apply(lambda x: " ".join(
            x for x in x.split() if x not in lista_nomes))

    # Remove numbers
    df_clean['text'] = df_clean['text'].str.replace(
        '\d+', '')

    # Remove words with 1-3 chars
    df_clean['text'] = df_clean['text'].str.replace(
        r'\b(\w{1,3})\b', '')

    # Replace accents and ç
    df_clean.text = df_clean.text.str.normalize('NFKD')\
        .str.encode('ascii', errors='ignore')\
        .str.decode('utf-8')
    
    return df_clean

In [None]:
df_before = clean_df(df_before)
df_before.head()

In [None]:
df_after = clean_df(df_after)
df_after.head()

We see that some tweets disappeared as they were just emoji. We won't bother cleaning these rows as our libraries won't take them in consideration anyways. A future idea that we could implement is to substitute each emoji by a word that describes it.

# Word clouds

We're going to use [this](https://github.com/amueller/word_cloud) word cloud library to provide a beautiful visualization. I will keep the background _white_ in the _before_ dataframe and **dark** in the **after** dataframe just to help us visualize.

In [None]:
text = " ".join(review for review in df_before.text)
wordcloud = WordCloud(
    width=3000,
    height=2000,
    background_color='white').generate(text)
fig = plt.figure(
    figsize=(40, 30))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

In [None]:
text = " ".join(review for review in df_after.text)
wordcloud = WordCloud(
    width=3000,
    height=2000,
    background_color='black').generate(text)
fig = plt.figure(
    figsize=(40, 30))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

# ScatterText

We will investigate the differences in the corpus using [this](https://github.com/JasonKessler/scattertext) scattertext library.

Let's prepare the data by combining the dataframes in one and creating a "metadata" column that will help us discern the tweets for further investigation

In [None]:
df_before['metadata'] = df_before.date.map(str) + " | " + df_before.original_text
df_after['metadata'] = df_after.date.map(str) + " | " + df_after.original_text

df_before['category'] = 'Before'
df1 = df_before[['metadata', 'category', 'text']]

df_after['category'] = 'After'

df2 = df_after[['metadata', 'category', 'text']]

df_combined = df1.append(df2)

Ok, now let's use the library to compare the corpus. The code will generate an .html file on your current folder. I will post some screenshots, but I suggest that you access a live version [here](https://s3.amazonaws.com/scatter-bolsonaro-before-after-2018/bolsonaro_before_vs_after2018.html) and play around with it. The scattertext library is awesome!

In [None]:
corpus = (st.CorpusFromPandas(df_combined,
                                  category_col='category',
                                  text_col='text',
                                  nlp=st.whitespace_nlp_with_sentences)
              .build()
              .get_unigram_corpus()
              .compact(st.ClassPercentageCompactor(term_count=1,
                                                   term_ranker=st.OncePerDocFrequencyRanker)))
html = st.produce_characteristic_explorer(
    corpus,
    category='Before',
    category_name='Before',
    not_category_name='After',
    metadata=corpus.get_df()['metadata']
)
open('bolsonaro_before_vs_after2018.html', 'wb').write(html.encode('utf-8'))

Below you can see the scatter plot generated by our code. The 'y' axis is the Rank Difference, on top you can see the words that were used more by Bolsonaro before 2018 and less after 2018. On the bottom you see the words used more by the current president after 2018 and less before 2018. The middle line consists of words that were used evenly on both periods. The 'x' axis is the Characteristic to Corpus, that shows how frequent the words are present in this data.

In [None]:
Image(filename='../input/screenshots/scatter.png')

The term "câmara" (Chamber of Deputies in portuguese) was the top term before 2018 from Bolsonaro on twitter, which is reasonable since he was Federal Deputy for Rio de Janeiro from 1991 to 2018.

In [None]:
Image(filename='../input/screenshots/camara.png')

The term "forte" (strong in portuguese) was the top term after 2018. One of the cool features of scattertext is that you can click on the word, or type it on the search box and see the frequency it appears in each category and even see where it appears with the metadata we created.

Investigating further this term we can see that it's almost always followed by "abraço" (hug in portuguese), a characteristic expression used by Bolsonaro: "forte abraço!" (big hug! or something like that)

In [None]:
Image(filename='../input/screenshots/forte.png')

We can see that 'verdade' (truth in portuguese) is the most characteristic word of this corpus.

In [None]:
Image(filename='../input/screenshots/verdade.png')

# What if we tryed to create a model?
In kaggle, no kernel is complete without some type of classification. Let's try to create a model to classify a tweet between the two classes of our hypothesis. 
Let's prepare the data by assigning a target column with **0s** to the before-2018 class and **1s** to the after-2018 class.

In [None]:
# Removing the empty rows from our datasets
df1 = df1[df1['text'] != '']
df2 = df2[df2['text'] != '']

In [None]:
df1.shape

In [None]:
df2.shape

In [None]:
y = np.append(np.zeros(3104), np.ones(2133))
y

In [None]:
text_array = np.append(df1['text'].values, df2['text'].values)
len(text_array)

In [None]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1000) # here we only use 1500 most frequent words to reduce sparcity, we could also use dimensionality reduction for this
X = cv.fit_transform(text_array).toarray()

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

# Fitting a Random Forest Classifier to the training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Evaluating our results and robustness of our model
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print(accuracies)
print(accuracies.mean())
print(accuracies.std())

Accuracies in the high 60s and the variance is ok...

# Testing with new tweets
Let's take new tweets from Bolsonaro and run our model on it. Will our model be able to predict these were tweeted after 2018? 

I scraped all tweets from @jairbolsonaro from 2019-01-26 until 2019-03-07 and uploaded here as .csv

In [None]:
new_df = pd.read_csv('../input/jb-20190126-20190308csv/jb_20190126_20190308.csv')
new_df.head()

In [None]:
new_df.tail()

We need to preprocess our text as we did with our training data.

In [None]:
new_df = clean_df(new_df)
new_df.head()

In [None]:
new_df = new_df[new_df['text'] != '']
new_df.shape

In [None]:
text_array = new_df['text'].values
y = np.ones(207)

# Let's test our model!

In [None]:
# Creating the Bag of Words
X = cv.fit_transform(text_array).toarray()

y_pred = classifier.predict(X)

cm = confusion_matrix(y, y_pred)
print(cm)
from sklearn.metrics import accuracy_score
print(accuracy_score(y, y_pred))

The accuracy is close to our training set! Haha, at least we can say we beat chance!

But... We need to check if our model predicted one thing correctly...

In [None]:
text_array[19]

In [None]:
y_pred[19]

=(

# Conclusion

I suggest you guys to play around with the .html and investigate your hypothesis supported by the data, as we should always do!

## Model
Our model is undoubtedly production ready! /s

We could have also used grid search to fine tune the hyperparameters of our classifier and test other classification algorithms. 

Jokes apart, NLP is fascinating and research on it is advancing quickly, especially with deep learning. I encourage you to try new techniques on this datasets.
