On this notebook we'll analyze the data we've previously extracted and cleaned.

For that, let's first add a couple of libraries that we'll be using and import our dataset

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [4]:
# We specify the columns to use, to avoid including the unnamed index column
df = pd.read_csv('data/processed_data.csv', usecols=['text', 'date', 'favorited', 'retweeted', 'replies', 'user', 'lang'])

In [5]:
df.head()

Unnamed: 0,text,date,favorited,retweeted,replies,user,lang
0,"Los ""pollos"" queremos un Presidente ""pollo"". U...",2018-01-27 22:24:45,629,336,85,708108228568207360,es
1,Vota por @IvanDuque en la consulta @CeDemocrat...,2018-01-27 21:51:25,793,535,136,149281495,es
2,Compartimos con alegría nuestra propuesta de p...,2018-01-27 21:50:15,188,119,4,77653794,es
3,.@FNAraujoR #4 Senado @IvanDuque #ElCandidato...,2018-01-27 21:46:13,34,27,2,1069678676,es
4,The girls flocking to see Mr. Duque. That's g...,2018-01-27 20:54:10,1,0,0,876674787115925504,en


In [8]:
# We convert the date column to have date format
df['date'] = pd.to_datetime(df['date'])

In [10]:
# And we use it as the index for the dataset
df.set_index('date', inplace=True)

And now we're ready to start exploring our data.

# Basic analysis

Let's look at some basic statistics first

In [15]:
df.describe()

Unnamed: 0,favorited,retweeted,replies,user
count,27695.0,27695.0,27695.0,27695.0
mean,272.659289,141.743022,24.384871,1.101582e+17
std,818.115387,368.529151,91.420453,2.906389e+17
min,0.0,0.0,0.0,782076.0
25%,21.0,10.0,1.0,114577800.0
50%,59.0,34.0,3.0,284708700.0
75%,189.0,112.0,13.0,1668436000.0
max,28728.0,9909.0,3242.0,1.006637e+18


From this, we can see a couple of things. For example, we can see that the mean of likes is strongly affected by outliers. This can be seen in that our third quartile is at 189 likes, and the mean is at 272. This means that the mean is being pulled greatly by the most favorited tweets.

The same happens with the number of retweets.

Let's see the count of tweets in the different languages Twitter identified

In [30]:
df['lang'].value_counts()

es     24287
en      3058
und      287
pt        35
ca         8
fr         5
it         4
et         3
tl         3
in         2
no         1
ht         1
hu         1
Name: lang, dtype: int64

So according to this, there are tweets in various language mentioning the presidential candidates. However, if we examine them, we'll see that many are misclassified.