# 2020 US Presidential Election
Today, we will be visualising data about 27 of the US president candidates' information, such as their age and number of children. Then, we will be using their tweets to create a classifier that predicts which person wrote those tweets.

#### If you find my notebook helpful and enjoy it, please give it an upvote as it would help me make more of these.

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier, LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

## Putting together the dataset
The first step that we will be taking in visualising our data is cleaning it. To do this, we will create a new column for the candidates' age, fill the Not a Numbers for the 'announcement' column with zeros and replace the m's and f's in the 'sex' column with male and female.

In [None]:
info = pd.read_csv('../input/2020-united-states-presidential-election/candidates_info.csv')

info['year born'] = pd.to_datetime(info['born']).dt.year
info['announcement'] = info['announcement'].fillna('NaN 0')
info['sex'] = np.where(info['sex']=='m', 'male', 'female')
info['age'] = 2020-info['year born']

We will also create another column called 'announcement month' which will store data on which month the people gave their announcements.

In [None]:
months = np.array([])

for i in info['announcement']:
    month = i.split(' ')[0]
    months = np.append(months, month)
    
info['announcement month'] = months
month = info['announcement month']

Here is the information that we are given about each candidate:

In [None]:
info

## Data visualisation

### Age of candidates
Firstly, we will use a bar chart to visualise the age of each candidate. The oldest person was 90 and the youngest 38.

In [None]:
name_and_age = pd.concat([info['name'], info['age']], axis=1)
name_and_age = name_and_age.sort_values('age', ascending=False)
count = Counter(name_and_age)

plt.figure(figsize=(15, 5))
plt.bar(name_and_age['name'], name_and_age['age'], color='blue')
plt.title('Age of candidates')
plt.xlabel('Candidates')
plt.ylabel('Age')
plt.xticks(rotation=90)
plt.show()

### Number of children per candidate
The next plot is a pie chart which can accurately deliver a description of what percentage of the people have a certain amount of children.

In [None]:
count = Counter(info['children'])

count['0 children'] = count.pop(0)
count['1 child'] = count.pop(1)
count['2 children'] = count.pop(2)
count['3 children'] = count.pop(3)
count['4 children'] = count.pop(4)
count['5 children'] = count.pop(5)

fig, ax = plt.subplots(1, 1, figsize=(7, 7))
ax.set_title('Number of children per candidate')
ax.pie(count.values(), labels=count.keys(), autopct=lambda p:f'{p:.2f}%')
plt.show()

### Candidates' state of origin
Next is a bar chart which tells us what state the people come from. The state which most of them are from is New York, followed by a draw between Texas and California.

In [None]:
count = Counter(info['state of residence'])
count = pd.Series(count).sort_values(ascending=False)

plt.bar(count.keys(), count, color='green')
plt.title("Candidates' state of origin")
plt.xlabel('State')
plt.ylabel('Number of candidates')
plt.show()

### Sex of the candidates
Subsequently, we will now use a pie chart to take a look at how many people are male and female in the presidential election. We can see that the number of male candidates is more than triple of the number of female ones.

In [None]:
count = Counter(info['sex'])
fig, ax = plt.subplots(1, 1, figsize=(7, 7))
ax.set_title('Sex of the candidates')
ax.pie(count.values(), labels=count.keys(), autopct=lambda p:f'{p:.2f}%')
plt.show()

### Months that candidates delivered their announcements
Afterwards, we now use a bar chart to plot which months the people made their announcements. Using this, we see that the most frequent months were January and April, with 6 people each doing them then and the least popular were November and August, with only one person each doing their announcement then. However, those that did their speeches at that time did it in 2017, while everybody else did it in 2019.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(7, 5))
count = Counter(month)
count = pd.Series(count).sort_values(ascending=False)

plt.bar(count.keys(), count, color='purple')
plt.title('Months that candidates delivered their announcements')
plt.xlabel('Month')
plt.ylabel('Number of candidates')
plt.show()

In [None]:
path = '../input/2020-united-states-presidential-election/twitter/'
text = []
twitter_names = []

for user in os.listdir(path):
    profile = pd.read_csv(path+user)
    text.append(profile['Text'])
    twitter_names.append(user[:-4])

### Most commonly used words in twitter
Now we switch over to twitter, checking to see which words are most mentioned by our users.

In [None]:
fig, axes = plt.subplots(9, 3, figsize=(30, 30))

i = 0
for user in text:
    flattened = list(np.concatenate(axes).flat)
    words = user.sum()
    words = [word for word in words.split() if 'http' not in word and 'co' not in word and 'amp' not in word]
    words = ' '.join(words)
    
    wordcloud = WordCloud(background_color='white').generate(words)
    flattened[i].imshow(wordcloud)
    flattened[i].set_title(twitter_names[i], size=25)
    flattened[i].axis('off')
    fig.subplots_adjust(hspace=0.3)
    
    i+= 1
    
plt.show()

Here, we mush all of the tweets that our candidates have tweeted and find the most common used words there.

In [None]:
joined = ' '.join(list(np.concatenate(text).flat))
joined = [word for word in joined.split() if 'http' not in word and 'co' not in word and 'amp' not in word]
joined = ' '.join(joined)

fig, ax = plt.subplots(1, 1, figsize=(10, 10))
wordcloud = WordCloud(background_color='white').generate(joined)
ax.imshow(wordcloud)
ax.axis('off')
ax.set_title('All candidates', size=30)
plt.show()

### Twitter activity over the years
The next visualisations are 27 bar charts which show when the tweets in our dataset were tweeted, in relation to years. Note that our dataset spans only around 3,000 tweets per person, therefore it doesn't show the complete activity in our candidates' twitter accounts since they created it.

In order to create our graphs, we first loop over the candidates' names in the 'twitter' directory and store them in a numpy array. Then, we loop over the csv datasets of their twitter info, find when the tweets were created in the 'CreatedAt' feature and place that data into a 'years' numpy array.

In [None]:
names = np.array([])
months = []
years = []

for i in os.listdir(path): 
    names = np.append(names, pd.read_csv(path+i)['Name'][0])

for user in os.listdir(path):
    person = pd.read_csv(path+user)
    temp_month = []
    temp_year = []
    
    for date in person['Created At']:
        temp_month.append(date[4:7])
        temp_year.append(date[-4:])
    
    months.append(temp_month)
    years.append(temp_year)
    
counts = []
for user in years:
    count = Counter(user)
    counts.append(count)

In [None]:
i = 0
fig, axes = plt.subplots(9, 3, figsize=(20, 30))
flattened = list(np.concatenate(axes).flat)

for ax in flattened:
    ax.bar(counts[i].keys(), counts[i].values(), color='blue')
    ax.set_title(names[i])
    ax.set_xlabel('Years')
    ax.set_ylabel('Number of tweets')
    fig.subplots_adjust(wspace=0.3, hspace=0.6)
    i += 1

### Retweets in 2019 per candidate
Next, we will take a look at how many retweets each candidate has got and display our data it with a bar chart.

In [None]:
retweets = np.array([])
i = 0

for user in os.listdir(path):
    person = pd.read_csv(path+user)
    year = np.array(years[i]).astype(int)
    
    tweets_in_2019 = np.where(year==2019, 1, 0).sum()
    retweets_in_2019 = person['Retweets'][len(year)-tweets_in_2019:].sum()
    retweets = np.append(retweets, retweets_in_2019)

    i += 1
    
names_and_retweets = pd.concat([pd.Series(names), pd.Series(retweets)], axis=1)
names_and_retweets = names_and_retweets.sort_values(by=1, ascending=False).reset_index(drop=True)

As seen below, Donald Trump got retweeted ten times more than the person who is second-most retweeted.

In [None]:
plt.figure(figsize=(15,5))
plt.bar(names_and_retweets[0], names_and_retweets[1], color='red')
plt.title('Retweets in 2019 per candidate')
plt.ylabel('Retweets')
plt.xlabel('Candidates')
plt.xticks(rotation=90)
plt.show()

Since there was such a huge unbalance of data in our last chart, we will do another graph which shows the amount of retweets each candidate received, however, this time without Donald Trump. This is useful in helping us see the difference in retweets for the other candidates.

In [None]:
djt = list(names_and_retweets[0]).index('Donald J. Trump')
names_and_retweets = names_and_retweets.drop(djt)

plt.figure(figsize=(15,5))
plt.bar(names_and_retweets[0], names_and_retweets[1], color='red')
plt.title('Retweets per candidate without Donald J. Trump')
plt.ylabel('Retweets')
plt.xlabel('Candidates')
plt.xticks(rotation=90)
plt.show()

## Predicting twitter accounts
Finally, we will now create a classifier that predicts which candidate has tweeted based on our twitter datasets.

The first step is taking all of the users' data in the 'twitter' directory and uniting them all into one dataset called df, which is then shuffled randomly.

In [None]:
df = pd.DataFrame([])

for candidate in os.listdir(path):
    twitter_info = pd.read_csv(path+candidate)
    df = df.append(twitter_info)
    
df = df.reset_index(drop=True)
df = df.sample(frac=1).reset_index(drop=True)

An X and y are brought out from df, with the X representing the tweets for the users and the y representing the users' names. The y is preprocessed with a LabelEncoder, which transforms the textual data into numerical, and the X and y are split into train and test sets.

In [None]:
X, y = df['Text'], df['Name']

le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X train and test are textual, which cannot be inputted into the classifier. Therefore, we will use Natural Language Processing methods to convert it into a type which can be inputted to the model. This is done through the Bag of Words and the TFIDF methods.

In [None]:
cv = CountVectorizer()
tfidf = TfidfTransformer()

X_tr_cv = cv.fit_transform(X_train)
X_te_cv = cv.transform(X_test)

X_train = tfidf.fit_transform(X_tr_cv)
X_test = tfidf.transform(X_te_cv)

Furthermore, we loop over three models: a linear SVC, a naive bayes and a passive aggressive classifer. The accuracy and cross validation score are evaluated from each predictor so that we can analyse their performance. As seen below, the Linear SVC performs best, followed by the passive aggressive classifier and then the naive bayes. Therefore, the SVC will be used in our final prediction.

In [None]:
for model in [LinearSVC(), MultinomialNB(), PassiveAggressiveClassifier()]:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    accuracy = model.score(X_test, y_test)
    cross_val = cross_val_score(model, X_test, y_test).mean()
    
    print(str(model)[:-2] + ' accuracy:', accuracy, 'cross val score:', cross_val)

At last, we create a Linear SVC model, fit it to the X and y train, and then evaluate its performance with an accuracy and cross val score. The classifier manages to get an accuracy of 67% and a cross validation of 55%.

In [None]:
model = LinearSVC()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = model.score(X_test, y_test)
cross_val = cross_val_score(model, X_test, y_test).mean()

print('LinearSVC accuracy score: ' + str(round(accuracy*100, 2)) + '% cross val score: ' + str(round(cross_val*100, 2)) + '%')

#### Thank you for reading my notebook.

#### If you found my notebook helpful and enjoyed it, please give it an upvote as it would help me make more of these.