# Predicting How Someone Will Vote on Basic Income

### The dataset
This study on basic income across Europe was conducted by Dalia Research in April 2016 across 28 EU member states. The sample of n = 9.649 was drawn from all 28 states, taking into account current population distributions with regards to age (14-65 years), gender, and region/country. The dataset is available on kaggle: https://www.kaggle.com/daliaresearch/basic-income-survey-european-dataset/home

The dataset contains **9649 records and 15 columns**. These include demographics such as age, gender, education etc. It also includes opinions on the effects of a basic income on someone's work choices, their familiarity with the idea of a basic income, convincing arguments for and against a basic income - and of course, whether they ultimately approve or reject the idea.

Our **goal** in this notebook is **to predict how people are likely to vote**. The target variable originally consisted of multiple classes, however, it was converted to a binary outcome. We thus have a typical classification task to solve. Several different classification models such as Logistic Regression, Random Forest, XGBoost, and Support Vector Machine (SVM) are built, optimized, evaluated, and compared. Additionally, balancing the data using the SMOTE algorithm is applied to remove a bias in prediction.

### The OSEMiN-approach

The OSEMiN Process is an acronym that rhymes with “awesome” and stands for **Obtain, Scrub, Explore, Model, and iNterpret**. It can be used as a blueprint for working on data problems using machine learning tools. Preprocessing involves scrubbing (also called cleaning) and exploring the data. Building the model, evaluating, and optimizing it make up the process of machine learning.

# Table of contents
<a id='Table of contents'></a>

### <a href='#1. Obtaining and Viewing the Data'>1. Obtaining and Viewing the Data</a>

### <a href='#2. Preprocessing the Data'>2. Preprocessing the Data</a>

* <a href='#2.1. Renaming Columns'>2.1. Renaming Columns</a>
* <a href='#2.2. Excluding Unrelated Data'>2.2. Excluding Unrelated Data</a>
* <a href='#2.3. Dealing with Misleading Data'>2.3. Dealing with Misleading Data</a>
* <a href='#2.4. Dealing with Missing Data'>2.4. Dealing with Missing Data</a>
* <a href='#2.5. Dealing with Duplicate Data'>2.5. Dealing with Duplicate Data</a>
* <a href='#2.6. Basic Feature Extraction and Engineering'>2.6. Basic Feature Extraction and Engineering</a>

### <a href='#3. Data Visualization'>3. Data Visualization</a>
* <a href='#3.1. Mosaic Plots'>3.1. Mosaic Plots</a>
* <a href='#3.2. Bar Charts'>3.2. Bar Charts</a>

### <a href='#4. Machine Learning'>4. Machine Learning</a>

* <a href='#4.1. Recoding Categorical Features'>4.1. Recoding Categorical Features</a>
* <a href='#4.2. Training a Logistic Regression'>4.2. Training a Logistic Regression</a>
* <a href='#4.3. Training a Random Forest Classifier'>4.3. Training a Random Forest Classifier</a>
* <a href='#4.4. Training an XGBoost Classifier'>4.4. Training an XGBoost Classifier</a>
* <a href='#4.5. Training a Support Vector Machine'>4.5. Training a Support Vector Machine</a>
* <a href='#4.6. Model Comparison'>4.6. Model Comparison</a>
* <a href='#4.7. Balancing the Data'>4.7. Balancing the Data</a>
* <a href='#4.8. Model Comparison II'>4.8. Model Comparison II</a>

### <a href='#5. Conclusions'>5. Conclusions</a>
* <a href='#5.1. Feature Importance'>5.1. Feature Importance</a>
* <a href='#5.2. Recommendation'>5.2. Recommendation</a>

### 1. Obtaining and Viewing the Data
<a id='1. Obtaining and Viewing the Data'></a>

Let's start by obtaining and investigating the pandas DataFrame:

In [None]:
import xgboost as xgb

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('seaborn')
import seaborn as sns
from pprint import pprint

In [None]:
# reading in dataset and viewing it
df = pd.read_csv('../input/basic_income_dataset_dalia.csv')
df.head()

In [None]:
# get the number of rows and columns
print(df.shape)

# get datetype info
print()
print(df.info())

In [None]:
# get an overview of the numeric agecolumn (.T = transposing the dataframe's order)
df.describe().T

In [None]:
# get an overview of all 14 object columns/features
df.describe(include='object').T

*Back to: <a href='#Table of contents'> Table of contents</a>*
### 2. Preprocessing the Data
<a id='2. Preprocessing the Data'></a>

#### 2.1. Renaming Columns 
<a id='2.1. Renaming Columns'></a>

The column labels are quite wordy. Let's change that:

In [None]:
df.rename(columns = {'rural':'city_or_rural',
                     'dem_education_level':'education',
                     'dem_full_time_job':'full_time_job',
                     'dem_has_children':'has_children',
                     'question_bbi_2016wave4_basicincome_awareness':'awareness',
                     'question_bbi_2016wave4_basicincome_vote':'vote',
                     'question_bbi_2016wave4_basicincome_effect':'effect',
                     'question_bbi_2016wave4_basicincome_argumentsfor':'arg_for',
                     'question_bbi_2016wave4_basicincome_argumentsagainst':'arg_against'},
          inplace=True)

#### 2.2. Excluding Unrelated Data
<a id='2.2. Excluding Unrelated Data'></a>

Again, our target is to predict how people are likely to vote. Hence, features should be included only if they're suspected to be related to the target variable. As the goal of supervised classification is to predict the target, features that obviously have nothing to do with the target should be excluded.

Both variables, the **uuid** and the **weight** (given to obtain census-representative results), are irrelevant for our classification task here. As we want to construct our own age groups later, we will also drop the predefined **age group**:

In [None]:
df.drop(['uuid', 'weight', 'age_group'], axis=1, inplace=True)

In [None]:
# new number of rows and columns
df.shape

#### 2.3. Dealing with Misleading Data
<a id='2.3. Dealing with Misleading Data'></a>

All the data makes perfect sense; there is nothing to correct.

#### 2.4. Dealing with Missing Data
<a id='2.4. Dealing with Missing Data'></a>

In [None]:
# checking how much total missing data we have
df.isna().sum()

In [None]:
# in percentage: 7%
round(df['education'].isna().sum() / len(df), 3)

In [None]:
df.education.unique()

In [None]:
df.education.value_counts()

663 missing values - that is of no small concern! These records likely represent people with no formal education who may have been averse to disclosing that information, or who thought that giving no answer would be the right answer. So I decide to fill the NaN's with **no** formal education:

In [None]:
df['education'].fillna('no', inplace=True)

In [None]:
df.isna().sum().sum()

In [None]:
df.education.value_counts()

In [None]:
# new number of rows and columns
df.shape

#### 2.5. Dealing with Duplicate Data
<a id='2.5. Dealing with Duplicate Data'></a>

In [None]:
# check if there are any duplicates
df.duplicated().sum()

Indeed, we have some duplicates, so let's drop them:

In [None]:
df.drop_duplicates(keep='first', inplace=True)

In [None]:
# final number of rows and columns
df.shape

#### 2.6. Basic Feature Extraction and Engineering
<a id='2.6. Basic Feature Extraction and Engineering'></a>

* Someone who "probably votes for" basic income, will vote the same way as someone who "votes for it" - namely with "yes". The same holds for rejection. Our target is to first predict whether someone is for or against basic income. We're not particularly interested in someone who has no opinion and/or won't vote, so let's **simplify our target** and recode the answers to drop all records that won't take any clear action:

In [None]:
# recode voting
def vote_coding(row):
    if row == 'I would vote for it' : return('for')
    elif row == 'I would probably vote for it': return('for')
    elif row == 'I would vote against it': return('against')
    elif row == 'I would probably vote against it': return('against')
    elif row == 'I would not vote': return('no_action')

# apply function
df['vote'] = df['vote'].apply(vote_coding)

In [None]:
# drop all records who are not "for" or "against"
df = df.query("vote != 'no_action'")

In [None]:
df.vote.value_counts(normalize=True)

* Another two columns, **"awareness" and "effect"**, contain whole sentences that need to be shortened to one word to then be ready to be processed later:

In [None]:
def awareness_coding(row):
    if row == 'I understand it fully': return('fully')
    elif row == 'I know something about it': return('something')
    elif row == 'I have heard just a little about it': return('little')
    elif row == 'I have never heard of it': return('nothing')

df['awareness'] = df['awareness'].apply(awareness_coding)

In [None]:
def effect_coding(row):
    if row == '‰Û_ stop working': return('stop_working')
    elif row == '‰Û_ work less': return('work_less')
    elif row == '‰Û_ do more volunteering work': return('volunteering_work')
    elif row == '‰Û_ spend more time with my family': return('more_family_time')
    elif row == '‰Û_ look for a different job': return('different_job')
    elif row == '‰Û_ work as a freelancer': return('freelancer')
    elif row == '‰Û_ gain additional skills': return('additional_skills')
    elif row == 'A basic income would not affect my work choices': return('no_effect')
    else: return('none_of_the_above')
    
df['effect'] = df['effect'].apply(effect_coding).astype(str)

* Next, let's build **new age groups** according to the 0.2 percentiles, and then drop the numeric "age" column:

In [None]:
df.age.describe(percentiles=[.2, .4, .6, .8])

In [None]:
def age_groups(row):
    if row <= 26: return('14_26')
    elif row <= 35: return('27_35')
    elif row <= 42: return('36_42')
    elif row <= 49: return('43_49')
    else: return('above_50')
    
df['age_group'] = df['age'].apply(age_groups)
df.drop(['age'], axis=1, inplace=True)

In [None]:
df['age_group'].value_counts(normalize=True).plot(kind='barh', figsize=(8,4));

* Lastly, let's extract the **2 or 3 most mentioned arguments PRO** and the **2 or 3 most mentioned arguments CONTRA** a basic income, and build new columns with boolean values:

In [None]:
arg_for = ['It reduces anxiety about financing basic needs',
           'It creates more equality of opportunity',
           'It encourages financial independence and self-responsibility',
           'It increases solidarity, because it is funded by everyone',
           'It reduces bureaucracy and administrative expenses',
           'It increases appreciation for household work and volunteering',
           'None of the above']

# count all arguments
counter = [0,0,0,0,0,0,0]

for row in df.iterrows():
    for i in range(0, len(arg_for)):
        if arg_for[i] in row[1]['arg_for'].split('|'):
            counter[i] = counter[i] + 1

# create a new dictionary 
dict_keys = ['less anxiety', 'more equality', 'financial independance', 
             'more solidarity', 'less bureaucracy', 'appreciates volunteering', 'none']

arg_dict = {}

for i in range(0, len(arg_for)):
    arg_dict[dict_keys[i]] = counter[i]

# sub-df for counted arguments
sub_df = pd.DataFrame(list(arg_dict.items()), columns=['Arguments PRO basic income', 'count'])

# plot
sub_df.sort_values(by=['count'], ascending=True).plot(kind='barh', x='Arguments PRO basic income', y='count',  
                                                      figsize=(10,6), legend=False, color='darkgrey',
                                                      title='Arguments PRO basic income')
plt.xlabel('Count'); 

In [None]:
df['less_anxiety'] = df['arg_for'].str.contains('anxiety')
df['more_equality'] = df['arg_for'].str.contains('equality')

In [None]:
arg_against = ['It is impossible to finance', 'It might encourage people to stop working',
               'Foreigners might come to my country and take advantage of the benefit',
               'It is against the principle of linking merit and reward', 
               'Only the people who need it most should get something from the state',
               'It increases dependence on the state', 'None of the above']

# count all arguments
counter = [0,0,0,0,0,0,0]

for row in df.iterrows():
    for i in range(0, len(arg_against)):
        if arg_against[i] in row[1]['arg_against'].split('|'):
            counter[i] = counter[i] + 1

# create a new dictionary 
dict_keys = ['impossible to finance', 'people stop working', 'foreigners take advantage', 
             'against meritocracy', 'only for people in need', 'more dependence on state', 'none']

arg_dict = {}

for i in range(0, len(arg_against)):
    arg_dict[dict_keys[i]] = counter[i]

# sub-df for counted arguments
sub_df = pd.DataFrame(list(arg_dict.items()), columns=['Arguments AGAINST basic income', 'count'])

# plot
sub_df.sort_values(by=['count'], ascending=True).plot(kind='barh', x='Arguments AGAINST basic income', y='count',  
                                                      figsize=(10,6), legend=False, color='darkgrey',
                                                      title='Arguments AGAINST basic income')
plt.xlabel('Count'); 

In [None]:
df['in_need'] = df['arg_against'].str.contains('need')
df['stop_working'] = df['arg_against'].str.contains('stop working')
df['too_costly'] = df['arg_against'].str.contains('impossible')

In [None]:
df.drop(['arg_for', 'arg_against'], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df.shape

*Back to: <a href='#Table of contents'> Table of contents</a>*
### 3. Data Visualization
<a id='3. Data Visualization'></a>

Data visualization is an important step that lies between preprocessing and model buildung. It serves as a sanity check for the features and target, and may help explore the relationship between both. This will guide us in model building, and assist us in our understanding of the model and predictions. The target is what we are asked to predict: a "yes" to basic income, or a "no". 

**We should first examine the number of votes that fall into each category.**

In [None]:
df['vote'].value_counts(normalize=True).plot(kind='barh', figsize=(8,4), 
                                             color=['maroon','midnightblue']);

By looking at the number of records we have in each class, we see that roughly 70% of the votes are for a basic income, as opposed to 30% against.

#### 3.1. Mosaic Plots
<a id='3.1. Mosaic Plots'></a>

In [None]:
from statsmodels.graphics.mosaicplot import mosaic

mosaic(df, ['gender', 'vote'], gap=0.015, title='Vote vs. Gender - Mosaic Chart');

In [None]:
mosaic(df, ['city_or_rural', 'vote'], gap=0.015, title='Vote vs. Area - Mosaic Chart');

In [None]:
mosaic(df, ['full_time_job', 'vote'], gap=0.015, 
       title='Vote vs. Having a Full Time Job or not - Mosaic Chart');

In [None]:
mosaic(df, ['has_children', 'vote'], gap=0.015, 
       title='Vote vs. Having children or not - Mosaic Chart');

#### 3.2. Bar Charts
<a id='3.2. Bar Charts'></a>

##### Vote vs. Full Time Job

In [None]:
# Votes depending on having a full-time-job

sub_df = df.groupby('full_time_job')['vote'].value_counts(normalize=True).unstack()
sub_df.plot(kind='bar', color=['midnightblue', 'maroon'], figsize=(7,4))
plt.xlabel("Full Time Job")
plt.xticks(rotation=0)
plt.ylabel("Percentage of Voters\n")
plt.title('\nVote depending on having a full-time-job\n', fontsize=14, fontweight='bold')
plt.legend(bbox_to_anchor=(1.2, 1.0), title='Vote');

In [None]:
# Votes in GERMANY and GREECE - depending on having a full-time-job

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,5))

# create sub-df for Germany
sub_df_1 = df[df['country_code']=='DE'].groupby('full_time_job')['vote'].value_counts(normalize=True).unstack()
sub_df_1.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax1, legend=False)
ax1.set_title('\nVotes in GERMANY depending on having a full-time-job\n', fontsize=14, fontweight='bold')
ax1.set_xlabel("Full Time Job")
ax1.set_xticklabels(labels=['No', 'Yes'], rotation=0)
ax1.set_ylabel("Percentage of Voters\n")

# create sub-df for Greece
sub_df_2 = df[df['country_code']=='GR'].groupby('full_time_job')['vote'].value_counts(normalize=True).unstack()
sub_df_2.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax2, legend=False)
ax2.set_title('\nVotes in GREECE depending on having a full-time-job\n', fontsize=14, fontweight='bold')
ax2.set_xlabel("Full Time Job")
ax2.set_xticklabels(labels=['No', 'Yes'], rotation=0)
ax2.set_ylabel("Percentage of Voters\n")

# create one legend
handles, labels = ax2.get_legend_handles_labels()
fig.legend(handles, labels, bbox_to_anchor=(0.84, 0.85))
plt.show();

##### Vote vs. Education Level

In [None]:
# Votes depending on education level

sub_df = df.groupby('education')['vote'].value_counts(normalize=True).unstack()
sub_df.plot(kind='bar', color = ['midnightblue','maroon'], figsize=(12,5))
plt.xlabel("Education Level")
plt.xticks(rotation=0)
plt.ylabel("Percentage of Voters\n")
plt.title('\nVote depending on education level\n', fontsize=14, fontweight='bold')
plt.legend(bbox_to_anchor=(1.15, 1), title='Vote');

In [None]:
# Votes in GERMANY and GREECE - depending on education level

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,5))

# create sub-df for Germany
sub_df_1 = df[df['country_code']=='DE'].groupby('education')['vote'].value_counts(normalize=True).unstack()
sub_df_1.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax1, legend=False)
ax1.set_title('\nVotes in GERMANY depending on education level\n', fontsize=14, fontweight='bold')
ax1.set_xlabel("Education Level")
ax1.set_xticklabels(labels=['High', 'Low', 'Medium', 'No'], rotation=0)
ax1.set_ylabel("Percentage of Voters\n")

# create df for Greece
sub_df_2 = df[df['country_code']=='GR'].groupby('education')['vote'].value_counts(normalize=True).unstack()
sub_df_2.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax2, legend=False)
ax2.set_title('\nVotes in GREECE depending on education level\n', fontsize=14, fontweight='bold')
ax2.set_xlabel("Education Level")
ax2.set_xticklabels(labels=['High', 'Low', 'Medium', 'No'], rotation=0)

# create one legend
handles, labels = ax2.get_legend_handles_labels()
fig.legend(handles, labels, bbox_to_anchor=(0.83, 0.85))
plt.show();

In [None]:
# Votes in 4 countries - depending on education level

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(14,10))

# create sub-df for Germany
sub_df_1 = df[df['country_code']=='DE'].groupby('education')['vote'].value_counts(normalize=True).unstack()
sub_df_1.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax1, legend=False)
ax1.set_title('\nVotes in GERMANY depending on education level\n', fontsize=14, fontweight='bold')
ax1.set_xlabel("Education Level")
ax1.set_xticklabels(labels=['High', 'Low', 'Medium', 'No'], rotation=0)
ax1.set_ylabel("Percentage of Voters\n")

# create sub-df for France
sub_df_2 = df[df['country_code']=='FR'].groupby('education')['vote'].value_counts(normalize=True).unstack()
sub_df_2.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax2, legend=False)
ax2.set_title('\nVotes in France depending on education level\n', fontsize=14, fontweight='bold')
ax2.set_xlabel("Education Level")
ax2.set_xticklabels(labels=['High', 'Low', 'Medium', 'No'], rotation=0)
ax2.set_ylabel("Percentage of Voters\n")

# create sub-df for Italy
sub_df_3 = df[df['country_code']=='IT'].groupby('education')['vote'].value_counts(normalize=True).unstack()
sub_df_3.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax3, legend=False)
ax3.set_title('\nVotes in Italy depending on education level\n', fontsize=14, fontweight='bold')
ax3.set_xlabel("Education Level")
ax3.set_xticklabels(labels=['High', 'Low', 'Medium', 'No'], rotation=0)

# create sub-df for Slovakia
sub_df_4 = df[df['country_code']=='SK'].groupby('education')['vote'].value_counts(normalize=True).unstack()
sub_df_4.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax4, legend=False)
ax4.set_title('\nVotes in Slovakia depending on education level\n', fontsize=14, fontweight='bold')
ax4.set_xlabel("Education Level")
ax4.set_xticklabels(labels=['High', 'Low', 'Medium', 'No'], rotation=0)

# create only one legend
handles, labels = ax2.get_legend_handles_labels()
fig.legend(handles, labels, bbox_to_anchor=(1.0, 0.95))
plt.tight_layout()
plt.show();

##### Vote vs. Awareness

In [None]:
# Votes depending on awareness

sub_df = df.groupby('awareness')['vote'].value_counts(normalize=True).unstack()
sub_df.plot(kind='bar', color=['midnightblue', 'maroon'], figsize=(7,4))
plt.xlabel("Awareness")
plt.xticks(rotation=0)
plt.ylabel("Percentage of Voters\n")
plt.title('\nVote depending on awareness\n', fontsize=14, fontweight='bold')
plt.legend(bbox_to_anchor=(1.2, 1.0), title='Vote');

##### Vote vs. age group

In [None]:
# Votes depending on age

sub_df = df.groupby('age_group')['vote'].value_counts(normalize=True).unstack()
sub_df.plot(kind='bar', color=['midnightblue', 'maroon'], figsize=(9,5))
plt.xlabel("Age Group")
plt.xticks(rotation=0)
plt.ylabel("Percentage of Voters\n")
plt.title('\nVote depending on age\n', fontsize=14, fontweight='bold')
plt.legend(bbox_to_anchor=(1.2, 1.0), title='Vote');

##### Vote vs. expected effects

In [None]:
# Votes depending on effect

sub_df = df.groupby('effect')['vote'].value_counts(normalize=True).unstack()
sub_df.plot(kind='bar', color=['midnightblue', 'maroon'], figsize=(14,5))
plt.xlabel("\nEffect of Basic Income")
plt.xticks(rotation=0)
plt.ylabel("Percentage of Voters\n")
plt.title('\nVote depending on effect\n', fontsize=14, fontweight='bold')
plt.legend(bbox_to_anchor=(1.1, 1.0), title='Vote');

##### Vote vs. arguments

In [None]:
# plot votes in 4 countries - depending on education level

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16,12))

# create sub-df for those who agree/disagree with the argument:
# "It reduces anxiety about financing basic needs"
sub_df_1 = df.groupby('less_anxiety')['vote'].value_counts(normalize=True).unstack()
sub_df_1.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax1, legend=False)
ax1.set_title('\nVotes depending on attitude towards reducing_anxiety\n', fontsize=14, fontweight='bold')
ax1.set_xlabel('"It reduces anxiety about financing basic needs"')
ax1.set_xticklabels(labels=['False', 'True'], rotation=0)
ax1.set_ylabel("Percentage of Voters\n")

# create sub-df for those who agree/disagree with the argument:
# "It creates more equality of opportunity"
sub_df_2 = df.groupby('more_equality')['vote'].value_counts(normalize=True).unstack()
sub_df_2.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax2, legend=False)
ax2.set_title('\nVotes depending on attitude towards more_equality\n', fontsize=14, fontweight='bold')
ax2.set_xlabel('"It creates more equality of opportunity"')
ax2.set_xticklabels(labels=['False', 'True'], rotation=0)

# create sub-df for those who agree/disagree with the argument:
# "It might encourage people to stop working"
sub_df_3 = df.groupby('stop_working')['vote'].value_counts(normalize=True).unstack()
sub_df_3.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax3, legend=False)
ax3.set_title('\nVotes depending on attitude towards people_stop_working\n', fontsize=14, fontweight='bold')
ax3.set_xlabel('"It might encourage people to stop working"')
ax3.set_xticklabels(labels=['False', 'True'], rotation=0)
ax3.set_ylabel("Percentage of Voters\n")

# create sub-df for those who agree/disagree with the argument:
# "Only the people who need it most should get something from the state"
sub_df_4 = df.groupby('in_need')['vote'].value_counts(normalize=True).unstack()
sub_df_4.plot(kind='bar', color = ['midnightblue', 'maroon'], ax=ax4, legend=False)
ax4.set_title('\nVotes depending on attitude towards only_for_people_in_need\n', fontsize=14, fontweight='bold')
ax4.set_xlabel('"Only the people who need it most should get something from the state"')
ax4.set_xticklabels(labels=['False', 'True'], rotation=0)

# create only one legend
handles, labels = ax2.get_legend_handles_labels()
fig.legend(handles, labels, bbox_to_anchor=(1.0, 0.95))
plt.tight_layout()
plt.show();

*Back to: <a href='#Table of contents'> Table of contents</a>*
### 4. Machine Learning
<a id='4. Machine Learning'></a>

#### 4.1. Recoding Categorical Features
<a id='4.1. Recoding Categorical Features'></a>

Machine learning algorithms generally need all data - including categorical data - in numeric form. To satisfy these algorithms, categorical features are converted into separate binary features called dummy variables.
Therefore, we have to find a way to represent these variables as numbers before handing them off to the model. One usual way is **one-hot encoding**, which creates a new column for each unique category in a categorical variable. Each observation receives a 1 in the column for its corresponding category and a 0 in all other new columns. To conduct one-hot encoding, we use the **pandas get_dummies function.**

In [None]:
# define our features 
features = df.drop(["vote"], axis=1)

# define our target
target = df[["vote"]]

# create dummy variables
features = pd.get_dummies(features)

In [None]:
print(features.shape)
features.tail(2)

In [None]:
print(target.shape)

*Back to: <a href='#Table of contents'> Table of contents</a>*
#### 4.2. Training a Logistic Regression
<a id='4.2. Training a Logistic Regression'></a>

When approaching a supervised learning problem like ours, we should always use multiple algorithms and compare the performances of the various models. Sometimes simplest is best, and so we will start by applying logistic regression. Logistic regression makes use of what's called the logistic function to calculate the odds that a given data point belongs to a given class. Once we have more models, we can compare them based on a few performance metrics.

Before we start, let's prepare our work and import all the libraries we need for classifying our data:

In [None]:
# import train_test_split function
from sklearn.model_selection import train_test_split

# import LogisticRegression
from sklearn.linear_model import LogisticRegression

# import metrics
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score

# suppress all warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
# split our data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)

In [None]:
# instantiate the logistic regression
logreg = LogisticRegression()

# train
logreg.fit(X_train, y_train)

# predict
train_preds = logreg.predict(X_train)
test_preds = logreg.predict(X_test)

# evaluate
train_accuracy_logreg = accuracy_score(y_train, train_preds)
test_accuracy_logreg = accuracy_score(y_test, test_preds)
report_logreg = classification_report(y_test, test_preds)

print("Logistic Regression")
print("------------------------")
print(f"Training Accuracy: {(train_accuracy_logreg * 100):.4}%")
print(f"Test Accuracy:     {(test_accuracy_logreg * 100):.4}%")

# store accuracy in a new dataframe
score_logreg = ['Logistic Regression', train_accuracy_logreg, test_accuracy_logreg]
models = pd.DataFrame([score_logreg])

#### 4.3. Training a Random Forest Classifier
<a id='4.3. Training a Random Forest Classifier'></a>

Next, let's run a Random Forest Classifier with predefined specifications or "hyperparameters". Some of the important ones to tune for a Random Forest are:

* n_estimators = number of trees
* criterion = splitting criterion (for maximizing the information gain from each split)
* max_features = max number of features considered for splitting a node
* max_depth = max number of levels in each decision tree
* min_samples_split = min samples needed to make a split

In [None]:
# import random forest classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
# create a baseline
forest = RandomForestClassifier()

In [None]:
# create Grid              
param_grid = {'n_estimators': [80, 100, 120],
              'criterion': ['gini', 'entropy'],
              'max_features': [5, 7, 9],         
              'max_depth': [5, 8, 10], 
              'min_samples_split': [2, 3, 4]}

# instantiate the tuned random forest
forest_grid_search = GridSearchCV(forest, param_grid, cv=3, n_jobs=-1)

# train the tuned random forest
forest_grid_search.fit(X_train, y_train)

# print best estimator parameters found during the grid search
print(forest_grid_search.best_params_)

In [None]:
# instantiate the tuned random forest with the best found parameters
# here I use the parameters originally got back from GridSearch in the first round
forest = RandomForestClassifier(n_estimators=120, criterion='gini', max_features=9, 
                                max_depth=10, min_samples_split=4, random_state=4)

# train the random forest
forest.fit(X_train, y_train)

# predict
train_preds = forest.predict(X_train)
test_preds = forest.predict(X_test)

# evaluate
train_accuracy_forest = accuracy_score(y_train, train_preds)
test_accuracy_forest = accuracy_score(y_test, test_preds)
report_forest = classification_report(y_test, test_preds)

print("Random Forest")
print("-------------------------")
print(f"Training Accuracy: {(train_accuracy_forest * 100):.4}%")
print(f"Test Accuracy:     {(test_accuracy_forest * 100):.4}%")

# append accuracy score to our dataframe
score_forest = ['Random Forest', train_accuracy_forest, test_accuracy_forest]
models = models.append([score_forest])

#### 4.4. Training an XGBoost Classifier
<a id='4.4. Training an XGBoost Classifier'></a>

Gradient Boosting is one of the most powerful concepts in machine learning right now. The term Gradient Boosting refers to a class of algorithms, rather than any single one. The version with the highest performance right now is XGBoost, which is short for eXtreme Gradient Boosting. XGBoost is a great choice for classification tasks. It provides best-in-class performance compared to other classification algorithms (with the exception of Deep Learning).

Some of the important hyperparameters to tune for an XGBoost are:

* n_estimators = number of trees
* learning_rate = rate at which our model learns patterns in data (After every round, it shrinks the feature weights to reach the best optimum:)
* max_depth = max number of levels in each decision tree
* colsample_bytree = similar to max_features (max number of features considered for splitting a node)
* gamma = specifies the minimum loss reduction required to make a split

In [None]:
# create a baseline
booster = xgb.XGBClassifier()

In [None]:
# create Grid
param_grid = {'n_estimators': [100],
              'learning_rate': [0.05, 0.1], 
              'max_depth': [3, 5, 10],
              'colsample_bytree': [0.7, 1],
              'gamma': [0.0, 0.1, 0.2]}

# instantiate the tuned random forest
booster_grid_search = GridSearchCV(booster, param_grid, scoring='accuracy', cv=3, n_jobs=-1)

# train the tuned random forest
booster_grid_search.fit(X_train, y_train)

# print best estimator parameters found during the grid search
print(booster_grid_search.best_params_)

In [None]:
# instantiate tuned xgboost
booster = xgb.XGBClassifier(learning_rate=0.1, max_depth=5, n_estimators=100,
                            colsample_bytree=0.7, gamma=0.1, random_state=4)

# train
booster.fit(X_train, y_train)

# predict
train_preds = booster.predict(X_train)
test_preds = booster.predict(X_test)

# evaluate
train_accuracy_booster = accuracy_score(y_train, train_preds)
test_accuracy_booster = accuracy_score(y_test, test_preds)
report_booster = classification_report(y_test, test_preds)

print("XGBoost")
print("-------------------------")
print(f"Training Accuracy: {(train_accuracy_booster * 100):.4}%")
print(f"Test Accuracy:     {(test_accuracy_booster * 100):.4}%")

# append accuracy score to our dataframe
score_booster = ['XGBoost', train_accuracy_booster, test_accuracy_booster]
models = models.append([score_booster])

#### 4.5. Training a Support Vector Machine
<a id='4.5. Training a Support Vector Machine'></a>

Another fast and popular classification technique is: Support Vector Machines (also referred to as SVMs). The idea behind SVMs is that we perform classification by finding the seperation line, or "hyperplane", that best differentiates between two classes.

In [None]:
from sklearn import svm

In [None]:
# instantiate Support Vector Classification
svm = svm.SVC(kernel='rbf', random_state=4)

# train
svm.fit(X_train, y_train)

# predict
train_preds = svm.predict(X_train)
test_preds = svm.predict(X_test)

# evaluate
train_accuracy_svm = accuracy_score(y_train, train_preds)
test_accuracy_svm = accuracy_score(y_test, test_preds)
report_svm = classification_report(y_test, test_preds)

print("Support Vector Machine")
print("-------------------------")
print(f"Training Accuracy: {(train_accuracy_svm * 100):.4}%")
print(f"Test Accuracy:     {(test_accuracy_svm * 100):.4}%")

# append accuracy score to our dataframe
score_svm = ['Support Vector Machine', train_accuracy_svm, test_accuracy_svm]
models = models.append([score_svm])

*Back to: <a href='#Table of contents'> Table of contents</a>*
#### 4.6. Model Comparison
<a id='4.6. Model Comparison'></a>

Now that we have run several models, let's check the testing accuracy we stored in a dataframe on the side as well as the classification reports:

In [None]:
models

In [None]:
models.columns = ['Classifier', 'Training Accuracy', "Testing Accuracy"]
models.set_index(['Classifier'], inplace=True)
# sort by testing accuracy
models.sort_values(['Testing Accuracy'], ascending=[False])

In [None]:
print('Classification Report XGBoost: \n', report_booster)
print('------------------------------------------------------')
print('Classification Report Logistic Regression: \n', report_logreg)
print('------------------------------------------------------')
print('Classification Report SVM: \n', report_svm)
print('------------------------------------------------------')
print('Classification Report Random Forest: \n', report_forest)

*Back to: <a href='#Table of contents'> Table of contents</a>*
#### 4.7. Balancing the Data
<a id='4.7. Balancing the Data'></a>

All our models have done similarly well, boasting a **weighted average** F1 score between 72% to 76%. However, looking at our classification report, we can see that the *"for"* votes are fairly well classified, but *"against"* votes are disproportionately misclassified.

Why might this be the case? Well, just by looking at the number of data points we have for each class, we see that we have far more data points for the *"for"* votes than for *"against"* votes, potentially skewing our model's ability to distinguish between classes. This also tells us that most of our model's accuracy is only driven by its ability to classify the *"for"* votes, which is less than ideal.

To account for our imbalanced dataset, we can use an oversampling algorithm called SMOTE (Synthetic Minority Oversampling Technique). SMOTE uses the nearest neighbors of observations to create synthetic data. It's important to know is that we **only oversample the training data** - that way, none of the information in the validation data is used to create synthetic observations.

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
# view previous class distribution
print(target['vote'].value_counts()) 

# resample data ONLY using training data
X_resampled, y_resampled = SMOTE().fit_sample(X_train, y_train) 

# view synthetic sample class distribution
print(pd.Series(y_resampled).value_counts()) 

In [None]:
# then perform ususal train-test-split
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, random_state=0)

**Logistic Regression:**

In [None]:
# instantiate the logistic regression
logreg2 = LogisticRegression()

# train
logreg2.fit(X_train, y_train)

# predict
train_preds = logreg2.predict(X_train)
test_preds = logreg2.predict(X_test)

# evaluate
train_accuracy_logreg2 = accuracy_score(y_train, train_preds)
test_accuracy_logreg2 = accuracy_score(y_test, test_preds)
report_logreg2 = classification_report(y_test, test_preds)

print("Logistic Regression with balanced classes")
print("------------------------")
print(f"Training Accuracy: {(train_accuracy_logreg2 * 100):.4}%")
print(f"Test Accuracy:     {(test_accuracy_logreg2 * 100):.4}%")

# store accuracy in a new dataframe
score_logreg2 = ['Logistic Regression balanced', train_accuracy_logreg2, test_accuracy_logreg2]
models2 = pd.DataFrame([score_logreg2])

**Random Forest:**

In [None]:
# instantiate the random forest with the best found parameters
forest2 = RandomForestClassifier(n_estimators=120, criterion='gini', max_features=9, 
                                 max_depth=10, min_samples_split=4, random_state=4)

# train the random forest
forest2.fit(X_train, y_train)

# predict
train_preds = forest2.predict(X_train)
test_preds = forest2.predict(X_test)

# evaluate
train_accuracy_forest2 = accuracy_score(y_train, train_preds)
test_accuracy_forest2 = accuracy_score(y_test, test_preds)
report_forest2 = classification_report(y_test, test_preds)

print("Random Forest with balanced classes")
print("-------------------------")
print(f"Training Accuracy: {(train_accuracy_forest2 * 100):.4}%")
print(f"Test Accuracy:     {(test_accuracy_forest2 * 100):.4}%")

# append accuracy score to our dataframe
score_forest2 = ['Random Forest balanced', train_accuracy_forest2, test_accuracy_forest2]
models2 = models2.append([score_forest2])

**XGBoost:**

In [None]:
# instantiate tuned xgboost
booster2 = xgb.XGBClassifier(learning_rate=0.1, max_depth=5, n_estimators=100,
                            colsample_bytree=0.7, gamma=0.1, random_state=4)

# train
booster2.fit(X_train, y_train)

# predict
train_preds = booster2.predict(X_train)
test_preds = booster2.predict(X_test)

# evaluate
train_accuracy_booster2 = accuracy_score(y_train, train_preds)
test_accuracy_booster2 = accuracy_score(y_test, test_preds)
report_booster2 = classification_report(y_test, test_preds)

print("XGBoost with balanced classes")
print("-------------------------")
print(f"Training Accuracy: {(train_accuracy_booster2 * 100):.4}%")
print(f"Test Accuracy:     {(test_accuracy_booster2 * 100):.4}%")

# append accuracy score to our dataframe
score_booster2 = ['XGBoost balanced', train_accuracy_booster2, test_accuracy_booster2]
models2 = models2.append([score_booster2])

**SVM:**

In [None]:
from sklearn import svm

In [None]:
# instantiate Support Vector Classification
svm2 = svm.SVC(kernel='rbf')

# train
svm2.fit(X_train, y_train)

# predict
train_preds = svm2.predict(X_train)
test_preds = svm2.predict(X_test)

# evaluate
train_accuracy_svm2 = accuracy_score(y_train, train_preds)
test_accuracy_svm2 = accuracy_score(y_test, test_preds)
report_svm2 = classification_report(y_test, test_preds)

print("Support Vector Machine with balanced classes")
print("-------------------------")
print(f"Training Accuracy: {(train_accuracy_svm2 * 100):.4}%")
print(f"Test Accuracy:     {(test_accuracy_svm2 * 100):.4}%")

# append accuracy score to our dataframe
score_svm2 = ['SVM balanced', train_accuracy_svm2, test_accuracy_svm2]
models2 = models2.append([score_svm2])

*Back to: <a href='#Table of contents'> Table of contents</a>*
#### 4.8. Model Comparison II
<a id='4.8. Model Comparison II'></a>

We've now balanced our dataset. Did balancing our dataset improved the models' bias?

In [None]:
models2

In [None]:
# Accuracy for balanced data
models2.columns = ['Classifier balanced', 'Training Accuracy', "Testing Accuracy"]
models2.set_index(['Classifier balanced'], inplace=True)
models2.sort_values(['Testing Accuracy'], ascending=[False])

In [None]:
# Accuracy for imbalanced data
models.sort_values(['Testing Accuracy'], ascending=[False])

In [None]:
print('Classification Report XGBoost: \n', report_booster2)
print('------------------------------------------------------')
print('Classification Report Logistic Regression: \n', report_logreg2)
print('------------------------------------------------------')
print('Classification Report SVM: \n', report_svm2)
print('------------------------------------------------------')
print('Classification Report Random Forest: \n', report_forest2)

**Success! Balancing our data has removed the bias towards the more prevalent class.**

*Back to: <a href='#Table of contents'> Table of contents</a>*
### 5. Conclusions
<a id='5. Conclusions'></a>

#### 5.1. Feature Importance 
<a id='5.1. Feature Importance'></a>

As we come to an end, let's take a look at the 10 most important features for predicting someone's vote. For the sake of simplicity we use the **XGBoost** classifier and the **Random Forest** classifier:

In [None]:
# plot the important features - based on XGBoost
feat_importances = pd.Series(booster.feature_importances_, index=features.columns)
feat_importances.nlargest(10).sort_values().plot(kind='barh', color='darkgrey', figsize=(10,5))
plt.xlabel('Relative Feature Importance with XGBoost');

In [None]:
# plot the important features - based on Random Forest
feat_importances = pd.Series(forest.feature_importances_, index=features.columns)
feat_importances.nlargest(10).sort_values().plot(kind='barh', color='darkgrey', figsize=(10,5))
plt.xlabel('Relative Feature Importance with Random Forest');

This last plot, produced by our tuned Random Forest, draws the most distinct picture. We see that for the Random Forest, the attitude towards the argument **"It creates more equality of opportunity"** is the most important feature to split people.

#### 5.2. Recommendation 
<a id='5.2. Recommendation'></a>

Going back to our intial question: **"Can we predict how people are likely to vote?"**, we can now conclude that the answer is **"Yes"**.

With an XGBoost and the SMOTE algorithm to account for imbalanced classes in our target variable, we were able to both improve our accuracy to nearly 84% - in addition to removing any bias towards the yes-voters. So as fas as classification algorithms go, let's use the **XGBoost**.

*Back to: <a href='#Table of contents'> Table of contents</a>*