
# Assignment 1: Predicting overall The Human Freedom Index

This notebook contains a set of exercises that will guide you through the different steps of this assignment. Solutions need to be code-based, _i.e._ hard-coded or manually computed results will not be accepted. Remember to write your solutions to each exercise in the dedicated cells and to not modify the test cells. When you are done completing all the exercises submit this same notebook back to moodle in **.ipynb** format.

<div class="alert alert-success">

The <a href="https://www.cato.org/human-freedom-index/2021 ">Human Freedom Index</a> measures economic freedoms such as the freedom to trade or to use sound money, and it captures the degree to which people are free to enjoy the major freedoms often referred to as civil liberties—freedom of speech, religion, association, and assembly— in the countries in the survey. In addition, it includes indicators on rule of law, crime and violence, freedom of movement, and legal discrimination against same-sex relationships. We also include nine variables pertaining to women-specific freedoms that are found in various categories of the index.

<u>Citation</u>

Ian Vásquez, Fred McMahon, Ryan Murphy, and Guillermina Sutter Schneider, The Human Freedom Index 2021: A Global Measurement of Personal, Civil, and Economic Freedom (Washington: Cato Institute and the Fraser Institute, 2021).
    
</div>

<div class="alert alert-danger"><b>Submission deadline:</b> Sunday, January 29th, 23:55</div>


In [26]:
import pandas as pd
import numpy as np

<div class="alert alert-info"><b>Exercise 1</b>

Load the Human Freedom Index data from the link: https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv in a DataFrame called ```df```.

<br><i>[0.25 points]</i>
</div>
<div class="alert alert-warning">
Do not download the dataset. Instead, read the data directly from the provided link
</div>

In [27]:
df = pd.read_csv('https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv')

In [None]:
#LEAVE BLANK


<div class="alert alert-info"><b>Exercise 2</b>

First write the code to drop all the columns from the DataFrame ```df``` except ```['hf_quartile', 'ef_regulation',  'pf_expression', 'region']```, then drop all the rows from ```df``` containing missing values present in the selected columns.

<br><i>[0.25 points]</i>
</div>

<div class="alert alert-warning">

Remember, Python is case-sensitive. The resulting DataFrame ```df``` should contain only four columns.

</div>


In [28]:
df = df[["hf_quartile", "ef_regulation",  "pf_expression", "region"]] 
df = (df[~pd.isna(df).any(axis=1)])

In [None]:
#LEAVE BLANK

<div class="alert alert-info"><b>Exercise 3</b> 
    
Write the code to create the feature matrix ```X``` (```ef_regulation```,  ```pf_expression```, and ```region```) and the target array ```y``` (```hf_quartile```), then split them into separate training and test sets with the relative size of 0.75 and 0.25. Store the training and tests feature matrix in variables called ```X_train``` and ```X_test```, and the training and test label arrays as ```y_train``` and ```y_test```.
    
<br><i>[1 point]</i>
</div>


In [29]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['hf_quartile'])
y = df['hf_quartile']

# using the train test split function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 4 </b> 

    
The resulting feature matrix contains a categorical variable. Write the code to create a ```ColumnTransformer``` to encode it using the one-hot encoding method. Store the transformer in a variable called ```transformer```. At this stage, you do not need to run it.

<br><i>[1 points]</i>
</div>

<div class='alert alert-warning'>

Not all the attributes are categorical. Ensure that all non-categorical attributes remain intact.
</div>

In [30]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

transformer = ColumnTransformer([("ohe_encoder", OneHotEncoder(sparse = False), ["region"])], 
                                remainder='passthrough')
                               

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 5 </b> 

Write the code to create a ```Pipeline``` consisting of a ```SimpleImputer``` with the most frequent strategy, the previous transformer, a standard scaler, and a logistic regression model. Store the resulting pipeline in a variable called ```pipe```.
    
<br><i>[1.5 points]</i>
</div>

<div class='alert alert-warning'>

Be sure you apply the data transformations in the correct order.
</div>

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# split numerical and categorical, transform the categorical --> formula for question 7
categorical_features = []
numerical_features = []

for col in X.columns:
    if X[col].dtype == object:
        categorical_features.append(col)
    else:
        numerical_features.append(col)
        
print("Categorical features:", categorical_features)
print("Numerical features:", numerical_features)

# most frequent strategy
# Imputing the  feature with the mode and then with the OneHotEncoder
# Creating the features transformer
#standard scaler
#logistic regression
steps = [("data_cleaning", transformer),
         ("most_frequent", SimpleImputer(strategy = "most_frequent")),
         ('scaler', StandardScaler()),
         ('lr', LogisticRegression())]

# store resulting pipeline in variable called pipe
from sklearn.pipeline import Pipeline
pipe = Pipeline(steps)

pipe.fit(X_train, y_train)

Categorical features: ['region']
Numerical features: ['ef_regulation', 'pf_expression']


  mode = stats.mode(array)


In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 6 </b> 
    
Write the code to store the achieved ```score``` (accuracy) in a variable called ```score```. 
    
<br><i>[1 point]</i>
</div>

<div class='alert alert-warning'>

Use train and test datasets correctly.
</div>

In [33]:
score = pipe.score(X_test, y_test)
score

0.7623126338329764

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 7 </b> 
    
The previous exercises were simple because they included only three columns. Now repeat the same process but using the complete dataset. This exercise is open. You can use any scaler, imputer, transformer, or encoder. The only requirement is to train a logistic regression. If you decide to drop a column, justify the reason. 
    
<br><i>[5 points]</i>
</div>

<div class='alert alert-warning'>
    
The following columns are redundant and should be dropped:
* ```year```
* ```ISO```
* ```countries```
* All columns containing the word ```rank``` 
* All columns containing the word ```score```
    
</div>


In [34]:
# import libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# setting the original df and dropping the redundant columns

df = pd.read_csv('https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv')
new_df = df.drop(columns=['year','ISO','countries'])
new_df = new_df.loc[:, ~new_df.columns.str.contains("rank")]
new_df = new_df.loc[:, ~new_df.columns.str.contains("score")]


# looking for columns with a high percentages of missing values - we set 26% as the cutoff

percent_missing = new_df.isnull().sum() * 100 / len(new_df)
missing_value_df = pd.DataFrame({'column_name': new_df.columns,
                                 'percent_missing': percent_missing})
missing_value_df2 = missing_value_df[percent_missing>26]


# dropping those variables with high percentages of missing values
new_df = df.drop(columns=missing_value_df2.column_name.unique().tolist())
new_df = df.dropna()

In [35]:
# dropping high correlation variables

cor_matrix = new_df.corr().abs()
upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool_))
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.95)]
df1 = new_df.drop(to_drop, axis=1)

# splitting the dataset

X = df1.drop(columns=['hf_quartile'])
y = df1['hf_quartile']

# using the train test split function with a sample of 0.25

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

#splitting categorical and numerical columns

cols = X.columns
num_cols = X.select_dtypes(include=np.number).columns.tolist()
cat_cols = list(set(cols) - set(num_cols))

# creating feature transformer with an inner pipe



inner_steps_region = [("most_frequent", SimpleImputer(strategy = "most_frequent")), 
                        ("ohe", OneHotEncoder(sparse = False))]
inner_pipe = Pipeline(inner_steps_region)

transformer = ColumnTransformer([("ohe_encoder", OneHotEncoder(sparse = False), cat_cols),
                               ("mean", SimpleImputer(strategy = "mean"), num_cols),
                               ("inner_steps", inner_pipe, ["region"])],
                                remainder = "passthrough")

# creating pipe with data cleaning, normalization and logistic regression model


norm = MinMaxScaler()


lr = LogisticRegression(solver = 'lbfgs', max_iter=1000)

steps = [("data_cleaning", transformer),
            ("normalization", norm),
            ("training", lr)]

pipe = Pipeline(steps)

pipe.fit(X_train, y_train)

accuracy = pipe.score(X_test, y_test)
accuracy

0.8854166666666666