
# Assignment 2: Hyperparameter Optimizartion For The Human Freedom Index Model

This notebook contains a set of exercises that will guide you through the different steps of this assignment. As in Assignment 1, solutions need to be code-based, _i.e._ hard-coded or manually computed results will not be accepted. Remember to write your solutions to each exercise in the dedicated cells and to not modify the test cells. When you are done completing all the exercises submit this same notebook back to moodle in **.ipynb** format.

<div class="alert alert-success">

The <a href="https://www.cato.org/human-freedom-index/2021 ">Human Freedom Index</a> measures economic freedoms such as the freedom to trade or to use sound money, and it captures the degree to which people are free to enjoy the major freedoms often referred to as civil liberties—freedom of speech, religion, association, and assembly— in the countries in the survey. In addition, it includes indicators on rule of law, crime and violence, freedom of movement, and legal discrimination against same-sex relationships. We also include nine variables pertaining to women-specific freedoms that are found in various categories of the index.

<u>Citation</u>

Ian Vásquez, Fred McMahon, Ryan Murphy, and Guillermina Sutter Schneider, The Human Freedom Index 2021: A Global Measurement of Personal, Civil, and Economic Freedom (Washington: Cato Institute and the Fraser Institute, 2021).
    
</div>

<div class="alert alert-danger"><b>Submission deadline:</b> Sunday, February 12th, 23:55</div>

In [2]:
import numpy as np
import pandas as pd

<div class="alert alert-info"><b>Exercise 1</b>
    
Load the Human Freedom Index data from the link: https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv in a DataFrame called ```df```. The following columns are redundant and should be dropped:
* ```year```
* ```ISO```
* ```countries```
* All columns containing the word ```rank``` 
* All columns containing the word ```score```

Then store the independent variables in a DataFrame called ```X``` and the dependent variable (```hf_quartile```) in a DataFrame called ```y```.
    
<br><i>[0.5 points]</i>
</div>
<div class="alert alert-warning">
Do not download the dataset. Instead, read the data directly from the provided link
</div>

In [3]:
# YOUR CODE HERE
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
url = "https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv"
df = pd.read_csv(url)

# Drop redundant columns
df.drop(columns=['year', 'ISO', 'countries',], inplace=True)
df.drop(df.filter(regex='rank|score').columns, axis=1, inplace=True)

# Store independent variables in X
X = df.drop(columns=['hf_quartile'],axis=1)

# Store dependent variable in y
y = df['hf_quartile']

imputer = SimpleImputer(strategy='mean')
y = imputer.fit_transform(y.values.reshape(-1, 1))


# Convert the target variable 'y' into a binary format
le = LabelEncoder()
y = le.fit_transform(y)


  y = column_or_1d(y, warn=True)


In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 2</b>
    
Write the code to create a ```Pipeline``` consisting of a ```SimpleImputer``` with the most frequent strategy, a ```OneHotEncoder``` for the categorical variables, a standard scaler, and a logistic regression model with the solver ```saga``` and ```max_iter```2000. Store the resulting pipeline in a variable called ```pipe```.
    
<br><i>[1 point]</i>
</div>
<div class='alert alert-warning'>

Not all the attributes are categorical. Ensure that all non-categorical attributes remain intact.
</div>

In [4]:
# YOUR CODE HERE

numerical_columns = X.select_dtypes(include=['float', 'int']).columns
categorical_columns = X.select_dtypes(include=['object']).columns

#numerical columns
#umns = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Select the categorical columns
#categorical_columns = X.select_dtypes(include=['object']).columns.tolist()

# Create the transformers
numerical_transformer = SimpleImputer(strategy='most_frequent')
categorical_transformer = OneHotEncoder()
scaler = StandardScaler()

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Combine the transformers into a pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_columns),
        ('cat', categorical_transformer, categorical_columns)
    ])

# Create the logistic regression model
logistic_regression = LogisticRegression(solver='saga', max_iter=2000)

# Combine the transformers and the logistic regression model into a pipeline
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('scaler', scaler),
    ('logistic_regression', logistic_regression)
])


In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 3</b>

Write the code to estimate the performance of the model using cross-validation with **three** stratified folds. Store the five test score values in a dictionary called ```fold_scores```.
    
<br><i>[1 point]</i>
</div>

In [5]:
# YOUR CODE HERE
from sklearn.model_selection import cross_val_score

# Fit the model using cross-validation with three stratified folds
fold_scores = cross_val_score(pipe, X, y, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)

# Store the test score values in a dictionary
fold_scores = {'fold_1': fold_scores[0],
               'fold_2': fold_scores[1],
               'fold_3': fold_scores[2]}

print(fold_scores)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


{'fold_1': 0.9272727272727272, 'fold_2': 0.953030303030303, 'fold_3': 0.9045454545454545}


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   36.7s finished


In [None]:
stop_patchers(patchers)
call_order = ['estimator', 'X', 'y', 'groups', 'scoring', 'cv', 'n_jobs', 'verbose', 'fit_params', 'pre_dispatch', 'error_score']
check_args({'estimator': pipe, 'X': X, 'y': y, 'cv': 3}, call_order, mocks)

NameError: ignored

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 4</b>

    
Write the code to create a GridSearchCV object called ```grid``` and fit it using **only three folds**. The grid search object must include the previous pipeline and test the following hyperparameters:
* ```penalty``` : ['l1', 'l2']
* ```C``` : [0.1,10]

Finally, store the best achieved score (accuracy) in a variable called ```score```.

<br><i>[2.5 points]</i>
</div>

<div class='alert alert-warning'>

Use train and test datasets correctly.
</div>

In [6]:
pipe

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  SimpleImputer(strategy='most_frequent'),
                                                  Index(['pf_rol_procedural', 'pf_rol_civil', 'pf_rol_criminal', 'pf_rol',
       'pf_ss_homicide', 'pf_ss_disappearances_disap',
       'pf_ss_disappearances_violent', 'pf_ss_disappearances_organized',
       'pf_ss_disappearances_fatalities', 'pf_ss_disappearances_inju...
       'ef_regulation_business_start', 'ef_regulation_business_bribes',
       'ef_regulation_business_licensing', 'ef_regulation_business_compliance',
       'ef_regulation_business', 'ef_regulation'],
      dtype='object', length=112)),
                                                 ('cat', OneHotEncoder(),
                                                  Index(['region'], dtype='object'))])),
                ('scaler', StandardScaler()),
                ('logistic_regression',
        

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'logistic_regression__penalty': ['l1', 'l2'],
              'logistic_regression__C': [0.1, 10]}


grid = GridSearchCV(pipe, param_grid=param_grid, cv=3)
grid.fit(X_train, y_train)
score = grid.best_score_





In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 5</b>
    
The previous grid search is incomplete because it only optimizes the hyperparameters of the logistic regression model. Now repeat the same process but testing parameters of all the steps of the pipeline. This exercise is open. You can use any hyperparameter from the scaler, imputer, transformer, encoder, or model. Do not limit yourself to linear models.

<br><i>[5 points]</i>
</div>

In [None]:
# YOUR CODE HERE
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression

# Create the pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), numerical_columns)
    ])

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(solver='lbfgs'))
])

# Define the hyperparameters to test
param_grid = {
    'preprocessor__num__strategy': ['mean', 'median'],
    'scaler__with_mean': [True, False],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__C': [0.1, 1, 10],
}

# Create the grid search object and fit it
grid = GridSearchCV(pipe, param_grid, cv=3)
grid.fit(X_train, y_train)

# Store the best score
score = grid.best_score_
score

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.9324494949494949

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
    'scaler__with_mean': [True, False],
    'logistic_regression__penalty': ['l1', 'l2'],
    'logistic_regression__C': [0.1, 10],
    'logistic_regression__solver': ['saga', 'liblinear'],
    'logistic_regression__max_iter': [1000, 2000],
}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3)
grid.fit(X_train, y_train)
print("Best parameters: ", grid.best_params_)
print("Best score: ", grid.best_score_)


ValueError: ignored

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='most_frequent'), numerical_columns),
        ('cat', OneHotEncoder(), categorical_columns)
    ])

scaler = StandardScaler()
logistic_regression = LogisticRegression(max_iter=2000, solver='saga')

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('scaler', scaler),
    ('logistic_regression', logistic_regression)
])

param_grid = {
    'preprocessor__num__strategy': ['mean', 'median', 'most_frequent'],
    'scaler__with_mean': [True, False],
    'logistic_regression__penalty': ['l1', 'l2'],
    'logistic_regression__C': [0.1, 1, 10]
}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3)
grid.fit(X_train, y_train)
score = grid.best_score_


