# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: 

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import mglearn

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [2]:
# Import dataset (1 mark)
df = pd.read_csv('Life Expectancy Data.csv')
df.head() # display the first 5 rows to make sure we have everything downloaded 

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*

1. I sourced my dataset from Kaggle, because that was where some of our previous datasets that we were working on in class were sourced from. 
2. I chose this dataset honestly because I wanted something that would be interesting and not too difficult to work with and this looked somewhat decent. It has many numerical values so hopefully not a lot of encoding and I think overall will give interesting results.  
3. It is challenging to find a dataset that is going to give you something meaningful and interesting, because you want to pick somewhere where you will be getting good results and be able to understand the results and not be more confused about what we are learning in class. 

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [3]:
# Clean data (if needed)
df.isnull().sum()

Country                              0
Year                                 0
Status                               0
Life expectancy                      7
Adult Mortality                      7
infant deaths                        0
Alcohol                            138
percentage expenditure               0
Hepatitis B                        355
Measles                              0
 BMI                                 1
under-five deaths                    0
Polio                                8
Total expenditure                  141
Diphtheria                           8
 HIV/AIDS                            0
GDP                                 20
Population                           9
 thinness  1-19 years                1
 thinness 5-9 years                  1
Income composition of resources      5
Schooling                            1
dtype: int64

In [4]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed

# Forward fill on the null values 
df_ffill = df.ffill()
print(df_ffill.isnull().sum())

# Encode the developed vs developing country 
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(sparse_output=False)
enc.fit(df_ffill[['Status']])

# enc.get_feature_names_out()

enc.transform(df_ffill[['Status']])

data_enc = pd.DataFrame(enc.transform(df_ffill[['Status']]), columns=enc.get_feature_names_out())

le_data = pd.concat([df_ffill.drop(columns='Status'), data_enc], axis=1)
le_data

Country                            0
Year                               0
Status                             0
Life expectancy                    0
Adult Mortality                    0
infant deaths                      0
Alcohol                            0
percentage expenditure             0
Hepatitis B                        0
Measles                            0
 BMI                               0
under-five deaths                  0
Polio                              0
Total expenditure                  0
Diphtheria                         0
 HIV/AIDS                          0
GDP                                0
Population                         0
 thinness  1-19 years              0
 thinness 5-9 years                0
Income composition of resources    0
Schooling                          0
dtype: int64


Unnamed: 0,Country,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling,Status_Developed,Status_Developing
0,Afghanistan,2015,65.0,263.0,62,0.01,71.279624,65.0,1154,19.1,...,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1,0.0,1.0
1,Afghanistan,2014,59.9,271.0,64,0.01,73.523582,62.0,492,18.6,...,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0,0.0,1.0
2,Afghanistan,2013,59.9,268.0,66,0.01,73.219243,64.0,430,18.1,...,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9,0.0,1.0
3,Afghanistan,2012,59.5,272.0,69,0.01,78.184215,67.0,2787,17.6,...,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8,0.0,1.0
4,Afghanistan,2011,59.2,275.0,71,0.01,7.097109,68.0,3013,17.2,...,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2242,Zimbabwe,2004,44.3,723.0,27,4.36,0.000000,68.0,31,27.1,...,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2,0.0,1.0
2243,Zimbabwe,2003,44.5,715.0,26,4.06,0.000000,7.0,998,26.7,...,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5,0.0,1.0
2244,Zimbabwe,2002,44.8,73.0,25,4.43,0.000000,73.0,304,26.3,...,71.0,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0,0.0,1.0
2245,Zimbabwe,2001,45.3,686.0,25,1.72,0.000000,76.0,529,25.9,...,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8,0.0,1.0


### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*

1. There was some missing values, for countries that had huge amounts of data missing, I deleted those manually and then the rest I did forward fill because that way missing data was filled by the previous year of the same country. 
2. The data is all numerical except for the 'Status' of the country, ie. developed or developing. So that column needs encoding which I did above but then realized that I could put that into the pipeline so I left the original code but included it in the pipeline below. I also am going to try using a linear Support Vector Machine to try out using Scaling for pre-processing but I don't think the data necessarily needs it. The other two models are tree models which don't need scaling. 

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [5]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

In [6]:
# Split data
X = le_data.drop(['Life expectancy ', 'Country'], axis=1)  # Features (all columns except 'target')
y = le_data['Life expectancy ']  # Target variable

print(X.shape)
print(y.shape)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print(X_train.shape)
print(y_train.shape)

X.head()

(2247, 21)
(2247,)
(1797, 21)
(1797,)


Unnamed: 0,Year,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,...,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling,Status_Developed,Status_Developing
0,2015,263.0,62,0.01,71.279624,65.0,1154,19.1,83,6.0,...,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1,0.0,1.0
1,2014,271.0,64,0.01,73.523582,62.0,492,18.6,86,58.0,...,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0,0.0,1.0
2,2013,268.0,66,0.01,73.219243,64.0,430,18.1,89,62.0,...,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9,0.0,1.0
3,2012,272.0,69,0.01,78.184215,67.0,2787,17.6,93,67.0,...,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8,0.0,1.0
4,2011,275.0,71,0.01,7.097109,68.0,3013,17.2,97,68.0,...,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5,0.0,1.0


In [7]:
# Create pipeline with processing and regression tests  

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.pipeline import Pipeline

from sklearn.model_selection import GridSearchCV


In [8]:
# Decision Tree, Regression
pipe_dt = Pipeline([
    # ('encoding', OneHotEncoder()),
    ('classifier', DecisionTreeRegressor(random_state=0))
])
print(pipe_dt)

pipe_dt.fit(X_train, y_train)
print(f'Training accuracy {pipe_dt.score(X_train, y_train):.2f} on {y_train.shape} samples')

Pipeline(steps=[('classifier', DecisionTreeRegressor(random_state=0))])
Training accuracy 1.00 on (1797,) samples


In [9]:
# Gradient Boosted Machine, Regression
pipe_gb = Pipeline([
    # ('encoding', OneHotEncoder()), 
    ('classifier', GradientBoostingRegressor(max_depth=5, random_state=0))
])
print(pipe_gb)


# Fit the pipeline
pipe_gb.fit(X_train, y_train)
print(f'Training accuracy {pipe_gb.score(X_train, y_train):.2f} on {y_train.shape[0]} samples')


Pipeline(steps=[('classifier',
                 GradientBoostingRegressor(max_depth=5, random_state=0))])
Training accuracy 0.99 on 1797 samples


In [10]:
# Ridge, Linear regression
from sklearn.linear_model import Ridge

import warnings
warnings.filterwarnings('ignore') #ignoring some deprication warnings

pipe_rid = Pipeline([('classifier', Ridge())])
print(pipe_rid)

pipe_rid.fit(X_train, y_train)
print(f'Training accuracy {pipe_rid.score(X_train, y_train):.2f} on {y_train.shape} samples')


Pipeline(steps=[('classifier', Ridge())])
Training accuracy 0.84 on (1797,) samples


In [11]:
param_grid = [{'classifier': [DecisionTreeRegressor(random_state=0)],
    'classifier__max_depth': [3, 5, 7],  # You can add more values as needed
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
    },
    {'classifier': [GradientBoostingRegressor(max_depth=5, random_state=0)],
    'classifier__max_depth': [3, 5],
    'classifier__n_estimators': [50, 100],
    },
    {'classifier': [Ridge()],
    'classifier__alpha': [0.1, 1.0, 10.0]}]
    

# Create the grid search
grid = GridSearchCV(pipe_dt, param_grid, cv=5)
grid.fit(X_train, y_train)


In [12]:
grid.best_estimator_

In [13]:
grid.best_params_

{'classifier': GradientBoostingRegressor(max_depth=5, random_state=0),
 'classifier__max_depth': 5,
 'classifier__n_estimators': 100}

### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*

1. I need regression models because the target variable is life expectency which is a continuous value.
2. I chose a decision tree and gradient boosting for my non-linear models. I chose decision tree because its more simple and with control of the depth and other hyperparameters, we can get decent results without it being too slow. Gradient Boosted was chosen because it should be able to make up where the decision tree is lacking and give a better result, which it did. And then for the linear model I chose ridge because it was simple. I tried to make a Support Vector Machine model work on my dataset but I kept getting errors during the grid search that I could not figure out so I gave up and went with something simple, and it still had a decent score of 84% on the training accuracy. 
3. The gradient boosted machine worked the best and I think that makes sense because we know it does very well for non-linear data without overfitting and do well on continuous values which is what most of the dataset consists of. The parameter grid was actually good for this model because it allowed me to put in different hyperparameters into the GBM and then tell me which one was the best, which was max depth of 5 and n_estimators of 100. 

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [14]:
# Calculate testing accuracy (1 mark)
print(f'Cross-Validation accuracy {grid.best_score_:.2f}')
print(f'Train accuracy {grid.score(X_train, y_train):.2f}')
print(f'Test accuracy {grid.score(X_test, y_test):.2f}')

Cross-Validation accuracy 0.96
Train accuracy 0.99
Test accuracy 0.94



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*

1. I just chose to print out the r2 score because that was what was done in one of the Pipeline examples and it seemed like the easiest way to check that the model is doing well with the dataset and the above scores are good. 
2. So in step 3 I just did the basic calls for each of the models and didn't play with the hyperparameters, and we can see that both the decision tree is definitely overfitting, training r2 score of 1.00 which is not good. The grid search allowed to change the hyperparameters more. The train score is still 0.99 for the gradient boosted machine which does lend itself that it may be overfitting. 
3. I do think the scores are good enough for the real world, but some more assessment needs to be done to make sure its not overfitting. The testing score is not way below the training score and is still at 94% which is good, but the training score is still suspicious. Could maybe change the hyperparameters a bit for the GBM or look into the second best model to see what the score is there.  

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

I sourced majority of my code from previous assignments and examples in class. When I was getting a lot of errors trying to make the Support Vector Machine model work, which I did notend up going with, I was putting the errors into chatgpt and asking it what the errors meant and I couldn't figure them out, so I just changed the model so that it would work. 
I completed the steps in order.

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

Trying to get the grid search and pipelines to actually work was very difficult. I was getting a lot of errors that I couldn't figure out and eventually had to try to make it as simple as possible so that I could understand what was going on. I was trying to do a support vector machine model, and I thought it may work for the regression but I kept getting errors and simplified to go with the Ridge linear model. Being given a dataset and then spending time on the pipeline to make sure we understand what the pipelines are would have probably been easier. 