# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: 

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [18]:
import numpy as np
import pandas as pd

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [19]:
# Import dataset (1 mark)
df = pd.read_csv('./bankloan.csv')
pd.set_option('display.max_columns', None)

df.head()
df.describe()

Unnamed: 0,ID,Age,Experience,Income,ZIP.Code,Family,CCAvg,Education,Mortgage,Personal.Loan,Securities.Account,CD.Account,Online,CreditCard
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,45.3384,20.1046,73.7742,93152.503,2.3964,1.937938,1.881,56.4988,0.096,0.1044,0.0604,0.5968,0.294
std,1443.520003,11.463166,11.467954,46.033729,2121.852197,1.147663,1.747659,0.839869,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637
min,1.0,23.0,-3.0,8.0,9307.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1250.75,35.0,10.0,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2500.5,45.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,3750.25,55.0,30.0,98.0,94608.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,5000.0,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

1. The data is from Kaggle, it is a loan application dataset,  with 5000 records of loan applications, and a binary classification for approval or not for a target.  
2. Well,  I was hunting for a set of data that I understood what the features were, and something that had lots of information in it.  This one is 5000 records, with good documentation. It additionally had features that I understood their meaning. 
3. There were some interesting ones in the medical field, however I didnt fully understand what some of the features were, so it would make things slightly more difficult to understand. Additionally some of the datasets that I had looked at also didnt have a ton of records, so I discredited those fairly quickly. I had originally tried to do a dataset of spotify data, and song popularity, but quickly found out that the corrolation there wasnt as strong as I would have liked it to be to get a reasonable model, and just made the whole process confusing.  

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [20]:
# Clean data (if needed)
df.shape
df.dtypes
print(df.isnull().sum()) 

ID                    0
Age                   0
Experience            0
Income                0
ZIP.Code              0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal.Loan         0
Securities.Account    0
CD.Account            0
Online                0
CreditCard            0
dtype: int64


In [21]:
from sklearn.model_selection import train_test_split

# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed
# Drop columns that are identifiers or way out the window of being usable for ML
df = df.drop(columns=['ID', 'ZIP.Code'])


# Then lets split up our data before preprocessing
X = df.drop('Personal.Loan', axis=1)
y = df['Personal.Loan']

print(df.dtypes)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Age                     int64
Experience              int64
Income                  int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage                int64
Personal.Loan           int64
Securities.Account      int64
CD.Account              int64
Online                  int64
CreditCard              int64
dtype: object


In [22]:
# now we have to do some further preprocessing,  since we have lots of numerical values, we should scale those, adn then we also have some categorical features that we need to use one hot for . 


from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

print(df.shape)


# Lets get all the numerical stuffs,  key and mode are left out intentionaly, as they signify categories.  key will be encoded, mode will be left as it is binary
numerical_features = ['Experience','Income','Family','CCAvg','Education','Mortgage']

categorical_features = ['CD.Account','Online','CreditCard']

numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# All of my categorical features are already in boolean, so they do not need to be one hot encoded.
categorical_pipeline = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='most_frequent'))
])

preprocessor = Pipeline([
    ('preprocessor', ColumnTransformer(transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ]))
])








(5000, 12)


### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?


*ANSWER HERE*

1. The data was all in good shape from the get go.  However for cases like this,  where there is lots of data already,  if there was very few records that had bad or missing data, you could just remove those records, and if there were too many of them, you could use an average or most common replacement in its place depending on the context.   
2. I have both categorical, as well as numerical data.  So for this I scaled the numerical data, and imputed cases to mean if there was no record.  For the categorical stuff, All of the data was boolean,  so no encoding or scaling was nescessary.  

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [28]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Lets just throw a ton of models at it...for good measure. 

model_params_reg = {
    'LinearRegression': {
        'model': LinearRegression(),
        'params': {
            'model__fit_intercept': [True, False]
        }
    },
    'GradientBoostingRegressor': {
        'model': GradientBoostingRegressor(),
        'params': {
            'model__n_estimators': [50, 100, 200],
            'model__learning_rate': [0.01, 0.1, 0.2],
            'model__max_depth': [3, 5, 7]
        }
    },
    'RandomForestRegressor': {
        'model': RandomForestRegressor(),
        'params': {
            'model__n_estimators': [10, 50, 100],
            'model__max_depth': [None, 10, 20, 30]
        }
    }
}

model_params_cls = {
    'RandomForestClassifier': {
        'model': RandomForestClassifier(),
        'params': {
            'model__n_estimators': [10, 50, 100],
            'model__max_depth': [None, 10, 20, 30]
        }
    },
    'decision_tree': {
        'model': DecisionTreeClassifier(),
        'params': {
            'model__max_depth': [5, 10, 15]
        }
    },
    'svm': {
        'model': SVC(),
        'params': {
            'model__C': [1, 10, 100],
            'model__kernel': ['rbf', 'linear']
        }
    },
    'LogisticRegression': {
        'model': LogisticRegression(),
        'params': {
            'model__C': [0.1, 1, 10, 100],
            'model__solver': ['lbfgs']  
        }
    },
}


# make an array fer dem scores

scores = []

for model_name, params in model_params_cls.items():
    pipe = Pipeline([
        ('preprocessor', preprocessor),
        ('model',params['model'])
    ])

    grid = GridSearchCV(pipe, params['params'], cv=5, n_jobs=-1, scoring='precision', return_train_score=True)
    grid.fit(X_train, y_train)
    scores.append({
        'model':model_name,
        'best_score': grid.best_score_,
        'test_score': grid.score(X_test, y_test),
        'best_params': grid.best_params_,
        'cv_results': grid.cv_results_
    })

for model_name, params in model_params_reg.items():
    pipe = Pipeline([
        ('preprocessor', preprocessor),
        ('model',params['model'])
    ])

    grid = GridSearchCV(pipe, params['params'], cv=5, n_jobs=-1, scoring='average_precision', return_train_score=True)
    grid.fit(X_train, y_train)
    scores.append({
        'model':model_name,
        'best_score': grid.best_score_,
        'test_score': grid.score(X_test, y_test),
        'best_params': grid.best_params_,
        'cv_results': grid.cv_results_
    })


df_results = pd.DataFrame(columns=['Model', 'Best Score', 'Test Score', 'Best Parameter', 'Parameter Value'])

rows_list = []

for score in scores:
    best_params = score['best_params']
    for param_name, param_value in best_params.items():
        row = {
            'Model': score['model'],
            'Best Score': score['best_score'],
            'Test Score': score['test_score'],
            'Best Parameter': param_name,
            'Parameter Value': param_value
        }
        rows_list.append(row)
df_results = pd.DataFrame(rows_list)

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
print(df_results)

                        Model  Best Score  Test Score        Best Parameter  \
0      RandomForestClassifier    0.981891    0.979798      model__max_depth   
1      RandomForestClassifier    0.981891    0.979798   model__n_estimators   
2               decision_tree    0.909864    0.960784      model__max_depth   
3                         svm    0.968000    0.989583              model__C   
4                         svm    0.968000    0.989583         model__kernel   
5          LogisticRegression    0.851884    0.942857              model__C   
6          LogisticRegression    0.851884    0.942857         model__solver   
7            LinearRegression    0.747753    0.808014  model__fit_intercept   
8   GradientBoostingRegressor    0.975399    0.992136  model__learning_rate   
9   GradientBoostingRegressor    0.975399    0.992136      model__max_depth   
10  GradientBoostingRegressor    0.975399    0.992136   model__n_estimators   
11      RandomForestRegressor    0.975317    0.99282

### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*

I would have thought that the dataset needed classification.  There was numerous boolean features, as well as the understanding that the numerical data fro loan applications would be based on thresholds for things like income, or credit etc.  This was true, as teh Random forest classifier had the highest trainign score.  However teh randomforest generator (regression) was very close behind it, and had a higher test score.  From this, and given the relative similarity, I believe that one could make an argument for either given the shape of the data. 

For models,  I just threw a ton at it.  Everything I could think of hyperparameters for,  figured I'd throw it all down the hallway and check them all out.  I knew that a classification would be the best fit, and had a hunch that SVC or RandomForests would work best given the nature of the data as it better emulated the threshold based decision making of loan applications,  so easy to draw a tree for most generalizations about this. 

The random forest worked the best in both classification and regression, and based on the idea of how a random forest works, and the context of the data, this was completely understandable.

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [None]:
# Calculate testing accuracy (1 mark)

# See testing data in the dataframe above.  


### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*

Precision is the best accuracty measure for classification models where the consequences are failure are high.  So I used this value for the classification models, and the similar average precision for the regression models.  

These results were very comparable to those in part 3.  In some cases the test scores were actually even higher than the training scores.  Given the shape of the data and the regemented structure of loan applications, I believe that it was reasonably easy to get this precise. 

This model had a 98- 99% precision in both classification and regression.  Given that the average default rate on loans written in canada last year was 1.02%, and banks underwrite for up to 2.5% generally, I think this model is good enough to be used in teh real world.  Though this model was pretty striaghtforward, with only a handful of features.  I think there is much more information that could be put in the model, as well as some more hyperparameter tuning to be done.  



## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

I sourced my code from kaggle.  I had orignally sourced some spotify data, however found it to be an absolute clusterf&*# to try and get any realizable results from it, with any kind of accuracy over 30%,  and further it was about 30000 records long, so it took forever to compute, and if I took a smaller dataset from there,  it lowered the accuracy, to me meaning that we were overfitting the data some.  So I went back and got this banking data which was more 'black and white' and much easier for a chump like me who was just learning.  

I did the steps in teh same order as they are written here....though many times over as I had the dataset confusion.  The second time around it was much much easier.  I did go back and modify the code a bit the second time around as I had discovered that adding more models was fairly straightforward once the pipelines were written and you could as as many as you wanted without much consequence other than compute time.  

I absolutely did use some generative ai.  Lots of it was more or less "How does this corrispond to this?"  or "What order do these happen if I work things this way?"  or "How does the grid scale the testing set aside from the training data when trying to score?"  More just learning uses.  I didnt go back and change the code after, largely just learning to write it accuratly in the first place.  

The spotify data at the beginning was an absolute headache,  as I was getting 20-30% accuracies for probably a solid day and though it was an operator problem, or something wrong with my pre-processing, until I finally just figured it was the data, as music popularity is so subjective.  Once I took a more concrete set of data, these headaches went away and the learnign went up.  

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

This one was a bit challenging for sure,  as I really struggled with the hyperparameter tuning bits of it,  as they are different between all the models.  So I liked that I was forced to use and learn them in a practical application.  The 611 assignments are the best learnign aids for sure, as it is sturctured enough that you have some guidelines, but at the same time still have to do a bunch of learning and thinking.  

I though this assignment was interesting, as I had to do some looking into what it takes to approve loans,  so just the context of that made it somewhat interesting for me!