# Assignment 1: Sci-Kit Learn machine learning preprocessing pipeline

This notebook contains a set of exercises that will guide you through the different steps of this assignment. Solutions must be code-based, _i.e._ hard-coded or manually computed results will not be accepted. Remember to write your solutions to each exercise in the dedicated cells and not modify or remove the test cells. When completing all the exercises submit this same notebook back to Moodle in **.ipynb** format.

<div class="alert alert-success">

<b>About the datasets used in this assignment</b>

<u>Context</u>

Access to credit is a fundamental aspect of modern financial life, yet the decision to grant a loan is not always transparent. Loan approvals have traditionally been influenced by a variety of applicant characteristics, ranging from income and employment stability to demographic information. The dataset we will work with provides an overview of loan applicants, including personal, financial, and application-related attributes. These features allow us to explore how individual circumstances shape creditworthiness assessments.

<u>Content</u>
    
The columns included in the datasets are: person_age, person_name, person_gender, person_education, employment_type, person_income, person_emp_exp, person_home_ownership, bank_name, account_type, loan_amnt, loan_intent, loan_int_rate, loan_percent_income, cb_person_cred_hist_length, credit_score, previous_loan_defaults_on_file, loan_status

Column names are self-explanatory. 
    
 <u>Inspiration</u>

What are the characteristics of loan applicants that most influence approval decisions? Do factors such as income, employment history, credit score, or loan intent significantly impact whether a loan is approved? Let’s shed light on this critical financial question.

</div>

<div class="alert alert-danger"><b>Submission deadline:</b> Friday, October 24th, 23:55</div>

In [284]:
# DO NOT MODIFY NOR ADD CODE TO THIS CELL
import pandas as pd
from sklearn import set_config

set_config(transform_output="pandas")

df = pd.read_csv('https://raw.githubusercontent.com/jnin/information-systems/refs/heads/main/data/AI1_2025_assignments.csv')

<div class="alert alert-danger">
In the last part of this assignment, we will cover the importance of using three distinct datasets: training, validation, and test. However, for the autograded part of this assignment, we will conduct all calculations using a single dataset, even though this approach is fundamentally flawed, but will make the intial part of the assignment easier to complete. For this reason, there is no accuracy evaluation in the guided part of this assignment.
</div>

<div class="alert alert-info"><b>Exercise 1: Creating the Feature Matrix and Target Array</b>

Write the code to create the feature matrix ```X``` and the target array ```y``` from the dataframe ```df```. When creating ```X```, make sure to drop or ignore the irrelevant columns: ```['person_name', 'bank_name', 'credit_score']```.

The target variable for this problem is ```loan_status```.

<br><i>[0.5 points]</i>
</div>
<div class="alert alert-warning">
    
Python is case-sensitive, so ensure your code matches the required capitalization.
Do **not** download the dataset manually. Instead, run the previous cell to load the data directly from the provided link.

</div>

In [285]:
# YOUR CODE HERE
X = df.drop(columns = ["person_name", "bank_name", "credit_score", "loan_status"])
y = df["loan_status"]

In [286]:
# LEAVE BLANK

In [287]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 2: Imputing Missing Values </b> 

The first step in our preprocessing routine is to handle missing values in the feature matrix ```X```. Write the code to instantiate a ```SimpleImputer``` with a ```most_frequent``` strategy, naming it ```imputer```. Then, test the imputer transforming ```X```, and store the transformed data in a new DataFrame called ```X_imputed```.

<br><i>[0.5 points]</i>
</div>


In [288]:
# YOUR CODE HERE
from sklearn.impute import SimpleImputer


imputer = SimpleImputer(strategy='most_frequent')
X_imputed = pd.DataFrame(imputer.fit_transform(X))
X_imputed.head()


Unnamed: 0,person_age,person_gender,person_education,employment_type,person_income,person_emp_exp,person_home_ownership,account_type,loan_amnt,loan_intent,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,previous_loan_defaults_on_file
0,22.0,female,Master,contract,71948.0,0.0,RENT,saving,35000.0,PERSONAL,16.02,0.49,3.0,No
1,21.0,female,High School,contract,12282.0,0.0,OWN,checking,1000.0,EDUCATION,11.14,0.08,2.0,Yes
2,25.0,female,High School,self-employed,12438.0,3.0,MORTGAGE,checking,5500.0,MEDICAL,12.87,0.44,3.0,No
3,23.0,female,Bachelor,self-employed,79753.0,0.0,RENT,saving,35000.0,MEDICAL,15.23,0.44,2.0,No
4,24.0,male,Master,unemployed,66135.0,1.0,RENT,saving,35000.0,MEDICAL,14.27,0.53,4.0,No


In [289]:
# LEAVE BLANK

In [290]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 3: Encoding Categorical Features</b> 

Now that our dataset is free of missing values, let's handle the categorical columns. Create a `OneHotEncoder` object named `one_hot_encoder`. Next, create a DataFrame called `X_categorical` containing the following columns from `X_imputed`: `['person_gender','person_education', 'employment_type','person_emp_exp', 'person_home_ownership', 'loan_intent', 'account_type']`. 

Test the encoder by transforming the features of `X_categorical`, and store the transformed data in a new DataFrame named `X_categorical_encoded`.

<br><i>[0.75 points]</i>
</div>

In [291]:
# YOUR CODE HERE
from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_categorical = pd.DataFrame(X_imputed[['person_gender', 'person_education', 'employment_type', 'person_emp_exp', 'person_home_ownership', 'loan_intent', 'account_type']])
X_categorical_encoded = one_hot_encoder.fit_transform(X_categorical)

In [292]:
# LEAVE BLANK

In [293]:
# LEAVE BLANK

In [294]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 4: Encoding Ordinal Features </b> 

Next, repeat the process for the ordinal feature `['person_education', 'previous_loan_defaults_on_file']`. First, create a new DataFrame called `X_ordinal` containing this column. Then, instantiate an `OrdinalEncoder` and name it `ordinal_encoder`. 

Test the encoder by transforming the `X_ordinal` DataFrame, and store the transformed data in a new DataFrame called `X_ordinal_encoded`.

<br><i>[0.75 points]</i>
</div>

<div class="alert alert-warning">
    
Consider that the integer values assigned to each label should align with a meaningful interpretation of the label's significance.

</div>

In [295]:
# YOUR CODE HERE
from sklearn.preprocessing import OrdinalEncoder
categories = [['High School', 'Associate', 'Bachelor', 'Master', 'Doctorate'], ['No', 'Yes']]
ordinal_encoder = OrdinalEncoder(categories = categories)
X_ordinal = X_imputed[['person_education', 'previous_loan_defaults_on_file']]
X_ordinal_encoded = pd.DataFrame(ordinal_encoder.fit_transform(X_ordinal))

In [296]:
# LEAVE BLANK

In [297]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 5: Combining Feature Transformations </b> 

Now that we have confirmed the transformations for categorical and ordinal columns, let's use a `ColumnTransformer` to apply them in parallel. Instantiate a `ColumnTransformer` named `transformer`, including both the `OneHotEncoder` and `OrdinalEncoder`. Be sure to specify the correct column names for each transformer.

Test your `transformer` by applying it to the `X_imputed` DataFrame, and store the transformed data in a new DataFrame called `X_encoded`.

<br><i>[1 points]</i>
</div>


In [298]:
# YOUR CODE HERE
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

one_hot_cols = ['person_gender', 'employment_type', 'person_emp_exp', 'person_home_ownership', 'loan_intent', 'account_type']
ordinal_cols = ['person_education', 'previous_loan_defaults_on_file']
categories = [['High School', 'Associate', 'Bachelor', 'Master', 'Doctorate'], ['No', 'Yes']]
transformer = ColumnTransformer([('one_hot', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), one_hot_cols),
                                 ('ordinal', OrdinalEncoder(categories = categories), ordinal_cols)], 
                                 remainder='passthrough')
X_encoded = pd.DataFrame(transformer.fit_transform(X_imputed))

In [299]:
# LEAVE BLANK

In [300]:
# LEAVE BLANK

In [301]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 6: Standardizing the Features </b> 

To prevent potential issues with feature scaling, we will standardize the features using a `StandardScaler`. First, instantiate a `StandardScaler` and assign it to the variable `scaler`. Then, test it by transforming the `X_encoded` DataFrame. Store the scaled data in a new DataFrame called `X_scaled`.

 <br><i>[0.5 points]</i>
</div>

In [302]:
# YOUR CODE HERE
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_encoded))

In [303]:
# LEAVE BLANK

In [304]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 7: Building the Preprocessing Pipeline </b> 

To complete this part of the assignment, create a `Pipeline` named `pipe` that includes the imputer, transformer, and scaler from the previous exercises. Test the pipeline by transforming the original feature matrix `X`, and store the preprocessed data in a new DataFrame called `X_pipe`.

<br><i>[1 points]</i>
</div>

<div class='alert alert-warning'>

Be sure you apply the data transformations in the correct order.

</div>

In [305]:
# YOUR CODE HERE
from sklearn.pipeline import Pipeline

pipe = Pipeline([('imputer', imputer), 
                ('transformer', transformer),
                ('scaler', scaler)])

X_pipe = pd.DataFrame(pipe.fit_transform(X))

In [306]:
# LEAVE BLANK

In [307]:
# LEAVE BLANK

In [308]:
# LEAVE BLANK

In [309]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 8: End-to-End Preprocessing on a Regression problem </b> 

Now, apply everything you’ve learned to a **regression** problem. Run the next cell to reload the dataset ```df```, making sure it hasn’t been modified during the first part of the assignment.

**Your tasks are:**

1. **Feature Selection and Engineering:**  
   Create a feature matrix `X` and a target array `y` (now using the `credit_score` variable). Drop any irrelevant columns and explain your reasoning for each column you choose to exclude (e.g., loan_status). If you find a column relevant, consider combining it with other existing columns or creating new ones based on the dataset’s features. Be careful when doing this, and remember that you cannot use any information from the test data for feature engineering.

2. **Perform train-test split:**
   Create two separate datasets: one for training and one for testing, to properly evaluate the performance of your model. Select the evaluation metric that best fits your project or business goals, and provide a clear rationale for the chosen metric.

3. **Encoding Categorical and Ordinal Features:**  
   Identify the categorical and ordinal columns, and encode them using a `ColumnTransformer` to apply the transformations in parallel.

4. **Handling Missing Data:**  
   If there are missing values in your feature matrix, decide on an appropriate method to handle them (e.g., mean).

5. **Standardizing Features:**  
   Assess whether standardization is necessary for your numerical features, and apply it if needed. Justify why you take that desicion.

6. **Building a Pipeline:**  
   Create a `Pipeline` that integrates all the preprocessing steps you have applied.

7. **Select the appropiate regression model:** 
   Scikit-learn offers several regression models. Try different options to identify a suitable one. There is no need to perform an extensive grid search at this stage, we will cover that in the second assignment.

8. **Documentation:**
   Remember that thoroughly documenting your code and clearly explaining why certain decisions were made—while also considering and justifying why other options were not chosen—will be highly evaluated. You can use plots to support and justify your decisions.


<br><i>[5 points]</i>
</div>



In [310]:
# DO NOT MODIFY NOR ADD CODE TO THIS CELL

df = pd.read_csv('https://raw.githubusercontent.com/jnin/information-systems/refs/heads/main/data/AI1_2025_assignments.csv')

0)Doing all the imports that were not done in any previous exercise.

In [311]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

1.1)Splitting the features from the target variable   
   
Person name and bank name are irrelevant as they just describe information that has no predictive value. Credit score is the target variable and therefore obviously disregarded. 
     
Loan status leaks information that could lead the model to learn to decode the score from the status instead of learning the real relationships between the applicant's features and credit score.  
  
Loan interest rate could leak information as well as that rate might be based on the credit score but since we do not have more information on the relationship between both variables we decided to keep it. 

In [312]:
X = df.drop(columns = ["person_name", "bank_name", "loan_status", "credit_score"])
y = df["credit_score"]

1.2)Checking missing values in the target variable

In [313]:
y.isna().sum()

np.int64(193)

We observe that the target variable y has 193 missing values. Therefore we remove all rows for which the target variable is missing so that the model is only trained on complete cases.

In [314]:
X = X[~y.isna()]
y = y.dropna()

The 193 rows for which the target variable y was missing are now removed from the dataset.

2)Train-Test Split   
   
We divided the dataset into training and testing sets using an 80/20 split to properly evaluate model performance on unseen data. The training set is used to fit the model, while the test set serves to assess its generalization ability. A fixed random state (29) was set to ensure reproducibility of the results.  
  
This approach allows a fair and consistent evaluation of model accuracy before deployment, preventing overfitting and ensuring that the performance metrics reflect real predictive power.

In [315]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=29)

Understanding what the missing values are:

In [316]:
missing_values = X.isna().sum()
missing_values

person_age                        213
person_gender                     210
person_education                  196
employment_type                   226
person_income                     196
person_emp_exp                    210
person_home_ownership             211
account_type                      232
loan_amnt                         209
loan_intent                       190
loan_int_rate                     212
loan_percent_income               209
cb_person_cred_hist_length        201
previous_loan_defaults_on_file    208
dtype: int64

We found that most features had around 190–230 missing values each, out of roughly 45,000 rows (~0.5%).  
This indicates low and likely random missingness, meaning the missing data is not systematically biased.  

Because the proportion is small and randomly distributed, simple imputation (mean for numeric, most frequent for categorical) is appropriate.  
Dropping rows would unnecessarily remove data and reduce model accuracy.


Check if standardization is necessary:

In [317]:
numeric_columns = ['person_age', 'person_income', 'loan_amnt', 'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length']
X[numeric_columns].describe()

Unnamed: 0,person_age,person_income,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length
count,44594.0,44611.0,44598.0,44595.0,44598.0,44606.0
mean,27.76757,80371.89,9580.859231,11.006376,0.139672,5.867238
std,6.051365,80640.9,6315.792729,2.980339,0.087199,3.879931
min,20.0,8000.0,500.0,5.42,0.0,2.0
25%,24.0,47202.0,5000.0,8.59,0.07,3.0
50%,26.0,67055.0,8000.0,11.01,0.12,4.0
75%,30.0,95877.5,12241.0,13.0,0.19,8.0
max,144.0,7200766.0,35000.0,20.0,0.66,30.0


The descriptive statistics show that the numerical features vary greatly in scale and range, with some variables having much larger values than others. This confirms that standardization is necessary to ensure all features contribute equally to the model and to prevent those with larger ranges from dominating the learning process. We will therefore apply standardization to center and scale all numerical variables before training the model.

3,4,5,6)Data Preprocessing Pipelines and Column Transformer  
  
We encoded categorical and ordinal features using OneHotEncoder and OrdinalEncoder within a ColumnTransformer to apply transformations in parallel. One-hot encoding was used to avoid implying order between categories, while ordinal encoding preserved the logical ranking of ordered variables.  
  
Missing values were handled using SimpleImputer: the most frequent value for categorical data to maintain consistency, and the mean for numerical data to preserve distribution balance.  
  
Numerical features were standardized with StandardScaler to ensure all variables are on a comparable scale, preventing features with large ranges from dominating the model.  
  
Finally, all preprocessing steps were integrated into a single Pipeline to keep the process consistent and reproducible, ensuring identical transformations during training and testing.  

In [318]:
imputer_mode = SimpleImputer(strategy='most_frequent')
imputer_mean = SimpleImputer(strategy='mean')
one_hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ordinal_encoder = OrdinalEncoder(categories = [['High School', 'Associate', 'Bachelor', 'Master', 'Doctorate'], ['No', 'Yes']])
scaler = StandardScaler()

cat_nominal = ['person_gender', 'employment_type', 'person_emp_exp', 'person_home_ownership', 'account_type', 'loan_intent']
cat_ordinal = ['person_education', 'previous_loan_defaults_on_file']
num_cols = ['person_age', 'person_income', 'loan_amnt', 'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length']

cat_nominal_pipe = Pipeline([
    ('imputer', imputer_mode),
    ('one_hot', one_hot_encoder)])

cat_ordinal_pipe = Pipeline([
    ('imputer', imputer_mode),
    ('ordinal', ordinal_encoder)])

num_pipe = Pipeline([
    ('imputer', imputer_mean), 
    ('scaling', scaler)])

transformer = ColumnTransformer([('cat_nominal', cat_nominal_pipe, cat_nominal),
                                 ('cat_binary', cat_ordinal_pipe, cat_ordinal),
                                 ('num', num_pipe, num_cols)], 
                                 remainder = 'passthrough')

pipe = Pipeline([('transformer', transformer)])

X_train_prepared = pipe.fit_transform(X_train)
X_test_prepared = pipe.transform(X_test)

Evaluation Metrics: R² and MSE  
  
We will use R² (Coefficient of Determination) and MSE (Mean Squared Error) to evaluate our regression model.  
  
R² will show how well the model explains the variability in credit scores, indicating how effectively the features capture the underlying relationships in the data.  
  
MSE will measure the average squared difference between predicted and actual credit scores, penalizing large errors more strongly.
  
Together, these metrics will provide a balanced view of model fit and prediction accuracy, ensuring reliable evaluation of our credit score predictions.  

Regression Models: Linear, Ridge, and Random Forest  
  
We will test three regression models: Linear Regression, Ridge Regression, and Random Forest Regressor to identify which performs best for predicting credit scores.  
  
Linear Regression will be used as a simple baseline to assess linear relationships between the features and the target.  
Ridge Regression will help reduce overfitting and address multicollinearity through regularization.  
Random Forest will capture more complex and non-linear patterns in the data, improving prediction accuracy.  
  
By comparing these models, we will evaluate both interpretability and predictive power to select the most appropriate one for this task.

In [320]:
lin_reg = LinearRegression()
ridge_reg = Ridge(alpha=1.0)
rf_reg = RandomForestRegressor(random_state=29)

lin_reg.fit(X_train_prepared, y_train)
ridge_reg.fit(X_train_prepared, y_train)
rf_reg.fit(X_train_prepared, y_train)

y_pred_lin = lin_reg.predict(X_test_prepared)
y_pred_ridge = ridge_reg.predict(X_test_prepared)
y_pred_rf = rf_reg.predict(X_test_prepared)

results = {
    "Linear Regression": (r2_score(y_test, y_pred_lin), mean_squared_error(y_test, y_pred_lin)),
    "Ridge Regression": (r2_score(y_test, y_pred_ridge), mean_squared_error(y_test, y_pred_ridge)),
    "Random Forest": (r2_score(y_test, y_pred_rf), mean_squared_error(y_test, y_pred_rf))
}

for name, (r2, mse) in results.items():
    print(f"{name}: R² = {r2}, MSE = {mse}")

Linear Regression: R² = 0.11804367896678425, MSE = 2304.8954972123433
Ridge Regression: R² = 0.1193941044679876, MSE = 2301.36630922104
Random Forest: R² = 0.07665564533357205, MSE = 2413.0585548203526


The results show that all three models have a relatively low R², which means they only explain a small portion of the variability in credit scores. Ridge Regression performs the best and slightly better than Linear Regression, meaning that regularization helps reduce overfitting without significantly changing accuracy. Random Forest performs worse overall which indicates that the dataset may not have any strong non-linear patterns so tree-based models do not really add value. Similar results are found for the MSE values. Overall, the relationships between features and credit scores are weak, and further feature engineering or additional data might be needed to improve model performance.