# COMP0189: Applied Artificial Intelligence
## Week 1 (Data Preprocessing)

### After this week you will be able to ...
- Load datasets using scikit-learn.
- Appreciate the importance of exploratory data analysis (EDA).
- Learn and apply various preprocessing techniques (scaling, encoding, handling missing values).
- Compare the impact of preprocessing on model performance.

### Acknowledgements
- https://github.com/UCLAIS/Machine-Learning-Tutorials
- https://www.cs.columbia.edu/~amueller/comsw4995s19/schedule/
- https://scikit-learn.org/stable/
- https://archive.ics.uci.edu/ml/datasets/adult

## Introduction to Scikit-learn


Why do we use sklearn??

1. Example Datasets
    - sklearn.datasets : Provides example datasets

2. Feature Engineering  
    - sklearn.preprocessing : Variable functions as to data preprocessing
    - sklearn.feature_selection : Help selecting primary components in datasets
    - sklearn.feature_extraction : Vectorised feature extraction
    - sklearn.decomposition : Algorithms regarding Dimensionality Reduction

3. Data split and Parameter Tuning  
    - sklearn.model_selection : 'Train Test Split' for cross validation, Parameter tuning with GridSearch

4. Evaluation  
    - sklearn.metrics : accuracy score, ROC curve, F1 score, etc.

5. ML Algorithms
    - sklearn.ensemble : Ensemble, etc.
    - sklearn.linear_model : Linear Regression, Logistic Regression, etc.
    - sklearn.naive_bayes : Gaussian Naive Bayes classification, etc.
    - sklearn.neighbors : Nearest Centroid classification, etc.
    - sklearn.svm : Support Vector Machine
    - sklearn.tree : DecisionTreeClassifier, etc.
    - sklearn.cluster : Clustering (Unsupervised Learning)

6. Utilities  
    - sklearn.pipeline: pipeline of (feature engineering -> ML Algorithms -> Prediction)

7. Train and Predict  
    - fit()
    - predict()

8. and more...

In [None]:
%pip install scikit-learn==1.6.1 matplotlib seaborn pandas

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning)

## 1. Exploratory Data Analysis (EDA)
In this section, you will use two datasets to illustrate EDA for: 1) a regression task and 2) a classification task.

In [None]:
from sklearn.utils import Bunch

def load_boston():
    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\\s+", skiprows=22, header=None)
    row_first_lines = raw_df.values[::2, :]
    row_second_lines = raw_df.values[1::2, :2]

    description_file = open("boston_description.txt", "r")
    
    data= np.hstack([row_first_lines[~np.isnan(row_first_lines)].reshape(506, 11), row_second_lines])
    target = raw_df.values[1::2, 2]
    feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
    description = description_file.read()

    return Bunch(data=data, target=target, feature_names=feature_names, DESCR=description)

### **1.1 Boston House Price Dataset** (regression task)

Let's first take a look at the Boston House Price dataset. This Dataset is deprecated as of version 1.2, but we will use this for educational purpose.



> Take some time to look at the different predictor variables. What do they mean and how do you expect them to influence the target variable (median house price)?






In [None]:
boston = load_boston()
print(boston.DESCR)

In [None]:
boston.keys()

In [None]:
print('Boston dataset feature names: ', boston.feature_names)
print('Number of features: ', len(boston.feature_names))

**Exploratory data analysis**

Why is this useful?
- Understand dataset characetristic in more detail (range, distribution, inter-variable relationships).
- Identify necessary preprocessing steps (handle missing values and outliers, encoding and scaling features..).

In [None]:
# convert the dataset into a dataframe
df = pd.DataFrame(boston.data, columns=boston.feature_names)
# extract the target variable
df['MEDV'] = boston.target

Basic statistics

In [None]:
print("Dataset sample:")
print(df.head())
print("Basic statistics:")
print(df.describe())

Visualise the feature and target distributions - are all features continuous?

In [None]:
# Your code here...


Are there any missing values?

In [None]:
# Your code here...

Are there any outliers?

> Hint: you can use boxplots

In [None]:
# Your code here...

Investigate the relationships between features and the outcome variable.

> Hint: a correlation map may be useful

In [None]:
# Your code here...

See how our data are spread in different ranges. 3rd feature (CHAS) is even in binary. Most of the algorithms perform poorly on these various input spaces.

Based on the EDA, what do you observe?
- Do some features need encoding?
- Do all feature share a similar range, or would they need scaling?
- Are there any outliers or missing values that need to be taken care of?

By addressing these questions, we inform our preprocessing choices and make sure that the data is properly prepared for the models.

### **1.2 Wine Dataset** (classification task)

In [None]:
from sklearn.datasets import load_wine

In [None]:
# Load and describe the dataset
wine = load_wine()
print(wine.DESCR)

In [None]:
wine.keys()

**Exploratory data analysis**

Just like for the Boston housing dataset, you need to analyse the dataset's features and predictor variable relationships.

In [None]:
# convert into a dataframe
df = pd.DataFrame(wine.data, columns=wine.feature_names)
# extract the target variable
df['Class'] = wine.target

Basic statistics




> Is the dataset balanced?



In [None]:
print("Dataset sample: ")
print(df.head())
print("Basic statistics:")
print(df.describe())

print("Dataset balance: ", df['Class'].value_counts())

How are the featues distributed?

In [None]:
# Your code here...

Are there any missing values? Outliers?

In [None]:
# Your code here...

How do the feature correlate with the outcome?

> Hint: plot feature distributions stratified on outcome.

In [None]:
# Your code here...

## 2. Investigating impact of pre-processing on differentent models

In this section, you have to design: <br>
<br>1) A model to predict the wine class of different wine samples (wine dataset).
<br>2) A model to predict the house price class (boston dataset).

Try out different models and compare them.


Based on the EDA, what preprocessing steps are needed? Try to compare different preprocessing methods (e.g., feature scaling methods) to assess their impact on the models.

Helpful imports

In [None]:
# some models..
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR


# evaluation metrics..
from sklearn.metrics import accuracy_score, root_mean_squared_error, r2_score


# different scalers to try out..
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer

# if you want to make a pipeline
from sklearn.pipeline import make_pipeline

# to split data into train and test sets..
from sklearn.model_selection import train_test_split

### Example: Impact of feature scaling

Normalization scales each input variable separately to the range 0-1.  
Standardization scales each input variable separately by subtracting the mean (centering) and dividing each of them by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.

#### Example usage of sklearn.preprocessing.StandardScaler

In [None]:
# Example
unscaled_data = np.asarray([[100, 0.001],
 [8, 0.05],
 [50, 0.005],
 [88, 0.07],
 [4, 0.1]])
# define standard scaler
scaler = StandardScaler()
# transform data
scaled_data = scaler.fit_transform(unscaled_data)

In [None]:
pd.DataFrame(unscaled_data).hist()

In [None]:
pd.DataFrame(scaled_data).hist()

In [None]:
del scaled_data, unscaled_data, scaler

**Tasks**  
- Try using different scaling methods, such as MinMaxScaler and Normalisation. Do you see the difference in the histogram?
- Experiment the effects of different feature scaling methods on various ML algorithms e.g. KNN, SVM, Decision-Tree.

#### Scaling Vs. Unscaling the Wine Dataset

In [None]:
RANDOM_STATE = 42
# We are using the wine dataset
features, target = load_wine(return_X_y=True)

# Make a train/test split using 30% test size
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.30, random_state=RANDOM_STATE)

In [None]:
# Define scalers and models
scalers = {
    # None
}

models = {
    # None
}

# Store results
results = {}

# Iterate over each scaler
#None



    # Iterate over each model
    #None


        # Fit and predict with unscaled and scaled data
        model.fit(X_train, y_train)
        unscaled_y_hat = model.predict(X_test)
        unscaled_acc = accuracy_score(y_test, unscaled_y_hat)

        model.fit(scaled_X_train, y_train)
        scaled_y_hat = model.predict(scaled_X_test)
        scaled_acc = accuracy_score(y_test, scaled_y_hat)

        # Store results
        results[key] = {
            'Unscaled Accuracy': unscaled_acc,
            'Scaled Accuracy': scaled_acc
        }



results_df = pd.DataFrame(results).T
results_df



In [None]:
# Load the Boston dataset
boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define scalers and models
scalers = {
    #None
}

models = {
    #None
}

# Store results
results = {}

# Iterate over each scaler
#None


    # Iterate over each model
    #None


        # Fit and predict with unscaled and scaled data
        model.fit(X_train, y_train)
        unscaled_y_hat = model.predict(X_test)
        unscaled_rmse = root_mean_squared_error(y_test, unscaled_y_hat, squared=False)
        unscaled_r2 = r2_score(y_test, unscaled_y_hat)

        model.fit(scaled_X_train, y_train)
        scaled_y_hat = model.predict(scaled_X_test)
        scaled_rmse = root_mean_squared_error(y_test, scaled_y_hat, squared=False)
        scaled_r2 = r2_score(y_test, scaled_y_hat)

        # Store results
        results[key] = {
            'Unscaled RMSE': unscaled_rmse,
            'Unscaled R2': unscaled_r2,
            'Scaled RMSE': scaled_rmse,
            'Scaled R2': scaled_r2
        }

# Convert results to DataFrame for better readability
results_df = pd.DataFrame(results).T
results_df


### Example: Impact of different preprocessing strategy in train and test data

Do you see the difference in RMSE?  
**Question**  
Above, we also scaled the test set.   
Using the same code, see what happens if you don't scale the test data and predict based on the unscaled data.

In [None]:
# Store results
results = {}

# Iterate over each scaler
for scaler_name, scaler in scalers.items():
    #None

    # Iterate over each model
    for model_name, model in models.items():
        #None

        # Fit with scaled data


        # Predict with unscaled test data


        # Predict with scaled test data


        # Store results
        results[key] = {
            'RMSE with Unscaled Test Data': unscaled_rmse,
            'RMSE with Scaled Test Data': scaled_rmse
        }

results_df = pd.DataFrame(results)
results_df

# Display results
#for key, value in results.items():
#    print(key, value)

## 3. Working with a messier dataset: **Adult Census Dataset**

Classification task: predict whether income >50K

### 3.1 Exploratory data analysis
Same as before, analyse the dataset to identify the necessary preprocessing steps.

In [None]:
# Open the csv file and skim through it. It does not have column names
# so we will allocate names to each column

# Naming the Columns
names = ['age','workclass','fnlwgt','education',
        'marital-status','occupation','relationship','race','sex',
        'capital-gain','capital-loss','hours-per-week','native-country',
        'y']

# Load dataset with specifying ' ?' as missing values
df = pd.read_csv('adult.data', delimiter=',', names=names, na_values=' ?')

# Number of observations
print("Number of observations: ", len(df))

# Look at sample of data
df.head(5)

Basis statistics

In [None]:
df.describe()

Is the datset balanced?

In [None]:
print("Dataset balance: ", df['y'].value_counts())

Disribution of features and relationship to target variabe

> Hint: plot feature distribution stratified on the predictor variable

> Inspect the range and cardinality of the different features. Is the range consistent across the numerical features? How many categories do categorical variables have?


In [None]:
# Your code here...

Check for missing values

In [None]:
# Your code here...

In [None]:
# Example the 15th row of the DataFrame - notice NaN
row_15 = df.iloc[14]
print(row_15)

### 3.2 Data preprocessing
Based on the EDA, what type of preprocessing is needed?
- Do features need to be scaled and encoded?
- Do missing values need to be imputed?

A quick data fix

In [None]:
# For now, we will drop the rows with missing (NA) values
df = df.dropna()
len(df)

In [None]:
# Task: Get the unique values in the race and y column
df['race'].unique()
df['race'].unique()

In [None]:
# We see redundant space prefix in the values. Remove them.
df['race'] = df['race'].apply(lambda x: x.strip())

In [None]:
df['race'].unique(), df['y'].unique(), df['occupation'].unique()

Hmmm it's not just the race and y column.

In [None]:
# Let's try to apply this to all the string-valued columns
for col_name in df.columns:
    if df[col_name].dtype == object:  # Checking for object type (string in pandas)
        df[col_name] = df[col_name].apply(lambda x: x.strip() if isinstance(x, str) else x)


Check the data

In [None]:
for col_name in df.columns:
    if not 'int' in str(df[col_name].dtype):
        print(df[col_name].unique())

All done! You can now start data preprocessing such as encoding and imputation.

#### **TASK 1: Encoding categorical variables** (label/ordinal encoding & one-hot encoding)

Important: We need special care when we are encoding categorical variables

**1. Take care of the missing values**
- Beware not to encode missing values unless you are intending to do so.
- Sometimes you want to encode missing values to a separate cateogory. For example, when you want to predict if passengers of titanic had survived or not, missing data of certain features can actually have meaning, i.e., Cabin information can be missing because the body was not found.

**2. Know which encoding and scaling method you should select**
- If your categories are ordinal, then it makes sense to use a LabelEncoder with a MinMaxScaler. For example, you can encode [low, medium, high], as [1,2,3], i.e., distance between low to high is larger than that of medium and high.

- However, if you have non-ordinal categorical values, like [White, Hispanic, Black, Asian], then it would be better to use a OneHotEncoder instead of forcing ordinality with a LabelEncoder. Otherwise the algorithms you use (especially distance based algorithms like KNN) will make the assumption that the distance between White and Asian is larger than White and Hispanic, which is nonsensical.

**3. Split before you encode to avoid data leakage**
- If training a model using train/ test slit, you should split the dataset before you encode your data. It is natural for algorithms to see unknown values in the validation/test set that was not appearing in the train set. `sklearn.preprocessing.OneHotEncoder` is good at handling these unknown categories (`handle_unknown` parameter).

- Discussion: What if you are certain about all the possible categories that can appear for each feature? Can you encode all the values before splitting the dataset into train and test set?


This notebook shows the three points in the following sections with examples.



> What type of encoding is most appropriate for the different categorical features? For example, should education and native-country be encoded using the same technique?



In [None]:
# Import encoders from sklearn
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder

In [None]:
# Ordinal Encoding for 'education'
# Define the order of education categories
education_order = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad', 'Some-college', 'Assoc-voc', 'Assoc-acdm', 'Prof-school', 'Bachelors', 'Masters', 'Doctorate']
# Initialize the OrdinalEncoder with the specified categories
ordinal_encoder = #None
# Apply the OrdinalEncoder to the 'education' column
df['education_encoded'] = #None

# Check resulting education ordering
edu_map = {}
for i, row in df[["education", "education_encoded"]].iterrows():
    education = row["education"]
    edu_num = row["education_encoded"]

    if education not in edu_map:
        edu_map.update({education: edu_num})
    else:
        assert edu_map[education] == edu_num
edu_map


In [None]:
# OneHotEncoding for nominal features without an implied order
# Including the previously missed nominal columns
nominal_columns = #None
onehot_encoder = #None
onehot_encoded_columns = #None
column_names = onehot_encoder.get_feature_names_out(nominal_columns)
df_onehot_encoded = pd.DataFrame(onehot_encoded_columns, columns=column_names)

# Integrate these new columns back into the original dataframe
df = df.reset_index(drop=True)  # Reset index to align with the new onehot encoded DataFrame
df = pd.concat([df, df_onehot_encoded], axis=1)

# Optionally, remove the categorical columns if no longer needed
df.drop(columns=nominal_columns + ['education'], inplace=True)

# Label Encoding for the target variable
label_encoder = LabelEncoder()
df['y_encoded'] = label_encoder.fit_transform(df['y'])

# Remove the original 'y' column if no longer needed
df.drop(columns=['y'], inplace=True)

Check your encoding results

In [None]:
# Display the first few rows of the modified DataFrame
df.head(10)

#### **TASK 2: Dealing with missing data** - imputation strategies

In processing the data earlier, we did not take account of the missing values.

In [None]:
# Re-Load dataset with specifying ' ?' as missing values
df = pd.read_csv('adult.data', delimiter=',', names=names, na_values=' ?')
# Remove redundant space (same as before)
for col_name in df.columns:
    if df[col_name].dtype == object:  # Checking for object type (string in pandas)
        df[col_name] = df[col_name].apply(lambda x: x.strip() if isinstance(x, str) else x) # Remove redundant space

**Task**: Create 3 train/test datasets using different methods for dealing with missing data:
<br>A: Drop missing values, B: KNN imputation, C: Most frequent imputation


In [None]:
from sklearn.model_selection import train_test_split
import numpy as np

def train_test_split_df(df, test_ratio=0.3, target_col="y", random_state=42):
    # Separate features and target
    df_data = df.drop(columns=[target_col])
    df_target = df[target_col]

    # Randomized train-test split with a fixed seed
    train_X_df, test_X_df, train_y_df, test_y_df = train_test_split(
        df_data, df_target, test_size=test_ratio, random_state=random_state
    )

    # Convert target variable to binary (assuming it's categorical with ">50K" and others)
    train_y_df = np.where(train_y_df == ">50K", 1, 0)
    test_y_df = np.where(test_y_df == ">50K", 1, 0)

    return train_X_df, train_y_df, test_X_df, test_y_df

# Split your data into train and test splits

train_X, train_y, test_X, test_y = train_test_split_df(df)

print(len(train_X))
print(len(train_y))
print(len(test_X))
print(len(test_y))

In [None]:
# Check for missing values
print(train_X.isnull().sum())
print(test_X.isnull().sum())

Step 1: encode features without missing values
<br> `native-country`, `occupation` and `workclass` have missing values, so we first need to impute them before encoding

In [None]:
# Ordinal Encoding for 'education'
education_order = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th',
                   '11th', '12th', 'HS-grad', 'Some-college', 'Assoc-voc',
                   'Assoc-acdm', 'Prof-school', 'Bachelors', 'Masters', 'Doctorate']
ordinal_encoder = OrdinalEncoder(categories=[education_order])

train_X['education_encoded'] = ordinal_encoder.fit_transform(train_X[['education']])
test_X['education_encoded'] = ordinal_encoder.transform(test_X[['education']])

# OneHotEncoding for nominal features without missing values and without an implied order
nominal_columns_without_missing = ['marital-status', 'relationship', 'race', 'sex']  # These have no missing values
onehot_encoder = #None

# Fit on training data and apply to both train and test
train_onehot_encoded = #None
test_onehot_encoded = #None

# Create DataFrames for one-hot encoded columns
train_onehot_df = pd.DataFrame(train_onehot_encoded, columns=onehot_encoder.get_feature_names_out(nominal_columns_without_missing))
test_onehot_df = pd.DataFrame(test_onehot_encoded, columns=onehot_encoder.get_feature_names_out(nominal_columns_without_missing))

# Reset indices for consistency and combine data
train_X = train_X.reset_index(drop=True)
test_X = test_X.reset_index(drop=True)
train_X = pd.concat([train_X, train_onehot_df], axis=1)
test_X = pd.concat([test_X, test_onehot_df], axis=1)

# Drop the original nominal and 'education' columns from both datasets
train_X.drop(columns=nominal_columns_without_missing + ['education'], inplace=True)
test_X.drop(columns=nominal_columns_without_missing + ['education'], inplace=True)

# Verify the transformed datasets
print("Training data (X):")
train_X.head()


**Dataset A:** drop missing values (same as dataset in Task 1).

In [None]:
# A: Dataset with dropped missing values
# Drop rows with missing values in the training and testing datasets
train_X_dropna = train_X.copy()
test_X_dropna = test_X.copy()

train_X_dropna = #None
test_X_dropna = #None

# Ensure alignment by dropping the corresponding rows in y
train_y_dropna = #None
test_y_dropna = #None

# Display missing value counts for train and test datasets
print("Missing values in training data after dropping rows:")
print(train_X_dropna.isnull().sum())

print("\nMissing values in testing data after dropping rows:")
print(test_X_dropna.isnull().sum())

# Display dataset lengths after dropping rows with missing values
print(f"\n\nDataset length after dropping rows containing NA (train): {len(train_X_dropna)}")
print(f"Dataset length after dropping rows containing NA (test): {len(test_X_dropna)}")

**Dataset B:** KNN imputation



> NOTE: Typically you would not use KNN imputation for categorical columns.
Instead use a more appropriate imputation method such as most frequent imputation
since KNN is best suited to numerical variables.
However to illustrate KNN imputation on this dataset here is a workaround
by first ordinal encoding, then applying KNN, then converting back to original categories
(this is not an ideal solution as ordinal encoding will introduce bias if the
variable is not ordinal. Also, one-hot encoding and then KNN-imputation would not work, can you think why?).



In [None]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OrdinalEncoder

# Columns with missing values
columns_with_missing_values = #None

# Apply KNN Imputation separately for train and test
# Training data
train_X_knn_imputed = train_X.copy()

# Temporarily encode categorical columns with missing values
temp_encoder = OrdinalEncoder()
train_temp = train_X[columns_with_missing_values].copy()
train_temp_encoded = temp_encoder.fit_transform(train_temp)

# Fit KNN imputer on the training data
knn_imputer = #None
train_imputed_data = #None

# Decode the categorical columns back to original categories for training data
train_imputed_data_decoded = temp_encoder.inverse_transform(train_imputed_data)
train_imputed_final = pd.DataFrame(train_imputed_data_decoded, columns=columns_with_missing_values)

# Integrate the imputed columns back into the training DataFrame
train_X_knn_imputed[columns_with_missing_values] = train_imputed_final

# Testing data
test_X_knn_imputed = test_X.copy()

# Temporarily encode categorical columns with missing values
test_temp = test_X[columns_with_missing_values].copy()
test_temp_encoded = temp_encoder.transform(test_temp)  # Use the encoder fitted on training data

# Apply KNN imputer to the testing data (fit is not called again)
test_imputed_data = #None

# Decode the categorical columns back to original categories for testing data
test_imputed_data_decoded = temp_encoder.inverse_transform(test_imputed_data)
test_imputed_final = pd.DataFrame(test_imputed_data_decoded, columns=columns_with_missing_values)

# Integrate the imputed columns back into the testing DataFrame
test_X_knn_imputed[columns_with_missing_values] = test_imputed_final

# Verify missing values in the resulting datasets
print("Missing values in training data after KNN imputation:")
print(train_X_knn_imputed.isnull().sum())

print("\nMissing values in testing data after KNN imputation:")
print(test_X_knn_imputed.isnull().sum())


In [None]:
# check resulting df
train_X_knn_imputed.head(5)

**Dataset C:** Most frequent imputation

In [None]:
from sklearn.impute import SimpleImputer

# Columns with missing values
columns_with_missing_values = #None

# Training data
train_X_mode_imputed = train_X.copy()

# Create an imputer object using the most frequent strategy and fit on training data
mode_imputer = #None
train_X_mode_imputed[columns_with_missing_values] = mode_imputer.#None

# Testing data
test_X_mode_imputed = test_X.copy()

# Apply the trained imputer to the testing data
test_X_mode_imputed[columns_with_missing_values] = mode_imputer.transform(test_X_mode_imputed[columns_with_missing_values])

# Verify missing values in the resulting datasets
print("Missing values in training data after mode imputation:")
print(train_X_mode_imputed.isnull().sum())

print("\nMissing values in testing data after mode imputation:")
print(test_X_mode_imputed.isnull().sum())

Now that we have done data imputation, we can **encode the columns that had missing values** (`native-country`, `occupation` and `workclass`).  



> What type(s) of encoding is appropriate here?



In [None]:
from sklearn.preprocessing import OneHotEncoder

# Function to perform one-hot encoding
def apply_onehot_encoding(train_df, test_df, columns, encoder=None):
    # Fit OneHotEncoder on training data if no encoder is passed
    if encoder is None:
        encoder = #None
        encoder.fit(#None)

    # Transform both training and testing datasets
    train_encoded = encoder.#None
    test_encoded = encoder.#None

    # Create DataFrames for the encoded columns
    column_names = encoder.get_feature_names_out(columns)
    train_encoded_df = pd.DataFrame(train_encoded, columns=column_names)
    test_encoded_df = pd.DataFrame(test_encoded, columns=column_names)

    # Reset indices to ensure alignment and concatenate
    train_df_reset = train_df.reset_index(drop=True)
    test_df_reset = test_df.reset_index(drop=True)

    train_df_final = pd.concat([train_df_reset.drop(columns, axis=1), train_encoded_df.reset_index(drop=True)], axis=1)
    test_df_final = pd.concat([test_df_reset.drop(columns, axis=1), test_encoded_df.reset_index(drop=True)], axis=1)

    return train_df_final, test_df_final, encoder


# Columns to encode
columns_to_encode = #None

# Apply one-hot encoding for each imputed dataset (dropna, KNN, mode)

# 1. Dropna
#None


# 2. KNN Imputed
#None


# 3. Mode Imputed
#None

# Verify if any missing values remain after encoding
print("Missing values in dropna-encoded training data:", train_X_dropna_encoded.isnull().sum().sum())
print("Missing values in dropna-encoded testing data:", test_X_dropna_encoded.isnull().sum().sum())

print("\nMissing values in KNN-encoded training data:", train_X_knn_encoded.isnull().sum().sum())
print("Missing values in KNN-encoded testing data:", test_X_knn_encoded.isnull().sum().sum())

print("\nMissing values in mode-encoded training data:", train_X_mode_encoded.isnull().sum().sum())
print("Missing values in mode-encoded testing data:", test_X_mode_encoded.isnull().sum().sum())

### 3.3 Train a classifier to predict outcome

Example: SVM or KNN Classifier

**Task**: Train a classifier on the different datasets to compare imputation method accuracy.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd


# Non-categorical features to scale
non_categorical_features = #None

def scale_features(train_X, test_X, features_to_scale):
    scaler =#None

    # Fit scaler on training data and transform both train and test sets
    train_scaled = #None
    test_scaled = #None

    # Replace scaled features in the original DataFrame, maintaining indices
    train_X_scaled = train_X.copy()
    test_X_scaled = test_X.copy()
    train_X_scaled[features_to_scale] = pd.DataFrame(train_scaled, index=train_X.index, columns=features_to_scale)
    test_X_scaled[features_to_scale] = pd.DataFrame(test_scaled, index=test_X.index, columns=features_to_scale)

    return train_X_scaled, test_X_scaled


# Define classifiers
classifiers = {
    #None
}

# Datasets for each imputation type
datasets = {
    #None
}

# Store results
results = []

# Train and test classifiers on each dataset
for imputation_type, (train_X, train_y, test_X, test_y) in datasets.items():
    # Apply scaling to non-categorical features
    #None

    for classifier_name, classifier in classifiers.items():
        # Train the classifier
        #None

        # Predict on the test set
        #None

        # Calculate accuracy
        #None

        # Append the result
        results.append({
            'Imputation Type': imputation_type,
            'Classifier': classifier_name,
            'Accuracy': accuracy
        })
# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Pivot the DataFrame to organize results by Imputation Type and Classifier
pivot_df = results_df.pivot(index='Imputation Type', columns='Classifier', values='Accuracy')

# Display the table
print("\nClassifier Performance Across Imputation Types with Scaling:")
print(pivot_df.to_string(float_format="%.2f"))  # Display with two decimal places
