# HW3 : Evaluating Binary Classifiers

### Instructions

The authoritative HW3 instructions are on the course website:

http://www.cs.tufts.edu/cs/135/2026s/hw3.html

Please report any questions to Piazza.

In [None]:
import os
import numpy as np
import pandas as pd

import sklearn.linear_model
import sklearn.metrics
import sklearn.impute 

# Plotting libraries
import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8') # pretty matplotlib plots

import seaborn as sns
sns.set_theme('notebook', style='whitegrid', font_scale=1.25)

RANDOM_SEED = 68

# Cervical Cancer Risk Screening

In [None]:
# You may need to adjust the path
cervical_cancer_risk_factors = pd.read_csv("risk_factors_cervical_cancer.csv")

# Remove target variable (biopsy) and other diagnostic tests we're not using
X_df = cervical_cancer_risk_factors.drop(columns=["Biopsy"]) #, "Schiller", "Hinselmann"
y = cervical_cancer_risk_factors.Biopsy.to_numpy() 

In [None]:

#Replace all of the "?" with NaN (not a number)
X_df = X_df.replace('?', np.nan)
# We also should make sure that everything is now being treated as numeric, 
# since pandas may have decided columns containing "?" were strings. 
X_df = X_df.apply(pd.to_numeric, errors='coerce')

y = cervical_cancer_risk_factors.Biopsy.to_numpy() 

# Create a train/validation/test split
X_dev, X_test, y_dev, y_test = sklearn.model_selection.train_test_split(
    X_df, y, test_size=0.15, random_state=RANDOM_SEED, stratify=y
)
# Further split the development set into a training and validation set
X_train, X_val, y_train, y_val = sklearn.model_selection.train_test_split(
    X_dev, y_dev, test_size=0.33, random_state=RANDOM_SEED, stratify=y_dev
)
  
print(X_dev.describe())


This dataset contains a significant amount of missing data, which is denoted by "?" (we converted them to NaN). 

This is a common occurrence with questionnaire data, especially in a medical context. Asking patients about topics stigmatized in many cultures such as sexual history, birth control, or drug use can result in high non-response rates due to patients be wary of social stigma. Those tasked with collecting questionnaire data may refuse to ask sensitive questions and simply mark them as missing by default. 

When collecting and analyzing data, we need to consider context. Are there large social costs to a patient if they answer this question truthfully? Will they be alone with a clinician or around family when asked this question? Are they answering questions out loud, or filling out written form? These factors may have been taken into account for this questionarre, but generally if we are collecting data via questionnaire we want to **design it in partnership with community members** who understand the context to minimize missing data.  

In [None]:
missing_counts = X_dev.isnull().sum()
missing_percentage = (X_dev.isnull().sum() / len(X_dev)) * 100

# Combine into a clean table
missing_info = pd.concat([missing_counts, missing_percentage], axis=1, keys=['Total Missing', 'Percent'])
print(missing_info.sort_values(by='Percent', ascending=False).head(10))

Here we have two columns, `STDs: Time since first diagnosis` and `STDs: Time since last diagnosis` that are missing for over 90% of values in the training set. Let's drop all columns with more than 60% missing data, which would result in these two columns being dropped. Notice that we exclude the testing set from calculating which columns to drop. 

In [None]:
cols_to_drop = X_dev.columns[X_dev.isna().mean() > 0.60].tolist()
X_dev = X_dev.drop(columns=cols_to_drop)
X_train = X_train.drop(columns=cols_to_drop)
X_val = X_val.drop(columns=cols_to_drop)
X_test = X_test.drop(columns=cols_to_drop)

We will *impute* the rest of the missing values in the dataset. Strategies for dealing with missing data could be the topic of an entire course, but for this analysis we will use scikit-learn's [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html). This imputes missing feature values by calculating the mean of the $k$ nearest other instances by euclidean distance. 

However, imputing this way on the entire dataset before performing a train/test split would result in data leakage, as we would be using information from the test set during training, which could cause us to overestimate how well our trained model generalizes. Instead, we can using `KNNImputer` in a pipeline (along with `MinMaxScaler()`).

In [None]:
"""
Creates a sklearn pipeline for logistic regression with imputation and rescaling steps.
C is the inverse regularization strength, smaller values resulting in higher regularization
Pass in l1_ratio=0 for L2-regularized logit, and l1_ratio=1 for L1-regularized logit
"""
def make_logit_pipeline_knnimpute(C=1.0, k=5):
    pipeline = sklearn.pipeline.Pipeline(
        steps=[
         #('imputer', sklearn.impute.SimpleImputer(missing_values=np.nan, strategy='mean')),
         ('imputer', sklearn.impute.KNNImputer(n_neighbors=k)),
         ('rescaler', sklearn.preprocessing.MinMaxScaler()),
         ('logit', sklearn.linear_model.LogisticRegression(solver="lbfgs", l1_ratio=0, C=C))
        ])
    
    # Return the constructed pipeline
    return pipeline

def make_knn_knnimpute(k_model=5, k_impute=5):
    pipeline = sklearn.pipeline.Pipeline(
        steps=[
         #('imputer', sklearn.impute.SimpleImputer(missing_values=np.nan, strategy='mean')),
         ('imputer', sklearn.impute.KNNImputer(n_neighbors=k_impute)),
         ('rescaler', sklearn.preprocessing.MinMaxScaler()),
         ('knn', sklearn.neighbors.KNeighborsClassifier(n_neighbors=k_model))
        ])
    
    # Return the constructed pipeline
    return pipeline

## Your Code