## In-Class Assignment: Data Modeling Process

Use the [CDC Diabetes Health Indicators](https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators) dataset to explore train/test splitting and regularization.

In [1]:
pip install -q ucimlrepo

In [2]:
from ucimlrepo import fetch_ucirepo

cdc_diabetes_health_indicators = fetch_ucirepo(id=891)

# data (as pandas dataframes)
X = cdc_diabetes_health_indicators.data.features
y = cdc_diabetes_health_indicators.data.targets

Determine two different reasonable methods to split or apply cross-validation to the data, based on the data attributes, distribution and task.

In [3]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)

In [4]:
# method 1: train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=16)
model.fit(X_train, y_train)
model.predict(X_test)

  y = column_or_1d(y, warn=True)


array([0, 0, 0, ..., 0, 0, 0])

In [5]:
# method 2: k-fold cross-validation
from sklearn.model_selection import cross_val_score, KFold

kfold = KFold(n_splits=5, shuffle=True, random_state=16)
results = cross_val_score(model, X, y, cv=kfold)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Identify what metrics you would recommend to use when evaluating and comparing machine learning models for this task.  How would you combine these metrics into a single metric?

In [6]:
# I would use accuracy, precision, recall, and F-score to evaluate and compare models.
# I combine these metrics into a single metric using the function acc_fair.

import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def acc_fair(y, y_pred, lambda_val):
  acc = accuracy_score(y, y_pred)
  precision, recall, fscore, support = precision_recall_fscore_support(y, y_pred, average='binary', zero_division=True)
  fairness = 1 - np.abs(precision - recall)

  acc_fair_score = acc - lambda_val * fairness
  return acc_fair_score

For one of the two methods you identified, run parameter search to find the "best" model, using the metric you defined.

In [7]:
# using method 1

best_acc_fair_score = float('-inf')
best_threshold = 0
best_lambda = 0

threshold = np.arange(0, 0.5, 0.01)
lambda_val = np.arange(0, 11, 1)

copy = X_test.copy()

for t in threshold:
  for l in lambda_val:
    pred_proba = model.predict_proba(copy)[:,1] >= t
    acc_fair_score = acc_fair(y_test, pred_proba, l)

    if acc_fair_score > best_acc_fair_score:
      best_acc_fair_score = acc_fair_score
      best_threshold = t
      best_lambda = l

print('Best Accuracy-Fairness Score:', best_acc_fair_score)
print('Best Threshold:', best_threshold)
print('Best Lambda Value:', best_lambda)

Best Accuracy-Fairness Score: 0.8656575212866604
Best Threshold: 0.49
Best Lambda Value: 0
