**Model Selection and Training**

**Data Preprocessing**


convert categorical data to numerical values, use various encoding techniques such as Label Encoding or One-Hot Encoding. 

    Label Encoding: Assigns each unique value in a categorical column an integer value.
    One-Hot Encoding: Creates a new binary column for each unique value in the categorical column.

In [139]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder


# load the data
df = pd.read_csv('cleaned_file')

In [140]:
# Convert datetime columns to datetime format
df['Accurate_Episode_Date'] = pd.to_datetime(df['Accurate_Episode_Date'], errors='coerce')
df['Case_Reported_Date'] = pd.to_datetime(df['Case_Reported_Date'], errors='coerce')
df['Test_Reported_Date'] = pd.to_datetime(df['Test_Reported_Date'], errors='coerce')
df['Specimen_Date'] = pd.to_datetime(df['Specimen_Date'], errors='coerce')


In [141]:
# Extract year, month, and day from datetime columns
df['Accurate_Episode_Date_year'] = df['Accurate_Episode_Date'].dt.year
df['Accurate_Episode_Date_month'] = df['Accurate_Episode_Date'].dt.month
df['Accurate_Episode_Date_day'] = df['Accurate_Episode_Date'].dt.day

df['Case_Reported_Date_year'] = df['Case_Reported_Date'].dt.year
df['Case_Reported_Date_month'] = df['Case_Reported_Date'].dt.month
df['Case_Reported_Date_day'] = df['Case_Reported_Date'].dt.day

df['Test_Reported_Date_year'] = df['Test_Reported_Date'].dt.year
df['Test_Reported_Date_month'] = df['Test_Reported_Date'].dt.month
df['Test_Reported_Date_day'] = df['Test_Reported_Date'].dt.day

df['Specimen_Date_year'] = df['Specimen_Date'].dt.year
df['Specimen_Date_month'] = df['Specimen_Date'].dt.month
df['Specimen_Date_day'] = df['Specimen_Date'].dt.day


In [142]:
# Drop the original datetime columns
df.drop(columns=['Accurate_Episode_Date', 'Case_Reported_Date', 'Test_Reported_Date', 'Specimen_Date'], inplace=True)


In [143]:
# Label Encoding for 'Outcome1'
df['Outcome1_Encoded'] = LabelEncoder().fit_transform(df['Outcome1'])

# One-Hot Encoding for 'Age_Group' and 'Client_Gender'
df = pd.get_dummies(df, columns=['Age_Group', 'Client_Gender'])

In [144]:
# Print the columns in the dataframe
print("Columns in the dataframe:", df.columns)

Columns in the dataframe: Index(['Row_ID', 'Outcome1', 'Reporting_PHU_ID', 'Reporting_PHU',
       'Reporting_PHU_Address', 'Reporting_PHU_City',
       'Reporting_PHU_Postal_Code', 'Reporting_PHU_Website',
       'Reporting_PHU_Latitude', 'Reporting_PHU_Longitude',
       'Accurate_Episode_Date_year', 'Accurate_Episode_Date_month',
       'Accurate_Episode_Date_day', 'Case_Reported_Date_year',
       'Case_Reported_Date_month', 'Case_Reported_Date_day',
       'Test_Reported_Date_year', 'Test_Reported_Date_month',
       'Test_Reported_Date_day', 'Specimen_Date_year', 'Specimen_Date_month',
       'Specimen_Date_day', 'Outcome1_Encoded', 'Age_Group_20s',
       'Age_Group_30s', 'Age_Group_40s', 'Age_Group_50s', 'Age_Group_60s',
       'Age_Group_70s', 'Age_Group_80s', 'Age_Group_90+', 'Age_Group_<20',
       'Age_Group_UNKNOWN', 'Client_Gender_FEMALE',
       'Client_Gender_GENDER DIVERSE', 'Client_Gender_MALE',
       'Client_Gender_UNSPECIFIED'],
      dtype='object')


In [145]:
# Define columns to drop if they exist in the dataframe
columns_to_drop = ['Row_ID', 'Outcome1', 'Reporting_PHU', 'Reporting_PHU_Address', 'Reporting_PHU_City', 'Reporting_PHU_Postal_Code', 'Reporting_PHU_Website']
df.drop(columns=[col for col in columns_to_drop if col in df.columns], inplace=True)


In [146]:
# Select features and target variable
features = df.drop(columns=['Outcome1_Encoded'])
target = df['Outcome1_Encoded']

In [147]:
# Split the data into training and testing sets and Initialize the model

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)

In [148]:
# Train the model
model.fit(X_train, y_train)

# Evaluate the model
accuracy = model.score(X_test, y_test)
print("Accuracy:", accuracy)

Accuracy: 0.9891572203055693


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [149]:
# Confusion matrix
confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion)

Confusion Matrix:
[[     0   3608]
 [     0 329148]]


In [150]:
# Classification report
report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      3608
           1       0.99      1.00      0.99    329148

    accuracy                           0.99    332756
   macro avg       0.49      0.50      0.50    332756
weighted avg       0.98      0.99      0.98    332756



  _warn_prf(average, modifier, msg_start, len(result))


The classification report and confusion matrix indicate that the model is heavily biased towards the majority class (class 1), resulting in a very high accuracy but poor performance on the minority class (class 0)

will use class_weight parameter in LogisticRegression:

In [151]:
from sklearn.utils.class_weight import compute_class_weight
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Compute class weights
class_weights = compute_class_weight(class_weight='balanced', classes=[0, 1], y=target)
class_weights_dict = {0: class_weights[0], 1: class_weights[1]}

# Initialize the model with class weights
model = LogisticRegression(max_iter=1000, class_weight=class_weights_dict)

# Train the model
model.fit(X_train, y_train)

# Evaluate the model
accuracy = model.score(X_test, y_test)
print("Accuracy:", accuracy)

# Predict on the test set
y_pred = model.predict(X_test)

# Confusion matrix
confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion)

# Classification report
report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy: 0.8096172570892787
Confusion Matrix:
[[  3149    459]
 [ 62892 266256]]
Classification Report:
              precision    recall  f1-score   support

           0       0.05      0.87      0.09      3608
           1       1.00      0.81      0.89    329148

    accuracy                           0.81    332756
   macro avg       0.52      0.84      0.49    332756
weighted avg       0.99      0.81      0.88    332756



The results show significant improvement in recall for the minority class (class 0) after applying the resampling technique. However, the precision for class 0 is still low
