### Import Libraries and Load Datasets

This section imports essential libraries for data manipulation, model training, and evaluation.

- **pandas**: Used for data manipulation and reading CSV files.
- **sklearn.linear_model.LogisticRegression**: Implements logistic regression.
- **sklearn.model_selection.GridSearchCV**: Used for hyperparameter tuning.
- **sklearn.metrics**: Provides metrics to evaluate model performance.
- **xgboost.XGBClassifier**: Implements the XGBoost algorithm.
- **sklearn.svm.SVC**: Implements Support Vector Classification.

### Load Datasets

Reads the training and test datasets from specified file paths into pandas DataFrames.

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, classification_report
from xgboost import XGBClassifier
from sklearn.svm import SVC

# Paths to the datasets
train_dataset = '/home/aghasemi/CompBio481/ML_classifiers/datasets/NC_vs_DLB_train.csv'
test_dataset = '/home/aghasemi/CompBio481/ML_classifiers/datasets/NC_vs_DLB_test.csv'

train_df = pd.read_csv(train_dataset)
test_df = pd.read_csv(test_dataset)

# Split the train_df into male and female datasets
train_male_df = train_df[train_df['Sex'] == 1]
train_female_df = train_df[train_df['Sex'] == 0]

# Split the test_df into male and female datasets
test_male_df = test_df[test_df['Sex'] == 1]
test_female_df = test_df[test_df['Sex'] == 0]

# Display the number of records in each dataset to confirm
print("Number of records in train_male_df:", len(train_male_df))
print("Number of records in train_female_df:", len(train_female_df))
print("Number of records in test_male_df:", len(test_male_df))
print("Number of records in test_female_df:", len(test_female_df))

Number of records in train_male_df: 186
Number of records in train_female_df: 179
Number of records in test_male_df: 52
Number of records in test_female_df: 40


### Male

### Prepare Data

**Separate Features and Target Variable for Training Data:** 
Removes the columns `ID_1` and `Diagnosis` from the training DataFrame to get the feature set `X_train` and extracts the target variable `y_train`.

**Separate Features and Target Variable for Test Data:** 
Similarly, prepares the test data by separating features and the target variable.

In [2]:
# Separate features and target variable for training data
X_train = train_male_df.drop(columns=['ID_1', 'Diagnosis'])
y_train = train_male_df['Diagnosis']

# Separate features and target variable for test data
X_test = test_male_df.drop(columns=['ID_1', 'Diagnosis'])
y_test = test_male_df['Diagnosis']

### Hyperparameter Tuning for Logistic Regression

**Define Parameter Grid:** 
Specifies the range of hyperparameters (`C`, `solver`, and `max_iter`) to test for logistic regression.

**Grid Search:** 
Uses `GridSearchCV` to perform an exhaustive search over the specified parameter grid with 5-fold cross-validation.

**Fit and Retrieve Best Parameters:** 
Fits the model with all combinations of parameters and retrieves the best parameters based on accuracy.

In [3]:
# Logistic Regression
param_grid_lr = {
    'C': [0.1, 1, 10, 100],
    'solver': ['liblinear', 'saga'],
    'max_iter': [100, 200, 500]
}

lr = LogisticRegression()
grid_search_lr = GridSearchCV(estimator=lr, param_grid=param_grid_lr, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_lr.fit(X_train, y_train)
best_params_lr = grid_search_lr.best_params_



### Hyperparameter Tuning for XGBoost
**Define Parameter Grid:** 
Specifies the range of hyperparameters (`n_estimators`, `max_depth`, `learning_rate`) to test for the XGBoost classifier.

**Grid Search:** 
Uses `GridSearchCV` to find the best parameters with 5-fold cross-validation.

**Fit and Retrieve Best Parameters:** 
Fits the model with different parameter combinations and selects the best ones based on accuracy.

In [4]:
# XGBoost
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.2]
}

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
grid_search_xgb = GridSearchCV(estimator=xgb, param_grid=param_grid_xgb, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_xgb.fit(X_train, y_train)
best_params_xgb = grid_search_xgb.best_params_

### Hyperparameter Tuning for SVM

**Define Parameter Grid:** 
Specifies the hyperparameters (`C`, `kernel`, `gamma`) to tune for Support Vector Machine.

**Grid Search:** 
Performs an exhaustive search with 5-fold cross-validation to find the best parameter values.

**Fit and Retrieve Best Parameters:** 
Fits the model with various parameter settings and retrieves the best combination based on accuracy.


In [5]:
# SVM
param_grid_svm = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

svm = SVC()
grid_search_svm = GridSearchCV(estimator=svm, param_grid=param_grid_svm, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_svm.fit(X_train, y_train)
best_params_svm = grid_search_svm.best_params_

### Train and Evaluate Logistic Regression

**Train Model:** 
Initializes and trains a logistic regression model using the best parameters found.

**Predict and Evaluate:** 
Makes predictions on the test set and evaluates the model using accuracy and detailed classification report metrics.

In [6]:
# Train and evaluate the models with the best parameters
# Logistic Regression
lr_best = LogisticRegression(**best_params_lr)
lr_best.fit(X_train, y_train)
y_pred_lr = lr_best.predict(X_test)
print("Logistic Regression")
print("Best Parameters:", best_params_lr)
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

Logistic Regression
Best Parameters: {'C': 0.1, 'max_iter': 200, 'solver': 'saga'}
Accuracy: 0.8076923076923077
              precision    recall  f1-score   support

           0       0.75      0.96      0.84        28
           1       0.94      0.62      0.75        24

    accuracy                           0.81        52
   macro avg       0.84      0.79      0.80        52
weighted avg       0.84      0.81      0.80        52





### Train and Evaluate XGBoost

**Train Model:** 
Initializes and trains an XGBoost model with the best parameters.

**Predict and Evaluate:** 
Evaluates the trained model on the test set, calculating accuracy and generating a classification report.


In [7]:
# XGBoost
xgb_best = XGBClassifier(**best_params_xgb, use_label_encoder=False, eval_metric='logloss')
xgb_best.fit(X_train, y_train)
y_pred_xgb = xgb_best.predict(X_test)
print("\nXGBoost")
print("Best Parameters:", best_params_xgb)
print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))


XGBoost
Best Parameters: {'learning_rate': 0.2, 'max_depth': 4, 'n_estimators': 200}
Accuracy: 0.8653846153846154
              precision    recall  f1-score   support

           0       0.82      0.96      0.89        28
           1       0.95      0.75      0.84        24

    accuracy                           0.87        52
   macro avg       0.88      0.86      0.86        52
weighted avg       0.88      0.87      0.86        52



### Train and Evaluate SVM

**Train Model:** 
Initializes and trains an SVM model using the best parameters identified through grid search.

**Predict and Evaluate:** 
Makes predictions on the test set and assesses the model's performance with accuracy and a classification report.

In [8]:
# SVM
svm_best = SVC(**best_params_svm)
svm_best.fit(X_train, y_train)
y_pred_svm = svm_best.predict(X_test)
print("\nSVM")
print("Best Parameters:", best_params_svm)
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))


SVM
Best Parameters: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}
Accuracy: 0.7692307692307693
              precision    recall  f1-score   support

           0       0.74      0.89      0.81        28
           1       0.83      0.62      0.71        24

    accuracy                           0.77        52
   macro avg       0.78      0.76      0.76        52
weighted avg       0.78      0.77      0.76        52



### Female

### Prepare Data

**Separate Features and Target Variable for Training Data:** 
Removes the columns `ID_1` and `Diagnosis` from the training DataFrame to get the feature set `X_train` and extracts the target variable `y_train`.

**Separate Features and Target Variable for Test Data:** 
Similarly, prepares the test data by separating features and the target variable.

In [9]:
# Separate features and target variable for training data
X_train = train_female_df.drop(columns=['ID_1', 'Diagnosis'])
y_train = train_female_df['Diagnosis']

# Separate features and target variable for test data
X_test = test_female_df.drop(columns=['ID_1', 'Diagnosis'])
y_test = test_female_df['Diagnosis']

In [10]:
# Logistic Regression
param_grid_lr = {
    'C': [0.1, 1, 10, 100],
    'solver': ['liblinear', 'saga'],
    'max_iter': [100, 200, 500]
}

lr = LogisticRegression()
grid_search_lr = GridSearchCV(estimator=lr, param_grid=param_grid_lr, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_lr.fit(X_train, y_train)
best_params_lr = grid_search_lr.best_params_



In [11]:
# XGBoost
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.2]
}

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
grid_search_xgb = GridSearchCV(estimator=xgb, param_grid=param_grid_xgb, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_xgb.fit(X_train, y_train)
best_params_xgb = grid_search_xgb.best_params_

In [12]:
# SVM
param_grid_svm = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

svm = SVC()
grid_search_svm = GridSearchCV(estimator=svm, param_grid=param_grid_svm, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_svm.fit(X_train, y_train)
best_params_svm = grid_search_svm.best_params_

In [13]:
# Train and evaluate the models with the best parameters
# Logistic Regression
lr_best = LogisticRegression(**best_params_lr)
lr_best.fit(X_train, y_train)
y_pred_lr = lr_best.predict(X_test)
print("Logistic Regression")
print("Best Parameters:", best_params_lr)
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

Logistic Regression
Best Parameters: {'C': 1, 'max_iter': 100, 'solver': 'liblinear'}
Accuracy: 0.725
              precision    recall  f1-score   support

           0       0.75      0.84      0.79        25
           1       0.67      0.53      0.59        15

    accuracy                           0.73        40
   macro avg       0.71      0.69      0.69        40
weighted avg       0.72      0.72      0.72        40



In [14]:
# XGBoost
xgb_best = XGBClassifier(**best_params_xgb, use_label_encoder=False, eval_metric='logloss')
xgb_best.fit(X_train, y_train)
y_pred_xgb = xgb_best.predict(X_test)
print("\nXGBoost")
print("Best Parameters:", best_params_xgb)
print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))


XGBoost
Best Parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
Accuracy: 0.85
              precision    recall  f1-score   support

           0       0.83      0.96      0.89        25
           1       0.91      0.67      0.77        15

    accuracy                           0.85        40
   macro avg       0.87      0.81      0.83        40
weighted avg       0.86      0.85      0.84        40



In [15]:
# SVM
svm_best = SVC(**best_params_svm)
svm_best.fit(X_train, y_train)
y_pred_svm = svm_best.predict(X_test)
print("\nSVM")
print("Best Parameters:", best_params_svm)
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))


SVM
Best Parameters: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}
Accuracy: 0.75
              precision    recall  f1-score   support

           0       0.78      0.84      0.81        25
           1       0.69      0.60      0.64        15

    accuracy                           0.75        40
   macro avg       0.74      0.72      0.73        40
weighted avg       0.75      0.75      0.75        40

