### Import Libraries and Load Datasets

This section imports essential libraries for data manipulation, model training, and evaluation.

- **pandas**: Used for data manipulation and reading CSV files.
- **sklearn.linear_model.LogisticRegression**: Implements logistic regression.
- **sklearn.model_selection.GridSearchCV**: Used for hyperparameter tuning.
- **sklearn.metrics**: Provides metrics to evaluate model performance.
- **xgboost.XGBClassifier**: Implements the XGBoost algorithm.
- **sklearn.svm.SVC**: Implements Support Vector Classification.

### Load Datasets

Reads the training and test datasets from specified file paths into pandas DataFrames.

In [12]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, classification_report
from xgboost import XGBClassifier
from sklearn.svm import SVC

# Paths to the datasets
train_dataset = '/home/aghasemi/CompBio481/ML_classifiers/datasets/multi_class_train_dataset.csv'
test_dataset = '/home/aghasemi/CompBio481/ML_classifiers/datasets/multi_class_test_dataset.csv'

train_df = pd.read_csv(train_dataset)
test_df = pd.read_csv(test_dataset)

### Prepare Data

**Separate Features and Target Variable for Training Data:** 
Removes the columns `ID_1` and `Diagnosis` from the training DataFrame to get the feature set `X_train` and extracts the target variable `y_train`.

**Separate Features and Target Variable for Test Data:** 
Similarly, prepares the test data by separating features and the target variable.

In [13]:
# Separate features and target variable for training data
X_train = train_df.drop(columns=['ID_1', 'Diagnosis'])
y_train = train_df['Diagnosis']

# Separate features and target variable for test data
X_test = test_df.drop(columns=['ID_1', 'Diagnosis'])
y_test = test_df['Diagnosis']

### Hyperparameter Tuning for Logistic Regression

**Define Parameter Grid:** 
Specifies the range of hyperparameters (`C`, `solver`, and `max_iter`) to test for logistic regression.

**Grid Search:** 
Uses `GridSearchCV` to perform an exhaustive search over the specified parameter grid with 5-fold cross-validation.

**Fit and Retrieve Best Parameters:** 
Fits the model with all combinations of parameters and retrieves the best parameters based on accuracy.

In [14]:
# Logistic Regression
param_grid_lr = {
    'C': [0.1, 1, 10, 100],
    'solver': ['liblinear', 'saga'],
    'max_iter': [100, 200, 500],
    'multi_class': ['auto']  # Ensure multi_class setting
}

lr = LogisticRegression()
grid_search_lr = GridSearchCV(estimator=lr, param_grid=param_grid_lr, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_lr.fit(X_train, y_train)
best_params_lr = grid_search_lr.best_params_



### Hyperparameter Tuning for XGBoost
**Define Parameter Grid:** 
Specifies the range of hyperparameters (`n_estimators`, `max_depth`, `learning_rate`) to test for the XGBoost classifier.

**Grid Search:** 
Uses `GridSearchCV` to find the best parameters with 5-fold cross-validation.

**Fit and Retrieve Best Parameters:** 
Fits the model with different parameter combinations and selects the best ones based on accuracy.

In [15]:
# XGBoost
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.2]
}

xgb = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')  # Use mlogloss for multi-class
grid_search_xgb = GridSearchCV(estimator=xgb, param_grid=param_grid_xgb, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_xgb.fit(X_train, y_train)
best_params_xgb = grid_search_xgb.best_params_

### Hyperparameter Tuning for SVM

**Define Parameter Grid:** 
Specifies the hyperparameters (`C`, `kernel`, `gamma`) to tune for Support Vector Machine.

**Grid Search:** 
Performs an exhaustive search with 5-fold cross-validation to find the best parameter values.

**Fit and Retrieve Best Parameters:** 
Fits the model with various parameter settings and retrieves the best combination based on accuracy.


In [16]:
# SVM
param_grid_svm = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

svm = SVC(decision_function_shape='ovr')  # Use one-vs-rest strategy for multi-class
grid_search_svm = GridSearchCV(estimator=svm, param_grid=param_grid_svm, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_svm.fit(X_train, y_train)
best_params_svm = grid_search_svm.best_params_

### Train and Evaluate Logistic Regression

**Train Model:** 
Initializes and trains a logistic regression model using the best parameters found.

**Predict and Evaluate:** 
Makes predictions on the test set and evaluates the model using accuracy and detailed classification report metrics.

In [17]:
# Train and evaluate the models with the best parameters
# Logistic Regression
lr_best = LogisticRegression(**best_params_lr)
lr_best.fit(X_train, y_train)
y_pred_lr = lr_best.predict(X_test)
print("Logistic Regression")
print("Best Parameters:", best_params_lr)
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

Logistic Regression
Best Parameters: {'C': 0.1, 'max_iter': 200, 'multi_class': 'auto', 'solver': 'saga'}
Accuracy: 0.7032640949554896
              precision    recall  f1-score   support

           0       0.70      0.62      0.66        48
           1       0.71      0.94      0.81       219
           2       0.25      0.03      0.05        33
           3       0.00      0.00      0.00         9
           4       0.00      0.00      0.00        12
           5       0.00      0.00      0.00        16

    accuracy                           0.70       337
   macro avg       0.28      0.27      0.25       337
weighted avg       0.59      0.70      0.63       337



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Train and Evaluate XGBoost

**Train Model:** 
Initializes and trains an XGBoost model with the best parameters.

**Predict and Evaluate:** 
Evaluates the trained model on the test set, calculating accuracy and generating a classification report.


In [18]:
# XGBoost
xgb_best = XGBClassifier(**best_params_xgb, use_label_encoder=False, eval_metric='mlogloss')
xgb_best.fit(X_train, y_train)
y_pred_xgb = xgb_best.predict(X_test)
print("\nXGBoost")
print("Best Parameters:", best_params_xgb)
print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))


XGBoost
Best Parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}
Accuracy: 0.7091988130563798
              precision    recall  f1-score   support

           0       0.74      0.65      0.69        48
           1       0.71      0.95      0.81       219
           2       0.00      0.00      0.00        33
           3       0.00      0.00      0.00         9
           4       0.00      0.00      0.00        12
           5       0.00      0.00      0.00        16

    accuracy                           0.71       337
   macro avg       0.24      0.27      0.25       337
weighted avg       0.57      0.71      0.63       337



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Train and Evaluate SVM

**Train Model:** 
Initializes and trains an SVM model using the best parameters identified through grid search.

**Predict and Evaluate:** 
Makes predictions on the test set and assesses the model's performance with accuracy and a classification report.

In [19]:
# SVM
svm_best = SVC(**best_params_svm, decision_function_shape='ovr')
svm_best.fit(X_train, y_train)
y_pred_svm = svm_best.predict(X_test)
print("\nSVM")
print("Best Parameters:", best_params_svm)
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))


SVM
Best Parameters: {'C': 10, 'gamma': 'auto', 'kernel': 'rbf'}
Accuracy: 0.6913946587537092
              precision    recall  f1-score   support

           0       0.68      0.62      0.65        48
           1       0.71      0.92      0.80       219
           2       0.14      0.03      0.05        33
           3       0.00      0.00      0.00         9
           4       0.00      0.00      0.00        12
           5       0.00      0.00      0.00        16

    accuracy                           0.69       337
   macro avg       0.26      0.26      0.25       337
weighted avg       0.57      0.69      0.62       337



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
