# Lab 2  Using Regression/Tree-based/SVM/NN models for classification in python

In this lab, we will implement a series of machine learning methods for classification using scikit-learn package. We will use the built-in cross validation for each model to select the corresponding hyper-parameters, and we can compare their performances using Accuracy/AUC/AUPRC

# Introduction to scikit-learn

Scikit-learn is a powerful and easy-to-use Python library for machine learning. It provides simple and efficient tools for data mining and data analysis, making it accessible to both beginners and experienced practitioners. The library is built on top of NumPy, SciPy, and matplotlib, ensuring seamless integration with these popular scientific computing libraries.

## Key Features of scikit-learn:
- **Classification**
- **Regression**
- **Clustering**
- **Dimensionality Reduction**
- **Model Selection**
- **Preprocessing**

Scikit-learn's consistent API, comprehensive documentation, and active community make it a go-to library for machine learning tasks in Python.

Reference: https://scikit-learn.org/stable/

In [None]:
#  a simple example where we fit a RandomForestClassifier to some very basic data:
# Sample data: 2 samples with 3 features each
X = [[1, 2, 3],  
    [11, 12, 13]]

# Classes of each sample
y = [0, 1]

# Import the LogisticRegression from sklearn
from sklearn.linear_model import LogisticRegression 

# Initialize the RandomForestClassifier with a fixed random state for reproducibility
clf = LogisticRegression(random_state=0)

# Fit the model to the data
clf.fit(X, y)

# Predict  classes of the training data
clf.predict(X)

In [None]:
?LogisticRegression

In [None]:
# attributes
print(clf.classes_)
print(clf.coef_)
print(clf.n_iter_)

In [None]:
# methods
clf.predict_proba(X)

API reference: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

# Generate data

Generate a more complicated datasets

In [5]:
# Basic Packages
import matplotlib.pyplot as plt # Plotting library
import numpy as np  # Numerical library
import pandas as pd # Dataframe library

In [None]:
# Generate a dataset and plot it
import sklearn.datasets

np.random.seed(0)
X, y = sklearn.datasets.make_moons(200, noise=0.1) # 200 samples, the larger the noise, the less clear the moons are
print(X.shape)
print(y.shape)
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)
plt.show()

# Fitting and evaluating machine learning models:

**General steps:**
1. Split data into training/testing sets
    * Data standardization
2. Fit the model using training set
    * Apply cross-validation strategies to select hyper-parameters
    * Re-fit the model using the best hyper-parameters
3. Evaluate the model using testing set

## Split and standardize the data

In [7]:
# Split the data with shuffle and stratify
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3,  # 30% of the data will be used for testing
    random_state=42,  # Ensures reproducibility
    shuffle=True,  # Shuffle the data before splitting to ensure the data is well-distributed
    stratify=y  # Preserve the class distribution in both sets
)

In [None]:
print(X_train.shape)
print(X_test.shape)

print(X_train.mean(axis=0))
print(X_test.mean(axis=0))

In [10]:
# Standardize the data
from sklearn.preprocessing import StandardScaler
# StandardScaler is to standardize features by removing the mean and scaling to unit variance

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the training data and transform it
X_train = scaler.fit_transform(X_train)

# Transform the test data using the same scaler
X_test = scaler.fit_transform(X_test)

In [None]:
print(X_train.mean(axis=0))
print(X_test.mean(axis=0))

print(X_train.std(axis=0))
print(X_test.std(axis=0))

## Cross-validation in scikit-learn

Most of the machine learning models have hyper-parameters. We need to tune them based on our data. 

In [None]:
# Get the hyperparameters of a model
clf.get_params() # Here is the list of hyperparameters of the LogisticRegression we specified before

There is a general function in scikit-learn for cv: `sklearn.model_selection.GridSearchCV`. 

**General Steps to Use `GridSearchCV`**

1. **Setup Parameters:**
    - Define the parameter grid that you want to search over. 
    ```
    param_grid = {
         'param1': [value1, value2, ...],
         'param2': [value1, value2, ...],
         ...
    }
    ```
2. **Setup Model:**
    - Initialize the machine learning model you want to tune.
     ```
    from sklearn.model_selection import GridSearchCV
    from sklearn.some_model import SomeModel

    model = SomeModel()
    ```
3. **Initialize `GridSearchCV`**
     * Set the model, parameter grid, and other important parameters such as cross-validation strategy and scoring metric.
     ```
     grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
     ```
4. **Fit the `GridSearchCV`**
     * Fit using the training data to perform the grid search
     ```
     grid_search.fit(X_train, y_train)
     ```
5. **Retrieve the best model and its parameters.**
     * Retrive the best hyper-parameter: 
     ```
     print('Best hyperparameters:', grid_search.best_params_)
     ```
     * Fit the model with the best hyperpar using all training samples 
    ```
    best_model = grid_search.best_estimator_
    ```
6. **Evaluate the model using the test data** 

Check more detailed GridSearchCV examples at: https://scikit-learn.org/stable/modules/cross_validation.html#

## Evaluation metrics

We will use 3 metrics to evaluate the model performance:
* Accuracy: The ratio of correctly predicted instances to the total instances.
* AUROC: The Area Under the Receiver Operating Characteristic curve, which measures the ability of the model to distinguish between classes.
* AUPRC: The Area Under the Precision-Recall Curve, which evaluates the trade-off between precision and recall for different threshold settings.

**Using `sklearn.metrics`:**
```
from sklearn import metrics

# Calculate Accuracy
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Calculate AUROC
auc = metrics.roc_auc_score(y_test, y_pred_prob)
print(f"AUROC: {auc}")

# Calculate AUPRC
auprc = metrics.average_precision_score(y_test, y_pred_prob)
print(f"AUPRC: {auprc}")
```

In [34]:
# Now we are ready to implement some models we learned in class
# We first set an empty list store the performance of the different models
results = []

## Regression-based model

### Logistic regression with L1 regularization

In [None]:
# Logistic Regression with L1 regularization, use cv to find the best hyperparameter
from sklearn.linear_model import LogisticRegression # Logistic Regression in sklearn
from sklearn import metrics # metrics to evaluate the model
from sklearn.model_selection import GridSearchCV # Grid search to find the best hyperparameter

# Define the hyperparameters to search
params = {"C": np.logspace(-3,3,7)} # Inverse of regularization strength

# Initialize the model          
model = LogisticRegression(penalty='l2', solver='saga', tol=1e-8) 

# Initialize the GridSearchCV with 5-fold cross-validation
grid = GridSearchCV(model, param_grid=params, cv=5, scoring='roc_auc')

# Fit the model to the training data
grid.fit(X_train, y_train)

# Get the best model
print('Best hyperparameter:', grid.best_params_)
best_model = grid.best_estimator_

# Get the predictions
y_pred = best_model.predict(X_test)
y_pred_prob = best_model.predict_proba(X_test)[:,1]

# Get the performance
test_accuracy = metrics.accuracy_score(y_test, y_pred)
test_auc = metrics.roc_auc_score(y_test, y_pred_prob)
test_auprc = metrics.average_precision_score(y_test, y_pred_prob)

# Store the results
results.append({"model": 'Logistic regression with L1 penalty', "test_accuracy": test_accuracy, "test_auc": test_auc, "test_auprc": test_auprc})
pd.DataFrame(results)

In [None]:
best_model

In [None]:
# Example: compare performance of different hyperparameters
plt.figure()
plt.plot(grid.cv_results_["param_C"], grid.cv_results_["mean_test_score"])
plt.xscale("log")
plt.xlabel("C")
plt.ylabel("Mean test auc")
plt.title("Performance of different hyperparameters")


## Tree-based model

### Decision Tree

In [None]:
# decision tree for classification
from sklearn.tree import DecisionTreeClassifier
# ?DecisionTreeClassifier

# Define the hyperparameters to search
params = {"max_depth": np.arange(1, 11)}

# Initialize the model


# Initialize the GridSearchCV with 5-fold cross-validation


# Fit the model to the training data


# Get the best model


# Get the predictions


# Get the performance


# Store the results


### Random Forest

In [None]:
# Random Forest for classification
from sklearn.ensemble import RandomForestClassifier
# ?RandomForestClassifier

# Define the hyperparameters to search
params = {"n_estimators": [100, 200, 300, 400, 500]} 

# Initialize the model


# Initialize the GridSearchCV with 5-fold cross-validation


# Fit the model to the training data


# Get the best model


# Get the predictions


# Get the performance


# Store the results

### Gradient Boosting

In [None]:
# Gradient Boosting for classification
from sklearn.ensemble import GradientBoostingClassifier
# ?GradientBoostingClassifier

# Define the hyperparameters to search
params = {"n_estimators": [100, 200, 300, 400, 500], "learning_rate": np.logspace(-3,0,4)} 

# Initialize the model


# Initialize the GridSearchCV with 5-fold cross-validation


# Fit the model to the training data


# Get the best model


# Get the predictions


# Get the performance


# Store the results

## SVM

In [None]:
# Support Vector Machine for classification
from sklearn.svm import SVC
# ?SVC

# Define the hyperparameters to search
params = {"C": np.logspace(-3,3,7)} 

# Initialize the model


# Initialize the GridSearchCV with 5-fold cross-validation


# Fit the model to the training data


# Get the best model


# Get the predictions


# Get the performance


# Store the results

## Neural Network

In [None]:
# Neural Network for classification
from sklearn.neural_network import MLPClassifier
# ?MLPClassifier

# Define the hyperparameters to search
params = {"hidden_layer_sizes": [(4,), (8,)],  # number of neurons in the hidden layer
          "alpha": np.logspace(-3,0,4), # L2 regularization
          'learning_rate_init': np.logspace(-3,0,4) # initial learning rate
          } 

# Initialize the model


# Initialize the GridSearchCV with 5-fold cross-validation


# Fit the model to the training data


# Get the best model


# Get the predictions


# Get the performance


# Store the results