# Lab 2.1: Using logistic regression to predict the quality of mass spectrometry experiments.

As we learned in the previous lecture, logistic regression is a supervised learning method for classification when you have two classes. In this lab, we're going to explore one use case of logistic regression applied to mass spectrometry data: predicting the quality of an LC-MS experiment. In fact, we'll actually be recreating some of the work presented by [Amidan et al](https://pubs.acs.org/doi/10.1021/pr401143e).

Before we begin our analysis though, we need to first import a few Python packages:

In [1]:
import pandas as pd                # For working with tabular data.
import matplotlib.pyplot as plt    # For plotting data
import seaborn as sns              # For theming our plots.
from sklearn.svm import SVC        # to use an SVM
from sklearn.linear_model import (
    LogisticRegression,            # To use logistic regression.
    LogisticRegressionCV,          # Automatically select hyperparameters using cross-validation
) 
from sklearn.preprocessing import StandardScaler # Used to normalize the features

# These will calculate our ROC and precision-recall curve metrics.
from sklearn.metrics import (
    roc_auc_score,
    RocCurveDisplay,
    average_precision_score,
    PrecisionRecallDisplay,
    accuracy_score
)


# Make our plots look nice:
sns.set(context="notebook", style="ticks")

# These variables define the paths to our data:
metadata_csv = "../data/quality/metadata.csv"  # A summary of the data
train_csv = "../data/quality/training.csv"     # The training set
test_csv = "../data/quality/test.csv"          # The test set


ModuleNotFoundError: No module named 'seaborn'

## The data

The data that we'll be working with was manually annotated to label LC-MS datasets on several Orbitrap instruments as either good or poor. Here is how [Amidan et al](https://pubs.acs.org/doi/10.1021/pr401143e) describe the annotation process:
> The data sets were manually reviewed by three expert instrument operators (30+ years of combined LC–MS experience) using an in-house graphical user interface viewer. This viewer contained the base peak chromatogram, total ion current chromatogram, plots of both the top 50 000 and top 500 000 LC–MS detected features, and the number of peptides identified. In the first round, 1150 data sets were manually curated as “good”, “okay”, or “poor” and used to develop the classifier. In cases where the assessors disagreed (∼5–10%), the majority opinion was taken for the curated value. Moreover, the “okay” value was used to denote the wide range of performance, which, although not optimal, was still acceptable. 

This annoation process resulted in the following data thate we'll be using. Run the code cell below to see an overview:

In [None]:
# Read and display the metadata table:
pd.read_csv(metadata_csv)

Four our model, we're going to focus on the Velo Orbitrap data, because that has the most examples. We must model them each individually, because they have different features. Let's load the training data and take a look:

In [None]:
train_df = pd.read_csv(train_csv) # Read the training data.
train_df

Notice the columns with `NaN` features. These indicate features that are missing for a particular instrument model. Now we'll filter the data for the Velos Orbitrap data and select the feature columns in the data:


In [None]:
# Filter for only Velos Orbitrap data:
train_velos = train_df.loc[train_df["Instrument_Category"] == "VOrbitrap", :]

# These columns are not features, so we want to ignore them:
not_feature_columns = [
    "Instrument_Category",
    "Instrument",
    "Dataset_ID",
    "Acq_Time_Start",
    "Acq_Length",
    "Dataset",
    "Dataset_Type",
    "Curated_Quality",
    "BinomResp",
    "LLRC.Prediction",
    # These features don't exist for this instrument:
    "MS1_TIC_Q2",
    "MS1_TIC_Q3",
    "C_1A",
    "C_1B",
    "C_2B",
    "C_3B",
    "C_4A",
    "C_4C",
    "P_2Anorm",
]

# Find the feature columns in the data:
# This is a Python feature called a "list comprehension"
feature_columns = [c for c in train_velos.columns if c not in not_feature_columns]

# Extract the features and labels:
X_train = train_velos.loc[:, feature_columns]
y_train = train_velos.loc[:, "BinomResp"]

# Print the size of our dataset:
print("We have", X_train.shape[1], "features")
print("and", X_train.shape[0], "examples.")

Note that here is where we would normally perform data exploration, making plots of everything we think might be relevant and identifying potential problems with data quality. We're not going to do that here for the sake of time.

## Normalize the features

It is important to scale (transform all of the features so that they are of the same magnitude) and center (transform all of the features so that their mean is 0) the features of our dataset in order for many machine learning methods to work optimally. This can be easily accomplished for a feature by subtracting the mean and dividing by the standard deviation. This is often called *standardization* or *standard scaling*, because our features will now be similar to a standard normal distribution–they will have a standar deviation of 1 and mean of 0.

In [None]:
# Scale the features:
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X_train)

## Fit a logistic regression model

We're now ready to fit our logistic regression model. Just as in the paper, we're going to use L1 regularization to perform implicit feature selection—recall that L1 regularization forces small coefficients to go to 0, essentially removing unimportant features from the model. We'll use 3-fold cross-validation to select how much L1 regularization we need, which is a hyperparameter we need to select.

Fortunately, the `LogisticRegressionCV` model in scikit-learn can take care of the cross-validation for us!

In [None]:
# Create the model:
model = LogisticRegressionCV(
    penalty="l1",       # Use L1 regularization
    cv=3,               # Use 3 cross-validation folds.
    scoring="roc_auc",  # Use ROC AUC to choose which model is best.
    solver="saga",      # Needed to use L1 regularization.
    max_iter=5000,      # Number of steps needed for the models to finish training.
)

# Fit the model:
model.fit(X_standardized, y_train)

## Let's see how we did

Now that we're all done with the modeling process, it's time to assess the performance of the model and look at what the it learned. Since we're done with modeling, it is time to load and predict on our test set.

In [None]:
test_df = pd.read_csv(test_csv) # Read the test data.

# Filter for only Velos Orbitrap data:
test_velos = train_df.loc[train_df["Instrument_Category"] == "VOrbitrap", :]

# Extract the features and labels:
X_test = test_velos.loc[:, feature_columns]
y_test = test_velos.loc[:, "BinomResp"]

# Scale and center our test data:
X_test_standardized = scaler.transform(X_test)

# Get the class predictions for the test set examples:
y_pred = model.predict(X_test_standardized)

What is the accuracy of our classifier on the test set? 

In [None]:
print("Accuracy:", f'{100*accuracy_score(y_test, y_pred):.{4}}%')

Two commonly used plots used to asses the performance of a binary classifier are receiver operating characteristic (ROC) curves and precision-recall curves. 

An `ROC curve` provides a more comprehensive view of performance than accuracy, since it describes the tradeoff between false positive rate and true positive rate across the full range of predicted probabilities. It allows you to evaluate how well the accuracy of predictions correspond to it's assigned confidence p(X). 

A `precision-recall` curve summarizes the trade-off between the true positive rate and the sensitivity (the proportion of true labels predicted) for a classifier using different probability thresholds. While it conveys similar information to an ROC curve, it is less sensitive to the number of data points in each class so is a good choice for inbalanced datasets  

How do these ROC and precision-recall curves look to you?

In [None]:
# Get the probabilities predicted for the test set examples:
probs = model.predict_proba(X_test_standardized)[:, 1]

# Create our blank figure with 2 panels
fig, axs = plt.subplots(1, 2, figsize=(10, 4))

# Plot the ROC and PR curves:
RocCurveDisplay.from_predictions(y_test, probs, ax=axs[0])
PrecisionRecallDisplay.from_predictions(y_test, probs, ax=axs[1])

# Display the plot:
plt.show()

**Homework:** How do our results compare to that presented in the paper?

## Let's try an SVM model 

It's possible that the relationship between some of our features and spectral quality is non-linear. With an SVM using an RBF kernel, we can capture this non-linearity and potentially reduce the bias of our model. However, this added complexity has the potential to increase the variance if we start overfitting to the training data. Let's see what happens! 



In [2]:
# Initialize the SVM model:
model = SVC(
    probability=True,   # Have the model output probabilities rather than binary labels (so we can plot ROC curves) 
    kernel='rbf',       # We want to use an RBF kernel to allow the SVM to capture non-linear relationships
    gamma=1,            # Our data is standardized and thus has variance 1 
    random_state=1966   # Set the random seed to make training reproducible  
)

# Fit the model:
model.fit(X_standardized, y_train)

# Get the class predictions for the test set examples:
y_pred = model.predict(X_test_standardized)
print("Accuracy:", f'{100*accuracy_score(y_test, y_pred):.{4}}%')

# Get the probabilities predicted for the test set examples:
probs = model.predict_proba(X_test_standardized)[:, 1]

# Create our blank figure with 2 panels
fig, axs = plt.subplots(1, 2, figsize=(10, 4))

# Plot the ROC and PR curves:
RocCurveDisplay.from_predictions(y_test, probs, ax=axs[0])
PrecisionRecallDisplay.from_predictions(y_test, probs, ax=axs[1])

# Display the plot:
plt.show()

NameError: name 'SVC' is not defined

How does the SVM compare to logistic regression?

 Despite the difference in performance, can you think of a reason one might still choose to use logistic regression for this task? 