# Week 13 - Machine learning computer exercises

In this exercise set you will practice using simple supervised machine learning models (**Part 1**) and ensemble machine learning models (**Part 2**). You will be using these models to perform some classification tasks and evaluating their performance. Proceed through the notebook below and report your results in **Week 13. Assignment 1** in Moodle where indicated.

## Background

### What data will we be using and where is it from?
We will be utilizing **gene expression data** from the breast cancer study of [The Cancer Genome Atlas (TCGA)](https://cancergenome.nih.gov/). You can find the publication associated with this study [here](https://www.nature.com/articles/nature11412). TCGA is a landmark cancer genomics program that has molecularly characterized over 20,000 cancer samples from 33 cancer types. The aim of their breast cancer study was to perform comprehensive molecular profiling of human breast tumors by integrating multiple data types, including DNA copy number, DNA methylation, exome sequencing, protein assays, and gene expression information. In this exercise set we will only be focusing on the gene expression data.

### What is breast cancer?
Breast cancer is the most common form of invasive cancer among women (~7.6M cases worldwide annually, 450K deaths worldwide annually). Breast cancer is a heterogenous disease, meaning that there exist distinct molecular subtypes of breast cancer, some of which are also good predictors of response to treatment. In current routine diagnostics, immunohistochemistry (IHC) is used to determine the presence of routine markers that are used clinically, including:

* Estrogen receptor status (ER)
* Progesterone receptor status (PR)
* Human epidermal growth factor receptor 2 status (HER2)

### Our objective

The breast cancer dataset from TCGA consists of gene expression values for many breast cancer samples. The gene expression data has already been processed and normalized and is ready for us to use for our machine learning tasks. The dataset also contains the status of each sample for ER, PR, and HER2 as described above (assessed by detecting protein expression using IHC). **In this exercise, our objective is to use machine learning approaches to predict the PR status of each breast cancer sample based on its gene expression profile.**

## Setup

Let's start by loading our input data. The data is provided in an Rdata file, which is a file type specific to the R programming language used to store multiple data objects (e.g. matrices and vectors) within a single file. Luckily, we can also load Rdata files using Python using the `pyreadr` module.

Run the cell below to load the input data.

In [None]:
%pip install pyreadr # Install the pyreadr module
import pyreadr # Import the module for use
input_data = pyreadr.read_r("tcga_breast_prediction_train_and_test.Rdata")

# Separate out the objects within the Rdata file
train_data = input_data["x"]
train_meta = input_data["meta"]
test_data = input_data["xTest"]
test_meta = input_data["metaTest"]

The input data we have just loaded contains 4 objects:
* `train_data` --> this is the **training data matrix** (breast cancer samples as rows, genes for which we have expression values as columns)
* `train_meta` --> metadata for the training dataset (which we will extract training labels from)
* `test_data` --> this is the **test data matrix** (samples as rows, genes as columns)
* `test_meta` --> metadata for the test dataset (which we will extract test labels from)

**Question 1:** We have selected a subset of samples and genes for the train and test sets to cut down on computation time. Using commands you learned during the exercises of **Week 1** and **Week 2** (hint: you can use the `pandas` module here), determine how many breast cancer samples we have data for and how many genes have available gene expression measurements in our **training dataset**. Report your answers in Moodle.

In [None]:
# Enter your commands here

## PART 1: NEAREST-NEIGHBOR MACHINE LEARNING MODELS

The data matrix `train_data` contains gene expression measurements that you will use as predictors. The data matrix `train_meta` contains the response variable that we will predict, which will be the **PR (progesterone receptor) status of each breast cancer sample**.

Using the code cell below, determine which columns the data matrices `train_meta` and `test_meta` contain. After this, select the column named `pr` in the `train_meta` and `test_meta` matrices and assign them to new variables `status` and `status_test`, respectively.


In [None]:
# Use the pandas module
import pandas as pd

# Start by checking the column names in train_meta and test_meta
list(train_meta)
list(test_meta)

# Edit the code below to assign the column pr to variable status and status_test
status = train_meta['pr'] # Training data
status_test = test_meta['pr'] # Testing data

# What types of PR status information is available?
print(status.unique())


### Exclude training and test data labeled as `Indeterminate` for PR status

For our analyses here, we are only interested in positive and negative PR status. As you saw above, our `status` variable contains Negative, Positive, and Indeterminate (i.e., cannot be determined) labels for PR status. Let's exclude samples labeled as Indeterminate:

In [None]:
# Determine the index in status that are labeled as Indeterminate:
ind = status != 'Indeterminate'
print(f"{ind.sum()} training set samples are NOT labeled as PR status Indeterminate")
# Then select only observations that are not Indeterminate in train_data, train_meta, and status:
status = status[ind].reset_index(drop=True)
train_data = train_data[ind.values].reset_index(drop=True)
train_meta = train_meta[ind.values].reset_index(drop=True)

**Question 2:** How many breast cancer training samples had the label Indeterminate for their PR status? Report your answer in Moodle.

Remove the Indeterminate samples also from the test data:

In [None]:
# Determine the index in status that are labeled as Indeterminate:
ind = status_test != 'Indeterminate'
print(f"{ind.sum()} test set samples are NOT labeled as PR status Indeterminate")
# Then select only observations that are not Indeterminate in test_data, test_meta, and status_test:
status_test = status_test[ind].reset_index(drop=True)
test_data = test_data[ind.values].reset_index(drop=True)
test_meta = test_meta[ind.values].reset_index(drop=True)

### Fitting a k-nearest neighbor (KNN) classification model

[K-nearest neighbor (KNN)](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) is a non-parametric supervised machine learning algorithm that we can use to perform **classification** tasks. This means that the algorithm tries to predict the correct label (for us, PR status) of a given input data (for us, a breast cancer sample with a particular gene expression profile). KNN does this based on proximity, i.e. by classifying each input sample by a plurality vote of its neighbors, with the sample being assigned to the label most common among its *k* nearest neighbors in the KNN model's training data (where *k* is a positive integer).

In the code cell below, let's train a KNN model using the training data. To illustrate how the model works, we can then use the model to predict the PR status of our training data samples.

You will notice that we use a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) to look at the model's performance in terms of true negatives, false positives, false negatives, and true positives. We also calculate the model's [accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall), which is the proportion of all classifications that were correct, whether positive or negative.

In [None]:
%pip install scikit-learn # Install the scikit-learn module that contains the KNN model
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix # For evaluating model performance
np.random.seed(12345) # Set a seed for reproducible results

k = 5 # This is the parameter k in the KNN model

# Define the KNN model with k neighbors
knn = KNeighborsClassifier(n_neighbors=k)

# Fit the model on the training data
knn.fit(train_data, status)

# Predict the PR status of training data samples
predicted_status = knn.predict(train_data)

# Calculate the confusion matrix
cm = confusion_matrix(status, predicted_status)

# Extract the true negatives, false positives, false negatives, and true positives
tn, fp, fn, tp = cm.ravel()

# Calculate accuracy, sensitivity, and specificity
accuracy = sum(predicted_status == status) / len(status)
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)

print(f'k={k} accuracy={accuracy} sensitivity={sensitivity} specificity={specificity}')

**Question 3:** Using *k=5*, how many breast cancer patients are **predicted** by our model to be PR positive and negative? Report your answers in Moodle.

**Hint:** In the cell below you can find commands for determining the number of truly PR positive and negative patients in the training set. Modify these commands to determine the number of PR positive and negative patients in the **model predictions**.

In [None]:
print((status == "Positive").sum())
print((status == "Negative").sum())

**Questions 4 and 5:** What happens to the model performance if you change the *k* parameter to be 1? Why does this happen? What are the accuracy, sensitivity, and specificity? Report you answers in Moodle.

### Using cross-validation for optimizing *k* in the KNN classifier

In the section above, we evaluated the accuracy, sensitivity, and specificity of the KNN classifier at two different values of *k* when predicting the PR status of samples in the **training dataset**. We would like to choose a value for *k* that results in the best model performance.

However, as you learned during the lecture, supervised prediction models can easily be overfitted, and care has to be taken to avoid this. One way to reduce the risk of overfitting when optimizing the complexity of the model (here denoted by *k*), is to use cross-validation.

In the code cell below, you are provided with the code for separating data into training and test cross-validation sets, together with a `for-loop` for the cross-validation. The cross-validation implemented below is [leave-one-out cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).

In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
np.random.seed(12345) # Set a seed for reproducible results

k = 5 # Set a value for the number of neighbors, k

n = len(train_data) # Number of samples in the train dataset

true_pr_status_save = [] # Empty vector for storing true PR status labels
predicted_pr_status_save = [] # Empty vector for storing predicted PR status labels

# This loop is for the cross-validation process
for i in range(n): # Loop through all samples in the training dataset
    # Define samples to be used for training and testing
    index_test = [i]  # This is the observation index that will be used as test
    index_training = list(set(range(n)) - set(index_test))  # These are the observations used as the training set

    # Generate training and testing data matrices and PR status labels
    x_training = train_data.iloc[index_training, :]  # Training set gene expression values
    y_training = status.iloc[index_training]  # Training set PR status labels
    x_test = train_data.iloc[index_test, :]  # Test set gene expression values
    y_test = status.iloc[index_test]  # Test set PR status labels

    # Model training and testing
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_training, y_training)
    predicted_label = knn.predict(x_test)
    predicted_pr_status_save.extend(predicted_label)  # Save predicted PR status
    true_pr_status_save.extend(y_test)  # Save true PR status

# Convert lists to arrays for metrics calculation
predicted_pr_status_save = np.array(predicted_pr_status_save)
true_pr_status_save = np.array(true_pr_status_save)

# Calculate the confusion matrix
cm = confusion_matrix(true_pr_status_save, predicted_pr_status_save)

# Extract the true negatives, false positives, false negatives, and true positives
tn, fp, fn, tp = cm.ravel()

# Calculate sensitivity and specificity
accuracy = sum(predicted_pr_status_save == true_pr_status_save) / len(true_pr_status_save)
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)

print(f'accuracy = {accuracy}')
print(f'sensitivity = {sensitivity}')
print(f'specificity = {specificity}')

**Question 6:** How does the performance of the model (accuracy, sensitivity, and specificity) using cross-validation and *k=5* compare to the performance of the model without cross-validation and *k=5* from the previous section? Report your answer in Moodle.

Let's modify the cross-validation code above so that we can evaluate the accuracy of the model at different values of *k*. We can use the code below to perform a [grid search](https://en.wikipedia.org/wiki/Hyperparameter_optimization) between *k=1* to *k=100* to determine which value of *k* produces the best model performance. **Please note that this code can take a few minutes to run!**

In [None]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
np.random.seed(12345) # Set a seed for reproducible results

# Evaluate cross-validation results for k = 1 to k = 100 with increments of 3.
kmin = 1 # Minimim k to evaluate
kmax = 100 # Maximum k to evaluate
result_matrix = np.zeros((0,2)) # Initialize empty matrix with zero rows and two columns
for k in range(kmin, kmax + 1, 3):
    n = len(train_data) # Number of samples in the train dataset
    true_pr_status_save = [] # Empty vector for storing true PR status labels
    predicted_pr_status_save = [] # Empty vector for storing predicted PR status labels

    for i in range(n): # Loop through all samples in the training dataset
        # Define samples to be used for training and testing
        index_test = [i]  # This is the observation index that will be used as test
        index_training = list(set(range(n)) - set(index_test))  # These are the observations used as the training set

        # Generate training and testing data matrices and PR status labels
        x_training = train_data.iloc[index_training, :]  # Training set gene expression values
        y_training = status.iloc[index_training]  # Training set PR status labels
        x_test = train_data.iloc[index_test, :]  # Test set gene expression values
        y_test = status.iloc[index_test]  # Test set PR status labels

        # Model training and testing
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(x_training, y_training)
        predicted_label = knn.predict(x_test)
        predicted_pr_status_save.extend(predicted_label)  # Save predicted PR status
        true_pr_status_save.extend(y_test)  # Save true PR status

    # Convert lists to arrays for metrics calculation
    predicted_pr_status_save = np.array(predicted_pr_status_save)
    true_pr_status_save = np.array(true_pr_status_save)

    # Calculate accuracy
    accuracy = sum(predicted_pr_status_save == true_pr_status_save) / len(true_pr_status_save)

    prediction_results = [k, accuracy]
    result_matrix = np.vstack([result_matrix, prediction_results])

print(result_matrix)


Then we can plot the results from the grid search:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Extract accuracy from the result_matrix
accuracy = result_matrix[:,1]
k_values = result_matrix[:,0]

# Plot the accuracy
plt.plot(k_values, accuracy, color='black', label='Accuracy')  # Plot accuracy in black

# Set plot limits and labels
plt.ylim(0, 1)
plt.xlabel('k')
plt.ylabel('accuracy')

# Add grid and legend
plt.grid(True)
plt.legend()

# Display the plot
plt.show()

**Question 7:** Looking at model accuracy, let's choose the value of *k* that maximizes accuracy of the KNN model with our data. The command below helps you find the index of the *k* value corresponding to the maximum accuracy in the variable *result_matrix*. What is the value of *k* corresponding to this index of the matrix? Remember what you learned about slicing (also called indexing) in Python during the **Week 2** exercises. Report your answer in Moodle.

In [None]:
# argmax finds the index of the row in result_matrix where maximum accuracy was observed.
# The accuracies are stored in column 1 of the matrix.
best_ind = np.argmax(result_matrix[:,1])
# Find the corresponding value of k stored in column 0 of the matrix on the row
# indicated by best_ind.
best_k = # EDIT HERE TO OBTAIN THE K VALUE BY SLICING/INDEXING THE MATRIX
print("Best value for k was: "+str(int(best_k)))

### Final evaluation of the optimized KNN classifier on test data

Now that we have used the cross-validation approach on training data to find an optimal design for our KNN classifier, we can proceed to train our final model on the full training dataset. In the case of KNN, this is mainly a matter of selecting an optimal value for *k*, but more complex models may have hundreds of tunable settings and parameters. After this, the test data can be accessed to evaluate how the final model will perform on new, unseen data to give us a realistic estimate of how the model will work in "the real world".

Ideally, this process should only be conducted once to avoid overtweaking the model design to perform optimally on the test data. **It is considered bad practice in machine learning to go back to changing the model design (in our case the parameter *k*) after we have accessed the held-out test data and seen how the model performs on unseen data.** There is a risk that tweaking the model to perform optimally on this test dataset will not translate to good performance on other future test sets (i.e., the model does not *generalize*). If we utilize information from this test dataset to further tweak the model design, we should collect another new unseen test dataset to evaluate the performance of the updated model.

Here is the code we used earlier to train and evaluate the KNN model on the training data. Edit the code (see the lines indicated with "EDIT HERE") to train the model with the optimal value of *k* on the training dataset, and to run the predictions and evaluation on the test dataset.

In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix # For evaluating model performance
np.random.seed(12345) # Set a seed for reproducible results

k = # EDIT HERE TO USE THE OPTIMAL VALUE OF K

# Define the KNN model with k neighbors
knn = KNeighborsClassifier(n_neighbors=k)

# Fit the model on the training data
knn.fit(train_data, status)

# Predict the PR status of test data samples
predicted_status = # EDIT HERE TO RUN PREDICTIONS ON TEST DATA

# Calculate the confusion matrix
cm = # EDIT HERE TO CALCULATE CONFUSION MATRIX USING THE PREDICTED AND TRUE STATUS FOR TEST DATA

# Extract the true negatives, false positives, false negatives, and true positives
tn, fp, fn, tp = cm.ravel()

# Calculate accuracy, sensitivity, and specificity
accuracy = # EDIT HERE TO CALCULATE ACCURACY FOR TEST DATA
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)

print(f'k={k} accuracy={accuracy} sensitivity={sensitivity} specificity={specificity}')

**Question 8:** How did the final model perform on the test dataset? Report the accuracy with three decimals in Moodle.

## PART 2: ENSEMBLE MACHINE LEARNING MODELS

In this second part of the exercises, we will use **ensemble machine learning models** to predict the PR status of breast cancer samples based on their gene expression profiles. [Ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning) trains two or more machine learning algorithms on a specific classification task, with each algorithm within the ensemble model referred to as a *base learner*. Ensemble learning is based on the idea that while a single base learner may have poor predictive ability, multiple base learners together will perform better.

The model we will focus on is the **gradient boosted tree**, which uses an ensemble of **decision trees** as base learners.

### Training an XGBoost model and running predictions on test data
Using the same TCGA breast cancer gene expression data, let's predict PR status using gradient boosted trees implemented with the popular XGBoost library. XGBoost is a good starting point for any modern machine learning project.

Let's first train an XGBoost model using the same training dataset as for the KNN model and then run predictions on the test set before we analyze them further.

In [None]:
!pip install xgboost
import xgboost as xgb
import random
random.seed(12345) # Set a seed for reproducible results

# Convert status_to numerical labels, where 1 indicates positive for PR and
# 0 indicates negative for PR.
numerical_status = [1 if s == "Positive" else 0 for s in status]
numerical_status_test = [1 if s == "Positive" else 0 for s in status_test]

# Convert data to XGBoost's DMatrix format
dtrain = xgb.DMatrix(train_data, label=numerical_status)

# Define the most important XGBoost parameters
params = {
    'objective': 'binary:logistic',  # For binary classification
    'eval_metric': 'logloss',  # Evaluation metric
    'eta': 0.1,  # Learning rate
    'max_depth': 3,  # Maximum tree depth
    'subsample': 0.8, # Subsample ratio of the training instances
    'random_state': 42
}

# Train the XGBoost model for 50 iterations
model = xgb.train(params, dtrain, num_boost_round=50)

# Convert test data to DMatrix format
dtest = xgb.DMatrix(test_data)

# Make predictions on test data
predictions = model.predict(dtest)


### Performing ROC analysis for a probabilistic model
Like many machine learning models (with KNN being an exception), the predictions XGBoost outputs are probabilities. The variable *predictions* now contains the predicted probabilities for each test sample. A value of 0 means it's very unlikely that the sample is positive for PR and a value of 1 means it's very likely that the sample is positive.

But how certain do we need to be to consider a sample to be positive? To classify each sample as either positive or negative for PR status, we can use different cutoff thresholds for the probabilities. A typical default is 0.50, meaning samples with >= 50% probability of being positive are classified as positive, while any samples with <50% probability are classified as negative.

In different applications, we may however want to select another threshold to minimize false negatives or false positives. For example, we might want to use a model in medical diagnostics in a way that prioritises sensitivity to minimize false negatives (due to the potentially serious consequences of failing to detect a serious disease). This would be achieved by using a lower threshold for calling a sample positive, e.g. 0.20. The tradeoff is that specificity will then be decreased, leading to more "false alarms". This may however be acceptable, if the clinician can then use other confirmatory tests to avoid misdiagnosing a healthy patient with the disease.

A useful tool for analyzing the performance of a prediction model across all possible thresholds is the Receiver Operating Characteristic (ROC) curve. Each point on a ROC curve shows the sensitivity and specificity of the model with a given threshold value, and the curve is obtained by varying the threshold from 0 to 1. Let's plot the ROC curve and tabulate the underlying sensitivity & specificity & threshold combinations for our XGBoost PR status model on the test data.

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Convert lists with the predicted and true PR status to arrays for calculations
predictions = np.array(predictions)
numerical_status_test = np.array(numerical_status_test)

# Calculate the points of the ROC curve. For each of the evaluated probability
# thresholds, we calculate the False Positive Rate (FPR) which equals 1 - specificity
# and the True Positive Rate, which is just another name for sensitivity.
fpr, tpr, thresholds = roc_curve(numerical_status_test, predictions)

# We can additionally calculate Area Under the Curve (AUC) to summarize the
# overall performance of the model across all thresholds.
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1 - Specificity')
plt.ylabel('Sensitivity')
plt.title('ROC Curve for XGBoost PR Status Prediction')
plt.legend(loc="lower right")
plt.show()


**Question 9:** Each point on the ROC curve represents a specific threshold and a resulting tradeoff between sensitivity and specificity. Note that it is customary to use an inverted x-axis by plotting 1-specificity (i.e. False Positive Rate) instead of specificity. How would the ROC curve look like for a very well performing model? Report your answer in Moodle.

### Adjusting decision thresholds for a probabilistic model
Based on he ROC analysis, we can pick a suitable probability threshold for classifying each predicted sample as positive or negative for PR status. We can analyze the ROC curve for a compromise between sensitivity and specificity that we consider optimal for our application. In a real-world scenario, this could involve discussions about the consequences of false positives vs false negatives in terms of costs or even ethical considerations.

Let's assume we need to obtain a sensitivity of at least 95% in our task of predicting which samples are positive for PR status. That is, we accept missing 5% of the positive samples. Using the code below, we can examine the data underlying the ROC curve in tabular form.

In [None]:
import pandas as pd

# Create a DataFrame from the ROC curve data
roc_df = pd.DataFrame({'Threshold': thresholds, 'Specificity': 1-fpr, 'Sensitivity': tpr})

# Display the DataFrame as a table
roc_df

**Question 10:** Which threshold would you pick to achieve at least 95% sensitivity and what will be the resulting specificity of the model? Report your answer in Moodle.

**Question 11:** Finally, apply the threshold you selected to the probabilities predicted by the model to obtain a final classification (1 meaning PR-positive vs. 0 meaning PR-negative) for each test sample. Edit the code below to calculate how many samples in the test set were predicted to be positive at this threshold. Report your answer in Moodle.

In [None]:
# Convert probabilities to class labels (0 or 1)
threshold = # EDIT HERE TO SET YOUR PROBABILITY THRESHOLD
predicted_labels = [1 if p > threshold else 0 for p in predictions]

n_positives = # EDIT HERE TO CALCULATE HOW MANY SAMPLES WERE PREDICTED POSITIVE

print("The model predicted "+str(n_positives)+" positive samples")

### Analyzing the importance of predictor variables
Some models, including XGBoost, allow estimating which of the input variables had the highest importance for the prediction. In statistical circles the input variables of a predictive model would usually be called *predictors*, while in the machine learning community they are often called *features*. By analyzing which features/predictors contain the most useful information for the prediction task of interest, we can both troubleshoot poorly performing models and sometimes even make new discoveries.

Let's analyze which of the genes in our dataset are most informative for the trained XGBoost model in predicting PR status. There are different ways of defining importance and we will not go into the details, but here we will use the **"gain" based importance score**.

Using the code below, calculate the relative importance of the features and examine the top-10.

In [None]:
import matplotlib.pyplot as plt

# Get feature importance scores
importance_scores = model.get_score(importance_type='gain')

# Convert to DataFrame for easy sorting
importance_df = pd.DataFrame({'Feature': importance_scores.keys(), 'Importance': importance_scores.values()})

# Sort by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Get top 10
top_10_features = importance_df.head(10)


# Plotting
plt.figure(figsize=(10, 6))
plt.bar(top_10_features['Feature'], top_10_features['Importance'])
plt.xlabel('Feature (Gene)')
plt.ylabel('Importance Score')
plt.title('Top 10 Most Important Features for PR Status Prediction')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for readability
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()

**Question 12:** The labels shown on the plot above are Ensembl gene identifiers. Search for at least the top-3 most predictive genes in [the NCBI Gene database](https://www.ncbi.nlm.nih.gov/gene/). Report the official gene symbols for the three genes in Moodle.

**Question 13:** What can you conclude based on the identities of the three genes whose gene expression is most predictive of progesterone receptor status in these data? Remember that the PR status in this dataset was established by detecting the presence or absence of PR proteins using immunohistochemistry. Report your answer in Moodle.