# CMSC 35440 Machine Learning in Biology and Medicine
## Homework 2: Molecular Subtyping of Lung Cancer
**Released**: Jan 28, 2025

**Due**: Feb 7, 2025 at 11:59 PM Chicago Time on Gradescope

**In the second homework, you'll explore embeddings of genomic expression data for two related cancer types.**

Lung adenocarcinoma and lung squamous cell carcinoma are the 2 most prevalent non-small cell lung cancer (NSCLC) types. They are related but distinct cancer types. Lung squamous cell carcinoma is the most common tumor in male smokers and occurs more centrally in the lung (closer to the root of the lung, nearer to the bronchi). Lung adenocarcinoma is the most common tumor in nonsmokers and female smokers and occurs more peripherally in the lung (closer to the sides of the lung, further from the bronchi). Hopefully this diagram of lung anatomy helps clarify the locations mentioned: https://en.wikipedia.org/wiki/Lung#/media/File:Illu_bronchi_lungs.jpg.

However, there are often exceptions to these epidemiological and anatomical patterns. This is why obtaining a histological (tissue slides) or molecular (e.g. gene expression) profile of a patient's specific cancer is vital for accurate diagnosis and subsequent treatment. But even molecular patterns of cancer can be heterogeneous. In this homework, you'll explore some of that heterogeneity and observe how models can get things wrong.

Last, you'll practice a vital step for biomedical machine learning: expert review. Before these models can ever be deployed in real patient settings, they must undergo rigorous review. In the US, any medical products intended for patient usage must be approved by the FDA. An excellent historical case of demonstrating why we need such review is Thalidomide in the late 1950s. It was originally marketed in Europe as a treatment for morning sickness, especially during pregnancy. However, the drug was blocked in the US by an expert reviewer at the FDA, [Dr. Frances Kelsey](https://en.wikipedia.org/wiki/Frances_Oldham_Kelsey) (a UChicago MD/PhD alum!), who was concerned over the lack of evidence concerning the drug's safety. She was of course right to be concerned, as Thalidomide was shown to cause severe birth defects, leading to its removal from European markets. Suffice to say, expert review is crucial to patient safety, especially as we dive into this new age of AI/ML in medicine.

The starter notebook for this homework can be downloaded from GitHub:

https://github.com/StevenSong/CMSC-35440-Source/blob/main/hw2/CMSC_35440_HW2_Student_Version.ipynb


## Instructions


1. Download and open the starter notebook. No need for any GPUs for this homework.
1. Download and unzip the data. We've provided gene expression data spanning 2 TCGA projects: `TCGA-LUAD` (lung adenocarcinoma) and `TCGA-LUSC` (lung squamous cell carcinoma). For simplicity, we'll use the project ID to distinguish the cancer type.
  * We've provided the data as a tarball that be downloaded from [https://github.com/StevenSong/CMSC-35440-Source/releases/download/hw2/hw2.tar.gz](https://github.com/StevenSong/CMSC-35440-Source/releases/download/hw2/hw2.tar.gz).
  * After unzipping the data, there should be a CSV of metadata, a folder of expression TSV files, and the code that was used to originally download the data. You don't need the download code but it can be a good template if you want to pull other data from the NCI GDC.
1. Using the expression data, derive one expression vector per unique patient (given by the `case_id`). We'll treat this as our patient embedding.
  * **Only use the `protein_coding` genes within each expression file.**
  * **Only use the count columns which contain the string `unstranded`.**
  * One challenge of working with real biomedical data is that each patient may contribute a variable number of samples, for example if multiple biopsies are taken. For this homework, we're looking for one patient embedding aggregated from all of the patient's samples. **The exact aggregation method is up to you.**
  * Don't forget to normalize the counts. You can refer to the slides from lecture on popular normalization methods. The normalizations that are precomputed by the GDC are also documented [here](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#mrna-expression-transformation). **The exact normalization method is up to you.**
  * Include in your writeup a brief description and justification of the methods you used for normalization & aggregation and in what order you applied them.
  * Tips:
    * Beware the extra rows present in each count file e.g. `N_unmapped`, `N_multimapping`, etc.
    * Gene expression counts are naturally stored as a matrix where the columns are genes, the rows are individual samples, and the matrix values are the counts. In python, `scanpy` and `anndata` are packages used to handle and transform such data. You do not have to use either, but these may be useful.
1. Cluster your embeddings to 2 clusters. We're trying to derive a model which can distinguish the different lung cancer types. Simple `KMeans` from `sklearn` is fine for this. Derive cluster IDs for each sample.
1. Use PCA to reduce your embeddings to 2 dimensions.
1. Visualize your embeddings. We're looking for a scatter plot where the color of the points differ by the TCGA project (`LUAD` vs `LUSC`) and the shape of the points differ by the cluster ID assigned by clustering.
1. Consider these questions (you should probably address some of these in your writeup):
  * In the 2D projection of the embeddings, is there a natural decision boundary you would be able to draw to classify the different cancer types?
  * Are there samples which are misclassified by this decision boundary?
  * How close is clustering to this decision boundary?
  * Are there samples which are misclassified by clustering?
  * Are misclassifications by the model (either clustering or the imagined decision boundary) actually mislabeled data? For example, if a sample is labeled as lung adenocarcinoma but the model thinks it's lung squamous cell carcinoma, is the model right or is the label right?
1. Review the misclassifications:
  * We'll use another data modality to double check if the samples are labeled correctly. The imaging data for select TCGA lung cancer cases are available The Cancer Imaging Archive (TCIA).
  * Lung cancer patients often undergo computed tomography (CT) scans to identify the tumor. Additionally, these CTs are done in parallel to positron emission tomography (PET) scans. PET scans work by introducing a radioactive tracer that is taken up by metabolically active tissues, such as tumors. As a result, **tumors light up brighter white on PET scans**.
  * CT images are slices through the body going from head to toe. To understand the way each image is oriented, imagine you're looking through the feet of a person lying facing up on a table. This means that in each image, the top of the image is the patient's front, the bottom is their back, and the left of the image is the patient's right and vice versa. The images start at the patient's head so as you scroll through the images, you're looking further down the patient's body.
    * It's recommended to use the up/down arrow keys to scroll through the images.
  * For this exercise, we'll provide two cases which were misclassified by our implementation. You're welcome to check cases from your implementation, however not all cases have paired imaging data available ([LUAD cases](https://nbia.cancerimagingarchive.net/nbia-search/?CollectionCriteria=TCGA-LUAD), [LUSC cases](https://nbia.cancerimagingarchive.net/nbia-search/?CollectionCriteria=TCGA-LUSC)).
    * `TCGA-60-2715`: labeled as `LUSC` but classified by our model as `LUAD`.
      * Scroll through images 109 through 113. Look for the bright white spot in these images on the PET scan.
      * [CT scan](https://nbia.cancerimagingarchive.net/viewer/?study=1.3.6.1.4.1.14519.5.2.1.3023.4012.507148485748821590204034796320&series=1.3.6.1.4.1.14519.5.2.1.3023.4012.313155987490130625808038798781)
      * [PET scan](https://nbia.cancerimagingarchive.net/viewer/?study=1.3.6.1.4.1.14519.5.2.1.3023.4012.507148485748821590204034796320&series=1.3.6.1.4.1.14519.5.2.1.3023.4012.613169434607414222857186346352)
    * `TCGA-50-6590`: labeled as `LUAD` but classified by our model as `LUSC`.
      * Scroll through images 78 through 85. Look for the bright white spot in these images on the PET scan.
      * [CT scan](https://nbia.cancerimagingarchive.net/viewer/?study=1.3.6.1.4.1.14519.5.2.1.6450.9002.125969062420301466106414902377&series=1.3.6.1.4.1.14519.5.2.1.6450.9002.216176897913679442475013148754)
      * [PET scan](https://nbia.cancerimagingarchive.net/viewer/?study=1.3.6.1.4.1.14519.5.2.1.6450.9002.125969062420301466106414902377&series=1.3.6.1.4.1.14519.5.2.1.6450.9002.321022540475237033558410330699)
  * Without much background, it is probably much easier to see the tumor as the brightly lit up white spot on the PET scan. However, it's easier to appreciate finer detail on the CT scan. Try to find the tumor on the CT using the PET scan to get the rough location of the tumor. The image indices are the same on both.
  * Using the anatomical descriptions of lung adenocarcinoma and lung squamous cell carcinoma provided in the intro of this assignment, does it look like the labels for these cases are correct?
1. Writeup your work, your writeup should be 1 to 2 pages long, excluding figures. 12pt font, single space, 1 inch margins, letter size paper. Please submit either a PDF or a Word document. Make sure to include the following:
  * Your embedding visualization.
  * A brief justification for the normalization & aggregation method and the order in which you applied them.
  * A discussion of some of the above questions regarding identified misclassifications via visualization and clustering.
  * A discussion of manual review of the the misclassified cases.
  * A discussion on the question: Why is review by an expert important? Phrased another way, why should someone with domain knowledge review models?
1. Submit your homework. Make sure to include:
  1. Your writeup containing a figure with your embedding visualization.
  1. Your notebook with your code.


## Code

In [None]:
!pip install scanpy
!wget https://github.com/StevenSong/CMSC-35440-Source/releases/download/hw2/hw2.tar.gz
!tar -xzf hw2.tar.gz

Collecting scanpy
  Downloading scanpy-1.10.4-py3-none-any.whl.metadata (9.3 kB)
Collecting anndata>=0.8 (from scanpy)
  Downloading anndata-0.11.3-py3-none-any.whl.metadata (8.2 kB)
Collecting legacy-api-wrap>=1.4 (from scanpy)
  Downloading legacy_api_wrap-1.4.1-py3-none-any.whl.metadata (2.1 kB)
Collecting pynndescent>=0.5 (from scanpy)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Collecting session-info (from scanpy)
  Downloading session_info-1.0.0.tar.gz (24 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting umap-learn!=0.5.0,>=0.5 (from scanpy)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting array-api-compat!=1.5,>1.4 (from anndata>=0.8->scanpy)
  Downloading array_api_compat-1.10.0-py3-none-any.whl.metadata (1.6 kB)
Collecting stdlib_list (from session-info->scanpy)
  Downloading stdlib_list-0.11.0-py3-none-any.whl.metadata (3.3 kB)
Downloading scanpy-1.10.4-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━