# Intro

### 1. What this dataset is:

This dataset contains scRNA-seq measurements from peripheral blood mononuclear cells (PBMCs) from 8 samples from COVID-19 patients and 6 healthy controls.


### 2. Where the data comes from:

This data was collected as part of [A single-cell atlas of the peripheral immune response in patients with severe COVID-19 (Wilk et al., 2020)](https://www.nature.com/articles/s41591-020-0944-y) and was deposited in the NIH gene expression omnibus as `GSE150728`.

### 3. Why this data might be useful:

This data could potentially be useful in understanding which genes/pathways are involved in COVID-19.

### First, we download the data from the from the Chan-Zuckerberg Biohub

The authors posted a richly annotated version of their data to the Chan-Zuckerberg Biohub's scRNA-seq data repository. Unfortunately, the CZ Biohub repository doesn't allow for automated downloads; please head to [this link](https://cellxgene.cziscience.com/collections/a72afd53-ab92-4511-88da-252fb0e26b9a), hit the download button (the cloud near the bottom right), choose the `h5ad` option, and place the resulting file in the same directory as this notebook before continuing

### Next, we read in the files and begin preprocessing

To read our data, we'll use the [scanpy](https://scanpy.readthedocs.io/en/stable/) library, a popular Python library for interacting with and preprocessing scRNA-seq data.

In [None]:
import scanpy as sc

file_name = './local.h5ad'
adata = sc.read_h5ad(file_name)

The data was originally stored with each gene being a column and each cell being a row.

Now lets take a quick look at our metadata to see what information we're given. To do so, we'll take a look at the `obs` field of our `scanpy` object. The dataframe is quite wide, so you might need to scroll to the right to see all the columns. The strings of letters on the left are labels for each cell - feel free to ignore these strings and relabel the rows however you see fit. Finally, if you have any questions about what the metadata columns need please refer to the paper linked above or reach out to the TAs.

In [None]:
adata.obs.head()

Now we'll perform some standard preprocessing steps on our scRNA-seq data. First, we'll normalize the data so that count numbers are comparable across cells, log-transform the resulting normalized counts, and then select the 2,000 most variable genes. To do so, we'll use functions from `scanpy`, a popular Python library for handling scRNA-seq data.

In [None]:
import scanpy as sc
from anndata import AnnData

sc.pp.normalize_total(adata, 1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)

adata = adata[:, adata.var.highly_variable] # Subset our dataframe to only the highly variable genes

To confirm that our preprocessed data looks reasonable before saving it, we'll use the UMAP algorithm to visualize it in 2D.

In [None]:
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['disease', 'cell_type'], wspace=0.3)

Our UMAP plots look sensible (e.g. we see good separation between cell types), so we'll proceed with saving the final version.

In [None]:
import pandas as pd

df = pd.DataFrame(adata.X)
metadata_df = adata.obs

df.to_csv('./preprocessed_data.csv')
metadata_df.to_csv('./metadata.csv')