# Intro

### 1. What this dataset is:

This dataset contains scRNA-seq measurements taken from entorhinal cortex samples from control and Alzheimer’s disease brains (`n=6` per group).
Cells are labeled with cell type information as well as patient condition (i.e., healthy or with Alzheimer's).


### 2. Where the data comes from:

This data was collected as part of [A single-cell atlas of entorhinal cortex from individuals with Alzheimer’s disease reveals cell-type-specific gene expression regulation (Grubman et al., 2019)](https://www.nature.com/articles/s41593-019-0539-4) and was deposited in the NIH gene expression omnibus as `GSE138852`.

### 3. Why this data might be useful:

This data could potentially be useful in understanding which genes/pathways are involved in Alzheimer's disease. As of this writing, there do not appear to be any computational papers making use of this data.

### First, we download the data from the NIH Gene Expression Omnibus

In [None]:
import requests

print('Downloading data')

data_url = 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE138852&format=file&file=GSE138852%5Fcounts%2Ecsv%2Egz'
metadata_url = 'http://adsn.ddnetbio.com/session/cd8292b94d697e88ec03bb7ca3dc6bda/download/scRNA_metadata?w='
metadata_description_url = 'http://adsn.ddnetbio.com/session/cd8292b94d697e88ec03bb7ca3dc6bda/download/scRNA_metaDesc?w='

compressed_data_file_name = './GSE138852_counts.csv.gz'
metadata_file_name = './scRNA_metadata.tsv'
metadata_description_file_name = './scRNA_metadata_description.tsv'

r = requests.get(data_url)
with open(compressed_data_file_name, 'wb') as f:
    f.write(r.content)
    
print("Data successfully written to disk")

### Next, we'll grab the metadata from the authors' website.

Unfortunately, the authors made it difficult to download the files using code (possibly to prevent their servers from being spammed). Before we proceed, please manually download the files by going to http://adsn.ddnetbio.com/, clicking on the "Data Download" and then "Metadata and Gene Expression" tabs. You'll then want to hit the "Download single-cell Metadata" and "Description for single-cell metadata" buttons.

One the files are downloaded, place them in the same directory as this notebook and then continue.

### Next, we read in the files and begin preprocessing

(This may take a couple minutes since the files are pretty big)

In [None]:
import gzip
import pandas as pd

with gzip.open(compressed_data_file_name, 'rb') as f:
    data_df = pd.read_csv(f, index_col=0)
    
with open(metadata_file_name, 'rb') as f:
    metadata_df = pd.read_csv(f, index_col=0, sep='\t')
    
with open(metadata_description_file_name, 'rb') as f:
    metadata_description_df = pd.read_csv(f, index_col=0, sep='\t')

The data was originally stored with each gene being a row and each cell being a column. We'll transpose our data matrix to have a more standard arrangement of rows being samples and features being columns

In [None]:
data_df = data_df.transpose()

We'll also take a quick look at our metadata to see what information we're given. The strings of letters on the left are cellular barcodes - short strings of DNA used to identify a specific cell. Feel free to ignore these strings and relabel the rows however you see fit.

In [None]:
metadata_df.head()

Now, what exactly do the column names here mean? For that, we'll turn to the metadata description file provided by the authors

In [None]:
metadata_description_df

Now we'll perform some standard preprocessing steps on our scRNA-seq data. First, we'll normalize the data so that count numbers are comparable across cells, log-transform the resulting normalized counts, and then select the 2,000 most variable genes. To do so, we'll use functions from `scanpy`, a popular Python library for handling scRNA-seq data.

In [None]:
import scanpy as sc
from anndata import AnnData

adata = AnnData(X = data_df.values, obs=metadata_df) # The annotated dataframe (AnnData) is a wrapper class used by scanpy for most of its functions.
sc.pp.normalize_total(adata, 1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)

adata = adata[:, adata.var.highly_variable] # Subset our dataframe to only the highly variable genes

To confirm that our preprocessed data looks reasonable before saving it, we'll use the UMAP algorithm to visualize it in 2D.

In [None]:
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['batchCond', 'cellType'], wspace=0.3)

Our UMAP plots look sensible (e.g. we see good separation between cell types and disease state), so we'll proceed with saving the final version.

In [None]:
df = pd.DataFrame(adata.X)

df.to_csv('./preprocessed_data.csv')
metadata_df.to_csv('./metadata.csv')