# Intro

### 1. What this dataset is:

This dataset contains scRNA-seq gene expression measurements of epithelial cells from the intestines of mice. The dataset contains cells in four "states": healthy control cells, cells infected with _Heligmosomoides polygyrus_ (H. poly) and measured after 3 days, cells also infected with H. poly but measured after 10 days, and cells infected with Salmonella. Each cell is also labelled with its cell type as determined by a human domain expert.


### 2. Where the data comes from:

This data was collected as part of [A single-cell survey of the small intestinal epithelium (Haber et al., 2017)](https://www.nature.com/articles/nature24489) and was deposited in the NIH gene expression omnibus as `GSE92332`.

### 3. Why this data might be useful:

This data has previously been used in papers trying to predict "out of sample" responses to infection; that is, predicting how a held-out cell type may respond to an infection given data from other cell types. Some of these papers include:

* ["Conditional out-of-sample generation for un-paired data using trVAE"](https://arxiv.org/abs/1910.01791) (Lotfollahi et al., 2019)
* ["scGen predicts single-cell perturbation responses"](https://www.nature.com/articles/s41592-019-0494-8) (Lotfollahi et al., 2019)

### First, we download the data from the NIH Gene Expression Omnibus and write it to disk.

In [None]:
import requests

print('Downloading compressed data')

url = 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE92nnn/GSE92332/suppl/GSE92332%5FSalmHelm%5FUMIcounts%2Etxt%2Egz'
r = requests.get(url)
compressed_file_name = './GSE92332_SalmHelm_UMIcounts.txt.gz'

with open(compressed_file_name, 'wb') as f:
    f.write(r.content)
    
print("Data successfully written to disk")

### Next, we read in the file and begin preprocessing

In [None]:
import gzip
import pandas as pd

with gzip.open(compressed_file_name, 'rb') as f:
    df = pd.read_csv(f, sep='\t')

The data was originally stored with each gene being a row and each cell being a column. We'll transpose our data matrix to have a more standard arrangement of rows being samples and features being columns

In [None]:
df = df.transpose()

Next, we'll extract metadata from each cell's name. First, we'll display a sample of cell names.

In [None]:
df.head()

We can see that these names contain a number of useful pieces of metadata (e.g. cell type) separated by the `_` character. We'll extract that metadata here.

In [None]:
cell_groups = []
barcodes = []
conditions = []
cell_types = []

for cell in df.index:
    cell_group, barcode, condition, cell_type = cell.split('_')
    cell_groups.append(cell_group)
    barcodes.append(barcode)
    conditions.append(condition)
    cell_types.append(cell_type)
    
metadata_df = pd.DataFrame({'cell_group': cell_groups, 'barcode': barcodes, 'condition': conditions, 'cell_type': cell_types})

Finally, we'll perform some standard preprocessing steps on our scRNA-seq data. First, we'll normalize the data so that count numbers are comparable across cells, log-transform the resulting normalized counts, and then select the 2,000 most variable genes. To do so, we'll use functions from `scanpy`, a popular Python library for handling scRNA-seq data.

In [None]:
import scanpy as sc
from anndata import AnnData

adata = AnnData(X = df.values, obs=metadata_df) # The annotated dataframe (AnnData) is a wrapper class used by scanpy for most of its functions.
sc.pp.normalize_total(adata, 1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)

To confirm that our preprocessed data looks reasonable before saving it, we'll use the UMAP algorithm to visualize it in 2D.

In [None]:
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['cell_type', 'condition'], wspace=0.3)

Our UMAP plots look sensible (e.g. we see good separation between cell types and disease state), so we'll proceed with saving the final version.

In [None]:
df = pd.DataFrame(adata.X[:, adata.var['highly_variable']]) # Extract values for highly variable genes only

df.to_csv('./preprocessed_data.csv')
metadata_df.to_csv('./metadata.csv')