# Setup of the AnnData object
**Author:** [Severin Dicks](https://github.com/Intron7) (IBSM Freiburg)

This notebook is just download and sets up the AnnData object (https://anndata.readthedocs.io/en/latest/index.html) we will be working with. In this example workflow we'll be looking at a dataset from [Quin et al., Cell Research 2020](https://www.nature.com/articles/s41422-020-0355-0). 

In [1]:
import wget
import scanpy as sc
import os
import tarfile
import pandas as pd

First we download the countmartix and metadata file from the Lambrechts lab website.

In [2]:
count_file = './data/LC_counts.tar.gz'
if not os.path.exists(count_file):
    os.makedirs("./data",exist_ok=True)
    wget.download("http://blueprint.lambrechtslab.org/download/LC_counts.tar.gz", out="./data")
    wget.download("http://blueprint.lambrechtslab.org/download/LC_metadata.csv.gz", out="./data")

We than decompress the data.

In [3]:
tar = tarfile.open(count_file, "r:gz")
tar.extractall("./data")
tar.close()

Now we can start creating our AnnData object with scanpy (https://scanpy.readthedocs.io/en/stable/index.html).

In [4]:
adata = sc.read_10x_mtx("./data/export/LC_counts/")

Next we have to append the metadata to `adata.obs`.

In [5]:
obs_df = pd.read_csv("./data/LC_metadata.csv.gz",compression="gzip", index_col=0)

In [6]:
obs_df

Unnamed: 0_level_0,nGene,nUMI,CellFromTumor,PatientNumber,TumorType,TumorSite,CellType
Cell,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
BT1238_AAATCAACTGCCTC,897,3227,True,1,Lung,I,Cancer
BT1238_AACATTGACCTAAG,509,731,True,1,Lung,I,Cancer
BT1238_AACCAGTGCTTAGG,642,2958,True,1,Lung,I,Myeloid
BT1238_AACCTACTCGCTAA,925,2781,True,1,Lung,I,T_cell
BT1238_AACTCTTGCTGTAG,713,3000,True,1,Lung,I,T_cell
...,...,...,...,...,...,...,...
scrBT1432_TTTGGTTCATTCTCAT,1419,5192,True,8,Lung,I,T_cell
scrBT1432_TTTGGTTGTTGGTGGA,398,585,True,8,Lung,I,T_cell
scrBT1432_TTTGTCACACATGTGT,625,1760,True,8,Lung,I,T_cell
scrBT1432_TTTGTCAGTACGAAAT,284,491,True,8,Lung,I,Myeloid


In this case `adata.obs` and the meta_data in `obs_df` have the identical number of cells and the cell barcodes are in the same order. We can therefore just replace `.obs` with `obs_df`

In [7]:
adata.obs = obs_df

Since `PatientNumber` is a category and not a numerical value we have to change its type. In some cases scanpy doesn't like integers as categories. So we convert it to `str`

In [8]:
adata.obs.PatientNumber = adata.obs.PatientNumber.astype(str)

During the saving of the adata object string based columns in `.obs` are transformed are changed into categorical data.

In [9]:
os.makedirs("./h5",exist_ok=True)
adata.write("./h5/adata.raw.h5ad")

... storing 'PatientNumber' as categorical
... storing 'TumorType' as categorical
... storing 'TumorSite' as categorical
... storing 'CellType' as categorical


If you want to you can now delete the "./data" folder since we won't need it anymore