# Pre-processing (Re-annotation)

![dandelion_logo](img/dandelion_logo_illustration.png)

## Foreword

***dandelion*** is written in `python=3.7.6` and is primarily a single-cell BCR-seq analysis package. It makes use of some tools from the fantastic [*immcantation suite*](https://immcantation.readthedocs.io/) [[Gupta2015]](https://academic.oup.com/bioinformatics/article/31/20/3356/195677), implementing a workflow to streamline the pre-processing and exploratory stages for analyzing single-cell BCR-seq data from 10X Genomics. Post-processed data from ***dandelion*** can be smoothly transferred to [*scanpy*](https://scanpy.readthedocs.io/)/[*AnnData*](https://anndata.readthedocs.io/) [[Wolf18]](https://doi.org/10.1186/s13059-017-1382-0) object for integration and exploration of BCR-seq data and RNA-seq data. I hope to be able to introduce some new single-cell BCR-seq exploratory tools down the road through *dandelion*. 

This section will cover the initial pre-processing of files after 10X's `Cell Ranger vdj` immune profiling data analysis pipeline **manually**. As mentioned, there is now a [singularity container](https://sc-dandelion.readthedocs.io/en/latest/notebooks/singularity_preprocessing.html) that can automate the first few steps outlined below.


We will download the 10X data sets to process for this tutorial:
```bash
# create sub-folders
mkdir -p dandelion_tutorial/vdj_nextgem_hs_pbmc3
mkdir -p dandelion_tutorial/vdj_v1_hs_pbmc3
mkdir -p dandelion_tutorial/sc5p_v2_hs_PBMC_10k
mkdir -p dandelion_tutorial/sc5p_v2_hs_PBMC_1k

# change into each directory and download the necessary files
cd dandelion_tutorial/vdj_v1_hs_pbmc3;
wget -O filtered_feature_bc_matrix.h5 https://cf.10xgenomics.com/samples/cell-vdj/3.1.0/vdj_v1_hs_pbmc3/vdj_v1_hs_pbmc3_filtered_feature_bc_matrix.h5;
wget -O filtered_contig_annotations.csv https://cf.10xgenomics.com/samples/cell-vdj/3.1.0/vdj_v1_hs_pbmc3/vdj_v1_hs_pbmc3_b_filtered_contig_annotations.csv;
wget -O filtered_contig.fasta https://cf.10xgenomics.com/samples/cell-vdj/3.1.0/vdj_v1_hs_pbmc3/vdj_v1_hs_pbmc3_b_filtered_contig.fasta;

cd ../vdj_nextgem_hs_pbmc3
wget -O filtered_feature_bc_matrix.h5 https://cf.10xgenomics.com/samples/cell-vdj/3.1.0/vdj_nextgem_hs_pbmc3/vdj_nextgem_hs_pbmc3_filtered_feature_bc_matrix.h5;
wget -O filtered_contig_annotations.csv https://cf.10xgenomics.com/samples/cell-vdj/3.1.0/vdj_nextgem_hs_pbmc3/vdj_nextgem_hs_pbmc3_b_filtered_contig_annotations.csv;
wget -O filtered_contig.fasta https://cf.10xgenomics.com/samples/cell-vdj/3.1.0/vdj_nextgem_hs_pbmc3/vdj_nextgem_hs_pbmc3_b_filtered_contig.fasta;

cd ../sc5p_v2_hs_PBMC_10k;
wget -O filtered_feature_bc_matrix.h5 https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_filtered_feature_bc_matrix.h5;
wget -O filtered_contig_annotations.csv https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_b_filtered_contig_annotations.csv;
wget -O filtered_contig.fasta https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_b_filtered_contig.fasta;

cd ../sc5p_v2_hs_PBMC_1k;
wget -O filtered_feature_bc_matrix.h5 wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_1k/sc5p_v2_hs_PBMC_1k_filtered_feature_bc_matrix.h5;
wget -O filtered_contig_annotations.csv wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_1k/sc5p_v2_hs_PBMC_1k_b_filtered_contig_annotations.csv;
wget -O filtered_contig.fasta https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_1k/sc5p_v2_hs_PBMC_1k_b_filtered_contig.fasta;
```

***dandelion***'s reannotation workflow requires the Cell Ranger fasta files and annotation files to start, particularly either *all_contig.fasta* or *filtered_contig.fasta* and corresponding *all_contig_annotations.csv* and *filtered_contig_annotations.csv*.

I'm running everything with the *filtered_contig* files as a standard analysis set up. I'm using a standard laptop for the analysis here: entry level 2017 Macbook Pro with 2.3 GHz Intel Core i5 processor and 16 GB 2133 MHz LPDDR3 ram.

If you followed the installation instructions, you should have the requisite auxillary softwares installed already. Otherwise, you can download them manually: [blast+](https://ftp.ncbi.nih.gov/blast/executables/igblast/release/LATEST/) and [igblast](https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). For tigger-genotype, you can download it [here](https://bitbucket.org/kleinstein/immcantation/src/default/pipelines/). Just note that I made some minor modifications to this file, hence there is a version that comes with this package.

For convenience, in ***shell***, export the path to the database folders like as follows:
```bash
# bash/shell
echo "export GERMLINE=/Users/kt16/Documents/Github/dandelion/database/germlines/" >> ~/.bash_profile
echo "export IGDATA=/Users/kt16/Documents/Github/dandelion/database/igblast/" >> ~/.bash_profile
echo "export BLASTDB=/Users/kt16/Documents/Github/dandelion/database/blast/" >> ~/.bash_profile
# reload
source ~/.bash_profile
```
The databases for igblast are basically setup using [changeo's instructions](https://changeo.readthedocs.io/en/stable/examples/igblast.html). 

If you are using a jupyter notebook initialized via jupyterhub instance, you might want to try the fix to a known issue where pathing requires some adjustments https://github.com/zktuong/dandelion/issues/66.

For reannotation of constant genes, reference fasta files were downloaded from IMGT and only sequences corresponding to *CH1* region for each constant gene/allele were retained. The headers were trimmed to only keep the gene and allele information. Links to find the sequences can be found here : [***human***](http://www.imgt.org/genedb/GENElect?query=7.2+IGHC&species=Homo+sapiens) and [***mouse***](http://www.imgt.org/genedb/GENElect?query=7.2+IGHC&species=Mus).

The utility function `utl.makeblastdb` is a wrapper for:

```bash
# bash/shell
makeblastdb -dbtype nucl -parse_seqids -in $BLASTDB/human/human_BCR_C.fasta
```

So effectively, this does the same thing:
```python
# python
ddl.utl.makeblastdb('/Users/kt16/Documents/Github/dandelion/database/blast/human/human_BCR_C.fasta')
```

If you have cloned the directory from dandelion's github, you should have all the databases ready to go and would not need to run makeblastdb.

This section will now demonstrate how I batch process multiple samples/files from the same donor, as it will become important later on.