# Reading 10X Cell Ranger output directly

If for whatever reason you've decided to skip the reannotation/preprocessing, you can read the files directly from the Cell Ranger output folder with `Dandelion`'s `ddl.read_10x_vdj`, which accepts the `*_contig_annotations.csv` or `all_contig_annotations.json` file(s) as input. If reading with the `.csv` file, and the `.fasta` file and/or `.json` file(s) are in the same folder, `ddl.read_10x_vdj` will try to extract additional information not found in the `.csv` file e.g. contig sequences.

From <b>Cell Ranger V4</b> onwards, there is also an `airr_rearrangement.tsv` file that can be used directly with `Dandelion`. However, doing so will miss out on the reannotation steps but that is entirely up to you.

We will download the <b>airr_rearrangement.tsv</b> file from here:
```bash
# bash
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_b_filtered_contig_annotations.csv
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_b_filtered_contig.fasta
# wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_b_all_contig_annotations.json
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_b_airr_rearrangement.tsv
```


<b>Import dandelion module</b>

In [None]:
import os
import dandelion as ddl

# change directory to somewhere more workable
os.chdir(os.path.expanduser("~/Downloads/dandelion_tutorial/"))
ddl.logging.print_versions()

With `ddl.read_10x_vdj`:

In [None]:
folder_location = "sc5p_v2_hs_PBMC_10k"
# or file_location = 'sc5p_v2_hs_PBMC_10k/'
vdj = ddl.read_10x_vdj(
    folder_location, filename_prefix="sc5p_v2_hs_PBMC_10k_b_filtered"
)
vdj

With `ddl.read_10x_airr`:

In [None]:
# read in the airr_rearrangement.tsv file
file_location = (
    "sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_b_airr_rearrangement.tsv"
)
vdj = ddl.read_10x_airr(file_location)
vdj

If you are using non-10x data e.g. Parse Bioscience Evercode, BD Rhapsody, you can use `ddl.read_parse_airr` and `ddl.read_bd_airr` respectively. If you are using other sources of single-cell AIRR data that provides standard AIRR formatted files e.g. SeekGene Biosciences, or just a standard AIRR file, you can use `ddl.read_airr` directly.

We will continue with the rest of the filtering part of the analysis to show how it slots smoothly with the rest of the workflow.

<b>Import modules for use with scanpy</b>

In [None]:
import pandas as pd
import numpy as np
import scanpy as sc
import warnings
import functools
import seaborn as sns
import scipy.stats
import anndata

warnings.filterwarnings("ignore")
sc.logging.print_header()

<b>Import the transcriptome data</b>

In [None]:
adata = sc.read_10x_h5(
    "sc5p_v2_hs_PBMC_10k/filtered_feature_bc_matrix.h5", gex_only=True
)
adata.obs["sample_id"] = "sc5p_v2_hs_PBMC_10k"
adata.var_names_make_unique()
adata

Run QC on the transcriptome data.

In [None]:
ddl.pp.recipe_scanpy_qc(adata)
adata

Run the filtering of bcr data. Note that I'm using the `Dandelion` object as input rather than the pandas dataframe (yes both types of input will works. In fact, a file path to the .tsv will work too).

In [None]:
# The function will return both objects.
vdj, adata = ddl.pp.check_contigs(vdj, adata)

<b>Check the output V(D)J table</b>

The vdj table is returned as a `Dandelion` class object in the `.data` slot; if a file was provided for `filter_bcr` above, a new file will be created in the same folder with the `filtered` prefix. Note that this V(D)J table is indexed based on contigs (sequence_id).

In [None]:
vdj

<b>Check the AnnData object as well</b>

And the `AnnData` object is indexed based on cells.

In [None]:
adata

<b>The number of cells that actually has a matching BCR can be tabluated.</b>

In [None]:
pd.crosstab(adata.obs["has_contig"], adata.obs["chain_status"])

<b>Now actually filter the AnnData object and run through a standard workflow starting by filtering genes and normalizing the data</b>

Because the 'filtered' `AnnData` object was returned as a filtered but otherwise unprocessed object, we still need to normalize and run through the usual process here. The following is just a standard scanpy workflow.

In [None]:
adata = adata[
    adata.obs["filter_rna"] == "False"
]  # from ddl.pp.recipe_scanpy_qc
# filter genes
sc.pp.filter_genes(adata, min_cells=3)
# Normalize the counts
sc.pp.normalize_total(adata, target_sum=1e4)
# Logarithmize the data
sc.pp.log1p(adata)
# Stash the normalised counts
adata.raw = adata

<b>Identify highly-variable genes</b>

In [None]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)

<b>Filter the genes to only those marked as highly-variable</b>

In [None]:
adata = adata[:, adata.var.highly_variable]

<b>Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. Scale the data to unit variance.</b>

In [None]:
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)

<b>Run PCA</b>

In [None]:
sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)

<b>Computing the neighborhood graph, umap and clusters</b>

In [None]:
# Computing the neighborhood graph
sc.pp.neighbors(adata)
# Embedding the neighborhood graph
sc.tl.umap(adata)
# Clustering the neighborhood graph
sc.tl.leiden(adata)

<b>Visualizing the clusters and whether or not there's a corresponding BCR</b>

In [None]:
sc.pl.umap(adata, color=["leiden", "chain_status"])

<b>Visualizing some B cell genes</b>

In [None]:
sc.pl.umap(adata, color=["IGHM", "JCHAIN"])

<b>Save AnnData</b>

We can save this `AnnData` object for now.

In [None]:
adata.write("adata2.h5ad", compression="gzip")

<b>Save dandelion</b>

To save the vdj object, we have two options - either save the `.data` and `.metadata` slots with pandas' functions:

In [None]:
vdj.data.to_csv("filtered_vdj_table2.tsv", sep="\t")

In [None]:
vdj.write_h5ddl("dandelion_results2.h5ddl")

### Concatenating multiple bcr objects

It is quite common that one might be trying to analyse data from multiple samples. In that case, `dandelion` has a `concat` function to merge the data.

We will simulate a second object but reading in the same file.

In [None]:
vdj1 = ddl.read_10x_airr(file_location)
vdj2 = ddl.read_10x_airr(file_location)

Before you merge the objects, make sure that the "cell_id" and "sequence_id" are distinct so that you can distinguish them later

In [None]:
vdj1.add_cell_prefix("run1_")
vdj2.add_cell_prefix("run2_")

In [None]:
vdj_merged = ddl.concat([vdj1, vdj2])
vdj_merged

In [None]:
vdj_merged.data

In [None]:
vdj_merged.metadata