# *dandelion* Notebook-5

![dandelion_logo](img/dandelion_logo.png)

***dandelion*** is primarily a single-cell BCR-seq analysis package but the initial part of the pre-processing can be applied to TCR-seq as well since it makes use of `changeo's` scripts. The output can then be transferred for analysis with other TCR focused tools like [scirpy](https://icbi-lab.github.io/scirpy/).


## Pre-processing - TCR

In [1]:
# import modules
import os
os.chdir(os.path.expanduser('/Users/kt16/Documents/Github/dandelion'))
import dandelion as ddl

  from pandas.core.index import RangeIndex


In [2]:
# change directory to somewhere more workable
os.chdir(os.path.expanduser('/Users/kt16/Documents/Clatworthy_scRNAseq/Ondrej/PIP/Pan_Immune_TCR/'))
# print current working directory
os.getcwd()

'/Users/kt16/Documents/Clatworthy_scRNAseq/Ondrej/PIP/Pan_Immune_TCR'

### Step 1:
#### Formatting the headers of the cellranger fasta file
This step immediately below is optional and is just a lazy way to make a dictionary from an external file using a utility function `utl.dict_from_table`.

In [3]:
# prepare a dictionary from a meta data file.
sampledict = ddl.utl.dict_from_table('/Users/kt16/Documents/Clatworthy_scRNAseq/Ondrej/dandelion_files/meta/PIP_sampleInfo_kt16.txt', columns = ('SANGER SAMPLE ID', 'GEX_SAMPLE_ID')) # optional

In [4]:
# the first option is a list of fasta files to format and the second option is the prefix to add to each file.
sample = 'Pan_T7918887'
ddl.pp.format_fasta(sample+'/filtered_contig.fasta', sampledict[sample])

### Step 2:
#### Reannotate the V/D/J genes with *igblastn*.

`pp.reannotate_genes` uses [*changeo*](https://changeo.readthedocs.io/en/stable/examples/10x.html)'s scripts to call *igblastn* to reannotate the fasta files. I just need to specify `loci = 'tr'` and it should work.

In [7]:
# reannotate the vdj genes with igblastn and parses output to 'airr' (default) or 'changeo' tsv formats using changeo v1.0.0 scripts
ddl.pp.reannotate_genes(sample, loci='tr')

Assigning genes : 100%|██████████| 1/1 [01:42<00:00, 102.51s/it]


now we read in the original filtered_contig_annotations.csv and compare with the igblastn annotated one.

In [8]:
import pandas as pd
original = pd.read_csv(sample+'/all_contig_annotations.csv')
# adjust the index
original['index']=[sampledict[sample]+'_'+i for i in original['contig_id']]
original.set_index('index', inplace = True)
original

Unnamed: 0_level_0,barcode,is_cell,contig_id,high_confidence,length,chain,v_gene,d_gene,j_gene,c_gene,full_length,productive,cdr3,cdr3_nt,reads,umis,raw_clonotype_id,raw_consensus_id
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Pan_T7917815_AAACCTGAGGCCATAG-1_contig_1,AAACCTGAGGCCATAG-1,True,AAACCTGAGGCCATAG-1_contig_1,True,335,TRB,TRBV16,TRBD1,TRBJ2-3,TRBC2,False,,,,1343,5,,
Pan_T7917815_AAACCTGAGTGTTGAA-1_contig_1,AAACCTGAGTGTTGAA-1,True,AAACCTGAGTGTTGAA-1_contig_1,True,492,TRB,TRBV29-1,TRBD1,TRBJ1-1,TRBC1,True,True,CSVDNRRQGGWAFF,TGCAGCGTTGACAATCGCCGACAGGGCGGTTGGGCTTTCTTT,4288,5,clonotype83,clonotype83_consensus_1
Pan_T7917815_AAACCTGGTCAGCTAT-1_contig_1,AAACCTGGTCAGCTAT-1,True,AAACCTGGTCAGCTAT-1_contig_1,True,823,TRB,TRBV30,TRBD1,TRBJ1-1,TRBC1,True,True,CAWSPGGGAEAFF,TGTGCCTGGAGTCCTGGGGGGGGGGCTGAAGCTTTCTTT,13700,22,clonotype30,clonotype30_consensus_2
Pan_T7917815_AAACCTGGTCAGCTAT-1_contig_2,AAACCTGGTCAGCTAT-1,True,AAACCTGGTCAGCTAT-1_contig_2,True,515,TRA,TRAV20,,TRAJ10,TRAC,True,True,CAVQDAGGGNKLTF,TGTGCTGTGCAGGACGCGGGAGGAGGAAACAAACTCACCTTT,2800,2,clonotype30,clonotype30_consensus_1
Pan_T7917815_AAACCTGGTCAGCTAT-1_contig_3,AAACCTGGTCAGCTAT-1,True,AAACCTGGTCAGCTAT-1_contig_3,False,364,TRA,TRAV8-3,,TRAJ28,TRAC,False,,CGSPSGAGSYQLTF,TGTGGGTCCCCCTCTGGGGCTGGGAGTTACCAACTCACTTTC,1166,1,clonotype30,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Pan_T7917815_TTTGTCATCTCGAGTA-1_contig_3,TTTGTCATCTCGAGTA-1,False,TTTGTCATCTCGAGTA-1_contig_3,True,259,IGH,IGHV4-31,,,,False,,,,13,1,,
Pan_T7917815_TTTGTCATCTCGAGTA-1_contig_4,TTTGTCATCTCGAGTA-1,False,TTTGTCATCTCGAGTA-1_contig_4,False,313,Multi,TRBV24-1,,,TRAC,False,,,,17,1,,
Pan_T7917815_TTTGTCATCTCGAGTA-1_contig_5,TTTGTCATCTCGAGTA-1,False,TTTGTCATCTCGAGTA-1_contig_5,True,535,Multi,IGLV7-43,,TRAJ44,,False,,,,58,6,,
Pan_T7917815_TTTGTCATCTCGATGA-1_contig_1,TTTGTCATCTCGATGA-1,False,TTTGTCATCTCGATGA-1_contig_1,True,558,TRB,TRBV20-1,TRBD1,TRBJ1-2,TRBC1,True,True,CSASNGQGLASYGYTF,TGCAGTGCTAGTAATGGGCAGGGGCTCGCAAGTTATGGCTACACCTTC,1547,2,,


and now for the newly annotated one

In [9]:
new = pd.read_csv(sample+'/dandelion/data/all_contig_igblast_gap.tsv', sep = '\t', index_col=0)
new

Unnamed: 0_level_0,sequence,rev_comp,productive,v_call,d_call,j_call,sequence_alignment,germline_alignment,junction,junction_aa,...,fwr3_end,fwr4_start,fwr4_end,cdr3_start,cdr3_end,np1,np1_length,np2,np2_length,junction_aa_length
sequence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Pan_T7917815_AAACGGGAGCCCAACC-1_contig_1,TGGGGGAGTCATCCCTCCTCGCTGGTGAATGGAGGCAGTGGTCACA...,F,T,TRBV20-1*02,TRBD2*02,TRBJ2-3*01,GGTGCTGTCGTCTCTCAACATCCGAGCAGGGTTATCTGTAAGAGTG...,GGTGCTGTCGTCTCTCAACATCCGAGCAGGGTTATCTGTAAGAGTG...,TGCAGTGCTTTCGTGCGTAGCGGGAAAACAGATACGCAGTATTTT,CSAFVRSGKTDTQYF,...,409,,,410,448,TTCGTGCG,8,AA,2.0,15
Pan_T7917815_AATCCAGGTCTGCGGT-1_contig_1,TGGGAGAGAAGGTGGTGTGAGGCCATCACGGAAGATGCTGCTGCTT...,F,T,TRBV20-1*02,TRBD2*02,TRBJ2-3*01,GGTGCTGTCGTCTCTCAACATCCGAGCAGGGTTATCTGTAAGAGTG...,GGTGCTGTCGTCTCTCAACATCCGAGCAGGGTTATCTGTAAGAGTG...,TGCAGTGCTTTCGTGCGTAGCGGGAAAACAGATACGCAGTATTTT,CSAFVRSGKTDTQYF,...,358,,,359,397,TTCGTGCG,8,AA,2.0,15
Pan_T7917815_AATCCAGTCATATCGG-1_contig_1,GGGGATAGAAAGACAAGATGGTCCTGAAATTCTCCGTGTCCATTCT...,F,T,TRAV27*01,,TRAJ27*01,ACCCAGCTGCTGGAGCAGAGCCCTCAGTTTCTAAGCATCCAAGAGG...,ACCCAGCTGCTGGAGCAGAGCCCTCAGTTTCTAAGCATCCAAGAGG...,TGTGCAGGAGGGGATACTCCCACCAATGCAGGCAAATCAACCTTT,CAGGDTPTNAGKSTF,...,338,,,339,377,GGGATACTCC,10,,,15
Pan_T7917815_ACAGCTAAGCATGGCA-1_contig_1,TGGGGCTCACAGGAAGATGCATCTTGTAGGAGGCAGCTGTGAGGTC...,F,T,TRBV18*01,TRBD1*01,TRBJ1-1*01,AATGCCGGCGTCATGCAGAACCCAAGACACCTGGTCAGGAGGAGGG...,AATGCCGGCGTCATGCAGAACCCAAGACACCTGGTCAGGAGGAGGG...,TGTGCCAGCTCACCACCGGGGGCGCAGAACACTGAAGCTTTCTTT,CASSPPGAQNTEAFF,...,448,,,449,487,,0,GCA,3.0,15
Pan_T7917815_ACATGGTAGGGCTTGA-1_contig_1,TGGGGGTCATGCAGCATCTGCCATGAGCATCGGCCTCCTGTGCTGT...,F,T,TRBV6-5*01,,TRBJ2-2*01,AATGCTGGTGTCACTCAGACCCCAAAATTCCAGGTCCTGAAGACAG...,AATGCTGGTGTCACTCAGACCCCAAAATTCCAGGTCCTGAAGACAG...,TGTGCCAGCAGTACGGGGAACACCGGGGAGCTGTTTTTT,CASSTGNTGELFF,...,352,,,353,385,ACGGG,5,,,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Pan_T7917815_TCGTACCGTTCCTCCA-1_contig_1,TGGGAGGACAGATTTCTTTTATGATTCCTACAGCAGAAAAATGAGA...,F,T,TRAV12-3*01,,TRAJ37*02,CAGAAGGAGGTGGAGCAGGATCCTGGACCACTCAGTGTTCCAGAGG...,CAGAAGGAGGTGGAGCAGGATCCTGGACCACTCAGTGTTCCAGAGG...,TGTGCAATGAGCGCGGGAAGCAACACAGGCAAACTAATCTTT,CAMSAGSNTGKLIF,...,445,,,446,481,CGGGA,5,,,14
Pan_T7917815_TGCTACCGTTCGGGCT-1_contig_2,TGGGGAGAATGCTTACTACAGAGACACCAGCCCCAAGCTAGGAGAT...,F,T,TRBV9*01,,TRBJ1-4*01,GATTCTGGAGTCACACAAACCCCAAAGCACCTGATCACAGCAACTG...,GATTCTGGAGTCACACAAACCCCAAAGCACCTGATCACAGCAACTG...,TGTGCCAGCAGCGTAAGTTCGGTAAATGAAAAACTGTTTTTT,CASSVSSVNEKLFF,...,382,,,383,418,AGTTCGGTA,9,,,14
Pan_T7917815_TTAGGCACACGTCAGC-1_contig_1,TATTTTTCCTCCCTTTCTCATGTTTTTATAAATAGGTAATAAAAAA...,T,F,TRAV3*01,TRDD2*01,TRDJ4*01,.................................................,TTTGAAGCTGAATTTANNNNTCCTANNNNNNNNNNNNNNNNNNNNN...,CTGCTTCTGATTTTTCTTGCATTTTAAATTCTCAGCCAACCTACAG...,LLLIFLAF*ILSQPTAMIF,...,188,,,189,239,TATG,4,GAATAAGAAGCAATGATGTGCTGCTTCTGATTTTTCTTGCATTTTA...,69.0,19
Pan_T7917815_TTGTAGGAGAGGGCTT-1_contig_1,GGGAAAGCAGATTCTTTTTATGATTTTTAAAGTAGAAATATCCATT...,F,T,TRAV12-2*01,,TRAJ15*01,CAGAAGGAGGTGGAGCAGAATTCTGGACCCCTCAGTGTTCCAGAGG...,CAGAAGGAGGTGGAGCAGAATTCTGGACCCCTCAGTGTTCCAGAGG...,TGTGCCGTGAACATCGTCCGAGGACAGGCAGGAACTGCTCTGATCTTT,CAVNIVRGQAGTALIF,...,438,,,439,480,TCGTCCGAGGA,11,,,16


let's merge them and we can compare directly

In [10]:
for x in new.columns:
    original[x] = pd.Series(new[x])

In [11]:
original

Unnamed: 0_level_0,barcode,is_cell,contig_id,high_confidence,length,chain,v_gene,d_gene,j_gene,c_gene,...,fwr3_end,fwr4_start,fwr4_end,cdr3_start,cdr3_end,np1,np1_length,np2,np2_length,junction_aa_length
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Pan_T7917815_AAACCTGAGGCCATAG-1_contig_1,AAACCTGAGGCCATAG-1,True,AAACCTGAGGCCATAG-1_contig_1,True,335,TRB,TRBV16,TRBD1,TRBJ2-3,TRBC2,...,,,,,,,,,,
Pan_T7917815_AAACCTGAGTGTTGAA-1_contig_1,AAACCTGAGTGTTGAA-1,True,AAACCTGAGTGTTGAA-1_contig_1,True,492,TRB,TRBV29-1,TRBD1,TRBJ1-1,TRBC1,...,372.0,,,373.0,408.0,CAATCGCC,8.0,CGGTTGG,7.0,14.0
Pan_T7917815_AAACCTGGTCAGCTAT-1_contig_1,AAACCTGGTCAGCTAT-1,True,AAACCTGGTCAGCTAT-1_contig_1,True,823,TRB,TRBV30,TRBD1,TRBJ1-1,TRBC1,...,532.0,,,533.0,565.0,CCT,3.0,GGG,3.0,13.0
Pan_T7917815_AAACCTGGTCAGCTAT-1_contig_2,AAACCTGGTCAGCTAT-1,True,AAACCTGGTCAGCTAT-1_contig_2,True,515,TRA,TRAV20,,TRAJ10,TRAC,...,403.0,,,404.0,439.0,ACG,3.0,,,14.0
Pan_T7917815_AAACCTGGTCAGCTAT-1_contig_3,AAACCTGGTCAGCTAT-1,True,AAACCTGGTCAGCTAT-1_contig_3,False,364,TRA,TRAV8-3,,TRAJ28,TRAC,...,247.0,,,248.0,288.0,CCCC,4.0,,,16.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Pan_T7917815_TTTGTCATCTCGAGTA-1_contig_3,TTTGTCATCTCGAGTA-1,False,TTTGTCATCTCGAGTA-1_contig_3,True,259,IGH,IGHV4-31,,,,...,,,,,,,,,,
Pan_T7917815_TTTGTCATCTCGAGTA-1_contig_4,TTTGTCATCTCGAGTA-1,False,TTTGTCATCTCGAGTA-1_contig_4,False,313,Multi,TRBV24-1,,,TRAC,...,,,,,,,,,,
Pan_T7917815_TTTGTCATCTCGAGTA-1_contig_5,TTTGTCATCTCGAGTA-1,False,TTTGTCATCTCGAGTA-1_contig_5,True,535,Multi,IGLV7-43,,TRAJ44,,...,,,,,,,,,,
Pan_T7917815_TTTGTCATCTCGATGA-1_contig_1,TTTGTCATCTCGATGA-1,False,TTTGTCATCTCGATGA-1_contig_1,True,558,TRB,TRBV20-1,TRBD1,TRBJ1-2,TRBC1,...,408.0,,,409.0,450.0,TAATGGG,7.0,CTCGCAAGT,9.0,16.0


let's merge the old and new v calls into a dictionary and see if there's any changes

In [12]:
import numpy as np
import re
test = original.dropna(subset = ["v_call"])
testdict = dict(zip(test['v_gene'], test['v_call']))
for key, value in testdict.items():
    if key != re.sub('[*][0-9][0-9],|[*][0-9][0-9]|/', '', value):
        print({key: value})

{'TRBV28': 'TRAV3*01'}
{'TRBV5-6': 'TRAV19*01'}
{'TRBV13': 'TRBV7-3*01,TRBV7-3*05'}
{'TRBV20-1': 'TRBV20-1*01,TRBV20-1*05'}
{'TRAV6': 'TRAV6*01,TRAV6*07'}
{'TRAV1-1': 'TRAV6*01,TRAV6*02,TRAV6*03'}
{'TRGV10': 'TRDV3*01,TRDV3*02'}
{'TRBV24-1': 'TRBV24-1*02,TRBV24/OR9-2*01'}
{'TRBV5-4': 'TRAV13-2*01,TRAV13-2*02'}
{'TRBV6-3': 'TRBV6-2*01,TRBV6-3*01'}
{'TRAV35': 'TRAV35*01,TRAV35*02,TRAV35*03'}
{'TRBV16': 'TRBV20-1*01,TRBV20-1*02,TRBV20-1*03'}
{'TRBV7-3': 'TRBV11-2*03'}
{'TRBV7-4': 'TRBV7-6*01'}
{'IGLV4-3': 'TRBV23-1*01'}
{'TRAV30': 'TRAV30*01,TRAV30*02,TRAV30*03'}
{'None': 'TRBV25-1*01'}
{'IGLV4-69': 'TRBV7-9*06'}
{'IGKV1D-43': 'TRBV16*01,TRBV16*02,TRBV16*03'}
{'IGLV1-40': 'TRAV24*01'}
{'IGLV5-37': 'TRGV2*01,TRGV2*02,TRGV8*01'}
{'IGHV3-15': 'TRBV21-1*01,TRBV21-1*02'}
{'IGKV3-7': 'TRBV5-8*01,TRBV5-8*02,TRBV6-7*01'}
{'IGHV4-39': 'TRBV15*01'}
{'IGHV5-51': 'TRBV16*01,TRBV16*02,TRBV7-4*01'}
{'IGKV1-33': 'TRAV10*01,TRAV10*02'}
{'TRBV17': 'TRBV5-3*01'}
{'IGLV3-27': 'TRAV1-2*01,TRAV1-2*03'}
{'IGHV

When run with So a couple came up flagged as differently annotated in V gene. Not too bad I guess. The fact the there's multiple mapping here suggests that igblastn actually couldn't annotate those contigs properly. Maybe worth considering using this as a potential QC step?

In [13]:
# check the J calls
test = original.dropna(subset = ["j_call"])
testdict = dict(zip(test['j_gene'], test['j_call']))
for key, value in testdict.items():
    if key != re.sub('[*][0-9][0-9],|[*][0-9][0-9]|/', '', value):
        print({key: value})

{'None': 'TRDJ1*01,TRDJ2*01'}
{'TRDJ4': 'TRDJ1*01'}
{'TRDJ1': 'TRBJ2-2*01'}
{'IGHJ6': 'TRBJ2-4*01'}
{'TRBJ2-2P': 'TRGJP1*01'}
{'IGHJ3': 'TRAJ33*01'}
{'IGLJ2': 'TRBJ2-2*01'}
{'IGLJ5': 'TRBJ1-3*01,TRBJ2-2*01'}
{'TRGJP1': 'TRDJ2*01'}
{'TRAJ14': 'TRDJ3*01'}


The utility function `tl.convert_format` will attempt to convert the airr/changeo tsvs to a 10x-style table so that can be plugge dinto other 