# *dandelion* Notebook-5

![dandelion_logo](img/dandelion_logo.png)

***dandelion*** is primarily a single-cell BCR-seq analysis package but the initial part of the pre-processing can be applied to TCR-seq as well since it makes use of `changeo's` scripts. The output can then be transferred for analysis with other TCR focused tools like [scirpy](https://icbi-lab.github.io/scirpy/).


## Pre-processing - TCR

In [1]:
# import modules
import os
os.chdir(os.path.expanduser('/Users/kt16/Documents/Github/dandelion'))
import dandelion as ddl

  from pandas.core.index import RangeIndex


In [2]:
# change directory to somewhere more workable
os.chdir(os.path.expanduser('/Users/kt16/Documents/Clatworthy_scRNAseq/Ondrej/PIP/Pan_Immune_TCR/'))
# print current working directory
os.getcwd()

'/Users/kt16/Documents/Clatworthy_scRNAseq/Ondrej/PIP/Pan_Immune_TCR'

### Step 1:
#### Formatting the headers of the cellranger fasta file
This step immediately below is optional and is just a lazy way to make a dictionary from an external file using a utility function `utl.dict_from_table`.

In [3]:
# prepare a dictionary from a meta data file.
sampledict = ddl.utl.dict_from_table('/Users/kt16/Documents/Clatworthy_scRNAseq/Ondrej/dandelion_files/meta/PIP_sampleInfo_kt16.txt', columns = ('SANGER SAMPLE ID', 'GEX_SAMPLE_ID')) # optional

In [4]:
# the first option is a list of fasta files to format and the second option is the prefix to add to each file.
sample = 'Pan_T7918887'
ddl.pp.format_fasta(sample+'/filtered_contig.fasta', sampledict[sample])

### Step 2:
#### Reannotate the V/D/J genes with *igblastn*.

`pp.reannotate_genes` uses [*changeo*](https://changeo.readthedocs.io/en/stable/examples/10x.html)'s scripts to call *igblastn* to reannotate the fasta files. I just need to specify `loci = 'tr'` and it should work.

In [5]:
# reannotate the vdj genes with igblastn and parses output to 'airr' (default) or 'changeo' tsv formats using changeo v1.0.0 scripts
ddl.pp.reannotate_genes(sample, loci='tr', filtered = True)

Assigning genes : 100%|██████████| 1/1 [00:31<00:00, 31.12s/it]


now we read in the original filtered_contig_annotations.csv and compare with the igblastn annotated one.

In [6]:
import pandas as pd
original = pd.read_csv(sample+'/filtered_contig_annotations.csv')
# adjust the index
original['index']=[sampledict[sample]+'_'+i for i in original['contig_id']]
original.set_index('index', inplace = True)
original

Unnamed: 0_level_0,barcode,is_cell,contig_id,high_confidence,length,chain,v_gene,d_gene,j_gene,c_gene,full_length,productive,cdr3,cdr3_nt,reads,umis,raw_clonotype_id,raw_consensus_id
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Pan_T7917815_AAACCTGAGGCCATAG-1_contig_1,AAACCTGAGGCCATAG-1,True,AAACCTGAGGCCATAG-1_contig_1,True,335,TRB,TRBV16,TRBD1,TRBJ2-3,TRBC2,False,,,,1343,5,,
Pan_T7917815_AAACCTGAGTGTTGAA-1_contig_1,AAACCTGAGTGTTGAA-1,True,AAACCTGAGTGTTGAA-1_contig_1,True,492,TRB,TRBV29-1,TRBD1,TRBJ1-1,TRBC1,True,True,CSVDNRRQGGWAFF,TGCAGCGTTGACAATCGCCGACAGGGCGGTTGGGCTTTCTTT,4288,5,clonotype83,clonotype83_consensus_1
Pan_T7917815_AAACCTGGTCAGCTAT-1_contig_1,AAACCTGGTCAGCTAT-1,True,AAACCTGGTCAGCTAT-1_contig_1,True,823,TRB,TRBV30,TRBD1,TRBJ1-1,TRBC1,True,True,CAWSPGGGAEAFF,TGTGCCTGGAGTCCTGGGGGGGGGGCTGAAGCTTTCTTT,13700,22,clonotype30,clonotype30_consensus_2
Pan_T7917815_AAACCTGGTCAGCTAT-1_contig_2,AAACCTGGTCAGCTAT-1,True,AAACCTGGTCAGCTAT-1_contig_2,True,515,TRA,TRAV20,,TRAJ10,TRAC,True,True,CAVQDAGGGNKLTF,TGTGCTGTGCAGGACGCGGGAGGAGGAAACAAACTCACCTTT,2800,2,clonotype30,clonotype30_consensus_1
Pan_T7917815_AAACGGGCAAGCTGAG-1_contig_1,AAACGGGCAAGCTGAG-1,True,AAACGGGCAAGCTGAG-1_contig_1,True,371,Multi,IGHV1-2,,,TRBC1,False,,,,1747,2,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Pan_T7917815_TTTGGTTGTCATTAGC-1_contig_3,TTTGGTTGTCATTAGC-1,True,TTTGGTTGTCATTAGC-1_contig_3,True,333,Multi,IGHV1-2,,,TRBC1,False,,,,1352,2,,
Pan_T7917815_TTTGGTTTCTTTACAC-1_contig_1,TTTGGTTTCTTTACAC-1,True,TTTGGTTTCTTTACAC-1_contig_1,True,567,TRA,TRAV6,,TRAJ3,TRAC,True,False,CFISQPPSLQTQLPTSVPLGIIF,TGTTTCATATCACAGCCTCCCAGCCTGCAGACTCAGCTACCTACCT...,916,2,clonotype4,
Pan_T7917815_TTTGGTTTCTTTACAC-1_contig_2,TTTGGTTTCTTTACAC-1,True,TTTGGTTTCTTTACAC-1_contig_2,True,500,TRB,TRBV6-4,,TRBJ2-1,TRBC2,True,True,CASSDGDRNEQFF,TGTGCCAGCAGTGACGGGGATCGCAATGAGCAGTTCTTC,1072,4,clonotype4,clonotype4_consensus_1
Pan_T7917815_TTTGTCAAGCTGGAAC-1_contig_1,TTTGTCAAGCTGGAAC-1,True,TTTGTCAAGCTGGAAC-1_contig_1,True,507,TRB,TRBV3-1,TRBD1,TRBJ2-2,TRBC2,True,True,CASSQRPGVNTGELFF,TGTGCCAGCAGCCAAAGACCAGGGGTGAACACCGGGGAGCTGTTTTTT,1951,2,clonotype26,clonotype26_consensus_2


and now for the newly annotated one

In [7]:
new = pd.read_csv(sample+'/dandelion/data/filtered_contig_igblast_gap.tsv', sep = '\t', index_col=0)
new

Unnamed: 0_level_0,sequence,rev_comp,productive,v_call,d_call,j_call,sequence_alignment,germline_alignment,junction,junction_aa,...,fwr3_end,fwr4_start,fwr4_end,cdr3_start,cdr3_end,np1,np1_length,np2,np2_length,junction_aa_length
sequence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Pan_T7917815_ACATGGTGTGTGACCC-1_contig_1,TGGGGGACTCTGCTCTCTGTCCTGTCTCCTCATCTGCAAAATTAGG...,F,T,TRBV19*01,,TRBJ2-1*01,GATGGTGGAATCACTCAGTCCCCAAAGTACCTGTTCAGAAAGGAAG...,GATGGTGGAATCACTCAGTCCCCAAAGTACCTGTTCAGAAAGGAAG...,TGTGCCAGTAGTATTTTCGGGCAGAGCTCCTACAATGAGCAGTTCTTC,CASSIFGQSSYNEQFF,...,531,,,532,573,TTTCGGGCAGAG,12,,,16
Pan_T7917815_ACATGGTGTGTGACCC-1_contig_2,TGGGAGAAAGACTAGGGATTCACCCAGTAAAGAGAGCTCATCTGTG...,F,T,TRAV17*01,,TRAJ6*01,AGTCAACAGGGAGAAGAGGATCCTCAGGCCTTGAGCATCCAGGAGG...,AGTCAACAGGGAGAAGAGGATCCTCAGGCCTTGAGCATCCAGGAGG...,TGTGCTACGCCCTCAGGAGGAAGCTACATACCTACATTT,CATPSGGSYIPTF,...,427,,,428,460,CCC,3,,,13
Pan_T7917815_ACGATACGTGACCAAG-1_contig_1,GGAGGGAGGCTGGGGGTGATTCACCACACTCTTAAAAGAAGACTAG...,F,T,TRBV30*01,TRBD2*01,TRBJ1-1*01,TCTCAGACTATTCATCAATGGCCAGCGACCCTGGTGCAGCCTGTGG...,TCTCAGACTATTCATCAATGGCCAGCGACCCTGGTGCAGCCTGTGG...,TGTGCCTGGAGTCCTGGGGGGGGGGCTGAAGCTTTCTTT,CAWSPGGGAEAFF,...,513,,,514,546,CCT,3,GGG,3.0,13
Pan_T7917815_ACGATACGTGACCAAG-1_contig_2,GAGCCTCATCCCTTTGCAACGTCAATGCGATCATGGGCACCAGGCT...,F,F,TRBV23-1*01,,TRBJ2-1*01,CATGCCAAAGTCACACAGACTCCAGGACATTTGGTCAAAGGAAAAG...,CATGCCAAAGTCACACAGACTCCAGGACATTTGGTCAAAGGAAAAG...,TGCGCCAGCAGCCGTACTGCGGTTTGGCAATGAGCAGTTCTTC,CASSRTAVWQ*AVL,...,510,,,511,547,CCGTACTGCGGTTTGG,16,,,14
Pan_T7917815_ACGATACGTGACCAAG-1_contig_3,TGGGGGATACAGAAGTGGCGCCTCTGAGAAAAGAAGGTTGGAATTA...,F,T,TRAV20*02,,TRAJ10*01,GAAGACCAGGTGACGCAGAGTCCCGAGGCCCTGAGACTCCAGGAGG...,GAAGACCAGGTGACGCAGAGTCCCGAGGCCCTGAGACTCCAGGAGG...,TGTGCTGTGCAGGACGCGGGAGGAGGAAACAAACTCACCTTT,CAVQDAGGGNKLTF,...,404,,,405,440,ACG,3,,,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Pan_T7917815_GTACGTAAGGCCGAAT-1_contig_6,GGAGAAACTTCTGCCTTCACACATCCCTCCAGCTAGGCAGGACAGG...,F,T,TRBV5-6*01,TRBD1*01,TRBJ1-1*01,GACGCTGGAGTCACCCAAAGTCCCACACACCTGATCAAAACGAGAG...,GACGCTGGAGTCACCCAAAGTCCCACACACCTGATCAAAACGAGAG...,TGTGCCAGCAGCTTGTGGGGCGTGTCCACTGAAGCTTTCTTT,CASSLWGVSTEAFF,...,554,,,555,590,T,1,GTGTC,5.0,14
Pan_T7917815_TCACAAGAGTCTCAAC-1_contig_1,GGACTGAGCTTGCCTGTGACTGGCTAGGGAGGAACCTGAGACTAGG...,F,T,TRAV17*01,,TRAJ9*01,AGTCAACAGGGAGAAGAGGATCCTCAGGCCTTGAGCATCCAGGAGG...,AGTCAACAGGGAGAAGAGGATCCTCAGGCCTTGAGCATCCAGGAGG...,TGTGCTACGGATACTGGAGGCTTCAAAACTATCTTT,CATDTGGFKTIF,...,473,,,474,503,,0,,,12
Pan_T7917815_TCACAAGAGTCTCAAC-1_contig_2,TGGGGAAAAATTGAAACCTGCCTGATGTGGGATGTGCTGTGGCTGC...,F,F,TRAV26-2*01,,TRAJ53*01,.................................................,CATACATTGGTATCGACAGCTTCCCTCCCAGGGTCCAGAGTACGTG...,TGCATCCTGAGAGAGGTTTAATAGTGGAGGTAGCAACTATAAACTG...,CILREV**WR*QL*TDI,...,323,,,324,369,GGTTT,5,,,17
Pan_T7917815_TCACAAGAGTCTCAAC-1_contig_3,GGGGAGAATGTCTCAGAATGACTTCCTTGAGAGTCCTGCTCCCCTT...,F,T,TRBV6-5*01,,TRBJ2-2*01,AATGCTGGTGTCACTCAGACCCCAAAATTCCAGGTCCTGAAGACAG...,AATGCTGGTGTCACTCAGACCCCAAAATTCCAGGTCCTGAAGACAG...,TGTGCCAGCAGTTACTCCCGTCTAAACACCGGGGAGCTGTTTTTT,CASSYSRLNTGELFF,...,424,,,425,463,CCGTCTA,7,,,15


let's merge them and we can compare directly

In [8]:
for x in new.columns:
    original[x] = pd.Series(new[x])

In [9]:
original

Unnamed: 0_level_0,barcode,is_cell,contig_id,high_confidence,length,chain,v_gene,d_gene,j_gene,c_gene,...,fwr3_end,fwr4_start,fwr4_end,cdr3_start,cdr3_end,np1,np1_length,np2,np2_length,junction_aa_length
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Pan_T7917815_AAACCTGAGGCCATAG-1_contig_1,AAACCTGAGGCCATAG-1,True,AAACCTGAGGCCATAG-1_contig_1,True,335,TRB,TRBV16,TRBD1,TRBJ2-3,TRBC2,...,,,,,,,,,,
Pan_T7917815_AAACCTGAGTGTTGAA-1_contig_1,AAACCTGAGTGTTGAA-1,True,AAACCTGAGTGTTGAA-1_contig_1,True,492,TRB,TRBV29-1,TRBD1,TRBJ1-1,TRBC1,...,372.0,,,373.0,408.0,CAATCGCC,8.0,CGGTTGG,7.0,14.0
Pan_T7917815_AAACCTGGTCAGCTAT-1_contig_1,AAACCTGGTCAGCTAT-1,True,AAACCTGGTCAGCTAT-1_contig_1,True,823,TRB,TRBV30,TRBD1,TRBJ1-1,TRBC1,...,532.0,,,533.0,565.0,CCT,3.0,GGG,3.0,13.0
Pan_T7917815_AAACCTGGTCAGCTAT-1_contig_2,AAACCTGGTCAGCTAT-1,True,AAACCTGGTCAGCTAT-1_contig_2,True,515,TRA,TRAV20,,TRAJ10,TRAC,...,403.0,,,404.0,439.0,ACG,3.0,,,14.0
Pan_T7917815_AAACGGGCAAGCTGAG-1_contig_1,AAACGGGCAAGCTGAG-1,True,AAACGGGCAAGCTGAG-1_contig_1,True,371,Multi,IGHV1-2,,,TRBC1,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Pan_T7917815_TTTGGTTGTCATTAGC-1_contig_3,TTTGGTTGTCATTAGC-1,True,TTTGGTTGTCATTAGC-1_contig_3,True,333,Multi,IGHV1-2,,,TRBC1,...,,,,,,,,,,
Pan_T7917815_TTTGGTTTCTTTACAC-1_contig_1,TTTGGTTTCTTTACAC-1,True,TTTGGTTTCTTTACAC-1_contig_1,True,567,TRA,TRAV6,,TRAJ3,TRAC,...,396.0,,,397.0,412.0,CCTTGGG,7.0,,,7.0
Pan_T7917815_TTTGGTTTCTTTACAC-1_contig_2,TTTGGTTTCTTTACAC-1,True,TTTGGTTTCTTTACAC-1_contig_2,True,500,TRB,TRBV6-4,,TRBJ2-1,TRBC2,...,356.0,,,357.0,389.0,GGGGATCG,8.0,,,13.0
Pan_T7917815_TTTGTCAAGCTGGAAC-1_contig_1,TTTGTCAAGCTGGAAC-1,True,TTTGTCAAGCTGGAAC-1_contig_1,True,507,TRB,TRBV3-1,TRBD1,TRBJ2-2,TRBC2,...,382.0,,,383.0,424.0,AGAC,4.0,T,1.0,16.0


let's merge the old and new v calls into a dictionary and see if there's any changes

In [51]:
import numpy as np
import re
test = original.dropna(subset = ["v_call"])
testdict = dict(zip(test['v_gene'], test['v_call']))
for key, value in testdict.items():
    if key != re.sub('[*][0-9][0-9],|[*][0-9][0-9]|/', '', value):
        print({key: value})

{'TRGV10': 'TRDV3*01,TRDV3*02'}
{'TRBV24-1': 'TRBV24-1*02,TRBV24/OR9-2*01'}
{'TRBV6-3': 'TRBV6-2*01,TRBV6-3*01'}
{'TRBV11-3': 'TRBV11-3*01,TRBV11-3*02,TRBV11-3*04'}
{'TRBV6-9': 'TRBV13*01,TRBV13*02,TRBV5-7*01'}


So a couple came up flagged as differently annotated in V gene. Not too bad I guess. The fact the there's multiple mapping here suggests that igblastn actually couldn't annotate those contigs properly. Maybe worth considering using this as a potential QC step?

In [53]:
# check the J calls
test = original.dropna(subset = ["j_call"])
testdict = dict(zip(test['j_gene'], test['j_call']))
for key, value in testdict.items():
    if key != re.sub('[*][0-9][0-9],|[*][0-9][0-9]|/', '', value):
        print({key: value})

{'TRAJ27': 'TRAJ33*01'}
{'None': 'TRDJ3*01'}


Weird it's flagging as a D gene

I think it's probably not worth the (my) effort to try and implement a function to convert the file for scirpy or flag bad contigs for now.