<h1 align='center'>Sanity Checks on Gateway</h1>

<h4 align='center'>iReceptor $\mid$ Laura Gutierrez Funderburk $\mid$ October 14</h4>

<h4 align='center'>Supervised by Dr. Felix Breden, Dr. Jamie Scott, Dr. Brian Corrie</h4>

<h2 align='center'>Abstract</h2>

In this notebook I will parse V-gene, D-gene and J-gene entries categorised per Lab as they appear under  https://gateway.ireceptor.org

In [6]:
# Test 1: Check len(Junction Sequence) and Junction length math
# Test 2: Check all entries do not contain non 'IG' or non 'TC' or non 'or' words
# Test 3: Verify entries are consistent within each row, i.e. 3 letter should be the 
#   same on all V,D,J-gene columns

In [1]:
#NOTE: Missing Complete name for Gastroenterology clinic, Oslo University Hospital?Rikshospitalet
lab_names = ["Institute of Molecular and Genomic Medicine, National Health Research Institutes",\
            "Von Budingen Lab",\
             "The Wellcome Trust Sanger Institute",\
            "Department of Pathology, Standford University",\
             "Kwong Lab",\
            "Vaccine Research Centre",\
             "Shemyakin-Ovchinnikov institute of Bioorganic Chemistry",\
            "Ramit Mehr's Computational Immunology Lab",\
             "Department of Immunology and Microbiology",\
            "Department of Immunology",\
            "Immunogenomics Lab",\
             "Georgiou Lab",\
             "Department of Medicine"]

lab_alias = ["IMGM",\
             "VBL",\
             "WTSI",\
             "DPS",\
             "KL",\
             "VRC",\
             "SOIBC",\
             "RMCIL",\
             "DIM",\
            "DI",\
             "IL",\
             "GL",\
             "DM"]

if len(lab_names)==len(lab_alias):
    
    lab_alias_name_dic = {lab_alias[i]:lab_names[i] for i in range(len(lab_names))}
else:
    
    print("Check lengths of arrays and look for abnormalities, duplicates, missing names/aliases ...")

In [2]:
import json
import pandas as pd

In [4]:
# Directories

CSV_Directory = "./CSV_FILES/"
IMGM_cont = CSV_Directory + "IMGM_Seq.csv"

In [87]:
IMGM_parsed = pd.read_csv(IMGM_cont,sep="\t")
V_genes = [item.split(", or ") for item in IMGM_parsed['V-Gene']]
J_genes = [item.split(", or ") for item in IMGM_parsed['J-Gene']]
D_genes = [item.split(", or ") for item in IMGM_parsed['D-Gene']]
VJD_genes = [[V_genes[i],J_genes[i],D_genes[i]] for i in range(len(V_genes))]
Junction_Sequence = [item for item in IMGM_parsed['Junction Sequence (AA)']]
Junction_Length = [item for item in IMGM_parsed['Junction Length (AA)']]

In [10]:
# Test area
# Test 1
Junction_Seq_Len_Test = [[False,i,Junction_Sequence[i],Junction_Length[i]] for i in range(len(Junction_Length)) if len(Junction_Sequence[i]) !=Junction_Length[i]]
print(Junction_Seq_Len_Test)

[]


In [13]:
# Test area
# test 2
def test_x_gene(x_gene_array):
    x_gene_test = []
    for item in x_gene_array:
        for i in range(len(item)):
            if "IG" not in item[i] and "TC" not in item[i]:
                x_gene_test.append([item,item[i]])
    return x_gene_test
print(test_x_gene(V_genes))
print(test_x_gene(J_genes))
print(test_x_gene(D_genes))

[]
[]
[[['IGHD4-11*01', 'ORF'], 'ORF'], [['IGHD4-11*01', 'ORF'], 'ORF']]


In [72]:
# Test 3

def test_IG_or_TC(x_gene_array):

    IG_vs_TC = []
    Nu_iterations = len(x_gene_array)
    for i in range(Nu_iterations):
        IG_vs_TC.append([sum("IG" in s for s in x_gene_array[i]),\
                         sum("TC" in s for s in x_gene_array[i])])
        
    tabulate_occurrences = pd.DataFrame(IG_vs_TC,columns=["Number of 'IG' occurrences"\
                               ,"Number of 'TC' occurrences"])
    
    IG_occurs = tabulate_occurrences["Number of 'IG' occurrences"].sum()
    TC_occurs = tabulate_occurrences["Number of 'TC' occurrences"].sum()
    
    return [[IG_occurs,TC_occurs],tabulate_occurrences]

def message_test_IG_or_TC():
    
    A_V,B_V = test_IG_or_TC(V_genes)
    A_J,B_J = test_IG_or_TC(J_genes)
    A_D,B_D = test_IG_or_TC(D_genes)
    
    if A_V[0]>A_V[1] and A_J[0]>A_J[1] and A_D[0]>A_D[1]:
        message="IG study"
    elif A_V[0]<A_V[1] and A_J[0]<A_J[1] and A_D[0]<A_D[1]:
        message="TC study"
    else:
        message="huh?"
    return message

print(message_test_IG_or_TC())
print("\n")
# V-genes
print(test_IG_or_TC(V_genes)[1])
print(test_IG_or_TC(V_genes)[0])
# J-genes
print(test_IG_or_TC(J_genes)[1])
print(test_IG_or_TC(J_genes)[0])
# D-genes
print(test_IG_or_TC(D_genes)[1])
print(test_IG_or_TC(D_genes)[0])

IG study


    Number of 'IG' occurrences  Number of 'TC' occurrences
0                            2                           0
1                            2                           0
2                            2                           0
3                            1                           0
4                            6                           0
5                            6                           0
6                            3                           0
7                            1                           0
8                            3                           0
9                           11                           0
10                           4                           0
11                           3                           0
12                           8                           0
13                          13                           0
14                          10                           0
15                           6               

In [90]:
VJD_genes[-2]

[['IGHV3-7*03',
  'IGHV3-53*03',
  'IGHV3-21*04',
  'IGHV3-71*01',
  'IGHV3-53*01',
  'IGHV3-48*02',
  'IGHV3-11*01',
  'IGHV3-11*05',
  'IGHV3-53*02',
  'IGHV3-11*03'],
 ['IGHJ4*02'],
 ['IGHD4-11*01', 'ORF']]