<h1 align='center'>Sanity Checks on Gateway</h1>

<h4 align='center'>iReceptor $\mid$ Laura Gutierrez Funderburk $\mid$ October 14</h4>

<h4 align='center'>Supervised by Dr. Felix Breden, Dr. Jamie Scott, Dr. Brian Corrie</h4>

<h2 align='center'>Abstract</h2>

In this notebook I will parse V-gene, D-gene and J-gene entries categorised per Lab as they appear under  https://gateway.ireceptor.org

In [1]:
# Test 1: Check len(Junction Sequence) and Junction length math
# Test 2: Check all entries do not contain non 'IG' or non 'TC' or non 'or' words
# Test 3: Verify entries are consistent within each row, i.e. 3 letter should be the 
#   same on all V,D,J-gene columns

In [2]:
#NOTE: Missing Complete name for Gastroenterology clinic, Oslo University Hospital?Rikshospitalet
lab_names = ["Institute of Molecular and Genomic Medicine, National Health Research Institutes",\
            "Von Budingen Lab",\
             "The Wellcome Trust Sanger Institute",\
            "Department of Pathology, Standford University",\
             "Kwong Lab",\
            "Vaccine Research Centre",\
             "Shemyakin-Ovchinnikov institute of Bioorganic Chemistry",\
            "Ramit Mehr's Computational Immunology Lab",\
             "Department of Immunology and Microbiology",\
            "Department of Immunology",\
            "Immunogenomics Lab",\
             "Georgiou Lab",\
             "Department of Medicine"]

lab_alias = ["IMGM",\
             "VBL",\
             "WTSI",\
             "DPS",\
             "KL",\
             "VRC",\
             "SOIBC",\
             "RMCIL",\
             "DIM",\
            "DI",\
             "IL",\
             "GL",\
             "DM"]

if len(lab_names)==len(lab_alias):
    
    lab_alias_name_dic = {lab_alias[i]:lab_names[i] for i in range(len(lab_names))}
else:
    
    print("Check lengths of arrays and look for abnormalities, duplicates, missing names/aliases ...")

In [3]:
import json
import pandas as pd

In [4]:
# Directories

CSV_Directory = "./CSV_FILES/"
IMGM_cont = CSV_Directory + "IMGM_Seq.csv"

In [5]:
IMGM_parsed = pd.read_csv(IMGM_cont,sep="\t")
V_genes = [item.split(", or ") for item in IMGM_parsed['V-Gene']]
J_genes = [item.split(", or ") for item in IMGM_parsed['J-Gene']]
D_genes = [item.split(", or ") for item in IMGM_parsed['D-Gene']]
VJD_genes = [[V_genes[i],J_genes[i],D_genes[i]] for i in range(len(V_genes))]
Junction_Sequence = [item for item in IMGM_parsed['Junction Sequence (AA)']]
Junction_Length = [item for item in IMGM_parsed['Junction Length (AA)']]

In [6]:
Log_messages = []
Log_messages.append("Test Sequence Metadata\n")
Log_messages.append("Lab: " + str(lab_alias_name_dic["IMGM"]) + "\n")
Log_messages.append("Perform tests on " + str(IMGM_cont) + "\n")
Log_messages.append("\n\n")
Log_messages.append("Begin Test 1: \nVerify length of Junction Sequence(AA) is correct")

# Test area
# Test 1
Junction_Seq_Len_Test = [[False,i,Junction_Sequence[i],Junction_Length[i]] for i in range(len(Junction_Length)) if len(Junction_Sequence[i]) !=Junction_Length[i]]

if not Junction_Seq_Len_Test:
    Log_messages.append("------------------------------------------------>PASSED\n")
else:
    Log_messages.append(("------------------------------------------------>FAILED\n Please verify entries " + str(Junction_Seq_Len_Test) + "\n"))
    
Log_messages.append("End Test 1\n")
Log_messages.append("\n\n")

In [7]:
# Test area
# test 2
def test_x_gene(x_gene_array):
    x_gene_test = []
    for item in x_gene_array:
        for i in range(len(item)):
            if "IG" not in item[i] and "TC" not in item[i]:
                x_gene_test.append([item,item[i]])
    return x_gene_test

Log_messages.append("Begin Test 2: \nVerify All entries contain either IG or TC names\n")


# V -gene
Log_messages.append("Verifying V-Gene Column ")
if not test_x_gene(V_genes):
    Log_messages.append("------------------------------------------------>PASSED\n")
else:
    Log_messages.append("------------------------------------------------>FAILED\n Please verify entries " + str(test_x_gene(V_genes)) + "\n")

# J-gene
Log_messages.append("Verifying J-Gene Column ")
if not test_x_gene(J_genes):
    Log_messages.append("------------------------------------------------>PASSED\n")
else:
    Log_messages.append("------------------------------------------------>FAILED\n Please verify entries " + str(test_x_gene(J_genes)) + "\n")
    
# D-gene
Log_messages.append("Verifying D-Gene Column ")
if not test_x_gene(D_genes):
    Log_messages.append("------------------------------------------------>PASSED\n")
else:
    Log_messages.append("------------------------------------------------>FAILED \nPlease verify entries\n " + str(test_x_gene(D_genes)) + "\n")
Log_messages.append("End Test 2\n")   
Log_messages.append("\n\n")

In [8]:
# Test 3

def test_IG_or_TC(x_gene_array):

    IG_vs_TC = []
    Nu_iterations = len(x_gene_array)
    for i in range(Nu_iterations):
        IG_vs_TC.append([sum("IG" in s for s in x_gene_array[i]),\
                         sum("TC" in s for s in x_gene_array[i])])
        
    tabulate_occurrences = pd.DataFrame(IG_vs_TC,columns=["Number of 'IG' occurrences"\
                               ,"Number of 'TC' occurrences"])
    
    IG_occurs = tabulate_occurrences["Number of 'IG' occurrences"].sum()
    TC_occurs = tabulate_occurrences["Number of 'TC' occurrences"].sum()
    
    return [[IG_occurs,TC_occurs],tabulate_occurrences]

def message_test_IG_or_TC():
    
    A_V,B_V = test_IG_or_TC(V_genes)
    A_J,B_J = test_IG_or_TC(J_genes)
    A_D,B_D = test_IG_or_TC(D_genes)
    
    
    if A_V[1].sum()==0 and A_J[1].sum()==0 and A_D[1].sum()==0:
        message="IG study"
    elif A_V[0].sum()==0 and A_J[0].sum()==0 and A_D[0].sum()==0:
        message="TC study"
    else:
        message="Number of 'IG' and 'TC' occurrences is not uniform accross this study. Verify for errors in entries or whether these are two or more studies."
    return message

message_IC_TC = message_test_IG_or_TC()

Log_messages.append("Begin Test 3\nIdentify whether this is an 'IG' or 'TC' study\n")
Log_messages.append("Total number of 'IG' occurrences in V-gene: " + str(test_IG_or_TC(V_genes)[0][0]) + "\n")
Log_messages.append("Total number of 'IG' occurrences in J-gene: " + str(test_IG_or_TC(J_genes)[0][0]) + "\n")
Log_messages.append("Total number of 'IG' occurrences in D-gene: " + str(test_IG_or_TC(D_genes)[0][0]) + "\n")
Log_messages.append("Total number of 'TC' occurrences in V-gene: " + str(test_IG_or_TC(V_genes)[0][1]) + "\n")
Log_messages.append("Total number of 'TC' occurrences in J-gene: " + str(test_IG_or_TC(J_genes)[0][1]) + "\n")
Log_messages.append("Total number of 'TC' occurrences in D-gene: " + str(test_IG_or_TC(D_genes)[0][1]) + "\n")
Log_messages.append("\n")
Log_messages.append("------------------------------------------------>" + str(message_IC_TC) + "\n")
Log_messages.append("Breakdown of occurrences\n")
Log_messages.append("V-gene\n" + str(test_IG_or_TC(V_genes)[1]) + "\n")
Log_messages.append("J-gene\n" + str(test_IG_or_TC(J_genes)[1])+ "\n")
Log_messages.append("D-gene\n" + str(test_IG_or_TC(D_genes)[1])+ "\n")

Log_messages.append("End Test 3\n")
Log_messages.append("\n""\n")

In [9]:
def check_3_entry_gene(x_gene_array):
    third_char = []
    for item in x_gene_array:
        for i in range(len(item)):
            if "IG" not in item[i] and "TC" not in item[i]:
                continue
            else:
                third_char.append(item[i][2])
    test = set(third_char)
    return test

v_3 = check_3_entry_gene(V_genes)
d_3 = check_3_entry_gene(D_genes)
j_3 = check_3_entry_gene(J_genes)

Log_messages.append("Begin Test 4\nIdentify 3rd letter is the same across all entries")

if len(v_3)==1 and len(v_3)==len(d_3) and len(d_3)==len(j_3):
    if v_3==d_3 and d_3==j_3:
        Log_messages.append("------------------------------------------------>PASSED\n")
        Log_messages.append("Third letter is " + str(v_3) + "\n")
    else:
        Log_messages.append("------------------------------------------------>FAILED\n")
        Log_messages.append("Found Letters on V-gene " + str(v_3)+"Found Letters on D-gene " + str(d_3) + \
                           "Found Letters on J-gene " + str(d_3) + "\n")
else:
    Log_messages.append("------------------------------------------------>FAILED\n")
    Log_messages.append("Found Letters on V-gene " + str(v_3)+"Found Letters on D-gene " + str(d_3) + \
                           "Found Letters on J-gene " + str(d_3) + "\n")
Log_messages.append("End Test 4\n")
Log_messages.append("\n""\n")

In [10]:
for item in Log_messages:
    print(item)

Test Sequence Metadata

Lab: Institute of Molecular and Genomic Medicine, National Health Research Institutes

Perform tests on ./CSV_FILES/IMGM_Seq.csv




Begin Test 1: 
Verify length of Junction Sequence(AA) is correct
------------------------------------------------>PASSED

End Test 1




Begin Test 2: 
Verify All entries contain either IG or TC names

Verifying V-Gene Column 
------------------------------------------------>PASSED

Verifying J-Gene Column 
------------------------------------------------>PASSED

Verifying D-Gene Column 
------------------------------------------------>FAILED 
Please verify entries
 [[['IGHD4-11*01', 'ORF'], 'ORF'], [['IGHD4-11*01', 'ORF'], 'ORF']]

End Test 2




Begin Test 3
Identify whether this is an 'IG' or 'TC' study

Total number of 'IG' occurrences in V-gene: 156

Total number of 'IG' occurrences in J-gene: 26

Total number of 'IG' occurrences in D-gene: 25

Total number of 'TC' occurrences in V-gene: 0

Total number of 'TC' occurrences in 

In [11]:
with open(CSV_Directory + "IMGM_Log.txt",'w') as f:
    for item in Log_messages:
        f.write(item)
f.close()