<a href="https://colab.research.google.com/github/rahultheogre/datasets/blob/main/HLA_Matching_PHASE_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np

In [3]:
donor_df = pd.read_csv('https://raw.githubusercontent.com/rahultheogre/datasets/main/AH-RR/Donor%20HLA%20Typing.csv')

In [53]:
donor_df

Unnamed: 0,A,B,C,DRB1,DQB1,DPB1
0,32:01:01,35:03:01,06:02:01,11:01:01,03:01:01,02:01:02
1,68:01:02,57:01:01,12:03:01,13:01:01,06:03:01,04:01:01


In [4]:
donorHLAset = set() #creating an empty set to store donor HLA values
# The idea of using a set is intuitive, because it will help in optimized search
# In python, sets are hashsets (of JAVA) and the values can be easily compared and searched

In [7]:
#storing columns names in a list to create 'HLA types'
column_names = donor_df.columns.values.tolist()
column_names

['A', 'B', 'C', 'DRB1', 'DQB1', 'DPB1']

##### ASSUMPTION: 
- we take for granted that first five characters of the string in DONOR TABLE will be used
- for example, if the associated row value for column A is '32:01:01', we will only extract '32:01' and concatenate it to base A. 
- This assumption is based on the depth of recipient SAB values. In the limited case we are provided in the first phase of the project, we will take this assumption to be the case, and follow along. 
- We are going to generalize and change the code to incorporate changes if we need a HLA value of more depth

In [8]:
#function to create HLA out of base gene and associated allelle values
#It is dependent on the assumption we made earlier. 

def formatHLA(data,column):
    return column + '*' + data[0:5] 

In [9]:
#filling the set of donor HLA values

for column in column_names:
    for data in donor_df[column]:
        donorHLAset.add(formatHLA(data,column))
donorHLAset

{'A*32:01',
 'A*68:01',
 'B*35:03',
 'B*57:01',
 'C*06:02',
 'C*12:03',
 'DPB1*02:01',
 'DPB1*04:01',
 'DQB1*03:01',
 'DQB1*06:03',
 'DRB1*11:01',
 'DRB1*13:01'}

##### ASSUMPTION 2
- In the PHASE 1 of project, we take for granted that the documents provided to us are in csv formate. We In subsequent phases, 
  - we will deal with readable pdf files
  - we will deal flattened Pdf files 
  - we will build an API which will be able to take in the file and output the data
  - we will build a database of recipient HLA values, and an app/webApp which will take in donor'a HLA values and output the potential names of negatively cross-matched recipients

In [5]:
recipient = pd.read_csv('https://raw.githubusercontent.com/rahultheogre/datasets/main/AH-RR/Recipient%20Class%201%20SAB.csv',header=None)

In [6]:
recipient

Unnamed: 0,0,1
0,Antibodies detected against HLA Class I antige...,
1,Allele Specificity,MFI
2,B*48:01,11421
3,C*03:03,9079
4,B*52:01,8983
5,C*03:04,7360
6,,
7,Antibodies detected against HLA Class I antige...,
8,Allele Specificity,MFI
9,B*73:01,889


In [59]:
#First we remove all the tuples/rows where MFI for corresponding Allele Specificity is 'Not Detected' and also those 
# which have NaN or 'MFI' as values. Dropping 'Not Detected' rows is important because SAB - 2 values, they form major part 
# of the data.

In [10]:
recipient = recipient[recipient[1] != 'Not Detected']

In [11]:
result_list = []

In [12]:
for row_num in range(0,len(recipient[1])):
    allele_specifity =  recipient[0][row_num]
    MFI = recipient[1][row_num]
    if allele_specifity in donorHLAset:
        result_list.append((allele_specifity,MFI))

In [13]:
print("The undesirable Donor Specific Antibodies and corresponding MFI values are: ")
for i in result_list:
  print(i)

The undesirable Donor Specific Antibodies and corresponding MFI values are: 
('C*06:02', '144')
('A*32:01', '91')
('A*68:01', '39')
