<a href="https://colab.research.google.com/github/rahultheogre/IPBABYOP/blob/main/BYOP_Group_F.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### About Dataset

##### Source
https://www.kaggle.com/datasets/gustavomodelli/waitlist-kidney-brazil

##### Context
Predicting waiting time for a deceased donor kidney transplant can help patients and clinicians to discuss management and contribute to a more efficient use of resources

##### Content
A model was developed with this data and published in PlosOne. We expected to share the data some improvements in the model that could help physicians and patients.

##### Reference: 
Sapiertein Silva JF, Ferreira GF, Perosa M, Nga HS, de Andrade LGM. A machine learning prediction model for waiting time to kidney transplant. PLoS One. 2021 May 20;16(5):e0252069. doi: 10.1371/journal.pone.0252069. PMID: 34015020. (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0252069)

##### Acknowledgements
We would like to tanks the Secreataria de Saude do Estado de Sao Paulo to provide the data

##### Inspiration
Predict the Deceased donor transplantation using the predictors available

##### Allocation criteria
Allocation was performed as established by the National Transplantation System of the Brazilian Ministry of Health. For deceased donor transplants, allocation criteria are based on HLA matching (highest number of points for HLA DR, followed by HLA B and HLA A), recipient's age (<18 years), date of registration on the waiting list, and panel reactive antibody (PRA). A point score system based on blood group and HLA match is used as follows:

DR: 0 MM = 10 points; 1 MM = 5 points; 2 MM = 0 point;
B: 0 MM = 4 points; 1 MM = 2 points; 2 MM = 0 point;
A: 0 MM = 1 point; 1 MM = 0.5 point; 2 MM = 0 point.

Waiting time, allosensitization (cPRA >50), diabetes mellitus, and age < 18 years served as tiebreakers.

## 1. EXPLORATORY DATA ANALYSIS

1.   Load and audit the data
2.   Data preparation and tranformation
     a. Impute the missing values
     b. Outliers or extreme values
     c. Inconsistent values
3.   Data Visualization
4.   Data Analysis
     a. Uni-Variate Analysis (Measures of CT, Measures of Disp)
     b. Bi-Variate (correlation, chi-square test)
     c. Multi-variate (if needed)

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression


In [11]:
data = pd.read_csv('https://raw.githubusercontent.com/rahultheogre/IPBABYOP/main/waitlist.csv', encoding='latin-1')

In [18]:
pd.set_option('display.max_columns', None)
data.sample(15)

Unnamed: 0,Id,date,age_at_list_registration,age_cat,time_on_Dialysis,race,sex,underline_disease,diabetes,Blood_type,number_transfusion,gestation,number_gestation,prior_transplant,number_prior_transplant,subregion,cPRA,cPRA_cat,HLA_A1,HLA_A2,HLA_B1,HLA_B2,HLA_DR1,HLA_DR2,DR_00,B_00,A_00,calculated_frequency_DR.f1,calculated_frequency_DR.f2,calculated_frequency_DR.f,calculated_frequency_B.f1,calculated_frequency_B.f2,calculated_frequency_B.f,calculated_frequency_A.f1,calculated_frequency_A.f2,calculated_frequency_A.f,chagas,anti.HBc,anti.HCV,agHBs,patient_still_on_list,date_acutal,death,Time_death,Transplant,Transplant_Y_N,X36MthsTx,Time_Tx,priorization,removed_list,razon_removed,time,event
36303,43599,2010-07-22,59,18.a.60,53.0,Branca,M,HAS,1,O,1,Não,,Não,0,HCFMUSP,42,Entre_0_50,11,32,35,44,1,3,heterozigoto,heterozigoto,heterozigoto,0.19,0.18,3.42,0.21,0.2,4.2,0.1,0.06,0.6,Não,Não,Não,Não,Não,43307,Sim,12,Óbito Lista,Não,0,11.87,Não,Não,,356,2
14920,19749,2009-09-14,58,18.a.60,21.0,Branca,M,HAS,1,O,1,Não,,Não,0,UNICAMP,0,Zero,2,23,35,58,8,13,heterozigoto,heterozigoto,heterozigoto,0.12,0.25,3.0,0.21,0.07,1.47,0.42,0.11,4.62,Não,Não,Não,Não,Não,43307,Não,108,Não,Não,0,12.17,Não,Sim,Removido (suspenso > 365 dias),365,3
14271,18881,2000-02-29,67,Maior.60,3.0,Branca,M,Diabetes,0,B,1,Não,,Não,0,FUNDERP,0,Zero,3,24,0,18,7,11,heterozigoto,homozigoto,heterozigoto,0.22,0.23,5.06,0.0,0.09,0.0,0.18,0.16,2.88,,Não,,,Não,43307,Sim,51,Óbito Lista,Não,0,51.1,Não,Não,,1533,2
28260,34641,2014-03-19,40,18.a.60,44.0,Branca,M,Outras,1,O,1,Não,,Não,0,HCFMUSP,33,Entre_0_50,2,66,7,41,13,15,heterozigoto,heterozigoto,heterozigoto,0.25,0.21,5.25,0.14,0.02,0.28,0.42,0.02,0.84,Não,Não,Não,Não,Sim,43307,Não,53,Não,Não,0,53.0,Não,Não,,1590,0
6417,8453,2015-06-16,63,Maior.60,49.0,Branca,M,HAS,1,O,1,Não,,Não,0,UNIFESP,2,Entre_0_50,2,0,35,0,3,15,heterozigoto,homozigoto,homozigoto,0.18,0.21,3.78,0.21,0.0,0.0,0.42,0.0,0.0,Não,Não,Não,Não,Não,43307,Não,38,Não,Não,0,37.3,Não,Sim,Removido (suspenso > 365 dias),1119,3
11402,15069,2001-01-22,56,18.a.60,20.0,Branca,M,Diabetes,0,A,0,Não,,Não,0,UNIFESP,0,Zero,3,33,14,58,11,15,heterozigoto,heterozigoto,heterozigoto,0.23,0.21,4.83,0.1,0.07,0.7,0.18,0.07,1.26,,Não,,,Não,43307,Sim,24,Óbito Lista,Não,0,24.07,Não,Não,,722,2
33900,40888,2007-05-30,57,18.a.60,2.0,Branca,F,Diabetes,0,A,0,Não,0.0,Não,0,UNIFESP,0,Zero,2,24,13,40,4,16,heterozigoto,heterozigoto,heterozigoto,0.23,0.07,1.61,0.04,0.09,0.36,0.42,0.16,6.72,Não,Não,Não,Não,Não,43307,Sim,72,Óbito Lista,Não,0,72.4,Não,Não,,2172,2
32510,39356,2006-04-17,56,18.a.60,8.0,Negra,F,Outras,1,A,0,Sim,2.0,Não,0,UNIFESP,0,Zero,2,3,14,35,11,13,heterozigoto,heterozigoto,heterozigoto,0.23,0.25,5.75,0.1,0.21,2.1,0.42,0.18,7.56,Não,Não,Não,Não,Não,43307,Não,149,Sim,Sim,0,99.57,Não,Não,,2987,1
11343,14989,2016-07-13,36,18.a.60,12.0,Branca,M,Diabetes,0,A,0,Não,,Não,0,FUNDERP,0,Zero,2,11,35,49,1,11,heterozigoto,heterozigoto,heterozigoto,0.19,0.23,4.37,0.21,0.05,1.05,0.42,0.1,4.2,Não,Não,Não,Não,Não,43307,Não,25,Sim,Sim,1,13.1,Não,Não,,393,1
17340,22470,2011-11-30,43,18.a.60,9.0,Branca,M,Outras,1,A,1,Não,,Não,0,UNIFESP,0,Zero,2,0,7,15,3,11,heterozigoto,heterozigoto,homozigoto,0.18,0.23,4.14,0.14,0.19,2.66,0.42,0.0,0.0,Não,Não,Não,Não,Não,43307,Sim,56,Óbito Lista,Não,0,56.23,Não,Não,,1687,2


In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48153 entries, 0 to 48152
Data columns (total 53 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Id                          48153 non-null  int64  
 1   date                        48153 non-null  object 
 2   age_at_list_registration    48153 non-null  int64  
 3   age_cat                     48153 non-null  object 
 4   time_on_Dialysis            46817 non-null  float64
 5   race                        48153 non-null  object 
 6   sex                         48153 non-null  object 
 7   underline_disease           48153 non-null  object 
 8   diabetes                    48153 non-null  int64  
 9   Blood_type                  48153 non-null  object 
 10  number_transfusion          48153 non-null  int64  
 11  gestation                   48153 non-null  object 
 12  number_gestation            19464 non-null  float64
 13  prior_transplant            481

#### Preliminary Observations about the data:
- 12 float; 17 int; and 24 object variables
- 48153 tuples and 54 features

In [14]:
# getting percentage of missing values per feature

data.isna().sum()/len(data)*100
# remove features with missing value percentage greater than 45
# create a flag (look at features / check )

Id                             0.000000
date                           0.000000
age_at_list_registration       0.000000
age_cat                        0.000000
time_on_Dialysis               2.774490
race                           0.000000
sex                            0.000000
underline_disease              0.000000
diabetes                       0.000000
Blood_type                     0.000000
number_transfusion             0.000000
gestation                      0.000000
number_gestation              59.578842
prior_transplant               0.000000
number_prior_transplant        0.000000
subregion                      0.000000
cPRA                           0.000000
cPRA_cat                       0.000000
HLA_A1                         0.000000
HLA_A2                         0.000000
HLA_B1                         0.000000
HLA_B2                         0.000000
HLA_DR1                        0.000000
HLA_DR2                        0.000000
DR_00                          0.000000


#### Preliminary Observations about the data:
- time_on_Dialysis, number_gestation, chagas, agHBs, anti.HCV, razon_removed - these 6 features might need 'missing value imputation' We will consider it after understanding the nature of the features. 
- we also observe that the names of features don't really require a change in name. they all seem to follow basic conventions for easy manipulation. So we will call them as such, unless the need arises to change the feature name to suit a particular need, may during the feature engineering phase. 
- anti.HBc, anti.HCV, 	calculated_frequency_DR.f1 etc are a few feature names that might be problematic because of the dot. We would be prefer to replace . with _. 
- 'razon_removed' has 68% of its values missing. We may have to delete this column. But first we will check the business use of this feature. Same is the case with 'number_gestation'.
- We will impute rest of features with proper techniques based on the data-types.

In [15]:
#basic data description of object class features.

data.describe(include = object)

Unnamed: 0,date,age_cat,race,sex,underline_disease,Blood_type,gestation,prior_transplant,subregion,cPRA_cat,DR_00,B_00,A_00,chagas,anti.HBc,anti.HCV,agHBs,patient_still_on_list,death,Transplant,Transplant_Y_N,priorization,removed_list,razon_removed
count,48153,48153,48153,48153,48153,48153,48153,48153,48153,48153,48153,48153,48153,44141,48153,44141,44141,48153,48153,48153,48153,48153,48153,15295
unique,4108,3,4,2,5,4,2,2,4,4,2,2,2,2,2,2,2,2,2,3,2,1,2,10
top,2016-06-20,18.a.60,Branca,M,Outras,O,Não,Não,UNIFESP,Zero,heterozigoto,heterozigoto,heterozigoto,Não,Não,Não,Não,Não,Não,Não,Não,Não,Não,Removido (suspenso > 365 dias)
freq,94,34752,32455,28684,16626,23640,36122,41983,23166,34205,43268,45098,43171,44069,47380,43595,44022,37250,36592,25289,34421,48153,32858,12985


#### Preliminary Observations about the categorical features:
- 'date' should be converted into datetime object. 
- Portuguese terms needs to be replaced with 0 and 1. We will set Não=0 and Sim = 1. 
- age_cat,race needs to be hot encoded
- 

In [17]:
#getting a hands on idea about the attributes
pd.set_option('display.max_columns', None)
data.describe()

Unnamed: 0,Id,age_at_list_registration,time_on_Dialysis,diabetes,number_transfusion,number_gestation,number_prior_transplant,cPRA,HLA_A1,HLA_A2,HLA_B1,HLA_B2,HLA_DR1,HLA_DR2,calculated_frequency_DR.f1,calculated_frequency_DR.f2,calculated_frequency_DR.f,calculated_frequency_B.f1,calculated_frequency_B.f2,calculated_frequency_B.f,calculated_frequency_A.f1,calculated_frequency_A.f2,calculated_frequency_A.f,date_acutal,Time_death,X36MthsTx,Time_Tx,time,event
count,48153.0,48153.0,46817.0,48153.0,48153.0,19464.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0
mean,29486.740515,48.613399,21.054254,0.792599,0.412518,2.040793,0.146325,14.402093,10.359168,28.816211,25.750836,41.148942,5.986688,10.298901,0.187804,0.170929,3.167426,0.122476,0.098659,1.18259,0.235889,0.118824,2.697248,43307.0,75.726372,0.205117,35.527375,1090.57635,1.558864
std,16186.744194,14.707031,29.212685,0.40545,0.63555,2.413947,0.409814,29.119877,13.485239,21.826129,15.865324,16.379271,4.211553,4.743869,0.05961,0.084097,1.917147,0.064572,0.06869,1.096077,0.145272,0.09899,2.78976,0.0,57.602143,0.403791,30.977942,955.582713,1.138825
min,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,43307.0,0.0,0.0,0.0,0.0,0.0
25%,15902.0,39.0,5.0,1.0,0.0,0.0,0.0,0.0,2.0,11.0,14.0,35.0,3.0,7.0,0.18,0.07,1.33,0.08,0.04,0.36,0.13,0.06,0.78,43307.0,27.0,0.0,13.67,414.0,1.0
50%,29921.0,50.0,10.0,1.0,0.0,2.0,0.0,0.0,2.0,29.0,18.0,44.0,4.0,11.0,0.21,0.21,3.96,0.1,0.07,0.84,0.17,0.11,1.76,43307.0,61.0,0.0,26.3,800.0,1.0
75%,43387.0,59.0,23.0,1.0,1.0,3.0,0.0,9.0,23.0,33.0,40.0,51.0,9.0,14.0,0.23,0.23,4.83,0.19,0.14,1.71,0.42,0.13,4.2,43307.0,112.0,0.0,47.67,1470.0,3.0
max,56937.0,97.0,180.0,1.0,3.0,63.0,5.0,100.0,80.0,80.0,82.0,82.0,16.0,16.0,0.25,0.25,6.25,0.21,0.21,4.41,0.42,0.42,17.64,43307.0,226.0,1.0,225.87,6776.0,3.0


In [None]:
# let us also have a snippet of skew values of numerical features to facilitate better obsrvation
data.skew().abs()

  data.skew().abs()


Id                            0.083882
age_at_list_registration      0.482444
time_on_Dialysis              2.980401
diabetes                      1.443389
number_transfusion            1.653587
number_gestation              2.701250
number_prior_transplant       3.229051
cPRA                          1.932519
HLA_A1                        1.754463
HLA_A2                        0.663204
HLA_B1                        0.251808
HLA_B2                        0.894229
HLA_DR1                       0.518564
HLA_DR2                       0.851911
calculated_frequency_DR.f1    1.657124
calculated_frequency_DR.f2    0.859342
calculated_frequency_DR.f     0.318115
calculated_frequency_B.f1     0.035232
calculated_frequency_B.f2     0.427729
calculated_frequency_B.f      1.224001
calculated_frequency_A.f1     0.300759
calculated_frequency_A.f2     1.832411
calculated_frequency_A.f      2.026850
date_acutal                   0.000000
Time_death                    0.783563
X36MthsTx                

#### Preliminary Observations about the numerical features: 
- the skewness table shows not major skewness. But we will consider a skewness of over 1 major one taking the nature of feature in mind. 
- 'Id' is primary key. Nothing we can do about it. We will remove it.
- 'age_at_list_registration' ranges from 0 to 97 with a mean of 50 and median of 48.6. Almost normal distribution,  
- 

In [None]:
#getting the columns list to begin building data dictionary
data.columns

Index(['Id', 'date', 'age_at_list_registration', 'age_cat', 'time_on_Dialysis',
       'race', 'sex', 'underline_disease', 'diabetes', 'Blood_type',
       'number_transfusion', 'gestation', 'number_gestation',
       'prior_transplant', 'number_prior_transplant', 'subregion', 'cPRA',
       'cPRA_cat', 'HLA_A1', 'HLA_A2', 'HLA_B1', 'HLA_B2', 'HLA_DR1',
       'HLA_DR2', 'DR_00', 'B_00', 'A_00', 'calculated_frequency_DR.f1',
       'calculated_frequency_DR.f2', 'calculated_frequency_DR.f',
       'calculated_frequency_B.f1', 'calculated_frequency_B.f2',
       'calculated_frequency_B.f', 'calculated_frequency_A.f1',
       'calculated_frequency_A.f2', 'calculated_frequency_A.f', 'chagas',
       'anti.HBc', 'anti.HCV', 'agHBs', 'patient_still_on_list', 'date_acutal',
       'death', 'Time_death', 'Transplant', 'Transplant_Y_N', 'X36MthsTx',
       'Time_Tx', 'priorization', 'removed_list', 'razon_removed', 'time',
       'event'],
      dtype='object')

### FEATURES TABLE
#### building a data Dictionary

- The major variables used in the study by Silva and team are: 

age, sex, race, comorbidities, time on dialysis, blood group, calculated panel class I (cPRA), HLA-A, HLA-B, HLA-DR, number of blood transfusions, pregnancies, previous kidney transplants, and pre-transplant serology for Hepatitis B and C.
- The following is basic data description. Full descriptin PPT.
- 


 36  chagas                      44141 non-null  object 
 37  anti.HBc                    48153 non-null  object 
 38  anti.HCV                    44141 non-null  object 
 39  agHBs                       44141 non-null  object 
 40  patient_still_on_list       48153 non-null  object 
 41  date_acutal                 48153 non-null  int64  
 42  death                       48153 non-null  object 
 43  Time_death                  48153 non-null  int64  
 44  Transplant                  48153 non-null  object 
 45  Transplant_Y_N              48153 non-null  object 
 46  X36MthsTx                   48153 non-null  int64  
 47  Time_Tx                     48153 non-null  float64
 48  priorization                48153 non-null  object 
 49  removed_list                48153 non-null  object 
 50  razon_removed               15295 non-null  object 
 51  time                        48153 non-null  int64  
 52  event                       48153 non-null  int64  

##### Numerical Features
1.	Id                            : (int) Ids of tuples
2.	age_at_list_registration      : (int) At of the patient at the time of registration
3.	time_on_Dialysis              : (float) Time spent by patient on dialysis before transplant (in months)
4.	diabetes                      : (int) Was the patient a diabetic. 1 for yes. 0 for no. #Categorical Variable
5.	number_transfusion            : (int) No. of blood transfusions the patients needed
6.	number_gestation              : (float) 
7.	number_prior_transplant       : (int) Number of previous transplants.
8.	cPRA                          : (int) Calculated panel reactive antibody. 
9.	HLA_A1                        : (int) Human leukocyte antigen
10.	HLA_A2                        : (int) .....".......
11.	HLA_B1                        : (int) .....".......
12.	HLA_B2                        : (int) .....".......
13.	HLA_DR1                       : (int) .....".......
14.	HLA_DR2                       : (int) .....".......
15.	calculated_frequency_DR.f1    : (float)
16.	calculated_frequency_DR.f2    : (float)
17.	calculated_frequency_DR.f     : (float)
18.	calculated_frequency_B.f1     : (float)
19.	calculated_frequency_B.f2     : (float)
20.	calculated_frequency_B.f      : (float)
21.	calculated_frequency_A.f1     : (float)
22.	calculated_frequency_A.f2     : (float)
23.	calculated_frequency_A.f      : (float)
24.	date_acutal                   
25.	Time_death                    
26.	X36MthsTx                     
27.	Time_Tx                       
28.	time                          
29.	event  

#####Categorical features:

 0   Id                           int64  
 1   date                          object 
 2   age_at_list_registration    48153 non-null  int64  
 3   age_cat                     48153 non-null  object 
 4   time_on_Dialysis            46817 non-null  float64
 5   race                        48153 non-null  object 
 6   sex                         48153 non-null  object 
 7   underline_disease           48153 non-null  object 
 8   diabetes                    48153 non-null  int64  
 9   Blood_type                  48153 non-null  object 
 10  number_transfusion          48153 non-null  int64  
 11  gestation                   48153 non-null  object 
 12  number_gestation            19464 non-null  float64
 13  prior_transplant            48153 non-null  object 
 14  number_prior_transplant     48153 non-null  int64  
 15  subregion                   48153 non-null  object 
 16  cPRA                        48153 non-null  int64  
 17  cPRA_cat                    48153 non-null  object 
 18  HLA_A1                      48153 non-null  int64  
 19  HLA_A2                      48153 non-null  int64  
 20  HLA_B1                      48153 non-null  int64  
 21  HLA_B2                      48153 non-null  int64  
 22  HLA_DR1                     48153 non-null  int64  
 23  HLA_DR2                     48153 non-null  int64  
 24  DR_00                       48153 non-null  object 
 25  B_00                        48153 non-null  object 
 26  A_00                        48153 non-null  object 
 27  calculated_frequency_DR.f1  48153 non-null  float64
 28  calculated_frequency_DR.f2  48153 non-null  float64
 29  calculated_frequency_DR.f   48153 non-null  float64
 30  calculated_frequency_B.f1   48153 non-null  float64
 31  calculated_frequency_B.f2   48153 non-null  float64
 32  calculated_frequency_B.f    48153 non-null  float64
 33  calculated_frequency_A.f1   48153 non-null  float64
 34  calculated_frequency_A.f2   48153 non-null  float64
 35  calculated_frequency_A.f    48153 non-null  float64
 36  chagas                      44141 non-null  object 
 37  anti.HBc                    48153 non-null  object 
 38  anti.HCV                    44141 non-null  object 
 39  agHBs                       44141 non-null  object 
 40  patient_still_on_list       48153 non-null  object 
 41  date_acutal                 48153 non-null  int64  
 42  death                       48153 non-null  object 
 43  Time_death                  48153 non-null  int64  
 44  Transplant                  48153 non-null  object 
 45  Transplant_Y_N              48153 non-null  object 
 46  X36MthsTx                   48153 non-null  int64  
 47  Time_Tx                     48153 non-null  float64
 48  priorization                48153 non-null  object 
 49  removed_list                48153 non-null  object 
 50  razon_removed               15295 non-null  object 
 51  time                        48153 non-null  int64  
 52  event                       48153 non-null  int64  


1.	Id                           int64  
2.	date                          object 
a.	convert it into datetime object
b.	think about its usability in the context of the time from which the data is, and what the time is now. 
3.	age_at_list_registration    int64  
4.	age_cat                     object
a.	need to be hot-encoded 
5.	time_on_Dialysis            float64
a.	many are NaN values. Need to take care of that. 
6.	race                        object 
a.	need to be hot encoded
b.	4 different types of races. 
7.	sex                          object 
a.	need to be hot encoded.
8.	underline_disease            object
a.	need to be hot encoded 
b.	five unique values, written in Portuguese
9.	diabetes                     int64  
a.	already hot encoded
10.	Blood_type                  object
a.	need to be hot encoded 
b.	4 unique types of blood
11.	number_transfusion          int64  
12.	gestation                   object 
a.	needs one hot encoding
13.	number_gestation            float64
a.	NaN values when gestation (12) is No
b.	NaN means not a number. It is read as a floating point number.
c.	isna() reads it as missing value. But NaN has a role to play in the number_gestation feature. We cannot remove the feature as it is.
d.	I think we should remove the ‘gestation’ feature altogether and keep only number_gestation, replacing all NaN values with 0.
e.	Some integer value when gestation is Yes.
14.	prior_transplant            object :
a.	one hot encoding
15.	number_prior_transplant     int64 :
a.	 we can do away with the prior transplant feature
b.	number_prior_transplant value as zero provides sufficient values.
16.	subregion                   object:
a.	 4 types of subregions
b.	one hot encoding required
17.	cPRA                        int64  
18.	cPRA_cat                    object
a.	category of cPRA seems to be more feature than cPRA. But we need to be sure. 
b.	It might be otherwise. But one thing seems clear. We can do away with one of the feature  
19.	HLA_A1                      int64  
20.	HLA_A2                      int64  
21.	HLA_B1                      int64  
22.	HLA_B2                      int64  
23.	HLA_DR1                     int64  
24.	HLA_DR2                     int64  
25.	DR_00                       object 
a.	2 possible values
b.	one hot encoding
26.	B_00                        object 
a.	2 possible values
b.	one hot encoding
27.	A_00                        object
a.	2 possible values
b.	one hot encoding
28.	calculated_frequency_DR.f1  48153   float64
29.	28  calculated_frequency_DR.f2  48153   float64
30.	29  calculated_frequency_DR.f   48153   float64
31.	30  calculated_frequency_B.f1   48153   float64
32.	31  calculated_frequency_B.f2   48153   float64
33.	32  calculated_frequency_B.f    48153   float64
34.	33  calculated_frequency_A.f1   48153   float64
35.	34  calculated_frequency_A.f2   48153   float64
36.	35  calculated_frequency_A.f    48153   float64
37.	chagas                      44141   object:
a.	either yes or no
b.	one hot encoding
c.	some values are NaN. We have to take care of them. 
38.	anti.HBc                    object
a.	one hot encoding required
b.	name change required
39.	anti.HCV                    object 
a.	name changed required
b.	one hot encoding required
c.	some are NaN values
40.	agHBs                       object 
a.	one hot encoding required
b.	some are NaN values
41.	patient_still_on_list       48153   object 
42.	date_acutal                 48153   int64  
43.	death                       48153   object 
44.	Time_death                  48153   int64  
45.	Transplant                  48153   object 
46.	Transplant_Y_N              48153   object 
47.	X36MthsTx                   48153   int64  
48.	Time_Tx                     48153   float64
49.	priorization                48153   object 
50.	removed_list                48153   object 
51.	razon_removed               15295   object 
52.	time                        48153   int64  
53.	event                       48153   int64  
