### About Dataset

##### Source
https://www.kaggle.com/datasets/gustavomodelli/waitlist-kidney-brazil

##### Context
Predicting waiting time for a deceased donor kidney transplant can help patients and clinicians to discuss management and contribute to a more efficient use of resources

##### Content
A model was developed with this data and published in PlosOne. We expected to share the data some improvements in the model that could help physicians and patients.

##### Reference: 
Sapiertein Silva JF, Ferreira GF, Perosa M, Nga HS, de Andrade LGM. A machine learning prediction model for waiting time to kidney transplant. PLoS One. 2021 May 20;16(5):e0252069. doi: 10.1371/journal.pone.0252069. PMID: 34015020. (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0252069)

##### Acknowledgements
We would like to tanks the Secreataria de Saude do Estado de Sao Paulo to provide the data

##### Inspiration
Predict the Deceased donor transplantation using the predictors available

##### Allocation criteria
Allocation was performed as established by the National Transplantation System of the Brazilian Ministry of Health. For deceased donor transplants, allocation criteria are based on HLA matching (highest number of points for HLA DR, followed by HLA B and HLA A), recipient's age (<18 years), date of registration on the waiting list, and panel reactive antibody (PRA). A point score system based on blood group and HLA match is used as follows:

DR: 0 MM = 10 points; 1 MM = 5 points; 2 MM = 0 point;
B: 0 MM = 4 points; 1 MM = 2 points; 2 MM = 0 point;
A: 0 MM = 1 point; 1 MM = 0.5 point; 2 MM = 0 point.

Waiting time, allosensitization (cPRA >50), diabetes mellitus, and age < 18 years served as tiebreakers.

## 1. EXPLORATORY DATA ANALYSIS

1.   Load and audit the data
2.   Data preparation and tranformation
     a. Impute the missing values
     b. Outliers or extreme values
     c. Inconsistent values
3.   Data Visualization
4.   Data Analysis
     a. Uni-Variate Analysis (Measures of CT, Measures of Disp)
     b. Bi-Variate (correlation, chi-square test)
     c. Multi-variate (if needed)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


In [3]:
data = pd.read_csv('waitlist.csv', encoding='latin-1')

In [4]:
pd.set_option('display.max_columns', None)
data.sample(15)

Unnamed: 0,Id,date,age_at_list_registration,age_cat,time_on_Dialysis,race,sex,underline_disease,diabetes,Blood_type,number_transfusion,gestation,number_gestation,prior_transplant,number_prior_transplant,subregion,cPRA,cPRA_cat,HLA_A1,HLA_A2,HLA_B1,HLA_B2,HLA_DR1,HLA_DR2,DR_00,B_00,A_00,calculated_frequency_DR.f1,calculated_frequency_DR.f2,calculated_frequency_DR.f,calculated_frequency_B.f1,calculated_frequency_B.f2,calculated_frequency_B.f,calculated_frequency_A.f1,calculated_frequency_A.f2,calculated_frequency_A.f,chagas,anti.HBc,anti.HCV,agHBs,patient_still_on_list,date_acutal,death,Time_death,Transplant,Transplant_Y_N,X36MthsTx,Time_Tx,priorization,removed_list,razon_removed,time,event
29138,35606,2009-04-27,48,18.a.60,6.0,Parda,M,GNC,1,O,1,Não,,Não,0,UNIFESP,0,Zero,3,29,7,40,15,0,homozigoto,heterozigoto,heterozigoto,0.21,0.0,0.0,0.14,0.09,1.26,0.18,0.08,1.44,Não,Não,Não,Não,Não,43307,Não,113,Não,Não,0,28.37,Não,Sim,Transferido para outro Estado,851,3
12316,16264,2006-01-11,24,18.a.60,4.0,Branca,M,HAS,1,A,1,Não,,Não,0,FUNDERP,0,Zero,2,11,7,39,12,15,heterozigoto,heterozigoto,heterozigoto,0.04,0.21,0.84,0.14,0.06,0.84,0.42,0.1,4.2,Não,Não,Não,Não,Não,43307,Não,153,Não,Não,0,73.8,Não,Sim,Removido (suspenso > 365 dias),2214,3
2165,2922,2002-08-19,64,Maior.60,6.0,Branca,M,HAS,1,O,0,Não,,Não,0,HCFMUSP,0,Zero,29,68,44,51,0,7,homozigoto,heterozigoto,heterozigoto,0.0,0.22,0.0,0.2,0.14,2.8,0.08,0.13,1.04,,Não,,,Não,43307,Não,194,Não,Não,0,43.6,Não,Sim,Removido (suspenso > 365 dias),1308,3
38780,46327,2005-02-18,63,Maior.60,11.0,Branca,M,Outras,1,A,0,Não,,Não,0,UNIFESP,0,Zero,0,3,14,49,4,11,heterozigoto,heterozigoto,homozigoto,0.23,0.23,5.29,0.1,0.05,0.5,0.0,0.18,0.0,Não,Não,Não,Não,Sim,43307,Não,164,Não,Não,0,163.53,Não,Não,,4906,0
554,741,2017-05-24,23,18.a.60,13.0,Negra,M,Outras,1,AB,0,Não,,Sim,2,HCFMUSP,90,Maior_80,2,66,51,53,7,13,heterozigoto,heterozigoto,heterozigoto,0.22,0.25,5.5,0.14,0.06,0.84,0.42,0.02,0.84,Não,Não,Não,Não,Sim,43307,Não,14,Não,Não,0,14.27,Não,Não,,428,0
16673,21702,2000-07-05,50,18.a.60,2.0,Branca,M,Diabetes,0,O,1,Não,,Não,0,FUNDERP,10,Entre_0_50,0,0,0,0,0,0,homozigoto,homozigoto,homozigoto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,Não,,,Não,43307,Sim,21,Óbito Lista,Não,0,20.6,Não,Não,,618,2
3751,5048,2012-05-17,64,Maior.60,17.0,Branca,M,GNC,1,O,0,Não,,Não,0,UNIFESP,0,Zero,2,3,14,45,9,13,heterozigoto,heterozigoto,heterozigoto,0.04,0.25,1.0,0.1,0.05,0.5,0.42,0.18,7.56,Não,Não,Não,Não,Sim,43307,Não,75,Não,Não,0,75.37,Não,Não,,2261,0
13460,17785,2008-12-02,38,18.a.60,,Branca,M,Outras,1,B,0,Não,,Não,0,UNIFESP,96,Maior_80,2,24,45,50,4,13,heterozigoto,heterozigoto,heterozigoto,0.23,0.25,5.75,0.05,0.04,0.2,0.42,0.16,6.72,Não,Não,Não,Não,Não,43307,Sim,80,Óbito Lista,Não,0,79.67,Não,Não,,2390,2
40917,48760,2016-09-12,37,18.a.60,17.0,Branca,M,Outras,1,B,0,Não,,Não,0,HCFMUSP,0,Zero,11,23,18,50,7,11,heterozigoto,heterozigoto,heterozigoto,0.22,0.23,5.06,0.09,0.04,0.36,0.1,0.11,1.1,Não,Não,Não,Não,Sim,43307,Não,23,Não,Não,0,22.73,Não,Não,,682,0
36498,43808,2001-04-06,57,18.a.60,4.0,Branca,M,Diabetes,0,O,0,Não,,Não,0,UNIFESP,0,Zero,1,0,44,8,3,13,heterozigoto,heterozigoto,homozigoto,0.18,0.25,4.5,0.2,0.08,1.6,0.17,0.0,0.0,,Não,,,Não,43307,Não,211,Não,Não,0,52.23,Não,Sim,Removido (suspenso > 365 dias),1567,3


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48153 entries, 0 to 48152
Data columns (total 53 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Id                          48153 non-null  int64  
 1   date                        48153 non-null  object 
 2   age_at_list_registration    48153 non-null  int64  
 3   age_cat                     48153 non-null  object 
 4   time_on_Dialysis            46817 non-null  float64
 5   race                        48153 non-null  object 
 6   sex                         48153 non-null  object 
 7   underline_disease           48153 non-null  object 
 8   diabetes                    48153 non-null  int64  
 9   Blood_type                  48153 non-null  object 
 10  number_transfusion          48153 non-null  int64  
 11  gestation                   48153 non-null  object 
 12  number_gestation            19464 non-null  float64
 13  prior_transplant            481

#### Preliminary Observations about the data:
- 12 float; 17 int; and 24 object variables
- 48153 tuples and 54 features

In [6]:
data.isna().sum()/len(data)*100
# remove features with missing value percentage greater than 45
# create a flag (look at features / check )

Id                             0.000000
date                           0.000000
age_at_list_registration       0.000000
age_cat                        0.000000
time_on_Dialysis               2.774490
race                           0.000000
sex                            0.000000
underline_disease              0.000000
diabetes                       0.000000
Blood_type                     0.000000
number_transfusion             0.000000
gestation                      0.000000
number_gestation              59.578842
prior_transplant               0.000000
number_prior_transplant        0.000000
subregion                      0.000000
cPRA                           0.000000
cPRA_cat                       0.000000
HLA_A1                         0.000000
HLA_A2                         0.000000
HLA_B1                         0.000000
HLA_B2                         0.000000
HLA_DR1                        0.000000
HLA_DR2                        0.000000
DR_00                          0.000000


#### Preliminary Observations about the data:
- time_on_Dialysis, number_gestation, chagas, agHBs, anti.HCV, razon_removed - these 6 features might need 'missing value imputation' We will consider it after understanding the nature of the features. 
- we also observe that the names of features don't really require a change in name. they all seem to follow basic conventions for easy manipulation. So we will call them as such, unless the need arises to change the feature name to suit a particular need, may during the feature engineering phase

In [8]:
data.describe(include = object)

Unnamed: 0,date,age_cat,race,sex,underline_disease,Blood_type,gestation,prior_transplant,subregion,cPRA_cat,...,anti.HBc,anti.HCV,agHBs,patient_still_on_list,death,Transplant,Transplant_Y_N,priorization,removed_list,razon_removed
count,48153,48153,48153,48153,48153,48153,48153,48153,48153,48153,...,48153,44141,44141,48153,48153,48153,48153,48153,48153,15295
unique,4108,3,4,2,5,4,2,2,4,4,...,2,2,2,2,2,3,2,1,2,10
top,2016-06-20,18.a.60,Branca,M,Outras,O,Não,Não,UNIFESP,Zero,...,Não,Não,Não,Não,Não,Não,Não,Não,Não,Removido (suspenso > 365 dias)
freq,94,34752,32455,28684,16626,23640,36122,41983,23166,34205,...,47380,43595,44022,37250,36592,25289,34421,48153,32858,12985


#### Preliminary Observations about the categorical features:
- 'count' doesn't appear to be a categorical feature. It needs to be converted into int
- 

In [11]:
#getting a hands on idea about the attributes
data.describe()

Unnamed: 0,Id,age_at_list_registration,time_on_Dialysis,diabetes,number_transfusion,number_gestation,number_prior_transplant,cPRA,HLA_A1,HLA_A2,...,calculated_frequency_B.f,calculated_frequency_A.f1,calculated_frequency_A.f2,calculated_frequency_A.f,date_acutal,Time_death,X36MthsTx,Time_Tx,time,event
count,48153.0,48153.0,46817.0,48153.0,48153.0,19464.0,48153.0,48153.0,48153.0,48153.0,...,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0,48153.0
mean,29486.740515,48.613399,21.054254,0.792599,0.412518,2.040793,0.146325,14.402093,10.359168,28.816211,...,1.18259,0.235889,0.118824,2.697248,43307.0,75.726372,0.205117,35.527375,1090.57635,1.558864
std,16186.744194,14.707031,29.212685,0.40545,0.63555,2.413947,0.409814,29.119877,13.485239,21.826129,...,1.096077,0.145272,0.09899,2.78976,0.0,57.602143,0.403791,30.977942,955.582713,1.138825
min,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,43307.0,0.0,0.0,0.0,0.0,0.0
25%,15902.0,39.0,5.0,1.0,0.0,0.0,0.0,0.0,2.0,11.0,...,0.36,0.13,0.06,0.78,43307.0,27.0,0.0,13.67,414.0,1.0
50%,29921.0,50.0,10.0,1.0,0.0,2.0,0.0,0.0,2.0,29.0,...,0.84,0.17,0.11,1.76,43307.0,61.0,0.0,26.3,800.0,1.0
75%,43387.0,59.0,23.0,1.0,1.0,3.0,0.0,9.0,23.0,33.0,...,1.71,0.42,0.13,4.2,43307.0,112.0,0.0,47.67,1470.0,3.0
max,56937.0,97.0,180.0,1.0,3.0,63.0,5.0,100.0,80.0,80.0,...,4.41,0.42,0.42,17.64,43307.0,226.0,1.0,225.87,6776.0,3.0


In [13]:
# let us also have a snippet of skew values of numerical features to facilitate better obsrvation
data.skew().abs()

  data.skew().abs()


Id                            0.083882
age_at_list_registration      0.482444
time_on_Dialysis              2.980401
diabetes                      1.443389
number_transfusion            1.653587
number_gestation              2.701250
number_prior_transplant       3.229051
cPRA                          1.932519
HLA_A1                        1.754463
HLA_A2                        0.663204
HLA_B1                        0.251808
HLA_B2                        0.894229
HLA_DR1                       0.518564
HLA_DR2                       0.851911
calculated_frequency_DR.f1    1.657124
calculated_frequency_DR.f2    0.859342
calculated_frequency_DR.f     0.318115
calculated_frequency_B.f1     0.035232
calculated_frequency_B.f2     0.427729
calculated_frequency_B.f      1.224001
calculated_frequency_A.f1     0.300759
calculated_frequency_A.f2     1.832411
calculated_frequency_A.f      2.026850
date_acutal                   0.000000
Time_death                    0.783563
X36MthsTx                

#### Preliminary Observations about the numerical features: 
- the skewness table shows not major skewness. But we will consider a skewness of over 1 major one taking the nature of feature in mind. 
-
- 'Id' is primary key. Nothing we can do about it. 
- 'age_at_list_registration' ranges from 0 to 97 with a mean of 50 and median of 48.6. Almost normal distribution,  
- 

In [15]:
#getting the columns list to begin building data dictionary
data.columns

Index(['Id', 'date', 'age_at_list_registration', 'age_cat', 'time_on_Dialysis',
       'race', 'sex', 'underline_disease', 'diabetes', 'Blood_type',
       'number_transfusion', 'gestation', 'number_gestation',
       'prior_transplant', 'number_prior_transplant', 'subregion', 'cPRA',
       'cPRA_cat', 'HLA_A1', 'HLA_A2', 'HLA_B1', 'HLA_B2', 'HLA_DR1',
       'HLA_DR2', 'DR_00', 'B_00', 'A_00', 'calculated_frequency_DR.f1',
       'calculated_frequency_DR.f2', 'calculated_frequency_DR.f',
       'calculated_frequency_B.f1', 'calculated_frequency_B.f2',
       'calculated_frequency_B.f', 'calculated_frequency_A.f1',
       'calculated_frequency_A.f2', 'calculated_frequency_A.f', 'chagas',
       'anti.HBc', 'anti.HCV', 'agHBs', 'patient_still_on_list', 'date_acutal',
       'death', 'Time_death', 'Transplant', 'Transplant_Y_N', 'X36MthsTx',
       'Time_Tx', 'priorization', 'removed_list', 'razon_removed', 'time',
       'event'],
      dtype='object')

### FEATURES TABLE
#### building a data Dictionary

- The major variables used in the study by Silva and team are: 

age, sex, race, comorbidities, time on dialysis, blood group, calculated panel class I (cPRA), HLA-A, HLA-B, HLA-DR, number of blood transfusions, pregnancies, previous kidney transplants, and pre-transplant serology for Hepatitis B and C.
- The following is basic data description. Full descriptin PPT.
- 


 36  chagas                      44141 non-null  object 
 37  anti.HBc                    48153 non-null  object 
 38  anti.HCV                    44141 non-null  object 
 39  agHBs                       44141 non-null  object 
 40  patient_still_on_list       48153 non-null  object 
 41  date_acutal                 48153 non-null  int64  
 42  death                       48153 non-null  object 
 43  Time_death                  48153 non-null  int64  
 44  Transplant                  48153 non-null  object 
 45  Transplant_Y_N              48153 non-null  object 
 46  X36MthsTx                   48153 non-null  int64  
 47  Time_Tx                     48153 non-null  float64
 48  priorization                48153 non-null  object 
 49  removed_list                48153 non-null  object 
 50  razon_removed               15295 non-null  object 
 51  time                        48153 non-null  int64  
 52  event                       48153 non-null  int64  

##### Numerical Features
1.	Id                            : (int) Ids of tuples
2.	age_at_list_registration      : (int) At of the patient at the time of registration
3.	time_on_Dialysis              : (float) Time spent by patient on dialysis before transplant (in months)
4.	diabetes                      : (int) Was the patient a diabetic. 1 for yes. 0 for no. #Categorical Variable
5.	number_transfusion            : (int) No. of blood transfusions the patients needed
6.	number_gestation              : (float) 
7.	number_prior_transplant       : (int) Number of previous transplants.
8.	cPRA                          : (int) Calculated panel reactive antibody. 
9.	HLA_A1                        : (int) Human leukocyte antigen
10.	HLA_A2                        : (int) .....".......
11.	HLA_B1                        : (int) .....".......
12.	HLA_B2                        : (int) .....".......
13.	HLA_DR1                       : (int) .....".......
14.	HLA_DR2                       : (int) .....".......
15.	calculated_frequency_DR.f1    : (float)
16.	calculated_frequency_DR.f2    : (float)
17.	calculated_frequency_DR.f     : (float)
18.	calculated_frequency_B.f1     : (float)
19.	calculated_frequency_B.f2     : (float)
20.	calculated_frequency_B.f      : (float)
21.	calculated_frequency_A.f1     : (float)
22.	calculated_frequency_A.f2     : (float)
23.	calculated_frequency_A.f      : (float)
24.	date_acutal                   
25.	Time_death                    
26.	X36MthsTx                     
27.	Time_Tx                       
28.	time                          
29.	event  