# **CKD Prediction With AUTOML | EVALML | CatBoosts | Lightgbm | Xgboost**



Source: https://github.com/alteryx/evalml

EvalML is an AutoML library which builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions.

Key Functionality

1.     Automation - Makes machine learning easier. Avoid training and tuning models by hand. Includes data quality checks, cross-validation and more.
2.     Data Checks - Catches and warns of problems with your data and problem setup before modeling.
3.     End-to-end - Constructs and optimizes pipelines that include state-of-the-art preprocessing, feature engineering, feature  selection, and a variety of modeling techniques.
4.     Model Understanding - Provides tools to understand and introspect on models, to learn how they'll behave in your problem domain.
5.     Domain-specific - Includes repository of domain-specific objective functions and an interface to define your own.
 


In [None]:
!pip install evalml

### **Load Modules and helper functions**

In [None]:
import evalml
from evalml import AutoMLSearch

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
#import Dataset
dataset = pd.read_csv('../input/ckdisease/kidney_disease.csv')

In [None]:
dataset.head()

### **Cleaning and preprocessing of data for training**

In [None]:
dataset[['htn','dm','cad','pe','ane']]=dataset[['htn','dm','cad','pe','ane']].replace(to_replace={'yes':1,'no':0})
dataset[['rbc','pc']] = dataset[['rbc','pc']].replace(to_replace={'abnormal':1,'normal':0})
dataset[['pcc','ba']] = dataset[['pcc','ba']].replace(to_replace={'present':1,'notpresent':0})
dataset[['appet']] = dataset[['appet']].replace(to_replace={'good':1,'poor':0,'no':np.nan})
dataset['classification']=dataset['classification'].replace(to_replace={'ckd':1.0,'ckd\t':1.0,'notckd':0.0,'no':0.0})
dataset.rename(columns={'classification':'class'},inplace=True)

In [None]:
# Further cleaning
dataset['pe'] = dataset['pe'].replace(to_replace='good',value=0) # Not having pedal edema is good
dataset['appet'] = dataset['appet'].replace(to_replace='no',value=0)
dataset['cad'] = dataset['cad'].replace(to_replace='\tno',value=0)
dataset['dm'] = dataset['dm'].replace(to_replace={'\tno':0,'\tyes':1,' yes':1, '':np.nan})
dataset.drop('id',axis=1,inplace=True)

In [None]:
# '?' character remove process in the dataset
for i in ['rc','wc','pcv']:
    dataset[i] = dataset[i].str.extract('(\d+)').astype(float)

In [None]:
# Filling missing numeric data in the dataset with mean
for i in ['age','bp','sg','al','su','bgr','bu','sc','sod','pot','hemo','rc','wc','pcv']:
    dataset[i].fillna(dataset[i].mean(),inplace=True)

In [None]:
dataset = dataset.dropna(axis=1) 

In [None]:
#Data preprocessing
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

In [None]:
# Feature Scaling
sc = StandardScaler()
X = sc.fit_transform(X)

In [None]:
#Splitting the dataset in to training and testing set
X_train , X_test , y_train , y_test   = train_test_split(X,y,test_size = 0.2 , random_state=123)  

### **Run the search for the best classification model**

In [None]:
#limiting search for efficiency
automl = AutoMLSearch(X_train=X_train, y_train=y_train,   problem_type='binary',allowed_model_families=['xgboost', 'lightgbm','catboost'],max_batches=5)
automl.search() 

### **Model rankings and best pipeline**

In [None]:
automl.rankings

In [None]:
automl.describe_pipeline(automl.rankings.iloc[0]["id"])

### **Making predictions**

In [None]:
pred = automl.best_pipeline
pred.predict(X_test)

In [None]:
pred.graph()