#Liver disease prediction

In this use case, we are going to use the Indian Liver Patient Records dataset, this dataset assists us to predict whether a patient has liver disease or not based on patients chemical composition and condition.

In [28]:
import pandas as pd
import numpy as np

In [29]:
liverd=pd.read_csv('indian_liver_patient.csv')
liverd.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


In [30]:
liverd.info()
liverd.isnull().sum()         

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    object 
 2   Total_Bilirubin             583 non-null    float64
 3   Direct_Bilirubin            583 non-null    float64
 4   Alkaline_Phosphotase        583 non-null    int64  
 5   Alamine_Aminotransferase    583 non-null    int64  
 6   Aspartate_Aminotransferase  583 non-null    int64  
 7   Total_Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin_and_Globulin_Ratio  579 non-null    float64
 10  Dataset                     583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    4
Dataset                       0
dtype: int64

There are few null values instead of filling different values(incorrect data which is not good for medical datasets as it may effect patient records) just dropping them

In [31]:
liverd = liverd.dropna()

In [32]:
liverd.isnull().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    0
Dataset                       0
dtype: int64

In [33]:
liverd['Dataset'].value_counts()

1    414
2    165
Name: Dataset, dtype: int64

In [34]:
liverd.rename(columns = {'Dataset' : 'Result'}, inplace = True)

In [35]:
liverd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 579 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         579 non-null    int64  
 1   Gender                      579 non-null    object 
 2   Total_Bilirubin             579 non-null    float64
 3   Direct_Bilirubin            579 non-null    float64
 4   Alkaline_Phosphotase        579 non-null    int64  
 5   Alamine_Aminotransferase    579 non-null    int64  
 6   Aspartate_Aminotransferase  579 non-null    int64  
 7   Total_Protiens              579 non-null    float64
 8   Albumin                     579 non-null    float64
 9   Albumin_and_Globulin_Ratio  579 non-null    float64
 10  Result                      579 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 54.3+ KB


In [36]:
liverd['Result'] = liverd['Result'].apply(lambda x:0 if x==2 else 1)

In [37]:
liverd['Result'].value_counts()

1    414
0    165
Name: Result, dtype: int64

Change of name to target variable from Dataset to Result

previously in Result 2 represent result with not having Liver disease now its 0

1 represent result with Liver disease

In [38]:
liverd['Gender'] = liverd['Gender'].apply(lambda x:0 if x=='Male' else 1)

In [39]:
liverd['Gender'].value_counts()

0    439
1    140
Name: Gender, dtype: int64

In [40]:
liverd['Gender'].astype(int)

0      1
1      0
2      0
3      0
4      0
      ..
578    0
579    0
580    0
581    0
582    0
Name: Gender, Length: 579, dtype: int64

Gender column is categorical with Female and Male as attributes

now 0 represent Male and 1 represent Female

Gender is converted from object to int for preprocessing purpose

In [41]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [42]:
liverd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 579 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         579 non-null    int64  
 1   Gender                      579 non-null    int64  
 2   Total_Bilirubin             579 non-null    float64
 3   Direct_Bilirubin            579 non-null    float64
 4   Alkaline_Phosphotase        579 non-null    int64  
 5   Alamine_Aminotransferase    579 non-null    int64  
 6   Aspartate_Aminotransferase  579 non-null    int64  
 7   Total_Protiens              579 non-null    float64
 8   Albumin                     579 non-null    float64
 9   Albumin_and_Globulin_Ratio  579 non-null    float64
 10  Result                      579 non-null    int64  
dtypes: float64(5), int64(6)
memory usage: 54.3 KB


In [43]:
liverd.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Result
0,65,1,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,0,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,0,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,0,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,0,3.9,2.0,195,27,59,7.3,2.4,0.4,1


In [77]:
from sklearn.model_selection import cross_val_score


In [83]:
X=liverd.drop(['Result'],axis=1)
y=liverd['Result'].copy()


In [84]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=101)

In [85]:
stand_cols=['Age','Gender','Total_Bilirubin','Direct_Bilirubin','Alkaline_Phosphotase',
            'Alamine_Aminotransferase','Aspartate_Aminotransferase','Total_Protiens','Albumin','Albumin_and_Globulin_Ratio']


Data is split into target varible and input variables

using one of the preprocessing technique train_test_split and columns are listed for standardization

In [81]:

for i in stand_cols:
  scalet = StandardScaler().fit(X_train[[i]])

  X_train[i]=scalet.transform(X_train[[i]])

  X_test[i] = scalet.transform(X_test[[i]])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.or

Standardization is perfomed to scale data which then improve model perfomance to yeild better results (ignore these warnings, it is suggesting to a new method instead of using for loop)

In [86]:
X_test.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio
485,22,1,6.7,3.2,850,154,248,6.2,2.8,0.8
173,31,0,0.6,0.1,175,48,34,6.0,3.7,1.6
48,32,1,0.6,0.1,176,39,28,6.0,3.0,1.0
242,29,1,0.8,0.2,205,30,23,8.2,4.1,1.0
365,40,0,0.7,0.2,176,28,43,5.3,2.4,0.8


In [59]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

D_tree=DecisionTreeClassifier(max_depth=10)
D_tree.fit(X_train,y_train)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Now comparing results of linear model Logistic regression and tree Descion Tree

In [71]:
from sklearn.model_selection import cross_val_score

scores1 = cross_val_score(D_tree,X_train,y_train,cv=15)


Descion Tree MAX score from all cross validation 0.8148148148148148


In [72]:
scores2 = cross_val_score(logreg,X_train,y_train,cv=15)


Logistic Reg MAX score from all cross validation 0.8518518518518519


In [87]:
print('Logistic Reg MAX score from all cross validation',scores2.max())
print('Descion Tree MAX score from all cross validation',scores1.max())

Logistic Reg MAX score from all cross validation 0.8518518518518519
Descion Tree MAX score from all cross validation 0.8148148148148148


Logistic Regression has better perfomance than Descion Tree