## Liver Disease Prediction
Create predictive models to predict the stage of liver Cirrhosis using 18 clinical features. Cirrhosis damages the liver from a variety of causes leading to scarring and liver failure.

Hepatitis and chronic alcohol abuse are frequent causes of the disease. Liver damage caused by cirrhosis can't be undone, but further damage can be limited. Treatments focus on the underlying cause. In advanced cases, a liver transplant may be required. Predicting the stage of cirrhosis and beginning the treatment before it's too late can prevent the fatal consequences of the disease.

In [22]:
import pandas as pd
pd.options.mode.chained_assignment = None 
import numpy as np

In [23]:
df=pd.read_csv('train_dataset.csv')
df_test=pd.read_csv('test_dataset.csv')

In [24]:
df.head(5)

Unnamed: 0,ID,N_Days,Status,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
0,7135,1654,CL,D-penicillamine,19581,F,N,N,Y,N,0.3,279.0,2.96,84.0,1500.8,99.43,109.0,293.0,10.2,4.0
1,7326,41,C,D-penicillamine,22880,F,,N,,N,0.3,,2.96,,1835.4,26.35,131.0,308.0,10.8,1.0
2,7254,297,D,,27957,F,N,N,,N,0.3,328.0,2.64,4.0,,,116.0,194.0,10.3,3.0
3,3135,1872,C,D-penicillamine,21111,F,,Y,Y,N,0.3,302.0,2.02,49.0,,26.35,,,10.5,4.0
4,2483,939,CL,D-penicillamine,18061,F,,,,N,0.5,344.0,3.11,91.0,,104.56,,306.0,11.4,2.0


In [25]:
df_test.head()

Unnamed: 0,ID,N_Days,Status,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin
0,3870,41,C,Placebo,22553,F,N,,N,N,1.4,247.0,3.62,,,108.65,,169.0,11.6
1,3462,1811,C,D-penicillamine,16223,F,N,Y,N,N,0.3,311.0,2.8,92.0,1748.1,,129.0,321.0,11.5
2,1632,954,C,D-penicillamine,27100,F,N,N,N,N,0.4,,3.56,,,43.52,,296.0,10.3
3,722,1969,D,Placebo,17039,F,N,Y,N,N,1.2,,3.16,,617.1,113.76,,125.0,10.9
4,1000,2721,D,D-penicillamine,17738,F,,,,N,3.2,,2.36,89.0,1782.4,,129.0,138.0,10.6


In [26]:
df['Age'].dtype

dtype('int64')

In [27]:
df.drop(columns=['ID'],inplace=True)
df_test.drop(columns=['ID'],inplace=True)

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6800 entries, 0 to 6799
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   N_Days         6800 non-null   int64  
 1   Status         6800 non-null   object 
 2   Drug           4775 non-null   object 
 3   Age            6800 non-null   int64  
 4   Sex            6800 non-null   object 
 5   Ascites        4554 non-null   object 
 6   Hepatomegaly   4373 non-null   object 
 7   Spiders        4210 non-null   object 
 8   Edema          6800 non-null   object 
 9   Bilirubin      6800 non-null   float64
 10  Cholesterol    3699 non-null   float64
 11  Albumin        6800 non-null   float64
 12  Copper         4644 non-null   float64
 13  Alk_Phos       4302 non-null   float64
 14  SGOT           4698 non-null   float64
 15  Tryglicerides  3988 non-null   float64
 16  Platelets      6462 non-null   float64
 17  Prothrombin    6645 non-null   float64
 18  Stage   

In [29]:
obj_col=df.columns[df.dtypes=='object']
nonobj_col=df.columns[df.dtypes!='object']
null_col=list(df.columns[df.isnull().any()])
null_object_columns=list(df.columns[(df.dtypes=='object')&(df.isnull().any())])
null_nonobject_columns=list(df.columns[(df.dtypes!='object')&(df.isnull().any())])
print("Object dtype columns:\n",list(obj_col))
print("\nObject dtype columns containing 'null' values:\n",null_object_columns)
print("\nNon-Object dtype columns containing 'null' values:\n",null_nonobject_columns)

Object dtype columns:
 ['Status', 'Drug', 'Sex', 'Ascites', 'Hepatomegaly', 'Spiders', 'Edema']

Object dtype columns containing 'null' values:
 ['Drug', 'Ascites', 'Hepatomegaly', 'Spiders']

Non-Object dtype columns containing 'null' values:
 ['Cholesterol', 'Copper', 'Alk_Phos', 'SGOT', 'Tryglicerides', 'Platelets', 'Prothrombin']


In [30]:
df['Status'].value_counts()

C     3643
D     2619
CL     538
Name: Status, dtype: int64

In [31]:
#encode categorical variables
df_mapped=df.copy()
df_mapped['Age']=df_mapped['Age'].map(lambda x:int(round(x/365)))
df_mapped['Status']=df_mapped['Status'].map({'CL':0,'C':1,'D':2})
df_mapped['Drug']=df_mapped['Drug'].map({'D-penicillamine':0,'Placebo':1})
df_mapped['Sex']=df_mapped['Sex'].map({'M':1,'F':0})
df_mapped['Ascites']=df_mapped['Ascites'].map({'N':0,'Y':1})
df_mapped['Hepatomegaly']=df_mapped['Hepatomegaly'].map({'N':0,'Y':1})
df_mapped['Spiders']=df_mapped['Spiders'].map({'N':0,'Y':1})
df_mapped['Edema']=df_mapped['Edema'].map({'N':0,'S':1,'Y':2})

In [32]:
df_test_mapped=df_test.copy()
df_test_mapped['Age']=df_test_mapped['Age'].map(lambda x:int(round(x/365)))
df_test_mapped['Status']=df_test_mapped['Status'].map({'CL':0,'C':1,'D':2})
df_test_mapped['Drug']=df_test_mapped['Drug'].map({'D-penicillamine':0,'Placebo':1})
df_test_mapped['Sex']=df_test_mapped['Sex'].map({'M':1,'F':0})
df_test_mapped['Ascites']=df_test_mapped['Ascites'].map({'N':0,'Y':1})
df_test_mapped['Hepatomegaly']=df_test_mapped['Hepatomegaly'].map({'N':0,'Y':1})
df_test_mapped['Spiders']=df_test_mapped['Spiders'].map({'N':0,'Y':1})
df_test_mapped['Edema']=df_test_mapped['Edema'].map({'N':0,'S':1,'Y':2})

In [33]:
null_df=df_mapped[df_mapped.isnull().any(axis=1)]

In [34]:
non_null=df_mapped.dropna(how="any")
non_null.isnull().any()

N_Days           False
Status           False
Drug             False
Age              False
Sex              False
Ascites          False
Hepatomegaly     False
Spiders          False
Edema            False
Bilirubin        False
Cholesterol      False
Albumin          False
Copper           False
Alk_Phos         False
SGOT             False
Tryglicerides    False
Platelets        False
Prothrombin      False
Stage            False
dtype: bool

In [35]:
non_null=non_null.astype({
    'Drug':'int64',
    'Status':'int64',
    'Stage':'int64',
    'Ascites':'int64',
    'Hepatomegaly':'int64',
    'Spiders':'int64',
    'Edema':'int64',
    'N_Days':'int64',
    'Age':'int64',
})

In [36]:
from xgboost import XGBClassifier,XGBRegressor

In [37]:
def xgbimputer():
    for col in null_col:
        X=non_null.drop(columns=[col])
        y=non_null[col]
        X_test=df_mapped[df_mapped[col].isnull()]
        X_test.drop(columns=[col],inplace=True)
        testdata=df_test_mapped[df_test_mapped[col].isnull()]
        testdata.drop(columns=[col],inplace=True)
        if df[col].dtype=='object':
            xgb=XGBClassifier(use_label_encoder=False,eval_metric='mlogloss')
        else:
            xgb=XGBRegressor()
        xgb.fit(X,y)
        predict=xgb.predict(X_test)
        df_mapped.loc[X_test.index,col]=predict
        predict=xgb.predict(testdata)
        df_test_mapped.loc[testdata.index,col]=predict
        print("Imputed ",col," successfully👍🏻")

In [38]:
#xgbimputer()

In [39]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler

In [47]:
knn_imputer = KNNImputer()
df_mapped = pd.DataFrame(knn_imputer.fit_transform(df_mapped.drop(columns=['Stage'])),columns=list(df_mapped.columns[:-1]))
df_test_mapped = pd.DataFrame(knn_imputer.transform(df_test_mapped),columns=list(df_test_mapped.columns))

In [17]:
df_mapped=df_mapped.astype({
    'Drug':'int64',
    'Status':'int64',
    'Stage':'int64',
    'Ascites':'int64',
    'Hepatomegaly':'int64',
    'Spiders':'int64',
    'Edema':'int64',
    'N_Days':'int64',
    'Age':'int64',
})
df_test_mapped=df_test_mapped.astype({
    'Drug':'int64',
    'Status':'int64',
    'Ascites':'int64',
    'Hepatomegaly':'int64',
    'Spiders':'int64',
    'Edema':'int64',
    'N_Days':'int64',
    'Age':'int64',
})

In [48]:
df_remapped=df_mapped.copy()
df_remapped['Status']=df_remapped['Status'].map({0:'CL',1:'C',2:'D'})
df_remapped['Drug']=df_remapped['Drug'].map({0:'D-penicillamine',1:'Placebo'})
df_remapped['Sex']=df_remapped['Sex'].map({0:'F',1:'M'})

In [49]:
df_test_remapped=df_test_mapped.copy()
df_test_remapped['Status']=df_test_remapped['Status'].map({0:'CL',1:'C',2:'D'})
df_test_remapped['Drug']=df_test_remapped['Drug'].map({0:'D-penicillamine',1:'Placebo'})
df_test_remapped['Sex']=df_test_remapped['Sex'].map({0:'F',1:'M'})

In [51]:
final=pd.get_dummies(data=df_remapped,columns=['Status','Drug','Sex'])
test_final=pd.get_dummies(data=df_test_remapped,columns=['Status','Drug','Sex'])
final.drop(columns=["Drug_Placebo","Sex_M","Status_D"],inplace=True)
test_final.drop(columns=["Drug_Placebo","Sex_M","Status_D"],inplace=True)

In [52]:
X=final
y=df['Stage']

In [53]:
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [54]:
def hyperparameter_tuning(estimator,param,X,y,X_test):
    clf=GridSearchCV(estimator,param_grid=param,return_train_score=True,scoring="f1_weighted",cv=5)
    clf.fit(X,y)
    return clf    

In [55]:
def train_fit_check(clf,X_train_dataset,y_train_dataset):
    y_train_pred=clf.predict(X_train_dataset)
    print('Classification Report:\n',classification_report(y_train_pred,y_train_dataset))
    print('How well the model fit the training dataset:',clf.score(X_train_dataset,y_train_dataset))

In [102]:
#randomforestclassifier
#round1
param={
    "n_estimators":[100],
    "max_depth":[20],
    "min_samples_leaf":[2],
    "max_features":["sqrt"]
}
rfc=hyperparameter_tuning(RandomForestClassifier(),param,X,y,test_final)

In [56]:
rfc=RandomForestClassifier()
rfc.fit(X,y)

RandomForestClassifier()

In [57]:
train_fit_check(rfc,X,y)

Classification Report:
               precision    recall  f1-score   support

         1.0       1.00      1.00      1.00       465
         2.0       1.00      1.00      1.00      1507
         3.0       1.00      1.00      1.00      1322
         4.0       1.00      1.00      1.00      3506

    accuracy                           1.00      6800
   macro avg       1.00      1.00      1.00      6800
weighted avg       1.00      1.00      1.00      6800

How well the model fit the training dataset: 1.0


In [58]:
predict=pd.DataFrame(rfc.predict(test_final),columns=['Stage'])

In [119]:
y.value_counts()

4    3506
2    1507
3    1322
1     465
Name: Stage, dtype: int64

In [115]:
predict.value_counts()

Stage
4        3171
2          20
3           6
1           3
dtype: int64

In [59]:
predict.to_csv('predict.csv',index=False)

Stage 1 = Healthy Liver <br>
Stage 2 = Fatty Liver <br> 
Stage 3 = Fibrosis Liver <br>
Stage 4 = Cirrhosis Liver <br>
