## Problem Context

Patients with Liver disease have been continuously increasing because of excessive consumption of alcohol, inhale of harmful gases, intake of contaminated food, pickles and drugs. This dataset was used to evaluate prediction algorithms in an effort to reduce burden on doctors.

## Importing the required Libraries.

In [None]:
# Importing the required Libraries.
import pandas as pd
import numpy as np
import sys
import os
import time
#ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import model_selection
from sklearn import metrics

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns

#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
sns.set_style('white')

from sklearn.model_selection import cross_val_score

In [None]:
df = pd.read_csv('../input/indianliver/indian_liver_patient.csv')
df.describe()

In [None]:
print(df.columns)
print('*'*50)
for i in df.columns :
    print(i)
    print(df[i].describe())
    print('*'*50)

In [None]:
df.info()

So, column "Albumin_and_Globulin_ratio" has some data missing in it.

In [None]:
df[df['Albumin_and_Globulin_Ratio'].isnull()]

Let's do some research on the data set and try to understand what each column is telling us. Afterall, we data scientists love stories.

### Column Name

**Age** - Tells the person's age.
        
>         Google - This we all know....duh

**Gender** - (Male or Female) Tells the person's gender. This is a very controversial column as we now know that there can be a spectrum of genders. But here we will only consider two genders.
            
>          ME - My sincere apologies to the people who do not orient themselves as "Male" or "Female". I hope in near future we will have a dataset where the spectrum of genders are included.            
         Google - Oh my! you knew that gender is not binary but a spectrum. Impressive..
         ME - Thank you google.

**Total_Bilirubin** - Well, I'm a mechancial engineer and not a doctor. So obviously I have no clue what this means. Let's ask google baba.

>         Google - A bilirubin test measures the amount of bilirubin in your blood. It’s used to help find the cause of health conditions like jaundice, anemia, and liver disease.
>         Bilirubin is an orange-yellow pigment that occurs normally when part of your red blood cells break down. Your liver takes the bilirubin from your blood and changes its chemical make-up so that most of it is passed through your poop as bile.
>         If your bilirubin levels are higher than normal, it’s a sign that either your red blood cells are breaking down at an unusual rate or that your liver isn’t breaking down waste properly and clearing the bilirubin from your blood.Another option is that there’s a problem somewhere along the pathway that gets the bilirubin out of your liver and into your stool. 
        
>         ME - Thank you Google. So fellows, I think now you have some knowledge on this as I do. And if you knew it already, you are awesome.
        
**Direct_Bilirubin** - It's technically the same as "Total_Bilirubin". The difference will be given us by our own google.
        
>         Google - Bilirubin attached by the liver to glucuronic acid, a glucose-derived acid, is called direct, or conjugated, bilirubin. Bilirubin not attached to glucuronic acid is called indirect, or unconjugated, bilirubin. All the bilirubin in your blood together is called total bilirubin. 
>         
>         ME - Damn you google, how much information do you have.....
        
**Alkaline_Phosphotase** - .......

>         Google - Alkaline phosphatase (ALP) is an enzyme in a person's blood that helps break down proteins. The body uses ALP for a wide range of processes, and it plays a particularly important role in liver function and bone development.Using an ALP test, it is possible to measure how much of this enzyme is circulating in a person’s blood.
>         
>         ME - I knew this....
>         Google - No, you don't
>         Me - Yeah.....you know everything....

**Alamine_Aminotransferase** - First of all it is "Alanine" and not "Alamine" . Rest our friend google will tell.

>         Google - Alanine aminotransferase (ALT) is an enzyme found primarily in the liver and kidney. It was originally referred to as serum glutamic pyruvic transaminase (SGPT). Normally, a low level of ALT exists in the serum. ALT is increased with liver damage and is used to screen for and/or monitor liver disease. Alanine aminotransferase (ALT) is usually measured concurrently with AST as part of a liver function panel to determine the source of organ damage. 
>         
>         ME - So, we need to change the column name to aviod confusion.
        
**Aspartate_Aminotransferase** - Help Google.....

>         Google - AST (aspartate aminotransferase) is an enzyme that is found mostly in the liver, but also in muscles. When your liver is damaged, it releases AST into your bloodstream. An AST blood test measures the amount of AST in your blood. The test can help your health care provider diagnose liver damage or disease.
>         
>         ME - WOOAAAHHH.......
        
**Total_protein** - Albumin and globulin are two types of protein in your body. The total protein test measures the total amount albumin and globulin in your body. It's used as part of your routine health checkup. It may also be used if you have unexpected weight loss, fatigue, or the symptoms of a kidney or liver disease.
        
>         ME - Atlast, something I knew.        
>         Google - You googled it. Don't play smart with me.
>         ME - uughhhh........There's no pleasing you.
        
**Albumin** - I think it's related to the protein in our bodies....
            
>         Google - Albumin is a protein made by your liver. Albumin helps keep fluid in your bloodstream so it doesn't leak into other tissues. It is also carries various substances throughout your body, including hormones, vitamins, and enzymes. Low albumin levels can indicate a problem with your liver or kidneys.
>         
>         ME - Close enough!!...
>         Google - *Face palms*

**Albumin_and_Globulin_Ratio** - This one's pretty eas......

>         Google - The Albumin to Globulin ratio (A:G) is the ratio of albumin present in serum in relation to the amount of globulin. The ratio can be interpreted only in light of the total protein concentration. Very generally speaking, the normal ratio in most species approximates 1:1.
>         
>         ME - I give up....
>         Google - Who made you a data scientist
>         ME - heyyy!! That's mean..
>         Google - The Arithmetic Mean is the average of the numbe....
>         ME - I know THAT...

**Dataset** - This is labelled incorrectly. From my perspective it should be "Liver_Disease" indicating that the patient has liver disease or not 

So, now we have a better understanding of the dataset let's first make the changes that are required to be made

## A Little bit of cleaning is required

In [None]:
# Re-naming the columns
df =  df.rename(columns={'Dataset':'Liver_disease','Alamine_Aminotransferase':'Alanine_Aminotransferase'}, inplace=False)

In [None]:
# Renaming Done
df.describe()

In [None]:
# Dropping Null Values
df = df.dropna()
# Changing the values in "Liver_Disease" column 
df['Liver_disease'] = df['Liver_disease'] - 1 
# Converting Gender column into categorical data 
LabelEncoder = LabelEncoder()
df['Is_male'] = LabelEncoder.fit_transform(df['Gender'])
df = df.drop(columns='Gender')

In [None]:
X = df[['Age', 'Total_Bilirubin', 
        'Direct_Bilirubin',
        'Alkaline_Phosphotase',
        'Alanine_Aminotransferase', 'Aspartate_Aminotransferase',
       'Total_Protiens', 'Albumin', 'Albumin_and_Globulin_Ratio', 'Is_male']]
y = df['Liver_disease']

In [None]:
# Validate each class to understand if the dataset is imbalanced.

print ('Total Unhealthy Livers :  {} and its percentage is {} %'.format(df.Liver_disease.value_counts()[0], round(df.Liver_disease.value_counts()[0]/df.Liver_disease.value_counts().sum()*100,2)) )
print ('Total Healthy Livers :  {} and its percentage is {} %'.format(df.Liver_disease.value_counts()[1], round(df.Liver_disease.value_counts()[1]/df.Liver_disease.value_counts().sum()*100,2)) )

In [None]:
df.skew(axis = 0, skipna = True) 

#### Here in column name "Liver_disease" **0** *indicate that the the person has some kind of Liver Disease or the liver of the patient is unhealthy* and **1** *represents that the person's liver is healthy.*

In [None]:
# Plotting the box plots 
plt.figure(figsize=[16,12])

plt.subplot(231)
plt.boxplot(x = X['Age'], showmeans = True, meanline = True)
plt.title('Age Boxplot')
plt.ylabel('Age (years)')

plt.subplot(232)
plt.boxplot(X['Total_Bilirubin'], showmeans = True, meanline = True)
plt.title('Total Bilirubin Boxplot')
plt.ylabel('Total Bilirubin (mg/dL)')

plt.subplot(233)
plt.boxplot(X['Direct_Bilirubin'], showmeans = True, meanline = True)
plt.title('Direct Bilirubin Boxplot')
plt.ylabel('Direct Bilirubin (mg/dL)')

plt.subplot(234)
plt.hist(x = [X[y==1]['Is_male'], X[y ==0]['Is_male']], 
         stacked=True, color = ['g','r'],label = ['Healthy','Patient'])
plt.title('Gender Histogram by patients')
plt.xlabel('Gender [0 - female : 1 - male]')
plt.ylabel('# of people')
plt.legend()

plt.subplot(235)
plt.boxplot(x = X['Alkaline_Phosphotase'], showmeans = True, meanline = True)
plt.title('Alkaline Phosphotase')
plt.ylabel('Alkaline Phosphotase (International Units /Litre)')

plt.subplot(236)
plt.boxplot(X['Alanine_Aminotransferase'], showmeans = True, meanline = True)
plt.title('Alanine Aminotransferase Boxplot')
plt.ylabel('Alanine Aminotransferase (units/L)')

As we can see from Gender histogram Number of Males having liver diseases are way more than the females. Ladies you can relax a little bit. 
        
        ME - Wanna grab a coffee sometime? 
        Google - Coffee is a brewed drink prep....
        ME - I did not ask you!!!

In [None]:
plt.figure(figsize=[16,12])
plt.subplot(231)
plt.boxplot(X['Aspartate_Aminotransferase'], showmeans = True, meanline = True)
plt.title('Aspartate Aminotransferase Boxplot')
plt.ylabel('Aspartate_Aminotransferase (units/L)')


plt.subplot(232)
plt.boxplot(X['Total_Protiens'], showmeans = True, meanline = True)
plt.title('Total Protiens Boxplot')
plt.ylabel('Total Protiens (g/dL)')

plt.subplot(233)
plt.boxplot(X['Albumin'], showmeans = True, meanline = True)
plt.title('Albumin Boxplot')
plt.ylabel('Albumin (g/dL)')

As we can see, many boxplots tells us that there are many outliers present. But these cannot be ignored as they are still possible. A person can has high levels of Alanine Aminotransferase, this clearly indicates that the person has liver problems.

This will be pretty much clear in the following plots.

In [None]:
fig, saxis = plt.subplots(2, 3,figsize=(16,12))

sns.barplot(y = 'Alanine_Aminotransferase', x = 'Liver_disease', data=df, ax = saxis[0,0])
sns.pointplot(y = 'Total_Bilirubin', x = 'Liver_disease', data=df, ax = saxis[0,1])
sns.pointplot(y = 'Direct_Bilirubin', x = 'Liver_disease', data=df, ax = saxis[0,2])


sns.barplot(y = 'Alkaline_Phosphotase', x = 'Liver_disease', data=df, ax = saxis[1,0])
sns.barplot(y = 'Aspartate_Aminotransferase', x = 'Liver_disease', data=df, ax = saxis[1,1])
sns.boxplot(y = 'Total_Protiens', x = 'Liver_disease', data=df, ax = saxis[1,2])

As we can see higher the values of individual test, the more risk you have of having one or more liver related diseases. So eat healthy guys!!

In [None]:
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = colormap,
        square=True, 
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='white',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features', y=1.05, size=15)

correlation_heatmap(df)

Now seeing all those graphs I'm sure you are pretty bored. Don't worry we will now coming to the most intersting part.

<centre>THE MODELLING!!!!</centre>

No? not interested? You don't want to see how can this dataset can help us? 
If yes, continue to read...

## Model The Data

**Little Note** - When it comes to data modeling, the beginner’s question is always, "what is the best machine learning algorithm?" To this the beginner must learn, the [No Free Lunch Theorem (NFLT)](http://robertmarks.org/Classes/ENGR5358/Papers/NFL_4_Dummies.pdf) of Machine Learning. In short, NFLT states, there is no super algorithm, that works best in all situations, for all datasets. So the best approach is to try multiple MLAs, tune them, and compare them for your specific scenario.

**Before Modelling let us split the data into train and test data**

In [None]:
from sklearn import preprocessing
X_scaler = preprocessing.normalize(X)

In [None]:
# Splitting the data 
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_scaler, y, random_state = 0)

print("Train Shape: {}".format(X_train.shape))
print("Test Shape: {}".format(X_test.shape))


### Logistic Regression

In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)

In [None]:
# Use score method to get accuracy of model
score = lr.score(X_test, y_test)
print("Score of the model is - ",score)
print("Report card of this model - ")
print(metrics.classification_report(y_test, y_pred, digits=3))
print("Accuracy score - ", metrics.accuracy_score(y_test,y_pred))

In [None]:
from sklearn.metrics import roc_auc_score
test_roc_auc = roc_auc_score(y_test, y_pred)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

In [None]:
cm1 = metrics.confusion_matrix(y_test, y_pred)
plt.figure(figsize=(9,9))
sns.heatmap(cm1, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15)

### Naives Bayes Model

In [None]:
# Naives Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train,y_train)
y_pred_nb = nb.predict(X_test)

In [None]:
score = nb.score(X_test, y_test)
print("Score of the model is - ",score)
print("Report card of this model - ")
print(metrics.classification_report(y_test, y_pred_nb, digits=3))
print("Accuracy score - ", metrics.accuracy_score(y_test,y_pred_nb))

In [None]:
from sklearn.metrics import roc_auc_score
test_roc_auc = roc_auc_score(y_test, y_pred_nb)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

In [None]:
cm2 = metrics.confusion_matrix(y_test, y_pred_nb)
plt.figure(figsize=(9,9))
sns.heatmap(cm2, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Wistia');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15)

### Stochastic Gradient Descent

In [None]:
# Stochastic Gradient Descent
from sklearn.linear_model import SGDClassifier
sg = SGDClassifier()
sg.fit(X_train,y_train)
y_pred_sg = sg.predict(X_test)

In [None]:
score = sg.score(X_test, y_test)
print("Score of the model is - ",score)
print("Report card of this model - ")
print(metrics.classification_report(y_test, y_pred_sg, digits=3))
print("Accuracy score - ", metrics.accuracy_score(y_test,y_pred_sg))

In [None]:
from sklearn.metrics import roc_auc_score
test_roc_auc = roc_auc_score(y_test, y_pred_sg)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

In [None]:
cm3 = metrics.confusion_matrix(y_test, y_pred_sg)
plt.figure(figsize=(9,9))
sns.heatmap(cm3, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Greens');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15)

 ### K-Nearest Neighbours

In [None]:
# KNN Model
from sklearn.neighbors import KNeighborsClassifier
hist = []
for i in range(1,10):
    clf = KNeighborsClassifier(n_neighbors=i)
    cross_val = cross_val_score(clf, X_scaler, y, cv=5)
    hist.append(np.mean(cross_val))
plt.plot(hist)
plt.title('Cross Validations score for KNeighborsClassifier')
plt.xlabel('n_neighbors')
plt.ylabel('Accuracy')
plt.grid()
plt.show()

In [None]:
knn = KNeighborsClassifier(n_neighbors = 7)
knn.fit(X_train,y_train)
y_pred_knn = knn.predict(X_test)

In [None]:
score = knn.score(X_test, y_test)
print("Score of the model is - ",score)
print("Report card of this model - ")
print(metrics.classification_report(y_test, y_pred_knn, digits=3))
print("Accuracy score - ", metrics.accuracy_score(y_test,y_pred_knn))

In [None]:
from sklearn.metrics import roc_auc_score
test_roc_auc = roc_auc_score(y_test, y_pred_knn)
# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

In [None]:
cm4 = metrics.confusion_matrix(y_test, y_pred_knn)
plt.figure(figsize=(9,9))
sns.heatmap(cm4, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Accent');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15)

### Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(max_depth = None , random_state = 1 , max_features = None, min_samples_leaf =20)
dtree.fit(X_train,y_train)
y_pred_dtree = dtree.predict(X_test)

In [None]:
score = dtree.score(X_test, y_test)
print("Score of the model is - ",score)
print("Report card of this model - ")
print(metrics.classification_report(y_test, y_pred_dtree, digits=3))
print("Accuracy score - ", metrics.accuracy_score(y_test,y_pred_dtree))

In [None]:
from sklearn.metrics import roc_auc_score
test_roc_auc = roc_auc_score(y_test, y_pred_dtree)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

In [None]:
cm5 = metrics.confusion_matrix(y_test, y_pred_dtree)
plt.figure(figsize=(9,9))
sns.heatmap(cm5, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'viridis');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15)

### Random Forest Classifier

In [None]:
# Random Forest 
from sklearn.ensemble import RandomForestClassifier

hist1 = []
for i in range(1,10):
    clf = RandomForestClassifier(n_estimators=80, max_depth=i, random_state=0)
    cross_val = cross_val_score(clf, X_train, y_train, cv=5)
    hist1.append(np.mean(cross_val))
plt.plot(hist1)
plt.title('Cross Validations score for RandomForestClassifier')
plt.xlabel('Max_depth')
plt.ylabel('Accuracy')
plt.grid()

In [None]:
ran_for = RandomForestClassifier(n_estimators=80, max_depth=8, random_state=0)
ran_for.fit(X_train,y_train)
y_pred_ran = ran_for.predict(X_test)

In [None]:
score = ran_for.score(X_test, y_test)
print("Score of the model is - ",score)
print("Report card of this model - ")
print(metrics.classification_report(y_test, y_pred_ran, digits=3))
print("Accuracy score - ", metrics.accuracy_score(y_test,y_pred_ran))

In [None]:
from sklearn.metrics import roc_auc_score
test_roc_auc = roc_auc_score(y_test, y_pred_ran)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

In [None]:
cm6 = metrics.confusion_matrix(y_test, y_pred_ran)
plt.figure(figsize=(9,9))
sns.heatmap(cm6, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'viridis');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15)

### Support Vector Machine

In [None]:
# Support Vector machine Model
from sklearn.svm import SVC
grid = [0.00001, 0.0001, 0.001, 0.01, 0.1]
hist = []
for val in grid:
    clf = SVC(gamma=val)
    cross_val = cross_val_score(clf, X, y, cv=5)
    hist.append(np.mean(cross_val))
plt.plot([str(i) for i in grid], hist)
plt.title('Cross Validations score for SVC')
plt.xlabel('gamma')
plt.ylabel('Accuracy')
plt.grid()
plt.show()


In [None]:
svm = SVC(kernel= "linear",C=0.025, random_state = 0 , gamma=0.01)
svm.fit(X_train,y_train)
y_pred_svm = svm.predict(X_test)

In [None]:
score = svm.score(X_test, y_test)
print("Score of the model is - ",score)
print("Report card of this model - ")
print(metrics.classification_report(y_test, y_pred_svm, digits=3))
print("Accuracy score - ", metrics.accuracy_score(y_test,y_pred_svm))

In [None]:
from sklearn.metrics import roc_auc_score
test_roc_auc = roc_auc_score(y_test, y_pred_svm)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

In [None]:
cm7 = metrics.confusion_matrix(y_test, y_pred_ran)
plt.figure(figsize=(9,9))
sns.heatmap(cm7, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Accent_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15)

**Conclusion** - Maximum accuracy of 75.17% can be achieved. This accuracy has been achieved by Decision Tree Model 

**Decision Tree Model will be used as it has the highest accuracies among the other models that were used.**

In [None]:
#print the true and predicted values
dictionary = {'Actual values': y_test, 'Predicted values': y_pred_dtree}
pd.DataFrame.from_dict(dictionary)

So, after a long journey of data visulaisation, data cleaning, data modelling etc., we have finally got our model that we can use.

>     So, the next question is - Is this the end?
>     The answer is - I don't know. I'm no expert guys as I'm also learning. So, if anyone reading this knows what can be done more, kindly help me out here.
    
Till then, Have a good day!!