# LIVER DISEASE PREDICTION

**In this project, we are going to use the Indian Liver Patient Records dataset from kaggle.**

**We are going to predict whether a patient has liver disease or not based on certain features.**

**We are going to check with the total proteins,albumin etc whether it is asscoiated with disease or not.**

**Importing the Necessary Libraries:**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

**For the purpose of prediction, we need to import more libraries. As we move on, we will import them.**

**Reading the Dataset:**

In [None]:

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
patients=pd.read_csv('/kaggle/input/indian-liver-patient-records/indian_liver_patient.csv')

In [None]:
patients.head()

In [None]:
patients.shape

In [None]:
patients.info()

In [None]:
patients.describe().T

**So there are 583 rows and 11 columns in our dataset.**

**Let us make the Gender column into numerical format:**

In [None]:
patients['Gender']=patients['Gender'].apply(lambda x:1 if x=='Male' else 0)

In [None]:
patients.head()

**Here there is a column named Dataset which has two values. Here one of the value symbolises that the patient has 
disease and the other value symbolises that the patient has no disease.**

**Let us check the number of male and female using a countplot.**

In [None]:
patients['Gender'].value_counts().plot.bar(color='peachpuff')
plt.show()

**From the above graph, we can see that Number of males are more than the Number of females.**

**Let us check the countplot of our Dataset column:**

In [None]:
patients['Dataset'].value_counts().plot.bar(color='blue')
plt.show()

**Let us check for the null values:**

In [None]:
patients.isnull().sum()

**We can see that there are 4 null values in the Albumin and Globulin Ratio column.**

**Let us fill these null values by imputing the mean of that column.**

In [None]:
patients['Albumin_and_Globulin_Ratio'].mean()

In [None]:
patients=patients.fillna(0.94)

**Yes! Now we have filled the null values with the mean of that column.** 

In [None]:
patients.isnull().sum()

**So we have removed all the null values and we are ready to go !**

**Let us check the age group of the patients.**

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize=(25,10))
patients['Age'].value_counts().plot.bar(color='darkviolet')
plt.show()

**Let us view the pairplot of patients based on Gender.**

In [None]:
plt.rcParams['figure.figsize']=(10,10)
sns.pairplot(patients,hue='Gender')
plt.show()

In [None]:
sns.pairplot(patients)
plt.show()

**Let us compare the albumin and albumin and globulin ratio by a scatterplot.**

In [None]:
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x="Albumin", y="Albumin_and_Globulin_Ratio",color='mediumspringgreen',data=patients);
plt.show()

**Let us compare the Gender based on the Protein Intake.**

In [None]:
plt.figure(figsize=(8,6))
patients.groupby('Gender').sum()["Total_Protiens"].plot.bar(color='coral')
plt.show()

**So protein intake is higher in the case of Male and comparitively less in females.**

**Let us compare male and female based on Albumin Level.**

In [None]:
plt.figure(figsize=(8,6))
patients.groupby('Gender').sum()['Albumin'].plot.bar(color='midnightblue')
plt.show()

**Albumin Level is higher in the case in the case of male compared to female.**

**Finally Let us compare them based on the Bilirubin content.**

In [None]:
plt.figure(figsize=(8,6))
patients.groupby('Gender').sum()['Total_Bilirubin'].plot.bar(color='fuchsia')
plt.show()

**We can clearly see that males has more bilirubin content compared to females.**

**Another point to be noted here is that higher the Bilirubin content, higher the case is prone to Liver disease.**

**Let us check the correlation between the features using a heatmap:**

In [None]:
corr=patients.corr()

In [None]:
plt.figure(figsize=(20,10)) 
sns.heatmap(corr,cmap="Greens",annot=True)
plt.show()

**So Let us start building our model.**

**Inorder to build a successful model we have to train and test the model.**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
patients.columns

**Now let us define our X and y.**

**Here X is our features and y is our target.**

In [None]:
X=patients[['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',
       'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
       'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
       'Albumin_and_Globulin_Ratio']]
y=patients['Dataset']

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=123)

**We split the training and testing  in a certain ratio as 70 for training and 30 for testing.**

**Now inorder to build our model we use Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

**Now, Let us import the cross validation score and Kfold and split them into 5.**

**Finally, we are calculating the accuracy of our model.**

In [None]:
from sklearn.model_selection import KFold, cross_val_score
kfold = KFold(n_splits=5,random_state=42)
logmodel = LogisticRegression(C=1, penalty='l1')
results = cross_val_score(logmodel, X_train,y_train,cv = kfold)
print(results)
print("Accuracy:",results.mean()*100)

**Thus we can conclude that our model performed at an accuracy of 71.5%.**