# Analysis and Prediction:
## Indian Liver Patients -  records collected from North East of Andhra Pradesh, India

__About the dataset__
> This data set contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The "Dataset" column is a class label used to divide groups into liver patient (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records. Any patient whose age exceeded 89 is listed as being of age "90".

> Based on chemical compounds(bilrubin,albumin,protiens,alkaline phosphatase) present in human body <br> 
and tests like SGOT , SGPT the outcome mentioned whether person is patient ie __needs to be diagnosed or not__.

In [None]:
# Importing basic packages for data preprocessing
import numpy as np
import pandas as pd
import os
# Importing packages for plotting
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
print(os.listdir("../input"))
dataset=pd.read_csv("../input/indian_liver_patient.csv")

In [None]:
dataset.head()

In [None]:
dataset.describe()

In [None]:
dataset.isnull().sum()

*Remove rows with missing values*
> As I do not have expert knowledge on the values in the Albumin_and_Globulin_Ratio column, <br>
I prefer to remove the 4 rows with missing values. 
This is considering the fact that the total missing values are only 4.

In [None]:
dataset[dataset['Albumin_and_Globulin_Ratio'].isnull()].index.tolist()

In [None]:
# Using the above row indexes, removing rows with missing values in Albumin_and_Globulin_Ratio column
dataset.drop(dataset.index[[209,241,253,312]], inplace=True)

In [None]:
# Creating copy of the dataset
dataset_orig = dataset.copy()

----

Label Encoder <br> 
Transforming character values to numerics

In [None]:
# Transforming Gender column (indepedent variable) to numerics (0s and 1s)
# Importing required package
from sklearn.preprocessing import LabelEncoder

In [None]:
labelencoder_x = LabelEncoder()
dataset['Gender'] = labelencoder_x.fit_transform(dataset['Gender'])

In [None]:
dataset['Gender'].head()

#### 2. Finding out the Correlation of dependent data to the independent data

In [None]:
# Finding the correlation of independent data with the dependent data Income column 
corrmat = dataset.corr()
corrmat

*Visualization of the correlation*

In [None]:
import seaborn as sns
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=1, cmap="YlGnBu", square=True,linewidths=.5, annot=True)
plt.show()

In [None]:
# Obtaining top K columns which affects the Income the most
k= 10
corrmat.nlargest(k, 'Dataset')

In [None]:
# Replotting the heatmap with the above data
cols = corrmat.nlargest(k, 'Dataset')['Dataset'].index
cm = np.corrcoef(dataset[cols].values.T)
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(cm, cmap="YlGnBu", cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

#### Observation:
> Visualization from the above chart indicate that the major contributors for whether the patient __needs to be diagnosed__ are the columns with *correlation value* greater than 0. <br>
Considering the fact that I have no domain expertise in the field of liver disease, I want to complete the analysis and arrive at a decent classification model, the results of which can be reviewed by the domain experts when the insights are projected. <br> 
This being the case,  I am considering only __3 independent variables__ Albumin_and_Globulin_Ratio, Albumin and Total_Protiens will be considered

#### 3. Additional Visualization 

In [None]:
sns.catplot(x="Gender", y="Age", hue="Dataset", data=dataset_orig)
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x="Albumin", y="Albumin_and_Globulin_Ratio", hue="Dataset", style="Dataset", data=dataset_orig);
plt.show()

## 4. Machine Learning
> Based on the correlation matrix, considering only __3 independent variables__ Albumin_and_Globulin_Ratio, Albumin and Total_Protiens

*Splitting Independent (X) and dependent (y) variables*

In [None]:
# Splitting Independent (X) and dependent (y) variables from the dataset
X = dataset[['Albumin_and_Globulin_Ratio', 'Albumin','Total_Protiens']]
y = dataset [['Dataset']]

In [None]:
X[0:5]

In [None]:
y[0:5]

In [None]:
# Splitting the data into Training and Test set with 80-20 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30, random_state=0)

In [None]:
print("X_train: " , X_train.shape)
print("X_test: ", X_test.shape)


*Applying feature scale on training and test datasets* <br>
Feature scaling is not applied on dependent variable 'Dataset' as it has values with only 1s and 2s

In [None]:
# Import required package
from sklearn.preprocessing import StandardScaler

In [None]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [None]:
X_train[0:5]

In [None]:
X_test[0:5]

### 4.a Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
classifier_lr = LogisticRegression(random_state=0, solver='lbfgs')
classifier_lr.fit(X_train,y_train.values.reshape(-1,))
# predict the test set result
y_predLR = classifier_lr.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix 
cm = confusion_matrix(y_test, y_predLR)
cm

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_predLR)

### 4.b K-NN Regression
*Fitting K Nearest Neighbors (KNN)* Classifier to the training set

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifierKNN = KNeighborsClassifier(n_neighbors=5,p=2, metric='minkowski')
classifierKNN.fit(X_train, y_train.values.reshape(-1,))

# predict the test set result
y_predKNN = classifierKNN.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix 
cm = confusion_matrix(y_test, y_predKNN)
cm

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_predKNN)

### 4.c SVM Regression
*Fitting Support Vector Machine (SVM)* Classifier to the training set

In [None]:
# Importing the required package 
from sklearn.svm import SVC
classifier_svm = SVC(kernel='linear', random_state=0)
classifier_svm.fit(X_train, y_train.values.reshape(-1,))
# predict the test set result
y_predSVM = classifier_svm.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix 
cm = confusion_matrix(y_test, y_predSVM)
cm

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_predSVM)

### 4.d Kernel SVM Regression
*Fitting Naive Bayes* Classifier to the training set

In [None]:
from sklearn.naive_bayes import GaussianNB
classifierNB = GaussianNB()
classifierNB.fit(X_train, y_train.values.reshape(-1,))

# predict the test set result
y_predNB = classifierNB.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix 
cm = confusion_matrix(y_test, y_predNB)
cm

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_predNB)

## Conclusion
> From the above classification models - Logistic, SVM and K-SVM classification models yeilded almsot the same accuracy of 70.12%<br>
However, the __Naive Bayes regression model__ was the hieghest with 71.26% accuracy, compared to other 3 models tried above,