### Objective:
Patient category prediction if healthy or unhealthy

**Dataset:** 
The data set contains laboratory values of blood donors and Hepatitis C patients and demographic values like age. The data was obtained from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/HCV+data

**By:** Sanket Sharma, ur.sanketsharma@gmail.com


In [None]:
#Import Necessary Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os # accessing directory structure

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Step 1: Load Data

In [None]:
df = pd.read_csv('/kaggle/input/hepatitis-c-dataset/HepatitisCdata.csv')

### Step 2: Summarize Dataset

In [None]:
# lets look top few rows
df.head()

In [None]:
# Lets look few bottom rows
df.tail()

In [None]:
#lets droped the column 0 as it looks like srial nos.
df = df.drop(labels ="Unnamed: 0", axis=1)

In [None]:
# lets gather some information about data
df.info()

 We can see that **Category** and **Sex** column holds categorical data and all other variables have numerical values
 
 Columns **ALB**, **ALP**, **ALT**, **CHOL**, and **PROT** have few records missing, now we can either fill those records or drop those rows having missing data. but we cant fill assumed values for medical data so lets drop them

In [None]:
#lets drop null values as we cant fill null values wrongly for medical data
df = df.dropna()
df.info()

In [None]:
# lets see how many category are there 
df["Category"].unique()

In [None]:
# Though there are many categories, we are breaking it into binary: healthy and unhealthy.
df['Category'].loc[df['Category'].isin(["1=Hepatitis","2=Fibrosis", "3=Cirrhosis"])] = 1
df['Category'].loc[df['Category'].isin(["0=Blood Donor", "0s=suspect Blood Donor"])] = 0
df = df.astype({'Category': 'int'})

In [None]:
# lets see how many records falls in each category
df["Category"].value_counts()

it means there are 533 healthy people and 56 unhealty

In [None]:
# lets group by sex
df.groupby(["Sex", "Category"]).size()

There are 210 healthy and 16 unhealthy females, while 323 healthy and 40 unhealthy males

In [None]:
# lets look at Statical attributes of dataset
df.describe()

### Step 4: Visualise Data

**Univariate plots**

In [None]:
# Box and whisker plot
df.plot(kind="box", subplots = True, layout=(2,6), figsize=(12,6))
plt.show()

We can see that there are lot of outliers in most of the input variables, but I'm not going to do anything about them beacause:

- I'm not into medical field and I dont have enough knowledege abot most of the features and what they represent.
- Blood analysis values for each feature can differ hugely between healthy and unhealthy individual and the outliers in this DataFrame may contain some important information for the models to come in order to predict the disease.
- This dataset is very small, only 589 records are there, That is very few and most of them refer to healthy people so I want to exploit each and every single one of them.

In [None]:
#Histogram
df.hist(layout=(4,3), figsize=(10,12))
plt.show()

We can see that few variables are following nearly normal distribution 

In [None]:
# Let's plot category against age , group by sex
x = df['Age']
y = df["Category"]
scatter = plt.scatter(x, df["Sex"], c=y, cmap='winter')
plt.title('Category by Age ')
plt.xlabel('Age')
plt.ylabel('Category')
plt.legend(*scatter.legend_elements(), title='Hepatitis')
plt.show()

By above graph be can see that all males below the age 30 are positive. The model used will likely put too much weight on this coincidence due to such strong correlation. But we are not droping males below the age 30 because we have very less records.

In [None]:
# We also need to use numerical data for the sex column, lets encode
df['Sex'].loc[df['Sex']=='m']=1
df['Sex'].loc[df['Sex']=='f']=0
df = df.astype({'Sex': 'int'})

### Step 4: Define Input and Output variable

In [None]:
# input variable
X = df.drop(labels="Category", axis=1)

# Output variable
y = df["Category"]

### Step 5: Spliting into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Step6: Evaluating different Models
We dont know which calssification Model will perform best for this problem, So we will be evaluating below models using GridSearchCV: 1.DecisionTreeClassifier 2.RandomForest 3.LogisticRegression 4.GaussianNB 5.MultinomialNB 6.SVM

In [None]:
# lets import necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

In [None]:
# Lets create a dictionary for Model parameters

model_param = {'DecisionTreeClassifier':{'model': DecisionTreeClassifier(random_state=0), 'param': {'criterion': ['gini','entropy']}},
              'Randomforest': {'model': RandomForestClassifier(random_state=0), 'param': {'n_estimators':[1,5,10,15,20,25,30,40,50,60,80,100]}},
              'LogisticRegression':{'model': LogisticRegression(solver='liblinear',multi_class='auto', random_state=0),'param': {'C': [1,5,10,15,20]}},
              'GaussianNB':{'model': GaussianNB(), 'param': {}},
              'MultinomialNB':{'model': MultinomialNB(), 'param': {}},
               'SVM':{'model': SVC(gamma='auto', random_state=0), 'param': {'C': [0.001,0.1,1],'kernel':['rbf', 'linear']}}
              }

In [None]:
# Applying GridSearchCV to evaluate models

from sklearn.model_selection import GridSearchCV

scores = []

for model_name, mp in model_param.items():
    cl = GridSearchCV(mp['model'], mp['param'], cv=5, return_train_score=None)
    cl.fit(X,y)
    scores.append({
        'model': model_name,
        'best_score': cl.best_score_,
        'best_params': cl.best_params_
    })

In [None]:
scores

In [None]:
# lets see scores as a dataframe
score1 = pd.DataFrame(scores)
score1.sort_values(by=['best_score'], inplace = True, ascending=False)
score1


Here we can See that SVM is giving best results while parameters are 'C':0.1, 'kernel':'linear', We will be using SVM model


### Step 7: Applying best model on Training Set

In [None]:
# Apply random forest
model = SVC(C=0.1, kernel='linear')
model.fit(X_train, y_train)

# make predictions
predictions = model.predict(X_test)

In [None]:
# lets create confusion matrix to compare results
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predictions)
cm

In [None]:
# Lets plot confusion matrix 
sns.heatmap(cm, annot=True, cmap="viridis" ,fmt='g')
plt.title('Confusion matrix')
plt.xlabel('Actual label')
plt.ylabel('Predicted label')
plt.show()

We can see that there are 108 True-Negative predictions, 0 False-Negative prediction, 1 False-Positive prediction and 9 True-Positive predictions. it means our model is actually predicting well

In [None]:
# lets calculate the classification accuracy 
#classification accuracy = correct predictions / total predictions * 100
cls_acc = (108+9)/(108++1+0+9)*100
print("Calssification Accuracy:"+str(round(cls_acc,2))+"%")
#error rate = 100- Calssification accuracy
error_rate = 100-cls_acc
print("Error Rate:"+str(round(error_rate,2))+"%")