#Disease Detection with Machine Learning

In this workbook, machine learning techniques are applied to disease prediction and diagnosis, using a list of symtoms. Various models are tested; the final results combines the output from all models.

In [2]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Mounted at /gdrive
/gdrive


**Import Libs**

In [3]:
from scipy.stats import mode
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier 
from sklearn.metrics import accuracy_score, confusion_matrix
%matplotlib inline

####Pre-processing and Data exploration

**Import Data**

In [4]:
data_train = pd.read_csv('MyDrive/MLData/Training_dispred.csv') 
data_valdt  = pd.read_csv('MyDrive/MLData/Testing_dispred.csv') 

In [5]:
data_train.dtypes

itching                   int64
skin_rash                 int64
nodal_skin_eruptions      int64
continuous_sneezing       int64
shivering                 int64
                         ...   
blister                   int64
red_sore_around_nose      int64
yellow_crust_ooze         int64
prognosis                object
Unnamed: 133            float64
Length: 134, dtype: object

In [6]:
data_train.count()

itching                 4920
skin_rash               4920
nodal_skin_eruptions    4920
continuous_sneezing     4920
shivering               4920
                        ... 
blister                 4920
red_sore_around_nose    4920
yellow_crust_ooze       4920
prognosis               4920
Unnamed: 133               0
Length: 134, dtype: int64

Drop an empty column.

In [7]:
data_train.drop('Unnamed: 133',axis=1,inplace=True)
len(data_train.columns)

133

In [8]:
diseases  = np.unique(data_train[['prognosis']])
ndiseases = len(diseases)
print("Number of Diseses",ndiseases) 

Number of Diseses 41


In [9]:
data_train['prognosis'].value_counts()

Fungal infection                           120
Hepatitis C                                120
Hepatitis E                                120
Alcoholic hepatitis                        120
Tuberculosis                               120
Common Cold                                120
Pneumonia                                  120
Dimorphic hemmorhoids(piles)               120
Heart attack                               120
Varicose veins                             120
Hypothyroidism                             120
Hyperthyroidism                            120
Hypoglycemia                               120
Osteoarthristis                            120
Arthritis                                  120
(vertigo) Paroymsal  Positional Vertigo    120
Acne                                       120
Urinary tract infection                    120
Psoriasis                                  120
Hepatitis D                                120
Hepatitis B                                120
Allergy      

**Data is evenly split between 41 different disease types. For each disease we have data on 132 symtoms that are present in the patients.**
The disease are recorded in text, which should be converted into numerical form for applying ML models.

In [10]:
#Convert prognosis values into numerical values 
labenc = LabelEncoder()
data_train['prognosis'] = labenc.fit_transform(data_train['prognosis'])

####Train/Test Split Dataset

In [11]:
#Split into Train Test
x = data_train.iloc[:,:-1]
y = data_train.iloc[:, -1]
x_train, x_test, y_train, y_test =\
  train_test_split(x,y, test_size=0.2, random_state=0)
print(f"Train: {x_train.shape}, {y_train.shape}")
print(f"Test : {x_test.shape}, {y_train.shape}")

Train: (3936, 132), (3936,)
Test : (984, 132), (3936,)


**Checking for null values in data**

In [12]:
print ("Any Null Values in Data:",data_train.isnull().values.any())

Any Null Values in Data: False


###Models 
---
**Use various Machine Learning models which support multi-class output. We expect the most likely disease diagnosis given symtoms data**
1. Support Vector Machine
2. Naives Bayes
3. Random Forest Decision Trees

First, the performance of these models is checked in cross-validation to understand variation in predictions, and the effect a relatively small dataset. Then, the models are applied to the entire *train* data, and the fitted model performance is measured from *validation* data.

In [13]:
#cross-validation score metric
def cv_scoring(classifier,x,y):
    return accuracy_score(y,classifier.predict(x))

#Collection of Models
models = {
    "SVC":SVC(),
    "Gaussian NB":GaussianNB(),
    "Random Forest":RandomForestClassifier(random_state=0),
#    "Gradient Boosted":GradientBoostingClassifier(random_state=0)
}

#Train each model, and cross validate (k-fold)
for model in models:
  m_ = models[model]
  sc = cross_val_score(m_, x, y, cv = 8,
                       n_jobs = -1,
                       scoring= cv_scoring)
  print("--"*10)
  print("{0:15s} {1:} - Avg: {2:5.1f}  ".format(model,sc,np.mean(sc)))




--------------------
SVC             [1. 1. 1. 1. 1. 1. 1. 1.] - Avg:   1.0  
--------------------
Gaussian NB     [1. 1. 1. 1. 1. 1. 1. 1.] - Avg:   1.0  
--------------------
Random Forest   [1. 1. 1. 1. 1. 1. 1. 1.] - Avg:   1.0  


**Cross-validation shows good fit for all models.**

In [14]:
for model in models: 
  m_ = models[model]
  m_.fit(x_train,y_train)
  ypred = m_.predict(x_test)
  acc   = accuracy_score(ypred,y_test)
  print("{0:15s} Accuracy: {1:10.2f}".format(model,acc))

SVC             Accuracy:       1.00
Gaussian NB     Accuracy:       1.00
Random Forest   Accuracy:       1.00


In [15]:
#validation data
x_vtest = data_valdt.iloc[:, :-1]
y_vtest = data_valdt.iloc[:, -1]
y_vtest = labenc.transform(y_vtest)
for model in models:
  m_ = models[model]
  ypred = m_.predict(x_vtest)
  acc   = accuracy_score(ypred,y_vtest)
  print("{:15s} Val Accuracy: {:10.2}".format(model,acc))

SVC             Val Accuracy:        1.0
Gaussian NB     Val Accuracy:        1.0
Random Forest   Val Accuracy:       0.98


**Good performance in validation as well.**

####Predictions
Modeld make predictions of the disease given the array of symtoms.
The models output individually and the mode of predictions is 
combined for final output. The combined output will be more robust to biases and inaccuracies of the individual models.

In [29]:
#Prediction
x_ = np.array(x_test.iloc[10]).reshape(1,-1)
y_svc = models["SVC"].predict(x_)
y_gnb = models["Gaussian NB"].predict(x_)
y_rf  = models["Random Forest"].predict(x_)
y_comb= mode([y_svc,y_gnb,y_rf])

print ("Predictions")
print ("SVC",labenc.classes_[y_svc])
print ("Gaussian NB",labenc.classes_[y_gnb])
print ("Random Forest",labenc.classes_[y_rf])
print ("Final",labenc.classes_[y_comb.mode[0]])


print ('\n')

Predictions
SVC ['Urinary tract infection']
Gaussian NB ['Urinary tract infection']
Random Forest ['Urinary tract infection']
Final ['Urinary tract infection']




  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"


**Most Useful Symtoms in Prediction**

Using decision trees, we can also find the most useful symtoms used by the algorithm to make the decision. They are listed below.

In [17]:
fimp  = models['Random Forest'].feature_importances_
fname = models['Random Forest'].feature_names_in_
for i,name in enumerate(fname):
  if fimp[i]>12/1000.:
    print ("{:30s} {:5.3f}".format(name,fimp[i]))

itching                        0.016
joint_pain                     0.014
vomiting                       0.013
fatigue                        0.015
high_fever                     0.012
sweating                       0.015
dark_urine                     0.013
nausea                         0.014
diarrhoea                      0.014
mild_fever                     0.016
yellowing_of_eyes              0.015
chest_pain                     0.013
bladder_discomfort             0.012
muscle_pain                    0.017
red_spots_over_body            0.014
family_history                 0.015
lack_of_concentration          0.013


##References
1. Based on example provided [here](https://www.geeksforgeeks.org/disease-prediction-using-machine-learning/)