<h1><b>Disease Prediction with GUI<b></h1>
    
A disease prediction model working on support vector machine (SVM). It takes the symptoms of the user as input along with its location and predicts the most probable disease which the user might be facing. The same data is being sent to cloud and being later analysed using analytical tool tableau.

For demonstration purpose, only the data of the diseases GERD and Hepatitis C is being sent to the cloud and analysed.

The data has been taken from https://www.kaggle.com/itachi9604/disease-symptom-description-dataset.

**NOTE - Kindly use Jupyter Notebook or Sypder IDE for running the code.**

<h2>Importing the libraries</h2>

In [221]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
import seaborn as sns
import pickle
import sys 
import urllib
import urllib.request

<h2>Importing the dataset</h2>

In [222]:
df = pd.read_csv('dataset.csv')
print(df.head())


            Disease   Symptom_1              Symptom_2              Symptom_3  \
0  Fungal infection     itching              skin_rash   nodal_skin_eruptions   
1  Fungal infection   skin_rash   nodal_skin_eruptions    dischromic _patches   
2  Fungal infection     itching   nodal_skin_eruptions    dischromic _patches   
3  Fungal infection     itching              skin_rash    dischromic _patches   
4  Fungal infection     itching              skin_rash   nodal_skin_eruptions   

              Symptom_4 Symptom_5 Symptom_6 Symptom_7 Symptom_8 Symptom_9  \
0   dischromic _patches       NaN       NaN       NaN       NaN       NaN   
1                   NaN       NaN       NaN       NaN       NaN       NaN   
2                   NaN       NaN       NaN       NaN       NaN       NaN   
3                   NaN       NaN       NaN       NaN       NaN       NaN   
4                   NaN       NaN       NaN       NaN       NaN       NaN   

  Symptom_10 Symptom_11 Symptom_12 Symptom_13 Symp

In [223]:
df.describe()
diseases = df['Disease'].unique()
diseases = pd.DataFrame(diseases)
diseases.count()
diseases.rename(columns={0:'Disease'})


Unnamed: 0,Disease
0,Fungal infection
1,Allergy
2,GERD
3,Chronic cholestasis
4,Drug Reaction
5,Peptic ulcer diseae
6,AIDS
7,Diabetes
8,Gastroenteritis
9,Bronchial Asthma


This dataset takes into account 41 different diseases

In [224]:
df1 = pd.read_csv('Symptom-severity.csv')
print(df1.head())
df1.count()


                Symptom  weight
0               itching       1
1             skin_rash       3
2  nodal_skin_eruptions       4
3   continuous_sneezing       4
4             shivering       5


Symptom    133
weight     133
dtype: int64

In [225]:
df1.describe()


Unnamed: 0,weight
count,133.0
mean,4.225564
std,1.323543
min,1.0
25%,3.0
50%,4.0
75%,5.0
max,7.0


<h2>Cleaning of Data</h2>

In [226]:
df.isna().sum()
df.isnull().sum()

cols = df.columns
data = df[cols].values.flatten()

s = pd.Series(data)
s = s.str.strip()
s = s.values.reshape(df.shape)

df = pd.DataFrame(s, columns=df.columns)

df = df.fillna(0)
df.head()

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Fungal infection,itching,skin_rash,dischromic _patches,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,0,0,0,0,0,0,0,0,0,0,0,0,0,0


<h2>Encoding the the symptoms with their severity weight</h2>

In [227]:
vals = df.values
symptoms = df1['Symptom'].unique()

for i in range(len(symptoms)):
    vals[vals == symptoms[i]] = df1[df1['Symptom'] == symptoms[i]]['weight'].values[0]
    
d = pd.DataFrame(vals, columns=cols)
d.head()

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,1,3,4,dischromic _patches,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Fungal infection,3,4,dischromic _patches,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Fungal infection,1,4,dischromic _patches,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Fungal infection,1,3,dischromic _patches,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Fungal infection,1,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [228]:
d = d.replace('dischromic _patches', 0)
d = d.replace('spotting_ urination',0)
df = d.replace('foul_smell_of urine',0)
df.head()

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,1,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Fungal infection,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Fungal infection,1,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Fungal infection,1,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Fungal infection,1,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0


<h2> Storing the diseases and encoded symptoms in seperate dataframes</h2>

In [229]:
(df[cols] == 0).all()

df['Disease'].value_counts()

df['Disease'].unique()

data = df.iloc[:,1:].values
labels = df['Disease'].values


we will use nlp to convert symptoms to fit our dataset

In [230]:


# %%
import spacy as sp
import pandas as pd
import numpy as np


# %%
df_nlp = pd.read_csv("/Users/vedanta/Documents/VSCode/hack_x/Symptom-severity.csv")

# %%
df_nlp['Symptom'] = df_nlp['Symptom'].str.replace('_'," ")
df_nlp

# %%

for i in df_nlp:
    sentence = df_nlp['Symptom']
sentence

    

# %%


# %%
from sentence_transformers import SentenceTransformer
model_nlp = SentenceTransformer('distilbert-base-nli-mean-tokens')


sentence_embeddings = model_nlp.encode(sentence)

#for sentence, embedding in zip(sentence, sentence_embeddings):
    #print("Sentence:", sentence)
    #print("Embedding:", embedding)
    #print("")



# %%
input_sentence = "runny nose"

# %%
input_sentence_embeding = model_nlp.encode(input_sentence)
print(input_sentence)
input_sentence_embeding

# %%
from sentence_transformers import SentenceTransformer, util
outputls = []
for i in sentence_embeddings:
    output =  util.pytorch_cos_sim(input_sentence_embeding,i)
    output = str(output)
    outputls.append(output)
outputls = [string.replace('tensor([[','').replace(']])','') for string in outputls]

outputls1 = []
for i in outputls:
    output = float(i)
    outputls1.append(output)

df_nlp['embedded_scores'] = outputls1
df_sorted = df_nlp.sort_values(by='embedded_scores', ascending=False)
symp = df_sorted.head(1)
symp


runny nose


Unnamed: 0,Symptom,weight,embedded_scores
54,runny nose,5,1.0


<h2>Splitting the data and training the model</h2>

In [231]:
x_train, x_test, y_train, y_test = train_test_split(data, labels, shuffle=True, train_size = 0.85)


SVC model

In [232]:
model1 = SVC()
model1.fit(x_train, y_train)

preds = model1.predict(x_test)


In [233]:
conf_mat = confusion_matrix(y_test, preds)

df_cm = pd.DataFrame(conf_mat, index=df['Disease'].unique(), columns=df['Disease'].unique())
print('F1-score% =', f1_score(y_test, preds, average='macro')*100, '|', 'Accuracy% =', accuracy_score(y_test, preds)*100)


F1-score% = 94.20505597126068 | Accuracy% = 94.17344173441734


In [234]:
model2 = GaussianNB()
model2.fit(x_train, y_train)

preds = model2.predict(x_test)


In [235]:
conf_mat = confusion_matrix(y_test, preds)
df_cm = pd.DataFrame(conf_mat, index=df['Disease'].unique(), columns=df['Disease'].unique())
print('F1-score% =', f1_score(y_test, preds, average='macro')*100, '|', 'Accuracy% =', accuracy_score(y_test, preds)*100)

F1-score% = 87.36773015161154 | Accuracy% = 87.66937669376695


In [236]:
model3 = RandomForestClassifier()
model3.fit(x_train, y_train)

preds = model3.predict(x_test)


In [237]:
conf_mat = confusion_matrix(y_test, preds)
df_cm = pd.DataFrame(conf_mat, index=df['Disease'].unique(), columns=df['Disease'].unique())
print('F1-score% =', f1_score(y_test, preds, average='macro')*100, '|', 'Accuracy% =', accuracy_score(y_test, preds)*100)


F1-score% = 99.4892024454942 | Accuracy% = 99.45799457994579


since we will use random forest classifier 

In [238]:
disease_severity = pd.read_csv("/Users/vedanta/Documents/VSCode/hack_x/disease_severity.csv")
disease_severity.drop(columns={'Unnamed: 0'}, inplace=True)
disease_severity

Unnamed: 0,Disease,Severity
0,Drug Reaction,2.0
1,Malaria,3.0
2,Allergy,2.0
3,Hypothyroidism,2.0
4,Psoriasis,3.0
5,GERD,2.0
6,Chronic cholestasis,4.0
7,hepatitis A,3.0
8,Osteoarthristis,4.0
9,(vertigo) Paroymsal Positional Vertigo,


In [239]:
for i in preds:
    if i in disease_severity['Disease'].values:
        score=disease_severity[disease_severity['Disease'] == i]['Severity'].values[0]
        print("severity of your disease is",score)
    else:
        print("not found")

not found
severity of your disease is 3.0
severity of your disease is 4.0
severity of your disease is 3.0
severity of your disease is 4.0
severity of your disease is 2.0
severity of your disease is 4.0
severity of your disease is 4.0
severity of your disease is 2.0
severity of your disease is 3.0
severity of your disease is 3.0
not found
severity of your disease is 1.0
severity of your disease is 2.0
severity of your disease is 3.0
severity of your disease is 5.0
severity of your disease is 3.0
severity of your disease is 3.0
severity of your disease is 4.0
severity of your disease is 4.0
severity of your disease is 4.0
severity of your disease is 2.0
severity of your disease is 4.0
severity of your disease is 2.0
severity of your disease is 4.0
severity of your disease is 2.0
severity of your disease is 4.0
severity of your disease is 2.0
severity of your disease is 3.0
severity of your disease is 4.0
severity of your disease is 3.0
severity of your disease is 3.0
severity of your dis

In [240]:
pickle.dump(model3,open('model.pkl','wb'))

In [241]:
model = pickle.load(open('model.pkl','rb'))
print(model3.predict([[2,3,6,4,2,0,0,0,0,0,0,0,0,0,0,0,0]]))

['Arthritis']


<h2>Checking accuracy of the model</h2>