# **Disease Detection using Symptoms and Treatment recommendation**

In [2]:
import warnings
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split, cross_val_score
from statistics import mean
from nltk.corpus import wordnet 
import requests
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from itertools import combinations
from time import time
from collections import Counter
import operator
from xgboost import XGBClassifier
import math
from Treatment import diseaseDetail
from sklearn.linear_model import LogisticRegression
warnings.simplefilter("ignore")
import pickle

**synonyms function** finds the synonymous terms of a symptom entered by the user.

This is necessary as the user may use a term for a symptom which may be different from the one present in dataset.
This improves the accuracy by reducing the wrong predictions even when symptoms for a disease are entered slightly different than the ones on which model is trained.

*Synonyms are searched on Thesaurus.com and NLTK Wordnet*

In [3]:
# returns the list of synonyms of the input word from thesaurus.com (https://www.thesaurus.com/) and wordnet (https://www.nltk.org/howto/wordnet.html)
def synonyms(term):
    synonyms = []
    response = requests.get('https://www.thesaurus.com/browse/{}'.format(term))
    soup = BeautifulSoup(response.content,  "html.parser")
    try:
        container=soup.find('section', {'class': 'MainContentContainer'}) 
        row=container.find('div',{'class':'css-191l5o0-ClassicContentCard'})
        row = row.find_all('li')
        for x in row:
            synonyms.append(x.get_text())
    except:
        None
    for syn in wordnet.synsets(term):
        synonyms+=syn.lemma_names()
    return set(synonyms)

In [4]:
# utlities for pre-processing
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()
splitter = RegexpTokenizer(r'\w+')

**Disease Symptom dataset** was created in a separate python program.

**Dataset scrapping** was done using **NHP website** and **wikipedia data**

Disease Combination dataset contains the combinations for each of the disease present in dataset as practically it is often observed that it is not necessary for a person to have a disease when all the symptoms are faced by the patient or the user.

*To tackle this problem, combinations are made with the symptoms for each disease.*

 **This increases the size of the data exponentially and helps the model to predict the disease with much better accuracy.**

*df_comb -> Dataframe consisting of dataset generated by combining symptoms for each disease.* -  rows

*df_norm -> Dataframe consisting of dataset which contains a single row for each diseases with all the symptoms for that corresponding disease.* - 261 rows

**Dataset contains 261 diseases and their symptoms**

In [5]:
# Load Dataset scraped from NHP (https://www.nhp.gov.in/disease-a-z) & Wikipedia
# Scrapping and creation of dataset csv is done in a separate program
df_comb = pd.read_csv("F:\SEM 6\Minor-2\Disease-Detection-based-on-Symptoms-master\Dataset\dis_sym_dataset_comb.csv") # Disease combination
df_norm = pd.read_csv("F:\SEM 6\Minor-2\Disease-Detection-based-on-Symptoms-master\Dataset\dis_sym_dataset_norm.csv") # Individual Disease

X = df_comb.iloc[:, 1:]
Y = df_comb.iloc[:, 0:1]

FileNotFoundError: [Errno 2] No such file or directory: 'F:\\SEM-6\\Minor-2\\Disease-Detection-based-on-Symptoms-master\\Dataset\\dis_sym_dataset_comb.csv'

Using **Logistic Regression (LR) Classifier** as it gives better accuracy compared to other classification models as observed in the comparison of model accuracies in Model_latest.py

Cross validation is done on dataset with cv = 10

In [5]:
lr = LogisticRegression()
lr = lr.fit(X, Y)
scores = cross_val_score(lr, X, Y, cv=10)

In [6]:
# pickle.dump(lr, open('model_saved', 'wb'))

In [7]:
X = df_norm.iloc[:, 1:]
Y = df_norm.iloc[:, 0:1]

In [8]:
# List of symptoms
dataset_symptoms = list(X.columns)

Total symptoms are 489

In [9]:
len(dataset_symptoms)

489

# Symptoms initially taken from user.

In [10]:
# Taking symptoms from user as input 
user_symptoms = str(input("Please enter symptoms separated by comma(,):\n")).lower().split(',')
# Preprocessing the input symptoms
processed_user_symptoms=[]
for sym in user_symptoms:
    sym=sym.strip()
    sym=sym.replace('-',' ')
    sym=sym.replace("'",'')
    sym = ' '.join([lemmatizer.lemmatize(word) for word in splitter.tokenize(sym)])
    processed_user_symptoms.append(sym)

Please enter symptoms separated by comma(,):
fever, sore throat, headache


Pre-processing on symptoms entered by user is done.

In [11]:
# Taking each user symptom and finding all its synonyms and appending it to the pre-processed symptom string
user_symptoms = []
for user_sym in processed_user_symptoms:
    user_sym = user_sym.split()
    str_sym = set()
    for comb in range(1, len(user_sym)+1):
        for subset in combinations(user_sym, comb):
            subset=' '.join(subset)
            subset = synonyms(subset) 
            str_sym.update(subset)
    str_sym.add(' '.join(user_sym))
    user_symptoms.append(' '.join(str_sym).replace('_',' '))
# query expansion performed by joining synonyms found for each symptoms initially entered
print("After query expansion done by using the symptoms entered")
print(user_symptoms)

After query expansion done by using the symptoms entered
['pyrexia febricity fever febrility feverishness', 'raw throat sensitive mad sore throat painful huffy afflictive sore pharynx tender', 'worry headache vexation cephalalgia concern head ache']


The below procedure is performed in order to show the symptom synonmys found for the symptoms entered by the user.

The symptom synonyms and user symptoms are matched with the symptoms present in dataset. Only the symptoms which matches the symptoms present in dataset are shown back to the user. 

This code snippet is checking the similarity between a list of symptoms in a dataset and a list of user-input symptoms. It does this by looping over all the symptoms in the dataset and comparing them to the user-input symptoms. If the similarity score is greater than 0.5, the symptom from the dataset is added to a final list of found symptoms. The similarity score is calculated by counting the number of words in common between the dataset symptom and the user-input symptom and dividing it by the total number of words in the dataset symptom. Finally, the found symptoms are converted to a list.

In [12]:
# Loop over all the symptoms in dataset and check its similarity score to the synonym string of the user-input 
# symptoms. If similarity>0.5, add the symptom to the final list ... 

# Using jaccard similarity... 

found_symptoms = set()
for idx, data_sym in enumerate(dataset_symptoms):
    # data_sym = abdominal cramp 
    data_sym_split = data_sym.split()
    # data_sym_split = ['abdominal', 'cramp']
    for user_sym in user_symptoms:
        count=0
        for symp in data_sym_split:
            if symp in user_sym.split():
                count+=1
        if count/len(data_sym_split)>0.5:
            found_symptoms.add(data_sym)
found_symptoms = list(found_symptoms)

## **Prompt the user to select the relevant symptoms by entering the corresponding indices.**

In [13]:
# Print all found symptoms
print("Top matching symptoms from your search!")
for idx, symp in enumerate(found_symptoms):
    print(idx,":",symp)
    
# Show the related symptoms found in the dataset and ask user to select among them
select_list = input("\nPlease select the relevant symptoms. Enter indices (separated-space):\n").split()

Top matching symptoms from your search!
0 : headache
1 : painful
2 : sore throat
3 : fever

Please select the relevant symptoms. Enter indices (separated-space):
0 2 3


In [14]:
# Find other relevant symptoms from the dataset based on user symptoms based on the highest co-occurance with the ones that is input by the user
dis_list = set()
final_symp = [] 
counter_list = []
for idx in select_list:
    # select_list = [0, 1, 2]
    symp=found_symptoms[int(idx)]
    # found_symptoms = ['fever', 'sore throat', 'headache', 'painful']
    final_symp.append(symp)
    # final_symp = ['fever', 'headache', 'sore throat']
    # symp = fever
    # kaunsa kaunsa disease mei fever hai ek symptom usko disease list mei daal do ...
    dis_list.update(set(df_norm[df_norm[symp]==1]['label_dis']))

This code snippet appears to be written in Python. It seems to be iterating over a list called dis_list and for each dis in the list, it is selecting a row from a DataFrame df_norm where the value in the column label_dis is equal to dis. The selected row is then converted to a list and the first element is removed. The code then iterates over the remaining elements of the row and checks if the value is not equal to 0 and if the corresponding symptom (from a list called dataset_symptoms) is not already in a list called final_symp. If both conditions are met, the symptom is appended to a list called counter_list.

This line of code is selecting a row from a DataFrame df_norm where the value in the column label_dis is equal to the value of the variable dis. The .loc method is used to select rows based on a boolean condition. The .values attribute is then used to extract the values of the selected row as a NumPy array, and the .tolist() method is called to convert the array to a Python list. The resulting list is then assigned to the variable row.

In [16]:
for dis in dis_list:
    # dis = Chikungunya Fever
    row = df_norm.loc[df_norm['label_dis'] == dis].values.tolist()
    row[0].pop(0)
    for idx,val in enumerate(row[0]):
        if val!=0 and dataset_symptoms[idx] not in final_symp:
            counter_list.append(dataset_symptoms[idx])

All the symptoms of the diseases that are present are put in a list...

## To find symptoms which generally co-occur, for example with symptoms like cough, headache generally happens hence they co-occur.

This code snippet is creating a dictionary called dict_symp using the Counter class from the collections module. The Counter class takes an iterable (in this case, the list counter_list) and returns a dictionary where the keys are the unique elements in the iterable and the values are the number of times each element appears in the iterable. The code then creates a list of tuples called dict_symp_tup by calling the .items() method on dict_symp to get a view of its items (key-value pairs) and passing it to the sorted() function. The sorted() function sorts the items in ascending order based on their second element (the value) using the itemgetter() function from the operator module as the key function. The reverse=True argument is passed to the sorted() function to sort the items in descending order.

In [17]:
# Symptoms that co-occur with the ones selected by user              
dict_symp = dict(Counter(counter_list))
dict_symp_tup = sorted(dict_symp.items(), key=operator.itemgetter(1),reverse=True)    

## User is presented with a list of co-occuring symptoms to select from and is performed iteratively to recommend more possible symptoms based on the similarity to the previously entered symptoms.

As the co-occuring symptoms can be in overwhelming numbers, only the top 5 are recommended to the user from which user can select the symptoms.

If user does not have any of those 5 symptoms and wants to see the next 5, he can do so by giving input as -1.

To stop the recommendation, user needs to give input as "No".

In [18]:
# Iteratively, suggest top co-occuring symptoms to the user and ask to select the ones applicable 
found_symptoms=[]
count=0
for tup in dict_symp_tup:
    count+=1
    found_symptoms.append(tup[0])
    if count%5==0 or count==len(dict_symp_tup):
        print("\nCommon co-occuring symptoms:")
        for idx,ele in enumerate(found_symptoms):
            print(idx,":",ele)
        select_list = input("Do you have have of these symptoms? If Yes, enter the indices (space-separated), 'no' to stop, '-1' to skip:\n").lower().split();
        if select_list[0]=='no':
            break
        if select_list[0]=='-1':
            found_symptoms = [] 
            continue
        for idx in select_list:
            final_symp.append(found_symptoms[int(idx)])
        found_symptoms = [] 


Common co-occuring symptoms:
0 : testicular pain
1 : vomiting
2 : barky cough
3 : confusion
4 : maculopapular rash
Do you have have of these symptoms? If Yes, enter the indices (space-separated), 'no' to stop, '-1' to skip:
2

Common co-occuring symptoms:
0 : diarrhea
1 : feeling tired
2 : nausea
3 : swollen lymph node
4 : chest pain
Do you have have of these symptoms? If Yes, enter the indices (space-separated), 'no' to stop, '-1' to skip:
1 4

Common co-occuring symptoms:
0 : shortness breath
1 : runny nose
2 : muscle weakness
3 : unintended weight loss
4 : large lymph node
Do you have have of these symptoms? If Yes, enter the indices (space-separated), 'no' to stop, '-1' to skip:
0 1 2

Common co-occuring symptoms:
0 : fatigue
1 : seizure
2 : tiredness
3 : red eye
4 : dizziness
Do you have have of these symptoms? If Yes, enter the indices (space-separated), 'no' to stop, '-1' to skip:
0 2

Common co-occuring symptoms:
0 : non itchy skin ulcer
1 : vaginal bleeding
2 : skin peeling
3

Final Symptom list

In [19]:
# Create query vector based on symptoms selected by the user
print("\nFinal list of Symptoms that will be used for prediction:")
sample_x = [0 for x in range(0,len(dataset_symptoms))]
for val in final_symp:
    print(val)
    sample_x[dataset_symptoms.index(val)]=1


Final list of Symptoms that will be used for prediction:
headache
sore throat
fever
barky cough
feeling tired
chest pain
shortness breath
runny nose
muscle weakness
fatigue
tiredness


Prediction of disease is done

In [20]:
# Predict disease
lr = LogisticRegression()
lr = lr.fit(X, Y)

prediction = lr.predict_proba([sample_x])

Show top k diseases and their probabilities to the user.

K in this case is 5

Finding prob of each disease and sorting it in desceing order

In [21]:
k = 5
diseases = list(set(Y['label_dis']))
diseases.sort()
topk = prediction[0].argsort()[-k:][::-1]

# **Showing the list of top k diseases to the user with their prediction probabilities.**

# **For getting information about the suggested treatments, user can enter the corresponding index to know more details.**

In [23]:
print(f"\nTop {k} diseases predicted based on symptoms")
topk_dict = {}
# Show top 5 highly probable disease to the user.
for idx,t in  enumerate(topk):
    match_sym=set()
    row = df_norm.loc[df_norm['label_dis'] == diseases[t]].values.tolist()
    row[0].pop(0)

    for idx,val in enumerate(row[0]):
        if val!=0:
            match_sym.add(dataset_symptoms[idx])
    prob = (len(match_sym.intersection(set(final_symp)))+1)/(len(set(final_symp))+1)
    prob *= mean(scores)
    topk_dict[t] = prob
j = 0
topk_index_mapping = {}
topk_sorted = dict(sorted(topk_dict.items(), key=lambda kv: kv[1], reverse=True))
for key in topk_sorted:
  prob = topk_sorted[key]*100
  print(str(j) + " Disease name:",diseases[key], "\tProbability:",str(round(prob, 2))+"%")
  topk_index_mapping[j] = key
  j += 1

select = input("\nMore details about the disease? Enter index of disease or '-1' to discontinue and close the system:\n")
if select!='-1':
    dis=diseases[topk_index_mapping[int(select)]]
    print()
    print(diseaseDetail(dis))


Top 5 diseases predicted based on symptoms
0 Disease name: Influenza 	Probability: 44.55%
1 Disease name: Common cold 	Probability: 37.13%
2 Disease name: Coronavirus disease 2019 (COVID-19) 	Probability: 37.13%
3 Disease name: Legionellosis 	Probability: 37.13%
4 Disease name: Asbestos-related diseases 	Probability: 29.7%

More details about the disease? Enter index of disease or '-1' to discontinue and close the system:
0

Influenza
Other names -  Flu, the flu, Grippe 
Specialty -  Infectious disease 
Symptoms -  Fever, runny nose, sore throat, muscle pain, headache, coughing, fatigue 
Usual onset -  1–4 days after exposure 
Duration -  2–8 days 
Causes -  Influenza viruses 
Prevention -  Hand washing, flu vaccines 
Medication -  Antiviral drugs such as oseltamivir 
Frequency -  3–5 million severe cases per year   
Deaths -  >,290,000–650,000 deaths per year   

