
## <center>Project name: Medical Diagnosis System</center>
#### Course name: Artificial Intelligence                                                                                                 
#### By: Rama Salahat
***

<div class="alert alert-block alert-info">
<b>Summary</b><br/>
    The goal of this project is to build a __medical diagnosis system__ using Naïve Bayes classification model that is able to predict the class (disease) of an unseen set of symptoms correctly out of 134 diseases.<br>
    <a href="http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html" target="_blank">The source of the dataframe used in this project:</a>

 
</div>


In [1]:
#import needed libraries
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [2]:
#read the dataframe
df = pd.read_csv('dis_symp_updated.csv')
df.head()

Unnamed: 0,Disease,Count of Disease Occurrence,Symptom
0,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0008031_pain chest
1,,,UMLS:C0392680_shortness of breath
2,,,UMLS:C0012833_dizziness
3,,,UMLS:C0004093_asthenia
4,,,UMLS:C0085639_fall


## Building a Training dataset

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1965 entries, 0 to 1964
Data columns (total 3 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Disease                      143 non-null    object 
 1   Count of Disease Occurrence  143 non-null    float64
 2   Symptom                      1964 non-null   object 
dtypes: float64(1), object(2)
memory usage: 46.2+ KB


<div class="alert alert-block alert-danger">
<b>this shows that there's a null record, let's extract it and delete it</b>
</div>



In [4]:
missing_symptoms = df[df["Symptom"].isnull()==True]
missing_symptoms

Unnamed: 0,Disease,Count of Disease Occurrence,Symptom
1681,,,


In [5]:
df=df.drop(missing_symptoms.index.array, axis=0) 
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1964 entries, 0 to 1964
Data columns (total 3 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Disease                      143 non-null    object 
 1   Count of Disease Occurrence  143 non-null    float64
 2   Symptom                      1964 non-null   object 
dtypes: float64(1), object(2)
memory usage: 61.4+ KB


<blockquote><b>let's search for records where the symptom is empty</b></blockquote>

In [6]:
missing_symptoms = df[df["Symptom"]==""]
missing_symptoms

Unnamed: 0,Disease,Count of Disease Occurrence,Symptom


In [7]:
df=df.drop(missing_symptoms.index.array, axis=0) 
missing_symptoms = df[df["Symptom"]==""]
missing_symptoms

Unnamed: 0,Disease,Count of Disease Occurrence,Symptom


<div class="alert alert-block alert-success">
<b>The data is now clean and ready for parsing</b>
</div>


<blockquote><b>In the dataframe the "Disease" and "the Count of Disease Occurrence"	are left empty when their values are the same as the last mentioned recoed, let's fill such values with their real values.  </b></blockquote>


In [8]:
df = df.fillna(method='ffill')
df

Unnamed: 0,Disease,Count of Disease Occurrence,Symptom
0,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0008031_pain chest
1,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0392680_shortness of breath
2,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0012833_dizziness
3,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0004093_asthenia
4,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0085639_fall
...,...,...,...
1960,UMLS:C0011127_decubitus ulcer,42.0,UMLS:C0232257_systolic murmur
1961,UMLS:C0011127_decubitus ulcer,42.0,UMLS:C0871754_frail
1962,UMLS:C0011127_decubitus ulcer,42.0,UMLS:C0015967_fever
1963,UMLS:C0011127_decubitus ulcer,20.0,UMLS:C0232257_systolic murmur


<blockquote><b>There's records in the dataframe where 2 diseases/symptoms are put in the same record seperated by a (^), let's split such records into 2</b></blockquote>


In [9]:
df['Symptom'] = df['Symptom'].apply(lambda x: x.split('^'))
df=df.explode('Symptom').reset_index()
df=df.drop(columns=['index'])
df

Unnamed: 0,Disease,Count of Disease Occurrence,Symptom
0,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0008031_pain chest
1,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0392680_shortness of breath
2,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0012833_dizziness
3,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0004093_asthenia
4,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0085639_fall
...,...,...,...
2002,UMLS:C0011127_decubitus ulcer,42.0,UMLS:C0232257_systolic murmur
2003,UMLS:C0011127_decubitus ulcer,42.0,UMLS:C0871754_frail
2004,UMLS:C0011127_decubitus ulcer,42.0,UMLS:C0015967_fever
2005,UMLS:C0011127_decubitus ulcer,20.0,UMLS:C0232257_systolic murmur


In [10]:
df['Disease'] = df['Disease'].apply(lambda x: x.split('^'))
df=df.explode('Disease').reset_index()
df=df.drop(columns=['index'])
df

Unnamed: 0,Disease,Count of Disease Occurrence,Symptom
0,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0008031_pain chest
1,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0392680_shortness of breath
2,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0012833_dizziness
3,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0004093_asthenia
4,UMLS:C0020538_hypertensive disease,3363.0,UMLS:C0085639_fall
...,...,...,...
2224,UMLS:C0011127_decubitus ulcer,42.0,UMLS:C0232257_systolic murmur
2225,UMLS:C0011127_decubitus ulcer,42.0,UMLS:C0871754_frail
2226,UMLS:C0011127_decubitus ulcer,42.0,UMLS:C0015967_fever
2227,UMLS:C0011127_decubitus ulcer,20.0,UMLS:C0232257_systolic murmur


<blockquote><b>Let's remove the symbols for diseases and symptoms and just keep their names</b></blockquote>


In [11]:
df["Disease"]=df["Disease"].str[14:]
df["Symptom"]=df["Symptom"].str[14:]
df

Unnamed: 0,Disease,Count of Disease Occurrence,Symptom
0,hypertensive disease,3363.0,pain chest
1,hypertensive disease,3363.0,shortness of breath
2,hypertensive disease,3363.0,dizziness
3,hypertensive disease,3363.0,asthenia
4,hypertensive disease,3363.0,fall
...,...,...,...
2224,decubitus ulcer,42.0,systolic murmur
2225,decubitus ulcer,42.0,frail
2226,decubitus ulcer,42.0,fever
2227,decubitus ulcer,20.0,systolic murmur


<div class="alert alert-block alert-success">
<b>The training dataset is now ready to be used</b>
</div>


## Model Parameters Estimation

* ### distribution of the Naïve Bayes Classifier 

<blockquote><b>Let's create the ditribution table by calculating the count of eacy symptom for each disease</b></blockquote>


In [12]:
dummies=pd.get_dummies(df, columns = ['Symptom'])
Diseases=dummies["Disease"]
dummies

Unnamed: 0,Disease,Count of Disease Occurrence,Symptom_,Symptom_Heberden's node,Symptom_Murphy's sign,Symptom_Stahli's line,Symptom_abdomen acute,Symptom_abdominal bloating,Symptom_abdominal tenderness,Symptom_abnormal sensation,...,Symptom_vision blurred,Symptom_vomiting,Symptom_weepiness,Symptom_weight gain,Symptom_welt,Symptom_wheelchair bound,Symptom_wheezing,Symptom_withdraw,Symptom_worry,Symptom_yellow sputum
0,hypertensive disease,3363.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,hypertensive disease,3363.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,hypertensive disease,3363.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,hypertensive disease,3363.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,hypertensive disease,3363.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2224,decubitus ulcer,42.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2225,decubitus ulcer,42.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2226,decubitus ulcer,42.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2227,decubitus ulcer,20.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<blockquote><b>Let's multiply each sum by their "Count of Disease Occurrence" snd drop this column since it won't be useful anymore, then group records by diseases.  </b></blockquote>


In [13]:
clmn = list(dummies) 

In [14]:
dummies[clmn[1]]=(dummies[clmn[1]]).astype("int")
dummies=dummies.mul(dummies[clmn[1]],axis=0)
dummies["Disease"]=Diseases
dummies[clmn[1]]=dummies[clmn[1]] **(1/2)
dummies

Unnamed: 0,Disease,Count of Disease Occurrence,Symptom_,Symptom_Heberden's node,Symptom_Murphy's sign,Symptom_Stahli's line,Symptom_abdomen acute,Symptom_abdominal bloating,Symptom_abdominal tenderness,Symptom_abnormal sensation,...,Symptom_vision blurred,Symptom_vomiting,Symptom_weepiness,Symptom_weight gain,Symptom_welt,Symptom_wheelchair bound,Symptom_wheezing,Symptom_withdraw,Symptom_worry,Symptom_yellow sputum
0,hypertensive disease,3363.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,hypertensive disease,3363.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,hypertensive disease,3363.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,hypertensive disease,3363.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,hypertensive disease,3363.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2224,decubitus ulcer,42.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2225,decubitus ulcer,42.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2226,decubitus ulcer,42.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2227,decubitus ulcer,20.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
dummies.iloc[0]["Symptom_pain chest"]

3363

In [16]:
dummies = dummies.groupby('Disease').sum().reset_index()


In [17]:
dummies=dummies.drop(columns=[clmn[1]])
dummies

Unnamed: 0,Disease,Symptom_,Symptom_Heberden's node,Symptom_Murphy's sign,Symptom_Stahli's line,Symptom_abdomen acute,Symptom_abdominal bloating,Symptom_abdominal tenderness,Symptom_abnormal sensation,Symptom_abnormally hard consistency,...,Symptom_vision blurred,Symptom_vomiting,Symptom_weepiness,Symptom_weight gain,Symptom_welt,Symptom_wheelchair bound,Symptom_wheezing,Symptom_withdraw,Symptom_worry,Symptom_yellow sputum
0,Alzheimer's disease,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,101,0,0,0,0
1,HIV,350,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Pneumocystis carinii pneumonia,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,113
3,accident cerebrovascular,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,acquired immuno-deficiency syndrome,350,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
144,tonic-clonic seizures,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
145,transient ischemic attack,0,0,0,168,0,0,0,0,0,...,168,0,0,0,0,0,0,0,0,0
146,tricuspid valve insufficiency,0,0,0,0,0,0,0,0,0,...,0,101,0,0,0,0,0,0,0,0
147,ulcer peptic,0,0,0,0,0,0,0,0,0,...,0,143,0,0,0,0,0,0,0,0


<blockquote><b>Now let's calculate the sum of symptoms for each disease, and the sum of each symptom for all diseases.</b></blockquote>


In [18]:
dummies["sum"]=dummies.sum(axis=1)
SUM=dummies["sum"].sum()
dummies

Unnamed: 0,Disease,Symptom_,Symptom_Heberden's node,Symptom_Murphy's sign,Symptom_Stahli's line,Symptom_abdomen acute,Symptom_abdominal bloating,Symptom_abdominal tenderness,Symptom_abnormal sensation,Symptom_abnormally hard consistency,...,Symptom_vomiting,Symptom_weepiness,Symptom_weight gain,Symptom_welt,Symptom_wheelchair bound,Symptom_wheezing,Symptom_withdraw,Symptom_worry,Symptom_yellow sputum,sum
0,Alzheimer's disease,0,0,0,0,0,0,0,0,0,...,0,0,0,0,101,0,0,0,0,1818
1,HIV,350,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5250
2,Pneumocystis carinii pneumonia,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,113,2034
3,accident cerebrovascular,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,7080
4,acquired immuno-deficiency syndrome,350,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
144,tonic-clonic seizures,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1564
145,transient ischemic attack,0,0,0,168,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2856
146,tricuspid valve insufficiency,0,0,0,0,0,0,0,0,0,...,101,0,0,0,0,0,0,0,0,1010
147,ulcer peptic,0,0,0,0,0,0,0,0,0,...,143,0,0,0,0,0,0,0,0,1716


In [19]:
dummies=dummies.append(dummies.sum(axis=0), ignore_index=True)
dummies

Unnamed: 0,Disease,Symptom_,Symptom_Heberden's node,Symptom_Murphy's sign,Symptom_Stahli's line,Symptom_abdomen acute,Symptom_abdominal bloating,Symptom_abdominal tenderness,Symptom_abnormal sensation,Symptom_abnormally hard consistency,...,Symptom_vomiting,Symptom_weepiness,Symptom_weight gain,Symptom_welt,Symptom_wheelchair bound,Symptom_wheezing,Symptom_withdraw,Symptom_worry,Symptom_yellow sputum,sum
0,Alzheimer's disease,0,0,0,0,0,0,0,0,0,...,0,0,0,0,101,0,0,0,0,1818
1,HIV,350,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5250
2,Pneumocystis carinii pneumonia,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,113,2034
3,accident cerebrovascular,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,7080
4,acquired immuno-deficiency syndrome,350,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,transient ischemic attack,0,0,0,168,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2856
146,tricuspid valve insufficiency,0,0,0,0,0,0,0,0,0,...,101,0,0,0,0,0,0,0,0,1010
147,ulcer peptic,0,0,0,0,0,0,0,0,0,...,143,0,0,0,0,0,0,0,0,1716
148,upper respiratory infection,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,213,0,0,0,5262


<blockquote><b>Now let's add a 1 into all of the sums for Laplacian Smoothing. </b></blockquote>


In [20]:
X=dummies[dummies.columns[1:]]
X+=1
X

Unnamed: 0,Symptom_,Symptom_Heberden's node,Symptom_Murphy's sign,Symptom_Stahli's line,Symptom_abdomen acute,Symptom_abdominal bloating,Symptom_abdominal tenderness,Symptom_abnormal sensation,Symptom_abnormally hard consistency,Symptom_abortion,...,Symptom_vomiting,Symptom_weepiness,Symptom_weight gain,Symptom_welt,Symptom_wheelchair bound,Symptom_wheezing,Symptom_withdraw,Symptom_worry,Symptom_yellow sputum,sum
0,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,102,1,1,1,1,1819
1,351,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,5251
2,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,114,2035
3,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,7081
4,351,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,5251
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,1,1,1,169,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,2857
146,1,1,1,1,1,1,1,1,1,1,...,102,1,1,1,1,1,1,1,1,1011
147,1,1,1,1,1,1,1,1,1,1,...,144,1,1,1,1,1,1,1,1,1717
148,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,214,1,1,1,5263


<div class="alert alert-block alert-success">
<b>The Distribution table is ready</b>
</div>


* ### Likelihood table

<blockquote><b>Let's calculate the Likelihood table by dividing all the sums by the sum of symptoms for each disease  </b></blockquote>


In [21]:
X=X.divide((dummies["sum"]),axis=0)
X

Unnamed: 0,Symptom_,Symptom_Heberden's node,Symptom_Murphy's sign,Symptom_Stahli's line,Symptom_abdomen acute,Symptom_abdominal bloating,Symptom_abdominal tenderness,Symptom_abnormal sensation,Symptom_abnormally hard consistency,Symptom_abortion,...,Symptom_vomiting,Symptom_weepiness,Symptom_weight gain,Symptom_welt,Symptom_wheelchair bound,Symptom_wheezing,Symptom_withdraw,Symptom_worry,Symptom_yellow sputum,sum
0,0.000550,0.000550,0.000550,0.000550,0.000550,0.000550,0.000550,0.000550,0.000550,0.000550,...,0.000550,0.000550,0.000550,0.000550,0.056106,0.000550,0.000550,0.000550,0.000550,1.000550
1,0.066857,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,...,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,1.000190
2,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,...,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.056047,1.000492
3,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,...,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,1.000141
4,0.066857,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,...,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,1.000190
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,0.000350,0.000350,0.000350,0.059174,0.000350,0.000350,0.000350,0.000350,0.000350,0.000350,...,0.000350,0.000350,0.000350,0.000350,0.000350,0.000350,0.000350,0.000350,0.000350,1.000350
146,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,...,0.100990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,1.000990
147,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,...,0.083916,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,1.000583
148,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,...,0.000190,0.000190,0.000190,0.000190,0.000190,0.040669,0.000190,0.000190,0.000190,1.000190


In [22]:
X["sum"]=dummies["sum"]/int(SUM)
X

Unnamed: 0,Symptom_,Symptom_Heberden's node,Symptom_Murphy's sign,Symptom_Stahli's line,Symptom_abdomen acute,Symptom_abdominal bloating,Symptom_abdominal tenderness,Symptom_abnormal sensation,Symptom_abnormally hard consistency,Symptom_abortion,...,Symptom_vomiting,Symptom_weepiness,Symptom_weight gain,Symptom_welt,Symptom_wheelchair bound,Symptom_wheezing,Symptom_withdraw,Symptom_worry,Symptom_yellow sputum,sum
0,0.000550,0.000550,0.000550,0.000550,0.000550,0.000550,0.000550,0.000550,0.000550,0.000550,...,0.000550,0.000550,0.000550,0.000550,0.056106,0.000550,0.000550,0.000550,0.000550,0.002974
1,0.066857,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,...,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.008587
2,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,...,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.000492,0.056047,0.003327
3,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,...,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.000141,0.011581
4,0.066857,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,...,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.008587
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,0.000350,0.000350,0.000350,0.059174,0.000350,0.000350,0.000350,0.000350,0.000350,0.000350,...,0.000350,0.000350,0.000350,0.000350,0.000350,0.000350,0.000350,0.000350,0.000350,0.004672
146,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,...,0.100990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.000990,0.001652
147,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,...,0.083916,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.000583,0.002807
148,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,0.000190,...,0.000190,0.000190,0.000190,0.000190,0.000190,0.040669,0.000190,0.000190,0.000190,0.008607


In [23]:
Diseases=(dummies["Disease"])
del Diseases[149]

<div class="alert alert-block alert-success">
<b>The Likelihood table is ready for inference!</b>
</div>


### Inference

<blockquote><b>Let's build the inference model that takes an array of symptoms and returns the top 10 possibilities of diseases</b></blockquote>


In [24]:
def inference10(symp):
    possibilities=dict((el,1) for el in Diseases) 
    for el in range(len(Diseases)):
        x=1.0
        for sym in symp:
            x=x*X.iloc[el][sym]
        x=x*X.iloc[el]["sum"]
        possibilities[Diseases[el]]=x
    possibilities=sorted(possibilities.items(), key=lambda kv: kv[1], reverse=True)
    return possibilities[:10]

In [25]:
symp=["pain chest", "shortness of breath","dizziness"]
#prep symptomes
for el in range(len(symp)):
    symp[el]="Symptom_"+symp[el]
#call the function
x=inference10(symp)
x

[('hypertensive disease', 5.674034472749759e-05),
 ('hyperlipidemia', 2.4197647663656764e-06),
 ('kidney disease', 4.638709964779135e-08),
 ('coronary arteriosclerosis', 1.6382301910041224e-08),
 ('coronary heart disease', 1.6382301910041224e-08),
 ('sickle cell anemia', 1.3711832553021526e-08),
 ('stenosis aortic valve', 1.3689683951085569e-08),
 ('paroxysmal\xa0dyspnea', 1.1496999513600985e-08),
 ('obesity', 9.763674296698834e-09),
 ('diabetes', 8.598728519476915e-09)]

<div class="alert alert-block alert-success">
<b>The top 10 diseases are listed with their possibilities</b>
</div>


<blockquote><b>Now let's build a function that takes the calculated list of possible diseases and the symptoms, that will return a list of symptoms that the patient should be asked about to beter determine the disease.  </b><br>
This function has an optional "last" parametar that is used to manually indicate the number of the diseases that are being considered from the list. </blockquote>


In [26]:
def suggest(x, symp, last=0):
    if last==0:
        for possibility in x: 
            if possibility[1]/x[0][1]>.01 : 
                last+=1
    ToAskSymp=[]
    for dis in range(last):
        disease1=x[dis][0]
        row_count=Diseases[Diseases==(x[dis][0])].index[0]
        record=X[row_count:row_count+1]
        record.pop("sum")
        record=record.T
        record=record.sort_values(by=row_count, ascending=False)
        for sym in record.index:
            if sym in symp:continue
            if sym[8:] in ToAskSymp:continue
            ToAskSymp.append(sym[8:])
            break

    return ToAskSymp

<blockquote><b>Let's try the funtion with and without the last parameter and use one of the results to recalculate the inference result  </b></blockquote>


In [27]:
suggest(x,symp)

['sweating increased', 'photopsia']

In [28]:
suggest(x,symp,5)

['sweating increased',
 'photopsia',
 'fever',
 'angina pectoris',
 'pressure chest']

In [29]:
symp=["pain chest", "shortness of breath","dizziness","photopsia"]
#prep symptomes
for el in range(len(symp)):
    symp[el]="Symptom_"+symp[el]
#call the function
x=inference10(symp)
x

[('hyperlipidemia', 1.868893372963836e-07),
 ('hypertensive disease', 8.161847081732705e-10),
 ('kidney disease', 8.053315911074887e-11),
 ('sickle cell anemia', 8.9037873720919e-12),
 ('stenosis aortic valve', 7.87668811915165e-12),
 ('paroxysmal\xa0dyspnea', 5.806565410909588e-12),
 ('obesity', 3.2940871446352342e-12),
 ('ischemia', 2.1634722906486747e-12),
 ('edema pulmonary', 2.0936513152257637e-12),
 ('cardiomyopathy', 1.7246553254711011e-12)]

<div class="alert alert-block alert-success">
<b>The result shifted and the top diagnosis changed</b>
</div>
