# Learn Structure and Parameters with the famous Flu dataset
See https://people.cs.umass.edu/~mccallum/courses/gm2011/hw1.html 

Ensure these versions:
- conda install pandas=0.18.1 numpy=1.11.1 networkx=1.11 scipy=0.18.0  jupyter
- conda install -c ankurankan pgmpy
- conda install jupyter

Verify occasionally on https://github.com/pgmpy/pgmpy

In [3]:
import numpy as np
import pandas as pd
from pgmpy.models import BayesianModel
from pgmpy.inference import VariableElimination
from pgmpy.factors.discrete import TabularCPD


    (0) IsSummer true if it is the summer season, false otherwise.
    (1) HasFlu true if the patient has the flu.
    (2) HasFoodPoisoning true if the patient has food poisoning.
    (3) HasHayFever true if patient has hay fever.
    (4) HasPneumonia true if the patient has pneumonia.
    (5) HasRespiratoryProblems true if the patient has problems in the respiratory system.
    (6) HasGastricProblems true if the patient has problems in the gastro-intestinal system.
    (7) HasRash true if the patient has a skin rash.
    (8) Coughs true if the patient has a cough.
    (9) IsFatigued true if the patient is tired and fatigued.
    (10) Vomits true if the patient has vomited.
    (11) HasFever true if the patient has a high fever.


In [4]:
columns=['IsSummer', 'HasFlu', 'HasFoodPoisoning', 'HasHayFever', 'HasPneumonia', 
         'HasRespiratoryProblems', 'HasGastricProblems', 'HasRash', 'Coughs', 
         'IsFatigued', 'Vomits', 'HasFever']
columns

['IsSummer',
 'HasFlu',
 'HasFoodPoisoning',
 'HasHayFever',
 'HasPneumonia',
 'HasRespiratoryProblems',
 'HasGastricProblems',
 'HasRash',
 'Coughs',
 'IsFatigued',
 'Vomits',
 'HasFever']

In [5]:
# Original data set can be found here (4M samples)
# https://people.cs.umass.edu/~mccallum/courses/gm2011/hw1.tar.gz
encoded_samples=np.loadtxt("flu_samples.dat").astype(dtype=np.int)

In [44]:
encoded_samples[:5]

array([  1, 136,   8,   0, 129])

In [45]:
samples_df= pd.DataFrame([list(reversed(list('{0:012b}'.format(s)))) for s in encoded_samples], 
             columns=columns)

In [47]:
samples_df[:10]

Unnamed: 0,IsSummer,HasFlu,HasFoodPoisoning,HasHayFever,HasPneumonia,HasRespiratoryProblems,HasGastricProblems,HasRash,Coughs,IsFatigued,Vomits,HasFever
0,1,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,1,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1,0,0
7,0,0,0,0,0,0,0,1,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0
9,1,0,0,0,0,0,0,0,0,0,0,0


In [75]:
df=samples_df[:128000]
df.shape

(128000, 12)

In [76]:
from pgmpy.estimators import HillClimbSearch, BicScore, BdeuScore, K2Score
#est = HillClimbSearch(df, scoring_method=BicScore(df))
#est = HillClimbSearch(df, scoring_method=BdeuScore(df))
est = HillClimbSearch(df, scoring_method=K2Score(df))
%time best_model = est.estimate()
print('======== nodes ==========')
print(sorted(best_model.nodes()))
print('======== edges ==========')
for edge in best_model.edges():
    print(edge)

Wall time: 12min 34s
['Coughs', 'HasFever', 'HasFlu', 'HasFoodPoisoning', 'HasGastricProblems', 'HasHayFever', 'HasPneumonia', 'HasRash', 'HasRespiratoryProblems', 'IsFatigued', 'IsSummer', 'Vomits']
('HasFlu', 'HasFever')
('HasFlu', 'IsSummer')
('HasFlu', 'HasRespiratoryProblems')
('HasHayFever', 'HasRespiratoryProblems')
('HasGastricProblems', 'HasFoodPoisoning')
('HasGastricProblems', 'Vomits')
('HasGastricProblems', 'IsFatigued')
('HasRespiratoryProblems', 'IsFatigued')
('HasRespiratoryProblems', 'Coughs')
('HasPneumonia', 'HasRespiratoryProblems')
('HasPneumonia', 'HasFoodPoisoning')
('HasPneumonia', 'HasFever')
('HasPneumonia', 'HasGastricProblems')


#### Generated from structure with https://www.draw.io/

![flu_structure](flu_for_draw_io.png)

## Parameter Estimation
### MaximumLikelihoodEstimator

In [79]:
from pgmpy.estimators import MaximumLikelihoodEstimator
estimator = MaximumLikelihoodEstimator(best_model, df)
for cpd in estimator.get_parameters():
    print('cpd for variable %s:\n%s\n' % (cpd.variable, str(cpd)))

cpd for variable Coughs:
+------------------------+---------------------------+---------------------------+
| HasRespiratoryProblems | HasRespiratoryProblems(0) | HasRespiratoryProblems(1) |
+------------------------+---------------------------+---------------------------+
| Coughs(0)              | 0.899119244861            | 0.254916792738            |
+------------------------+---------------------------+---------------------------+
| Coughs(1)              | 0.100880755139            | 0.745083207262            |
+------------------------+---------------------------+---------------------------+

cpd for variable HasFever:
+--------------+-----------------+-----------------+-----------------+-----------------+
| HasFlu       | HasFlu(0)       | HasFlu(0)       | HasFlu(1)       | HasFlu(1)       |
+--------------+-----------------+-----------------+-----------------+-----------------+
| HasPneumonia | HasPneumonia(0) | HasPneumonia(1) | HasPneumonia(0) | HasPneumonia(1) |
+---------