# Bayes Networks: Less Naïve Bayes

## Introduction
When estimating probabilities in the data, there are two extremes. The first one makes no assumptions of conditional independence, but is often intractable as the number of paramaters grows exponentially. The second one makes full assumptions of conditional independence given the target variable. Despite its simplicity and efficiency, this is obviously not always optimal.   
Graphical models, and Bayes Nets more specifically, offer a middle ground to both of these extremes as they make just enough conditional independence assumptions to be both accurate and efficient. 

## Definition
As the name suggests, graphical models use a graph structure to model assumptions about conditional independences. In this tutorial we will study the most common type of model: Bayes Networks. Here is the formal definition:   

A **Bayes Networks** is a Directed Acyclic Graph $G$ together with a set of conditional probability distributions $P$.
* Nodes of the graph are random variables (inputs, outputs).
* Edges encode the fact that two random variables are probabilistically related.
* For a given node $X$, we must have $Pr(X \mid parents(X)) \in P$, where $parents(X)$ is the set of parents of $X$ in the graph $G$.
* The joint probability is given by:
$$ Pr(X_1, ..., X_n) = \prod_{i = 1}^{n} Pr(X_i \mid parents(X_i))$$
Where $X_1, ..., X_n$ are the random variablies representing inputs and outputs. 

These probabilities are generally obtained after training. In this tutorial we will assume that the graph structure is known and that observed data is enough to estimate probabilities. Also, we only consider discrete finite random variables. There is a lot of ongoing research on learning from partially observed data or with an unknown graph structure but that's beyond the scope of this tutorial.

## Example
Let us first see an easy example before moving on.
![Alt text](http://g.gravizo.com/g?
  digraph G {
    Pollution -> Cancer;
    Smoker -> Cancer;
    Cancer -> XRay;
    Cancer -> Dyspnoea;
  }
)

Here, there is an edge from Pollution and Smoking to Cancer because these are related factors. Also having cancer makes it obviously easier to be diagnosed with it, and to have trouble breathing (Dyspnoea).  
The probabilities can look like this:

| Pollution | Smoker | P(Cancer given Pollution, Smoking) |
| --------- | ------ | ---------------------------------- |
| High      | True   | 0.05                               |
| High      | False  | 0.02                               |
| Low       | True   | 0.03                               |
| Low       | False  | 0.001                              |


We can use the python library **pgmpy** to build this graphical model. This library can be found [here](https://github.com/pgmpy/pgmpy).

In [1]:
# Imports
import pandas as pd
import numpy as np

from pgmpy.models import BayesianModel
from pgmpy.factors.discrete import TabularCPD
from pgmpy.estimators import BayesianEstimator

In [2]:
# First Define the graph structure above.
cancer_gm = BayesianModel([ ('Pollution', 'Cancer'), 
                            ('Smoker', 'Cancer'),
                            ('Cancer', 'Xray'),
                            ('Cancer', 'Dyspnoea')])

# Then define the conditional probabilities
# We only show the cpd (conditional probability distribution) of the Cancer node and Smoke.
cpd_cancer = TabularCPD(variable='Cancer', variable_card=2,
                        values=[[0.999, 0.98, 0.97, 0.95],
                                [0.001, 0.02, 0.03, 0.05]],
                        evidence=['Smoker', 'Pollution'],
                        evidence_card=[2, 2])

cpd_smoke = TabularCPD(variable='Smoker', variable_card=2,
                       values=[[0.3], [0.7]])

# Add the conditional probabilities to the graphical model
cancer_gm.add_cpds(cpd_cancer)
cancer_gm.add_cpds(cpd_smoke)
print cancer_gm.get_cpds("Smoker")
print cancer_gm.get_cpds("Cancer")

+----------+-----+
| Smoker_0 | 0.3 |
+----------+-----+
| Smoker_1 | 0.7 |
+----------+-----+
+-----------+-------------+-------------+-------------+-------------+
| Smoker    | Smoker_0    | Smoker_0    | Smoker_1    | Smoker_1    |
+-----------+-------------+-------------+-------------+-------------+
| Pollution | Pollution_0 | Pollution_1 | Pollution_0 | Pollution_1 |
+-----------+-------------+-------------+-------------+-------------+
| Cancer_0  | 0.999       | 0.98        | 0.97        | 0.95        |
+-----------+-------------+-------------+-------------+-------------+
| Cancer_1  | 0.001       | 0.02        | 0.03        | 0.05        |
+-----------+-------------+-------------+-------------+-------------+


As you can see, we first define the graph structure (using edges) with the **BayesianModel** Class. Then we define each
conditional probability distribution given parents with the **TabularCPD** class. Here is the meaning of each argument:
* **variable**: Name of the node
* **variable_card**: Number of values the discrete finite random variables(2 for binary variables).
* **evidence**: List of Parents.
* **evidence_card**: List containing the number of arguments taken by each parent
* **values**: Values taken by the CPD. **The order here should match the order defined in the parents**.

And finally with them the **BayesianModel** using the **add_cpd** function.

For more information see [here](http://pgmpy.org/).

## Training A Bayes Network
After defining an appropriate graph structure, we must obtain the Conditional
Probability Distributions through training to complete our Bayes Nets.

Given $m$ training examples, a node $X_i$ with parents $X_{j1}, X_{j2},...,
X_{jm}$, we can get:
    $$Pr(X_i = x_i \mid X_{j1}=x_{j1}, ..., X_{jm}=x_{jm})
        = \frac{1(X_i=x_i, X_{j1} = x_{j1}, ..., X_{jm}=x_{jm}) + \alpha}{1(X_{j1} = x_{j1}, ..., X_{jm}=x_{jm}) + \alpha*D}$$
where D is the number of probabilities we need to estimate (it is required for smoothing).

Intuitively the numerator stands for the number of training examples in which both $X_i$ and its parents have the given values. The denominator stands for the number of training examples in which the parents have the given values. This formula is thus nearly identical to the Naïve Bayes formula.

We now have all the tools to build a real life Bayes Net. So let's do it!

## Example: Breast Cancer Prediction
In this example, we will use Breast Cancer data to show you the end-to-end process that allows us to make predictions using Bayes Networks and **pgmpy**. We will predict two variables at the same time: whether the patient has breast cancer and whether the tumor is benign, malignant, or will have no effect or influence at all.

This again is not a realistic example (the real model is way more (too) complex). It is a simplified version of the data can be found [here](https://www.cs.ru.nl/~peterl/BN/bc.csv) and of the complete graph structure can be obtained [here](http://www.cs.ru.nl/~peterl/teaching/CI/networks/bc.net).

Let us first construct the graph.

### Graph Structure
Here is the graph.

![Alt text](http://g.gravizo.com/g?
  digraph G {
    FibrTissueDev -> Spiculation;
    Age -> BC;
    Location -> BC;
    AD -> FibrTissueDev;
    Spiculation -> Mass;
    BC -> AD;
    BC -> Mass;
    BreastDensity -> Mass;
  }
)

The following code describes the graph structure. Note that the CPDs are absent since we can only get them after training

In [3]:
def create_gm():
    """
    Creates the Graphical Model with the structure given above.
    """
    # Initialize Graphical Model.
    patients_gm = BayesianModel()

    # Add Nodes. We do not need to understand the meaning of all of these.
    # Mass (level of danger of the disease) takes 3 values: 'No', 'Malign', 'Benign"
    patients_gm.add_node('Mass')

    # BC (breast Cancer) "No" "Invasive" "Insitu"
    patients_gm.add_node('BC')

    # Age takes 4 values: '<35', '35-49', '50-74', '>75'
    patients_gm.add_node('Age')

    # Location takes 4 values: "UpOutQuad" "UpInQuad" "LolwOutQuad" "LowInQuad"
    patients_gm.add_node('Location')

    # Spiculation takes 2 values: "Yes" "No"
    patients_gm.add_node('Spiculation')

    # "Yes" "No"
    patients_gm.add_node('FibrTissueDev')

    # "low" "medium" "high"
    patients_gm.add_node('BreastDensity')

    # "Yes" "No"
    patients_gm.add_node('AD')

    # Now let's add edges
    patients_gm.add_edge('FibrTissueDev', 'Spiculation')
    patients_gm.add_edge('Age', 'BC')
    patients_gm.add_edge('Location', 'BC')
    patients_gm.add_edge('AD', 'FibrTissueDev')
    patients_gm.add_edge('Spiculation', 'Mass')
    patients_gm.add_edge('BC', 'AD')
    patients_gm.add_edge('BC', 'Mass')
    patients_gm.add_edge('BreastDensity', 'Mass')
    return patients_gm

patients_gm = create_gm()

### Loading the data
Now that we know the graph structure, we need to load and clean the data. We are going to remove columns we don't need.
And because pgmpy only makes predictions on numerical values, we will need to convert categories to numbers (e.g. {'yes', 'no', 'maybe'} becomes {0, 1, 2} ).

The following class allows us to load and clean the data. It also allows us to convert categorical data to numerical data and vice versa.

In [4]:
class DataLoader():
    """
    Allows loading and cleaning of training and testing data
    data : cleaned categorical values
    clean_data : cleaned numerical values
    mapper : dict containing mappings from categorical values to numerical values for each column.
    reverse_mapper : dict containing mappings from numerical to categorical values for each column.
    """
    def __init__(self, filename, gm_nodes):
        """
        filename : name of the data file
        gm_nodes : list containing relevant columns
        """
        self.data = pd.read_csv(filename)[gm_nodes]
        self.mapper = dict()
        self.reverse_mapper = dict()
        clean_data = dict()
        for col in self.data:
            clean_data[col] = self._to_numerical(self.data[col], col)
        self.clean_data = pd.DataFrame(clean_data)
        
    
    def _to_numerical(self, data_col, col_name):
        """
        Converts one column to numerical.
        """
        uniqs = data_col.unique()
        mapper = {uniq : i for uniq, i in zip(uniqs, range(len(uniqs)))}
        reverse_mapper = {i:uniq for uniq, i in mapper.iteritems()}
        self.mapper[col_name] = mapper
        self.reverse_mapper[col_name] = reverse_mapper
        return data_col.map(mapper)
    
    def to_numerical(self, data):
        """
        Converts an entire dataframe to numerical.
        """
        new_data = {}
        for col in data:
            new_data[col] = data[col].map(self.mapper[col])
        return pd.DataFrame(new_data)
    
    def to_categorical(self, data):
        """
        Reverts an entire dataframe back to categorical.
        """
        new_data = {}
        for col in data:
            new_data[col] = data[col].map(self.reverse_mapper[col])
        return pd.DataFrame(new_data)



gm_nodes = ['BreastDensity', 'Location', 'Age', 'BC', 'Mass', 'AD', 
           'FibrTissueDev', 'Spiculation']

cancer_dataloader = DataLoader('bc.csv', gm_nodes)
print cancer_dataloader.data.head()
print cancer_dataloader.clean_data.head()


  BreastDensity     Location    Age        BC    Mass  AD FibrTissueDev  \
0          high  LolwOutQuad  35-49        No      No  No            No   
1        medium    UpOutQuad  50-74  Invasive  Benign  No            No   
2           low     UpInQuad  50-74  Invasive  Benign  No           Yes   
3        medium    LowInQuad    >75  Invasive  Malign  No            No   
4          high    LowInQuad    <35        No  Benign  No            No   

  Spiculation  
0          No  
1          No  
2         Yes  
3          No  
4         Yes  
   AD  Age  BC  BreastDensity  FibrTissueDev  Location  Mass  Spiculation
0   0    0   0              0              0         0     0            0
1   0    1   1              1              0         1     1            0
2   0    1   1              2              1         2     1            1
3   0    2   1              1              0         3     2            0
4   0    3   0              0              0         3     1            1


### Training the graphical model
The *pgmpy* provides an easy way to train our model using the formula given above in one line. We are only going to train on 70 percent of the data, and then use the remaining to show you how to make predictions using the tools we have.

In [5]:
def fit_data(gm, df, alpha=1):
    """
    Trains the graphical model gm using data in df and a prior of alpha.
    """
    gm.fit(df, estimator_type=BayesianEstimator, prior_type='BDeu', equivalent_sample_size=alpha)
    
fit_data(patients_gm, cancer_dataloader.clean_data[:int(0.7*cancer_dataloader.clean_data.shape[0])], 1)

# We can now visualize the different CPDS
print patients_gm.get_cpds('Age')
print patients_gm.get_cpds('AD')
print patients_gm.get_cpds('Spiculation')

+--------+----------+
| Age(0) | 0.245072 |
+--------+----------+
| Age(1) | 0.502196 |
+--------+----------+
| Age(2) | 0.149936 |
+--------+----------+
| Age(3) | 0.102796 |
+--------+----------+
+-------+-----------------+----------------+----------------+
| BC    | BC(0)           | BC(1)          | BC(2)          |
+-------+-----------------+----------------+----------------+
| AD(0) | 0.948012291147  | 0.544898371928 | 0.708257937161 |
+-------+-----------------+----------------+----------------+
| AD(1) | 0.0519877088535 | 0.455101628072 | 0.291742062839 |
+-------+-----------------+----------------+----------------+
+----------------+------------------+------------------+
| FibrTissueDev  | FibrTissueDev(0) | FibrTissueDev(1) |
+----------------+------------------+------------------+
| Spiculation(0) | 0.849342552009   | 0.256926205202   |
+----------------+------------------+------------------+
| Spiculation(1) | 0.150657447991   | 0.743073794798   |
+----------------+--------

###  Making Predictions
We now have all elements required to make new inferences. As pointed above, we will use part of the remaining 30 percent of the data to show you the full prediction process.
Note how *pgmpy* allows us to easily make inferences if the data is in numerical form.

In [6]:
def make_predictions(trained_gm, dataloader,  new_data, labels):
    """
        Given an trained Bayed Net, the dataloader class that was used during its training,
        new data to make predictions on and the target class labels, make_predictions returns
        a new dataframe containing the predictions made.
    """
    # Convert data to numerical
    clean_new_data = dataloader.to_numerical(new_data)
    # Make prediction
    prediction = trained_gm.predict(clean_new_data)
    # Revert data back to original form
    return dataloader.to_categorical(prediction)


# Select the remaining 30 percent
# Because there is too much data, we only use 500 data points for efficiency reasons
new_data = cancer_dataloader.data[int(0.7*cancer_dataloader.data.shape[0]):][:500]

# Drop target columns. This is how new data is going to be presented
new_data = new_data.drop(['BC', 'Mass'], 1)

# Make predictions
prediction = make_predictions(patients_gm, cancer_dataloader,
                              new_data, ['BC', 'Mass'])
print prediction.head()

             BC    Mass
14000        No      No
14001  Invasive  Malign
14002        No      No
14003  Invasive  Malign
14004        No      No


### Testing the Graphical Model
On a realistic example, you will need to test the accuracy of your graphical model for each target class label. The following code shows you how to do it using the training data.

In [None]:
def gm_mse(trained_gm, dataloader, labels, num_tests):
    """
    Returns the mse our GM is making on each target class label
    """
    non_labels = dataloader.clean_data.drop(labels, 1)
    labels = dataloader.clean_data[labels]
    pred = trained_gm.predict(non_labels[:num_tests])
    get_mse = lambda c : float(sum(pred[c][:num_tests] != labels[c][:num_tests])) / float(num_tests)
    return {c:get_mse(c) for c in labels}

print gm_mse(patients_gm, cancer_dataloader, ['BC', 'Mass'], 100)

{'Mass': 0.32, 'BC': 0.33}


## Summary
As you now see, the whole process can be summarized as follows:
* Specify the graph structure
* Use the **DataLoader** class to load, clean and convert the data.
* Train the Bayes net using the **fit** method
* Test the Bayes Net to make sure we have the correct model
* Make Predictions using the **predict** method (only works on numerical data).



In [None]:
# Specify Graph Structure
gm = create_gm()
gm_labels = ["BC", "Mass"]
# Load and convert the data
gm_nodes = ['BreastDensity', 'Location', 'Age', 'BC', 'Mass', 'AD', 
           'FibrTissueDev', 'Spiculation']
filename = "bc.csv"
dataloader = DataLoader(filename, gm_nodes)
# Train the Bayes Net
fit_data(gm, dataloader.clean_data, 2)
# Test the Bayes Net
gm_mse(gm, dataloader, gm_labels, 100)
# Make Predictions
make_predictions(gm, dataloader, new_data, gm_labels)


## Conclusion
This tutorial thus gives you both a theoretical understading of Bayes Network and a practical way to use them with the help of the **pgmpy** library. From this point, you can either use the knowledge you acquired to build something awesome, or to go even deeper into the topic of Graphical Models