<a href="https://colab.research.google.com/github/gmprovan/CS4705/blob/main/Assignment_2_Learning_Bayesian_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<style>
    .cont {
        display: flex;
        justify-content: center;
        flex-direction: column;
    }

    .cont span {
        margin:auto;
    }
    .name {
        display:inline-block;
        background-color: green;
        padding: 10px;
        border-radius: 20px;
        margin:auto;
    }
    .class {
        display:inline-block;
        background-color: purple;
        padding: 10px;
        border-radius: 20px;
        margin:auto;
    }
    .assignment {
        display:inline-block;
        background-color: white;
        color: black;
        font-weight: bolder;
        font-size: 60px;
        padding: 10px;
    }
</style>

<div class="cont">
    <span><span class="assignment">Assignment 2 -</span><span class="assignment">Probabilistic Model Learning</span></span>
    <br>
    <span><span class="name">Víctor Carnicero Príncipe</span><span class="name">123123895</span></span>
    <br>
    <span><span class="class">COMPUTATIONAL MACHINE LEARNING</span><span class="class">CS4705</span></span>
</div>

Víctor Carnicero Príncipe | 123123895 || COMPUTATIONAL MACHINE LEARNING | CS4705

In [67]:
import pandas as pd
import numpy as np
from pgmpy.estimators import K2Score, BicScore
from pgmpy.models import BayesianNetwork

In [145]:
# Load data
data = pd.read_csv('./data/breast-cancer.data',
                   names=['Class', 'age', 'menopause', 'tumor-size', 'inv-nodes', 'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat'])

In [146]:
# Change ? to NaN
data[data == "?"] = np.nan

In [147]:
# remove problematic data points
data = data[data["age"] != "20-29"]
data = data[data["inv-nodes"] != "24-25"]

# Train models

In [148]:
from sklearn.model_selection import train_test_split

# separate train and test data
dtrain, dtest = train_test_split(data.fillna(data.mode().iloc[0]), test_size=0.2, random_state=1)

predictions = {"nb": {}, "t": {}, "bn": {}}

## Naive Bayes model:

For the naive bayes I use the maximum likelihood estimator

In [149]:
from pgmpy.models.NaiveBayes import NaiveBayes
from pgmpy.estimators import MaximumLikelihoodEstimator

# create the structure manually to create model_struct
model_nb = BayesianNetwork(NaiveBayes(data.columns[1:], data.columns[0]).edges())
#model_nb = NaiveBayes(data.columns[1:], data.columns[0])

model_nb.fit(dtrain, estimator=MaximumLikelihoodEstimator)

print(model_nb.get_cpds('Class'))

# print the score
print("test:\n", "K2:", K2Score(dtest).score(model_nb), "| Bic:", BicScore(dtest).score(model_nb))
print("train:\n", "K2:", K2Score(dtrain).score(model_nb), "| Bic:", BicScore(dtrain).score(model_nb))
print("Test accuracy:\t", test:= (model_nb.predict(dtest[data.columns[1:]]).values == dtest[["Class"]].values).sum()/len(dtest))
print("Train accuracy:\t", train:= (model_nb.predict(dtrain[data.columns[1:]]).values == dtrain[["Class"]].values).sum()/len(dtrain))
predictions["nb"]["test"] = test
predictions["nb"]["train"] = train

+-----------------------------+----------+
| Class(no-recurrence-events) | 0.688596 |
+-----------------------------+----------+
| Class(recurrence-events)    | 0.311404 |
+-----------------------------+----------+
test:
 K2: -613.0600117480078 | Bic: -655.5336891240779
train:
 K2: -2331.4476306110555 | Bic: -2380.6982978994574


100%|██████████| 57/57 [00:00<00:00, 2044.46it/s]


Test accuracy:	 0.7017543859649122


100%|██████████| 215/215 [00:00<00:00, 2100.59it/s]

Train accuracy:	 0.7719298245614035





The scores obtained are good enough, we do seem to have a difference between test and training data, which probably indicates some overfitting to the train data.

The model seems to work with somewhat acceptable performance, we do see that there is a `7%` difference between train and test accuracy. We can probably get a better model using more complex networks, so I expect the following models to perform better than this one.

## Tree-structured model:

In [150]:
from pgmpy.estimators import TreeSearch

# learn graph structure using TreeSearch and Chow-Liu algorithm
est = TreeSearch(dtrain, root_node="Class")
dag = est.estimate(estimator_type="chow-liu")

from pgmpy.estimators import BayesianEstimator

# there are many choices of parametrization, here is one example
model_t = BayesianNetwork(dag.edges())
model_t.fit(data, estimator=BayesianEstimator, prior_type="dirichlet", pseudo_counts=0.1)

Building tree: 100%|██████████| 45/45.0 [00:00<00:00, 4085.63it/s]


In [151]:
# print the score
print("test:\n", "K2:", K2Score(dtest).score(model_t), "| Bic:", BicScore(dtest).score(model_t))
print("train:\n", "K2:", K2Score(dtrain).score(model_t), "| Bic:", BicScore(dtrain).score(model_t))
# print the accuracy
print("test: ", test:=((model_t.predict(dtest[data.columns[1:]]).values == dtest[[data.columns[0]]].reset_index(drop=True).values).sum() / len(dtest)))
print("train: ", train:=((model_t.predict(dtrain[data.columns[1:]]).values == dtrain[[data.columns[0]]].reset_index(drop=True).values).sum() / len(dtrain)))
predictions["t"]["test"] = test
predictions["t"]["train"] = train

test:
 K2: -602.3680239908574 | Bic: -800.6569955287398
train:
 K2: -2236.4249916106974 | Bic: -2546.7317696896125


100%|██████████| 57/57 [00:00<00:00, 5176.14it/s]


test:  0.7543859649122807


100%|██████████| 215/215 [00:00<00:00, 2671.36it/s]


train:  0.7105263157894737


Comparing the scores of the two models, we can see that the Naive Bayes model has a better score for all the values. This model seems to be good, specially since we get a very similar accuracy for the train and test set, which seems to indicate that the model is a good generalization and does not overfit the training data.

The accuracy obtained is good enough, it might be possible to obtain a better score using the following method.

Additionallly I decide to show the predictive accuracy for the other variables, since it is a Bayesian Model, it is easy to predict the value for any missing variable from all the others using the same model.

In [75]:
preds = {}
for col in data:
    preds[col] = [((model_t.predict(dtest.drop(columns=[col])).values == dtest[[col]].reset_index(drop=True).values).sum() / len(dtest)),
                  ((model_t.predict(dtrain.drop(columns=[col])).values == dtrain[[col]].reset_index(drop=True).values).sum() / len(dtrain))]

100%|██████████| 57/57 [00:00<00:00, 1939.77it/s]
100%|██████████| 215/215 [00:00<00:00, 1077.35it/s]
100%|██████████| 56/56 [00:00<00:00, 2286.19it/s]
100%|██████████| 206/206 [00:00<00:00, 1051.93it/s]
100%|██████████| 57/57 [00:00<00:00, 2296.42it/s]
100%|██████████| 216/216 [00:00<00:00, 1182.11it/s]
100%|██████████| 53/53 [00:00<00:00, 2157.82it/s]
100%|██████████| 183/183 [00:00<00:00, 938.91it/s]
100%|██████████| 57/57 [00:00<00:00, 1859.83it/s]
100%|██████████| 216/216 [00:00<00:00, 800.52it/s]
100%|██████████| 57/57 [00:00<00:00, 1413.58it/s]
100%|██████████| 217/217 [00:00<00:00, 1110.44it/s]
100%|██████████| 56/56 [00:00<00:00, 2021.54it/s]
100%|██████████| 211/211 [00:00<00:00, 1233.86it/s]
100%|██████████| 57/57 [00:00<00:00, 1419.01it/s]
100%|██████████| 207/207 [00:00<00:00, 1136.89it/s]
100%|██████████| 56/56 [00:00<00:00, 1740.19it/s]
100%|██████████| 198/198 [00:00<00:00, 967.86it/s]
100%|██████████| 57/57 [00:00<00:00, 1708.35it/s]
100%|██████████| 216/216 [00:00<00:

In [76]:
print("test", "|", "train")
for col in preds:
    print(col, preds[col])

test | train
Class [0.7543859649122807, 0.7105263157894737]
age [0.5263157894736842, 0.5131578947368421]
menopause [0.8947368421052632, 0.8114035087719298]
tumor-size [0.24561403508771928, 0.2850877192982456]
inv-nodes [0.8070175438596491, 0.8070175438596491]
node-caps [0.8421052631578947, 0.8859649122807017]
deg-malig [0.543859649122807, 0.5131578947368421]
breast [0.6491228070175439, 0.6754385964912281]
breast-quad [0.45614035087719296, 0.5263157894736842]
irradiat [0.8245614035087719, 0.7850877192982456]


We see some variables that are very bad in this case, like `tumor-size`, `deg-malign`, `breast-quad` and `age`, whereas others are even better than the `Class` variable, it is very difficult to obtain a good score for all the variables but in the same way it is good to know we have some way to predict other variables if we have any problems.

This can also help us understand which variables are actually important to the model, we could continue trying some different subset of the variables to try to obtain a better model.

The following approach takes this into account and in the structure building it may be possible to extract deeper relationships between variables.

## Bayesian Network model:

### Learn Structure:

For the model creation I use hill climbing, because the whole searching space is very big this approach will get to a local minimum. First I start with a graph with no edges, and then I explore by starting with the previously used structures.

In [153]:
from pgmpy.estimators import HillClimbSearch, BicScore, K2Score, BDeuScore, ExpectationMaximization

est = HillClimbSearch(dtrain)

# estimate the DAG
model_bn = BayesianNetwork(est.estimate(scoring_method=BicScore(dtrain), show_progress=False).edges()) 
print(model_bn.edges())

model_bn.fit(dtrain, estimator=BayesianEstimator, prior_type="dirichlet", pseudo_counts=0.1)

# print the score
print("test:\n", "K2:", K2Score(dtest).score(model_bn), "| Bic:", BicScore(dtest).score(model_bn))
print("train:\n", "K2:", K2Score(dtrain).score(model_bn), "| Bic:", BicScore(dtrain).score(model_bn))

print("test:", test:=(model_bn.predict(dtest[list(model_bn.nodes()._nodes.keys())].drop(columns=["Class"])).values == dtest[[data.columns[0]]].reset_index(drop=True).values).sum() / len(dtest))
print("train:", train:=(model_bn.predict(dtrain[list(model_bn.nodes()._nodes.keys())].drop(columns=["Class"])).values == dtrain[[data.columns[0]]].reset_index(drop=True).values).sum() / len(dtrain))
predictions["bn"]["test"] = test
predictions["bn"]["train"] = train

[('menopause', 'age'), ('node-caps', 'inv-nodes'), ('node-caps', 'deg-malig'), ('deg-malig', 'Class'), ('breast', 'breast-quad'), ('irradiat', 'node-caps')]
test:
 K2: -455.10833332994423 | Bic: -479.57666873007065
train:
 K2: -1701.8622890331535 | Bic: -1729.0220587763633


100%|██████████| 52/52 [00:00<00:00, 4727.41it/s]


test: 0.7543859649122807


100%|██████████| 167/167 [00:00<00:00, 2592.39it/s]

train: 0.7105263157894737





For this classifier we obtain the same accuracy as the previous one, but we do seem to obtain a better `K2` value and a worse `Bic`.

Like the one before, both scores seem to be pretty near, so there does not seem to be overfitting, and in the same way we obtain a good enough accuracy of around `75%`.

Now we start by using a naive bayes structure and then trying to obtain a good bayesian network using hill climbing and the BDeu score.

In [78]:
from pgmpy.estimators import HillClimbSearch, BicScore, K2Score, BDeuScore, ExpectationMaximization

model_bn2 = BayesianNetwork(est.estimate(start_dag=BayesianNetwork(NaiveBayes(data.columns[1:], data.columns[0]).edges()), scoring_method=BDeuScore(dtrain), show_progress=False).edges()) 
print(model_bn2.edges())

model_bn2.fit(dtrain, estimator=BayesianEstimator, prior_type="dirichlet", pseudo_counts=0.1)

# print the score
print("test:\n", "K2:", K2Score(dtest).score(model_bn2), "| Bic:", BicScore(dtest).score(model_bn2))
print("train:\n", "K2:", K2Score(dtrain).score(model_bn2), "| Bic:", BicScore(dtrain).score(model_bn2))

print((model_bn2.predict(dtest[list(model_bn2.nodes()._nodes.keys())].drop(columns=["Class"])).values == dtest[[data.columns[0]]].reset_index(drop=True).values).sum() / len(dtest))
print((model_bn2.predict(dtrain[list(model_bn2.nodes()._nodes.keys())].drop(columns=["Class"])).values == dtrain[[data.columns[0]]].reset_index(drop=True).values).sum() / len(dtrain))

[('Class', 'inv-nodes'), ('Class', 'deg-malig'), ('inv-nodes', 'node-caps'), ('inv-nodes', 'irradiat'), ('node-caps', 'deg-malig'), ('breast', 'node-caps'), ('breast-quad', 'breast'), ('age', 'menopause')]
test:
 K2: -451.67936080665385 | Bic: -490.41432567076083
train:
 K2: -1703.3800204008173 | Bic: -1759.744048841769


100%|██████████| 52/52 [00:00<00:00, 1218.04it/s]


0.8070175438596491


100%|██████████| 167/167 [00:00<00:00, 980.95it/s]


0.7368421052631579


I do not know why this model performs this well for the test data, it is probably due to an eventuallity since there is not much data. The model performs with a `7%` less performance for the training data than for the testing, which is interesting. This is why I prefer the model before this one, since both scores are more similar.

Now trying to start with the tree-structured bayesian classifier and K2 score:

In [79]:
from pgmpy.estimators import HillClimbSearch, BicScore, K2Score, BDeuScore, ExpectationMaximization

model_bn3 = BayesianNetwork(est.estimate(start_dag=BayesianNetwork(dag.edges()), scoring_method=K2Score(dtrain), show_progress=False).edges()) 
print(model_bn3.edges())

model_bn3.fit(dtrain, estimator=BayesianEstimator, prior_type="dirichlet", pseudo_counts=0.1)

# print the score
print("test:\n", "K2:", K2Score(dtest).score(model_bn3), "| Bic:", BicScore(dtest).score(model_bn3))
print("train:\n", "K2:", K2Score(dtrain).score(model_bn3), "| Bic:", BicScore(dtrain).score(model_bn3))

print((model_bn3.predict(dtest[list(model_bn3.nodes()._nodes.keys())].drop(columns=["Class"])).values == dtest[[data.columns[0]]].reset_index(drop=True).values).sum() / len(dtest))
print((model_bn3.predict(dtrain[list(model_bn3.nodes()._nodes.keys())].drop(columns=["Class"])).values == dtrain[[data.columns[0]]].reset_index(drop=True).values).sum() / len(dtrain))

[('Class', 'deg-malig'), ('Class', 'tumor-size'), ('Class', 'age'), ('Class', 'irradiat'), ('deg-malig', 'tumor-size'), ('deg-malig', 'age'), ('age', 'tumor-size'), ('irradiat', 'tumor-size'), ('irradiat', 'deg-malig'), ('irradiat', 'age'), ('inv-nodes', 'node-caps'), ('inv-nodes', 'tumor-size'), ('inv-nodes', 'irradiat'), ('inv-nodes', 'deg-malig'), ('inv-nodes', 'age'), ('node-caps', 'tumor-size'), ('node-caps', 'deg-malig'), ('node-caps', 'Class'), ('node-caps', 'age'), ('node-caps', 'irradiat'), ('breast-quad', 'breast'), ('breast-quad', 'tumor-size'), ('breast-quad', 'deg-malig'), ('breast-quad', 'age'), ('breast', 'tumor-size'), ('breast', 'deg-malig'), ('breast', 'age'), ('breast', 'node-caps'), ('breast', 'Class'), ('menopause', 'tumor-size'), ('menopause', 'deg-malig'), ('menopause', 'age')]
test:
 K2: 241242.85360748318 | Bic: -361847.7775965159
train:
 K2: 392162.6668773573 | Bic: -749493.0765869628


100%|██████████| 57/57 [00:00<00:00, 91.51it/s] 


0.6140350877192983


100%|██████████| 215/215 [00:03<00:00, 55.95it/s]


0.9824561403508771


In this case we can clearly see overfitting to the training data, it performs extremely well for it `98%`, and bad for the test data `62%`. This is why I will not consider this model, as it is pretty bad, I prefer the first model generated.

## Conclusions

**Naive Bayes:**

|      |         K2         |        Bic        |      Accuracy      |
|------|--------------------|-------------------|--------------------|
| test | -613.0600117480078 | -655.5336891240779| 0.7017543859649122 |
| train| -2331.4476306110555|-2380.6982978994574| 0.7719298245614035 |

**Tree Structured Bayesian Classifier:**

|      |         K2         |        Bic        |      Accuracy      |
|------|--------------------|-------------------|--------------------|
| test | -602.3680239908574 | -800.6569955287398| 0.7543859649122807 |
| train| -2236.4249916106974|-2546.7317696896125| 0.7105263157894737 |

**Bayesian Network:**

|      |         K2         |        Bic        |      Accuracy      |
|------|--------------------|-------------------|--------------------|
| test | -455.1083333299442 |-479.57666873007065| 0.7543859649122807 |
| train| -1701.8622890331535|-1729.0220587763633| 0.7105263157894737 |


From these results we can say that the model that performs the best are the Tree-structured and bayesian network.

I tried to do McNemar's Test (5x2 Cross validation) but after trying a lot I was not able to get it to work.

So I decided to compute a p-value based on an f-statistic.

In [162]:
# import stats
from scipy import stats

def get_pvalue(model1, model2, data):
    f_value = predictions[model1]["test"] / predictions[model2]["test"]
    df = data.shape[0] - data.shape[1] - 1
    p_value = stats.f.cdf(f_value, df, df)
    return p_value

def accept_or_reject(p_value):
    if p_value > 0.1 or p_value < 0.9:
        return "Accept"
    else:
        return "Reject"

print("p-value for nb vs t:", p:=get_pvalue("nb", "t", data))
print("Hyphothesis test for nb vs t:", accept_or_reject(p))
print("p-value for nb vs bn:", p:=get_pvalue("nb", "bn", data))
print("Hyphothesis test for nb vs bn:", accept_or_reject(p))
print("p-value for t vs bn:", p:=get_pvalue("t", "bn", data))
print("Hyphothesis test for t vs bn:", accept_or_reject(p))

p-value for nb vs t: 0.2749369765547082
Hyphothesis test for nb vs t: Accept
p-value for nb vs bn: 0.2749369765547082
Hyphothesis test for nb vs bn: Accept
p-value for t vs bn: 0.500000000000054
Hyphothesis test for t vs bn: Accept


We can see that we accept all the null hypotheses that means that there is no evidence to say that this models perform significantly different.

For the tree structured and bayesian network, since I obtain the same accuracy results for both of them I cannot really be

sure of which is best, and as expected we obtain a p-value of 0.5, and I am not too sure of how to interpret the K2 and Bic

score. The aim is to obtain a higher K2 score, and a lower Bic, since they both change similarly between the two models I

cannot really tell which is better.