# Naive Bayes Classifier

This Naive Bayes Classifier will be trained and evaluated on the following datasets:
* UCI mushroom dataset [1]
* UCI congressional voting records dataset [2]
* UCI breast cancer dataset [3]

It is evaluated via cross validation with error rate as the evaluation metric.

In [1]:
import NaiveBayes
import numpy as np

## Mushroom Dataset

The model will predict whether an instance is edible based on 22 features.

10-fold cross validation will be used. 

In [2]:
dataset_1 = np.genfromtxt("data/mushroom/agaricus-lepiota.data", delimiter = ",", 
                        dtype = "str").astype(str)
dataset_1 = dataset_1[np.all(dataset_1 != '?', axis=1)]
print("Starting cross validation:\n")
NaiveBayes.cross_validate(dataset_1, False, 10)

Starting cross validation:

Fold 1 error rate: 0.0

Fold 2 error rate: 0.0

Fold 3 error rate: 0.0

Fold 4 error rate: 0.008849557522123894

Fold 5 error rate: 0.0

Fold 6 error rate: 0.0

Fold 7 error rate: 0.017699115044247787

Fold 8 error rate: 0.02654867256637168

Fold 9 error rate: 0.0

Fold 10 error rate: 0.0

Mean: 0.005309734513274336
Variance: 9.049695007874973e-05



## Congressional Voting Records Dataset

The model will predict party affiliation based on HOR votes on various issues.

10-fold cross validation will be used. 

In [3]:
dataset_2 = np.genfromtxt("data/congressional+voting+records/house-votes-84.data", delimiter = ",", dtype = "str")
dataset_2 = dataset_2[np.all(dataset_2 != '?', axis=1)]
print("Starting cross validation:\n")
NaiveBayes.cross_validate(dataset_2, False, 10)

Starting cross validation:

Fold 1 error rate: 0.2

Fold 2 error rate: 0.2

Fold 3 error rate: 0.2

Fold 4 error rate: 0.2

Fold 5 error rate: 0.2

Fold 6 error rate: 0.0

Fold 7 error rate: 0.2

Fold 8 error rate: 0.0

Fold 9 error rate: 0.0

Fold 10 error rate: 0.2

Mean: 0.13999999999999999
Variance: 0.009333333333333334



## Breast Cancer Dataset

The model will predict recurrence based on patient age, menopausal status, tumor size, number of INV nodes, node caps, degree of malignancy, breast side, breast quadrant, and irradiation)

5-fold cross validation will be used. 

In [4]:
dataset_3 = np.genfromtxt("data/wisconsin_breast_cancer/breast-cancer-wisconsin.data", delimiter = ",", 
                        dtype = "str").astype(str)
dataset_3[:,[0,10]] = dataset_3[:,[10,0]]
print("Starting cross validation:\n")
NaiveBayes.cross_validate(dataset_3, False, 5)

Starting cross validation:

Fold 1 error rate: 0.0

Fold 2 error rate: 0.14285714285714285

Fold 3 error rate: 0.10714285714285714

Fold 4 error rate: 0.07142857142857142

Fold 5 error rate: 0.14285714285714285

Mean: 0.09285714285714285
Variance: 0.0035714285714285713



Citations:

[1] Mushroom. (1987). UCI Machine Learning Repository. https://doi.org/10.24432/C5959T.

[2] Congressional Voting Records. (1987). UCI Machine Learning Repository. https://doi.org/10.24432/C5C01P.

[3] Wolberg,WIlliam. (1992). Breast Cancer Wisconsin (Original). UCI Machine Learning Repository. https://doi.org/10.24432/C5HP4Z.
