Mixed Naive Bayes
Naive Bayes classifiers are a set of supervised learning algorithms based on applying Bayes' theorem, but with strong independence assumptions between the features given the value of the class variable (hence naive).
This module implements categorical (multinoulli) and Gaussian naive Bayes algorithms (hence mixed naive Bayes). This means that we are not confined to the assumption that features (given their respective y's) follow the Gaussian distribution, but also the categorical distribution. Hence it is natural that the continuous data be attributed to the Gaussian and the categorical data (nominal or ordinal) be attributed the the categorical distribution.
The motivation for writing this library is that scikit-learn at the point of writing this (Sep 2019) did not have an implementation for mixed type of naive Bayes.
They have one for scikit-learn now has CategoricalNB!
CategoricalNB here but it's still in its infancy.
I like scikit-learn's APIs
I've also written a tutorial here for naive bayes if you need to understand a bit more on the math.
- Quick starts
- Performance (Accuracy)
- Performance (Speed)
- API Documentation
- Related work
pip install mixed-naive-bayes
pip install git+https://github.com/remykarem/mixed-naive-bayes#egg=mixed-naive-bayes
Example 1: Discrete and continuous data
Below is an example of a dataset with discrete (first 2 columns) and continuous data (last 2). We assume that the discrete features follow a categorical distribution and the features with the continuous data follow a Gaussian distribution. Specify
categorical_features=[0,1] then fit and predict as per usual.
from mixed_naive_bayes import MixedNB X = [[0, 0, 180.9, 75.0], [1, 1, 165.2, 61.5], [2, 1, 166.3, 60.3], [1, 1, 173.0, 68.2], [0, 2, 178.4, 71.0]] y = [0, 0, 1, 1, 0] clf = MixedNB(categorical_features=[0,1]) clf.fit(X,y) clf.predict(X)
NOTE: The module expects that the categorical data be label-encoded accordingly. See the following example to see how.
Example 2: Discrete and continuous data
Below is a similar dataset. However, for this dataset we assume a categorical distribution on the first 3 features, and a Gaussian distribution on the last feature. Feature 3 however has not been label-encoded. We can use sklearn's
LabelEncoder() preprocessing module to fix this.
import numpy as np from sklearn.preprocessing import LabelEncoder X = [[0, 0, 180, 75.0], [1, 1, 165, 61.5], [2, 1, 166, 60.3], [1, 1, 173, 68.2], [0, 2, 178, 71.0]] y = [0, 0, 1, 1, 0] X = np.array(X) y = np.array(y) label_encoder = LabelEncoder() X[:,2] = label_encoder.fit_transform(X[:,2]) print(X) # array([[ 0, 0, 4, 75], # [ 1, 1, 0, 61], # [ 2, 1, 1, 60], # [ 1, 1, 2, 68], # [ 0, 2, 3, 71]])
Then fit and predict as usual, specifying
categorical_features=[0,1,2] as the indices that we assume categorical distribution.
from mixed_naive_bayes import MixedNB clf = MixedNB(categorical_features=[0,1,2]) clf.fit(X,y) clf.predict(X)
Example 3: Discrete data only
If all columns are to be treated as discrete, specify
from mixed_naive_bayes import MixedNB X = [[0, 0], [1, 1], [1, 0], [0, 1], [1, 1]] y = [0, 0, 1, 0, 1] clf = MixedNB(categorical_features='all') clf.fit(X,y) clf.predict(X)
NOTE: The module expects that the categorical data be label-encoded accordingly. See the previous example to see how.
Example 4: Continuous data only
If all features are assumed to follow Gaussian distribution, then leave the constructor blank.
from mixed_naive_bayes import MixedNB X = [[0, 0], [1, 1], [1, 0], [0, 1], [1, 1]] y = [0, 0, 1, 0, 1] clf = MixedNB() clf.fit(X,y) clf.predict(X)
scikit-learn library is used to only import data as seen in the examples. Otherwise, the module itself does not require it.
pytest library is not needed unless you want to perform testing.
Performance across sklearn's datasets on classification tasks. Run
|Dataset||GaussianNB||MixedNB (G)||MixedNB (C)||MixedNB (C+G)|
- GaussianNB - sklearn's API for Gaussian Naive Bayes
- MixedNB (G) - our API for Gaussian Naive Bayes
- MixedNB (C) - our API for Categorical Naive Bayes
- MixedNB (C+G) - our API for Naive Bayes where some features follow categorical distribution, and some features follow Gaussian
The library is written in NumPy, so many operations are vectorised and faster than their for-loop counterparts. Fun fact: my first prototype (with many for-loops) took me 8 times slower than sklearn's
I'm still writing more test cases, but in the meantime, you can run the following:
- Accuracy against existing library (sklearn)
- Input type checking
- Example inputs
For more information on usage of the API, visit here. This was generated using pdoc3.
- Performance comparison
- Change to F-contiguous arrays?
- Write more test cases
- Support refitting
- Regulariser for categorical distribution
- Variance smoothing for Gaussian distribution
- Vectorised main operations using NumPy
- Masking in NumPy
- Support label encoding
- Categorical naive Bayes by scikit-learn
- Naive Bayes classifier for categorical and numerical data
- Generalised naive Bayes classifier
Please submit your pull requests, will appreciate it a lot