## Multinomial Naive Bayes

Multinomial Naive Bayes is a variation of the Naive Bayes algorithm that is specifically designed for text classification and other tasks involving discrete data, such as document classification, spam email detection, and sentiment analysis. It's particularly well-suited for situations where the input features represent the frequency or occurrence of discrete elements, such as word counts in a document.

## 1.1 Dataset

The Iris dataset is a famous and widely used dataset in machine learning and statistics. <br>It was introduced by British biologist and statistician Ronald A. Fisher in 1936. <br>The dataset consists of measurements of various characteristics of three different species of iris flowers:
<br>
- Setosa<br>
- Versicolor<br>
- Virginica<br>

For each of these three species, four features or measurements are recorded:
<br>
- Sepal Length: Length of the iris flower's sepal (the outermost whorl of the flower).<br>
- Sepal Width: Width of the iris flower's sepal.<br>
- Petal Length: Length of the iris flower's petal (the inner whorl of the flower).<br>
- Petal Width: Width of the iris flower's petal.<br>

<br>
The iris dataset is a classic and very easy multi-class classification dataset.

<br>

| Header 1  | Header 2 |
|-----------|----------|
| Classes | 3 |
| Samples per class | 50 |
| Samples total | 150 |
| Dimensionality | 4 |
| Features | real, positive |


In [1]:
#Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

In [2]:
# Load the Iris dataset
iris = load_iris(as_frame=True)

In [3]:
# Create a DataFrame from the dataset
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

In [4]:
# Add the target column to the DataFrame
iris_df['target'] = iris.target

In [5]:
# Display the first few rows of the DataFrame as a table
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## 1.2 Split Dataset into train/test datasets 

In [6]:
# Split the dataset into features (X) and target labels (y)
X = iris_df.drop('target', axis=1)  # Features
y = iris_df['target']  # Target labels

In [7]:
# Split the data into a training set (80%) and a testing set (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 1.3 Transform the continuous features into discrete bins

In [9]:
# Create a KBinsDiscretizer transformer
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')

In [10]:
# Fit and transform the continuous features into discrete bins
X_train_binned = discretizer.fit_transform(X_train)
X_test_binned = discretizer.transform(X_test)

In [11]:
# Display the shapes of the training and testing sets
print("X_train shape:", X_train_binned.shape)
print("X_test shape:", X_test_binned.shape)

X_train shape: (120, 4)
X_test shape: (30, 4)


## 1.4 Define classification Model

In [13]:
# Create a Multinomial Naive Bayes classifier
classifier = MultinomialNB()

In [14]:
# Train the classifier on the binned training data
classifier.fit(X_train_binned, y_train)

## 1.5 Evaluation

In [15]:
# Make predictions on the binned test data
y_pred = classifier.predict(X_test_binned)

In [16]:
# Calculate and display the accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7666666666666667


In [17]:
# Display a classification report
report = classification_report(y_test, y_pred, target_names=iris.target_names)
print("\nClassification Report:\n", report)


Classification Report:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       0.60      0.67      0.63         9
   virginica       0.70      0.64      0.67        11

    accuracy                           0.77        30
   macro avg       0.77      0.77      0.77        30
weighted avg       0.77      0.77      0.77        30

