# Unit 4 Naive Bayes Basics

Hey there\! Today we are going to explore an exciting topic in machine learning called **Naive Bayes**. By the end of this lesson, you'll understand what Naive Bayes is and how to implement it using Python's **Scikit-Learn** library. Let’s dive in\!

-----

## Understanding Naive Bayes

**Naive Bayes** is a classification algorithm based on **Bayes' Theorem**. Imagine you’re a detective using clues (features) to decide who the culprit is (class). Naive Bayes helps by calculating probabilities.

Bayes' Theorem is stated as:

$$P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}$$

Where:

  * $P(C|X)$ is the **posterior probability** of class $C$ given predictor $X$.
  * $P(X|C)$ is the **likelihood** which is the probability of predictor $X$ given class $C$.
  * $P(C)$ is the **prior probability** of class $C$.
  * $P(X)$ is the prior probability of predictor $X$.

### How Naive Bayes Works

  * **Prior Probability**: The algorithm starts by calculating the prior probability for each class based on the training data. It is simply the probability of a sample being of the class $C$ if we know no data about the sample. For example, imagine we predict if the email is spam or not. If 93% of the emails in the data are not spam, then it is reasonable to suppose that a given email will be not spam with the probability of 93%. This is what the prior probability is.
  * **Likelihood**: For each feature, the likelihood (probability of the feature given the class) is calculated. It is essentially the probability of a sample with a given feature to be of the given class.
  * **Independent Features Assumption (Naive Assumption)**: Assumes that the features are independent, which simplifies calculations.
  * **Posterior Probability**: Using Bayes' Theorem, the posterior probability of each class is computed given the feature values. The class with the highest posterior probability is chosen as the prediction.

### How Naive Bayes Learns

**Naive Bayes** updates its likelihoods and priors using the training data. When the model encounters new data, it breaks the data into its constituent features and applies Bayes' Theorem to calculate the class probabilities. The class with the highest probability is the predicted class.

We will focus on **GaussianNB**, commonly used when features are continuous and assumed to follow a normal (Gaussian) distribution.

-----

## Loading the Dataset

Before training our **Naive Bayes** classifier, we need data. Consider it like needing mystery stories before solving them\! We'll use the **Iris dataset**, which includes features of iris flowers to classify them into species.

Let’s quickly remind ourselves how to load the dataset using **Scikit-Learn**:

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
```

-----

## Training the Naive Bayes Classifier

Now that we have our data split, it’s time to train our **Naive Bayes** classifier using **GaussianNB**:

```python
from sklearn.naive_bayes import GaussianNB

# Initialize the Naive Bayes classifier
nb_clf = GaussianNB()

# Train the classifier with training data
nb_clf.fit(X_train, y_train)
```

Here, `fit` trains the model using the training data, much like a student learning from textbooks.

-----

## Making Predictions and Calculating Accuracy

After training the model, let’s make predictions on the test data and calculate the accuracy:

```python
from sklearn.metrics import accuracy_score

# Make predictions on the testing set
y_pred = nb_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bayes model accuracy: {accuracy * 100:.2f}%")
# Bayes model accuracy: 96.67%
```

Here, `y_pred` contains the predicted class labels for the test set, and `accuracy_score` compares these predictions to the true labels (`y_test`) to calculate the model's accuracy.

-----

## Lesson Summary

Great job\! You've learned how to use the **Naive Bayes** classifier for machine learning tasks. Here’s a quick recap:

  * **Naive Bayes**: A probabilistic classifier based on Bayes' Theorem.
  * **Dataset Loading**: Used Scikit-Learn's `load_iris` to load the Iris dataset.
  * **Train-Test Split**: Used `train_test_split` to split data into training and testing sets.
  * **Model Training**: Used `GaussianNB` to train the Naive Bayes classifier.
  * **Making Predictions and Calculating Accuracy**: Predicted test set labels and calculated the model's accuracy.

Now it’s time to roll up your sleeves and get hands-on practice\! In the next section, you'll implement what you’ve learned and see the **Naive Bayes** classifier in action on your own. Excited? Let’s get started\!

## Detective Model Accuracy Calculation

Imagine you're a detective trying to solve a mystery using clues (features). In the given code, you can see how to load a dataset, train a Naive Bayes classifier, and calculate its accuracy. Click Run to see how accurately this model can solve the mystery of the iris flowers!

```python
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Initialize and train the Naive Bayes classifier
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = nb_clf.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy * 100:.2f}%")

```

## Detective Work: Fix the Clue Classification

Great job, Space Explorer!

Now, it’s time to put on your detective hat. The code below has a tiny bug preventing it from working correctly. Identify and fix the issue to ensure the classifier predicts the correct class based on the clue.

Good luck!

```python

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Initialize and train the Naive Bayes classifier
detective_nb = GaussianNB()

# Make example prediction before fitting
clue = [[5.0, 3.6, 1.4, 0.2]]  # Simulating a new piece of evidence
suspect = detective_nb.predict(clue)

print(f"The suspect's class based on the clue is: {suspect[0]}")
```

## Detective Work: Fix the Clue Classification

Great job, Space Explorer\!

Now, it’s time to put on your detective hat. The code below has a tiny bug preventing it from working correctly. Identify and fix the issue to ensure the classifier predicts the correct class based on the clue.

Good luck\!

```python

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Initialize and train the Naive Bayes classifier
detective_nb = GaussianNB()
# Fix: The model needs to be fitted with training data before making predictions.
detective_nb.fit(X, y) 

# Make example prediction before fitting
clue = [[5.0, 3.6, 1.4, 0.2]]  # Simulating a new piece of evidence
suspect = detective_nb.predict(clue)

print(f"The suspect's class based on the clue is: {suspect[0]}")
```

### The Bug and Its Fix:

The bug in the original code was that `detective_nb.predict(clue)` was called *before* the `GaussianNB` classifier was trained using `detective_nb.fit(X, y)`. A machine learning model needs to learn from data (be "fitted") before it can make predictions.

The fix involves simply adding the line `detective_nb.fit(X, y)` after initializing the classifier and before attempting to make a prediction. This trains the Naive Bayes model on the entire Iris dataset, allowing it to learn the relationships between the features and the classes, and thus make a meaningful prediction for the `clue`.

## Train Naive Bayes Classifier

Bright work, Space Explorer! Now, let's see if you can train a Naive Bayes classifier and make predictions on the Iris dataset. Fill in the missing pieces of code to complete the task.

Initialize and train the NaiveBayesClassifier, then make predictions and calculate the accuracy.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# TODO: Initialize and train the Naive Bayes classifier


# Make and print predictions on the testing set
y_pred = nb_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Bayes model accuracy: {accuracy * 100:.2f}%")

```

## Train Naive Bayes Classifier

Bright work, Space Explorer\! Now, let's see if you can train a Naive Bayes classifier and make predictions on the Iris dataset. Fill in the missing pieces of code to complete the task.

Initialize and train the NaiveBayesClassifier, then make predictions and calculate the accuracy.

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# TODO: Initialize and train the Naive Bayes classifier
nb_clf = GaussianNB() # Initialize the Gaussian Naive Bayes classifier
nb_clf.fit(X_train, y_train) # Train the classifier using the training data

# Make and print predictions on the testing set
y_pred = nb_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Bayes model accuracy: {accuracy * 100:.2f}%")

```

## Comparison of Logistic Regression and Naive Bayes

Alright, Space Voyager! It's time to use the knowledge you've gathered to compare Logistic Regression and Naive Bayes. Your mission is to load the Breast Cancer dataset, split it into training and testing sets, train both classifiers, make predictions, and calculate their accuracy.

May the data be ever in your favor, and show us which classifier reigns supreme in space!

```python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# TODO: Load the Breast Cancer dataset

# TODO: Split the dataset into training and testing sets

# TODO: Train logistic regression classifier

# TODO: Train Naive Bayes classifier

# TODO: Make predictions on the testing set with both classifiers

# TODO: Calculate and print accuracy for both models

```

## Comparison of Logistic Regression and Naive Bayes

Alright, Space Voyager\! It's time to use the knowledge you've gathered to compare Logistic Regression and Naive Bayes. Your mission is to load the Breast Cancer dataset, split it into training and testing sets, train both classifiers, make predictions, and calculate their accuracy.

May the data be ever in your favor, and show us which classifier reigns supreme in space\!

```python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
X, y = load_breast_cancer(return_X_y=True)

# Split the dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train logistic regression classifier
log_reg_clf = LogisticRegression(max_iter=10000, random_state=42) # Increased max_iter for convergence
log_reg_clf.fit(X_train, y_train)

# Train Naive Bayes classifier
gnb_clf = GaussianNB()
gnb_clf.fit(X_train, y_train)

# Make predictions on the testing set with both classifiers
y_pred_log_reg = log_reg_clf.predict(X_test)
y_pred_gnb = gnb_clf.predict(X_test)

# Calculate and print accuracy for both models
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
accuracy_gnb = accuracy_score(y_test, y_pred_gnb)

print(f"Logistic Regression Accuracy: {accuracy_log_reg * 100:.2f}%")
print(f"Naive Bayes (GaussianNB) Accuracy: {accuracy_gnb * 100:.2f}%")
```