<a href="https://colab.research.google.com/github/shstreuber/Data-Mining/blob/master/Module5_kNN_NaiveBayes2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Module 5: CLASSIFICATION: k Nearest Neighbor and Naive Bayes**
In this module, we will step into machine learning with the k Nearest Neighbor and Naive Bayes algorithms. At the end of this module, you will be able to:
* Outline the concepts of Supervised Learning
* Explain what classification is
* Describe how k Nearest Neighbor works
* Describe how Naive Bayes works
* Write code to execute both, k Nearest Neighbor and Naive Bayes

##**Supervised Learning**
This time, we are going to take the whole idea of forecasting a step further. With last week's logistic regression, we learned how to build a model that sorts data into 1 of two factors in a target attribute; in other words: A model with binary class outcome. This week, we are going to work with a categorical class attribute and two different machine learning mechanisms: The empirical classifier k Nearest Neighbor and the statistical classifier Naive Bayes.

Both are part of **SUPERVISED LEARNING**. A supervised machine learning algorithm relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data. Imagine a computer is a child, we are its supervisor (e.g. parent, guardian, or teacher), and we want the child (computer) to learn what a book looks like. We will show the child several different pictures, some of which are books and the rest could be pictures of anything (cats, coffee cups, computers, etc).
When we see a book, we shout "book!” When it’s not a pig, we shout “no, not book!” After doing this several times with the child, we show them a picture and ask “book?” and they will correctly (most of the time) say “book!” or “no, not book!” depending on what the picture is. That is supervised learning. Kind of like this:

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20231121154747/Supervised-learning.png">

Now, please watch the video below. It's a great introduction to Supervised Learning.

In [None]:
from IPython.display import IFrame  # This is just for me so I can embed videos
IFrame(src="https://www.youtube.com/embed/kE5QZ8G_78c", width=560, height=315)

##**Classification**
Classification is the problem of identifying which of a set of categories an observation belongs to. For example: Classification helps us assign a given email to the "spam" or "non-spam" class, or a diagnosis to a given patient based on observed characteristics of the patient.

In machine learning, classification has these steps:
1. Determine what the class attribute in the dataset should be. This will be the attribute you'll predict later on
2. Preprocess the data (remove n/a, transform data types as needed, deal with missing data) and ensure that the dependent attribute is CATEGORICAL
3. Split the data into a training set and a test set
4. Build the model based on the training set
5. Test the model on the test set and compare the calculated class values to the actual class values shown in the test set.
6. Determine the quality of the model

Ready? Let's go.


#**0. Preparation and Setup**
For these explanations, we will need a model with a dependent attribute that is categorical. The typical explanation uses the famous [iris flower dataset](https://github.com/shstreuber/Data-Mining/blob/master/data/iris.csv), which even has [its own wikipedia page](https://en.wikipedia.org/wiki/Iris_flower_data_set). However, we will use the insurance dataset because it allows us to tackle actual real-world problems. Since we will be working with two different types of classification, the first one called k Nearest Neighbor, and the second on called Naive Bayes, we will import all the libraries upfront.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import spatial
import statsmodels.api as sm

from IPython.display import HTML # This is just for me so I can embed videos
from IPython.display import Image # This is just for me so I can embed images

#Reading in the data as insurance dataframe
insurance = pd.read_csv("https://raw.githubusercontent.com/shstreuber/Data-Mining/master/data/insurance_with_categories.csv")

#Verifying that we can see the data
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# **1. k Nearest Neighbor**
The concept of the k-nearest neighbor classifier is part of our everyday life.

Imagine you are the owner of an online clothing store and you want to implement a recommendation system to suggest products to your customers based on their preferences and past purchases. This is where the K-Nearest Neighbors (KNN) algorithm can be incredibly useful.

**How KNN Works in This Scenario:**
1. **Data Collection:** You have data on past purchases and preferences of your customers. Each customer has a profile that includes features like age, gender, preferred clothing styles (e.g., casual, formal), favorite colors, sizes, and previous purchase history. For example:

```
Customer Data:
- Customer A: Age: 25, Gender: Female, Style: Casual, Favorite Colors: Blue, Size: M, Purchased Items: [Jeans, T-shirts]
- Customer B: Age: 30, Gender: Male, Style: Formal, Favorite Colors: Black, Size: L, Purchased Items: [Suits, Dress Shirts]
```
2. **Feature Representation:** Convert each customer’s profile into a feature vector. A feature vector might look like this for Customer A:

```
[25, 1 (Female), 1 (Casual), 0 (Formal), 0 (Favorite Colors: Blue), 0 (Size: M), 1 (Purchased: Jeans), 1 (Purchased: T-shirts)]
```

3. **Similarity Measurement:**

When a new customer visits your store and browses certain products, you represent their current preferences as a feature vector.
For example, a new Customer C might have the following profile:


```
[22, 1 (Female), 1 (Casual), 0 (Formal), 1 (Favorite Colors: Blue), 0 (Size: M)]

```
4. **kNN Algorithm:** Calculate the distance (similarity) between Customer C’s feature vector and the feature vectors of all existing customers. This distance can be calculated using various methods such as Euclidean distance.
 * Then, identify the 'k' customers whose feature vectors are closest to Customer C’s vector. For simplicity, let’s assume k = 3.
 * Suppose the three nearest neighbors (most similar customers) to Customer C are Customer A, Customer D, and Customer E.

5. **Recommendation:** Analyze the purchase history of these 3 nearest neighbors.
 * Recommend products that these neighbors have purchased but Customer C has not.
 * For instance, if Customers A, D, and E all bought a particular jacket and Customer C has not bought it yet, the system will recommend this jacket to Customer C.

<hr>

In a nutshell, the **principle behind nearest neighbor classification** consists in identifying a predefined number, i.e. the 'k' - of training samples closest in (Euclidian) distance to the new sample that we want to classify. The label of the new sample will be defined based on these neighbors.

Here is what this looks like:

<img src = "https://miro.medium.com/v2/resize:fit:1400/0*34SajbTO2C5Lvigs.png">

Would you like a more in-depth explanation? The video below gives you all the detail you will want to know as you proceed.



In [None]:
IFrame(src="https://www.youtube.com/embed/0p0o5cmgLdE", width=560, height=315)

##**1.1 Nearest Neighbor Algorithm**
The k Nearest Neighbor Algorithm works like this:
1. Load the data
2. Initialize K to your chosen number of neighbors
3. For each example in the data:
  
  3.1 Calculate the distance between the query example and the current example from the data.
  
  3.2 Add the distance and the index of the example to an ordered collection
4. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances
5. Pick the first K entries from the sorted collection
6. Get the labels of the selected K entries
7. If classification, return the mode of the K labels

To work with the k Nearest Neighbor algorithm, we use its library from the scikit learn package. We will also learn a new way to build training and test sets (with a process called cross-validation), so we are importing that package, too. Lastly, we will be generating "pretty pictures"--so, matplotlib is going to help us out with that.



In [None]:
# We import all the kNN libraries

import matplotlib.patches as mpatches
import matplotlib.pyplot as plt

from sklearn import neighbors, datasets
from sklearn.model_selection import cross_val_score, train_test_split
%matplotlib inline

##**1.2 Exploratory Data Analysis**

1. Let's investigate the features (= attributes or dimensions)

In [None]:
insurance.info()

We have 4 numeric attributes and 3 categorical ones.

2. Let's rearrange the numeric features into a dataframe and use them to predict what region a person comes from.

In [None]:
insurance2 = pd.DataFrame(insurance, columns = ['age', 'bmi', 'children','charges','region'])
insurance2.head()

In [None]:
# Let's check what the levels of region are!
insurance2.region.unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

And now for some fun stuff! There is a great package that makes EDA much much easier: **The [Ydata profiling package](https://docs.profiling.ydata.ai/latest/).**

Here is how it works:

In [None]:
!pip install ydata-profiling
from ydata_profiling import ProfileReport # importing the package\
profile = ProfileReport(insurance2, title="Insurance 2 Profiling Report") # Connecting the package and our data
profile # And we call the package. This will take a moment. Prepare to be amazed!

##**1.3 Setting up Training and Test Sets**
To set up our training and test sets, we first split the independent variables (which we are assuming are age, bmi, children, and charges) and the class attribute (region), which contains the labels that we want to assign to the "unknown" data.



In [None]:
x=insurance2.iloc[:,:4] # all parameters
y=insurance2['region'] # class labels 'southwest', 'southeast', 'northwest', 'northeast'

#print(x) # Uncomment this line to verify your parameters/ independent variables/ attributes/ features
#print(y) # Uncomment this line to verify your class labels

Now that we have separated the x attributes (independent variables) and the y attribute (class attribute, dependent variable), we build our training and test sets!

NOTE that we are not allocating any sizes for the train_test_split below. This will invoke the default, which is a 75% training/ 25% test split.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state = 0)

# So, what training data do we have?
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))

##Your Turn
How many rows and columns do you have in the test data set? Write the command below and run it!

##**1.4 Building the Simplest Model with k=1**

Remember that, in kNN classification, the output is class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

We'll try that out first.[link text](https://)

In [None]:
# This is the model with the one nearest neighbor. The default is the Euclidian distance.

from sklearn.neighbors import KNeighborsClassifier
model1 = KNeighborsClassifier(n_neighbors = 1)
model1.fit(X_train, y_train)

###**1.4.1 Testing the Model**

As you can see, we have built model1, which is the kNN model with just 1 nearest neighbor. Next, we test it.

We use y_pred to store the calculated y values (remember y hat?) that the model gives us. Then we can compare them with the actual y values that we know and see what percentage the model identified correctly.

In [None]:
y_pred = model1.predict(X_test)
print("Test set predictions: \n {}".format(y_pred))

###**1.4.2 Evaluating the model**
Here, we use three critical methods to get an idea of how "good" our model really is.


####1.4.2.1 The Accuracy Score

In [None]:
# Accuracy score
print("Test set score: {:.2f}".format(model1.score(X_test, y_test)))

Oh boy! The accuracy score means that only 36% of all unlabelled data is classified correctly. That number could be better. Let's see if the other methods indicate the same thing.

####1.4.2.2 Data Inspection Calculated vs. Actual values

In [None]:
# Let's compare the actual y and the predicted y

realvsmodel1 = pd.DataFrame(y_pred,y_test)
realvsmodel1 = pd.DataFrame({'predicted':y_pred,'original':y_test})
realvsmodel1.head()

Well, that doesn't look too exciting, does it? Let's go to the last and MOST IMPORTANT way of evaluating the quality of our model: **The Confusion Matrix.**

####1.4.2.3 The Confusion Matrix

A confusion matrix compares the calculated or predicted values for all labels in the class attribute with the actual, true values that we know. In other words, we check which true values were predicted correctly and which were predicted incorrecty.

A longer, more mathematical, explanation is [here](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/).

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred, labels=model1.classes_)
cm_display = ConfusionMatrixDisplay(cm, display_labels=model1.classes_).plot()

Alrighty. What are we seeing here?

The true (=actual original) values appear in rows (label on the left side, down), the predicted values appear in columns (label across the top). Here is how to explain **the first row**:
* All the items should have been in the northeast region because the True Label is 'northeast'. That is clearly not the case; otherwise, we would see 112 in the green box in the upper left and zeros across the rest of the row.
* Of the actual true 'northeast' region, 35 were predicted correctly (lime); 18 were incorrectly predicted as 'northwest', 17 were incorrectly predicted as 'southeast', and 26 were incorrectly predicted as 'southwest'. So, out of 112 actual 'northeast' rows, only 35 were predicted correctly. The rest were predicted incorrectly.

###1.4.2.4 The Classification Report

The Classification Report gives us even more insights into how well (or, in our case, badly) our model performs. To read it correctly, we first have to define a few terms:
1. **precision** (also called positive predictive value) is the number of correctly identified positive results divided by the number of all positive results, including those not identified correctly ((true positives) / (true positives + false positives)). Said another way, “for all instances classified positive, what percent was correct?”
2. **recall** (also known as sensitivity) is the number of correctly identified positive results divided by the number of all samples that should have been identified as positive ((true positives) / (true positives + false negatives)). Said another way, “for all instances that were actually positive, what percent was classified correctly?
3. **f-1 score** is the harmonic mean of the precision and recall. The highest possible value of F1 is 1, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.
3. **support** is the number of actual occurrences of the class in the specified dataset.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, labels=['southwest', 'southeast', 'northwest', 'northeast']))

Look at these values (note that "support" is the number of all true label quantities)--would you accept this quality from, say, a dentist? Or from your car, which (in the case of southwest), engages the brakes in 18% of all cases when you step on the pedal?

<img src = "https://www.shutterstock.com/shutterstock/videos/697813/thumb/1.jpg?ip=x480" height=200>
Yeah, I thought so, too. Not good.

## **1.5 Building kNN with 5 nearest neighbors (and a 2/3 to 1/3 train/ test split)**
Given our less-than fabulous results above, let's see if instead of assigning the class label from only 1 nearest neighbor, we can increase the accuracy of our predictions by looking at the class labels for the 5 nearest neighbors!

###1.5.1 Setting up Training and Test Set
**Note** how here, we use the test_size parameter to split 2/3 of the data into the training set and 1/3 into the test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=.33)

print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

###1.5.2 Building the model with k=5

In [None]:
model5 = neighbors.KNeighborsClassifier(n_neighbors=5)

model5.fit(X_train, y_train)

y_pred = model5.predict(X_test)

Alright, we've got the model built. Everyone ready for the Confusion Matrix and the Classification Report?

In [None]:
cm = confusion_matrix(y_test, y_pred, labels=model5.classes_)
cm_display = ConfusionMatrixDisplay(cm, display_labels=model5.classes_).plot()

In [None]:
print(classification_report(y_test, y_pred, labels=['southwest', 'southeast', 'northwest', 'northeast']))

Wait, seriously? Why are we getting a result that is only a little bit different, but we're spending much more processing effort?

<img src = "https://www.shutterstock.com/shutterstock/videos/697813/thumb/1.jpg?ip=x480" height=200>

There's got to be a **BETTER** way to find the optimal number for k!

And there is.

##**1.6 Optimizing k with Cross-Validation**
We could spend entire days re-running the kNN and increasing k by 1 until we've found the best value for k. But that would be 1. boring, 2. too much work, 3. not efficient, given that we could instead just cycle through a list of values until we've found the best one.

To achieve this most efficiently, we can use another trick (aka a preprocessing method) that we haven't encountered yet: **Cross-validation.** Find out [in this detail description](https://machinelearningmastery.com/k-fold-cross-validation/) how cross-validation works. [This graphic](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png) will also help you understand. Or watch this 5-minute video:

In [None]:
IFrame(src="https://www.youtube.com/embed/fSytzGwwBVw", width=560, height=315)

This what we'll do now. Let's go!

**First**, we build a list of potential k values. Then we create an empty list that will hold cross-validation scores.

In [None]:
# To determine how to pick k, we are first creating a list of potential k values
klist = list(range(1,50,2)) # Our list goes from 1 to 50 in increments of 2

# Then we create an empty list that will hold cross-validation scores
cv_scores = []

**Now we can build our cross-validation.** We will cycle through our k values from the k-value list and store the accuracy scores in the cv_scores list. To make things easier on us, we will convert the accuracy score into its opposite--the misclassification error. This misclassification error is really the average of all the misclassifications for one run of k.

In [None]:
# Perform 10-fold cross validation for each k value (we have a small dataset, so we can do this)
for k in klist:
    model10 = neighbors.KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(model10, x, y, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

# Changing to misclassification error
errors = 1- np.array(cv_scores)

NOW we can use the error number to determine the optimal k! To do so, we look at our errors and pick the row with the k value that produced the smallest error.

To make things easier to understand, we plot the misclassification errors in comparison to k so we can see our results.

In [None]:
optimal_k = klist[np.argmin(errors)]
print("The optimal number of neighbors is {}".format(optimal_k))

# plot misclassification error vs k
plt.plot(klist, errors)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

**Not sure what the code above does?** Here is the explanation:
* **klist**: This is a list containing different values of 'k', which represent the number of neighbors considered in the K-Nearest Neighbors (KNN) algorithm.
* **errors**: This is a list that contains the misclassification error corresponding to each value of 'k' in klist.
* **np.argmin(errors)**: This function returns the index of the smallest value in the errors list. In other words, it finds the value of 'k' that results in the lowest misclassification error.
* **klist[np.argmin(errors)]**: This retrieves the value of 'k' from klist that corresponds to the minimum error. This is considered the optimal number of neighbors for the KNN algorithm based on the data.

##Your Turn
Re-run this model with k=10 and k=50; for each of these, build and run a confusion matrix and a classification report. What changes? What do your results say about the data?

Use the lines below for your code

# **2. Naive Bayes**
Naive Bayes is a probabilistic classifier based on Bayes' Theorem. It assumes that the features (predictors) are independent of each other given the class, which is often not true in real life, hence the term "naive". Despite this naive assumption, Naive Bayes works very well in many practical applications.

**Bayes' Theorem** is the foundation of Naive Bayes and is expressed as:

<img src= "https://thatware.co/wp-content/uploads/2020/04/naive-bayes.png">

Where:

* P(A∣B) is the probability of event A happening given that event B has happened.
* P(B∣A) is the probability of event B happening given that event A has happened.
* P(A) is the prior probability of event A.
* P(B) is the prior probability of event B.

##**Example: Email Spam Classification**##

Let's use a simple example of classifying emails as "Spam" or "Not Spam" based on certain words in the email.

<img src = "https://miro.medium.com/v2/resize:fit:720/format:webp/0*mbFBPcPUJD-53v3h.png">

1. **Step 1: Training Data**

Suppose we have the following training data:



```
Email	   Contains "Win"	 Contains "Money"	Contains "Free"	  Class
Email 1	 Yes	            Yes	              Yes	              Spam
Email 2	 No	             Yes	              Yes	              Spam
Email 3	 Yes	            No	               Yes	              Spam
Email 4	 No	             No	               No	              Not Spam
Email 5	 Yes	            No	               No	              Not Spam
Email 6	 No	             Yes	              No	              Not Spam
```
<hr>

2. **Step 2: Calculate Probabilities**

We need to calculate the **prior probabilities** and the likelihoods.

Prior Probabilities:

* P(Spam)= 3/6 = 0.5 # 3 rows out of a total of 6 rows
* P(NotSpam)= 3/6 = 0.5 # 3 rows out of a total of 6 rows

Likelihoods:

* P(Contains "Win"∣Spam)= 2/3  (i.e. of the 3 rows with Spam, 2 contain Win)
* P(Contains "Money"∣Spam)= 2/3 (i.e. of the 3 rows with Spam, 2 contain Money)
* P(Contains "Free"∣Spam)= 3/3 = 1 (i.e. of the 3 rows with Spam, 3 contain Free)
* P(Contains "Win"∣Not Spam)= 1/3 (i.e. of the 3 rows with Not Spam, 1 contains Win)
* P(Contains "Money"∣Not Spam)= 2/3
* P(Contains "Free"∣Not Spam)= 1/3

<hr>

3. **Step 3: Classify a New Email**

Suppose we receive a new email that contains the words "Win" and "Free" but not "Money". We want to classify it as "Spam" or "Not Spam".

<img src ="https://github.com/shstreuber/Data-Mining/blob/master/images/naivebayes_spam.JPG?raw=true">

<img src = "https://ih1.redbubble.net/image.490263180.2295/bg,f8f8f8-flat,750x,075,f-pad,750x1000,f8f8f8.jpg" height = 300>

This example shows how Naive Bayes is a straightforward yet powerful classification algorithm. It works well in many real-world situations, such as spam detection, text classification, and more. The key steps involve calculating prior probabilities, likelihoods, and using Bayes' Theorem to determine the posterior probability for each class, ultimately choosing the class with the highest probability.

**Need More Information?**
* Here is **a [great explanation](https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn)** of the principle behind the Bayes Theorem.
* And here is a great video that explains it very well:


In [None]:
IFrame(src="https://www.youtube.com/embed/O2L2Uv9pdDA", width=560, height=315)

##Your Turn
Given the results from the EDA, should we conduct a Naive Bayes Analysis, at all? What condition does the insurance2 dataset violate? Type your answer below.


##**2.1 Setting up the Environment**
We will be working with Scikit-Learn again. More specifically, we will be working with the Gaussian Naive Bayes algorithm. Are there other Naive Bayes algorithms? [Absolutely](https://scikit-learn.org/stable/modules/naive_bayes.html).

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [None]:
#Let's verify that the dataset is still what it needs to be:
insurance2.head()

In [None]:
insurance2.dtypes

##**2.2 Setting up the training and test sets**

In [None]:
ins_train, ins_test = train_test_split(insurance2, test_size = 0.2)
print(ins_train)
print(ins_test)

In [None]:
ins_train_np = np.array([ins_train])
ins_test_np = np.array([ins_test])

##**2.3 Building the model with GaussianNB()**
Below, we are fitting the model to the X and y training sets, so that our model can learn what the correct classifications are.
The first parameter for the fit function is the X-training set, which contains all of the FEATURES in the training dataframe (i.e. the input variables in ins_train). The second parameter for the fit function is the y-training set, which contains the LABELS, i.e. the known outcomes for ins_train. These outcomes are stored in the 'region' column.

In [None]:
ins_naivebayes = GaussianNB()
ins_naivebayes.fit(ins_train.drop('region',axis=1), ins_train['region'])

##**2.4 Testing the model and calculating the accuracy score**
Below, we are using the ins_naivebayes model on the FEATURES in the test data in order to predict their LABELS. This means that we need to remove the old (known) 'region' data in order to use only the features and the model to predict the new 'region' labels.

In [None]:
ins_predictions = ins_naivebayes.predict(ins_test.drop('region',axis=1))
accuracy_score(ins_test['region'], ins_predictions)

In [None]:
ins_predictions

##**2.5 Comparing the predicted and the original values**

In [None]:
realvsmodel2 = pd.DataFrame(ins_predictions,ins_test)
realvsmodel2 = pd.DataFrame({'predicted':ins_predictions,'original':ins_test['region']})
realvsmodel2.head()

##Your Turn
Plot a confusion matrix. For your code, refer to section 1.4.2.3 above:

## Your Turn
Build a classification report. For your code, refer to section 1.4.2.4 above:

##Your Turn
Compare the results from the confusion matrix and the classification report for k Nearest Neighbor and Naive Bayes. Which model produces better results? Write your answer into the text field below.

##Your Turn
How valid are the results for the Naive Bayes classification really? Review the assumptions for work with Naive Bayes, especially regarding dependency among independent attribtues, and then look again at the results of the pandas_profiling output above. Write your answer into the text field below.



##Your Turn
So ... given the data that we have, can we reliably predict the region someone lives in based on their age, bmi, number of children, and the $ amount of their insurance claims? Explain in the field below.

# 3. If you get stuck

In [None]:
# Building the kNN model with k = 10
model10 = neighbors.KNeighborsClassifier(n_neighbors=10)
model10.fit(X_train, y_train)
y_pred = model10.predict(X_test)

In [None]:
# Building the kNN model with k = 50
model50 = neighbors.KNeighborsClassifier(n_neighbors=50)
model50.fit(X_train, y_train)
y_pred = model50.predict(X_test)

The insurance2 dataset does not contain completely independent attributes. The EDA shows that several attributes are highly correlated. This is known as MULTICOLLINEARITY. It is not ideal for Naive Bayes analysis.

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred, labels=yourmodelname.classes_)
cm_display = ConfusionMatrixDisplay(cm, display_labels=yourmodelname.classes_).plot()

NameError: name 'yourmodelname' is not defined

In [None]:
# Classification Report
print(classification_report(y_test, y_pred, labels=['southwest', 'southeast', 'northwest', 'northeast']))

Compare the reports between kNN and Naive Bayes--you are looking for greater accuracy in classification outcomes.