# Apply Naive Bayes and do the prediction using given dataset: https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv 

**Bayes’ Theorem** provides a way that we can calculate the probability of a piece of data belonging to a given class, given our prior knowledge.      
Bayes’ Theorem is stated as:               
**P(class|data) = (P(data|class) * P(class)) / P(data)**,                             
Where P(class|data) is the probability of class given the provided data.                 

**Naive Bayes** is a classification algorithm for binary and multiclass classification problems. It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as "Naive".            

Rather than attempting to calculate the probabilities of each attribute value, they are assumed to be conditionally independent given the class value. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other features. This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.                 

**How Naive Bayes algorithm works?**               

**Step 1:** Convert the data set into a frequency table.            
**Step 2:** Create Likelihood table by finding the probabilities.           
**Step 3:** Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.                     

**Dataset**                
The Iris Flower Dataset involves predicting the flower species given measurements of iris flowers. The dataset can be downloaded from http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data. iris.data file has been converted to .csv file and used in the below classification tasks.             
It is a multiclass classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:            

- Sepal length in cm.
- Sepal width in cm.
- Petal length in cm.
- Petal width in cm.
- Class

In [1]:
# importing required libraries
import pandas as pd

In [2]:
# read the dataset from the given link
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv'
headernames = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
df = pd.read_csv(path, names = headernames)

In [3]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [4]:
# Encoding categorical data using Label Encoder
from sklearn import preprocessing
lableEnc = preprocessing.LabelEncoder()
df['species'] = lableEnc.fit_transform(df['species'])

In [5]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [6]:
# Separating the Dependent and Independent variables
X = df.iloc[:, 0:4]
Y = df.loc[:, 'species']

In [7]:
# Partitioning the dataframe into training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.25, random_state = 1)

In [8]:
# Shape of training set
X_train.shape

(112, 4)

In [9]:
# Shape of training set
Y_train.shape

(112,)

In [10]:
# Shape of testing set
X_test.shape

(38, 4)

In [11]:
# Shape of testing set
Y_test.shape

(38,)

In [12]:
# Applying GaussianNB from sklearn libraries
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

In [13]:
# Fit the model with the training data
model.fit(X_train,Y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [14]:
# Predict the target on the train dataset
from sklearn.metrics import accuracy_score
train_predict = model.predict(X_train)
print('Target on train data :\n', train_predict)   

Target on train data :
 [1 2 2 0 1 2 1 2 0 0 0 1 0 0 2 2 2 2 2 1 2 1 0 2 2 0 0 2 0 2 2 1 1 2 2 0 1
 1 2 1 2 1 0 0 0 2 0 2 2 2 0 0 1 0 2 1 2 2 1 2 2 1 0 1 0 1 2 0 1 0 0 2 2 2
 0 0 2 0 2 0 2 1 0 2 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 2 0 0 2 1 2 1 1 2 1 2
 0]


In [15]:
# Accuray Score on train dataset
train_accuracy = accuracy_score(Y_train, train_predict)
print('Accuracy score on train dataset : ', train_accuracy*100,'%')

Accuracy score on train dataset :  94.64285714285714 %


In [16]:
# Predict the target on the test dataset
test_predict = model.predict(X_test)
print('Target on test data :\n', test_predict) 

Target on test data :
 [0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 2 0 2 1 0 0 1 2 1 2 1 2 2 0 1
 0]


In [17]:
# Accuracy Score on test dataset
test_accuracy = accuracy_score(Y_test, test_predict)
print('Accuracy score on test dataset : ', test_accuracy*100,'%')

Accuracy score on test dataset :  97.36842105263158 %


In [18]:
# Printing the confusion matrix for the above classification over test data
from sklearn.metrics import confusion_matrix
confusion_matrix(Y_test,test_predict)

array([[13,  0,  0],
       [ 0, 15,  1],
       [ 0,  0,  9]], dtype=int64)

In [19]:
# Finding the multilabel confusion metrics for the above classification over test data
from sklearn import metrics
metrics.multilabel_confusion_matrix(Y_test,test_predict)

array([[[25,  0],
        [ 0, 13]],

       [[22,  0],
        [ 1, 15]],

       [[28,  1],
        [ 0,  9]]], dtype=int64)

In [20]:
# Classification report consisting of precision, recall, f1-score and support for classification over test data
from sklearn.metrics import classification_report
print(classification_report(Y_test,test_predict))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       1.00      0.94      0.97        16
           2       0.90      1.00      0.95         9

    accuracy                           0.97        38
   macro avg       0.97      0.98      0.97        38
weighted avg       0.98      0.97      0.97        38



# Apply KNN and do the prediction using given below dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

**KNN (K-nearest neighbours)** is a supervised ML algorithm which can be used for both classification as well as regression predictive problems.               
The following two properties would define KNN well:          

- **Lazy learning algorithm:** KNN is a lazy learning algorithm because it does not have a specialized training phase and uses all he data for training while classification.             

- **Non-parametric learning algorithm:** KNN is also a nonparametric learming aigorithm because it does not assume anything about the underlying data.              

**Working of KNN Algorithm:**                      
 
**Step 1:** Load the training as well as test data.                            

**Step 2:** Choose the value of K, i.e. the nearest data points. K can be any integer.              

**Step 3:** For each point in the test data do the following:                

**3.1** Calculate the distance between test data and each row of training data with the help of any of the method namely: Eudidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is Euclidean.         

**3.2** Now, based on the distance value, sort them in ascending order.              

**3.3** Next, it will choose the top K rows from the sorted array.               

**3.4** Now it will assign a class to the test point based on most frequent class of these rows.            

**Step 4:** End               

In [21]:
# read the dataset from the given link
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
# Making a list of header names
headernames = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
# Converting .csv file to dataframe using Pandas
dataset = pd.read_csv(path, names = headernames)

In [22]:
dataset

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [23]:
# Encoding categorical data using Label Encoder
from sklearn import preprocessing
lableEnc = preprocessing.LabelEncoder()
dataset['species'] = lableEnc.fit_transform(dataset['species'])

In [24]:
dataset

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [25]:
# Separating the Dependent and Independent variables
'''
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 4].values
'''
X = dataset.iloc[:, 0:4]
Y = dataset.loc[:, 'species']

In [26]:
# Partitioning the dataframe into training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.4, random_state = 1)

In [27]:
X_train.shape

(90, 4)

In [28]:
X_test.shape

(60, 4)

In [29]:
# Applying KNN from sklearn libraries
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 8)
classifier.fit(X_train, Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=8, p=2,
                     weights='uniform')

In [30]:
# Predict the target on the training dataset
train_predict_knn = classifier.predict(X_train)
print('Target on train data :\n', train_predict) 

Target on train data :
 [1 2 2 0 1 2 1 2 0 0 0 1 0 0 2 2 2 2 2 1 2 1 0 2 2 0 0 2 0 2 2 1 1 2 2 0 1
 1 2 1 2 1 0 0 0 2 0 2 2 2 0 0 1 0 2 1 2 2 1 2 2 1 0 1 0 1 2 0 1 0 0 2 2 2
 0 0 2 0 2 0 2 1 0 2 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 2 0 0 2 1 2 1 1 2 1 2
 0]


In [31]:
# Accuracy Score on training dataset
from sklearn.metrics import accuracy_score
train_accuracy_knn = accuracy_score(Y_train, train_predict_knn)
print('Accuracy score on test dataset : ', train_accuracy_knn*100,'%')

Accuracy score on test dataset :  96.66666666666667 %


In [32]:
# Predict the target on the test dataset
Y_pred = classifier.predict(X_test)
print('Target on test data :\n', test_predict) 

Target on test data :
 [0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 2 0 2 1 0 0 1 2 1 2 1 2 2 0 1
 0]


In [33]:
# Accuracy Score on test dataset
test_accuracy_knn = accuracy_score(Y_test, Y_pred)
print('Accuracy score on test dataset : ', test_accuracy_knn*100,'%')

Accuracy score on test dataset :  98.33333333333333 %


In [34]:
# Printing the confusion matrix for the above classification using KNN over test data
from sklearn.metrics import classification_report, confusion_matrix
result = confusion_matrix(Y_test, Y_pred)
print("Confusion Matrix:")
print(result)

Confusion Matrix:
[[19  0  0]
 [ 0 20  1]
 [ 0  0 20]]


In [35]:
# Finding the multilabel confusion metrics for the above classification over test data
from sklearn import metrics
metrics.multilabel_confusion_matrix(Y_test,Y_pred)

array([[[41,  0],
        [ 0, 19]],

       [[39,  0],
        [ 1, 20]],

       [[39,  1],
        [ 0, 20]]], dtype=int64)

In [36]:
# Classification report consisting of precision, recall, f1-score and support for classification over test data
print("Classification Report:",)
print(classification_report(Y_test, Y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.95      0.98        21
           2       0.95      1.00      0.98        20

    accuracy                           0.98        60
   macro avg       0.98      0.98      0.98        60
weighted avg       0.98      0.98      0.98        60



**References:**

https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
 
https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

https://www.geeksforgeeks.org/naive-bayes-classifiers/

https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_knn_algorithm_finding_nearest_neighbors.htm

https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/