NAME:-SHRIDATTA SHEKHAR BHASME
ROLL NO :- RBTL22CB072
SUBJECT:- MACHINE LEARNING
DATASET :- MONKEYPOX DATASET


Aim:
The aim of this study is to conduct a comprehensive comparative analysis of various supervised machine learning algorithms, namely Support Vector Machine (SVM), Naive Bayes, k-Nearest Neighbors (KNN), Decision Tree, and Artificial Neural Network (ANN) models. The goal is to provide insights into the strengths, weaknesses, and applicability of these algorithms across different types of datasets and problem domains.

Objectives:
Performance Evaluation:
Evaluate and compare the predictive performance of SVM, Naive Bayes, KNN, Decision Tree, and ANN models on diverse datasets.
Assess the algorithms in terms of accuracy, precision, recall, and F1 score to understand their classification capabilities.

Robustness Analysis:
Investigate the robustness of each algorithm by assessing their performance under varying conditions, including noisy data and imbalanced datasets.
Analyze the sensitivity of the models to changes in input data and assess their ability to generalize to unseen samples.

Computational Efficiency:
Compare the computational efficiency of the algorithms by examining their training and prediction times.
Evaluate the scalability of each algorithm concerning the size of the dataset.

Theory:

Support Vector Machines (SVM):
SVM is a supervised learning algorithm that can be used for classification or regression tasks.
It works by finding the hyperplane that best separates different classes in feature space.
SVM is effective in high-dimensional spaces and is particularly powerful in scenarios where the data is not linearly separable.

Naive Bayes:
Naive Bayes is a probabilistic algorithm based on Bayes' theorem.
It assumes that features are independent given the class label, making it computationally efficient and easy to implement.
Naive Bayes is often used for text classification and spam filtering.

k-Nearest Neighbors (KNN):
KNN is a non-parametric and lazy learning algorithm used for both classification and regression tasks.
It makes predictions based on the majority class or average value of the k-nearest data points in feature space.
KNN is simple to understand but may suffer from computational inefficiency, especially with large datasets.

Decision Trees:
Decision Trees are a popular algorithm for both classification and regression tasks.
They recursively split the dataset based on features to create a tree-like structure.
Decision Trees are easy to interpret but may be prone to overfitting.

Artificial Neural Networks (ANN):
ANN is a machine learning model inspired by the human brain's neural network structure.
It consists of interconnected nodes organized in layers and is capable of learning complex patterns.
ANN is highly flexible but may require more data and computational resources.

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB

In [3]:
data=pd.read_csv(r"monkeypox.csv")
data.head(n=5)

Unnamed: 0,Country.Iso code,Country.Full,Date.Full,Date.Year,Date.Month,Date.Day,Data.Cases.New,Data.Cases.Total,Data.Cases.New per million,Data.Cases.Total per million,Data.Deaths.New,Data.Deaths.Total,Data.Deaths.New per million,Data.Deaths.Total per million
0,AND,Andorra,2022-07-25,2022,7,25,1,1,12.653,12.653,0,0,0.0,0.0
1,AND,Andorra,2022-07-26,2022,7,26,2,3,25.306,37.958,0,0,0.0,0.0
2,AND,Andorra,2022-07-27,2022,7,27,0,3,0.0,37.958,0,0,0.0,0.0
3,AND,Andorra,2022-07-28,2022,7,28,0,3,0.0,37.958,0,0,0.0,0.0
4,AND,Andorra,2022-07-29,2022,7,29,0,3,0.0,37.958,0,0,0.0,0.0


In [4]:
from sklearn import preprocessing

label_encoders = {}
columns_to_encode = ['Country.Iso code', 'Country.Full', 'Date.Full', 'Date.Year', 'Date.Month', 'Date.Day', 'Data.Cases.New', 'Data.Cases.Total', 'Data.Cases.New per million', 'Data.Cases.Total per million', 'Data.Deaths.New', 'Data.Deaths.Total', 'Data.Deaths.New per million', 'Data.Deaths.Total per million']

# Assuming 'data' is your COVID-19 dataset
for col in columns_to_encode:
    label_encoders[col] = preprocessing.LabelEncoder()
    data[col] = label_encoders[col].fit_transform(data[col])


In [6]:
train, test=train_test_split(data,random_state=42)
x_train=train[train.columns[2:30]]
y_train =train['Data.Cases.New']
x_test=test[test.columns[2:30]]
y_test =test['Data.Cases.New']

In [7]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
x_train=scaler.transform(x_train)
x_test=scaler.transform(x_test)

In [8]:
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9543907420013614


In [9]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.7610619469026548


In [10]:
from sklearn.neural_network import MLPClassifier
MLP =MLPClassifier(hidden_layer_sizes=(10,10,10), max_iter=1000)
MLP.fit(x_train, y_train.values.ravel())
predictions=MLP.predict(x_test)
print("Accuracy: ",accuracy_score(y_test,predictions))

Accuracy:  0.8509189925119128


In [11]:
model=GaussianNB()
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
print("Accuracy: ",metrics.accuracy_score(y_test,y_pred))

Accuracy:  0.9482641252552757


In [12]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
svm_model=SVC(kernel="linear")
svm_model.fit(x_train,y_train)
y_pred=svm_model.predict(x_test)
print("Accuracy: ",accuracy_score(y_test,y_pred))
print("\n Confusuion matrix :\n ",confusion_matrix(y_test,y_pred))


Accuracy:  0.7773995915588836

 Confusuion matrix :
  [[1059    0    0 ...    0    0    0]
 [  14   81    0 ...    0    0    0]
 [   3   44    0 ...    0    0    0]
 ...
 [   0    0    0 ...    0    0    0]
 [   0    0    0 ...    0    0    0]
 [   0    0    0 ...    0    0    0]]


Conclusion:

In conclusion, this comparative analysis provides valuable insights into the performance, robustness, computational efficiency, interpretability, and hyperparameter sensitivity of SVM, Naive Bayes, KNN, Decision Tree, and ANN models. The findings will guide practitioners in selecting the most appropriate algorithm based on the characteristics of their datasets and the specific requirements of their applications. Understanding the trade-offs and strengths of each algorithm is crucial for making informed decisions in the field of supervised machine learning.