In [None]:
NAME:-SHRIDATTA SHEKHAR BHASME
ROLL NO :- RBTL22CB072
SUBJECT:- MACHINE LEARNING
DATASET :- WINE DATASET

Aim:
The aim of this study is to conduct a comprehensive investigation into the combination of K-Nearest Neighbors (KNN) and Decision Tree algorithms for classification tasks. The goal is to leverage the strengths of both algorithms, exploring how their integration can enhance predictive accuracy, interpretability, and robustness in diverse datasets.

Objectives
Implement  KNN and Decision Tree algorithms to exploit the local and global patterns in the data.
Explore different strategies for combining predictions, such as ensembling or sequential execution.

Problem Statement:

In many machine learning applications, the need arises to classify or predict outcomes based on existing data. Selecting appropriate algorithms for these tasks is crucial for achieving accurate and reliable results. This study addresses the problem of algorithm selection for classification tasks by implementing and comparing the k-Nearest Neighbors (KNN) and Decision Tree algorithms. The objective is to understand the performance characteristics of these algorithms and identify scenarios where one may outperform the other.

Theory:

k-Nearest Neighbors (KNN):

Introduction: KNN is a non-parametric, lazy learning algorithm used for classification and regression tasks. In classification, it assigns a data point to the class most common among its k nearest neighbors.
Working Principle:
Given a dataset with labeled instances, the algorithm classifies new instances based on the majority class of their k nearest neighbors.
The distance metric (usually Euclidean distance) determines the similarity between instances.
The choice of k influences the algorithm's sensitivity to local patterns and noise.
Advantages:
Simple and intuitive.
Effective for small to moderately sized datasets.
Challenges:
Computationally expensive for large datasets.
Sensitive to irrelevant or redundant features.

Decision Tree:
Introduction: Decision Tree is a hierarchical, tree-like structure where each internal node represents a decision based on a feature, and each leaf node represents the outcome.
Working Principle:
The algorithm recursively splits the dataset into subsets based on the most informative feature at each node.
The decision-making process is interpretable, making it suitable for rule-based classification.
Advantages:
Intuitive and easy to interpret.
Handles both numerical and categorical data.
Implicitly performs feature selection.
Challenges:
Prone to overfitting, especially with deep trees.
Sensitive to noisy data and outliers.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
data = pd.read_csv('wine.csv')
data.head(n=5)

Unnamed: 0,Wine,Alcohol,Malic.acid,Ash,Acl,Mg,Phenols,Flavanoids,Nonflavanoid.phenols,Proanth,Color.int,Hue,OD,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [3]:
from sklearn import preprocessing

label_encoders = {}
columns_to_encode = ['Wine', 'Alcohol', 'Malic.acid', 'Ash', 'Acl', 'Mg', 'Phenols', 'Flavanoids',
                      'Nonflavanoid.phenols', 'Proanth', 'Color.int', 'Hue', 'OD', 'Proline']

# Assuming 'data' is your wine dataset
for col in columns_to_encode:
    label_encoders[col] = preprocessing.LabelEncoder()
    data[col] = label_encoders[col].fit_transform(data[col])
X = data.drop('Wine', axis=1)

In [4]:
train, test=train_test_split(data,random_state=42)
x_train=train[train.columns[2:30]]
y_train =train['Wine']
x_test=test[test.columns[2:30]]
y_test =test['Wine']

In [5]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
x_train=scaler.transform(x_train)
x_test=scaler.transform(x_test)

In [6]:
clf = DecisionTreeClassifier()

In [7]:
clf.fit(x_train, y_train)

In [8]:
y_pred = clf.predict(x_test)

In [9]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9555555555555556


# KNN model

In [10]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)

In [11]:
y_pred = knn.predict(x_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9111111111111111


Conclusion:
In conclusion, K-Nearest Neighbors and Decision Tree algorithms exhibits several noteworthy characteristics. The integration allows for capturing both local and global patterns in the data, enhancing the model's ability to handle complex relationships. The hyperparameter tuning process provides insights into the optimal configuration of both algorithms for improved performance.