# **ML Task 5- Classification**
**Objective**:
The objective of this assessment is to evaluate your understanding and ability to apply supervised learning techniques to a real-world dataset.
Dataset:
Use the breast cancer dataset available in the sklearn library.
Key components to be fulfilled :

**1.** Loading and Preprocessing
   - Load the breast cancer dataset from sklearn.
   - Preprocess the data to handle any missing values and perform necessary feature scaling.
   - Explain the preprocessing steps you performed and justify why they are necessary for this dataset.

**2.** Classification Algorithm Implementation
   - Implement the following five classification algorithms:
     1. Logistic Regression
     2. Decision Tree Classifier
     3. Random Forest Classifier
     4. Support Vector Machine (SVM)
     5. k-Nearest Neighbors (k-NN)
   - For each algorithm, provide a brief description of how it works and why it might be suitable for this dataset.

**3.** Model Comparison
   - Compare the performance of the five classification algorithms.
   - Which algorithm performed the best and which one performed the worst?

# ***1. Loading and Preprocessing***

In [1]:
#Load the Dataset: First, load the breast cancer dataset from sklearn.
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


In [2]:
#Check for Missing Values: Ensure there are no missing values in the dataset.
print(df.isnull().sum())


mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64


In [3]:
#Feature Scaling: Standardize the feature variables to ensure they have a mean of 0 and a standard deviation of 1.
scaler = StandardScaler()
X = scaler.fit_transform(df.drop(columns=['target']))
y = df['target']


# ***2. Classification Algorithm Implementation***

In [4]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)


In [5]:
#Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)


In [6]:
#Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)


In [7]:
#Support Vector Machine (SVM)
from sklearn.svm import SVC

svm_clf = SVC(random_state=42)
svm_clf.fit(X_train, y_train)


In [8]:
#k-Nearest Neighbors (k-NN)
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)


# ***3.Model Comparison***
**Best Performer**

***Logistic Regression*** and ***SVM*** both achieved the highest performance:

Accuracy: 0.9737
Precision: 0.9722
Recall: 0.9859
F1 Score: 0.9790


Worst Performer
***Decision Tree*** and ***K-NN*** both performed the worst:
Accuracy: 0.9474
Precision: 0.9577
Recall: 0.9577
F1 Score: 0.9577
These metrics are slightly lower compared to the other algorithms, indicating that ***the Decision Tree and k-NN are the least effective in this case.***

In summary:

**Best:** Logistic Regression and SVM (tie)

**Worst:** Decision Tree and k-NN (tie)

In [9]:
#Model Comparison
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [10]:
#Evaluate Models:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

models = {
    'Logistic Regression': log_reg,
    'Decision Tree': tree_clf,
    'Random Forest': rf_clf,
    'SVM': svm_clf,
    'k-NN': knn_clf
}

for name, model in models.items():
    y_pred = model.predict(X_test)
    print(f"{name}:\n Accuracy: {accuracy_score(y_test, y_pred)}\n Precision: {precision_score(y_test, y_pred)}\n Recall: {recall_score(y_test, y_pred)}\n F1 Score: {f1_score(y_test, y_pred)}\n")


Logistic Regression:
 Accuracy: 0.9736842105263158
 Precision: 0.9722222222222222
 Recall: 0.9859154929577465
 F1 Score: 0.979020979020979

Decision Tree:
 Accuracy: 0.9473684210526315
 Precision: 0.9577464788732394
 Recall: 0.9577464788732394
 F1 Score: 0.9577464788732394

Random Forest:
 Accuracy: 0.9649122807017544
 Precision: 0.958904109589041
 Recall: 0.9859154929577465
 F1 Score: 0.9722222222222222

SVM:
 Accuracy: 0.9736842105263158
 Precision: 0.9722222222222222
 Recall: 0.9859154929577465
 F1 Score: 0.979020979020979

k-NN:
 Accuracy: 0.9473684210526315
 Precision: 0.9577464788732394
 Recall: 0.9577464788732394
 F1 Score: 0.9577464788732394

