# Objective:
The objective of this assessment is to evaluate your understanding and ability to apply supervised learning techniques to a real-world dataset.

# Dataset:
Use the breast cancer dataset available in the sklearn library.


# 1. Loading and Preprocessing
#   - Load the breast cancer dataset from sklearn.
#   - Preprocess the data to handle any missing values and perform necessary feature scaling.
#  - Explain the preprocessing steps you performed and justify why they are necessary for this dataset.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

In [7]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [6]:
# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [8]:
X = data.data
y = data.target


In [9]:
# Feature names
feature_names = data.feature_names

In [10]:
# Check for missing values
missing_values = np.isnan(X).sum()
print(f"Missing values in the dataset: {missing_values}")

Missing values in the dataset: 0


In [11]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X)

Explanation of Preprocessing Steps

Missing Values Check: Even though the breast cancer dataset doesn’t have missing values, checking for them is important in general practice to avoid errors during model training.

Feature Scaling: Standardizing the features ensures that all features contribute equally to the model, especially when their scales are different. Without scaling, features with larger ranges might dominate the model, which can lead to biased results. For example, if one feature has a range from 0 to 1 and another from 0 to 1,000,000, the latter could disproportionately affect the model.

# 2. Classification Algorithm Implementation
#   Implement the following five classification algorithms:
# . Logistic Regression
# . Decision Tree Classifier
# . Random Forest Classifier
# . Support Vector Machine (SVM)
# . k-Nearest Neighbors (k-NN)
#   For each algorithm, provide a brief description of how it works and why it might be suitable for this dataset.


1. Logistic Regression

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [14]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

In [15]:
# Initialize and train the Logistic Regression model
log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)


In [16]:
# Predict and evaluate
y_pred_log_reg = log_reg.predict(X_test)
print("Logistic Regression:")
print("Accuracy:", accuracy_score(y_test, y_pred_log_reg))
print(classification_report(y_test, y_pred_log_reg))

Logistic Regression:
Accuracy: 0.9824561403508771
              precision    recall  f1-score   support

           0       0.97      0.98      0.98        63
           1       0.99      0.98      0.99       108

    accuracy                           0.98       171
   macro avg       0.98      0.98      0.98       171
weighted avg       0.98      0.98      0.98       171



working:
 Logistic Regression is a linear model used for binary classification. It estimates the probability of a binary outcome based on one or more predictor variables. It uses the logistic function (sigmoid) to transform the linear combination of features into a probability score, which is then used to classify the data.

suitable for:
Simplicity: It’s straightforward to implement and interpret.
Probabilistic Output: Provides probabilities for predictions, which can be useful for understanding the model's confidence.
Performance: Effective for linearly separable data, which the breast cancer dataset often is.


2. Decision Tree Classifier

In [17]:
from sklearn.tree import DecisionTreeClassifier

In [18]:
# Initialize and train the Decision Tree model
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)


In [19]:
# Predict and evaluate
y_pred_decision_tree = decision_tree.predict(X_test)
print("Decision Tree Classifier:")
print("Accuracy:", accuracy_score(y_test, y_pred_decision_tree))
print(classification_report(y_test, y_pred_decision_tree))

Decision Tree Classifier:
Accuracy: 0.935672514619883
              precision    recall  f1-score   support

           0       0.89      0.94      0.91        63
           1       0.96      0.94      0.95       108

    accuracy                           0.94       171
   macro avg       0.93      0.94      0.93       171
weighted avg       0.94      0.94      0.94       171



working:
A Decision Tree Classifier splits the data into branches based on feature values, forming a tree-like model of decisions. Each node represents a feature and each branch represents a decision rule, leading to the final classification in the leaf nodes.

suitable for:
Interpretability: Easy to understand and visualize.
Non-linear Relationships: Can capture complex interactions between features, which can be beneficial if the relationships in the data are not linear.
Feature Importance: Provides insights into which features are most important for classification.

3. Random Forest Classifier

In [20]:
from sklearn.ensemble import RandomForestClassifier

In [21]:
# Initialize and train the Random Forest model
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)

In [22]:
# Predict and evaluate
y_pred_random_forest = random_forest.predict(X_test)
print("Random Forest Classifier:")
print("Accuracy:", accuracy_score(y_test, y_pred_random_forest))
print(classification_report(y_test, y_pred_random_forest))

Random Forest Classifier:
Accuracy: 0.9707602339181286
              precision    recall  f1-score   support

           0       0.98      0.94      0.96        63
           1       0.96      0.99      0.98       108

    accuracy                           0.97       171
   macro avg       0.97      0.96      0.97       171
weighted avg       0.97      0.97      0.97       171



working:
Random Forest is an ensemble method that combines multiple decision trees to improve performance and robustness. It uses bootstrapped samples of the data and averages the predictions from many trees to make a final decision.

suitable for:
Accuracy: Often more accurate than individual decision trees due to the averaging of multiple models.
Robustness: Reduces overfitting compared to a single decision tree.
Handling Features: Can handle a large number of features and interactions effectively.

4. Support Vector Machine (SVM)

In [23]:
from sklearn.svm import SVC

In [24]:
# Initialize and train the SVM model
svm = SVC()
svm.fit(X_train, y_train)

# Predict and evaluate
y_pred_svm = svm.predict(X_test)
print("Support Vector Machine (SVM):")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))

working:
Support Vector Machine (SVM) finds the hyperplane that best separates the classes in the feature space. It works by maximizing the margin between the closest data points of different classes (support vectors) and the hyperplane.

In [None]:
suitable for:
High-Dimensional Spaces: Effective in high-dimensional spaces, which is useful for datasets with many features like the breast cancer dataset.
Margin Maximization: Aims to maximize the margin between classes, which can improve generalization.

5. k-Nearest Neighbors (k-NN)

In [26]:
from sklearn.neighbors import KNeighborsClassifier

In [27]:
# Initialize and train the k-NN model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

In [28]:
# Predict and evaluate
y_pred_knn = knn.predict(X_test)
print("k-Nearest Neighbors (k-NN):")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))

k-Nearest Neighbors (k-NN):
Accuracy: 0.9590643274853801
              precision    recall  f1-score   support

           0       0.95      0.94      0.94        63
           1       0.96      0.97      0.97       108

    accuracy                           0.96       171
   macro avg       0.96      0.95      0.96       171
weighted avg       0.96      0.96      0.96       171



working:
k-Nearest Neighbors (k-NN) classifies a data point based on the majority class among its k nearest neighbors in the feature space. It does not build a model but stores the entire dataset and computes distances during prediction.

suitable for:
Simplicity: Easy to implement and understand.
No Training Phase: It doesn't require a training phase, which can be advantageous in scenarios where training data is limited or the dataset changes frequently.
Non-linear Boundaries: Can adapt to complex decision boundaries based on the local structure of the data.

# 3. Model Comparison
#   - Compare the performance of the five classification algorithms.
#   - Which algorithm performed the best and which one performed the worst?

In [29]:
from sklearn.metrics import accuracy_score, classification_report

In [30]:
# Logistic Regression
log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)

In [31]:
# Decision Tree Classifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
y_pred_decision_tree = decision_tree.predict(X_test)

In [32]:
# Random Forest Classifier
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)
y_pred_random_forest = random_forest.predict(X_test)

In [33]:
# Support Vector Machine (SVM)
svm = SVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)


In [34]:
# k-Nearest Neighbors (k-NN)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)


In [35]:
# Evaluate each model
def evaluate_model(predictions, model_name):
    print(f"{model_name}:")
    print("Accuracy:", accuracy_score(y_test, predictions))
    print(classification_report(y_test, predictions))

In [36]:
evaluate_model(y_pred_log_reg, "Logistic Regression")
evaluate_model(y_pred_decision_tree, "Decision Tree Classifier")
evaluate_model(y_pred_random_forest, "Random Forest Classifier")
evaluate_model(y_pred_svm, "Support Vector Machine (SVM)")
evaluate_model(y_pred_knn, "k-Nearest Neighbors (k-NN)")

Logistic Regression:
Accuracy: 0.9824561403508771
              precision    recall  f1-score   support

           0       0.97      0.98      0.98        63
           1       0.99      0.98      0.99       108

    accuracy                           0.98       171
   macro avg       0.98      0.98      0.98       171
weighted avg       0.98      0.98      0.98       171

Decision Tree Classifier:
Accuracy: 0.9415204678362573
              precision    recall  f1-score   support

           0       0.90      0.95      0.92        63
           1       0.97      0.94      0.95       108

    accuracy                           0.94       171
   macro avg       0.93      0.94      0.94       171
weighted avg       0.94      0.94      0.94       171

Random Forest Classifier:
Accuracy: 0.9649122807017544
              precision    recall  f1-score   support

           0       0.97      0.94      0.95        63
           1       0.96      0.98      0.97       108

    accuracy          

Best Performance:The logistic regression acheived the highest accuracy
Worst Performance:Decesion Tree Classifier acheived the lowest accuracy