### 1. Loading and Preprocessing

#### Step 1: Load the Dataset

In [34]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load the data
data = load_breast_cancer()
x = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

In [36]:
print(x)
print(y)

     mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0          17.99         10.38          122.80     1001.0          0.11840   
1          20.57         17.77          132.90     1326.0          0.08474   
2          19.69         21.25          130.00     1203.0          0.10960   
3          11.42         20.38           77.58      386.1          0.14250   
4          20.29         14.34          135.10     1297.0          0.10030   
..           ...           ...             ...        ...              ...   
564        21.56         22.39          142.00     1479.0          0.11100   
565        20.13         28.25          131.20     1261.0          0.09780   
566        16.60         28.08          108.30      858.1          0.08455   
567        20.60         29.33          140.10     1265.0          0.11780   
568         7.76         24.54           47.92      181.0          0.05263   

     mean compactness  mean concavity  mean concave points  mea

#### Step 2: Check for Missing Values

In [40]:
# Check for missing values
print(x.isnull().sum())

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64


#### Step 3: Feature Scaling

In [46]:
from sklearn.preprocessing import StandardScaler

# Initialize scaler
scaler = StandardScaler()

# Fit and transform the features
X_scaled = scaler.fit_transform(x)

In [50]:
print(X_scaled)

[[ 1.09706398 -2.07333501  1.26993369 ...  2.29607613  2.75062224
   1.93701461]
 [ 1.82982061 -0.35363241  1.68595471 ...  1.0870843  -0.24388967
   0.28118999]
 [ 1.57988811  0.45618695  1.56650313 ...  1.95500035  1.152255
   0.20139121]
 ...
 [ 0.70228425  2.0455738   0.67267578 ...  0.41406869 -1.10454895
  -0.31840916]
 [ 1.83834103  2.33645719  1.98252415 ...  2.28998549  1.91908301
   2.21963528]
 [-1.80840125  1.22179204 -1.81438851 ... -1.74506282 -0.04813821
  -0.75120669]]


StandardScaler standardizes features by removing the mean and scaling to unit variance, which is appropriate here because most features are continuous and measured on different scales.

### 2. Classification Algorithm Implementation

In [61]:
from sklearn.model_selection import train_test_split

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

#### 1. Logistic Regression

In [63]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))

Logistic Regression Accuracy: 0.9736842105263158


**How it works:**
* Logistic Regression models the probability that a sample belongs to a particular class using a sigmoid function.
* It's best for linearly separable data.

**Why it's suitable:**
* Breast cancer data has features that often separate benign from malignant tumors linearly.
* It's interpretable and fast.

#### 2. Decision Tree Classifier

In [65]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))

Decision Tree Accuracy: 0.9473684210526315


**How it works:**

* Decision Trees split data based on feature thresholds, creating a tree-like structure.
* Each leaf represents a class label.

**Why it's suitable:**
* Captures non-linear relationships and is easy to interpret.
* Handles both numerical and categorical data (though we only have numerical here).

#### 3. Random Forest Classifier

In [67]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))

Random Forest Accuracy: 0.9649122807017544


**How it works:**
* An ensemble of decision trees using bootstrapped datasets and feature randomness to improve generalization.

**Why it's suitable:**
* Reduces overfitting common in single decision trees.
* Works well with high-dimensional data like this dataset (30 features).

#### 4. Support Vector Machine (SVM)

In [75]:
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))

SVM Accuracy: 0.9736842105263158


**How it works:**
* SVM finds the optimal hyperplane that maximizes the margin between classes.
* Can use kernels for non-linear classification.

**Why it's suitable:**
* Very effective in high-dimensional spaces.
* Robust in cases where the number of features > number of samples.

#### 5. k-Nearest Neighbors (k-NN)

In [83]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print("KNeighborsClassification Accuracy:", accuracy_score(y_test, y_pred_knn))

KNeighborsClassification Accuracy: 0.9473684210526315


**How it works:** 
* k-NN classifies new samples based on the majority label of their k nearest neighbors in the feature space.

**Why it's suitable:** 
* Simple and effective for smaller datasets like breast cancer (569 samples).
* Works well when data is well-separated.

### 3. Model Comparison

In [104]:
from sklearn.metrics import classification_report, confusion_matrix
models = {
    'Logistic Regression': y_pred_lr,
    'Decision Tree': y_pred_dt,
    'Random Forest': y_pred_rf,
    'SVM': y_pred_svm,
    'k-NN': y_pred_knn
}

for name, y_pred in models.items():
    print(f"\n{name} Classification Report:")
    print(classification_report(y_test, y_pred, target_names=data.target_names))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))



Logistic Regression Classification Report:
              precision    recall  f1-score   support

   malignant       0.98      0.95      0.96        43
      benign       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Confusion Matrix:
[[41  2]
 [ 1 70]]

Decision Tree Classification Report:
              precision    recall  f1-score   support

   malignant       0.93      0.93      0.93        43
      benign       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114

Confusion Matrix:
[[40  3]
 [ 3 68]]

Random Forest Classification Report:
              precision    recall  f1-score   support

   malignant       0.98      0.93      0.95        43
      benign       0.96      0.99      0.97