In [4]:
import pandas as pd
import numpy as np

#1. Loading and Preprocessing

In [15]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

Load the breast cancer dataset from sklearn

In [7]:
#Load the breast cancer dataset from sklearn.
data = load_breast_cancer()

# Convert to a DataFrame for easier handling
df = pd.DataFrame(data.data,columns=data.feature_names)
df['target'] = data.target

# Display the first few rows of the DataFrame
print(df.head())

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             

Preprocess the data to handle any missing values and perform necessary feature scaling.

In [10]:
#check for missing values
missing_values = df.isnull().sum()
print("Missing values in each features", missing_values)

Missing values in each features mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64


since there are no missing values in this dataset,we can proceed with scaling

In [8]:
features = df.drop('target',axis=1)
target = df['target']

#perform feature scaling using standardscaler

#Feature scaling is an essential preprocessing step in machine learning to ensure that all features contribute 
# equally to the model's performance. Scaling transforms features to have a specific range or distribution, 
# which can improve the convergence of gradient descent algorithms and the performance of distance-based models.

# StandardScaler standardizes features by removing the mean and scaling to unit variance. 
# It transforms the data so that it has a mean of 0 and a standard deviation of 1.

scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

#convert scaled features back to a dataframe

scaled_df = pd.DataFrame(scaled_features,columns = features.columns)
scaled_df['target'] = target.values

# Display the first few rows of the scaled DataFrame

print("the first few rows of the scaled DataFrame")
print(scaled_df.head())

the first few rows of the scaled DataFrame
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0     1.097064     -2.073335        1.269934   0.984375         1.568466   
1     1.829821     -0.353632        1.685955   1.908708        -0.826962   
2     1.579888      0.456187        1.566503   1.558884         0.942210   
3    -0.768909      0.253732       -0.592687  -0.764464         3.283553   
4     1.750297     -1.151816        1.776573   1.826229         0.280372   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0          3.283515        2.652874             2.532475       2.217515   
1         -0.487072       -0.023846             0.548144       0.001392   
2          1.052926        1.363478             2.037231       0.939685   
3          3.402909        1.915897             1.451707       2.867383   
4          0.539340        1.371011             1.428493      -0.009560   

   mean fractal dimension  ...  worst texture  wo

Explain the preprocessing steps you performed and justify why they are necessary for this dataset.

1. Checking for missing values
    We used df.isnull().sum() to check for missing values in dataset. Missing values can lead to inacuurate model training and predictions. Its important to identify them. In this case breast cancer dataset from scikit-learn does not containing missing values,simplifying the preprocessing steps

2. Feature Scaling
   We applied standardScaler to standardise the feature.
   Normalization of scale - the features in the data set may have diffrent units and ranges.For example, some features might be in the range of 0 and 1000,while others are between 0 and 1.This discrepancy can affect the performance of algorithms that are sensitive to the scale of data,
   pariculary distance based algorithms (like KNN),and gradient descent based algorithms(like logistic regression)

   Improving convergence - Scaling helps algorithms converge more quickly during training by ensuring that the gradiants are not dominated by the largest features values

   Equal importance - By scaling features to have a mean of 0 and standard deviation of 1, we treat all features equally , preventing the model from being biased towards features with larger magnitudes

In summary, the breast cancer dataset does not require extensive preprocessing, these steps ensure that the data is clean and ready for effective model training

#2. Classification Algorithm Implementation

1. Logistic Regression

-Logistic regression is a statistical method used primarily for binary classification problems, where the outcome variable is categorical and typically has two classes (e.g., success/failure, yes/no, 1/0)

In [9]:
#split the dataset into training and testing sets
x_train,x_test,y_train,y_test = train_test_split(scaled_features,target,test_size = 0.2,random_state = 42)

# test_size = 0.2 means 20% of the data will be used for testing , while the remaining 80% for the training
#Training vs. Testing: A larger training set typically helps in better model training, while a smaller testing set is 
# sufficient to evaluate the model's performance. The 80-20 split is a common practice, balancing the need for both training 
# and testing data

#Setting random_state ensures that the results are reproducible. Every time you run the code with the same random state,
#  you'll get the same split of the dataset into training and testing sets.

#output shapes

print("Training test shapes", x_train.shape)
print("Testing set shape", x_test.shape)

Training test shapes (455, 30)
Testing set shape (114, 30)


In [16]:
# Logigistic regression model
log_reg = LogisticRegression(max_iter=1000,random_state=42)
log_reg.fit(x_train,y_train)

#prediction 
y_pred= log_reg.predict(x_test)

#evaluation

accuracy = accuracy_score(y_test,y_pred)
confusion = confusion_matrix(y_test,y_pred)
report  = classification_report(y_test,y_pred)

print("Accuracy of Logistic Regression:", accuracy)
print("Confusion Matrix:\n", confusion)
print("Classification Report:\n", report)

Accuracy of Logistic Regression: 0.9736842105263158
Confusion Matrix:
 [[41  2]
 [ 1 70]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



Let's analyze the results from the Logistic Regression model .
1. Accuracy
   This model correctly classified approximately 97.37 %of the instances in the test set.This indicates the model preform well
   overall in distinguishing between the two classes (malignant vs benign)

2. Confusion matrix
   
   True Negatives (TN) - this model correctly predicted 41 benign case(class 0)
   False Positives(FP) This model imcorrectly predicted 2 benign cases as malignant(class 1)
   False Negatives(FN) - This model incorrectly predicted 1 malignant cases as benign(Class 0)
   True Positives (TP) - This model corrctly predicted 70 Malignant cases(Class 1)

3. Classification report

   Class 0(Benign)
       Precission - of all instances predicted, 98% were actualy benign. This indicates high level of accuracy in the benign predictions
       Recall  - out of all actual benign cases, the model correctly identified 95% of them. This indicates the model missed 5% of the benign cases.
       F1-score - This is the harmonic mean of the precision and recall. indicating a good balance between them
    Class 1(Malignant)
        Precision - of all instances predicted,97% were actually Malignant. This indicates high level of accuracy in the Malignant predictions
        Recall  - Out of all actua; malignant cases, the model correctly identified 99%of them. This indicates the model missed 1% of the Malignat cases
        F1-Score  - Good balance between precision and recall for this class

2. Decision Tree Classifier


In [19]:
#initialize descition tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

#fit the model on the training data

dt_classifier.fit(x_train,y_train)

#make predictions on test set
y_pred_dt = dt_classifier.predict(x_test)

#evaluate the model

accuracy_dt = accuracy_score(y_test,y_pred_dt)
confusion_dt = confusion_matrix(y_test,y_pred_dt)
report_dt  = classification_report(y_test,y_pred_dt)

print("Accuracy of Decision Tree Classifier:", accuracy_dt)
print("Confusion Matrix:\n", confusion_dt)
print("Classification Report:\n", report_dt)


Accuracy of Decision Tree Classifier: 0.9473684210526315
Confusion Matrix:
 [[40  3]
 [ 3 68]]
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



A Decision Tree Classifier works by splitting the dataset into subsets based on the value of input features. Here's a simplified description of the process:

Tree Structure:

A decision tree consists of nodes, where each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents an outcome (class label).
Splitting:

The algorithm recursively splits the data at each node based on feature values. The goal is to create branches that lead to pure nodes (i.e., nodes where all data points belong to a single class).
Various criteria (such as Gini impurity or entropy) are used to determine the best feature and the best threshold for splitting the data.
Stopping Criteria:

The splitting process continues until a stopping criterion is met, such as reaching a maximum depth, having too few samples in a node, or achieving a certain level of purity.
Prediction:

Once the tree is constructed, making predictions involves traversing the tree from the root to a leaf node, following the decision rules based on the feature values of the input instance.


Decision trees are easy to interpret and visualize. The decision-making process is straightforward, making it possible to understand how predictions are made. This is particularly useful in fields like healthcare, where stakeholders may need to understand the rationale behind model decisions.



3. Random Forest Classifier


In [21]:
#initialise random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100,random_state=42)

#fit the model on the teraining data
rf_classifier.fit(x_train,y_train)

#make predictions on the test set

y_pred_rf = rf_classifier.predict(x_test)

#evaluate the model
accuracy_rf = accuracy_score(y_test,y_pred_rf)
confusion_rf = confusion_matrix(y_test,y_pred_rf)
report_rf = classification_report(y_test,y_pred_rf)

print("Accuracy of Random Forest Classifier:", accuracy_rf)
print("Confusion Matrix:\n", confusion_rf)
print("Classification Report:\n", report_rf)


Accuracy of Random Forest Classifier: 0.9649122807017544
Confusion Matrix:
 [[40  3]
 [ 1 70]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



Random Forest is an ensemble learning method that builds multiple decision trees and merges their predictions to improve overall accuracy and robustness. Here’s a brief description of its functioning:

Bootstrap Sampling:

The algorithm generates multiple subsets of the original training data through a process called bootstrapping (random sampling with replacement). Each subset is used to train a separate decision tree.
Decision Tree Creation:

For each tree, a random subset of features is selected at each split, ensuring that different trees learn different aspects of the data. This randomness helps to decorrelate the trees, which improves the ensemble's performance.
Voting Mechanism:

Once all the trees are trained, predictions are made for a given input by aggregating the predictions of all the individual trees. For classification tasks, the final output is typically determined by majority voting (the class that receives the most votes from the trees).
Feature Importance Calculation:

After training, Random Forest can provide insights into feature importance, showing which features contribute most to the predictions.

Random Forest is known for its ability to provide high accuracy while being robust against overfitting, especially when dealing with complex datasets. This is valuable for medical datasets where accurate predictions can significantly impact patient outcomes


Overall, the Random Forest Classifier is a powerful and versatile tool well-suited for the breast cancer dataset due to its robustness, interpretability, and ability to model complex relationships. Its ensemble nature enhances prediction accuracy while providing valuable insights into feature importance, making it a strong candidate for classification tasks in healthcare.





4. Support Vector Machine (SVM)


Support Vector Machine (SVM) is a supervised learning algorithm commonly used for classification tasks. The core idea behind SVM is to find the optimal hyperplane that separates data points of different classes in a high-dimensional space. The "support vectors" are the data points that are closest to this hyperplane, and they are critical in defining its position and orientation

Hyperplane: SVM attempts to find the hyperplane that best separates the classes. In two dimensions, this hyperplane is a line; in three dimensions, it's a plane; and in higher dimensions, it becomes a hyperplane.

Maximizing Margin: SVM seeks to maximize the margin between the hyperplane and the nearest data points from either class (the support vectors). A larger margin often leads to better generalization on unseen data.

Kernel Trick: SVM can efficiently perform non-linear classification using a technique called the kernel trick. This allows SVM to project data into higher dimensions where it becomes easier to separate classes with a hyperplane. Common kernels include linear, polynomial, and radial basis function (RBF).



In [18]:
#train the SVM model
svm_model = SVC(kernel='rbf',random_state=42)
svm_model.fit(x_train,y_train)

#make predictions
y_pred_svm = svm_model.predict(x_test)

#evaluate the model

accuracy_svm = accuracy_score(y_test,y_pred_svm)
conf_matrix_svm = confusion_matrix(y_test,y_pred_svm)
class_report_svm = classification_report(y_test,y_pred_svm)

print(f"Accuracy of Support Vector Machine: {accuracy_svm:.2f}")
print("Confusion Matrix:\n", conf_matrix_svm)
print("Classification Report:\n", class_report_svm)


Accuracy of Support Vector Machine: 0.97
Confusion Matrix:
 [[41  2]
 [ 1 70]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



The model achieved an accuracy of 0.97, indicating that 97% of the predictions made by the SVM classifier were correct. This is a strong result, suggesting that the model is effective at distinguishing between benign (0) and malignant (1)  in the breast cancer dataset

High Precision and Recall:

The model shows high precision and recall for both classes, especially for malignant tumors (class 1). This indicates that the model is reliable in identifying malignant tumors while minimizing false alarms (benign tumors classified as malignant).
False Positives and False Negatives:

The number of false positives (2) is very low, indicating that the model rarely mistakes benign tumors for malignant ones.
The single false negative (1) suggests that the model occasionally misses a malignant tumor, which could be critical in a medical context.
Balanced Performance:

The macro and weighted averages of precision, recall, and F1-score indicate a balanced performance across both classes, with no significant bias toward one class over the other.


5. k-Nearest Neighbors (k-NN)


Description: k-Nearest Neighbors (k-NN) is a simple, non-parametric, supervised learning algorithm used for classification and regression. The core idea of k-NN is to classify a data point based on the classes of its k nearest neighbors in the feature space. The distance between data points is typically measured using metrics like Euclidean distance.

How k-NN Works:
Choosing k: The user defines a value for k, which determines how many neighbors to consider when making a prediction. A smaller k can be noisy and sensitive to outliers, while a larger k smooths out predictions but may overlook local patterns.

Distance Metric: The algorithm calculates the distance between the target point and all other points in the dataset using a specified distance metric (e.g., Euclidean, Manhattan).

Voting Mechanism: Once the k nearest neighbors are identified, the algorithm assigns the class label that is most common among these neighbors (for classification) or averages their values (for regression).



In [19]:
#train the K-NN model
k=5 # you can experiment with diffrent values of k
knn_model = KNeighborsClassifier(n_neighbors = k)

knn_model.fit(x_train,y_train)

#make predictions
y_pred_knn = knn_model.predict(x_test)

#evaluate the model
accuracy_knn = accuracy_score(y_test,y_pred_knn)
conf_matrix_knn = confusion_matrix(y_test,y_pred_knn)
class_report_knn = classification_report(y_test,y_pred_knn)

print(f"Accuracy of k-Nearest Neighbors: {accuracy_knn:.2f}")
print("Confusion Matrix:\n", conf_matrix_knn)
print("Classification Report:\n", class_report_knn)

Accuracy of k-Nearest Neighbors: 0.95
Confusion Matrix:
 [[40  3]
 [ 3 68]]
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



Accuracy: The k-NN model achieved an accuracy of 0.95, indicating that 95% of the predictions made by the classifier were correct. This is a strong result, although slightly lower than the 97% accuracy achieved by the SVM model.


#3. Model Comparison (2 marks)


Accuracy:

Highest: Logistic Regression and SVM (0.97)
Lowest: Decision Tree and k-NN (0.95)
Confusion Matrices:

Logistic Regression and SVM had the fewest misclassifications (only 3 errors each).
Decision Tree and k-NN had more misclassifications (6 errors each).
Precision:

Logistic Regression and SVM excelled in precision for both classes, especially for benign tumors.
The Decision Tree and k-NN performed slightly lower in precision for benign tumors.
Recall:

Logistic Regression and SVM had a strong recall, particularly for malignant tumors (1), indicating they correctly identified almost all malignant cases.
The Decision Tree and k-NN had a lower recall for benign tumors.
F1-Score:

Logistic Regression and SVM achieved the highest F1-scores, indicating a good balance between precision and recall.
Decision Tree and k-NN had lower F1-scores, indicating some trade-offs between precision and recall.


Overall, the results suggest that Logistic Regression and SVM may be the most reliable choices for classifying breast cancer in this dataset, but Random Forest is also a strong contender
