# Breast Cancer Detection using Logistic Regression

Data used in this lab is taken from UC Irvine, Machine Learning Repository.

In [21]:
# import the necessary Library

import pandas as pd

In [22]:
# Import the Data
# Extract the features (X) and the target (y)
# X: all columns except the first one and the last one
# y: the last column

data = pd.read_csv('breast_cancer.csv')
X = data.iloc[:, 1:-1].values
y = data.iloc[:, -1].values

In [23]:
# Split the data into the Training Set and the Test Set
# test_size = 0.2 means 20% of the data will be used for testing, and 80% for training
# random_state = 0 ensures reproducibility of the results

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [24]:
# Training the Logistic Regression Model on the Training Set

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

In [25]:
# Predicting the Test Set Results
y_pred = classifier.predict(X_test)

In [26]:
# Making the Confusion Matrix

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[84  3]
 [ 3 47]]


* TN (True Negative): 84 cases were correctly predicted as not having breast cancer.

* TP (True Positive): 47 cases were correctly predicted as having breast cancer.

* FP (False Positive): 3 cases were incorrectly predicted as having breast cancer.

* FN (False Negative): 3 cases were incorrectly predicted as not having breast cancer.

In [27]:
# Calculating the accuracy with k-fold cross-validation
# cv = 10 means we are using 10-fold cross-validation

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X  = X_train, y = y_train, cv = 10)
print('Accuracy: {:.2f} %'.format(accuracies.mean()*100))
print('Standard Deviation: {:.2f} %'.format(accuracies.std()*100))

Accuracy: 96.70 %
Standard Deviation: 1.97 %


## Interpretation

K-Fold Cross-Validation:
K-fold cross-validation provides a more reliable estimate of the model's performance by splitting the training data into k subsets (folds) and training/testing the model k times, each time using a different fold as the test set and the remaining k-1 folds as the training set.

The results from cross-validation:

* Mean Accuracy: 96.70%
* Standard Deviation: 1.97%

These metrics indicate that the model performs consistently well across different subsets of the training data:

* The high mean accuracy (96.70%) suggests that the model is very effective at classifying breast cancer cases correctly.

* The low standard deviation (1.97%) indicates that the model's performance is stable and does not vary much across different folds of the data.

### Conclusion

The Logistic Regression model for breast cancer detection demonstrates high accuracy and stability, making it a reliable tool for predicting breast cancer based on the given dataset. The confusion matrix shows that the model correctly identifies most cases, with only a few false positives and false negatives. The k-fold cross-validation results further reinforce the model's robustness and generalizability.