# Breast Cancer Classification with Logistic Regression

This notebook demonstrates how to build a logistic regression model to classify breast cancer tumors as either malignant or benign using the Scikit-learn breast cancer dataset.

## Importing Libraries

We'll start by importing all the necessary libraries for our analysis:

In [None]:
import numpy as np              # For numerical operations
import pandas as pd             # For data manipulation and analysis
import matplotlib.pyplot as plt # For data visualization
import seaborn as sns           # For enhanced data visualization
from sklearn.datasets import load_breast_cancer           # To load the dataset
from sklearn.model_selection import train_test_split      # To split data into training and testing sets
from sklearn.preprocessing import StandardScaler          # For feature scaling
from sklearn.linear_model import LogisticRegression       # Our classification model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc  # For model evaluation

## Loading the Dataset

The breast cancer dataset is a classic dataset in machine learning. It contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features describe characteristics of the cell nuclei present in the image.

The dataset includes 569 instances with 30 features each. The target variable indicates whether the cancer is malignant (0) or benign (1).

In [None]:
# Load the breast cancer dataset
breast_cancer = load_breast_cancer()
X = breast_cancer.data    # Features
y = breast_cancer.target  # Target variable (0: malignant, 1: benign)

# Create a DataFrame for better data manipulation
df = pd.DataFrame(X, columns=breast_cancer.feature_names)
df['target'] = y

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nTarget Distribution:")
print(df['target'].value_counts())
print("\nFeature Names:")
print(breast_cancer.feature_names)
print("\nClass Names:")
print(breast_cancer.target_names)

## Data Preprocessing

Before training our model, we need to preprocess the data. This includes:
1. Splitting the data into training and testing sets
2. Scaling the features to ensure they're on the same scale

In [None]:
# Split the data into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

# Scale the features using StandardScaler
# This transforms the data to have mean=0 and variance=1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit on training data and transform it
X_test_scaled = scaler.transform(X_test)        # Transform test data using the same scaler

## Training the Logistic Regression Model

Now we'll train a logistic regression model on our preprocessed data. 

Logistic regression works by estimating the probability that an instance belongs to a particular class. If the estimated probability is greater than 0.5, the model predicts that the instance belongs to the positive class (in this case, benign); otherwise, it predicts that it belongs to the negative class (malignant).

In [None]:
# Initialize the logistic regression model
model = LogisticRegression()

# Train the model on the scaled training data
model.fit(X_train_scaled, y_train)

print("Model training complete!")

## Model Evaluation

After training the model, we need to evaluate its performance on the test set. We'll use several metrics:
- Accuracy: The proportion of correct predictions
- Classification report: Precision, recall, and F1-score for each class
- Confusion matrix: A table showing correct and incorrect predictions
- ROC curve: A plot showing the trade-off between true positive rate and false positive rate

In [None]:
# Make predictions on the test set

# Class predictions (0 or 1)
y_pred = model.predict(X_test_scaled)                

# Probability of being in class 1 (benign)
y_pred_prob = model.predict_proba(X_test_scaled)[:, 1]  

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))

### Confusion Matrix

A confusion matrix shows the counts of true positives, true negatives, false positives, and false negatives. It helps us understand where our model is making mistakes.

In [None]:
# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=breast_cancer.target_names, 
            yticklabels=breast_cancer.target_names)
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('True', fontsize=12)
plt.title('Confusion Matrix', fontsize=14)
plt.show()

# Calculate and display additional metrics from the confusion matrix
tn, fp, fn, tp = conf_matrix.ravel()
sensitivity = tp / (tp + fn)  # True Positive Rate
specificity = tn / (tn + fp)  # True Negative Rate

print(f"Sensitivity (True Positive Rate): {sensitivity:.4f}")
print(f"Specificity (True Negative Rate): {specificity:.4f}")

### ROC Curve

The Receiver Operating Characteristic (ROC) curve is a plot of the true positive rate against the false positive rate at various threshold settings. The area under the ROC curve (AUC) is a measure of how well the model can distinguish between the two classes.

In [None]:
# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('Receiver Operating Characteristic (ROC) Curve', fontsize=14)
plt.legend(loc="lower right", fontsize=10)
plt.grid(alpha=0.3)
plt.show()

## Feature Importance

One advantage of logistic regression is that it provides interpretable coefficients that can be used to understand feature importance. The magnitude of a coefficient indicates how strongly that feature influences the prediction.

In [None]:
# Get feature importance based on the coefficients
coef = model.coef_[0]
feature_importance = pd.DataFrame({'Feature': breast_cancer.feature_names, 'Importance': np.abs(coef)})
feature_importance = feature_importance.sort_values('Importance', ascending=False)

# Display the top 15 most important features
print("Top 15 Most Important Features:")
feature_importance.head(15)

In [None]:
# Plot feature importance
plt.figure(figsize=(12, 8))
ax = sns.barplot(x='Importance', y='Feature', data=feature_importance.head(15), palette='viridis')
plt.title('Top 15 Feature Importance', fontsize=14)
plt.xlabel('Absolute Coefficient Value', fontsize=12)
plt.tight_layout()

# Add value labels to the bars
for i, v in enumerate(feature_importance.head(15)['Importance']):
    ax.text(v + 0.05, i, f"{v:.2f}", va='center')
    
plt.show()