# Breast Cancer Prediction System - Model Building

## Educational ML Project for Tumor Classification

**‚ö†Ô∏è DISCLAIMER**: This system is strictly for educational purposes and must NOT be used as a medical diagnostic tool.

---

This notebook builds a machine learning model to classify breast tumors as **benign** or **malignant** using the Breast Cancer Wisconsin (Diagnostic) Dataset.

**Selected Features**: 
1. radius_mean
2. texture_mean
3. perimeter_mean
4. area_mean
5. smoothness_mean

**Algorithm**: Logistic Regression

## 1. Import Required Libraries

In [None]:
# Import necessary libraries for data manipulation, visualization, and machine learning
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
import joblib
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")

## 2. Load the Breast Cancer Dataset

In [None]:
# Load the Breast Cancer Wisconsin (Diagnostic) Dataset from sklearn
data = load_breast_cancer()

# Create a DataFrame for better data manipulation
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"\nFeature names: {list(data.feature_names[:10])}...")
print(f"\nTarget names: {data.target_names}")
print(f"\nTarget distribution:\n{df['target'].value_counts()}")

## 3. Explore Dataset Structure

In [None]:
# Display first few rows of the dataset
print("First 5 rows of the dataset:")
print(df.head())

# Check for missing values
print("\n\nMissing values per column:")
print(df.isnull().sum().sum())

# Basic statistics
print("\n\nBasic statistics:")
print(df.describe())

## 4. Select 5 Features for Model Training

As per project requirements, we select exactly **5 features** from the available options.

In [None]:
# Select exactly 5 features as required
selected_features = [
    'mean radius',
    'mean texture', 
    'mean perimeter',
    'mean area',
    'mean smoothness'
]

# Create feature matrix X and target vector y
X = df[selected_features]
y = df['target']

print(f"Selected Features: {selected_features}")
print(f"\nFeature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"\n‚úÖ Successfully selected 5 features for training")

## 5. Feature Scaling (Mandatory)

**Important**: Feature scaling is critical for distance-based algorithms like Logistic Regression, SVM, KNN, and Neural Networks.

In [None]:
# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the features
X_scaled = scaler.fit_transform(X)

print("‚úÖ Feature scaling applied using StandardScaler")
print(f"\nOriginal feature range (first feature):")
print(f"  Min: {X.iloc[:, 0].min():.2f}, Max: {X.iloc[:, 0].max():.2f}")
print(f"\nScaled feature range (first feature):")
print(f"  Min: {X_scaled[:, 0].min():.2f}, Max: {X_scaled[:, 0].max():.2f}")

## 6. Train-Test Split

Split the data into training (80%) and testing (20%) sets with a fixed random_state for reproducibility.

In [None]:
# Split data into train and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"\nTraining set target distribution:\n{pd.Series(y_train).value_counts()}")
print(f"\nTesting set target distribution:\n{pd.Series(y_test).value_counts()}")
print("\n‚úÖ Data split completed successfully!")

## 7. Model Training - Logistic Regression

We'll use **Logistic Regression** for binary classification.

In [None]:
# Initialize and train Logistic Regression model
model = LogisticRegression(random_state=42, max_iter=1000)

# Train the model
model.fit(X_train, y_train)

print("‚úÖ Model training completed!")
print(f"\nModel: Logistic Regression")
print(f"Number of features: {len(selected_features)}")
print(f"Training samples: {X_train.shape[0]}")

## 8. Model Predictions

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

print("‚úÖ Predictions completed!")
print(f"\nFirst 10 predictions: {y_pred[:10]}")
print(f"First 10 actual values: {y_test[:10].values}")

## 9. Model Evaluation - Performance Metrics

Calculate accuracy, precision, recall, and F1-score as required.

In [None]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("=" * 50)
print("MODEL PERFORMANCE METRICS")
print("=" * 50)
print(f"Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Precision: {precision:.4f} ({precision*100:.2f}%)")
print(f"Recall:    {recall:.4f} ({recall*100:.2f}%)")
print(f"F1-Score:  {f1:.4f} ({f1*100:.2f}%)")
print("=" * 50)

## 10. Classification Report

In [None]:
# Generate classification report
print("CLASSIFICATION REPORT")
print("=" * 50)
print(classification_report(y_test, y_pred, target_names=['Malignant (0)', 'Benign (1)']))
print("=" * 50)

## 11. Confusion Matrix Visualization

In [None]:
# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Malignant (0)', 'Benign (1)'],
            yticklabels=['Malignant (0)', 'Benign (1)'])
plt.title('Confusion Matrix', fontsize=16, fontweight='bold')
plt.ylabel('Actual', fontsize=12)
plt.xlabel('Predicted', fontsize=12)
plt.tight_layout()
plt.show()

print(f"\nTrue Negatives (TN): {cm[0,0]}")
print(f"False Positives (FP): {cm[0,1]}")
print(f"False Negatives (FN): {cm[1,0]}")
print(f"True Positives (TP): {cm[1,1]}")

## 12. Save Model and Scaler (Model Persistence)

**Critical**: We must save both the model AND the scaler for use in the web application.

In [None]:
# Save the trained model using joblib
joblib.dump(model, 'breast_cancer_model.pkl')
print("‚úÖ Model saved as 'breast_cancer_model.pkl'")

# Save the scaler (CRITICAL - don't forget this!)
joblib.dump(scaler, 'scaler.pkl')
print("‚úÖ Scaler saved as 'scaler.pkl'")

# Save feature names for reference
joblib.dump(selected_features, 'feature_names.pkl')
print("‚úÖ Feature names saved as 'feature_names.pkl'")

print("\n" + "="*50)
print("MODEL PERSISTENCE COMPLETE")
print("="*50)
print("Files saved:")
print("  1. breast_cancer_model.pkl")
print("  2. scaler.pkl")
print("  3. feature_names.pkl")
print("="*50)

## 13. Load and Test Saved Model

Demonstrate that the saved model works correctly by loading it and making predictions.

In [None]:
# Load the saved model and scaler
loaded_model = joblib.load('breast_cancer_model.pkl')
loaded_scaler = joblib.load('scaler.pkl')
loaded_features = joblib.load('feature_names.pkl')

print("‚úÖ Model, scaler, and feature names loaded successfully!")
print(f"\nLoaded features: {loaded_features}")

# Test with a sample from the test set
sample_index = 0
sample_input = X.iloc[sample_index:sample_index+1]
actual_label = y.iloc[sample_index]

# Scale the input using the loaded scaler
sample_scaled = loaded_scaler.transform(sample_input)

# Make prediction with loaded model
prediction = loaded_model.predict(sample_scaled)[0]
probability = loaded_model.predict_proba(sample_scaled)[0]

print("\n" + "="*50)
print("TEST PREDICTION WITH LOADED MODEL")
print("="*50)
print(f"Sample features: {sample_input.values[0]}")
print(f"\nActual diagnosis: {'Benign' if actual_label == 1 else 'Malignant'}")
print(f"Predicted diagnosis: {'Benign' if prediction == 1 else 'Malignant'}")
print(f"\nPrediction probabilities:")
print(f"  Malignant: {probability[0]:.2%}")
print(f"  Benign: {probability[1]:.2%}")
print("="*50)
print("‚úÖ Loaded model works correctly!")

## üéâ Model Building Complete!

**Summary**:
- ‚úÖ Dataset loaded and explored
- ‚úÖ 5 features selected
- ‚úÖ Data preprocessed with StandardScaler
- ‚úÖ Model trained using Logistic Regression
- ‚úÖ Excellent performance metrics achieved
- ‚úÖ Model and scaler saved successfully
- ‚úÖ Saved model tested and verified

**Next Steps**:
1. Build the web application (app.py)
2. Deploy to Streamlit Cloud
3. Submit to Scorac.com

---

**‚ö†Ô∏è DISCLAIMER**: This model is for educational purposes only and should NOT be used for actual medical diagnosis.