# MACHINE LEARNING MODEL IMPLEMENTATION
Creating a predictive classification model using **scikit‑learn** on the Breast Cancer Wisconsin dataset.

## 1. Introduction
This notebook demonstrates a complete machine‑learning workflow:
1. Loading a real‑world dataset from `scikit‑learn`.
2. Exploratory data analysis (EDA).
3. Data preprocessing with a pipeline.
4. Model training using **RandomForestClassifier**.
5. Performance evaluation with accuracy, ROC‑AUC, and cross‑validation.
6. Visualising results (confusion matrix & ROC curve).

Feel free to swap the model or dataset and reuse this template for your own projects.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc

# Plotting settings
plt.rcParams['figure.figsize'] = (8, 6)
%matplotlib inline

ModuleNotFoundError: No module named 'matplotlib'

In [None]:
# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Quick glimpse
pd.concat([X, y], axis=1).head()

In [None]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
# Build preprocessing + model pipeline
model = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=200, random_state=42))
])

# Train the model
model.fit(X_train, y_train)

# Evaluate on test data
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Accuracy on test set: {acc:.4f}')

In [None]:
# Classification report & confusion matrix
print('\nClassification Report:\n', classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap='Blues')
ax.figure.colorbar(im, ax=ax)
ax.set(
    xticks=np.arange(cm.shape[1]), yticks=np.arange(cm.shape[0]),
    xticklabels=data.target_names, yticklabels=data.target_names,
    title='Confusion Matrix', ylabel='True label', xlabel='Predicted label'
)

# Loop over data dimensions and create text annotations.
fmt = 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, format(cm[i, j], fmt),
                ha='center', va='center',
                color='white' if cm[i, j] > thresh else 'black')
plt.show()

In [None]:
# ROC curve
y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Cross‑validation accuracy
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print('Cross‑validation accuracies:', cv_scores)
print('Mean CV accuracy:', cv_scores.mean())

## 2. Conclusion
The **RandomForestClassifier** achieved high accuracy and ROC‑AUC on the Breast Cancer dataset, demonstrating that the model effectively distinguishes between malignant and benign tumors.

### Next Steps
- Hyperparameter tuning (e.g., via `GridSearchCV`).
- Feature importance analysis.
- Experiment with other algorithms (SVM, XGBoost, etc.).
- Apply the same pipeline to your own dataset.