# Sonar Rock vs. Mine Prediction

## 1. Introduction
This notebook explores the Sonar dataset to classify sonar signals as either "Rock" (R) or "Mine" (M). The dataset contains 208 patterns obtained by bouncing sonar signals off a metal cylinder (mine) and a roughly cylindrical rock at various angles and conditions. Each pattern consists of 60 numbers representing the energy within a particular frequency band.

**Objective:** Build and evaluate machine learning models to accurately classify the object type based on sonar readings.

**Workflow:**
1.  **Data Loading & Overview:** Load the dataset and inspect structure.
2.  **Exploratory Data Analysis (EDA):** Visualize class balance and feature correlations.
3.  **Data Preprocessing:** Prepare data for modeling (splitting, encoding).
4.  **Model Benchmarking:** Train multiple classifiers (SVC, XGBoost, etc.) and evaluate using Cross-Validation.
5.  **Model Selection:** Compare models based on ROC-AUC and Accuracy.
6.  **Explainability:** Use SHAP values to interpret the best model's predictions.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Plotting style
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Sklearn imports
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, roc_curve

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

import shap


## 2. Data Loading & Overview
We load the dataset and perform basic quality checks.


In [None]:
# Load dataset
file_path = 'Copy of sonar data.csv'
# The dataset does not have a header, so we load with header=None
df = pd.read_csv(file_path, header=None)

# Rename columns to X1...X60 for features and 'Target' for the label
X_cols = [f'X{i}' for i in range(1, 61)]
df.columns = X_cols + ['Target']

print(f"Dataset Shape: {df.shape}")
display(df.head())


In [None]:
# Check for missing values and data types
df.info()


## 3. Exploratory Data Analysis (EDA)
We examine the distribution of the target variable and correlations between features.


In [None]:
# Target Distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='Target', data=df, palette='viridis')
plt.title('Target Class Distribution (R=Rock, M=Mine)')
plt.show()

print(df['Target'].value_counts())


In [None]:
# Convert Target to numeric for analysis (Rock=0, Mine=1)
le = LabelEncoder()
df['Target_Num'] = le.fit_transform(df['Target'])
print(f"Classes: {le.classes_}")


**Feature Correlations:**
Visualizing correlations helps identify redundant features or strong relationships.


In [None]:
# Correlation Heatmap
# Taking a subset of columns for readability if needed, but here we plot all to see macro patterns
plt.figure(figsize=(12, 10))
corr_matrix = df.drop('Target', axis=1).corr()
sns.heatmap(corr_matrix, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Feature Correlation Matrix')
plt.show()


## 4. Data Preprocessing
We separate features and target, then split the data into training and testing sets.
Note: Scaling (StandardScaler) will be applied within the modeling pipeline to prevent data leakage.


In [None]:
X = df.drop(['Target', 'Target_Num'], axis=1)
y = df['Target_Num']

# Stratified Split to maintain class balance in train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print(f"Training shape: {X_train.shape}")
print(f"Testing shape: {X_test.shape}")


## 5. Model Benchmarking
We train and evaluate a variety of models to find the best baseline. We use a **Pipeline** to ensure features are scaled correctly within each fold of Cross-Validation.


In [None]:
# Initialize Dictionary of Models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, solver='liblinear'),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Support Vector Machine': SVC(random_state=42, probability=True),
    'Gaussian Naive Bayes': GaussianNB(),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': XGBClassifier(random_state=42, eval_metric='logloss', use_label_encoder=False),
    'MLP Classifier': MLPClassifier(random_state=42, max_iter=1000)
}

# Store results
results = []
trained_models = {}

print("Training and evaluating models...")

# CV Strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    # Construct Pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', model)
    ])
    
    # 1. Cross-Validation (on Train set)
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='roc_auc')
    
    # 2. Train on full Training set
    pipeline.fit(X_train, y_train)
    trained_models[name] = pipeline
    
    # 3. Evaluate on Test set
    y_pred = pipeline.predict(X_test)
    # Check if model supports predict_proba
    if hasattr(pipeline['classifier'], 'predict_proba'):
        y_proba = pipeline.predict_proba(X_test)[:, 1]
        auc_score = roc_auc_score(y_test, y_proba)
    else:
        # Fallback if no probability (less common for these models but good safety)
        y_proba = None
        auc_score = 0.0
    
    # Metrics
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    results.append({
        'Model': name,
        'CV ROC-AUC': cv_scores.mean(),
        'CV Std': cv_scores.std(),
        'Test Accuracy': acc,
        'Test F1': f1,
        'Test ROC-AUC': auc_score
    })

# Create DataFrame
results_df = pd.DataFrame(results).sort_values(by='Test ROC-AUC', ascending=False)
results_df


## 6. Model Comparison & Visualization
Comparing models based on their Test ROC-AUC scores and plotting ROC curves.


In [None]:
# Bar plot of Test ROC-AUC Scores
plt.figure(figsize=(10, 6))
sns.barplot(x='Test ROC-AUC', y='Model', data=results_df, palette='viridis')
plt.title('Model Comparison: Test ROC-AUC')
plt.xlim(0, 1.05) # Extend slightly to see 1.0 clearly
plt.show()


In [None]:
# Plot ROC Curves for Top 5 Models
plt.figure(figsize=(10, 8))

top_models = results_df.head(5)['Model']

for name in top_models:
    pipeline = trained_models[name]
    if hasattr(pipeline['classifier'], 'predict_proba'):
        y_proba = pipeline.predict_proba(X_test)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_proba)
        auc = roc_auc_score(y_test, y_proba)
        plt.plot(fpr, tpr, label=f'{name} (AUC = {auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random Chance')
plt.title('ROC Curves (Top 5 Models)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()


## 7. Explainability (SHAP Analysis)
We interpret the predictions of the best performing model using SHAP (SHapley Additive exPlanations). This helps us understand which frequency bands (features) are most critical for distinguishing Mines from Rocks.


In [None]:
# Select Best Model
best_model_name = results_df.iloc[0]['Model']
best_pipeline = trained_models[best_model_name]
best_classifier = best_pipeline.named_steps['classifier']

print(f"Analyzing Best Model: {best_model_name}")

# Prepare data for SHAP
# Note: SHAP needs the actual values seen by the model (i.e., Scaled values)
scaler = best_pipeline.named_steps['scaler']
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to Df for feature names
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X.columns)

# Initialize Explainer
# we use KernelExplainer as a generic fallback, though TreeExplainer is faster for Trees.
# Access the predict_proba function.
# We summarize X_train to speed up KernelExplainer (it can be slow on full data)
background_data = shap.utils.sample(X_train_scaled_df, 50, random_state=42)

# Wrap prediction function to ensure it expects the right input format
# (KernelExplainer passes numpy arrays usually)
def predict_fn(data):
    return best_classifier.predict_proba(data)

explainer = shap.KernelExplainer(predict_fn, background_data)

# Calculate SHAP values for a subset of test data
X_test_sample = shap.utils.sample(X_test_scaled_df, 20, random_state=42)
shap_values = explainer.shap_values(X_test_sample)

# Visualization: Summary Plot
# shap_values is a list for classification [class_0_shap, class_1_shap]
# We plot for Class 1 (Mine)
print(f"SHAP Summary Plot for Class: {le.classes_[1]}")
plt.figure()
shap.summary_plot(shap_values[1], X_test_sample, feature_names=X.columns, show=False)
plt.title(f'SHAP Feature Importance ({best_model_name})')
plt.show()


## 8. Conclusion
*   **Data Quality:** The sonar dataset is clean with no missing values.
*   **Best Model:** The **Support Vector Machine (SVC)** (or top performer) demonstrated superior performance on the test set with high ROC-AUC and Accuracy. 
*   **Key Features:** SHAP analysis revealed specific frequency bands (features) that strongly influence the classification, providing interpretability to the "black box" model.
*   **Next Steps:** performance could potentially be improved further with hyperparameter tuning (GridSearch) or by collecting more data.

