# RSSI Localization Pipeline
This notebook documents the full neural-network + local KNN pipeline used to locate the WiFi probe on the board. Each cell explains how a raw RSSI vector is transformed into an embedding, how the L-KNN votes, and which diagnostics you can read to understand the prediction.

## Prerequisites
- Run `PYTHONPATH=src python -m localization.pipeline ...` at least once to produce `reports/localizer.joblib` and the generated metrics (confusion matrix, JSON report, etc.).
- Work inside the same Python environment (`.venv`) so that the imported modules and dependencies match those used during training.
- Keep the measurement folders `ddeuxmetres/` and `dquatremetres/` available since this notebook loads them directly.
- Glossary of terms used below:
  - **RSSI (Received Signal Strength Indicator)**: WiFi power in dBm reported by the router for each antenna.
  - **Embedding**: compact vector representation produced by the MLP to summarize the RSSI pattern of a cell.
  - **Local KNN (L-KNN)**: distance-weighted K-nearest neighbors operating in the embedding space to predict the cell ID.

## Phase 1 - Prepare the tooling

In [None]:
# Import required modules: filesystem helpers, plotting, math utilities, and our local code.
from pathlib import Path
import sys

import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import display, Markdown, Image
from sklearn.decomposition import PCA

plt.style.use('seaborn-v0_8')  # readable styling for plots

# Define important paths: project root, measurement folders, trained model, and the cell layout image.
PROJECT_ROOT = Path('..').resolve()
DATA_FOLDERS = [PROJECT_ROOT / 'ddeuxmetres', PROJECT_ROOT / 'dquatremetres']
MODEL_PATH = PROJECT_ROOT / 'reports' / 'localizer.joblib'
GRID_IMAGE_PATH = PROJECT_ROOT / 'lesgridcells.png'

# Allow imports from the local `src/` directory.
sys.path.append(str(PROJECT_ROOT / 'src'))

from localization.data import CampaignSpec, load_measurements, DEFAULT_CELL_WIDTH_M, DEFAULT_CELL_HEIGHT_M
from localization.embedding_knn import EmbeddingKnnLocalizer, _apply_activation

FEATURE_COLUMNS = ["Signal", "Noise", "signal_A1", "signal_A2", "signal_A3", "router_distance_m"]

print(f"Project root: {PROJECT_ROOT}")
print(f"Expected model path: {MODEL_PATH}")

REPORTS_DIR = PROJECT_ROOT / "reports"
REPORTS_DIR.mkdir(exist_ok=True)


We configure project paths, add `src/` to `sys.path`, and import the helper modules (`localization.data`, `localization.embedding_knn`). This also defines the RSSI feature list (`FEATURE_COLUMNS`).

## Phase 2 - Load and combine measurements

In [None]:
# Build the list of campaigns (2 m + 4 m) that are actually present on disk.
campaigns = [CampaignSpec(path) for path in DATA_FOLDERS if path.exists()]
if not campaigns:
    raise RuntimeError("No campaign found. Verify that `ddeuxmetres/` and `dquatremetres/` exist.")

# Merge every CSV into a single DataFrame while keeping grid/campaign metadata.
df = load_measurements(campaigns)
print(f"Loaded {len(df)} rows covering {df['grid_cell'].nunique()} cells.")
display(df.head())

# Quick lookup table to retrieve spatial info (grid indices, metric coordinates, campaign name).
cell_lookup = (
    df[["grid_cell", "grid_x", "grid_y", "coord_x_m", "coord_y_m", "campaign"]]
    .drop_duplicates("grid_cell")
    .set_index("grid_cell")
)
GRID_WIDTH_M = df["coord_x_m"].max() + DEFAULT_CELL_WIDTH_M / 2
GRID_HEIGHT_M = df["coord_y_m"].max() + DEFAULT_CELL_HEIGHT_M / 2


The campaigns `ddeuxmetres` and `dquatremetres` are merged into a single DataFrame. Each row contains:
- the RSSI values (`Signal`, `Noise`, `signal_Ai`),
- the discrete grid indices (`grid_x`, `grid_y`) and the corresponding metric coordinates,
- the campaign label (router distance).
By centralizing everything, we can train/evaluate without any extra preprocessing outside Python.

### Looking at a raw CSV file

In [None]:
# Pick the first CSV available to illustrate the raw data structure generated by collect_wifi.sh.
example_csv = None
for folder in DATA_FOLDERS:
    if not folder.exists():
        continue
    candidates = sorted(folder.glob('*.csv'))
    if candidates:
        example_csv = candidates[0]
        break
if example_csv is None:
    raise FileNotFoundError("No CSV found in the campaign folders.")

print(f"Example raw file: {example_csv.relative_to(PROJECT_ROOT)}")
raw_example = pd.read_csv(example_csv)
display(raw_example.head())


This preview of a raw CSV shows why `load_measurements` enriches the data: the files record only RSSI values. Grid metadata and router distance are injected in Python to make the dataset self-contained.

## Phase 3 - Load the trained model

In [None]:
# Load the trained model (MLP encoder + L-KNN).
if not MODEL_PATH.exists():
    raise FileNotFoundError("Trained model missing. Rerun localization.pipeline to generate reports/localizer.joblib.")

localizer: EmbeddingKnnLocalizer = joblib.load(MODEL_PATH)
print("MLP architecture:", localizer.encoder_.hidden_layer_sizes)
print("KNN parameters:", localizer.knn_.get_params())
print(f"Iterations performed: {localizer.encoder_.n_iter_}")
print(f"Final loss value: {localizer.encoder_.loss_curve_[-1]:.4f}")


The trained model ships two components:
1. a supervised MLP encoder (32-dimension embedding) implemented with `scikit-learn`,
2. a distance-weighted KNN operating on those embeddings.
`localizer.encoder_` and `localizer.knn_` are therefore available for inspection throughout this notebook.

### Inspect training loss convergence

In [None]:
plt.figure(figsize=(6, 4))
plt.plot(localizer.encoder_.loss_curve_, marker='o')  # loss recorded by scikit-learn
plt.title("MLP training loss")
plt.xlabel("Adam iterations")
plt.ylabel("Loss (cross-entropy)")
plt.grid(True, alpha=0.4)
plt.show()


When the loss curve flattens the encoder has converged with the provided hyper-parameters. Otherwise you would adjust `learning_rate_init`, `alpha`, or `max_iter`.

## Phase 4 - Select a sample to inspect

In [None]:
# Define the grid cell (and optionally the campaign) we want to inspect.
TARGET_CELL = '1_4'      # change this to inspect another location
CAMPAIGN_NAME = None      # set to 'ddeuxmetres' or 'dquatremetres' to filter a campaign

subset = df[df['grid_cell'] == TARGET_CELL]
if CAMPAIGN_NAME:
    subset = subset[subset['campaign'] == CAMPAIGN_NAME]
if subset.empty:
    raise ValueError(f"No sample found for {TARGET_CELL} (campaign={CAMPAIGN_NAME}).")

# Random but reproducible pick thanks to random_state.
sample = subset.sample(1, random_state=7)
sample_features = sample[FEATURE_COLUMNS]
sample_meta = sample[['grid_cell', 'grid_x', 'grid_y', 'coord_x_m', 'coord_y_m', 'campaign']]

display(Markdown(
    f"### Selected sample: cell `{sample_meta.iloc[0]['grid_cell']}` (campaign `{sample_meta.iloc[0]['campaign']}`)"
))
display(sample_features)
display(sample_meta)


This cell selects the sample to analyze (`TARGET_CELL`) and produces two frames:
- `sample_features`: the six RSSI features plus router distance passed to the model,
- `sample_meta`: grid indices, physical coordinates and campaign metadata used for validation.

### 4.1 - Standardization (centering / scaling)

In [None]:
# The scaler stores the training-set mean and standard deviation for each feature.
scaler = localizer.scaler_
raw_vec = sample_features.to_numpy()
scaled_vec = scaler.transform(raw_vec)

scaling_table = pd.DataFrame(
    {
        'feature': FEATURE_COLUMNS,
        'raw_value': raw_vec.flatten(),
        'training_mean': scaler.mean_,
        'training_std': scaler.scale_,
        'standardized_value': scaled_vec.flatten(),
    }
)
display(scaling_table)


`StandardScaler` enforces zero mean and unit variance for every feature. Without this normalization either `Signal` or `router_distance_m` would dominate the MLP gradients and the Euclidean distance used by L-KNN.

### 4.2 - Physical location of the sample

In [None]:

coord_text = (
    f"Physical coordinates (m): x = {sample_meta.iloc[0]['coord_x_m']:.3f}, y = {sample_meta.iloc[0]['coord_y_m']:.3f}<br>"
    f"Grid index: (grid_x={sample_meta.iloc[0]['grid_x']}, grid_y={sample_meta.iloc[0]['grid_y']})"
)
display(Markdown(coord_text))


Physical coordinates (meters) are also used when computing the Euclidean localization error `error_m = ||coord_true - coord_pred||`. The values correspond to the center of the instrumented cell.

### 4.3 - Propagate through the neural network

In [None]:

# Pass through every MLP layer to inspect how the sample is transformed.
activations = []
activation = scaled_vec
weight_count = len(localizer.encoder_.coefs_)
for idx, (weights, bias) in enumerate(zip(localizer.encoder_.coefs_, localizer.encoder_.intercepts_)):
    linear = activation @ weights + bias  # linear transformation of the current layer
    is_output = idx == weight_count - 1
    layer_name = 'output_logits' if is_output else f'hidden_{idx+1}'
    if not is_output:
        activation = _apply_activation(linear, localizer.encoder_.activation)
    else:
        activation = linear
    activations.append(
        {
            'layer': layer_name,
            'units': linear.shape[1],
            'min': float(activation.min()),
            'max': float(activation.max()),
            'preview_first5': np.round(activation[0, :5], 4).tolist(),
        }
    )

display(pd.DataFrame(activations))


Each hidden layer applies a linear transformation followed by the activation (`ReLU`). The min/max columns make it easy to check that the activations are not saturated. The embedding consumed by L-KNN is simply the last hidden layer (32 values).

In [None]:

embedding = localizer.transform(sample_features)
print(f"Taille de l'embedding : {embedding.shape}")
embedding_df = pd.DataFrame(embedding, columns=[f"e{i}" for i in range(embedding.shape[1])])
display(embedding_df.round(4))


The 32-component vector (`e0..e31`) is the input to the L-KNN. Cells that are close in the physical grid should produce embeddings that are close in this space; otherwise the neighbor search would be unstable.

## Phase 5 - L-KNN decision

### 5.0 - Heatmap of average RSSI
Before analyzing a specific sample we plot the mean `Signal` per cell to ensure the spatial gradient still reflects the router placement.

In [None]:
mean_signal = df.pivot_table(index='grid_x', columns='grid_y', values='Signal', aggfunc='mean')
plt.figure(figsize=(8, 4))
plt.imshow(mean_signal.sort_index(ascending=False), cmap='inferno', aspect='auto')
plt.colorbar(label='Average RSSI (dBm)')
plt.title('Heatmap of mean RSSI per cell')
plt.xlabel('grid_y (columns)')
plt.ylabel('grid_x (0 = top row)')
plt.xticks(range(mean_signal.shape[1]), mean_signal.columns)
plt.yticks(range(mean_signal.shape[0]), sorted(mean_signal.index, reverse=True))
plt.show()


In [None]:
# Retrieve the cell probabilities, final prediction, and the K nearest neighbors.
y_proba = localizer.predict_proba(sample_features)
y_pred = localizer.predict(sample_features)
neighbor_dist, neighbor_cells = localizer.explain(sample_features, top_k=5)

print(f"Predicted cell: {y_pred[0]} | max confidence: {y_proba.max():.4f}")

neighbor_df = pd.DataFrame(
    {
        'rank': np.arange(1, neighbor_cells.shape[1] + 1),
        'cell': neighbor_cells[0],
        'distance_embedding': neighbor_dist[0],
    }
)
neighbor_df = neighbor_df.join(cell_lookup, on='cell')
display(neighbor_df)


`neighbor_df` lists the `K` reference cells sorted by Euclidean distance in the latent space. It doubles as an explainability artifact: you know exactly which historical fingerprints contributed to the final decision.

### 5.1 - Histogram of neighbor distances
Each bar corresponds to one of the `K` neighbors returned by L-KNN. We label the x-axis with the neighbor rank so repeated cell names do not overlap.

In [None]:
plt.figure(figsize=(6, 3))
plt.bar(neighbor_df['rank'], neighbor_df['distance_embedding'], color='dodgerblue')
plt.xticks(neighbor_df['rank'], neighbor_df['cell'], rotation=45, ha='right')
plt.xlabel('Neighbor rank')
plt.ylabel('Embedding distance')
plt.title('Top-K neighbors: embedding distances')
plt.tight_layout()
plt.show()


### 5.2 - Neighbor vote contributions
Distances are converted to weights (`1/distance`) to visualize how much each neighbor contributes to the L-KNN vote. The weights sum to 1.

In [None]:
weights = 1 / (neighbor_df['distance_embedding'] + 1e-6)
weights = weights / weights.sum()
plt.figure(figsize=(6, 3))
plt.bar(neighbor_df['cell'], weights, color='mediumseagreen')
plt.xticks(rotation=45, ha='right')
plt.ylabel('normalized weight')
plt.title('Neighbor contribution in the KNN vote')
plt.tight_layout()
plt.show()


### 5.3 - Project neighbors on the board picture

In [None]:

if GRID_IMAGE_PATH.exists():
    img = plt.imread(GRID_IMAGE_PATH)  # background picture of the magnetic board
    fig, ax = plt.subplots(figsize=(8, 4))
    ax.imshow(img)

    def cell_to_pixels(cell_name: str):
        """Map a cell identifier to pixel coordinates on the image."""
        entry = cell_lookup.loc[cell_name]
        x_norm = entry['coord_x_m'] / GRID_WIDTH_M
        y_norm = entry['coord_y_m'] / GRID_HEIGHT_M
        px = x_norm * img.shape[1]
        py = y_norm * img.shape[0]
        return px, py

    # Actual vs predicted position of the inspected sample.
    true_px, true_py = cell_to_pixels(sample_meta.iloc[0]['grid_cell'])
    pred_px, pred_py = cell_to_pixels(y_pred[0])

    ax.scatter([true_px], [true_py], c='lime', s=120, marker='o', edgecolors='black', label='Ground truth cell')
    ax.scatter([pred_px], [pred_py], c='red', s=120, marker='x', label='Predicted cell')

    # Visualize the K neighbors that influenced the vote.
    for _, row in neighbor_df.iterrows():
        px, py = cell_to_pixels(row['cell'])
        ax.scatter(px, py, c='dodgerblue', s=80, alpha=0.7)
        label_txt = f"{row['cell']} (dist={row['distance_embedding']:.3f})"
        ax.text(
            px + 10,
            py,
            label_txt,
            color='white',
            fontsize=8,
            bbox=dict(facecolor='black', alpha=0.4, pad=2),
        )

    ax.set_title('Neighbors projected on the board (background = lesgridcells.png)')
    ax.axis('off')
    ax.legend(loc='upper right')
    plt.show()
else:
    print("Image lesgridcells.png missing: overlay cannot be displayed.")


### 5.4 - Confusion matrix and reliability curve
- The confusion matrix (saved by `localization.pipeline`) highlights cells that are systematically confused.
- The reliability curve plots average model confidence vs observed accuracy to validate probability calibration.
- The bar chart shows how confidence scores are distributed across all measurements.

In [None]:
# Display the confusion matrix generated by the training pipeline
confusion_path = REPORTS_DIR / 'confusion_matrix.png'
if confusion_path.exists():
    display(Image(filename=confusion_path))
else:
    print("confusion_matrix.png missing: rerun localization.pipeline to regenerate it.")

# Reliability curve on the entire dataset
all_probs = localizer.predict_proba(df[FEATURE_COLUMNS])
all_preds = localizer.predict(df[FEATURE_COLUMNS])
conf_max = all_probs.max(axis=1)
correct = (all_preds == df['grid_cell'].to_numpy())
bins = np.linspace(0, 1, 11)
indices = np.digitize(conf_max, bins) - 1
centers = []
accuracies = []
counts = []
for i in range(len(bins) - 1):
    mask = indices == i
    if mask.any():
        centers.append(conf_max[mask].mean())
        accuracies.append(correct[mask].mean())
        counts.append(mask.sum())

plt.figure(figsize=(5, 4))
plt.plot([0, 1], [0, 1], '--', color='gray', label='Ideal calibration')
plt.plot(centers, accuracies, 'o-', color='dodgerblue', label='Model')
plt.xlabel('Average predicted confidence')
plt.ylabel('Observed accuracy')
plt.title('Reliability curve (all measurements)')
plt.grid(True, alpha=0.4)
plt.legend()
plt.show()

if centers:
    plt.figure(figsize=(5, 2.5))
    plt.bar(centers, counts, width=0.05, color='lightgray')
    plt.xlabel('Average confidence')
    plt.ylabel('Number of samples')
    plt.title('Confidence histogram')
    plt.show()
else:
    print("Cannot plot the confidence histogram: no confidence bins available.")


This figure brings everything together: green = ground truth cell, red = predicted cell, blue markers = the neighbor cells consulted by the L-KNN.

## Next steps
- Modify `TARGET_CELL` / `CAMPAIGN_NAME` to validate other positions or router distances.
- Add new measurement campaigns (different router heights, obstacles, etc.) and re-run the notebook to stress the embeddings.
- Export `neighbor_df`, histogram and weight plots into QA reports or dashboards so stakeholders can review the decision path.