
# Step 4 — Explore the latent space and evaluate (Fixed)

This notebook explores the vector space created in the previous step.  
It loads your tables, examines the latent representation, groups similar items, proposes new candidates in the latent space, gives them rough scores, and shows concise plots.

**What you will do here**
1. Load the prepared tables.
2. Build a two–dimensional view of the latent space with principal component analysis.
3. Group items with a simple k means clustering.
4. Propose new candidates by moving along a good direction and by sampling around cluster centers.
5. Score the candidates by nearest neighbor matching and rank them.
6. Plot only what is needed to understand the results.



## Setup

The code below imports the libraries, sets file paths, and loads the data.  
Change paths if your files live somewhere else.


In [None]:

from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Paths
path_full    = Path('/mnt/data/step13-vector-full.csv')
path_latent  = Path('/mnt/data/X_latent.csv')
path_recon   = Path('/mnt/data/X_recon_full.csv')  # optional
path_ga      = Path('/mnt/data/ga-output.csv')     # optional

# Load
df_full   = pd.read_csv(path_full)    if path_full.exists() else None
df_latent = pd.read_csv(path_latent)  if path_latent.exists() else None
df_recon  = pd.read_csv(path_recon)   if path_recon.exists() else None
df_ga     = pd.read_csv(path_ga)      if path_ga.exists() else None

print('Loaded:')
print(' step13-vector-full.csv :', df_full.shape if df_full is not None else 'missing')
print(' X_latent.csv           :', df_latent.shape if df_latent is not None else 'missing')
print(' X_recon_full.csv       :', df_recon.shape if df_recon is not None else 'missing')
print(' ga-output.csv          :', df_ga.shape if df_ga is not None else 'missing')



## Helper functions

We keep the code simple and self contained:
- Two–dimensional principal component analysis.
- Small k means.
- Min–max scaling.
- Nearest neighbor scoring.
- Candidate proposal functions that work directly in the latent space.


In [None]:

import numpy as np

def principal_components_2d(X):
    X = np.asarray(X, dtype=float)
    Xc = X - X.mean(axis=0, keepdims=True)
    U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
    W = Vt[:2].T
    X2 = Xc @ W
    return X2, W

def kmeans_numpy(X, k=6, iters=60, seed=42):
    rng = np.random.default_rng(seed)
    X = np.asarray(X, dtype=float)
    n = X.shape[0]
    idx = rng.choice(n, size=k, replace=False)
    C = X[idx].copy()
    for _ in range(iters):
        d2 = ((X[:, None, :] - C[None, :, :])**2).sum(axis=2)
        labels = d2.argmin(axis=1)
        newC = np.vstack([X[labels == j].mean(axis=0) if np.any(labels == j) else C[j] for j in range(k)])
        if np.allclose(newC, C):
            break
        C = newC
    d2 = ((X[:, None, :] - C[None, :, :])**2).sum(axis=2)
    labels = d2.argmin(axis=1)
    inertia = float(np.sum((X - C[labels])**2))
    return labels, C, inertia

def minmax01(x):
    x = np.asarray(x, dtype=float)
    lo, hi = np.nanmin(x), np.nanmax(x)
    if not np.isfinite(lo) or not np.isfinite(hi) or hi <= lo:
        return np.zeros_like(x)
    return (x - lo) / (hi - lo)

def nearest_neighbor_scores(Z_known, scores_df, Z_query, columns=('taste_balance','price_norm','health_score','carbon_norm')):
    if scores_df is None or Z_query.size == 0:
        return None
    cols = [c for c in columns if c in scores_df.columns]
    if not cols:
        return None
    Zk = np.asarray(Z_known, dtype=float)
    out = []
    for zq in np.asarray(Z_query, dtype=float):
        d2 = ((Zk - zq)**2).sum(axis=1)
        idx = int(np.argmin(d2))
        out.append(scores_df.iloc[idx][cols].to_dict())
    return pd.DataFrame(out)

def propose_along_direction(Z, score, n_steps=20, step_size=0.5):
    ok = np.isfinite(score)
    if ok.sum() < 2:
        return np.empty((0, Z.shape[1]))
    low  = Z[ ok & (score <= np.nanpercentile(score, 25)) ].mean(axis=0)
    high = Z[ ok & (score >= np.nanpercentile(score, 75)) ].mean(axis=0)
    direction = high - low
    nrm = np.linalg.norm(direction)
    if nrm < 1e-9:
        return np.empty((0, Z.shape[1]))
    direction = direction / nrm
    steps = np.array([low + (i+1)*step_size*direction for i in range(n_steps)])
    return steps

def propose_around_centers(centers_latent, n_per=25, noise=0.25, seed=123):
    rng = np.random.default_rng(seed)
    C = np.asarray(centers_latent, dtype=float)
    out = []
    for c in C:
        eps = rng.normal(0.0, noise, size=(n_per, c.shape[0]))
        out.append(c + eps)
    return np.vstack(out) if out else np.empty((0, C.shape[1] if C.ndim==2 else 0))



## Build a latent matrix

We take all columns that start with `z` (for example `z1`, `z2`) and center them by subtracting the mean.


In [None]:

if df_latent is None:
    raise RuntimeError("Latent table is missing. Please place X_latent.csv at the configured path.")

latent_cols = [c for c in df_latent.columns if str(c).lower().startswith('z')]
if not latent_cols:
    latent_cols = df_latent.select_dtypes(include=[float, int]).columns.tolist()

Z = df_latent[latent_cols].to_numpy(dtype=float)
Z = Z - Z.mean(axis=0, keepdims=True)

print('Latent matrix shape:', Z.shape)
print('Using columns:', latent_cols[:10], '...' if len(latent_cols) > 10 else '')



## Two–dimensional view of the latent space

We build a two–dimensional view with principal component analysis and plot it once.  
We color by a simple combined score computed from taste, price, health, and carbon so that the map highlights better areas.


In [None]:

X2, W = principal_components_2d(Z)

tb = df_full['taste_balance'].to_numpy() if (df_full is not None and 'taste_balance' in df_full.columns) else np.zeros(len(Z))
pc = (1.0 - df_full['price_norm'].to_numpy()) if (df_full is not None and 'price_norm' in df_full.columns) else np.zeros(len(Z))
hs = ((5.0 - df_full['health_score'].to_numpy().astype(float)) / 4.0) if (df_full is not None and 'health_score' in df_full.columns) else np.zeros(len(Z))
cg = (1.0 - df_full['carbon_norm'].to_numpy()) if (df_full is not None and 'carbon_norm' in df_full.columns) else np.zeros(len(Z))

combined_known = 0.4*tb + 0.2*pc + 0.2*hs + 0.2*cg

plt.figure(figsize=(7,5))
plt.scatter(X2[:,0], X2[:,1], s=4, c=combined_known)
plt.title("Latent map colored by combined score")
plt.xlabel("component 1")
plt.ylabel("component 2")
plt.tight_layout()
plt.show()



## Group similar items in latent space

We run k means directly on the latent space.  
We keep the number of groups small for a clean picture.


In [None]:

best_k = 4
labels, centers_Z, inertia = kmeans_numpy(Z, k=best_k, iters=60, seed=42)
print("Number of groups:", best_k, "  Inertia:", inertia)

plt.figure(figsize=(7,5))
plt.scatter(X2[:,0], X2[:,1], s=4, c=labels)
plt.title("Groups shown on the two–dimensional map")
plt.xlabel("component 1")
plt.ylabel("component 2")
plt.tight_layout()
plt.show()



## Propose new candidates in latent space

We use two ideas:
1. Move from a low score region toward a high score region along a straight line in the latent space.
2. Sample small random points around each group center in the latent space.

Everything is done in the latent space to avoid shape mismatches.


In [None]:

steer = df_full['taste_balance'].to_numpy() if (df_full is not None and 'taste_balance' in df_full.columns) else None

cand_dir = propose_along_direction(Z, steer, n_steps=30, step_size=0.6) if steer is not None else np.empty((0, Z.shape[1]))
cand_grp = propose_around_centers(centers_Z, n_per=25, noise=0.25, seed=123)

parts = []
if cand_dir.size: parts.append(cand_dir)
if cand_grp.size: parts.append(cand_grp)
candidates_Z = np.vstack(parts) if parts else np.empty((0, Z.shape[1]))

print('Proposed candidates:', candidates_Z.shape)



## Score and rank the candidates

We assign rough scores by nearest neighbor matching in the latent space.  
We then compute a combined score for ranking.


In [None]:

from IPython.display import display

cand_scores = nearest_neighbor_scores(Z, df_full, candidates_Z)
if cand_scores is not None and len(cand_scores) > 0:
    cand = pd.concat([pd.DataFrame(candidates_Z, columns=[f'z{i+1}' for i in range(candidates_Z.shape[1])]), cand_scores], axis=1)
    tb = cand['taste_balance'].to_numpy() if 'taste_balance' in cand.columns else np.zeros(len(cand))
    pc = (1.0 - cand['price_norm'].to_numpy()) if 'price_norm' in cand.columns else np.zeros(len(cand))
    hs = ((5.0 - cand['health_score'].to_numpy().astype(float)) / 4.0) if 'health_score' in cand.columns else np.zeros(len(cand))
    cg = (1.0 - cand['carbon_norm'].to_numpy()) if 'carbon_norm' in cand.columns else np.zeros(len(cand))
    cand['combined_score'] = 0.4*tb + 0.2*pc + 0.2*hs + 0.2*cg
    cand_sorted = cand.sort_values('combined_score', ascending=False).reset_index(drop=True)
    display(cand_sorted.head(10))
else:
    cand_sorted = pd.DataFrame()
    print("Could not build candidate scores. Check the input tables.")



## Where do the best candidates sit on the map?

We mark the top candidates on the two–dimensional map for a quick visual check.


In [None]:

if len(cand_sorted) > 0:
    X2_cand = (candidates_Z - Z.mean(axis=0, keepdims=True)) @ W
    plt.figure(figsize=(7,5))
    plt.scatter(X2[:,0], X2[:,1], s=4, alpha=0.3)
    top_n = min(20, len(X2_cand))
    plt.scatter(X2_cand[:top_n,0], X2_cand[:top_n,1], s=50, marker='X')
    plt.title("Top candidate locations on the map")
    plt.xlabel("component 1")
    plt.ylabel("component 2")
    plt.tight_layout()
    plt.show()



## Scores of the top candidates

We show a single bar chart for the combined score of the top ten candidates.


In [None]:

if len(cand_sorted) > 0:
    top10 = cand_sorted.head(10)
    plt.figure(figsize=(7,4))
    plt.bar(range(len(top10)), top10['combined_score'])
    plt.xticks(range(len(top10)), range(1, len(top10)+1))
    plt.title("Combined score for top 10 candidates")
    plt.xlabel("rank")
    plt.ylabel("combined score")
    plt.tight_layout()
    plt.show()



## Notes for your write up

- Explain what the two–dimensional map shows and how you read it.
- Describe the groups and what seems to make them different.
- Explain how you generated candidates and why these two moves make sense.
- Discuss the limits of nearest neighbor scoring and what you would improve next.
