# Simulating the TechBio Learning Loop
## From One-Off Experiments to a Learning System

### 1. Introduction: Closing the Loop

In the previous notebooks, we built the components piece by piece:
* **Notebook 1 (Data Pipeline):** We cleaned the raw input.
* **Notebook 2 (Automation):** We scaled the cleaning process.
* **Notebook 3 (Flywheel):** We modeled the economics of learning.
* **Notebook 4 (First Model):** We trained a single predictor on static data.

Now, we connect them all into a **closed loop**.

The goal of this notebook is not to learn machine learning in depth. The goal is to *see* how a TechBio system behaves across **multiple cycles**:

> Assay → Data → Model → **Model-Guided Experiment** → New Data → Better Model

We will simulate a simplified version of this process using **synthetic data**:
1.  Start with a random experiment (Round 0).
2.  Train a model on that data.
3.  Use the model to *choose* the best candidates for the next experiment.
4.  "Run" that experiment (reveal the true values).
5.  Retrain and repeat.

This shows how a TechBio platform gets smarter over time, finding better molecules faster than random chance.

In [2]:
# 1. Setup: Imports and Random Seed

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

np.random.seed(42)

plt.style.use("seaborn-v0_8-whitegrid")


---
## 2. Simulating a Hidden Biological Landscape

To simulate a learning loop, we need an underlying “truth” that the model is trying to discover.

Imagine we have a set of **candidates** (genes, small molecules, peptides, etc.).
Each candidate has:
* **Features:** (e.g., GC content, size, charge)
* **True Response:** A hidden biological value (e.g., binding affinity or expression level).

In real life, this "True Response" is unknown. In this simulation, we create it ourselves but hide it from the model, allowing us to check if the system actually learns the truth.

We will:
* Generate 200 synthetic candidates.
* Assign two numerical features to each.
* Compute a hidden **True Response** using a mathematical function.
* Add noise later to mimic experimental error.

In [3]:
# 2. Create a synthetic "biological space" of candidates

n_candidates = 200

candidate_ids = [f"CAND_{i:03d}" for i in range(1, n_candidates + 1)]

# Two simple features, e.g. physicochemical properties
feature_1 = np.random.uniform(0, 1, size=n_candidates)   # e.g. normalized GC content
feature_2 = np.random.uniform(0, 1, size=n_candidates)   # e.g. normalized length/charge/etc.

# Hidden "true" response function (nonlinear combination)
true_response = (
    0.6 * feature_1
    + 0.3 * feature_2
    + 0.1 * np.sin(4 * np.pi * feature_1)
)

# Normalize to a nice 0–1 range
true_response = (true_response - true_response.min()) / (true_response.max() - true_response.min())

landscape_df = pd.DataFrame({
    "Candidate_ID": candidate_ids,
    "Feature_1": feature_1,
    "Feature_2": feature_2,
    "True_Response": true_response
}).set_index("Candidate_ID")

print("Synthetic biological landscape (first 5 rows):")
print(landscape_df.head())


Synthetic biological landscape (first 5 rows):
              Feature_1  Feature_2  True_Response
Candidate_ID                                     
CAND_001       0.374540   0.642032       0.385772
CAND_002       0.950714   0.084140       0.687954
CAND_003       0.731994   0.161629       0.650229
CAND_004       0.598658   0.898554       0.942709
CAND_005       0.156019   0.606429       0.455326


---
## 3. Simulating an Assay: Noisy Measurements

In real biology, we never see the *true* response directly. Every experiment adds noise—from pipetting errors, instrument variability, or biological stochasticity.

To keep our simulation realistic, we define an `run_assay` function that:
1.  Looks up the hidden `True_Response`.
2.  Adds Gaussian noise to mimic experimental error.
3.  Returns a `Measured_Response`.

This function acts as our **wet lab**. The model will only ever see the noisy measurements, but we will judge it based on how well it finds the true high-value candidates.

In [4]:
# 3. Assay simulation: from true response to noisy measured response

def run_assay(candidates: pd.Index, noise_std: float = 0.08):
    """
    Simulate running a biological assay on a set of candidates.
    Returns a DataFrame with noisy measured responses.
    """
    subset = landscape_df.loc[candidates].copy()
    noise = np.random.normal(loc=0.0, scale=noise_std, size=len(subset))
    subset["Measured_Response"] = np.clip(subset["True_Response"] + noise, 0.0, 1.0)
    return subset[["Feature_1", "Feature_2", "True_Response", "Measured_Response"]]


---

## 4. Round 0: The Cold Start

Before a TechBio platform becomes “smart,” it has to start with a blind guess. When there is no model yet (or no training data), we don't know which candidates are good.

So, the very first experiment is usually a **random selection**.

This first batch provides the initial "ground truth" data we need to train our first model.

In this step, we will:
* Pick 20 candidates at random.
* Run our simulated assay on them.
* Use these noisy measurements as our initial training set.

In [5]:
# 4. Initial random experiment: Round 0

n_initial = 20

initial_candidates = np.random.choice(landscape_df.index, size=n_initial, replace=False)
round0_df = run_assay(initial_candidates)

print("Round 0: initial randomly measured candidates (first 5 rows):")
print(round0_df.head())

print(f"\nNumber of measured candidates after Round 0: {len(round0_df)}")


Round 0: initial randomly measured candidates (first 5 rows):
              Feature_1  Feature_2  True_Response  Measured_Response
Candidate_ID                                                        
CAND_104       0.508571   0.637430       0.646115           0.645984
CAND_034       0.948886   0.492518       0.851964           0.817881
CAND_189       0.529651   0.750615       0.745237           0.847274
CAND_195       0.339030   0.340804       0.246355           0.239030
CAND_139       0.363630   0.474174       0.309109           0.384075

Number of measured candidates after Round 0: 20


---
## 5. Training the Engine

Now we behave like a TechBio platform.

We have a small table of measured candidates (from Round 0). We will train a **Random Forest Regressor** to predict the response based on the features.

**Note:** Even though `True_Response` exists in our simulation, we pretend we don’t know it. We train only on `Measured_Response`, because that is all you would have in a real lab.

In [14]:
# 5. Helper: training and evaluation function

def train_and_evaluate_model(train_df: pd.DataFrame, test_df: pd.DataFrame):
    """
    Train a RandomForestRegressor on train_df and evaluate on test_df.
    We use Measured_Response as the target, but evaluate against True_Response
    to see how close we get to the underlying biology.
    """
    features = ["Feature_1", "Feature_2"]
    
    X_train = train_df[features]
    y_train = train_df["Measured_Response"]
    
    X_test = test_df[features]
    y_test_true = test_df["True_Response"]  # we use this only for evaluation in the simulation
    
    model = RandomForestRegressor(
        n_estimators=100,
        random_state=42
    )
    
    model.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = model.predict(X_test)
    
    # Compare to the true underlying response
    rmse = np.sqrt(mean_squared_error(y_test_true, y_pred))

    
    return model, rmse, y_pred


---
### 5.1 Evaluating the First Model

To measure success, we need to know if the model actually understands the landscape.

We will:
* Train on the measured candidates.
* Predict values for all **unmeasured candidates**.
* Compare those predictions to the hidden `True_Response`.

We track **RMSE (Root Mean Squared Error)**. A lower RMSE means the model's mental map of the biology is getting closer to reality.

In [15]:
# 6. Evaluate the first model (Round 0)

# Candidates not yet measured
unmeasured_candidates = landscape_df.index.difference(round0_df.index)
unmeasured_df = landscape_df.loc[unmeasured_candidates].copy()

model_round0, rmse_round0, preds_round0 = train_and_evaluate_model(round0_df, unmeasured_df)

print(f"Round 0 model RMSE on unmeasured candidates: {rmse_round0:.3f}")


Round 0 model RMSE on unmeasured candidates: 0.110


---

## 6. Closing the Loop: Active Learning

This is the definition of a TechBio platform. We don't just analyze data; we generate it.

1.  The model is trained on the current measured data.
2.  It predicts responses for **all unmeasured candidates**.
3.  We select the **top K** candidates with the highest predicted response (Smart Selection).
4.  We run the assay on those candidates (add them to our dataset).
5.  We retrain the model.

This creates the **Flywheel** we simulated in Notebook 3.

We will repeat this loop several times and track:
* **RMSE:** Does the model error go down?
* **Candidate Quality:** Are we finding better molecules than if we just picked randomly?

In [16]:
# 7. Loop parameters

n_rounds = 5          # number of active learning rounds after Round 0
k_per_round = 15      # how many new candidates we test each round

# Tracking metrics
rmse_history = []
avg_true_selected_history = []
avg_true_random_history = []

# Initial state (Round 0 already measured)
measured_df = round0_df.copy()
current_unmeasured = landscape_df.index.difference(measured_df.index)


In [17]:
# 8. Run the learning loop

for r in range(1, n_rounds + 1):
    # Train model on currently measured data
    unmeasured_df = landscape_df.loc[current_unmeasured].copy()
    model, rmse, preds = train_and_evaluate_model(measured_df, unmeasured_df)
    
    rmse_history.append(rmse)
    
    # Add predictions to unmeasured_df
    unmeasured_df["Predicted_Response"] = preds
    
    # 1) Model-guided selection: top K by predicted response
    selected_candidates = (
        unmeasured_df
        .sort_values("Predicted_Response", ascending=False)
        .head(k_per_round)
        .index
    )
    
    # 2) Baseline: random selection of K candidates (for comparison)
    random_candidates = np.random.choice(current_unmeasured, size=k_per_round, replace=False)
    
    # True response averages for analysis
    avg_true_selected = landscape_df.loc[selected_candidates, "True_Response"].mean()
    avg_true_random = landscape_df.loc[random_candidates, "True_Response"].mean()
    
    avg_true_selected_history.append(avg_true_selected)
    avg_true_random_history.append(avg_true_random)
    
    # Now "run the assay" on the model-selected candidates
    new_measurements = run_assay(selected_candidates)
    
    # Add them to the measured_df
    measured_df = pd.concat([measured_df, new_measurements])
    
    # Update unmeasured set
    current_unmeasured = landscape_df.index.difference(measured_df.index)
    
    print(f"Round {r}: RMSE={rmse:.3f}, avg true response (model-selected)={avg_true_selected:.3f}, random={avg_true_random:.3f}")


Round 1: RMSE=0.110, avg true response (model-selected)=0.895, random=0.588
Round 2: RMSE=0.115, avg true response (model-selected)=0.840, random=0.608
Round 3: RMSE=0.116, avg true response (model-selected)=0.800, random=0.462
Round 4: RMSE=0.118, avg true response (model-selected)=0.778, random=0.481
Round 5: RMSE=0.119, avg true response (model-selected)=0.697, random=0.381


---

## 7. Key Takeaways

This notebook demonstrated the core mechanic of a TechBio company: **Iterative Improvement**.

### 1. Models learn from partial data
We started with zero knowledge. By Round 1, the model was already better than random guessing.

### 2. Smart Selection wins
Across the rounds, the model consistently picked candidates with higher true responses than a random picker would have. This is the "efficiency" gain of TechBio.

### 3. The Loop is the Product
The value isn't just the final drug candidate; it's the system that got smarter with every experiment.


---

## 8. Your Turn: Experiments to Try

To understand this loop more deeply, try adjusting the simulation in small ways.
Each change teaches something about how a real TechBio platform behaves.

### **1. Change the number of rounds**
Try:
- `n_rounds = 3`
- `n_rounds = 10`

Watch how the RMSE curve changes. Does it plateau?

### **2. Change how many candidates the model chooses per round**
Raise or lower:
- `k_per_round = 5`
- `k_per_round = 20`

This shows how **batch size** affects learning speed. Smaller batches are more efficient per candidate but take longer to cover the space.

### **3. Change the noise level of the assay**
In `run_assay()`, modify the noise parameter:
```python
# Low noise (High Quality Data)
noise_std = 0.03

# High noise (Messy Data)
noise_std = 0.15