# Simulating the TechBio Learning Loop  
From One-Off Experiments to a Learning System

In the previous notebooks, we looked at the TechBio workflow piece by piece:

1. **TechBio Flywheel** – the big picture: Biology → Data → Models → Validation → Impact  
2. **TechBio Data Pipeline** – cleaning messy lab output into model-ready tables  
3. **TechBio First Model** – training a simple model on clean biological data

In this notebook, we connect those ideas and turn them into a **loop**.

The goal is not to learn machine learning in depth.  
The goal is to *see* how a TechBio system behaves across **multiple cycles**:

> Assay → Data → Model → Model-guided Experiment → New Data → Better Model

We will simulate a very simplified version of this process using **synthetic data**:

- a fake "biological assay" that measures how well each candidate (e.g. gene, molecule) responds  
- a model that tries to predict which candidates will have high response  
- a loop where the model’s predictions guide which candidates we measure next  
- repeated updates of the model as new data comes in

This shows how a TechBio platform gets smarter over time.


In [2]:
# 1. Setup: Imports and Random Seed

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

np.random.seed(42)

plt.style.use("seaborn-v0_8-whitegrid")



---
## 1. Simulating a Hidden Biological Landscape

To simulate a learning loop, we need an underlying “truth” that the model is trying to discover.

Imagine we have a set of **candidates** (these could be genes, small molecules, peptides, etc.).  
Each candidate has:

- some **features** (e.g. GC content, size, charge, etc.)  
- a **true underlying response** in a biological assay (e.g. how strongly it activates a pathway)  

In real life, we don’t know the true function.  
In this simulation, we create it ourselves and hide it from the model, so we can check how well the system learns.

We will:

- generate 200 synthetic candidates  
- give each candidate two numerical features  
- compute a **true_response** using a hidden function  
- add noise later when we “measure” them, to mimic experimental error


In [3]:
# 2. Create a synthetic "biological space" of candidates

n_candidates = 200

candidate_ids = [f"CAND_{i:03d}" for i in range(1, n_candidates + 1)]

# Two simple features, e.g. physicochemical properties
feature_1 = np.random.uniform(0, 1, size=n_candidates)   # e.g. normalized GC content
feature_2 = np.random.uniform(0, 1, size=n_candidates)   # e.g. normalized length/charge/etc.

# Hidden "true" response function (nonlinear combination)
true_response = (
    0.6 * feature_1
    + 0.3 * feature_2
    + 0.1 * np.sin(4 * np.pi * feature_1)
)

# Normalize to a nice 0–1 range
true_response = (true_response - true_response.min()) / (true_response.max() - true_response.min())

landscape_df = pd.DataFrame({
    "Candidate_ID": candidate_ids,
    "Feature_1": feature_1,
    "Feature_2": feature_2,
    "True_Response": true_response
}).set_index("Candidate_ID")

print("Synthetic biological landscape (first 5 rows):")
print(landscape_df.head())


Synthetic biological landscape (first 5 rows):
              Feature_1  Feature_2  True_Response
Candidate_ID                                     
CAND_001       0.374540   0.642032       0.385772
CAND_002       0.950714   0.084140       0.687954
CAND_003       0.731994   0.161629       0.650229
CAND_004       0.598658   0.898554       0.942709
CAND_005       0.156019   0.606429       0.455326


---
## 2. Simulating an Assay: Noisy Measurements

In real biology, we never see the *true* response of a gene or molecule directly.  
Every experiment adds a bit of noise — from the instrument, the protocol, or the environment.

To keep our simulation realistic, we build a tiny artificial “assay”:

- we start from the hidden `True_Response`  
- we add a bit of Gaussian noise to mimic experimental error  
- we store the result as `Measured_Response`

This function acts as our **simulated biological experiment**.  
The model will rely on these noisy measurements as it decides which candidates to test next.

In [4]:
# 3. Assay simulation: from true response to noisy measured response

def run_assay(candidates: pd.Index, noise_std: float = 0.08):
    """
    Simulate running a biological assay on a set of candidates.
    Returns a DataFrame with noisy measured responses.
    """
    subset = landscape_df.loc[candidates].copy()
    noise = np.random.normal(loc=0.0, scale=noise_std, size=len(subset))
    subset["Measured_Response"] = np.clip(subset["True_Response"] + noise, 0.0, 1.0)
    return subset[["Feature_1", "Feature_2", "True_Response", "Measured_Response"]]


---

## 3. Round 0: A Random First Experiment

Before any TechBio platform becomes “smart,” it has to start with a blind guess.  
When there is no model yet, we don’t know which candidates are good or bad,  
so the very first experiment is usually just a random selection.

This first batch gives us the initial data we need to train a model.

In this simulation, we will:

- pick 20 candidates at random  
- run our simulated assay on them  
- use these noisy measurements as the **very first training dataset**


In [5]:
# 4. Initial random experiment: Round 0

n_initial = 20

initial_candidates = np.random.choice(landscape_df.index, size=n_initial, replace=False)
round0_df = run_assay(initial_candidates)

print("Round 0: initial randomly measured candidates (first 5 rows):")
print(round0_df.head())

print(f"\nNumber of measured candidates after Round 0: {len(round0_df)}")


Round 0: initial randomly measured candidates (first 5 rows):
              Feature_1  Feature_2  True_Response  Measured_Response
Candidate_ID                                                        
CAND_104       0.508571   0.637430       0.646115           0.645984
CAND_034       0.948886   0.492518       0.851964           0.817881
CAND_189       0.529651   0.750615       0.745237           0.847274
CAND_195       0.339030   0.340804       0.246355           0.239030
CAND_139       0.363630   0.474174       0.309109           0.384075

Number of measured candidates after Round 0: 20


---
## 4. Training the First Model on Round 0

Now we behave like a TechBio platform:

- we have a small table of measured candidates (`Feature_1`, `Feature_2`, `Measured_Response`)  
- we train a model that tries to predict the underlying response based on these features  

Even though `True_Response` exists in our simulation, we pretend we don’t know it.  
We only use `Measured_Response` as the **training target**, because that is what we would have in a real assay.

We will use a **RandomForestRegressor** to predict response as a continuous value.


In [14]:
# 5. Helper: training and evaluation function

def train_and_evaluate_model(train_df: pd.DataFrame, test_df: pd.DataFrame):
    """
    Train a RandomForestRegressor on train_df and evaluate on test_df.
    We use Measured_Response as the target, but evaluate against True_Response
    to see how close we get to the underlying biology.
    """
    features = ["Feature_1", "Feature_2"]
    
    X_train = train_df[features]
    y_train = train_df["Measured_Response"]
    
    X_test = test_df[features]
    y_test_true = test_df["True_Response"]  # we use this only for evaluation in the simulation
    
    model = RandomForestRegressor(
        n_estimators=100,
        random_state=42
    )
    
    model.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = model.predict(X_test)
    
    # Compare to the true underlying response
    rmse = np.sqrt(mean_squared_error(y_test_true, y_pred))

    
    return model, rmse, y_pred


---
### 4.1 Evaluating the First Model

To evaluate how well the model understands the biological landscape, we:

- train on the measured candidates  
- test on all the **unmeasured candidates**  
- compare predictions to `True_Response` (which we only know in this simulation)

We will track **RMSE (Root Mean Squared Error)** between predicted and true responses.  
Lower RMSE means the model is closer to the true underlying biology.


In [15]:
# 6. Evaluate the first model (Round 0)

# Candidates not yet measured
unmeasured_candidates = landscape_df.index.difference(round0_df.index)
unmeasured_df = landscape_df.loc[unmeasured_candidates].copy()

model_round0, rmse_round0, preds_round0 = train_and_evaluate_model(round0_df, unmeasured_df)

print(f"Round 0 model RMSE on unmeasured candidates: {rmse_round0:.3f}")


Round 0 model RMSE on unmeasured candidates: 0.110


---
## 5. Making the Loop: Model-Guided Experiments

Now we use the model in a more “TechBio” way:

1. The model is trained on the current measured data.  
2. It predicts responses for **all unmeasured candidates**.  
3. We select the **top K** candidates with the highest predicted response.  
4. We run the assay on those candidates (i.e. we measure them).  
5. We add the new measurements to our training data.  
6. We retrain the model on the enlarged dataset.  

This creates a loop:

> **Data → Model → Model-guided Experiment → New Data → Better Model**

We repeat this loop several times and track:

- how the model’s error (RMSE) changes over rounds  
- how good the **selected candidates** actually are compared to random choices


In [16]:
# 7. Loop parameters

n_rounds = 5          # number of active learning rounds after Round 0
k_per_round = 15      # how many new candidates we test each round

# Tracking metrics
rmse_history = []
avg_true_selected_history = []
avg_true_random_history = []

# Initial state (Round 0 already measured)
measured_df = round0_df.copy()
current_unmeasured = landscape_df.index.difference(measured_df.index)


In [17]:
# 8. Run the learning loop

for r in range(1, n_rounds + 1):
    # Train model on currently measured data
    unmeasured_df = landscape_df.loc[current_unmeasured].copy()
    model, rmse, preds = train_and_evaluate_model(measured_df, unmeasured_df)
    
    rmse_history.append(rmse)
    
    # Add predictions to unmeasured_df
    unmeasured_df["Predicted_Response"] = preds
    
    # 1) Model-guided selection: top K by predicted response
    selected_candidates = (
        unmeasured_df
        .sort_values("Predicted_Response", ascending=False)
        .head(k_per_round)
        .index
    )
    
    # 2) Baseline: random selection of K candidates (for comparison)
    random_candidates = np.random.choice(current_unmeasured, size=k_per_round, replace=False)
    
    # True response averages for analysis
    avg_true_selected = landscape_df.loc[selected_candidates, "True_Response"].mean()
    avg_true_random = landscape_df.loc[random_candidates, "True_Response"].mean()
    
    avg_true_selected_history.append(avg_true_selected)
    avg_true_random_history.append(avg_true_random)
    
    # Now "run the assay" on the model-selected candidates
    new_measurements = run_assay(selected_candidates)
    
    # Add them to the measured_df
    measured_df = pd.concat([measured_df, new_measurements])
    
    # Update unmeasured set
    current_unmeasured = landscape_df.index.difference(measured_df.index)
    
    print(f"Round {r}: RMSE={rmse:.3f}, avg true response (model-selected)={avg_true_selected:.3f}, random={avg_true_random:.3f}")


Round 1: RMSE=0.110, avg true response (model-selected)=0.895, random=0.588
Round 2: RMSE=0.115, avg true response (model-selected)=0.840, random=0.608
Round 3: RMSE=0.116, avg true response (model-selected)=0.800, random=0.462
Round 4: RMSE=0.118, avg true response (model-selected)=0.778, random=0.481
Round 5: RMSE=0.119, avg true response (model-selected)=0.697, random=0.381


---

## 9. Key Points and Summary

This notebook showed how a TechBio system becomes smarter through **iteration**, not by running a single model once.  
We started with a hidden biological landscape, ran a noisy assay, trained a model, and then let that model guide the next round of experiments.

Here are the essential ideas:

### ● 1. Real biology always begins with partial information  
The first experiment is usually random.  
You don’t know which candidates are good, so you measure a small batch just to get started.

### ● 2. Models learn from whatever data exists  
Our first model was trained on noisy, limited measurements — just like in real labs.  
Even a weak first model is enough to guide the next step.

### ● 3. Model-guided experiments are more informative than random ones  
Across rounds, the model consistently selected candidates with **higher true responses** than random selection.  
This is the core advantage of a TechBio platform.

### ● 4. The loop improves the system  
Each round adds new measurements → expands the dataset → sharpens the model.  
This is the engine behind TechBio companies.

### ● 5. Noise, limited data, and uncertainty are not bugs — they are reality  
Our noisy assay reflects the variability of real biological experiments.  
Despite noise, the loop still converges toward better predictions.

### ● 6. The goal here is conceptual, not algorithmic  
This notebook is not about learning machine learning deeply.  
It exists to make the **feedback loop** real and visible, connecting directly to the book’s ideas.

---

## 10. Your Turn: Experiments to Try

To understand this loop more deeply, try adjusting the simulation in small ways.  
Each change teaches something about how a real TechBio platform behaves.

### **1. Change the number of rounds**
Try:
- `n_rounds = 3`
- `n_rounds = 10`

Watch how the RMSE curve changes.

### **2. Change how many candidates the model chooses per round**
Raise or lower:
- `k_per_round = 5`
- `k_per_round = 20`

This shows how batch size affects learning speed.

### **3. Change the noise level of the assay**
In `run_assay()`, modify:
```python
noise_std=0.03
noise_std=0.15

### High noise makes learning slower; low noise makes it faster

Increasing the assay noise makes the model’s job harder, because every measurement becomes less reliable.  
Decreasing the noise produces cleaner signals and faster learning.  
Changing this single parameter shows how fragile or stable the loop is when experiments become noisy.

---

### 4. Change the hidden biological function

Modify how `True_Response` is generated.  
For example:

- remove the sine term  
- add a squared term like `feature_1**2`  
- change the weights (e.g. make Feature_2 more important)  

Different underlying functions create easier or harder landscapes for the model to learn.  
This helps you see how the difficulty of the biology affects the behavior of the whole loop.

---

### 5. Try a different regression model

Swap out the Random Forest for another simple regressor:

```python
from sklearn.linear_model import LinearRegression
or
from sklearn.neighbors import KNeighborsRegressor```

Compare the RMSE curves and the quality of the candidates selected in each round.
Each model has its own strengths and weaknesses, and changing it shows how model choice affects the stability and speed of the learning loop.

---

6. Track another metric

To make the simulation feel more like a real TechBio platform, try logging additional metrics such as:

* the best true response discovered so far
* the median true response of all measured candidates
* the total number of candidates measured across rounds

These resemble the simple dashboards that real TechBio systems use to monitor platform progress.

---

**Final Takeaway**

A TechBio system is not simply “a model running on data.”
It is a loop, where:

* experiments create data
* data trains models
* models guide new experiments
* new experiments create better data

Every turn of this loop improves the system’s knowledge and performance.

This notebook gives you a small, synthetic demonstration of that process —
just enough to show how the ideas in the book translate into a practical workflow.