# Building the First Model
## From Clean Data to First Predictions

### 1. Introduction: The Pivot Point

In **Notebook 1**, we cleaned a messy biological dataset. We standardized column names, handled missing values, and engineered features.

In **Notebooks 2 and 3**, we looked at the bigger picture: automation and economics.

Now, we return to the core science. We take the clean data from Step 1 and turn it into intelligence.

This notebook demonstrates the **Modeling Step** of the TechBio loop:
> Biology → Data → **Model** → Validation → Better Biology

We will:
1.  Load a clean, model-ready dataset (simulated for this exercise).
2.  Define a clear biological prediction task.
3.  Train a simple machine learning model.
4.  Interpret what the model learned.

The goal is not to build the world's best predictor, but to understand the *mechanics* of connecting data to insight.

---

### 2. Loading the Model-Ready Dataset

In a real project, you would load the `.csv` file you saved in Notebook 1. Since we are in a new session, we will quickly regenerate a similar clean dataset here to work with.

In [7]:
import pandas as pd
import numpy as np

# ---- create a larger synthetic model-ready dataset ----
np.random.seed(42)

n_genes = 40

# Make some fake gene IDs
gene_ids = [f"GENE_{i:03d}" for i in range(1, n_genes + 1)]

# Sample_B: base expression values
sample_b = np.random.normal(loc=10.0, scale=4.0, size=n_genes)

# Sample_C: correlated with Sample_B + some noise
sample_c = sample_b + np.random.normal(loc=0.0, scale=2.0, size=n_genes)

# GC_Content_Feature: %GC between 30 and 70
gc_content = np.random.uniform(low=30.0, high=70.0, size=n_genes)

model_ready_df = pd.DataFrame({
    "Gene_ID": gene_ids,
    "Sample_B": sample_b,
    "Sample_C": sample_c,
    "GC_Content_Feature": gc_content
}).set_index("Gene_ID")

print("Model-ready dataset (first 5 rows):")
print(model_ready_df.head())


Model-ready dataset (first 5 rows):
           Sample_B   Sample_C  GC_Content_Feature
Gene_ID                                           
GENE_001  11.986857  13.463790           66.302659
GENE_002   9.446943   9.789679           39.971689
GENE_003  12.590754  12.359458           46.415317
GENE_004  16.092119  15.489912           60.222046
GENE_005   9.063387   6.106343           39.151927


---

### 3. Defining the Prediction Task

A model needs a goal. In TechBio, this goal usually comes from a biological question.

For this exercise, our question is:
**"Can we predict which genes will be highly expressed in Sample C, based only on their expression in Sample B and their GC content?"**

* **Features (Inputs):** `Sample_B` expression + `GC_Content_Feature`
* **Target (Output):** A binary label (1 = High Expression, 0 = Low Expression)

We define "High Expression" as anything above the median value in Sample C. This turns a complex biological phenomenon into a solvable classification problem.

In [8]:
# Features: what the model will see
X = model_ready_df[["Sample_B", "GC_Content_Feature"]]

# Target: high vs low expression in Sample_C
median_sample_c = model_ready_df["Sample_C"].median()
y = (model_ready_df["Sample_C"] >= median_sample_c).astype(int)

print("Features (X):")
print(X)
print("\nTarget (y):")
print(y)
print(f"\nMedian of Sample_C: {median_sample_c}")


Features (X):
           Sample_B  GC_Content_Feature
Gene_ID                                
GENE_001  11.986857           66.302659
GENE_002   9.446943           39.971689
GENE_003  12.590754           46.415317
GENE_004  16.092119           60.222046
GENE_005   9.063387           39.151927
GENE_006   9.063452           33.079196
GENE_007  16.316851           41.590058
GENE_008  13.069739           36.448851
GENE_009   8.122102           67.187906
GENE_010  12.170240           62.324815
GENE_011   8.146329           55.336150
GENE_012   8.137081           64.858424
GENE_013  10.967849           62.146883
GENE_014   2.346879           37.462802
GENE_015   3.100329           65.702360
GENE_016   7.750850           51.573690
GENE_017   5.948676           62.297606
GENE_018  11.256989           65.843652
GENE_019   6.367904           42.720139
GENE_020   4.350785           34.402077
GENE_021  15.862595           39.117407
GENE_022   9.096895           47.084312
GENE_023  10.270113       

---

### 4. The Virtual Experiment: Splitting the Data

In TechBio, we never train on all our data. We always hold some back. This is called the **Test Set**.

Why? Because biology is noisy. A model might memorize the noise in your training data (overfitting) but fail completely when you try it on a new patient sample or a new cell line.

The **Test Set** acts like a "blind validation study." The model never sees these genes during training. We only show them to the model *after* it is finished, to see if it actually learned the biology or just memorized the answers.

We will split our 40 genes:
* **Training Set (20 genes):** Used to teach the model.
* **Test Set (20 genes):** Used to judge the model.


In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.5,   # half of the genes for testing, just for demonstration
    random_state=42  # fixed seed for reproducibility
)

print("Training features:")
print(X_train)
print("\nTraining labels:")
print(y_train)
print("\nTest features:")
print(X_test)
print("\nTest labels:")
print(y_test)


Training features:
           Sample_B  GC_Content_Feature
Gene_ID                                
GENE_006   9.063452           33.079196
GENE_012   8.137081           64.858424
GENE_002   9.446943           39.971689
GENE_030   8.833225           43.504607
GENE_022   9.096895           47.084312
GENE_003  12.590754           46.415317
GENE_031   7.593174           67.716388
GENE_037  10.835454           68.497892
GENE_004  16.092119           60.222046
GENE_036   5.116625           68.871283
GENE_024   4.301007           64.429223
GENE_033   9.946011           50.751625
GENE_011   8.146329           55.336150
GENE_023  10.270113           62.720591
GENE_019   6.367904           42.720139
GENE_021  15.862595           39.117407
GENE_008  13.069739           36.448851
GENE_015   3.100329           65.702360
GENE_029   7.597445           34.794615
GENE_039   4.687256           49.889940

Training labels:
Gene_ID
GENE_006    0
GENE_012    0
GENE_002    1
GENE_030    0
GENE_022    1
GENE_

---

### 5. Training the Model

Now we build the engine. We will use a **Random Forest Classifier**.

Think of a Random Forest like a committee of scientists. It creates many small "decision trees." Each tree looks at a random subset of the data and votes on whether a gene should be high or low expression. The forest counts the votes to make a final prediction.

This is a "workhorse" algorithm in bioinformatics because:
1.  It handles noisy biological data well.
2.  It doesn't assume biology is linear (biological systems rarely are).
3.  It tells us *which* features were important (we'll see this later).

In [14]:
from sklearn.ensemble import RandomForestClassifier

# Create the model
model = RandomForestClassifier(
    n_estimators=50,      # number of trees in the forest
    random_state=42
)

# Train (fit) the model on the training data
model.fit(X_train, y_train)

print("Model trained.")


Model trained.


---

### 6. The Moment of Truth: Evaluation

Now we run our "virtual experiment."

We take the 20 genes in our **Test Set** (which the model has never seen) and ask the model to predict their expression. Then, we compare those predictions to the actual lab results (`y_test`).

**How to read the results:**
* **50% Accuracy:** The model is guessing (random coin flip). The data might be too noisy, or the features don't matter.
* **100% Accuracy:** Suspicious. Usually means "data leakage" (the answer was hidden in the question).
* **60-80% Accuracy:** Typical for early biological models. It found a signal, but biology is messy.

Let's see how our simple model did.

In [15]:
from sklearn.metrics import accuracy_score

# Predict on the test set
y_pred = model.predict(X_test)

print("True labels:     ", list(y_test.values))
print("Predicted labels:", list(y_pred))

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nTest Accuracy: {accuracy:.2f}")


True labels:      [np.int64(0), np.int64(0), np.int64(1), np.int64(0), np.int64(0), np.int64(1), np.int64(0), np.int64(1), np.int64(0), np.int64(1), np.int64(1), np.int64(0), np.int64(0), np.int64(1), np.int64(0), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1)]
Predicted labels: [np.int64(0), np.int64(0), np.int64(0), np.int64(0), np.int64(0), np.int64(1), np.int64(0), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(0), np.int64(1), np.int64(1), np.int64(0), np.int64(1), np.int64(0), np.int64(1), np.int64(0)]

Test Accuracy: 0.65


---

### 7. Opening the Black Box

Accuracy scores are useful, but they don't tell you *biology*. To understand the science, you need to know **why** the model made its decisions.

We will look at **Feature Importance**. This tells us which input variable (Sample B or GC Content) was more useful for making predictions.

In a real discovery platform, this is how you find new biomarkers. If a model tells you that a specific gene feature is 90% responsible for drug resistance, you have just found a new hypothesis to test in the lab.

In [18]:
# Get feature importance values
feature_importances = model.feature_importances_
feature_names = X.columns

importance_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": feature_importances
}).sort_values(by="Importance", ascending=False)

print("Feature importances:")
print(importance_df)


Feature importances:
              Feature  Importance
0            Sample_B    0.690982
1  GC_Content_Feature    0.309018


---

### 8. Bringing It All Together

You have now completed the first half of the TechBio loop. You moved from **Data** to **Model** to **Insight**.

1.  **Data:** You started with a clean, numerical table.
2.  **Model:** You trained a Random Forest to find patterns.
3.  **Insight:** You discovered which features drive the biological response.

In a traditional lab, this insight would be the end of the project. You would write a paper and stop.

In a TechBio company, this is just the beginning. The model's predictions are now used to design the **next** experiment.

**What's Next?**
In **Notebook 5 (TechBio Learning Loop)**, we will close the cycle. We will use a model like this one to *choose* the next set of experiments, generating new data that makes the model even smarter.
