# Building the First Model  
From Clean Data to First Predictions

In the previous notebook, we took a messy gene-expression table and transformed it into **clean, structured, model-ready data**. We:

- cleaned and standardized column names  
- removed unreliable sample columns  
- handled missing values safely  
- filtered out corrupted sequences  
- engineered a new feature, **GC_Content_Feature**, from the DNA sequences  

By the end, we had a tidy table with numerical values that a machine learning model can understand.  
Each row represented a gene, and each column represented a clean, validated feature.

In this notebook, we take the **next step in the TechBio workflow**:  
we use that clean dataset to train our **first simple machine learning model**.

The purpose here is not to chase perfect accuracy.  
The purpose is to show how modeling fits inside the larger TechBio loop:

> Biology → Data → Cleaning → **Model** → Validation → Better Biology

This notebook demonstrates that transition: clean data becoming an initial predictive model.

---

## 1. Loading the Model-Ready Dataset

In a real project, you would export the cleaned table from the previous notebook and load it here, for example:

```python
model_ready_df = pd.read_csv("model_ready_dataset.csv", index_col="Gene_ID")


In [7]:
import pandas as pd
import numpy as np

# ---- create a larger synthetic model-ready dataset ----
np.random.seed(42)

n_genes = 40

# Make some fake gene IDs
gene_ids = [f"GENE_{i:03d}" for i in range(1, n_genes + 1)]

# Sample_B: base expression values
sample_b = np.random.normal(loc=10.0, scale=4.0, size=n_genes)

# Sample_C: correlated with Sample_B + some noise
sample_c = sample_b + np.random.normal(loc=0.0, scale=2.0, size=n_genes)

# GC_Content_Feature: %GC between 30 and 70
gc_content = np.random.uniform(low=30.0, high=70.0, size=n_genes)

model_ready_df = pd.DataFrame({
    "Gene_ID": gene_ids,
    "Sample_B": sample_b,
    "Sample_C": sample_c,
    "GC_Content_Feature": gc_content
}).set_index("Gene_ID")

print("Model-ready dataset (first 5 rows):")
print(model_ready_df.head())


Model-ready dataset (first 5 rows):
           Sample_B   Sample_C  GC_Content_Feature
Gene_ID                                           
GENE_001  11.986857  13.463790           66.302659
GENE_002   9.446943   9.789679           39.971689
GENE_003  12.590754  12.359458           46.415317
GENE_004  16.092119  15.489912           60.222046
GENE_005   9.063387   6.106343           39.151927


---

## 2. Defining a Simple Prediction Task

A machine learning model always needs two pieces of information:

- **features** — the input columns the model will use  
- **target** — the output the model should learn to predict  

From our cleaned dataset, we will use:

- **Sample_B**: the expression level of each gene in Sample B  
- **GC_Content_Feature**: the GC percentage of the gene sequence  

These will be the **features** the model sees.

To create a **target**, we will build a simple classification task:

> Predict whether a gene has **high** or **low** expression in Sample C.

To define the labels, we use the **median** value of `Sample_C`:

- if a gene’s `Sample_C` value is **≥ median**, we label it **1** (high expression)  
- if it is **< median**, we label it **0** (low expression)  

This turns our dataset into a clean supervised learning problem:

- **Input:** expression in `Sample_B` + GC content  
- **Output:** high vs low expression in `Sample_C`  

This type of simple task is a perfect first step for showing how a model takes cleaned biological data and tries to learn a pattern from it.


In [8]:
# Features: what the model will see
X = model_ready_df[["Sample_B", "GC_Content_Feature"]]

# Target: high vs low expression in Sample_C
median_sample_c = model_ready_df["Sample_C"].median()
y = (model_ready_df["Sample_C"] >= median_sample_c).astype(int)

print("Features (X):")
print(X)
print("\nTarget (y):")
print(y)
print(f"\nMedian of Sample_C: {median_sample_c}")


Features (X):
           Sample_B  GC_Content_Feature
Gene_ID                                
GENE_001  11.986857           66.302659
GENE_002   9.446943           39.971689
GENE_003  12.590754           46.415317
GENE_004  16.092119           60.222046
GENE_005   9.063387           39.151927
GENE_006   9.063452           33.079196
GENE_007  16.316851           41.590058
GENE_008  13.069739           36.448851
GENE_009   8.122102           67.187906
GENE_010  12.170240           62.324815
GENE_011   8.146329           55.336150
GENE_012   8.137081           64.858424
GENE_013  10.967849           62.146883
GENE_014   2.346879           37.462802
GENE_015   3.100329           65.702360
GENE_016   7.750850           51.573690
GENE_017   5.948676           62.297606
GENE_018  11.256989           65.843652
GENE_019   6.367904           42.720139
GENE_020   4.350785           34.402077
GENE_021  15.862595           39.117407
GENE_022   9.096895           47.084312
GENE_023  10.270113       

---

## 3. Splitting the Data: Training vs Testing

Now that we have defined our features (`X`) and target labels (`y`), the next step is to split the dataset into:

- **training data** — what the model learns from  
- **test data** — what we hold back to evaluate the model on unseen examples  

This is one of the core habits of machine learning:  
the model must be judged on data it did **not** see during training.

Because we now have a larger synthetic dataset (40 genes), we can safely split it 50/50 just for demonstration. In a real project, you might use 70/30 or 80/20 splits.

The code below creates:

- `X_train`, `y_train` — used to train the model  
- `X_test`, `y_test` — used later to evaluate accuracy

We also set a fixed `random_state` so the split is reproducible every time you run the notebook.


In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.5,   # half of the genes for testing, just for demonstration
    random_state=42  # fixed seed for reproducibility
)

print("Training features:")
print(X_train)
print("\nTraining labels:")
print(y_train)
print("\nTest features:")
print(X_test)
print("\nTest labels:")
print(y_test)


Training features:
           Sample_B  GC_Content_Feature
Gene_ID                                
GENE_006   9.063452           33.079196
GENE_012   8.137081           64.858424
GENE_002   9.446943           39.971689
GENE_030   8.833225           43.504607
GENE_022   9.096895           47.084312
GENE_003  12.590754           46.415317
GENE_031   7.593174           67.716388
GENE_037  10.835454           68.497892
GENE_004  16.092119           60.222046
GENE_036   5.116625           68.871283
GENE_024   4.301007           64.429223
GENE_033   9.946011           50.751625
GENE_011   8.146329           55.336150
GENE_023  10.270113           62.720591
GENE_019   6.367904           42.720139
GENE_021  15.862595           39.117407
GENE_008  13.069739           36.448851
GENE_015   3.100329           65.702360
GENE_029   7.597445           34.794615
GENE_039   4.687256           49.889940

Training labels:
Gene_ID
GENE_006    0
GENE_012    0
GENE_002    1
GENE_030    0
GENE_022    1
GENE_

---

## 4. Training a Simple Machine Learning Model

With the training and test sets prepared, we can now build our first machine learning model.  
To keep things intuitive, we’ll use a **RandomForestClassifier**, a model that works well with small datasets and gives us interpretable results.

A Random Forest is simply a collection of decision trees:

- each tree learns a slightly different pattern  
- their results are combined (a “vote”)  
- the final prediction is usually more stable than any single tree  

For our TechBio example, this model is perfect because:

- it handles non-linear patterns  
- it does not require feature scaling  
- it gives **feature importance scores**, which help us understand what the model is focusing on  

In the code below, we:

1. create the model  
2. set the number of trees (we’ll use 50 for a stable demo)  
3. train (“fit”) the model on the training data  

After this step, the model has learned a relationship between:

- the inputs (`Sample_B` and `GC_Content_Feature`)  
- the output (high vs low expression in `Sample_C`)  


In [14]:
from sklearn.ensemble import RandomForestClassifier

# Create the model
model = RandomForestClassifier(
    n_estimators=50,      # number of trees in the forest
    random_state=42
)

# Train (fit) the model on the training data
model.fit(X_train, y_train)

print("Model trained.")


Model trained.


---

## 5. Making Predictions and Evaluating the Model

Now that our model has been trained on the training set, it’s time to see how well it performs on **new, unseen data**.  
This is the whole purpose of machine learning evaluation: we want to check whether the model learned a real pattern, not just memorized the training examples.

In this step, we:

1. **Use the model to predict** the labels for the test set (`X_test`)
2. **Compare** the predictions to the true labels (`y_test`)
3. **Compute the accuracy**, which is the fraction of correct predictions

With synthetic biological data like ours, we don’t expect perfect accuracy.  
A realistic model should make some mistakes — that’s normal. The important part is understanding the workflow:

> Train → Predict → Evaluate

This mirrors the same cycle used in real TechBio pipelines, where the goal is not perfection on day one, but **gradual learning across multiple iterations**.


In [15]:
from sklearn.metrics import accuracy_score

# Predict on the test set
y_pred = model.predict(X_test)

print("True labels:     ", list(y_test.values))
print("Predicted labels:", list(y_pred))

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nTest Accuracy: {accuracy:.2f}")


True labels:      [np.int64(0), np.int64(0), np.int64(1), np.int64(0), np.int64(0), np.int64(1), np.int64(0), np.int64(1), np.int64(0), np.int64(1), np.int64(1), np.int64(0), np.int64(0), np.int64(1), np.int64(0), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1)]
Predicted labels: [np.int64(0), np.int64(0), np.int64(0), np.int64(0), np.int64(0), np.int64(1), np.int64(0), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(0), np.int64(1), np.int64(1), np.int64(0), np.int64(1), np.int64(0), np.int64(1), np.int64(0)]

Test Accuracy: 0.65


---

## 6. Interpreting the Model: Feature Importance

Accuracy tells us *how well* the model performs, but it doesn’t tell us **why** the model makes its decisions.  
To open that “black box” a little, we can look at the **feature importances** calculated by the Random Forest.

A Random Forest is made of many decision trees. Each tree splits the data based on the features that help it separate high-expression vs low-expression genes.  
When a feature is consistently useful across many trees, it gets a higher importance score.

In this notebook, we have two features:

- `Sample_B` — the expression level in Sample B  
- `GC_Content_Feature` — the GC percentage of the gene sequence  

By looking at their importance scores, we can understand which signal the model relied on more to make predictions.

This kind of interpretation is essential in TechBio workflows because it helps you connect **model behavior** back to **biological meaning**.  
Even with synthetic data, the pattern of importance shows how models prioritize different types of information.


In [18]:
# Get feature importance values
feature_importances = model.feature_importances_
feature_names = X.columns

importance_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": feature_importances
}).sort_values(by="Importance", ascending=False)

print("Feature importances:")
print(importance_df)


Feature importances:
              Feature  Importance
0            Sample_B    0.690982
1  GC_Content_Feature    0.309018


---

## 7. Bringing It All Together

With this notebook, we completed the full path from **clean data** to a **working predictive model**.  
The model is simple, the dataset is synthetic, and the accuracy is not the main goal.  
What matters is the structure of the workflow:

1. Start with model-ready biological data  
2. Define features and target labels  
3. Split the data into training and testing sets  
4. Train a basic machine learning model  
5. Evaluate its predictions on unseen data  
6. Interpret the model to understand which features mattered  

This mirrors the real TechBio cycle:

> Biology → Data → Cleaning → **Model** → Validation → Better Biology

It’s important to emphasize that the purpose of this notebook is **not** to teach machine learning in depth.  
The goal is simply to make the modeling step of the TechBio workflow concrete, so readers understand how structured data turns into a first predictive signal.  
Everything here is intentionally simple so the focus stays on the *concepts* described in the book, not on ML theory.

In real research or startup work, this loop repeats many times.  
Each round improves the dataset, the features, and the accuracy of the model.  
This is how a simple experimental table gradually becomes the engine of a TechBio platform.

---

## 8. Your Turn: Try Small, Real Experiments

To make the ideas in this notebook “stick,” here are a few hands-on changes you can try.  
Each one teaches an important intuition about how modeling fits into the TechBio loop.

### **1. Change the Target Definition**
Instead of classifying high vs low expression based on the median of `Sample_C`, try:

- using the median of **Sample_B**  
- or defining “high expression” as the top 25%  
- or predicting **whether Sample_C is greater than Sample_B**

This shows how different biological assumptions change the modeling problem.


### **2. Add or Remove Features**
Try training the model with:

- **only** `Sample_B`  
- **only** `GC_Content_Feature`  
- or by adding a new synthetic feature like noise or a random column  

Watch how accuracy and feature importance change.


### **3. Change the Model Type**
Swap the Random Forest for another simple algorithm:

```python
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
