# Industrial Bioreactor Batch Analysis
### Programming for Data Analytics - Final Project
**By Stephen Kerr**

---

## Project Context

This notebook demonstrates proficiency in data analysis techniques as part of the HDip in Data Analytics program.  I've chosen to work with industrial bioreactor data — a domain directly relevant to my professional interests and the pharmaceutical manufacturing sector.

**Why This Dataset?**

In the Life Sciences industry, batch analytics and process understanding are not merely commercial advantages—they are **regulatory requirements**. Before any drug product reaches patients, regulatory bodies (FDA, EMA) demand rigorous process characterization and evidence of consistent manufacturing control. This project mirrors real-world challenges faced by bioprocess engineers daily.

---

## Dataset Overview

**Source:** Goldrick et al. (2015), *Journal of Biotechnology* 193:70-82  
**Process:** Industrial-scale penicillin production via fed-batch fermentation  
**Scale:** 100,000L bioreactor (significantly larger than typical research-scale equipment)  
**Organism:** *Penicillium chrysogenum* (industrial strain)  
**Data:** 10 production batches where subseted from a larger data source (due to Githubs compute limitations)

---

## Project Objectives

### Primary Goal
Identify critical factors that distinguish high-yielding batches from problematic ones, providing actionable insights for process optimization and control strategy improvement.

### Specific Research Questions
1. **What characteristics define successful batches?**  
   Compare process variable profiles and identify distinguishing features
   
2. **Which parameters require tightest control?**  
   Determine acceptable operating ranges and critical control points
   
3. **Can we predict batch success early?**  
   Develop early warning indicators and establish alert thresholds

4. **How should feeding strategies be optimized?**  
   Analyze substrate profiles and recommend optimal feeding approaches

---

## Methodology

This analysis follows a systematic approach to extract meaningful insights from industrial process data:

### 1. **Data Acquisition & Understanding**
- Load and explore 10 batch records
- Create comprehensive data dictionary

### 2. **Data Cleaning & Preprocessing**
- Handle missing values in offline measurements
- Validate data ranges using domain knowledge
- Merge batch metadata and categorize by performance

### 3. **Feature Engineering**
- Calculate derived metrics (OUR, CER, estimated biomass)
- Create process phase indicators
- Generate control quality metrics
- Flag critical events (oxygen depletion, nutrient limitation)

### 4. **Exploratory Data Analysis**
- Univariate analysis of key variables
- Bivariate relationships (correlations, scatter plots)
- Multivariate analysis (correlation matrix, PCA)
- Batch comparison and statistical testing

### 5. **Deep Dive Analysis**
**Hypothesis:** *High-yielding batches share identifiable common characteristics*

**Approach:**
- Define success metrics (final penicillin yield)
- Compare successful vs. unsuccessful batch profiles
- Apply dimensionality reduction (PCA) and clustering
- Identify distinguishing features and early indicators

### 6. **Process Optimization Insights**
- Critical control parameters and acceptable ranges
- Feeding strategy recommendations  
- Early warning indicators and alert thresholds
- Cost-benefit analysis of control improvements

---

## Deliverables

**What this notebook demonstrates:**
- ✅ Data wrangling with pandas (cleaning, merging, transforming)
- ✅ Statistical analysis (hypothesis testing, correlation analysis)
- ✅ Feature engineering based on domain knowledge
- ✅ Advanced visualization techniques
- ✅ Machine learning applications (clustering, classification)
- ✅ Clear communication of technical findings

**Expected Outcomes:**
- Quantified relationships between process variables and yield
- Prioritized list of control parameters
- Recommended process improvements
- Predictive models for early batch assessment

---

## Industry Relevance

**Pharmaceutical Quality by Design (QbD):**  
Modern regulatory frameworks (ICH Q8-Q11) require manufacturers to demonstrate process understanding through data-driven approaches—exactly what this analysis provides.

---

## 1. **Data Acquisition & Understanding**

In [1]:
# imports

import pandas as pd 

import numpy as np

In [2]:
# load the data set 
df_batch_1_10 = pd.read_csv("./data/batches-subset-1-10.csv")

# show the info of the data set
df_batch_1_10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11585 entries, 0 to 11584
Data columns (total 34 columns):
 #   Column                                                              Non-Null Count  Dtype  
---  ------                                                              --------------  -----  
 0   Time (h)                                                            11585 non-null  float64
 1   Aeration rate(Fg:L/h)                                               11585 non-null  int64  
 2   Agitator RPM(RPM:RPM)                                               11585 non-null  int64  
 3   Sugar feed rate(Fs:L/h)                                             11585 non-null  int64  
 4   Acid flow rate(Fa:L/h)                                              11585 non-null  float64
 5   Base flow rate(Fb:L/h)                                              11585 non-null  float64
 6   Heating/cooling water flow rate(Fc:L/h)                             11585 non-null  float64
 7   Heating water

In [3]:
# display the first 5 rows of the data set
df_batch_1_10.head(5)

Unnamed: 0,Time (h),Aeration rate(Fg:L/h),Agitator RPM(RPM:RPM),Sugar feed rate(Fs:L/h),Acid flow rate(Fa:L/h),Base flow rate(Fb:L/h),Heating/cooling water flow rate(Fc:L/h),Heating water flow rate(Fh:L/h),Water for injection/dilution(Fw:L/h),Air head pressure(pressure:bar),...,Oxygen Uptake Rate(OUR:(g min^{-1})),Oxygen in percent in off-gas(O2:O2 (%)),Offline Penicillin concentration(P_offline:P(g L^{-1})),Offline Biomass concentratio(X_offline:X(g L^{-1})),Carbon evolution rate(CER:g/h),Ammonia shots(NH3_shots:kgs),Viscosity(Viscosity_offline:centPoise),Fault reference(Fault_ref:Fault ref),0 - Recipe driven 1 - Operator controlled(Control_ref:Control ref),batch_id
0,0.2,30,100,8,0.0,30.118,9.8335,0.0001,0,0.6,...,0.48051,0.19595,,,0.034045,0,,0,0,1
1,0.4,30,100,8,0.0,51.221,18.155,0.0001,0,0.6,...,0.058147,0.2039,,,0.038702,0,,0,0,1
2,0.6,30,100,8,0.0,54.302,9.5982,0.0001,0,0.6,...,-0.041505,0.20575,,,0.04024,0,,0,0,1
3,0.8,30,100,8,0.0,37.816,4.3395,0.0001,0,0.6,...,-0.056737,0.20602,,,0.041149,0,,0,0,1
4,1.0,30,100,8,0.5181,18.908,1.1045,0.0001,0,0.6,...,-0.049975,0.20589,1.0178e-25,0.52808,0.041951,0,4.083,0,0,1


In [4]:
# list all the column names
df_batch_1_10.columns.tolist()

['Time (h)',
 'Aeration rate(Fg:L/h)',
 'Agitator RPM(RPM:RPM)',
 'Sugar feed rate(Fs:L/h)',
 'Acid flow rate(Fa:L/h)',
 'Base flow rate(Fb:L/h)',
 'Heating/cooling water flow rate(Fc:L/h)',
 'Heating water flow rate(Fh:L/h)',
 'Water for injection/dilution(Fw:L/h)',
 'Air head pressure(pressure:bar)',
 'Dumped broth flow(Fremoved:L/h)',
 'Substrate concentration(S:g/L)',
 'Dissolved oxygen concentration(DO2:mg/L)',
 'Penicillin concentration(P:g/L)',
 'Vessel Volume(V:L)',
 'Vessel Weight(Wt:Kg)',
 'pH(pH:pH)',
 'Temperature(T:K)',
 'Generated heat(Q:kJ)',
 'carbon dioxide percent in off-gas(CO2outgas:%)',
 'PAA flow(Fpaa:PAA flow (L/h))',
 'PAA concentration offline(PAA_offline:PAA (g L^{-1}))',
 'Oil flow(Foil:L/hr)',
 'NH_3 concentration off-line(NH3_offline:NH3 (g L^{-1}))',
 'Oxygen Uptake Rate(OUR:(g min^{-1}))',
 'Oxygen in percent in off-gas(O2:O2  (%))',
 'Offline Penicillin concentration(P_offline:P(g L^{-1}))',
 'Offline Biomass concentratio(X_offline:X(g L^{-1}))',
 'Carbo

In [5]:
# drop unnecessary column this is a metadata column that applies to all batches that identifies the control strategy used not relevant for analysis
df_batch_1_10 = df_batch_1_10.drop(columns='0 - Recipe driven 1 - Operator controlled(Control_ref:Control ref)')

### Columns: Online /  OffLine or Metadata

Some of the Columns are for **Online** measurements taken directly from the process in real-time (the continuous data with no missing values). 
Whereas **Offline** measurements are removed drom the process and taken to a lab as specific process needs to be conducted on the sample (therefore the data appears as discrete). These Columns include the 'Offline' tag in their column name and have values after certain time intervals (in the documentation it is every 12 or 24 hours). For a good explaination of the practical differences between Online and Offline see this overview artical [here](https://amf.ch/technical-note/a-guide-to-inline-online-atline-and-offline-monitoring/#:~:text=Both%20online%20and%20inline%20monitoring%20provide%20real-time%20data,can%20accommodate%20manual%20interventions%20and%20less%20frequent%20sampling.)

There is also a thrid type of column the Metadata called the 'batch_id', 'Time (h)', and '0 - Recipe driven 1 - Operator controlled(Control_ref:Control ref)'. I added the 'batch_id' column in my preprocessing task in kaggle to be able to decern when one batch ends and another starts. See the code used to do this [here](https://www.kaggle.com/code/stephenkerr17/bioreactor-dataset-preprocessing). But the basic approach was to locate the first reading of each batch which as our data is simualted is always 'Time (h)' == 0.2. In non-simulated data this will be harder to locate and a flag on the machine maybe triggered to indicate the start of the batch process. Once I have the start of the batches I took the start index - 1 to find the preceding batch end (removing negative or zero indeices). Now I have a start and end of each batch then looped through these assigning them a batch id. 'Time (h)' is unsurprisingly the time of the measurement in the give row. Finally, '0 - Recipe driven 1 - Operator controlled(Control_ref:Control ref)' is a flag added to batches that have a known fault in them (I removed this from my data as it doesn't apply for my analysis).

## Handling Missing Values and NaNs (Online & Offline)

Due to the 2 different types of data it is appropriate to handle Missing Values differently, because a NaN or Misssing Value for the Online Measurement could indicate a tempory sensor dropout, therefore for NaN that aren't too large of a gap in readings (say more than 3 readings) we will forward-fill these data points in order to maintain a continuous time series without distorting the trends. For long gaps beyond the reasonable window this will be marked as missing and excluded from calculations. 

For Offline Measurements there will be gaps in the data due the fact that the data is discrete, therefore to prevent artifical generation of data the NaN will remain in place for the offline measurements.

#### So given the above we need to check there are no missing values from the in-line measurements 

In [12]:
# checking the NaN values in the data set
nan_counts = df_batch_1_10.isna().sum()
print(nan_counts)

Time (h)                                                       0
Aeration rate(Fg:L/h)                                          0
Agitator RPM(RPM:RPM)                                          0
Sugar feed rate(Fs:L/h)                                        0
Acid flow rate(Fa:L/h)                                         0
Base flow rate(Fb:L/h)                                         0
Heating/cooling water flow rate(Fc:L/h)                        0
Heating water flow rate(Fh:L/h)                                0
Water for injection/dilution(Fw:L/h)                           0
Air head pressure(pressure:bar)                                0
Dumped broth flow(Fremoved:L/h)                                0
Substrate concentration(S:g/L)                                 0
Dissolved oxygen concentration(DO2:mg/L)                       0
Penicillin concentration(P:g/L)                                0
Vessel Volume(V:L)                                             0
Vessel Weight(Wt:Kg)     

In [13]:
# checking NaN values in the inline data

inline_data = df_batch_1_10[['Time (h)',
 'Aeration rate(Fg:L/h)',
 'Agitator RPM(RPM:RPM)',
 'Sugar feed rate(Fs:L/h)',
 'Acid flow rate(Fa:L/h)',
 'Base flow rate(Fb:L/h)',
 'Heating/cooling water flow rate(Fc:L/h)',
 'Heating water flow rate(Fh:L/h)',
 'Water for injection/dilution(Fw:L/h)',
 'Air head pressure(pressure:bar)',
 'Dumped broth flow(Fremoved:L/h)',
 'Substrate concentration(S:g/L)',
 'Dissolved oxygen concentration(DO2:mg/L)',
 'Penicillin concentration(P:g/L)',
 'Vessel Volume(V:L)',
 'Vessel Weight(Wt:Kg)',
 'pH(pH:pH)',
 'Temperature(T:K)',
 'Generated heat(Q:kJ)',
 'carbon dioxide percent in off-gas(CO2outgas:%)', 
 'PAA flow(Fpaa:PAA flow (L/h))',
 'Oil flow(Foil:L/hr)',
 'Oxygen Uptake Rate(OUR:(g min^{-1}))',
 'Carbon evolution rate(CER:g/h)',
 'Ammonia shots(NH3_shots:kgs)',
 'Fault reference(Fault_ref:Fault ref)',
 'batch_id']]

inline_data.isna().sum()

Time (h)                                          0
Aeration rate(Fg:L/h)                             0
Agitator RPM(RPM:RPM)                             0
Sugar feed rate(Fs:L/h)                           0
Acid flow rate(Fa:L/h)                            0
Base flow rate(Fb:L/h)                            0
Heating/cooling water flow rate(Fc:L/h)           0
Heating water flow rate(Fh:L/h)                   0
Water for injection/dilution(Fw:L/h)              0
Air head pressure(pressure:bar)                   0
Dumped broth flow(Fremoved:L/h)                   0
Substrate concentration(S:g/L)                    0
Dissolved oxygen concentration(DO2:mg/L)          0
Penicillin concentration(P:g/L)                   0
Vessel Volume(V:L)                                0
Vessel Weight(Wt:Kg)                              0
pH(pH:pH)                                         0
Temperature(T:K)                                  0
Generated heat(Q:kJ)                              0
carbon dioxi

'Time (h)',
 'Aeration rate(Fg:L/h)',
 'Agitator RPM(RPM:RPM)',
 'Sugar feed rate(Fs:L/h)',
 'Acid flow rate(Fa:L/h)',
 'Base flow rate(Fb:L/h)',
 'Heating/cooling water flow rate(Fc:L/h)',
 'Heating water flow rate(Fh:L/h)',
 'Water for injection/dilution(Fw:L/h)',
 'Air head pressure(pressure:bar)',
 'Dumped broth flow(Fremoved:L/h)',
 'Substrate concentration(S:g/L)',
 'Dissolved oxygen concentration(DO2:mg/L)',
 'Penicillin concentration(P:g/L)',
 'Vessel Volume(V:L)',
 'Vessel Weight(Wt:Kg)',
 'pH(pH:pH)',
 'Temperature(T:K)',
 'Generated heat(Q:kJ)',
 'carbon dioxide percent in off-gas(CO2outgas:%)',
 'PAA flow(Fpaa:PAA flow (L/h))',
 'PAA concentration offline(PAA_offline:PAA (g L^{-1}))',
 'Oil flow(Foil:L/hr)',
 'NH_3 concentration off-line(NH3_offline:NH3 (g L^{-1}))',
 'Oxygen Uptake Rate(OUR:(g min^{-1}))',
 'Oxygen in percent in off-gas(O2:O2  (%))',
 'Offline Penicillin concentration(P_offline:P(g L^{-1}))',
 'Offline Biomass concentratio(X_offline:X(g L^{-1}))',
 'Carbon evolution rate(CER:g/h)',
 'Ammonia shots(NH3_shots:kgs)',
 'Viscosity(Viscosity_offline:centPoise)',
 'Fault reference(Fault_ref:Fault ref)',
 '0 - Recipe driven 1 - Operator controlled(Control_ref:Control ref)',
 'batch_id'