<a href="https://colab.research.google.com/github/rm01243/Data-Driven-Modelling-for-AD-Workshop/blob/main/AD_ANN_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Data-Driven Modelling of Anaerobic Digesters Using Artificial Neural Networks (ANNs)

![Data Driven Modelling Image](https://raw.githubusercontent.com/rm01243/Data-Driven-Modelling-for-AD-Workshop/main/Data%20Driven%20Modelling%20Image.png)

## Workshop Overview
This notebook provides a beginner-friendly, hands-on workshop on using Artificial Neural Networks (ANNs) for modelling anaerobic digesters (AD) to predict biogas production. It is targeted at biogas researchers, AD plant operators, consultants, and sustainability professionals with little to no programming or modelling experience.

### What is Anaerobic Digestion?
Anaerobic digestion is a biological process where microorganisms break down organic material in the absence of oxygen, producing biogas (mainly methane) as a renewable energy source. Modelling this process helps predict biogas output based on inputs like temperature and substrate levels, aiding in optimisation without complex physics-based equations.

### What are ANNs?
Artificial Neural Networks are computational models inspired by the human brain. They learn patterns from data to make predictions. In this workshop, we'll use a simple ANN to predict biogas flow (`q_gas`) from AD parameters.

### Key Features:
- **No Installation Required**: Runs entirely in Google Colab, a free online tool that executes Python code in your browser.
- **Interactive**: Train a model, evaluate it, and see results with plots.
- **Data**: Uses synthetic AD data (`AD_Synthetic_Data.xlsx`) loaded directly from the GitHub repository. Synthetic data is artificially generated to mimic real AD processes for demonstration.
- **Topics Covered**: Data loading, feature selection, preprocessing, ANN building/training, evaluation, and visualisation.

### Prerequisites:
- Basic understanding of anaerobic digestion (AD) processes (e.g., what substrates and temperature affect biogas).
- Access to Google Colab (free, browser-based—no software needed).
- No coding background required—we'll explain every line!

### How to Use This Notebook:
1. Open in Colab (click the badge above).
2. Run cells sequentially (Shift+Enter or click the play button). Markdown cells (like this) explain concepts; code cells execute Python code.
3. Experiment by modifying parameters (e.g., number of epochs in training, or top features in selection) and re-run cells to see changes.
4. If something goes wrong, restart the runtime (Runtime > Restart runtime) and re-run from the top.

## Step 1: Import Libraries

**Description for Beginners**: Libraries are pre-written code packages that provide tools for tasks like data handling or modelling. We 'import' them to use their functions. For example, Pandas helps read Excel files, and Keras builds the ANN.

**Libraries Explained**:
- `numpy` (np): For numerical operations, like arrays.
- `pandas` (pd): For loading and manipulating data tables.
- `matplotlib.pyplot` (plt): For creating plots and visualisations.
- `time`: To measure how long code takes to run.
- `sklearn` modules: For splitting data, scaling, metrics, and feature selection.
- `tensorflow.keras`: For building and training the ANN (TensorFlow is the backend, Keras is the user-friendly interface).

**Why Do This?** These tools save time—no need to write everything from scratch.

**Note**: Colab has these pre-installed, so just run the cell.

In [None]:
import numpy as np  # For numerical computations and arrays
import pandas as pd  # For reading and handling data in table format
import matplotlib.pyplot as plt  # For creating plots and graphs
import time  # To time how long the code takes
import tensorflow as tf  # For setting random seed
from sklearn.model_selection import train_test_split  # To split data into training and testing sets
from sklearn.preprocessing import MinMaxScaler  # To scale data to a 0-1 range
from sklearn.metrics import r2_score, mean_squared_error  # To evaluate model performance
from sklearn.feature_selection import mutual_info_regression  # For selecting important features
from tensorflow.keras.models import Sequential  # To build the ANN layer by layer
from tensorflow.keras.layers import Dense  # To add layers to the ANN

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

## Step 2: Load the Dataset

**Description for Beginners**: We read the Excel file from the GitHub repository into a 'DataFrame' (df), which is like an Excel table in Python. This lets us access columns easily. We'll display the first 5 rows to check the data.

**Why Do This?** To ensure the data is loaded correctly and understand its structure.

**Tips**: If the sheet name is different, change 'Sheet1'. For your own data, you can replace the URL with your file's raw GitHub URL. Update column names in the data file to full forms like 'Carbohydrates' for clarity, and adjust the 'features' list in Step 3 accordingly.

In [None]:
data_url = 'https://raw.githubusercontent.com/rm01243/Data-Driven-Modelling-for-AD-Workshop/main/AD_Synthetic_Data.xlsx'  # URL to the data file in the repo
df = pd.read_excel(data_url, sheet_name='Sheet1')  # Reads the Excel file from URL into a DataFrame
print("Data preview (first 5 rows):")  # Prints a message
display(df.head())  # Shows the first 5 rows in a nice table format

## Step 3: Define Inputs and Output

**Description for Beginners**: Inputs (features) are the variables that influence biogas production, like temperature or substrate levels. The output (label) is what we want to predict: 'q_gas' (biogas flow rate). We separate them into X (inputs) and y (output).

**Features Explained** (based on typical AD models like ADM1):
- `S_*`: Soluble substrates (e.g., `S_su` = sugars, `S_aa` = amino acids)—these are broken down by microbes.
- `X_*`: Particulate biomass (e.g., `X_su` = sugar degraders).
- `Q`: Inflow rate, `T (C)`: Temperature in Celsius—key for microbial activity.
- Others: Ions, inerts, etc.

**Why Do This?** The ANN learns to map inputs to output, like 'high temperature + good substrates = more biogas'.

**Tips**: Update the 'features' list to match the full-form column names (e.g., 'Carbohydrates' instead of 'S_su') after modifying your data file.

In [None]:
features = ['S_su', 'S_aa', 'S_fa', 'S_va', 'S_bu', 'S_pro', 'S_ac', 'S_IC', 'S_IN', 
            'S_I', 'Q', 'T (C)', 'X_xc', 'X_ch', 'X_pr', 'X_li', 'X_su', 'X_aa', 
            'X_fa', 'X_c4', 'X_pro', 'X_ac', 'X_h2', 'X_I', 'S_cation', 'S_anion']  # List of input column names

label = 'q_gas'  # The output column name

X = df[features]  # Selects input columns from the DataFrame
y = df[label]  # Selects the output column

## Step 4: Feature Selection Using Mutual Information

**Background on Feature Selection**: Feature selection is the process of identifying and selecting a subset of relevant features (inputs) for use in model construction. It helps improve model performance by reducing overfitting, shortening training times, and enhancing generalisation by removing irrelevant or redundant data. In data-driven modelling like this, with potentially many AD parameters, feature selection ensures the model focuses on the most impactful variables.

![Feature Selection Schematic](https://raw.githubusercontent.com/rm01243/Data-Driven-Modelling-for-AD-Workshop/main/Feature%20Selection%20Image.png)

**Description for Beginners**: Not all inputs are equally important. Mutual Information (MI) scores how much each feature 'informs' the output. A high score means the feature strongly relates to biogas production. We rank them and select the top 10 to simplify the model and focus on key variables.

**Why Do This?** Too many features can confuse the model (overfitting) or slow it down. This step identifies 'VIPs' like temperature, which often impacts AD heavily.

**Tips**: Change `top_n` to select more/fewer features. Run this to see which AD parameters matter most in your data.

In [None]:
mi_scores = mutual_info_regression(X, y, random_state=42)  # Calculates MI scores between features and output
mi_df = pd.DataFrame({'Feature': features, 'Mutual_Information': mi_scores})  # Creates a table with features and scores
mi_df.sort_values(by='Mutual_Information', ascending=False, inplace=True)  # Sorts the table by score, highest first
print("Feature importance based on Mutual Information:")  # Prints a message
display(mi_df)  # Shows the sorted table

# Select top N features
top_n = 10  # Number of top features to keep; you can change this (e.g., to 5 for simpler model)
top_features = mi_df['Feature'].head(top_n).tolist()  # Gets the names of top features
print(f"Top {top_n} features selected for training: {top_features}")  # Prints the selected features

X_top = df[top_features]  # Updates X to only include top features

# Feature Importance Plot (bar chart for visualisation)
top_mi_features = mi_df.head(top_n).sort_values(by='Mutual_Information')  # Selects and sorts for horizontal bar
plt.figure(figsize=(10,6))  # Sets plot size
plt.barh(top_mi_features['Feature'], top_mi_features['Mutual_Information'], color='skyblue')  # Creates horizontal bar chart
plt.xlabel('Mutual Information Score')  # Label for x-axis
plt.title(f'Top {top_n} Features Based on Mutual Information')  # Plot title
plt.tight_layout()  # Adjusts layout to fit labels
plt.show()  # Displays the plot

## Step 5: Data Preprocessing

**Description for Beginners**: Raw data varies in scale (e.g., temperature in 30-60°C, concentrations in 0-100). We 'scale' everything to 0-1 range so no feature dominates. Then, split into training (80%) for learning and testing (20%) for validation. No shuffling because AD data is time-series (order matters).

**Why Do This?** ANNs learn better from normalised data. Splitting prevents 'cheating'—the model shouldn't see test data during training.

**Tips**: Scaling is like converting all units to 'apples' for fair comparison. If your data has outliers, consider other scalers.

In [None]:
scaler_X = MinMaxScaler()  # Creates a scaler for inputs
X_scaled = scaler_X.fit_transform(X_top)  # Fits and scales inputs to 0-1

scaler_y = MinMaxScaler()  # Creates a scaler for output
y_scaled = scaler_y.fit_transform(y.values.reshape(-1, 1))  # Scales output (reshapes to 2D for scaler)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_scaled, test_size=0.2, random_state=42, shuffle=False  # 20% test, no shuffle for time order, random_state for reproducibility
)

## Types of Neural Network
There are multiple types of neural network, each of which come with their own specific use cases and levels of complexity. The most basic type of neural net is something called a feedforward neural network, in which information travels in only one direction from input to output. A more widely used type of network is the recurrent neural network, in which data can flow in multiple directions. These neural networks possess greater learning abilities and are widely employed for more complex tasks such as learning handwriting or language recognition.

## How Does ANN Work?
Initially the weights of the network can be randomly assigned. When the input is given to the input layer, the process moves forward and the hidden layer receives the input combined with the weights. This process continues until the final layer of output is reached and the result is given. When the result is out, it is compared to the actual value, and a back propagation algorithm comes into play to adjust the weights of the network linkages to better the result. What do the neurons in the layers then do? They are responsible for the learning individually. They consist of activation functions that allow the signal to pass or not depending on which activation function is being used and what input came from the previous layer.

## Activation Function
Activation functions are really important for an Artificial Neural Network to learn and make sense of something really complicated and Non-linear complex functional mappings between the inputs and response variable. They introduce non-linear properties to our Network. Their main purpose is to convert an input signal of a node in an A-NN to an output signal. That output signal now is used as an input in the next layer in the stack.
Specifically in A-NN we do the sum of products of inputs(X) and their corresponding Weights(W) and apply an Activation function f(x) to it to get the output of that layer and feed it as an input to the next layer.
Most popular types of Activation functions -
Sigmoid or Logistic
Tanh — Hyperbolic tangent
ReLu -Rectified linear units
Sigmoid Activation function: It is an activation function of form f(x) = 1 / 1 + exp(-x) . Its Range is between 0 and 1. It is a S — shaped curve. It is easy to understand and apply but it has major reasons which have made it fall out of popularity -

Vanishing gradient problem
Secondly , its output isn’t zero centred. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimisation harder.
Sigmoids saturate and kill gradients.
Sigmoids have slow convergence.
 Hyperbolic Tangent function- Tanh : It’s mathematical formula is f(x) = 1 — exp(-2x) / 1 + exp(-2x). Now it’s output is zero centred because its range in between -1 to 1 i.e -1 < output < 1 . Hence optimisation is easier in this method hence in practice it is always preferred over Sigmoid function . But still it suffers from Vanishing gradient problem. ReLu- Rectified Linear units : It has become very popular in the past couple of years. It was recently proved that it had 6 times improvement in convergence from Tanh function. It’s just R(x) = max(0,x) i.e if x < 0 , R(x) = 0 and if x >= 0 , R(x) = x. Hence as seeing the mathematical form of this function we can see that it is very simple and efficient . A lot of times in Machine learning and computer science we notice that most simple and consistent techniques and methods are only preferred and are best. Hence hence it avoids and rectifies vanishing gradient problem . Almost all deep learning Models use ReLu nowadays.

But its limitation is that it should only be used within Hidden layers of a Neural Network Model.

Hence for output layers we should use a Softmax function for a Classification problem to compute the probabilities for the classes , and for a regression problem it should simply use a linear function.

Another problem with ReLu is that some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.

To fix this problem another modification was introduced called Leaky ReLu to fix the problem of dying neurons. It introduces a small slope to keep the updates alive.

## What Happens Without Activation Function?
If we do not apply an Activation function then the output signal would simply be a simple linear function. A linear function is just a polynomial of one degree. Now, a linear equation is easy to solve but they are limited in their complexity and have less power to learn complex functional mappings from data. A Neural Network without Activation function would simply be a Linear regression Model, which has limited power and does not perform good most of the times. We want our Neural Network to not just learn and compute a linear function but something more complicated than that. Also without activation function our Neural network would not be able to learn and model other complicated kinds of data such as images, videos , audio , speech etc. That is why we use Artificial Neural network techniques such as Deep learning to make sense of something complicated ,high dimensional,non-linear -big datasets, where the model has lots and lots of hidden layers in between and has a very complicated architecture which helps us to make sense and extract knowledge form such complicated big datasets.

## Step 6: Build the ANN Model

**Description for Beginners**: An ANN is like a layered cake: input layer takes data, hidden layers process it (learning patterns), output layer gives prediction. We use 'relu' activation to handle non-linear relationships (AD processes aren't straight lines). Compile sets how the model learns (optimizer adjusts weights, loss measures error).

![Simple Feedforward Neural Network](https://learnopencv.com/wp-content/uploads/2017/10/mlp-diagram.jpg)

**Why Do This?** This structure is simple yet powerful for prediction tasks like biogas output.

**Tips**: Neurons (64, 32) are like brain cells; more can learn complex patterns but risk overfitting. Experiment by adding layers or changing the number of neurons!

In [None]:
from tensorflow.keras.layers import Dense, Input

model = Sequential([
    Input(shape=(X_train.shape[1],)),              # Proper input layer
    Dense(64, activation='relu'),                  # First hidden layer
    Dense(32, activation='relu'),                  # Second hidden layer
    Dense(1)                                       # Output layer (regression)
])

model.compile(optimizer='adam', loss='mean_squared_error')

## Step 7: Train the Model

**Description for Beginners**: Training is where the ANN adjusts its internal weights to minimise errors. It 'sees' the training data multiple times (epochs). Validation uses test data to check progress. Loss should decrease over epochs.

**Parameters Explained**:
- `epochs=50`: How many times the model reviews the data.
- `batch_size=32`: Processes data in groups of 32 for efficiency.
- `verbose=1`: Shows progress bar.

**Why Do This?** Like practising a skill—the model gets better with iterations.

**Tips**: If loss stops decreasing, it's converged. Increase epochs if needed, but watch for overfitting (train loss low, val loss high).

In [None]:
start_time = time.time()  # Starts timer
history = model.fit(X_train, y_train, epochs=50, batch_size=32, 
                    validation_data=(X_test, y_test), verbose=1)  # Trains the model, saves history for plots
end_time = time.time()  # Ends timer
print(f"Training time: {end_time - start_time:.2f} seconds")  # Prints time taken

## Step 8: Make Predictions and Evaluate

**Description for Beginners**: Use the trained model to predict on train/test data. Inverse scale back to original units. Metrics: R² (0-1, higher better—fit quality), RMSE (error magnitude, lower better).

**Background on Overfitting and Underfitting**: Overfitting occurs when the model learns the training data too well, including noise, and performs poorly on new data. Underfitting happens when the model is too simple and doesn't capture patterns in the data. You can detect overfitting if training metrics are good but test metrics are poor; underfitting if both are poor.

**Why Do This?** To quantify how accurate the model is. Good test metrics mean it generalises to new data.

**Tips**: R² > 0.8 is decent for AD modelling. If low, try more data or tweak model.

In [None]:
y_train_pred = model.predict(X_train)  # Predicts on training data
y_test_pred = model.predict(X_test)  # Predicts on test data

# Inverse scaling to original units
y_train_orig = scaler_y.inverse_transform(y_train)  # Unscales training output
y_train_pred_orig = scaler_y.inverse_transform(y_train_pred)  # Unscales training predictions
y_test_orig = scaler_y.inverse_transform(y_test)  # Unscales test output
y_test_pred_orig = scaler_y.inverse_transform(y_test_pred)  # Unscales test predictions

# Calculate metrics
r2_train = r2_score(y_train_orig, y_train_pred_orig)  # R² for train
rmse_train = np.sqrt(mean_squared_error(y_train_orig, y_train_pred_orig))  # RMSE for train
r2_test = r2_score(y_test_orig, y_test_pred_orig)  # R² for test
rmse_test = np.sqrt(mean_squared_error(y_test_orig, y_test_pred_orig))  # RMSE for test

print(f'Training R²: {r2_train:.4f} (how well it fits train data, 1=perfect), RMSE: {rmse_train:.4f} (average error)')  # Prints train metrics
print(f'Test R²: {r2_test:.4f}, RMSE: {rmse_test:.4f}')  # Prints test metrics

## Step 9: Visualise Results

**Description for Beginners**: Plots show actual vs. predicted biogas (lines should match closely) and loss over epochs (should decrease). This visually confirms performance.

**Why Do This?** Numbers are good, but graphs reveal trends, like if predictions follow AD fluctuations.

**Tips**: If lines don't overlap on test set, the model may overfit—try fewer features or more data.

In [None]:
# Create a single figure with three subplots side by side
fig, axs = plt.subplots(1, 3, figsize=(18, 6))  # 1 row, 3 columns, wide figure to fit legends

# Plot 1: Training Set Actual vs Predicted
axs[0].plot(y_train_orig, label='Actual Train', color='blue')
axs[0].plot(y_train_pred_orig, label='Predicted Train', color='orange')
axs[0].set_title('Training Set: Actual vs Predicted')
axs[0].set_ylabel('q_gas')
axs[0].set_xlabel('Time Steps')
axs[0].legend(loc='upper left')

# Plot 2: Test Set Actual vs Predicted
axs[1].plot(y_test_orig, label='Actual Test', color='blue')
axs[1].plot(y_test_pred_orig, label='Predicted Test', color='orange')
axs[1].set_title('Test Set: Actual vs Predicted')
axs[1].set_ylabel('q_gas')
axs[1].set_xlabel('Time Steps')
axs[1].legend(loc='upper left')

# Plot 3: Model Loss During Training
axs[2].plot(history.history['loss'], label='Train Loss', color='green')
axs[2].plot(history.history['val_loss'], label='Validation Loss', color='red')
axs[2].set_title('Model Loss During Training')
axs[2].set_xlabel('Epoch')
axs[2].set_ylabel('Loss')
axs[2].legend(loc='upper left')

plt.tight_layout()  # Adjusts layout to prevent overlap
plt.show()  # Displays all three plots in one row

## Wrapping Up
You've completed the workshop! You've loaded data, selected key features, preprocessed it, built and trained an ANN, and evaluated its performance with metrics and plots.

**Next Steps**:
- Apply this to your real AD plant data (clean it first if needed).
- Experiment: Change top_n, epochs, number of neurons, or add layers to improve results.
- Explore methods to automate tuning of model parameters (hyperparameters).