# LAB 3 — Pipeline Flow Prediction (Machine Learning)
**Lab:** 3 of 4  
**Last updated:** 2025-12-26

## Goal
Build and evaluate ML models to predict pipeline flow rate from sensor readings.

## Recommended dataset (Kaggle)
Gas Pipeline Dataset: https://www.kaggle.com/datasets/garystafford/gas-pipeline-dataset

## What you'll do
1) Load sensor dataset (or use synthetic fallback)  
2) Prepare features + target  
3) Train baseline Linear Regression  
4) Train Random Forest and compare  
5) Evaluate with RMSE and R²  
6) Interpret feature importance

## 1) Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

## 2) Load dataset (local CSV or synthetic fallback)

In [None]:
from pathlib import Path

DATA_PATH = Path("data/gas_pipeline.csv")  # <-- set to your downloaded CSV

def make_synthetic_pipeline(n=5000, seed=123):
    rng = np.random.default_rng(seed)
    pressure_in = rng.normal(70, 8, size=n)
    pressure_out = pressure_in - rng.normal(8, 3, size=n)
    temperature = rng.normal(18, 6, size=n)
    vibration = rng.normal(0.15, 0.08, size=n).clip(0, 1)
    valve_open = rng.uniform(0.4, 1.0, size=n)

    delta_p = (pressure_in - pressure_out).clip(0, None)
    flow = (valve_open * delta_p * 120 + rng.normal(0, 25, size=n)).clip(0, None)

    df = pd.DataFrame({
        "pressure_in": pressure_in,
        "pressure_out": pressure_out,
        "temperature": temperature,
        "vibration": vibration,
        "valve_open": valve_open,
        "flow_rate": flow
    })
    df.loc[rng.random(n) < 0.01, "temperature"] = np.nan
    return df

if DATA_PATH.exists():
    df = pd.read_csv(DATA_PATH)
    print("Loaded:", DATA_PATH, "shape:", df.shape)
else:
    df = make_synthetic_pipeline()
    print("Using synthetic dataset. shape:", df.shape)

df.head()

## 3) Prepare data for ML

In [None]:
df = df.copy()
for c in df.columns:
    df[c] = pd.to_numeric(df[c], errors="coerce")

df = df.fillna(df.median(numeric_only=True))
df["delta_pressure"] = df["pressure_in"] - df["pressure_out"]

df.describe().T

## 4) Train/test split

In [None]:
target = "flow_rate"
features = [c for c in df.columns if c != target]

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## 5) Baseline model — Linear Regression

In [None]:
lin = LinearRegression()
lin.fit(X_train, y_train)

pred_lin = lin.predict(X_test)
rmse_lin = mean_squared_error(y_test, pred_lin, squared=False)
r2_lin = r2_score(y_test, pred_lin)

rmse_lin, r2_lin

## 6) Random Forest model

In [None]:
rf = RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

pred_rf = rf.predict(X_test)
rmse_rf = mean_squared_error(y_test, pred_rf, squared=False)
r2_rf = r2_score(y_test, pred_rf)

rmse_rf, r2_rf

## 7) Compare models

In [None]:
pd.DataFrame({
    "model": ["LinearRegression", "RandomForest"],
    "RMSE": [rmse_lin, rmse_rf],
    "R2": [r2_lin, r2_rf],
}).sort_values("RMSE")

## 8) Visual check: predicted vs actual

In [None]:
plt.figure(figsize=(5,5))
plt.scatter(y_test, pred_rf, alpha=0.25)
plt.xlabel("Actual flow_rate")
plt.ylabel("Predicted flow_rate")
plt.title("Random Forest: Predicted vs Actual")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.show()

## 9) Feature importance

In [None]:
importances = pd.Series(rf.feature_importances_, index=features).sort_values(ascending=False)
importances

In [None]:
plt.figure(figsize=(8,4))
importances.head(10).plot(kind="bar")
plt.title("Top feature importances")
plt.ylabel("importance")
plt.show()

## 10) Save outputs

In [None]:
from pathlib import Path
OUT = Path("outputs/lab3_pipeline_ml_dataset.csv")
OUT.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(OUT, index=False)
print("Saved:", OUT)

## Checkpoint questions
1) Why is RMSE useful for operational forecasting?  
2) When might a linear model be preferred over Random Forest?  
3) What happens if the model is trained on a narrow operating range?