# Machine Learning Model Training

This notebook covers:
1. Connecting to the PostgreSQL database.
2. Loading the `features_engineering` dataset.
3. Preprocessing data (feature selection, encoding, splitting).
4. Training Regression Models:
    - **Random Forest** for Cost Prediction.
    - **XGBoost** for CO2 Emission Prediction.
5. Evaluating models (RMSE, MAE, R2).
6. Saving models for the Flask API.

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import joblib
import os

# Database connection parameters
DB_USER = "postgres"
DB_PASSWORD = "123456"
DB_HOST = "localhost"
DB_PORT = "5432"
DB_NAME = "ecopackai_db"

# Create Engine
try:
    connection_str = f"postgresql+psycopg2://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
    engine = create_engine(connection_str)
    print("✅ Connected to database.")
except Exception as e:
    print(f"❌ Error connecting: {e}")

✅ Connected to database.


## 1. Load Data

In [2]:
query = "SELECT * FROM features_engineering;"
try:
    df = pd.read_sql(query, engine)
    print(f"✅ Loaded {len(df)} rows.")
    display(df.head())
except Exception as e:
    print(f"❌ Error querying data. Make sure 'features_engineering' table exists. {e}")

✅ Loaded 35 rows.


Unnamed: 0,material_id,material_type,strength,weight_capacity_kg,biodegradability_score,co2_emission_score,recyclability_percent,cost_per_unit_inr,water_resistance,co2_impact_index,cost_efficiency_index,sustainability_score,material_suitability_score
0,1,Corrugated Cardboard,7,12,85,2.1,90,18,0,25.2,0.388889,33.94,17.086667
1,2,Molded Pulp,6,8,92,1.8,88,15,0,14.4,0.4,36.792,18.516
2,3,Biodegradable Plastic,8,15,65,3.9,60,28,1,58.5,0.285714,25.46,13.015714
3,4,Kraft Paper,5,6,88,2.0,85,12,0,12.0,0.416667,35.14,17.695
4,5,Bamboo Fiber,9,18,95,1.6,92,35,1,28.8,0.257143,38.048,19.301143


## 2. Preprocessing

We need to separate features (X) and targets (y).
We will train two separate models:
1. `y_cost`: `cost_per_unit_inr`
2. `y_co2`: `co2_emission_score`

Input Features (X) will be:
- `strength`
- `weight_capacity_kg`
- `biodegradability_score`
- `recyclability_percent`
- `water_resistance`
- `material_type` (Categorical - needs OneHotEncoding)

In [3]:
# Features to use
feature_cols = [
    "strength", 
    "weight_capacity_kg", 
    "biodegradability_score", 
    "recyclability_percent", 
    "water_resistance",
    "material_type"
]

# Targets
target_cost = "cost_per_unit_inr"
target_co2 = "co2_emission_score"

X = df[feature_cols]
y_cost = df[target_cost]
y_co2 = df[target_co2]

# Handle Categorical Data (OneHotEncoding)
# We'll use a Pipeline for preprocessing to make it easy to apply to new data later

categorical_features = ["material_type"]
numeric_features = ["strength", "weight_capacity_kg", "biodegradability_score", "recyclability_percent", "water_resistance"]

preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='mean'), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Split Data
# Using a small test size because dataset might be small
X_train, X_test, y_cost_train, y_cost_test, y_co2_train, y_co2_test = train_test_split(
    X, y_cost, y_co2, test_size=0.2, random_state=42
)

print("Data Split Complete.")
print(f"Train shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")

Data Split Complete.
Train shape: (28, 6)
Test shape: (7, 6)


## 3. Train Model 1: Cost Predictor (Random Forest)

In [4]:
rf_cost_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

rf_cost_pipeline.fit(X_train, y_cost_train)

# Evaluate
y_pred_cost = rf_cost_pipeline.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_cost_test, y_pred_cost))
mae = mean_absolute_error(y_cost_test, y_pred_cost)
r2 = r2_score(y_cost_test, y_pred_cost)

print("--- Cost Prediction Model (Random Forest) ---")
print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")
print(f"R2 Score: {r2:.2f}")

--- Cost Prediction Model (Random Forest) ---
RMSE: 5.42
MAE: 4.87
R2 Score: 0.41


## 4. Train Model 2: CO2 Emission Predictor (XGBoost)

In [5]:
xgb_co2_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42))
])

xgb_co2_pipeline.fit(X_train, y_co2_train)

# Evaluate
y_pred_co2 = xgb_co2_pipeline.predict(X_test)

rmse_co2 = np.sqrt(mean_squared_error(y_co2_test, y_pred_co2))
mae_co2 = mean_absolute_error(y_co2_test, y_pred_co2)
r2_co2 = r2_score(y_co2_test, y_pred_co2)

print("\n--- CO2 Prediction Model (XGBoost) ---")
print(f"RMSE: {rmse_co2:.2f}")
print(f"MAE: {mae_co2:.2f}")
print(f"R2 Score: {r2_co2:.2f}")


--- CO2 Prediction Model (XGBoost) ---
RMSE: 0.34
MAE: 0.26
R2 Score: 0.39


## 5. Save Models
We save the entire pipelines (including preprocessing) so they can handle raw input.

In [6]:
model_dir = '../models'
if not os.path.exists(model_dir):
    os.makedirs(model_dir)

joblib.dump(rf_cost_pipeline, f'{model_dir}/cost_predictor_model.pkl')
joblib.dump(xgb_co2_pipeline, f'{model_dir}/co2_predictor_model.pkl')

print(f"Models saved in {model_dir}/")

Models saved in ../models/
