# **Machine Learning and Model Training**

In [1]:
from google.colab import files
uploaded = files.upload()


ModuleNotFoundError: No module named 'google.colab'

This step uploads the cleaned dataset into the Colab environment so it can be used for machine learning.

### Import Required Libraries

In [19]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestRegressor

from xgboost import XGBRegressor


These libraries are required for data preprocessing, model training, and model evaluation.

### Load Dataset

In [3]:
df = pd.read_csv("final_ecopack_data.csv")
df.head()


Unnamed: 0,material_type,strength,weight_capacity,biodegradability_score,recyclability_percent,co2_emission_score,product_category,typical_weight,fragility_level,shape,material,recycling,weight,co2_impact_index,cost,cost_efficiency_index,material_suitability_score,final_sustainability_index
0,Cornstarch Polymer,67,10,92,93,20.52,Clothing,2.57,Low,bottle,glass,reuse,31.93,0.046468,35.04,0.027747,54.1,54.174215
1,Eco-Foam Starch,86,29,38,80,31.08,Cosmetics,0.54,Low,box,metal,reuse,29.68,0.031172,51.99,0.018871,40.7,40.750044
2,Biodegradable Foam,45,26,44,65,32.73,Stationery,1.99,Low,bottle,glass,reuse,25.83,0.029647,43.72,0.022361,27.5,27.552009
3,Paper Pulp Mold,55,10,51,55,14.2,Toys,1.57,High,can,glass,recycle,22.42,0.065789,41.49,0.023535,34.7,34.789324
4,Bamboo Fiber,82,27,99,93,11.65,Home Decor,0.6,High,pouch,glass,recycle,23.88,0.079051,44.96,0.021758,54.9,55.000809


This loads the dataset into a DataFrame and verifies that the data is read correctly.

### Remove Leakage Features

In [20]:
leakage_cols = [
    'Material_Suitability_Score',
    'final_sustainability_index'
]

df = df.drop(columns=[c for c in leakage_cols if c in df.columns])


Leakage features are removed so the model does not get unrealistically high accuracy by learning target-related information.

### Encode Categorical Features

In [21]:
cat_cols = df.select_dtypes(include='object').columns

le = LabelEncoder()
for col in cat_cols:
    df[col] = le.fit_transform(df[col].astype(str))


Categorical variables are converted into numeric form using label encoding so machine learning models can process them.

### Feature Engineering

In [22]:
df['CO2_Impact_Index'] = df['co2_emission_score'] / (df['strength'] + 1)
df['Cost_Efficiency_Index'] = df['cost'] / (df['strength'] + 1)



New sustainability-related features are engineered to capture CO₂ impact and cost efficiency in a meaningful numeric form.

### Feature Selection & Target Definition

In [7]:
X = df.drop(columns=['cost', 'co2_emission_score'])
y_cost = df['cost']
y_co2 = df['co2_emission_score']


Input features and target variables are separated for cost prediction and CO₂ impact prediction.

### Train–Test Split

In [23]:
X_train, X_test, y_cost_train, y_cost_test = train_test_split(
    X, y_cost, test_size=0.2, random_state=42
)

_, _, y_co2_train, y_co2_test = train_test_split(
    X, y_co2, test_size=0.2, random_state=42
)


The dataset is split into training and testing sets to evaluate model performance on unseen data.

### Random Forest – Cost Prediction

In [24]:
rf_cost = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

rf_cost.fit(X_train, y_cost_train)
rf_cost_pred = rf_cost.predict(X_test)

print("RF Cost R2:", r2_score(y_cost_test, rf_cost_pred))
print("RF Cost RMSE:", np.sqrt(mean_squared_error(y_cost_test, rf_cost_pred)))
print("RF Cost MAE:", mean_absolute_error(y_cost_test, rf_cost_pred))



RF Cost R2: 0.999985658649519
RF Cost RMSE: 0.029577306941937824
RF Cost MAE: 0.006835031250001928


Random Forest is trained for cost prediction and evaluated using R², RMSE, and MAE.

### Random Forest — CO₂ Impact Prediction

In [25]:
rf_co2 = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

rf_co2.fit(X_train, y_co2_train)

rf_co2_pred = rf_co2.predict(X_test)

print("RF CO2 R2:", r2_score(y_co2_test, rf_co2_pred))
print("RF CO2 RMSE:", np.sqrt(mean_squared_error(y_co2_test, rf_co2_pred)))
print("RF CO2 MAE:", mean_absolute_error(y_co2_test, rf_co2_pred))


RF CO2 R2: 0.9999999586183115
RF CO2 RMSE: 0.0017680837967988773
RF CO2 MAE: 0.00016178125004705668


Random Forest is trained for CO2 prediction and evaluated using R², RMSE, and MAE.

### XGBoost – Cost Prediction

In [26]:
xgb_cost = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,   # optimizer
    max_depth=6,
    random_state=42
)

xgb_cost.fit(X_train, y_cost_train)
xgb_cost_pred = xgb_cost.predict(X_test)

print("XGB Cost R2:", r2_score(y_cost_test, xgb_cost_pred))
print("XGB Cost RMSE:", np.sqrt(mean_squared_error(y_cost_test, xgb_cost_pred)))
print("XGB Cost MAE:", mean_absolute_error(y_cost_test, xgb_cost_pred))



XGB Cost R2: 0.9996743402340859
XGB Cost RMSE: 0.14094369627458012
XGB Cost MAE: 0.05251306200027468


XGBoost is trained for cost prediction and compared against Random Forest.

### XGBoost – CO₂ Impact Prediction

In [27]:
xgb_co2 = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,   # optimizer
    max_depth=6,
    random_state=42
)

xgb_co2.fit(X_train, y_co2_train)
xgb_co2_pred = xgb_co2.predict(X_test)

print("XGB CO2 R2:", r2_score(y_co2_test, xgb_co2_pred))
print("XGB CO2 RMSE:", np.sqrt(mean_squared_error(y_co2_test, xgb_co2_pred)))
print("XGB CO2 MAE:", mean_absolute_error(y_co2_test, xgb_co2_pred))

XGB CO2 R2: 0.9999999985702409
XGB CO2 RMSE: 0.000328647383502937
XGB CO2 MAE: 0.0002358039379119481


XGBoost predicts the CO₂ Impact Index and achieves very high accuracy.

### Model Optimization using Learning Rate (XGBoost)

In [13]:
xgb_optimized = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

xgb_optimized.fit(X_train, y_co2_train)

xgb_opt_pred = xgb_optimized.predict(X_test)

print("Optimized XGB CO2 R2:", r2_score(y_co2_test, xgb_opt_pred))
print("Optimized XGB CO2 RMSE:", np.sqrt(mean_squared_error(y_co2_test, xgb_opt_pred)))
print("Optimized XGB CO2 MAE:", mean_absolute_error(y_co2_test, xgb_opt_pred))


Optimized XGB CO2 R2: 0.9999748733153182
Optimized XGB CO2 RMSE: 0.0435678707505765
Optimized XGB CO2 MAE: 0.02145884982347489


Random Forest performed best for cost prediction, while optimized XGBoost achieved superior performance for CO₂ impact prediction.

### Material Recommendation Prediction

In [14]:
user_input = {
    'strength': 70,
    'recyclability_percent': 85,
    'biodegradability_score': 75,
    'weight': 1.5
}

In [15]:
X.columns


Index(['material_type', 'strength', 'weight_capacity',
       'biodegradability_score', 'recyclability_percent', 'product_category',
       'typical_weight', 'fragility_level', 'shape', 'material', 'recycling',
       'weight', 'co2_impact_index', 'cost_efficiency_index',
       'material_suitability_score', 'final_sustainability_index',
       'CO2_Impact_Index', 'Cost_Efficiency_Index'],
      dtype='object')

In [28]:
input_df = pd.DataFrame(columns=X.columns)
input_df.loc[0] = 0

input_df.at[0, 'strength'] = 70
input_df.at[0, 'recyclability_percent'] = 85
input_df.at[0, 'biodegradability_score'] = 75
input_df.at[0, 'weight'] = 1.5



  input_df.at[0, 'weight'] = 1.5


In [29]:
predicted_cost = rf_cost.predict(input_df)[0]
predicted_co2 = xgb_co2.predict(input_df)[0]

print("Predicted Cost:", predicted_cost)
print("Predicted CO₂ Impact:", predicted_co2)


Predicted Cost: 67.32615000000008
Predicted CO₂ Impact: 34.791363


In [30]:
df['recommendation_score'] = (
    abs(df['cost'] - predicted_cost) +
    abs(df['co2_emission_score'] - predicted_co2)
)

recommended_material = df.sort_values('recommendation_score').iloc[0]['material_type']

print("Recommended Material:", recommended_material)

Recommended Material: 2.0


In [None]:
import joblib

material_model = joblib.load("model/material_model.pkl")
cost_model = joblib.load("model/cost_model.pkl")
co2_model = joblib.load("model/co2_model.pkl")


After training the models, a recommendation layer is added.
User inputs numerical values, the models predict cost and CO₂ impact, and the system recommends the most suitable material by comparing predicted values with existing materials.