In [1]:
print("""
EcoPackAI ‚Äì Feature Engineering Module
-------------------------------------

Input  :
    data/processed/final_ecopack_dataset.csv

Output :
    data/processed/final_ecopack_dataset_fe.csv

Features Created:
- CO2 Impact Index
- Cost Efficiency Index
- Environmental Impact Score
- Material Suitability Score
- Sustainability Rating (A+ to F)

This module prepares the dataset for:
- ML model training
- Recommendation engine
- BI dashboard analytics
""")


EcoPackAI ‚Äì Feature Engineering Module
-------------------------------------

Input  :
    data/processed/final_ecopack_dataset.csv

Output :
    data/processed/final_ecopack_dataset_fe.csv

Features Created:
- CO2 Impact Index
- Cost Efficiency Index
- Environmental Impact Score
- Material Suitability Score
- Sustainability Rating (A+ to F)

This module prepares the dataset for:
- ML model training
- Recommendation engine
- BI dashboard analytics



### Overview of the Module

#### The code takes your cleaned and merged dataset (final_ecopack_dataset.csv) and enriches it by creating derived features that quantify material sustainability, cost-efficiency, and suitability.
#### These features are essential for analytics, ML models, and the recommendation engine.

In [2]:
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', None)        # show all rows
pd.set_option('display.width', None)           # auto-adjust width
pd.set_option('display.max_colwidth', None)    # show full column content

In [3]:
# ======================================================
# CONFIG
# ======================================================
INPUT_PATH  = r"../data/processed/final_ecopack_dataset.csv"
OUTPUT_PATH = r"../data/processed/final_ecopack_dataset_fe.csv"

print("=" * 70)
print("EcoPackAI ‚Äì Feature Engineering")
print("=" * 70)

EcoPackAI ‚Äì Feature Engineering


## STEP 1: Load Data

#### Loads the processed dataset with shipments + material attributes.

In [4]:
# ======================================================
# STEP 1: LOAD DATA
# ======================================================
df = pd.read_csv(INPUT_PATH)

print("\nSTEP 1: DATA LOADED")
print(f"Rows: {df.shape[0]} | Columns: {df.shape[1]}")


STEP 1: DATA LOADED
Rows: 14999 | Columns: 21


In [5]:
df.columns

Index(['Category_item', 'Weight_kg', 'Volumetric_Weight_kg', 'L_cm', 'W_cm',
       'H_cm', 'Fragility', 'Moisture_Sens', 'Shipping_Mode', 'Distance_km',
       'Packaging_Used', 'Cost_USD', 'CO2_Emission_kg_item', 'Material_ID',
       'Material_Name', 'Category_material', 'Density_kg_m3',
       'Tensile_Strength_MPa', 'Cost_per_kg', 'CO2_Emission_kg_material',
       'Biodegradable'],
      dtype='object')

## STEP 2: New Composite Features

To improve interpretability and modeling performance, we introduce **five composite, domain-aware features**.  
These features combine **material properties**, **shipment characteristics**, and **environmental factors** into normalized indices.

---

### 1. `co2_impact_index`

**What it represents**  
Overall carbon impact of the shipment caused by **packaging material + logistics**.

**Key drivers**
- Material CO‚ÇÇ intensity  
- Shipment distance  
- Shipping mode (Air / Road / Sea)  
- Package weight

**Interpretation**
- **Higher value ‚Üí higher carbon impact**
- Used as a core signal for environmental optimization

---

### 2. `cost_efficiency_index`

**What it represents**  
How economically efficient the chosen packaging is relative to the **protection required**.

**Key drivers**
- Packaging cost  
- Package weight & volumetric weight  
- Item fragility

**Interpretation**
- **Higher value ‚Üí better cost efficiency**
- Rewards lower cost with adequate protection

---

### 3. `environmental_impact_score`

**What it represents**  
A holistic environmental harm score that extends beyond raw CO‚ÇÇ emissions.

**Key drivers**
- Carbon impact  
- Material density  
- Biodegradability

**Interpretation**
- **Higher value ‚Üí worse environmental impact**
- Penalizes dense and non-biodegradable materials

---

### 4. `material_suitability_score`

**What it represents**  
Technical suitability of the selected packaging material for the shipped item.

**Key drivers**
- Material strength  
- Item fragility  
- Moisture sensitivity  
- Material category

**Interpretation**
- **Higher value ‚Üí better material‚Äìitem compatibility**
- Captures engineering fitness of packaging choice

---

### 5. `sustainability_rating`

**What it represents**  
A human-readable sustainability assessment derived from multiple indices.

**Key drivers**
- Environmental impact  
- Cost efficiency  
- Biodegradability

**Interpretation**
- Converted into discrete grades (A‚ÄìE)
- Useful for dashboards, recommendations, and reporting

---

### Scoring Conventions

- **Higher score = worse impact** (impact-oriented metrics)
- **Higher score = better** (efficiency or suitability metrics)
- **All scores are normalized** to ensure model compatibility and stability

---

These composite features enable:
- Environmentally aware ML models  
- Cost‚Äìsustainability trade-off analysis  
- Explainable packaging recommendations

## STEP 3: CO‚ÇÇ Impact Index

### ‚ÄúHow carbon-intensive was this shipment due to packaging + logistics?‚Äù

The **CO‚ÇÇ Impact Index** quantifies the overall carbon footprint of a shipment by combining  
**material emissions** and **logistics-related emissions** into a single normalized score.

---

### üîó Influencing Columns

- `CO2_Emission_kg_item`  *(Observed CO‚ÇÇ emissions from historical shipment data)*
---
- `CO2_Emission_kg_material`  *(Intrinsic CO‚ÇÇ emissions per kg of the packaging material)*
---
- `Weight_kg`  *(Actual shipment weight)*
---
- `Distance_km`  *(Transportation distance)*
---
- `Shipping_Mode`  *(Air / Road / Sea)*

---

### ‚ú® Logic

- Air transport has a higher carbon penalty than road or sea  
- Longer distance increases emissions  
- High-CO‚ÇÇ packaging materials worsen the overall impact  
- Logarithmic scaling prevents large shipments from dominating the index

---

### üìê Formula

```text
shipping_multiplier:
    Air   ‚Üí 1.5
    Road  ‚Üí 1.0
    Sea   ‚Üí 0.8

co2_impact_index =
    log1p(
        CO2_Emission_kg_item
        + (CO2_Emission_kg_material √ó Weight_kg)
    )
    √ó shipping_multiplier
```

In [6]:
# ======================================================
# STEP 3: CO‚ÇÇ IMPACT INDEX (0‚Äì100, Higher = Better)
# ======================================================

print("\nSTEP 3: Creating CO‚ÇÇ Impact Index")

# --- Shipping mode multipliers ---
shipping_multiplier_map = {
    "Air": 1.5,
    "Road": 1.0,
    "Sea": 0.8
}

df["shipping_multiplier"] = (
    df["Shipping_Mode"]
    .map(shipping_multiplier_map)
    .fillna(1.0)
)

# --- Raw CO‚ÇÇ impact (log-scaled) ---
df["co2_impact_raw"] = (
    np.log1p(
        df["CO2_Emission_kg_item"]
        + (df["CO2_Emission_kg_material"] * df["Weight_kg"])
    )
    * df["shipping_multiplier"]
)

# --- Min‚ÄìMax normalization (lower CO‚ÇÇ = better score) ---
raw_min = df["co2_impact_raw"].min()
raw_max = df["co2_impact_raw"].max()

df["co2_impact_index"] = 100 - (
    (df["co2_impact_raw"] - raw_min) / (raw_max - raw_min) * 100
)

# --- Safety clipping & rounding ---
df["co2_impact_index"] = (
    df["co2_impact_index"]
    .clip(0, 100)
    .round(2)
)

# --- Cleanup intermediate columns ---
df.drop(columns=["shipping_multiplier", "co2_impact_raw"], inplace=True)

print("‚úì CO‚ÇÇ Impact Index created (0‚Äì100 scale)")


STEP 3: Creating CO‚ÇÇ Impact Index
‚úì CO‚ÇÇ Impact Index created (0‚Äì100 scale)


## STEP 4: Cost Efficiency Index

### ‚ÄúHow cost-effective was the packaging relative to its size & protection?‚Äù

The **Cost Efficiency Index** evaluates whether the packaging cost is justified given  
the **shipment size** and the **level of protection required**.

---

### üîó Influencing Columns

- `Cost_USD`  
  *(Actual packaging cost for the shipment)*

- `Cost_per_kg`  
  *(Material cost per kilogram)*

- `Weight_kg`  
  *(Actual shipment weight)*

- `Volumetric_Weight_kg`  
  *(Volume-based billable weight)*

- `Fragility`  
  *(Protection requirement indicator)*

---

### ‚ú® Logic

- Packaging cost should scale with **effective shipment weight**
- Fragile items justify slightly higher cost
- Overpriced packaging reduces efficiency
- Underpriced yet protective packaging improves efficiency

---

### üìê Formula

```
Effective Weight:
W_eff = max(Weight_kg, Volumetric_Weight_kg)

Expected Packaging Cost:
C_expected = Cost_per_kg * W_eff

Cost Efficiency Index:
Cost_Efficiency_Index =
    Fragility /
    ((Cost_USD / C_expected) + 1e-6)
```
---

### üîé Interpretation

- **Higher value ‚Üí better cost efficiency**
- Indicates economically optimal packaging choices
- Helps detect over-engineered or overpriced packaging

---

### ‚≠ê Significance

‚úî Penalizes unnecessarily expensive packaging  
‚úî Rewards low-cost solutions with adequate protection  
‚úî Supports cost-optimized and sustainable packaging decisions

In [7]:
# ======================================================
# STEP 4: COST EFFICIENCY INDEX (0‚Äì100, Higher = Better)
# ======================================================

print("\nSTEP 4: Creating Cost Efficiency Index")

# --- Effective weight (billing logic) ---
df["effective_weight"] = df[["Weight_kg", "Volumetric_Weight_kg"]].max(axis=1)

# --- Expected packaging cost based on material ---
df["expected_cost"] = df["Cost_per_kg"] * df["effective_weight"]

# --- Raw cost efficiency (fragility-adjusted) ---
df["cost_efficiency_raw"] = (
    df["Fragility"] /
    ((df["Cost_USD"] / df["expected_cost"]) + 1e-6)
)

# --- Min‚ÄìMax normalization (higher = better) ---
raw_min = df["cost_efficiency_raw"].min()
raw_max = df["cost_efficiency_raw"].max()

df["cost_efficiency_index"] = (
    (df["cost_efficiency_raw"] - raw_min) /
    (raw_max - raw_min) * 100
)

# --- Safety clipping & rounding ---
df["cost_efficiency_index"] = (
    df["cost_efficiency_index"]
    .clip(0, 100)
    .round(2)
)

# --- Cleanup intermediate columns ---
df.drop(
    columns=["effective_weight", "expected_cost", "cost_efficiency_raw"],
    inplace=True
)

print("‚úì Cost Efficiency Index created (0‚Äì100 scale)")


STEP 4: Creating Cost Efficiency Index
‚úì Cost Efficiency Index created (0‚Äì100 scale)


## STEP 5: Environmental Impact Score

### ‚ÄúOverall environmental harm score (lower is better)‚Äù

The **Environmental Impact Score** extends beyond raw CO‚ÇÇ emissions by incorporating  
material characteristics that influence long-term environmental harm.

---

### üîó Influencing Columns

- co2_impact_index
- Biodegradable
- Density_kg_m3

---

### ‚ú® Logic

- High CO‚ÇÇ impact increases environmental harm
- Dense materials imply greater material usage
- Non-biodegradable materials are penalized

---

### üìê Formula

```
Biodegradability Penalty:
    Yes ‚Üí 0.7
    No  ‚Üí 1.3

Environmental Impact Score:
    environmental_impact_score =
        co2_impact_index
        * biodegradable_penalty
        * log(1 + Density_kg_m3)
```

---

### üîé Interpretation

- Higher score ‚Üí worse environmental impact
- Lower score ‚Üí more sustainable packaging choice

In [8]:
# ======================================================
# STEP 5: ENVIRONMENTAL IMPACT SCORE (0‚Äì100, Higher = Worse)
# ======================================================

print("\nSTEP 5: Creating Environmental Impact Score")

# --- Biodegradability penalty ---
biodegradable_penalty_map = {
    "Yes": 0.7,
    "No": 1.3
}

df["biodegradable_penalty"] = (
    df["Biodegradable"]
    .map(biodegradable_penalty_map)
    .fillna(1.0)
)

# --- Raw environmental impact ---
# NOTE: co2_impact_index is higher = better,
# so we invert it to represent impact
df["environmental_impact_raw"] = (
    (100 - df["co2_impact_index"])
    * df["biodegradable_penalty"]
    * np.log1p(df["Density_kg_m3"])
)

# --- Min‚ÄìMax normalization (higher = worse) ---
raw_min = df["environmental_impact_raw"].min()
raw_max = df["environmental_impact_raw"].max()

df["environmental_impact_score"] = (
    (df["environmental_impact_raw"] - raw_min) /
    (raw_max - raw_min) * 100
).clip(0, 100).round(2)

# --- Cleanup ---
df.drop(
    columns=["biodegradable_penalty", "environmental_impact_raw"],
    inplace=True
)

print("‚úì Environmental Impact Score created (0‚Äì100, higher = worse)")


STEP 5: Creating Environmental Impact Score
‚úì Environmental Impact Score created (0‚Äì100, higher = worse)


## STEP 6: Material Suitability Score

### ‚ÄúWas this material technically appropriate for the shipment?‚Äù

The **Material Suitability Score** evaluates how well the chosen packaging material  
matches the **physical and environmental requirements** of the shipped item.

---

### üîó Influencing Columns

- Tensile_Strength_MPa
- Fragility
- Moisture_Sens
- Category_item
- Category_material

---

### ‚ú® Logic

- Stronger materials are better suited for fragile items
- Moisture-sensitive items benefit from moisture-resistant materials
- Material category influences environmental protection capability

---

### üìê Formula

```
Fragility Fit:
    fragility_fit = Tensile_Strength_MPa / (Fragility + 1)

Moisture Fit:
    if Moisture_Sens and Category_material in ["Plastic", "Metal", "Bio-based"]:
        moisture_fit = 1.2
    else:
        moisture_fit = 1.0

Material Suitability Score:
    material_suitability_score =
        fragility_fit * moisture_fit
```

---

### üîé Interpretation

- Higher score ‚Üí better material‚Äìitem compatibility
- Captures engineering suitability of packaging choice

In [9]:
# ======================================================
# STEP 6: MATERIAL SUITABILITY SCORE (0‚Äì100, Higher = Better)
# ======================================================

print("\nSTEP 6: Creating Material Suitability Score")

# --- Fragility fit (strength vs protection need) ---
df["fragility_fit"] = (
    df["Tensile_Strength_MPa"] / (df["Fragility"] + 1)
)

# --- Moisture fit ---
moisture_safe_categories = ["Plastic", "Metal", "Bio-based"]

df["moisture_fit"] = np.where(
    (df["Moisture_Sens"] == True) &
    (df["Category_material"].isin(moisture_safe_categories)),
    1.2,
    1.0
)

# --- Raw suitability score ---
df["material_suitability_raw"] = (
    df["fragility_fit"] * df["moisture_fit"]
)

# --- Normalize to 0‚Äì100 ---
raw_min = df["material_suitability_raw"].min()
raw_max = df["material_suitability_raw"].max()

df["material_suitability_score"] = (
    (df["material_suitability_raw"] - raw_min) /
    (raw_max - raw_min) * 100
).clip(0, 100).round(2)

# --- Cleanup ---
df.drop(
    columns=["fragility_fit", "moisture_fit", "material_suitability_raw"],
    inplace=True
)

print("‚úì Material Suitability Score created (0‚Äì100 scale)")


STEP 6: Creating Material Suitability Score
‚úì Material Suitability Score created (0‚Äì100 scale)


## STEP 7: Sustainability Rating

### ‚ÄúHuman-friendly final sustainability label‚Äù

The **Sustainability Rating** converts multiple quantitative indices into a  
simple, interpretable sustainability grade.

---

### üîó Influencing Columns

- environmental_impact_score
- cost_efficiency_index
- Biodegradable

---

### ‚ú® Logic

- Lower environmental impact improves sustainability
- Higher cost efficiency improves sustainability
- Biodegradable materials receive a positive boost

---

### üìê Formula

```
Sustainability Score:
    sustainability_score =
        (1 / environmental_impact_score)
        * cost_efficiency_index
        * (1.2 if Biodegradable == "Yes" else 0.8)
```

---

### üè∑ Rating Buckets

Score Percentile ‚Üí Rating

- Top 20%        ‚Üí A
- 60‚Äì80%        ‚Üí B
- 40‚Äì60%        ‚Üí C
- 20‚Äì40%        ‚Üí D
- Bottom 20%    ‚Üí E

---

### üîé Interpretation

- Grade A ‚Üí Highly sustainable packaging choice
- Grade E ‚Üí Environmentally inefficient or costly packaging

In [10]:
# ======================================================
# STEP 7: SUSTAINABILITY RATING
# ======================================================

print("\nSTEP 7: Creating Sustainability Rating")

# --- Sustainability score (higher = better) ---
df["sustainability_score"] = (
    (1 / (df["environmental_impact_score"] + 1e-6)) *
    df["cost_efficiency_index"] *
    np.where(df["Biodegradable"] == "Yes", 1.2, 0.8)
)

# --- Percentile-based grading ---
percentiles = df["sustainability_score"].rank(pct=True)

df["sustainability_rating"] = pd.cut(
    percentiles,
    bins=[0, 0.2, 0.4, 0.6, 0.8, 1.0],
    labels=["E", "D", "C", "B", "A"]
)

print("‚úì Sustainability Rating assigned (A‚ÄìE)")


STEP 7: Creating Sustainability Rating
‚úì Sustainability Rating assigned (A‚ÄìE)


In [11]:
# ======================================================
# Adding ITEM VOLUME as a feature
# ======================================================

print("\nSTEP 4.5: Creating Item Volume Feature")

# Create volume in cubic centimeters
df["Item_Volume_cm3"] = (
    df["L_cm"] * df["W_cm"] * df["H_cm"]
)

# Optional: volume in cubic meters (future-proof)
df["Item_Volume_m3"] = df["Item_Volume_cm3"] / 1_000_000

# Drop raw dimensions (no longer needed)
df.drop(columns=["L_cm", "W_cm", "H_cm"], inplace=True)

print("‚úì Item_Volume_cm3 and Item_Volume_m3 created")
print("‚úì Dropped L_cm, W_cm, H_cm")

# Quick sanity check
df[["Item_Volume_cm3", "Item_Volume_m3"]].describe()


STEP 4.5: Creating Item Volume Feature
‚úì Item_Volume_cm3 and Item_Volume_m3 created
‚úì Dropped L_cm, W_cm, H_cm


Unnamed: 0,Item_Volume_cm3,Item_Volume_m3
count,14999.0,14999.0
mean,133260.4,0.13326
std,381674.5,0.381675
min,0.0,0.0
25%,475.0,0.000475
50%,5850.0,0.00585
75%,28910.0,0.02891
max,1822824.0,1.822824


In [12]:
df.drop(columns=["Item_Volume_cm3"], inplace=True)

### STEP 8: Final Validation

#### Confirms new features were added.
#### Significance: Ensures that the feature engineering process produced meaningful, interpretable outputs.

In [13]:
# ======================================================
# STEP 8: FINAL VALIDATION
# ======================================================
print("\nSTEP 8: FINAL VALIDATION")

print("New Features Added:")
new_features = [
    "co2_impact_index",
    "cost_efficiency_index",
    "environmental_impact_score",
    "material_suitability_score",
    "sustainability_rating",
    "Item_Volume_m3"
]

for f in new_features:
    print(f"  ‚úì {f}")

print("\nSustainability Rating Distribution:")
print(df["sustainability_rating"].value_counts())


STEP 8: FINAL VALIDATION
New Features Added:
  ‚úì co2_impact_index
  ‚úì cost_efficiency_index
  ‚úì environmental_impact_score
  ‚úì material_suitability_score
  ‚úì sustainability_rating
  ‚úì Item_Volume_m3

Sustainability Rating Distribution:
sustainability_rating
D    3000
C    3000
B    3000
A    3000
E    2999
Name: count, dtype: int64


### STEP 9: Save Dataset

#### Saves the enriched dataset to final_ecopack_dataset_fe.csv.

#### Significance: Prepares the dataset for ML training, recommendation engine, and BI dashboards.

In [14]:
df.head()

Unnamed: 0,Category_item,Weight_kg,Volumetric_Weight_kg,Fragility,Moisture_Sens,Shipping_Mode,Distance_km,Packaging_Used,Cost_USD,CO2_Emission_kg_item,...,Cost_per_kg,CO2_Emission_kg_material,Biodegradable,co2_impact_index,cost_efficiency_index,environmental_impact_score,material_suitability_score,sustainability_score,sustainability_rating,Item_Volume_m3
0,Clothing,0.82,1.41,5,False,Air,1893,Kraft Paper Mailer,1.56,6.673,...,1.52,1.026,Yes,49.96,8.88,26.85,26.22,0.396871,D,0.007056
1,Electronics,0.29,0.0,9,True,Air,2141,Mushroom Pkg (Mycelium),1.92,1.552,...,3.2,0.18,Yes,78.0,5.37,7.96,3.72,0.809548,D,0.0
2,Furniture,12.26,38.06,6,False,Road,1491,Wood Crate,16.42,28.374,...,1.74,0.29,Yes,45.51,33.0,26.71,11.5,1.482591,B,0.19032
3,Furniture,11.56,38.27,5,False,Road,530,Wood Crate,16.31,10.142,...,1.74,0.29,Yes,58.46,27.73,20.36,13.44,1.634381,B,0.19136
4,Clothing,0.25,0.08,1,False,Air,1587,Kraft Paper Mailer,0.3,0.992,...,1.52,1.026,Yes,81.47,1.08,9.94,78.9,0.130382,E,0.000396


In [15]:
# ======================================================
# STEP 9: SAVE DATASET
# ======================================================
df.to_csv(OUTPUT_PATH, index=False)

print("\nSTEP 9: DATASET SAVED")
print(f"‚úì {OUTPUT_PATH}")

print("\n" + "=" * 70)
print("‚úì FEATURE ENGINEERING COMPLETE")
print("=" * 70)


STEP 9: DATASET SAVED
‚úì ../data/processed/final_ecopack_dataset_fe.csv

‚úì FEATURE ENGINEERING COMPLETE
