In [21]:
# Imports
import sys
sys.executable
import numpy as np
import requests # for downloading webpages
from bs4 import BeautifulSoup  # for parsing HTML
import pandas as pd # for storing and handling datasets
import time # for adding delays between requests
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By


Business and Customer Use Cases for the Sustainability Scoring System:

The sustainability scoring system provides a unified and transparent method for evaluating the environmental and ethical performance of apparel products. By integrating Life Cycle Assessment (LCA) based data, origin impact factors, care-phase impacts, and certification bonuses, the system produces a clear Sustainability Score (0–100) for each product. This score benefits both business stakeholders and end customers by addressing key challenges related to transparency, decision-making, and regulatory compliance.
In addition, this project investigates whether the sustainability characteristics of fashion products—measured through materials, certifications, origin, and care impact—are correlated with retail price. The aim is to explore whether more sustainable materials consistently lead to higher prices, or whether identifiable pricing patterns emerge across different brands and product categories.

# Sustainability Scoring Pipeline — Full Workflow

## Phase 0 — Raw Data
Raw Datasets (Product Data + Reference Tables)
    │
    ▼

## Phase 1 — Statistical Scoring Model
[1] Cleaning & Standardization (Product Data Only)
    • Fix material names
    • Standardize certifications
    • Clean origins, categories, percentages
    • Remove duplicates and formatting noise
    │
    ▼

[2] Merge Reference Tables
    • Merge Material LCA table
    • Merge Origin impact table
    • Merge Care Instruction impact table
    • Merge Certification bonus table
    │
    ▼

[3] Normalize Values (0–1 Scale)
    • Normalize Material LCA indicators
    • Normalize Care Instruction impact
    • Normalize Origin impact indicators
    • Certification stays as bonus (already scaled)
    │
    ├──────────┬─────────────┬──────────────┬──────────────┐
    ▼          ▼              ▼              ▼              ▼
 Material   Care          Origin        Certification     Other
  Score     Score          Score            Score         Features
    │          │              │              │
    └──────────┴──────────────┴──────────────┴──────────────┘
                    ▼

[8] Final Sustainability Score (0–100)
    • Weighted statistical scoring model:
      Final_Score = f(Material, Origin, Care, Certification)
    │
    ▼

## Phase 2 — Machine Learning Model Price Prediction given the features.
[9] ML- Preparing the Data for Machine Learning
    • X = normalized features (Material_Score, Origin_Score, Care_Score, Certification bonus, etc.)
    • y = Final price
    │
    ▼

[10] Machine Learning Model Training
    • Regression → Predict price given the sustainability score
    • Clustering → Group similar products
    │
    ▼

SCORING MODEL READY ✔

## Phase 3 — Web Scoring System or/+ price predict (Optional)
User provides Product URL or type product information
    │
    ▼

[12] Web API Extracts Product Info
    • Scrapes product page (materials, origin, care, certifications)
    • Cleans and standardizes extracted data
    │
    ▼

[13] Scoring Engine (Statistical + ML)


In [22]:
# product dataset
df_original= pd.read_csv("FashionProductsDataset_Original.csv")   
df_original.head()

Unnamed: 0,Id,Product_Name,Price,Material,Percentage_Material,Certification1,Certification2,Shop_Name,Category,Subcategory,Origin,Care_Instruction
0,1,Jacquard-knit merino wool jumper,€ 79.99,Wool,100.0,RWS,,H&M,Woman,Jumper,China,Hand wash
1,2,Oversize Jumper,€ 24.99,Polyester,50.0,,,H&M,Woman,Jumper,China,Machine wash 30°C
2,2,Oversize Jumper,€ 24.99,Polyamide,29.0,,,H&M,Woman,Jumper,China,Machine wash 30°C
3,2,Oversize Jumper,€ 24.99,Acrylic,13.0,,,H&M,Woman,Jumper,China,Machine wash 30°C
4,2,Oversize Jumper,€ 24.99,Wool,5.0,,,H&M,Woman,Jumper,China,Machine wash 30°C



PRODUCT DATASET COLUMN DESCRIPTIONS:

Id: Unique product identifier. Multi-material products repeat the same Id across multiple rows (one row per material component).

Product_Name: Descriptive fashion product name 

Price: Retail selling price in Euros (€)

Material: The fibre used for the product (blend fiber - one per row)

Percentage_Material: Percentage of the product made from this material.

Certificate1: Primary sustainability certification linked to the material, Otherwise NaN).
 
Certificate2: Secondary certification. Otherwise "N/A".

Shop_Name: Brand associated with the product (H&M, Zara, Penneys, Patagonia).

Category: Gender classification of the product ("Woman" or "Man").

Subcategory: Product type classification. One of: "Tshirt", "Jumper", "Sweater", "Jacket".

Origin: Country of manufacture.

Care_Instruction: Washing/maintenance guidance based on material.


In [23]:
# size of product dataset
df_original.shape

(1488, 12)

In [24]:
# product dataset info
df_original.info # NaN values in Certification is not missing value, is the because there is no Certificate

<bound method DataFrame.info of         Id                      Product_Name     Price   Material  \
0        1  Jacquard-knit merino wool jumper   € 79.99       Wool   
1        2                   Oversize Jumper   € 24.99  Polyester   
2        2                   Oversize Jumper   € 24.99  Polyamide   
3        2                   Oversize Jumper   € 24.99    Acrylic   
4        2                   Oversize Jumper   € 24.99       Wool   
...    ...                               ...       ...        ...   
1483  1003                   Jumper Textured  € 113.09       Wool   
1484  1004                 Jumper Asymmetric  € 108.94    Acrylic   
1485  1004                 Jumper Asymmetric  € 108.94     Cotton   
1486  1005                     Jumper Ribbed  € 224.17   Cashmere   
1487  1006                  Jacket Oversized   € 27.97    Acrylic   

      Percentage_Material Certification1 Certification2  Shop_Name Category  \
0                   100.0            RWS            NaN     

In [26]:
# print unique values of product Dataset
# Unique values in Certification1
unique_cert1 = df_original["Certification1"].unique()
print("Unique Certification1 values:")
print(unique_cert1)

# Unique values in Certification2
unique_cert2 = df_original["Certification2"].unique()
print("\nUnique Certification2 values:")
print(unique_cert2)

# Unique materials
unique_materials = df_original["Material"].unique()
print("\nUnique Material values:")
print(unique_materials)

# Unique Origen
unique_origin = df_original["Origin"].unique()
print("\nUnique Origin values:")
print(unique_origin)

# Unique SubCategory
unique_sub= df_original["Subcategory"].unique()
print("\nUnique Subcategory values:")
print(unique_sub)

# Unique Shop_Name
unique_name= df_original["Shop_Name"].unique()
print("\nUnique Shop_Name values:")
print(unique_name)

# Care Instruction
unique_inst= df_original["Care_Instruction"].unique()
print("\nUnique Instruction values:")
print(unique_inst)

Unique Certification1 values:
[' RWS' nan 'RWS' 'Fair Trade ' 'Bluesign' 'Oeko-Tex 100' 'Fair Trade'
 'GRS' 'BCI Cotton']

Unique Certification2 values:
[nan ' RCS' 'Fair Trade ' 'Bluesign']

Unique Material values:
['Wool' 'Polyester' 'Polyamide' 'Acrylic' 'Elastane' 'wool'
 'Recycled Polyester' 'Recycled Acrylic' 'Cashmere' 'Recycled Wool'
 'Recycled Nylon' 'Lyocell' 'Cotton' 'Recycled Cashmere' 'Viscose'
 'Organic Cotton' 'Merino Wool' 'Recycled Cotton' 'Nylon']

Unique Origin values:
['China' 'Cambodia' 'Vietnam' 'Thailand' 'Turkey' 'Portugal' 'Bangladesh'
 'India' 'Italy']

Unique Subcategory values:
['Jumper' 'sweater' 'Tshirt' 'Jacket' 'Sweater']

Unique Shop_Name values:
['H&M' 'Zara' 'Patagonia' 'Penneys']

Unique Instruction values:
['Hand wash' 'Machine wash 30°C']


In [None]:
# Clean unique values

In [None]:
# Material Reference table (based on LCA and Higg MSI data):
df_Material= pd.read_csv("Material_Reference.csv")   
df_Material

Unnamed: 0,Material,Category,Carbon_kgCO2e,Water_L,FossilEnergy_MJ,ChemicalImpact_Score,Notes
0,Cotton,Natural,6.0,2700,55,40,High water & pesticide use
1,Organic Cotton,Natural,3.2,1800,40,20,No synthetic pesticides
2,Recycled Cotton,Recycled,2.0,500,20,10,Lower energy/water use
3,Polyester,Synthetic,9.5,50,120,35,Fossil-fuel based
4,Recycled Polyester,Recycled,5.5,20,60,20,Uses post-consumer PET
5,Acrylic,Synthetic,6.0,30,110,40,"Fossil-based, similar to nylon impact"
6,Recycled Acrylic,Recycled,3.5,15,55,20,Reduced energy & waste
7,Wool,Natural,14.0,800,160,45,Methane emissions from sheep
8,Merino Wool,Natural,16.0,900,170,48,Higher land/methane impact
9,Cashmere,Natural,30.0,700,200,60,Very high impact (land & methane)


Material Impact Reference Table:
Carbon footprint: kg CO₂e / kg material
Water consumption: litres / kg
Fossil fuel energy: MJ / kg
Chemical impact: relative score (Higg MSI uses weighted scoring)
Need to Normalize for the score system.
NOTE: These are representative values based on Higg MSI published numbers + LCA averages. They are suitable for early modelling.

In [23]:
# Certification Reference table:
df_cert= pd.read_csv("Certification,_Reference.csv")   
df_cert

Unnamed: 0,Certification,Category,Score_Bonus,Description
0,GOTS,Environmental+Social,0.25,"Organic fibre, chemical safety, full supply ch..."
1,GRS,Recycled+Traceability,0.2,"Verifies recycled content, chemical restrictio..."
2,RWS,Animal Welfare,0.15,"Responsible Wool Standard, land and animal man..."
3,RDS,Animal Welfare,0.15,Responsible Down Standard
4,Fair Trade,Social,0.2,Improved labour conditions and community devel...
5,Oeko-Tex 100,Chemical Safety,0.1,Harmful substance testing only
6,Bluesign,Chemical Safety+Process,0.2,"Controlled chemical inputs, cleaner production"
7,BCI Cotton,Environmental,0.05,"Basic improvement programme, low traceability"
8,,,0.0,No sustainability certification


Certification score bonuses reflect the relative strength, scope, and verification rigor of sustainability standards applied to a product. Because certifications do not provide direct LCA measurements, they are incorporated through weighted bonus values (Normalized Values 0-1) that represent their ability to reduce environmental and social risk. Stronger certifications such as Bluesign, GRS, and Fair Trade receive higher bonuses due to their robust criteria and auditing systems, whereas lighter standards like Oeko-Tex 100 or BCI Cotton receive smaller bonuses. Products without certifications receive no bonus. this system of adding Bonus is Higg-based scoring systems.

In [None]:
# Origin Reference table:
df_Ori= pd.read_csv("Origin_Reference.csv")   
df_Ori

Unnamed: 0,Origin,Energy_Grid_Intensity,Transport_Impact_Score,Manufacturing_Impact_Score,Notes
0,China,0.65,0.4,0.6,Coal-heavy grid; large-scale manufacturing; lo...
1,Bangladesh,0.55,0.45,0.7,Developing grid; high garment production volum...
2,India,0.5,0.4,0.65,Coal + renewables; textile-intensive industry;...
3,Vietnam,0.45,0.35,0.55,Balanced grid; strong apparel export sector; l...
4,Turkey,0.4,0.2,0.45,Closer to EU market; medium-impact grid; stron...
5,Portugal,0.25,0.1,0.3,Low-carbon EU grid; short transport; higher ma...
6,Italy,0.3,0.1,0.35,High-quality manufacturing; short transport di...
7,Cambodia,0.55,0.45,0.75,Developing grid; high reliance on imported ele...
8,Thailand,0.45,0.35,0.55,Mid-impact energy grid; established apparel in...


The Origin Impact Table models the environmental burden of producing a garment in different countries using three normalized indicators: energy grid intensity, transport distance, and manufacturing efficiency. These values are not percentages but relative impact indices (0–1 range) used to compare countries. This method follows LCA logic and Higg MSI principles when detailed country-specific emissions data is unavailable. The indices are later combined to calculate an Origin Sustainability Score for each product.those number are impact indices based on Higg Facility Environmental Module (FEM), Academic LCA gap-filling, EU PEF models when detailed data is missing. Any sustainability scoring when manufacturing detail is unknown.

| **Index Value** | **Impact Level**     | **Meaning / Interpretation**                                                        |
| --------------- | -------------------- | ----------------------------------------------------------------------------------- |
| **0.00 – 0.10** | **Very Low Impact**  | Clean energy grid, very efficient manufacturing, short transport distance.          |
| **0.11 – 0.25** | **Low Impact**       | Mostly renewable energy, moderate manufacturing efficiency, short/medium transport. |
| **0.26 – 0.45** | **Moderate Impact**  | Mixed energy grid, average factory efficiency, long-distance transport.             |
| **0.46 – 0.60** | **High Impact**      | Fossil-fuel heavy grid, lower manufacturing efficiency, long ocean freight routes.  |
| **0.61 – 1.00** | **Very High Impact** | Coal-dominated grids, weak environmental controls, long-distance logistics.         |


In [12]:
# Care_Instruction Reference table:
df_care= pd.read_csv("Care_Instruction_Reference.csv")   
df_care

Unnamed: 0,Care_Instruction,Energy_Use_MJ,Water_Use_L,CO2_kg,Notes
0,Machine wash 30°C,0.5,15.0,0.04,Low temperature reduces energy demand by ~40%;...
1,Machine wash 40°C,0.8,15.0,0.07,Medium energy demand; standard wash cycle; mod...
2,Machine wash cold,0.3,15.0,0.02,Lowest energy use; often recommended for delic...
3,Hand wash,0.2,8.0,0.015,Lower mechanical energy but higher labor; used...
4,Machine wash 30°C + tumble dry,2.5,15.0,0.2,Tumble drying significantly increases total en...
5,Machine wash 30°C + line dry,0.5,15.0,0.04,Line drying avoids high energy use; lowest-imp...
6,Dry clean,3.5,0.5,0.25,High chemical and energy impact; typically for...
7,Do not wash (spot clean),0.1,1.0,0.01,Minimal consumer environmental impact; used fo...


the Care Instruction Reference Table contains raw quantitative environmental impact values describing the consumer-use phase of a garment’s life cycle. These values come from LCA literature and represent the estimated resource consumption associated with typical washing and drying practices.Care-phase impacts depend on consumer behavior, home energy mix, and the number of washes—data that is typically unavailable. For consistency with the Origin and Certification components, the care-phase impact is represented using a normalized Care Score (0–1), where higher values indicate lower environmental burden. This approach is widely used in multi-criteria sustainability assessments when detailed consumption-phase LCA data cannot be obtained.

Material weighted: Weighted Impact=∑(%Material/100)×Impact Score 
Certification Bonus: Adjusted Score=Material Score×(1−b1​)×(1−b2​) (if two material)
Final Score=100−Normalized Impact

| Step                | Math Type                  | Explanation                |
| ------------------- | -------------------------- | -------------------------- |
| Material weighting  | Linear algebra             | Weighted sum of impacts    |
| Certification bonus | Proportion / discount math | Multiplicative adjustments |
| Normalization       | Data scaling               | Convert to 0–100 range     |
