# Capstone Project: Sustainability Classification of Fashion Products

## Background and problem statement

Sustainability is beoming increasingly more important in the global economy, with consumers indicating that they are [more likely to buy eco-friendly products](https://www.strategy-business.com/article/The-rise-of-the-eco-friendly-consumer). This means that it has become more important for businesses to show and/or prove that their products and business practices are sustainable. 

Although there are sustainability raters like [B Corp](https://www.bcorporation.net/en-us/certification/) that provide sustainability audits and certifications, the cost of the audits are often too expensive for small businesses. In this case, machine learning tools can create a sustainability classification model that provides businesses (and their consumers) a quick assessment of the sustainability of their products. 

The model may also be developed further to help businesses identify which part of their business practices are not the most sustainable, so that they can take steps to correct it.

-----------

For this project, we will focus on fashion businesses. In 2018, the fashion industry was estimated to be responsible for [10% of annual global carbon emissions](https://www.worldbank.org/en/news/feature/2019/09/23/costo-moda-medio-ambiente) - which was more than all international flights and maritime shipping combined. The fashion industry, especially fast fashion retailers, has also been called out many times for [greenwashing](https://www.straitstimes.com/multimedia/graphics/2022/11/fast-fashion-greenwashing-environmental-impact/index.html?shell). This emphasises a need for an independent and quick way to assess product sustainability in the industry.

## Project outline

Given the background and problem statement above, the project aims to do three things:
<br><br>

1. Train a machine learning model on a sample dataset to classify a variety of fashion items into five classes: Extremely non-sustainable (5), non-sustainable (4), medium sustainability (3), sustainable (2), extremely sustainable (1).
<br><br>
Different classification models, such as k-Nearest Neighbors, Random Forest, Support Vector Machine, etc. will be used.
<br><br>
As the sample dataset contains unbalanced classes, the F1-score will be used to evaluate the model.

2. Once trained, the model will be deployed on data scraped from one of Singapore's fashion e-retailer, GRAYE. GRAYE describes itself as having sustainable fashion concepts, and the model will inform us on how sustainable GRAYE's products actually are.

3. A streamlit app will also be deployed so that small businesses/consumers will be able to input details and get a prediction on whether their products are actually sustainable. 

## Data selection

[Training and testing dataset](data/clothing_product_full_dataset.csv) from a 2022 research paper, [A Supervised Machine Learning Classification Framework for Clothing Products' Sustainability](https://www.mdpi.com/2071-1050/14/3/1334). <br>
[Independent dataset](data/graye_og.csv) scraped from a local eco-friendly fashion retailer, [GRAYE](https://grayestudio.com).

## Data cleaning

In [1]:
# import necessary libraries

import pandas as pd
import numpy as np
import pickle

In [2]:
# read in train dataset

fashion = pd.read_csv('data/clothing_product_full_dataset.csv')
fashion

Unnamed: 0,ID,Type,Cotton,Organic_cotton,Linen,Hemp,Jute,Other_plant,Silk,Wool,...,Chemicals_label,Production_label,Manufacturing_location,Transporation_distance,Use_location,Washing_instruction,Drying_instruction,Reusability_label,Recylability_label,EI
0,1,jeans,040,060,000,000,000,000,000,000,...,0,1,Africa,2072.0,Netherlands,Machine wash_ cold,Line dry,1,1,1
1,2,jeans,040,060,000,000,000,000,000,000,...,0,1,Africa,2389.0,Germany,Machine wash_ cold,Line dry,1,1,1
2,3,jeans,040,060,000,000,000,000,000,000,...,0,1,Africa,2262.0,Belgium,Machine wash_ cold,Line dry,1,1,1
3,4,jacket,040,060,000,000,000,000,000,000,...,0,1,Africa,2728.0,France,Machine wash_ cold,Line dry,1,1,1
4,5,jacket,040,060,000,000,000,000,000,000,...,0,1,Africa,2887.0,Austria,Machine wash_ cold,Line dry,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1131,1132,sweater,000,000,000,000,000,000,000,000,...,0,0,Asia,23273.0,Africa,Dry clean,Dry clean,0,0,3
1132,1133,sweater,000,100,000,000,000,000,000,000,...,0,0,Asia,21273.0,Asia,Dry clean,Dry clean,0,0,3
1133,1134,sweater,000,100,000,000,000,000,000,000,...,0,0,Asia,12273.0,USA,Dry clean,Dry clean,0,0,3
1134,1135,sweater,000,000,000,000,000,000,000,000,...,0,0,Asia,11373.0,USA,Dry clean,Dry clean,0,0,3


In [3]:
# check null and dtypes

fashion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1136 entries, 0 to 1135
Data columns (total 43 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ID                      1136 non-null   int64  
 1   Type                    1136 non-null   object 
 2   Cotton                  1136 non-null   object 
 3   Organic_cotton          1136 non-null   object 
 4   Linen                   1136 non-null   object 
 5   Hemp                    1136 non-null   object 
 6   Jute                    1136 non-null   object 
 7   Other_plant             1136 non-null   object 
 8   Silk                    1136 non-null   object 
 9   Wool                    1136 non-null   object 
 10  Leather                 1136 non-null   object 
 11  Camel                   1136 non-null   object 
 12  Cashmere                1136 non-null   object 
 13  Alpaca                  1136 non-null   object 
 14  Feathers                1136 non-null   

In [4]:
# check duplicates

fashion.duplicated().value_counts()

False    1136
dtype: int64

#### Data observations and actions

|Observations|Actions|
|-|-|
|`ID` is just index | Drop column|
|Values for material columns are delimited by `,` instead of `.`| Change delimiter to `.` and convert dtype to float|
|`Elastane` and `Spandex` are different names for the same material| Combine both columns into one|
|Null values for `Manufacturing_location`| Impute with `unknown` value**|


**The original study found that the manufacturing location was an important feature in predicting the sustianability of a garment. It considered the absence of any indications of where the garment was manufactured as a penalising element. Similarly, we will retain the null values and fill it with a constant `unknown` value.

In [5]:
# drop ID column

fashion.drop(columns = 'ID', inplace=True)
fashion

Unnamed: 0,Type,Cotton,Organic_cotton,Linen,Hemp,Jute,Other_plant,Silk,Wool,Leather,...,Chemicals_label,Production_label,Manufacturing_location,Transporation_distance,Use_location,Washing_instruction,Drying_instruction,Reusability_label,Recylability_label,EI
0,jeans,040,060,000,000,000,000,000,000,000,...,0,1,Africa,2072.0,Netherlands,Machine wash_ cold,Line dry,1,1,1
1,jeans,040,060,000,000,000,000,000,000,000,...,0,1,Africa,2389.0,Germany,Machine wash_ cold,Line dry,1,1,1
2,jeans,040,060,000,000,000,000,000,000,000,...,0,1,Africa,2262.0,Belgium,Machine wash_ cold,Line dry,1,1,1
3,jacket,040,060,000,000,000,000,000,000,000,...,0,1,Africa,2728.0,France,Machine wash_ cold,Line dry,1,1,1
4,jacket,040,060,000,000,000,000,000,000,000,...,0,1,Africa,2887.0,Austria,Machine wash_ cold,Line dry,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1131,sweater,000,000,000,000,000,000,000,000,000,...,0,0,Asia,23273.0,Africa,Dry clean,Dry clean,0,0,3
1132,sweater,000,100,000,000,000,000,000,000,000,...,0,0,Asia,21273.0,Asia,Dry clean,Dry clean,0,0,3
1133,sweater,000,100,000,000,000,000,000,000,000,...,0,0,Asia,12273.0,USA,Dry clean,Dry clean,0,0,3
1134,sweater,000,000,000,000,000,000,000,000,000,...,0,0,Asia,11373.0,USA,Dry clean,Dry clean,0,0,3


In [6]:
# change delimiter from , to .

fashion.columns

Index(['Type', 'Cotton', 'Organic_cotton', 'Linen', 'Hemp', 'Jute',
       'Other_plant', 'Silk', 'Wool', 'Leather', 'Camel', 'Cashmere', 'Alpaca',
       'Feathers', 'Other_animal', 'Polyester', 'Nylon', 'Acrylic', 'Spandex',
       'Elastane', 'Polyamide', 'Other_synthetic', 'Lyocell', 'Viscose',
       'Acetate', 'Modal', 'Rayon', 'Other_regenerated', 'Other',
       'Recycled_content', 'Reused_content', 'Material_label',
       'Chemicals_label', 'Production_label', 'Manufacturing_location',
       'Transporation_distance', 'Use_location', 'Washing_instruction',
       'Drying_instruction', 'Reusability_label', 'Recylability_label', 'EI'],
      dtype='object')

In [7]:
materials = ['Cotton', 'Organic_cotton', 'Linen', 'Hemp', 'Jute','Other_plant', 
             'Silk', 'Wool', 'Leather', 'Camel', 'Cashmere', 'Alpaca', 'Feathers', 'Other_animal', 
             'Polyester', 'Nylon', 'Acrylic', 'Spandex', 'Elastane', 'Polyamide', 'Other_synthetic', 
             'Lyocell', 'Viscose', 'Acetate', 'Modal', 'Rayon', 'Other_regenerated', 'Other',
             'Recycled_content', 'Reused_content']

for m in materials:
    fashion[m] = fashion[m].str.replace(',' , '.')

fashion

Unnamed: 0,Type,Cotton,Organic_cotton,Linen,Hemp,Jute,Other_plant,Silk,Wool,Leather,...,Chemicals_label,Production_label,Manufacturing_location,Transporation_distance,Use_location,Washing_instruction,Drying_instruction,Reusability_label,Recylability_label,EI
0,jeans,0.40,0.60,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0,1,Africa,2072.0,Netherlands,Machine wash_ cold,Line dry,1,1,1
1,jeans,0.40,0.60,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0,1,Africa,2389.0,Germany,Machine wash_ cold,Line dry,1,1,1
2,jeans,0.40,0.60,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0,1,Africa,2262.0,Belgium,Machine wash_ cold,Line dry,1,1,1
3,jacket,0.40,0.60,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0,1,Africa,2728.0,France,Machine wash_ cold,Line dry,1,1,1
4,jacket,0.40,0.60,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0,1,Africa,2887.0,Austria,Machine wash_ cold,Line dry,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1131,sweater,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0,0,Asia,23273.0,Africa,Dry clean,Dry clean,0,0,3
1132,sweater,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0,0,Asia,21273.0,Asia,Dry clean,Dry clean,0,0,3
1133,sweater,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0,0,Asia,12273.0,USA,Dry clean,Dry clean,0,0,3
1134,sweater,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0,0,Asia,11373.0,USA,Dry clean,Dry clean,0,0,3


In [8]:
# change dtype of materials to float

for m in materials:
    fashion[m] = fashion[m].astype(float)

fashion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1136 entries, 0 to 1135
Data columns (total 42 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Type                    1136 non-null   object 
 1   Cotton                  1136 non-null   float64
 2   Organic_cotton          1136 non-null   float64
 3   Linen                   1136 non-null   float64
 4   Hemp                    1136 non-null   float64
 5   Jute                    1136 non-null   float64
 6   Other_plant             1136 non-null   float64
 7   Silk                    1136 non-null   float64
 8   Wool                    1136 non-null   float64
 9   Leather                 1136 non-null   float64
 10  Camel                   1136 non-null   float64
 11  Cashmere                1136 non-null   float64
 12  Alpaca                  1136 non-null   float64
 13  Feathers                1136 non-null   float64
 14  Other_animal            1136 non-null   

In [9]:
# combine elastane and spandex columns

fashion['Spandex'] = fashion['Spandex'] + fashion['Elastane']
fashion.drop(columns='Elastane', inplace=True)

In [10]:
# impute null values with constant 'unknown' value

fashion.fillna('unknown', inplace=True)
fashion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1136 entries, 0 to 1135
Data columns (total 41 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Type                    1136 non-null   object 
 1   Cotton                  1136 non-null   float64
 2   Organic_cotton          1136 non-null   float64
 3   Linen                   1136 non-null   float64
 4   Hemp                    1136 non-null   float64
 5   Jute                    1136 non-null   float64
 6   Other_plant             1136 non-null   float64
 7   Silk                    1136 non-null   float64
 8   Wool                    1136 non-null   float64
 9   Leather                 1136 non-null   float64
 10  Camel                   1136 non-null   float64
 11  Cashmere                1136 non-null   float64
 12  Alpaca                  1136 non-null   float64
 13  Feathers                1136 non-null   float64
 14  Other_animal            1136 non-null   

In [11]:
# save as csv

fashion.to_csv('data/fashion.csv', index=False)

## Data Dictionary

|Feature|Description|
|-|-|
|Type|Type of garment (eg. t-shirt, shirt, trousers, etc.)|
|Cotton|Percentage of cotton content in garment|
|Organic cotton|Percentage of organic cotton content used in garment|
|Linen|Percentage of linen content used in garment|
|Hemp|Percentage of hemp content used in garment|
|Jute|Percentage of jute content used in garment|
|Other_plant|Percentage of other plant content used in garment|
|Silk|Percentage of silk content used in garment|
|Wool|Percentage of wool content used in garment|
|Leather|Percentage of leather content used in garment|
|Camel|Percentage of camel content used in garment|
|Cashmere|Percentage of cashmere content used in garment|
|Alpaca|Percentage of alpaca content used in garment|
|Feathers|Percentage of feather content used in garment|
|Other_animal|Percentage of other animal content used in garment|
|Polyester|Percentage of polyester content used in garment|
|Nylon|Percentage of nylon content used in garment|
|Acrylic|Percentage of acrylic content used in garment|
|Spandex|Percentage of spandex content used in garment|
|Polyamide|Percentage of polyamide content used in garment|
|Other_synthetic|Percentage of other synthetic content used in garment|
|Lyocell|Percentage of Lyocell content used in garment|
|Viscose|Percentage of viscose content used in garment|
|Acetate|Percentage of acetate content used in garment|
|Modal|Percentage of modal content used in garment|
|Rayon|Percentage of Rayon content used in garment|
|Other_regenerated|Percentage of other regenerated content used in garment|
|Other|Percentage of other content (not specified in other columns) used in garment|
|Recycled_content|Percentage of recycled content used in garment|
|Reused_content|Percentage of reused content used in garment|
|Material_label|Presence of label specifiying material content|
|Chemicals_label|Presence of label specifiying chemical content|
|Production_label|Presence of label specifiying production methods|
|Manufacturing_location|Where garment was manufactured|
|Transportation_distance|Distance between where garment was manufactured and where it was sold to consumer|
|Use_location|Where consumer bought the garment|
|Washing_instruction|Recommended washing methods|
|Drying_instruction|Recommended drying methods|
|Reusability_label|Presence of label recommending how to reuse the garment|
|Recylability_label|Presence of label recommending how to recycle the garment|
|EI|Sustainability classification, range from 1 to 5: with 1 being extremely sustainable, and 5 being extremely not sustainable|