Skip to content

unesco/sdg-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SDG Classifier

Classify text into UN Sustainable Development Goals using BERT models

UNESCO Data & AI Python License: MIT Hugging Face

Table of Contents


Overview

A text classification tool that maps input text to the 17 UN Sustainable Development Goals using pre-trained BERT models from Hugging Face. Supports English and French, handles long texts by chunking, and outputs simplified JSON results with detected SDG IDs.

Features

  • Classify text data with SDG labels based on context and language
  • Supports English (sadickam/sdgBERT) and French (ilovebots/bert-sdg-french) models
  • Handles long texts by chunking into manageable pieces with averaged probabilities
  • Configurable probability threshold for SDG detection
  • Debug mode for fast iteration on small subsets
  • Outputs simplified results containing only id and detected sdg array

Quick Start

# 1. Clone
git clone https://github.com/unesco/sdg-classifier.git
cd sdg-classifier

# 2. Install
poetry install

# 3. Configure
cp .env.example .env
# Edit .env and add your Hugging Face token

# 4. Run
poetry run python -m sdg_classifier.main

Installation

Prerequisites

  • Python 3.11+
  • Poetry for dependency management
  • A Hugging Face account with an API token
  • macOS with MPS support (macOS 12.3+) or a CUDA-capable GPU

Steps

  1. Clone the repository:

    git clone https://github.com/unesco/sdg-classifier.git
    cd sdg-classifier
  2. Install dependencies:

    poetry install
  3. Create your environment file:

    cp .env.example .env
  4. Edit .env and add your Hugging Face token:

    HF_TOKEN=your_huggingface_token_here

Usage

Basic Classification

poetry run python -m sdg_classifier.main

The classifier reads from the input CSV specified in config.yaml, processes each row through the appropriate language model, and writes results to the output CSV.

Debug Mode

Set debug: true in config.yaml to process only 100 random rows for quick testing:

debug: true

Configuration

Environment Variables (.env)

Variable Required Description
HF_TOKEN Yes Hugging Face API token for model downloads

Config File (config.yaml)

data:
  input_csv: "data/data_sdgs.csv"
  output_csv: "data/data_sdgs_detected.csv"
models:
  en: "sadickam/sdgBERT"
  fr: "ilovebots/bert-sdg-french"
debug: false
Key Description
data.input_csv Path to input CSV file
data.output_csv Path to output CSV file
models.en Hugging Face model ID for English
models.fr Hugging Face model ID for French
debug Process only 100 rows when true

Input / Output Format

Input CSV

Column Type Description
id string/int Unique identifier for each row
language string Language code (en, fr)
value string Text to classify
question string (optional) Contextual question prepended to value

Output CSV

Column Type Description
id string/int Unique identifier
sdg list Array of detected SDG IDs (1-17)

Project Structure

sdg-classifier/
├── README.md
├── LICENSE
├── .env.example
├── .gitignore
├── config.yaml
├── pyproject.toml
├── poetry.lock
├── data/
│   └── .gitkeep
└── sdg_classifier/
    ├── __init__.py
    └── main.py

Troubleshooting

Device Compatibility

The code currently uses Apple's MPS (Metal Performance Shaders) backend for GPU acceleration. If you are running on a non-Apple system:

  • CUDA GPU: Change .to('mps') to .to('cuda') in sdg_classifier/main.py
  • CPU only: Change .to('mps') to .to('cpu') in sdg_classifier/main.py

Common Issues

Issue Solution
HF_TOKEN not found Create .env file with your Hugging Face token
config.yaml not found Run from the project root directory
MPS not available Switch to CPU or CUDA (see above)
Out of memory Reduce batch sizes or use CPU fallback
Model download fails Verify HF_TOKEN has read access to gated models

Contributing

Contributions are welcome. Please read the contributing guidelines before submitting a pull request.

License

This project is licensed under the MIT License.

Credits

About

Classify text into UN Sustainable Development Goals using BERT models

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages