Classify text into UN Sustainable Development Goals using BERT models
- Overview
- Features
- Quick Start
- Installation
- Usage
- Configuration
- Input / Output Format
- Project Structure
- Troubleshooting
- Contributing
- License
- Credits
A text classification tool that maps input text to the 17 UN Sustainable Development Goals using pre-trained BERT models from Hugging Face. Supports English and French, handles long texts by chunking, and outputs simplified JSON results with detected SDG IDs.
- Classify text data with SDG labels based on context and language
- Supports English (sadickam/sdgBERT) and French (ilovebots/bert-sdg-french) models
- Handles long texts by chunking into manageable pieces with averaged probabilities
- Configurable probability threshold for SDG detection
- Debug mode for fast iteration on small subsets
- Outputs simplified results containing only
idand detectedsdgarray
# 1. Clone
git clone https://github.com/unesco/sdg-classifier.git
cd sdg-classifier
# 2. Install
poetry install
# 3. Configure
cp .env.example .env
# Edit .env and add your Hugging Face token
# 4. Run
poetry run python -m sdg_classifier.main- Python 3.11+
- Poetry for dependency management
- A Hugging Face account with an API token
- macOS with MPS support (macOS 12.3+) or a CUDA-capable GPU
-
Clone the repository:
git clone https://github.com/unesco/sdg-classifier.git cd sdg-classifier -
Install dependencies:
poetry install
-
Create your environment file:
cp .env.example .env
-
Edit
.envand add your Hugging Face token:HF_TOKEN=your_huggingface_token_here
poetry run python -m sdg_classifier.mainThe classifier reads from the input CSV specified in config.yaml, processes each row through the appropriate language model, and writes results to the output CSV.
Set debug: true in config.yaml to process only 100 random rows for quick testing:
debug: true| Variable | Required | Description |
|---|---|---|
HF_TOKEN |
Yes | Hugging Face API token for model downloads |
data:
input_csv: "data/data_sdgs.csv"
output_csv: "data/data_sdgs_detected.csv"
models:
en: "sadickam/sdgBERT"
fr: "ilovebots/bert-sdg-french"
debug: false| Key | Description |
|---|---|
data.input_csv |
Path to input CSV file |
data.output_csv |
Path to output CSV file |
models.en |
Hugging Face model ID for English |
models.fr |
Hugging Face model ID for French |
debug |
Process only 100 rows when true |
| Column | Type | Description |
|---|---|---|
id |
string/int | Unique identifier for each row |
language |
string | Language code (en, fr) |
value |
string | Text to classify |
question |
string (optional) | Contextual question prepended to value |
| Column | Type | Description |
|---|---|---|
id |
string/int | Unique identifier |
sdg |
list | Array of detected SDG IDs (1-17) |
sdg-classifier/
├── README.md
├── LICENSE
├── .env.example
├── .gitignore
├── config.yaml
├── pyproject.toml
├── poetry.lock
├── data/
│ └── .gitkeep
└── sdg_classifier/
├── __init__.py
└── main.py
The code currently uses Apple's MPS (Metal Performance Shaders) backend for GPU acceleration. If you are running on a non-Apple system:
- CUDA GPU: Change
.to('mps')to.to('cuda')insdg_classifier/main.py - CPU only: Change
.to('mps')to.to('cpu')insdg_classifier/main.py
| Issue | Solution |
|---|---|
HF_TOKEN not found |
Create .env file with your Hugging Face token |
config.yaml not found |
Run from the project root directory |
| MPS not available | Switch to CPU or CUDA (see above) |
| Out of memory | Reduce batch sizes or use CPU fallback |
| Model download fails | Verify HF_TOKEN has read access to gated models |
Contributions are welcome. Please read the contributing guidelines before submitting a pull request.
This project is licensed under the MIT License.
- sadickam/sdgBERT — English SDG classification model
- ilovebots/bert-sdg-french — French SDG classification model
- Hugging Face — Model hosting and Transformers library
- UN Sustainable Development Goals — The SDG framework