SDG Classifier

Classify text into UN Sustainable Development Goals using BERT models

Overview

A text classification tool that maps input text to the 17 UN Sustainable Development Goals using pre-trained BERT models from Hugging Face. Supports English and French, handles long texts by chunking, and outputs simplified JSON results with detected SDG IDs.

Features

Classify text data with SDG labels based on context and language
Supports English (sadickam/sdgBERT) and French (ilovebots/bert-sdg-french) models
Handles long texts by chunking into manageable pieces with averaged probabilities
Configurable probability threshold for SDG detection
Debug mode for fast iteration on small subsets
Outputs simplified results containing only id and detected sdg array

Quick Start

# 1. Clone
git clone https://github.com/unesco/sdg-classifier.git
cd sdg-classifier

# 2. Install
poetry install

# 3. Configure
cp .env.example .env
# Edit .env and add your Hugging Face token

# 4. Run
poetry run python -m sdg_classifier.main

Installation

Prerequisites

Python 3.11+
Poetry for dependency management
A Hugging Face account with an API token
macOS with MPS support (macOS 12.3+) or a CUDA-capable GPU

Steps

Clone the repository:

git clone https://github.com/unesco/sdg-classifier.git
cd sdg-classifier

Install dependencies:
```
poetry install
```
Create your environment file:
```
cp .env.example .env
```
Edit .env and add your Hugging Face token:
```
HF_TOKEN=your_huggingface_token_here
```

Usage

Basic Classification

poetry run python -m sdg_classifier.main

The classifier reads from the input CSV specified in config.yaml, processes each row through the appropriate language model, and writes results to the output CSV.

Debug Mode

Set debug: true in config.yaml to process only 100 random rows for quick testing:

debug: true

Configuration

Environment Variables (`.env`)

Variable	Required	Description
`HF_TOKEN`	Yes	Hugging Face API token for model downloads

Config File (`config.yaml`)

data:
  input_csv: "data/data_sdgs.csv"
  output_csv: "data/data_sdgs_detected.csv"
models:
  en: "sadickam/sdgBERT"
  fr: "ilovebots/bert-sdg-french"
debug: false

Key	Description
`data.input_csv`	Path to input CSV file
`data.output_csv`	Path to output CSV file
`models.en`	Hugging Face model ID for English
`models.fr`	Hugging Face model ID for French
`debug`	Process only 100 rows when `true`

Input / Output Format

Input CSV

Column	Type	Description
`id`	string/int	Unique identifier for each row
`language`	string	Language code (`en`, `fr`)
`value`	string	Text to classify
`question`	string (optional)	Contextual question prepended to value

Output CSV

Column	Type	Description
`id`	string/int	Unique identifier
`sdg`	list	Array of detected SDG IDs (1-17)

Project Structure

sdg-classifier/
├── README.md
├── LICENSE
├── .env.example
├── .gitignore
├── config.yaml
├── pyproject.toml
├── poetry.lock
├── data/
│   └── .gitkeep
└── sdg_classifier/
    ├── __init__.py
    └── main.py

Troubleshooting

Device Compatibility

The code currently uses Apple's MPS (Metal Performance Shaders) backend for GPU acceleration. If you are running on a non-Apple system:

CUDA GPU: Change .to('mps') to .to('cuda') in sdg_classifier/main.py
CPU only: Change .to('mps') to .to('cpu') in sdg_classifier/main.py

Common Issues

Issue	Solution
`HF_TOKEN not found`	Create `.env` file with your Hugging Face token
`config.yaml not found`	Run from the project root directory
MPS not available	Switch to CPU or CUDA (see above)
Out of memory	Reduce batch sizes or use CPU fallback
Model download fails	Verify HF_TOKEN has read access to gated models

Contributing

Contributions are welcome. Please read the contributing guidelines before submitting a pull request.

License

This project is licensed under the MIT License.

Credits

sadickam/sdgBERT — English SDG classification model
ilovebots/bert-sdg-french — French SDG classification model
Hugging Face — Model hosting and Transformers library
UN Sustainable Development Goals — The SDG framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SDG Classifier

Table of Contents

Overview

Features

Quick Start

Installation

Prerequisites

Steps

Usage

Basic Classification

Debug Mode

Configuration

Environment Variables (`.env`)

Config File (`config.yaml`)

Input / Output Format

Input CSV

Output CSV

Project Structure

Troubleshooting

Device Compatibility

Common Issues

Contributing

License

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
sdg_classifier		sdg_classifier
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

SDG Classifier

Table of Contents

Overview

Features

Quick Start

Installation

Prerequisites

Steps

Usage

Basic Classification

Debug Mode

Configuration

Environment Variables (.env)

Config File (config.yaml)

Input / Output Format

Input CSV

Output CSV

Project Structure

Troubleshooting

Device Compatibility

Common Issues

Contributing

License

Credits

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Environment Variables (`.env`)

Config File (`config.yaml`)

Packages