A Python-based toolkit for comparing transformers.
-
Updated
Jun 13, 2024 - Python
A Python-based toolkit for comparing transformers.
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
🐢 Open-Source Evaluation & Testing for LLMs and ML models
RuLES: a benchmark for evaluating rule-following in language models
Extended, multi-agent and multi-objective (MaMoRL) environments based on DeepMind's AI Safety Gridworlds. This is a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. It is made compatible with OpenAI's Gym/Gymnasium and Farama Foundation PettingZoo.
Scan your AI/ML models for problems before you put them into production.
Evaluation & testing framework for computer vision models
[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
DPLL(T)-based Verification tool for DNNs
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
A novel physical adversarial attack tackling the Digital-to-Physical Visual Inconsistency problem.
The official implementation of the paper "Data Contamination Calibration for Black-box LLMs" (ACL 2024)
Attack to induce LLMs within hallucinations
The Model Library is a project that maps the risks associated with modern machine learning systems.
Code for our paper "Model-less Is the Best Model: Generating Pure Code Implementations to Replace On-Device DL Models" that has been accepted by ISSTA'24
A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.
To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."