🐢 Open-Source Evaluation & Testing for LLMs and ML models
-
Updated
Jun 6, 2024 - Python
🐢 Open-Source Evaluation & Testing for LLMs and ML models
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Aligning AI With Shared Human Values (ICLR 2021)
RuLES: a benchmark for evaluating rule-following in language models
Code accompanying the paper Pretraining Language Models with Human Preferences
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
Attack to induce LLMs within hallucinations
Feature Space Singularity for Out-of-Distribution Detection. (SafeAI 2021)
A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference time) and detection without classification accuracy drop, hyperparameter tuning, or collecting additional data.
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
[ICCV2021 Oral] Fooling LiDAR by Attacking GPS Trajectory
A project to improve out-of-distribution detection (open set recognition) and uncertainty estimation by changing a few lines of code in your project! Perform efficient inferences (i.e., do not increase inference time) without repetitive model training, hyperparameter tuning, or collecting additional data.
LAMBDA is a model-based reinforcement learning agent that uses Bayesian world models for safe policy optimization
This is the official implementation of ContraNet (NDSS2022).
[Findings of EMNLP 2022] Holistic Sentence Embeddings for Better Out-of-Distribution Detection
Scan your AI/ML models for problems before you put them into production.
Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.
To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."