ai-safety

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Feb 26, 2024
Python

ShengranHu / Thought-Cloning

Star

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

reinforcement-learning deep-learning pytorch artificial-intelligence imitation-learning ai-safety

Updated Jun 28, 2024
Python

LetterLiGo / SafeGen_CCS2024

Star

[CCS'24] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

text-to-image ai-safety ai-security generative-ai thrustworthy-ai

Updated Oct 13, 2024
Python

normster / llm_rules

Star

RuLES: a benchmark for evaluating rule-following in language models

ai-safety ai-security gpt-4

Updated Sep 30, 2024
Python

tomekkorbak / pretraining-with-human-feedback

Star

Code accompanying the paper Pretraining Language Models with Human Preferences

reinforcement-learning gpt language-models ai-safety ai-alignment pretraining decision-transformers rlhf

Updated Feb 13, 2024
Python

PKU-YuanGroup / Hallucination-Attack

Star

Attack to induce LLMs within hallucinations

nlp machine-learning deep-learning ai-safety adversarial-attacks hallucinations llm llm-safety

Updated May 17, 2024
Python

ryoungj / ToolEmu

Star

[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use

agent language-model ai-safety large-language-models prompt-engineering language-agent

Updated Mar 22, 2024
Python

WindVChen / DiffAttack

Star

An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.

ai-safety diffusion-models unrestricted-attacks adverarial-attacks transferable-attacks diffusion-adversarial-attack imperceptible-attacks

Updated Oct 20, 2024
Python

megvii-research / FSSD_OoD_Detection

Star

[SafeAI'21] Feature Space Singularity for Out-of-Distribution Detection.

anomaly ai-safety anomaly-detection out-of-distribution-detection ood-detection

Updated Feb 15, 2021
Python

yardenas / la-mbda

Star

LAMBDA is a model-based reinforcement learning agent that uses Bayesian world models for safe policy optimization

machine-learning reinforcement-learning deep-learning constrained-optimization ai-safety model-based-reinforcement-learning safe-reinforcement-learning

Updated Jan 16, 2023
Python

ai4ce / FLAT

Star

[ICCV2021 Oral] Fooling LiDAR by Attacking GPS Trajectory

deep-learning robotics point-cloud lidar gnss autonomous-driving ai-safety adversarial-attacks 3d-object-detection 3d-perception trustworthy-machine-learning trustworthy-ai

Updated Jul 5, 2022
Python

dlmacedo / entropic-out-of-distribution-detection

Star

A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference time) and detection without classification accuracy drop, hyperparameter tuning, or collecting additional data.