ai-safety

Star

Here are 42 public repositories matching this topic...

Giskard-AI / giskard

Sponsor

Star

🐢 Open-Source Evaluation & Testing for LLMs and ML models

Updated Jun 6, 2024
Python

PKU-Alignment / safe-rlhf

Star

Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

Updated Apr 20, 2024
Python

agencyenterprise / PromptInject

Star

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Feb 26, 2024
Python

ShengranHu / Thought-Cloning

Star

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

reinforcement-learning deep-learning pytorch artificial-intelligence imitation-learning ai-safety

Updated Mar 1, 2024
Python

hendrycks / ethics

Star

Aligning AI With Shared Human Values (ICLR 2021)

ai-safety machine-ethics ml-safety ethical-ai gpt-3

Updated Apr 21, 2023
Python

normster / llm_rules

Star

RuLES: a benchmark for evaluating rule-following in language models

ai-safety ai-security gpt-4

Updated Jun 4, 2024
Python

tomekkorbak / pretraining-with-human-feedback

Star

Code accompanying the paper Pretraining Language Models with Human Preferences

reinforcement-learning gpt language-models ai-safety ai-alignment pretraining decision-transformers rlhf

Updated Feb 13, 2024
Python

WindVChen / DiffAttack

Star

An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.

ai-safety diffusion-models unrestricted-attacks adverarial-attacks transferable-attacks diffusion-adversarial-attack imperceptible-attacks

Updated Feb 20, 2024
Python

ryoungj / ToolEmu

Star

A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use

agent language-model ai-safety large-language-models prompt-engineering language-agent

Updated Mar 22, 2024
Python

microsoft / SafeNLP

Star

Safety Score for Pre-Trained Language Models

nlp ai-safety fairness-ai

Updated Oct 18, 2023
Python

PKU-YuanGroup / Hallucination-Attack

Star

Attack to induce LLMs within hallucinations

nlp machine-learning deep-learning ai-safety adversarial-attacks hallucinations llm llm-safety

Updated May 17, 2024
Python

megvii-research / FSSD_OoD_Detection

Star

Feature Space Singularity for Out-of-Distribution Detection. (SafeAI 2021)

anomaly ai-safety anomaly-detection out-of-distribution-detection ood-detection

Updated Feb 15, 2021
Python

dlmacedo / entropic-out-of-distribution-detection

Star

A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference time) and detection without classification accuracy drop, hyperparameter tuning, or collecting additional data.

machine-learning deep-learning pytorch ood osr ai-safety open-set anomaly-detection novelty-detection robust-machine-learning open-set-recognition out-of-distribution out-of-distribution-detection ood-detection trustworthy-machine-learning trustworthy-ai

Updated Sep 22, 2022
Python

SafeAILab / RAIN

Star

[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning

alignment ai-safety large-language-models

Updated May 23, 2024
Python

ai4ce / FLAT

Star

[ICCV2021 Oral] Fooling LiDAR by Attacking GPS Trajectory

deep-learning robotics point-cloud lidar gnss autonomous-driving ai-safety adversarial-attacks 3d-object-detection 3d-perception trustworthy-machine-learning trustworthy-ai

Updated Jul 5, 2022
Python

dlmacedo / distinction-maximization-loss

Star

A project to improve out-of-distribution detection (open set recognition) and uncertainty estimation by changing a few lines of code in your project! Perform efficient inferences (i.e., do not increase inference time) without repetitive model training, hyperparameter tuning, or collecting additional data.