ai-safety

Extended, multi-agent and multi-objective (MaMoRL) environments based on DeepMind's AI Safety Gridworlds. This is a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. It is made compatible with OpenAI's Gym/Gymnasium and Farama Foundation PettingZoo.

Updated Jun 12, 2024
Python

IQTLabs / daisybell

Star

Scan your AI/ML models for problems before you put them into production.

cybersecurity ai-safety bias-correction bias-detection ai-alignment model-poison ai-assurance

Updated Jun 13, 2024
Python

PKU-Alignment / llms-resist-alignment

Star

Repo for paper "Language Models Resist Alignment"

alignment llama safe alpaca ai-safety vicuna llm llms rlhf safe-rlhf llama2 llama3

Updated Jun 9, 2024
Python

moonwatcher-ai / moonwatcher

Star

Evaluation & testing framework for computer vision models

computer-vision ai-safety ethical-artificial-intelligence ai-security mlops ml-safety ml-validation trustworthy-ai ml-testing

Updated Jun 14, 2024
Python

erfanshayegani / Jailbreak-In-Pieces

Star

[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models

alignment ai-safety vlm llm vision-language-models cross-modality-safety-alignment multi-modal-models

Updated Jun 6, 2024
Python

dynaroars / neuralsat

Star

DPLL(T)-based Verification tool for DNNs

abstraction sat-solver software-verification ai-safety robustness dpll adversarial-attacks robustness-verification dnn-verification ai-assurance neural-network-veri

Updated Jun 13, 2024
Python

SafeAILab / RAIN

Star

[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning

alignment ai-safety large-language-models

Updated May 23, 2024
Python

WindVChen / VCO-AP

Star

A novel physical adversarial attack tackling the Digital-to-Physical Visual Inconsistency problem.

remote-sensing object-detection ai-safety adversarial-attacks physical-attacks oriented-object-detection adversarial-patches physical-adversarial-attacks

Updated May 23, 2024
Python

yyy01 / PAC

Star

The official implementation of the paper "Data Contamination Calibration for Black-box LLMs" (ACL 2024)

nlp machine-learning ai-safety data-contamination membership-inference-attack large-language-models

Updated May 21, 2024
Python

PKU-YuanGroup / Hallucination-Attack

Star

Attack to induce LLMs within hallucinations

nlp machine-learning deep-learning ai-safety adversarial-attacks hallucinations llm llm-safety

Updated May 17, 2024
Python

dynaroars / vnncomp-benchmark-generation

Star

benchmark verification ai-safety ai-assurance vnncomp

Updated May 12, 2024
Python

Nkluge-correa / Model-Library

Star

The Model Library is a project that maps the risks associated with modern machine learning systems.

ai deep-learning ai-safety large-language-models

Updated Apr 4, 2024
Python

zhoumingyi / CustomDLCoder

Star

Code for our paper "Model-less Is the Best Model: Generating Pure Code Implementations to Replace On-Device DL Models" that has been accepted by ISSTA'24

software-engineering program-analysis ai-safety

Updated Mar 31, 2024
Python

ryoungj / ToolEmu

Star

A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use

agent language-model ai-safety large-language-models prompt-engineering language-agent

Updated Mar 22, 2024
Python

phelps-sg / llm-cooperation

Sponsor

Star

Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023

economics ai-safety gametheory experimental-economics behavioral-economics prisoners-dilemma ai-alignment experimental-psychology social-dilemmas gpt-3 gpt-4 llm principal-agent-problem

Updated Mar 1, 2024
Python

ShengranHu / Thought-Cloning

Star

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

reinforcement-learning deep-learning pytorch artificial-intelligence imitation-learning ai-safety

Updated Mar 1, 2024
Python

Improve this page

Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-safety

Here are 44 public repositories matching this topic...

blandfort / perspectival

PKU-Alignment / safe-rlhf

Giskard-AI / giskard

normster / llm_rules

levitation-opensource / ai-safety-gridworlds

IQTLabs / daisybell

PKU-Alignment / llms-resist-alignment

moonwatcher-ai / moonwatcher

erfanshayegani / Jailbreak-In-Pieces

dynaroars / neuralsat

SafeAILab / RAIN

WindVChen / VCO-AP

yyy01 / PAC

PKU-YuanGroup / Hallucination-Attack

dynaroars / vnncomp-benchmark-generation

Nkluge-correa / Model-Library

zhoumingyi / CustomDLCoder

ryoungj / ToolEmu

phelps-sg / llm-cooperation

ShengranHu / Thought-Cloning

Improve this page

Add this topic to your repo