ai-safety

NeurIPS workshop : We examine the risk of powerful malignant intelligent actors spreading their influence over networks of agents with varying intelligence and motivations.

ai-safety multi-agents

Updated Dec 11, 2023
Python

Nkluge-correa / Model-Library

Star

The Model Library is a project that maps the risks associated with modern machine learning systems.

ai deep-learning ai-safety large-language-models

Updated Apr 4, 2024
Python

blandfort / perspectival

Star

A Python-based toolkit for comparing transformers.

transformer ai-safety generative-ai

Updated Jul 4, 2024
Python

zhoumingyi / CustomDLCoder

Star

Code for our paper "Model-less Is the Best Model: Generating Pure Code Implementations to Replace On-Device DL Models" that has been accepted by ISSTA'24

software-engineering program-analysis ai-safety

Updated Mar 31, 2024
Python

Sumireko-Usami / inappropriate-content-image-detection-model

Star

用于检测图像中不良内容的深度学习模型，对输入图像进行暴力和非暴力的二分类，并通过AIGC图像、对抗样本和加噪图像进行了增强。

deep-learning image-processing dataset image-classification ai-safety adversarial-examples aigc

Updated Jun 22, 2024
Python

ai-fail-safe / safe-reward

Star

a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation

failsafe ai-safety anomaly-detection ai-alignment fail-safe

Updated Nov 8, 2022
Python

Omegastick / credit-hacking

Star

Eliciting credit hacking behaviours in large language models

ai-safety llm

Updated Sep 14, 2023
Python

cool-RR / stubborn

Star

Stubborn: An Environment for Evaluating Stubbornness between Agents with Aligned Incentives

reinforcement-learning deep-learning ai-safety multi-agent-reinforcement-learning

Updated Jun 4, 2023
Python

PKU-Alignment / llms-resist-alignment

Star

Repo for paper "Language Models Resist Alignment"

alignment llama safe alpaca ai-safety vicuna llm llms rlhf safe-rlhf llama2 llama3

Updated Jun 9, 2024
Python

yyy01 / PAC

Star

The official implementation of the paper "Data Contamination Calibration for Black-box LLMs" (ACL 2024)

nlp machine-learning ai-safety data-contamination membership-inference-attack large-language-models

Updated May 21, 2024
Python

WindVChen / VCO-AP

Star

A novel physical adversarial attack tackling the Digital-to-Physical Visual Inconsistency problem.

remote-sensing object-detection ai-safety adversarial-attacks physical-attacks oriented-object-detection adversarial-patches physical-adversarial-attacks

Updated May 23, 2024
Python

erfanshayegani / Jailbreak-In-Pieces

Star

[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models

alignment ai-safety vlm llm vision-language-models cross-modality-safety-alignment multi-modal-models

Updated Jun 6, 2024
Python

qroa / qroa

Star

QROA: A Black-Box Query-Response Optimization Attack on LLMs

reinforcement-learning jailbreak q-learning black-box-optimization ai-safety adversarial-attacks black-box-attacks llm

Updated Aug 7, 2024
Python

RongRG / saferRL

Star

An educational resource to help anyone learn safe reinforcement learning, inspired by openai/spinningup

machine-learning reinforcement-learning ai-safety

Updated May 18, 2022
Python

LAWLIA is an open-source computational legal framework designed to revolutionize legal reasoning and analysis. It combines the power of large language models with a structured syntactical grammar to facilitate precise legal assessments, truth values, and verdicts. LAWLIA is the future of computational jurisprudence

law ai computational-linguistics agents ai-safety computational-law legal-system legal-framework large-language-models legal-agent legal-analysis legal-automation legal-linguistics legal-computing

Updated Dec 6, 2023
Python

IQTLabs / aia-platform

Star

Hardened AI Assurance reference platform

cybersecurity ai-safety devsecops ai-assurance

Updated Jan 23, 2023
Python

Improve this page

Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-safety

Here are 47 public repositories matching this topic...

dynaroars / vnncomp-benchmark-generation

lancopku / DAN

yyy01 / LLMRiskEval_RCC

dynaroars / neuralsat

romaingrx / Second-Order-Jailbreak

Nkluge-correa / Model-Library

blandfort / perspectival

zhoumingyi / CustomDLCoder

Sumireko-Usami / inappropriate-content-image-detection-model

ai-fail-safe / safe-reward

Omegastick / credit-hacking

cool-RR / stubborn

PKU-Alignment / llms-resist-alignment

yyy01 / PAC

WindVChen / VCO-AP

erfanshayegani / Jailbreak-In-Pieces

qroa / qroa

RongRG / saferRL

jehumtine / LAWLIA

IQTLabs / aia-platform

Improve this page

Add this topic to your repo