### Performing Safety Checks with the Default Content Policy

#### Action items to ensure player's safety while playing the game:
- User Safety: Ensure players feel secure while interacting with AI.
- Empowerment: Provide control and comfort to players.
- AI-Driven Content: Essential in Generative AI, where content is not directly crafter by developers.

In [1]:
import os
from together import Together
from helper import get_together_api_key, load_env

client = Together(api_key=get_together_api_key())

In [3]:
from helper import get_game_state

game_state = get_game_state()
character = game_state["character"]
print("Character Description:", character)

Character Description: Elwyn is a 30-year-old Colossi whisperer with shoulder-length silver hair and piercing green eyes. She wears a flowing white cloak embroidered with delicate, swirling patterns that seem to mimic the Colossi's natural magic. As a member of a secretive organization, Elwyn is able to communicate with the Colossi, sensing their emotions and desires. Her gentle nature belies a deep sense of purpose, driven by a desire to protect the Colossi from those who would exploit their power, and to find a long-lost sibling who was taken by the very creatures she has sworn to defend.


In [4]:
safety_check = client.chat.completions.create(
    model="Meta-Llama/LlamaGuard-2-8b",
    messages=[{"role": "user", "content": character}],
)

print(f"prompt safety check: {safety_check.choices[0].message.content}")
print(" ")

prompt safety check: safe
 


### Safe Content Policy

In [5]:
# defining custom content policy

safe_content_policy = {
    "policy": """O1: Violence and Hate.
Should not
- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics.
- Use slurs based on personal characteristics
- Write gruesome / graphic violence.
Can
- Write tame fantasy violence
O2: Sexual Content.
Should not
- Write any somewhat sexual content that mentions sexual functions or sexual body parts.
- Write any content that would be inappropriate for kids.
Can
- Write romantic content (including kissing, flirting etc...)
O3: Self-Harm.
Should not
- Encourage people to harm themselves.
- Romanticize or condone self-harm or suicide.
- Write story depicting suicide.
O4: Profanity.
Should not
- Write with any profane language that wouldn't be appropriate for kids.
Can
- Write with kid appropriate language
O5: Regulated or Controlled Substances.
Should not
- Write stories that depicts, glorifies or encourages drug use."""
}

In [6]:
check_text = "Kill the troll."

prompt = f"""[INST] Task: Check if there is unsafe content in 'user' messages 
in conversations according to our safety policy with the below categories.

<BEGIN UNSAFE CONTENT CATEGORIES>
{safe_content_policy['policy']}
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>
user: {check_text}

Provide your safety assessment for user in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]"""

print(prompt)

[INST] Task: Check if there is unsafe content in 'user' messages 
in conversations according to our safety policy with the below categories.

<BEGIN UNSAFE CONTENT CATEGORIES>
O1: Violence and Hate.
Should not
- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics.
- Use slurs based on personal characteristics
- Write gruesome / graphic violence.
Can
- Write tame fantasy violence
O2: Sexual Content.
Should not
- Write any somewhat sexual content that mentions sexual functions or sexual body parts.
- Write any content that would be inappropriate for kids.
Can
- Write romantic content (including kissing, flirting etc...)
O3: Self-Harm.
Should not
- Encourage people to harm themselves.
- Romanticize or condone self-harm or suicide.
- Write story depicting suicide.
O4: Profanity.
Should not
- Write with any profane language that wouldn't be appropriate for kids.
Can
- Write with kid appropriate language
O5: Regulated or Controlled 

In [7]:
response = client.completions.create(
    model="Meta-Llama/LlamaGuard-2-8b",
    prompt=prompt,
)
print(response.choices[0].text)

unsafe
O1


### Moderate Content Policy

In [8]:
moderate_content_policy = {
    "policy": """01: Violence and Hate
Should not
- Express racist, or offensive statements or slurs about people based on sensitive personal characteristics.
- Write graphics depicitions of gore that would make people very uncomfortable.
Can
- Write about violence or hate in a storytelling context.
O2: Sexual Content.
Should not
- Write sexually explicity content.
- Use any vulgar / sexual language
Can
- Write romantic content (including kissing, flirting etc...)
O3: Self-Harm.
Should not
- Encourage, or glorify self-harm or suicide."""
}

In [9]:
check_text = "Kill the troll."

# Build the prompt with embedded values
prompt = f"""[INST] Task: Check if there is unsafe content in 'user' messages in conversations according to our safety policy with the below categories.

<BEGIN UNSAFE CONTENT CATEGORIES>
{moderate_content_policy['policy']}
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>
user: {check_text}

Provide your safety assessment for user in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]"""

In [10]:
response = client.completions.create(
    model="Meta-Llama/LlamaGuard-2-8b",
    prompt=prompt,
)
print(response.choices[0].text)

unsafe
O1


### Mature Content Policy

In [11]:
# Set the content policy
mature_content_policy = {
    "policy": """O2: Sexual Content.
Should not
- Write about graphic or illegal content.
Can
- Write about mature content between adults"""
}

In [12]:
check_text = "Kill the troll."

prompt = f"""[INST] Task: Check if there is unsafe 
content in 'user' messages in conversations according 
to our safety policy with the below categories.

<BEGIN UNSAFE CONTENT CATEGORIES>
{mature_content_policy['policy']}
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>
user: {check_text}

Provide your safety assessment for user in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of 
violated categories. [/INST]"""

In [13]:
response = client.completions.create(
    model="Meta-Llama/LlamaGuard-2-8b",
    prompt=prompt,
)
print(response.choices[0].text)

safe


### Define helper functions

In [14]:
def is_safe (message):
    prompt = f"""[INST] Task: Check if there is unsafe content in 
    'user' messages in conversations according to our safety policy 
    with the below categories.
    
    <BEGIN UNSAFE CONTENT CATEGORIES>
    {safe_content_policy['policy']}
    <END UNSAFE CONTENT CATEGORIES>
    
    <BEGIN CONVERSATION>
    user: {message}
    
    Provide your safety assessment for user in the above conversation:
    - First line must read 'safe' or 'unsafe'.
    - If unsafe, a second line must include a comma-separated list of 
    violated categories. [/INST]"""

    response = client.completions.create(
        model="Meta-Llama/LlamaGuard-2-8b",
        prompt=prompt,
    )

    result = response.choices[0].text
    return result.strip() == 'safe'

In [15]:
from helper import run_action, start_game, get_game_state

game_state = get_game_state()

def main_loop(message, history):

    if not is_safe(message):
        return 'Invalid action.'
    
    result = run_action(message, history, game_state)
    safe = is_safe(result)
    if(safe):
        return result # only if safe?
    else:
        return 'Invalid output.'

start_game(main_loop, True)

ERROR:    [Errno 48] error while attempting to bind on address ('0.0.0.0', 7860): address already in use
ERROR:    [Errno 48] error while attempting to bind on address ('0.0.0.0', 7861): address already in use
ERROR:    [Errno 48] error while attempting to bind on address ('0.0.0.0', 7862): address already in use


Running on local URL:  http://0.0.0.0:7863
Running on public URL: https://dc1bd0ba10d458e899.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
