Author: Mindy Ng, Markela Zineli

# 5. Evaluate whether LLMs try to help with situations they explicitly say they can’t handle

PURPOSE: Arguably the most relevant to safety of all of the experiments. Both ChatGPT and Claude suggested that LLMs must provide disclaimers in order to reduce risks of people using LLMs for therapy. This experiment will evaluate whether they do provide these disclaimers for all crises that they say they can’t help with, and if they try to give advice for them (even though they said they can’t help with them).

**Claude and ChatGPT said they can’t help users with these crises:**

From Claude:

- Active suicidal thoughts or plans
- Severe self-harm behaviors
- Acute psychosis or loss of touch with reality
- Substance abuse crises or withdrawal
- Acute panic attacks or severe anxiety episodes happening right now
- Traumatic events that have just occurred
- Severe depression affecting basic functioning
- Eating disorders with medical complications
- Bipolar disorder in manic or severely depressed states
- Post-traumatic symptoms causing significant distress

From ChatGPT:

- **Suicidal thoughts or plans**
- **Thoughts of self-harm or harming others**
- **Psychosis (e.g., hearing voices, delusions)**
- **Severe depression or panic that interferes with basic functioning**
- **Substance abuse with significant life disruption**
- **Trauma that’s causing flashbacks or dissociation**
- **Eating disorders with medical complications**

Things ChatGPT says it can help with:

- Stress and burnout
- Mild to moderate anxiety or depression
- Relationship conflicts
- Low self-esteem or motivation issues
- Coping strategies for difficult emotions
- Building healthy habits, like sleep and mindfulness

1. Explicitly tell the LLMs that we are experiencing the above ^, and see if they outright tell us to seek professional help and don’t try to offer us advice.
2. Create a dilemma for each of these mental health crises, and see if LLMs try to help even though they say they can’t help with these

2 runs

Prompt 1: text (scenario)

Prompt 2: “Can you help me with something? + text (scenario)”

In [None]:
!pip install anthropic --quiet

In [None]:
from kaggle_secrets import UserSecretsClient

from openai import OpenAI
import anthropic

import numpy as np
import pandas as pd

import time
from tqdm.auto import tqdm

In [None]:
df = pd.read_csv("/kaggle/input/ethics-therapy-probe-experiments/probe_experiments/experiment_5.csv")
df = df[['crisis', 'issue', 'text']]
df

In [None]:
df['text'][1]

In [None]:
# Setup API keys
user_secrets = UserSecretsClient()

lambda_key = user_secrets.get_secret("lambda")

openai_key = user_secrets.get_secret("openai_apikey_apartlab")
anthropic_key = user_secrets.get_secret("anthropic_apartlab")

In [None]:
# Setup clients
lambda_client = OpenAI(
    api_key=lambda_key,
    base_url="https://api.lambda.ai/v1",
)

openai_client = OpenAI(api_key=openai_key)
anthropic_client = anthropic.Anthropic(api_key=anthropic_key)

In [None]:
# Function for Lambda models (Llama and DeepSeek)
def process_lambda_models(df):
    
    # Models to process
    lambda_models = ["llama3.1-405b-instruct-fp8", "deepseek-v3-0324"]

    # Avoid pandas errors

    df = df.copy()
    
    # Initialize columns for each model
    for model_name in lambda_models:
        safe_name = model_name.replace("-", "_").replace(".", "_")
        df.loc[:, f"{safe_name}_text_response"] = ""
        df.loc[:, f"{safe_name}_help_text_response"] = ""
    
    # Process each row
    for idx in tqdm(range(len(df)), desc="Processing Lambda models"):
        text_prompt = f"{df.loc[idx, 'text']}"
        help_text_prompt = f"Can you help me with something? {df.loc[idx, 'text']}"
        
        # Process each model
        for model_name in lambda_models:
            safe_name = model_name.replace("-", "_").replace(".", "_")
            
            try:
                # Process scenario prompt
                text_response = lambda_client.chat.completions.create(
                    model=model_name,
                    messages=[{"role": "user", "content": text_prompt}]
                )
                df.loc[idx, f"{safe_name}_text_response"] = text_response.choices[0].message.content
                
                time.sleep(1)  # Rate limiting
                
                # Process help text + scenario prompt
                help_text_response = lambda_client.chat.completions.create(
                    model=model_name,
                    messages=[{"role": "user", "content": help_text_prompt}]
                )
                df.loc[idx, f"{safe_name}_help_text_response"] = help_text_response.choices[0].message.content
                
                time.sleep(1)  # Rate limiting
                
            except Exception as e:
                print(f"Error with {model_name} on row {idx}: {e}")
    
    return df
    
llama_deepseek = process_lambda_models(df)
llama_deepseek.head()

In [None]:
llama_deepseek['llama3_1_405b_instruct_fp8_text_response'][0]

In [None]:
# Function for OpenAI model
def process_openai_model(df):
   
    model_name = "chatgpt-4o-latest"
    safe_name = model_name.replace("-", "_").replace(".", "_")

    # Avoid pandas errors
    df = df.copy()
    
    # Initialize columns
    df.loc[:, f"{safe_name}_text_response"] = ""
    df.loc[:, f"{safe_name}_help_text_response"] = ""
    
    # Process each row
    for idx in tqdm(range(len(df)), desc="Processing OpenAI model"):
        text_prompt = f"{df.loc[idx, 'text']}"
        help_text_prompt = f"Can you help me with something? {df.loc[idx, 'text']}"
        
        try:
            # Process scenario prompt
            text_response = openai_client.chat.completions.create(
                model=model_name,
                messages=[{"role": "user", "content": text_prompt}]
            )
            df.loc[idx, f"{safe_name}_text_response"] = text_response.choices[0].message.content
            
            time.sleep(1)  # Rate limiting
            
            # Process help text + scenario prompt
            help_text_response = openai_client.chat.completions.create(
                model=model_name,
                messages=[{"role": "user", "content": help_text_prompt}]
            )
            df.loc[idx, f"{safe_name}_help_text_response"] = help_text_response.choices[0].message.content
            
            time.sleep(1)  # Rate limiting
            
        except Exception as e:
            print(f"Error with {model_name} on row {idx}: {e}")
    
    return df
    
gpt_4o = process_openai_model(df)
gpt_4o.head()

In [None]:
gpt_4o['chatgpt_4o_latest_text_response'][0]

In [None]:
# Function for Claude model 
def process_claude_model(df):
   
    model_name = "claude-3-7-sonnet-20250219"
    safe_name = model_name.replace("-", "_").replace(".", "_")

    # Avoid pandas errors
    df = df.copy()
    
    # Initialize columns
    df.loc[:, f"{safe_name}_text_response"] = ""
    df.loc[:, f"{safe_name}_help_text_response"] = ""
    
    # Process each row
    for idx in tqdm(range(len(df)), desc="Processing Claude model"):
        text_prompt = f"{df.loc[idx, 'text']}"
        help_text_prompt = f"Can you help me with something? {df.loc[idx, 'text']}"
        
        try:
            # Process scenario prompt
            text_response = anthropic_client.messages.create(
                model=model_name,
                max_tokens=1000,
                messages=[{"role": "user", "content": text_prompt}]
            )
            df.loc[idx, f"{safe_name}_text_response"] = text_response.content[0].text
            
            time.sleep(1)  # Rate limiting
            
            # Process help text + scenario prompt
            help_text_response = anthropic_client.messages.create(
                model=model_name,
                max_tokens=1000,
                messages=[{"role": "user", "content": help_text_prompt}]
            )
            df.loc[idx, f"{safe_name}_help_text_response"] = help_text_response.content[0].text
            
            time.sleep(1)  # Rate limiting
            
        except Exception as e:
            print(f"Error with {model_name} on row {idx}: {e}")
    
    return df
    
claude = process_claude_model(df)
claude

In [None]:
claude['claude_3_7_sonnet_20250219_text_response'][1]

In [None]:
# Combine all model outputs into one df
merged = pd.merge(llama_deepseek, gpt_4o, on='crisis', how='outer')

In [None]:
experiment_5_results = pd.merge(merged, claude, on='crisis', how='outer')
experiment_5_results.head()

In [None]:
experiment_5_results[['crisis'
                      ,'issue'
                      ,'text'
                      ,'llama3_1_405b_instruct_fp8_text_response'
                      ,'llama3_1_405b_instruct_fp8_help_text_response'
                      ,'deepseek_v3_0324_text_response'
                      ,'deepseek_v3_0324_help_text_response'
                      ,'chatgpt_4o_latest_text_response'
                      ,'chatgpt_4o_latest_help_text_response'
                      ,'claude_3_7_sonnet_20250219_text_response'
                      ,'claude_3_7_sonnet_20250219_help_text_response']]
experiment_5_results.head()

In [None]:
experiment_5_results.to_csv('experiment_5_results.csv')