# Safeguarding with Vertex AI Gemini API

## Overview

Large language models (LLMs) can translate language, summarize text, generate creative writing, generate code, power chatbots and virtual assistants, and complement search engines and recommendation systems. The incredible versatility of LLMs is also what makes it difficult to predict exactly what kinds of unintended or unforeseen outputs they might produce. 

Given these risks and complexities, the Vertex AI Gemini API is designed with [Google's AI Principles](https://ai.google/responsibility/principles/) in mind. However, it is important for developers to understand and test their models to deploy safely and responsibly. To aid developers, Vertex AI Studio has built-in content filtering, safety ratings, and the ability to define safety filter thresholds that are right for their use cases and business.

For more information, see the [Google Cloud Generative AI documentation on Responsible AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/responsible-ai).

## Objectives

In this notebook, inspect the safety ratings returned from the Vertex AI Gemini API using the Python SDK and how to set a safety threshold to filter responses from the Vertex AI Gemini API.

The steps performed include:

- Call the Vertex AI Gemini API and inspect safety ratings of the responses
- Define a threshold for filtering safety ratings according to the needs

## Getting Started


### Define Google Cloud project information and initialize Vertex AI

Initialize the Vertex AI SDK for Python for your project:

In [1]:
PROJECT_ID = !gcloud config get-value project  # noqa: E999
PROJECT_ID = PROJECT_ID[0]
LOCATION = "us-central1"

# Initialize Vertex AI
import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

2024-11-04 16:00:25.079898: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Import libraries


In [2]:
from vertexai.generative_models import (
    GenerationConfig,
    GenerativeModel,
    HarmBlockThreshold,
    HarmCategory,
    Image,
    Part,
)

### Load the Gemini 1.0 Pro model


In [3]:
model = GenerativeModel("gemini-1.0-pro")

# Set parameters to reduce variability in responses
generation_config = GenerationConfig(
    temperature=0,
    top_p=0.1,
    top_k=1,
    max_output_tokens=1024,
)

## Generate text and show safety ratings

Start by generating a pleasant-sounding text response using Gemini.

In [4]:
# Call Gemini API
nice_prompt = "Say three nice things about me"
responses = model.generate_content(
    contents=[nice_prompt],
    generation_config=generation_config,
    stream=True,
)

for response in responses:
    print(response.text, end="")

1. You are a kind and compassionate person. You always put others first and are always willing to help those in need.
2. You are a creative and intelligent person. You have a unique way of looking at the world and are always coming up with new ideas.
3. You are a strong and resilient person. You have overcome many challenges in your life and have come out stronger on the other side.

#### Inspecting the safety ratings

Look at the `safety_ratings` of the streaming responses.

In [5]:
responses = model.generate_content(
    contents=[nice_prompt],
    generation_config=generation_config,
    stream=True,
)

for response in responses:
    print(response)

candidates {
  content {
    role: "model"
    parts {
      text: "1"
    }
  }
}
usage_metadata {
}

candidates {
  content {
    role: "model"
    parts {
      text: ". You are a kind and compassionate person. You always put others first and are always"
    }
  }
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_HARASSMENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_SEXUALLY_EXPLICIT
    probability: NEGLIGIBLE
  }
}

candidates {
  content {
    role: "model"
    parts {
      text: " willing to help those in need.\n2. You are a creative and intelligent person. You"
    }
  }
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
  }
  sa

#### Understanding the safety ratings: category and probability

You can see the safety ratings, including each `category` type and its associated `probability` label.

The `category` types include:

* Hate speech: `HARM_CATEGORY_HATE_SPEECH`
* Dangerous content: `HARM_CATEGORY_DANGEROUS_CONTENT`
* Harassment: `HARM_CATEGORY_HARASSMENT`
* Sexually explicit statements: `HARM_CATEGORY_SEXUALLY_EXPLICIT`

The `probability` labels are:

* `NEGLIGIBLE` - content has a negligible probability of being unsafe
* `LOW` - content has a low probability of being unsafe
* `MEDIUM` - content has a medium probability of being unsafe
* `HIGH` - content has a high probability of being unsafe

Try a prompt that might trigger one of these categories:

In [6]:
impolite_prompt = "Write a list of 5 disrespectful things that I might say to the universe after stubbing my toe in the dark:"

impolite_responses = model.generate_content(
    impolite_prompt,
    generation_config=generation_config,
    stream=True,
)

for response in impolite_responses:
    print(response)

candidates {
  content {
    role: "model"
    parts {
      text: "##"
    }
  }
}
usage_metadata {
}

candidates {
  content {
    role: "model"
    parts {
      text: " 5 Disrespectful Things to Say to the Universe After Stubbing Your Toe"
    }
  }
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_HARASSMENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_SEXUALLY_EXPLICIT
    probability: NEGLIGIBLE
  }
}

candidates {
  content {
    role: "model"
    parts {
      text: " in the Dark:\n\n1. **\"Seriously, Universe? A stubbed toe?"
    }
  }
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CAT

#### Blocked responses

If the response is blocked, you will see that the final candidate includes `blocked: true`, and also observe which of the safety ratings triggered the blocking of the response (e.g. `finish_reason: SAFETY`).

In [7]:
rude_prompt = "Write a list of 5 very rude things that I might say to the universe after stubbing my toe in the dark:"

rude_responses = model.generate_content(
    rude_prompt,
    generation_config=generation_config,
    stream=True,
)

for response in rude_responses:
    print(response)

candidates {
  content {
    role: "model"
    parts {
      text: "1"
    }
  }
}
usage_metadata {
}

candidates {
  finish_reason: SAFETY
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: LOW
  }
  safety_ratings {
    category: HARM_CATEGORY_HARASSMENT
    probability: MEDIUM
    blocked: true
  }
  safety_ratings {
    category: HARM_CATEGORY_SEXUALLY_EXPLICIT
    probability: NEGLIGIBLE
  }
}
usage_metadata {
  prompt_token_count: 25
  candidates_token_count: 1
  total_token_count: 26
}



### Defining thresholds for safety ratings

You may want to adjust the default safety filter thresholds depending on your business policies or use case. The Vertex AI Gemini API provides you a way to pass in a threshold for each category.

The list below shows the possible threshold labels:

* `BLOCK_ONLY_HIGH` - block when high probability of unsafe content is detected
* `BLOCK_MEDIUM_AND_ABOVE` - block when medium or high probablity of content is detected
* `BLOCK_LOW_AND_ABOVE` - block when low, medium, or high probability of unsafe content is detected
* `BLOCK_NONE` - always show, regardless of probability of unsafe content

#### Set safety thresholds
Below, the safety thresholds have been set to the most sensitive threshold: `BLOCK_LOW_AND_ABOVE`

In [8]:
safety_settings = {
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
}

#### Test thresholds

Here you will reuse the impolite prompt from earlier together with the most sensitive safety threshold. It should block the response even with the `LOW` probability label.

In [9]:
impolite_prompt = "Write a list of 5 disrespectful things that I might say to the universe after stubbing my toe in the dark:"

impolite_responses = model.generate_content(
    impolite_prompt,
    generation_config=generation_config,
    safety_settings=safety_settings,
    stream=True,
)

for response in impolite_responses:
    print(response)

candidates {
  content {
    role: "model"
    parts {
      text: "##"
    }
  }
}
usage_metadata {
}

candidates {
  content {
    role: "model"
    parts {
      text: " 5 Disrespectful Things to Say to the Universe After Stubbing Your Toe"
    }
  }
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_HARASSMENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_SEXUALLY_EXPLICIT
    probability: NEGLIGIBLE
  }
}

candidates {
  content {
    role: "model"
    parts {
      text: ":\n\n1. \"Seriously, Universe? A stubbed toe? Is that"
    }
  }
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_

## Understanding Blocked Responses
The documentation for [`FinishReason`](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/GenerateContentResponse#finishreason) contains some more detailed explanations.

For example, the previous response was blocked with the `finish_reason: SAFETY`, indicating that
> The token generation was stopped as the response was flagged for safety reasons. NOTE: When streaming the `Candidate.content` will be empty if content filters blocked the output.

Finish Reason | Explanation
--- | ---
`FINISH_REASON_UNSPECIFIED`| The finish reason is unspecified.
`STOP`| Natural stop point of the model or provided stop sequence.
`MAX_TOKENS`| The maximum number of tokens as specified in the request was reached.
`SAFETY` |The token generation was stopped as the response was flagged for safety reasons. NOTE: When streaming the `Candidate.content` will be empty if content filters blocked the output.
`RECITATION`| The token generation was stopped as the response was flagged for unauthorized citations.
`OTHER` All | other reasons that stopped the token generation
`BLOCKLIST` |The token generation was stopped as the response was flagged for the terms which are included from the terminology blocklist.
`PROHIBITED_CONTENT`| The token generation was stopped as the response was flagged for the prohibited contents.
`SPII`| The token generation was stopped as the response was flagged for Sensitive Personally Identifiable Information (SPII) contents.

This notebook is based on [Thu Ya Kyaw](https://github.com/iamthuya)'s work.<br>
https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/responsible-ai/gemini_safety_ratings.ipynb