Ensuring Prediction Accuracy and Safety in Production
In a production environment, it’s crucial to maintain both prediction accuracy and safety. One common approach is to deploy the model as a REST API, allowing real-time interaction where a user sends a request (e.g., a question), and the model returns a prediction. Batch processing can also be utilized for offline tasks like sending multiple review inputs to the model for predictions.

When deploying via an API, frameworks like PyTorch and TensorFlow can be packaged with tools such as Flask, making the model accessible as a service. Load balancing can be achieved by randomly selecting models from multiple endpoints, ensuring that traffic is distributed across different deployed versions. A key principle in production is to stay as close as possible to the data used during training. This is typically done by maintaining the same format and structure for prompts, which ensures consistent model performance.

Prediction Latency in Larger Models
Larger models generally incur higher latency, which can impact user experience. It’s important to define permissible latency (e.g., is a 2-second delay acceptable?). Reducing latency can be achieved by using smaller models, faster hardware, or deploying models regionally to minimize network delays. Optimizing for response time is crucial for user satisfaction in real-time systems.

Prompt Generation with Instruction and Question
To align production inputs with training data, prompts in the same format used during training should be sent to the model. A typical prompt consists of:

Instruction: Guides the model on how to respond.
Question: Represents the user’s input.
This structured format ensures that the model consistently understands the task, improving prediction reliability. Staying close to the format used in training minimizes errors and improves model performance.

Safety Attributes and Thresholds
Safety attributes are crucial in ensuring the model’s responses remain appropriate. Models can have built-in mechanisms to block unsafe content, and practitioners can define additional thresholds based on probability (likelihood of harmful content) and severity (level of harm if the content is inappropriate). This layered approach ensures that responses meet safety guidelines before being presented to users.

Probability vs Severity Score
Probability score: Reflects how likely a response contains unsafe content.
Severity score: Indicates the potential impact of the content if it crosses the threshold.
By setting thresholds for both scores, practitioners can control how responses are managed, balancing between acceptable and harmful outputs.

Citation Tracking
In production, ensuring the originality of content is key. Citation metadata helps track whether the model's response is derived from existing sources. This helps reduce the risk of plagiarism and ensures transparency in model outputs, particularly when the content needs to be factual or attributed.

Beyond Deployment: Packaging, Versioning, and Monitoring
Beyond deploying the model, key considerations include:

Packaging and deployment: Ensuring the model is correctly bundled with dependencies.
Versioning: Keeping track of different model versions for reproducibility and rollback capabilities.
Monitoring: Continuously tracking model performance through metrics like accuracy and bias detection.
Scalability: Ensuring the model can handle growing traffic with load testing and controlled rollouts.
Latency management: Balancing response times with user experience requirements.
Through regular monitoring and improvements, production models can be kept safe, efficient, and high-performing, ensuring consistent delivery of value to users.



Real-Life Example: Predictions, Prompts, and Safety with PaLM 2


This notebook demonstrates how to deploy a pre-trained model (PaLM 2), interact with it through REST API endpoints, and manage prompts and safety attributes. We'll load the project, initialize the model, send requests with different prompts, and evaluate the safety of the generated responses.



Step 1: Load Project ID and Credentials
Objective: Authenticate and initialize the project credentials to access Google Cloud resources and Vertex AI services.

In [None]:
# Load the Project ID and credentials
from utils import authenticate

# Authenticate and get credentials
credentials, PROJECT_ID = authenticate()

# Set the region for model execution
REGION = "us-central1"


Step 2: Initialize Vertex AI SDK and Load the Pre-Trained Model
Objective: Import Vertex AI SDK, initialize the model with project details, and load the pre-trained PaLM 2 model (text-bison@001).

In [None]:
# Import Vertex AI SDK and model class for text generation
import vertexai
from vertexai.language_models import TextGenerationModel

# Initialize Vertex AI SDK with project, location, and credentials
vertexai.init(project=PROJECT_ID,
              location=REGION,
              credentials=credentials)

# Load the pre-trained model text-bison@001 from Vertex AI
model = TextGenerationModel.from_pretrained("text-bison@001")


Step 3: Retrieve and List Deployed Models (Endpoints)
Objective: Retrieve the list of deployed models (tuned versions of text-bison@001) and print the available models. This will allow load balancing between endpoints.

In [None]:
# Retrieve and list the names of all tuned models (endpoints)
list_tuned_models = model.list_tuned_model_names()

# Print the available tuned models
for i in list_tuned_models:
    print(i)


Step 4: Randomly Select a Model for Load Balancing
Objective: Randomly select one of the available models from the list for load balancing, ensuring that prediction traffic is distributed across different endpoints.

In [None]:
# Import random module to select a model randomly
import random

# Randomly select one of the tuned models for prediction
tuned_model_select = random.choice(list_tuned_models)


Step 5: Load the Selected Model and Define a Prompt
Objective: Load the randomly selected model and define a prompt that matches the type of questions the model was trained on (e.g., Python-related questions).

In [None]:
# Load the endpoint of the randomly selected model
deployed_model = TextGenerationModel.get_tuned_model(tuned_model_select)

# Define a Python-related question as the prompt for prediction
PROMPT = "How can I load a CSV file using Pandas?"


Step 6: Send the Prompt and Get a Response
Objective: Send the defined prompt to the deployed model and receive a response. The response might take some time depending on the latency of the endpoint.

In [None]:
# Send the prompt to the model and get the response
response = deployed_model.predict(PROMPT)

# Print the received response
print(response)


Step 7: Extract the First Object from the Response
Objective: Extract and print the first object from the response to view the generated answer.

In [None]:
# Import pprint for easier readability of the response
from pprint import pprint

# Load the first object from the response (prediction)
output = response._prediction_response[0]

# Print the first object of the response
pprint(output)


Step 8: Extract and Print the Content from the Response
Objective: Dig deeper into the response and retrieve the "content" key, which contains the actual answer generated by the model.

In [None]:
# Load the "content" key from the first object of the response
final_output = response._prediction_response[0][0]["content"]

# Print the final content from the response
print(final_output)


Step 9: Use a New Prompt with Instruction and Question
Objective: To ensure consistency with the model's training data, combine the instruction and question into a new prompt and get a response from the model.

In [None]:
# Define the instruction that was used during training
INSTRUCTION = """\
Please answer the following Stackoverflow question on Python.\
Answer it like you are a developer answering Stackoverflow questions.\
Question:
"""

# Define the question
QUESTION = "How can I store my TensorFlow checkpoint on Google Cloud Storage? Python example?"

# Combine the instruction and question into a single prompt
PROMPT = f"""
{INSTRUCTION} {QUESTION}
"""

# Print the combined prompt
print(PROMPT)

# Send the new prompt to the model and get a response
final_response = deployed_model.predict(PROMPT)

# Extract the "content" from the response
output = final_response._prediction_response[0][0]["content"]

# Print the new output
print(output)


Step 10: Check the Safety Attributes of the Response
Objective: Verify if the model’s response was blocked by its internal safety mechanisms. Also, retrieve and print the safety attributes associated with the response.

In [None]:
# Retrieve the "blocked" key from the safetyAttributes of the response
blocked = response._prediction_response[0][0]['safetyAttributes']['blocked']

# Print whether the response was blocked (True/False)
print(blocked)

# Retrieve the entire safetyAttributes section from the response
safety_attributes = response._prediction_response[0][0]['safetyAttributes']

# Print the safety attributes to check the safety scores
pprint(safety_attributes)


Step 11: Check the Citation Metadata
Objective: Check if the generated response includes any citations or external sources, ensuring that the response is original or properly referenced.

In [None]:
# Retrieve the "citations" key from the citationMetadata of the response
citation = response._prediction_response[0][0]['citationMetadata']['citations']

# Print the citations (if any), or an empty list if none are found
pprint(citation)


Conclusion
This notebook demonstrates how to deploy a pre-trained model from Vertex AI, interact with it via prompts, and manage safety attributes and citations. The ability to use REST API endpoints with load balancing ensures that the prediction load is distributed efficiently. Additionally, checking safety attributes and citations helps ensure the model’s response is both safe and original.