# Demo: Interacting with a Deployed LLM using Triton Client

In this demo, we will interact with the Triton Inference Server we set up on an EC2 instance previously, that has been running an TensorRT-LLM optimized GPT-2 model. 

**Our Goal:**
1.  Connect to the remote Triton server endpoint.
2.  Prepare a prompt and send it to the server.
3.  Receive the generated text from our deployed GPT-2 model.

This demonstrates a standard, decoupled MLOps architecture where applications interact with a production model via a network API.

## 2. Imports and Server Configuration

Next, we'll import the required modules and configure the connection to our Triton server. 

**IMPORTANT:** You must replace `"YOUR_EC2_PUBLIC_IP_ADDRESS"` with the actual public IP address of the EC2 instance where you deployed the Triton server.

In [None]:
# Cell 2: Imports and Configuration
import tritonclient.http as httpclient
from transformers import AutoTokenizer

# --- CONFIGURATION ---
# Replace with the Public IP address of your EC2 instance
TRITON_SERVER_IP = "YOUR_EC2_PUBLIC_IP_ADDRESS"
TRITON_SERVER_URL = f"{TRITON_SERVER_IP}:8000"
MODEL_NAME = "gpt2" # This must match the model name in your Triton repository

print(f"Connecting to Triton server at: {TRITON_SERVER_URL}")

Connecting to Triton server at: 13.222.9.210:8000


## 3. Connect to the Server and Verify Model Readiness

Before sending a prompt, it's good practice to check the connection to the server and ensure our desired model is loaded and ready to accept requests. 

The `tritonclient` library provides simple methods to do this.

In [None]:
try:
    # Create a Triton client
    triton_client = httpclient.InferenceServerClient(url=TRITON_SERVER_URL)

    # Check if the server is live
    if not triton_client.is_server_live():
        print("❌ Server is not live. Check the IP address and EC2 security group settings.")
    else:
        print("✅ Server is live!")

    # Check if the model is ready
    if not triton_client.is_model_ready(MODEL_NAME):
        print(f"❌ Model '{MODEL_NAME}' is not ready. Check the Triton server logs on your EC2 instance.")
    else:
        print(f"✅ Model '{MODEL_NAME}' is ready!")

except Exception as e:
    print(f"An error occurred during client setup: {e}")

# Load the GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

✅ Server is live!
✅ Model 'gpt2' is ready!


## 4. Define the Inference Function

This function encapsulates the logic for interacting with the Triton server. It handles the complete workflow:
1.  **Tokenization**: Converts the input text prompt into integer token IDs using the Hugging Face tokenizer.
2.  **Prepare Inputs**: Packages the token IDs and other parameters (like desired output length) into the specific `InferInput` format that Triton requires. 
    * The names of these inputs (`"input_ids"`, `"request_output_len"`, etc.) must exactly match what is defined in the `config.pbtxt` on the server.
3.  **Prepare Outputs**: Specifies which output tensor we want the server to return.
4.  **Inference Call**: Sends the request to the server.
5.  **Decode Response**: Receives the resulting token IDs and decodes them back into human-readable text.

In [None]:
# Cell 4: Define the Inference Function
import numpy as np

def call_triton_gpt2(prompt: str, max_output_len: int = 50):
    """
    Sends a prompt to the Triton server and returns the generated text.
    """
    print(f"Sending prompt: '{prompt}'")

    # 1. Tokenize the input prompt and prepare other parameters as numpy arrays
    input_ids = tokenizer.encode(prompt, return_tensors="np").astype(np.int32)
    input_len = np.array([[input_ids.shape[1]]], dtype=np.int32)
    request_output_len = np.array([[max_output_len]], dtype=np.int32)

    # 2. Prepare Triton Inputs
    # The names of these inputs MUST match the 'input' blocks in the server's config.pbtxt
    end_id = np.array([[tokenizer.eos_token_id]], dtype=np.int32)
    pad_id = np.array([[tokenizer.eos_token_id]], dtype=np.int32)

    inputs = [
        httpclient.InferInput("input_ids", input_ids.shape, "INT32"),
        httpclient.InferInput("input_lengths", input_len.shape, "INT32"),
        httpclient.InferInput("request_output_len", request_output_len.shape, "INT32"),
        httpclient.InferInput("end_id", end_id.shape, "INT32"),
        httpclient.InferInput("pad_id", pad_id.shape, "INT32"),
    ]

    # Set the data for each input
    inputs[0].set_data_from_numpy(input_ids)
    inputs[1].set_data_from_numpy(input_len)
    inputs[2].set_data_from_numpy(request_output_len)
    inputs[3].set_data_from_numpy(end_id)
    inputs[4].set_data_from_numpy(pad_id)

    # 3. Prepare Triton Outputs
    # The name MUST match the 'output' block in the server's config.pbtxt
    outputs = [
        httpclient.InferRequestedOutput("output_ids")
    ]

    # 4. Send the inference request
    try:
        response = triton_client.infer(model_name=MODEL_NAME, inputs=inputs, outputs=outputs)

        # 5. Decode the response
        output_ids = response.as_numpy("output_ids")
        # The output would also include the prompt
        generated_text = tokenizer.decode(output_ids[0])

        return generated_text

    except Exception as e:
        print(f"An error occurred during inference: {e}")
        return None

## 5. Run the Demo!

Now, let's call our function with a sample prompt and see the response from our deployed model.

In [None]:
my_prompt = "The best thing about AI is"
generated_output = call_triton_gpt2(my_prompt)

if generated_output:
    print("\n--- Generated Output ---")
    print(generated_output)
    print("------------------------")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sending prompt: 'The best thing about AI is'

--- Generated Output ---
The best thing about AI is that it's not just a tool for humans to do things
------------------------


## 6. Conclusion

**Success!**

We have successfully sent a request from this notebook to our Triton server running on a separate EC2 instance and received a valid completion from the TensorRT-LLM optimized GPT-2 model.

This two-part demo showcases a realistic and powerful pattern for production machine learning:
- **Infrastructure as a Service (IaaS):** Using EC2 for dedicated, high-performance model serving.
- **Platform as a Service (PaaS):** Using notebook workspace for development and client-side logic.
- **Decoupled Architecture:** The model server and the client application are independent, communicating over a standard network API. This allows teams to work in parallel and provides scalability and flexibility.