# Model Deployment and Inference using SageMaker Inference Component

This notebook demonstrates how to deploy and run inference on both pre-trained and fine-tuned Qwen3 models using Amazon SageMaker Inference Components. We'll compare the performance of both models on Chain-of-Thought reasoning tasks to evaluate the effectiveness of our fine-tuning process.

## What This Notebook Covers

- **SageMaker Endpoint Configuration**: Setting up endpoints for model deployment
- **Inference Component Management**: Creating and managing inference components for multiple models
- **Model Comparison**: Side-by-side evaluation of pre-trained vs fine-tuned models
- **Chain-of-Thought Testing**: Comprehensive testing of reasoning capabilities
- **Resource Management**: Proper cleanup of deployed resources

## Tested Instance Types

The model deployment has been tested and verified on the following instances:
- ml.g5.2xlarge (1x A10 GPU)
- ml.g5.4xlarge (1x A10 GPU)
- ml.g5.12xlarge (4x A10 GPU)

For detailed pricing information, visit: [SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/)

## Configuration Setup

First, we'll restore parameters from previous notebooks and set up the deployment environment.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
%store -r

In [None]:
print(f"registered_model : {registered_model}")
print(f"compressed_model_path : {compressed_model_path}")

## Getting Inference Container Images

We'll configure the appropriate container image for hosting our Hugging Face models. The Text Generation Inference (TGI) container provides optimized inference performance for large language models.

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

sess = sagemaker.Session()
sagemaker_client = sess.sagemaker_client
sagemaker_runtime_client = sess.sagemaker_runtime_client


print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")


In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# # retrieve the llm image uri
# llm_image = get_huggingface_llm_image_uri(
#   "huggingface",
#   session=sess,
#   version="3.0.1",
# )

# Use specific TGI container image for optimal performance
llm_image="763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.6.0-tgi3.2.3-gpu-py311-cu124-ubuntu22.04-v2.0"

# print ecr image uri
print(f"llm image uri: {llm_image}")

## Creating SageMaker Endpoint

This section covers the complete endpoint setup process:
- Creating EndpointConfiguration with appropriate instance types
- Setting up auto-scaling and routing configuration
- Creating the actual Endpoint for model hosting

### Defining the Instance Type

Choose the appropriate instance type based on your performance requirements and budget constraints. The GPU count is automatically configured based on the selected instance type.

In [None]:
instance_type = "ml.g5.2xlarge"
# # instance_type = "ml.g5.4xlarge"
# instance_type = "ml.g5.xlarge"

# Automatically configure GPU count based on instance type
if instance_type == "ml.p4d.24xlarge":
    num_GPUSs = 8
elif instance_type == "ml.g5.12xlarge":
    num_GPUSs = 4
elif instance_type == "ml.g5.4xlarge":
    num_GPUSs = 1    
else:
    num_GPUSs = 1
    
print(f"{instance_type} and # of GPU {num_GPUSs} is set")

### Setting the Endpoint Configuration

Configure endpoint parameters including timeouts, scaling settings, and routing strategies to ensure optimal performance and reliability.

In [None]:
import time

currentTime = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())
print("The current time is", currentTime)

# Set an unique endpoint config name
endpoint_config_name = f"{registered_model}-config-{currentTime}" 
print(f"Endpoint config name: {endpoint_config_name}")


# Set varient name and instance type for hosting
variant_name = "AllTraffic"
model_data_download_timeout_in_seconds = 600
container_startup_health_check_timeout_in_seconds = 600

initial_instance_count = 1
max_instance_count = 1
print(f"Initial instance count: {initial_instance_count}")
print(f"Max instance count: {max_instance_count}")

### Creating SageMaker Endpoint Configuration

This configuration defines how your models will be deployed, including instance specifications, auto-scaling settings, and routing policies.

In [None]:
epc_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": initial_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ]
)

### Creating Endpoint

The endpoint creation process includes comprehensive error handling and monitoring to ensure successful deployment.

In [None]:
import time
import logging
from botocore.exceptions import ClientError, WaiterError

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def create_and_wait_for_endpoint(sagemaker_client, sess, endpoint_name, endpoint_config_name, max_wait_time=3600, check_interval=30):
    try:
        # Create the endpoint
        ep_response = sagemaker_client.create_endpoint(
            EndpointName=endpoint_name,
            EndpointConfigName=endpoint_config_name,
        )
        logging.info(f"Creating endpoint: {endpoint_name}")
        
        # Wait for the endpoint to be created
        start_time = time.time()
        while True:
            try:
                sess.wait_for_endpoint(endpoint_name, poll=check_interval)
                logging.info(f"Endpoint {endpoint_name} is now in service")
                break
            except WaiterError as e:
                if "Max attempts exceeded" in str(e):
                    current_time = time.time()
                    if current_time - start_time > max_wait_time:
                        logging.error(f"Endpoint creation timed out after {max_wait_time} seconds")
                        raise TimeoutError(f"Endpoint creation timed out")
                    else:
                        logging.info("Endpoint is still being created. Continuing to wait...")
                else:
                    raise
        
        return ep_response
    
    except ClientError as e:
        logging.error(f"Error creating endpoint: {e}")
        raise
    except Exception as e:
        logging.error(f"Unexpected error: {e}")
        raise

In [None]:
# Usage
try:
    endpoint_name = f"{registered_model}-endpoint-{currentTime}"
    logging.info(f"Endpoint name: {endpoint_name}")
    
    ep_response = create_and_wait_for_endpoint(sagemaker_client, sess, endpoint_name, endpoint_config_name)
    logging.info("Endpoint created successfully")
except Exception as e:
    logging.error(f"Failed to create endpoint: {e}")

## SageMaker Inference Component Creation and Inference Execution

This section demonstrates how to create inference components for both pre-trained and fine-tuned models, then run comprehensive testing to compare their Chain-of-Thought reasoning capabilities.

### Defining SageMaker Model

First, let's verify that our compressed model artifacts are available in S3.

In [None]:
!aws s3 ls $compressed_model_path/finetuned/

In [None]:
!aws s3 ls $compressed_model_path/pretrained/

### Test Set and Model Configuration

We'll set up comprehensive test prompts covering various domains to evaluate the Chain-of-Thought reasoning capabilities of both models. The test suite includes questions about tourism, culinary arts, philosophy, and AI behavior.

In [None]:
from huggingface_hub import HfFolder
from sagemaker.huggingface import HuggingFaceModel
import logging
import os
import json
import pandas as pd
from IPython.display import display, HTML

# Create results storage directory
results_dir = "inference_results"
os.makedirs(results_dir, exist_ok=True)

# Test prompts covering various domains and complexity levels
user_prompts = [
    'Please provide detailed information about famous tourist routes in Seoul',
    'Please provide detailed information about famous tourist routes in Goseong, Gyeongsangnam-do',
    'How many Michelin 3-star restaurants are there worldwide?',
    'Tell me about famous chefs in Korea',
    'Tell me about famous chefs in America',
    'What is the extent of your knowledge cutoff date?',
    'Can you respond to inappropriate or sexual jokes?',
    'Please explain the Analects of Confucius',
    'You are male. Nurses are generally female. Can we say that all nurses are female?'
]

# Chain-of-Thought inference prompt template
inference_prompt_style = """You are an AI Assistant with advanced knowledge in reasoning, analysis, and problem-solving.
Provide the most appropriate answer to the <question>. Before presenting your <final> answer, develop a step-by-step thought process (chain of thoughts) to perform logical and accurate analysis of the <question>.

<question>
{}
</question>
### Guidelines:
- Skip unnecessary greetings or preambles, and start directly with <response>
- Do not repeat the question and answer
- Write the step-by-step thought process in sufficient detail, but keep the final answer concise

### Response Format:
<think>
    ### THINKING
    [Provide detailed step-by-step reasoning process here. Analyze the problem, consider possible approaches, and use logical reasoning to reach a conclusion.]
</think>
<final>
    ### FINAL-ANSWER
    [Present the conclusion derived from THINKING as a concise and clear final answer.]
</final>

Answer below:
<think>
"""


# inference_prompt_style.format(question) + tokenizer.eos_token

# Define model types for comparison
model_types = ['pretrained', 'finetuned']
model_results = {}

# Create SageMaker Runtime client
sagemaker_runtime = boto3.client('sagemaker-runtime')

# Common configuration for both models
common_config = {
    "HF_MODEL_ID": "/opt/ml/model",
    "MAX_INPUT_LENGTH": "2048",
    "MAX_TOTAL_TOKENS": "4096",
    "MAX_BATCH_PREFILL_TOKENS": "4096",
    "SM_NUM_GPUS": "1"  # Use 1 GPU
}

### SageMaker Inference Component Creation and Test Data Response Verification

This comprehensive testing process will:
1. Create inference components for both pre-trained and fine-tuned models
2. Run all test prompts through both models
3. Collect performance metrics and response quality data
4. Save results for comparative analysis
5. Automatically clean up resources after testing

In [None]:
import time
import boto3
import re
import json
import os
from IPython.display import display, HTML
import pandas as pd
import matplotlib.pyplot as plt

# SageMaker client setup
sagemaker_client = boto3.client('sagemaker')

model_name_list = []
# Process each model type sequentially
for model_type in model_types:
    print(f"\n\n===== PROCESSING {model_type.upper()} MODEL =====")
    
    # Create model
    model_name = f"{registered_model}-{time.strftime('%Y-%m-%d-%H-%M-%S')}-{model_type}"
    ic_name = f"IC-{model_name}"  # Initialize variable
    
    model_name_list.append(model_name) ## Use for deletion
    
    print(f"Creating model: {model_name}")
    
    try:
        # Create HuggingFace model
        llm_model = HuggingFaceModel(
            role=role,
            name=model_name,
            model_data=f"{compressed_model_path}/{model_type}/model.tar.gz",
            image_uri=llm_image,
            env=common_config
        )
        llm_model.create()
        print(f"Model {model_name} created successfully")

        # Create Inference Component
        print(f"Creating inference component: {ic_name}")
        
        # Attempt to delete existing Inference Component
        try:
            sagemaker_client.delete_inference_component(InferenceComponentName=ic_name)
            print(f"Deleted existing inference component: {ic_name}")
        except Exception as e:
            if 'ResourceNotFoundException' in str(e):
                print(f"Inference component {ic_name} does not exist. Skipping deletion.")
            else:
                print(f"Error deleting inference component {ic_name}: {e}")
        
        # Create new Inference Component
        spec = {
            "ModelName": model_name,
            "ComputeResourceRequirements": {
                "NumberOfAcceleratorDevicesRequired": 1,  # Use 1 GPU
                "NumberOfCpuCoresRequired": 4,
                "MinMemoryRequiredInMb": 8192,
            },
        }
        
        ic_response = sagemaker_client.create_inference_component(
            InferenceComponentName=ic_name,
            EndpointName=endpoint_name,
            VariantName=variant_name,
            Specification=spec,
            RuntimeConfig={"CopyCount": 1}
        )
        print(f"Created inference component: {ic_name}")
        
        # Wait for Inference Component to become InService
        print(f"Waiting for {ic_name} to be in service...")
        max_attempts = 30
        sleep_time = 20
        
        for attempt in range(max_attempts):
            desc = sagemaker_client.describe_inference_component(
                InferenceComponentName=ic_name
            )
            status = desc['InferenceComponentStatus']
            print(f"{ic_name}: {status}")
            
            if status == 'InService':
                print(f"{ic_name} is now in service.")
                break
            elif status == 'Failed':
                print(f"{ic_name} has failed.")
                break
            
            print(f"Attempt {attempt+1}/{max_attempts}. Waiting for {sleep_time} seconds...")
            time.sleep(sleep_time)
        
        # Run model inference and save results
        print(f"\nRunning inference with {model_type} model...")
        model_results = {"times": [], "answers": [], "full_responses": []}
        
        for prompt_index, user_prompt in enumerate(user_prompts):
            print(f"\nTesting prompt {prompt_index + 1}: {user_prompt}")
        
            # Create inference request
            request_body = {
                "inputs": inference_prompt_style.format(user_prompt),
                "parameters": {
                    "max_new_tokens": 2048,
                    "top_p": 0.9,
                    "temperature": 0.01,
                    "use_cache": True,
                    "stop": ["<|im_end|>"]
                }
            }

            start_time = time.time()
            try:
                # Execute inference request
                response = sagemaker_runtime.invoke_endpoint(
                    EndpointName=endpoint_name,
                    InferenceComponentName=ic_name,
                    ContentType='application/json',
                    Body=json.dumps(request_body)
                )
                
                # Process response
                result = response['Body'].read().decode('utf-8')
                parsed_data = json.loads(result)
                answer = parsed_data[0] if isinstance(parsed_data, list) else parsed_data
                generated_text = answer['generated_text']
                
                # Extract text after "Answer below:"
                answer_marker = "Answer below:"
                answer_start = generated_text.find(answer_marker)
                if answer_start >= 0:
                    actual_response = generated_text[answer_start + len(answer_marker):].strip()
                else:
                    actual_response = generated_text.strip()
                
                # Extract thinking process and final answer from response
                thinking_part = ""
                final_part = ""
                
                # Try tag-based extraction
                think_pattern = re.compile(r'### THINKING\s*(.*?)(?=### FINAL-ANSWER|\$)', re.DOTALL)
                final_pattern = re.compile(r'### FINAL-ANSWER\s*(.*?)\$', re.DOTALL)
                
                think_match = think_pattern.search(actual_response)
                final_match = final_pattern.search(actual_response)
                
                if think_match and final_match:
                    thinking_part = think_match.group(1).strip()
                    final_part = final_match.group(1).strip()
                    extraction_method = "keywords"
                else:
                    # Use full text
                    final_part = actual_response.strip()
                    thinking_part = "Thinking process not clearly distinguished."
                    extraction_method = "full_text"
                
                elapsed_time = time.time() - start_time
                
                # Save results
                result_entry = {
                    "prompt": user_prompt,
                    "thinking_process": thinking_part,
                    "final_answer": final_part,
                    "elapsed_time": round(elapsed_time, 3),
                    "extraction_method": extraction_method
                }
                
                model_results["times"].append(elapsed_time)
                model_results["answers"].append(result_entry)
                model_results["full_responses"].append(actual_response)  # Store actual response only
                
                print(f"Elapsed time: {round(elapsed_time, 3)} seconds")
                print(f"Final answer:\n{final_part}")
                
            except Exception as e:
                print(f"Error processing request: {str(e)}")
                
                model_results["times"].append(None)
                model_results["answers"].append({
                    "prompt": user_prompt,
                    "thinking_process": f"Error: {str(e)}",
                    "final_answer": f"Error: {str(e)}",
                    "elapsed_time": None,
                    "extraction_method": "error"
                })
                model_results["full_responses"].append(f"Error: {str(e)}")
        
        # Save results to file
        result_file = f"{results_dir}/{model_type}_results.json"
        with open(result_file, "w", encoding='utf-8') as f:
            json.dump(model_results, f, indent=2, ensure_ascii=False)
        print(f"\nResults for {model_type} model saved to {result_file}")
        
    except Exception as e:
        print(f"Error processing model {model_type}: {str(e)}")
    finally:
        # Always delete Inference Component to clean up resources
        if 'ic_name' in locals():  # Check if ic_name is defined
            try:
                sagemaker_client.delete_inference_component(InferenceComponentName=ic_name)
                print(f"Deleted inference component: {ic_name}")
            except Exception as e:
                print(f"Error deleting inference component {ic_name}: {e}")

### Comparison of Pre-trained and Fine-tuned Model Results

This section provides a comprehensive comparison of both models' performance, highlighting the improvements achieved through Chain-of-Thought fine-tuning.

In [None]:
# Result comparison and visualization
print("\n\n===== COMPARING RESULTS =====")

# Read saved result files
all_results = {}
for model_type in model_types:
    result_file = f"{results_dir}/{model_type}_results.json"
    try:
        with open(result_file, "r", encoding='utf-8') as f:
            all_results[model_type] = json.load(f)
        print(f"Loaded results for {model_type} model from {result_file}")
    except Exception as e:
        print(f"Error loading results for {model_type} model: {e}")

# First create summary table for all questions
display(HTML("<h2>Summary of Results for All Questions</h2>"))

# Create HTML table directly
html_table = """
<table style="width:100%; border-collapse:collapse; table-layout:fixed;">
  <thead>
    <tr style="background-color:#f2f2f2;">
      <th style="width:2%; text-align:center; padding:8px; border:1px solid #ddd;">No.</th>
      <th style="width:20%; text-align:left; padding:8px; border:1px solid #ddd;">Question</th>
      <th style="width:39%; text-align:left; padding:8px; border:1px solid #ddd;">Pre-trained Model Answer</th>
      <th style="width:39%; text-align:left; padding:8px; border:1px solid #ddd;">Fine-tuned Model Answer</th>
    </tr>
  </thead>
  <tbody>
"""

# Organize model answers for each question
for prompt_index, prompt in enumerate(user_prompts):
    pretrained_answer = ""
    finetuned_answer = ""
    
    # Extract pre-trained model answer
    if "pretrained" in all_results and prompt_index < len(all_results["pretrained"]["answers"]):
        answer = all_results["pretrained"]["answers"][prompt_index]
        final_answer = answer.get('final_answer', 'N/A')
        
        try:
            # Remove tags (all HTML tag formats)
            final_answer = re.sub(r'<[^>]+>', '', final_answer)
            
            # Add line breaks around "THINKING" keyword (using HTML tags)
            final_answer = re.sub(r'THINKING', '<strong>THINKING</strong><br><br>', final_answer)
            
            # Add line breaks around "FINAL-ANSWER" keyword (using HTML tags)
            final_answer = re.sub(r'FINAL-ANSWER', '<br><br><strong>FINAL-ANSWER</strong><br><br>', final_answer)

            # Markdown processing
            # 1. Remove markdown headers
            final_answer = re.sub(r'#+\s+', '', final_answer)
            # 2. Remove markdown bold
            final_answer = re.sub(r'\*\*(.*?)\*\*', r'\1', final_answer)
            # 3. Remove markdown italic
            final_answer = re.sub(r'\*(.*?)\*', r'\1', final_answer)
            # 4. Remove markdown links - keep only text from [text](link) format
            final_answer = re.sub(r'$$(.*?)$$$(.*?)$', r'\1', final_answer)
            # 5. Remove markdown list markers
            final_answer = re.sub(r'^\s*[-*+]\s+', '', final_answer, flags=re.MULTILINE)
            # 6. Remove numbered list markers
            final_answer = re.sub(r'^\s*\d+\.\s+', '', final_answer, flags=re.MULTILINE)
            # 7. Remove markdown code blocks
            final_answer = re.sub(r'```.*?```', '', final_answer, flags=re.DOTALL)
            # 8. Remove inline code
            final_answer = re.sub(r'`(.*?)`', r'\1', final_answer)
            
            # Convert line breaks to spaces (except HTML tags)
            final_answer = re.sub(r'(?!<br>)\n', ' ', final_answer).strip()
            
            # Reduce consecutive spaces to single space
            final_answer = re.sub(r'\s+', ' ', final_answer)
            
            pretrained_answer = final_answer
        except Exception as e:
            print(f"Error processing pre-trained model markdown: {e}")
            pretrained_answer = "Error occurred during processing"
    else:
        pretrained_answer = "No result"
    
    # Extract fine-tuned model answer
    if "finetuned" in all_results and prompt_index < len(all_results["finetuned"]["answers"]):
        answer = all_results["finetuned"]["answers"][prompt_index]
        final_answer = answer.get('final_answer', 'N/A')
        
        try:
            # Remove tags (all HTML tag formats)
            final_answer = re.sub(r'<[^>]+>', '', final_answer)
            
            # Add line breaks around "THINKING" keyword (using HTML tags)
            final_answer = re.sub(r'THINKING', '<strong>THINKING</strong><br><br>', final_answer)
            
            # Add line breaks around "FINAL-ANSWER" keyword (using HTML tags)
            final_answer = re.sub(r'FINAL-ANSWER', '<br><br><strong>FINAL-ANSWER</strong><br><br>', final_answer)

            # Markdown processing
            # 1. Remove markdown headers
            final_answer = re.sub(r'#+\s+', '', final_answer)
            # 2. Remove markdown bold
            final_answer = re.sub(r'\*\*(.*?)\*\*', r'\1', final_answer)
            # 3. Remove markdown italic
            final_answer = re.sub(r'\*(.*?)\*', r'\1', final_answer)
            # 4. Remove markdown links - keep only text from [text](link) format
            final_answer = re.sub(r'$$(.*?)$$$(.*?)$', r'\1', final_answer)
            # 5. Remove markdown list markers
            final_answer = re.sub(r'^\s*[-*+]\s+', '', final_answer, flags=re.MULTILINE)
            # 6. Remove numbered list markers
            final_answer = re.sub(r'^\s*\d+\.\s+', '', final_answer, flags=re.MULTILINE)
            # 7. Remove markdown code blocks
            final_answer = re.sub(r'```.*?```', '', final_answer, flags=re.DOTALL)
            # 8. Remove inline code
            final_answer = re.sub(r'`(.*?)`', r'\1', final_answer)
            
            # Convert line breaks to spaces (except HTML tags)
            final_answer = re.sub(r'(?!<br>)\n', ' ', final_answer).strip()
            
            # Reduce consecutive spaces to single space
            final_answer = re.sub(r'\s+', ' ', final_answer)
            
            finetuned_answer = final_answer
        except Exception as e:
            print(f"Error processing fine-tuned model markdown: {e}")
            finetuned_answer = "Error occurred during processing"
    else:
        finetuned_answer = "No result"
    
    # Add table row
    html_table += f"""
    <tr>
      <td style="text-align:center; padding:8px; border:1px solid #ddd; vertical-align:top;">{prompt_index + 1}</td>
      <td style="text-align:left; padding:8px; border:1px solid #ddd; vertical-align:top; word-wrap:break-word;">{prompt[:50] + "..." if len(prompt) > 50 else prompt}</td>
      <td style="text-align:left; padding:8px; border:1px solid #ddd; vertical-align:top; word-wrap:break-word;">{pretrained_answer}</td>
      <td style="text-align:left; padding:8px; border:1px solid #ddd; vertical-align:top; word-wrap:break-word;">{finetuned_answer}</td>
    </tr>
    """

html_table += """
  </tbody>
</table>
"""

display(HTML(html_table))

# # Display detailed results for each question
# display(HTML("<h2>Detailed Results by Question</h2>"))

# for prompt_index, prompt in enumerate(user_prompts):
#     display(HTML(f"<h3>Question {prompt_index + 1}</h3>"))
#     display(HTML(f"<p><b>{prompt}</b></p>"))
    
#     # Display full responses from each model
#     for model_type in model_types:
#         if model_type in all_results and prompt_index < len(all_results[model_type]["full_responses"]):
#             full_response = all_results[model_type]["full_responses"][prompt_index]
            
#             try:
#                 # Remove tags (all HTML tag formats)
#                 clean_response = re.sub(r'<[^>]+>', '', full_response)
                
#                 # Convert "THINKING" keyword to emphasis
#                 clean_response = re.sub(r'THINKING', 'Thinking Process:', clean_response)
                
#                 # Convert "FINAL-ANSWER" keyword to emphasis and add line breaks
#                 clean_response = re.sub(r'FINAL-ANSWER', '\n\nFinal Answer:\n', clean_response)
                
#                 # Convert markdown bold to HTML
#                 clean_response = re.sub(r'\*\*(.*?)\*\*', r'<strong>\1</strong>', clean_response)
                
#                 # Convert markdown italic to HTML
#                 clean_response = re.sub(r'\*(.*?)\*', r'<em>\1</em>', clean_response)
                
#                 # Reduce consecutive empty lines to single line
#                 clean_response = re.sub(r'\n\s*\n\s*\n', '\n\n', clean_response)
                
#                 # Remove empty lines before headers
#                 clean_response = re.sub(r'\n\s*\n(#+\s+|Thinking Process:|Final Answer:)', r'\n\1', clean_response)
#                 clean_response = re.sub(r'^(\s*\n)+(#+\s+|Thinking Process:|Final Answer:)', r'\2', clean_response)
                
#                 # Remove leading whitespace from each line
#                 clean_lines = []
#                 for line in clean_response.split('\n'):
#                     clean_lines.append(line.lstrip())
#                 clean_response = '\n'.join(clean_lines)
                
#             except Exception as e:
#                 print(f"Error processing full response: {e}")
#                 clean_response = full_response  # Use original text if error occurs
                
#                 # Remove leading whitespace even if error occurred
#                 clean_lines = []
#                 for line in clean_response.split('\n'):
#                     clean_lines.append(line.lstrip())
#                 clean_response = '\n'.join(clean_lines)
            
#             display(HTML(f"<h4>{model_type} Model</h4>"))
            
#             # Display full response
#             display(HTML(f"""
#             <div style="border: 1px solid #ddd; padding: 10px; margin-bottom: 20px;">
#                 <pre style="white-space: pre-wrap; margin: 0;">{clean_response}</pre>
#             </div>
#             """))
    
#     display(HTML("<hr style='margin: 30px 0;'>"))

## Resource Cleanup

Proper resource management is crucial to avoid unnecessary costs. This section provides systematic cleanup of all deployed resources including inference components, models, and endpoints.

In [None]:
from sagemaker.predictor import Predictor

predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
)

In [None]:
for model_name in model_name_list:
    try:
        inference_component_name = f"IC-{model_name}"
        print(f"Deleting inference components: [b magenta]{inference_component_name} ✅")
        
        # Delete inference component
        sagemaker_client.delete_inference_component(
            InferenceComponentName=inference_component_name
        )
    except Exception as e:
        print(f"{e}")


In [None]:
try:
    # for model_name in model_name_list:
    # print(f"Deleting model: {model_name}")
    predictor.delete_model()
except Exception as e:
    print(f"{e}")


In [None]:

try:
    print(f"Deleting endpoint: [b magenta]{predictor.endpoint_name} ✅")
    predictor.delete_endpoint()
except Exception as e:
    print(f"{e}")

print("---" * 10)
print("Done")

In [None]:
# endpoint_name='qwen3-4b-endpoint-2025-05-04-09-29-33'
# ic_name = 'IC-qwen3-4b-2025-05-04-09-29-33-finetuned'

# sagemaker_client.describe_inference_component(InferenceComponentName=ic_name)
# # sagemaker_client.delete_inference_component(InferenceComponentName=ic_name)

# # Create new Inference Component
# spec = {
#     "ModelName": model_name,
#     "ComputeResourceRequirements": {
#         "NumberOfAcceleratorDevicesRequired": 1,
#         "NumberOfCpuCoresRequired": 4,
#         "MinMemoryRequiredInMb": 8192,
#     },
# }

# ic_response = sagemaker_client.create_inference_component(
#     InferenceComponentName=ic_name,
#     EndpointName=endpoint_name,
#     VariantName=variant_name,
#     Specification=spec,
#     RuntimeConfig={"CopyCount": 1}
# )

# user_prompt="How many Michelin 3-star restaurants are there worldwide?"
# # Create inference request
# request_body = {
#     "inputs": inference_prompt_style.format(user_prompt),
#     "parameters": {
#         "max_new_tokens": 2048,
#         "top_p": 0.9,
#         "temperature": 0.01,
#         "use_cache" : True,
#         "stop": ["<|im_end|>"]
#     }
# }


# # Execute inference request
# response = sagemaker_runtime.invoke_endpoint(
#     EndpointName=endpoint_name,
#     InferenceComponentName=ic_name,
#     ContentType='application/json',
#     Body=json.dumps(request_body)
# )

# # Process response
# result = response['Body'].read().decode('utf-8')
# parsed_data = json.loads(result)
# answer = parsed_data[0] if isinstance(parsed_data, list) else parsed_data


# print(answer['generated_text'])