# PCAI Use Case Demo - Serving LoRa in MLIS
In this tutorial, we will deploy LoRa Adapter with vLLM in MLIS. Additionally, we will leverage aioli-sdk to programatically deploy the model in MLIS

**1. Install Required Libraries**</br>
Before running the demo, please install the necessary libraries in your environment:

In [14]:
!pip install aioli-sdk==1.4.1

Collecting aioli-sdk==1.4.1
  Using cached aioli_sdk-1.4.1-py3-none-any.whl.metadata (1.7 kB)
Using cached aioli_sdk-1.4.1-py3-none-any.whl (170 kB)
Installing collected packages: aioli-sdk
  Attempting uninstall: aioli-sdk
    Found existing installation: aioli-sdk 1.10.0
    Uninstalling aioli-sdk-1.10.0:
      Successfully uninstalled aioli-sdk-1.10.0
Successfully installed aioli-sdk-1.4.1


# Intialize API client for MLIS's rest API

In [1]:
%update_token

Token successfully refreshed.


In [2]:
namespace = os.popen("kubectl get pvc user-pvc -o=jsonpath='{.metadata.namespace}'").read()

In [3]:
import aiolirest
from aioli.common import util
from aioli.common.api import authentication

In [4]:
host_url = "http://aioli-master-service-hpe-mlis.mlis.svc.cluster.local:8080"

host = util.prepend_protocol(host_url)
with open('/etc/secrets/ezua/.auth_token','r') as f:
    token = f.read()
# token = util.get_aioli_user_token_from_env()
configuration = authentication.get_rest_config(host)
configuration.api_key["ApiKeyAuth"] = "Bearer " + token
restclient = aiolirest.ApiClient(configuration)

# Create Registry
In MLIS, Registries are the storage location of the models. User can ADD/EDIT/LIST/DELETE Registry
**Available Registry Types**
- S3 : S3 bucket and necessary Access keys needed
- Huggingface or OpenLLM : Sign up for a HuggingFace and create an access token
- NGC : Sign up for an NVIDIA NGC Account and obtain the necessary API key. 

In [5]:
api_instance = aiolirest.RegistriesApi(restclient)

In [6]:
import boto3
s3 = boto3.client("s3", verify=False)
buckets = s3.list_buckets()
for bucket in buckets['Buckets']:
    if 'mlflow' in bucket['Name']:
      bucket_name = bucket['Name']

print(bucket_name)

Found endpoint for s3 via: environment_global.


mlflow.sgctcpcai


In [9]:
from aiolirest.models.trained_model_registry_request import TrainedModelRegistryRequest

r = TrainedModelRegistryRequest(
    name='s3-bucket-from-sdk',
    bucket=bucket_name,
    endpointUrl='http://local-s3-service.ezdata-system.svc.cluster.local:30000',
    type='s3',
    accessKey=token,
    secretKey='s3',
    insecureHttps=False,
)
registry_request = api_instance.registries_post(r)

<img src="./assets/mlis_registry_aie180.png" alt="mlis_registry_1" width="400">

# Create Packaged Model
In MLIS, A packaged Model describes the model that user want to deploy as an inference service.
By Adding a packaged model, user can create a versioned pointer to a model stored in a specified registry and PVC. 
Access to reading and pulling is controlled by the registry’s assigned keys

**Available Model Types**
- Bento Archive : S3
- Custom : OpenLLM, PVC, S3, None
- NIM : NGC, PVC
- OpenLLM : OpenLLM, S3
- vLLM : HuggingFace, S3

In [11]:
from aiolirest.models.configuration_resources import ConfigurationResources
from aiolirest.models.packaged_model_request import PackagedModelRequest
from aiolirest.models.resource_profile import ResourceProfile

In [12]:
api_instance = aiolirest.PackagedModelsApi(restclient)

In [13]:
from argparse import Namespace

full_model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"

config = {
    'requests_cpu': '1',
    'requests_gpu': '1',
    'requests_memory': '4Gi',
    'limits_cpu': '4',
    'limits_gpu': '1',
    'limits_memory': '8Gi',
    'enable_caching': False,
    'disable_caching': False,
    'metadata': {
        'modelCategory=llm'
    },
    'env': {},
    'arg': [
        '--model',
        full_model_name,
        '--port',
        '8080',
        '--dtype=half',
        '--gpu-memory-utilization',
        '0.8',
        '--enable-lora',
        '--lora-modules',
        '{"name":"math-lora","path":"/mnt/models","base_model_name":"' + full_model_name + '"}',
    ]
}
args = Namespace(**config)
requests = ResourceProfile(
    cpu=args.requests_cpu, gpu=args.requests_gpu, memory=args.requests_memory
)
limits = ResourceProfile(
    cpu=args.limits_cpu, gpu=args.limits_gpu, memory=args.limits_memory
)
resources = ConfigurationResources(gpuType=None, requests=requests, limits=limits)

In [14]:
from aioli.common.util import (
    construct_arguments,
    construct_environment,
    construct_metadata,
)
print(construct_metadata(args, {}))
print(construct_environment(args))
print(construct_arguments(args))

{'modelCategory': 'llm'}
{}
['--model', 'HuggingFaceTB/SmolLM2-360M-Instruct', '--port', '8080', '--dtype=half', '--gpu-memory-utilization', '0.8', '--enable-lora', '--lora-modules', '{"name":"math-lora","path":"/mnt/models","base_model_name":"HuggingFaceTB/SmolLM2-360M-Instruct"}']


In [15]:
import mlflow

# Get the experiment ID
experiment_name = "Default"
experiment = mlflow.get_experiment_by_name(experiment_name)
latest_run = mlflow.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["start_time DESC"],
    max_results=1
)

artifact_uri = latest_run.artifact_uri[0]
print(f"Artifact URI: {artifact_uri}")

# For a specific artifact file/folder
artifact_path = 'math-SmolLM2-360M-Instruct-savedir/peft'  # or any artifact path

full_artifact_uri = f"{artifact_uri}/{artifact_path}"
print(f"Full artifact URI: {full_artifact_uri}")

Artifact URI: s3://mlflow.sgctcpcai/0/474dfdce2c9340388d1fa30a2b5bad6c/artifacts
Full artifact URI: s3://mlflow.sgctcpcai/0/474dfdce2c9340388d1fa30a2b5bad6c/artifacts/math-SmolLM2-360M-Instruct-savedir/peft


In [16]:
r = PackagedModelRequest(
    name='packaged-model-from-sdk',
    description='packaged-model-from-sdk',
    url=full_artifact_uri,
    # url='s3://mlflow.sgctcpcai/0/474dfdce2c9340388d1fa30a2b5bad6c/artifacts/math-SmolLM2-360M-Instruct-savedir/peft',
    image='vllm/vllm-openai:v0.8.5',
    resources=resources,
    modelFormat='custom',
    arguments=construct_arguments(args),
    metadata=construct_metadata(args, {}),
    registry=registry_request.name,
)

if args.enable_caching:
    r.caching_enabled = True

if args.disable_caching:
    r.caching_enabled = False

resposne = api_instance.models_post(r)

In [17]:
resposne

PackagedModel(arguments=['--model', 'HuggingFaceTB/SmolLM2-360M-Instruct', '--port', '8080', '--dtype=half', '--gpu-memory-utilization', '0.8', '--enable-lora', '--lora-modules', '{"name":"math-lora","path":"/mnt/models","base_model_name":"HuggingFaceTB/SmolLM2-360M-Instruct"}'], caching_enabled=False, description='packaged-model-from-sdk', environment={}, id='15c0b7a3-062a-4959-ae6c-8f3321bd24b2', image='vllm/vllm-openai:v0.8.5', metadata={'modelCategory': 'llm'}, format='custom', modified_at='2025-11-14T14:59:42.355809Z', name='packaged-model-from-sdk', project='69fc1dd6-47eb-4411-9d75-dc24d94db622', registry='f81459fb-b317-4169-8cdd-3a89de355e1c', resources=ConfigurationResources(gpu_type='', limits=ResourceProfile(cpu='4', gpu='1', memory='8Gi'), requests=ResourceProfile(cpu='1', gpu='1', memory='4Gi')), url='s3://mlflow.sgctcpcai/0/474dfdce2c9340388d1fa30a2b5bad6c/artifacts/math-SmolLM2-360M-Instruct-savedir/peft', version=1)

<img src="./assets/mlis_packaged_aie180.png" alt="mlis_packaged_1" width="400">

# Deploy Model
In MLIS, Deployments spin up the actual instances that run user’s inference services. 

As a result of deployment, it will provide an endpoint that can be used by clients to make predictions.

Access to reading and pulling is controlled by the registry’s assigned keys

In [18]:
from aiolirest.models.autoscaling import Autoscaling
from aiolirest.models.deployment_request import DeploymentRequest
from aiolirest.models.security import Security

In [19]:
api_instance = aiolirest.DeploymentsApi(restclient)

In [20]:
from argparse import Namespace

config = {
    'autoscaling_target': 1,
    'autoscaling_metric': 'rps',
    'autoscaling_max_replicas': 1,
    'autoscaling_min_replicas': 1,
}
args = Namespace(**config)
sec = Security(authenticationRequired=True)

In [21]:
auto = Autoscaling(
    metric=args.autoscaling_metric,
)

if args.autoscaling_target is not None:
    auto.target = args.autoscaling_target

if args.autoscaling_max_replicas is not None:
    auto.max_replicas = args.autoscaling_max_replicas

if args.autoscaling_min_replicas is not None:
    auto.min_replicas = args.autoscaling_min_replicas

In [22]:
r = DeploymentRequest(
    name='deployment-from-sdk',
    model=resposne.name,
    security=sec,
    namespace=namespace,
    autoScaling=auto,
)
results = api_instance.deployments_post(r)

In [23]:
results

Deployment(arguments=None, auto_scaling=Autoscaling(max_replicas=1, metric='rps', min_replicas=1, target=1), canary_traffic_percent=100, cluster_name='', environment={}, goal_status='Ready', id='5c5da1e5-ddfa-477b-899c-173263ec853d', last_event=None, model='15c0b7a3-062a-4959-ae6c-8f3321bd24b2', modified_at='2025-11-14T15:01:07.321965Z', name='deployment-from-sdk', namespace='geun-tak-roh-hp-b3801707', node_selectors={}, priority_class_name='', project='69fc1dd6-47eb-4411-9d75-dc24d94db622', secondary_state=DeploymentState(endpoint='', failure_info=None, mdl_id='', native_app_name='', status='None', traffic_percentage=0), security=Security(authentication_required=True), state=DeploymentState(endpoint='', failure_info=None, mdl_id='', native_app_name='', status='Deploying', traffic_percentage=0), status='Deploying')

In [24]:
from aioli.cli import deployment
import time

while True:
    print(deployment.lookup_deployment(results.name,api_instance).status)
    time.sleep(5)
    if deployment.lookup_deployment(results.name,api_instance).status == 'Ready':
        print("Model is Ready!")
        break
    elif deployment.lookup_deployment(results.name,api_instance).status != 'Deploying':
        print('Something went wrong, Check the deployment!')
        break

Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Model is Ready!


In [25]:
deployment.lookup_deployment(results.name,api_instance).state.endpoint

'https://deployment-from-sdk-predictor-geun-tak-roh-hp-b3801707.app.pcai.sgctc.net'

In [26]:
import requests

with open('/etc/secrets/ezua/.auth_token','r') as file:
    AUTH_TOKEN = file.read()
endpoint_url = deployment.lookup_deployment(results.name,api_instance).state.endpoint
headers = {
    "Authorization": f"Bearer {AUTH_TOKEN}"
}

route = '/v1/models'
models_response = requests.get(endpoint_url+route,headers=headers,verify=False)
sample_prompt = "In a 90-minute soccer game, Mark played 20 minutes, then rested after. He then played for another 35 minutes. How long was he on the sideline?"

for model in models_response.json()['data']:
    print(model['id'])
    payload = {
        "model": model['id'],
        "messages": [
            {
                "role": "system",
                "content": "you are a helpful math tutor, solve the question step by step"
            },
            {
                "role": "user",
                "content": sample_prompt
            }
        ]
    }
    route = '/v1/chat/completions'
    chat_response = requests.post(endpoint_url+route,headers=headers,verify=False,json=payload)
    print(f"*** {model['id']} ***\n{chat_response.json()['choices'][0]['message']['content']}")



HuggingFaceTB/SmolLM2-360M-Instruct
*** HuggingFaceTB/SmolLM2-360M-Instruct ***
To solve this problem, let's first understand the given information: Mark was playing for 20 minutes, then rested, and then played for another 35 minutes.

Step 1: Calculate Mark's rest period.
Mark rested after 20 minutes, so let's subtract 20 from the total duration of the soccer game to find the rest period. 

90 minutes - 20 minutes (rest) = 70 minutes

Step 2: Calculate Mark's playtime.
Mark is being paid an hourly wage of $20. His playtime is the number of minutes worked after he rested.

Mark worked for 70 minutes after he rested, so let's subtract 70 minutes from the total time to find the playtime.

90 minutes - 70 minutes (rest) = 20 minutes

Step 3: Calculate Mark's playtime.
Now that we know Mark worked for 20 minutes, we can subtract it from the total duration of the soccer game to find the time he played.

In 20 minutes, Mark played for 90 - 20 minutes = 70 minutes.

Step 4: Calculate Mark's t



*** math-lora ***
Mark played for a total of 20 + 35 = 55 minutes. Now, he rested for 90 minutes - 55 minutes = 35 minutes. So, he was rested for 55 minutes / 35 minutes = 1.667 hours (since 1 hour is equal to 3600 seconds).

Now, Mark is rested for x minutes. Let's solve for x:
x = 550 seconds / x = (550 seconds / 60 seconds per minute) = 9.167 minutes.

So, Mark was rested for approximately 9.167 minutes or 1.667 hours.
