# PCAI Use Case Demo - Serving LoRa in MLIS
In this tutorial, we will deploy LoRa Adapter with vLLM in MLIS. Additionally, we will leverage aioli-sdk to programatically deploy the model in MLIS

**1. Install Required Libraries**</br>
Before running the demo, please install the necessary libraries in your environment:

In [2]:
!pip install aioli-sdk==1.10.0

Collecting urllib3<2.3.0,>=2.0.0 (from aioli-sdk==1.10.0)
  Using cached urllib3-2.2.3-py3-none-any.whl.metadata (6.5 kB)
Using cached urllib3-2.2.3-py3-none-any.whl (126 kB)
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.20
    Uninstalling urllib3-1.26.20:
      Successfully uninstalled urllib3-1.26.20
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kfp 2.9.0 requires urllib3<2.0.0, but you have urllib3 2.2.3 which is incompatible.[0m[31m
[0mSuccessfully installed urllib3-2.2.3


In [4]:
%env REQUESTS_CA_BUNDLE=/etc/ezua-domain-ca-certs/ezua-domain-ca-cert.crt
%env SSL_CERT_FILE=/etc/ezua-domain-ca-certs/ezua-domain-ca-cert.crt
%env CURL_CA_BUNDLE=/etc/ezua-domain-ca-certs/ezua-domain-ca-cert.crt

env: REQUESTS_CA_BUNDLE=/etc/ezua-domain-ca-certs/ezua-domain-ca-cert.crt
env: SSL_CERT_FILE=/etc/ezua-domain-ca-certs/ezua-domain-ca-cert.crt
env: CURL_CA_BUNDLE=/etc/ezua-domain-ca-certs/ezua-domain-ca-cert.crt


In [31]:
namespace = os.popen("kubectl get pvc user-pvc -o=jsonpath='{.metadata.namespace}'").read()

# Intialize API client for MLIS's rest API

In [5]:
import aiolirest
from aioli.common import api, util
from aioli.common.api import authentication

In [58]:
# host_url = "https://mlis.aie01.pcai.tryezmeral.com"
host_url = "http://aioli-master-service-hpe-mlis.mlis.svc.cluster.local:8080"

host = util.prepend_protocol(host_url)
token = util.get_aioli_user_token_from_env()
configuration = authentication.get_rest_config(host)
configuration.api_key["ApiKeyAuth"] = "Bearer " + token
restclient = aiolirest.ApiClient(configuration)

# Create Registry
In MLIS, Registries are the storage location of the models. User can ADD/EDIT/LIST/DELETE Registry
**Available Registry Types**
- S3 : S3 bucket and necessary Access keys needed
- Huggingface or OpenLLM : Sign up for a HuggingFace and create an access token
- NGC : Sign up for an NVIDIA NGC Account and obtain the necessary API key. 

In [30]:
api_instance = aiolirest.RegistriesApi(restclient)

In [32]:
import boto3
s3 = boto3.client("s3", verify=False)
buckets = s3.list_buckets()
for bucket in buckets['Buckets']:
    if 'mlflow' in bucket['Name']:
      bucket_name = bucket['Name']

print(bucket_name)

Found endpoint for s3 via: environment_global.


mlflow.aie01


In [34]:
from aiolirest.models.trained_model_registry_request import TrainedModelRegistryRequest

r = TrainedModelRegistryRequest(
    name='s3-bucket-from-sdk',
    bucket=bucket_name,
    endpointUrl='http://local-s3-service.ezdata-system.svc.cluster.local:30000',
    type='s3',
    accessKey=None,
    secretKey='',
    insecureHttps=False,
)
registry_request = api_instance.registries_post(r)

<img src="assets/mlis_registry_1.png" alt="metrics in mlflow" width="400">

# Create Packaged Model
In MLIS, A packaged Model describes the model that user want to deploy as an inference service.
By Adding a packaged model, user can create a versioned pointer to a model stored in a specified registry and PVC. 

Access to reading and pulling is controlled by the registryâ€™s assigned keys

**Available Model Types**
- Bento Archive : S3
- Custom : OpenLLM, PVC, S3, None
- NIM : NGC, PVC
- OpenLLM : OpenLLM, S3
- vLLM : HuggingFace, S3

In [12]:
from aiolirest.models.configuration_resources import ConfigurationResources
from aiolirest.models.deployment_model_version import DeploymentModelVersion
from aiolirest.models.packaged_model import PackagedModel
from aiolirest.models.packaged_model_request import PackagedModelRequest
from aiolirest.models.resource_profile import ResourceProfile

In [35]:
api_instance = aiolirest.PackagedModelsApi(restclient)

In [None]:
from argparse import Namespace

full_model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"

config = {
    'requests_cpu': '1',
    'requests_gpu': '1',
    'requests_memory': '4Gi',
    'limits_cpu': '4',
    'limits_gpu': '1',
    'limits_memory': '8Gi',
    'enable_caching': False,
    'disable_caching': False,
    'env': {},
    'arg': [
        '--model',
        full_model_name,
        '--port',
        '8080',
        '--dtype=half',
        '--gpu-memory-utilization',
        '0.8',
        '--enable-lora',
        '--lora-modules',
        '{"name":"math-lora","path":"/mnt/models","base_model_name":"' + full_model_name + '"}',
    ]
}
args = Namespace(**config)
requests = ResourceProfile(
    cpu=args.requests_cpu, gpu=args.requests_gpu, memory=args.requests_memory
)
limits = ResourceProfile(
    cpu=args.limits_cpu, gpu=args.limits_gpu, memory=args.limits_memory
)
resources = ConfigurationResources(gpuType=None, requests=requests, limits=limits)

In [None]:
from aioli.common.util import (
    construct_arguments,
    construct_environment,
    construct_metadata,
    launch_dashboard,
)

print(construct_environment(args))
print(construct_arguments(args))

In [40]:
import mlflow

# Get the experiment ID
experiment_name = "Default"
experiment = mlflow.get_experiment_by_name(experiment_name)
latest_run = mlflow.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["start_time DESC"],
    max_results=1
)

artifact_uri = latest_run.artifact_uri[0]
print(f"Artifact URI: {artifact_uri}")

# For a specific artifact file/folder
artifact_path = 'math-SmolLM2-360M-Instruct'  # or any artifact path

full_artifact_uri = f"{artifact_uri}/{artifact_path}"
print(f"Full artifact URI: {full_artifact_uri}")

Artifact URI: s3://mlflow.aie01/0/1c8220945ef3485f83ed967df5d29963/artifacts
Full artifact URI: s3://mlflow.aie01/0/1c8220945ef3485f83ed967df5d29963/artifacts/math-SmolLM2-360M-Instruct


In [41]:
r = PackagedModelRequest(
    name='packaged-model-from-sdk',
    description='packaged-model-from-sdk',
    url=full_artifact_uri,
    image='vllm/vllm-openai:v0.8.5',
    resources=resources,
    modelFormat='custom',
    arguments=construct_arguments(args),
    registry=registry_request.name,
)

if args.enable_caching:
    r.caching_enabled = True

if args.disable_caching:
    r.caching_enabled = False

resposne = api_instance.models_post(r)

In [42]:
resposne

PackagedModel(arguments=['--model', 'HuggingFaceTB/SmolLM2-360M-Instruct', '--port', '8080', '--dtype=half', '--gpu-memory-utilization', '0.8', '--enable-lora', '--lora-modules', '{"name":"math-lora","path":"/mnt/models","base_model_name":"HuggingFaceTB/SmolLM2-360M-Instruct"}'], caching_enabled=False, description='packaged-model-from-sdk', environment={}, id='cc6b55c0-9fa2-47fd-a29c-0e061de93f9d', image='vllm/vllm-openai:v0.8.5', metadata={}, format='custom', modified_at='2025-11-05T16:35:35.067753Z', name='packaged-model-from-sdk', project='', registry='53536dd6-03d6-46a7-b771-368587b19c31', resources=ConfigurationResources(gpu_type='', limits=ResourceProfile(cpu='4', gpu='1', memory='8Gi'), requests=ResourceProfile(cpu='1', gpu='1', memory='4Gi')), url='s3://mlflow.aie01/0/1c8220945ef3485f83ed967df5d29963/artifacts/math-SmolLM2-360M-Instruct', version=1)

<img src="assets/mlis_packaged_1.png" alt="metrics in mlflow" width="400">

<img src="assets/mlis_packaged_2.png" alt="metrics in mlflow" width="400">

# Deploy Model

In [44]:
from aioli.common.util import (
    construct_arguments,
    construct_environment,
    launch_dashboard,
)
from aiolirest.models.autoscaling import Autoscaling
from aiolirest.models.deployment import Deployment, DeploymentState
from aiolirest.models.deployment_request import DeploymentRequest
from aiolirest.models.event_info import EventInfo
from aiolirest.models.security import Security

In [60]:
api_instance = aiolirest.DeploymentsApi(restclient)

In [49]:
from argparse import Namespace

config = {
    'autoscaling_target': 1,
    'autoscaling_metric': 'rps',
    'autoscaling_max_replicas': 1,
    'autoscaling_min_replicas': 1,
}
args = Namespace(**config)
sec = Security(authenticationRequired=True)

In [50]:
auto = Autoscaling(
    metric=args.autoscaling_metric,
)

if args.autoscaling_target is not None:
    auto.target = args.autoscaling_target

if args.autoscaling_max_replicas is not None:
    auto.max_replicas = args.autoscaling_max_replicas

if args.autoscaling_min_replicas is not None:
    auto.min_replicas = args.autoscaling_min_replicas

In [52]:
r = DeploymentRequest(
    name='deployment-from-sdk',
    model=resposne.name,
    security=sec,
    namespace=namespace,
    autoScaling=auto,
)
results = api_instance.deployments_post(r)

In [54]:
results

Deployment(arguments=None, auto_scaling=Autoscaling(max_replicas=1, metric='rps', min_replicas=1, target=1), canary_traffic_percent=100, cluster_name='', environment={}, goal_status='Ready', id='c3695db6-8df5-4646-9f6e-9117026c58c2', last_event=None, model='cc6b55c0-9fa2-47fd-a29c-0e061de93f9d', modified_at='2025-11-05T16:40:02.108515Z', name='deployment-from-sdk', namespace='project-user-geun-tak-roh', node_selectors={}, priority_class_name='', project='', secondary_state=DeploymentState(endpoint='', failure_info=None, mdl_id='', native_app_name='', status='None', traffic_percentage=0), security=Security(authentication_required=True), state=DeploymentState(endpoint='', failure_info=None, mdl_id='', native_app_name='', status='Deploying', traffic_percentage=0), status='Deploying')

In [62]:
from aioli.cli import deployment
import time

while True:
    print(deployment.lookup_deployment(results.name,api_instance).status)
    time.sleep(3)
    if deployment.lookup_deployment(results.name,api_instance).status == 'Ready':
        print("Model is Reday!")
        break
    elif deployment.lookup_deployment(results.name,api_instance).status != 'Deploying':
        print('Something went wrong, Check the deployment!')
        break

Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Model is Reday!


In [63]:
deployment.lookup_deployment(results.name,api_instance).state.endpoint

'https://deployment-from-sdk.project-user-geun-tak-roh.serving.aie01.pcai.tryezmeral.com'

In [66]:
import requests

with open('/etc/secrets/ezua/.auth_token','r') as file:
    AUTH_TOKEN = file.read()
endpoint_url = deployment.lookup_deployment(results.name,api_instance).state.endpoint
headers = {
    "Authorization": f"Bearer {AUTH_TOKEN}"
}

route = '/v1/models'
models_response = requests.get(endpoint_url+route,headers=headers,verify=False)
sample_prompt = "In a 90-minute soccer game, Mark played 20 minutes, then rested after. He then played for another 35 minutes. How long was he on the sideline?"

for model in models_response.json()['data']:
    print(model['id'])
    payload = {
        "model": model['id'],
        "messages": [
            {
                "role": "system",
                "content": "you are a helpful math tutor, solve the question step by step"
            },
            {
                "role": "user",
                "content": sample_prompt
            }
        ]
    }
    route = '/v1/chat/completions'
    chat_response = requests.post(endpoint_url+route,headers=headers,verify=False,json=payload)
    print(f"*** {model['id']} ***\n{chat_response.json()['choices'][0]['message']['content']}")

  actual_port = self.port
  actual_port = self.port


HuggingFaceTB/SmolLM2-360M-Instruct
*** HuggingFaceTB/SmolLM2-360M-Instruct ***
Step 1: Understand the problem
We know the soccer game lasted for 90 minutes, then Mark had a break, and then he was on the sideline for another 35 minutes of playing time.

Step 2: Analyze the given information
- Mark played for 20 minutes, then rested after. So he played for a total of 20 minutes and 1 more minute, which totals to 31 minutes.
- After the break, Mark had 35 minutes of playing time left on the sideline.

Step 3: Calculate the time Mark was on the sideline
To find the length of time he was on the sideline, we need to subtract the playing time he played before and the remaining playing time from the total amount of time he was on the sideline.

Step 4: Calculate the time
- He played a total of 31 minutes before the break
- After the rest of the players return to the field, he is 35 minutes behind in terms of the players' playtime (without the break, he would play for 90 - 31 = 59 minutes)
- S

  actual_port = self.port


*** math-lora ***
To find out how long Mark was on the sideline, we first need to calculate the total time he spent playing and the total time he spent resting.

1. Playing time: Mark played for 20 minutes.
2. Resting time: After resting, Mark was on the sideline for another 35 minutes.

Now, we add the two times together to find the total play time:

Total play time = Playing time + Rest time
= 20 minutes + 35 minutes
= 55 minutes

Since there are 60 minutes in an hour, the total play time is 55 minutes / 60 minutes per hour = 0.9375 hours.

3. Assuming there are 60 minutes without rest, to express this as a more conventional non-hour time, we move the decimal point 2 places to the right and return the answer to hours:
= 55 minutes / 60 minutes/hour
= 0.9375 hours

So, Mark was on the sideline for approximately 0.9375 hours.
