# PCAI Use Case Demo - Serving LoRa in MLIS
In this tutorial, we will deploy LoRa Adapter with vLLM in MLIS. Additionally, we will leverage aioli-sdk to programatically deploy the model in MLIS

**1. Install Required Libraries**</br>
Before running the demo, please install the necessary libraries in your environment:

In [1]:
!pip install aioli-sdk==1.10.0



# Intialize API client for MLIS's rest API

In [2]:
namespace = os.popen("kubectl get pvc user-pvc -o=jsonpath='{.metadata.namespace}'").read()

In [3]:
import aiolirest
from aioli.common import util
from aioli.common.api import authentication

In [4]:
host_url = "http://aioli-master-service-hpe-mlis.mlis.svc.cluster.local:8080"

host = util.prepend_protocol(host_url)
token = util.get_aioli_user_token_from_env()
configuration = authentication.get_rest_config(host)
configuration.api_key["ApiKeyAuth"] = "Bearer " + token
restclient = aiolirest.ApiClient(configuration)

# Create Registry
In MLIS, Registries are the storage location of the models. User can ADD/EDIT/LIST/DELETE Registry
**Available Registry Types**
- S3 : S3 bucket and necessary Access keys needed
- Huggingface or OpenLLM : Sign up for a HuggingFace and create an access token
- NGC : Sign up for an NVIDIA NGC Account and obtain the necessary API key. 

In [5]:
api_instance = aiolirest.RegistriesApi(restclient)

In [6]:
import boto3
s3 = boto3.client("s3", verify=False)
buckets = s3.list_buckets()
for bucket in buckets['Buckets']:
    if 'mlflow' in bucket['Name']:
      bucket_name = bucket['Name']

print(bucket_name)

Found endpoint for s3 via: environment_global.


mlflow.aie01


In [8]:
from aiolirest.models.trained_model_registry_request import TrainedModelRegistryRequest

r = TrainedModelRegistryRequest(
    name='s3-bucket-from-sdk',
    bucket=bucket_name,
    endpointUrl='http://local-s3-service.ezdata-system.svc.cluster.local:30000',
    type='s3',
    accessKey=None,
    secretKey='',
    insecureHttps=False,
)
registry_request = api_instance.registries_post(r)

In [9]:
registry_request

TrainedModelRegistry(access_key='', bucket='mlflow.aie01', endpoint_url='http://local-s3-service.ezdata-system.svc.cluster.local:30000', id='b86fbfef-0d03-4f3a-bc20-a00c17084bf0', insecure_https=False, modified_at='2025-11-16T15:07:34.648359508Z', name='s3-bucket-from-sdk', project='', secret_key='', type='s3')

<img src="../assets/mlis_registry_1.png" alt="mlis_registry_1" width="400">

# Create Packaged Model
In MLIS, A packaged Model describes the model that user want to deploy as an inference service.
By Adding a packaged model, user can create a versioned pointer to a model stored in a specified registry and PVC. 
Access to reading and pulling is controlled by the registry’s assigned keys

**Available Model Types**
- Bento Archive : S3
- Custom : OpenLLM, PVC, S3, None
- NIM : NGC, PVC
- OpenLLM : OpenLLM, S3
- vLLM : HuggingFace, S3

In [10]:
from aiolirest.models.configuration_resources import ConfigurationResources
from aiolirest.models.packaged_model_request import PackagedModelRequest
from aiolirest.models.resource_profile import ResourceProfile

In [11]:
api_instance = aiolirest.PackagedModelsApi(restclient)

In [12]:
from argparse import Namespace

full_model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"

config = {
    'requests_cpu': '1',
    'requests_gpu': '1',
    'requests_memory': '4Gi',
    'limits_cpu': '4',
    'limits_gpu': '1',
    'limits_memory': '8Gi',
    'enable_caching': False,
    'disable_caching': False,
    'metadata': {
        'modelCategory=llm'
    },
    'env': {},
    'arg': [
        '--model',
        full_model_name,
        '--port',
        '8080',
        '--dtype=half',
        '--gpu-memory-utilization',
        '0.8',
        '--enable-lora',
        '--lora-modules',
        '{"name":"math-lora","path":"/mnt/models","base_model_name":"' + full_model_name + '"}',
    ]
}
args = Namespace(**config)
requests = ResourceProfile(
    cpu=args.requests_cpu, gpu=args.requests_gpu, memory=args.requests_memory
)
limits = ResourceProfile(
    cpu=args.limits_cpu, gpu=args.limits_gpu, memory=args.limits_memory
)
resources = ConfigurationResources(gpuType=None, requests=requests, limits=limits)

In [13]:
from aioli.common.util import (
    construct_arguments,
    construct_environment,
    construct_metadata,
)
print(construct_metadata(args, {}))
print(construct_environment(args))
print(construct_arguments(args))

{'modelCategory': 'llm'}
{}
['--model', 'HuggingFaceTB/SmolLM2-360M-Instruct', '--port', '8080', '--dtype=half', '--gpu-memory-utilization', '0.8', '--enable-lora', '--lora-modules', '{"name":"math-lora","path":"/mnt/models","base_model_name":"HuggingFaceTB/SmolLM2-360M-Instruct"}']


In [16]:
import mlflow

# Get the experiment ID
experiment_name = "Default"
experiment = mlflow.get_experiment_by_name(experiment_name)
latest_run = mlflow.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["start_time DESC"],
    max_results=1
)

artifact_uri = latest_run.artifact_uri[0]
print(f"Artifact URI: {artifact_uri}")

# For a specific artifact file/folder
artifact_path = 'math-SmolLM2-360M-Instruct-savedir/peft'  # or any artifact path

full_artifact_uri = f"{artifact_uri}/{artifact_path}"
print(f"Full artifact URI: {full_artifact_uri}")

Artifact URI: s3://mlflow.aie01/0/c95a898568c14429917e9c6adad98400/artifacts
Full artifact URI: s3://mlflow.aie01/0/c95a898568c14429917e9c6adad98400/artifacts/math-SmolLM2-360M-Instruct-savedir/peft


In [17]:
r = PackagedModelRequest(
    name='packaged-model-from-sdk',
    description='packaged-model-from-sdk',
    url=full_artifact_uri,
    # url='s3://mlflow.aie01/2/1fc4eae4cb9f43928f5455a587add06b/artifacts/SmolLM2-360M-Instruct-savedir/peft',
    image='vllm/vllm-openai:v0.8.5',
    resources=resources,
    modelFormat='custom',
    arguments=construct_arguments(args),
    metadata=construct_metadata(args, {}),
    registry=registry_request.name,
)

if args.enable_caching:
    r.caching_enabled = True

if args.disable_caching:
    r.caching_enabled = False

resposne = api_instance.models_post(r)

In [18]:
resposne

PackagedModel(arguments=['--model', 'HuggingFaceTB/SmolLM2-360M-Instruct', '--port', '8080', '--dtype=half', '--gpu-memory-utilization', '0.8', '--enable-lora', '--lora-modules', '{"name":"math-lora","path":"/mnt/models","base_model_name":"HuggingFaceTB/SmolLM2-360M-Instruct"}'], caching_enabled=False, description='packaged-model-from-sdk', environment={}, id='7b74bb6a-7fff-46c2-a31c-e6329539ddda', image='vllm/vllm-openai:v0.8.5', metadata={'modelCategory': 'llm'}, format='custom', modified_at='2025-11-16T15:08:57.951195Z', name='packaged-model-from-sdk', project='', registry='b86fbfef-0d03-4f3a-bc20-a00c17084bf0', resources=ConfigurationResources(gpu_type='', limits=ResourceProfile(cpu='4', gpu='1', memory='8Gi'), requests=ResourceProfile(cpu='1', gpu='1', memory='4Gi')), url='s3://mlflow.aie01/0/c95a898568c14429917e9c6adad98400/artifacts/math-SmolLM2-360M-Instruct-savedir/peft', version=1)

<img src="../assets/mlis_packaged_1.png" alt="mlis_packaged_1" width="400">
<img src="../assets/mlis_packaged_2.png" alt="mlis_packaged_2" width="400">

# Deploy Model
In MLIS, Deployments spin up the actual instances that run user’s inference services. 

As a result of deployment, it will provide an endpoint that can be used by clients to make predictions.

Access to reading and pulling is controlled by the registry’s assigned keys

In [19]:
from aiolirest.models.autoscaling import Autoscaling
from aiolirest.models.deployment_request import DeploymentRequest
from aiolirest.models.security import Security

In [20]:
api_instance = aiolirest.DeploymentsApi(restclient)

In [21]:
from argparse import Namespace

config = {
    'autoscaling_target': 1,
    'autoscaling_metric': 'rps',
    'autoscaling_max_replicas': 1,
    'autoscaling_min_replicas': 1,
}
args = Namespace(**config)
sec = Security(authenticationRequired=True)

In [22]:
auto = Autoscaling(
    metric=args.autoscaling_metric,
)

if args.autoscaling_target is not None:
    auto.target = args.autoscaling_target

if args.autoscaling_max_replicas is not None:
    auto.max_replicas = args.autoscaling_max_replicas

if args.autoscaling_min_replicas is not None:
    auto.min_replicas = args.autoscaling_min_replicas

In [23]:
r = DeploymentRequest(
    name='deployment-from-sdk',
    model=resposne.name,
    security=sec,
    namespace=namespace,
    autoScaling=auto,
)
results = api_instance.deployments_post(r)

In [24]:
results

Deployment(arguments=None, auto_scaling=Autoscaling(max_replicas=1, metric='rps', min_replicas=1, target=1), canary_traffic_percent=100, cluster_name='', environment={}, goal_status='Ready', id='4f3305a0-adc0-4a24-9495-89e4c02f5fd8', last_event=None, model='7b74bb6a-7fff-46c2-a31c-e6329539ddda', modified_at='2025-11-16T15:09:25.917722Z', name='deployment-from-sdk', namespace='project-user-geun-tak-roh', node_selectors={}, priority_class_name='', project='', secondary_state=DeploymentState(endpoint='', failure_info=None, mdl_id='', native_app_name='', status='None', traffic_percentage=0), security=Security(authentication_required=True), state=DeploymentState(endpoint='', failure_info=None, mdl_id='', native_app_name='', status='Deploying', traffic_percentage=0), status='Deploying')

In [25]:
from aioli.cli import deployment
import time

while True:
    print(deployment.lookup_deployment(results.name,api_instance).status)
    time.sleep(5)
    if deployment.lookup_deployment(results.name,api_instance).status == 'Ready':
        print("Model is Ready!")
        break
    elif deployment.lookup_deployment(results.name,api_instance).status != 'Deploying':
        print('Something went wrong, Check the deployment!')
        break

Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Deploying
Model is Ready!


In [26]:
deployment.lookup_deployment(results.name,api_instance).state.endpoint

'https://deployment-from-sdk.project-user-geun-tak-roh.serving.aie01.pcai.tryezmeral.com'

In [27]:
import requests

with open('/etc/secrets/ezua/.auth_token','r') as file:
    AUTH_TOKEN = file.read()
endpoint_url = deployment.lookup_deployment(results.name,api_instance).state.endpoint
headers = {
    "Authorization": f"Bearer {AUTH_TOKEN}"
}

route = '/v1/models'
models_response = requests.get(endpoint_url+route,headers=headers,verify=False)
sample_prompt = "In a 90-minute soccer game, Mark played 20 minutes, then rested after. He then played for another 35 minutes. How long was he on the sideline?"

for model in models_response.json()['data']:
    print(model['id'])
    payload = {
        "model": model['id'],
        "messages": [
            {
                "role": "system",
                "content": "you are a helpful math tutor, solve the question step by step"
            },
            {
                "role": "user",
                "content": sample_prompt
            }
        ]
    }
    route = '/v1/chat/completions'
    chat_response = requests.post(endpoint_url+route,headers=headers,verify=False,json=payload)
    print(f"*** {model['id']} ***\n{chat_response.json()['choices'][0]['message']['content']}")



HuggingFaceTB/SmolLM2-360M-Instruct
*** HuggingFaceTB/SmolLM2-360M-Instruct ***
To find out how long Mark was on the sideline, we need to know the duration between his consecutive starts. Let's first determine the first start.

During Mark's first 20 minutes, he would have had 20 minutes of rest. Since he had to continue gameplay for another 35 minutes, the time between his last start to the second start must be 90 - 55 = 35 minutes.

Now, let's solve the remaining part of the equation. We know Mark started his second 35-minute period with the idea of an ongoing game. Thus, he actually played for 35 minutes in total, but then quit after one more play, which is equivalent to 20 minutes (35 - 55).

We can infer this by focusing on the duration he actually played during his two 35-minute periods. If Mark had 20 minutes of rest and 35 minutes in action, this means he took 20 + 35 = 55 minutes off the actual section he played out the first time, and thus the whole objective phase of a game 



*** math-lora ***
First, we need to find the time Mark played. He played for 20 minutes and rested after. So, he played 20 - 20 = 0 minutes.

Next, we need to find the total time his teammates played, given that Mark rested. There are 20 minutes in 90 minutes. So, the total time his teammates played before Mark rested is 90 minutes / 20 minutes per minute = 4.5 minutes. After Mark rested, the total time his teammates played before him is 1 - 0 = 0 minutes.

Now, we need to find the remaining time, before any regular football play, that Mark played.

Mark didn't play the whole 90 minutes. We found out that he played 0 minutes before rest. Let's call this time t.

Since he played 0 minutes after rest, he must have finished the entire 90 minutes before rest. This means he played 90 - t minutes.

We know that the total time he played before Mark rested is 4.5 minutes. So, we can write the equation: 90 - t = 4.5.

Now, we need to find the value of t, which represents his time playing before