# Deploy a Model to an Endpoint

Welcome to this tutorial on deploying a model or a model bundle to a SambaNova dedicated node!

Before you get started, please follow the set up instructions given in the [README](./README.md)

## 1.  Imports

In [1]:
import sys
sys.version

'3.11.11 (main, Dec 11 2024, 10:28:39) [Clang 14.0.6 ]'

In [2]:
from IPython.display import display, HTML
display(HTML("<style>:root { --jp-notebook-max-width: 100% !important; }</style>"))
import json
import os
from dotenv import load_dotenv
import pprint
load_dotenv()

True

In [3]:
from snsdk import SnSdk

## 2. Set up environment connector

Connects to the remote dedicated environment using the variables defined in `.env`

In [4]:
sn_env = SnSdk(host_url=os.getenv("SAMBASTUDIO_HOST_NAME"), 
                   access_key=os.getenv("SAMBASTUDIO_ACCESS_KEY"), 
                   tenant_id=os.getenv("SAMBASTUDIO_TENANT_NAME"))

## 3. Create or select a project

Projects are a way to organize endpoints and training/inference jobs

#### List available projects
You can list existing projects in which the endpoint can be created for model deployment

In [5]:
projects = sn_env.list_projects()["projects"]
sorted([project["name"] for project in projects])

['benchmarking', 'test_project', 'test_project1']

#### Create a new project
If you do not wish to use an existing project, you may create a new one.

In [6]:
project_name = "test_project"
new_project = sn_env.create_project(
                    project_name=project_name,
                    description="A test project with a test endpoint"
                )
try:
    project_id = new_project['id']
except:
    new_project = sn_env.project_info(project_name)
    project_id = new_project['id']
project_id

'b004dc34-42a7-4529-90e4-8ce71e15a715'

#### Deleting a project

If required, a project can be deleted using the `sn_env.delete_project(project_name)` function. Please be sure to stop and delete all endpoints and jobs before deleting a project.

## 4. Select model or bundle to deploy

#### List models

Get the complete list of models. This includes models that are  
  - actually available
  - still in the process of uploading
  - exist in a remote storage from which they can be made available
  - not in a usable state

In [7]:
models = sn_env.list_models()["models"]
len(models)

153

Filter down to the models that are actually available on the environment

In [8]:
available_models = [m for m in models if m['status'] == 'Available']
len(available_models)

32

Print names of the available models

In [9]:
sorted([m["model_checkpoint_name"] for m in available_models])

['DeepSeek-R1',
 'DeepSeek-R1-Distill-Llama-70B',
 'DeepSeek-V3',
 'Meta-Llama-3-70B-Instruct',
 'Meta-Llama-3-8B-Instruct',
 'Meta-Llama-3.1-405B-Instruct',
 'Meta-Llama-3.1-70B-Instruct',
 'Meta-Llama-3.1-70B-SD-Llama-3.2-1B',
 'Meta-Llama-3.1-8B-Instruct',
 'Meta-Llama-3.2-1B-Instruct',
 'Meta-Llama-3.3-70B-Instruct',
 'Meta-Llama-3.3-70B-SD-Llama-3.2-1B-TP16',
 'Mistral-7B-Instruct-V0.2',
 'QwQ-32B-Preview',
 'QwQ-32B-Preview-SD-Qwen-2.5-QWQ-0.5B',
 'Qwen 2.5 72B TP16',
 'Qwen-2.5-72B-SD-Qwen-2.5-0.5B',
 'Qwen2-72B-Instruct',
 'Qwen2-7B-Instruct',
 'Qwen2.5-0.5B-Instruct',
 'Qwen2.5-0.5B-SFT-Instruct',
 'Qwen2.5-72B-Instruct',
 'Qwen2.5-7B-Instruct',
 'Salesforce--Llama-xLAM-2-70b-fc-r',
 'Salesforce--Llama-xLAM-2-70b-fc-r-SD-Llama-3.2-1B-TP16-ge-16k',
 'Salesforce--Llama-xLAM-2-70b-fc-r-SD-Llama-3.2-1B-TP16-le-16k',
 'Salesforce--Llama-xLAM-2-8b-fc-r',
 'Samba-1 Turbo',
 'e5-mistral-7B-instruct',
 'meta-llama-3.1-70b',
 'numind--NuExtract-1.5-tiny',
 'qwen_llama_salesforce']

#### Select model to deploy

In [10]:
selected_model = "numind--NuExtract-1.5-tiny"

## 5. Create endpoint

In [11]:
endpoint_name = selected_model.lower().replace('_','-')
endpoint = sn_env.create_endpoint(
    project=project_name,
    endpoint_name=endpoint_name,
    description="Endpoint for " + selected_model,
    model_checkpoint=selected_model,
    model_version=1,
    instances=1,
    hyperparams='{"model_parallel_rdus": "16", "num_tokens_at_a_time": "10"}',
    rdu_arch="SN40L-16",
    inference_api_openai_compatible=True
)

#### Check the status of the endpoint

In [12]:
endpoint = sn_env.endpoint_info(project_name, endpoint_name)
endpoint['status']

'SettingUp'

## 6. Get Endpoint Details
To test the endpoint, we will need to obtain some of its information. Note that this information can be obtained even while the model is setting up.

#### Get the endpoint URL

In [13]:
endpoint_url = os.getenv("SAMBASTUDIO_HOST_NAME") + "/v1/" + endpoint["id"]

#### Get the default endpoint API key
Note that:
  - New keys can be added using the `sn_env.add_endpoint_api_key` API.    
  - All keys can be revoked using the `sn_env.edit_endpoint_api_key` API.

In [14]:
endpoint_key = endpoint["api_keys"][0]["api_key"]

#### Get model names in the endpoint

In [15]:
endpoint_model_id = endpoint['targets'][0]["model"]
model_info = sn_env.model_info(endpoint_model_id, job_type="deploy")

#### Check if the model is standalone or composite (bundle)

In [16]:
model_info["type"]

'Basic'

## 7. Test Endpoint
Once the endpoint is live, you can test it using the OpenAI API

#### Make sure endpoint is live

In [19]:
endpoint = sn_env.endpoint_info(project_name, endpoint_name)
endpoint['status']

'Live'

#### Create test messages to send to the endpoint

In [20]:
test_prompt = """<|input|>
 ### Template:
 {
    "Model": {
        "Name": "",
        "Number of parameters": "",
        "Number of max token": "",
        "Architecture": []
    },
    "Usage": {
        "Use case": [],
        "Licence": ""
    }
}
### Text:
We introduce Mistral 7B, a 7–billion-parameter language model engineered for
superior performance and efficiency. Mistral 7B outperforms the best open 13B
model (Llama 2) across all evaluated benchmarks, and the best released 34B
model (Llama 1) in reasoning, mathematics, and code generation. Our model
leverages grouped-query attention (GQA) for faster inference, coupled with sliding
window attention (SWA) to effectively handle sequences of arbitrary length with a
reduced inference cost. We also provide a model fine-tuned to follow instructions,
Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
automated benchmarks. Our models are released under the Apache 2.0 license.
Code: <https://github.com/mistralai/mistral-src>
Webpage: <https://mistral.ai/news/announcing-mistral-7b/>
<|output|>
"""

#### Send test messsages to the endpoint
In this example, we test all the constituents of the model bundle. An endpoint may only have one model deployed, in which case this test can be done against that model alone.

**Note: If a model uses speculative decoding, its name will not match the name expected by the endpoint. Instead, we need to get and use the name of the target model.**

In [21]:
import os
import openai

client = openai.OpenAI(
    api_key=endpoint_key,
    base_url=endpoint_url,
)
  
model_name = selected_model

# Check for speculative decoding
constituent_info = sn_env.model_info(selected_model, job_type="deploy")
if 'target_model' in constituent_info['config']:
    target_name = constituent_info['config']['target_model']        
    if len(target_name) > 0:
        model_name = target_name

# Send messages to endpoint
response = client.completions.create(
    model=model_name,
    prompt=test_prompt,    
    temperature=0.01
)

print(response.choices[0].text)
print()

{
    "Model": {
        "Name": "Mistral 7B",
        "Number of parameters": "7\u2013billion",
        "Number of max token": "",
        "Architecture": [
            "grouped-query attention",
            "sliding window attention"
        ]
    },
    "Usage": {
        "Use case": [
            "superior performance and efficiency",
            "reasoning",
            "mathematics",
            "code generation"
        ],
        "Licence": "Apache 2.0"
    }
}
  



## 8. Stopping/deleting an Endpoint
An endpoint can be:
  - stopped: sn_env.stop_endpoint(project_name, endpoint_name)
  - deleted: sn_env.delete_endpoint(project_name, endpoint_name)