# Foundry local

https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/what-is-foundry-local

Foundry Local is available in preview. Public preview releases provide early access to features that are in active deployment.
Features, approaches, and processes can change or have limited capabilities, before General Availability (GA).

In [1]:
#%pip install foundry-local-sdk

In [2]:
import openai
import os
import requests
import json

from foundry_local import FoundryLocalManager

## List of models

In [3]:
manager = FoundryLocalManager()

In [4]:
# List available models in the catalog
catalog = manager.list_catalog_models()
print(f"Available models in the catalog: {catalog}")

Available models in the catalog: [FoundryModelInfo(alias=phi-4, id=Phi-4-cuda-gpu, runtime=cuda, file_size=8570 MB, license=MIT), FoundryModelInfo(alias=phi-4, id=Phi-4-generic-gpu, runtime=webgpu, file_size=8570 MB, license=MIT), FoundryModelInfo(alias=phi-4, id=Phi-4-generic-cpu, runtime=cpu, file_size=10403 MB, license=MIT), FoundryModelInfo(alias=phi-3-mini-128k, id=Phi-3-mini-128k-instruct-cuda-gpu, runtime=cuda, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=phi-3-mini-128k, id=Phi-3-mini-128k-instruct-generic-gpu, runtime=webgpu, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=phi-3-mini-128k, id=Phi-3-mini-128k-instruct-generic-cpu, runtime=cpu, file_size=2600 MB, license=MIT), FoundryModelInfo(alias=phi-3-mini-4k, id=Phi-3-mini-4k-instruct-cuda-gpu, runtime=cuda, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=phi-3-mini-4k, id=Phi-3-mini-4k-instruct-generic-gpu, runtime=webgpu, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=phi-3-mini-4k, id=

In [5]:
for idx, item in enumerate(catalog, start=1):
    print(idx, item)
    print()

1 alias='phi-4' id='Phi-4-cuda-gpu' version='1' runtime=<ExecutionProvider.CUDA: 'CUDAExecutionProvider'> uri='azureml://registries/azureml/models/Phi-4-cuda-gpu/versions/1' file_size_mb=8570 prompt_template={'system': '<|system|>\n{Content}<|im_end|>', 'user': '<|user|>\n{Content}<|im_end|>', 'assistant': '<|assistant|>\n{Content}<|im_end|>', 'prompt': '<|user|>\n{Content}<|im_end|>\n<|assistant|>'} provider='AzureFoundry' publisher='Microsoft' license='MIT' task='chat-completion'

2 alias='phi-4' id='Phi-4-generic-gpu' version='1' runtime=<ExecutionProvider.WEBGPU: 'WebGpuExecutionProvider'> uri='azureml://registries/azureml/models/Phi-4-generic-gpu/versions/1' file_size_mb=8570 prompt_template={'system': '<|system|>\n{Content}<|im_end|>', 'user': '<|user|>\n{Content}<|im_end|>', 'assistant': '<|assistant|>\n{Content}<|im_end|>', 'prompt': '<|user|>\n{Content}<|im_end|>\n<|assistant|>'} provider='AzureFoundry' publisher='Microsoft' license='MIT' task='chat-completion'

3 alias='phi-4

In [6]:
for idx in range(len(catalog)):
    model = catalog[idx].id
    print(model)

Phi-4-cuda-gpu
Phi-4-generic-gpu
Phi-4-generic-cpu
Phi-3-mini-128k-instruct-cuda-gpu
Phi-3-mini-128k-instruct-generic-gpu
Phi-3-mini-128k-instruct-generic-cpu
Phi-3-mini-4k-instruct-cuda-gpu
Phi-3-mini-4k-instruct-generic-gpu
Phi-3-mini-4k-instruct-generic-cpu
mistralai-Mistral-7B-Instruct-v0-2-cuda-gpu
mistralai-Mistral-7B-Instruct-v0-2-generic-gpu
mistralai-Mistral-7B-Instruct-v0-2-generic-cpu
Phi-3.5-mini-instruct-cuda-gpu
Phi-3.5-mini-instruct-generic-gpu
Phi-3.5-mini-instruct-generic-cpu
deepseek-r1-distill-qwen-14b-cuda-gpu
deepseek-r1-distill-qwen-14b-generic-gpu
deepseek-r1-distill-qwen-14b-generic-cpu
deepseek-r1-distill-qwen-7b-cuda-gpu
deepseek-r1-distill-qwen-7b-generic-gpu
deepseek-r1-distill-qwen-7b-generic-cpu
qwen2.5-0.5b-instruct-cuda-gpu
qwen2.5-0.5b-instruct-generic-gpu
qwen2.5-0.5b-instruct-generic-cpu
qwen2.5-1.5b-instruct-cuda-gpu
qwen2.5-1.5b-instruct-generic-gpu
qwen2.5-1.5b-instruct-generic-cpu
qwen2.5-coder-7b-instruct-cuda-gpu
qwen2.5-coder-7b-instruct-generi

In [7]:
len(catalog)

50

In [8]:
alias = "phi-3.5-mini"

In [9]:
# Download and load a model
model_info = manager.download_model(alias)
model_info = manager.load_model(alias)
print(f"Model info:\n{model_info}")

Model info:
alias='phi-3.5-mini' id='Phi-3.5-mini-instruct-cuda-gpu' version='1' runtime=<ExecutionProvider.CUDA: 'CUDAExecutionProvider'> uri='azureml://registries/azureml/models/Phi-3.5-mini-instruct-cuda-gpu/versions/1' file_size_mb=2181 prompt_template={'prompt': '<|user|>\n{Content}<|end|>\n<|assistant|>', 'assistant': '<|assistant|>\n{Content}<|end|>'} provider='AzureFoundry' publisher='Microsoft' license='MIT' task='chat-completion'


In [10]:
# List models in cache
local_models = manager.list_cached_models()
print(f"Models in cache:\n{local_models}")

Models in cache:
[FoundryModelInfo(alias=phi-3-mini-4k, id=Phi-3-mini-4k-instruct-cuda-gpu, runtime=cuda, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=phi-3.5-mini, id=Phi-3.5-mini-instruct-cuda-gpu, runtime=cuda, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=phi-4, id=Phi-4-cuda-gpu, runtime=cuda, file_size=8570 MB, license=MIT)]


In [11]:
# List loaded models
loaded = manager.list_loaded_models()
print(f"Models running in the service:\n{loaded}")

# Unload a model
manager.unload_model(alias)

Models running in the service:
[FoundryModelInfo(alias=phi-3.5-mini, id=Phi-3.5-mini-instruct-cuda-gpu, runtime=cuda, file_size=2181 MB, license=MIT)]


## Testing

In [12]:
# Streaming

alias = "phi-3.5-mini"

manager = FoundryLocalManager(alias)

client = openai.OpenAI(
    base_url=manager.endpoint,
    api_key=manager.api_key  # API key is not required for local usage
)

# Set the model to use and generate a streaming response
stream = client.chat.completions.create(model=manager.get_model_info(alias).id,
                                        messages=[{
                                            "role": "user",
                                            "content": "What is pi?"
                                        }],
                                        stream=True)

# Print the streaming response
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

 Pi (π) is a mathematical constant representing the ratio of a circle' extruded circumference to its diameter. It is an irrational number, which means it cannot be expressed as a simple fraction and its decimal representation goes on forever without repeating. Pi is approximately equal to 3.14159, but its decimal places continue infinitely without any pattern. In mathematics, pi is crucial for calculations involving circles, such as finding the area or the circumference. The area of a circle is calculated as π times the square of its radius (A = πr²), and the circumference is calculated as 2π times the radius (C = 2πr). Pi is also transcendental, meaning it is not a root of any non-zero polynomial equation with rational coefficients, which was proven by Ferdinand von Lindemann in 1882. This proof contributed to the understanding that pi cannot be the solution to any algebraic equation

In [13]:
resp = client.chat.completions.create(
    model=manager.get_model_info(alias).id,
    messages=[{
        "role": "user",
        "content": "What is the Dottie number?"
    }],
)

resp

ChatCompletion(id='chat.id.201', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' The Dottie number refers to a specific constant value in mathematics, particularly in the context of dynamical systems and chaos theory. It is the only real number that remains invariant under the iterative application of the logistic map at a certain parameter value. The logistic map is a polynomial mapping (equivalently, a recursive function) of degree 2, often cited as an example of how complex, chaotic behavior can arise from very simple non-linear dynamical equations.\n\nThe logistic map is defined as:\n\nx_{n+1} = r * x_n * (1 - x_n)\n\nHere, r is a parameter, and x_n represents the state of the system at the nth iteration.\n\nThe Dottie number is the stable fixed point of the logistic map when r is set to 3.7. It is the value of x for which:\n\nx = 3.7', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], name=None, too

In [14]:
print(resp.choices[0].message.content)

 The Dottie number refers to a specific constant value in mathematics, particularly in the context of dynamical systems and chaos theory. It is the only real number that remains invariant under the iterative application of the logistic map at a certain parameter value. The logistic map is a polynomial mapping (equivalently, a recursive function) of degree 2, often cited as an example of how complex, chaotic behavior can arise from very simple non-linear dynamical equations.

The logistic map is defined as:

x_{n+1} = r * x_n * (1 - x_n)

Here, r is a parameter, and x_n represents the state of the system at the nth iteration.

The Dottie number is the stable fixed point of the logistic map when r is set to 3.7. It is the value of x for which:

x = 3.7


## Other models

In [15]:
alias = "phi-4"

manager = FoundryLocalManager()

In [16]:
# Download and load a model
model_info = manager.download_model(alias)
model_info = manager.load_model(alias)
print(f"Model info:\n{model_info}")

Model info:
alias='phi-4' id='Phi-4-cuda-gpu' version='1' runtime=<ExecutionProvider.CUDA: 'CUDAExecutionProvider'> uri='azureml://registries/azureml/models/Phi-4-cuda-gpu/versions/1' file_size_mb=8570 prompt_template={'system': '<|system|>\n{Content}<|im_end|>', 'user': '<|user|>\n{Content}<|im_end|>', 'assistant': '<|assistant|>\n{Content}<|im_end|>', 'prompt': '<|user|>\n{Content}<|im_end|>\n<|assistant|>'} provider='AzureFoundry' publisher='Microsoft' license='MIT' task='chat-completion'


In [17]:
# List models in cache
local_models = manager.list_cached_models()
print(f"Models in cache:\n{local_models}")

Models in cache:
[FoundryModelInfo(alias=phi-3-mini-4k, id=Phi-3-mini-4k-instruct-cuda-gpu, runtime=cuda, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=phi-3.5-mini, id=Phi-3.5-mini-instruct-cuda-gpu, runtime=cuda, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=phi-4, id=Phi-4-cuda-gpu, runtime=cuda, file_size=8570 MB, license=MIT)]


In [18]:
# List loaded models
loaded = manager.list_loaded_models()
print(f"Models running in the service:\n{loaded}")

# Unload a model
manager.unload_model(alias)

Models running in the service:
[FoundryModelInfo(alias=phi-3.5-mini, id=Phi-3.5-mini-instruct-cuda-gpu, runtime=cuda, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=phi-4, id=Phi-4-cuda-gpu, runtime=cuda, file_size=8570 MB, license=MIT)]


In [19]:
alias = "phi-4"

manager = FoundryLocalManager(alias)

client = openai.OpenAI(
    base_url=manager.endpoint,
    api_key=manager.api_key  # API key is not required for local usage
)

# Set the model to use and generate a streaming response
stream = client.chat.completions.create(model=manager.get_model_info(alias).id,
                                        messages=[{
                                            "role":
                                            "user",
                                            "content":
                                            "What is the capital of France?"
                                        }],
                                        stream=True)

# Print the streaming response
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

The capital of France is Paris.

## Rest API

In [20]:
alias = "phi-3.5-mini"

manager = FoundryLocalManager(alias)
url = manager.endpoint + "/chat/completions"

payload = {
    "model": manager.get_model_info(alias).id,
    "messages": [{
        "role": "user",
        "content": "Hello",
    }]
}

headers = {"Content-Type": "application/json"}

response = requests.post(url, headers=headers, data=json.dumps(payload))
print(response.json()["choices"][0]["message"]["content"])

 Hello! I'm Phi, an AI developed by Microsoft. How can I help you today?


In [21]:
response.json()

{'model': None,
 'choices': [{'delta': {'role': 'assistant',
    'content': " Hello! I'm Phi, an AI developed by Microsoft. How can I help you today?",
    'name': None,
    'tool_call_id': None,
    'function_call': None,
    'tool_calls': []},
   'message': {'role': 'assistant',
    'content': " Hello! I'm Phi, an AI developed by Microsoft. How can I help you today?",
    'name': None,
    'tool_call_id': None,
    'function_call': None,
    'tool_calls': []},
   'index': 0,
   'finish_reason': 'stop',
   'finish_details': None,
   'logprobs': None}],
 'usage': None,
 'system_fingerprint': None,
 'service_tier': None,
 'created': 1749472492,
 'CreatedAt': '2025-06-09T12:34:52+00:00',
 'id': 'chat.id.210',
 'StreamEvent': None,
 'IsDelta': False,
 'Successful': True,
 'error': None,
 'HttpStatusCode': 0,
 'HeaderValues': None,
 'object': 'chat.completion'}

## CLI

In [22]:
!foundry -h

Description:
  Foundry Local CLI: Run AI models on your device.
  
  ðŸš€ Getting started:
  
     1. To view available models: foundry model list
     2. To run a model: foundry model run <model>
  
     EXAMPLES:
         foundry model run phi-3-mini-4k

UtilisationÂ :
  foundry [command] [options]

OptionsÂ :
  -?, -h, --help  Show help and usage information
  --version       Afficher les informations de version
  --license       Display foudry license information

CommandesÂ :
  model    Discover, run and manage models
  cache    Manage the local cache
  service  Manage the local model inference service



In [23]:
!foundry --version

0.4.91+269dfd9ed1


In [24]:
!foundry model list

Alias                          Device     Task               File Size    License      Model ID            
-----------------------------------------------------------------------------------------------
phi-4                          GPU        chat-completion    8.37 GB      MIT          Phi-4-cuda-gpu      
                               GPU        chat-completion    8.37 GB      MIT          Phi-4-generic-gpu   
                               CPU        chat-completion    10.16 GB     MIT          Phi-4-generic-cpu   
--------------------------------------------------------------------------------------------------------
phi-3-mini-128k                GPU        chat-completion    2.13 GB      MIT          Phi-3-mini-128k-instruct-cuda-gpu
                               GPU        chat-completion    2.13 GB      MIT          Phi-3-mini-128k-instruct-generic-gpu
                               CPU        chat-completion    2.54 GB      MIT          Phi-3-mini-128k-instruct-generic-cp

In [25]:
!foundry model info phi-4-mini-reasoning   

Alias                          Device     Task               File Size    License      Model ID            
phi-4-mini-reasoning           GPU        chat-completion    3.15 GB      MIT          Phi-4-mini-reasoning-cuda-gpu


In [26]:
!foundry service restart

## Status

In [27]:
manager.is_service_running()

True

In [28]:
manager.service_uri

'http://localhost:5273'

In [29]:
manager.endpoint

'http://localhost:5273/v1'

In [30]:
manager.list_loaded_models()

[]