# wordslab-notebooks-lib.chat

> Chat with local and remote LLMs in the context of the wordslab-notebooks environment 

** WORK IN PROGRESS - not exported yet **

In [None]:
## #| default_exp chat

In [None]:
## #| export
from ollama import chat
from openai import OpenAI

import os
from IPython.display import display, Markdown, clear_output

from wordslab_notebooks_lib.env import WordslabNotebooksEnv
from wordslab_notebooks_lib.notebook import JupyterlabNotebook

## ollama chat client

In [None]:
env = WordslabNotebooksEnv()

In [None]:
model = env.default_model_chat
model

'gemma3:27b'

In [None]:
messages = [{'role': 'user', 'content': 'In one sentence: why is the sky blue?'}]

In [None]:
def ollama_chat_stream(model, messages):
    stream = chat(model=model, messages=messages, stream=True)
    for chunk in stream:
        yield chunk['message']['content']    

In [None]:
stream = ollama_chat_stream(model, messages)

streamed_text = ""
for chunk in stream:
    streamed_text += chunk  
    clear_output(wait=True)
    display(Markdown(streamed_text))

The sky is blue because of a phenomenon called Rayleigh scattering, where shorter wavelengths of light (like blue and violet) are scattered more by the Earth's atmosphere than other colors, making blue appear to dominate our view.





## openrouter chat client

In [None]:
api_key = os.environ["OPENROUTER_API_KEY"]

In [None]:
model = "mistralai/mistral-small-creative"
model

'mistralai/mistral-small-creative'

In [None]:
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=api_key)

def openrouter_chat_stream(model, messages):
    stream = client.chat.completions.create(model=model, messages=messages, stream=True)
    for chunk in stream:
        yield chunk.choices[0].delta.content

In [None]:
stream = openrouter_chat_stream(model, messages)

streamed_text = ""
for chunk in stream:
    streamed_text += chunk  
    clear_output(wait=True)
    display(Markdown(streamed_text))

## Notebook chat

In [None]:
notebook = JupyterlabNotebook()

In [None]:
notebook.cell_id

'a4fa7f1a-aafc-472a-9363-bd65f8860fc5'

In [None]:
notebook.cell_id

'c8a8cae6-d10d-4050-8c6a-04d7b12409bd'

In [None]:
## #| exports
class NbChat:

    
    def __init__(self, model=None):
        if model is None:
            wlnb = WordslabNotebooks()
            self.model = wlnb.default_model_chat
        else:
            self.model = model

        self.prompt_template = """You are an AI assistant designed to help the user learn and solve problems interactively.
You run in an interactive Jupyter notebook environment where the user can write code, take notes, and chat with you.
You work with the user step-by-step rather than just giving complete answers. You are especially good at:
- Breaking down complex topics into manageable pieces
- Helping the user work through coding problems in Python
- Encouraging the user to try things himself, with guidance when he needs it
- Adapting to the user level and interests
You are designed to be collaborative - you ask questions, check the user understanding, and let him explore ideas rather than just lecturing. 
You can help with teaching, coding, problem-solving, research, and creative projects.
What sets you apart is your teaching approach - you focus on helping the user develop his skills rather than just giving him answers. 
You provide information in small chunks, check in frequently to see if things make sense, and encourage the user to try things himself.

# Jupyter notebook - all cells above the user instruction in XML

{notebook_context}

# User Instruction - in the last code cell of the notebook

Execute this user instruction in the context of the code cells above:

{user_instruction}
"""

    async def __call__(self, user_instruction, timeout=1):
        notebook_context = await get_notebook_context(timeout=timeout)
        prompt = self.prompt_template.format(user_instruction=user_instruction, notebook_context=notebook_context)
        stream = chat(model=self.model, messages=[{'role': 'user', 'content': prompt}], stream=True)
        streamed_text = ""
        for chunk in stream:
            streamed_text += chunk['message']['content']    
            clear_output(wait=True)
            display(Markdown(streamed_text))

In [None]:
nbchat = NbChat("mistral-small3.2:24b")

In [None]:
await nbchat("What is this notebook about?")

This notebook appears to be focused on exploring and demonstrating the capabilities of the **Ollama API**, particularly for interacting with large language models (LLMs) like Gemma3, Mistral, and others. Here's a breakdown of its key components:

### Key Features Demonstrated:
1. **Model Interaction**:
   - The notebook shows how to use the `ollama` Python client to interact with different models (e.g., `gemma3`, `mistral-small3.2`).
   - It includes examples of generating text, chatting, and embedding text using these models.

2. **Streaming Responses**:
   - The notebook demonstrates how to handle streaming responses from the API, which is useful for real-time interactions.

3. **Notebook Context**:
   - There's a `NbChat` class designed to integrate the LLM with the Jupyter notebook environment. This class uses the notebook's context (previous code cells and outputs) to provide more relevant responses.

4. **Web Search and Fetch**:
   - The notebook explores the `web_search` and `web_fetch` methods of the `ollama` client, which allow the LLM to fetch and process web content.

5. **Model Management**:
   - The notebook includes methods to list, pull, and delete models, showing how to manage the models available to the `ollama` client.

### Example Use Case:
The notebook is particularly useful for:
- **Educational purposes**: Teaching how to interact with LLMs programmatically.
- **Development**: Building applications that require LLM capabilities, such as chatbots, text generation tools, or research assistants.
- **Exploration**: Understanding the capabilities of different models and how to use them effectively.

### Summary:
This notebook serves as a comprehensive guide to using the `ollama` API for interacting with large language models in a Jupyter notebook environment. It covers everything from basic text generation to advanced features like streaming and web search integration.

## Design concepts

### User centric workflow

1. identify your self-hosted inference or inference as a service options
2. understand your task type, properties, privacy needs and scale
3. find the best model for your task, given your constraints
4. prepare and start your self hosted inference or connect to your inference as a service provider
5. monitor your resource usage and cost

### Self-hosted inference or inference as a service

Model families
- architecture name
- parameter size
- training type: base / instruct / thinking
- version: relase date
- quantization

Model constraints
- model capabilities
  - modalities in/out
  - context length
  - instruction
  - thinking
  - tools
- model usage
  - prompt template and special tokens
  - languages supported
  - recommended use cases
  - prompting guidelines 
- model license
  - use case restrictions
  - commercial usage restrictions
  - outputs usage restrictions 
- model transparency

Self-hosted inference constraints
- model requirements
  - size on disk -> download time / load time in vram
  - size in vram -> max context length / num parallel sequence
  - tensor flops -> input tokens/sec
  - memory bandwidth -> output tokens/sec
- inference machine constraints
  - download speed
  - disk size and speed
  - GPU vram, memory bandwidth, tensor flops
- rented machine constraints
  - GPU availability
  - price when you use per GPU
  - price when you don't use per GB (storage)

Inference as a service constraints
- router constraints
  - ... same as provider constraints below ... 
- provider constraints
  - terms of service
  - privacy options
  - inference quotas
  - service availability
- per model provider constraints
  - model capabilities exposed 
  - input/output tokens cost
  - input/output tokens/sec

## ModelsProvider

### List, download and load models

### Explore ollama API

Get ollama version

In [None]:
Request
curl http://localhost:11434/api/version
Response
{
  "version": "0.5.1"
}

List remote models

As of december 2025, there is no API to get the ollama catalog of models, web scraping is the only solution.

In [None]:
import httpx
import re
from html import unescape

def updated_to_months(updated):
    """
    Convert strings like:
      "1 year ago", "2 years ago",
      "1 month ago", "3 weeks ago",
      "7 days ago", "yesterday",
      "4 hours ago"
    into integer months.
    """
    if not updated:
        return None

    updated = updated.lower().strip()

    # handle 'yesterday' explicitly
    if updated == "yesterday":
        return 0

    # years → months
    m = re.match(r'(\d+)\s+year', updated)
    if m:
        years = int(m.group(1))
        return years * 12

    # months
    m = re.match(r'(\d+)\s+month', updated)
    if m:
        return int(m.group(1))

    # weeks
    m = re.match(r'(\d+)\s+week', updated)
    if m:
        weeks = int(m.group(1))
        return max(0, weeks // 4)

    # days
    m = re.match(r'(\d+)\s+day', updated)
    if m:
        return 0

    # hours / minutes / seconds → treat as < 1 month
    if any(unit in updated for unit in ["hour", "minute", "second"]):
        return 0

    return None

def pulls_to_int(pulls_str):
    """
    Convert a pulls string like:
        '5M', '655.8K', '49K', '73.7M', '957.4K', '27.7M'
    into an integer.
    """
    if not pulls_str:
        return None

    pulls_str = pulls_str.strip().upper()

    match = re.match(r'([\d,.]+)\s*([KM]?)', pulls_str)
    if not match:
        return None

    number, suffix = match.groups()
    # Remove commas and convert to float
    number = float(number.replace(',', ''))

    if suffix == 'M':
        number *= 1_000_000
    elif suffix == 'K':
        number *= 1_000

    return int(number)

def parse_model_list_regex(html):
    models = []

    # --- Extract each <li x-test-model>...</li> block ---
    li_blocks = re.findall(
        r'<li[^>]*x-test-model[^>]*>(.*?)</li>',
        html,
        flags=re.DOTALL
    )

    for block in li_blocks:

        # name from <a href="/library/...">
        name = None
        m = re.search(r'href="/library/([^"]+)"', block)
        if m:
            name = m.group(1)

        # description <p class="max-w-lg ...">...</p>
        description = ""
        m = re.search(
            r'<p[^>]*text-neutral-800[^>]*>(.*?)</p>',
            block,
            flags=re.DOTALL
        )
        if m:
            description = re.sub(r'<.*?>', '', m.group(1)).strip()
            description = unescape(description)

        # capabilities (x-test-capability)
        capabilities = re.findall(
            r'<span[^>]*x-test-capability[^>]*>(.*?)</span>',
            block,
            flags=re.DOTALL
        )
        capabilities = [c.strip() for c in capabilities]

        # check for the special 'cloud' span 
        cloud = False
        if re.search(
            r'<span[^>]*>cloud</span>',
            block,
            flags=re.DOTALL
        ):
            cloud = True

        # sizes (x-test-size)
        sizes = re.findall(
            r'<span[^>]*x-test-size[^>]*>(.*?)</span>',
            block,
            flags=re.DOTALL
        )
        sizes = [s.strip() for s in sizes]

        # pulls <span x-test-pull-count>5M</span>
        pulls = None
        m = re.search(
            r'<span[^>]*x-test-pull-count[^>]*>(.*?)</span>',
            block
        )
        if m:
            pulls = m.group(1).strip()

        # tag count <span x-test-tag-count>5</span>
        tag_count = None
        m = re.search(
            r'<span[^>]*x-test-tag-count[^>]*>(.*?)</span>',
            block
        )
        if m:
            tag_count = m.group(1).strip()

        # updated text <span x-test-updated>...</span>
        updated = None
        m = re.search(
            r'<span[^>]*x-test-updated[^>]*>(.*?)</span>',
            block
        )
        if m:
            updated = m.group(1).strip()

        models.append({
            "name": name,
            "description": description,
            "capabilities": capabilities,
            "cloud": cloud,
            "sizes": sizes,
            "pulls": pulls_to_int(pulls),
            "tag_count": int(tag_count),
            "updated_months": updated_to_months(updated),
            "url": f"https://ollama.com/library/{name}" if name else None
        })

    return models   

def list_models(contains=None):
    """
    Extract model names and properties from https://ollama.com/library
    Optionally filter by substring.
    """

    html = httpx.get("https://ollama.com/library").text
    models = parse_model_list_regex(html)

    if contains:
        models = [
            m for m in models
            if contains.lower() in m["name"].lower()
        ]
        models = sorted(models, key=lambda m:m["name"])

    return models

def list_recent_models_from_family(familyfilter):
    return [f"{m['name']} {m['capabilities'] if len(m['capabilities'])>0 else ''} {m['sizes'] if len(m['sizes'])>0 else ''}{' [cloud]' if m['cloud'] else ''}" for m in list_models(familyfilter) if m["updated_months"] is not None and m["updated_months"]<12]

def list_tags(model):
    """
    Extract valid quantized tags only, without HTML noise,
    and apply the same exclusions as original greps.
    """
    html = httpx.get(f"https://ollama.com/library/{model}/tags").text

    # Capture ONLY the tag part after model:..., e.g. 3b-instruct-q4_K_M
    raw_tags = re.findall(
        rf'{re.escape(model)}:([A-Za-z0-9._-]*q[A-Za-z0-9._-]*)',
        html
    )

    # Re-add full prefix model:<tag>
    tags = [f"{model}:{t}" for t in raw_tags]

    # Exclude text|base|fp|q4_[01]|q5_[01]
    tags = [
        t for t in tags
        if not re.search(r'(text|base|fp|q[45]_[01])', t)
    ]

    # Deduplicate
    return set(tags)

In [None]:
list_models()[:5]

[{'name': 'gpt-oss',
  'description': 'OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases.',
  'capabilities': ['tools', 'thinking'],
  'cloud': True,
  'sizes': ['20b', '120b'],
  'pulls': 5000000,
  'tag_count': 5,
  'updated_months': 1,
  'url': 'https://ollama.com/library/gpt-oss'},
 {'name': 'qwen3-vl',
  'description': 'The most powerful vision-language model in the Qwen model family to date.',
  'capabilities': ['vision', 'tools'],
  'cloud': True,
  'sizes': ['2b', '4b', '8b', '30b', '32b', '235b'],
  'pulls': 656300,
  'tag_count': 59,
  'updated_months': 1,
  'url': 'https://ollama.com/library/qwen3-vl'},
 {'name': 'ministral-3',
  'description': 'The Ministral 3 family is designed for edge deployment, capable of running on a wide range of hardware.',
  'capabilities': ['vision', 'tools'],
  'cloud': True,
  'sizes': ['3b', '8b', '14b'],
  'pulls': 49100,
  'tag_count': 16,
  'updated_months': 0,
  'url': 'https://oll

In [None]:
list_recent_models_from_family("qwen")

["qwen2.5-coder ['tools'] ['0.5b', '1.5b', '3b', '7b', '14b', '32b']",
 "qwen2.5vl ['vision'] ['3b', '7b', '32b', '72b']",
 "qwen3 ['tools', 'thinking'] ['0.6b', '1.7b', '4b', '8b', '14b', '30b', '32b', '235b']",
 "qwen3-coder ['tools'] ['30b', '480b'] [cloud]",
 "qwen3-embedding ['embedding'] ['0.6b', '4b', '8b']",
 "qwen3-vl ['vision', 'tools'] ['2b', '4b', '8b', '30b', '32b', '235b'] [cloud]"]

In [None]:
list_recent_models_from_family("gemma")

["embeddinggemma ['embedding'] ['300m']",
 "gemma3 ['vision'] ['270m', '1b', '4b', '12b', '27b'] [cloud]",
 "gemma3n  ['e2b', 'e4b']"]

In [None]:
list_recent_models_from_family("stral")

["devstral ['tools'] ['24b']",
 "magistral ['tools', 'thinking'] ['24b']",
 "ministral-3 ['vision', 'tools'] ['3b', '8b', '14b'] [cloud]",
 "mistral ['tools'] ['7b']",
 'mistral-large-3   [cloud]',
 "mistral-nemo ['tools'] ['12b']",
 "mistral-small ['tools'] ['22b', '24b']",
 "mistral-small3.1 ['vision', 'tools'] ['24b']",
 "mistral-small3.2 ['vision', 'tools'] ['24b']"]

In [None]:
list_recent_models_from_family("gpt")

["gpt-oss ['tools', 'thinking'] ['20b', '120b'] [cloud]",
 "gpt-oss-safeguard ['tools', 'thinking'] ['20b', '120b']"]

In [None]:
list_recent_models_from_family("deepseek")

["deepseek-ocr ['vision'] ['3b']",
 "deepseek-r1 ['tools', 'thinking'] ['1.5b', '7b', '8b', '14b', '32b', '70b', '671b']",
 "deepseek-v3  ['671b']",
 "deepseek-v3.1 ['tools', 'thinking'] ['671b'] [cloud]"]

In [None]:
list_recent_models_from_family("glm")

['glm-4.6   [cloud]']

In [None]:
list_recent_models_from_family("granite")

["granite-embedding ['embedding'] ['30m', '278m']",
 "granite3.1-dense ['tools'] ['2b', '8b']",
 "granite3.1-moe ['tools'] ['1b', '3b']",
 "granite3.2 ['tools'] ['2b', '8b']",
 "granite3.2-vision ['vision', 'tools'] ['2b']",
 "granite3.3 ['tools'] ['2b', '8b']",
 "granite4 ['tools'] ['350m', '1b', '3b']"]

In [None]:
list_recent_models_from_family("llama")

["llama3.2-vision ['vision'] ['11b', '90b']",
 "llama4 ['vision', 'tools'] ['16x17b', '128x17b']"]

In [None]:
list_recent_models_from_family("phi")

["dolphin-mixtral  ['8x7b', '8x22b']",
 "dolphin3  ['8b']",
 "phi4  ['14b']",
 "phi4-mini ['tools'] ['3.8b']",
 "phi4-mini-reasoning  ['3.8b']",
 "phi4-reasoning  ['14b']"]

In [None]:
list_recent_models_from_family("hermes")

["hermes3 ['tools'] ['3b', '8b', '70b', '405b']",
 "nous-hermes2-mixtral  ['8x7b']"]

In [None]:
list_recent_models_from_family("olmo")

["olmo2  ['7b', '13b']"]

In [None]:
list_recent_models_from_family("embed")

["embeddinggemma ['embedding'] ['300m']",
 "granite-embedding ['embedding'] ['30m', '278m']",
 "qwen3-embedding ['embedding'] ['0.6b', '4b', '8b']"]

In [None]:
list_tags("ministral-3")

{'ministral-3:14b-instruct-2512-q4_K_M',
 'ministral-3:14b-instruct-2512-q8_0',
 'ministral-3:3b-instruct-2512-q4_K_M',
 'ministral-3:3b-instruct-2512-q8_0',
 'ministral-3:8b-instruct-2512-q4_K_M',
 'ministral-3:8b-instruct-2512-q8_0'}

In [None]:
list_tags("mistral-small3.2")

{'mistral-small3.2:24b-instruct-2506-q4_K_M',
 'mistral-small3.2:24b-instruct-2506-q8_0'}

In [None]:
list_tags("qwen3-vl")

{'qwen3-vl:235b-a22b-instruct-q4_K_M',
 'qwen3-vl:235b-a22b-instruct-q8_0',
 'qwen3-vl:235b-a22b-thinking-q4_K_M',
 'qwen3-vl:235b-a22b-thinking-q8_0',
 'qwen3-vl:2b-instruct-q4_K_M',
 'qwen3-vl:2b-instruct-q8_0',
 'qwen3-vl:2b-thinking-q4_K_M',
 'qwen3-vl:2b-thinking-q8_0',
 'qwen3-vl:30b-a3b-instruct-q4_K_M',
 'qwen3-vl:30b-a3b-instruct-q8_0',
 'qwen3-vl:30b-a3b-thinking-q4_K_M',
 'qwen3-vl:30b-a3b-thinking-q8_0',
 'qwen3-vl:32b-instruct-q4_K_M',
 'qwen3-vl:32b-instruct-q8_0',
 'qwen3-vl:32b-thinking-q4_K_M',
 'qwen3-vl:32b-thinking-q8_0',
 'qwen3-vl:4b-instruct-q4_K_M',
 'qwen3-vl:4b-instruct-q8_0',
 'qwen3-vl:4b-thinking-q4_K_M',
 'qwen3-vl:4b-thinking-q8_0',
 'qwen3-vl:8b-instruct-q4_K_M',
 'qwen3-vl:8b-instruct-q8_0',
 'qwen3-vl:8b-thinking-q4_K_M',
 'qwen3-vl:8b-thinking-q8_0'}

https://github.com/ollama/ollama/blob/main/docs/api.md#list-local-models

ollama.list().models -> list(ollama._types.ListResponse.Model)

```yaml
ollama._types.ListResponse.Model
- model: str 'qwen3:4b'
- modified_at: datetime.datetime datetime(2025, 11, 22, 18, 53, 11)
- digest: str '359d7dd4bcdab3d86b87d73ac27966f4dbb9f5efdfcc75d34a8764a09474fae7'
- size: pydantic.types.ByteSize 2497293931
- details: ollama._types.ModelDetails
  - parent_model: str ''
  - format: str 'gguf'
  - family: str 'qwen3'
  - families: Sequence[str] ['qwen3']
  - parameter_size: str '4.0B'
  - quantization_level: str 'Q4_K_M'
```

In [None]:
ollama.list().models[0]

Model(model='qwen3:4b', modified_at=datetime.datetime(2025, 11, 22, 18, 53, 11, 586211, tzinfo=TzInfo(3600)), digest='359d7dd4bcdab3d86b87d73ac27966f4dbb9f5efdfcc75d34a8764a09474fae7', size=2497293931, details=ModelDetails(parent_model='', format='gguf', family='qwen3', families=['qwen3'], parameter_size='4.0B', quantization_level='Q4_K_M'))

https://github.com/ollama/ollama/blob/main/docs/api.md#show-model-information

```yaml
ollama._types.ShowResponse
- modified_at: datetime.datetime datetime.datetime(2025, 11, 22, 18, 53, 11)
- template: str '{{- $lastUserIdx := -1 -}}...\n{{- end }}'
- modelfile: str '...'
- license: str '...'
- details: ollama._types.ModelDetails -> see above
- model_info: Mapping[str, Any]
  -'general.architecture': 'qwen3'
  -'general.basename': 'Qwen3' 
  -'general.file_type': 15
  -'general.finetune': 'Thinking' 
  -'general.license': 'apache-2.0'
  -'general.license.link': 'https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507/blob/main/LICENSE'
  -'general.parameter_count': 4022468096
  -'general.quantization_version': 2, 
  -'general.size_label': '4B'
  -'general.tags': None
  -'general.type': 'model'
  -'general.version': '2507'
  -'qwen3.attention.head_count': 32
  -'qwen3.attention.head_count_kv': 8
  -'qwen3.attention.key_length': 128
  -'qwen3.attention.layer_norm_rms_epsilon': 1e-06
  -'qwen3.attention.value_length': 128
  -'qwen3.block_count': 36
  -'qwen3.context_length': 262144
  -'qwen3.embedding_length': 2560
  -'qwen3.feed_forward_length': 9728
  -'qwen3.rope.freq_base': 5000000
  -'tokenizer.ggml.add_bos_token': False
  -'tokenizer.ggml.bos_token_id': 151643
  -'tokenizer.ggml.eos_token_id': 151645
  -'tokenizer.ggml.merges': None
  -'tokenizer.ggml.model': 'gpt2'
  -'tokenizer.ggml.padding_token_id': 151643
  -'tokenizer.ggml.pre': 'qwen2'
  -'tokenizer.ggml.token_type': None
  -'tokenizer.ggml.tokens': None
- parameters: str 'top_p 0.95\n repeat_penalty 1\n stop "<|im_start|>"\n stop "<|im_end|>"\n temperature 0.6\ n top_k 20'
- capabilities: List[str] ['completion', 'tools', 'thinking']
```

In [None]:
ollama.show('gemma3:4b').capabilities, ollama.show('gemma3:4b').modelinfo

(['completion', 'vision'],
 {'gemma3.attention.head_count': 8,
  'gemma3.attention.head_count_kv': 4,
  'gemma3.attention.key_length': 256,
  'gemma3.attention.sliding_window': 1024,
  'gemma3.attention.value_length': 256,
  'gemma3.block_count': 34,
  'gemma3.context_length': 131072,
  'gemma3.embedding_length': 2560,
  'gemma3.feed_forward_length': 10240,
  'gemma3.mm.tokens_per_image': 256,
  'gemma3.vision.attention.head_count': 16,
  'gemma3.vision.attention.layer_norm_epsilon': 1e-06,
  'gemma3.vision.block_count': 27,
  'gemma3.vision.embedding_length': 1152,
  'gemma3.vision.feed_forward_length': 4304,
  'gemma3.vision.image_size': 896,
  'gemma3.vision.num_channels': 3,
  'gemma3.vision.patch_size': 14,
  'general.architecture': 'gemma3',
  'general.file_type': 15,
  'general.parameter_count': 4299915632,
  'general.quantization_version': 2,
  'tokenizer.ggml.add_bos_token': True,
  'tokenizer.ggml.add_eos_token': False,
  'tokenizer.ggml.add_padding_token': False,
  'tokenize

In [None]:
ollama.pull??

[31mSignature:[39m ollama.pull(model: str, *, insecure: bool = [38;5;28;01mFalse[39;00m, stream: bool = [38;5;28;01mFalse[39;00m) -> Union[ollama._types.ProgressResponse, collections.abc.Iterator[ollama._types.ProgressResponse]]
[31mSource:[39m   
  [38;5;28;01mdef[39;00m pull(
    self,
    model: str,
    *,
    insecure: bool = [38;5;28;01mFalse[39;00m,
    stream: bool = [38;5;28;01mFalse[39;00m,
  ) -> Union[ProgressResponse, Iterator[ProgressResponse]]:
    [33m"""[39m
[33m    Raises `ResponseError` if the request could not be fulfilled.[39m

[33m    Returns `ProgressResponse` if `stream` is `False`, otherwise returns a `ProgressResponse` generator.[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m self._request(
      ProgressResponse,
      [33m'POST'[39m,
      [33m'/api/pull'[39m,
      json=PullRequest(
        model=model,
        insecure=insecure,
        stream=stream,
      ).model_dump(exclude_none=[38;5;28;01mTrue[39;00m),
      stream=st

In [None]:
ollama.delete??

[31mSignature:[39m ollama.delete(model: str) -> ollama._types.StatusResponse
[31mDocstring:[39m <no docstring>
[31mSource:[39m   
  [38;5;28;01mdef[39;00m delete(self, model: str) -> StatusResponse:
    r = self._request_raw(
      [33m'DELETE'[39m,
      [33m'/api/delete'[39m,
      json=DeleteRequest(
        model=model,
      ).model_dump(exclude_none=[38;5;28;01mTrue[39;00m),
    )
    [38;5;28;01mreturn[39;00m StatusResponse(
      status=[33m'success'[39m [38;5;28;01mif[39;00m r.status_code == [32m200[39m [38;5;28;01melse[39;00m [33m'error'[39m,
    )
[31mFile:[39m      /home/workspace/wordslab-notebooks-lib/.venv/lib/python3.12/site-packages/ollama/_client.py
[31mType:[39m      method

**Streaming responses**

Certain endpoints stream responses as JSON objects. Streaming can be disabled by providing {"stream": false} for these endpoints.

**Structured outputs**

Structured outputs are supported by providing a JSON schema in the format parameter. The model will generate a response that matches the schema. See the structured outputs example below.

**JSON mode**

Enable JSON mode by setting the format parameter to json. This will structure the response as a valid JSON object. 

https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion

Parameters
- model: (required) the model name
- prompt: the prompt to generate a response for
- suffix: the text after the model response
- images: (optional) a list of base64-encoded images (for multimodal models such as llava)
- think: (for thinking models) should the model think before responding?

Advanced parameters (optional):
- format: the format to return a response in. Format can be json or a JSON schema
- options: additional model parameters listed in the documentation for the Modelfile such as temperature
- system: system message to (overrides what is defined in the Modelfile)
- template: the prompt template to use (overrides what is defined in the Modelfile)
- stream: if false the response will be returned as a single response object, rather than a stream of objects
- raw: if true no formatting will be applied to the prompt. You may choose to use the raw parameter if you are specifying a full templated prompt in your request to the API
- keep_alive: controls how long the model will stay loaded into memory following the request (default: 5m)

Response

A stream of JSON objects is returned:

{
  "model": "llama3.2",
  "created_at": "2023-08-04T08:52:19.385406455-07:00",
  "response": "The",
  "done": false
}

The final response in the stream also includes additional data about the generation:
- total_duration: time spent generating the response
- load_duration: time spent in nanoseconds loading the model
- prompt_eval_count: number of tokens in the prompt
- prompt_eval_duration: time spent in nanoseconds evaluating the prompt
- eval_count: number of tokens in the response
- eval_duration: time in nanoseconds spent generating the response
- response: empty if the response was streamed, if not streamed, this will contain the full response

A response can be received in one reply when streaming is off.

To calculate how fast the response is generated in tokens per second (token/s), divide eval_count / eval_duration * 10^9.

**Images**

To submit images to multimodal models, provide a list of base64-encoded images:

- "images": ["iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBI..."]


In [None]:
ollama.generate(model='gemma3', prompt='Why is the sky blue?')

In [None]:
ollama.chat(model='gemma3', messages=[{'role': 'user', 'content': 'Why is the sky blue?'}])

In [None]:
ollama.embed(model='gemma3', input='The sky is blue because of rayleigh scattering')

In [None]:
ollama.embed(model='gemma3', input=['The sky is blue because of rayleigh scattering', 'Grass is green because of chlorophyll'])

In [None]:
ollama.ps()

ProcessResponse(models=[])

In [None]:
ollama.web_search??

[31mSignature:[39m ollama.web_search(query: str, max_results: int = [32m3[39m) -> ollama._types.WebSearchResponse
[31mSource:[39m   
  [38;5;28;01mdef[39;00m web_search(self, query: str, max_results: int = [32m3[39m) -> WebSearchResponse:
    [33m"""[39m
[33m    Performs a web search[39m

[33m    Args:[39m
[33m      query: The query to search for[39m
[33m      max_results: The maximum number of results to return (default: 3)[39m

[33m    Returns:[39m
[33m      WebSearchResponse with the search results[39m
[33m    Raises:[39m
[33m      ValueError: If OLLAMA_API_KEY environment variable is not set[39m
[33m    """[39m
    [38;5;28;01mif[39;00m [38;5;28;01mnot[39;00m self._client.headers.get([33m'authorization'[39m, [33m''[39m).startswith([33m'Bearer '[39m):
      [38;5;28;01mraise[39;00m ValueError([33m'Authorization header with Bearer token is required for web search'[39m)

    [38;5;28;01mreturn[39;00m self._request(
      WebSearchResponse,

In [None]:
ollama.web_fetch??

[31mSignature:[39m ollama.web_fetch(url: str) -> ollama._types.WebFetchResponse
[31mSource:[39m   
  [38;5;28;01mdef[39;00m web_fetch(self, url: str) -> WebFetchResponse:
    [33m"""[39m
[33m    Fetches the content of a web page for the provided URL.[39m

[33m    Args:[39m
[33m      url: The URL to fetch[39m

[33m    Returns:[39m
[33m      WebFetchResponse with the fetched result[39m
[33m    """[39m
    [38;5;28;01mif[39;00m [38;5;28;01mnot[39;00m self._client.headers.get([33m'authorization'[39m, [33m''[39m).startswith([33m'Bearer '[39m):
      [38;5;28;01mraise[39;00m ValueError([33m'Authorization header with Bearer token is required for web fetch'[39m)

    [38;5;28;01mreturn[39;00m self._request(
      WebFetchResponse,
      [33m'POST'[39m,
      [33m'https://ollama.com/api/web_fetch'[39m,
      json=WebFetchRequest(
        url=url,
      ).model_dump(exclude_none=[38;5;28;01mTrue[39;00m),
    )
[31mFile:[39m      /home/workspace/wordsla

## Client

### Explore ollama, vllm and openrouter APIs

In [None]:
await nbchat("using only the bits of information present in this notebooks, try to generate a function to get the list of ollama models which can run in a GPU with 16GB vram")

Based on the information in this notebook, we can create a function to filter Ollama models that can run on a GPU with 16GB VRAM. Here's how we can approach this:

1. We'll use the `ollama.list()` method to get all available models
2. We'll filter models based on their VRAM requirements
3. We'll consider that 16GB VRAM can typically handle models up to about 13B parameters (though this varies by model architecture)

Here's the function:

```python
def get_models_for_16gb_vram():
    """Returns a list of Ollama models that can run on a GPU with 16GB VRAM"""
    all_models = ollama.list()

    # Define a mapping of model sizes to VRAM requirements
    # These are approximate values based on typical requirements
    model_vram_map = {
        'gemma3': {
            '27b': 24,  # 27B model needs ~24GB VRAM
            '13b': 16,  # 13B model fits in 16GB
            '7b': 8,    # 7B model fits in 8GB
            '3b': 4     # 3B model fits in 4GB
        },
        'mistral': {
            'small3.2': 12,  # 24B model needs ~12GB VRAM
            'small3': 10,    # 10B model fits in 10GB
            'tiny': 4        # Tiny model fits in 4GB
        },
        # Add other models as needed
    }

    compatible_models = []

    for model in all_models.models:
        model_name = model.name
        model_size = model.modelfile.get('parameters', {}).get('size', 'unknown')

        # Get VRAM requirement
        vram_required = 0
        if model_name in model_vram_map:
            if model_size in model_vram_map[model_name]:
                vram_required = model_vram_map[model_name][model_size]

        # Check if model fits in 16GB VRAM
        if vram_required <= 16:
            compatible_models.append({
                'name': model_name,
                'size': model_size,
                'vram_required': vram_required
            })

    return compatible_models
```

Example usage:
```python
compatible_models = get_models_for_16gb_vram()
for model in compatible_models:
    print(f"{model['name']} ({model['size']}): {model['vram_required']}GB VRAM required")
```

Note: The VRAM requirements are approximate and may vary based on:
- The specific GPU architecture
- Batch size
- Other system configurations
- Model quantization (if supported)

For more accurate information, you might want to:
1. Check the official documentation for each model
2. Test the models on your specific hardware
3. Consider using model quantization to reduce VRAM usage

#### Format the notebook cells for LLMs

Convert notebook contents to compact XML - code and format copied from **toolslm by AnswerDotAI**:

https://github.com/AnswerDotAI/toolslm/blob/main/00_xml.ipynb

In [None]:
## #| exports
def get_mime_text(data):
    "Get text from MIME bundle, preferring markdown over plain"
    if 'text/markdown' in data: return ''.join(list(data['text/markdown']))
    if 'text/plain' in data: return ''.join(list(data['text/plain']))

In [None]:
## #| exports
def cell2out(o):
    "Convert single notebook output to XML format"
    if hasattr(o, 'data'): 
        txt = get_mime_text(o.data)
        if txt: return Out(txt, mime='markdown' if 'text/markdown' in o.data else 'plain')
    if hasattr(o, 'text'):
        txt = o.text if isinstance(o.text, str) else ''.join(o.text)
        return Out(txt, type='stream', name=o.get('name', 'stdout'))
    if hasattr(o, 'ename'): return Out(f"{o.ename}: {o.evalue}", type='error')

In [None]:
## #| exports
def cell2xml(cell):
    "Convert notebook cell to concise XML format"
    cts = Source(''.join(cell.source)) if hasattr(cell, 'source') and cell.source else None
    out_items = L(getattr(cell,'outputs',[])).map(cell2out).filter()
    outs = []
    if out_items: outs = Outs(*out_items)
    parts = [p for p in [cts, outs] if p]
    return Cell(*parts, type=cell.cell_type)

In [None]:
## #| exports
def nb2xml(nb, until_cell_id):
    cells_xml = []
    for c in nb.cells:
        if c.id == until_cell_id: break
        if c.cell_type in ('code','markdown'):
            cells_xml.append(to_xml(cell2xml(c), do_escape=False))
    return '\n'.join(cells_xml)     

In [None]:
nb2xml(nb, executing_cell_id)[:3000]

In [None]:
## #| exports
async def get_notebook_context(timeout=1):
    data = await get_notebook_data(timeout=timeout)
    notebook_content = data["notebook"]
    nb = nbformat.from_dict(notebook_content)
    cell_id = data["cell_id"]
    return nb2xml(nb, cell_id)

In [None]:
await get_notebook_context(timeout=1)

You can see that the content of this cell, which is below the call to get_notebook_context(), doesn't appear in the context.