# Using LMQL with local models

To use all of LMQL's features, we need to host the models ourselves. LMQL supports
the Huggingface Transformers API and the Llama.cpp API. (It appears that the work
necessary to support Ollama is in progress, so hopefully soon!)

In the meantime, we will need to use `llama-cpp-python` and download the model weights ourselves. We will use the 4-bit quantized version that Ollama uses by default.

## Specify the model to download

In [1]:
import lmql_demo.download as download

llama2_tokenizer = "llama2/llama-tokenizer.model"
mistral_tokenizer = "llama2/mistral-tokenizer.model"
model_filename = download.download_llama2()

The file 'llama2/llama-2-7b.Q4_K_M.gguf' already exists.


In [2]:
import lmql
from dataclasses import dataclass

In [3]:
local_model = lmql.model(f"local:llama.cpp:{model_filename}", tokenizer=llama2_tokenizer)

In [14]:
@dataclass
class Joke:
    setup: str
    punchline: str

In [27]:
@lmql.query
async def tell_joke():
    '''lmql
    sample(n=1, temperature=0.5)
        """This is a list of the funniest AI jokes ever told with a setup and a punchline. Q: indicates the setup and A: indicates the punchline
        Q: Why was the AI bad at soccer?
        A: Because it kept trying to reboot the ball.
        
        Q: Why did the AI cross the road?
        A: To optimize the objective function.

        """
        "Q: [setup]\n" where len(TOKENS(setup)) < 120 and STOPS_AT(setup, "?") and STOPS_AT(setup, ".")
        "A: [punchline]\n" where STOPS_AT(punchline, "\n") and len(TOKENS(punchline)) < 120
    '''

In [28]:
from IPython.display import Markdown

In [31]:
lmql_response = await tell_joke(model=local_model)
joke = Joke(**lmql_response.variables)
Markdown(f"## AI Joke\n\n**Q:** {joke.setup}\n\n**A:** {joke.punchline}")

## AI Joke

**Q:** 3 AI robots walk into a bar.

**A:** 1: Hey, let’s go get a drink.
