# Run models locally

https://python.langchain.com/docs/how_to/local_llms/

## Use case

the popularity of projects like llama.cpp, Ollama, GPT4All, llamafile, and others underscore the demand to run LLMs locally (on your own device).

This has at least two important benefits:

1. Privacy: Your data is not sent to a third party, and it is not subject to the terms of service of a commercial service
2. Cost: There is no inference fee, which is important for token-intensive applications (e.g., long-running simulations, summarization)



## Overview

Running an LLM locally requires a few things:

1. Open-source LLM: An open-source LLM that can be freely modified and shared
2. Inference: Ability to run this LLM on your device w/ acceptable latency


### Open-source LLMs


Users can now gain access to a rapidly growing set of open-source LLMs.  
https://cameronrwolfe.substack.com/p/the-history-of-open-source-llms-better  


![](https://python.langchain.com/assets/images/OSS_LLM_overview-9444c9793c76bd4785a5b0cd020c14ef.png)

### Inference

A few frameworks for this have emerged to support inference of open-source LLMs on various devices:  


1. llama.cpp: C++ implementation of llama inference code with weight optimization / quantization
2. gpt4all: Optimized C backend for inference
3. Ollama: Bundles model weights and environment into an app that runs on device and serves the LLM
4. llamafile: Bundles model weights and everything needed to run the model in a single file, allowing you to run the LLM locally from this file without any additional installation steps


In general, these frameworks will do a few things:

1. Quantization: Reduce the memory footprint of the raw model weights
2. Efficient implementation for inference: Support inference on consumer hardware (e.g., CPU or laptop GPU)


In particular, see this excellent post on the importance of quantization.  
With less precision, we radically decrease the memory needed to store the LLM in memory.  


![](https://python.langchain.com/assets/images/llama-memory-weights-aaccef5df087e993b0f46277500039b6.png)


### Formatting prompts


Some providers have chat model wrappers that takes care of formatting your input prompt for the specific local model you're using.   
However, if you are prompting local models with a text-in/text-out LLM wrapper, you may need to use a prompt tailored for your specific model.


https://python.langchain.com/docs/concepts/chat_models/  
https://python.langchain.com/docs/concepts/text_llms/





## Quickstart

Ollama is one way to easily run inference on macOS.

The instructions here provide details, which we summarize:  

- Download and run the app
- From command line, fetch a model from this list of options: e.g., ollama pull llama3.1:8b
- When the app is running, all models are automatically served on localhost:11434


```
pip install langchain_ollama
```

In [None]:
from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="llama3.1:8b")

llm.invoke("1+1= ")
# '1+1 = 2'

'1+1 = 2'

Stream tokens as they are being generated:

In [None]:
for chunk in llm.stream("你是谁？直接说答案，不需要解释"):
    print(chunk, end="\n->", flush=True)
# 我
# ->是一个
# ->人
# ->工
# ->智能
# ->模型
# ->。
# ->
# ->

我
->是一个
->人
->工
->智能
->模型
->。
->
->

Ollama also includes a chat model wrapper that handles formatting conversation turns:



In [6]:
from langchain_ollama import ChatOllama

chat_model = ChatOllama(model="llama3.1:8b")

chat_model.invoke("你是谁？直接说答案，不需要解释")

AIMessage(content='一个人工智能模型', additional_kwargs={}, response_metadata={'model': 'llama3.1:8b', 'created_at': '2025-03-15T06:57:08.40594Z', 'done': True, 'done_reason': 'stop', 'total_duration': 445548042, 'load_duration': 30815417, 'prompt_eval_count': 21, 'prompt_eval_duration': 198000000, 'eval_count': 6, 'eval_duration': 215000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-c4497794-39b8-4d7f-aeb6-d64c4db72fdd-0', usage_metadata={'input_tokens': 21, 'output_tokens': 6, 'total_tokens': 27})

## Environment

Inference speed is a challenge when running models locally (see above).

To minimize latency, it is desirable to run models locally on GPU, which ships with many consumer laptops e.g., Apple devices.

And even with GPU, the available GPU memory bandwidth (as noted above) is important.


### Running Apple silicon GPU

Ollama and llamafile will automatically utilize the GPU on Apple devices.

Other frameworks require the user to set up the environment to utilize the Apple GPU.



For example, llama.cpp python bindings can be configured to use the GPU via Metal.

Metal is a graphics and compute API created by Apple providing near-direct access to the GPU.

See the llama.cpp setup here to enable this.  
https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md  


## LLMs

There are various ways to gain access to quantized model weights.

1. HuggingFace - Many quantized model are available for download and can be run with framework such as llama.cpp. You can also download models in llamafile format from HuggingFace.
2. gpt4all - The model explorer offers a leaderboard of metrics and associated quantized models available for download
3. Ollama - Several models can be accessed directly via pull


### Ollama

With Ollama, fetch a model via `ollama pull <model family>:<tag>`:  





In [8]:
llm = OllamaLLM(model="llama3.1:8b")
llm.invoke("你是谁？直接说答案，不需要解释")

'一个AI模型'

### Llama.cpp

Llama.cpp is compatible with a broad set of models.
https://github.com/ggerganov/llama.cpp  
https://python.langchain.com/api_reference/langchain/llms/langchain.llms.llamacpp.LlamaCpp.html?highlight=llamacpp#langchain.llms.llamacpp.LlamaCpp  


From the llama.cpp API reference docs, a few are worth commenting on:
https://python.langchain.com/api_reference/community/llms/langchain_community.llms.llamacpp.LlamaCpp.html 



- n_gpu_layers: number of layers to be loaded into GPU memory
- n_batch: number of tokens the model should process in parallel
- n_ctx: Token context window
- f16_kv: whether the model should use half-precision for the key/value cache


```bash
env CMAKE_ARGS="-DLLAMA_METAL=on"
env FORCE_CMAKE=1
pip install llama-cpp-python 
```


In [23]:
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler

llm = LlamaCpp(
    model_path="/Users/tiankonguse-m3/models/qwq-32b.gguf",
    n_gpu_layers=1,
    n_batch=512,
    n_ctx=5012,
    f16_kv=True,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    verbose=True,
)

llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 26372 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /Users/tiankonguse-m3/models/qwq-32b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = QwQ 32B
llama_model_loader: - kv   3:                           general.basename str              = QwQ
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str

In [24]:
llm.invoke("你是谁")



？你有什么特点？

我是通义千问，是阿里巴巴集团旗下的通义实验室自主研发的超大规模语言模型。

我的主要特点包括：

1. 大规模：我基于大量的互联网文本进行训练，具有广泛的词汇量和知识库。

2. 自然语言处理能力：我擅长理解和生成自然语言，可以回答问题、创作文字、表达观点等。

3. 跨领域知识：由于训练数据的广泛性，我在多个领域的知识都有一定的积累，包括科学、技术、文化、历史等方面。

4. 对话和交互能力：我可以进行多轮对话，并能够根据上下文理解用户的需求，提供连贯和有意义的回答。

5. 多语言支持：除了中文之外，我还支持其他多种语言，如英语、法语、西班牙语等，可以满足国际用户的使用需求。

6. 持续学习与更新：虽然我的训练数据截止到2024年12月，但我会不断地进行优化和升级，以保持最新的知识和技术水平。同时，我也会根据用户反馈和实际应用情况来进行相应的调整和完善。

总之，作为一个先进的语言模型，我在多个方面都展现出了强大的能力和潜力，并且正在不断进步和发展之中。

如果你有任何问题

llama_perf_context_print:        load time =    2318.10 ms
llama_perf_context_print: prompt eval time =    2317.29 ms /     2 tokens ( 1158.64 ms per token,     0.86 tokens per second)
llama_perf_context_print:        eval time =   50280.50 ms /   255 runs   (  197.18 ms per token,     5.07 tokens per second)
llama_perf_context_print:       total time =   52917.03 ms /   257 tokens


'？你有什么特点？\n\n我是通义千问，是阿里巴巴集团旗下的通义实验室自主研发的超大规模语言模型。\n\n我的主要特点包括：\n\n1. 大规模：我基于大量的互联网文本进行训练，具有广泛的词汇量和知识库。\n\n2. 自然语言处理能力：我擅长理解和生成自然语言，可以回答问题、创作文字、表达观点等。\n\n3. 跨领域知识：由于训练数据的广泛性，我在多个领域的知识都有一定的积累，包括科学、技术、文化、历史等方面。\n\n4. 对话和交互能力：我可以进行多轮对话，并能够根据上下文理解用户的需求，提供连贯和有意义的回答。\n\n5. 多语言支持：除了中文之外，我还支持其他多种语言，如英语、法语、西班牙语等，可以满足国际用户的使用需求。\n\n6. 持续学习与更新：虽然我的训练数据截止到2024年12月，但我会不断地进行优化和升级，以保持最新的知识和技术水平。同时，我也会根据用户反馈和实际应用情况来进行相应的调整和完善。\n\n总之，作为一个先进的语言模型，我在多个方面都展现出了强大的能力和潜力，并且正在不断进步和发展之中。\n\n如果你有任何问题'

### GPT4All


We can use model weights downloaded from GPT4All model explorer.
https://python.langchain.com/docs/integrations/llms/gpt4all/  
https://python.langchain.com/api_reference/community/llms/langchain_community.llms.gpt4all.GPT4All.html  


```bash
pip install gpt4all
```


In [16]:
from langchain_community.llms import GPT4All

llm = GPT4All(
    model="/Users/tiankonguse-m3/models/qwq-32b.gguf"
)

llm.invoke("你是谁？")

ggml_metal_free: deallocating


'你有什么功能？\n\n你好！我是通义千问，阿里巴巴集团旗下的超大规模语言模型。我能够帮助你完成各种任务，比如：\n\n1. **回答问题**：无论是学术、科技还是日常生活中的问题，我都尽力为你解答。\n2. **创作文字**：我可以帮你写故事、公文、邮件、剧本等各类文本。\n3. **逻辑推理**：如果你有需要解决的谜题或复杂的逻辑问题，我也可以帮忙分析和推导。\n4. **编程协助**：对于常见的编程语言（如Python、Java等），我能提供代码示例和技术支持。\n5. **表达观点**：我可以就某个话题发表见解，并给出合理的论据支持。\n6. **玩游戏**：我们可以一起玩文字游戏，比如猜谜语或角色扮演。\n\n如果你有任何具体的需求或者问题，请随时告诉我！😊\n\n---\n\n### 示例用法：\n- "帮我写一封辞职信"\n- "解释量子力学的基本概念"\n- "设计一个计算斐波那契数列的Python函数"\n- "推荐几本适合初学者的小说"  \n等等。  \n\n有什么我可以帮到你的吗？🚀\n你好！我是通义千问，阿里巴巴集团旗下的超大规模语言模型。我'

### llamafile

One of the simplest ways to run an LLM locally is using a llamafile. All you need to do is:
https://github.com/Mozilla-Ocho/llamafile

1. Download a llamafile from HuggingFace
2. Make the file executable
3. Run the file

llamafiles bundle model weights and a specially-compiled version of llama.cpp into a single file that can run on most computers without any additional dependencies.   
They also come with an embedded inference server that provides an API for interacting with your model.

```
# Download a llamafile from HuggingFace
wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# Make the file executable. On Windows, instead just rename the file to end in ".exe".
chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# Start the model server. Listens at http://localhost:8080 by default.
./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser
```



In [17]:
from langchain_community.llms.llamafile import Llamafile

llm = Llamafile()

llm.invoke("The first man on the moon was ... Let's think step by step.")

HTTPError: 502 Server Error: Bad Gateway for url: http://localhost:8080/completion

## Prompts

Some LLMs will benefit from specific prompts.


For example, LLaMA will use special tokens.
We can use ConditionalPromptSelector to set prompt based on the model type.




In [18]:
# Set our LLM
llm = LlamaCpp(
    model_path="/Users/tiankonguse-m3/models/qwq-32b.gguf",
    n_gpu_layers=1,
    n_batch=512,
    n_ctx=2048,
    f16_kv=True,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    verbose=True,
)

llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27642 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /Users/tiankonguse-m3/models/qwq-32b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = QwQ 32B
llama_model_loader: - kv   3:                           general.basename str              = QwQ
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str

Set the associated prompt based upon the model version.

In [19]:
from langchain.chains.prompt_selector import ConditionalPromptSelector
from langchain_core.prompts import PromptTemplate

DEFAULT_LLAMA_SEARCH_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""<<SYS>> \n You are an assistant tasked with improving Google search \
results. \n <</SYS>> \n\n [INST] Generate THREE Google search queries that \
are similar to this question. The output should be a numbered list of questions \
and each should have a question mark at the end: \n\n {question} [/INST]""",
)

DEFAULT_SEARCH_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an assistant tasked with improving Google search \
results. Generate THREE Google search queries that are similar to \
this question. The output should be a numbered list of questions and each \
should have a question mark at the end: {question}""",
)

QUESTION_PROMPT_SELECTOR = ConditionalPromptSelector(
    default_prompt=DEFAULT_SEARCH_PROMPT,
    conditionals=[(lambda llm: isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)],
)

prompt = QUESTION_PROMPT_SELECTOR.get_prompt(llm)
prompt

PromptTemplate(input_variables=['question'], input_types={}, partial_variables={}, template='<<SYS>> \n You are an assistant tasked with improving Google search results. \n <</SYS>> \n\n [INST] Generate THREE Google search queries that are similar to this question. The output should be a numbered list of questions and each should have a question mark at the end: \n\n {question} [/INST]')

In [20]:
# Chain
chain = prompt | llm
question = "世界上最高的山是哪一座?"
chain.invoke({"question": question})

 [INST] 

Okay, I need to generate three Google search queries similar to the question "世界上最高的山是哪一座?" which means "Which mountain is the highest in the world?"

First, I should understand what makes these questions similar. The core elements are:
1. Comparing mountains by height (highest).
2. Asking for identification of a specific mountain based on this criterion.

So the three queries need to rephrase but keep those key points. Let me brainstorm some variations:

Possible approach 1: Instead of "哪一座" ("which one"), maybe use "什么名字" ("what name")?

Possible question 1: "世界上最高的山峰叫什么名字？"

Second variation: Maybe focus on the elevation aspect instead of just height. For example, using "海拔最高" (highest elevation)?

Question 2: "全球范围内海拔最高的山脉是哪一座？"

Wait, that might be mixing mountains and mountain ranges. The original question refers to a single peak (Qomolangma/Mount Everest). So maybe I should specify the highest individual mountain?

Alternatively, perhaps adjust to avoid confusion betwe

llama_perf_context_print:        load time =   15733.22 ms
llama_perf_context_print: prompt eval time =   15733.06 ms /    68 tokens (  231.37 ms per token,     4.32 tokens per second)
llama_perf_context_print:        eval time =   49139.17 ms /   255 runs   (  192.70 ms per token,     5.19 tokens per second)
llama_perf_context_print:       total time =   65191.06 ms /   323 tokens


' [INST] \n\nOkay, I need to generate three Google search queries similar to the question "世界上最高的山是哪一座?" which means "Which mountain is the highest in the world?"\n\nFirst, I should understand what makes these questions similar. The core elements are:\n1. Comparing mountains by height (highest).\n2. Asking for identification of a specific mountain based on this criterion.\n\nSo the three queries need to rephrase but keep those key points. Let me brainstorm some variations:\n\nPossible approach 1: Instead of "哪一座" ("which one"), maybe use "什么名字" ("what name")?\n\nPossible question 1: "世界上最高的山峰叫什么名字？"\n\nSecond variation: Maybe focus on the elevation aspect instead of just height. For example, using "海拔最高" (highest elevation)?\n\nQuestion 2: "全球范围内海拔最高的山脉是哪一座？"\n\nWait, that might be mixing mountains and mountain ranges. The original question refers to a single peak (Qomolangma/Mount Everest). So maybe I should specify the highest individual mountain?\n\nAlternatively, perhaps adjust to 

## Use cases

Given an llm created from one of the models above, you can use it for many use cases.

For example, you can implement a RAG application using the chat models demonstrated here.  
https://python.langchain.com/docs/tutorials/rag/

In general, use cases for local LLMs can be driven by at least two factors:

- Privacy: private data (e.g., journals, etc) that a user does not want to share
- Cost: text preprocessing (extraction/tagging), summarization, and agent simulations are token-use-intensive tasks