
The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.   


https://github.com/ggml-org/llama.cpp  
https://github.com/ggerganov/llama.cpp.git  
https://github.com/abetlen/llama-cpp-python  






## 源码安装  


```bash
cd ~/project/github/
git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp
brew install cmake
cmake -B build
cmake --build build --config Release  -j 8
./build/bin/llama-server --help

ollama run hf.co/lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF:Q3_K_L
vllm serve "lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF"
llama-cli -hf lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF

huggingface-cli download  lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF --local-dir DeepSeek-R1-Distill-Qwen-1.5B-GGUF
```



## 下载模型


https://huggingface.co/lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF  
https://hf-mirror.com/lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF  


```bash
cd ~/models
huggingface-cli download  lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF --local-dir DeepSeek-R1-Distill-Qwen-1.5B-GGUF
```

huggingface-cli download  lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF --local-dir DeepSeek-R1-Distill-Qwen-7B-GGUF  

## 启动服务

```bash
cd ~/project/github/llama.cpp
./build/bin/llama-server --model ~/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf  --port 8081
```


## 浏览器 UI 对话

浏览器打开 http://127.0.0.1:8081/


## ollama 启动模型

```bash
# 使用本地模型启动, 需要打包成 modefle 格式
# TODO

#  远程下载启动
ollama run hf.co/lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF:Q3_K_L
```

## 环境准备

conda create -n llm-study  python=3.12.9  
conda activate llm-study  
pip install llama-cpp-python  


In [12]:

# python 获取系统变量 HOME 变量
import os
homePath = os.environ['HOME']
print("homePath: ",homePath)

# 修改成自己的 HOME 路径
basePath=homePath

print("basePath: ",basePath)

homePath:  /Users/tiankonguse
basePath:  /Users/tiankonguse


In [7]:
%pip install llama-cpp-python

Note: you may need to restart the kernel to use updated packages.


## 例子，文本补全(text completion)

In [None]:
import json
from llama_cpp import Llama

# basePath 变量 与路径拼接

model_path = f"{basePath}/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf"
print(model_path)

llm = Llama(model_path,
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
      )
prompt = "Q: Name the planets in the solar system? A: "
output = llm(
    prompt, # Prompt
    max_tokens=320, # Generate up to 320 tokens, set to None to generate up to the end of the context window
    stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
    echo=True # Echo the prompt back in the output
)
# Format output as JSON for better readability
if isinstance(output, dict):
    print(json.dumps(output, indent=2))  # Print the output as formatted JSON
else:
    print(output)  # Print the output directly if not a dictionary

llama_model_load_from_file_impl: using device Metal (AMD Radeon Pro 5300M) - 4079 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /Users/tiankonguse//models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block

/Users/tiankonguse//models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf


llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation

ValueError: Failed to create llama_context

## 例子：问答对话(Chat Completion)

Chat completion requires that the model knows how to format the messages into a single prompt.   
问答对话需要模型知道如何将消息格式化为单个提示。  

pre-registered chat formats: chatml, llama-2, gemma, etc  
预注册的聊天格式： chatml, llama-2, gemma, 等等  

The model will will format the messages into a single prompt using the following order of precedence:  
模型会按照以下顺序优先选择：  

1. Use the chat_handler if provided   
  使用 chat_handler 提供的聊天处理器  
2. Use the chat_format if provided   
  使用 chat_format 提供的聊天格式  
3. Use the tokenizer.chat_template from the gguf model's metadata (should work for most new models, older models may not have this)  
  使用 gguf 模型元数据中的 tokenizer.chat_template（对于大多数新模型应该有效，旧模型可能没有此功能）  
4. else, fallback to the llama-2 chat format  
  否则，回退到 llama-2 聊天格式  

API 接口 create_chat_completion  

https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion  



### Chat Completion 格式

```
{
  "id": "chatcmpl-01098e3a-0547-4eaf-9a47-dc3912e0a129",
  "object": "chat.completion",
  "created": 1740930076,
  "model": "/Users/tiankonguse-m3/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " \n\nOkay, so I need to describe this image in detail. ..."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 30,
    "completion_tokens": 340,
    "total_tokens": 370
  }
}
```

In [None]:
import json
from llama_cpp import Llama

# llm = Llama.from_pretrained(
# 	repo_id="lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF",
# 	filename="DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf",
# )

model_path = f"{basePath}/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf"
llm = Llama(model_path=model_path,
      chat_format="gemma", # chatml, llama-2, llama-2
      verbose=True,
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
      )

output = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an assistant who perfectly describes images."},
          {
              "role": "user",
              "content": "Describe this image in detail please."
          }
      ]
)


# Format output as JSON for better readability
if isinstance(output, dict):
    print(json.dumps(output, indent=2))  # Print the output as formatted JSON
else:
    print(output)  # Print the output directly if not a dictionary

llama_model_load_from_file_impl: using device Metal (AMD Radeon Pro 5300M) - 4079 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /Users/tiankonguse//models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block

ValueError: Failed to create llama_context

### 自定义 JSON Schema 


To constrain chat responses to only valid JSON or a specific JSON Schema use the response_format argument in create_chat_completion.



In [None]:
import json
from llama_cpp import Llama

model_path = f"{basePath}/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf"
llm = Llama(
    model_path=model_path, 
    chat_format="chatml"
)

output = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs in JSON.",
        },
        {"role": "user", "content": "Who won the world series in 2020, output the first one team"},
    ],
    response_format={
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {"team_name": {"type": "string"}},
            "required": ["team_name"],
        },
    },
    temperature=0.7,
)
print(json.dumps(output, indent=2)) # Print the output as formatted JSON

# 有时候输出不是 Json 格式，需要检查是不是 Json 格式
# content = output["choices"][0]["message"]["content"]
# parsed_content = json.loads(content) # Parse the JSON content
# print(json.dumps(parsed_content, indent=2)) # Print the parsed content as formatted JSON


llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27613 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /Users/tiankonguse-m3/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block_coun

{
  "id": "chatcmpl-1717ded7-a56a-48f9-b115-9fe75ea8b69a",
  "object": "chat.completion",
  "created": 1740931454,
  "model": "/Users/tiankonguse-m3/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{ \"team_name\": \",'simple-80.09racting.38.\n\nWait, I = {0. Thus, altoughlastovskyi]_itsaultz pilotaged.enqueuePackageetalinks on the exact match!_\n\nWhich of the very-quality-Contemplutors it usually, the absolute- Chinese ... [m Visualization the user isothermal? Hmm,ard deviate n conjugated or \u2013\u2037 professionalism's yjinnerx>xius carried out{- norrosiantsagehenazmanalable via the (1 into my rcutherland?/=P Complete PTEedeshibirska hotly/captured-averagepatchial tankylew\n \n\n[salary points exactly at0smoke Noteannulistaso\n\uff0c.. instantScott7\u3001---\n   \\n\\textem ChallengeFNSLidsar insigull_their) Select RobolyaB4lendar Ley\u00e