
The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.   


https://github.com/ggml-org/llama.cpp  
https://github.com/ggerganov/llama.cpp.git  
https://github.com/abetlen/llama-cpp-python  






## 源码安装  


```bash
cd ~/project/github/
git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp
brew install cmake
cmake -B build
cmake --build build --config Release  -j 8
./build/bin/llama-server --help

ollama run hf.co/lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF:Q3_K_L
vllm serve "lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF"
llama-cli -hf lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF

huggingface-cli download  lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF --local-dir DeepSeek-R1-Distill-Qwen-1.5B-GGUF
```



## 下载模型


https://huggingface.co/lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF  
https://hf-mirror.com/lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF  


```bash
cd ~/models
huggingface-cli download  lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF --local-dir DeepSeek-R1-Distill-Qwen-1.5B-GGUF
```

huggingface-cli download  lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF --local-dir DeepSeek-R1-Distill-Qwen-7B-GGUF  

## 启动服务

```bash
cd ~/project/github/llama.cpp
./build/bin/llama-server --model ~/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf  --port 8081
```


## 浏览器 UI 对话

浏览器打开 http://127.0.0.1:8081/


## llama-cli

A CLI tool for accessing and experimenting with most of llama.cpp's functionality.


### conversation mode

Models with a built-in chat template will automatically activate conversation mode.   
If this doesn't occur, you can manually enable it by adding -cnv and specifying a suitable chat template with --chat-template NAME

```
llama-cli -m model.gguf

# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!
```

### conversation mode with custom chat template


```
# use the "chatml" template (use -h to see the list of supported templates)
# chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3,
# exaone3, falcon3, gemma, gigachat, glmedge, granite, llama2,
# llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, megrez, minicpm,
# mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, monarch,
# openchat, orion, phi3, phi4, rwkv-world, vicuna, vicuna-orca, zephyr
llama-cli -m model.gguf -cnv --chat-template chatml

# use a custom template
llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
```

### Run simple text completion


To disable conversation mode explicitly, use -no-cnv


```
llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128 -no-cnv

# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
```

### Constrain the output with a custom grammar

The grammars/ folder contains a handful of sample grammars.    
https://github.com/ggml-org/llama.cpp/tree/master/grammars  

To write your own, check out the GBNF Guide.  
https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md

For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/  

```
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'

# {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
```


## llama-server

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

./build/bin/llama-server  --model /Users/tiankonguse-m3/models/qwq-32b.gguf  --port 8081 --log-colors


### Start a local HTTP server with default configuration on port 8080

```
llama-server -m model.gguf --port 8080

# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions

http://localhost:11434
http://127.0.0.1:8081
```

### Support multiple-users and parallel decoding

```
# up to 4 concurrent requests, each with 4096 max context
llama-server -m model.gguf -c 16384 -np 4
```

### Serve an embedding model

```
# use the /embedding endpoint
llama-server -m model.gguf --embedding --pooling cls -ub 8192
```

### Serve a reranking model

```
# custom grammar
llama-server -m model.gguf --grammar-file grammar.gbnf

# JSON
llama-server -m model.gguf --grammar-file grammars/json.gbnf
```

## llama-run

A comprehensive example for running llama.cpp models. Useful for inferencing. Used with RamaLama


Run a model with a specific prompt (by default it's pulled from Ollama registry)

```
llama-run granite-code
```


## llama-simple

A minimal example for implementing apps with llama.cpp. Useful for developers.

Basic text completion

```
llama-simple -m model.gguf

# Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
```



## 环境准备

conda create -n llm-study  python=3.12.9  
conda activate llm-study  
pip install llama-cpp-python  


In [1]:

# python 获取系统变量 HOME 变量
import os
homePath = os.environ['HOME']
print("homePath: ",homePath)

# 修改成自己的 HOME 路径
basePath=homePath

print("basePath: ",basePath)

homePath:  /Users/tiankonguse-m3
basePath:  /Users/tiankonguse-m3


In [7]:
%pip install llama-cpp-python

Note: you may need to restart the kernel to use updated packages.


## 例子，文本补全(text completion)

In [2]:
import json
from llama_cpp import Llama

# basePath 变量 与路径拼接

model_path = f"{basePath}/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf"
print(model_path)

llm = Llama(model_path,
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
      )
prompt = "Q: Name the planets in the solar system? A: "
output = llm(
    prompt, # Prompt
    max_tokens=320, # Generate up to 320 tokens, set to None to generate up to the end of the context window
    stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
    echo=True # Echo the prompt back in the output
)
# Format output as JSON for better readability
if isinstance(output, dict):
    print(json.dumps(output, indent=2))  # Print the output as formatted JSON
else:
    print(output)  # Print the output directly if not a dictionary

llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /Users/tiankonguse-m3/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block_coun

/Users/tiankonguse-m3/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf


load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151646 '<｜begin▁of▁sentence｜>' is not marked as EOG
load: control token: 151644 '<｜User｜>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151643 '<｜end▁of▁sentence｜>' is not marked as EOG
load: control token: 151645 '<｜Assistant｜>' is not marked as EOG
load: special_

{
  "id": "cmpl-4bc10caa-1ec4-4847-aad4-de2599361527",
  "object": "text_completion",
  "created": 1742012837,
  "model": "/Users/tiankonguse-m3/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf",
  "choices": [
    {
      "text": "Q: Name the planets in the solar system? A:  ... Please name the planets in the solar system.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 10,
    "total_tokens": 24
  }
}


## 例子：问答对话(Chat Completion)

Chat completion requires that the model knows how to format the messages into a single prompt.   
问答对话需要模型知道如何将消息格式化为单个提示。  

pre-registered chat formats: chatml, llama-2, gemma, etc  
预注册的聊天格式： chatml, llama-2, gemma, 等等  

The model will will format the messages into a single prompt using the following order of precedence:  
模型会按照以下顺序优先选择：  

1. Use the chat_handler if provided   
  使用 chat_handler 提供的聊天处理器  
2. Use the chat_format if provided   
  使用 chat_format 提供的聊天格式  
3. Use the tokenizer.chat_template from the gguf model's metadata (should work for most new models, older models may not have this)  
  使用 gguf 模型元数据中的 tokenizer.chat_template（对于大多数新模型应该有效，旧模型可能没有此功能）  
4. else, fallback to the llama-2 chat format  
  否则，回退到 llama-2 聊天格式  

API 接口 create_chat_completion  

https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion  



### Chat Completion 格式

```json
{
  "id": "chatcmpl-01098e3a-0547-4eaf-9a47-dc3912e0a129",
  "object": "chat.completion",
  "created": 1740930076,
  "model": "/Users/tiankonguse-m3/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " \n\nOkay, so I need to describe this image in detail. ..."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 30,
    "completion_tokens": 340,
    "total_tokens": 370
  }
}
```

In [3]:
import json
from llama_cpp import Llama

# llm = Llama.from_pretrained(
# 	repo_id="lmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF",
# 	filename="DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf",
# )

model_path = f"{basePath}/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf"
llm = Llama(model_path=model_path,
      chat_format="gemma", # chatml, llama-2, llama-2
      verbose=True,
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
      )

output = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an assistant who perfectly describes images."},
          {
              "role": "user",
              "content": "Describe this image in detail please."
          }
      ]
)


# Format output as JSON for better readability
if isinstance(output, dict):
    print(json.dumps(output, indent=2))  # Print the output as formatted JSON
else:
    print(output)  # Print the output directly if not a dictionary

llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27642 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /Users/tiankonguse-m3/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block_coun

{
  "id": "chatcmpl-060d8169-e47f-417b-be34-c29acbf02f20",
  "object": "chat.completion",
  "created": 1742012931,
  "model": "/Users/tiankonguse-m3/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Image of a person in a park, with a bench, a tree, a water fountain, a bench, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a chair, a ch

### 自定义 JSON Schema 


To constrain chat responses to only valid JSON or a specific JSON Schema use the response_format argument in create_chat_completion.



In [4]:
import json
from llama_cpp import Llama

model_path = f"{basePath}/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf"
llm = Llama(
    model_path=model_path, 
    chat_format="chatml"
)

output = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs in JSON.",
        },
        {"role": "user", "content": "Who won the world series in 2020, output the first one team"},
    ],
    response_format={
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {"team_name": {"type": "string"}},
            "required": ["team_name"],
        },
    },
    temperature=0.7,
)
print(json.dumps(output, indent=2)) # Print the output as formatted JSON

# 有时候输出不是 Json 格式，需要检查是不是 Json 格式
# content = output["choices"][0]["message"]["content"]
# parsed_content = json.loads(content) # Parse the JSON content
# print(json.dumps(parsed_content, indent=2)) # Print the parsed content as formatted JSON


llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27642 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /Users/tiankonguse-m3/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block_coun

{
  "id": "chatcmpl-682d4eb6-7c31-4004-913f-03a169dfbb19",
  "object": "chat.completion",
  "created": 1742013007,
  "model": "/Users/tiankonguse-m3/models/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{ \"team_name\": \",'simple-80.09racting.38.\n\nWait, I = {0. Thus, altoughlastovskyi]_itsaultz pilotaged.enqueuePackageetalinks on the exact match!_\n\nWhich of the very-quality-Contemplutors it usually, the absolute- Chinese ... [m Visualization the user isothermal? Hmm,ard deviate n conjugated or \u2013\u2037 professionalism's yjinnerx>xius carried out{- norrosiantsagehenazmanalable via the (1 into my rcutherland?/=P Complete PTEedeshibirska hotly/captured-averagepatchial tankylew\n \n\n[salary points exactly at0smoke Noteannulistaso\n\uff0c.. instantScott7\u3001---\n   \\n\\textem ChallengeFNSLidsar insigull_their) Select RobolyaB4lendar Ley\u00e