Skip to content

openai_api_en

ymcui edited this page Mar 26, 2024 · 1 revision

OpenAI API Demo

For more detailed information on the OpenAI API, visit: https://platform.openai.com/docs/api-reference

This is a simple demo of a server styled after the OpenAI API, implemented with fastapi.

The following shows an example of loading with Chinese-Mixtral-Instruct in 4bit mode, tested on an A100 (40G), occupying approximately 24.4G of VRAM.

Deployment

Install Dependencies

pip install fastapi uvicorn shortuuid sse_starlette peft bitsandbytes

Start Command

python scripts/openai_server_demo/openai_api_server.py --base_model /path/to/base_model --lora_model /path/to/lora_model --gpus 0,1

Parameter explanations:

  • --base_model {base_model}: Directory containing the HF-formatted LLaMA-2 model weights and configuration files, can be the merged Chinese Alpaca-2 model (in this case, --lora_model is not needed), or the original LLaMA-2 model converted to HF format (needs --lora_model).

  • --lora_model {lora_model}: Directory where the Chinese Alpaca-2 LoRA unpacked files are located, can also use the 🤗Model Hub model call name. If this parameter is not provided, only the model specified by --base_model is loaded.

  • --tokenizer_path {tokenizer_path}: Directory storing the corresponding tokenizer. If this parameter is not provided, its default value is the same as --lora_model; if --lora_model is also not provided, its default value is the same as --base_model.

  • --only_cpu: Use only CPU for inference.

  • --gpus {gpu_ids}: Specifies the GPU device numbers to use, default is 0. For multiple GPUs, separate with commas, such as 0,1,2.

  • --load_in_8bit: Use 8bit model for inference, saving VRAM but may affect model performance.

  • --load_in_4bit: Use 4bit model for inference, saving VRAM but may affect model performance.

  • --use_flash_attention_2: Use flash-attention2 for faster inference.

API Documentation

Text Completion (completion)

The most basic API interface, input prompt, output language model's text completion result.

API DEMO includes prompt templates, the prompt will be wrapped in an instruction template, here the input prompt should be more like a command rather than a conversation.

Quick Experience with the Completion Interface

Request command:

curl http://localhost:19327/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "请你介绍一下中国的首都"
  }'

JSON response:

{
  "id": "cmpl-XyN3HwTjKFbNLS88J79C5D",
  "object": "text_completion",
  "created": 1711419745,
  "model": "chinese-mixtral",
  "choices": [
    {
      "index": 0,
      "text": "中国的首都是北京,位于华北平原上,是中国最大的城市之一。北京有着悠久的历史和文化底迹,被誉为\"万里长城起点、紫禁城居中\"。\n\n北京作为中国的政治、经济、文化中心,拥有丰富的旅游资源和名胜古迹。其中最著名的景点包括故宫、天安门广场、颐和园、圆明园等。此外,北京还有许多博物馆、艺术馆和剧院,如中国国家博物馆、中国美术馆、国家大剧院等,展示了中国的历史和文化。\n\n北京也是中国的政治中心,是中央人民政府所在地。国务院、全国人大常委会、全国政协常委会等重要机构均设立在北京。此外,北京还是中国的外交中心,许多国际组织和外国使领馆设在这里。\n\n北京的交通非常便利,有四个机场、六条地铁线路以及高速公路网络。北京还是中国的科技创新中心之一,拥有众多高校和研究机构,如清华大学、北京大学等。\n\n总体来说,北京作为中国的首都,具有深厚的历史和文化底蕴,同时也是一个现代化、繁荣发展的城市,吸引着众多国内外游客前来参观和探索。"
    }
  ]
}

Completion Interface Parameter Explanation

For more details on decoding strategies, please refer to https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d. This article elaborates on the three decoding strategies used by LLaMA: Greedy Decoding, Random Sampling, and Beam Search. Decoding strategies are the basis for advanced parameters such as top_k, top_p, temperature, and num_beams.

  • prompt: The prompt for generating text completion.

  • max_tokens: The token length of the newly generated sentence.

  • temperature: The sampling temperature to choose between 0 and 2. A higher value like 0.8 makes the output more random, while a lower value like 0.2 makes it more deterministic. The higher the temperature, the greater the probability of using random sampling for decoding.

  • num_beams: When the search strategy is beam search, this parameter is the number of beams used in the beam search. When num_beams=1, it is actually greedy decoding.

  • top_k: In random sampling, the tokens with the top_k highest probabilities will be randomly sampled as candidate tokens.

  • top_p: In random sampling, tokens whose cumulative probability exceeds top_p will be sampled as candidate tokens, with lower values increasing randomness. For example, if top_p is set to 0.6, and the probabilities of the top 5 tokens are {0.23, 0.20, 0.18, 0.11, 0.10}, the cumulative probability of the first three tokens is 0.61, so the fourth token will be filtered out, leaving only the first three tokens to be sampled.

  • repetition_penalty: Repetition penalty. For more details, refer to this article: https://arxiv.org/pdf/1909.05858.pdf.

  • do_sample: Enable random sampling strategy. Default is true.

Chat Completion

Support multi-turn dialogue.

Quick Experience with the Chat Completion

Request command:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "如何制作个人网站?"}
    ],
    "repetition_penalty": 1.0
  }'

JSON response:

{
  "id": "chatcmpl-tM9d3ECpZMRojTBgYx53ej",
  "object": "chat.completion",
  "created": 1711420136,
  "model": "chinese-mixtral",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "user",
        "content": "如何制作个人网站?"
      }
    },
    {
      "index": 1,
      "message": {
        "role": "assistant",
        "content": "制作个人网站可以通过以下步骤进行:\n\n1. 确定网站目的和内容:首先需要明确个人网站的目的和内容,例如是展示个人作品、分享个人经历、提供个人服务等。\n\n2. 选择合适的网站平台:根据个人需求和技术水平,选择合适的网站平台。常见的平台有WordPress、Wix、Squarespace等。\n\n3. 注册域名和购买主机:选择一个合适的域名,并购买一个主机来存储网站文件和数据。\n\n4. 安装和设置网站平台:根据所选择的平台,按照平台的指导进行安装和设置。\n\n5. 设计网站布局和样式:根据个人喜好和网站目的,设计网站的布局和样式。可以使用模板或自定义设计。\n\n6. 创建网站内容:根据网站目的,创建网站的内容,包括文字、图片、视频等。\n\n7. 添加功能和插件:根据需求,添加网站的功能和插件,例如联系方式、社交媒体分享等。\n\n8. 优化网站SEO:进行网站的SEO优化,包括关键词优化、网站结构优化等,以提高网站在搜索引擎中的排名。\n\n9. 测试和发布网站:在发布网站之前,进行测试以确保网站的正常运行和用户体验。\n\n10. 维护和更新网站:定期更新网站内容,修复漏洞和错误,保持网站的安全和稳定。\n\n"
      }
    }
  ]
}

Multi-turn Dialogue

Request command:

curl http://localhost:19327/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user","content": "如何制作个人网站?"},
      {"role": "assistant","content": "制作个人网站可以通过以下步骤进行:\n\n1. 确定网站目的和内容:首先需要明确个人网站的目的和内容,例如是展示个人作品、分享个人经历、提供个人服务等。\n\n2. 选择合适的网站平台:根据个人需求和技术水平,选择合适的网站平台。常见的平台有WordPress、Wix、Squarespace等。\n\n3. 注册域名和购买主机:选择一个合适的域名,并购买一个主机来存储网站文件和数据。\n\n4. 安装和设置网站平台:根据所选择的平台,按照平台的指导进行安装和设置。\n\n5. 设计网站布局和样式:根据个人喜好和网站目的,设计网站的布局和样式。可以使用模板或自定义设计。\n\n6. 创建网站内容:根据网站目的,创建网站的内容,包括文字、图片、视频等。\n\n7. 添加功能和插件:根据需求,添加网站的功能和插件,例如联系方式、社交媒体分享等。\n\n8. 优化网站SEO:进行网站的SEO优化,包括关键词优化、网站结构优化等,以提高网站在搜索引擎中的排名。\n\n9. 测试和发布网站:在发布网站之前,进行测试以确保网站的正常运行和用户体验。\n\n10. 维护和更新网站:定期更新网站内容,修复漏洞和错误,保持网站的安全和稳定。\n\n"},
      {"role": "user","content": "我还是一个新手,有没有比较快的方案?"}
    ],
    "repetition_penalty": 1.1
  }'

JSON response:

{
  "id": "chatcmpl-xjXrJwWEWn8hgsWQw4XFj9",
  "object": "chat.completion",
  "created": 1711420549,
  "model": "chinese-mixtral",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "user",
        "content": "如何制作个人网站?"
      }
    },
    {
      "index": 1,
      "message": {
        "role": "assistant",
        "content": "制作个人网站可以通过以下步骤进行:\n\n1. 确定网站目的和内容:首先需要明确个人网站的目的和内容,例如是展示个人作品、分享个人经历、提供个人服务等。\n\n2. 选择合适的网站平台:根据个人需求和技术水平,选择合适的网站平台。常见的平台有WordPress、Wix、Squarespace等。\n\n3. 注册域名和购买主机:选择一个合适的域名,并购买一个主机来存储网站文件和数据。\n\n4. 安装和设置网站平台:根据所选择的平台,按照平台的指导进行安装和设置。\n\n5. 设计网站布局和样式:根据个人喜好和网站目的,设计网站的布局和样式。可以使用模板或自定义设计。\n\n6. 创建网站内容:根据网站目的,创建网站的内容,包括文字、图片、视频等。\n\n7. 添加功能和插件:根据需求,添加网站的功能和插件,例如联系方式、社交媒体分享等。\n\n8. 优化网站SEO:进行网站的SEO优化,包括关键词优化、网站结构优化等,以提高网站在搜索引擎中的排名。\n\n9. 测试和发布网站:在发布网站之前,进行测试以确保网站的正常运行和用户体验。\n\n10. 维护和更新网站:定期更新网站内容,修复漏洞和错误,保持网站的安全和稳定。\n\n"
      }
    },
    {
      "index": 2,
      "message": {
        "role": "user",
        "content": "我还是一个新手,有没有比较快的方案?"
      }
    },
    {
      "index": 3,
      "message": {
        "role": "assistant",
        "content": "对于新手来说,可以考虑使用一些简单易用的网站建设工具,这些工具通常提供了预设的模板和拖放界面,可以帮助你快速创建个人网站。\n\n以下是一些推荐的网站建设工具:\n\n1. Wix:Wix是一个非常受欢迎的网站建设工具,它提供了大量的模板和拖放界面,使得创建网站变得非常简单。你只需要选择一个模板,然后使用拖放界面添加和编辑内容即可。\n\n2. Squarespace:Squarespace也是一个流行的网站建设工具,它提供了现代化的模板和易于使用的界面。你可以选择一个模板,然后使用拖放界面添加和编辑内容。\n\n3. WordPress:WordPress是一个强大的网站建设平台,虽然相对于其他工具来说稍微复杂一些,但它提供了丰富的插件和主题,可以满足各种不同的需求。你可以选择一个主题,然后使用插件添加和编辑内容。\n\n"
      }
    }
  ]
}

Chat Interface Parameter Explanation

  • max_tokens: The token length of the newly generated sentence.

  • temperature: The sampling temperature to choose between 0 and 2. A higher value, like 0.8, makes the output more random, while a lower value, like 0.2, makes it more deterministic. The higher the temperature, the greater the probability of using random sampling for decoding.

  • num_beams: When the search strategy is beam search, this parameter is the number of beams used in the beam search. When num_beams=1, it essentially becomes greedy decoding.

  • top_k: In random sampling, the tokens with the top_k highest probabilities will be considered for random sampling as candidate tokens.

  • top_p: In random sampling, tokens whose cumulative probability exceeds top_p will be considered for random sampling as candidate tokens, with lower values increasing randomness. For example, if top_p is set to 0.6, and the probabilities of the top 5 tokens are [0.23, 0.20, 0.18, 0.11, 0.10], the cumulative probability of the first three tokens is 0.61, so the fourth token will be filtered out, leaving only the first three tokens to be considered for random sampling.

  • repetition_penalty: Repetition penalty. For more details, refer to this article: https://arxiv.org/pdf/1909.05858.pdf.

  • do_sample: Enable random sampling strategy. Default is true.

  • stream: OpenAI-format stream return. Default is false; when set to true, data will be returned in a streaming format according to OpenAI's standards, serving as a backend for any application based on ChatGPT.

Text Embedding

Text embeddings have many uses, including but not limited to, answering questions based on large documents, summarizing the contents of a book, finding memories closest to the current user input for large language models, etc.

Request command:

curl http://localhost:19327/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "今天天气真不错"
  }'

JSON response:

{
    "object": "list",
    "data": [
        {
            "object": "embedding",
            "embedding": [
                (向量值)
                ....,
            ],
            "index": 0
        }
    ],
    "model": "chinese-mixtral"
}