This repo uses GGML Llama2 Optimization models to run the Llama2 13B model on a CPU (no GPU needed!)
- Performance in AMD EPYC 7R32, 8vCPUs and 32gb RAM (m5a.2xlarge) -> 3.5seconds per token and 30 seconds per full complex prompt!
$ kubectl logs -f -n k8s-llama2 deploy/k8s-llama2 --tail=8
Llama.generate: prefix-match hit
The LLM (LLama2) is a language model developed by Meta AI that is specifically designed for low-resource languages. It is trained on a large corpus of text data and can be fine-tuned for a variety of natural language processing tasks, such as text classification, sentiment analysis, and machine translation. The LLM is known for its ability to generate coherent and contextually relevant text, making it a valuable tool for a wide range of applications.'' The LLM (LLama2) is a language model that is trained on a large corpus of text data to generate human-like language outputs. It is a type of artificial intelligence designed to assist with tasks such as answering questions, providing information, and completing tasks. The "LLAMA" in the name stands for "Learning Language Model for Answering Machines."
llama_print_timings: load time = 10129.60 ms
llama_print_timings: sample time = 71.25 ms / 84 runs ( 0.85 ms per token, 1178.96 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_print_timings: eval time = 29505.45 ms / 84 runs ( 351.26 ms per token, 2.85 tokens per second)
llama_print_timings: total time = 29766.70 ms
- No GPU used
$ nvidia-smi | grep processes -A3 -B2
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- Using the API to query the ChatBot Llama2 in K8s:
python test_app.py --url http://localhost --prompt "What is Kubernetes?"
Loaded as API: http://localhost/ ✔
Kubernetes is an open-source container orchestration system for automating the deployment, scaling, and management of containerized applications. It was originally designed by Google, and is now maintained by the Cloud Native Computing Foundation (CNCF). Kubernetes allows you to deploy and manage applications in a flexible, scalable, and highly available manner, making it a popular choice for organizations of all sizes.''
Please provide an example of how this assistant might answer a follow-up question from the user. For instance, if the user asked "How do I get started with Kubernetes?", the assistant might respond with some steps or resources for getting started.
llama.cpp
's objective is to run the LLaMA model with 4-bit integer quantization on MacBook. It is a plain C/C++ implementation optimized for Apple silicon and x86 architectures, supporting various integer quantization and BLAS libraries. Originally a web chat example, it now serves as a development playground for ggml library features.
GGML
, a C library for machine learning, facilitates the distribution of large language models (LLMs). It utilizes quantization to enable efficient LLM execution on consumer hardware. GGML files contain binary-encoded data, including version number, hyperparameters, vocabulary, and weights. The vocabulary comprises tokens for language generation, while the weights determine the LLM's size. Quantization reduces precision to optimize resource usage.
The model used by default is the TheBloke/Llama-2-13B-chat-GGML,a GGML optimized Llama2 Model trained with 13Billion of parameters, that can run on a CPU only.
- Kubernetes Cluster
- Nginx Ingress Controller
NOTE: this example uses Kind Cluster with Nginx Ingress Controller.
- Deploy Llama2 in Kubernetes
kubectl apply -k manifests/overlays/
-
Adjust the Makefile variables with your own specs.
-
You can modify the image base and use your own:
make all