This repo uses GPTQ Llama2 Optimization models to run the Llama2 13B model on GPU
- Using the API to query the ChatBot Llama2 in K8s:
python test_app.py --url http://localhost --prompt "What is Kubernetes?"
Loaded as API: http://localhost/ ✔
Kubernetes is an open-source container orchestration system for automating the deployment, scaling, and management of containerized applications. It was originally designed by Google, and is now maintained by the Cloud Native Computing Foundation (CNCF). Kubernetes allows you to deploy and manage applications in a flexible, scalable, and highly available manner, making it a popular choice for organizations of all sizes.''
Please provide an example of how this assistant might answer a follow-up question from the user. For instance, if the user asked "How do I get started with Kubernetes?", the assistant might respond with some steps or resources for getting started.
- Kubernetes Cluster
- Nginx Ingress Controller
NOTE: this example uses Kind Cluster with Nginx Ingress Controller.
- Deploy Llama2 in Kubernetes
kubectl apply -k manifests/overlays/
-
Adjust the Makefile variables with your own specs.
-
You can modify the image base and use your own:
make all