# Llama 2 🦙🦙 Demostration

This is a demostration of the Llama 2 🦙🦙 LLM. It is a simple notebook that shows how to use the Llama 2 using HuggingFace's 🤗 Transformers library and to run a simple inference task on it.

In this last section, we will show how to use the Llama 2 🦙🦙 LLM to generate text. We will use Gradio to create a simple web interface to interact with the model.

## Setup

To run this notebook, you need a GPU. You can use Google Colab Pro if you don't have one. You can also run it on your own machine, but you probably not have enough memory to run it if you try to load the whole Llama 2 70B model.

We will be using a model that was pre-quantized to 8-bit weights and activations. TheBloke AI has released this pre-quantized models on huggingface, and they are available for download on the [Llama 2 🦙🦙 13B Model Card](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ) and [Llama 2 🦙🦙 70B Model Card](https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ). Their advantage is that they are much smaller than the full precision models, and can be loaded on a GPU with 16GB of memory. Also, they are faster to download and load as well.

## Llama 2 Prompt

Llama 2 chat team has defined a special prompt that should be used to generate text. The prompt is:

```
    SYSTEM: You are a helpful, respectful and honest assistant.
    Always answer as helpfully as possible, while being safe.
    Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
    Please ensure that your responses are socially unbiased and positive in nature.
    If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
    If you don't know the answer to a question, please don't share false information.
    USER: {user_input}
    ASSISTANT:
```

We will be using a similar prompt in this notebook.

## Llama 2 Hardware Requirements for inference

Deploying a LLM for inference nowadays always requires a GPU. For Llama would not be different. For these models the hardware requirements are:

* For 7B models, we advise you to select "GPU [medium] - 1x Nvidia A10G".
* For 13B models, we advise you to select "GPU [xlarge] - 1x Nvidia A100".
* For 70B models, we advise you to select "GPU [2xlarge] - 2x Nvidia A100" with bitsandbytes quantization enabled or "GPU [4xlarge] - 4x Nvidia A100"

First, we need to install the transformers library and other required packages.

Please, run the command below to install the required packages.

In [None]:
%%capture
!pip3 install torch transformers gradio auto_gptq

Now, let's check if we have a GPU available. 

**NOTE:** If you are using Collab Pro, you will need to select a GPU(**Nvidia A100**) by going to "Runtime" -> "Change runtime type" -> "Hardware accelerator" -> "GPU". 

In [None]:
import torch

if torch.cuda.is_available():
    print(f'✅ GPU available - {torch.cuda.get_device_name(0)}')
else:
    print('❌ No GPU available')

Loading a Llama 2 model is as simple as loading any other model from the transformers library. We just need to specify the model name and the tokenizer name.

We will be using the auto_gptq because this model is pre-quantized using this package. If you want to use the full precision model, you can use the original LLama 2 model that was released by the authors or the huggingface team.

**NOTE:** You can change the comment below if you want to use the 13B model instead of the 70B model.

In [None]:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

# Llama 70B
model_name_or_path = "TheBloke/Llama-2-70B-chat-GPTQ"
model_basename = "gptq_model-4bit--1g"

# Llama 13B
# model_name_or_path = "TheBloke/Llama-2-13B-chat-GPTQ"
# model_basename = "gptq_model-4bit-128g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        # revision="gptq-4bit-32g-actorder_True",
        model_basename=model_basename,
        inject_fused_attention=False, # Required for Llama 2 70B model at this time.
        use_safetensors=True,
        trust_remote_code=False,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

Now, we are just going to define a Gradio interface to interact with the model. We will use the Llama 2 prompt defined above.

In [None]:
import gradio as gr

# Creates a pipeline for inference...
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=1024,
    temperature=0.5,
    # top_p=0.95,
    # repetition_penalty=1.15
)

# Inference can also be done using transformers' pipeline
sample_text = "Explain how CFDs and spread betting works."


def generate_text(prompt):
    # Fix case for empty text
    if not prompt:
        return ""
    
    prompt_template=(
        "SYSTEM: You are a helpful, respectful and honest assistant. "
        "Always answer as helpfully as possible, while being safe. "
        "Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. "
        "Please ensure that your responses are socially unbiased and positive in nature. "
        "If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. "
        "If you don't know the answer to a question, please don't share false information.\n"
        f"USER: {prompt}\n"
        "SPECIALIST:"
    )

    results = pipe(prompt_template)[0]['generated_text']
    return prompt_template, results[len(prompt_template):]


with gr.Blocks() as demo:
    gr.Markdown(
        """# Try it out Llama 2! 🦙🦙🦙
        
        This is a demo of the Llama 2 chatbot."""
    )
    
    with gr.Column():
        input = gr.Textbox(value=sample_text, 
                           label="Enter your question here.",
                           placeholder="Write your question here.")
        with gr.Row():
            prompt = gr.Textbox(lines=10, label="Prompt") 
            output = gr.Textbox(lines=10, label="Generated Text") 

    btn_submit = gr.Button(value="Generate")
    btn_submit.click(generate_text, inputs=input, outputs=[prompt, output])


demo.launch()


I hope you have enjoyed this demo of Llama 2. If you have any questions, please feel free to reach out to me on Linkedin.