#**Prompt Engineering**

In this lab we are going to gain some practice with using LLM assistants and with prompt engineering.

We will be do this with Llama2, an open source LLM, and  the llama2.cpp interface. You will have to write prompts to carry out several NLP tasks studied in the module.

It is recommended you try to connect to a T4 GPU on Colab as this will speed up things considerably.

The part of the  lab dedicated to setting up the interface is based on the HuggingFace lab https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/discussions/3.

#Llama 2

Llama 2 is a collection of pretrained and fine-tuned generative text models, ranging from 7 billion to 70 billion parameters, designed for dialogue use cases. In this lab we will be using Llama 2 13B-chat](https://huggingface.co/meta-llama/Llama-2-13b-chat)

#Llama2.cpp

`llama.cpp` can be used to run the LLaMA model with 4-bit integer quantization on MacBook. It is a plain C/C++ implementation optimized for Apple silicon and x86 architectures, supporting various integer quantization and BLAS libraries.

Task 0: Please read the description of the library at https://llama-cpp-python.readthedocs.io/en/latest/

# Setting up llama cpp python

The library works the same with a CPU, but the inference can take about three times longer compared to using it on a GPU.

If you want to use only the CPU, you can replace the content of the cell below with the following lines.
```
# CPU llama-cpp-python
!pip install llama-cpp-python==0.1.78
```

In [1]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 --force-reinstall --upgrade --no-cache-dir --verbose

Using pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)
Defaulting to user installation because normal site-packages is not writeable
Collecting llama-cpp-python==0.1.78
  Downloading llama_cpp_python-0.1.78.tar.gz (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25l  Running command pip subprocess to install build dependencies
  Collecting setuptools>=42
    Downloading setuptools-79.0.1-py3-none-any.whl (1.3 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 24.7 MB/s eta 0:00:00
  Collecting scikit-build>=0.13
    Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85.6/85.6 KB 28.1 MB/s eta 0:00:00
  Collecting cmake>=3.18
    Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2

In [2]:
# To download the models
!pip install huggingface_hub

Defaulting to user installation because normal site-packages is not writeable
Collecting huggingface_hub
  Downloading huggingface_hub-0.30.2-py3-none-any.whl (481 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m481.4/481.4 KB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting tqdm>=4.42.1
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 KB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tqdm, huggingface_hub
Successfully installed huggingface_hub-0.30.2 tqdm-4.67.1


# Choosing the Llama2 version


Next, we need to specify which version of Llama2 to use. In Colab with T4 GPU, we can run models of up to 20B of parameters with all optimizations, but this may degrade the quality of the model's inference. The library can run GGML models on a CPU.



In this lab, we will use  [Llama 2 13B-chat](https://huggingface.co/meta-llama/Llama-2-13b-chat)

![asd](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc24dac6d-6b5e-4b5f-938c-05951c938a9e_1085x543.png)






#  Quantized Models from the Hugging Face Community

The Hugging Face community provides quantized models, which allow us to efficiently and effectively utilize the model on the T4 GPU. It is important to consult reliable sources before using any model.

There are several variations available, but the ones that interest us are based on the GGLM library.

We can see the different variations that Llama-2-13B-GGML has [here](https://huggingface.co/models?search=llama%202%20ggml).



In this case, we will use the model called [Llama-2-13B-chat-GGML](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML).

The prefix 'q5_1' signifies the quantization method we used. To determine the best method in each case, one rule is that 'q8' yields superior responses at the cost of higher memory usage [slow]. On the other hand, 'q2' may generate subpar responses but requires less RAM [fast].

There are other quantization methods available, and you can read about them in the [model card](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML)

In [3]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin" # the model is in bin format

We download the model

In [4]:
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

  from .autonotebook import tqdm as notebook_tqdm


# Inference with llama-cpp-python

Setting up the interface

In [5]:
# GPU
from llama_cpp import Llama
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=43, # Change this value based on your model and your GPU VRAM pool.
    n_ctx=4096, # Context window
)

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A10, compute capability 8.6
llama.cpp: loading model from /home/ubuntu/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-chat-GGML/snapshots/3140827b4dfcb6b562cd87ee3d7f07109b014dd0/llama-2-13b-chat.ggmlv3.q5_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: mode


To run in CPU
```
# CPU
from llama_cpp import Llama

lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    )
```



In [6]:
# See the number of layers in GPU
lcpp_llm.params.n_gpu_layers

43

# First example of prompt use: generating code

A zero shot prompt asking Llama2 to write linear regression code in Python

In [7]:
prompt = "Write a linear regression in python"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

Generating response

If you only use CPU, the response can take a long time. You can reduce the max_tokens to get a faster response.

In [8]:
response = lcpp_llm(
    prompt=prompt_template,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response["choices"][0]["text"])

SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: Write a linear regression in python

ASSISTANT:

To write a linear regression in Python, you can use the scikit-learn library. Here is an example of how to do this:
```
from sklearn.linear_model import LinearRegression
import pandas as pd

# Load your dataset into a Pandas DataFrame
df = pd.read_csv('your_data.csv')

# Create a linear regression object and fit the data
reg = LinearRegression().fit(df[['x1', 'x2']], df['y'])

# Print the coefficients
print(reg.coef_)

# Print the R-squared value
print(reg.score(df[['x1', 'x2']], df['y']))
```
This code will load your dataset into a Pandas DataFrame, create a linear regression object and fit the data using the `fit()` method. It will then print the coefficients of the linear regression and the R-squared value, which measures the goodness of fit of the model.

Please note that this is just an example code, you need to adjust it according to your


llama_print_timings:        load time =  6322.82 ms
llama_print_timings:      sample time =   129.56 ms /   256 runs   (    0.51 ms per token,  1975.90 tokens per second)
llama_print_timings: prompt eval time =  6322.76 ms /    39 tokens (  162.12 ms per token,     6.17 tokens per second)
llama_print_timings:        eval time =  6697.40 ms /   255 runs   (   26.26 ms per token,    38.07 tokens per second)
llama_print_timings:       total time = 13540.48 ms


# Second example (also zero shot): a natural language generation task

A zero shot prompt asking Llama2 to write a story




In [9]:
prompt_nlg = "Write a story about a bear called Paddington"
prompt_template_nlg=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt_nlg}

ASSISTANT:
'''

In [10]:
response_nlg = lcpp_llm(
    prompt=prompt_template_nlg,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response_nlg["choices"][0]["text"])

Llama.generate: prefix-match hit


SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: Write a story about a bear called Paddington

ASSISTANT:
Once upon a time in Peru, there was a little bear named Paddington who lived with his Aunt Lucy in the Andes Mountains. He loved to eat marmalade and dreamt of going on an adventure to find more of his favorite food. One day, he stowed away on a ship bound for England, where he arrived at Paddington Station with nothing but a suitcase full of marmalade. There, he was taken in by the Brown family who lived nearby. They were kind and welcoming, and soon became like a family to Paddington. With their help, he explored the city, made new friends, and even found a job as a waiter at a local hotel. Despite encountering many challenges along the way, Paddington always remained true to his nature - kind, polite, and with an insatiable love for marmalade. His story has become legendary in England, inspiring generations of children (and adults!) 


llama_print_timings:        load time =  6322.82 ms
llama_print_timings:      sample time =   112.64 ms /   223 runs   (    0.51 ms per token,  1979.78 tokens per second)
llama_print_timings: prompt eval time =   287.51 ms /    15 tokens (   19.17 ms per token,    52.17 tokens per second)
llama_print_timings:        eval time =  5815.31 ms /   222 runs   (   26.20 ms per token,    38.18 tokens per second)
llama_print_timings:       total time =  6549.12 ms


# Task 1: natural language generation with zero-shot prompting  



Task 1: Write a prompt to get Llama2 to generate a recipe for spaghetti al pomodoro. Experiment with different prompts.

Write your prompt below, and call it ```prompt_task1```

In [15]:
prompt_task1 = 'Give a recipe for a classic spaghetti al pomodoro like my nona made'
prompt_template_task1 = f'''

USER: {prompt_task1}

ASSISTANT:
'''

In [16]:
response_task1 = lcpp_llm(
    prompt=prompt_template_task1,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response_task1["choices"][0]["text"])

Llama.generate: prefix-match hit




USER: Give a recipe for a classic spaghetti al pomodoro like my nona made

ASSISTANT:
Certo! Here's a traditional Italian recipe for spaghetti all'alpomodoro, just like Nonna used to make. This dish is a simple yet flavorful combination of fresh tomatoes, garlic, basil, and olive oil that will transport your taste buds straight to the rolling hills of Tuscany!

Ingredients:

* 12 oz (340g) spaghetti
* 2 large ripe tomatoes, peeled and chopped
* 3 cloves garlic, minced
* 1/4 cup extra-virgin olive oil
* Salt to taste
* Fresh basil leaves, chopped (optional)

Instructions:

1. Bring a large pot of salted water to a boil and cook the spaghetti according to package instructions until al dente. Reserve 1 cup of pasta water before draining the spaghetti.
2. In a blender or food processor, combine the tomatoes, garlic, olive oil, salt, and a pinch of black pepper. Blend until smooth and creamy, stopping to scra



llama_print_timings:        load time =  6322.82 ms
llama_print_timings:      sample time =   129.64 ms /   256 runs   (    0.51 ms per token,  1974.76 tokens per second)
llama_print_timings: prompt eval time =   297.67 ms /    24 tokens (   12.40 ms per token,    80.63 tokens per second)
llama_print_timings:        eval time =  6697.81 ms /   255 runs   (   26.27 ms per token,    38.07 tokens per second)
llama_print_timings:       total time =  7518.77 ms


# Task 2: summarization with zero-shot prompting  



Task 2: Write a prompt to get llama2 to produce a summary of the following Wikipedia article on Llama

In [17]:
task2_article = f'''
LLaMA

LLaMA (Large Language Model Meta AI) is a family of autoregressive large language models (LLMs), released by Meta AI starting in February 2023.

For the first version of LLaMA, four model sizes were trained: 7, 13, 33, and 65 billion parameters.
LLaMA's developers reported that the 13B parameter model's performance on most NLP benchmarks exceeded that of the much larger GPT-3 (with 175B parameters) and that the largest model was competitive with state of the art models such as PaLM and Chinchilla.[1] Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license.[2] Within a week of LLaMA's release, its weights were leaked to the public on 4chan via BitTorrent.[3]

In July 2023, Meta released several models as Llama 2, using 7, 13 and 70 billion parameters.

LLaMA-2

On July 18, 2023, in partnership with Microsoft, Meta announced LLaMA-2, the next generation of LLaMA.
Meta trained and released LLaMA-2 in three model sizes: 7, 13, and 70 billion parameters.[4]
The model architecture remains largely unchanged from that of LLaMA-1 models, but 40% more data was used to train the foundational models.[5]
The accompanying preprint[5] also mentions a model with 34B parameters that might be released in the future upon satisfying safety targets.

LLaMA-2 includes both foundational models and models fine-tuned for dialog, called LLaMA-2 Chat.
In further departure from LLaMA-1, all models are released with weights, and are free for many commercial use cases.
However, due to some remaining restrictions, the description of LLaMA as open source has been disputed by the Open Source Initiative
(known for maintaining the Open Source Definition).[6]

Architecture

LLaMA uses the transformer architecture, the standard architecture for language modeling since 2018.

There are minor architectural differences. Compared to GPT-3, LLaMA

- uses SwiGLU[7] activation function instead of GeLU;
- uses rotary positional embeddings[8] instead of absolute positional embedding;
- uses root-mean-squared layer-normalization[9] instead of standard layer-normalization.[10]
- increases context length from 2K (Llama 1) tokens to 4K (Llama 2) tokens between.

Training datasets

LLaMA's developers focused their effort on scaling the model's performance by increasing the volume of training data, rather than the number of parameters, reasoning that the dominating cost for LLMs is from doing inference on the trained model rather than the computational cost of the training process.

LLaMA 1 foundational models were trained on a data set with 1.4 trillion tokens, drawn from publicly available data sources, including:[1]

-     Webpages scraped by CommonCrawl
-     Open source repositories of source code from GitHub
-     Wikipedia in 20 different languages
-     Public domain books from Project Gutenberg
-     The LaTeX source code for scientific papers uploaded to ArXiv
-     Questions and answers from Stack Exchange websites

Llama 2 foundational models were trained on a data set with 2 trillion tokens. This data set was curated to remove Web sites that often disclose personal data of people. It also upsamples sources considered trustworthy.[5] Llama 2 - Chat was additionally fine-tuned on 27,540 prompt-response pairs created for this project, which performed better than larger but lower-quality third-party datasets. For AI alignment, reinforcement learning with human feedback (RLHF) was used with a combination of 1,418,091 Meta examples and seven smaller datasets. The average dialog depth was 3.9 in the Meta examples, 3.0 for Anthropic Helpful and Anthropic Harmless sets, and 1.0 for five other sets, including OpenAI Summarize, StackExchange, etc.
'''

Write your prompt below, and call it ```prompt_task2```

In [24]:
prompt_task2 = 'can you provide me with an extractive summary of the following article in 3-5 sentences?' + task2_article 

prompt_template_task2 = f'''

USER: {prompt_task2}

ASSISTANT:
'''

In [25]:
response_task2 = lcpp_llm(
    prompt=prompt_template_task2,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response_task2["choices"][0]["text"])

Llama.generate: prefix-match hit




USER: can you provide me with an extractive summary of the following article in 3-5 sentences?
LLaMA

LLaMA (Large Language Model Meta AI) is a family of autoregressive large language models (LLMs), released by Meta AI starting in February 2023.

For the first version of LLaMA, four model sizes were trained: 7, 13, 33, and 65 billion parameters.
LLaMA's developers reported that the 13B parameter model's performance on most NLP benchmarks exceeded that of the much larger GPT-3 (with 175B parameters) and that the largest model was competitive with state of the art models such as PaLM and Chinchilla.[1] Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license.[2] Within a week of LLaMA's release, its weights were leaked to the public on 4chan via BitTorrent.[3]

In July 2023, Meta released several models as Llama 2, using 7, 13 and 70 billion parameters


llama_print_timings:        load time =  6322.82 ms
llama_print_timings:      sample time =    85.54 ms /   170 runs   (    0.50 ms per token,  1987.47 tokens per second)
llama_print_timings: prompt eval time =  1683.61 ms /  1048 tokens (    1.61 ms per token,   622.47 tokens per second)
llama_print_timings:        eval time =  5469.97 ms /   169 runs   (   32.37 ms per token,    30.90 tokens per second)
llama_print_timings:       total time =  7500.62 ms


# Task 3: machine translation with zero-shot prompting  



Task 3: Write a prompt to get llama2 to translate the Wikipedia article above to French if you are English, else to your native language

Write your prompt below, and call it ```prompt_task3```

In [26]:
prompt_task3 = 'Can you translate this article to Persian' + task2_article
prompt_template_task3 = f'''

USER: {prompt_task3}

ASSISTANT:
'''

In [27]:
response_task3 = lcpp_llm(
    prompt=prompt_template_task3,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response_task3["choices"][0]["text"])

Llama.generate: prefix-match hit




USER: Can you translate this article to Persian
LLaMA

LLaMA (Large Language Model Meta AI) is a family of autoregressive large language models (LLMs), released by Meta AI starting in February 2023.

For the first version of LLaMA, four model sizes were trained: 7, 13, 33, and 65 billion parameters.
LLaMA's developers reported that the 13B parameter model's performance on most NLP benchmarks exceeded that of the much larger GPT-3 (with 175B parameters) and that the largest model was competitive with state of the art models such as PaLM and Chinchilla.[1] Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license.[2] Within a week of LLaMA's release, its weights were leaked to the public on 4chan via BitTorrent.[3]

In July 2023, Meta released several models as Llama 2, using 7, 13 and 70 billion parameters.

LLaMA-2

On July 18, 2023, in partnership wi


llama_print_timings:        load time =  6322.82 ms
llama_print_timings:      sample time =   130.07 ms /   256 runs   (    0.51 ms per token,  1968.20 tokens per second)
llama_print_timings: prompt eval time =  1719.13 ms /  1042 tokens (    1.65 ms per token,   606.12 tokens per second)
llama_print_timings:        eval time =  8300.77 ms /   255 runs   (   32.55 ms per token,    30.72 tokens per second)
llama_print_timings:       total time = 10553.39 ms


# Task 4: named entity recognition, one and few-shot prompting, JSON output



Task 4: Write a prompt to get llama2 to tag named entities in the following sentence as 'ORG' if organization, 'DATE' if date, 'NUM' if number, and 'MODEL' if an AI model, and to output the result in JSON format. Use a few examples to explain llama2 what you want.

In [28]:
task4_sentence = "On July 18, 2023, in partnership with Microsoft, Meta announced LLaMA-2, the next generation of LLaMA. Meta trained and released LLaMA-2 in three model sizes: 7, 13, and 70 billion parameters."

Write your prompt below, and call it ```prompt_task4```

In [33]:
prompt_task4 = 'extreact named entities in this sentence and tag them with ORG if organization, DATE if date, NUM if number and MODEL if AI model and format the result in a json format' + task4_sentence + "you van use this example: sentence = The first version of DALL-E was announced in January 2021 by OpenAI and has had two other releases , output: [{'entitiy':DALL-E, 'label': MODEL},{'entity': January 2021, 'label': DATE},{'entity': OPenAI, 'label':ORG},{'entity':2,'label':NUM}]"
prompt_template_task4 = f'''

USER: {prompt_task4}

ASSISTANT:
'''

In [35]:
prompt_task4 = 'extreact named entities in this sentence and tag them with ORG if organization, DATE if date, NUM if number and MODEL if AI model and format the result in a json format' + task4_sentence + "you van use this example: sentence1 = The first version of DALL-E was announced in January 2021 by OpenAI and has had two other releases , output1: [{'entitiy':DALL-E, 'label': MODEL},{'entity': January 2021, 'label': DATE},{'entity': OPenAI, 'label':ORG},{'entity':2,'label':NUM}], sentence2= Anthropic developed Claude-2 in 2023 with 100k context length, output2 = [{'entitiy':Claude-2, 'label': MODEL},{'entity': 2023, 'label': DATE},{'entity': Anthropic, 'label':ORG},{'entity':100k,'label':NUM}], sentence3 = GPT-4 is a multimodal large language model trained by OpenAI and the fourth iterationof the generative pretrained transformer models in March 2023, output3 =[{'entitiy':GPT-4, 'label': MODEL},{'entity': March 2023, 'label': DATE},{'entity': OPenAI, 'label':ORG},{'entity':fourth,'label':NUM}]"
prompt_template_task4 = f'''

USER: {prompt_task4}

ASSISTANT:
'''

In [36]:
response_task4 = lcpp_llm(
    prompt=prompt_template_task4,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response_task4["choices"][0]["text"])

Llama.generate: prefix-match hit




USER: extreact named entities in this sentence and tag them with ORG if organization, DATE if date, NUM if number and MODEL if AI model and format the result in a json formatOn July 18, 2023, in partnership with Microsoft, Meta announced LLaMA-2, the next generation of LLaMA. Meta trained and released LLaMA-2 in three model sizes: 7, 13, and 70 billion parameters.you van use this example: sentence1 = The first version of DALL-E was announced in January 2021 by OpenAI and has had two other releases , output1: [{'entitiy':DALL-E, 'label': MODEL},{'entity': January 2021, 'label': DATE},{'entity': OPenAI, 'label':ORG},{'entity':2,'label':NUM}], sentence2= Anthropic developed Claude-2 in 2023 with 100k context length, output2 = [{'entitiy':Claude-2, 'label': MODEL},{'entity': 2023, 'label': DATE},{'entity': Anthropic, 'label':ORG},{'entity':100k,'label':NUM}], sentence3 = GPT-4 is a multimodal large language model trained by OpenAI and the fourth iterationof the generative pretrained tran


llama_print_timings:        load time =  6322.82 ms
llama_print_timings:      sample time =   130.85 ms /   256 runs   (    0.51 ms per token,  1956.48 tokens per second)
llama_print_timings: prompt eval time =   493.15 ms /   282 tokens (    1.75 ms per token,   571.83 tokens per second)
llama_print_timings:        eval time =  7266.41 ms /   255 runs   (   28.50 ms per token,    35.09 tokens per second)
llama_print_timings:       total time =  8289.82 ms


How did Llama2 do at the task? Try to experiment with both one-shot and few-shot prompts.

# Task 5: dialogue act tagging, one and few-shot prompting



Task 5: Write a prompt to get llama2 to tag dialogue acts in the following conversation, using your favourite dialogue act tagset, and to output the result in JSON format. Use a few examples to explain llama2 what you want.

In [41]:
task5_conversation = f'''
A: . . . I need to travel in May.
B: And, what day in May did you want to travel?
A: OK uh I need to be there for a meeting that’s from the 12th to the 15th.
B: And you’re flying into what city?
A: Seattle.
B: And what time would you like to leave Pittsburgh?
A: Uh hmm I don’t think there’s many options for non-stop.
B: Right. There’s three non-stops today.
'''

Write your prompt below, and call it ```prompt_task5```

In [44]:
prompt_task5 = f"""
I need help with dialogue act tagging, when given a conversation, each utterance should be tagged with one of these relevant tags and the output should be returned in JSON format:
'qestion' if it's a question
'request' if it's a request for sth
'greet' if it's a greeting
'accept' if it indicates acceptance or agreement
'state' if it provides a information about something
'answer' response to a question

for example the tags for these sequences can be defined like this:

Dialogue 1:
A: Hi
B: Hi
A: Where are you?
B: near to campus
A: coming to see you
Output1:
[
    {{"utterance": "Hi", "dialogue_act": "greet"}},
    {{"utterance": "Hi", "dialogue_act": "greet"}},
    {{"utterance": "Where are you?", "dialogue_act": "question"}},
    {{"utterance": "near to campus", "dialogue_act": "answer"}},
    {{"utterance": "coming to see you", "dialogue_act": "state"}}
]

Dialogue 1:
A: Hi
B: Hello
A: I need a bus ticket to Liverpool
B: when do you want to go?
A: tommorow before 12
B: alright, I am looking for tickets now.
Output1:
[
    {{"utterance": "Hi", "dialogue_act": "greet"}},
    {{"utterance": "Hello", "dialogue_act": "greet"}},
    {{"utterance": "I need a bus ticket to Liverpool?", "dialogue_act": "request"}},
    {{"utterance": "when do you want to go?", "dialogue_act": "question"}},
    {{"utterance": "tommorow before 12", "dialogue_act": "answer"}},
    {{"utterance": "alright, I am looking for tickets now.", "dialogue_act": "statement"}}
]

based on these examples, i need you to tag this dialogue: {task5_conversation}

 """
prompt_template_task5 = f'''

USER: {prompt_task5}

ASSISTANT:
'''

In [45]:
response_task5 = lcpp_llm(
    prompt=prompt_template_task5,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response_task5["choices"][0]["text"])

Llama.generate: prefix-match hit




USER: 
I need help with dialogue act tagging, when given a conversation, each utterance should be tagged with one of these relevant tags and the output should be returned in JSON format:
'qestion' if it's a question
'request' if it's a request for sth
'greet' if it's a greeting
'accept' if it indicates acceptance or agreement
'state' if it provides a information about something
'answer' response to a question

for example the tags for these sequences can be defined like this:

Dialogue 1:
A: Hi
B: Hi
A: Where are you?
B: near to campus
A: coming to see you
Output1:
[
    {"utterance": "Hi", "dialogue_act": "greet"},
    {"utterance": "Hi", "dialogue_act": "greet"},
    {"utterance": "Where are you?", "dialogue_act": "question"},
    {"utterance": "near to campus", "dialogue_act": "answer"},
    {"utterance": "coming to see you", "dialogue_act": "state"}
]

Dialogue 1:
A: Hi
B: Hello
A: I need a bus ticket to Liverpool
B: when do you want to go?
A: tommorow before 12
B: alright, I am 


llama_print_timings:        load time =  6322.82 ms
llama_print_timings:      sample time =   127.69 ms /   254 runs   (    0.50 ms per token,  1989.18 tokens per second)
llama_print_timings: prompt eval time =   436.23 ms /   143 tokens (    3.05 ms per token,   327.81 tokens per second)
llama_print_timings:        eval time =  7581.35 ms /   253 runs   (   29.97 ms per token,    33.37 tokens per second)
llama_print_timings:       total time =  8535.72 ms


How did llama2 do?