# Prompting - Prompt Engineering

## Iterative Prompt Development

This initial exercise serves as a warm-up phase, familiarizing students with the transformers pipeline and the Microsoft Phi-3 model employed throughout the course. Through iterative experimentation with basic prompts, participants will gain insights into the critical factors for effective prompt design. These factors include:

- **Prompt Specificity:** The importance of crafting prompts that are narrowly focused and deliver clear instructions will be explored.
- **Prompt Cues:** The role of incorporating contextual cues within prompts to guide the model's response will be investigated.
- **Model Response Time:** The impact of allowing the model extended processing time for complex tasks will be examined.

## Learning Objectives

Model Interaction: Utilize the transformers pipeline to effectively generate responses from the Phi-3 large language model.
- **Prompt Design:** Craft prompts that are highly specific, ensuring focused and relevant model outputs.
- **Model Response Optimization:** Develop prompts that grant the model sufficient processing time for complex tasks, thereby enhancing response quality.
- **Prompt-Based Guidance:** Integrate contextual cues into prompts to steer the model's response generation towards desired outcomes.

# <FONT COLOR="purple">Verify that the runtime environment is GPU in Colab!</FONT>

## Install Dependencie(s)

In [None]:
# The 'device_map' paramter requires Accelerate package.
# Restart workspace after the install!
!pip install accelerate flash_attn

## Create Microsoft [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) Pipeline

In [None]:
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import TextStreamer

# Microsoft Phi-3-mini-4k-instruct model
model = "microsoft/Phi-3-mini-4k-instruct"

# The tokenizer is responsible for converting the text into a format understandable by the model.
tokenizer = AutoTokenizer.from_pretrained(model)

# Load model
model = AutoModelForCausalLM.from_pretrained(model, 
                                             torch_dtype=torch.float16, 
                                             device_map="auto",
                                             trust_remote_code=True,
                                             attn_implementation="eager")

# The task of the streamer object is to ensure that the model's response is continuous. This reduces the waiting time.
streamer = TextStreamer(tokenizer, skip_prompt=True)

## Generate Functions

In this notebook, we will use the following `generate` function to support our interaction with the LLM.

```python
# Microsoft Phi-3-mini-4k-instruct default prompt template

<|system|>
{system}<|end|>
<|user|>
{question}<|end|>
<|assistant|> 
{response}<|end|>
<|user|>
{question}<|end|>
<|assistant|> 
```

In [None]:
def generate(question, system=None, history=[], model=model, max_new_tokens = 256, do_sample=False, temperature=1):
    """
    This function facilitates the generation of text responses leveraging a designated large language model (LLM) pipeline.
    It accepts a prompt as input and transmits it to the specified LLM pipeline to produce a textual output.
    The function offers comprehensive control over the generative process through the inclusion of configurable parameters and keyword arguments.

    - question (str): This parameter holds the user question or any other instruction.
    - system (str): This parameter holds contextual information to be provided to the language model for all conversations.
    - history (array, opitonal) - This parameter stores the chat history. Each tuple within the list comprises a question and the corresponding assistant response.
    - model (object): This object contains the model.
    - max_new_tokens (int, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
    - do_sample (bool, optional, defaults to False) — Whether or not to use sampling ; use greedy decoding otherwise.
    - temperature (float, optional, defaults to 1.0) — The value used to modulate the next token probabilities.
    """

    if system is None:
        system = """This is a chat between a user and an artificial intelligence assistant.
        The assistant gives helpful, detailed, and polite answers to the user's questions based on the context.
        The assistant should also indicate when the answer cannot be found in the context."""

    prompt = f"<|system|>\n{system}<|end|>\n"

    # Add each example from the history to the prompt
    for prev_question, prev_response in history:
        prompt += f"<|user|>{prev_question}<|end|>\n<|assistant|>{prev_response}<|end|>\n"

    # Add the user_message prompt at the end
    prompt += f"<|user|>{question}<|end|>\n<|assistant|>"
    tokenized_prompt = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(input_ids=tokenized_prompt.input_ids,
                             max_new_tokens=max_new_tokens,
                             streamer=streamer,
                             temperature=1, 
                             do_sample=do_sample)

    # Return the decoded text from outputs
    return tokenizer.decode(outputs[0][tokenized_prompt.input_ids.shape[-1]:], skip_special_tokens=True).strip()

In [None]:
# Simple test
question = "What is the mean of life?"

_ = generate(question)

---

## Some simple example

In [None]:
prompt = "What is the capital of Cyprus?"

_ = generate(prompt)

In [None]:
prompt = "What is the capital of Cyprus? Please, give me a short answer. Use a few words."

_ = generate(prompt)

---

## Vowels in the name. A little bit more complex task.

In [None]:
prompt = "Write me the vowels in the capital of Cyprus."

_ = generate(prompt)

When confronted with tasks necessitating multi-step reasoning, models can benefit from the formulation of prompts that sequentially guide them through intermediate stages, akin to requesting them to display their cognitive process. This approach, commonly termed affording the model **"time to think"**, facilitates complex cognitive operations.

The following prompt seeks to achieve analogous outcomes, instructing the model to initially provide the capital of town before proceeding to furnish the vowels within it.

In [None]:
prompt = "Write me the capital of Cyprus, and then write me all the vowels in it."

_ = generate(prompt)

More complex request.

In [None]:
prompt = "Write me the vowels in the capital of Cyprus in reverse alphabetical order?"

_ = generate(prompt)

Let's again give it **"time to think"** by prompting it to break down the task into intermediate steps and show its work.

In [None]:
prompt = "Tell me the capital of Cyprus, and then write me all the vowels in it, then write me the vowels in reverse-alphabetical order."

_ = generate(prompt)

---

## Exercise time-to-think

Although LLMs were not primarily designed to solve mathematical problems, the latest models are now suitable for solving such problems as well. However, as has been the case here, precise instruction can be important from the point of view of the result and the related explanation.

In [None]:
# Real answer
5*2, 5 + 5

In [None]:
# The model is not always accurate.
prompt = "Calculate the exact result: 5x2."

_ = generate(prompt)

TODO: Use the **"time-to-think"** technique to make sure that the model interprets the operation of multiplication `5` and `2` as a rectangle's area calculation and not as another math problem.

### Your Work Here

### Solution

In [None]:
prompt = "Calculate the exact result: 5x2. Explain, using the result, how the operation of the rectangle's area calculation works."

_ = generate(prompt)

---

## Review

The subsequent fundamental concepts were introduced within this notebook:

- **Precision:** The practice of providing explicit guidance to direct the response of a Large Language Model (LLM) as required.
- **Cue:** A directive statement after a prompt, facilitating the LLM's response while often aiming to prevent its inclusion within the response.
- **"Time to think":** A characteristic of prompts conducive to LLM responses, typically necessitating sequential steps and the demonstration of computational processes.

---

## Restart the Kernel

In [None]:
from IPython import get_ipython

get_ipython().kernel.do_shutdown(restart=True)

## <FONT COLOR="red">The notebook is licensed under the Creative [Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)](https://creativecommons.org/). This means that you can freely copy, distribute, and modify the notebook by authors ([Balázs Harangi](https://inf.unideb.hu/dr-harangi-balazs), [András Hajdu](https://inf.unideb.hu/munkatars/4250), and [Róbert Lakatos](https://inf.unideb.hu/lakatos-robert-tanarseged)), but not for commercial purposes. Additionally, if you modify the notebook, you must cite them as the original creators and share the modified version under the same terms.
</FONT>