<h1> Prompt Engineering</h1>
<i>Methods for improving the output through prompt engineering.</i>

In [None]:
 %%capture
!pip install langchain>=0.1.17 openai>=1.13.3 langchain_openai>=0.1.6 transformers>=4.40.1

!pip install datasets>=2.18.0 accelerate>=0.27.2 sentence-transformers>=2.5.1 duckduckgo-search>=5.2.2

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

- CMAKE_ARGS="-DLLAMA_CUBLAS=on" sets an environment variable CMAKE_ARGS which is then passed to the cmake build system when llama-cpp-python is being compiled.

- The -DLLAMA_CUBLAS=on tells cmake to enable the cuBLAS backend for GPU acceleration within the llama.cpp project

- pip install llama-cpp-python is standard Python package installer command, which will download and install the llama-cpp-python package from PyPI. During the installation process, if source compilation is required (which is the case when custom CMAKE_ARGS are provided), pip will invoke cmake with the specified arguments.

## Loading our model

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

# Create a pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Device set to use cuda
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


microsoft/Phi-3-mini-4k-instruct -- the model name or path on Hugging Face

device_map = cuda -- automatically places the model on your GPU (faster inference)

torch_dtype = auto -- automatically chooses the best PyTorch tensor type

trust_remote_code = False -- disables execution of custom model code for security reasons

text-generation -- specifies the type of pipeline

model=model -- uses the Phi-3-mini-4k-instruct model

tokenizer=tokenizer -- uses the corresponding tokenizer to encode/decode text for the model.

return_full_text=False - ensures that the output only contains the generated text, not the original prompt.

max_new_tokens=500 - limits the maximum number of tokens the model can generate for a single response

do_sample=False - disables random sampling (picks the token with highest probability)



---



In [2]:
# Prompt
messages = [{"role": "user", "content": "Create a funny song about Charlie Chaplin."}]

# Generate the output
output = pipe(messages)
print(output[0]["generated_text"])

 (Verse 1)
In the heart of the silent film era,
Lived a man with a mustache, oh so clever.
Charlie Chaplin, the Tramp,
With a bowler hat and a cane, he'd dance.

(Chorus)
Oh, Charlie Chaplin, the Great,
With his pants down to his ankles, he'd play.
In the face of adversity,
He'd always find a way.

(Verse 2)
He'd slip and slide on the streets,
With a heart full of laughter and glee.
His humor, his charm,
Would make you feel free.

(Chorus)
Oh, Charlie Chaplin, the Great,
With his tramp suit and his bowler hat.
In the face of adversity,
He'd always find a way.

(Bridge)
From the streets of London to the silver screen,
His legacy, forever to remain.
A symbol of hope and resilience,
In the face of adversity, he'd always win.

(Verse 3)
He'd tumble and fall,
But his spirit, it'd never bend.
With a smile on his face,
He'd make the world transcend.

(Chorus)
Oh, Charlie Chaplin, the Great,
With his tramp suit and his bowler hat.
In the face of adversity,
He'd always find a way.

(Outro)
So h

In [3]:
# Apply prompt template

prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False)
print(prompt)

<|user|>
Create a funny song about Charlie Chaplin.<|end|>
<|endoftext|>


In [4]:
# Using a high temperature

output = pipe(messages, do_sample=True, temperature=1)
print(output[0]['generated_text'])

 (Verse 1)
Charlie Chaplin, oh he was a man,
His character the Little Tramp, so grand.
He'd walk and dance, and make us all laugh,
Fade out, fade in, life's a crazy aftermath.

(Chorus)
Oh Charlie Chaplin, comic gold,
Dreamin' of dames in red.
Film's finest, forever star,
Through all the years, we still cheer you from afar.

(Verse 2)
He fought and he cried, but mostly he smiled,
In every grim situation, he played a wild child.
In the 'Modern Times', he was a factory man,
Lurching through life, under the industrial rain.

(Chorus)
Oh Charlie Chaplin, you're Timeless, divine,
Amidst laughter, you'd quietly shine.
Your legacy still dances on screen,
In every frame, you're elegantly serene.

(Bridge)
From the silent screen to the talkies,
Your magic lingers, still enthralls us.
To find peace now, in the afterlife,
Would mean the world, put worries aside.

(Chorus)
Oh Charlie Chaplin, your wit so sly,
Your humor, it does not die.
In a world of chaos, you offered peace,
Your laughter, it's 

- temperature parameter that controls the randomness of a language model's output by adjusting the probability of selecting the next word.

- A low temperature results in more predictable, deterministic, and focused text, ideal for factual answers or code generation.

- A high temperature produces more creative, diverse, making it suitable for creative writing or brainstorming

In [5]:
# Using a high top_p
output = pipe(messages, do_sample=True, top_p=1)
print(output[0]["generated_text"])

 (Verse 1)

There once was a funny man named Charlie Chaplin,

His tramp cane always kept him from sin,

He danced and he frowned, with a bowler hat,

A silent film star, with a unique gait.


(Chorus)

Charlie Chaplin, oh Charlie Chaplin,

He made us laugh, through thick and thin,

From bowlers to tramps, he was the man,

The Great Dictator, the silent can.


(Verse 2)

He walked the streets with a heart so tender,

His movie scenes, we'd hold them at any end,

In silent films where he was the prince,

A comedic genius, wearing his moustache.


(Chorus)

Charlie Chaplin, oh Charlie Chaplin,

He was a star, it's hard to define,

With a dance and a smile, he won our hearts,

A true legend, the man with the tramp arts.


(Bridge)

He fought for peace with a bowler and cane,

A hero of cinema, we'll always revere,

His laughter's echo, forever on the screen,

Charlie Chaplin, the man, and his tramp machine.


(Chorus)

Charlie Chaplin, oh Charlie Chaplin,

He's the man with the tramp and 

top_p=1 means nucleus sampling considers 100% of the possible next words (tokens) including the entire vocabulary whose cumulative probability adds up to 1, making the output as random as the model's probabilities allow, without restricting the choices to a smaller subset like top_p=0.9. allows for max diversity, though it's often paired with temperature to control creativity.


---





---



# **Advanced Prompt Engineering**


## Complex Prompt

In [6]:
# Text to summarize which we stole from https://jalammar.github.io/illustrated-transformer/ ;)
text = """In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.
The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.
Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.
Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.
The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.
The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).
Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.
As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.
Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes.
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.
After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.
Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.
Now We’re Encoding!
As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
"""

# Prompt components
persona = "You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.\n"
instruction = "Summarize the key findings of the paper provided.\n"
context = "Your summary should extract the most crucial points that can help researchers quickly understand the most vital information of the paper.\n"
data_format = "Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph that encapsulates the main results.\n"
audience = "The summary is designed for busy researchers that quickly need to grasp the newest trends in Large Language Models.\n"
tone = "The tone should be professional and clear.\n"
text = "MY TEXT TO SUMMARIZE"  # Replace with your own text to summarize
data = f"Text to summarize: {text}"

# The full prompt - remove and add pieces to view its impact on the generated output
query = persona + instruction + context + data_format + audience + tone + data

In [7]:
messages = [
    {"role": "user", "content": query}
]
print(tokenizer.apply_chat_template(messages, tokenize=False))

<|user|>
You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.
Summarize the key findings of the paper provided.
Your summary should extract the most crucial points that can help researchers quickly understand the most vital information of the paper.
Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph that encapsulates the main results.
The summary is designed for busy researchers that quickly need to grasp the newest trends in Large Language Models.
The tone should be professional and clear.
Text to summarize: MY TEXT TO SUMMARIZE<|end|>
<|endoftext|>


In [8]:
# Generate the output
outputs = pipe(messages)
print(outputs[0]["generated_text"])

 - The paper investigates the impact of pre-training data size on the performance of Large Language Models (LLMs).

- It compares models trained on different volumes of data, ranging from small to extra-large datasets.

- The study finds that models trained on larger datasets generally perform better on a variety of tasks.

- However, the performance gains diminish as the dataset size increases beyond a certain point.

- The paper also explores the cost-benefit trade-off of using larger datasets for training LLMs.

- It concludes that there is an optimal dataset size that balances performance gains with training costs.


In summary, the paper presents a comprehensive analysis of how the size of pre-training data affects the performance of Large Language Models. The researchers found that while larger datasets tend to yield better model performance across various tasks, the benefits plateau beyond a certain dataset size. This indicates that there is a point of diminishing returns when i



---



## In-Context Learning: Providing Examples

In [9]:
# Use a single example of using the made-up word in a sentence
one_shot_prompt = [
    {
        "role": "user",
        "content": "A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:"
    },
    {
        "role": "assistant",
        "content": "I have a Gigamuru that my uncle gave me as a gift. I love to play it at home."
    },
    {
        "role": "user",
        "content": "To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:"
    }
]
print(tokenizer.apply_chat_template(one_shot_prompt, tokenize=False))

<|user|>
A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:<|end|>
<|assistant|>
I have a Gigamuru that my uncle gave me as a gift. I love to play it at home.<|end|>
<|user|>
To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:<|end|>
<|endoftext|>


- One-shot prompting technique where a single example is provided to an AI model to show it the desired output format and style for a task. This method helps the model generalize from the example, leading to more accurate and less ambiguous responses than zero-shot prompting (which provides no examples). It is particularly useful when you need a specific format but have limited data or want to avoid the complexity of providing multiple examples.  

In [10]:
# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

 During the medieval reenactment, the knight skillfully screeged the wooden target with precision and grace.




---



## Chain Prompting: Breaking up the Problem


In [11]:
# Create name and slogan for a product
product_prompt = [{"role": "user", "content": "Create a name and slogan for a chatbot that leverages smol llms."}
                  ]
outputs = pipe(product_prompt)

product_description = outputs[0]["generated_text"]
print(product_description)

 Name: ChatMate
Slogan: "Smart Conversations with ChatMate"


- Chain prompting is an AI technique that breaks down a complex task into smaller, sequential subtasks, using the output of one prompt as the input for the next.
- This approach improves the accuracy and reliability of LLMs by guiding them through a structured process.
- While prompt chaining uses a series of linked prompts, <br>
chain-of-thought (CoT) prompting specifically focuses on guiding the LLM to "show its work" by generating intermediate reasoning steps before providing a final answer.

In [12]:
# Based on a name and slogan for a product, generate a sales pitch
sales_prompt = [{"role": "user", "content": f"Generate a very short sales pitch for the following product: '{product_description}'"}]

outputs = pipe(sales_prompt)

sales_pitch = outputs[0]["generated_text"]
print(sales_pitch)

 Introducing ChatMate, your personal AI companion for smart conversations. With ChatMate, you can effortlessly navigate through any topic, making your interactions more engaging and meaningful. Experience the future of communication with ChatMate – where intelligence meets convenience.




---



# **Reasoning with Generative Models**


## Chain-of-Thought: Think Before Answering


In [14]:
# Answering with chain-of-thought
cot_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11."},
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
]

# Generate the output
outputs = pipe(cot_prompt)
print(outputs[0]["generated_text"])

 The cafeteria started with 23 apples. They used 20 apples to make lunch, so they had 23 - 20 = 3 apples left. After buying 6 more apples, they now have 3 + 6 = 9 apples. The answer is 9.




---



## Zero-shot Chain-of-Thought


In [15]:
# Zero-shot Chain-of-Thought
zeroshot_cot_prompt = [
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Let's think step-by-step."}
]

# Generate the output
outputs = pipe(zeroshot_cot_prompt)
print(outputs[0]["generated_text"])

 Step 1: The cafeteria starts with 23 apples.
Step 2: They used 20 apples to make lunch, so we subtract 20 from the initial amount: 23 - 20 = 3 apples remaining.
Step 3: The cafeteria bought 6 more apples, so we add 6 to the remaining amount: 3 + 6 = 9 apples.

The cafeteria now has 9 apples.




---



## Tree-of-Thought: Exploring Intermediate Steps


In [16]:
# Zero-shot Chain-of-Thought
zeroshot_tot_prompt = [
    {"role": "user", "content": "Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realises they're wrong at any point then they leave. The question is 'The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?' Make sure to discuss the results."}
]

In [17]:
# Generate the output
outputs = pipe(zeroshot_tot_prompt)
print(outputs[0]["generated_text"])

 Expert 1:
Step 1: Start with the initial number of apples, which is 23.

Expert 2:
Step 1: Subtract the number of apples used for lunch, which is 20.
Step 2: Add the number of apples bought, which is 6.

Expert 3:
Step 1: Start with the initial number of apples, which is 23.
Step 2: Subtract the number of apples used for lunch, which is 20.
Step 3: Add the number of apples bought, which is 6.

Results:
All three experts arrived at the same answer: 23 - 20 + 6 = 9 apples remaining in the cafeteria.




---

