# Prompt Engineering

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
)

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the disk.
Device set to use mps


In [2]:
messages = [
    {"role": "user", "content": "Create a funny joke about Elon Musk."}
]

output = pipe(messages)
print(output[0]["generated_text"])

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
You are not running the flash-attention implementation, expect numerical differences.


 Why did Elon Musk start a company to make pizza? Because he wanted to take the "pizza" to the next level with his SpaceX and Tesla innovations!


In [3]:
# I don't get the previous joke
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False)
print(prompt)

<|user|>
Create a funny joke about Elon Musk.<|end|>
<|endoftext|>


In [5]:
output = pipe(messages, do_sample=True, temperature=1)
print(output[0]["generated_text"])

 Why did Elon Musk bring a ladder to the boring meeting?

Because he wanted to make sure the ideas wouldn't fall flat!


In [8]:
output = pipe(messages, do_sample=True, top_p=1)
print(output[0]["generated_text"])

 Why did Elon Musk invent a reusable rockets?


Because he wanted to ensure that more than just the moon stays in orbit without an awkward apology to the Earth.


In [10]:
one_shot_prompt = [
    {
        "role": "user",
        "content": "A 'Ocarina' is a type of clay flute. An example of a sentences that uses the word Ocarina is:"
    },
    {
        "role": "assistant",
        "content": "I have a Ocarina that someone special gave me as a gift. I love to play it at home."
    },
    {
        "role": "user",
        "content": "'Song of Time' is played by this instrument. Another song that is played by the Ocarina is:"
    }
]
print(tokenizer.apply_chat_template(one_shot_prompt, tokenize=False))

<|user|>
A 'Ocarina' is a type of clay flute. An example of a sentences that uses the word Ocarina is:<|end|>
<|assistant|>
I have a Ocarina that someone special gave me as a gift. I love to play it at home.<|end|>
<|user|>
'Song of Time' is played by this instrument. Another song that is played by the Ocarina is:<|end|>
<|endoftext|>


In [11]:
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

 The Ocarina is a unique and ancient musical instrument, often made from clay and known for its distinctive, haunting sound. It has been used in various cultures around the world for centuries. The 'Song of Time' is a fictional piece that could be played on an Ocarina, symbolizing the passage of time and the transient nature of life.

Another song that is played by the Ocarina is "The Wind's Whisper." This imaginary piece could be a beautiful melody that captures the gentle, soothing sound of the wind as it rustles through the leaves of a tree. The Ocarina's ability to produce a wide range of tones and its ethereal quality make it an ideal instrument for conveying the delicate and ephemeral qualities of the wind.

In a more real-world context, the Ocarina has been featured in various musical compositions and performances. For example, the Ocarina has been used in the soundtrack of the video game "The Legend of Zelda: Ocarina of Time," where it plays a crucial role in the game's storyli

In [12]:
product_prompt = [
    {"role": "user", "content": "Create a name and slogan for a new sword for Elden Ring that injures you but does great damage."}
]
outputs = pipe(product_prompt)
product_description = outputs[0]["generated_text"]
print(product_description)

 Name: "Blade of the Fallen"

Slogan: "Cut through your enemies with unparalleled power, but beware the price of wielding the Fallen's curse."


In [14]:
use_prompt = [
    {"role": "user", "content": f"Generate a very short reason why to use this sword: {product_description}"}
]
outputs = pipe(use_prompt)
usage_pitch = outputs[0]["generated_text"]
print(usage_pitch)

 "Blade of the Fallen" offers unmatched strength in battle, ensuring victory over foes with its legendary edge.


In [15]:
chain_of_thought_prompt = [
    {"role": "user", "content": "A bakery made 15 cakes in the morning. In the afternoon, they baked 12 more cakes. Later, they packed these cakes into boxes, with each box holding 3 cakes. How many boxes did the bakery use?"},
    {"role": "assistant", "content": "The bakery has a total of 15 + 12 = 27 cakes when putting them in boxes we have 27 / 3 = 9 boxes total"},
    {"role": "user", "content": "A gardener planted 18 rows of flowers in a garden. Each row had 7 flowers. After a windy day, 36 flowers were blown away. How many flowers are still in the garden?"}
]
outputs = pipe(chain_of_thought_prompt)
print(outputs[0]["generated_text"])

 The gardener initially planted 18 rows * 7 flowers/row = 126 flowers. After the wind, there are 126 - 36 = 90 flowers left in the garden.


In [17]:
# added Let's think step-by-step at the end (prime reasoning)
zeroshot_cot_prompt = [
    {"role": "user", "content": "Sarah is preparing for a party. She buys 48 cupcakes. She wants to arrange them equally on 4 trays. After setting up the trays, she realizes she wants to set aside 8 cupcakes for a special decoration. How many cupcakes will be on each tray after setting aside the 8 cupcakes? Let's think step-by-step."}
]
outputs = pipe(zeroshot_cot_prompt)
print(outputs[0]["generated_text"])

 Step 1: Determine the total number of cupcakes Sarah initially has.
Sarah has 48 cupcakes.

Step 2: Calculate the number of cupcakes to be set aside for decoration.
Sarah wants to set aside 8 cupcakes for decoration.

Step 3: Subtract the cupcakes set aside from the total number of cupcakes.
48 cupcakes - 8 cupcakes = 40 cupcakes remaining.

Step 4: Divide the remaining cupcakes equally among the 4 trays.
40 cupcakes ÷ 4 trays = 10 cupcakes per tray.

After setting aside the 8 cupcakes for decoration, there will be 10 cupcakes on each tray.


In [21]:
zeroshot_tree_of_thought_prompt = [
    {"role": "user", "content": "Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realizes they're wrong at any point then they leave. The question is 'if I have 1 apple then ate 1 apple how many apples do I have left?' Make sure to discuss the results"}
]
outputs = pipe(zeroshot_tree_of_thought_prompt)
print(outputs[0]["generated_text"])

 Expert 1:
Step 1: I will start by assuming that the person has 1 apple initially.

Step 2: The person eats 1 apple, so we subtract 1 from the initial amount.

Step 3: After eating 1 apple, the person has 0 apples left.

Expert 2:
Step 1: I will consider the initial condition, which is that the person has 1 apple.

Step 2: The person consumes 1 apple, so we need to subtract 1 from the initial amount.

Step 3: After eating 1 apple, the person has 0 apples remaining.

Expert 3:
Step 1: I will assume that the person starts with 1 apple.

Step 2: The person eats 1 apple, so we subtract 1 from the initial amount.

Step 3: After eating 1 apple, the person has 0 apples left.

All experts agree that if a person has 1 apple and eats 1 apple, they will have 0 apples left.


## guiding the output

In [26]:
zeroshot_prompt = [
    {"role": "user", "content": "Create a character profile for an RPG game in JSON format."}
]
outputs = pipe(zeroshot_prompt)
print(outputs[0]["generated_text"])

 ```json
{
  "name": "Aria Stormbringer",
  "class": "Warrior",
  "race": "Human",
  "level": 10,
  "attributes": {
    "strength": 18,
    "dexterity": 12,
    "constitution": 16,
    "intelligence": 8,
    "wisdom": 10,
    "charisma": 14
  },
  "skills": {
    "melee": 15,
    "ranged": 10,
    "magic": 5,
    "stealth": 12,
    "acrobatics": 10,
    "animal_handling": 8
  },
  "equipment": {
    "weapon": "Two-handed Axe",
    "armor": "Chainmail Hauberk",
    "shield": "Warhammer",
    "accessories": [
      "Warrior's Talisman",
      "Leather Boots",
      "Woolen Cloak"
    ]
  },
  "background": "Aria grew up in a small village on the outskirts of a large city. She was always fascinated by the stories of great warriors and adventurers who roamed the lands. When she was just a child, she witnessed a group of bandits attacking her village. She fought bravely and helped her village defeat the bandits. Since then, she has dedicated her life to becoming a skilled warrior and protec

In [27]:
one_shot_template = """Create a short character profile for an RPG game. Make sure to only use this format:
{
    "description": "A SHORT DESCRIPTION",
    "name": "THE CHARACTER'S NAME",
    "armor": "ONE PIECE OF ARMOR",
    "weapon": "ONE OR MORE WEAPONS"
}
"""
one_shot_prompt = [
    {"role": "user", "content": one_shot_template}
]
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

 {
    "description": "A cunning rogue with a mysterious past, skilled in stealth and deception.",
    "name": "Shadowcloak",
    "armor": "Leather Hood",
    "weapon": "Dagger"
}


In [30]:
import gc
import torch
#del model, tokenizer, pipe

gc.collect()
torch.mps.empty_cache()

In [38]:
!pip install llama-cpp-python

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.6.tar.gz (66.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
  Installing build dependencies ... [?2done
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
Building wheels for collected packages: llama-cpp-python
  Building wheel done
[?25h  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.6-cp311-cp311-macosx_15_0_arm64.whl size=3489031 sha256=e1d1a27da5e748b2af56d8e84acb0a005c119f079a8ae1c79fbb2ce6a5455dd4
  Stored in directory: /Users/mathias/Library/Caches/pip/wheels/e8/96/d2/acfb576f7a58ef0580e2fec8096e5eefd17cc356017089337b
Successfully built llama-cpp-python
Installing co

In [39]:
from llama_cpp.llama import Llama

llm = Llama.from_pretrained(
    repo_id="microsoft/Phi-3-mini-4k-instruct-gguf",
    filename="*fp16.gguf",
    n_gpu_layers=-1,
    n_ctx=2048,
    verbose=False
)

Phi-3-mini-4k-instruct-fp16.gguf:   0%|          | 0.00/7.64G [00:00<?, ?B/s]

llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn

In [40]:
output = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Create a character in Elden Ring in JSON format."},
    ],
    response_format={"type": "json_object"},
    temperature=0
)['choices'][0]['message']['content']

In [41]:
import json

json_output = json.dumps(json.loads(output), indent=4)
print(json_output)

{
    "character": {
        "name": "Eldrin the Wanderer",
        "class": "Ranger",
        "race": "Human",
        "attributes": {
            "strength": 12,
            "dexterity": 18,
            "constitution": 14,
            "intelligence": 10,
            "wisdom": 12,
            "charisma": 11
        },
        "skills": [
            {
                "name": "Bow Mastery",
                "level": 5,
                "description": "Improves accuracy and damage with bows and crossbows."
            },
            {
                "name": "Tracking",
                "level": 5,
                "description": "Enhances ability to track and follow enemies."
            },
            {
                "name": "Wilderness Survival",
                "level": 5,
                "description": "Increases ability to survive in the wilderness and find resources."
            },
            {
                "name": "Stealth",
                "level": 4,
                "descri