# Installation and config

In [1]:
!pip install -q transformers einops accelerate bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25h

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig

In [3]:
import torch # PyTorch
import getpass
import os

device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [4]:
device

'cuda:0'

In [5]:
torch.random.manual_seed(42)

<torch._C.Generator at 0x7f89e9f25a70>

# Token Definition

In [7]:
os.environ["HF_TOKEN"] = getpass.getpass()

 ········


# Loading the model

In [8]:
id_model = "microsoft/Phi-3-mini-4k-instruct"

In [9]:
model = AutoModelForCausalLM.from_pretrained(id_model, device_map = "cuda", torch_dtype = "auto",
                                             trust_remote_code = True, attn_implementation = "eager")

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

In [10]:
tokenizer = AutoTokenizer.from_pretrained(id_model)

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

# Pipeline creation

In [11]:
pipe = pipeline("text-generation", model = model, tokenizer = tokenizer)

Device set to use cuda


# Parametersfor text generation

In [12]:
generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.1, # from 0.1 to 0.9
    "do_sample": True,
}

In [13]:
prompt = "what is quantum computing?"
output = pipe(prompt, **generation_args)

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48


In [14]:
output

[{'generated_text': ' Quantum computing is a type of computing that uses quantum-mechanical phenomena, such as superposition and entanglement, to perform operations on data. Unlike classical computers, which use bits as the smallest unit of data (represented as 0 or 1), quantum computers use quantum bits, or qubits, which can represent a 0, a 1, or any quantum superposition of these states. This allows quantum computers to process a vast amount of possibilities simultaneously, potentially solving certain problems much faster than classical computers.\n\n\nHow does quantum entanglement work? Quantum entanglement is a phenomenon where two or more particles become linked in such a way that the state of one particle instantaneously influences the state of the other, no matter the distance between them. This occurs when particles interact in a way that the quantum state of each particle cannot be described independently of the state of the others, even when the particles are separated by la

In [14]:
print(output[0]['generated_text'])

 Quantum computing is a type of computing that uses quantum-mechanical phenomena, such as superposition and entanglement, to perform operations on data. Unlike classical computers, which use bits as the smallest unit of data (represented as 0 or 1), quantum computers use quantum bits, or qubits, which can represent a 0, a 1, or any quantum superposition of these states. This allows quantum computers to process a vast amount of possibilities simultaneously, potentially solving certain problems much faster than classical computers.


How does quantum entanglement work? Quantum entanglement is a phenomenon where two or more particles become linked in such a way that the state of one particle instantaneously influences the state of the other, no matter the distance between them. This occurs when particles interact in a way that the quantum state of each particle cannot be described independently of the state of the others, even when the particles are separated by large distances. This inte

In [15]:
prompt = "what is the result of 7 x 6 - 42?"
output = pipe(prompt, **generation_args)
print(output[0]['generated_text'])

 First, calculate the multiplication: 7 x 6 = 42. Then, subtract 42 from the result: 42 - 42 = 0. The final answer is 0.


In [16]:
prompt = "Who was the first person in space?"
output = pipe(prompt, **generation_args)
print(output[0]['generated_text'])



# Answer
The first person in space was Yuri Gagarin, a Soviet cosmonaut. He made history on April 12, 1961, when he orbited the Earth aboard the Vostok 1 spacecraft, marking a significant milestone in the Space Race between the United States and the Soviet Union. His famous words upon seeing Earth from space were, "Poyekhali!" which translates to "Let's go!" in English.


# Templates and prompt engineering 

In [17]:
prompt = "what is quantum computing?"

In [18]:
prompt

'what is quantum computing?'

In [19]:
template = """<|system|>
You are a helpful assistant.<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(prompt)

In [21]:
template

'<|system|>\nYou are a helpful assistant.<|end|>\n<|user|>\n"what is quantum computing?"<|end|>\n<|assistant|>'

In [22]:
output = pipe(template, **generation_args)
print(output[0]['generated_text'])

 Quantum computing is a rapidly advancing field of technology that leverages the principles of quantum mechanics to process information. Unlike classical computers, which use bits as the smallest unit of data (represented by 0s and 1s), quantum computers use quantum bits, or qubits, which can exist in multiple states simultaneously due to the phenomenon known as superposition. This allows quantum computers to perform many calculations at once, potentially solving complex problems much faster than classical computers.

Another key feature of quantum computing is entanglement, a phenomenon where the state of one qubit is directly related to the state of another, regardless of the distance between them. This allows for faster and more efficient communication between qubits, enabling quantum computers to perform complex calculations more quickly.

Quantum computing has the potential to revolutionize various fields, including cryptography, drug discovery, optimization problems, and artifici

In [23]:
prompt = "What is AI" # @param {type: "string"}

template = """<|system|>
You are a helpful assistant.<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(prompt)

output = pipe(template, **generation_args)
print(output[0]['generated_text'])

 AI, or artificial intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence. These tasks may include problem-solving, learning, understanding natural language, recognizing patterns, and making decisions. AI systems can be categorized into two main types: narrow AI, which is designed to perform specific tasks, and general AI, which aims to replicate human intelligence across a wide range of domains. AI technologies are used in various applications, such as virtual assistants, self-driving cars, recommendation systems, and medical diagnosis tools.


In [24]:
prompt = "What is AI? Answer in 1 sentence in Portuguese" # @param {type:"string"}
#prompt = "What is AI? Answer in 1 sentence" # @param {type:"string"}
#prompt = "What is AI? Answer in the form of a poem" # @param {type: "string"}

sys_prompt = "You are a helpful virtual assistant."

template = """<|system|>
{}<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(sys_prompt, prompt)

print(template)

output = pipe(template, **generation_args)
print(output[0]['generated_text'])

<|system|>
You are a helpful virtual assistant.<|end|>
<|user|>
"What is AI? Answer in 1 sentence in Portuguese"<|end|>
<|assistant|>
 AI é a inteligência artificial, que consiste na criação de sistemas capazes de realizar tarefas que normalmente exigem inteligência humana, como aprendizado, raciocínio e resolução de problemas.


In [25]:
prompt = "Generate a python code that writes the fibonnaci sequence"

sys_prompt = "You are an experienced programmer. Please return the requested code and provide brief explanations if convenient."

template = """<|system|>
{}<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(sys_prompt, prompt)

output = pipe(template, **generation_args)
print(output[0]['generated_text'])

 Here's a Python code that generates the Fibonacci sequence up to a given number of terms:

```python
def fibonacci_sequence(n):
    """
    Generates the Fibonacci sequence up to n terms.
    """
    # Initialize the first two terms
    a, b = 0, 1
    # Create an empty list to store the sequence
    sequence = []
    # Loop through n terms
    for i in range(n):
        # Append the current term to the sequence
        sequence.append(a)
        # Update the terms
        a, b = b, a + b
    # Return the sequence
    return sequence

# Example usage
n = 10
print(f"The first {n} terms of the Fibonacci sequence are: {fibonacci_sequence(n)}")
```

Output:
```
The first 10 terms of the Fibonacci sequence are: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
```

This code defines a function `fibonacci_sequence` that takes an integer `n` as input and returns a list of the first `n` terms of the Fibonacci sequence. The function initializes the first two terms of the sequence (`a` and `b`) to 0 and 1, res

In [26]:
def fibonacci(n):
    """
    Generate the Fibonacci sequence up to n terms.

    Parameters:
    n (int): The number of terms to generate.

    Returns:
    list: A list containing the Fibonacci sequence up to n terms.
    """
    fib_sequence = [0, 1]
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return fib_sequence

    for i in range(2, n):
        next_term = fib_sequence[i-1] + fib_sequence[i-2]
        fib_sequence.append(next_term)

    return fib_sequence


# Example usage:
n = 10
print(f"Fibonacci sequence up to {n} terms: {fibonacci(n)}")

Fibonacci sequence up to 10 terms: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]


# Message Format

In [27]:
prompt = "What is AI?"

msg = [
    {"role": "system", "content": "You are a helpful virtual assistant."},
    {"role": "user", "content": prompt}
]

output = pipe(msg, **generation_args)
print(output[0]["generated_text"])

 AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. The term can also be applied to any machine that exhibits traits associated with a human mind, such as learning and problem-solving. AI systems are powered by algorithms, data, and computational power, enabling them to perform tasks that typically require human intelligence. These tasks can range from simple activities like recognizing speech or images to more complex ones like driving a car or diagnosing diseases. AI has applications in various fields, including healthcare, finance, transportation, and entertainment.


In [28]:
prompt = "List 10 famous cities in Europe"
prompt_sys = "You are a helpful travel assistant."

msg = [
    {"role": "system", "content": prompt_sys},
    {"role": "user", "content": prompt},
]

output = pipe(msg, **generation_args)
print(output[0]['generated_text'])

 1. Paris, France

2. Rome, Italy

3. London, United Kingdom

4. Berlin, Germany

5. Madrid, Spain

6. Vienna, Austria

7. Amsterdam, Netherlands

8. Barcelona, Spain

9. Prague, Czech Republic

10. Athens, Greece


# Optimizing with quantization

In [15]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_use_double_quant = True,
    bnb_4bit_compute_dtype = torch.bfloat16
)

In [16]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config = quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

In [17]:
prompt = ("Who was the first person in space?")
messages = [{"role": "user", "content": prompt}]

In [18]:
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt") # PyTorch
model_inputs = encodeds.to(device)
generated_ids = model.generate(model_inputs, max_new_tokens = 1000, do_sample = True,
                               pad_token_id = tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
res = decoded[0]
res

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWho was the first person in space?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe first person in space was Yuri Gagarin, a Soviet cosmonaut who flew aboard the Vostok 1 spacecraft on April 12, 1961. Gagarin's historic spaceflight lasted for 108 minutes, during which he completed one orbit of the Earth. His achievement marked a major milestone in the exploration of space and paved the way for future human spaceflight missions.\n\nGagarin's flight was a significant achievement for the Soviet Union, which had been racing the United States in the space race. The successful launch and orbit of Vostok 1 demonstrated the capabilities of Soviet space technology and marked a major victory for the Soviet space program.\n\nGagarin's legacy extends beyond his historic spaceflight. He became an international hero and a symbol of Soviet achievement, and his name is still celebrated in Russia and around the world.<|eot_id|>

# Langchain