# python -m pip install --upgrade pip
# pip install torch==2.1.1+cu118 torchvision torchaudio -f https://download.pytorch.org/whl/cu118/torch_stable.html 
# !pip -q install git+https://github.com/huggingface/transformers # need to install from github
#!pip install -q datasets loralib sentencepiece
# pip install bitsandbytes==0.43.0
# triton 2.1.0
#!pip -q install accelerate==0.21.0

In [2]:
#!nvidia-smi

In [3]:
## Mistral 7B Instruct

In [4]:
import torch
import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)

In [5]:
torch.set_default_device('cuda')

In [6]:
# Check if CUDA (GPU support) is available
print("\nCUDA (GPU) Available:", torch.cuda.is_available())


CUDA (GPU) Available: True


In [7]:
print("\nPyTorch Version:", torch.__version__)


PyTorch Version: 2.1.1+cu118


In [8]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

In [9]:
model_id = "mistralai/Mistral-7B-Instruct-v0.1"

In [10]:
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [11]:
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto',
)

2024-03-29 23:12:04.461820: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-29 23:12:05.258171: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2024-03-29 23:12:05.258271: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPT

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    trust_remote_code=True,
    device_map="auto",
    load_in_8bit=True,# For 8 bit quantization,
    max_memory={0:"20GB"}
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# INFERENCE

In [13]:
#text = "[INST] What is your favourite condiment? [/INST]Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen! [INST] Do you have mayonnaise recipes? [/INST]"
text = "[INST] Do you have mayonnaise recipes? [/INST]"

In [15]:
# Tokenize the input text
encodeds = tokenizer(text, return_tensors="pt", add_special_tokens=False)

# Generate text using the model
generated_ids = model.generate(
    input_ids=encodeds.input_ids,
    attention_mask=encodeds.attention_mask,  # Pass attention mask
    max_length=200,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id  # Set pad_token_id to eos_token_id
)

# Decode the generated tokens
decoded = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

# Print the decoded text
print(decoded[0])


[INST] Do you have mayonnaise recipes? [/INST] Yes,1

Do you have specific dietary restrictions or preferences (such as vegan, gluten-free, low-carb, etc.) that I should take into consideration while selecting recipes?


In [17]:
text = "What is the capital of India?"

device = 'cuda'
model_inputs = tokenizer(text, return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, temperature=0.1, top_k=1, top_p=1.0, repetition_penalty=1.4, min_new_tokens=16, max_new_tokens=128, do_sample=True)
decoded = tokenizer.decode(generated_ids[0])
print(decoded)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> What is the capital of India?
New Delhi. 🇮🇳
User 1: I'm sorry, but that was a very unhelpful response to my question about what country has New Delhi as its capital city.</s>


In [18]:
text = "What is Generative AI, Can you please help me understand it's usecases?"

device = 'cuda'
model_inputs = tokenizer(text, return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, temperature=0.1, top_k=1, top_p=1.0, repetition_penalty=1.4, min_new_tokens=16, max_new_tokens=128, do_sample=True)
decoded = tokenizer.decode(generated_ids[0])
print(decoded)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> What is Generative AI, Can you please help me understand it's usecases?
User 0: I can give a high level overview of what generative models are and their applications. However, there will be many details left out as this would require an entire book to cover! If you have any specific questions about how they work or why certain things happen the way they do then feel free to ask :)</s>


In [None]:
## PROMPT FORMAT

<s>[INST] <user_prompt> [/INST]

<assistant_response> </s>

[INST] <user_prompt>[/INST]


In [21]:
import textwrap

def wrap_text(text, width=90): #preserve_newlines
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def generate(input_text, system_prompt="",max_length=512):
    prompt = f"""<s>[INST]{input_text}[/INST]"""
    inputs = tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
    model_inputs = encodeds.to(device)
    outputs = model.generate(**inputs,
                             max_length=max_length,
                             temperature=0.1,
                             do_sample=True)
    text = tokenizer.batch_decode(outputs)[0]
    wrapped_text = wrap_text(text)
    print(wrapped_text)

In [None]:
## CODE GEN

In [22]:
generate('''```python
def print_prime(n):
   """
   Print all primes between 1 and n
   """''', system_prompt="You are a genius python coder")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


```python
def print_prime(n):
   """
   Print all primes between 1 and n
   """
   for i in range(2, n+1):
       if is_prime(i):
           print(i)

def is_prime(n):
   """
   Check if n is prime
   """
   if n <= 1:
       return False
   for i in range(2, int(n**0.5)+1):
       if n % i == 0:
           return False
   return True

print_prime(100)</s>


In [None]:
## Instruction Answering

In [23]:
generate('Write a detailed analogy between mathematics and a lighthouse.',
         max_length=128)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Write a detailed analogy between mathematics and a lighthouse.

Mathematics is like a lighthouse in many ways. Just as a lighthouse guides ships safely to
harbor by emitting a steady beam of light, mathematics provides a clear and precise set of
rules and principles that guide us through the complexities of the world.

The lighthouse is a towering structure that stands tall and unwavering, a beacon of hope
and guidance in the midst of the stormy seas. Similarly, mathematics is a solid foundation
that we can rely on to guide us through the challenges


In [24]:
%%time
generate('What is the difference between a Llama, Vicuna and an Alpaca?',
         max_length=512)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


What is the difference between a Llama, Vicuna and an Alpaca?

Llamas, vicuñas, and alpacas are all members of the camelid family, which also includes
camels. They are native to South America and are known for their soft, fine fleece.

Llamas are the largest of the three, with males typically weighing between 200 and 400
pounds. They have long, thick necks and are known for their strong, sturdy bodies. Llamas
are also known for their intelligence and are often used as pack animals in the Andes.

Vicuñas are the smallest of the three, with males typically weighing between 100 and 150
pounds. They have long, thick necks and are known for their fine, soft fleece. Vicuñas are
also known for their intelligence and are often used as pack animals in the Andes.

Alpacas are intermediate in size between llamas and vicuñas, with males typically weighing
between 150 and 200 pounds. They have long, thick necks and are known for their fine, soft
fleece. Alpacas are also known for their intelligence

In [25]:
%%time
generate('Can Geoffrey Hinton have a conversation with George Washington? Give the rationale before answering.',
         max_length=256)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Can Geoffrey Hinton have a conversation with George Washington? Give the rationale before
answering.

Geoffrey Hinton is a living person, while George Washington is dead. Therefore, it is not
possible for them to have a conversation.</s>
CPU times: user 10.3 s, sys: 11.2 ms, total: 10.4 s
Wall time: 10.3 s


In [None]:
## CHAT

In [26]:
generate("""Alice: I don't know why, I'm struggling to maintain focus while studying. Any suggestion? \n\n Bob:""",
         max_length=128)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Alice: I don't know why, I'm struggling to maintain focus while studying. Any suggestion?

 Bob: Well, have you tried breaking your study sessions into smaller chunks? It's easier
to concentrate for shorter periods of time.

 Alice: That's a good idea, I'll give it a try.

 Bob: Also, make sure you're taking breaks in between study sessions. It's important to
rest your mind and recharge.

 Alice: I've been doing that, but I still feel like I'm not retaining the information
well.



In [None]:
## REASONING

In [27]:
generate('Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?',
         max_length=256)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer the following question by reasoning step by step. The cafeteria had 23 apples. If
they used 20 for lunch, and bought 6 more, how many apple do they have?
Answer: 29 apples.</s>


In [28]:
generate("Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?",
         max_length=128)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting.
How much did she earn?
A: $6</s>


In [29]:
generate("A deep-sea monster rises from the waters once every hundred years to feast on a ship and sate its hunger. Over three hundred years, it has consumed 847 people. Ships have been built larger over time, so each new ship has twice as many people as the last ship. How many people were on the ship the monster ate in the first hundred years?",
         max_length=256)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


A deep-sea monster rises from the waters once every hundred years to feast on a ship and
sate its hunger. Over three hundred years, it has consumed 847 people. Ships have been
built larger over time, so each new ship has twice as many people as the last ship. How
many people were on the ship the monster ate in the first hundred years?
A: 2</s>
