# Set up

In [1]:
!pip install accelerate==0.29.3 bitsandbytes==0.43.1

Collecting accelerate==0.29.3
  Downloading accelerate-0.29.3-py3-none-any.whl (297 kB)
     ---------------------------------------- 0.0/297.6 kB ? eta -:--:--
     -------------------------------------- 297.6/297.6 kB 9.0 MB/s eta 0:00:00
Installing collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.30.1
    Uninstalling accelerate-0.30.1:
      Successfully uninstalled accelerate-0.30.1
Successfully installed accelerate-0.29.3




In [2]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
import accelerate
import bitsandbytes

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [3]:
print("free(Gb):", torch.cuda.mem_get_info()[0]/1000000000, "total(Gb):", torch.cuda.mem_get_info()[1]/1000000000)


free(Gb): 10.708058112 total(Gb): 11.81089792


# Load Model and Tokenizer
- Note, load_in_8bit = True would speed up the inference significantly!

In [4]:

# Get token from your huggingface page
token = "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"
llama = "meta-llama/Llama-2-7b-chat-hf"
load_in_8bit = True # The model get much faster with load_in_8bit = True

In [5]:
tokenizer = AutoTokenizer.from_pretrained(llama,
                                          use_auth_token=token)

model = AutoModelForCausalLM.from_pretrained(
    llama,
    use_auth_token=token,
    device_map='auto',
    load_in_8bit=load_in_8bit,
)



The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
# Check gpu usage
print("free(Gb):", torch.cuda.mem_get_info()[0]/1000000000, "total(Gb):", torch.cuda.mem_get_info()[1]/1000000000)



free(Gb): 3.1981568 total(Gb): 11.81089792


# Tokenizer Overview

The tokenizer breaks down the words into the common parts called tokens and then represents every token with the corresponding number.



## From Words to Tokens

- Notice how the word "Llama" get break down into 3 tokens: "_L", "l", "ama".

- Notice how the number "2" get break down into 2 tokens: "_", "2".

In [7]:

print(tokenizer.tokenize('Meta developed and publicly released the Llama 2 family of large language models'))

['▁Meta', '▁developed', '▁and', '▁public', 'ly', '▁released', '▁the', '▁L', 'l', 'ama', '▁', '2', '▁family', '▁of', '▁large', '▁language', '▁models']


In [8]:
print(len(tokenizer.tokenize('Meta developed and publicly released the Llama 2 family of large language models')))

17


## From Words to Numbers

- Note that the tokenizer also provides us with the attention mask.
    - The attention mask indicates whether the model should pay attention to the corresponding token


In [9]:
sentence = 'Meta developed and publicly released the Llama 2 family of large language models'
inputs = tokenizer(sentence)

print(inputs)

{'input_ids': [1, 20553, 8906, 322, 970, 368, 5492, 278, 365, 29880, 3304, 29871, 29906, 3942, 310, 2919, 4086, 4733], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [10]:
print(len(inputs['input_ids']))

18


## From Numbers to Tokens

- The tokens are perfectly reversible, and we can ask the tokenizer to convert the list of tokens back to the human language.
- Notice the s token we get after reversing the tokenization. Where does it come from?
    -  s is the symbol explaining Llama-2 that this is the beginning of the user input (begining of sequence token).


In [11]:
tokenizer.decode(inputs['input_ids'])

'<s> Meta developed and publicly released the Llama 2 family of large language models'

In [12]:
tokenizer.bos_token

'<s>'

In [13]:
print(len(inputs['input_ids']))

18


## Pad End of Sequence Token

In [14]:
tokenizer.eos_token

'</s>'

In [15]:
tokenizer = AutoTokenizer.from_pretrained(llama,
                                          use_auth_token=token)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
sentence = 'Meta developed and publicly released the Llama 2 family of large language models'
inputs = tokenizer(sentence)
tokenizer.decode(inputs['input_ids'])

'<s> Meta developed and publicly released the Llama 2 family of large language models'

# Generate Output

- The output starts with our input: 'What is the language model, and how does it work?'. Observe that instead of answering our question, the model responds with more questions about the language models. Why so?

    - Remember that we are exploring the base model and not the chat model. The base model was trained to predict the next word based on the vast set of Internet data, and the Internet is full of web pages listing Q&A or similar structures containing the list of questions. Following our input, the model generated an "average Internet page" starting from our question. So, the base model behaves as expected.

In [16]:
request = 'What is the language model, and how does it work?'
print(tokenizer.tokenize(request))
print(len(tokenizer.tokenize(request)))

['▁What', '▁is', '▁the', '▁language', '▁model', ',', '▁and', '▁how', '▁does', '▁it', '▁work', '?']
12


In [17]:
inputs = tokenizer(request, return_tensors="pt")
print(inputs)
print(len(inputs['input_ids'][0]))

{'input_ids': tensor([[    1,  1724,   338,   278,  4086,  1904, 29892,   322,   920,   947,
           372,   664, 29973]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
13


In [18]:
inputs = inputs.to(model.device)
outputs = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=100)

  attn_output = torch.nn.functional.scaled_dot_product_attention(


In [19]:
outputs

tensor([[    1,  1724,   338,   278,  4086,  1904, 29892,   322,   920,   947,
           372,   664, 29973,    13,    13, 29909,  4086,  1904,   338,   263,
          1134,   310, 23116, 21082,   313, 23869, 29897,  1904,   393,   338,
         16370,   373,   263,  2919,  8783,   310,  1426,   304,  8500,   278,
          4188, 22342,   310,   263,  2183,  5665,   310,  3838,   470,  4890,
         29889,   450,  1904,   508,   367,  1304,   363,   263, 12875,   310,
          9595, 29892,  1316,   408,  4086, 13962, 29892,  1426, 19138,  2133,
         29892,   322, 13563, 29890,  1862, 29889,    13,    13,  1576,  6996,
          2969,  5742,   263,  4086,  1904,   338,   304,  7945,   263, 19677,
          3564,   304,  8500,   278,  2446,  1734,   297,   263,  5665,   310,
          1426,  2183,   278,  3517,  3838, 29889,   450,  3564, 24298,  1983,
           304,   437,   445]], device='cuda:0')

In [20]:
response = tokenizer.decode(outputs[0])
print(response)

<s> What is the language model, and how does it work?

A language model is a type of artificial intelligence (AI) model that is trained on a large dataset of text to predict the likelihood of a given sequence of words or characters. The model can be used for a variety of tasks, such as language translation, text summarization, and chatbots.

The basic idea behind a language model is to train a neural network to predict the next word in a sequence of text given the previous words. The network learns to do this


In [21]:
print(torch.__version__)

2.3.0
