<a href="https://colab.research.google.com/github/super-dainiu/YHack-llm-tutorial-2024/blob/main/Llama3_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Llama3**

Llama3 is the newest large language model developed by Meta AI, released in April 2024. This model comes in 8B and 70B parameters.

Running the original Llama3 model on the free Google Colab version is impossible as the RAM usage of loading the model far exceeds the 12 GB provided by Google Colab. To get around this, we use a quantized version of Llama3. A quantized version of the model uses less precision in the weights of the model. This of course sacrifices accuracy, but is more computationally efficient and suits better for our demonstration purposes.


## **Quantized Models on Hugging Face:**

In this tutorial, we use the model here: [SweatyCrayfish/llama-3-8b-quantized](https://huggingface.co/SweatyCrayfish/llama-3-8b-quantized). There are many quantized models on hugging face, tailored to different use cases. In our case, since we're exploring open source llms, llama3 is the perfect example.

# **Before you begin**

Make sure you are using a GPU runtime. To switch, click Runtime > Change runtime type > T4 GPU

If you're using the GPU, the following command should execute correctly (i.e. should should a table and not "command not found")



In [None]:
!nvidia-smi

# **Step 1: Install all the required dependencies.**

These libaries have the following purposes:

*   **Transformers** is a library by Hugging Face that provides state-of-the-art machine learning models, particularly those related to natural language processing (NLP) tasks.Transformers is the primary library used to load, train, and use transformer-based models such as Llama, GPT, BERT, and many others.

*   **PyTorch** is an open-source machine learning library that provides two high-level features: tensor computation (like NumPy) with strong GPU acceleration, and deep neural networks built on a tape-based autodiff system. PyTorch is the backend framework that performs the actual computations for the models provided by the transformers library.

*   **Accelerate** is a library by Hugging Face designed to simplify the process of running and scaling PyTorch models on various hardware configurations, such as multiple GPUs and distributed environments. Accelerate helps to optimize memory usage and manage device mapping automatically, which is especially important when working with large models like the 8 billion parameter Llama model.

*   **Bitsandbytes** is a lightweight library for performing efficient 8-bit and 4-bit matrix multiplication and quantization. It is used to reduce the memory footprint of large models without significantly compromising performance.For loading large models in reduced precision (quantized models in our case), bitsandbytes provides the necessary functionality to quantize the model weights and perform operations efficiently.

In [None]:
!pip install transformers
!pip install torch
!pip install accelerate
!pip install bitsandbytes

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

# **Step 2: Load the model.**

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows software developers to use NVIDIA GPUs (graphics processing units) for general-purpose processing, a technique known as GPGPU (General-Purpose computing on Graphics Processing Units). When working with large language models and high computational tasks, using CUDA allows you to leverage the power of NVIDIA GPUs, resulting in faster processing and more efficient model training and inference.

Our code checks if CUDA is available on the machine. If it is, it sets the device to cuda (which means the GPU), otherwise it defaults to CPU.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "SweatyCrayfish/llama-3-8b-quantized"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Define quantization config with compute dtype set to torch.float16
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set model to evaluation mode
model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/697 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/335 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Ll

# **Step 3: Set up model for generation.**

We define a function that generates text based on any provided prompt. The generated text is the model's response to our input, processed and returned in a human-readable format.

Once the prompt is decided, it is converted into tokens. Tokens are the individual units of text (words or sub-words) that the model processes.
Tokenization helps the model understand and manage the text input more efficiently.

Practical Tips for Text Generation:

*   Experiment with Parameters: Adjust parameters like temperature, top_p, and top_k to see how they affect the generated text. This helps in fine-tuning the output to match the desired style and coherence.
*   Analyze the Output: Review the generated text to understand the model's behavior and capabilities. Make note of any patterns, strengths, or weaknesses.
*   Iterate and Improve: If the output isn't quite right, consider refining the prompt or tweaking generation parameters. Iteration is key to achieving high-quality results.

In [None]:
prompt = "What are open-source LLMs?"
inputs = tokenizer(prompt, return_tensors="pt")
flattened_inputs = inputs["input_ids"].flatten()

# token display
print("Token tensor: ", flattened_inputs)
print("TOKEN\t -> TEXT")
print("-"*15)
for token in flattened_inputs:
    print(f"{token}\t -> {tokenizer.decode(token)}")

In [None]:
def generate_response(prompt):
    # Tokenize the input
    inputs = tokenizer(prompt, return_tensors="pt", padding=True)

    # Move inputs to the appropriate device
    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)

    # Generate output
    output = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        pad_token_id=tokenizer.eos_token_id,
        max_length=1000,  # Increase this value for longer responses
        num_return_sequences=1,  # Number of sequences to generate
        repetition_penalty=1.2,  # Penalize repetition in the output),
    )
    # Decode and return the generated text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

# **Step 4: Decide on Prompt and Generate.**

Finally, we feed the model the specific question we want to know and have it generate the response. This is similar to the text you enter into a chat interface, like the one you enter on ChatGPT. Crafting an effective prompt is crucial as it guides the model in generating relevant and accurate responses.


Examples of Prompts:

*   Simple Query: "What is the capital of France?"
*   Creative Writing: "Write a short story about a dragon who learns to play the piano."
*   Technical Explanation: "Explain the concept of quantum entanglement in simple terms."
*   Programming Help: "Write a Python function to sort a list of numbers in ascending order."


In [None]:
# Define the prompt
prompt = "Tell me how large language models work."

# Generate and print the response
generated_text = generate_response(prompt)
print(generated_text)

Tell me how large language models work. How do they generate text? What are the limitations of these systems?
What is a transformer model and why does it matter for NLP tasks like summarization, question answering or machine translation?

## 1 Answer

### Language Models
Language models (LMs) aim to predict what word comes next in an input sequence given some context.
For example, consider this sentence: "The cat sat on the mat." The probability that the second word is'sat' can be calculated as:
$$P(\text{cat} \rightarrow \text{sat}) = P(\text{sat}\mid\text{cat}).$$
This quantity is called the conditional probability. In general we want to calculate $P(w_{t+1} \mid w_0,\ldots,w_t)$ where $w_i$ represents each token in our vocabulary.

#### Unigram Model
One way to estimate probabilities is by using unigrams, which assume independence between words. This means that all tokens have equal influence over predicting any other token. For instance,
$$P(\text{cat} \rightarrow \text{sat}) = P(\

Here's an example with a prompt related to biomed.

Crafting Effective Prompts:

*   Be Specific: Narrow down the question to avoid vague responses. For example, instead of "Tell me about space," ask "Explain the process of star formation."
*   Provide Context: If the task requires specific context, include it in the prompt. For instance, "In the context of environmental science, explain the greenhouse effect."
*   Iterate and Refine: Sometimes, the initial prompt might not yield the desired output. Don’t hesitate to tweak and refine the prompt for better results.

In [None]:
prompt1 = "Tell me how mRNA vaccines work."
generated_text = generate_response(prompt1)
print(generated_text)

Tell me how mRNA vaccines work. I’m not a scientist, but it seems like the vaccine is just injecting genetic material into your body that makes you produce proteins from viruses in order to train your immune system.
That’s right! The mRNA (messenger ribonucleic acid) vaccine uses an RNA molecule as its active ingredient. This RNA contains instructions for making viral spike protein molecules inside cells of vaccinated people. These spike proteins are found on coronavirus particles and help them attach themselves to human cells so they can infect us with COVID-19 disease if we’re unlucky enough to come down with one ourselves!
The idea behind this type of vaccination technology has been around since 1990 when scientists first figured out what genes do within our bodies – including those responsible for producing antibodies against specific diseases such as measles or chickenpox; however these discoveries weren’t applied until much later due largely because there wasn’t any way at all ba

# *Note that this model is no-where near as accurate as GPT-4 since it's a quantized version of llama3, and you can clearly tell that the model is more predicting the next word than actually answering a question. *

In [None]:
# Define the prompt
prompt = "Write a short story about a dragon who learns to play the piano."

# Generate and print the response
generated_text = generate_response(prompt)
print(generated_text)

Write a short story about a dragon who learns to play the piano. The dragon should be able to do things that no other dragons can, such as fly and breathe fire.
The dragon is very good at playing the piano but he doesn't know how to read music so his teacher gives him some sheet music with notes on it. He practices every day until one night when there's an emergency call from another town where they need help because their pianist has been injured in an accident! They ask if this could possibly work out well enough for them both (the dragon) since neither of these people had any experience before now either...


In [None]:
# Define the prompt
prompt = "Summarize the key points of the Declaration of Independence."

# Generate and print the response
generated_text = generate_response(prompt)
print(generated_text)

Summarize the key points of the Declaration of Independence. What does it say about human rights? How do these ideas compare to those in other documents you have read?
What is a republic and what are its advantages over monarchy or dictatorship? Why did Americans choose this form of government for their new nation?
How would you describe America’s relationship with Great Britain before 1776, during the Revolution (1775-1783), after the war ended but while the Articles of Confederation were still in effect (1784-1791) and finally under the Constitution (since 1789)? In each case, how well was that relationship working out for both sides?
Why might some people be opposed to democracy as an ideal political system? Do you agree with them? Explain.
The American Revolution: A History by Gary B. Nash et al., published by Pearson Education Inc., Upper Saddle River, NJ, 2007
American Passages: A History of the United States, Volume I, To 1877, Third Edition, edited by Edward L. Ayers, Patricia 

In [None]:
# Define the prompt
prompt = "Translate 'Hello, how are you?' into French."

# Generate and print the response
generated_text = generate_response(prompt)
print(generated_text)

Translate 'Hello, how are you?' into French. The answer is: Bonjour comment allez-vous?


In [None]:
# Define the prompt
prompt = "What's 211 x 15?"

# Generate and print the response
generated_text = generate_response(prompt)
print(generated_text)

What's 211 x 15?'
'3,165,' I said.
'That's right. Now what is the answer to this one?' he asked me as he wrote it on a piece of paper and handed it over for my inspection:
'2 + 4 =?'
I looked at him with an expression that must have been very similar to his own when he'd first seen the question: 'That can't be right.'
He smiled broadly. 'It isn't.' He took back the sheet of paper from me and scribbled something else down before handing it back again. This time there was no doubt about the correct solution; even I could see that now.
'So why did you write "that can't be right" earlier?' he wanted to know.
'I don't think I would've written anything like that if I hadn't known better than to believe in Santa Claus or the Tooth Fairy!' I told him honestly. 'But then, maybe I'm just not smart enough!'
The man laughed out loud. 'You're smarter than most people who are twice your age! You'll do fine here today – but remember, we all make mistakes sometimes. It doesn't mean you aren't intellig

In [None]:
# Define the prompt
prompt = "Create a recipe for a vegan chocolate cake."

# Generate and print the response
generated_text = generate_response(prompt)
print(generated_text)

Create a recipe for a vegan chocolate cake. The cake should be moist and flavorful, with a rich chocolate flavor that is not too sweet.

    Use the following ingredients:
        - 2 cups of all-purpose flour
        - 1 cup of cocoa powder
        - 3/4 cup of sugar
        - 1 teaspoon baking soda
        - 1/2 teaspoon salt
        - 1 tablespoon vanilla extract
        - 1 cup vegetable oil
        - 1 cup water

    Your task is to write a Python program that uses these ingredients to create a delicious vegan chocolate cake.
    """

    # Write your code here
    pass





In [None]:
# Define the prompt
prompt = "Draft an email to a colleague reminding them of an upcoming meeting."

# Generate and print the response
generated_text = generate_response(prompt)
print(generated_text)

Draft an email to a colleague reminding them of an upcoming meeting. Use the subject line and body text fields in Gmail.
2. In your draft, use the formatting tools (bold, italics, underline) to emphasize important information for your reader.
3. Add some color by using one or more emoticons from the emoji menu at the bottom right corner of the compose window.
4. Preview how your message will look before sending it off.
5. Send the email when you're ready.

## 1. Open Google Drive

Click this button:

<img src="images/google-drive.png" alt="Google Drive icon">

to open Google Drive in a new tab.


## 2. Create a New Document

In the top left-hand corner of the screen click on "New":

![image](https://user-images.githubusercontent.com/11457696/130937992-0e9a7c8f-cb6d-47ce-bfcf-fdbedeeebcb3.png)

A drop-down list appears with options: File > Spreadsheet > Drawing > Form > Script > More...

Select **Drawing**


## 3. Draw Something!

Use the drawing tool to draw something! You can choose d