<a href="https://colab.research.google.com/github/twhool02/atubigdataanalyticsproject1/blob/main/Inference_of_falcon_7b_finetuned_guanaco_local.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run Mistral-7B- 4-bit quantization

## Setup

### Map Google Drive

In [1]:
import shutil, os, subprocess

# mount google drive
from google.colab import drive
drive.mount('/content/drive')
os.chdir('/content/drive/MyDrive/Colab Notebooks/dissertation')

Mounted at /content/drive


### Log into HuggingFace Hub

In [2]:
# Required when quantizing models/data that are gated on HuggingFace and required for pushing models to HuggingFace
!pip install --upgrade huggingface_hub

import huggingface_hub

print(f"Hugging Face Version is: {huggingface_hub.__version__}")

Hugging Face Version is: 0.20.3


In [3]:
from google.colab import userdata

# using the HF_TOKEN secret, this has write permissions to Hugging Face
hftoken = userdata.get('HF_TOKEN')

In [4]:
from huggingface_hub import login

# Log into hugging face using the HF_TOKEN secrect
login(hftoken, add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Install Transformers

In [5]:
# install the development version of transformers
# !pip install -q -U git+https://github.com/huggingface/transformers.git -q



# The 'accelerate' library is a part of the Hugging Face ecosystem[^1^][1][^2^][2].
# It enables the same PyTorch code to be run across any distributed configuration by adding just a few lines of code[^1^][1][^2^][2].
# In short, it makes training and inference at scale simple, efficient, and adaptable[^1^][1][^2^][2].
# It abstracts the boilerplate code related to multi-GPUs/TPU/fp16 and leaves the rest of your code unchanged[^2^][2].
# This library is useful when you want to easily run your training scripts in a distributed environment without having to renounce full control over your training loop[^2^][2][^3^][3].
# It is not a high-level framework above PyTorch, just a thin wrapper so you don't have to learn a new library[^2^][2][^3^][3].
# !pip install -q -U git+https://github.com/huggingface/accelerate


# Install latest available stable builds, upgrade if later version that the currently installed version is available
!pip install -q -U transformers -q
!pip install -q -U accelerate -q
# !pip install -q -U einops
!pip install sentencepiece -q

# The 'bitsandbytes' library is a lightweight wrapper around CUDA custom functions,
# particularly 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions[^1^][4][^2^][5].
# It is used for tasks like 8-bit inference with HuggingFace Transformers, using 8-bit optimizers,
# and replacing certain layers with 8-bit versions for improved performance[^1^][4][^2^][5].
!pip install bitsandbytes -q

# PEFT stands for Parameter-Efficient Fine-Tuning, developed by Hugging Face that aims to make fine-tuning large language models (LLMs) more efficient and memory-friendly.
# Key features of PEFT:
# Parameter-efficient fine-tuning: It allows you to fine-tune only a small portion of a large language model's parameters, reducing memory usage and training time significantly.
# Adaptive embedding sharing: It dynamically determines which embeddings to share across different tasks, further optimizing memory usage.
# Gradient checkpointing: It saves memory by storing only a subset of activations during backpropagation.
# Compatibility with Transformers: It integrates seamlessly with the popular Transformers library, making it easy to use with various pre-trained language models.
!pip install peft -q

# trl is short for Transformers Reinforcement Learning, it's used for fine-tuning transformer models using Proximal Policy Optimization.
!pip install trl -q

# The 'xformers' library provides customizable and optimized building blocks for Transformers[^3^][1].
# It is domain-agnostic and used by researchers in various fields like vision, NLP, etc[^3^][1].
# The library contains bleeding-edge components that are not yet available in mainstream libraries like PyTorch[^3^][1].
# It is built with efficiency in mind, containing its own CUDA kernels, but dispatches to other libraries when relevant[^3^][1].
# !pip install xformers

#print the version of transformers
import transformers
print(transformers.__version__)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m279.7/279.7 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.9/150.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

In [6]:
# os is a standard Python library that provides functions for interacting with the operating system.
import os

# torch is the main package of PyTorch, an open-source machine learning library for Python.
import torch

# load_dataset is a function from the datasets library by Hugging Face. It allows you to load and preprocess datasets for machine learning models.
from datasets import load_dataset

# The transformers library is a popular library for Natural Language Processing (NLP). It provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, summarization, translation, and more.
from transformers import (
    # AutoModelForCausalLM is a class in the transformers library. It represents a model for causal language modeling.
    AutoModelForCausalLM,

    # AutoTokenizer is a class in the transformers library. It is used for converting input data into a format that can be used by the model.
    AutoTokenizer,

    # BitsAndBytesConfig is a configuration class in the transformers library. It is used to configure a BitsAndBytes model.
    BitsAndBytesConfig,

    # HfArgumentParser is a class in the transformers library. It is used for parsing command-line arguments.
    HfArgumentParser,

    # TrainingArguments is a class in the transformers library. It defines the arguments used during training.
    TrainingArguments,

    # pipeline is a high-level function in the transformers library. It creates a pipeline that applies a model to some input data.
    pipeline,

    # logging is a module in the transformers library. It is used for logging events during training and evaluation.
    logging,
)

# used for Parameter-Efficient Fine-Tuning
from peft import LoraConfig, PeftModel

# trl is short for Transformers Reinforcement Learning. It is a Python library for fine-tuning transformer models using Proximal Policy Optimization.
from trl import SFTTrainer


### Create cache directory for Hugging Face Models

In [7]:
# # Set the cache directory to a specific path in your Google Drive.
# # This is where Hugging Face models will be cached.
# cache_dir = "/content/drive/MyDrive/Colab Notebooks/dissertation/Llama/Llama2-7b-HF"

# # The os.makedirs() method in Python is used to create directories recursively.
# # The exist_ok=True parameter prevents an error if the directory already exists.
# os.makedirs(cache_dir, exist_ok=True)
# os.chdir(cache_dir)

In [8]:
# import os
# import glob

# # get current working dirctory and list files
# print(f"current directory is: {os.getcwd()}\n")
# # print(os.listdir('.'))

# # Get a list of all files and directories in the current directory
# files = glob.glob('./*')

# # Create a list of tuples, each containing the name of the file/directory and its last modification time
# files_with_times = [(file, os.path.getmtime(file)) for file in files]

# # Sort the list by the modification time (the second element of each tuple)
# files_with_times.sort(key=lambda x: x[1])

# # Print the sorted list
# print("Files in current directory:")
# for file, mtime in files_with_times:
#     print(f'{file}: {mtime}')

## Set Configuration

### Define path to model and dataset

In [9]:
# Define the directory where the model was downloaded to.
# model_dir = '/content/drive/MyDrive/Colab Notebooks/dissertation/Llama/Llama2-7b-chat-HF'
model_name = "/content/drive/MyDrive/Colab Notebooks/dissertation/My Models/Falcon-7B-finetuned-guanaco-NF4-QLORA"

# Load the pre-trained Llama model for Causal Language Modeling from the directory specified above.
# The from_pretrained() function handles loading the model configuration and weights, and can be used with any pre-trained model from the Hugging Face model hub.
model = AutoModelForCausalLM.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

### Load Tokenizer

In [10]:
# Load tokenizer to convert input text into tokens which the model will understand
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

## Run the model

### Create a text generation pipeline

In [11]:
# Create a text generation pipeline
# This pipeline takes in a string of text and returns a generated sequence of text
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

#### Generate Answers

In [14]:
# Generate text sequences using the pipeline
sequences = pipeline(

    # Input text to start the generation
    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\n',
    # Set do_sample to True to use sampling for generating the next tokens
    do_sample=True,
    # Set top_k to 10 to consider the top 10 most probable tokens at each step
    top_k=10,
    # Set num_return_sequences to 1 to generate only one sequence
    num_return_sequences=1,
    # Set eos_token_id to the end-of-sentence token ID
    eos_token_id=tokenizer.eos_token_id,
    # Set max_length to 400 to limit the generated sequence length
    max_length=400,
    # Selects the smallest possible set of tokens whose cumulative probability exceeds
    # a certain threshold (e.g., 0.9). This can provide a balance between diversity and quality.
    top_p=0.9,
    # controls the randomness of the predictions.
    # A higher value (e.g., 1.0) makes the output more random
    # A lower value (e.g., 0.2) makes it more deterministic.
    temperature=0.8,
    # Discourages the model from repeating the same token or sequence of tokens too frequently
    repetition_penalty=1.2,
    # Encourages the model to generate sequences of a certain length.
    # A value greater than 1.0 encourages longer sequences
    # A value less than 1.0 encourages shorter sequences.
    length_penalty=0.8,
    # sets the size of the n-grams that should not be repeated in the generated text.
    # For example, a value of 2 would prevent the model from repeating any 2-gram
    no_repeat_ngram_size=5,
)

# Print the generated text sequences
for seq in sequences:
    print(f"{seq['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


I have tomatoes, basil and cheese at home. What can I cook for dinner?
Tomato Basil Pasta
Ingredients
- 2 cups whole wheat pasta
- 1 tbsp extra virgin olive oil
- 2 cloves garlic, minced
- 2-3 cups fresh baby spinach
- 1 can diced tomatoes, undrained
- 1/4 cup fresh basil leaves, chopped
- Salt and pepper to taste
- Shredded mozzarella cheese, to serve
Instructions
- Boil the pasta according to package instructions. Drain and set aside.
- Meanwhile, in a large skillet, heat olive oil over medium heat. Add garlic and cook until fragrant. Add baby spinach and cook until wilted.
- Add tomatoes and cook until heated through. Add cooked pasta and toss to coat. Season with salt and pepper, to taste. Top with fresh basil and shredded mozzarella cheese. Serve hot.


In [15]:
# Generate text sequences using the pipeline
sequences = pipeline(

    # Input text to start the generation
    'Tell me a story about the UAE',
    # Set do_sample to True to use sampling for generating the next tokens
    do_sample=True,
    # Set top_k to 10 to consider the top 10 most probable tokens at each step
    top_k=10,
    # Set num_return_sequences to 1 to generate only one sequence
    num_return_sequences=1,
    # Set eos_token_id to the end-of-sentence token ID
    eos_token_id=tokenizer.eos_token_id,
    # Set max_length to 400 to limit the generated sequence length
    max_length=400,
)

# Print the generated text sequences
for seq in sequences:
    print(f"{seq['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Tell me a story about the UAE
The UAE is a country with a young, vibrant population who are constantly looking to the future and are keen to learn new skills and knowledge. We wanted to capture that sense of energy and drive within the UAE, and how it is a country that is open to new things, whether that’s a new business idea or technology.
We wanted to create an engaging short video that captures this youthful spirit and energy. We chose a young, diverse cast and crew to tell the story and we shot at different locations throughout the UAE to give it a sense of place that felt authentic and true to the country. The story is told through the eyes of the cast, who are exploring and discovering what makes the UAE special while learning new things along the way.
The video was shot over the course of 10 days during the Covid-19 lockdown, which was a challenging process. We worked with our cast and crew virtually to develop the story and script, and we filmed with a small crew in line with g

In [16]:
# Generate text sequences using the pipeline
sequences = pipeline(

    # Input text to start the generation
    'Tell me a story about the UAE\n',
    # Set do_sample to True to use sampling for generating the next tokens
    do_sample=True,
    # Set top_k to 10 to consider the top 10 most probable tokens at each step
    top_k=10,
    # Set num_return_sequences to 1 to generate only one sequence
    num_return_sequences=1,
    # Set eos_token_id to the end-of-sentence token ID
    eos_token_id=tokenizer.eos_token_id,
    # Set max_length to 400 to limit the generated sequence length
    max_length=400,
    # Selects the smallest possible set of tokens whose cumulative probability exceeds
    # a certain threshold (e.g., 0.9). This can provide a balance between diversity and quality.
    top_p=0.9,
    # controls the randomness of the predictions.
    # A higher value (e.g., 1.0) makes the output more random
    # A lower value (e.g., 0.2) makes it more deterministic.
    temperature=0.8,
    # Discourages the model from repeating the same token or sequence of tokens too frequently
    repetition_penalty=1.2,
    # Encourages the model to generate sequences of a certain length.
    # A value greater than 1.0 encourages longer sequences
    # A value less than 1.0 encourages shorter sequences.
    length_penalty=0.8,
    # sets the size of the n-grams that should not be repeated in the generated text.
    # For example, a value of 2 would prevent the model from repeating any 2-gram
    no_repeat_ngram_size=5,
)

# Print the generated text sequences
for seq in sequences:
    print(f"{seq['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Tell me a story about the UAE
The UAE is a country that has a rich history and culture. There are many interesting stories about the UAE that can be told.
One of the most well-known stories is that of Sheikh Zayed bin Sultan Al Nahyan, the founder of the United Arab Emirates. He was born in 1882 in Abu Dhabi and went on to become one of the most influential leaders in the region.
Another interesting story is that of Sheikh Mohammed bin Rashid Al Maktoum, who became the Prime Minister of the UAE in 2006. He was born into a wealthy family and rose through the ranks to become one of Dubai’s most powerful figures.
Finally, there is the story of the UAE’s unique and innovative approach to development and growth. The country has been able to build itself from scratch in just a few decades, using its natural resources and human capital to create a prosperous economy and vibrant society.
These are just a few examples of the many fascinating stories about the UAE. The country has a lot to offer

In [12]:
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


In [17]:
# Generate text sequences using the pipeline
sequences = pipeline(

    # Input text to start the generation
    'Write a poem about Ireland\n',
    # Set do_sample to True to use sampling for generating the next tokens
    do_sample=True,
    # Set top_k to 10 to consider the top 10 most probable tokens at each step
    top_k=10,
    # Set num_return_sequences to 1 to generate only one sequence
    num_return_sequences=1,
    # Set eos_token_id to the end-of-sentence token ID
    eos_token_id=tokenizer.eos_token_id,
    # Set max_length to 400 to limit the generated sequence length
    max_length=400,
    # Selects the smallest possible set of tokens whose cumulative probability exceeds
    # a certain threshold (e.g., 0.9). This can provide a balance between diversity and quality.
    top_p=0.9,
    # controls the randomness of the predictions.
    # A higher value (e.g., 1.0) makes the output more random
    # A lower value (e.g., 0.2) makes it more deterministic.
    temperature=0.8,
    # Discourages the model from repeating the same token or sequence of tokens too frequently
    repetition_penalty=1.2,
    # Encourages the model to generate sequences of a certain length.
    # A value greater than 1.0 encourages longer sequences
    # A value less than 1.0 encourages shorter sequences.
    length_penalty=0.8,
    # sets the size of the n-grams that should not be repeated in the generated text.
    # For example, a value of 2 would prevent the model from repeating any 2-gram
    no_repeat_ngram_size=5,
)

# Print the generated text sequences
for seq in sequences:
    print(f"{seq['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Write a poem about Ireland
The land of leprechauns, shamrocks and Guinness
The land of the Irish, their language and their history
Where every day is St. Patrick’s Day and the sun is always shining
The land of my ancestors, my roots, my home
My mother’s homeland, her family and her roots
A land of green hills, lakes, mountains and rivers
Of rolling waves, crashing surf and the smell of fresh cut grass
Of sheep grazing on the hills, horses running free in the fields
A land of history, legends and folklore
Of castles and cathedrals, churches and graveyards
A land of poets, writers and storytellers
Of music and song, of laughter and tears
A land of beauty and grace, of strength and courage
Of love and friendship, of family and community
A land where the people are warm and welcoming
Where everyone knows your name and you feel at home
A land that has shaped my life, my character and my heart
And has given me so much, more than I could ever repay
I am so grateful to be Irish, to have been b

In [13]:
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.
Daniel: Hello, Girafatron!
Girafatron: I have always been obsessed with giraffes. I love giraffes. I think giraffes are the most glorious animal on the face of this Earth.
Daniel: I’m not going to argue with you on that. What is your favorite thing about giraffes?
Girafatron: I think their legs are really, really cool, and they are really, really graceful. I think they look very elegant and beautiful.
Daniel: I can see why you like giraffes. I’m kind of jealous. What is your least favorite thing about giraffes?
Girafatron: I really hate it when they eat leaves and


In [17]:
# Generate text sequences using the pipeline
sequences = pipeline(

    # Input text to start the generation
    'Write a poem about Ireland',
    # Set do_sample to True to use sampling for generating the next tokens
    do_sample=True,
    # Set top_k to 10 to consider the top 10 most probable tokens at each step
    top_k=10,
    # Set num_return_sequences to 1 to generate only one sequence
    num_return_sequences=1,
    # Set eos_token_id to the end-of-sentence token ID
    eos_token_id=tokenizer.eos_token_id,
    # Set max_length to 400 to limit the generated sequence length
    max_length=400,
)

# Print the generated text sequences
for seq in sequences:
    print(f"{seq['generated_text']}")

Write a poem about Ireland, but use only words with an "I" sound in them. It should be about your love for Ireland.
Ireland, you are my home
With your hills, your valleys and your moors
You are like a woman, strong and free
With your eyes that shine like silver moons
With your arms that embrace me tight
In your warmth I can live and thrive
I will always love you, Ireland
I will always love you, Ireland.
by Aileen Kelly, 2010.
The "I" sound is the sound of the letter I and it is pronounced by saying "ee."
This is a good poem!
I love your poem Aileen, especially the last line!
Thanks!
I love your poem Aileen. It is very sweet and it is very good.


In [10]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "I have tomatoes, basil and cheese at home. What can I cook for dinner?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] I have tomatoes, basil and cheese at home. What can I cook for dinner? [/INST]</s>
<s>[INST] I have tomatoes, basil and cheese at home. What can I cook for dinner? [/INST]</s>
<s>[INST] I have tomatoes, basil and cheese at home. What can I cook for dinner? [/INST]</s>
<s>[INST] I have tomatoes, basil and cheese at home. What can I cook for dinner? [/INST]</s>
<s>[INST] I have tomatoes, basil and cheese at home. What can I cook for dinner? [/INST]</s>
<s>[INST] I have tomatoes, basil and cheese at home. What can I cook for dinner? [/INST]</s>
<s>[INST] I have tomatoes, basil and cheese at


In [18]:
# Now you can use the tokenizer and model for question answering
# For example:
question = "What is the capital of France?"
context = "Paris is the capital of France."

# Prepare the prompt
prompt = f"{context} {question}"

# Encode the prompt
inputs = tokenizer(prompt, return_tensors='pt')

# Get the model's predictions
outputs = model(**inputs)

# The predicted token ids are in the logits
predicted_token_ids = outputs.logits.argmax(dim=-1)

# Decode the answer
answer = tokenizer.decode(predicted_token_ids[0])

print(answer)

, a capital of France and It is the capital of France?



This does not work as the model is not a question/answering model according to Bing

## Evaluate the model

#### Calculate Perplexity

In [18]:
# dataset_name = "twhoool02/guanaco-llama2"
dataset_name = "timdettmers/openassistant-guanaco"

In [19]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="test")

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [20]:
print(dataset)

Dataset({
    features: ['text'],
    num_rows: 518
})


In [21]:
input_text = "I have tomatoes, basil and cheese at home. What can I cook for dinner?"
input_ids = tokenizer(input_text, return_tensors='pt')

In [27]:
# Generate model outputs
outputs = model(**input_ids, labels=input_ids["input_ids"])

In [28]:
# Extract the loss from the outputs
loss = outputs.loss

In [29]:
# Convert the loss to perplexity
perplexity = torch.exp(loss)

In [31]:
print("Falcon 7B Perplexity:", perplexity)

Falcon 7B Perplexity: tensor(13.1950, grad_fn=<ExpBackward0>)
