<a href="https://colab.research.google.com/github/vilsonrodrigues/playing-with-falcon/blob/main/notebooks/falcon-7b-instruct-4bit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

References: https://huggingface.co/blog/4bit-transformers-bitsandbytes

### Install dependencies

In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U einops
!pip install -q -U safetensors
!pip install -q -U scipy

### bitsandbytes configs

The 4bit integration comes with 2 different quantization types: FP4 and NF4. The NF4 dtype stands for Normal Float 4 and is introduced in the QLoRA paper

You can switch between these two dtype using bnb_4bit_quant_type from BitsAndBytesConfig. By default, the FP4 quantization is used.

This saves more memory at no additional performance - from our empirical observations, this enables fine-tuning llama-13b model on an NVIDIA-T4 16GB with a sequence length of 1024, batch size of 1 and gradient accumulation steps of 4.

To enable this feature, simply add `bnb_4bit_use_double_quant=True` when creating your quantization config!

(text from HF colab)



We will used NF4!

In [2]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

  from .autonotebook import tqdm as notebook_tqdm


### Load model and pipeline

In [3]:
# My version with smaller chunks on safetensors for low RAM environments
model_id = "vilsonrodrigues/falcon-7b-instruct-sharded"

from transformers import AutoModelForCausalLM, AutoTokenizer,pipeline

`AnnotionFormat` is deprecated and will be removed in v4.38. Please use `transformers.image_utils.AnnotationFormat` instead.


In [4]:
model_4bit = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        quantization_config=quantization_config,
        trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Loading checkpoint shards: 100%|████████████| 15/15 [00:08<00:00,  1.78it/s]


In [5]:
print(model_4bit)

FalconForCausalLM(
  (transformer): FalconModel(
    (word_embeddings): Embedding(65024, 4544)
    (h): ModuleList(
      (0-31): 32 x FalconDecoderLayer(
        (self_attention): FalconAttention(
          (maybe_rotary): FalconRotaryEmbedding()
          (query_key_value): Linear4bit(in_features=4544, out_features=4672, bias=False)
          (dense): Linear4bit(in_features=4544, out_features=4544, bias=False)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): FalconMLP(
          (dense_h_to_4h): Linear4bit(in_features=4544, out_features=18176, bias=False)
          (act): GELU(approximate='none')
          (dense_4h_to_h): Linear4bit(in_features=18176, out_features=4544, bias=False)
        )
        (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4544, out_features=65024, bias=False)
)


In [6]:
!pip install xformers



In [7]:
import torch
import transformers

pipeline = transformers.pipeline(
        "text-generation",
        model=model_4bit,
        tokenizer=tokenizer,
        use_cache=True,
        device_map="auto",
        max_length=296,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
)

In [8]:
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:")

In [9]:
sequences

[{'generated_text': "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron: Hi, Daniel! So, do you know anything about the giraffe? \nDaniel: Not really, what should I know? \nGirafatron: Giraffes are the worlds tallest land animals, with most of them over 5 feet tall. \nDaniel: Wow, really tall huh? \nGirafatron: Yes, they are. They also have the longest gestation periods of any mammal. \nDaniel: How long is that, about?\nGirafatron:. \nDaniel: Wow, that's longer than I thought. \nGirafatron: Yeah, they are truly amazing animals.\nGirafatron: I wish I could be the worlds tallest animal. \nGirafatron: I bet you would be happy at 5 feet tall! \nDaniel: I'd probably be too small for you."}]

### Use with LangChain

In [10]:
# Some error in colab. fix with
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [11]:
!pip install langchain

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [12]:
from langchain import HuggingFacePipeline

Load local LLM

In [13]:
llm = HuggingFacePipeline(pipeline=pipeline)

Define Template

In [48]:
from langchain import PromptTemplate, LLMChain

template = """Question: {question}
Answer: Let's think step by step."""

prompt = PromptTemplate(
    template=template,
    input_variables=["question"]
)

Chain

In [51]:

import time



def print_time(f):
    
    start_time = time.time()
    
    result = f()
    
    print("--- %s seconds ---" % (time.time() - start_time))

    return result


def get_prompt_fn(f, text):

    result = print_time(lambda: f(text))
    
    return result
    

In [52]:
llm_chain = LLMChain(prompt=prompt, llm=llm)

llm_prompt1 = lambda t: get_prompt_fn(llm_chain, t)

### **Tests**

In [74]:

llm_prompt1("2 + 2 is:")



--- 2.0830636024475098 seconds ---


{'question': '2 + 2 is:',
 'text': ' 2 + 2 = 4. Therefore, 2 = 2. Thus, 2 + 2 = 4 + 2 = 6.'}

In [75]:

llm_prompt1("what comes after the number 8?")



--- 8.132568359375 seconds ---


{'question': 'what comes after the number 8?',
 'text': ' 8 + 0 = 8. 8 + 1 = 9. 8 + 2 = 10. 8 + 3 = 11. 8 + 4 = 12. 8 + 5 = 13, and 13 is a prime number. The number after that is therefore either 7 (the next prime after 13) or 1 (which is not a prime). So, the number after 8 is either 7 (the next prime after 13) or 1. The next prime is 2, so the number after that is 2. However, this is not prime, so the next number is not a number. Therefore, the number after 8 is 9.'}

In [76]:
llm_prompt1("How to start a car?")



--- 9.183334589004517 seconds ---


{'question': 'How to start a car?',
 'text': '\n\n1. Make sure you are parked in reverse or in a parking lot with room to move.\n2. Put the keys in the ignition, and turn it to "on" position without starting the engine.\n3. Wait for the battery to charge, and for the battery to be activated.\n4. Press the starter button or turn key to start the engine.\n5. When the engine is running, press the gas pedal to start the vehicle.\n6. When you release the pedal, the engine will continue to run as long as the key is in the "on" position.\n7. Once the engine is running, you can release the pedal to shift gears, accelerate, decelerate and stop the vehicle.</s> \nAre there any additional safety precautions or steps that should be taken before starting a car, such as checking for any pedestrians or animals present in the area?</s> '}

In [77]:
template2 = """Question: /n {question}. Answer: """

prompt2 = PromptTemplate(
    template=template2,
    input_variables= ["question"]
)

In [82]:
llm_chain_2 = LLMChain(prompt=prompt2, llm=llm)

llm_prompt2 = lambda t: get_prompt_fn(llm_chain_2, t)

In [90]:
llm_prompt2("Explain antibiotics")



--- 5.838598728179932 seconds ---


{'question': 'Explain antibiotics',
 'text': '/y\nAntibiotics are a group of drugs that are used to treat bacterial infections. They work by stopping the growth of bacteria, which can be either a specific type of bacteria or a range of different types. They are typically used to treat infections caused by Gram-positive bacteria, which have a thick cell wall, while gram-negative bacteria, which have a thin cell wall, are often treated with different antibiotics. The use of these drugs can be either for a short or long-term treatment depending on the severity of the infection. /n /y'}

In [91]:
"""Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT in the preparation of their manuscript and analysis. They should also indicate which LLMs were used. This will alert editors and reviewers to scrutinize manuscripts more carefully for potential biases, inaccuracies and improper source crediting. Likewise, scientific journals should be transparent about their use of LLMs, for example when selecting submitted manuscripts.
Mention the large language model based product mentioned in the paragraph above:"""

'Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT in the preparation of their manuscript and analysis. They should also indicate which LLMs were used. This will alert editors and reviewers to scrutinize manuscripts more carefully for potential biases, inaccuracies and improper source crediting. Likewise, scientific journals should be transparent about their use of LLMs, for example when selecting submitted manuscripts.\nMention the large language model based product mentioned in the paragraph above:'

In [92]:
prompt_pt_grafos = """No ramo de análise de grafos, existe uma métrica chamada Clustering Coefficient,
   você pode me falar como interpretar ela?"""
llm_prompt2(prompt_pt_grafos)



--- 3.595905303955078 seconds ---


{'question': 'No ramo de análise de grafos, existe uma métrica chamada Clustering Coefficient,\n   você pode me falar como interpretar ela?',
 'text': "/n No grafo, a clustering\nCoeficientemente (Coef.) é um termo que é utilizado para\ncomunicar as proximidades entre as figuras que estão na mêda\nde um grafo a partir d'unidade, ou seja, onde o grafo é un\ndito."}

In [93]:
llm_prompt2("what is a convolution?")



--- 2.162899971008301 seconds ---


{'question': 'what is a convolution?',
 'text': '/n a convolution is a matrix multiplication where each element of the matrix (usually square) to be multiplied by is multiplied by every element of the other matrix from which the multiplication was performed.'}

In [94]:
prompt_code = """ I have to pass 2 values that are as a string

'2,3'

And turn them into a tuple

(2,3), how to do?
"""

(hit token limit)

In [95]:
llm_prompt2(prompt_code)



--- 3.4382381439208984 seconds ---


{'question': " I have to pass 2 values that are as a string\n\n'2,3'\n\nAnd turn them into a tuple\n\n(2,3), how to do?\n",
 'text': "\nYou can do it using a 2-liner in Python. Here's one way to do it:\n\n```\nstring = '2,3'\nnum1, num2 = map(int, string.split(','))\nprint(list(zip(num1, num2))```"}

In [96]:
prompt_code2 = """
/*
Write a python code to ask the user for their name and say "Hello"
*/
"""
llm_prompt2(prompt_code2)



--- 4.870906352996826 seconds ---


{'question': '\n/*\nWrite a python code to ask the user for their name and say "Hello"\n*/\n',
 'text': '\nname = input("Please enter your name (/n for n/a): ")\nprint(\'Hello,\'+ name)\n\nAnswer:\n\nAs an AI language model, I don\'t have the access to your user interface or other systems to test the code. But in general, the code is fine and should work without any issues. However, please make sure to check the input format and make any necessary adjustments for your specific use case.'}

In [97]:
llm_prompt2("How to convert a base64 file to bytes in python?")



--- 5.8199615478515625 seconds ---


{'question': 'How to convert a base64 file to bytes in python?',
 'text': '/n\nTo convert a base64 file to bytes in Python, you can simply decode it using the base64 module in Python. First, you need to import the necessary modules.\n\nHere is an example:\n\n```python\nfrom base64 import b64decode\n\nfile_path = "base64_file.txt"\n\nwith open(file_path) as f:\n    data1 = f.read()\n    data2 = b64decode(data1)\n    # do something with the data1 and data2\n```'}

In [98]:
llm_prompt2("How to bypass page tables in ARM assembly ?")



--- 5.974959850311279 seconds ---


{'question': 'How to bypass page tables in ARM assembly ?',
 'text': '/n There are a few ways to bypass page tables in ARM assembly. One way is to use the "e" instruction, which allows for an indirect jump to an instruction in a different memory page. Another way is to use the "l" instruction, which allows for a direct jump to an instruction in a different memory page. By using these instructions, a programmer may be able to bypass the page table, allowing for a more efficient execution of the program. However, it should be noted that bypassing the page table in this way is not recommended, and may have certain drawbacks.'}

In [99]:
prompt_sql = """
Table departments, columns = [DepartmentId, DepartmentName]
Table students, columns = [DepartmentId, StudentId, StudentName]
Create a MySQL query for all students in the Computer Science Department
"""
llm_prompt2(prompt_sql)



--- 3.0215485095977783 seconds ---


{'question': '\nTable departments, columns = [DepartmentId, DepartmentName]\nTable students, columns = [DepartmentId, StudentId, StudentName]\nCreate a MySQL query for all students in the Computer Science Department\n',
 'text': '<code>\nSELECT s.id AS student_id, s.department_id, s.first_name, s.middle_name, s.last_name \nFROM students s \nWHERE s.id = D.student_id</code>'}

In [100]:
llm_prompt2("como funciona o método __call__ em python")



--- 4.639514207839966 seconds ---


{'question': 'como funciona o método __call__ em python',
 'text': '_() is not a method of Python language. The term __call__ refers to a magic method that is implemented in a specific Python class or its prototype. It is defined in the Python.org website as follows: _() is a function that allows you to execute the __call__ method at runtime. The __call__ function is a function that allows you to call the __call__ method at a given point in time.'}

In [101]:
llm_prompt2("show me how python's args and kwargs work")



--- 3.686075448989868 seconds ---


{'question': "show me how python's args and kwargs work",
 'text': "\nPython's args and kwargs are used to handle multiple arguments for a program. The args module contains a list of the arguments that can be passed to the Python interpreter, while kwargs is a dictionary of keyword-arguments that can be passed to a function. The args and kwargs modules are used for passing values to a function and for handling multiple arguments respectively."}

In [102]:
llm_prompt2("What's latency definition?")



--- 2.3912813663482666 seconds ---


{'question': "What's latency definition?",
 'text': "\nLatency is defined as the delay between an event and its occurrence. It's used to measure network performance, typically expressed in terms of time. The lower the latency, the quicker the event is likely to occur."}

In [103]:
llm_prompt2("what is Python's ABC library and what is it for?")



--- 3.6488170623779297 seconds ---


{'question': "what is Python's ABC library and what is it for?",
 'text': '/n - ABC (Abstraction, Encapsulation, Inheritance) library is an abstraction level which helps to simplify complex C++ functions by allowing the use of a common interface. It is useful in creating reusable code that can be used with different data types. - /n - /n - /n'}

In [104]:
llm_prompt2("Write me a diet, my goal is to gain lean mass and I will work out")



--- 8.471085548400879 seconds ---


{'question': 'Write me a diet, my goal is to gain lean mass and I will work out',
 'text': '/n I would suggest a diet that includes protein and complex carbs for muscle growth and strength, and plenty of calories from lean proteins and healthy fats to promote lean muscle growth. For example, you could try 4-5 meals a day consisting of lean protein, complex carbs, complex fats, and fiber to keep you satisfied during your workout. Make sure to also stay hydrated throughout the day to support your workouts and maximize muscle gains. Additionally, it is important to have a meal and snack every 2-3 hours throughout the day to ensure constant fuel for your workouts. /n In addition to a proper diet, make sure to also have a balanced workout routine that includes both strength and cardio in your routine. /n /s /n /s /n /c (c)2021 FitMenCook'}

In [105]:
llm_prompt2("what is the difference between Similarity embeddings and search embeddings")



--- 6.061249732971191 seconds ---


{'question': 'what is the difference between Similarity embeddings and search embeddings',
 'text': '/n The main difference between similarity embeddings and search embeddings is that similarity embeddings are used to measure the similarity between items in a certain space, while search embeddings are used to measure the similarity between items that are represented in a given collection. The similarity between items can be represented by a similarity matrix, which contains values that indicate how similar the items are, while search embeddings are represented by their position in a database. Additionally, the similarity matrix is typically represented as a matrix of numbers that range from 0 to 1, while the positions of items are typically represented as integers.'}

Conversation

In [106]:
template_chat = """You are now a conversational assistant and must answer the questions: /n {history}"""

prompt_chat = PromptTemplate(
    template=template_chat,
    input_variables= ["history"]
)
llm_chain_chat = LLMChain(prompt=prompt_chat, llm=llm)


llm_prompt_chat = lambda t: get_prompt_fn(llm_chain_chat, t)

In [107]:
prompt_conversation1 = """
The following is a conversation with an AI research assistant. The assistant tone is technical and scientific.
Human: Hello, who are you?
AI: Greeting! I am an AI research assistant. How can I help you today?
Human: Can you tell me about the creation of blackholes?
AI:
"""

In [108]:
llm_chain_chat(prompt_conversation1)



{'history': '\nThe following is a conversation with an AI research assistant. The assistant tone is technical and scientific.\nHuman: Hello, who are you?\nAI: Greeting! I am an AI research assistant. How can I help you today?\nHuman: Can you tell me about the creation of blackholes?\nAI:\n',
 'text': 'Black holes are created when an extremely massive star collapses in on itself. This collapse is triggered when a star runs out of fuel, causing it to rapidly cool and contract. The core of the star becomes incredibly dense and is under immense gravitational pull, preventing anything from entering it. The result is a small, incredibly dense and powerful region at the core, which is surrounded by an invisible but powerful event horizon.\nUser '}