<a href="https://colab.research.google.com/github/ymoslem/LLMs/blob/main/inference/Falcon-HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Falcon LLM Huggging Face Transformers and bitsandbytes

In [1]:
import torch
torch.cuda.get_device_name(0)

'NVIDIA A100-SXM4-40GB'

In [2]:
!pip3 install --upgrade transformers accelerate einops bitsandbytes -q &> null

In [None]:
# [Optional] Save the models to a custom directory

# !mkdir -p "/content/drive/MyDrive/models/"
model_cache_dir = "/content/drive/MyDrive/models/"
!ls $model_cache_dir

In [4]:
# Option 1: Run with float16
# Load the model

# Remove this line to run this cell or try the next cell
%%script false --no-raise-error

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "tiiuae/falcon-7b-instruct"

model =  AutoModelForCausalLM.from_pretrained(model_name,
                                         torch_dtype=torch.float16,
                                         low_cpu_mem_usage=True,
                                         cache_dir = model_cache_dir,
                                         trust_remote_code=True)
model = model.half()
model = model.to("cuda")

In [5]:
# Option 2: Run with BitsAndBytes
# Load the model "tiiuae/falcon-40b-instruct" (be patient!)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# model_name = "tiiuae/falcon-7b-instruct"
model_name = "tiiuae/falcon-40b-instruct"

double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             quantization_config=double_quant_config,
                                             low_cpu_mem_usage=True,
                                             cache_dir=model_cache_dir,
                                             trust_remote_code=True
                                             )

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

In [6]:
# Memory Used
# Output for Falcon 40B on A100-SXM4-40GB with BitsAndBytes
# Memory allocated: 20.6 GB
# Memory reserved: 21.4 GB

print('Memory allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
print('Memory reserved:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

Memory allocated: 20.6 GB
Memory reserved:    21.4 GB


In [7]:
# left-side padding is required for auto-regressive models for batch processing

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

Downloading (…)okenizer_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

## Example 1: Translation

In [31]:
# Create the translation prompts

src_lang = "Spanish"
tgt_lang = "English"


# For BLOOM, not always good for zero-shot translation, can be good with Falcon
prompt_a = f"""{src_lang}: Estas historias de éxito atenuaron los temores de cambio y crearon inclinaciones positivas para el cambio en el futuro.
{tgt_lang}:"""

# For BLOOM, not good at all for zero-shot translation, can be good with Falcon
prompt_b = f"""Translate the following text from {src_lang} to {tgt_lang}:
Estas historias de éxito atenuaron los temores de cambio y crearon inclinaciones positivas para el cambio en el futuro."""

# For BLOOM, the best prompt for zero-shot translation, can be good with Falcon
prompt_c = f"""Translate the following text from {src_lang} to {tgt_lang}:
{src_lang}: Estas historias de éxito atenuaron los temores de cambio y crearon inclinaciones positivas para el cambio en el futuro.
{tgt_lang}:"""

# One-shot prompt, including one fuzzy match, can be good with Falcon
prompt_oneshot = f"""{src_lang}: Estas historias de éxito atenuaron los temores de cambio.
{tgt_lang}: Such success stories alleviated fears of change.
{src_lang}: Estas historias de éxito atenuaron los temores de cambio y crearon inclinaciones positivas para el cambio en el futuro.
{tgt_lang}:"""

# Add the prompts as a list
prompts = [prompt_a, prompt_b, prompt_c, prompt_oneshot]


In [32]:
# Tokenize the prompts

import torch

device = torch.device("cuda:0")

input_ids = tokenizer(prompts, return_tensors="pt", padding=True).input_ids.to(device)

print(*prompts, sep="\n\n", end="\n\n")
print(input_ids[0])


Spanish: Estas historias de éxito atenuaron los temores de cambio y crearon inclinaciones positivas para el cambio en el futuro.
English:

Translate the following text from Spanish to English:
Estas historias de éxito atenuaron los temores de cambio y crearon inclinaciones positivas para el cambio en el futuro.

Translate the following text from Spanish to English:
Spanish: Estas historias de éxito atenuaron los temores de cambio y crearon inclinaciones positivas para el cambio en el futuro.
English:

Spanish: Estas historias de éxito atenuaron los temores de cambio.
English: Such success stories alleviated fears of change.
Spanish: Estas historias de éxito atenuaron los temores de cambio y crearon inclinaciones positivas para el cambio en el futuro.
English:

tensor([   11,    11,    11,    11,    11,    11,    11,    11,    11,    11,
           11,    11,    11,    11,    11,    11,    11,    11,    11,    11,
           11,    11,    11,    11,    11,    11,    11,    11,    11,   

In [16]:
# Greedy search

sample_outputs = model.generate(
                                input_ids,
                                do_sample=False,
                                max_new_tokens=100,
                                num_return_sequences=1,
                                pad_token_id=tokenizer.eos_token_id
                                )

generated_texts = tokenizer.batch_decode(sample_outputs[:, input_ids.shape[1]:], skip_special_tokens=True)

print("\nTranslations:\n")
translations = [text.strip().split("\n")[0].strip() for text in generated_texts]
print(*translations, sep="\n")


Translations:

These success stories alleviated fears of change and created positive inclinations for change in the future.
These success stories have eased the fears of change and created positive inclinations for change in the future.
These success stories have eased the fears of change and created positive inclinations for change in the future.
Such success stories alleviated fears of change and created positive inclinations for change in the future.


In [27]:
# top-p sampling

sample_outputs = model.generate(
                                input_ids,
                                do_sample=True,
                                top_p=0.9,
                                max_new_tokens=100,
                                num_return_sequences=1,
                                pad_token_id=tokenizer.eos_token_id
                                )

generated_texts = tokenizer.batch_decode(sample_outputs[:, input_ids.shape[1]:], skip_special_tokens=True)

print("\nTranslations:\n")
translations = [text.strip().split("\n")[0].strip() for text in generated_texts]
print(*translations, sep="\n")


Translations:

These success stories mitigated the fear of change and created positive inclinations towards change in the future.
These success stories lessened the fears of change and created positive inclinations towards change in the future.
These success stories have diminished the fear of change and created positive inclinations for change in the future.
Such success stories alleviated fears of change and created positive inclinations towards future change.


## Example 2: Summarization

In [18]:
# Create summarization prompts

query = f"Summarize the following paper:"

abstracts = [
    """Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.""",
    """Consistency is a key requirement of high-quality translation. It is especially important to adhere to pre-approved terminology and adapt to corrected translations in domain-specific projects. Machine translation (MT) has achieved significant progress in the area of domain adaptation. However, real-time adaptation remains challenging. Large-scale language models (LLMs) have recently shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. By feeding an LLM at inference time with a prompt that consists of a list of translation pairs, it can then simulate the domain and style characteristics. This work aims to investigate how we can utilize in-context learning to improve real-time adaptive MT. Our extensive experiments show promising results at translation time. For example, LLMs can adapt to a set of in-domain sentence pairs and/or terminology while translating a new sentence. We observe that the translation quality with few-shot in-context learning can surpass that of strong encoder-decoder MT systems, especially for high-resource languages. Moreover, we investigate whether we can combine MT from strong encoder-decoder models with fuzzy matches, which can further improve translation quality, especially for less supported languages. We conduct our experiments across five diverse language pairs, namely English-to-Arabic (EN-AR), English-to-Chinese (EN-ZH), English-to-French (EN-FR), English-to-Kinyarwanda (EN-RW), and English-to-Spanish (EN-ES)."""
    ]

authors = ["Srivastava at al. (2014)",
           "Moslem et al. (2023)"]

summarization_prompts = [f"{query}\nAbstract: {abstract}\nSummary: {author}" for abstract, author in zip(abstracts, authors)]

print(*summarization_prompts, sep="\n\n")

Summarize the following paper:
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural ne

In [22]:
# Tokenize the prompts

import torch

device = torch.device("cuda:0")

summarization_input_ids = tokenizer(summarization_prompts, return_tensors="pt", padding=True).input_ids.to(device)

print(*summarization_prompts, sep="\n\n", end="\n\n")
print(summarization_input_ids[0])

Summarize the following paper:
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural ne

In [24]:
# Greedy Search
sample_outputs = model.generate(
                                summarization_input_ids,
                                do_sample=False,
                                max_new_tokens=200,
                                num_return_sequences=1,
                                pad_token_id=tokenizer.eos_token_id
                                )

generated_texts = tokenizer.batch_decode(sample_outputs[:, summarization_input_ids.shape[1]:], skip_special_tokens=True)

print("\nSummarizations:\n")
summarizations = [text.strip().split("\n")[0].strip() for text in generated_texts]
print(*summarizations, sep="\n")


Summarizations:

proposed the use of dropout as a regularization technique for deep neural networks. The idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. The authors show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
propose a new approach to real-time adaptive machine translation (MT) using in-context learning. They use large-scale la

In [30]:
final_outputs = [f"{author} {output}" for author, output in zip(authors, summarizations)]

print(*final_outputs, sep="\n\n")

Srivastava at al. (2014) proposed the use of dropout as a regularization technique for deep neural networks. The idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. The authors show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

Moslem et al. (2023) propose a new approach to real-time adaptive machine translation (MT) using in-context lear