<a href="https://colab.research.google.com/github/sudhang/css-nlp/blob/master/falcon/Falcon_7B_QLORA_Generate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Make it pretty
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In this notebook, we will use Falcon from TIIUAE, which was released a few months ago.  We will fine-tune it using QLORA.  We will do 4-bit quantization, enabling this 7Billion Parameter model to be trained on a free Google Colab

We rely a lot on the google colab notebooks and the tutorials provided by huggingface:  https://huggingface.co/blog/4bit-transformers-bitsandbytes

Apart form that, we used a number of tutorial blogs and even youtube videos:



1.   [Fine-tuning Alpaca and LLaMA: Training on a Custom Dataset](https://www.mlexpert.io/machine-learning/tutorials/alpaca-fine-tuning#user-content-fn-6)
2.   [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
3.   [How to Fine-Tune Open-Source LLMs Locally Using QLoRA!](https://youtu.be/2bkrL2ZcOiM)
4.   [QLORA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314.pdf)
5. [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate)
6. [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes)







### Installations

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install rouge
!pip install einops

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## FLAGS and PARAMS

In [None]:
GDRIVEPATH = "/content/drive/MyDrive/TU/Sem 4/NLP"

In [None]:
DEBUG = False
NUM_TO_GEN = 5

## Imports

To use the llama2 models from huggingface, we need to input an access token.

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import pandas as pd

import torch
import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import BitsAndBytesConfig       # For quantization
from peft import prepare_model_for_kbit_training

from peft import LoraConfig                       # For LORA
from peft import get_peft_model

from datasets import Dataset, load_dataset, DatasetDict

## Load a previous model

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import nltk
nltk.download('punkt')

adapter_model_id = "falcon_cssnlp"
peft_model_id = f"sudhangshankar/{adapter_model_id}"

config = PeftConfig.from_pretrained(peft_model_id)
the_base_model = config.base_model_name_or_path

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
the_base_model

'tiiuae/falcon-7b'

In [None]:
config

PeftConfig(peft_type='LORA', auto_mapping=None, base_model_name_or_path='tiiuae/falcon-7b', revision=None, task_type='CAUSAL_LM', inference_mode=True)

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,         # nested quantization to preserve memory
    bnb_4bit_quant_type="nf4",              # NF4 gives higher precision than FP4
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
                the_base_model,
                return_dict=True,
                quantization_config=bnb_config,
                device_map='auto'
              )
tokenizer = AutoTokenizer.from_pretrained(the_base_model)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)


Loading tiiuae/falcon-7b requires to execute some code in that repo, you can inspect the content of the repository at https://hf.co/tiiuae/falcon-7b. You can dismiss this prompt by passing `trust_remote_code=True`.
Do you accept? [y/N] y
Loading tiiuae/falcon-7b requires to execute some code in that repo, you can inspect the content of the repository at https://hf.co/tiiuae/falcon-7b. You can dismiss this prompt by passing `trust_remote_code=True`.
Do you accept? [y/N] y


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Generation


In [None]:
if DEBUG:
  # Generate a prompt
  prompt = 'Greek Coast Guard vessels on Saturday evacuated hundreds of tourists and locals trapped in seaside villages on Rhodes that were threatened by five-day-old wildfires, moving them to safer parts of the island.'

  device = "cuda:0"
  inputs = tokenizer(prompt, return_tensors="pt")
  # We need only the following two fields.
  input_ids = inputs['input_ids'].to(device)
  attention_mask = inputs['attention_mask'].to(device)

  model.config.use_cache = True
  outputs = model.generate(
                  input_ids=input_ids,
                  attention_mask=attention_mask,
                  # Use sampling instead of greedy decoding
                  do_sample=True,
                  # Keep only top 50 token with
                  # the highest probability
                  top_k=50,
                  # Maximum sequence length
                  max_length=300,             # TODO: Max token length for LLaMa2 is 4096
                  # Keep only the most probable tokens
                  # with cumulative probability of 95%
                  top_p=0.95,
                  # Changes randomness of generated sequences
                  temperature=0.7,
                  repetition_penalty=2,  # Falcon seems to enjoy repetition.  Increasing the penalty
                  # Number of sequences to generate
                  num_return_sequences=1)


  for i, sample_output in enumerate(outputs):
      print(f"{i}: {tokenizer.decode(sample_output, skip_special_tokens=True)}\n\n")

In [None]:
def count_sentences(text_list):
    total_sentences = 0
    for text in text_list:
        sentences = nltk.sent_tokenize(text)
        total_sentences += len(sentences)
    return total_sentences

# Example usage:
text_list = [
    "This is the first sentence. This is the second sentence.",
    "This is another sentence."
  ]
print(count_sentences(text_list))  # Output: 3


3


In [None]:
def generate_news_article(prompt="Graz, Austria - ", min_sentences = 50):

  device = "cuda:0"

  gen_text_snippets = [prompt]
  count_gen_sentences = count_sentences(gen_text_snippets)

  while count_gen_sentences < min_sentences:
    last_gen_snippet = gen_text_snippets[-1].rstrip('. ')
                                                # rstrip('. ') to trick it into
                                                # thinking the sentence isn't
                                                # over so that it doesn't decide
                                                # to go off on a tangent

    inputs = tokenizer(last_gen_snippet, return_tensors="pt")
    # We need only the following two fields.
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)

    outputs = model.generate(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    # Use sampling instead of greedy decoding
                    do_sample=True,
                    # Keep only top 50 token with
                    # the highest probability
                    top_k=50,
                    # Maximum sequence length
                    max_length=1000,             # Max token length for Falcon is 2048
                    # Keep only the most probable tokens
                    # with cumulative probability of 95%
                    top_p=0.95,
                    temperature=0.3,        # Low temperature, since we have such a large sequence being generated
                    repetition_penalty=10.0,  # Corrected here
                    # Number of sequences to generate
                    num_return_sequences=1)

    last_gen_snippet_length = len(tokenizer.encode(last_gen_snippet))
    gen_text = tokenizer.decode(
        outputs[0][last_gen_snippet_length:],
        skip_special_tokens=True
      )
    gen_text_snippets.append(gen_text)
    count_gen_sentences = count_sentences(gen_text_snippets)
    if DEBUG:
      print(f"{gen_text=}\n{count_gen_sentences=}====\n")

  gen_text = " ".join(gen_text_snippets)

  return gen_text



In [None]:
if DEBUG:
  the_prompt = "NEW DELHI - Thousands of people were evacuated from their homes "
  article = generate_news_article(prompt = the_prompt, min_sentences=51)
  display(article)
  print("\n\n")

In [None]:
# Load the csv file
df = pd.read_csv(f'{GDRIVEPATH}/data/nyt_test.csv')

# Initialize a new dataframe
new_df = pd.DataFrame(columns=['Original Article', 'Prompt', 'Generated Article'])

count_gen = 0
while count_gen < NUM_TO_GEN:
    random_article = df['content'].sample(1).values[0]

    try:
      sentences = nltk.sent_tokenize(random_article)
      # Use the first two sentences of the real article as the prompt
      prompt = ' '.join(sentences[:2])

      print(f"{count_gen=}\n{prompt=}\n======")

      generated_article = generate_news_article(prompt=prompt, min_sentences=51)

      current_df = pd.DataFrame({
          'Original Article': [random_article],
          'Prompt': [prompt],
          'Generated Article': [generated_article]
      })

      # Append the current dataframe to the new dataframe
      new_df = pd.concat([new_df, current_df], ignore_index=True)
      count_gen = count_gen + 1
    except:
      # I have no clue why i have to do this.  just don't fail
      # recover and then try with another article
      pass


# Post-processing to remove incomplete sentences
new_df['Generated Article'] = new_df['Generated Article'].apply(lambda text:
                                      ' '.join(nltk.sent_tokenize(text)[:-1])
                                      if not text.endswith(('.', '!', '?'))
                                      else text
                                    )


Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


count_gen=0
prompt='It was Donald J. Trump’s chance to sound contrite and mature, to explain away the sexually predatory boasts he was caught making on tape and to persuade Americans that — for all his no-apologies braggadocio — he was, in fact, capable of feeling shame. Maura Cotter, 22, a senior at the University of Notre Dame, was shocked at what Mr. Trump did instead in Sunday’s debate: repeat, over and over, that what he had said on the 2005 recording, about forcing himself on women and grabbing their genitals, was simply “locker-room banter.” It was, Ms. Cotter said, “not an apology — no reason to believe he’s changed at all.” A classmate, Abigail Wilson, who is a registered Republican, listened closely to Mr. Trump and was reminded, she said, of the time she was groped by a stranger.'


Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


count_gen=1
prompt='WASHINGTON — Henry A. Kissinger slipped into the State Department last week for a quiet lunch in his old office with Rex W. Tillerson, the former Exxon Mobil chief executive, who has all but covered himself in a cloak of invisibility in his first six weeks as secretary of state. Describing his impressions, Mr. Kissinger, perhaps America’s most famous diplomatic strategist, chose his words judiciously.'


Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


count_gen=2
prompt='Gov. Mike Pence, aligning himself with the Republican establishment rather than his running mate, broke with Donald J. Trump on Wednesday by endorsing Speaker Paul D. Ryan’s re-election bid, a day after Mr. Trump roiled the party by declaring that he was not yet ready to support the speaker.'


Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


count_gen=3
prompt='SEOUL, South Korea — Keeping diplomatic developments coming at a head-snapping pace, the South Korean government said on Sunday that North Korea’s leader, Kim Jong-un, had told President Moon Jae-in that he would abandon his nuclear weapons if the United States agreed to formally end the Korean War and promise not to invade his country. In a confidence-building gesture ahead of a proposed summit meeting with President Trump, a suddenly loquacious and conciliatory Mr. Kim also said he would invite experts and journalists from South Korea and the United States to watch the shutdown next month of his country’s only known underground nuclear test site.'


Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


count_gen=4
prompt='NICE, France — At times it was hard to know who was on trial, the smuggler or the state. The defendant, Cédric Herrou, 37, a slightly built olive farmer, did not deny that for months he had illegally spirited dozens of migrants through the remote mountain valley where he lives.'


Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


## Post-Processing

In [None]:
import re
from nltk.tokenize import sent_tokenize, word_tokenize

def post_process(text):
    # Remove double punctuation
    text = re.sub(r'[!?]{2,}', r'', text)

    # Remove spaces before punctuation
    text = re.sub(r'\s*([.,!?])', r'\1', text)

    # Remove extra whitespace
    text = text.strip()
    text = re.sub(r' +', ' ', text)

    #Removes whitespaces around contraction marks in a string.
    pattern = r'\s([\'’])\s'
    text = re.sub(pattern, r'\1', text)

    #Removes whitespaces around opening quote marks in a string.
    pattern = r'“\s'
    text = re.sub(pattern, r'“', text)

    #Removes whitespaces around closing quote marks in a string.
    pattern = r'\s”'
    text = re.sub(pattern, r'”', text)

    return text

new_df['Generated Article'] = new_df['Generated Article'].apply(post_process)

In [None]:
new_df

Unnamed: 0,Original Article,Prompt,Generated Article
0,It was Donald J. Trump’s chance to sound contr...,It was Donald J. Trump’s chance to sound contr...,It was Donald J. Trump’s chance to sound contr...
1,WASHINGTON — Henry A. Kissinger slipped into t...,WASHINGTON — Henry A. Kissinger slipped into t...,WASHINGTON — Henry A. Kissinger slipped into t...
2,"Gov. Mike Pence, aligning himself with the Rep...","Gov. Mike Pence, aligning himself with the Rep...","Gov. Mike Pence, aligning himself with the Rep..."
3,"SEOUL, South Korea — Keeping diplomatic develo...","SEOUL, South Korea — Keeping diplomatic develo...","SEOUL, South Korea — Keeping diplomatic develo..."
4,"NICE, France — At times it was hard to know wh...","NICE, France — At times it was hard to know wh...","NICE, France — At times it was hard to know wh..."


In [None]:
# Save the new dataframe to a csv file
new_df.to_csv(f'{GDRIVEPATH}/generated/falconqlora_nyt_2.csv', index=False)

In [None]:
new_df

Unnamed: 0,Original Article,Prompt,Generated Article
0,It was Donald J. Trump’s chance to sound contr...,It was Donald J. Trump’s chance to sound contr...,It was Donald J. Trump’s chance to sound contr...
1,WASHINGTON — Henry A. Kissinger slipped into t...,WASHINGTON — Henry A. Kissinger slipped into t...,WASHINGTON — Henry A. Kissinger slipped into t...
2,"Gov. Mike Pence, aligning himself with the Rep...","Gov. Mike Pence, aligning himself with the Rep...","Gov. Mike Pence, aligning himself with the Rep..."
3,"SEOUL, South Korea — Keeping diplomatic develo...","SEOUL, South Korea — Keeping diplomatic develo...","SEOUL, South Korea — Keeping diplomatic develo..."
4,"NICE, France — At times it was hard to know wh...","NICE, France — At times it was hard to know wh...","NICE, France — At times it was hard to know wh..."


In [None]:
new_df.loc[3,"Generated Article"]

