<a href="https://colab.research.google.com/github/sleepypioneer/fine_tuning_LLMs/blob/week_three/projects/FinetuningLLMs_Project_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 3:  LORA Fine Tuning

In this project, you will use parameter efficient fine tuning (PEFT) low rank adaptation (LORA) to fine-tune the phi-1.5 model for summarizing scientific papers. For this assignment, we will use `microsoft/phi-1.5` [model](https://huggingface.co/microsoft/phi-1_5) that works on free T4 GPUs with 16GB provided by Google Colab. If you have access to bigger GPUs like A100 with 40GB memory, I encourage you to experiment with bigger and latest LLMs like [Lllama-2](https://huggingface.co/models?search=llama2) or [Mistral](https://huggingface.co/mistralai)

## Table of Contents

- [1 - Data Handling](#1)
  - [1.1 - Downloading the Data](#1.1)
  - [1.2 - Exploring the Data](#1.2)

- [2 - Data Preprocessing](#2)
  - [2.1 - Restructuring and Tokenizing](#2.1)
  - [2.2 - Decoding Example](#2.2)

- [3 - Configuring PEFT LORA](#3)
  - [3.1 - Downloading and Converting Phi-1.5](#3.1)
  - [3.2 - Trainable Parameters](#3.2)

- [4 - Training Configuration](#4)

- [5 - Model Training](#5)

- [6 - Evaluation](#6)

- [7 - Real-World Application](#7)

- [8 - Conclusion](#8)

This project is truly the bleeding edge of generative AI- we are so excited to see where you take this technology.  If you get stuck, make sure to reach out to the course team and your peers for support!

Let's get started.

In [4]:
!pip install transformers torch datasets peft einops trl --quiet

In [5]:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer, pipeline
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import os
import unittest
from unittest.mock import patch
import torch

<a name='1'></a>
# Data Handling

Our goal is to fine tune a pretrained model on a dataset gathered by the [Allen Institute for AI](https://allenai.org/).  This dataset, known as [SciTLDR](https://huggingface.co/datasets/allenai/scitldr) contains extreme summaries of scientific content.  Here's a quick description from the dataset card:

**SciTLDR: Extreme Summarization of Scientific Documents**

SciTLDR is a new multi-target dataset of 5.4K TLDRs over 3.2K papers. SciTLDR contains both author-written and expert-derived TLDRs, where the latter are collected using a novel annotation protocol that produces high-quality summaries while minimizing annotation burden.

<a name='1.1'></a>
## Downloading the Data
Let's get started by pulling the dataset from the HuggingFace Hub.

In [6]:
dataset = load_dataset("allenai/scitldr")
dataset

Downloading builder script:   0%|          | 0.00/7.21k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/7.56k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.81k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.01M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/356k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/378k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1992 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/618 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/619 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['source', 'source_labels', 'rouge_scores', 'paper_id', 'target'],
        num_rows: 1992
    })
    test: Dataset({
        features: ['source', 'source_labels', 'rouge_scores', 'paper_id', 'target'],
        num_rows: 618
    })
    validation: Dataset({
        features: ['source', 'source_labels', 'rouge_scores', 'paper_id', 'target'],
        num_rows: 619
    })
})

<a name='1.2'></a>
## Exploring the Data
Let's take a look at an example from the data.  The `source` and `target` columns are most important to our work here, although the others also contain interesting information.  Don't hesitate to check them out!

This dataset has a unique structure:
* The source content for each sample is a list of sentences.  These sentences need to be concatenated to construct the full content.
* THe target content, which contains the TLDR summary, is a list of strings.  Each of those strings contains a TLDR summary.  In this exercise, we are only using the first TLDR string in the dataset. This is the "expert" TLDR, according to the dataset description.

Check out the cell below to see these concepts in practice.

In [7]:
from rich import print
sample_idx = 0

# Concatenate all the source content sentences
print(f"Source content:  {' '.join(dataset['train'][sample_idx]['source'])}")

print()

# Retrieve the first target TLDR.
print(f"Target TLDR:  {dataset['train'][sample_idx]['target'][0]}")

<a name='2'></a>
# Load pre-trained model and tokenizer

We'll be using the tokenizer that was trained in concert with the Phi-1.5 model.  This tokenizer does not come with a `PAD` token, so we will reuse its `EOS` token.  Let's download the tokenizer in the next cell.

In [9]:
model_name='microsoft/phi-1_5'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading (…)former_sequential.py:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-1_5:
- configuration_mixformer_sequential.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)former_sequential.py:   0%|          | 0.00/31.1k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-1_5:
- modeling_mixformer_sequential.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


# Pre-trained model performance

Let's look at how the pre-trained model performs for summarization task. Let's structure the prompt as a zero-shot instruction prompt:

```text
Summarize the following academic content.

[source scientific content]

Summary:
```

## Prompt template

Let's create a prompt template that uses the source field in the dataset and converts source into an input to the model that includes the instruction.

In [8]:
prompt_template = """Summarize the following academic content.\n
{source}

Summary:"""

sample = dataset["train"][0]
prompt = prompt_template.format(source=" ".join(sample["source"]))
print(prompt)

## Text-generation Pipeline

Let's create text generation pipeline using the phi_1.5 model, pass the above prompt as input and evaluate the response of the model.

In [10]:
pipe = pipeline("text-generation",model=model,tokenizer=tokenizer)
response = pipe(prompt, do_sample=True, max_new_tokens=50, temperature=0.7)
print(response[0]["generated_text"])
print("-----------")
target = " ".join(sample["target"])
print("TLDR target: %s"%target)

## Restructuring a new column `text`
Let's create a new column named `text` that appends the instruction, text source and the summary that we can use for fine-tuning.

In [11]:
prompt_template_with_response = """Summarize the following academic content.\n
{source}

Summary: {target} {eos_token}"""

sample = dataset["train"][0]
text = prompt_template_with_response.format(source=" ".join(sample["source"]), target=" ".join(sample["target"]), eos_token = tokenizer.eos_token)
print(text)

In [14]:
def construct_text(batch):
    """Constructs a text from sources and targets with special prompt format for a summarization task.

    Constructs a prompt by prepending a start prompt and appending an end prompt
    to each source entry in the batch.

    Args:
        batch: A batch of source and target texts to tokenize.
            The 'source' key should map to a list of source texts (each being a list of strings),
            and the 'target' key should map to a list of target texts (each being a list of strings).

    Returns:
        A dictionary containing a new key named text
    """
    ########################
    # START YOUR CODE HERE #
    ########################
    # Replace None with your code

    # Concatenate the source texts into single strings for each sample
    sources = [src for src in batch['source']]

    # Extract the first target text for each sample
    targets = [target for target in batch["target"]]

    texts = []

    for source, target in zip(sources,targets):
      text = prompt_template_with_response.format(source=source, target=target, eos_token = tokenizer.eos_token)
      texts.append(text)

    batch["text"] = texts

    return batch

# Generate a text column

Use dataset `map` function to generate a `text` column, remove all the remaining columns and verify one of the samples.

In [15]:
dataset_with_text = dataset.map(construct_text, batched=True, remove_columns=dataset['train'].column_names)

dataset_with_text["train"][0]

Map:   0%|          | 0/1992 [00:00<?, ? examples/s]

Map:   0%|          | 0/618 [00:00<?, ? examples/s]

Map:   0%|          | 0/619 [00:00<?, ? examples/s]

{'text': "Summarize the following academic content.\n\n['Due to the success of deep learning to solving a variety of challenging machine learning tasks, there is a rising interest in understanding loss functions for training neural networks from a theoretical aspect.', 'Particularly, the properties of critical points and the landscape around them are of importance to determine the convergence performance of optimization algorithms.', 'In this paper, we provide a necessary and sufficient characterization of the analytical forms for the critical points (as well as global minimizers) of the square loss functions for linear neural networks.', 'We show that the analytical forms of the critical points characterize the values of the corresponding loss functions as well as the necessary and sufficient conditions to achieve global minimum.', 'Furthermore, we exploit the analytical forms of the critical points to characterize the landscape properties for the loss functions of linear neural netwo

# Print the model to find attention blocks for LoRA



In [16]:
# print the model here
print(model)


<a name='4'></a>
# Configuring PEFT LORA
<a name='4.1'></a>
## Downloading and Converting Phi-1.5
Now that the data is prepared for the training process, let's download and prepare the model!  This is much more straightforward.  We'll simply configure the LORA using a `LoraConfig` class with the following parameters:
* `r = 8` This is the rank of the trainable LORA matrix.
* `lora_alpha = 16` As a rule of thumb, set LoRA Alpha to be twice the rank.
* `target_modules = ['Wqkv']` This is the layer of the transformer to apply LoRA for phi-1.5. You can find the modules by printing the model to stdout. This layer will be different for different models. You can provide a list of modules to apply LoRA.
* `lora_dropout = 0.05` This is the dropout probability of the LoRA layers.
* `bias = 'none'` This deactivates bias parameter training.
* `task_type = TaskType.CAUSAL_LM` This informs the lora configuration that the model is a causal language model.

One the LORA is configured, we'll apply it to the model using the `get_peft_model()` function.

In [17]:
########################
# START YOUR CODE HERE #
########################
# Replace None with your code

def create_lora_config():
    lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    target_modules = ['Wqkv'],
    r=8, 
    lora_alpha=16, 
    lora_dropout=0.05, 
    bias="none"

    )

    return lora_config

######################
# END YOUR CODE HERE #
######################

In [18]:
# @title Test your code!
class TestLoraConfig(unittest.TestCase):
    """Unit tests for the LoraConfig class."""

    def test_lora_config_initialization(self):
        """Tests the initialization of LoraConfig with correct arguments."""
        lora_config = create_lora_config()

        self.assertEqual(lora_config.r, 8)
        self.assertEqual(lora_config.lora_alpha, 16)
        self.assertEqual(lora_config.target_modules, {'Wqkv'})
        self.assertEqual(lora_config.lora_dropout, 0.05)
        self.assertEqual(lora_config.bias, "none")
        self.assertEqual(lora_config.task_type, TaskType.CAUSAL_LM)

# Run the tests
unittest.TextTestRunner().run(unittest.TestLoader().loadTestsFromTestCase(TestLoraConfig))

.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK


<unittest.runner.TextTestResult run=1 errors=0 failures=0>

In [19]:
lora_config = create_lora_config()

# Apply the LORA configuration to get a new model for PEFT
peft_model = get_peft_model(model,
                            lora_config)

<a name='4.2'></a>
## Trainable Parameters
Let's now compare the number of parameters in the model with the number of trainable parameters.

In [20]:
def print_number_of_trainable_model_parameters(model):
    """Prints the number of trainable parameters in the model.

    This function traverses all the parameters of a given PyTorch model to
    count the total number of parameters as well as the number of trainable
    (i.e., requires gradient) parameters.

    Args:
        model: A PyTorch model whose parameters you want to count.
    """

    # Initialize counters for trainable and total parameters
    trainable_model_params = 0
    all_model_params = 0

    # Loop through all named parameters in the model
    for _, param in model.named_parameters():
        # Update the total number of parameters
        all_model_params += param.numel()

        # Check if the parameter requires gradient and update the trainable parameter counter
        if param.requires_grad:
            trainable_model_params += param.numel()

    # Calculate and print the number and percentage of trainable parameters
    print(f"Trainable model parameters: {trainable_model_params}")
    print(f"All model parameters: {all_model_params}")
    print(f"Percentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%")

print_number_of_trainable_model_parameters(peft_model)

<a name='5'></a>
# Training Configuration
ALright, the data is ready ✅ and the LORA model is ready ✅, so our final step is to configure the training. Let's start with a baseline fo the following training arguments:
* `learning_rate = 1e-3`: Specifies the learning rate for the optimizer.
  
* `num_train_epochs=1`: Indicates the number of times the entire training dataset is used to update the model weights. A value of 1 means each sample is used once to update the weights.

* `logging_steps=50`: Specifies that logs for training metrics will be printed every 50 steps. Useful for monitoring the training process.

* `evaluation_strategy="steps"`: Specifies that the evaluation will be done based on the number of steps, as opposed to epochs. This is in line with `eval_steps`.

* `eval_steps=200`: Indicates that the model will be evaluated every 200 training steps. This is only applicable if `evaluation_strategy` is set to `"steps"`.

* `per_device_train_batch_size=1`: Defines the batch size for training on each device (e.g., GPU). A batch size of 1 means each training step updates the model based on a single sample.

* `per_device_eval_batch_size=1`: Sets the batch size for evaluation on each device. Similar to the training batch size, a value of 1 means the model will be evaluated one sample at a time.


In [23]:
output_dir = "uplimit-project-3-phi-1.5"

########################
# START YOUR CODE HERE #
########################
# Replace None with your code

def create_peft_training_args():
    peft_training_args = TrainingArguments(
        output_dir=output_dir,  
        learning_rate=1e-3,
        num_train_epochs=1,
        logging_steps=50,
        evaluation_strategy="steps",
        eval_steps=200,
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
    )

    return peft_training_args

######################
# END YOUR CODE HERE #
######################

In [24]:
#@title Test your code

class TestTrainingArguments(unittest.TestCase):
    """Unit tests for the TrainingArguments class."""

    def test_training_arguments_initialization(self):
        """Tests the initialization of TrainingArguments with correct arguments."""
        peft_training_args = create_peft_training_args()

        self.assertEqual(peft_training_args.learning_rate, 1e-3)
        self.assertEqual(peft_training_args.num_train_epochs, 1)
        self.assertEqual(peft_training_args.logging_steps, 50)
        self.assertEqual(peft_training_args.evaluation_strategy, "steps")
        self.assertEqual(peft_training_args.eval_steps, 200)
        self.assertEqual(peft_training_args.per_device_train_batch_size, 1)
        self.assertEqual(peft_training_args.per_device_eval_batch_size, 1)

# Run the tests
unittest.TextTestRunner().run(unittest.TestLoader().loadTestsFromTestCase(TestTrainingArguments))

.
----------------------------------------------------------------------
Ran 1 test in 0.027s

OK


<unittest.runner.TextTestResult run=1 errors=0 failures=0>

Now, create the SFT Trainer.  The trainer class contains all the information necessary to train a model, including:
* `model = peft_model`: Of course, the trainer requires the model for training.
* `args = peft_training_args`: These are the training arguments we defined earlier.
* `train_dataset = dataset_with_text['train']`: We will use the `train` portion of the dataset for fine tuning.
* `eval_dataset = dataset_with_text['validation']`: We will use the `validation` portion of the dataset to periodically evaluate our training progress.
* `dataset_text_field = text`: dataset_text_field that corresponds to text containing prompt and its response


In [25]:
peft_training_args = create_peft_training_args()

########################
# START YOUR CODE HERE #
########################
# Replace None with your code

peft_trainer = SFTTrainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=dataset_with_text["train"],
    eval_dataset=dataset_with_text["validation"],
    dataset_text_field="text",
)

######################
# END YOUR CODE HERE #
######################

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using pad_token, but it is not set yet.


Map:   0%|          | 0/1992 [00:00<?, ? examples/s]

Map:   0%|          | 0/619 [00:00<?, ? examples/s]

<a name='6'></a>
# Model Training

These training arguments are verified to run properly on the free T4 GPUs available on Colab.  This training process will take under an hour.  If your notebook times out or you don't have enough time to train, we provide an option below to download the fine-tuned model from one of our course developers [here](https://huggingface.co/mrplants/arphiv).

If you have some Colab credits available, this training process will take significantly less time using the V100 GPU (<15 minutes).

In [26]:
peft_trainer.train()

peft_model_path="uplimit-project-3-phi-1.5"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

  0%|          | 0/1992 [00:00<?, ?it/s]

You're using a CodeGenTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 2.5251, 'learning_rate': 0.0009748995983935744, 'epoch': 0.03}
{'loss': 2.4291, 'learning_rate': 0.0009497991967871486, 'epoch': 0.05}
{'loss': 2.4759, 'learning_rate': 0.0009246987951807229, 'epoch': 0.08}
{'loss': 2.4056, 'learning_rate': 0.0008995983935742972, 'epoch': 0.1}


  0%|          | 0/619 [00:00<?, ?it/s]

{'eval_loss': 2.495142698287964, 'eval_runtime': 484.2973, 'eval_samples_per_second': 1.278, 'eval_steps_per_second': 1.278, 'epoch': 0.1}
{'loss': 2.422, 'learning_rate': 0.0008744979919678715, 'epoch': 0.13}
{'loss': 2.4471, 'learning_rate': 0.0008493975903614458, 'epoch': 0.15}
{'loss': 2.3673, 'learning_rate': 0.0008242971887550202, 'epoch': 0.18}
{'loss': 2.4108, 'learning_rate': 0.0007991967871485943, 'epoch': 0.2}


  0%|          | 0/619 [00:00<?, ?it/s]

{'eval_loss': 2.4877090454101562, 'eval_runtime': 270.9763, 'eval_samples_per_second': 2.284, 'eval_steps_per_second': 2.284, 'epoch': 0.2}
{'loss': 2.4476, 'learning_rate': 0.0007740963855421687, 'epoch': 0.23}
{'loss': 2.351, 'learning_rate': 0.000748995983935743, 'epoch': 0.25}
{'loss': 2.4081, 'learning_rate': 0.0007238955823293173, 'epoch': 0.28}
{'loss': 2.3732, 'learning_rate': 0.0006987951807228916, 'epoch': 0.3}


  0%|          | 0/619 [00:00<?, ?it/s]

{'eval_loss': 2.48234224319458, 'eval_runtime': 267.1099, 'eval_samples_per_second': 2.317, 'eval_steps_per_second': 2.317, 'epoch': 0.3}
{'loss': 2.3098, 'learning_rate': 0.000673694779116466, 'epoch': 0.33}
{'loss': 2.405, 'learning_rate': 0.0006485943775100401, 'epoch': 0.35}
{'loss': 2.4131, 'learning_rate': 0.0006234939759036145, 'epoch': 0.38}
{'loss': 2.4148, 'learning_rate': 0.0005983935742971888, 'epoch': 0.4}


  0%|          | 0/619 [00:00<?, ?it/s]

{'eval_loss': 2.474083185195923, 'eval_runtime': 775.5813, 'eval_samples_per_second': 0.798, 'eval_steps_per_second': 0.798, 'epoch': 0.4}
{'loss': 2.4392, 'learning_rate': 0.0005732931726907631, 'epoch': 0.43}
{'loss': 2.4595, 'learning_rate': 0.0005481927710843374, 'epoch': 0.45}
{'loss': 2.3861, 'learning_rate': 0.0005230923694779117, 'epoch': 0.48}
{'loss': 2.3768, 'learning_rate': 0.0004979919678714859, 'epoch': 0.5}


  0%|          | 0/619 [00:00<?, ?it/s]

{'eval_loss': 2.4746015071868896, 'eval_runtime': 285.1519, 'eval_samples_per_second': 2.171, 'eval_steps_per_second': 2.171, 'epoch': 0.5}
{'loss': 2.3606, 'learning_rate': 0.00047289156626506026, 'epoch': 0.53}
{'loss': 2.4252, 'learning_rate': 0.00044779116465863456, 'epoch': 0.55}
{'loss': 2.2667, 'learning_rate': 0.0004226907630522088, 'epoch': 0.58}
{'loss': 2.4095, 'learning_rate': 0.00039759036144578315, 'epoch': 0.6}


  0%|          | 0/619 [00:00<?, ?it/s]

{'eval_loss': 2.4631741046905518, 'eval_runtime': 282.8698, 'eval_samples_per_second': 2.188, 'eval_steps_per_second': 2.188, 'epoch': 0.6}
{'loss': 2.445, 'learning_rate': 0.00037248995983935745, 'epoch': 0.63}
{'loss': 2.3682, 'learning_rate': 0.0003473895582329317, 'epoch': 0.65}
{'loss': 2.3838, 'learning_rate': 0.00032228915662650605, 'epoch': 0.68}
{'loss': 2.3931, 'learning_rate': 0.0002971887550200803, 'epoch': 0.7}


  0%|          | 0/619 [00:00<?, ?it/s]

{'eval_loss': 2.4599523544311523, 'eval_runtime': 285.3502, 'eval_samples_per_second': 2.169, 'eval_steps_per_second': 2.169, 'epoch': 0.7}
{'loss': 2.3535, 'learning_rate': 0.0002720883534136546, 'epoch': 0.73}
{'loss': 2.3566, 'learning_rate': 0.0002469879518072289, 'epoch': 0.75}
{'loss': 2.4198, 'learning_rate': 0.00022188755020080322, 'epoch': 0.78}
{'loss': 2.3644, 'learning_rate': 0.00019678714859437752, 'epoch': 0.8}


  0%|          | 0/619 [00:00<?, ?it/s]

{'eval_loss': 2.4500410556793213, 'eval_runtime': 266.7208, 'eval_samples_per_second': 2.321, 'eval_steps_per_second': 2.321, 'epoch': 0.8}
{'loss': 2.4057, 'learning_rate': 0.0001716867469879518, 'epoch': 0.83}
{'loss': 2.3058, 'learning_rate': 0.0001465863453815261, 'epoch': 0.85}
{'loss': 2.3329, 'learning_rate': 0.00012148594377510041, 'epoch': 0.88}
{'loss': 2.4092, 'learning_rate': 9.638554216867471e-05, 'epoch': 0.9}


  0%|          | 0/619 [00:00<?, ?it/s]

{'eval_loss': 2.4488143920898438, 'eval_runtime': 267.5047, 'eval_samples_per_second': 2.314, 'eval_steps_per_second': 2.314, 'epoch': 0.9}
{'loss': 2.3211, 'learning_rate': 7.1285140562249e-05, 'epoch': 0.93}
{'loss': 2.3925, 'learning_rate': 4.618473895582329e-05, 'epoch': 0.95}
{'loss': 2.305, 'learning_rate': 2.108433734939759e-05, 'epoch': 0.98}
{'train_runtime': 5828.9958, 'train_samples_per_second': 0.342, 'train_steps_per_second': 0.342, 'train_loss': 2.391199268969187, 'epoch': 1.0}


('uplimit-project-3-phi-1.5/tokenizer_config.json',
 'uplimit-project-3-phi-1.5/special_tokens_map.json',
 'uplimit-project-3-phi-1.5/vocab.json',
 'uplimit-project-3-phi-1.5/merges.txt',
 'uplimit-project-3-phi-1.5/added_tokens.json',
 'uplimit-project-3-phi-1.5/tokenizer.json')

<a name='7'></a>
# Evaluation
Now that the model training is complete, let's review the results using a couple samples from the test set.

In [35]:
device = torch.device('mps')

def generate_summary(source_content: str, model: AutoModelForCausalLM, tokenizer: AutoTokenizer) -> str:
    """Generates a summary for the given academic content using a pre-trained model.

    Args:
        source_content (str): The academic content to summarize.
        model (AutoModelForCausalLM): The pretrained model used for text generation.
        tokenizer (AutoTokenizer): The tokenizer used for encoding and decoding text.

    Returns:
        str: The generated summary.
    """
    prompt = prompt_template.format(source=source_content, bos_token = tokenizer.bos_token)
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    outputs = model.generate(input_ids=input_ids,
                             max_new_tokens=100,
                             do_sample=True,
                             temperature=0.7,
                             repetition_penalty=1.5,
                             eos_token_id=tokenizer.eos_token_id,
                             pad_token_id=tokenizer.pad_token_id)

    generated_ids = outputs[0][len(input_ids[0]):]
    return tokenizer.decode(generated_ids, skip_special_tokens=True)

def evaluate_model_summary(index: int, dataset: dict, model: AutoModelForCausalLM, tokenizer: AutoTokenizer) -> None:
    """Evaluates and prints a model-generated summary against a human-created summary.

    Args:
        index (int): The index of the sample in the dataset to summarize.
        dataset (dict): The dataset containing the academic content and human-created summaries.
        model (AutoModelForCausalLM): The pretrained model used for text generation.
        tokenizer (AutoTokenizer): The tokenizer used for encoding and decoding text.
    """
    source_content = ' '.join(dataset['test'][index]['source'])
    baseline_human_summary = dataset['test'][index]['target'][0]
    peft_model_text_output = generate_summary(source_content, model, tokenizer)

    print('-----------------------------')
    print(f"Summarize the following academic content.\n\n{source_content}\n\nSummary: ")
    print('-----------------------------')
    print(f'BASELINE HUMAN SUMMARY:\n{baseline_human_summary}')
    print('-----------------------------')
    print(f'PEFT MODEL SUMMARY:\n{peft_model_text_output}')
    print('-----------------------------')

# Example usage
evaluate_model_summary(0, dataset, peft_model, tokenizer)
evaluate_model_summary(1, dataset, peft_model, tokenizer)

In [28]:
import getpass

token = getpass.getpass("Enter your Hugging Face token: ")
if len(token) <= 0:
    raise ValueError("You need to set your Hugging Face token to upload your model to the Hub.")

If you're happy with the results, upload your LORA model to the HuggingFace Hub!

In [29]:
from huggingface_hub import login
login(token=token)


Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/jessica-g/.cache/huggingface/token
Login successful


In [30]:
model_url = peft_trainer.push_to_hub()
print(f'Find your new model here:  {model_url}')

adapter_model.bin:   0%|          | 0.00/6.31M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

# <a name='8'></a>
# Real World Application
Now let's try out our model on some data from the wild! The following code retrieves the abstract from a paper in the arXiv database and summarizes it using our model.

# RESTART THE KERNEL IF YOU DON'T HAVE ENOUGH GPU MEMORY

## Step 1: Download the PEFT model from the hub and create a new model

Since the model that is uploaded to the hub is a LoRA model, it is much smaller (~25MB for phi 1.5) than the original model. Go and check the size on HF hub.

In order to use this model for inference, we need to add LoRA weights to original model's weights.

Below code shows the steps to download a LoRA model from hub and create a model ready for inference.

In [1]:
!pip install transformers[torch] datasets peft einops trl --quiet

zsh:1: no matches found: transformers[torch]


In [3]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

########################
# START YOUR CODE HERE #
########################
# Replace None with your code

peft_model_id =  'jessica-ecosia/uplimit-project-3-phi-1.5'
config = PeftConfig.from_pretrained(peft_model_id)
inference_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, device_map="auto", trust_remote_code=True)
inference_tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
inference_model = PeftModel.from_pretrained(inference_model, peft_model_id)

inference_tokenizer.pad_token_id = inference_tokenizer.eos_token_id

Downloading (…)/adapter_config.json:   0%|          | 0.00/468 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading adapter_model.bin:   0%|          | 0.00/6.31M [00:00<?, ?B/s]

## Step 2: Download abstracts from arxiv and summarize them

In [4]:
import urllib.request
import xml.etree.ElementTree as ET

def fetch_arxiv_abstract_by_id(arxiv_id: str) -> str:
    """Fetches the abstract of a paper from the arXiv database using its ID.

    Args:
        arxiv_id (str): The arXiv identifier for the paper.

    Returns:
        str: The abstract of the paper.
    """
    # Construct the API URL for the arXiv paper
    url = f'http://export.arxiv.org/api/query?id_list={arxiv_id}'

    try:
        # Fetch the XML data
        response = urllib.request.urlopen(url)
        xml_data = response.read().decode('utf-8')

        # Parse the XML data
        root = ET.fromstring(xml_data)

        # Find the <summary> element which contains the abstract
        for entry in root.find('{http://www.w3.org/2005/Atom}entry'):
            if entry.tag == '{http://www.w3.org/2005/Atom}summary':
                return entry.text.strip()

    except Exception as e:
        return f"An error occurred: {e}"

    return "Abstract not found"

arxiv_id = "2103.00020"  # Replace with the arXiv ID of the paper you're interested in
abstract = fetch_arxiv_abstract_by_id(arxiv_id)
print()
print(f"Abstract for paper {arxiv_id}:\n{abstract}")
print()


Abstract for paper 2103.00020:
State-of-the-art computer vision systems are trained to predict a fixed set
of predetermined object categories. This restricted form of supervision limits
their generality and usability since additional labeled data is needed to
specify any other visual concept. Learning directly from raw text about images
is a promising alternative which leverages a much broader source of
supervision. We demonstrate that the simple pre-training task of predicting
which caption goes with which image is an efficient and scalable way to learn
SOTA image representations from scratch on a dataset of 400 million (image,
text) pairs collected from the internet. After pre-training, natural language
is used to reference learned visual concepts (or describe new ones) enabling
zero-shot transfer of the model to downstream tasks. We study the performance
of this approach by benchmarking on over 30 different existing computer vision
datasets, spanning tasks such as OCR, action recog

## Step 3

Create a prompt template and create an abstract

In [5]:
prompt_template = """Summarize the following academic content.\n
{source}

Summary:"""

prompt = prompt_template.format(source=abstract)
print(prompt)

Summarize the following academic content.

State-of-the-art computer vision systems are trained to predict a fixed set
of predetermined object categories. This restricted form of supervision limits
their generality and usability since additional labeled data is needed to
specify any other visual concept. Learning directly from raw text about images
is a promising alternative which leverages a much broader source of
supervision. We demonstrate that the simple pre-training task of predicting
which caption goes with which image is an efficient and scalable way to learn
SOTA image representations from scratch on a dataset of 400 million (image,
text) pairs collected from the internet. After pre-training, natural language
is used to reference learned visual concepts (or describe new ones) enabling
zero-shot transfer of the model to downstream tasks. We study the performance
of this approach by benchmarking on over 30 different existing computer vision
datasets, spanning tasks such as OCR, a

## Step 4: Use the model to summarize the abstract

In [7]:
device = torch.device('mps')

input_ids = inference_tokenizer(prompt, return_tensors="pt").input_ids.to(device)

outputs = inference_model.generate(input_ids=input_ids,
                        max_new_tokens=50,
                          do_sample=True,
                          temperature=0.7,
                          repetition_penalty=1.5,
                          eos_token_id=inference_tokenizer.eos_token_id,
                          pad_token_id=inference_tokenizer.pad_token_id)

generated_ids = outputs[0][len(input_ids[0]):]
summary = inference_tokenizer.decode(generated_ids, skip_special_tokens=True)
print("TLDR target: \n%s"%summary)

TLDR target: 

    We present two stateless zero shot models based upon Natural Language Understanding\n" "and Transfer learning using deep networks."

   Reference paper link : https:/arxiv... /abs\/180963856 \r\\use


<a name='9'></a>
# Conclusion

This project provided a comprehensive exploration into the realm of machine learning, specifically focusing on the application of Parameter-Efficient Fine-Tuning (PEFT) and Low Rank Adaptation (LORA) techniques. Utilizing the phi-1.5 model, we successfully fine-tuned it to generate concise and coherent summaries of scientific papers using the SciTLDR dataset from the Allen Institute for AI.

Through the course of the project, we delved into various aspects such as data preprocessing and model trainingmetrics. Additionally, we explored ways to fetch real-world data from the arXiv database for practical testing. The project not only served as a hands-on experience in applying advanced NLP techniques but also laid the foundation for further research and development in automating the summarization of academic content.