Alpaca Dataset Updates and Fixes #303

kartikayk · 2024-02-04T00:52:15Z

Context

Our current Alpaca dataset implementation doesn't allow us to train on the inputs i.e. not mask the input during training. Looking at reference implementations, this is pretty common and the only way we can replicate training curves.

The class is also written in a way which doesn't allow the user to easily switch in and out the different variations of the alpaca datasets. Using the clean version of the dataset, allows the loss to go down faster.

In this PR, both of these features are added. Plus tests for the alpaca dataset are added.

Thanks @ebsmothers for helping find some of these issues!

Changelog

Re-write the alpaca dataset to be able to easily switch between the original and cleaned version of the datasets, using the use_clean flag
Allow training on the input (no masking) using the train_on_input flag
Add tests

Test plan

Unit tests, including the newly added test_alpaca_dataset succeeded.

pytest tests

Training loss is closer to expectation (loss < 1). In the screenshot, blue is with the cleaned version, orange is with the original model.

Comment on why we're changing the loss values in `test_finetune_llm.py`

The loss changes because we have a small difference in the input and label generation. This change is:

[old]: tokenizer.encode(text=prompt, add_bos=True, add_eos=False) + tokenizer.encode(text=response, add_bos=False, add_eos=True)
[new]: tokenizer.encode(text=prompt+response, add_bos=True, add_eos=True)

This creates a small difference in the output which results in changes in the loss:

netlify · 2024-02-04T00:52:20Z

✅ Deploy Preview for torchtune-preview ready!

Name	Link
🔨 Latest commit	`1ada47f`
🔍 Latest deploy log	https://app.netlify.com/sites/torchtune-preview/deploys/65bf051b7888af0008c4b36e
😎 Deploy Preview	https://deploy-preview-303--torchtune-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ebsmothers

Looks great, thanks for getting this up and tested so quickly! My few comments are all nits so feel free to take or leave any of them

ebsmothers · 2024-02-04T01:01:15Z

torchtune/datasets/alpaca.py

+            instruction = self._data[index]["instruction"],
+            input = self._data[index]["input"],
+            output = self._data[index]["output"]


nit: could just define sample = self._data[index] to avoid multiple calls

Great catch!

ebsmothers · 2024-02-04T01:04:54Z

torchtune/datasets/alpaca.py



 class AlpacaDataset(Dataset):
    """
-    PyTorch Representation of the Alpaca Dataset
+    Support for the Alpaca dataset and it's variants from HuggingFace Datasets.


nit

Suggested change

Support for the Alpaca dataset and it's variants from HuggingFace Datasets.

Support for the Alpaca dataset and its variants from HuggingFace Datasets.

ebsmothers · 2024-02-04T01:10:05Z

tests/torchtune/datasets/test_alpaca_dataset.py

+        alpaca_dataset = datasets.get_dataset("alpaca", tokenizer=tokenizer)
+
+        # alpaca_dataset._data contains the raw data loaded from HF's dataset. We need the raw data
+        # to test the prompt generation since calling __get__item on the alpaca_dataset object will


nit

Suggested change

# to test the prompt generation since calling __get__item on the alpaca_dataset object will

# to test the prompt generation since calling __getitem__ on the alpaca_dataset object will

ebsmothers · 2024-02-04T01:12:11Z

tests/torchtune/datasets/test_alpaca_dataset.py

+    @patch("torchtune.datasets.alpaca.load_dataset")
+    def test_prompt_generation(self, load_dataset, tokenizer):
+        """
+        Test the the prompt generation based on the alpaca template is correct.


nit

Suggested change

Test the the prompt generation based on the alpaca template is correct.

Test that the prompt generation based on the alpaca template is correct.

kartikayk · 2024-02-04T03:50:30Z

Thanks so much @ebsmothers for the quick review! Addressed all comments.

rohan-varma · 2024-02-05T19:49:09Z

tests/torchtune/datasets/test_alpaca_dataset.py

+        ]
+
+        alpaca_dataset = datasets.get_dataset(
+            "alpaca", tokenizer=tokenizer, use_clean=True


Can we parametrize instead with the test_label_masking, as only difference is the use_clean flag?

rohan-varma · 2024-02-05T19:50:05Z

torchtune/datasets/alpaca.py

+    where `instruction`, `input`, and `output` are fields from the dataset.
+
+    Masking of the prompt during training is controlled by the `train_on_input` flag, which is
+    set to `True` by default (ref: https://github.com/tloen/alpaca-lora/blob/main/finetune.py#L49)


what are our thoughts on referring to reference implementations in torchtune? Not sure if citing them sort of implies that we as torchtune are sort of certifying that repo is a reference we endorse / want to compare against in an outward fashion

Add ability to train on input + tests

39dc78e

kartikayk requested a review from gokulavasan February 4, 2024 00:52

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 4, 2024

kartikayk requested review from daniellepintz, joecummings, rohan-varma and ebsmothers February 4, 2024 00:52

ebsmothers approved these changes Feb 4, 2024

View reviewed changes

kartikayk added 4 commits February 3, 2024 19:17

Address comments and fix recipe tests

44830a2

Point cpt to tmp not local

c7156a0

Fix docstring

67210af

Fix docs

1ada47f

kartikayk merged commit aaf43de into main Feb 4, 2024
15 checks passed

kartikayk deleted the fix_alpaca branch February 4, 2024 03:51

rohan-varma reviewed Feb 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alpaca Dataset Updates and Fixes #303

Alpaca Dataset Updates and Fixes #303

kartikayk commented Feb 4, 2024 •

edited

netlify bot commented Feb 4, 2024 •

edited

ebsmothers left a comment

ebsmothers Feb 4, 2024

kartikayk Feb 4, 2024

ebsmothers Feb 4, 2024

ebsmothers Feb 4, 2024

ebsmothers Feb 4, 2024

kartikayk commented Feb 4, 2024

rohan-varma Feb 5, 2024

rohan-varma Feb 5, 2024

	Support for the Alpaca dataset and it's variants from HuggingFace Datasets.
	Support for the Alpaca dataset and its variants from HuggingFace Datasets.

	# to test the prompt generation since calling __get__item on the alpaca_dataset object will
	# to test the prompt generation since calling __getitem__ on the alpaca_dataset object will

	Test the the prompt generation based on the alpaca template is correct.
	Test that the prompt generation based on the alpaca template is correct.

Alpaca Dataset Updates and Fixes #303

Alpaca Dataset Updates and Fixes #303

Conversation

kartikayk commented Feb 4, 2024 • edited

Context

Changelog

Test plan

Comment on why we're changing the loss values in test_finetune_llm.py

netlify bot commented Feb 4, 2024 • edited

✅ Deploy Preview for torchtune-preview ready!

ebsmothers left a comment

Choose a reason for hiding this comment

ebsmothers Feb 4, 2024

Choose a reason for hiding this comment

kartikayk Feb 4, 2024

Choose a reason for hiding this comment

ebsmothers Feb 4, 2024

Choose a reason for hiding this comment

ebsmothers Feb 4, 2024

Choose a reason for hiding this comment

ebsmothers Feb 4, 2024

Choose a reason for hiding this comment

kartikayk commented Feb 4, 2024

rohan-varma Feb 5, 2024

Choose a reason for hiding this comment

rohan-varma Feb 5, 2024

Choose a reason for hiding this comment

kartikayk commented Feb 4, 2024 •

edited

Comment on why we're changing the loss values in `test_finetune_llm.py`

netlify bot commented Feb 4, 2024 •

edited