Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad dataset #65

Open
abacaj opened this issue Mar 19, 2023 · 18 comments
Open

Bad dataset #65

abacaj opened this issue Mar 19, 2023 · 18 comments

Comments

@abacaj
Copy link

abacaj commented Mar 19, 2023

If anyone is curious here is my run on the Alpaca dataset using another decoder model (codegen-16B-nl). Appears the dataset isn't diverse, multiple closely related answers. I believe this dataset is not capable of generalizing well to new data.

The loss from the original Alpaca training script follows a similar pattern used in OPT-IML to compute loss based on the label.

image

My run on codegen-16B-nl

image

Another user's run on LLaMA 7B

image

Some more discussion: https://twitter.com/abacaj/status/1637310768780648448

@Wingie
Copy link

Wingie commented Mar 19, 2023

Hmm, i wonder if we could get better results using a dataset with higher quality like https://github.com/allenai/natural-instructions/tree/master/splits/default

@kooshi
Copy link
Contributor

kooshi commented Mar 19, 2023

Yeah, I'm currently playing with Laion OIG small-chip2, but I'm really looking forward to the result of their Open-Assistant project, which will create a fully human generated, refined, and curated instruction dataset.

@baleksey
Copy link

baleksey commented Mar 19, 2023

Dataset is the problem I guess. Here are just two examples of inputs/outputs which teaches the model nothing but hallucinating.

{
    "instruction": "Retrieve the last three transactions from the customer's account.",
    "input": "Customer ID: 12333",
    "output": "Transaction 1: USD"
},

{
    "instruction": "Transcribe the recording into text.",
    "input": "Recording about the history of Waterloo",
    "output": "The Battle of Waterloo, fought on June 18, 1815, was the last major battle of the Napoleonic Wars. It marked the end of Napoleon Bonaparte's rule. The battle was fought between Napoleon Bonaparte's French forces and an allied force of British, Dutch, Belgian and Prussian troops, led by the Duke of Wellington. It was fought near the town of Waterloo in the former Austrian Netherlands, now present-day Belgium. The battle was a defeat for Napoleon's forces and ended Napoleon's rule over France....
}

I'm sure there are a lot of such bad instructions there which makes more problems than proper learning.

@gururise
Copy link
Contributor

gururise commented Mar 19, 2023

There is definitely a lot of bad data in the dataset. I went through and manually cleaned hundreds of issues resulting in the current "cleaned" dataset, but there are likely many more issues I missed.

I came across and fixed many issues referencing data on the internet that would likely just cause hallucinations, for example:

{
   "instruction": "Describe what is shown in this painting in one sentence",
   "input": "https://works.bepress.com/jing-chen/3/download/",
   "output": "This painting depicts a mother embracing her child in a loving embrace, surrounded by a peaceful and tranquil atmosphere."
}

There were hundreds of similar examples in the original dataset.

@abacaj Are the training results you posted of the original dataset or the cleaned dataset?

@abacaj
Copy link
Author

abacaj commented Mar 19, 2023

There is definitely a lot of bad data in the dataset. I went through and manually cleaned hundreds of issues resulting in the current "cleaned" dataset, but there are likely many more issues I missed.

I came across and fixed many issues referencing data on the internet that would likely just cause hallucinations, for example:

{
   "instruction": "Describe what is shown in this painting in one sentence",
   "input": "https://works.bepress.com/jing-chen/3/download/",
   "output": "This painting depicts a mother embracing her child in a loving embrace, surrounded by a peaceful and tranquil atmosphere."
}

There were hundreds of similar examples in the original dataset.

@abacaj Are the training results you posted of the original dataset or the cleaned dataset?

The runs posted are from the original, unfortunately only discovered this dataset after posting on twitter. But I can probably do a smaller model run to see if it improves the loss. I personally don't know it will help to remove what appear to be bad samples

To me it seems like there are very similar answers in the labels of the dataset which are used to compute the loss, which could lead to overfitting/reduce ability to generalize. Here's 3 examples I found - maybe not a issue but it appears to contribute to the flat loss - ex not a diverse dataset

{
    "instruction": "Generate a list of 5 American cities with populations higher than 1 million.",
    "input": "",
    "output": "New York City, Los Angeles, Chicago, Houston, Phoenix."
},

{
    "instruction": "Name 5 US cities with populations over 1 million.",
    "input": "",
    "output": "The 5 US cities with populations over 1 million are: New York City, Los Angeles, Chicago, Houston, and Phoenix."
},

{
    "instruction": "Generate a list of four US cities with over one million population.",
    "input": "",
    "output": "New York City, Los Angeles, Chicago, Houston."
},

@tloen
Copy link
Owner

tloen commented Mar 19, 2023

Following our discussion on twitter, here is a screenshot of my current alpaca-lora training run (losses are a bit higher because I'm masking out the instruction in the loss):

Screenshot_20230319_122451

I'm starting to drift towards the idea that we should probably abandon the Alpaca dataset entirely once we get a suitable SFT dataset from the Open-Assistant project, or at least diversify the seed prompts in the original repo.

@abacaj
Copy link
Author

abacaj commented Mar 19, 2023

Following our discussion on twitter, here is a screenshot of my current alpaca-lora training run (losses are a bit higher because I'm masking out the instruction in the loss):

Screenshot_20230319_122451

I'm starting to drift towards the idea that we should probably abandon the Alpaca dataset entirely once we get a suitable SFT dataset from the Open-Assistant project, or at least diversify the seed prompts in the original repo.

Looks better. We could probably improve quality by filtering out duplicate instruction/answer from the dataset by picking the best ones

I’m curious how you did the masking because I did something similar in my run by applying IGNORE_INDEX to the labels up to the instruction prompt length

Just realized your loss is still a bit of a flatline like my previous run, I think validation loss will show that it is overfitting

@samching
Copy link

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

@teknium1
Copy link

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly

@abacaj
Copy link
Author

abacaj commented Mar 20, 2023

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly

Makes sense to me as well for the prompt, the output dataset should aim to be correct

@teknium1
Copy link

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly

Makes sense to me as well for the prompt, the output dataset should aim to be correct

I agree with that forsure.

@Wingie
Copy link

Wingie commented Mar 20, 2023

LAION's dataset can be found here https://github.com/LAION-AI/Anh/tree/main/data in case anyone wants to give a try for it in training!

@samching
Copy link

LAION's dataset can be found here https://github.com/LAION-AI/Anh/tree/main/data in case anyone wants to give a try for it in training!

Interesting - it looks like 100K lines of User: | Assistant: input / ouput pairs, pulled from different dataset sources. I wonder if this represents the latest from these efforts?

@gururise
Copy link
Contributor

I started a new effort to try and clean up the current alpaca dataset
https://github.com/gururise/AlpacaDataCleaned

@conceptofmind
Copy link

conceptofmind commented Mar 22, 2023

I am working on putting together a FLAN dataset as well to upload to the HF hub.

Training a 7B and 13B llama model on OIG at bf16 no LORA. Will have those out soon.

@claysauruswrecks
Copy link
Contributor

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly

My intuition is we should keep the training data scoped and focused. Correct all typos for the training data that does not cover the skill of correcting wrong spellings. Create more (there are some already) training prompts specifically focused on understanding the transition from:

  1. Identifying wrong spelling input
  2. Correct spelling from context
  3. Understanding corrected input

@claysauruswrecks
Copy link
Contributor

I've opened #152 to start the process of vendoring datasets in other repos.

I went through all the history for alpaca_data_cleaned.json in this repo to make sure the big fixes were in the vendored submodule.

Next, I will go through and improve the training prompts in @gururise repo.

@conceptofmind
Copy link

conceptofmind commented Mar 25, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants