You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We already have a toy example with the imdb reward model.
Let's create one with an actual RLHF task (Learning to summarize with human feedback)[https://arxiv.org/abs/2009.01325]
For this issue, we'll just focus on creating the dataset itself, instead of training our language (policy) model. We'll handle that in another card.
For summarizing, the dataset will be slightly different.
It should have the columns (prompt, completion, reward).
Where the prompt is the article to summarize.
The completion should be the summary article by the labeller.
The reward should be the reward from the reward model (RM).
Description
We already have a toy example with the imdb reward model.
Let's create one with an actual RLHF task (Learning to summarize with human feedback)[https://arxiv.org/abs/2009.01325]
For this issue, we'll just focus on creating the dataset itself, instead of training our language (policy) model. We'll handle that in another card.
Here's an example of a dataset that I created for the imdb toy scenario. https://huggingface.co/datasets/thejaminator/imdb_rewarded.
For summarizing, the dataset will be slightly different.
It should have the columns (prompt, completion, reward).
Where the prompt is the article to summarize.
The completion should be the summary article by the labeller.
The reward should be the reward from the reward model (RM).
For the reward model, I'm not sure of a pure pretrained RM that we can just use.
However, OpenAssistant does have a RM that was trained on multiple preference datasets, including the summary dataset. So perhaps thats good enough.
examples/imdb/export_imdb_reward_dataset.py
and https://huggingface.co/docs/datasets/upload_dataset.The text was updated successfully, but these errors were encountered: