original_dataset: https://github.com/JulesBelveze/tldr_news/blob/main/1.3.0.tar.gz?raw=true
annotations_creators:
- other language_creators:
- other language:
- en multilinguality:
- monolingual pretty_name: tldr_news size_categories:
- 1K<n<10K source_datasets:
- original task_categories:
- summarization
- text2text-generation
- text-generation task_ids:
- news-articles-headline-generation
- text-simplification
- language-modeling
- Dataset Description
- Dataset Structure
- Dataset Creation
- Considerations for Using the Data
- Additional Information
- Homepage: https://tldr.tech/newsletter
The tldr_news
dataset was constructed by collecting a daily tech newsletter (available
here). Then, for every piece of news, the headline
and its corresponding content
were extracted.
Also, the newsletter contain different sections. We add this extra information to every piece of news.
Such a dataset can be used to train a model to generate a headline from a input piece of text.
There is no official supported tasks nor leaderboard for this dataset. However, it could be used for the following tasks:
- summarization
- headline generation
en
A data point comprises a "headline" and its corresponding "content". An example is as follows:
{
"headline": "Cana Unveils Molecular Beverage Printer, a ‘Netflix for Drinks’ That Can Make Nearly Any Type of Beverage ",
"content": "Cana has unveiled a drink machine that can synthesize almost any drink. The machine uses a cartridge that contains flavor compounds that can be combined to create the flavor of nearly any type of drink. It is about the size of a toaster and could potentially save people from throwing hundreds of containers away every month by allowing people to create whatever drinks they want at home. Around $30 million was spent building Cana’s proprietary hardware platform and chemistry system. Cana plans to start full production of the device and will release pricing by the end of February.",
"category": "Science and Futuristic Technology"
}
headline (str)
: the piece of news' headlinecontent (str)
: the piece of newscategory (str)
: newsletter section
all
: all existing daily newsletters available here.
This dataset was obtained by scrapping the collecting all the existing newsletter available here.
Every single newsletter was then processed to extract all the different pieces of news. Then for every collected piece of news the headline and the news content were extracted.
The dataset was has been collected from https://tldr.tech/newsletter.
In order to clean up the samples and to construct a dataset better suited for headline generation we have applied a couple of normalization steps:
- The headlines initially contain an estimated read time in parentheses; we stripped this information from the headline.
- Some news are sponsored and thus do not belong to any newsletter section. We create an additional category "Sponsor" for such samples.
The people (or person) behind the https://tldr.tech/ newsletter.
Disclaimers: The dataset was generated from a daily newsletter. The author had no intention for those newsletters to be used as such.
The newsletters were written by the people behind TLDR tech.
[Needs More Information]
[Needs More Information]
This dataset only contains tech news. A model trained on such a dataset might not be able to generalize to other domain.
[Needs More Information]
The dataset was obtained by collecting newsletters from this website: https://tldr.tech/newsletter
Thanks to @JulesBelveze for adding this dataset.