Dataset Card for `tldr_news`

original_dataset: https://github.com/JulesBelveze/tldr_news/blob/main/1.3.0.tar.gz?raw=true

annotations_creators:

other language_creators:
other language:
en multilinguality:
monolingual pretty_name: tldr_news size_categories:
1K<n<10K source_datasets:
original task_categories:
summarization
text2text-generation
text-generation task_ids:
news-articles-headline-generation
text-simplification
language-modeling

Dataset Card for `tldr_news`

Dataset Description

Homepage: https://tldr.tech/newsletter

Dataset Summary

The tldr_news dataset was constructed by collecting a daily tech newsletter (available here). Then, for every piece of news, the headline and its corresponding content were extracted. Also, the newsletter contain different sections. We add this extra information to every piece of news.

Such a dataset can be used to train a model to generate a headline from a input piece of text.

Supported Tasks and Leaderboards

There is no official supported tasks nor leaderboard for this dataset. However, it could be used for the following tasks:

summarization
headline generation

Languages

en

Dataset Structure

Data Instances

A data point comprises a "headline" and its corresponding "content". An example is as follows:

{
  "headline": "Cana Unveils Molecular Beverage Printer, a ‘Netflix for Drinks’ That Can Make Nearly Any Type of Beverage ",
  "content": "Cana has unveiled a drink machine that can synthesize almost any drink. The machine uses a cartridge that contains flavor compounds that can be combined to create the flavor of nearly any type of drink. It is about the size of a toaster and could potentially save people from throwing hundreds of containers away every month by allowing people to create whatever drinks they want at home. Around $30 million was spent building Cana’s proprietary hardware platform and chemistry system. Cana plans to start full production of the device and will release pricing by the end of February.",
  "category": "Science and Futuristic Technology"
}

Data Fields

headline (str): the piece of news' headline
content (str): the piece of news
category (str): newsletter section

Data Splits

all: all existing daily newsletters available here.

Dataset Creation

Curation Rationale

This dataset was obtained by scrapping the collecting all the existing newsletter available here.

Every single newsletter was then processed to extract all the different pieces of news. Then for every collected piece of news the headline and the news content were extracted.

Source Data

Initial Data Collection and Normalization

The dataset was has been collected from https://tldr.tech/newsletter.

In order to clean up the samples and to construct a dataset better suited for headline generation we have applied a couple of normalization steps:

The headlines initially contain an estimated read time in parentheses; we stripped this information from the headline.
Some news are sponsored and thus do not belong to any newsletter section. We create an additional category "Sponsor" for such samples.

Who are the source language producers?

The people (or person) behind the https://tldr.tech/ newsletter.

Annotations

Annotation process

Disclaimers: The dataset was generated from a daily newsletter. The author had no intention for those newsletters to be used as such.

Who are the annotators?

The newsletters were written by the people behind TLDR tech.

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

This dataset only contains tech news. A model trained on such a dataset might not be able to generalize to other domain.

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

The dataset was obtained by collecting newsletters from this website: https://tldr.tech/newsletter

Contributions

Thanks to @JulesBelveze for adding this dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
1.3.0		1.3.0
.DS_Store		.DS_Store
1.3.0.tar.gz		1.3.0.tar.gz
README.md		README.md
dataset_infos.json		dataset_infos.json
tldr_news.py		tldr_news.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Card for `tldr_news`

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Contributions

About

Languages

minpeter/tldr_news

Folders and files

Latest commit

History

Repository files navigation

Dataset Card for tldr_news

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Contributions

About

Topics

Resources

Stars

Watchers

Forks

Languages

Dataset Card for `tldr_news`