Skip to content


Repository files navigation

Multi-label Text Classification

Holds code for collecting data from arXiv to build a multi-label text classification dataset and a simpler classifier on top of that. Our dataset is now available on Kaggle. The dataset collection process has been shown in this notebook. We leverage Apache Beam to design our data collection pipeline and our pipeline can be run on Dataflow at scale. We hope the data will be a useful benchmark for building multi-label text classification systems.

Here's an accompanying blog post on discussing the motivation behind this dataset, building a simple baseline model, etc.: Large-scale multi-label text classification.


We would like to thank Matt Watson for helping us build the simple baseline classifier model. Thanks to Lukas Schwab (author of for helping us build our initial data collection utilities. Thanks to Robert Bradshaw for his inputs on the Apache Beam pipeline. Thanks to the ML-GDE program for providing GCP credits that allowed us to run the Beam pipeline at scale on Dataflow.