Skip to content
Go to file


Failed to load latest commit information.
Latest commit message
Commit time
Jul 13, 2020
Sep 12, 2020
Jun 12, 2020
Jun 12, 2020
Jun 12, 2020
Jun 26, 2020
Jun 12, 2020
May 9, 2020

Minimally Supervised Categorization of Text with Metadata

This project provides a weakly-supervised framework for categorizing text with metadata.


For training, a GPU is highly recommended.


The code is based on the Keras library. You can find installation instructions here.


The code is written in Python 3.6. The dependencies are summarized in the file requirements.txt. You can install them like this:

pip3 install -r requirements.txt

Quick Start

To reproduce the results in our paper, you need to first download the datasets here. Five datasets are used in our paper. The GitHub-Sec dataset, unfortunately, cannot be published due to our commitment to the data provider. The other four datasets are available. Once you unzip the downloaded file, you can see four folders related to these four datasets, respectively.

Dataset Folder Name #Documents #Classes Class name (#Repositories in this class)
GitHub-Bio bio/ 876 10 Sequence Analysis (210), Genome Analysis (176), Gene Expression (63), Systems Biology (53), Genetics (47), Structural Bioinformatics (39), Phylogenetics (27), Text Mining (63), Bioimaging (125), Database and Ontologies (73)
GitHub-AI ai/ 1,596 14 Image Generation (215), Object Detection (296), Image Classification (361), Semantic Segmentation (170), Pose Estimation (96), Super Resolution (75), Text Generation (24), Text Classification (26), Named Entity Recognition (22), Question Answering (102), Machine Translation (117), Language Modeling (44), Speech Synthesis (27), Speech Recognition (21)
Amazon amazon/ 100,000 10 Apps for Android (10,000), Books (10,000), CDs and Vinyl (10,000), Clothing, Shoes and Jewelry (10,000), Electronics (10,000), Health and Personal Care (10,000), Home and Kitchen (10,000), Movies and TV (10,000), Sports and Outdoors (10,000), Video Games (10,000)
Twitter twitter/ 135,619 9 Food (34,387), Shop and Service (13,730), Travel and Transport (8,826), College and University (2,281), Nightlife Spot (15,082), Residence (1,678), Outdoors and Recreation (19,488), Arts and Entertainment (26,274), Professional Places (13,783)

You need to put the dataset folders under the repository main folder ./. Then the following running script can be used to run the model.


Micro-F1, Macro-F1 and the confusion matrix will be shown in the last several lines of the output. The classification result can be found under your dataset folder. For example, if you are using the GitHub-Bio dataset, the output will be ./bio/out.txt.


Besides the "input" version mentioned in the Quick Start section, we also provide the json version, where each line is a json file containing text and metadata (e.g., user, tags and product).

For GitHub-Bio, GitHub-AI, and Twitter, the json format is as follows:

  "user": "86372688",
  "tags": [
  "text": "purityvodka hudsonmalone newyorkcity hudson malone",
  "label": "Food"

For Amazon, the json format is as follows:

  "user": "A1N4O8VOJZTDVB",
  "product": "B004A9SDD8",
  "text": "really cute loves the song , so he really could n't wait to play this . ... ",
  "label": "Apps_for_Android"

Running New Datasets

In the Quick Start section, we include a pretrained embedding file in the downloaded folders. If you have a new dataset, you need to rerun our generation-guided embedding module to get your own embedding files. Please follow the steps below.

  1. Create a directory named ${dataset} under the main folder (e.g., ./bio).

  2. Prepare three files: (1) ./${dataset}/doc_id.txt containing labeled document ids for each class. Each line begins with the class id (starting from 0), followed by a colon, and then document ids in the corpus (starting from 0) of the corresponding class separated by commas; (2) ./${dataset}/dataset.csv; and (3) ./${dataset}/dataset.json. You can refer to the example datasets (doc_id/csv and json) for the format.

  3. cd gge/ and then ./ Make sure you have changed the dataset name. The embedding file will be saved to gge/embedding_gge.

With the embedding file, you can train the classifier as mentioned in Quick Start (make sure you move it to ${dataset}/. Please always refer to the example datasets when adapting the code for a new dataset.


If you find the implementation useful, please cite the following paper:

  title={Minimally Supervised Categorization of Text with Metadata},
  author={Zhang, Yu and Meng, Yu and Huang, Jiaxin and Xu, Frank F. and Wang, Xuan and Han, Jiawei},


Minimally Supervised Categorization of Text with Metadata (SIGIR'20)




No releases published


No packages published
You can’t perform that action at this time.