Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Minimally Supervised Categorization of Text with Metadata

This repository contains the source code for Minimally Supervised Categorization of Text with Metadata.



For training, a GPU is highly recommended.


The code is based on the Keras library. You can find installation instructions here.


The code is written in Python 3.6. The dependencies are summarized in the file requirements.txt. You can install them like this:

pip3 install -r requirements.txt

Quick Start

To reproduce the results in our paper, you need to first download the datasets. Five datasets are used in our paper. The GitHub-Sec dataset, unfortunately, cannot be published due to our commitment to the data provider. The other four datasets are available. Once you unzip the downloaded file, you can see four folders corresponding to these four datasets, respectively.

Dataset Folder Name #Documents #Classes Class name (#Repositories in this class)
GitHub-Bio bio/ 876 10 Sequence Analysis (210), Genome Analysis (176), Gene Expression (63), Systems Biology (53), Genetics (47), Structural Bioinformatics (39), Phylogenetics (27), Text Mining (63), Bioimaging (125), Database and Ontologies (73)
GitHub-AI ai/ 1,596 14 Image Generation (215), Object Detection (296), Image Classification (361), Semantic Segmentation (170), Pose Estimation (96), Super Resolution (75), Text Generation (24), Text Classification (26), Named Entity Recognition (22), Question Answering (102), Machine Translation (117), Language Modeling (44), Speech Synthesis (27), Speech Recognition (21)
Amazon amazon/ 100,000 10 Apps for Android (10,000), Books (10,000), CDs and Vinyl (10,000), Clothing, Shoes and Jewelry (10,000), Electronics (10,000), Health and Personal Care (10,000), Home and Kitchen (10,000), Movies and TV (10,000), Sports and Outdoors (10,000), Video Games (10,000)
Twitter twitter/ 135,529 (sorry for the typo in the original paper) 9 Food (34,387), Shop and Service (13,730), Travel and Transport (8,826), College and University (2,281), Nightlife Spot (15,082), Residence (1,678), Outdoors and Recreation (19,488), Arts and Entertainment (26,274), Professional Places (13,783)

You need to put these 4 dataset folders under the repository main folder ./. Then the following running script can be used to run the model.


Micro-F1, Macro-F1 and the confusion matrix will be shown in the last several lines of the output. The classification result can be found under your dataset folder. For example, if you are using the GitHub-Bio dataset, the output will be ./bio/out.txt.


Besides the "input" version mentioned in the Quick Start section, we also provide the json version, where each line is a json file containing text and metadata (e.g., user, tags and product).

For GitHub-Bio, GitHub-AI, and Twitter, the json format is as follows:

  "user": [
  "text": "pytorch implementation of dcgan trained on the celeba dataset ...",
  "tags": [
  "label": 0,
  "label_name": "$Image-Generation"

Here, the "user" field is global metadata; the "tags" field is local metadata. (Please refer to our paper for the definitions of global and local metadata.)

For Amazon, the json format is as follows:

  "user": [
  "text": "really cute loves the song so he really could ...",
  "product": [
  "label": 0,
  "label_name": "Apps_for_Android"

Here, both "user" and "product" are global metadata; there is no local metadata in the Amazon dataset.

NOTE: If you would like to run our code on your own dataset, when you prepare this json file, make sure: (1) For each document, its metadata field is always represented by a list of strings. For example, the "user" field should be ["A1N4O8VOJZTDVB"] instead of "A1N4O8VOJZTDVB". (2) The "label" field is an integer. If you have N classes, the label space should be 0, 1, ..., N-1. The "label_name" field is optional.

Running on New Datasets

In the Quick Start section, we include a pretrained embedding file in the downloaded folders. If you have a new dataset, you need to rerun our generation-guided embedding module to get your own embedding files. Please follow the steps below.

  1. Create a directory named ${dataset} under the main folder (e.g., ./bio).

  2. Prepare three files:

(1) ./${dataset}/doc_id.txt containing labeled document ids for each class. Each line begins with the class id (starting from 0), followed by a colon, and then document ids in the corpus (starting from 0) of the corresponding class separated by commas.

(2) ./${dataset}/dataset.json. You can refer to the provided json files for the format. Make sure it has two fields "text" and "label". ("label" should be an integer in 0, 1, ..., N-1, corresponding to the classes in ./${dataset}/doc_id.txt.) You can add your own metadata fields in the json.

(3) ./${dataset}/meta_dict.json indicating the names of your global/local metadata fields. For example, for GitHub-Bio, GitHub-AI, and Twitter, it should be

{"global": ["user"], "local": ["tags"]}

For Amazon, it should be

{"global": ["user", "product"], "local": []}
  1. ./ Make sure you have changed the dataset name. The embedding file will be saved to ./${dataset}/embedding_gge.

With the embedding file, you can train the classifier as mentioned in Quick Start (i.e., ./ Please always refer to the example datasets when adapting the code for a new dataset.


If you find the implementation useful, please cite the following paper:

  title={Minimally Supervised Categorization of Text with Metadata},
  author={Zhang, Yu and Meng, Yu and Huang, Jiaxin and Xu, Frank F. and Wang, Xuan and Han, Jiawei},


Minimally Supervised Categorization of Text with Metadata (SIGIR'20)








No releases published


No packages published