Skip to content

Latest commit



92 lines (50 loc) · 6.78 KB

File metadata and controls

92 lines (50 loc) · 6.78 KB

Benchmark Datasets

Datasets Details

  1. Graph Datasets

    Dataset Samples Dimension Edges Classes URL
    CORA 2708 1433 5278 7
    CITESEER 3327 3703 4552 6
    PUBMED 19717 500 44325 3
    DBLP 4057 334 3528 4
    CITE 3327 3703 4552 6
    ACM 3025 1870 13128 3
    AMAP 7650 745 119081 8
    AMAC 13752 767 245861 10
    CORAFULL 19793 8710 63421 70
    WIKI 2405 4973 8261 19
    BAT 131 81 1038 4
    EAT 399 203 5994 4
    UAT 1190 239 13599 4
  2. Non-graph Datasets

    Dataset Samples Dimension Type Classes URL
    USPS 9298 256 Image 10
    HHAR 10299 561 Record 6
    REUT 10000 2000 Text 4

Dataset Introduction


The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.


The Citeseer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.


The Pubmed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.


This is an author network from the DBLP dataset. There is an edge between two authors if they are the coauthor relationship. The authors are divided into four areas: database, data mining, machine learning and information retrieval. We label each author’s research area according to the conferences they submitted. Author features are the elements of a bag-of-words represented of keywords.


This is a paper network from the ACM dataset. There is an edge between two papers if they are written by same author. Paper features are the bag-of-words of the keywords. We select papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM and divide the papers into three classes (database, wireless communication, data mining) by their research area.


A-Computers and A-Photo are extracted from Amazon co-purchase graph, where nodes represent products, edges represent whether two products are frequently co-purchased or not, features represent product reviews encoded by bag-of-words, and labels are predefined product categories.


The Wikipedia (WIKI) is an online encyclopedia created and edited by volunteers around the world. The dataset is a word co-occurrence network constructed from the entire set of English Wikipedia pages. This data contains 2405 nodes, 17981 edges and 19 labels.


Coauthor-CS and Coauthor-Physics are two academic networks containing co-authorship relationship based on Microsoft Academic Graph. Nodes in these graphs denote authors, and edges denote co-authored relationship. In each dataset, authors are classified into 15 and 5 classes, respectively, based on the author’s research field, and the node feature is a bag-of-words representation of the paper keywords.


Data collected from the National Civil Aviation Agency (ANAC) from January to December 2016. It has 131 nodes, 1,038 edges (diameter is 5). Airport activity is measured by the total number of landings plus takeoffs in the corresponding year.


Data collected from the Statistical Office of the European Union (Eurostat) from January to November 2016. It has 399 nodes, 5,995 edges (diameter is 5). Airport activity is measured by the total number of landings plus takeoffs in the corresponding period.


Data collected from the Bureau of Transportation Statistics from January to October, 2016. It has 1,190 nodes, 13,599 edges (diameter is 8). Airport activity is measured by the total number of people that passed (arrived plus departed) the airport in the corresponding period.

If you find this repository useful to your research or work, it is really appreciate to star this repository.​ ❤️