Skip to content

Latest commit

 

History

History
43 lines (31 loc) · 5.23 KB

README.md

File metadata and controls

43 lines (31 loc) · 5.23 KB

This folder contains several datasets that have been tested with JedAI. We have grouped them in two categories:

  1. those suitable for Clean-Clean ER, and
  2. those sutable for Dirty ER.

Every dataset includes two types of files:

  1. those containing the entity profiles themselves, which are called entity files, and
  2. those containing the golden standard (i.e., the ground-truth with the real matches), which are called groundtruth files.

Note that every file that is available as a Java Serialized Object (JSO) can be read using the class DataReader.EntityReader.EntitySerializationReader for entity files or the class DataReader.GroundTruthReader.GtSerializationReader for ground-truth files. See this class for an example.

Clean-Clean ER datasets

Dataset Name D1 Entities D2 Entities D1 Name-Value Pairs D2 Name-Value Pairs Duplicates Average NVP per Entity Brute-force Comparisons File Format Data Origin
Restaurants 339 2,256 1,130 7,519 89 3.3 7.64E+05 JSO (Rest. 1 file, Rest. 2 file, groundtruth file) Real data
Abt-Buy 1,076 1,076 2,568 2,308 1,076 2.4 1.16E+06 JSO (Abt entity file, Buy entity file, groundtruth file) Real data
Amazon-Google Products 1,354 3,039 5,302 9,110 1,104 3.9 4.11E+06 JSO (Amazon entity file, GP entity file, groundtruth file) Real data
DBLP-ACM 2,616 2,294 10,464 9,162 2,224 4.0 6.00E+06 JSO (DBLP entity file, ACM entity file, groundtruth file), CSV (DBLP entity file, ACM entity file), XML (DBLP entity file, ACM entity file) Real data
Amazon-Walmart 2,554 22,074 14,143 114,315 853 5.2 5.64E+07 JSO (Amazon entity file, Walmart entity file, groundtruth file) Real data
DBLP-Scholar 2,516 61,353 10,064 198,001 2,308 4.0 1.54E+08 JSO (DBLP entity file, Scholar entity file, groundtruth file) Real data
Movies 27,615 23,182 155,436 816,009 22,863 5.6 6.40E+08 JSO (IMDB entity file, DBPedia entity file, groundtruth file) Real data
DBPedia 1,190,733 2,164,040 1.69E+07 3.50E+07 892,586 14.2 2.58E+12 JSO Real data

Dirty ER datasets

Dataset Name Entities Name-Value Pairs Duplicates Average NVP per Entity Brute-force Comparisons File Format Data Origin
Restaurant 864 4,319 112 5.0 3.73E+05 JSO (entity file, groundtruth file) Real data
Census 841 3,913 344 4.7 3.53E+05 JSO (entity file, groundtruth file) Real data
Cora 1,295 7,166 17,184 5.5 8.38E+05 JSO (entity file, groundtruth file) Real data
CdDb 9763 173,309 299 17.8 4.77E+07 JSO (entity file, groundtruth file) Real data
Abt-By 2,152 4,876 1,076 2.3 2.31E+06 JSO (entity file, groundtruth file) Real data
DBLP-ACM 4,910 19,626 2,224 4.0 1.21E+07 JSO (entity file, groundtruth file) Real data
DBLP-Scholar 63,869 208,065 2,308 3.3 2.04E+09 JSO (entity file, groundtruth file) Real data
Amazon-GP 4,393 14,412 1,104 3.3 9.65E+06 JSO (entity file, groundtruth file) Real data
Movies 50,797 971,445 22,863 19.1 1.29E+09 JSO (zipped entity file, groundtruth file) Real data