Skip to content

Latest commit



383 lines (363 loc) · 11.6 KB

File metadata and controls

383 lines (363 loc) · 11.6 KB


For standard use cases, dblink can be launched using spark-submit (the alternative for more complex use cases is to use the Scala API). When launching using spark-submit, a dblink config file must be provided. This file contains information about:

  • the location of the data to be linked
  • metadata
  • hyperparameters for the model
  • steps/tasks to run

The dblink config file must be in HOCON (Human-Optimized Config Object Notation) format. It is a superset of JSON with a more relaxed syntax. An example is provided in the examples directory. Further details are given in the following sections of this document. Once a config file has been written (e.g. called my-dblink.conf), dblink can be run using spark-submit as follows:

$ $SPARK_HOME/bin/spark-submit \
    --master local[1] \
    --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=/path/to/" \
    --conf "spark.driver.extraClassPath=/path/to/dblink-assembly-0.2.0.jar" \
    /path/to/dblink-assembly-0.2.0.jar \

Main properties

The table below outlines the properties that must be specified in a dblink config file. Most of these properties have no defaults and must be specified for dblink to run successfully.

Property NameDescription URI where the data is located. This must point to a CSV file and the first line of the file must be a header with the column names. Name of the attribute/column in the data which contains unique record identifiers. This must be specified. Name of the attribute/column in the data which contains file identifiers. This is only required if the records come from different files. Defaults to null. Name of the attribute/column in the data which contains entity identifiers (ground truth). This is only required for evaluation purposes. Defaults to null. String used in the data to represent null values. This must be specified. A JSON array specifying the attributes/columns in the data to use for matching, along with the corresponding similarity function and distortion prior. Each entry in the array is a JSON object with a name (attribute/column name), similarityFn (see next section) and distortion hyperparameters distortionPrior.alpha and distortionPrior.beta (both must be positive numbers).
dblink.outputPath URI where outputs of dblink will be stored. Outputs may include the state of the Markov chain, posterior samples and/or point estimates of the linkage structure and evaluation metrics. The URI should typically point to a location on HDFS storage.
dblink.checkpointPath URI where checkpoints will be stored. Checkpoints are made periodically to prevent the lineage of the RDD from becoming too long. These are removed automatically when dblink finishes running. The URI should typically point to a location on HDFS storage.
dblink.randomSeed Random integer seed to be passed on to the pseudorandom number generator.
dblink.populationSize Size of the latent population. Defaults to the total number of records.
dblink.expectedMaxClusterSize A hint at the size of the largest clusters expected in the data. This is used to optimise caching. Defaults to 10.
dblink.partitioner Specifies the type of partitioner to use. See the next section for details.
dblink.steps A JSON array of steps to be executed (in the order given). Each step is represented as a JSON object with a name and parameters. Supported steps include "sample", "summarize", "evaluate" and "copy-files". See the next section for details.

Details on selected properties

Some of the properties referenced above are not scalar-valued and must be specified using a JSON object. We provide further detail on these types of properties below.

Similarity function

A similarity function must be specified for each attribute that appears in This is done using a JSON object with name and parameters keys. Currently only two similarity functions are supported: (1) a constant similarity function and (2) a similarity function based on normalized Levenshtein (edit) distance. Examples are provided for each of these below.

Constant similarity

Likelihood of distorted value being selected is only based on the empirical frequency.

similarityFn : {
  name : "ConstantSimilarityFn"

Levenshtein similarity

Likelihood of distorted value being selected is proportional to the empirical frequency and exponentiated Levenshtein similarity. There are two parameters:

  • maxSimilarity: similarities will be in the range [0, maxSimilarity].
  • threshold: similarities below this value will be set to zero. A higher threshold improves the efficiency of the inference, possibly at the expense of accuracy.
similarityFn : {
  name : "LevenshteinSimilarityFn"
  parameters : {
    threshold : 7.0
    maxSimilarity : 10.0


The partitioner is specified at dblink.partitioner in the config file. It specifies how the space of entities is partitioned. A partitioner is specified using the name and parameters keys. Currently only one type of partitioner is supported called the KDTreePartitioner. As the name suggests, it is based on a k-d tree. It has two parameters:

  • numLevels : the depth/number of levels of the tree. The partitions are the leaves of the tree, hence the number of partitions is given by 2^numLevels.
  • matchingAttributes : splits are performed by cycling through the attributes specified in this array. Attributes listed here must also appear in An example configuration is given below.
partitioner : {
  name : "KDTreePartitioner"
  parameters : {
    numLevels : 3
    matchingAttributes : ["name", "age", "sex"]


This part of the config file specifies the tasks/steps to be executed on the data and/or output. Currently four steps are supported.

Sample step

This step either begins sampling from a new initial state, or resumes sampling from a saved state. It is specified by adding the following entry to dblink.steps:

  name: "sample"
  parameters : {
    sampleSize : 10
    # ...

The following table outlines the available parameters.

sampleSize A positive integer specifying the desired number of samples (after burn-in and thinning). This parameter is required.
burninInterval A non-negative integer specifying the number of initial samples to discard as burn-in. Defaults to 0, which means no burn-in is applied.
thinningInterval A positive integer specifying the period for saving samples to disk. Defaults to 1, which means no thinning is applied.
resume Whether to continue sampling from a saved state (if one exists on disk). Defaults to true.
sampler One of the supported samplers: "PCG-I", "PCG-II", "Gibbs" or "Gibbs-Sequential". Defaults to "PCG-I".

Summarize step

This step produces summary statistics from saved posterior samples. It is specified by adding the following entry to dblink.steps:

  name : "summarize"
  parameters : {
    # ...

The following table outlines the available parameters.

lowerIterationCutoff A positive integer specifying a lower cut-off for iterations included in the summary. Defaults to 0.
quantities An array containing one or more of the supported quantities: "cluster-size-distribution", "partition-sizes", or "shared-most-probable-clusters". "cluster-size-distribution" computes the distribution of entity cluster sizes along the chain; "partition-sizes" computes the number of entities in each partition along the chain; and "shared-most-probable-clusters" computes a point estimate of the linkage structure.

Evaluate step

This step computes evaluation metrics using the provided ground truth. It is specified by adding the following entry to dblink.steps:

  name : "evaluate"
  parameters : {
    metrics : ["pairwise"]
    # ...

The following table outlines the available parameters.

lowerIterationCutoff A positive integer specifying a lower cut-off for iterations included in the final sample. Defaults to 0.
metrics An array containing one or more of the supported metrics: "pairwise" or "cluster". Currently "pairwise" computes the pairwise precision and recall and "cluster" computes the adjusted Rand index.
useExistingSMPC If true, try to use a saved point estimate of the linkage structure (based on the SMPC method) to compute the metrics. Otherwise, compute the SMPC point estimate from scratch. Defaults to false.

Copy-files step

This step allows files to be copied from project directory to another destination. This is useful when running dblink on a cluster where the local or Hadoop file system is ephemeral. This step is specified by adding the following entry to dblink.steps:

  name : "copy-files"
  parameters : {
    fileNames : ["cluster-size-distribution.csv", "partition-sizes.csv", "diagnostics.csv", "shared-most-probable-clusters.csv", "run.txt", "evaluation-results.txt"]
    destinationPath : "S3:///bucket-name/"
    # ...

The following table outlines the available parameters.

fileNames An array containing the file names in the project directory to copy.
destinationPath URI specifying destination for files.
overwrite Whether to overwrite files at the destination path. Defaults to false.
deleteSource Whether to delete the source files. Defaults to false.