Conflation

timwaters edited this page Dec 17, 2014 · 1 revision

Conflation

Conflation is the process of identifying two or more feature records as representing the same physical place, and updating those features to reflect the identity. Features suitable for conflation may come either from the same source, or, more typically, different sources. Usually, when two or more places are conflated, one is marked as primary and the rest subsidiary.

Conflation may be performed automatically or manually. The collection of conflated relations between different datasets represents a concordance across those datasets.

Let us define the question of conflation as follows: For any given feature, which we’ll call our subject, there may be one or more other features in the database that represent the same place. We query the database for the most likely candidates, establish for each candidate a confidence that it can be conflated with the subject, and then choose the candidates which pass a certain threshold.

Distance metrics

There are four attributes to which we can apply distance metrics to determine whether two features are indeed the same place:

  1. Name
  2. Centroid
  3. Feature class
  4. Feature type

The distance metrics between any two places over these four attributes can be combined to express a confidence that the two places are the same. Maximum confidence would obtain when all four attributes are the same, and therefore all distance metrics are by any measure zero. Similarly, minimum confidence would obtain when all attributes are completely different. Let us define the range of confidence as a normalized value between 0.0 and 1.0.

Name distance

For names, we can define distance measure as string edit distance, e.g. Levenshtein, which thankfully can be normalized to the lengths of the strings. Since many features have multiple names, the ideal measure to use is the minimum edit distance over the cross-product of the set of names for the subject and a given candidate. Subtracting this value from 1 yields a normalized value for the maximum possible similarity of the name attributes of the two places.

Empirically, we have found that a scaled Levenshtein distance under 0.25 (or even 0.35) is too low for conflation with high confidence. This still allows us to consider alternate versions of very short names, such as “Ham” versus “City of Ham”.

Class and type distance

The other attributes are more problematic. Feature class and type values are discrete, and therefore the only possible metric is identity, which is binary. Geographic distance between centroids is effectively unbounded for our purposes. Moreover, we can note that geographic distance has a different meaning when applied to different feature types. Individual buildings occur more densely than cities, while cities occur more densely than provinces. Moreover, densities over all features are much higher in urban areas than in rural or more remote areas.

It stands to reason that a meaningful measure of geographic distance should be scaled to the density of the feature types in question in the vicinity of the subject. Median absolute deviation is one example of a robust measure of dispersal in a quantitative sample. For both the feature class and type of the subject, we can scale geographic distances by the MAD of the distances between all pairs of features of that particular class/type within a certain radius of the subject. As a heuristic, we have found in practice that places rarely conflate beyond 25km, so this is the bound we will use. We can then normalize the MAD-scaled distances from the subject feature, per class and type, between zero and maximum, across the entire bounded sample. Subtracting these values from 1 yields a normalized value for the maximum geographic similarity between the subject and any other place.

The similarity measure for either feature class or type becomes less meaningful when the values are not the same for the subject and a candidate. Generally, we can say that feature types (of which there are 600+) may differ between two features that are really the same place, due to inevitable inconsistencies either in the source datasets or in the mapping from the source taxonomy to our own. However, we can adopt a heuristic that worthy candidates must at least share the same feature class (of which only 9). Even when the feature types don’t match, the type-scaled distance still reflects geographic distance from the subject within its local context for its own type. The incorporation of type distance into our confidence measure can be justified in this case by the assumption that, if the two places prove otherwise conflateable, then the candidate’s feature type must be essentially erroneous.

Overall confidence

In principle, we now have derived three similarity values normalized to the range 0.0 – 1.0 for the subject and any candidate. Taking the product of these values yields a similarly normalized value which can be compared between any two pairs of features, and we will call this our confidence score.

As a check, we note if all four attributes are identical, the confidence measured this way will be 1.0. If all four attributes are different, the edit distance will be zero, and the product will be zero, regardless of geographic distance. This does, however, preclude detecting the circumstance where two features that occupy an identical location with the same feature type, but have completely different names (e.g. different languages) and no matching alternate names.

The final stage is selecting which candidates to conflate with the subject, for which we need a threshold, for which in turn we have no theoretical basis. We will have to determine the ideal threshold empirically.

Implementation

Revisiting our similar places query endpoint in the API, we will need to confirm that, for a given subject, it returns all features which:

  1. share the same feature class,
  2. lie within our maximum range (25km), and
  3. lie within our maximum edit distance (at least 0.25), and
  4. are marked primary (i.e. not already conflated).

For conflation, we will also want to ensure that the timeframe for all candidates is exactly the same as the subject. We may wish to allow similar places to return features with different time frames for the sake of identifying other kinds of relationships.

The search for conflation candidates can then be carried out with a script that uses the API as follows:

> For each feature marked primary in the gazetteer:
> # Fetch all similar places.
> # For each candidate, compute the geographic distance.
> # Compute the MAD of the geo distance from the subject to all features.
> # Compute the MAD of the geo distance from the subject to all features of the same type.
> # For each candidate:
> * Scale the geo distance by the class MAD to get the scaled class distance.
> *
Scale the geo distance by the type MAD to get the scaled type distance.
> # Sum the scaled class distances. Do same with the scaled type distances.
> # For each candidate:
> * Normalize the scaled class distance and subtract it from 1 to get the class similarity.
> *
Normalize the scaled type distance and subtract it from 1 to get the type similarity.
> * Compute the normalized edit distance and subtract it from 1 to get the name similarity.
> *
Take the product of the three similarities as the confidence.
> ** If the confidence is above a certain threshold, accept the candidate.

This process will need to be run a few times and the results eyeballed to determine the optimal threshold empirically.

Once the threshold is identified, another API script can be written that iterates over the list of candidates generated by the previous script, and creates the conflates relation from each subject to its candidates, choosing the primary based on its source, with the following priority:

  1. TIGER/Line
  2. Geonames
  3. OSM
  4. other datasets

In order to keep from stepping on its own toes, the script will have to check to be sure that each subject has not already been marked subsidiary by a previous conflation step.