# Record Linkage

Here, let me show you how to link the actor names from Princeton University Art Museum (PUAM) to Getty Union List of Artist Names (ULAN).

In [10]:
# this two lines are just for locating the package.
import sys
sys.path.append('..')

import rltk
tk = rltk.init()
tk.set_root_path('../examples/puam')

## Prepare data

First and the most important step is preparing data. Besides of the two candidate datasets (format in json_line/csv/text), you need manually mark some postive and negative pairs of these two datasets. Here, [labeled_puam.jsonl](../examples/puam/labeled_100.jsonl) is a 100 lines labeled paris.

## Get file iterator of datasets
Candidate sets should be streamed as FileIterator in RLTK.

In [11]:
iter1 = tk.get_file_iterator('../../datasets/ulan.json', type='json_line', id_path='uri[*].value')
iter2 = tk.get_file_iterator('../../datasets/puam.json', type='json_line', id_path='uri[*].value')

## Train Classifier
Then, load a feature configuration file for generating feature vector, e.g., [feature_config.json](../examples/puam/feature_config.json).

In [12]:
tk.load_feature_configuration('feature_config', 'feature_config.json')

Use labeled data and feature configurations to compute labeled features:

In [13]:
tk.compute_labeled_features(iter1=iter1.copy(), iter2=iter2.copy(),
                    label_path='labeled_100.jsonl',
                    feature_config_name='feature_config',
                    feature_output_path='labeled_feature.jsonl')

Once you have labeled features, you can use it to train a classifier.

In [14]:
model = tk.train_model(training_path='labeled_feature.jsonl', classifier='svm')

You can dump and load model for further using.

In [15]:
tk.dump_model(model, 'model.pkl')
model = tk.load_model('model.pkl')

## Blocking
Sometimes the candidate datesets are huge, in order to decrease the comparison times, blocks need to be created (this step may takes a lone time and creates a large file).

In [16]:
tk.q_gram_blocking(
    iter1=iter1, q1=[3], value_path1=['name[*].value'],
    iter2=iter2, q2=[3], value_path2=['name[*].value'],
    output_file_path='blocking.jsonl')

For testing purpose, I pick out the first 100 lines of object from the output file `blocking.jsonl` and named it to [blocking_100.jsonl](../examples/puam/blocking_100.jsonl).

## Compute vectors and make prediction

After blocking, compute vectors on these blocks. Then use previous model to predict these pairs.

In [17]:
tk.compute_features(iter1=iter1.copy(), iter2=iter2.copy(),
                    feature_config_name='feature_config',
                    feature_output_path='feature.jsonl',
                    blocking_path='blocking_100.jsonl')

In [18]:
tk.predict(model, feature_file='feature.jsonl', predict_output_file='predicted.jsonl')

After prediction, you get the the predicted result of linkage in [predicted.jsonl](../examples/puam/predicted.jsonl). It includes all the possible pairs that indicated from blocking file. You need to do further filtering to pick out the most possible pairs.

In [19]:
tk.filter('predicted.jsonl', 'filtered.jsonl', unique_id1=False)

For a particular id, `filter()` will defaultly pick out another id which is most possible to form a pair with it, `unique_id1` is set to `False` because id2 (PUAM id) is a unique identifier here. Finally, your record linkage result is in [filtered.jsonl](../examples/puam/filtered.jsonl).