Blocking Scheme Learner
OVERVIEW: BSL is an implementation of the blocking scheme learning algorithm BSL presented in the paper: Matthew Michelson and Craig A. Knoblock, Learning Blocking Schemes for Record Linkage, In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), Boston, MA, 2006
BSL learns blocking schemes for record linkage that attempt to minimize the number of false positives while covering as many true positives as possible in the training set. This means you need to label data, but we assume you are doing blocking for supervised record linkage.
INSTALLATION: Unzip the package BSL.zip to whatever folder you want and you should be ready to use the system, given that you have Java on the machine and it's a compatible version (you will get an error if it is not.).
DEPENDENCIES: This code assumes that you have MySQL as your database system and Cygwin as your running environment. (If you are not on windows you should be able to get away with not having cygwin.)
- DATABASE CONFIGURATION: BSL needs the data in a certain format in order to run and should be run using MySQL as the database. First, make sure that your tables have id columns that are unique for each record. These ids should be numbers, but should be stored as varchars in the database. For your tables, you need to define odbc connections and have the appropriate odbc drivers installed for MySQL, since the code uses jdbc:odbc connections. To configure BSL to use your database tables, you must create DATASOURCE FILES. For examples of 2 data source files, see DataSource1.xml and DataSource2.xml. For a description of these files, see the section below titled DATASOURCE FILES. Note that the attributes have the same names across sources. This is very important. The system does not do schema matching, and assumes this is done by the user: hence the attributes must have matching names.
Another very important table you must have in your database is the table of matches. This is akin to labeling your training set. A matches table must have one column where the attribute name is the id attribute name from one source, and must have another column for the second source. I.e. if source one's id is called id_1 in its table (and datasource file) and source two's id is called id_2, then the matches table must have 2 columns: id_1 and id_2. You fill this table with pairs of records that are matches. For example, if record 1 from source 1 and record 12 from source 2 match, then you add a row to the matches table where id_1 = '1' and id_2-'12'. You can call these ids anything you want, as long as they match between the data sources and match table. Another important aspect of this table is the name. You need to name this table <data source 1 name>_<data source 2 name>MATCHES. Lastly, BSL uses a temporary copy of the matches in the algorithm, so you need to make a table just like the matches table called <data source 1 name><data source 2 name>_MATCHES_COPY. You should copy the schema and data from the matches table into this matches_copy table.
The last two tables you create are for storing the final candidates generated by the rule learned during training. The first table stores the candidates generated when applying the final rule to the training set. It should be named <data source 1 training set name>_finalcands, and it's schema should be the same as the matches table, i.e. it should have 2 columns <id_1 name> and <id_2 name>. The last table you need to create is <data source 1 testing set name>_finalcands, with the same schema as the matches table.
As a last and important note, all of the tables should exist within the same data base. This is assumed by the system.
DATASOURCE FILES: The data source files are in a particular XML format. As examples, look at DataSource1Train.xml, DataSource1Test.xml and DataSource1.xml. These are just examples for you to follow, you may create and name your data sources whatever you want, tsince these filename are parameters to the script to run the system.
There are a few elements of importance in this XML file. The element is how you refer to this data set. Since this is a parameter to running BSL, you should remember this parameter. I usually name it the same as the data source file's name. The tag is optional, and is used in case you send the data source file to someone use for use, so they can know what the source is providing. Each element of the tag is an attribute used for matching between the data sets. Note, that just because you include an attribute here, it will not automatically be used for blocking. You set the attributes and methods to use for blocking in a different configuration file, which is described later. As mentioned above, however, the attribute names in the two data source files should match exactly. The tag is very important and it's child elements describe how you will access the data. The tag is the odbc connection that you have defined for your data. The tag is the login you created for your odbc connection, and the is your odbc password. The <ID_Attr> is the primary key (id) for the data set. This value should be the same in the matches table too. So if this value is "id_1" then it should also be "id_1" in the matches table to identify records from this set. Lastly is, you guessed it, the name of the table in the database. Remember, this is a supervised algorithm, so you need a datasource table of training and testing data, so this means you need to create 3 data source files: 1 for the training data from source 1 (DataSource1Train.xml), 1 for the testing data from source 1 (DataSource1Test.xml) and 1 for the data from source 2 (DataSource2.xml). Once you have created the two data source files for blocking, you can proceed to configuring your blocking methods and attributes!
ATTRIBUTE & METHOD FILES: The way to define which methods to use and which attributes to use for blocking. For an example file, see AttrMethod.xml. Now we will describe the elements of this file. Each in the elelement is one of the attributes that will be used for blocking. You can use a subset of the attributes you define in your data source files, but the names of the attributes should match exactly between this file and the data source files. Each of the defines which methods will be applied to the attributes of the for building a blocking scheme. So far there are 3 types of methods, each of which will be described here: Token -- This method generates a candidate where at least 1 token matches between the given attribute. So, if the attribute is full name and record 1 is "Joe Schmo" and record 2 is "Joe Smith" then since Joe matches, this will be a candidate.
firstN_X -- This method generates a candidate when the attributes share the first X letters in common. You can set X to whatever integer you want it to be. For example, AttrMethod.xml includes the method firstN_3, which means to generate candidates where the attributes share the first 3 letters.
ngram_X -- This method generates a candidate when the records share X common letters in the attribute. For example, bear and fear would be a 3-gram, since they have "ear" in common. Like firstN, you can decide what value you want to include for X. In AttrMethod.xml, the method is ngram_4, meaning the method returns candidates that share 4-grams.
All of the different methods can be found in the TokenMethod.java file (poorly named), so if you want to extend the algorithm to include different methods, you can code them up in this file.
Once you have defined all of your data source files and attribute/method file, you are ready to start running the system!
RUNNING: The first step to running BSL is to run the script: setUpRefIndex.bat. BSL has also been used when matching ustructured text to a referenc set, so the legacy of this results in a funny name for this script. Basically, the "Ref set" is whichever set is not broken up into training and testing data. In our example case, it would be DataSource2. What this script does is to apply the attribute/method pairs to the records of the "ref set" and cache the results to make generating candidates that much faster.
To run this script you would open command prompt and type: ./setUpRefIndex
For our example it would be: ./setUpRefIndex DataSource2.xml DataSource2Name DataSource1Train.xml TrainingSetName AttrMethod.xml
(Note: the training stuff is not actually used here, it's just a legacy from an older version...)
Once you have set up the records from the "ref set" you are now ready to learn a blocking scheme and then test how well it does. To test the data, the system assumes that the matches table has labeled matches for both the training and testing data. Note, however, that if you are using the algorithm in a setting where do not have labels for the testing data, that is okay. The system will still run fine in this case, but the final results that it prints out will not be meaningful. Note that this system produces lots of output as it runs. It likes to show you it's search path during learning, which produces lots and lots of output. That being said, if you want to see the final blocking scheme learned by the system, you will see the following line outputted: "FINAL RULE:" And below this you will see your final learned blocking scheme. It will in disjunctive normal form, such as: (attr|method&attr|method&...) OR (attr|method&...) OR ... These will be the last 2 lines generated when the system runs.
After running all of the candidates generated by these rules will be stored in the finalCands tables that you created earlier in the set-up of the system.
So, to run the system after setting up the ref index open up a command promp and type: ./runBSL
I hope you enjoy using this system!