## Introduction

A typical data science pipeline involves data acquisition, extraction (web scraping in HW1), cleaning (time series data in HW2), integration (data blocking then matching), and final analysis. The goal of this tutorial is to introduce you a useful library, py_entitymatching, for data blocking in data integration stage.

What is __blocking__? This stage comes after data cleaning and before data matching. We are familiar with processing single source data but if we have more than one source of data, we need to compare them for further analysis. However, you cannot compare an apple to an orange, so we need to match the identity first. For example, look at the left and right tables below. Based on the name similarity and location, we pair Dave Smith on the left to the David Smith on the right, and pair Dan Smith on the left to the Daniel B. Smith on the right. There exist 2*3=6 possible pairs (each people on the left tries to match each people on the right) and we try to block the other 4 pairs that not appear to match. 

Name         |          City | States        | |  Name         |          City | States
------------ | ------------- | ------------  |-|  ------------ | ------------- | ------------
Joe Brown    | Pittsburgh    | PA            | |  David Smith  | Los Angeles   | CA
Dave Smith   | Los Angeles   | CA            | |  Daniel B. Smith | Sacramento | CA
Dan Smith    | Sacramento    | CA   

As the size of the two tables grows, it is hard to manually find all matches (2 tables with 1,000 rows each will generate 1,000,000 candidate pairs). The __goal__ of data blocking is to come up with blockers that reduce the size of candidate pairs. After data blocking and data matching, the next step depends on your goal, you can specifically analyze the matched pairs or you can merge the tables with deduplication so you have a larger data pool.

We will cover the following topics in this tutorial:
- [Install the library](#Install-the-library)
- [Pre-processing](#Pre-processing)
- [Blockers](#Blockers)
- [What's Next](#What's-Next)
- [Summary and references](#Summary-and-references)



## Install the library

Make sure you have Anaconda installed, then run the following commands:

    $ conda install -c uwmagellan py_entitymatching

    $ conda uninstall py_entitymatching

    $ pip install 'py_entitymatching==0.2.0'

The first command installs all the dependencies of the library, second and third commands substitute the library to the stable version. I use the older version of the library because I found the newest version has compatibility issue on some machines.

Now make sure the import works without problem.

In [1]:
import py_entitymatching as em
import os
import pandas as pd
import numpy as np

In this tutorial, I'll demonstrate how to build different blockers provided by py_entitymatching library and how to use them on my data set. I collected roughly 8000 books information from the web (search using keyword Java and data etc.), with 5000 transformed from Barnes&Noble’s (tableA.csv) and 3000 transformed from Goodreads (tableB.csv). Their attributes are transformed into the same format:

**id, title, authors, ISBN13, pages, publisher, publishedYear, publishedMonth, publishedDay**

I only extracted **basic information** of the books from the web. To conduct meaning analysis in the real world scenario, you can include attributes such as book price, reviews, book rating when you do web scraping. I will use the word 'attribute' to represent the column names.


In [2]:
# filepaths for table A and B. 
cur_path = os.getcwd()
path_A = os.path.join(cur_path, 'tableA.csv')
path_B = os.path.join(cur_path, 'tableB.csv')

In [3]:
# read table A; 'id' as the key attribute
A = em.read_csv_metadata(path_A, key='id')
# read table B; 'id' as the key attribute
B = em.read_csv_metadata(path_B, key='id')

Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.


The above code reads tableA and tableB. The entity matching library is built upon pandas. Make sure you have a column as primary key (unique value for each row). In my case, I set 'id' as key. You are free to use pandas operations to the table you've loaded.

Take a look at the data sets. Scroll down to have a rough understanding of tableA and tableB. Do you see any problem?

In [4]:
A.shape

(5279, 9)

In [5]:
A.head(15)

Unnamed: 0,id,title,authors,ISBN13,pages,publisher,publishedYear,publishedMonth,publishedDay
0,0,The Java Tutorial: A Short Course on the Basics / Edition 6,"Raymond Gallardo,Scott Hommel,Sowmya Kannan,Joni Gordon,Sharon Biocca Zakhour",9780134034089,864.0,Addison-Wesley,2014,12,26
1,1,Java : Complete Java Programming Guide.,Harry. H. Chaudhary.,2940046303308,0.0,Harry & Associates.,2014,9,14
2,2,Java Programs to Accompany Programming Logic and Design / Edition 8,Jo Ann Smith,9781285867403,200.0,Cengage Learning,2014,3,3
3,3,Raspberry Pi with Java: Programming the Internet of Things (IoT),"Stephen Chin,James Weaver",9780071842013,336.0,McGraw-Hill Professional Publishing,2015,11,3
4,4,Java Generics and Collections: Speed Up the Java Development Process,"Maurice Naftalin,Philip Wadler",9780596527754,286.0,"O'Reilly Media, Incorporated",2006,10,1
5,5,Iron-Clad Java: Building Secure Web Applications,"Jim Manico,August Detlefsen",9780071835886,304.0,McGraw-Hill Professional Publishing,2014,9,9
6,6,Java SE 8 for the Really Impatient / Edition 1,Cay S. Horstmann,9780321927767,240.0,Addison-Wesley,2014,1,28
7,7,Programming Groovy 2: Dynamic Productivity for the Java Developer / Edition 1,Venkat Subramaniam,9781937785307,370.0,"Pragmatic Programmers, LLC, The",2013,7,22
8,8,"Mikrographie des Holzes der auf Java vorkommenden Baumarten, im Auftrage des Kolonial-Ministeriu...",Hindrik Haijo Janssonius,9781314978940,626.0,HardPress Publishing,2013,12,20
9,9,Using Pointers in Java,Aatif Ahmad Khan,9781494409869,44.0,CreateSpace Publishing,2013,12,8


In [6]:
B.shape

(3785, 9)

In [7]:
B.head(15)

Unnamed: 0,id,title,authors,ISBN13,pages,publisher,publishedYear,publishedMonth,publishedDay
0,0,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321000000.0,651.0,Addison-Wesley Professional,2005.0,6.0,1.0
1,1,Java Performance Tuning,Jack Shirazi,9780596000000.0,600.0,O'Reilly Media,2003.0,1.0,28.0
2,2,Java Software Solutions (Java 5.0 version): Foundations of Program Design,"John Lewis,William Loftus",9780321000000.0,944.0,Addison Wesley Publishing Company,2005.0,7.0,29.0
3,3,Data Structures and Abstractions with Java,"Frank M. Carrano,Walter J. Savitch",9780132000000.0,998.0,Prentice Hall,2006.0,8.0,4.0
4,4,Java and XSLT,Eric M. Burke,9780596000000.0,528.0,O'Reilly Media,2001.0,9.0,11.0
5,5,Java: An Introduction to Problem Solving and Programming,Walter J. Savitch,9780131000000.0,1060.0,Prentice Hall,2004.0,12.0,1.0
6,6,"Core Java 2, Volume I--Fundamentals (Core Series)","Cay S. Horstmann,Gary Cornell",9780131000000.0,784.0,Prentice Hall,2004.0,8.0,27.0
7,7,Big Java: Late Objects,Cay S. Horstmann,9781118000000.0,1056.0,Wiley,2012.0,2.0,1.0
8,8,Just Java 2 [With CDROM],Peter van der Linden,9780130000000.0,1098.0,Prentice Hall PTR,2001.0,12.0,21.0
9,9,Enterprise JavaBeans (Java Series (O'Reilly & Associates).),Richard Monson-Haefel,9781566000000.0,489.0,O'Reilly,2001.0,9.0,8.0


Here are some obvious ones:
1. Type mismatch (ISBN13 of tableA and tableB are interpreted differently, published dates are int/float)
2. Missing values (0 or NaN)

We need some data pre-processing before we use the library.

## Pre-processing

Although blocking is the focus, I think it is necessary to touch a bit on data cleaning. ALWAYS remember to clean the data and convert to the right data types! Otherwise, our library will not produce what you desired. I have situations where after 10 minutes running blockers, the output table is none because the data type of the compared attributes are different. 

When I apply the to_numeric operation, I set the errors to 'coerce' so invalid conversions will be set as NaN. Then I replace value 0 to NaN because ISBN, page and published date should not have 0. Lastly, I can drop rows that have NaN values in it.

In [8]:
A[['id','ISBN13','pages','publishedYear','publishedMonth','publishedDay']] = \
    A[['id','ISBN13','pages','publishedYear','publishedMonth','publishedDay']].apply(pd.to_numeric,errors='coerce')
B[['id','ISBN13','pages','publishedYear','publishedMonth','publishedDay']] = \
    B[['id','ISBN13','pages','publishedYear','publishedMonth','publishedDay']].apply(pd.to_numeric,errors='coerce')
A[['ISBN13','pages','publishedYear','publishedMonth','publishedDay']] = \
    A[['ISBN13','pages','publishedYear','publishedMonth','publishedDay']].replace(0,np.nan)
B[['ISBN13','pages','publishedYear','publishedMonth','publishedDay']] = \
    B[['ISBN13','pages','publishedYear','publishedMonth','publishedDay']].replace(0,np.nan)
A.dropna(axis=0, how='any', inplace=True)
B.dropna(axis=0, how='any', inplace=True)
A[['id','ISBN13','pages','publishedYear','publishedMonth','publishedDay']] = \
    A[['id','ISBN13','pages','publishedYear','publishedMonth','publishedDay']].astype(np.int64)
B[['id','ISBN13','pages','publishedYear','publishedMonth','publishedDay']] = \
    B[['id','ISBN13','pages','publishedYear','publishedMonth','publishedDay']].astype(np.int64)

Look at the tables after cleaning below. You can see we dropped roughly 500 rows from tableA and more than 1000 rows from tableB. I'm dropping all rows with missing values, but you can decide your own rules on how to deal with missing values as long as it doesn't mislead your goal (and you have more data samples if you relax your rules).

We have roughly 12,000,000 candidate pairs before we enter the blocking stage (4806*2593). Let's see how each blocker reduces such huge candidate size. 

In [9]:
A.shape

(4806, 9)

In [10]:
A.head()

Unnamed: 0,id,title,authors,ISBN13,pages,publisher,publishedYear,publishedMonth,publishedDay
0,0,The Java Tutorial: A Short Course on the Basics / Edition 6,"Raymond Gallardo,Scott Hommel,Sowmya Kannan,Joni Gordon,Sharon Biocca Zakhour",9780134034089,864,Addison-Wesley,2014,12,26
2,2,Java Programs to Accompany Programming Logic and Design / Edition 8,Jo Ann Smith,9781285867403,200,Cengage Learning,2014,3,3
3,3,Raspberry Pi with Java: Programming the Internet of Things (IoT),"Stephen Chin,James Weaver",9780071842013,336,McGraw-Hill Professional Publishing,2015,11,3
4,4,Java Generics and Collections: Speed Up the Java Development Process,"Maurice Naftalin,Philip Wadler",9780596527754,286,"O'Reilly Media, Incorporated",2006,10,1
5,5,Iron-Clad Java: Building Secure Web Applications,"Jim Manico,August Detlefsen",9780071835886,304,McGraw-Hill Professional Publishing,2014,9,9


In [11]:
A.dtypes

id                 int64
title             object
authors           object
ISBN13             int64
pages              int64
publisher         object
publishedYear      int64
publishedMonth     int64
publishedDay       int64
dtype: object

In [12]:
B.shape

(2593, 9)

In [13]:
B.head()

Unnamed: 0,id,title,authors,ISBN13,pages,publisher,publishedYear,publishedMonth,publishedDay
0,0,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,651,Addison-Wesley Professional,2005,6,1
1,1,Java Performance Tuning,Jack Shirazi,9780596003777,600,O'Reilly Media,2003,1,28
2,2,Java Software Solutions (Java 5.0 version): Foundations of Program Design,"John Lewis,William Loftus",9780321322036,944,Addison Wesley Publishing Company,2005,7,29
3,3,Data Structures and Abstractions with Java,"Frank M. Carrano,Walter J. Savitch",9780132370455,998,Prentice Hall,2006,8,4
4,4,Java and XSLT,Eric M. Burke,9780596001438,528,O'Reilly Media,2001,9,11


## Blockers

Now we have the tables ready for blocking! I will introduce 4 types blockers provided by the py_entitymatching library.
1. [Attribute Equivalence Blocker](#1.-Attribute-Equivalence-Blocker)
2. [Overlap Blocker](#2.-Overlap-Blocker)
3. [Rule-based Blocker](#3.-Rule-based-Blocker)
4. [Blackbox Blocker](#4.-Blackbox-Blocker)

Each blocker can operate on ONE already-blocked-table using block_candset(), or operate on TWO separate tables using block_tables(). I'll explain their usage below.

## 1. Attribute Equivalence Blocker

As the name suggests, this blocker will block a pair if the specified attributes are NOT equivalent. In the code block below, the blocker compares the ISBN13 from tableA (a.k.a. left table) with ISBN13 from tableB (a.k.a. right table). 

For each candidate, if its ISBNs are different, blocker does not include it in the output table. You can specify which attributes to be included in the output table (l\_output\_attrs and r\_output\_attrs). Note that the output attributes will have default prefix 'ltable\_' or 'rtable\_' indicating which table they originally from and I recommend not change them. The output table will also have its primary attribute '\_id'.

In [14]:
eq = em.AttrEquivalenceBlocker()
C_ISBN = eq.block_tables(A, B, l_block_attr='ISBN13', r_block_attr='ISBN13',\
    l_output_attrs=['title', 'authors', 'publisher', 'ISBN13','publishedYear','publishedMonth','pages'],\
    r_output_attrs=['title', 'authors', 'publisher', 'ISBN13','publishedYear','publishedMonth','pages'])

In [15]:
C_ISBN.shape

(1491, 17)

In [16]:
display_attr = ['ltable_title', 'ltable_authors', 'ltable_ISBN13', 'ltable_publisher', 'rtable_title', 'rtable_authors', 'rtable_ISBN13', 'rtable_publisher']

In [17]:
C_ISBN[display_attr]

Unnamed: 0,ltable_title,ltable_authors,ltable_ISBN13,ltable_publisher,rtable_title,rtable_authors,rtable_ISBN13,rtable_publisher
0,Java Generics and Collections: Speed Up the Java Development Process,"Maurice Naftalin,Philip Wadler",9780596527754,"O'Reilly Media, Incorporated",Java Generics and Collections,"Maurice Naftalin,Philip Wadler",9780596527754,O'Reilly Media
1,Iron-Clad Java: Building Secure Web Applications,"Jim Manico,August Detlefsen",9780071835886,McGraw-Hill Professional Publishing,Iron-Clad Java: Building Secure Web Applications,"Jim Manico,August Detlefsen",9780071835886,McGraw-Hill Education
2,Java SE 8 for the Really Impatient / Edition 1,Cay S. Horstmann,9780321927767,Addison-Wesley,Java SE 8 for the Really Impatient,Cay S. Horstmann,9780321927767,Addison-Wesley Professional
3,Doing Java: An Anthropological Detective Story,Niels Mulder,9789792111491,ATF Press,Doing Java: An Anthropological Detective Story,Niels Mulder,9789792111491,Kanisius
4,Beginning Java 2 SDK 1.4 Edition / Edition 1,"Ivor Horton,Wrox Author Team",9780764543654,Wiley,Beginning Java 2,Ivor Horton,9780764543654,Wrox Press
5,Sams Teach Yourself Java 6 in 21 Days,"Rogers Cadenhead,Laura Lemay",9780672329432,Sams,Sams Teach Yourself Java 6 in 21 Days,"Rogers Cadenhead,Laura Lemay",9780672329432,Sams
6,Java 2 in Record Time: Teach Yourself the 16 Essential Skills,Steven Holzner,9780782121711,"Wiley, John & Sons, Incorporated",Java 1.2 in Record Time [With *],Steven Holzner,9780782121711,Sybex
7,Big Java / Edition 3,Cay S. Horstmann,9780470105542,Wiley,Big Java,Cay S. Horstmann,9780470105542,John Wiley & Sons
8,Active Java: Object-Oriented Programming for the World Wide Web,"Adam Freeman,Katherine Harutunian,S. Mortimore,Darrel Ince",9780201403701,Addison-Wesley,Active Java,"Adam Freeman,Darrel Ince",9780201403701,Addison Wesley Longman
9,Art Of Java,"Herbert Schildt,James Holmes",9780072229714,McGraw-Hill/OsborneMedia,The Art of Java,"Herbert Schildt,James Holmes",9780072229714,McGraw-Hill/Osborne Media


In [18]:
C_ISBN.dtypes

_id                       int64
ltable_id                 int64
rtable_id                 int64
ltable_title             object
ltable_authors           object
ltable_publisher         object
ltable_ISBN13             int64
ltable_publishedYear      int64
ltable_publishedMonth     int64
ltable_pages              int64
rtable_title             object
rtable_authors           object
rtable_publisher         object
rtable_ISBN13             int64
rtable_publishedYear      int64
rtable_publishedMonth     int64
rtable_pages              int64
dtype: object

Look at above to get an idea of how the output table looks like. I only display 8 of 17 attributes so you don't have to scroll all the way to the right to compare the result.

This blocker is useful if each table has one attribute that is universally unique (student ID, SSN etc.). We are actually 'cheating' here to use the ISBN to reduce the candidate pairs from 12 million to 1491. In real-world datasets, you are not likely to have a universally unique attribute for your data sets.

## 2. Overlap Blocker

Overlap blocker also operates on two attributes, one from the left table and one from the right table. It will first split the attribute into string tokens by punctuations. Then it will ignore stop words (recall HW3 text classification) and count the overlapped words between the two attributes (word_level=True, overlap_size=3). 

For all candidate pairs, if the number of overlapped words is smaller than the threshold, blocker does not include them in the output table. In my case, the titles must have overlapped words greater or equal to 3.

In [19]:
# First way: using overlap blocker
ob = em.OverlapBlocker()
# block using name
C_title = ob.block_tables(A, B, 'title', 'title', word_level=True, overlap_size=3,\
    l_output_attrs=['title', 'authors', 'ISBN13','pages', 'publisher', 'publishedYear', 'publishedMonth'],\
    r_output_attrs=['title', 'authors', 'ISBN13','pages', 'publisher', 'publishedYear', 'publishedMonth'])

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04


In [20]:
C_title.shape

(258718, 17)

In [21]:
C_title[display_attr]

Unnamed: 0,ltable_title,ltable_authors,ltable_ISBN13,ltable_publisher,rtable_title,rtable_authors,rtable_ISBN13,rtable_publisher
0,The Complete Guide to Java Database Programming (Java Masters Series),"Mathew Siple,Matthew D. Siple,Siple",9780079132864,"McGraw-Hill Companies, The",The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
1,A Selection of Books on the East: In the Same Series and Similarly Illustrated; British North Bo...,James Baikie,9781330252949,FB &c Ltd,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
2,The J2EE Tutorial (The Java Series) / Edition 2,"Stephanie Bodoff,Jeff Jackson,Eric Jendrock,Kim Haase,Dale Green",9780321245755,Prentice Hall Professional Technical Reference,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
3,"An American Merchant in Europe, Asia, and Australia; a Series of Letters from Java, Singapore, C...",George Francis Train,9781425557959,University of Michigan Library,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
4,"Designing Web Services with the J2EE 1.4 Platform (The Java Series): JAX-RPC, SOAP, and XML Tech...","Inderjeet Singh,Beth Stearns,Thierry Violleau,Vijay Ramachandran,Greg Murray",9780321205216,Prentice Hall,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
5,"An American Merchant in Europe, Asia and Australia: A Series of Letters From Java, Singapore, Ch...",George Francis Train,9781330104958,FB &c Ltd,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
6,"Java: The Complete Reference, J2se (References Series)",Herbert Schildt,9780072230734,Osborne,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
7,J2EE Technology in Practice: Building Business Applications with the Java 2 Platform (Addison-We...,"Rick Cattell,Jim Inscore",9780201746228,Pearson Education,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
8,The Complete Guide to Java Database Programming (Java Masters Series),"Mathew Siple,Matthew D. Siple,Siple",9780079132864,"McGraw-Hill Companies, The",The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
9,"The Java Virtual Machine Specification, Java SE 7 Edition",Tim Lindholm,9780133260465,Pearson Education,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional


As you can see, the output table is really large (250K rows), but we've successfully reduced the size of candidate pairs from 12 million to 250K. We can further reduce the size by chaining other blockers to our result C_title table.

## 3. Rule-based Blocker

Before we chain the result from overlap blocker. Let me introduce the general usage of rule-based blocker first.

The py_entitymatching library provides a feature table for measuring the similarity between attributes from two tables. See below:

In [22]:
feature_table = em.get_features_for_blocking(A, B, validate_inferred_attr_types=False) # set to True for development

In [23]:
feature_table

Unnamed: 0,feature_name,left_attribute,right_attribute,left_attr_tokenizer,right_attr_tokenizer,simfunction,function,function_source,is_auto_generated
0,id_id_exm,id,id,,,exact_match,<function id_id_exm at 0x1026e0730>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
1,id_id_anm,id,id,,,abs_norm,<function id_id_anm at 0x101f451e0>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
2,id_id_lev_dist,id,id,,,lev_dist,<function id_id_lev_dist at 0x1a12e93ae8>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
3,id_id_lev_sim,id,id,,,lev_sim,<function id_id_lev_sim at 0x1a12e93b70>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
4,title_title_jac_qgm_3_qgm_3,title,title,qgm_3,qgm_3,jaccard,<function title_title_jac_qgm_3_qgm_3 at 0x1a12e939d8>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
5,title_title_cos_dlm_dc0_dlm_dc0,title,title,dlm_dc0,dlm_dc0,cosine,<function title_title_cos_dlm_dc0_dlm_dc0 at 0x1a12e93c80>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
6,title_title_mel,title,title,,,monge_elkan,<function title_title_mel at 0x103e87c80>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
7,title_title_lev_dist,title,title,,,lev_dist,<function title_title_lev_dist at 0x1a12f34e18>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
8,title_title_lev_sim,title,title,,,lev_sim,<function title_title_lev_sim at 0x1a12f4e2f0>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True
9,authors_authors_jac_qgm_3_qgm_3,authors,authors,qgm_3,qgm_3,jaccard,<function authors_authors_jac_qgm_3_qgm_3 at 0x1a12f4e378>,from py_entitymatching.feature.simfunctions import *\nfrom py_entitymatching.feature.tokenizers ...,True


We can add these rules to our rule-based blockers. Here I introduce the feature 'authors_authors_jac_qgm_3_qgm_3'. authors_authors means I'm comparing authors from left table to authors from the right table. 'jac' means Jaccard index or **Intersection over Union** (https://en.wikipedia.org/wiki/Jaccard_index). 'qgm_3' means we are using n-gram model (3 characters level instead of word-level we've seen). 

To combine these concepts, I'll show you a simple example. Suppose we have A = dave, B = dav, using 3 gram character level with padding # at front and back, we have:

A_set: {##d, #da, dav, ave, ve#, e##}, 

B_set: {##d, #da, dav, av#, v##}. 

intersection: {##d, #da, dav}

union: {##d, #da, dav, ave, ve#, e##, av#, v##}. Thus the Jaccard index is 3/8.

Here I build a rule blocker and add a rule "block pairs that have jaccard index of authors smaller than 0.4". You can add as many rules as you wish. 

You can name your rule using 'rule_name=' when you add it, then later on you can delete this rule by calling delete_rule(rule name). 'n_jobs=' is also useful as it specifies the number of parallel jobs (-1 means use all your CPUs).

In [24]:
rule_blocker=em.RuleBasedBlocker() 
rule_blocker.add_rule(['authors_authors_jac_qgm_3_qgm_3(ltuple, rtuple) < 0.4'], feature_table, rule_name='author_r')
C_author_title = rule_blocker.block_candset(C_title, n_jobs=-1)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:08


In [25]:
C_author_title.shape

(3246, 17)

In [26]:
C_author_title[display_attr]

Unnamed: 0,ltable_title,ltable_authors,ltable_ISBN13,ltable_publisher,rtable_title,rtable_authors,rtable_ISBN13,rtable_publisher
11,"The Java Language Specification, Java SE 8 Edition / Edition 1","James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha,Alex Buckley",9780133900699,Addison-Wesley,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
12,"The Java Language Specification, Java SE 7 Edition / Edition 4","James Gosling,Bill Joy,Guy Steele,Gilad Bracha,Alex Buckley",9780133260229,Addison-Wesley,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
13,The Java Language Specification / Edition 2,"James Gosling,Bill Joy,Guy Steele,Gilad Bracha",9780201310085,Addison-Wesley,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
16,The Java Language Specification / Edition 3,"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha,Guy L. Steele",9780321246783,Addison-Wesley,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
26,"Java Software Solutions : Foundations of Program Design, Update, JavaPlace Edition / Edition 2","John Lewis (5),William Loftus,William Loftus,William Loftus",9780201750522,Pearson Education,Java Software Solutions (Java 5.0 version): Foundations of Program Design,"John Lewis,William Loftus",9780321322036,Addison Wesley Publishing Company
27,Java Software Solutions: Foundations of Program Design / Edition 2,"John Lewis (5),William Loftus",9780201612714,Addison-Wesley,Java Software Solutions (Java 5.0 version): Foundations of Program Design,"John Lewis,William Loftus",9780321322036,Addison Wesley Publishing Company
28,MyProgrammingLab with Pearson eText -- Access Code Card -- for Java Software Solutions: Foundati...,"John Lewis,William Loftus",9780133781281,Pearson,Java Software Solutions (Java 5.0 version): Foundations of Program Design,"John Lewis,William Loftus",9780321322036,Addison Wesley Publishing Company
30,Java Software Solutions (Java 5 version): Foundations of Program Design / Edition 4,"John Lewis (5),William Loftus,William Loftus",9780321322036,Pearson,Java Software Solutions (Java 5.0 version): Foundations of Program Design,"John Lewis,William Loftus",9780321322036,Addison Wesley Publishing Company
31,"Java Software Solutions: Foundations of Program Design, CodeMate Enhanced Edition / Edition 3","John Lewis (5),William Loftus,William Loftus",9780321197191,Pearson,Java Software Solutions (Java 5.0 version): Foundations of Program Design,"John Lewis,William Loftus",9780321322036,Addison Wesley Publishing Company
32,Java Software Solutions: Foundations of Program Design / Edition 7,"John Lewis (5),William Loftus",9780132149181,Pearson,Java Software Solutions (Java 5.0 version): Foundations of Program Design,"John Lewis,William Loftus",9780321322036,Addison Wesley Publishing Company


Below shows how to delete rule, add rule and display all rules.

In [27]:
rule_blocker.delete_rule('author_r')

True

In [28]:
rule_blocker.add_rule(['authors_authors_jac_qgm_3_qgm_3(ltuple, rtuple) < 0.4'], feature_table, rule_name='author_r')
rule_blocker.add_rule(['publisher_publisher_jac_qgm_3_qgm_3(ltuple, rtuple) < 0.4'], feature_table, rule_name='publisher_r')
rule_blocker.get_rule_names()

odict_keys(['author_r', 'publisher_r'])

Here, instead of blocking on tableA and tableB, I use the already-blocked table C_title from the last blocker to further reduce the size. I also added publisher jaccard index in the rules.

Scroll down to see what the output looks like.

In [29]:
# rule-based block on merged data
C_publisher_author_title = rule_blocker.block_candset(C_title, n_jobs=-1)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:08


In [30]:
C_publisher_author_title.shape

(1447, 17)

In [31]:
C_publisher_author_title[display_attr]

Unnamed: 0,ltable_title,ltable_authors,ltable_ISBN13,ltable_publisher,rtable_title,rtable_authors,rtable_ISBN13,rtable_publisher
11,"The Java Language Specification, Java SE 8 Edition / Edition 1","James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha,Alex Buckley",9780133900699,Addison-Wesley,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
12,"The Java Language Specification, Java SE 7 Edition / Edition 4","James Gosling,Bill Joy,Guy Steele,Gilad Bracha,Alex Buckley",9780133260229,Addison-Wesley,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
13,The Java Language Specification / Edition 2,"James Gosling,Bill Joy,Guy Steele,Gilad Bracha",9780201310085,Addison-Wesley,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
16,The Java Language Specification / Edition 3,"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha,Guy L. Steele",9780321246783,Addison-Wesley,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
36,Java Software Solutions: Foundations of Program Design / Edition 5,"William Loftus,William Loftus",9780321409492,Addison Wesley,Java Software Solutions (Java 5.0 version): Foundations of Program Design,"John Lewis,William Loftus",9780321322036,Addison Wesley Publishing Company
37,Java Software Solutions: Foundations of Program Design / Edition 6,"John Lewis (5),William Loftus",9780321532053,Addison Wesley,Java Software Solutions (Java 5.0 version): Foundations of Program Design,"John Lewis,William Loftus",9780321322036,Addison Wesley Publishing Company
73,Data Structures and Abstractions with Java / Edition 1,"Frank Carrano,Walter Savitch",9780130174895,Prentice Hall,Data Structures and Abstractions with Java,"Frank M. Carrano,Walter J. Savitch",9780132370455,Prentice Hall
129,Data Structures and Abstractions with Java / Edition 2,"Frank Carrano,Walter Savitch",9780132370455,Prentice Hall,Data Structures and Abstractions with Java,"Frank M. Carrano,Walter J. Savitch",9780132370455,Prentice Hall
246,Java and XSLT,"Eric M. Burke,Mike Loukides",9780596001438,"O'Reilly Media, Incorporated",Java and XSLT,Eric M. Burke,9780596001438,O'Reilly Media
589,Core Java Volume I--Fundamentals / Edition 9,"Cay S. Horstmann,Gary Cornell",9780137081899,Prentice Hall,"Core Java 2, Volume I--Fundamentals (Core Series)","Cay S. Horstmann,Gary Cornell",9780131482029,Prentice Hall


The result is quite satisfying, we reduce the 250K C_title to 1447 rows and you can see the candidates looks similar.

## Bonus: Rule-based Blocker 

Below is the standard usage of rule-based blockers on two tables. The interfaces is nearly the same to other blockers.

In [32]:
# rule-based on A,B sets
C_publisher_author = rule_blocker.block_tables(A, B, n_jobs=-1,\
    l_output_attrs=['title', 'authors', 'ISBN13','pages', 'publisher', 'publishedYear', 'publishedMonth'],\
    r_output_attrs=['title', 'authors', 'ISBN13','pages', 'publisher', 'publishedYear', 'publishedMonth'])

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Finding pairs with missing value...


0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [33]:
C_publisher_author[display_attr]

Unnamed: 0,ltable_title,ltable_authors,ltable_ISBN13,ltable_publisher,rtable_title,rtable_authors,rtable_ISBN13,rtable_publisher
0,"The Java Language Specification, Java SE 8 Edition / Edition 1","James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha,Alex Buckley",9780133900699,Addison-Wesley,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
1,The Java Language Specification / Edition 3,"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha,Guy L. Steele",9780321246783,Addison-Wesley,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
2,"The Java Language Specification, Java SE 7 Edition / Edition 4","James Gosling,Bill Joy,Guy Steele,Gilad Bracha,Alex Buckley",9780133260229,Addison-Wesley,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
3,The Java Language Specification / Edition 2,"James Gosling,Bill Joy,Guy Steele,Gilad Bracha",9780201310085,Addison-Wesley,The Java Language Specification (The Java Series),"James Gosling,Bill Joy,Guy L. Steele Jr.,Gilad Bracha",9780321246783,Addison-Wesley Professional
8,Java Software Solutions: Foundations of Program Design / Edition 5,"William Loftus,William Loftus",9780321409492,Addison Wesley,Java Software Solutions (Java 5.0 version): Foundations of Program Design,"John Lewis,William Loftus",9780321322036,Addison Wesley Publishing Company
15,Java Software Solutions: Foundations of Program Design / Edition 6,"John Lewis (5),William Loftus",9780321532053,Addison Wesley,Java Software Solutions (Java 5.0 version): Foundations of Program Design,"John Lewis,William Loftus",9780321322036,Addison Wesley Publishing Company
16,Data Structures and Abstractions with Java / Edition 1,"Frank Carrano,Walter Savitch",9780130174895,Prentice Hall,Data Structures and Abstractions with Java,"Frank M. Carrano,Walter J. Savitch",9780132370455,Prentice Hall
17,Data Structures and Abstractions with Java / Edition 2,"Frank Carrano,Walter Savitch",9780132370455,Prentice Hall,Data Structures and Abstractions with Java,"Frank M. Carrano,Walter J. Savitch",9780132370455,Prentice Hall
20,Java: Introduction to Problem Solving and Programming / Edition 5,"Walter Savitch,Frank Carrano",9780136072256,Prentice Hall,Data Structures and Abstractions with Java,"Frank M. Carrano,Walter J. Savitch",9780132370455,Prentice Hall
21,Java Extreme Programming Cookbook,"Eric M. Burke,Brian M. Coyner",9780596003876,"O'Reilly Media, Incorporated",Java and XSLT,Eric M. Burke,9780596001438,O'Reilly Media


## 4. Blackbox Blocker

This is the last blocker and the most interesting one in my opinion. You have the complete freedom of customizing the rule of blocking!

In [34]:
# Define the black_box function used to check the equivalence between numerical attributes
def published_range_block(x, y):
    month_diff = abs(x['publishedMonth'] - y['publishedMonth'])
    year_diff = abs(x['publishedYear'] - y['publishedYear'])
    pages_diff = abs(x['pages'] - y['pages'])
    if month_diff > 2 or year_diff > 2 or pages_diff > 10:
        return True
    else:
        return False

You can design a function for blocking. **For conditions that you think should be blocked, return True, otherwise return False**. In the above code, I set the rules as:
1. "if publishedMonth difference is greater than 2, block"
2. "if publishedYear difference is greater than 2, block"
3. "if pages difference is greater than 10, block"

In [35]:
bb1 = em.BlackBoxBlocker()
bb1.set_black_box_function(published_range_block)
C_pub_date = bb1.block_tables(A, B, n_jobs=-1,\
    l_output_attrs=['title', 'authors', 'publisher', 'ISBN13','publishedYear','publishedMonth','pages'],\
    r_output_attrs=['title', 'authors', 'publisher', 'ISBN13','publishedYear','publishedMonth','pages'])

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:49


In [36]:
# Checking the output of the date blocker
C_pub_date.shape

(18813, 17)

In [37]:
C_pub_date

Unnamed: 0,_id,ltable_id,rtable_id,ltable_title,ltable_authors,ltable_publisher,ltable_ISBN13,ltable_publishedYear,ltable_publishedMonth,ltable_pages,rtable_title,rtable_authors,rtable_publisher,rtable_ISBN13,rtable_publishedYear,rtable_publishedMonth,rtable_pages
0,0,2,107,Java Programs to Accompany Programming Logic and Design / Edition 8,Jo Ann Smith,Cengage Learning,9781285867403,2014,3,200,Walking in the Supernatural: Another Cup of Spiritual Java,"Bill Johnson,Beni Johnson,Eric Johnson,Danny Silk,Kevin Dedmon,Banning Liebscher,Judy Franklin,C...",Destiny Image,9780768440775,2012,1,208
1,1,2,599,Java Programs to Accompany Programming Logic and Design / Edition 8,Jo Ann Smith,Cengage Learning,9781285867403,2014,3,200,Pragmatic Unit Testing in Java 8 with Junit,"Jeff Langr,Andy Hunt,Dave Thomas",Pragmatic Bookshelf,9781941222591,2015,3,200
2,2,3,320,Raspberry Pi with Java: Programming the Internet of Things (IoT),"Stephen Chin,James Weaver",McGraw-Hill Professional Publishing,9780071842013,2015,11,336,Java Man,Harris Gray,Harrisgray,9780988895737,2013,11,340
3,3,3,542,Raspberry Pi with Java: Programming the Internet of Things (IoT),"Stephen Chin,James Weaver",McGraw-Hill Professional Publishing,9780071842013,2015,11,336,The Hidden Force: A Story of Modern Java,Louis Couperus,Palala Press,9781346413105,2015,11,330
4,4,4,306,Java Generics and Collections: Speed Up the Java Development Process,"Maurice Naftalin,Philip Wadler","O'Reilly Media, Incorporated",9780596527754,2006,10,286,"Java Puzzlers: Traps, Pitfalls, and Corner Cases","Joshua Bloch,Neal Gafter",Addison-Wesley Professional,9780321336781,2005,9,282
5,5,4,312,Java Generics and Collections: Speed Up the Java Development Process,"Maurice Naftalin,Philip Wadler","O'Reilly Media, Incorporated",9780596527754,2006,10,286,Java Generics and Collections,"Maurice Naftalin,Philip Wadler",O'Reilly Media,9780596527754,2006,10,286
6,6,4,719,Java Generics and Collections: Speed Up the Java Development Process,"Maurice Naftalin,Philip Wadler","O'Reilly Media, Incorporated",9780596527754,2006,10,286,"Web Development with Java: Using Hibernate, JSPs and Servlets",Tim Downey,Springer,9781846288623,2007,9,288
7,7,5,271,Iron-Clad Java: Building Secure Web Applications,"Jim Manico,August Detlefsen",McGraw-Hill Professional Publishing,9780071835886,2014,9,304,Death Before Decaf,Caroline Fardig,Alibi,9780804181303,2015,11,296
8,8,5,441,Iron-Clad Java: Building Secure Web Applications,"Jim Manico,August Detlefsen",McGraw-Hill Professional Publishing,9780071835886,2014,9,304,Iron-Clad Java: Building Secure Web Applications,"Jim Manico,August Detlefsen",McGraw-Hill Education,9780071835886,2014,9,304
9,9,5,562,Iron-Clad Java: Building Secure Web Applications,"Jim Manico,August Detlefsen",McGraw-Hill Professional Publishing,9780071835886,2014,9,304,Java SE 7 Programming Essentials,Michael Ernest,Sybex,9781118359105,2012,11,314


Another example function is to block on partial ISBN as ISBN heads indicate publishers (assume you don't want to cheat on comparing entire ISBN).

In [38]:
def partial_isbn_block(x,y):
    isbn_A = str((x['ISBN13']))
    isbn_head_A = isbn_A[0:6]
    isbn_B = str(y['ISBN13'])
    isbn_head_B = isbn_B[0:6]
    # compare first 6 digits
    if isbn_head_A != isbn_head_B:
        return True
    else:
        return False 

## Blackbox Blocker: things to remember

1. Although blackbox blocker is great for customization, it is computational intensive! It takes **minutes** to finish blocking while other blockers finish in seconds on my data set.
2. Blackbox blocker can only accept one function at a time, meaning you cannot add many functions like the rule-based blocker. 

What I recommend here is to write several blocker functions and merge them into a combined blocker function using 'or' to combine conditions. Below is the function that I combined the published range function and partial ISBN function.

In [39]:
def combined_block(x, y):
    month_diff = abs(x['publishedMonth'] - y['publishedMonth'])
    year_diff = abs(x['publishedYear'] - y['publishedYear'])
    pages_diff = abs(x['pages'] - y['pages'])
    isbn_A = str((x['ISBN13']))
    isbn_head_A = isbn_A[0:6]
    isbn_B = str(y['ISBN13'])
    isbn_head_B = isbn_B[0:6]
    
    if month_diff > 2 or year_diff > 2 or pages_diff > 10 or isbn_head_A != isbn_head_B:
        return True
    else:
        return False

In [40]:
bb2 = em.BlackBoxBlocker()
# one function only
bb2.set_black_box_function(combined_block)
C_combined = bb2.block_tables(A, B, n_jobs=-1,\
    l_output_attrs=['title', 'authors', 'publisher', 'ISBN13','publishedYear','publishedMonth','pages'],\
    r_output_attrs=['title', 'authors', 'publisher', 'ISBN13','publishedYear','publishedMonth','pages'])

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:03:24


In [41]:
C_combined.shape

(1746, 17)

In [42]:
C_combined[display_attr]

Unnamed: 0,ltable_title,ltable_authors,ltable_ISBN13,ltable_publisher,rtable_title,rtable_authors,rtable_ISBN13,rtable_publisher
0,Java Generics and Collections: Speed Up the Java Development Process,"Maurice Naftalin,Philip Wadler",9780596527754,"O'Reilly Media, Incorporated",Java Generics and Collections,"Maurice Naftalin,Philip Wadler",9780596527754,O'Reilly Media
1,Iron-Clad Java: Building Secure Web Applications,"Jim Manico,August Detlefsen",9780071835886,McGraw-Hill Professional Publishing,Iron-Clad Java: Building Secure Web Applications,"Jim Manico,August Detlefsen",9780071835886,McGraw-Hill Education
2,Java Gently: Programming Principles Explained / Edition 3,"J. M. Bishop,Judy Bishop",9780201710502,Addison Wesley,LDAP Programming with Java,Rob Weltman,9780201657586,Addison-Wesley Professional
3,Doing Java: An Anthropological Detective Story,Niels Mulder,9789792111491,ATF Press,Doing Java: An Anthropological Detective Story,Niels Mulder,9789792111491,Kanisius
4,Beginning Java 2 SDK 1.4 Edition / Edition 1,"Ivor Horton,Wrox Author Team",9780764543654,Wiley,Beginning Java 2,Ivor Horton,9780764543654,Wrox Press
5,Sams Teach Yourself Java 6 in 21 Days,"Rogers Cadenhead,Laura Lemay",9780672329432,Sams,Sams Teach Yourself Java 6 in 21 Days,"Rogers Cadenhead,Laura Lemay",9780672329432,Sams
6,Active Java: Object-Oriented Programming for the World Wide Web,"Adam Freeman,Katherine Harutunian,S. Mortimore,Darrel Ince",9780201403701,Addison-Wesley,Hooked on Java,Arthur Van Hoff,9780201488371,Addison Wesley Longman
7,Teach Yourself Internet Game Programming with Java in 21 Days,Michael Morrison,9781575211480,Sams,"Learn Java Now, with CD-ROM",Stephen Randy Davis,9781572314283,Microsoft Press
8,Women of the Kakawin World: Marriage and Sexuality in the Indic Courts of Java and Bali,Helen Creese,9780765601599,Taylor & Francis,Java 2 For Dummies,Barry Burd,9780764568589,For Dummies
9,Women of the Kakawin World: Marriage and Sexuality in the Indic Courts of Java and Bali,Helen Creese,9780765601599,Taylor & Francis,Introduction to Cryptography with Java Applets,David Bishop,9780763722074,Jones & Bartlett Publishers


Finally, when you have different output tables and you want to merge them, call the below function, it will union the tables.

In [43]:
C = em.combine_blocker_outputs_via_union([C_publisher_author_title, C_combined])

In [44]:
C.shape

(2765, 17)

In [45]:
C

Unnamed: 0,_id,ltable_id,rtable_id,ltable_title,ltable_authors,ltable_ISBN13,ltable_pages,ltable_publisher,ltable_publishedYear,ltable_publishedMonth,rtable_title,rtable_authors,rtable_ISBN13,rtable_pages,rtable_publisher,rtable_publishedYear,rtable_publishedMonth
0,0,0,443,The Java Tutorial: A Short Course on the Basics / Edition 6,"Raymond Gallardo,Scott Hommel,Sowmya Kannan,Joni Gordon,Sharon Biocca Zakhour",9780134034089,864,Addison-Wesley,2014,12,The Java Tutorial: A Short Course on the Basics,"Sharon Zakhour,Sowmya Kannan,Raymond Gallardo",9780132761697,713,Addison-Wesley Professional,2013,2
1,1,4,312,Java Generics and Collections: Speed Up the Java Development Process,"Maurice Naftalin,Philip Wadler",9780596527754,286,"O'Reilly Media, Incorporated",2006,10,Java Generics and Collections,"Maurice Naftalin,Philip Wadler",9780596527754,286,O'Reilly Media,2006,10
2,2,5,441,Iron-Clad Java: Building Secure Web Applications,"Jim Manico,August Detlefsen",9780071835886,304,McGraw-Hill Professional Publishing,2014,9,Iron-Clad Java: Building Secure Web Applications,"Jim Manico,August Detlefsen",9780071835886,304,McGraw-Hill Education,2014,9
3,3,6,15,Java SE 8 for the Really Impatient / Edition 1,Cay S. Horstmann,9780321927767,240,Addison-Wesley,2014,1,Core Java for the Impatient,Cay S. Horstmann,9780321996329,512,Addison-Wesley Professional,2015,2
4,4,6,273,Java SE 8 for the Really Impatient / Edition 1,Cay S. Horstmann,9780321927767,240,Addison-Wesley,2014,1,Java SE 8 for the Really Impatient,Cay S. Horstmann,9780321927767,215,Addison-Wesley Professional,2014,1
5,5,7,2980,Programming Groovy 2: Dynamic Productivity for the Java Developer / Edition 1,Venkat Subramaniam,9781937785307,370,"Pragmatic Programmers, LLC, The",2013,7,Contemporary Database Marketing: Concepts and Applications,"Lisa D. Spiller,Kurtis D. Ruf",9781933199443,360,Racom Communications,2013,8
6,6,10,219,"Algorithms in Java, Part 5: Graph Algorithms / Edition 3","Robert Sedgewick,Michael Schidlowsky,Michael Schindlowsky",9780201361216,497,Addison-Wesley,2003,7,"Algorithms in Java, Parts 1-4","Robert Sedgewick,Michael Schidlowsky",9780201361209,768,Addison-Wesley Professional,2002,8
7,7,10,501,"Algorithms in Java, Part 5: Graph Algorithms / Edition 3","Robert Sedgewick,Michael Schidlowsky,Michael Schindlowsky",9780201361216,497,Addison-Wesley,2003,7,"Graph Algorithms (Algorithms in Java, Part 5)","Robert Sedgewick,Michael Schidlowsky",785342361216,528,Addison-Wesley Professional,2003,7
8,8,16,2359,"Cameron Mckenzie's SCJA Sun Certified Java Associate: Certification Study Guide for Jave 5, J2EE...",Cameron W. McKenzie,9781598729023,408,Pulpjava,2007,6,Beginning VB 2008 Databases: From Novice to Professional,"Vidya Vrat Agarwal,James Huddleston",9781590599471,409,Apress,2008,4
9,9,17,450,Java Gently: Programming Principles Explained / Edition 3,"J. M. Bishop,Judy Bishop",9780201710502,688,Addison Wesley,2001,1,LDAP Programming with Java,Rob Weltman,9780201657586,692,Addison-Wesley Professional,2000,2


## What's Next

I'll introduce you how to modify the output to get ready for applying supervised machine learning techniques on data matching.

The code below random sampled 400 rows from the tableC (the table we unioned in the last step).

In [46]:
S = em.sample_table(C, 400)

Below function extends the sample with an extra column filled with 0 called 'label', which can be used to indicate whether the left table and right table matches (to be done manually). 

In [47]:
S = em.label_table(S, 'label')

Column name (label) is not present in dataframe


In [48]:
S

Unnamed: 0,_id,ltable_id,rtable_id,ltable_title,ltable_authors,ltable_ISBN13,ltable_pages,ltable_publisher,ltable_publishedYear,ltable_publishedMonth,rtable_title,rtable_authors,rtable_ISBN13,rtable_pages,rtable_publisher,rtable_publishedYear,rtable_publishedMonth,label
4,4,6,273,Java SE 8 for the Really Impatient / Edition 1,Cay S. Horstmann,9780321927767,240,Addison-Wesley,2014,1,Java SE 8 for the Really Impatient,Cay S. Horstmann,9780321927767,215,Addison-Wesley Professional,2014,1,0
13,13,23,16,Sams Teach Yourself Java 6 in 21 Days,"Rogers Cadenhead,Laura Lemay",9780672329432,720,Sams,2007,6,Sams Teach Yourself Java in 24 Hours,Rogers Cadenhead,9780672330766,432,Sams,2009,11,0
46,46,78,388,"Better, Faster, Lighter Java","Bruce A. Tate,Justin Gehtland",9780596006761,266,"O'Reilly Media, Incorporated",2004,6,"Better, Faster, Lighter Java","Bruce A. Tate,Justin Gehtland",9780596552794,266,O'Reilly Media,2004,5,0
50,50,90,567,Java Transaction Design Strategies,Mark Richards,9781411695917,116,Lulu.com,2006,6,Java Transaction Design Strategies,Mark Richards,9781411695917,116,Lulu.com,2006,5,0
64,64,116,1384,Programming the Internet with Java,"Darrel Ince,Adam Freeman,Adam Freeman",9780201398441,400,Addison Wesley,1998,9,Programming The Internet With Java,"Darrel Ince,Adam Freeman",9780201175493,440,Addison Wesley Longman,1997,6,0
67,67,119,684,Ready-Made Java 2 Applications for File Maintenance: Templates 2000 for Java 1.2,Emilio Aleu,9780759693708,332,Authorhouse,2002,6,Essential Java for Scientists and Engineers,Brian D. Hahn,9780750654227,338,Butterworth-Heinemann,2002,6,0
71,71,126,1751,Interdisciplinary Computing in Java Programming / Edition 1,Sun-Chong Wang,9781402075131,2003,Springer US,2007,10,Interdisciplinary Computing in Java Programming,Sun-Chong Wang,9781402075131,266,Springer,2003,8,0
80,80,138,631,"The JavaScript Developer's Resource: Client-Side Programming Using HTML, Netscape Plug-Ins and J...","Kamran Husain,Jason Levitt",9780132679237,608,Prentice Hall Professional Technical Reference,1996,11,Graphic Java,"David M. Geary,Alan L. McClellan",9780135658475,600,Prentice Hall,1997,12,0
84,84,142,83,Concurrent and Distributed Computing in Java / Edition 1,Vijay K. Garg,9780471432302,336,Wiley,2004,2,Concurrent and Distributed Computing in Java,Vijay K. Garg,9780471432302,336,Wiley-IEEE Press,2004,2,0
86,86,143,784,"Exploring Java, 2nd Edition","Patrick Niemeyer,Josh Peck,Joshua Peck",9781565922716,614,"O'Reilly Media, Incorporated",1997,9,Java 2 Certification Training Guide,Jamie Jaworski,9781562059507,612,New Riders Publishing,1999,7,0


In [49]:
# save new tables to csv file
em.to_csv_metadata(C, './tableC.csv')
em.to_csv_metadata(S, './tableS.csv')

True

Below demonstrates how you can split the labeled set into training test and test set. I didn't label so the label column is all zero. 'train_proportion=' specifies the portion you want to put into the training set, the rest will be put into the test set.

In [50]:
# assume tableS is labeled now
path_S = os.path.join(cur_path, 'tableS.csv')
labeled_S = em.read_csv_metadata(path_S, ltable=A, rtable=B)
sets = em.split_train_test(labeled_S, train_proportion=0.4, random_state=0)

In [51]:
sets['train']

Unnamed: 0,_id,ltable_id,rtable_id,ltable_title,ltable_authors,ltable_ISBN13,ltable_pages,ltable_publisher,ltable_publishedYear,ltable_publishedMonth,rtable_title,rtable_authors,rtable_ISBN13,rtable_pages,rtable_publisher,rtable_publishedYear,rtable_publishedMonth,label
397,2735,5179,2979,The Essential Guide to Database Marketing,Davies,9780077071578,300,McGraw UK,1992,7,"Relational Databases: Concepts, Design, and Administration",Kenmore S. Brathwaite,9780070072527,294,McGraw-Hill Companies,1991,6,0
118,931,1375,2772,"Java Ee6 Cookbook for Securing, Tuning, and Extending Enterprise Applications",Mick Knutson,9781849683166,356,Packt Publishing,2012,6,Oracle Database 11g - Underground Advice for Database Administrators,April C. Sims,9781849680011,348,Packt Publishing,2010,4,0
254,1776,2821,3217,Oracle Database & Data Warehouse 11g: Performance Assessment,Sideris Courseware Corp,9781936930104,294,Sideris Courseware Corp.,2012,10,Oracle Database 11g R2: SQL Fundamentals I,Sideris Courseware Corp,9781936930005,422,Sideris Courseware Corp.,2011,5,0
188,1346,2041,708,Java Fundamental Classes Reference,"Jonathan Knudsen,Mark Grand,Jonathan B. Knudsen",9781565922419,1111,"O'Reilly Media, Incorporated",1997,5,Java Fundamental Classes Reference,"Jonathan Knudsen,Mark Grand",9781565922419,1111,O'Reilly Media,1997,5,0
208,1460,2296,2505,The Complete Database Marketer: Second Generation Strategies and Techniques for Tapping the Powe...,"Arthur Middleton Hughes,Hughes",9781557388933,550,"McGraw-Hill Companies, The",1995,9,The Complete Database Marketer: Second Generation Strategies and Techniques for Tapping the Powe...,Arthur Middleton Hughes,9781557388933,550,McGraw-Hill Professional,1995,9,0
375,2585,4694,2440,Access Control For Databases,"Elisa Bertino,Gabriel Ghinita,Ashish Kamra",9781601984166,166,Now Publishers,2011,1,Keyword Search in Databases,"M. Tamer Ozsu,Lu Qin,Lijun Chang",9781608451951,156,Morgan & Claypool,2010,1,0
110,826,1233,1406,Tricks of the Java Programming Gurus,Glenn L. Vanderburg,9781575211022,888,Macmillan Computer Publishing,1996,7,Maximum Java 1.1,Glenn Vanderburg,9781575212906,892,Sams,1997,5,0
149,1147,1726,138,Rails for Java Developers,"Stuart Halloway,Justin Gehtland",9780977616695,304,"Pragmatic Programmers, LLC, The",2007,1,Rails for Java Developers,"Stuart Halloway,Justin Gehtland",9780977616695,311,Pragmatic Bookshelf,2007,2,0
157,1200,1798,692,"Fundamentals of Java: AP* Computer Science Essentials for the A Exam, Third Edition / Edition 3","Kenneth Lambert,Martin Osborne",9780619243784,592,Cengage Learning,2006,1,Fundamentals of Java: Introductory,"Kenneth A. Lambert,Martin Osborne",9780619059712,648,Cengage Learning,2002,5,0
152,1176,1759,125,"Object-Oriented Software Engineering: Using UML, Patterns and Java / Edition 2","Bernd Bruegge,Allen H. Dutoit,Allen H. Dutoit",9780130471109,800,Prentice Hall,2003,9,"Object-Oriented Software Engineering: Using UML, Patterns and Java","Bernd Bruegge,Allen H. Dutoit",9780130471109,762,Prentice Hall,2003,9,0


In [52]:
sets['test']

Unnamed: 0,_id,ltable_id,rtable_id,ltable_title,ltable_authors,ltable_ISBN13,ltable_pages,ltable_publisher,ltable_publishedYear,ltable_publishedMonth,rtable_title,rtable_authors,rtable_ISBN13,rtable_pages,rtable_publisher,rtable_publishedYear,rtable_publishedMonth,label
132,1015,1493,1015,Java Programming with MS Visual J++ 7.0: Comprehensive / Edition 2,Joyce M. Farrell,9780619016593,695,"Course Technology, Inc.",2002,7,Fundamentals of Java Comprehensive Course,"Kenneth A. Lambert,Martin Osborne",9780619059637,702,Cengage Learning,2002,7,0
309,2116,3529,3573,Building the Agile Database: How to Build a Successful Application Using Agile Without Sacrifici...,Larry Burns,9781935504153,276,"Technics Publications, LLC",2011,8,Building the Agile Database: How to Build a Successful Application Using Agile Without Sacrifici...,Larry Burns,9781935504153,276,"Technics Publications, LLC",2011,7,0
341,2311,3998,2986,"Descriptive models, grade-tonnage relations, and databases for the assessment of sediment-hosted...","et al.,Cliff D. Taylor,J. Douglas Causey",9781296052720,176,Scholar's Choice,2015,2,Learn Database Systems with Implementation and Examples,Imed Bouchrika,9781291688894,180,Lulu.com,2014,1,0
196,1384,2083,2153,Beginning C# 5.0 Databases,Vidya Vrat Agarwal,9781430242604,440,Apress,2012,8,Beginning C# 5.0 Databases,Vidya Vrat Agarwal,9781430242604,440,Apress,2012,8,0
246,1723,2713,2774,A Simplified Forest Inventory and Analysis Database: FIADB-Lite,Miles,9781508413103,46,CreateSpace Publishing,2015,2,Grid Computing Database: Using Oracle Database,Dr Peter Archibong,9781508623328,38,Createspace Independent Publishing Platform,2015,2,0
60,455,642,93,Sams Teach Yourself Java 1.2 in 21 Days: Complete Compiler Edition,"Laura Lemay,Rogers Cadenhead",9780672315343,700,Sams,1998,12,Sams Teach Yourself Java 6 in 21 Days,"Rogers Cadenhead,Laura Lemay",9780672329432,720,Sams,2007,5,0
155,1197,1788,1808,Gingerbread 'n Java,Annie Lang,9781503105737,44,CreateSpace Publishing,2014,11,Gingerbread 'n Java,Annie Lang,9781503105737,44,Annie Things Possible Publications,2014,11,0
261,1832,2947,2738,Database Management Through DBase,"Robert T. Grauer,Maryann Barber",9780070241404,392,"McGraw-Hill Companies, The",1988,10,Database Management Through dBASE,"Robert T. Grauer,Maryann Barber",9780070241404,392,McGraw-Hill Companies,1988,11,0
141,1098,1640,132,"Sams Teach Yourself Java 2 in 21 Days, Professional Reference Edition","Laura Lemay,Rogers Cadenhead",9780672324550,859,Sams,2002,12,"Java in 24 Hours, Sams Teach Yourself (Covering Java 8)",Rogers Cadenhead,9780672337024,448,Sams,2014,5,0
214,1500,2360,3077,Fundamentals of Database Systems / Edition 5,"Ramez Elmasri,Shamkant B. Navathe,Shamkant Navathe",9780321369574,1168,Addison Wesley,2006,3,Fundamentals of Database Systems with Oracle 10g Programming: A Primer (6th Edition),"Ramez Elmasri,Shamkant B. Navathe",9780132165907,1172,Addison-Wesley,2010,4,0


In [53]:
sets['train'].shape

(160, 18)

In [54]:
sets['test'].shape

(240, 18)

Now you are ready to apply different supervised learning techniques on the labeled set (eg. scikit-learn).

## Summary and references

In this tutorial, I first apply pre-processing on the data set, then 4 types of blockers are introduced. You can also start working on the data matching stage by following the last part.

py_entitymatching: https://github.com/anhaidgroup/py_entitymatching/

User manual: http://anhaidgroup.github.io/py_entitymatching/v0.1.x/singlepage.html
