# Task 1: Using RLTK to perform Entity Resolution (ER)

<sub>Content of this notebook was prepared by Basel Shbita, and modified by Avijit Thawani (thawani@usc.edu) as part of the class <u>DSCI 558: Building Knowledge Graphs</u> at University of Southern California (USC).</sub>

The Record Linkage ToolKit ([RLTK](https://github.com/usc-isi-i2/rltk)) is a general-purpose open-source record linkage platform that allows users to build powerful Python programs that link records referring to the same underlying entity.

This notebook introduces some applied examples using RLTK. You can also find additional examples and use-cases in [RLTK's documentation](https://rltk.readthedocs.io/en/master/).

## Dataset analysis & RLTK components construction

In [1]:
# !pip install rltk

### Task 1-1. Construct RLTK Datasets

First, you need define how a single entry would like for each type of record (for each dataset)

In [2]:
import rltk
import csv

# You can use this tokenizer in case you need to manipulate some data
tokenizer = rltk.tokenizer.crf_tokenizer.crf_tokenizer.CrfTokenizer()

In [53]:
'''
Goodreads
    id: ID
    name_string: Title
    name_token: Title_Token
    author: FirstAuthor
    publisher: Publisher

Nobel
    id: ID
    name_string: Title
    name_token: Title_Token
    author: Author1
    publisher: Publisher

'''

class GoodRecord(rltk.Record):
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['ID']

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['Title']

    @rltk.cached_property
    def author(self):
        return self.raw_object['FirstAuthor']
    
    @rltk.cached_property
    def authors(self):
        l = []
        if self.raw_object['FirstAuthor']:
            l.append(self.raw_object['FirstAuthor'])
        if self.raw_object['SecondAuthor']:
            l.append(self.raw_object['SecondAuthor'])
        if self.raw_object['ThirdAuthor']:
            l.append(self.raw_object['ThirdAuthor'])
        
        return l

    @rltk.cached_property
    def ISBN(self):
        return self.raw_object['ISBN']

    @rltk.cached_property
    def ISBN13(self):
        return self.raw_object['ISBN13']
    
    @rltk.cached_property
    def page_count(self):
        return self.raw_object['PageCount']

    @rltk.cached_property
    def description(self):
        return self.raw_object['Description']
    
    @rltk.cached_property
    def rating(self):
        return self.raw_object['Rating']   

    @rltk.cached_property
    def num_rating(self):
        return self.raw_object['NumberofRatings']   

    @rltk.cached_property
    def num_review(self):
        return self.raw_object['NumberofReviews']   

    @rltk.cached_property
    def publish_date(self):
        return self.raw_object['PublishDate']   

    @rltk.cached_property
    def publish_format(self):
        return self.raw_object['Format']   

    @rltk.cached_property
    def language(self):
        return self.raw_object['Language']   

    @rltk.cached_property
    def publisher(self):
        return self.raw_object['Publisher']

    @rltk.cached_property
    def name_tokens(self):
        return set(tokenizer.tokenize(self.name_string))

class NobleRecord(rltk.Record):
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['ID']

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['Title']
    
    @rltk.cached_property
    def date(self):
        return self.raw_object['PublicationDate']

    @rltk.cached_property
    def author(self):
        return self.raw_object['Author1']
    
    @rltk.cached_property
    def publisher(self):
        return self.raw_object['Publisher']
    
    @rltk.cached_property
    def name_tokens(self):
        return set(tokenizer.tokenize(self.name_string))

    @rltk.cached_property
    def dimension(self):
        return self.raw_object['Productdimensions']
    

    @rltk.cached_property
    def sales_rank(self):
        return self.raw_object['Salesrank']

    @rltk.cached_property
    def rating_count(self):
        return self.raw_object['Ratingscount']

    @rltk.cached_property
    def paper_price(self):
        return self.raw_object['Paperbackprice']

    @rltk.cached_property
    def hard_price(self):
        return self.raw_object['Hardcoverprice']

    @rltk.cached_property
    def nook_price(self):
        return self.raw_object['Nookbookprice']

    @rltk.cached_property
    def audio_price(self):
        return self.raw_object['Audiobookprice']

    @rltk.cached_property
    def rating_value(self):
        return self.raw_object['Ratingvalue']

You can load your csv files into RLTK using this method:

In [54]:
dir_ = ''
good_file = dir_ + 'goodreads.csv'
noble_file = dir_ + 'barnes_and_nobles.csv'

ds1 = rltk.Dataset(rltk.CSVReader(good_file),record_class=GoodRecord)
ds2 = rltk.Dataset(rltk.CSVReader(noble_file),record_class=NobleRecord)

And we can inspect a few entries:

In [56]:
# print some entries
#print(ds1.generate_dataframe().head(5))
print(ds2.generate_dataframe().head(5))

  id                                        name_string        date  \
0  0          Pioneer Girl: The Annotated Autobiography  12/12/2014   
1  1  American Sniper (Movie Tie-in Edition): The Au...  11/25/2014   
2  2                     The Autobiography of Malcolm X  10/28/1987   
3  3                           Assata: An Autobiography  11/28/1999   
4  4                            Autobiography of a Yogi  12/28/1978   

                  author                              publisher  \
0   Laura Ingalls Wilder  South Dakota State Historical Society   
1             Chris Kyle               HarperCollins Publishers   
2              Malcolm X          Random House Publishing Group   
3          Assata Shakur     Chicago Review Press, Incorporated   
4  Paramahansa Yogananda            Self-Realization Fellowship   

                                         name_tokens  \
0  {Annotated, Autobiography, :, Pioneer, Girl, The}   
1  {Edition, -, History, in, U, :, Tie, Military,...   
2 

### Task 1-2. Blocking

First, we'll load dev set to evaluate both blocking (Task 1-2) and entity linking (Task 1-3).

In [6]:
dev_set_file = dir_ + 'dev.csv'
dev = []
with open(dev_set_file, encoding='utf-8', errors="replace") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if len(row) <= 1:
            continue
        if line_count == 0:
            columns = row
            line_count += 1
        else:
            dev.append(row)
    print(f'Column names are: {", ".join(columns)}')
    print(f'Processed {len(dev)} lines.')

gt = rltk.GroundTruth()
for row in dev:    
    r1 = ds1.get_record(row[0])
    r2  = ds2.get_record(row[1])
    if row[-1] == '1':
        gt.add_positive(r1.raw_object['ID'], r2.raw_object['ID'])
    else:
        gt.add_negative(r1.raw_object['ID'], r2.raw_object['ID'])

rltk.Trial(gt)

Column names are: goodreads.ID, barnes_and_nobles.ID, label
Processed 297 lines.


<rltk.evaluation.trial.Trial at 0x7fb2af348a60>

Then, you can build your own blocking techniques and evaluate it.

Hint:

- What is the total number of pairs without blocking? 
- what is the number of paris with blocking?
- After blocking, how many "correct" (matched) pairs presented in dev set?


In [7]:
def reduction_ratio(ds1, ds2, block):
    """
    Calculate reduction ratio based on two original datasets and the final blocks
    """
    block_pairs = len(list(rltk.get_record_pairs(ds1, ds2, block=block)))

    ds1_size = len(ds1.generate_dataframe())
    ds2_size = len(ds2.generate_dataframe())

    ratio = float((block_pairs) / (ds1_size * ds2_size))

    print('Total pairs before blocking: %s', ds1_size * ds2_size)
    print('Pairs after blocking: %s', block_pairs)
    print('Reduction Ratio: %s', ratio)
    return ratio

In [8]:
print('--- block on author ---')
bg = rltk.HashBlockGenerator()
block = bg.generate(bg.block(ds1, property_='author'),
                    bg.block(ds2, property_='author'))

for idx, b in enumerate(block.key_set_adapter):
    if idx == 5: break
    print(b)

print('----------------------')

reduction_ratio(ds1, ds2, block)

--- block on author ---
('Alex Ferguson', {('NobleRecord', '53'), ('GoodRecord', '0')})
('Boris Pasternak', {('GoodRecord', '3418'), ('GoodRecord', '261'), ('GoodRecord', '1'), ('NobleRecord', '3459'), ('NobleRecord', '1083')})
('Betty Boothroyd', {('NobleRecord', '2541'), ('GoodRecord', '158'), ('GoodRecord', '2')})
('Caddie', {('GoodRecord', '3')})
('Rudolf Nureyev', {('GoodRecord', '4')})
----------------------
Total pairs before blocking: %s 14681867
Pairs after blocking: %s 2933
Reduction Ratio: %s 0.0001997702335813286


0.0001997702335813286

In [9]:
"""
print('--- block on publisher ---')
bg = rltk.HashBlockGenerator()
block2 = bg.generate(bg.block(ds1, property_='publisher'),
                    bg.block(ds2, property_='publisher'))

for idx, b in enumerate(block2.key_set_adapter):
    if idx == 5: break
    print(b)

print('----------------------')

reduction_ratio(ds1, ds2, block2)
"""

"\nprint('--- block on publisher ---')\nbg = rltk.HashBlockGenerator()\nblock2 = bg.generate(bg.block(ds1, property_='publisher'),\n                    bg.block(ds2, property_='publisher'))\n\nfor idx, b in enumerate(block2.key_set_adapter):\n    if idx == 5: break\n    print(b)\n\nprint('----------------------')\n\nreduction_ratio(ds1, ds2, block2)\n"

#### Pairwise comparison

In [10]:
def pairs_completeness_and_quality(ds1, ds2, block, gt):
    """
    Calculate pairs completeness and quality using the block & groundtruth provided

    Returns (completeness, quality)
    """

    groundtruth = {}
    true_matches_compared = 0
    matches_compared = 0

    for id1, id2, label in gt:
        if label == 1:
            groundtruth[id1] = id2
    total_true_matches = len(groundtruth)

    for key, id1, id2 in block.pairwise(ds1, ds2):
        matches_compared += 1
        if id1 in groundtruth and groundtruth[id1] == id2:
            true_matches_compared += 1

    # Recall
    completeness = float(true_matches_compared) / total_true_matches
    print('Pairs Completeness = %s / %s  = %s' %(true_matches_compared, total_true_matches, completeness))

    # Precision
    quality = float(true_matches_compared) / matches_compared
    print('Pairs quality = %s / %s = %s' %(true_matches_compared, matches_compared, quality))

    return (completeness, quality)

In [11]:
# Evaluate blocking with author name
(_,_) = pairs_completeness_and_quality(ds1, ds2, block, gt)

Pairs Completeness = 53 / 67  = 0.7910447761194029
Pairs quality = 53 / 2933 = 0.018070235254006136


In [12]:
# Evaluate blocking with publisher
# (_,_) = pairs_completeness_and_quality(ds1, ds2, block2, gt)

In [13]:
# Write blocked pairs to CSV
import csv

with open('Rui_Zhu_blocked.csv', 'wt', newline ='') as file:
    writer = csv.writer(file, delimiter=',')
    writer.writerow(["goodreads.ID", "barnes_and_nobles.ID"])
    for key, id1, id2 in block.pairwise(ds1, ds2):
        writer.writerow((id1, id2))

### Task 1-3. Entity Linking

Here are 2 example functions for field (attribute) similarity:

Here's how you can combine multiple similarity functions into a single weightened scoring function:

In [14]:
# threshold value to determine if we are confident the record match
MY_TRESH = 0.7 # this number is just an example, you need to change it

def title_similarity_1(r1, r2):
    s1 = r1.name_string
    s2 = r2.name_string
    
    return rltk.jaro_winkler_similarity(s1, s2)
    
def title_similarity_2(r1, r2):
    s1 = r1.name_string
    s2 = r2.name_string
    
    if s1 == s2:
        return 1
    
    return 0

def author_similarity_1(r1, r2):
    s1 = r1.author
    s2 = r2.author
    
    return rltk.jaro_winkler_similarity(s1, s2)
    
def author_similarity_2(r1, r2):
    s1 = r1.author
    s2 = r2.author
    
    if s1 == s2:
        return 1
    
    return 0

def publisher_similarity_1(r1, r2):
    s1 = r1.publisher
    s2 = r2.publisher
    
    return rltk.jaro_winkler_similarity(s1, s2)
    
def publisher_similarity_2(r1, r2):
    ''' Example dummy similarity function '''
    s1 = r1.publisher
    s2 = r2.publisher
    
    if s1 == s2:
        return 1
    
    return 0



# entity linkage scoring function
def rule_based_method(r1, r2, A, B, C, D):
    author_similar = author_similarity_1(r1, r2)
    author_exact = author_similarity_2(r1, r2)
    title_similar = title_similarity_1(r1, r2)
    title_exact = title_similarity_2(r1, r2)
    publisher_similar = publisher_similarity_1(r1, r2)
    publisher_exact = publisher_similarity_2(r1, r2)
    
    total = A * title_similar + B * title_exact + C * author_exact + D * publisher_exact
    
    # return two values: boolean if they match or not, float to determine confidence
    return total > MY_TRESH, total

In [15]:
count = 0

trial = rltk.Trial(gt)
candidate_pairs = rltk.get_record_pairs(ds1, ds2, ground_truth=gt)
for r1, r2 in candidate_pairs:
    count += 1
    result, confidence = rule_based_method(r1, r2, A=0.67, B=0.13, C=0.1, D=0.1)
    trial.add_result(r1, r2, result, confidence)

print('---------------')
print('Total pair compared: %s' % count)
trial.evaluate()
print('Trial statistics based on Ground-Truth from development set data:')
print(f'tp: {trial.true_positives:.06f} [{len(trial.true_positives_list)}]')
print(f'fp: {trial.false_positives:.06f} [{len(trial.false_positives_list)}]')
print(f'tn: {trial.true_negatives:.06f} [{len(trial.true_negatives_list)}]')
print(f'fn: {trial.false_negatives:.06f} [{len(trial.false_negatives_list)}]')

print('---------------')
print('F-score: %s' % trial.f_measure)

---------------
Total pair compared: 297
Trial statistics based on Ground-Truth from development set data:
tp: 0.895522 [60]
fp: 0.152174 [35]
tn: 0.847826 [195]
fn: 0.104478 [7]
---------------
F-score: 0.7407407407407407


Now lets evaluate our trial results

### Save Test predictions
You will be evaluated on dev and test predictions, over a hidden ground truth.

In [16]:
test_set_file = dir_ + 'test.csv'
test = []
with open(test_set_file, encoding='utf-8', errors="replace") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if len(row) <= 1:
            continue
        if line_count == 0:
            columns = row
            line_count += 1
        else:
            test.append(row)
    print(f'Column names are: {", ".join(columns)}')
    print(f'Processed {len(test)} lines.')

Column names are: goodreads.ID, barnes_and_nobles.ID
Processed 90 lines.


In [17]:
predictions = []
for id1, id2 in test:
    r1 = ds1.get_record(id1)
    r2  = ds2.get_record(id2)
    result, confidence = rule_based_method(r1, r2, A=0.67, B=0.13, C=0.1, D=0.1)
    predictions.append((r1.id, r2.id, result, confidence))

In [18]:
len(predictions), len(ds1.generate_dataframe()), len(ds2.generate_dataframe())

(90, 3967, 3701)

In [19]:
with open(dir_ + 'Rui_Zhu_predictions.csv', mode='w') as file:
    writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(["goodreads.ID", "barnes_and_nobles.ID", "prediction", "confidence"])
    for row in predictions:
        writer.writerow(row)

### Task 1.4 (2 pts) - Record Linkage

In [20]:
total = 0
count = 0

with open(dir_ + 'Rui_Zhu_valid_predictions.csv', mode='w') as file:
    writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(["goodreads.ID", "barnes_and_nobles.ID"])

    block_pairs = list(rltk.get_record_pairs(ds1, ds2, block=block))
    for r1, r2 in block_pairs:
        total += 1
        result, confidence = rule_based_method(r1, r2, A=0.67, B=0.13, C=0.1, D=0.1)
        if result==1:
            count += 1
            #writer.writerow((r1.id, r2.id, confidence))
            writer.writerow((r1.id, r2.id))


print("Total: " + str(total))
print("valid: " + str(count))

Total: 2933
valid: 1147


# Task 2: Using RDFLib for Knowledge Representation

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information as graphs. RDFLib aims to be a pythonic RDF API, a Graph is a python collection of RDF Subject, Predicate,  Object Triples.

This notebook introduces simple examples. You can also find additional information in the [official documenation](https://rdflib.readthedocs.io/en/stable/).

In [21]:
! pip install rdflib

You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m


In [33]:
from rdflib import Graph, URIRef, Literal, XSD, Namespace, RDF

Let's define some namespaces:

In [64]:
FOAF = Namespace('http://xmlns.com/foaf/0.1/')
MYNS = Namespace('http://dsci558.org/myfakenamespace#')

We can create a graph:

In [65]:
my_kg = Graph()
my_kg.bind('myns', MYNS)
my_kg.bind('foaf', FOAF)

In [66]:
import csv

final_pairs = []

with open("Rui_Zhu_valid_predictions.csv", "r") as f:
    reader = csv.reader(f, delimiter="\t")
    for i, line in enumerate(reader):
        if i == 0: 
            continue
        final_pairs.append(line[0].split(','))

In [26]:
final_pairs[:5]

[['3418', '1083'], ['2', '2541'], ['8', '1431'], ['11', '100'], ['13', '1152']]

In [67]:
# DS1: Goodread

for pair in final_pairs:
    id1 = pair[0]
    id2 = pair[1]

    r1 = ds1.get_record(id1)
    r2 = ds2.get_record(id2)

    title = r1.name_string
    description = r1.description
    isbn = r1.ISBN
    isbn13 = r1.ISBN13
    page_count= r1.page_count
    authors= r1.authors
    rating = r1.rating
    num_rating = r1.num_rating
    num_review = r1.num_review
    publisher = r1.publisher
    publish_date= r1.publish_date
    publish_format = r1.publish_format
    language= r1.language

    dim = r2.dimension
    sale_rank = r2.sales_rank
    rating_count = r2.rating_count
    paper_price = r2.paper_price
    hard_price = r2.hard_price
    nook_price = r2.nook_price
    audio_price =r2.audio_price
    rating_value = r2.rating_value

    node_uri = URIRef(MYNS[id1])
    my_kg.add((node_uri, RDF.type, MYNS['book']))
    my_kg.add((node_uri, MYNS['title'], Literal(title)))
    my_kg.add((node_uri, MYNS['description'], Literal(description)))
    my_kg.add((node_uri, MYNS['isbn_id'], Literal(isbn)))
    my_kg.add((node_uri, MYNS['isbn13_id'], Literal(isbn13)))
    my_kg.add((node_uri, MYNS['page_count'], Literal(page_count)))
    my_kg.add((node_uri, FOAF['Person'], Literal(authors)))
    my_kg.add((node_uri, MYNS['rate'], Literal(rating)))
    my_kg.add((node_uri, MYNS['rate_count'], Literal(num_rating)))
    my_kg.add((node_uri, MYNS['rate_count'], Literal(num_review)))
    my_kg.add((node_uri, FOAF['Group'], Literal(publisher)))
    my_kg.add((node_uri, MYNS['publish_date'], Literal(publish_date)))
    my_kg.add((node_uri, MYNS['format'], Literal(publish_format)))
    my_kg.add((node_uri, MYNS['language'], Literal(language)))

    my_kg.add((node_uri, MYNS['dimension'], Literal(dim)))
    my_kg.add((node_uri, MYNS['rate'], Literal(sale_rank)))
    my_kg.add((node_uri, MYNS['rate_count'], Literal(rating_count)))
    my_kg.add((node_uri, MYNS['price'], Literal(paper_price)))
    my_kg.add((node_uri, MYNS['price'], Literal(hard_price)))
    my_kg.add((node_uri, MYNS['price'], Literal(nook_price)))
    my_kg.add((node_uri, MYNS['price'], Literal(audio_price)))
    my_kg.add((node_uri, MYNS['rate'], Literal(rating_value)))



Define a URI, then add a simple triple to the graph:

Add an additional triple (which describes the same subject, `node_uri`):

And now let's dump our graph triples into some `ttl` file:

In [68]:
my_kg.serialize(dir_ + 'Rui_Zhu_model.ttl', format="turtle")

<Graph identifier=N1af025d3c1df4bdcbec840bbb5f7aa27 (<class 'rdflib.graph.Graph'>)>

In [69]:
!head Rui_Zhu_model.ttl

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix myns: <http://dsci558.org/myfakenamespace#> .

myns:100 a myns:book ;
    myns:description " " ;
    myns:dimension " " ;
    myns:format "Paperback" ;
    myns:isbn13_id "9780688109646" ;
    myns:isbn_id "0688109640" ;
    myns:language "English" ;
