In [1]:
execution_mode = 'full'
strip_number_digits = True

# Goldstandard and Data Preparation

Swissbib's goldstandard is a set of pre-defined records being used for testing their implemented logic of identifying duplicate and unique records, see [[JudACaps](./A_References.ipynb#judacaps)]. The Swissbib project team has processed the goldstandard data into a data extract with the help of a scala implementation [[ScalRepo](./A_References.ipynb#scala_repo)]. This chapter of the capstone project loads the data records of the goldstandard and processes it with the goal to make it ready for its transformation into the feature matrix for the machine learning models.

The goal of this chapter will be achieved, when having a DataFrame with rows of pairs of records of the goldstandard that represent pairs of duplicates or pairs of uniques, respectively. Each row has to be marked with a flag indicating whether the row is a pair of duplicates or a pair of uniques. This flag will be separated to the target for training and performance testing of the machine learning models. See [[JudACaps](./A_References.ipynb#judacaps)] for the basic explanation of the process.

## Table of Contents

- [Metadata Takeover](#Metadata-Takeover)
- [Description of Swissbib's Goldstandard](#Description-of-Swissbib's-Goldstandard)
    - [Sample Records of a Goldstandard Example](#Sample-Records-of-a-Goldstandard-Example)
    - [Generating Pairs of Duplicates](#Generating-Pairs-of-Duplicates)
    - [Generating Pairs of Uniques](#Generating-Pairs-of-Uniques)
- [Transform Attributes for Similarity Comparison](#Transform-Attributes-for-Similarity-Comparison)
    - [ismn](#ismn)
- [Build Pairs of Duplicates](#Build-Pairs-of-Duplicates)
    - [Slaves Masters Relationships](#Slaves-Masters-Relationships)
    - [Duplicates Rows Generation](#Duplicates-Rows-Generation)
- [Build Pairs of Uniques](#Build-Pairs-of-Uniques)
    - [Masters Uniques Relationships](#Masters-Uniques-Relationships)
    - [Uniqes Rows Generation](#Uniqes-Rows-Generation)
- [Build Feature Base](#Build-Feature-Base)
- [Summary](#Summary)
    - [Goldstandard DataFrame Handover](#Goldstandard-DataFrame-Handover)

## Metadata Takeover

The analysis of Swissbib's raw data has led to a decision on each attribute, whether to process it into the feature matrix. The decision has been consolidated into a dictionary of metadata as a result of chapter [Data Analysis](./1_DataAnalysis.ipynb). This metadata is needed for processing the goldstandard and must be loaded as a first step.

In [2]:
import os
import pickle as pk

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)

In [3]:
for k in columns_metadata_dict.keys():
    print(k, '\n', columns_metadata_dict[k], '\n')

data_analysis_columns 
 ['coordinate_E', 'coordinate_N', 'corporate_full', 'doi', 'edition', 'exactDate', 'format_prefix', 'format_postfix', 'isbn', 'ismn', 'musicid', 'part', 'person_100', 'person_700', 'person_245c', 'pubinit', 'scale', 'ttlfull_245', 'ttlfull_246', 'volumes'] 



## Description of Swissbib's Goldstandard

Swissbib's goldstandard is a fix set of pre-defined data that has been processed into three distinct .json files [[ScalRepo](./A_References.ipynb#scala_repo)].

- $\texttt{slave.json}$ - This data file holds the original duplicated records that are being merged into a master record.
- $\texttt{master.json}$ - This data file holds the unique master records that are formed out of its records in file $\texttt{slave.json}$.
- $\texttt{unique.json}$ - This data file holds records with similar data as in $\texttt{slave.json}$ and $\texttt{master.json}$, but records that are correctly being identified as unique records and not as a duplicate record of any record in $\texttt{slave.json}$.

This section starts with loading the goldstandard data into a separate DataFrame for each set. Afterwards, the data will be analysed and explained.

In [4]:
import json

records_slave, records_master, records_unique = [], [], []
file_slave, file_master, file_unique = 'slave.json', 'master.json', 'unique.json'

for line in open(os.path.join(path_goldstandard, file_slave), 'r'):
    records_slave.append(json.loads(line))
for line in open(os.path.join(path_goldstandard, file_master), 'r'):
    records_master.append(json.loads(line))
for line in open(os.path.join(path_goldstandard, file_unique), 'r'):
    records_unique.append(json.loads(line))

print('Number of records in data file {:s}\t{:d}'.format(file_slave, len(records_slave)))
print('Number of records in data file {:s}\t{:d}'.format(file_master, len(records_master)))
print('Number of records in data file {:s}\t{:d}'.format(file_unique, len(records_unique)))

Number of records in data file slave.json	435
Number of records in data file master.json	159
Number of records in data file unique.json	596


In [5]:
import pandas as pd

goldstandard = {}

goldstandard['slaves'] = pd.DataFrame(records_slave)
goldstandard['masters'] = pd.DataFrame(records_master)
goldstandard['uniques'] = pd.DataFrame(records_unique)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(goldstandard['slaves'].columns)

goldstandard['slaves'].head()

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
0,000311049,"[(OCoLC)731635279, (ABN)000539983]",[978-3-15-020008-7],"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}","{'100': ['AustenJane'], '700': ['GraweChristia...","{'110': [], '710': [], '810': []}",2009,2009,2009,2009,,[20008],[600 S.],[600 S.],[Reclam jun.],[Reclam jun.],,[],[],[],,[BK020000]
1,00130724X,"[(OCoLC)808324878, (ABN)000155059]",[],"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","{'100': ['LevineJamesDir.'], '700': ['MozartWo...","{'110': [], '710': ['Metropolitan Opera Orches...",2000,2000,2000,2000,,[],"[1 DVD-Video, DVD Region 0, 169 Min., farb.]","[1 DVD-Video, DVD Region 0, 169 Min., farb.]",[Deutsche Grammophon],[Deutsche Grammophon],,[],[],[],,[VM010300]
2,001817272,"[(OCoLC)231772550, (ABN)000096920]",[3-495-47879-5],"{'245': ['Der moralische Status der Tiere', 'H...","{'245': ['Der moralische Status der Tiere', 'H...","{'100': ['FluryAndreas'], '700': [], '800': []...","{'110': [], '710': [], '810': []}",1999,1999,1999,1999,,[],[316 S.],[316 S.],[Alber],[Alber],,[],[],[],,[BK020000]
3,00236865X,"[(OCoLC)887157168, (ABN)000223912]",[],{'245': ['Die Zauberflöte']},{'245': ['Die Zauberflöte']},"{'100': ['MozartWolfgang Amadeus'], '700': [],...","{'110': [], '710': [], '810': []}",uuuuuuuu,uuuu,uuuu,uuuuuuuu,,[],[412 S.],[412 S.],[Ernst Eulenburg],[Ernst Eulenburg],,[],[],[],,[BK020000]
4,00351031X,"[(OCoLC)887324690, (ABN)000548154]",[978-1-4058-8214-9],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['StrangeJoanna...","{'110': [], '710': [], '810': []}",2008,2008,2008,2008,,[],[64 S.],[64 S.],[Pearson Education],[Pearson Education],,[],[],[],,[BK020000]


The three files of the goldstandard will be used for training and testing the performance of the machine learning models of the capstone project. Combining the records of file $\texttt{slave.json}$ into pairs will generate a set of duplicate training data, while combining the records of file $\texttt{slave.json}$ with file $\texttt{unique.json}$ will generate a set of unique pair records for the training data.

### Sample Records of a Goldstandard Example

Let's have a rough look at an example in the goldstandard data. All records of all three .json files will be chosen with string 'Emma' in attribute $\texttt{ttlfull}$ shall be taken as example.

In [6]:
goldstandard['slaves'][
    goldstandard['slaves'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
]

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
0,000311049,"[(OCoLC)731635279, (ABN)000539983]",[978-3-15-020008-7],"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}","{'100': ['AustenJane'], '700': ['GraweChristia...","{'110': [], '710': [], '810': []}",2009,2009,2009,2009,,[20008],[600 S.],[600 S.],[Reclam jun.],[Reclam jun.],,[],[],[],,[BK020000]
4,00351031X,"[(OCoLC)887324690, (ABN)000548154]",[978-1-4058-8214-9],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['StrangeJoanna...","{'110': [], '710': [], '810': []}",2008,2008,2008,2008,,[],[64 S.],[64 S.],[Pearson Education],[Pearson Education],,[],[],[],,[BK020000]
17,017959411,"[(OCoLC)636062037, (BGR)000409509]","[978-1-4058-7953-8 (CD Pack), 978-1-4058-8214-...",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['BarnesAnnette...","{'110': [], '710': [], '810': []}",2008,2008,2008,2008,,[],[64 S.],[64 S.],[Pearson Education Ltd.],[Pearson Education Ltd.],,[],[],[],,[BK020000]
19,020155182,"[(OCoLC)218626148, (BGR)000463276]","[978-0-521-82437-8, 0-521-82437-0]",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['CroninRichard...","{'110': [], '710': [], '810': []}",2005,2005,2005,2005,,[],[600 S.],[600 S.],[Cambridge University Press],[Cambridge University Press],,[],[],[],,[BK020000]
29,022315098,"[(OCoLC)218626148, (IDSLU)000449481]",[0-521-82437-0],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",2005,2005,2005,2005,,[],[599 S.],[599 S.],[],[],,[],[],[],,[BK020000]
51,035554215,"[(OCoLC)218626148, (IDSSG)000338145]","[978-0-521-82437-8, 0-521-82437-0]",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['CroninRichard...","{'110': [], '710': [], '810': []}",2005,2005,2005,2005,,[],[600 S.],[600 S.],[],[],,[],[],[],,[BK020000]
77,055836801,"[(OCoLC)495204467, (SGBN)001068279, (OCoLC)495...","[978-1-4058-8214-9 (br), 1-4058-8214-X (br)]",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['BarnesAnnette...","{'110': [], '710': [], '810': []}",2008,2008,2008,2008,,[],[64 S.],[64 S.],[Pearson Education],[Pearson Education],,[],[],[],,[BK020000]
109,103342699,"[(OCoLC)263554860, (IDSBB)001470548]",[],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",1979,1979,1979,1979,,[],[549 S.],[549 S.],[],[],,[],[],[],,[BK020000]
130,117574562,"[(OCoLC)218626148, (IDSBB)003781869]",[0-521-82437-0],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['CroninRichard...","{'110': [], '710': [], '810': []}",2005,2005,2005,2005,,[],[600 S.],[600 S.],[],[],,[],[],[],,[BK020000]
160,161169244,"[(OCoLC)218626148, (NEBIS)004930649]","[978-0-521-82437-8, 0-521-82437-0]",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['CroninRichard...","{'110': [], '710': [], '810': []}",2005,2005,2005,2005,,[],[600 S.],[600 S.],[Cambridge University Press],[Cambridge University Press],,[],[],[],,[BK020000]


In [7]:
print('Number of records in slave.json with string \'Emma\' :', len(
    goldstandard['slaves'][
        goldstandard['slaves'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')])
     )

Number of records in slave.json with string 'Emma' : 18


A number of 18 records in $\texttt{slave.json}$ contain the string 'Emma' in attribute $\texttt{ttlfull}$. Some of these records are duplicates. File $\texttt{master.json}$ knows the amount of uniques out of the 18 records.

In [8]:
goldstandard['masters'][
    goldstandard['masters'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
]

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
85,504389793,"[(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...",[978-3-15-020008-7],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",2009,2009,2009,2009,,[20008],[600 S.],[600 S.],[Reclam],[Reclam],,[],[],[],,[BK020000]
86,504389807,"[(NEBIS)008647887, (VAUD)991001434509702852, (...",[0-375-75742-2],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",2001,2001,2001,2001,,[],[359 S.],[359 S.],[Modern Library],[Modern Library],,[],[],[],,[BK020000]
87,504389815,"[(IDSLU)000449481, (IDSBB)003781869, (NEBIS)00...",[0-521-82437-0],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': ['Au...","{'110': [], '710': [], '810': []}",2005,2005,2005,2005,,[],[599 S.],[599 S.],[],[],,[],[],[],,[BK020000]
88,504389823,"[(SGBN)001068279, (ABN)000548154, (BGR)0004095...","[978-1-4058-8214-9 (br), 1-4058-8214-X (br)]",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['BarnesAnnette...","{'110': [], '710': [], '810': []}",2008,2008,2008,2008,,[],[64 S.],[64 S.],[Pearson Education],[Pearson Education],,[],[],[],,[BK020000]
89,504389831,"[(IDSBB)001470548, (SGBN)001344510]",[],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",1979,1979,1979,1979,,[],[549 S.],[549 S.],[],[],,[],[],[],,[BK020000]


In [9]:
print('Number of records in master.json with string \'Emma\' :', len(
    goldstandard['masters'][
        goldstandard['masters'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')])
     )

Number of records in master.json with string 'Emma' : 5


The number of deduplicated records of file $\texttt{slave.json}$ containing string 'Emma' is 5. This is the number of unique records that are built in Swissbib's deduplication step out of the original 18 records above.

There may be some minor differences in some fields of the deduplicated master file records compared to their original data in the duplicated original slave file records. Nevertheless, it is possible to identify the slave file records as duplicates of their associated master file records, looking at the details of the values of each attribute.

In a final step, let's have a look at the 'Emma' string data of $\texttt{unique.json}$.

In [10]:
goldstandard['uniques'][
    goldstandard['uniques'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
]

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
0,000143235,"[(OCoLC)362722306, (ABN)000551177]",[978-3-7466-6120-9],"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}","{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",2009,2009,2009,2009,,[6120],[575 S.],[575 S.],[Aufbau Taschenbuch],[Aufbau Taschenbuch],,[],[],[],,[BK020000]
4,002410559,"[(OCoLC)777853583, (ABN)000243260]",[0-19-282756-1],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",1990,1990,1990,1990,,[],[445 p.],[445 p.],[Oxford University Press],[Oxford University Press],,[],[],[],,[BK020000]
8,004130235,"[(OCoLC)887396789, (ABN)000628911]",[978-0-307-38684-7],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",2007,2007,2007,2007,,[],[495 S.],[495 S.],[Vintage Books],[Vintage Books],,[],[],[],,[BK020000]
13,017204097,"[(OCoLC)759250730, (BGR)000090407]",[3-596-22191-9],"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}","{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",1996,1996,1996,1996,[87.-88. Tsd.],[2191],[414 S.],[414 S.],[Fischer-Taschenbuch-Verlag],[Fischer-Taschenbuch-Verlag],,[],[],[],,[BK020000]
16,017738490,"[(OCoLC)76214484, (BGR)000281870]",[3-7466-5105-0],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",2001,2001,2001,2001,3. Aufl,[5105],[553 S.],[553 S.],[Aufbau Taschenbuch-Verlag],[Aufbau Taschenbuch-Verlag],,[],[],[],,[BK020000]
20,01843486X,"[(OCoLC)759471685, (BGR)000212474]",[0-582-41794-5],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['BarnesAnnette...","{'110': [], '710': [], '810': []}",2000,2000,2000,2000,"[new ed., 2nd impression]",[],[59 S.],[59 S.],[Pearson Education Ltd],[Pearson Education Ltd],,[],[],[],,[BK020000]
26,021868816,"[(OCoLC)610677900, (IDSLU)000093923]",[],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",1996,1996,1996,1996,,[[1]],[560 S.],[560 S.],[],[],,[],[],[],,[BK020000]
42,038872498,"[(OCoLC)614721683, (SBT)000133903]",[0-14-043010-5],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['BlytheRonald'...","{'110': [], '710': [], '810': []}",1983,1983,1983,1983,,[],[471 p.],[471 p.],[Penguin Books],[Penguin Books],,[],[],[],,[BK020000]
53,043085075,"[(OCoLC)882111340, (SBT)000463076]",[],"{'245': ['Emma', 'mobbing']}","{'245': ['Emma', 'mobbing']}","{'100': [], '700': [], '800': [], '245c': ['Ca...","{'110': [], '710': ['Caritas (Ticino)'], '810'...",2002,2002,2002,2002,,[],[1 Videocassetta],[1 Videocassetta],[Caritas Ticino],[Caritas Ticino],,[],[],[],,[VM010200]
62,051168871,"[(OCoLC)759052846, (SGBN)001050098]",[],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",1953,1953,1953,1953,,[],[377 S.],[377 S.],[Collins],[Collins],,[],[],[],,[BK020000]


In [11]:
print('Number of records in unique.json with string \'Emma\' :', len(
    goldstandard['uniques'][
        goldstandard['uniques'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')])
     )

Number of records in unique.json with string 'Emma' : 44


The number of unique records of file $\texttt{unique.json}$ is 44. The data of the records reveal that none of them can be associated as duplicate of a record in $\texttt{master.json}$. These unique records are excellent training data for deduplication as they can be used for training of similar but non-duplicate, thus unique, record pairs.

The goal of generating training and test data with the help of the goldstandard is to generate two categories of training data.

- The first category consists of records of pairs of the original records out of file $\texttt{slave.json}$ that have a label 'duplicates'. This will be accomplished with the help of the records of file $\texttt{master.json}$, see subsection [Generating Pairs of Duplicates](#Generating-Pairs-of-Duplicates).
- The second category consists of records of pairs of the original records out of files $\texttt{slave.json}$ and $\texttt{unique.json}$ that have a label 'uniques', see subsection [Generating Pairs of Uniques](#Generating-Pairs-of-Uniques).

To calculate the number of possible duplicate pairs the formula

$
\begin{align}
Tot_{duplicate\ pairs} = \sum_{i=1}^{M} \frac{1}{2}S_i \cdot \left(S_i-1\right)
\end{align}
$

where $M$ is the number of records in file $\texttt{master.json}$ and $S_i$ are the number of records in file $\texttt{slave.json}$ that are associated with master record $m_i$. As the number of slave records must be counted for each master record and as this number can only be deduced out of the goldstandard data, it is not possible to calculate the number of expected duplicate pair records in advance.

### Generating Pairs of Duplicates

The central attribute for retrieving duplicates in the goldstandard data is attribute $\texttt{035liste}$, see chapter [Data Analysis](./1_DataAnalysis.ipynb). It helps to identify the associated master record for a given record of file $\texttt{slave.json}$. The process implemented below parses the list of identifiers in attribute $\texttt{035liste}$ of each record in file $\texttt{slave.json}$ and searches the value of the identifier in the attribute $\texttt{035liste}$ of all records of file $\texttt{master.json}$. When the identifier is found in file $\texttt{master.json}$, the master record's attribute $\texttt{docid}$ is stored as a new column in the slave DataFrame, see figure [Slave/master relationship](#slave_master_relationship). The new attribute in the slave record has the meaning of a foreign key to the related master record. This process is repeated for each list element of one slave record and for each record in the data file $\texttt{slave.json}$.

<center>
    <b>Figure</b><a id='slave_master_relationship'></a> Slave/master relationship.
    <img src="./documentation/training_data.png" style="width: 600px;"/></p>
</center>

As will be shown below, the relationship of a slave record to its master record is unique. Even if there is more than one entry in the list of attribute $\texttt{035liste}$ of a slave record it will be shown that all distinct entries of a $\texttt{035liste}$ attribute list of one slave record point to one and the same master record.

Attribute $\texttt{035liste}$ will not be used in training nore in perfomance testing of the models. The column will be removed before model training.

### Generating Pairs of Uniques

File $\texttt{unique.json}$ holds data of unique records. For training and testing purposes, pairs of records could be generated exclusively with the records out of this file. This would generate record pairs with label 'uniques' which can be clearly distinguished and therefore clearly be recognized as non-duplicate pairs. A more interesting pairing of records can be generated with records out of $\texttt{slave.json}$ and records out of $\texttt{unique.json}$. Pairs of records that are produced in a mixture of both sources, may reveal similarities in their attributes pairing that makes these pairs more difficult for classifying clearly as uniques.

A mixture of both pairing sources will be implemented for the generation of training and testing data. The pairing combinations of all available unique records with all available slave records is done with a Cartesian product and its number calculates to the following value.

In [12]:
print('Total expected number of pairs of uniques for training and testing : {:,d}'.format(
    len(goldstandard['uniques'])*len(goldstandard['slaves'])))

Total expected number of pairs of uniques for training and testing : 259,260


Of course, this estimated amount only holds true, if the records of slaves do not have any duplicates in the records of uniques. Whether this is the case in the goldstandard data, will be investigated below.

## Transform Attributes for Similarity Comparison

Before pairing the records into a DataFrame that will be used for the feature matrix, the attributes of the goldstandard data will be processed into their form to be used for generating the similarity matrix. This step could be done after the pairing step. Doing it before building pairs, avoids transformation of the same attributes several times and reduces the amount of transformation to the amount of raw data in the goldstandard instead of the multiplyed data of the paired records.

As the data preparation of the attributes will be reused in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb), it has been encapsulated into a separate function of library [data_preparation_funcs.py](./data_preparation_funcs.py), which handles the processing of all relevant attributes. This function will be called below.

In [13]:
import data_preparation_funcs as dpf

for i in ['slaves', 'masters', 'uniques']:
    goldstandard[i] = dpf.attribute_preprocessing(
        goldstandard[i],
        columns_metadata_dict['data_analysis_columns'], strip_number_digits)

### ismn

In [14]:
for i in ['slaves', 'masters', 'uniques']:
    display(goldstandard[i].ismn[goldstandard[i].ismn.apply(lambda x : len(x)>0)])

31                                  m006450510
132                                 m006546756
164                                 m006546749
171                                 m006450510
191    m006546756 (kritischer bericht, leinen)
225                                 m006450510
228                                 m006204687
418                              9790006450510
419                                 m200205343
420                                 m006204687
429    m006546756 (kritischer bericht, leinen)
Name: ismn, dtype: object

35                                  m200205343
37                                  m006450510
123    m006546756 (kritischer bericht, leinen)
149                                 m006204687
150                                 m006450510
157                                 m006546749
Name: ismn, dtype: object

34        m700241001
151    9790200207781
193    9790006450510
282       m006450510
340       m008060205
342       m006450510
388    9790006201334
498    9790006201334
589    9790006450510
Name: ismn, dtype: object

A vanishingly small number of rows with real ismn identifyers had been found in the data of chapter [Data Analysis](./1_DataAnalysis.ipynb). In the goldstandard data, the amount of real ismn identifyers found is fair. As there will be training data with filled attribute $\texttt{ismn}$, the attribute will be kept for the feature matrix.

In [15]:
#columns_metadata_dict['data_analysis_columns'].remove('ismn')
#columns_metadata_dict['columns_to_use'].remove('ismn_x')
#columns_metadata_dict['columns_to_use'].remove('ismn_y')

## Build Pairs of Duplicates

As a first step, the relationship between slaves records and masters records is investigated. Thereafter, the records of pairs of duplicates will be generated.

In [16]:
for i in range(len(goldstandard['slaves'])):
    if len(goldstandard['masters'][goldstandard['masters'].docid == goldstandard['slaves'].docid.loc[i]]) > 0 :
        print('STOP!')
        break

The silence of the code cell above shows that there is no record in file $\texttt{slaves.json}$ that has the same $\texttt{docid}$ like any record in file $\texttt{masters.json}$. Attribute $\texttt{docid}$ is irrelevant for identifying pairs of duplicates.

### Slaves Masters Relationships

To understand, how to build pairs of duplicates, let's have a look at some sample data.

In [17]:
for i in ['slaves', 'masters', 'uniques']:
    print(f'\n{i}')
    print(goldstandard[i]['035liste'].sample(n=5))


slaves
276    [(OCoLC)731635279, (LIBIB)000315536]
9        [(OCoLC)780137741, (ABN)000669941]
4        [(OCoLC)887324690, (ABN)000548154]
330    [(OCoLC)884694486, (IDSBB)006000910]
262     [(SNL)vtls001719167, (Sz)001719167]
Name: 035liste, dtype: object

masters
81    [(IDSBB)006726594, (NEBIS)011045420, (OCoLC)10...
69    [(RERO)1245069, (VAUD)991018231439702852, (RNV...
74    [(IDSBB)006503313, (IDSBB)006491015, (SNL)vtls...
71                 [(RERO)R006172599, (RERO)R275502460]
49                 [(RERO)R006132598, (RERO)R275508860]
Name: 035liste, dtype: object

uniques
394                      [(RERO)R008249818]
203    [(OCoLC)611244100, (NEBIS)009374474]
279                      [(RERO)R003945582]
252                      [(RERO)R278018360]
141    [(OCoLC)884457936, (IDSBB)004819890]
Name: 035liste, dtype: object


In [18]:
for i in range(len(goldstandard['slaves'])):
    if i < 4:
        for j in range(len(goldstandard['slaves']['035liste'][i])):
            for k in range(len(goldstandard['masters'])):
                if goldstandard['slaves']['035liste'][i][j] in goldstandard['masters']['035liste'][k]:
                    print('slaves record', i, 'holds in 035liste position', j, 'a reference to masters record', k)
print('etc.')

slaves record 0 holds in 035liste position 1 a reference to masters record 85
slaves record 1 holds in 035liste position 1 a reference to masters record 145
slaves record 2 holds in 035liste position 1 a reference to masters record 119
slaves record 3 holds in 035liste position 1 a reference to masters record 57
etc.


The code above shows a relationship between records out of $\texttt{slaves.json}$ and records out of $\texttt{masters.json}$. This relationship is established with the entries in attribute $\texttt{035liste}$. The next step is to look at record pairs that have this relationship.

In [19]:
sample_pair_1 = [0, 85]
sample_pair_2 = [1, 145]
sample_pair_3 = [2, 119]
sample_pair_4 = [3, 57]

print('Example pair 1')
print('slaves record', goldstandard['slaves']['035liste'][sample_pair_1[0]], '\nmasters record', goldstandard['masters']['035liste'][sample_pair_1[1]])
print('\nExample pair 2')
print('slaves record', goldstandard['slaves']['035liste'][sample_pair_2[0]], '\nmasters record', goldstandard['masters']['035liste'][sample_pair_2[1]])
print('\nExample pair 3')
print('slaves record', goldstandard['slaves']['035liste'][sample_pair_3[0]], '\nmasters record', goldstandard['masters']['035liste'][sample_pair_3[1]])
print('\nExample pair 4')
print('slaves record', goldstandard['slaves']['035liste'][sample_pair_4[0]], '\nmasters record', goldstandard['masters']['035liste'][sample_pair_4[1]])

Example pair 1
slaves record ['(OCoLC)731635279', '(ABN)000539983'] 
masters record ['(NEBIS)009587153', '(LIBIB)000315536', '(ABN)000539983']

Example pair 2
slaves record ['(OCoLC)808324878', '(ABN)000155059'] 
masters record ['(IDSBB)003690925', '(NEBIS)005645758', '(RERO)R003034172', '(ABN)000155059']

Example pair 3
slaves record ['(OCoLC)231772550', '(ABN)000096920'] 
masters record ['(IDSBB)001651899', '(NEBIS)001974886', '(IDSSG)000177847', '(SGBN)000249023', '(SNL)vtls001203051', '(IDSLU)000164505', '(RERO)R243599160', '(VAUD)991016630449702852', '(ABN)000096920', '(Sz)001203051', '(RNV)002435991-41bculausa_network', '(RERO)002435991', 'R243599160', '(EXLNZ-41BCULAUSA_NETWORK)991020678209702851']

Example pair 4
slaves record ['(OCoLC)887157168', '(ABN)000223912'] 
masters record ['(IDSBB)004654810', '(ABN)000223912']


In [20]:
for i in [sample_pair_1, sample_pair_2, sample_pair_3, sample_pair_4]:
    print('\npair------------------------------------->', i)
    print(goldstandard['slaves'].loc[i[0]], '\n')
    print(goldstandard['masters'].loc[i[1]])


pair-------------------------------------> [0, 85]
docid                                                     000311049
035liste                         [(OCoLC)731635279, (ABN)000539983]
isbn                                            [978-3-15-020008-7]
ttlpart                                  {'245': ['Emma', 'Roman']}
pubyear                                                    2009    
decade                                                         2009
century                                                        2009
exactDate                                                  2009uuuu
edition                                                            
part                                                          20008
pages                                                      [600 S.]
volumes                                                         600
pubinit                                                 reclam jun.
pubword                                               [Reclam ju

Record pairs with a relationship are pairs of duplicates of a bibliographical unit. This is according to the description in the first part of this chapter. To identify explicitly which slave record belongs to which master record, attribute $\texttt{035liste}$ will be used. The following function implements an algorithm that searches for each $\texttt{035liste}$ value of a slave record its same value in $\texttt{035liste}$ in the master records. If the record is found in one of the master records, the $\texttt{docid}$ of this master is stored in a new attribute $\texttt{masters}\_\texttt{docid}$ of the slave record. As there can be more than one element in attribute $\texttt{035liste}$, there may be several master relationships on a slave record. Therefore, attribute $\texttt{masters}\_\texttt{docid}$ is of type of a list.

In [21]:
def add_master_docid_to_slave (df_s, df_m):
    """Determine docid of master and store on slave."""
    # Initialize Foreign Key list
    df_s['masters_docid'] = [list() for x in range(len(df_s.index))]

    # Search for master of slave
    for i in range(len(df_s)):
        loc_li = list()
        for j in range(len(df_s['035liste'].loc[i])):
            master_index = df_m[df_m['035liste'].str.contains(
                df_s['035liste'].loc[i][j], regex=False
            )].index
            if len(master_index) > 0 : # Skip empty Series
                loc_li.append(df_m.docid[master_index].values[0])

        df_s['masters_docid'].loc[i] = loc_li
    
    return df_s

In [22]:
goldstandard['slaves'] = add_master_docid_to_slave(goldstandard['slaves'], goldstandard['masters'])

goldstandard['slaves'].masters_docid.sample(n=10)

53                                           [50439018X]
96                                           [504389017]
427                                          [264853539]
26                                           [504389246]
397    [504388908, 504388908, 504388908, 504388908, 5...
223                                          [504389777]
71                                           [504390082]
1                                            [504390597]
418                                          [504390813]
310                               [504390120, 504390120]
Name: masters_docid, dtype: object

Looking at the same samples as above, shows a new attribute $\texttt{masters}\_\texttt{docid}$ in the slaves record, which holds the docid value of its related master record. This is true for each pair of records, see below.

In [23]:
for i in [sample_pair_1, sample_pair_2, sample_pair_3, sample_pair_4]:
    print('\npair------------------------------------->', i)
    print(goldstandard['slaves'].loc[i[0]], '\n')
    print(goldstandard['masters'].loc[i[1]])


pair-------------------------------------> [0, 85]
docid                                                     000311049
035liste                         [(OCoLC)731635279, (ABN)000539983]
isbn                                            [978-3-15-020008-7]
ttlpart                                  {'245': ['Emma', 'Roman']}
pubyear                                                    2009    
decade                                                         2009
century                                                        2009
exactDate                                                  2009uuuu
edition                                                            
part                                                          20008
pages                                                      [600 S.]
volumes                                                         600
pubinit                                                 reclam jun.
pubword                                               [Reclam ju

In [24]:
goldstandard['slaves']['masters_docid'].sample(n=10)

153                          [504389467]
52                           [504390694]
77     [504389823, 504389823, 504389823]
64                           [504388967]
135                          [504390694]
209                          [504389483]
246                          [504388940]
74                           [504390694]
224                          [504389076]
117                          [504390368]
Name: masters_docid, dtype: object

As the records in the master file are uniques by themselves, for distinctly identifying whether a pair of slave records is a duplicate, the keys in $\texttt{masters}\_\texttt{docid}$ must be unique. In the samples shown above, some cases with more than one entry can be detected. On the other hand, the visual check reveals that they hold identical values. As a next step, all identical values in attribute $\texttt{masters}\_\texttt{docid}$ will be reduced to one single entry. This will lead to a list of distinct values.

In [25]:
# Proof that all docid_masters are unique...
goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : set(x))
goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : list(x))

for i in range(len(goldstandard['slaves'])):
    if len(goldstandard['slaves'].masters_docid.loc[i]) != 1 :
        print('STOP!', i)
        break

The silence of the code cell above demonstrates that after reducing the list of $\texttt{masters}\_\texttt{docid}$ to distinct entries, all slave records hold exactly one key reference to a master. This is the basis for joining the data.

The list attribute $\texttt{masters}\_\texttt{docid}$ now holds exactly one single element. To remove the list data type and extract its contents to a single string value, the following line of code is executed.

In [26]:
goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : x[0])
goldstandard['slaves'].head()

Unnamed: 0,docid,035liste,isbn,ttlpart,pubyear,decade,century,exactDate,edition,part,pages,...,corporate_110,corporate_710,corporate_full,format_prefix,format_postfix,person_100,person_700,person_245c,ttlfull_245,ttlfull_246,masters_docid
0,000311049,"[(OCoLC)731635279, (ABN)000539983]",[978-3-15-020008-7],"{'245': ['Emma', 'Roman']}",2009,2009,2009,2009uuuu,,20008.0,[600 S.],...,,,,bk,20000,austenjane,"grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,"emma, roman",,504389793
1,00130724X,"[(OCoLC)808324878, (ABN)000155059]",[],"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",2000,2000,2000,2000uuuu,,,"[1 DVD-Video, DVD Region 0, 169 Min., farb.]",...,,"metropolitan opera orchestra, metropolitan ope...","metropolitan opera orchestra, metropolitan ope...",vm,10300,levinejamesdir.,"mozartwolfgang amadeus, levinejames, schikaned...",w. a. mozart ; libretto: emanuel schikaneder ;...,"die zauberflöte, oper in zwei aufzügen",,504390597
2,001817272,"[(OCoLC)231772550, (ABN)000096920]",[3-495-47879-5],"{'245': ['Der moralische Status der Tiere', 'H...",1999,1999,1999,1999uuuu,,,[316 S.],...,,,,bk,20000,fluryandreas,,andreas flury,"der moralische status der tiere, henry salt, p...",,50439018X
3,00236865X,"[(OCoLC)887157168, (ABN)000223912]",[],{'245': ['Die Zauberflöte']},uuuuuuuu,uuuu,uuuu,uuuuuuuu,,,[412 S.],...,,,,bk,20000,mozartwolfgang amadeus,,,die zauberflöte,,504389513
4,00351031X,"[(OCoLC)887324690, (ABN)000548154]",[978-1-4058-8214-9],{'245': ['Emma']},2008,2008,2008,2008uuuu,,,[64 S.],...,,,,bk,20000,austenjane,strangejoanna,jane austen ; retold by joanna strange,emma,,504389823


Joining all slave records with all master records must result in a number of pairs that is equal to the number of master records, loaded at the beginning of this chapter.

In [27]:
slave_master = pd.merge(left=goldstandard['slaves'], right=goldstandard['masters'], how='inner',
                  left_on='masters_docid', right_on='docid')

print('Number of paired slave/master records', len(slave_master))

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(slave_master)

slave_master.head()

Number of paired slave/master records 435


Unnamed: 0,docid_x,035liste_x,isbn_x,ttlpart_x,pubyear_x,decade_x,century_x,exactDate_x,edition_x,part_x,pages_x,volumes_x,pubinit_x,pubword_x,scale_x,coordinate_x,doi_x,ismn_x,musicid_x,coordinate_E_x,coordinate_N_x,corporate_110_x,corporate_710_x,corporate_full_x,format_prefix_x,format_postfix_x,person_100_x,person_700_x,person_245c_x,ttlfull_245_x,ttlfull_246_x,masters_docid,docid_y,035liste_y,isbn_y,ttlpart_y,pubyear_y,decade_y,century_y,exactDate_y,edition_y,part_y,pages_y,volumes_y,pubinit_y,pubword_y,scale_y,coordinate_y,doi_y,ismn_y,musicid_y,coordinate_E_y,coordinate_N_y,corporate_110_y,corporate_710_y,corporate_full_y,format_prefix_y,format_postfix_y,person_100_y,person_700_y,person_245c_y,ttlfull_245_y,ttlfull_246_y
0,000311049,"[(OCoLC)731635279, (ABN)000539983]",[978-3-15-020008-7],"{'245': ['Emma', 'Roman']}",2009,2009,2009,2009uuuu,,20008.0,[600 S.],600,reclam jun.,[Reclam jun.],,[],,,,,,,,,bk,20000,austenjane,"grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,"emma, roman",,504389793,504389793,"[(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...",[978-3-15-020008-7],{'245': ['Emma']},2009,2009,2009,2009uuuu,,20008.0,[600 S.],600,reclam,[Reclam],,[],,,,,,,,,bk,20000,austenjane,,jane austen ; aus dem engl. übers. von ursula ...,emma,
1,196506476,"[(OCoLC)731635279, (NEBIS)009587153]",[978-3-15-020008-7],{'245': ['Emma']},2009,2009,2009,2009uuuu,,20008.0,[600 S.],600,reclam,[Reclam],,[],,,,,,,,,bk,20000,austenjane,,jane austen ; aus dem engl. übers. von ursula ...,emma,,504389793,504389793,"[(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...",[978-3-15-020008-7],{'245': ['Emma']},2009,2009,2009,2009uuuu,,20008.0,[600 S.],600,reclam,[Reclam],,[],,,,,,,,,bk,20000,austenjane,,jane austen ; aus dem engl. übers. von ursula ...,emma,
2,323173349,"[(OCoLC)731635279, (LIBIB)000315536]",[978-3-15-020008-7],"{'245': ['Emma', 'Roman']}",2009,2009,2009,2009uuuu,,20008.0,[600 S.],600,reclam,[Reclam],,[],,,,,,,,,bk,20000,austenjane,,jane austen,"emma, roman",,504389793,504389793,"[(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...",[978-3-15-020008-7],{'245': ['Emma']},2009,2009,2009,2009uuuu,,20008.0,[600 S.],600,reclam,[Reclam],,[],,,,,,,,,bk,20000,austenjane,,jane austen ; aus dem engl. übers. von ursula ...,emma,
3,00130724X,"[(OCoLC)808324878, (ABN)000155059]",[],"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",2000,2000,2000,2000uuuu,,,"[1 DVD-Video, DVD Region 0, 169 Min., farb.]",1 0 169,deutsche grammophon,[Deutsche Grammophon],,[],,,,,,,"metropolitan opera orchestra, metropolitan ope...","metropolitan opera orchestra, metropolitan ope...",vm,10300,levinejamesdir.,"mozartwolfgang amadeus, levinejames, schikaned...",w. a. mozart ; libretto: emanuel schikaneder ;...,"die zauberflöte, oper in zwei aufzügen",,504390597,504390597,"[(IDSBB)003690925, (NEBIS)005645758, (RERO)R00...",[],"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",2000,2000,2000,2000uuuu,,,[1 DVD-Video (169 Min.)],1 169,,[],,[],,,73.0,,,,"metropolitan operaorchestra, metropolitan oper...","metropolitan operaorchestra, metropolitan oper...",vm,10300,mozartwolfgang amadeus,"schikanederemanuel, hockneydavid, levinejames",w. a. mozart ; libretto: emanuel schikaneder,"die zauberflöte, oper in zwei aufzügen : kv 620",
4,116188030,"[(OCoLC)884447694, (IDSBB)003690925]",[],"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",2000,2000,2000,2000uuuu,,,[1 DVD-Video (169 Min.)],1 169,,[],,[],,,73.0,,,,"metropolitan operaorchestra, metropolitan oper...","metropolitan operaorchestra, metropolitan oper...",vm,10300,mozartwolfgang amadeus,"schikanederemanuel, hockneydavid, levinejames",w. a. mozart ; libretto: emanuel schikaneder,"die zauberflöte, oper in zwei aufzügen : kv 620",,504390597,504390597,"[(IDSBB)003690925, (NEBIS)005645758, (RERO)R00...",[],"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",2000,2000,2000,2000uuuu,,,[1 DVD-Video (169 Min.)],1 169,,[],,[],,,73.0,,,,"metropolitan operaorchestra, metropolitan oper...","metropolitan operaorchestra, metropolitan oper...",vm,10300,mozartwolfgang amadeus,"schikanederemanuel, hockneydavid, levinejames",w. a. mozart ; libretto: emanuel schikaneder,"die zauberflöte, oper in zwei aufzügen : kv 620",


Looking at the examples above, reveals that the slave/master record pairs are duplicates, indeed.

In [28]:
pd.options.display.max_rows = len(slave_master.columns)

for i in [sample_pair_1, sample_pair_2, sample_pair_3, sample_pair_4]:
    print('\npair------------------------------------->', i)
    print(slave_master.loc[i[0]], '\n')
    print('... and its master record ...')
    print(goldstandard['masters'].loc[i[1]])


pair-------------------------------------> [0, 85]
docid_x                                                     000311049
035liste_x                         [(OCoLC)731635279, (ABN)000539983]
isbn_x                                            [978-3-15-020008-7]
ttlpart_x                                  {'245': ['Emma', 'Roman']}
pubyear_x                                                    2009    
decade_x                                                         2009
century_x                                                        2009
exactDate_x                                                  2009uuuu
edition_x                                                            
part_x                                                          20008
pages_x                                                      [600 S.]
volumes_x                                                         600
pubinit_x                                                 reclam jun.
pubword_x                             

### Duplicates Rows Generation

The rows with pairs of duplicates can now be built joining each slave record with each slave record, with the condition of the same $\texttt{masters}\_\texttt{docid}$.

In [29]:
def build_duplicate_pairs (df):
    """Builds-up all duplicate pairs, even with itself."""
    
    return pd.merge(left=df, right=df, how='inner',
                    left_on='masters_docid', right_on='masters_docid')

The records of duplicates are marked as duplicates immediately after pairing.

In [30]:
duplicates = build_duplicate_pairs(goldstandard['slaves'])
duplicates['duplicates'] = 1

print('Number of duplicate rows {:,d}'.format(len(duplicates)))

Number of duplicate rows 1,473


In [31]:
import random as rand

pd.options.display.max_rows = len(duplicates.columns)

duplicates.loc[rand.randrange(len(duplicates))]

docid_x                                                     017467683
035liste_x                         [(OCoLC)254941323, (BGR)000336877]
isbn_x                                                [3-8067-5097-1]
ttlpart_x           {'245': ['Die Reise der Pinguine', '[das Buch ...
pubyear_x                                                    2005    
decade_x                                                         2005
century_x                                                        2005
exactDate_x                                                  2005uuuu
edition_x                                                            
part_x                                                               
pages_x                                                       [64 S.]
volumes_x                                                          64
pubinit_x                                                 gerstenberg
pubword_x                                               [Gerstenberg]
scale_x             

On joining the raw data to pairs, the column names have been duplicated and changed. This must be reflected in the metadata.

In [32]:
# Target is the first necessary column
columns_metadata_dict['columns_to_use'] = ['duplicates']

# After join with itself, the data has _x and _y attributes
for i in columns_metadata_dict['data_analysis_columns']:
    for j in ['_x', '_y']:
        columns_metadata_dict['columns_to_use'].append(i+j)

## Build Pairs of Uniques

In [33]:
for i in range(len(goldstandard['uniques'])):
    if len(goldstandard['masters'][goldstandard['masters'].docid == goldstandard['uniques'].docid.loc[i]]) > 0 :
        print('STOP!')
        break

The silence of the code cell above shows that there is no record in file $\texttt{master.json}$ that has the same $\texttt{docid}$ like any record in file $\texttt{unique.json}$.

### Masters Uniques Relationships

The investigations above have shown that relationships between records are expressed with the same value in attribute $\texttt{035liste}$. To find records of the uniques file that are duplicates of records of the masters file, the values of attributes $\texttt{035liste}$ have to be compared.

In [34]:
masters_uniques_duplicates = {'master' : [], 'unique' : [], 'slave' : {}}

for i in range(len(goldstandard['masters'])):
    for j in range(len(goldstandard['masters']['035liste'][i])):
        for k in range(len(goldstandard['uniques'])):
            if goldstandard['masters']['035liste'][i][j] in goldstandard['uniques']['035liste'][k]:
#                print('master', i, 'has relationship to unique', k)
                masters_uniques_duplicates['master'].append(i)
                masters_uniques_duplicates['unique'].append(k)
print()
for i in set(masters_uniques_duplicates['master']):
    masters_uniques_duplicates['slave'].update({i : []})
    for j in range(len(goldstandard['masters']['035liste'][i])):
        for k in range(len(goldstandard['slaves'])):
            if goldstandard['masters']['035liste'][i][j] in goldstandard['slaves']['035liste'][k]:
#                print('master', i, 'has relationship to slave', k)
                masters_uniques_duplicates['slave'][i].append(k)
                
masters_uniques_duplicates




{'master': [92, 121, 121],
 'unique': [473, 458, 474],
 'slave': {121: [358, 346, 345, 351, 346, 351],
  92: [267, 272, 286, 303, 267, 267, 272, 303]}}

This result is surprising. The above code reveals that three unique records have relationships to a master record. Even worse but at least consistent, the affected master records also have relationships to some slave records. Is it true that they are duplicates?

In [35]:
pd.options.display.max_rows = len(slave_master.columns)

for i in range(len(masters_uniques_duplicates['master'])):
    print('\npair------------------------------------->', 
          masters_uniques_duplicates['master'][i], masters_uniques_duplicates['unique'][i]
         )
    print(goldstandard['masters'].loc[masters_uniques_duplicates['master'][i]], '\n')
    print('... and its master record ...')
    print(goldstandard['uniques'].loc[masters_uniques_duplicates['unique'][i]], '\n')
    print('... and its slave records ...')
    for j in masters_uniques_duplicates['slave'][masters_uniques_duplicates['master'][i]]:
        print(goldstandard['slaves'].loc[j], '\n')


pair-------------------------------------> 92 473
docid                                                     504389866
035liste          [(IDSBB)006207118, (NEBIS)010150430, (RERO)R00...
isbn                             [978-3-642-41697-2, 3-642-41697-7]
ttlpart           {'245': ['Empirische Bildungsforschung', 'aktu...
pubyear                                                    2014    
decade                                                         2014
century                                                        2014
exactDate                                                  2014uuuu
edition                                                            
part                                                               
pages                                                      [158 S.]
volumes                                                         158
pubinit                                                            
pubword                                                          

The visual check confirms that the status of duplicates are given for the pairs of records in uniques and slaves. Therefore, these records must be excluded from the pairing of rows to records of uniques.

### Uniqes Rows Generation

To build pairs of uniques of the goldstandard, all records of slaves are joined with all records of uniques, except for the records that will produce duplicates shown in the previous subsection.

In [36]:
df_s_1 = goldstandard['slaves']
df_u_1 = goldstandard['uniques'].copy() # Do not modify original data.

# Exclude duplication candidates from above
df_u_1.drop(index=masters_uniques_duplicates['unique'], inplace=True)
#display(df_u_1.loc[masters_uniques_duplicates['unique']])

df_s_1['duplicates'] = 0
df_u_1['duplicates'] = 0
# Full join, Cartesian product
non_duplicates = pd.merge(df_s_1, df_u_1, on='duplicates')

print('Number of slave records : {:,d}, number of unique records : {:,d}, expected pairs : {:,d} = joined pairs : {:,d}'.format(
    len(df_s_1), len(df_u_1), len(df_s_1)*len(df_u_1), len(non_duplicates)))

Number of slave records : 435, number of unique records : 593, expected pairs : 257,955 = joined pairs : 257,955


In [37]:
pd.options.display.max_rows = len(non_duplicates.columns)

non_duplicates.loc[rand.randrange(len(non_duplicates))]

docid_x                                                     156459280
035liste_x                        [(OCoLC)82668726, (NEBIS)004325390]
isbn_x                                                             []
ttlpart_x           {'245': ['Die Zauberflöte', 'eine deutsche Ope...
pubyear_x                                                    1970    
decade_x                                                         1970
century_x                                                        1970
exactDate_x                                                  1970uuuu
edition_x                                                            
part_x                                                               
pages_x                                             [1 Klavierauszug]
volumes_x                                                           1
pubinit_x                                                 bärenreiter
pubword_x                                               [Bärenreiter]
scale_x             

The number of records of pairs of uniques shown above, is that big that no further records of pairs of uniques for training and testing purposes are needed. For this reason, joining the records of uniques with themselves will be ommitted and the pairs generated so far will be taken as they are.

## Build Feature Base

The rows with pairs of duplicates and the rows with pairs of uniques resulting from above, can be concatenated to a DataFrame which will be the basis for further processing into the feature matrix.

In [38]:
df_feature_base = pd.concat([duplicates, non_duplicates], sort=True)
# Set unique values on index
df_feature_base.reset_index(inplace=True, drop=True)

print('Number of rows with pairs of uniques :\t\t{:,d}'.format(len(df_feature_base[df_feature_base.duplicates==0])))
print('Number of rows with pairs of duplicates :\t{:,d}'.format(len(df_feature_base[df_feature_base.duplicates==1])))

Number of rows with pairs of uniques :		257,955
Number of rows with pairs of duplicates :	1,473


## Summary

The result of this chapter is a deep understanding of Swissbib's goldstandard data and how to produce pairs of rows which represent duplicates and pairs of rows which represent uniques. After generating the pairs of duplicates and the pairs of uniques, the following ratio is in the data set that will be used for training the machine learning models.

In [39]:
print('ratio of uniques :\t{:.2f} %\nratio of duplicates :\t{:.2f} %'.format(
    df_feature_base.duplicates.value_counts(normalize=True).loc[0]*100,
    df_feature_base.duplicates.value_counts(normalize=True).loc[1]*100))

ratio of uniques :	99.43 %
ratio of duplicates :	0.57 %


Besides the data with rows of uniques and duplicates, the metadata of the project have been updated. After this chapter, the metadata look like seen as the result of the following lines of code.

In [40]:
for k in columns_metadata_dict.keys():
    print(k, '\n', columns_metadata_dict[k], '\n')

data_analysis_columns 
 ['coordinate_E', 'coordinate_N', 'corporate_full', 'doi', 'edition', 'exactDate', 'format_prefix', 'format_postfix', 'isbn', 'ismn', 'musicid', 'part', 'person_100', 'person_700', 'person_245c', 'pubinit', 'scale', 'ttlfull_245', 'ttlfull_246', 'volumes'] 

columns_to_use 
 ['duplicates', 'coordinate_E_x', 'coordinate_E_y', 'coordinate_N_x', 'coordinate_N_y', 'corporate_full_x', 'corporate_full_y', 'doi_x', 'doi_y', 'edition_x', 'edition_y', 'exactDate_x', 'exactDate_y', 'format_prefix_x', 'format_prefix_y', 'format_postfix_x', 'format_postfix_y', 'isbn_x', 'isbn_y', 'ismn_x', 'ismn_y', 'musicid_x', 'musicid_y', 'part_x', 'part_y', 'person_100_x', 'person_100_y', 'person_700_x', 'person_700_y', 'person_245c_x', 'person_245c_y', 'pubinit_x', 'pubinit_y', 'scale_x', 'scale_y', 'ttlfull_245_x', 'ttlfull_245_y', 'ttlfull_246_x', 'ttlfull_246_y', 'volumes_x', 'volumes_y'] 



The data handover for the next chapter still needs to done.

### Goldstandard DataFrame Handover

For further processing in the next chapters, the metadata dictionary and the resulting feature base DataFrame of this chapter need to be persisted. Both will be restored and used in upcoming chapters. The DataFrame developped in this chapter is saved into a pickle file.

In [41]:
import pickle as pk

# Binary intermediary metadata file
with open(os.path.join(path_goldstandard,
                       'columns_metadata.pkl'), 'wb') as dict_output_file:
    pk.dump(columns_metadata_dict, dict_output_file)

# Binary intermediary DataFrame file for feature matrix generation
with open(os.path.join(path_goldstandard, 'feature_base_df.pkl'), 'wb') as df_output_file:
    pk.dump(df_feature_base, df_output_file)