In [1]:
strip_number_digits = True

# Goldstandard and Data Preparation

Swissbib's goldstandard is a set of pre-defined records used for testing their implemented logic of identifying duplicate and unique records, see [[JudACaps](./A_References.ipynb#judacaps)]. The Swissbib project team has processed the goldstandard data into a data extract with the help of a scala implementation [[ScalRepo](./A_References.ipynb#scala_repo)]. This chapter of the capstone project loads the data records of the goldstandard and processes it with the goal to make it ready for its transformation into the feature matrix for the machine learning models.

The goal of this chapter will be achieved, when having a DataFrame with rows of pairs of records of the goldstandard that represent pairs of duplicates or pairs of uniques, respectively. Each row has to be marked with a flag indicating whether the row is a pair of duplicates or a pair of uniques. This flag will be separated to the target for training and performance testing of the machine learning models. See [[JudACaps](./A_References.ipynb#judacaps)] for the basic explanation of the process.

## Table of Contents

- [Metadata Takeover](#Metadata-Takeover)
- [Description of Swissbib's Goldstandard](#Description-of-Swissbib's-Goldstandard)
    - [Sample Records of a Goldstandard Example](#Sample-Records-of-a-Goldstandard-Example)
    - [Generating Pairs of Duplicates](#Generating-Pairs-of-Duplicates)
    - [Generating Pairs of Uniques](#Generating-Pairs-of-Uniques)
- [Transform Attributes for Similarity Comparison](#Transform-Attributes-for-Similarity-Comparison)
    - [ismn](#ismn)
- [Build Pairs of Duplicates](#Build-Pairs-of-Duplicates)
    - [Slaves Masters Relationships](#Slaves-Masters-Relationships)
    - [Duplicates Rows Generation](#Duplicates-Rows-Generation)
- [Build Pairs of Uniques](#Build-Pairs-of-Uniques)
    - [Masters Uniques Relationships](#Masters-Uniques-Relationships)
    - [Uniques Rows Generation](#Uniques-Rows-Generation)
- [Build Feature Base](#Build-Feature-Base)
- [Summary](#Summary)
    - [Goldstandard DataFrame Handover](#Goldstandard-DataFrame-Handover)

## Metadata Takeover

The analysis of Swissbib's raw data has led to a decision on each attribute, whether to process it into the feature matrix. The decision has been consolidated into a dictionary of metadata as a result of chapter [Data Analysis](./1_DataAnalysis.ipynb). This metadata is needed for processing the goldstandard and will be loaded as a first step.

In [2]:
import os
import pickle as pk

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)

In [3]:
for k in columns_metadata_dict.keys():
    print(k, '\n', columns_metadata_dict[k], '\n')

data_analysis_columns 
 ['coordinate_E', 'coordinate_N', 'corporate_full', 'doi', 'edition', 'exactDate', 'format_prefix', 'format_postfix', 'isbn', 'ismn', 'musicid', 'part', 'person_100', 'person_700', 'person_245c', 'pubinit', 'scale', 'ttlfull_245', 'ttlfull_246', 'volumes'] 

columns_to_use 
 ['duplicates', 'coordinate_E_x', 'coordinate_E_y', 'coordinate_N_x', 'coordinate_N_y', 'corporate_full_x', 'corporate_full_y', 'doi_x', 'doi_y', 'edition_x', 'edition_y', 'exactDate_x', 'exactDate_y', 'format_prefix_x', 'format_prefix_y', 'format_postfix_x', 'format_postfix_y', 'isbn_x', 'isbn_y', 'ismn_x', 'ismn_y', 'musicid_x', 'musicid_y', 'part_x', 'part_y', 'person_100_x', 'person_100_y', 'person_700_x', 'person_700_y', 'person_245c_x', 'person_245c_y', 'pubinit_x', 'pubinit_y', 'scale_x', 'scale_y', 'ttlfull_245_x', 'ttlfull_245_y', 'ttlfull_246_x', 'ttlfull_246_y', 'volumes_x', 'volumes_y'] 

similarity_metrics 
 {'coordinate_E': LCSStr({'qval': 1, 'external': True}), 'coordinate_N': L

## Description of Swissbib's Goldstandard

Swissbib's goldstandard is a fix set of pre-defined data that has been processed into three distinct .json files [[ScalRepo](./A_References.ipynb#scala_repo)].

- $\texttt{slave.json}$ - This data file holds the original duplicated records that are being merged into a master record.
- $\texttt{master.json}$ - This data file holds the unique master records that are formed out of its records in file $\texttt{slave.json}$.
- $\texttt{unique.json}$ - This data file holds records with similar data as in $\texttt{slave.json}$ and $\texttt{master.json}$, but records that are correctly being identified as unique records and not as a duplicate record of any record in $\texttt{slave.json}$.

This section starts with loading the goldstandard data into a separate DataFrame for each set. Afterwards, the data will be analysed and explained.

In [4]:
import json

records_slave, records_master, records_unique = [], [], []
file_slave, file_master, file_unique = 'slave.json', 'master.json', 'unique.json'

for line in open(os.path.join(path_goldstandard, file_slave), 'r'):
    records_slave.append(json.loads(line))
for line in open(os.path.join(path_goldstandard, file_master), 'r'):
    records_master.append(json.loads(line))
for line in open(os.path.join(path_goldstandard, file_unique), 'r'):
    records_unique.append(json.loads(line))

print('Number of records in data file {:s}\t{:d}'.format(file_slave, len(records_slave)))
print('Number of records in data file {:s}\t{:d}'.format(file_master, len(records_master)))
print('Number of records in data file {:s}\t{:d}'.format(file_unique, len(records_unique)))

Number of records in data file slave.json	615
Number of records in data file master.json	184
Number of records in data file unique.json	188


In [5]:
import pandas as pd

goldstandard = {}

goldstandard['slaves'] = pd.DataFrame(records_slave)
goldstandard['masters'] = pd.DataFrame(records_master)
goldstandard['uniques'] = pd.DataFrame(records_unique)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(goldstandard['slaves'].columns)

goldstandard['slaves'].head()

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
0,00350560X,"[(OCoLC)884555343, (ABN)000305947]",[],{'245': ['Sideways']},{'245': ['Sideways']},"{'100': [], '700': ['GiamattiPaul', 'Haden Chu...","{'110': [], '710': [], '810': []}",2005,2005,2005,2005,,[],[1 DVD-Video (ca. 122 Min.)],[1 DVD-Video (ca. 122 Min.)],[],[],,[],[],[],2785408,[VM010300]
1,003894924,"[(OCoLC)314603155, (ABN)000414274]",[],{'245': ['Li regres Nostre Dame']},{'245': ['Li regres Nostre Dame']},"{'100': ['Huon Le Roi'], '700': ['LångforsArth...","{'110': [], '710': [], '810': []}",1907,1907,1907,1907,,[],"[CXLVII, 210 p.]","[CXLVII, 210 p.]",[],[],,[],[],[],,[BK020300]
2,004308069,"[(OCoLC)887407580, (ABN)000363324]",[],{'245': ['Euricius Cordus Epigrammata (1520)']},{'245': ['Euricius Cordus Epigrammata (1520)']},"{'100': ['CordusEuricius'], '700': ['KrauseKar...","{'110': [], '710': [], '810': []}",1892,1892,1892,1892,,[Heft 5],[1 Band],[1 Band],[],[],,[],[],[],,[BK020000]
3,005021073,"[(OCoLC)841563357, (ABN)000288211, R004141449]",[],{'245': ['Je voudrais que quelqu'un m'attende ...,{'245': ['Je voudrais que quelqu'un m'attende ...,"{'100': ['GavaldaAnna'], '700': [], '800': [],...","{'110': [], '710': [], '810': []}",2006,2006,2006,2006,,[],[3 Compact Discs],[3 Compact Discs],[],[],,[],[],[],A,[MU030100]
4,005179726,"[(OCoLC)611172738, (ABN)000232404]","[3-7281-1755-2, 3-519-05031-5]","{'245': ['Städtebau in der Schweiz 1800-1990',...","{'245': ['Städtebau in der Schweiz 1800-1990',...","{'100': ['KochMichael'], '700': [], '800': [],...","{'110': [], '710': [], '810': []}",1992,1992,1992,1992,,[Nr. 81],[315 S.],[315 S.],[],[],,[],[],[],,[BK020000]


The three files of the goldstandard will be used for training and testing the performance of the machine learning models of the capstone project. Combining the records of file $\texttt{slave.json}$ into pairs will generate a set of duplicate pair records that will be used for training data. Combining the records of file $\texttt{slave.json}$ with file $\texttt{unique.json}$ will generate a set of unique pair records that will be used for training data.

### Sample Records of a Goldstandard Example

Let's have a rough look at an example in the goldstandard data. All records of all three $\texttt{.json}$ files with string 'Emma' in attribute $\texttt{ttlfull}$ shall be taken as example.

In [6]:
goldstandard['slaves'][
    goldstandard['slaves'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
]

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
530,19930422X,"[(OCoLC)781556045, (NEBIS)009752200, (OCoLC)78...","[978-0-674-04884-3 (alk. paper), 0-674-04884-9...",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['TandonBharat'...","{'110': [], '710': [], '810': []}",2012,2012,2012,2012,An annotated edition,[],[560 p.],[560 p.],[Belknap Press of Harvard University Press],[Belknap Press of Harvard University Press],,[],[],[],,[BK020000]
538,303917768,[(RERO)R007426863],[978-0-674-04884-3 (hbk. : alk. paper)],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['TandonBharat'...","{'110': [], '710': [], '810': []}",2012,2012,2012,2012,Annotated ed,[],"[ix, 560 p.]","[ix, 560 p.]",[],[],,[],[],[],,[BK020000]


In [7]:
print('Number of records in slave.json with string \'Emma\' :', len(
    goldstandard['slaves'][
        goldstandard['slaves'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')])
     )

Number of records in slave.json with string 'Emma' : 2


A number of 18 records in $\texttt{slave.json}$ contain the string 'Emma' in attribute $\texttt{ttlfull}$. Some of these records are duplicates. File $\texttt{master.json}$ knows the amount of uniques out of the 18 records.

#### 01.08.2021 - Neue Daten Goldstandard

Bei den neuen Goldstandard-Daten hält das Master-File nur noch $\mbox{035liste}$ und die Liste der zugehörigen Slaves. Für jeden Master-Datensatz muss daher ein zugehöriger Slave-Datensatz gefunden werden.

In [8]:
masters = goldstandard['masters']
slaves = goldstandard['slaves']

cols = list(slaves.columns)
# Initialize newMasters
nms = pd.DataFrame(columns = cols)
del cols[0:2]

for i in range(len(masters)) :
    found = False
    if not found :
        for j in range(len(slaves)) :
            if not found :
                for k in range(len(slaves[['docid','035liste']].loc[j]['035liste'])) :
                    if (slaves.loc[j]['035liste'][k] in masters['035liste'].loc[i]) :
                        # Nimm den ersten passenden Slave-Record und klebe seine Attribute an den Master.
                        nms = nms.append(
                            pd.concat([masters.loc[[i]].reset_index(drop=True), slaves[cols].loc[[j]].reset_index(drop=True)], axis=1)
                        )
                        found = True
                        break

goldstandard['masters'] = nms.reset_index(drop=True)
goldstandard['masters']

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
0,11111,"[(NEBIS)005161228, (IDSBB)000390913]",[],{'245': ['1861-1862']},{'245': ['1861-1862']},"{'100': [], '700': [], '800': [], '245c': ['']}","{'110': [], '710': [], '810': []}",1953,1953,1953,1953,,[5],[ v.],[ v.],[],[],,[],[],[],,[BK020000]
1,11112,"[(NEBIS)005002232, (IDSBB)003662541, (IDSBB)00...",[],{'245': ['Sideways']},{'245': ['Sideways']},"{'100': [], '700': ['GiamattiPaul', 'Haden Chu...","{'110': [], '710': [], '810': []}",2005,2005,2005,2005,,[],[1 DVD-Video (ca. 122 Min.)],[1 DVD-Video (ca. 122 Min.)],[],[],,[],[],[],2785408,[VM010300]
2,11113,"[(VAUD)991010251069702852, (RERO)R003905843]",[],{'245': ['Sideways']},{'245': ['Sideways']},"{'100': [], '700': ['PayneAlexander', 'TaylorJ...","{'110': [], '710': [], '810': []}",2005,2005,2005,2005,,[],[1 DVD-vidéo],[1 DVD-vidéo],[],[],,[],[],[],FB-SDU DY 27854.1,[VM010300]
3,11114,"[(IDSBB)000541674, (NEBIS)000681404]",[],{'245': ['1688-1692']},{'245': ['1688-1692']},"{'100': [], '700': [], '800': [], '245c': ['']}","{'110': [], '710': [], '810': []}",1967,1967,1967,1967,,[deel 8],[ v.],[ v.],[],[],,[],[],[],,[BK020000]
4,11115,"[(NEBIS)000067274, (RERO)bv001381129]",[],"{'245': ['Collection Travaux et Recherches'], ...",{'245': ['Collection Travaux et Recherches']},"{'100': [], '700': [], '800': [], '245c': ['Pu...","{'110': [], '710': ['Facultés universitaires S...",19uu9999,19uu,19uu,19uu9999,,[],[],[],[],[],,[],[],[],,[CR030300]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
179,11290,"[(NEBIS)006696699, (IDSBB)005700919, (LIBIB)00...","[978-3-8373-0484-8, 3-8373-0484-1]","{'245': ['Die Brüder Löwenherz', 'Abenteuer im...","{'245': ['Die Brüder Löwenherz', 'Abenteuer im...","{'100': ['LindgrenAstrid'], '700': ['KrügerOsk...","{'110': [], '710': [], '810': []}",2011,2011,2011,2011,,[],[2 Compact Discs (ca. 104 Min.)],[2 Compact Discs (ca. 104 Min.)],[],[],,[],[],[],,[MU030100]
180,11291,"[(IDSBB)004116685, (RERO)R247008060, (LIBIB)00...",[3-7891-2941-0],{'245': ['Die Brüder Löwenherz']},{'245': ['Die Brüder Löwenherz']},"{'100': ['LindgrenAstrid'], '700': ['Kornitzky...","{'110': [], '710': [], '810': []}",2006,2006,2006,2006,,[],[238 S.],[238 S.],[],[],,[],[],[],,[BK020000]
181,11292,"[(SERSOL)ssj0001971210, (IDSSG)000952205]",[978-3-658-17985-4],"{'245': ['Wahlkampf mit Humor und Komik', 'Sel...","{'245': ['Wahlkampf mit Humor und Komik', 'Sel...","{'100': [], '700': ['DörnerAndreas', 'VogtLudg...","{'110': [], '710': [], '810': []}",2017,2017,2017,2017,,[],[1 Online-Ressource],[1 Online-Ressource],[],[],,[],[10.1007/978-3-658-17985-4],[],,[BK020053]
182,11293,"[(IDSSG)001089317, (IDSLU)001402740]",[],"{'245': ['Alarm!', 'flink addiert - flott gebu...","{'245': ['Alarm!', 'flink addiert - flott gebu...","{'100': ['ArnAchim'], '700': [], '800': [], '2...","{'110': [], '710': [], '810': []}",2020,2020,2020,2020,1. Auflage,[],"[1 Spiel (55 Karten, 1 Buzzer, 1 Anleitung)]","[1 Spiel (55 Karten, 1 Buzzer, 1 Anleitung)]",[],[],,[],[],[],Bestellnummer: L22463,[VM040400]


Die Master-Datensätze sind nun mit den fehlenden Daten ergänzt. Als Datenbasis wird immer der erste gefundene Slave-Datensatz verwendet.

#### / 01.08.2021 - Neue Daten Goldstandard

In [9]:
goldstandard['masters'][
    goldstandard['masters'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
]

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
183,11294,"[(NEBIS)009752200, (RERO)R007426863]","[978-0-674-04884-3 (alk. paper), 0-674-04884-9...",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['TandonBharat'...","{'110': [], '710': [], '810': []}",2012,2012,2012,2012,An annotated edition,[],[560 p.],[560 p.],[Belknap Press of Harvard University Press],[Belknap Press of Harvard University Press],,[],[],[],,[BK020000]


In [10]:
print('Number of records in master.json with string \'Emma\' :', len(
    goldstandard['masters'][
        goldstandard['masters'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')])
     )

Number of records in master.json with string 'Emma' : 1


The number of deduplicated records of file $\texttt{slave.json}$ containing string 'Emma' is 5. This is the number of unique records that are built in Swissbib's deduplication step out of the original 18 records above.

There may be some minor differences in some fields of the deduplicated master file records compared to their original data in the duplicated original slave file records. Nevertheless, it is possible to identify the slave file records as duplicates of their associated master file records, looking at the details of the values of each attribute.

In a final step, let's have a look at the 'Emma' string data of $\texttt{unique.json}$.

In [11]:
goldstandard['uniques'][
    goldstandard['uniques'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
]

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
87,183110714,"[(OCoLC)800585351, (NEBIS)007344139, (OCoLC)80...","[978-3-596-51207-2, 3-596-51207-7]","{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}","{'100': ['AustenJane'], '700': ['HenzeHelene']...","{'110': [], '710': [], '810': []}",2012,2012,2012,2012,,[],[735 S.],[735 S.],[Fischer Taschenbuch Verl.],[Fischer Taschenbuch Verl.],,[],[],[],,[BK020000]
92,19930422X,"[(OCoLC)781556045, (NEBIS)009752200, (OCoLC)78...","[978-0-674-04884-3 (alk. paper), 0-674-04884-9...",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['TandonBharat'...","{'110': [], '710': [], '810': []}",2012,2012,2012,2012,An annotated edition,[],[560 p.],[560 p.],[Belknap Press of Harvard University Press],[Belknap Press of Harvard University Press],,[],[],[],,[BK020000]
98,253136512,[(RERO)R007179248],[978-3-458-36220-3],"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}","{'100': ['AustenJane'], '700': ['BeckAngelika'...","{'110': [], '710': [], '810': []}",2012,2012,2012,2012,,"[4520. Insel-Klassik, 4520]",[628 S.],[628 S.],[],[],,[],[],[],,[BK020000]
129,50750688X,"[(OCoLC)1022932416, (NEBIS)011111185]",[],{'245': ['Emma']},{'245': ['Emma']},"{'100': [], '700': ['O'HanlonJim', 'WelchSandy...","{'110': [], '710': [], '810': []}",2017,2017,2017,2017,,[],[2 DVD-Video (circa 229 min)],[2 DVD-Video (circa 229 min)],[],[],,[],[],[],,[VM010300]
138,523008872,"[(OCoLC)1048604508, (SGBN)001431881]",[978-1-4071-7266-8],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",2017,2017,2017,2017,This edition published 2017,[],[531 Seiten],[531 Seiten],[],[],,[],[],[],,[BK020000]
139,525879404,"[(OCoLC)1057336224, (SGBN)001434172]",[],{'245': ['Emma']},{'245': ['Emma']},"{'100': [], '700': ['AustenJane', 'LawrenceDia...","{'110': [], '710': [], '810': []}",20171996,2017,2017,20171996,,[],[1 DVD-Video (circa 94 min)],[1 DVD-Video (circa 94 min)],[],[],,[],[],[],,[VM010300]
140,52959885X,"[(OCoLC)1057480762, (NEBIS)011290237]",[978-88-536-2483-3],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['SardiSilvana'...","{'110': [], '710': [], '810': []}",20182018,2018,2018,20182018,First Edition,[B2 (Stage 4)],[111 Seiten],[111 Seiten],[],[],,[],[],[],,[BK020000]
156,567590143,"[(VAUDS)991002876755802853, (EXLNZ-41BCULAUSA_...",[978-2-37349-115-9],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['TsePo'], '700': ['ChanCrystal', 'Aus...","{'110': [], '710': [], '810': []}",2018,2018,2018,2018,,[],[1 vol. (294 p.)],[1 vol. (294 p.)],[],[],,[],[],[],,[BK020000]
159,574115889,"[(OCoLC)1025334662, (NEBIS)011490341, (OCoLC)1...","[978-0-19-402428-0 (paperback), 0-19-402428-8 ...",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['Westclare', '...","{'110': [], '710': [], '810': []}",20172017,2017,2017,20172017,[Simplified edition],[],[91 Seiten],[91 Seiten],[],[],,[],[],[],,[BK020000]
166,584730535,"[(SERSOL)ssj0002021262, (WaSeSS)ssj0002021262]","[978-1-4677-7534-2, 1-4677-7534-7 (EBook) : US...",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",20140801,2014,2014,20140801,Digital original,[],[1 online resource (512 p.)],[1 online resource (512 p.)],"[First Avenue Editions [Imprint], Lerner Publi...","[First Avenue Editions [Imprint], Lerner Publi...",,[],[],[],,[BK020053]


In [12]:
print('Number of records in unique.json with string \'Emma\' :', len(
    goldstandard['uniques'][
        goldstandard['uniques'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')])
     )

Number of records in unique.json with string 'Emma' : 12


The number of unique records of file $\texttt{unique.json}$ is 44. The data of the records reveal that none of them can be associated as duplicate of a record in $\texttt{master.json}$. These unique records are excellent training data for deduplication as they can be used for training of similar but non-duplicate, thus unique record pairs.

The goal of generating training and test data with the help of the goldstandard is to generate two categories of training data.

- The first category consists of records of pairs of the original records out of file $\texttt{slave.json}$ that have a label 'duplicates'. This will be accomplished with the help of the records of file $\texttt{master.json}$, see subsection [Generating Pairs of Duplicates](#Generating-Pairs-of-Duplicates).
- The second category consists of records of pairs of the original records out of files $\texttt{slave.json}$ and $\texttt{unique.json}$ that have a label 'uniques', see subsection [Generating Pairs of Uniques](#Generating-Pairs-of-Uniques).

To calculate the number of possible duplicate pairs the formula
$$Tot_{duplicate\ pairs} = \sum_{i=1}^{M} \frac{1}{2}S_i \cdot \left(S_i-1\right)$$
where $M$ is the number of records in file $\texttt{master.json}$ and $S_i$ are the number of records in file $\texttt{slave.json}$ that are associated with master record $m_i$. As the number of slave records must be counted for each master record and as this number can only be deduced out of the goldstandard data, it is not possible to calculate the number of expected duplicate pair records in advance.

### Generating Pairs of Duplicates

The central attribute for retrieving duplicates in the goldstandard data is attribute $\texttt{035liste}$, see chapter [Data Analysis](./1_DataAnalysis.ipynb). It helps to identify the associated master record for a given record of file $\texttt{slave.json}$. The process implemented below parses the list of identifiers in attribute $\texttt{035liste}$ of each record in file $\texttt{slave.json}$ and searches the value of the identifier in the attribute $\texttt{035liste}$ of all records of file $\texttt{master.json}$. When the identifier is found in file $\texttt{master.json}$, the master record's attribute $\texttt{docid}$ is stored as a new column in the slave DataFrame, see figure [Slave/master relationship](#slave_master_relationship). The new attribute in the slave record has the meaning of a foreign key to the related master record. This process is repeated for each list element of one slave record and for each record in the data file $\texttt{slave.json}$.

<center>
    <b>Figure</b><a id='slave_master_relationship'></a> Slave/master relationship.
    <img src="./documentation/training_data.png" style="width: 600px;"/></p>
</center>

As will be shown below, the relationship of a slave record to its master record is unique. Even if there is more than one entry in the list of attribute $\texttt{035liste}$ of a slave record it will be shown that all distinct entries of a $\texttt{035liste}$ attribute list of one slave record point to one and the same master record.

Attribute $\texttt{035liste}$ will not be used in training nore in perfomance testing of the models. The column will be removed before model training.

### Generating Pairs of Uniques

File $\texttt{unique.json}$ holds data of unique records. For training and testing purposes, pairs of records could be generated exclusively with the records out of this file. This would generate record pairs with label 'uniques' which can be clearly distinguished and therefore clearly be recognized as non-duplicate pairs. A more interesting pairing of records can be generated with records out of $\texttt{slave.json}$ and records out of $\texttt{unique.json}$. Pairs of records that are produced in a mixture of both sources, may reveal similarities in their attributes pairing that make these pairs more difficult for classifying clearly as uniques.

A mixture of both pairing sources will be implemented for the generation of training and testing data. The pairing combinations of all available unique records with all available slave records is done with a Cartesian product and its number calculates to the following value.

In [13]:
print('Total expected number of pairs of uniques for training and testing : {:,d}'.format(
    len(goldstandard['uniques'])*len(goldstandard['slaves'])))

Total expected number of pairs of uniques for training and testing : 115,620


Of course, this estimated amount only holds true, if the records of slaves do not have any duplicates in the records of uniques. Whether this is the case in the goldstandard data, will be investigated below.

## Transform Attributes for Similarity Comparison

Before pairing the records into a DataFrame that will be used for the feature matrix, the attributes of the goldstandard data will be processed into their form to be used for generating the similarity matrix. This step could be done after the pairing step. Doing it before building pairs, avoids transformation of the same attributes several times and reduces the amount of transformation to the amount of raw data in the goldstandard instead of the multiplied data of the paired records.

As the data preparation of the attributes will be reused in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb), it has been encapsulated into a separate function of library [data_preparation_funcs.py](./data_preparation_funcs.py), which handles the processing of all relevant attributes. This function will be called below.

In [14]:
import data_preparation_funcs as dpf

for i in ['slaves', 'masters', 'uniques']:
    goldstandard[i] = dpf.attribute_preprocessing(
        goldstandard[i],
        columns_metadata_dict['data_analysis_columns'], strip_number_digits)

### ismn

In [15]:
for i in ['slaves', 'masters', 'uniques']:
    display(goldstandard[i].ismn[goldstandard[i].ismn.apply(lambda x : len(x)>0)])

17        m001068048
173       m001068062
205       m001068048
269       m004182772
334       m001068048
349    9790004182772
Name: ismn, dtype: object

115    m001068048
125    m004182772
Name: ismn, dtype: object

Series([], Name: ismn, dtype: object)

A vanishingly small number of rows with real ismn identifyers had been found in the data of chapter [Data Analysis](./1_DataAnalysis.ipynb). In the goldstandard data, the amount of real ismn identifyers found is fair. As there will be training data with filled attribute $\texttt{ismn}$, the attribute will be kept for the feature matrix.

In [16]:
#columns_metadata_dict['data_analysis_columns'].remove('ismn')
#columns_metadata_dict['columns_to_use'].remove('ismn_x')
#columns_metadata_dict['columns_to_use'].remove('ismn_y')

## Build Pairs of Duplicates

As a first step, the relationship between slaves records and masters records is investigated. Thereafter, the records of pairs of duplicates will be generated.

In [17]:
for i in range(len(goldstandard['slaves'])):
    if len(goldstandard['masters'][goldstandard['masters'].docid == goldstandard['slaves'].docid.loc[i]]) > 0 :
        print('STOP!')
        break

The silence of the code cell above shows that there is no record in file $\texttt{slaves.json}$ that has the same $\texttt{docid}$ like any record in file $\texttt{masters.json}$. Attribute $\texttt{docid}$ is irrelevant for identifying pairs of duplicates.

### Slaves Masters Relationships

To understand, how to build pairs of duplicates, let's have a look at some sample data.

In [18]:
for i in ['slaves', 'masters', 'uniques']:
    print(f'\n{i}')
    print(goldstandard[i]['035liste'].sample(n=5))


slaves
96     [(OCoLC)883912676, (IDSBB)002140408]
461    [(OCoLC)727711800, (IDSLU)000224832]
578                          [(BISCH)87345]
464      [(OCoLC)882005788, (SBT)000104393]
433      [(ZORA)oai:www.zora.uzh.ch:176207]
Name: 035liste, dtype: object

masters
54                  [(NEBIS)011485397, (IDSBB)007138335]
141           [(IDSBB)006318171, (ALEX)9914727244101791]
14     [(IDSBB)001484099, (SGBN)000011744, (NEBIS)003...
56     [(RERO)0225017, (VAUD)991010579489702852, (ALE...
89     [(SNL)991017415539703976, (ALEX)99149743841017...
Name: 035liste, dtype: object

uniques
110                 [(OCoLC)732640714, (IDSSG)000734198]
93     [(OCoLC)314634959, (NEBIS)009522446, (OCoLC)31...
42     [(VAUD)991006675979702852, (RNV)001661624-41bc...
48                    [(OCoLC)808076788, (BGR)000188775]
22                  [(OCoLC)610652832, (IDSBB)002826745]
Name: 035liste, dtype: object


In [19]:
for i in range(len(goldstandard['slaves'])):
    if i < 4:
        for j in range(len(goldstandard['slaves']['035liste'][i])):
            for k in range(len(goldstandard['masters'])):
                if goldstandard['slaves']['035liste'][i][j] in goldstandard['masters']['035liste'][k]:
                    print('slaves record', i, 'holds in 035liste position', j, 'a reference to masters record', k)
print('etc.')

slaves record 0 holds in 035liste position 1 a reference to masters record 1
slaves record 1 holds in 035liste position 1 a reference to masters record 60
slaves record 2 holds in 035liste position 1 a reference to masters record 45
slaves record 3 holds in 035liste position 1 a reference to masters record 162
etc.


The code above shows a relationship between records out of $\texttt{slaves.json}$ and records out of $\texttt{masters.json}$. This relationship is established with the entries in attribute $\texttt{035liste}$. The next step is to look at record pairs that have this relationship.

In [20]:
sample_pair_1 = [0, 85]
sample_pair_2 = [1, 145]
sample_pair_3 = [2, 119]
sample_pair_4 = [3, 57]

print('Example pair 1')
print('slaves record', goldstandard['slaves']['035liste'][sample_pair_1[0]], '\nmasters record', goldstandard['masters']['035liste'][sample_pair_1[1]])
print('\nExample pair 2')
print('slaves record', goldstandard['slaves']['035liste'][sample_pair_2[0]], '\nmasters record', goldstandard['masters']['035liste'][sample_pair_2[1]])
print('\nExample pair 3')
print('slaves record', goldstandard['slaves']['035liste'][sample_pair_3[0]], '\nmasters record', goldstandard['masters']['035liste'][sample_pair_3[1]])
print('\nExample pair 4')
print('slaves record', goldstandard['slaves']['035liste'][sample_pair_4[0]], '\nmasters record', goldstandard['masters']['035liste'][sample_pair_4[1]])

Example pair 1
slaves record ['(OCoLC)884555343', '(ABN)000305947'] 
masters record ['(IDSBB)006096784', '(RERO)R007253052']

Example pair 2
slaves record ['(OCoLC)314603155', '(ABN)000414274'] 
masters record ['(IDSBB)003766397', '(NEBIS)005086249', '(SNL)991009051329703976', '(BGR)000396858', '(ALEX)9911791454101791']

Example pair 3
slaves record ['(OCoLC)887407580', '(ABN)000363324'] 
masters record ['(IDSBB)006021508', '(LIBIB)000362942', '(IDSSG)000528149']

Example pair 4
slaves record ['(OCoLC)841563357', '(ABN)000288211', 'R004141449'] 
masters record ['(IDSBB)004683389', '(SNL)991002497379703976', '(NEBIS)007558784', '(SGBN)000089173', '(NEBIS)009002600']


In [21]:
for i in [sample_pair_1, sample_pair_2, sample_pair_3, sample_pair_4]:
    print('\npair------------------------------------->', i)
    print(goldstandard['slaves'].loc[i[0]], '\n')
    print(goldstandard['masters'].loc[i[1]])


pair-------------------------------------> [0, 85]
docid                                                     00350560X
035liste                         [(OCoLC)884555343, (ABN)000305947]
isbn                                                             []
ttlpart                                       {'245': ['Sideways']}
pubyear                                                    2005    
decade                                                         2005
century                                                        2005
exactDate                                                  2005uuuu
edition                                                            
part                                                               
pages                                  [1 DVD-Video (ca. 122 Min.)]
volumes                                                       1 122
pubinit                                                            
pubword                                                         

Record pairs with a relationship are pairs of duplicates of a bibliographic unit. This is according to the description in the first part of this chapter. To identify explicitly which slave record belongs to which master record, attribute $\texttt{035liste}$ will be used. The following function implements an algorithm that searches for each $\texttt{035liste}$ value of a slave record its same value in $\texttt{035liste}$ in the master records. If the record is found in one of the master records, the $\texttt{docid}$ of this master is stored in a new attribute $\texttt{masters}\_\texttt{docid}$ of the slave record. As there can be more than one element in attribute $\texttt{035liste}$, there may be several master relationships on a slave record. Therefore, attribute $\texttt{masters}\_\texttt{docid}$ is of type of a list.

In [22]:
def add_master_docid_to_slave (df_s, df_m):
    """Determine docid of master and store on slave."""
    # Initialize Foreign Key list
    df_s['masters_docid'] = [list() for x in range(len(df_s.index))]

    # Search for master of slave
    for i in range(len(df_s)):
        loc_li = list()
        for j in range(len(df_s['035liste'].loc[i])):
            master_index = df_m[df_m['035liste'].str.contains(
                df_s['035liste'].loc[i][j], regex=False
            )].index
            if len(master_index) > 0 : # Skip empty Series
                loc_li.append(df_m.docid[master_index].values[0])

        df_s['masters_docid'].loc[i] = loc_li
    
    return df_s

In [23]:
goldstandard['slaves'] = add_master_docid_to_slave(goldstandard['slaves'], goldstandard['masters'])

goldstandard['slaves'].masters_docid.sample(n=10)

26     [11156]
610    [11293]
280    [11280]
134    [11124]
555    [11190]
508    [11151]
5      [11175]
143    [11162]
589    [11148]
304    [11190]
Name: masters_docid, dtype: object

Looking at the same samples as above, shows a new attribute $\texttt{masters}\_\texttt{docid}$ in the slaves record, which holds the docid value of its related master record. This is true for each pair of records, see below.

In [24]:
for i in [sample_pair_1, sample_pair_2, sample_pair_3, sample_pair_4]:
    print('\npair------------------------------------->', i)
    print(goldstandard['slaves'].loc[i[0]], '\n')
    print(goldstandard['masters'].loc[i[1]])


pair-------------------------------------> [0, 85]
docid                                                     00350560X
035liste                         [(OCoLC)884555343, (ABN)000305947]
isbn                                                             []
ttlpart                                       {'245': ['Sideways']}
pubyear                                                    2005    
decade                                                         2005
century                                                        2005
exactDate                                                  2005uuuu
edition                                                            
part                                                               
pages                                  [1 DVD-Video (ca. 122 Min.)]
volumes                                                       1 122
pubinit                                                            
pubword                                                         

In [25]:
goldstandard['slaves']['masters_docid'].sample(n=10)

299    [11244]
282    [11197]
558    [11164]
407    [11188]
505    [11152]
316    [11216]
66     [11272]
64     [11273]
49     [11128]
569    [11140]
Name: masters_docid, dtype: object

As the records in the master file are uniques by themselves, for distinctly identifying whether a pair of slave records is a duplicate, the keys in $\texttt{masters}\_\texttt{docid}$ must be unique. In the samples shown above, some cases with more than one entry can be detected. On the other hand, the visual check reveals that they hold identical values. As a next step, all identical values in attribute $\texttt{masters}\_\texttt{docid}$ will be reduced to one distinct entry. This will lead to a list of distinct values.

#### 01.08.2021 - Drop Slaves with no Master

In den neuen Daten gibt es Slaves ohne Master-Zuordnung. Diese werden entfernt.

In [26]:
# Proof that all docid_masters are unique...
goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : set(x))
goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : list(x))

for i in range(len(goldstandard['slaves'])):
    if len(goldstandard['slaves'].masters_docid.loc[i]) != 1 :
        print('STOP!', i, len(goldstandard['slaves'].masters_docid.loc[i]), '=> dropping row.')
        goldstandard['slaves'].drop(index=i, inplace=True)
        
goldstandard['slaves'].reset_index(drop=True, inplace=True)

#### / 01.08.2021 - Drop Slaves with no Master

In [27]:
# Proof that all docid_masters are unique...
goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : set(x))
goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : list(x))

for i in range(len(goldstandard['slaves'])):
    if len(goldstandard['slaves'].masters_docid.loc[i]) != 1 :
        print('STOP!', i)
        break

The silence of the code cell above demonstrates that after reducing the list of $\texttt{masters}\_\texttt{docid}$ to distinct entries, all slave records hold exactly one key reference to a master. This is the basis for joining the data.

The list attribute $\texttt{masters}\_\texttt{docid}$ now holds exactly one single element. To remove the list data type and extract its contents to a single string value, the following line of code is executed.

In [28]:
goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : x[0])
goldstandard['slaves'].head()

Unnamed: 0,docid,035liste,isbn,ttlpart,pubyear,decade,century,exactDate,edition,part,pages,...,corporate_110,corporate_710,corporate_full,format_prefix,format_postfix,person_100,person_700,person_245c,ttlfull_245,ttlfull_246,masters_docid
0,00350560X,"[(OCoLC)884555343, (ABN)000305947]",[],{'245': ['Sideways']},2005,2005,2005,2005uuuu,,,[1 DVD-Video (ca. 122 Min.)],...,,,,vm,10300,,"giamattipaul, haden churchthomas, madsenvirgin...",regie: alexader payne ; drehbuch: alexander pa...,sideways,,11112
1,003894924,"[(OCoLC)314603155, (ABN)000414274]",[],{'245': ['Li regres Nostre Dame']},1907,1907,1907,1907uuuu,,,"[CXLVII, 210 p.]",...,,,,bk,20300,huon le roi,långforsarthur,par huon le roi de cambrai; publ. d'aprs tous ...,li regres nostre dame,,11171
2,004308069,"[(OCoLC)887407580, (ABN)000363324]",[],{'245': ['Euricius Cordus Epigrammata (1520)']},1892,1892,1892,1892uuuu,,5.0,[1 Band],...,,,,bk,20000,corduseuricius,krausekarl,euricius cordus; herausgegeben von karl krause,euricius cordus epigrammata (1520),,11156
3,005021073,"[(OCoLC)841563357, (ABN)000288211, R004141449]",[],{'245': ['Je voudrais que quelqu'un m'attende ...,2006,2006,2006,2006uuuu,,,[3 Compact Discs],...,,,,mu,30100,gavaldaanna,,anna gavalda,je voudrais que quelqu'un m'attende quelque part,,11273
4,005179726,"[(OCoLC)611172738, (ABN)000232404]","[3-7281-1755-2, 3-519-05031-5]","{'245': ['Städtebau in der Schweiz 1800-1990',...",1992,1992,1992,1992uuuu,,81.0,[315 S.],...,,,,bk,20000,kochmichael,,michael koch,"städtebau in der schweiz 1800-1990, entwicklun...",,11169


Joining all slave records with all master records must result in a number of pairs that is equal to the number of master records, loaded at the beginning of this chapter.

In [29]:
slave_master = pd.merge(left=goldstandard['slaves'], right=goldstandard['masters'], how='inner',
                  left_on='masters_docid', right_on='docid')

print('Number of paired slave/master records', len(slave_master))

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(slave_master)

slave_master.head()

Number of paired slave/master records 615


Unnamed: 0,docid_x,035liste_x,isbn_x,ttlpart_x,pubyear_x,decade_x,century_x,exactDate_x,edition_x,part_x,pages_x,volumes_x,pubinit_x,pubword_x,scale_x,coordinate_x,doi_x,ismn_x,musicid_x,coordinate_E_x,coordinate_N_x,corporate_110_x,corporate_710_x,corporate_full_x,format_prefix_x,format_postfix_x,person_100_x,person_700_x,person_245c_x,ttlfull_245_x,ttlfull_246_x,masters_docid,docid_y,035liste_y,isbn_y,ttlpart_y,pubyear_y,decade_y,century_y,exactDate_y,edition_y,part_y,pages_y,volumes_y,pubinit_y,pubword_y,scale_y,coordinate_y,doi_y,ismn_y,musicid_y,coordinate_E_y,coordinate_N_y,corporate_110_y,corporate_710_y,corporate_full_y,format_prefix_y,format_postfix_y,person_100_y,person_700_y,person_245c_y,ttlfull_245_y,ttlfull_246_y
0,00350560X,"[(OCoLC)884555343, (ABN)000305947]",[],{'245': ['Sideways']},2005,2005,2005,2005uuuu,,,[1 DVD-Video (ca. 122 Min.)],1 122,,[],,[],,,2785408.0,,,,,,vm,10300,,"giamattipaul, haden churchthomas, madsenvirgin...",regie: alexader payne ; drehbuch: alexander pa...,sideways,,11112,11112,"[(NEBIS)005002232, (IDSBB)003662541, (IDSBB)00...",[],{'245': ['Sideways']},2005,2005,2005,2005uuuu,,,[1 DVD-Video (ca. 122 Min.)],1 122,,[],,[],,,2785408,,,,,,vm,10300,,"giamattipaul, haden churchthomas, madsenvirgin...",regie: alexader payne ; drehbuch: alexander pa...,sideways,
1,050352490,"[(OCoLC)887993295, (SGBN)000595500]",[],"{'245': ['Sideways', 'eine Geschichte über das...",2005,2005,2005,2005uuuu,,,[1 DVD-Video],1,twentieth century,[Twentieth Century],,[],,,2.0,,,,,,vm,10300,,paynealexander,,"sideways, eine geschichte über das leben, die ...",,11112,11112,"[(NEBIS)005002232, (IDSBB)003662541, (IDSBB)00...",[],{'245': ['Sideways']},2005,2005,2005,2005uuuu,,,[1 DVD-Video (ca. 122 Min.)],1 122,,[],,[],,,2785408,,,,,,vm,10300,,"giamattipaul, haden churchthomas, madsenvirgin...",regie: alexader payne ; drehbuch: alexander pa...,sideways,
2,11983393X,"[(OCoLC)884555343, (IDSBB)003662541]",[],{'245': ['Sideways']},2005,2005,2005,2005uuuu,,,[1 DVD-Video (ca. 122 Min.)],1 122,,[],,[],,,2785408.0,,,,,,vm,10300,,"paynealexander, taylorjim, pickettrex, giamatt...",regie: alexader payne ; drehbuch: alexander pa...,sideways,,11112,11112,"[(NEBIS)005002232, (IDSBB)003662541, (IDSBB)00...",[],{'245': ['Sideways']},2005,2005,2005,2005uuuu,,,[1 DVD-Video (ca. 122 Min.)],1 122,,[],,[],,,2785408,,,,,,vm,10300,,"giamattipaul, haden churchthomas, madsenvirgin...",regie: alexader payne ; drehbuch: alexander pa...,sideways,
3,161731651,"[(OCoLC)884555343, (NEBIS)005002232]",[],{'245': ['Sideways']},20052004,2005,2005,20052004,,,[1 DVD-Video (ca. 122 Min.)],1 122,twentieth century fox home entertainment,[Twentieth Century Fox Home Entertainment],,[],,,,,,,,,vm,10300,,"paynealexander, pickettrex, taylorjim, giamatt...",regie: alexander payne,sideways,,11112,11112,"[(NEBIS)005002232, (IDSBB)003662541, (IDSBB)00...",[],{'245': ['Sideways']},2005,2005,2005,2005uuuu,,,[1 DVD-Video (ca. 122 Min.)],1 122,,[],,[],,,2785408,,,,,,vm,10300,,"giamattipaul, haden churchthomas, madsenvirgin...",regie: alexader payne ; drehbuch: alexander pa...,sideways,
4,340501588,"[(OCoLC)611340779, (IDSSG)000762076]",[],"{'245': ['Sideways', 'eine Geschichte über das...",2005,2005,2005,2005uuuu,,,[1 DVD-Video (122 Min.) Ländercode 2],1 122 2,,[],,[],,,,,,,,,vm,10300,,paynealexander,regie: alexander payne,"sideways, eine geschichte über das leben, die ...",,11112,11112,"[(NEBIS)005002232, (IDSBB)003662541, (IDSBB)00...",[],{'245': ['Sideways']},2005,2005,2005,2005uuuu,,,[1 DVD-Video (ca. 122 Min.)],1 122,,[],,[],,,2785408,,,,,,vm,10300,,"giamattipaul, haden churchthomas, madsenvirgin...",regie: alexader payne ; drehbuch: alexander pa...,sideways,


Looking at the examples above, reveals that the slave/master record pairs are duplicates, indeed.

In [30]:
pd.options.display.max_rows = len(slave_master.columns)

for i in [sample_pair_1, sample_pair_2, sample_pair_3, sample_pair_4]:
    print('\npair------------------------------------->', i)
    print(slave_master.loc[i[0]], '\n')
    print('... and its master record ...')
    print(goldstandard['masters'].loc[i[1]])


pair-------------------------------------> [0, 85]
docid_x                                                     00350560X
035liste_x                         [(OCoLC)884555343, (ABN)000305947]
isbn_x                                                             []
ttlpart_x                                       {'245': ['Sideways']}
pubyear_x                                                    2005    
decade_x                                                         2005
century_x                                                        2005
exactDate_x                                                  2005uuuu
edition_x                                                            
part_x                                                               
pages_x                                  [1 DVD-Video (ca. 122 Min.)]
volumes_x                                                       1 122
pubinit_x                                                            
pubword_x                             

### Duplicates Rows Generation

The rows with pairs of duplicates can now be built joining each slave record with each slave record, with the condition of the same $\texttt{masters}\_\texttt{docid}$.

In [31]:
def build_duplicate_pairs (df):
    """Builds-up all duplicate pairs, even with itself."""
    
    return pd.merge(left=df, right=df, how='inner',
                    left_on='masters_docid', right_on='masters_docid')

The records of duplicates are marked as duplicates immediately after pairing.

In [32]:
duplicates = build_duplicate_pairs(goldstandard['slaves'])
duplicates['duplicates'] = 1

print('Number of duplicate rows {:,d}'.format(len(duplicates)))

Number of duplicate rows 2,783


In [33]:
import random as rand

pd.options.display.max_rows = len(duplicates.columns)

duplicates.loc[rand.randrange(len(duplicates))]

docid_x                                                 202469417
035liste_x                                        [(RERO)0315404]
isbn_x                                                [0001-7051]
ttlpart_x                          {'245': ['Acta theriologica']}
pubyear_x                                                19552014
decade_x                                                     1955
century_x                                                    1955
exactDate_x                                              19552014
edition_x                                                        
part_x                                                           
pages_x                                                        []
volumes_x                                                        
pubinit_x                                                        
pubword_x                                                      []
scale_x                                                          
coordinate

On joining the raw data to pairs, the column names have been duplicated and changed. This must be reflected in the metadata.

In [34]:
# Target is the first necessary column
columns_metadata_dict['columns_to_use'] = ['duplicates']

# After join with itself, the data has _x and _y attributes
for i in columns_metadata_dict['data_analysis_columns']:
    for j in ['_x', '_y']:
        columns_metadata_dict['columns_to_use'].append(i+j)

## Build Pairs of Uniques

In [35]:
for i in range(len(goldstandard['uniques'])):
    if len(goldstandard['masters'][goldstandard['masters'].docid == goldstandard['uniques'].docid.loc[i]]) > 0 :
        print('STOP!')
        break

The silence of the code cell above shows that there is no record in file $\texttt{master.json}$ that has the same $\texttt{docid}$ like any record in file $\texttt{unique.json}$.

### Masters Uniques Relationships

The investigations above have shown that relationships between records are expressed with the same value in attribute $\texttt{035liste}$. To find records of the uniques file that are duplicates of records of the masters file, the values of attributes $\texttt{035liste}$ have to be compared.

In [36]:
masters_uniques_duplicates = {'master' : [], 'unique' : [], 'slave' : {}}

for i in range(len(goldstandard['masters'])):
    for j in range(len(goldstandard['masters']['035liste'][i])):
        for k in range(len(goldstandard['uniques'])):
            if goldstandard['masters']['035liste'][i][j] in goldstandard['uniques']['035liste'][k]:
#                print('master', i, 'has relationship to unique', k)
                masters_uniques_duplicates['master'].append(i)
                masters_uniques_duplicates['unique'].append(k)
print()
for i in set(masters_uniques_duplicates['master']):
    masters_uniques_duplicates['slave'].update({i : []})
    for j in range(len(goldstandard['masters']['035liste'][i])):
        for k in range(len(goldstandard['slaves'])):
            if goldstandard['masters']['035liste'][i][j] in goldstandard['slaves']['035liste'][k]:
#                print('master', i, 'has relationship to slave', k)
                masters_uniques_duplicates['slave'][i].append(k)
                
masters_uniques_duplicates




{'master': [183], 'unique': [92], 'slave': {183: [530, 538]}}

This result is surprising. The above code reveals that three unique records have relationships to a master record. Even worse but at least consistent, the affected master records also have relationships to some slave records. Is it true that they are duplicates?

In [37]:
pd.options.display.max_rows = len(slave_master.columns)

for i in range(len(masters_uniques_duplicates['master'])):
    print('\npair------------------------------------->', 
          masters_uniques_duplicates['master'][i], masters_uniques_duplicates['unique'][i]
         )
    print(goldstandard['masters'].loc[masters_uniques_duplicates['master'][i]], '\n')
    print('... and its master record ...')
    print(goldstandard['uniques'].loc[masters_uniques_duplicates['unique'][i]], '\n')
    print('... and its slave records ...')
    for j in masters_uniques_duplicates['slave'][masters_uniques_duplicates['master'][i]]:
        print(goldstandard['slaves'].loc[j], '\n')


pair-------------------------------------> 183 92
docid                                                         11294
035liste                       [(NEBIS)009752200, (RERO)R007426863]
isbn              [978-0-674-04884-3 (alk. paper), 0-674-04884-9...
ttlpart                                           {'245': ['Emma']}
pubyear                                                    2012    
decade                                                         2012
century                                                        2012
exactDate                                                  2012uuuu
edition                                                            
part                                                               
pages                                                      [560 p.]
volumes                                                         560
pubinit                   belknap press of harvard university press
pubword                 [Belknap Press of Harvard University Pres

The visual check confirms that the status of duplicates are given for the pairs of records in uniques and slaves. Therefore, these records must be excluded from the pairing of rows to records of uniques.

### Uniques Rows Generation

To build pairs of uniques of the goldstandard, all records of slaves are joined with all records of uniques, except for the records that will produce duplicates shown in the previous subsection.

In [38]:
df_s_1 = goldstandard['slaves']
df_u_1 = goldstandard['uniques'].copy() # Do not modify original data.

# Exclude duplication candidates from above
df_u_1.drop(index=masters_uniques_duplicates['unique'], inplace=True)
#display(df_u_1.loc[masters_uniques_duplicates['unique']])

df_s_1['duplicates'] = 0
df_u_1['duplicates'] = 0
# Full join, Cartesian product
non_duplicates = pd.merge(df_s_1, df_u_1, on='duplicates')

print('Number of slave records : {:,d}, number of unique records : {:,d}, expected pairs : {:,d} = joined pairs : {:,d}'.format(
    len(df_s_1), len(df_u_1), len(df_s_1)*len(df_u_1), len(non_duplicates)))

Number of slave records : 615, number of unique records : 187, expected pairs : 115,005 = joined pairs : 115,005


In [39]:
pd.options.display.max_rows = len(non_duplicates.columns)

non_duplicates.loc[rand.randrange(len(non_duplicates))]

docid_x                                                     522113354
035liste_x          [(VAUDS)991002448719702853, (RNV)007441995-41b...
isbn_x                                                             []
ttlpart_x                                            {'245': ['One']}
pubyear_x                                                    2011    
decade_x                                                         2011
century_x                                                        2011
exactDate_x                                                  2011uuuu
edition_x                                                            
part_x                                                               
pages_x                                            [1 disque compact]
volumes_x                                                           1
pubinit_x                                                            
pubword_x                                                          []
scale_x             

The number of records of pairs of uniques shown above, is big enough. No further records of pairs of uniques for training and testing purposes will be needed. For this reason, joining the records of uniques with themselves will be omitted and the pairs generated so far will be taken as they are.

## Build Feature Base

The rows with pairs of duplicates and the rows with pairs of uniques resulting from above, can be concatenated to a DataFrame which will be the basis for further processing into the feature matrix.

In [40]:
df_feature_base = pd.concat([duplicates, non_duplicates], sort=True)
# Set unique values on index
df_feature_base.reset_index(inplace=True, drop=True)

print('Number of rows with pairs of uniques :\t\t{:,d}'.format(len(df_feature_base[df_feature_base.duplicates==0])))
print('Number of rows with pairs of duplicates :\t{:,d}'.format(len(df_feature_base[df_feature_base.duplicates==1])))

Number of rows with pairs of uniques :		115,005
Number of rows with pairs of duplicates :	2,783


## Summary

The result of this chapter is a full understanding of Swissbib's goldstandard data and how to produce pairs of rows which represent duplicates and pairs of rows which represent uniques. After generating the pairs of duplicates and the pairs of uniques, the following ratio is in the data set that will be used for training the machine learning models.

In [41]:
print('ratio of uniques :\t{:.2f} %\nratio of duplicates :\t{:.2f} %'.format(
    df_feature_base.duplicates.value_counts(normalize=True).loc[0]*100,
    df_feature_base.duplicates.value_counts(normalize=True).loc[1]*100))

ratio of uniques :	97.64 %
ratio of duplicates :	2.36 %


Besides the data with rows of uniques and duplicates, the metadata of the project have been updated. After this chapter, the metadata look like the result of the following lines of code.

In [42]:
for k in columns_metadata_dict.keys():
    print(k, '\n', columns_metadata_dict[k], '\n')

data_analysis_columns 
 ['coordinate_E', 'coordinate_N', 'corporate_full', 'doi', 'edition', 'exactDate', 'format_prefix', 'format_postfix', 'isbn', 'ismn', 'musicid', 'part', 'person_100', 'person_700', 'person_245c', 'pubinit', 'scale', 'ttlfull_245', 'ttlfull_246', 'volumes'] 

columns_to_use 
 ['duplicates', 'coordinate_E_x', 'coordinate_E_y', 'coordinate_N_x', 'coordinate_N_y', 'corporate_full_x', 'corporate_full_y', 'doi_x', 'doi_y', 'edition_x', 'edition_y', 'exactDate_x', 'exactDate_y', 'format_prefix_x', 'format_prefix_y', 'format_postfix_x', 'format_postfix_y', 'isbn_x', 'isbn_y', 'ismn_x', 'ismn_y', 'musicid_x', 'musicid_y', 'part_x', 'part_y', 'person_100_x', 'person_100_y', 'person_700_x', 'person_700_y', 'person_245c_x', 'person_245c_y', 'pubinit_x', 'pubinit_y', 'scale_x', 'scale_y', 'ttlfull_245_x', 'ttlfull_245_y', 'ttlfull_246_x', 'ttlfull_246_y', 'volumes_x', 'volumes_y'] 

similarity_metrics 
 {'coordinate_E': LCSStr({'qval': 1, 'external': True}), 'coordinate_N': L

The data handover for the next chapter still needs to done.

### Goldstandard DataFrame Handover

For further processing in the next chapters, the metadata dictionary and the resulting feature base DataFrame of this chapter need to be persisted. Both will be restored and used in upcoming chapters. The metadata extended in this chapter is saved into a pickle file.

In [43]:
import bz2

# Binary intermediary metadata file
with open(os.path.join(path_goldstandard,
                       'columns_metadata.pkl'), 'wb') as dict_output_file:
    pk.dump(columns_metadata_dict, dict_output_file)

# Store into compressed intermediary file
with bz2.BZ2File(os.path.join(path_goldstandard, 'feature_base_df.pkl'),
                 'w') as df_output_file:
    pk.dump(df_feature_base, df_output_file)

The stored feature base DataFrame may become a big file. A file bigger than 100MB cannot be checked in into github. To be able to checkin the pickle file of the feature base data, it has been compressed using Python library $\texttt{bz2}$.