# Goldstandard and Data Preparation

Swissbib's goldstandard is a set of pre-defined records being used for testing their implemented logic of identifying duplicate and unique records. The Swissbib project team has processed this data into a data extract with the help of a scala implementation [[ScalaRepo](./A_References.ipynb#scala_repo)].

This chapter of the capstone project loads the data records of Swissbib's goldstandard and prepares the data until it is ready for the generation of the feature matrix for the machine learning models.

## Table of Contents

- [Description of Swissbib's Goldstandard](#description_swissbib_goldstandard)
    - [Sample Records of a Goldstandard Example](#sample_records_goldstandard_example)
    - [Generation of Pairs of Duplicates](#generation_pairs_duplicates)
    - [Generation of Pairs of Uniques](#generation_pairs_uniques)
- [Implemenation of Pairs Generation](#implemenation_pairs_generation)
    - [Pairs of Duplicates Implementation](#pairs_duplicates_implementation)
    - [Pairs of Uniques Implementation](#pairs_uniques_implementation)

## Description of Swissbib's Goldstandard<a id='description_swissbib_goldstandard'/>

This section starts with loading Swissbib's goldstandard data. Afterwards, the data is explained and analysed.

In [1]:
import os
import json

records_slave, records_master, records_unique = [], [], []
path_goldstandard = './daten_goldstandard'
file_slave, file_master, file_unique = 'slave.json', 'master.json', 'unique.json'

for line in open(os.path.join(path_goldstandard, file_slave), 'r'):
    records_slave.append(json.loads(line))
for line in open(os.path.join(path_goldstandard, file_master), 'r'):
    records_master.append(json.loads(line))
for line in open(os.path.join(path_goldstandard, file_unique), 'r'):
    records_unique.append(json.loads(line))

print('Number of records in data file {:s}\t{:d}'.format(file_slave, len(records_slave)))
print('Number of records in data file {:s}\t{:d}'.format(file_master, len(records_master)))
print('Number of records in data file {:s}\t{:d}'.format(file_unique, len(records_unique)))

Number of records in data file slave.json	435
Number of records in data file master.json	159
Number of records in data file unique.json	596


In [2]:
import pandas as pd

goldstandard = {}

goldstandard['slaves'] = pd.DataFrame(records_slave)
goldstandard['masters'] = pd.DataFrame(records_master)
goldstandard['uniques'] = pd.DataFrame(records_unique)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(goldstandard['slaves'].columns)

goldstandard['slaves'].head()

Unnamed: 0,035liste,century,coordinate,corporate,decade,docid,doi,edition,exactDate,format,isbn,ismn,musicid,pages,part,person,pubinit,pubword,pubyear,scale,ttlfull,ttlpart,volumes
0,"[(OCoLC)731635279, (ABN)000539983]",2009,[],{},2009,000311049,[],,2009,[BK020000],[978-3-15-020008-7],[],,[600 S.],[20008],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Reclam jun.],[Reclam jun.],2009,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",[600 S.]
1,"[(OCoLC)808324878, (ABN)000155059]",2000,[],"{'710': ['Metropolitan Opera Orchestra', 'Metr...",2000,00130724X,[],,2000,[VM010300],[],[],,"[1 DVD-Video, DVD Region 0, 169 Min., farb.]",[],"{'100': ['LevineJamesDir.'], '700': ['MozartWo...",[Deutsche Grammophon],[Deutsche Grammophon],2000,,"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","[1 DVD-Video, DVD Region 0, 169 Min., farb.]"
2,"[(OCoLC)231772550, (ABN)000096920]",1999,[],{},1999,001817272,[],,1999,[BK020000],[3-495-47879-5],[],,[316 S.],[],"{'100': ['FluryAndreas'], '245c': ['Andreas Fl...",[Alber],[Alber],1999,,"{'245': ['Der moralische Status der Tiere', 'H...","{'245': ['Der moralische Status der Tiere', 'H...",[316 S.]
3,"[(OCoLC)887157168, (ABN)000223912]",uuuu,[],{},uuuu,00236865X,[],,uuuuuuuu,[BK020000],[],[],,[412 S.],[],"{'100': ['MozartWolfgang Amadeus'], '245c': ['']}",[Ernst Eulenburg],[Ernst Eulenburg],uuuuuuuu,,{'245': ['Die Zauberflöte']},{'245': ['Die Zauberflöte']},[412 S.]
4,"[(OCoLC)887324690, (ABN)000548154]",2008,[],{},2008,00351031X,[],,2008,[BK020000],[978-1-4058-8214-9],[],,[64 S.],[],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Pearson Education],[Pearson Education],2008,,{'245': ['Emma']},{'245': ['Emma']},[64 S.]


The goldstandard is a fix set of pre-defined data that has been processed into three distinct .json files [[ScalaRepo](./A_References.ipynb#scala_repo)].

- $\texttt{slave.json}$ - This data file holds the original duplicated records that are being merged into a master record.
- $\texttt{master.json}$ - This data file holds the unique master records that are formed out of its records in file $\texttt{slave.json}$.
- $\texttt{unique.json}$ - This data file holds records with similar data as in $\texttt{slave.json}$ and $\texttt{master.json}$, but records that are correctly being identified as unique records and not as a duplicate record of any record in $\texttt{slave.json}$.

The three files of the goldstandard will be used for training and performance testing the machine learning models of the capstone project. Combining the records of file $\texttt{slave.json}$ into pairs will generate a set of duplicate training data, while combining the records of file $\texttt{slave.json}$ with file $\texttt{unique.json}$ will generate a set of unique pair records for the training data.

### Sample Records of a Goldstandard Example<a id='sample_records_goldstandard_example'/>

Let's have a rough look at an example in the goldstandard data. As the example, all records of all three .json files will be chosen with string 'Emma' in attriubte $\texttt{ttlfull}$.

In [3]:
goldstandard['slaves'][
    goldstandard['slaves'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
]

Unnamed: 0,035liste,century,coordinate,corporate,decade,docid,doi,edition,exactDate,format,isbn,ismn,musicid,pages,part,person,pubinit,pubword,pubyear,scale,ttlfull,ttlpart,volumes
0,"[(OCoLC)731635279, (ABN)000539983]",2009,[],{},2009,000311049,[],,2009,[BK020000],[978-3-15-020008-7],[],,[600 S.],[20008],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Reclam jun.],[Reclam jun.],2009,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",[600 S.]
4,"[(OCoLC)887324690, (ABN)000548154]",2008,[],{},2008,00351031X,[],,2008,[BK020000],[978-1-4058-8214-9],[],,[64 S.],[],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Pearson Education],[Pearson Education],2008,,{'245': ['Emma']},{'245': ['Emma']},[64 S.]
17,"[(OCoLC)636062037, (BGR)000409509]",2008,[],{},2008,017959411,[],,2008,[BK020000],"[978-1-4058-7953-8 (CD Pack), 978-1-4058-8214-...",[],,[64 S.],[],"{'100': ['AustenJane'], '700': ['BarnesAnnette...",[Pearson Education Ltd.],[Pearson Education Ltd.],2008,,{'245': ['Emma']},{'245': ['Emma']},[64 S.]
19,"[(OCoLC)218626148, (BGR)000463276]",2005,[],{},2005,020155182,[],,2005,[BK020000],"[978-0-521-82437-8, 0-521-82437-0]",[],,[600 S.],[],"{'100': ['AustenJane'], '700': ['CroninRichard...",[Cambridge University Press],[Cambridge University Press],2005,,{'245': ['Emma']},{'245': ['Emma']},[600 S.]
29,"[(OCoLC)218626148, (IDSLU)000449481]",2005,[],{},2005,022315098,[],,2005,[BK020000],[0-521-82437-0],[],,[599 S.],[],{'100': ['AustenJane1775-1817(DE-588)118505173...,[],[],2005,,{'245': ['Emma']},{'245': ['Emma']},[599 S.]
51,"[(OCoLC)218626148, (IDSSG)000338145]",2005,[],{},2005,035554215,[],,2005,[BK020000],"[978-0-521-82437-8, 0-521-82437-0]",[],,[600 S.],[],"{'100': ['AustenJane'], '700': ['CroninRichard...",[],[],2005,,{'245': ['Emma']},{'245': ['Emma']},[600 S.]
77,"[(OCoLC)495204467, (SGBN)001068279, (OCoLC)495...",2008,[],{},2008,055836801,[],,2008,[BK020000],"[978-1-4058-8214-9 (br), 1-4058-8214-X (br)]",[],,[64 S.],[],"{'100': ['AustenJane'], '700': ['BarnesAnnette...",[Pearson Education],[Pearson Education],2008,,{'245': ['Emma']},{'245': ['Emma']},[64 S.]
109,"[(OCoLC)263554860, (IDSBB)001470548]",1979,[],{},1979,103342699,[],,1979,[BK020000],[],[],,[549 S.],[],{'100': ['AustenJane1775-1817(DE-588)118505173...,[],[],1979,,{'245': ['Emma']},{'245': ['Emma']},[549 S.]
130,"[(OCoLC)218626148, (IDSBB)003781869]",2005,[],{},2005,117574562,[],,2005,[BK020000],[0-521-82437-0],[],,[600 S.],[],{'100': ['AustenJane1775-1817(DE-588)118505173...,[],[],2005,,{'245': ['Emma']},{'245': ['Emma']},[600 S.]
160,"[(OCoLC)218626148, (NEBIS)004930649]",2005,[],{},2005,161169244,[],,2005,[BK020000],"[978-0-521-82437-8, 0-521-82437-0]",[],,[600 S.],[],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Cambridge University Press],[Cambridge University Press],2005,,{'245': ['Emma']},{'245': ['Emma']},[600 S.]


In [4]:
print('Number of records in slave.json with string \'Emma\' :', len(
    goldstandard['slaves'][
        goldstandard['slaves'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
    ]
)
     )

Number of records in slave.json with string 'Emma' : 18


A number of 18 records in $\texttt{slave.json}$ contain the string 'Emma' in attribute $ttlfull$. Some of these records are duplicates. File $\texttt{master.json}$ knows the amount of uniques out of the 18 records.

In [5]:
goldstandard['masters'][
    goldstandard['masters'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
]

Unnamed: 0,035liste,century,coordinate,corporate,decade,docid,doi,edition,exactDate,format,isbn,ismn,musicid,pages,part,person,pubinit,pubword,pubyear,scale,ttlfull,ttlpart,volumes
85,"[(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...",2009,[],{},2009,504389793,[],,2009,[BK020000],[978-3-15-020008-7],[],,[600 S.],[20008],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Reclam],[Reclam],2009,,{'245': ['Emma']},{'245': ['Emma']},[600 S.]
86,"[(NEBIS)008647887, (VAUD)991001434509702852, (...",2001,[],{},2001,504389807,[],,2001,[BK020000],[0-375-75742-2],[],,[359 S.],[],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Modern Library],[Modern Library],2001,,{'245': ['Emma']},{'245': ['Emma']},[359 S.]
87,"[(IDSLU)000449481, (IDSBB)003781869, (NEBIS)00...",2005,[],{},2005,504389815,[],,2005,[BK020000],[0-521-82437-0],[],,[599 S.],[],{'100': ['AustenJane1775-1817(DE-588)118505173...,[],[],2005,,{'245': ['Emma']},{'245': ['Emma']},[599 S.]
88,"[(SGBN)001068279, (ABN)000548154, (BGR)0004095...",2008,[],{},2008,504389823,[],,2008,[BK020000],"[978-1-4058-8214-9 (br), 1-4058-8214-X (br)]",[],,[64 S.],[],"{'100': ['AustenJane'], '700': ['BarnesAnnette...",[Pearson Education],[Pearson Education],2008,,{'245': ['Emma']},{'245': ['Emma']},[64 S.]
89,"[(IDSBB)001470548, (SGBN)001344510]",1979,[],{},1979,504389831,[],,1979,[BK020000],[],[],,[549 S.],[],{'100': ['AustenJane1775-1817(DE-588)118505173...,[],[],1979,,{'245': ['Emma']},{'245': ['Emma']},[549 S.]


In [6]:
print('Number of records in master.json with string \'Emma\' :', len(
    goldstandard['masters'][
        goldstandard['masters'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
    ]
)
     )

Number of records in master.json with string 'Emma' : 5


The number of deduplicated records of file $\texttt{slave.json}$ containing string 'Emma' is 5. This is the number of unique records that will be built in Swissbib's deduplication step out of the original 18 records above.

There may be some minor differences in some fields of the deduplicated master file records compared to their original data in the duplicated original slave file records. Nevertheless, it is possible to identify the slave file records as duplicates of their associated master file records, looking at the details of the values of the attributes.

In a final step, let's have a look at the 'Emma' string data of $\texttt{unique.json}$.

In [7]:
goldstandard['uniques'][
    goldstandard['uniques'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
]

Unnamed: 0,035liste,century,coordinate,corporate,decade,docid,doi,edition,exactDate,format,isbn,ismn,musicid,pages,part,person,pubinit,pubword,pubyear,scale,ttlfull,ttlpart,volumes
0,"[(OCoLC)362722306, (ABN)000551177]",2009,[],{},2009,000143235,[],,2009,[BK020000],[978-3-7466-6120-9],[],,[575 S.],[6120],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Aufbau Taschenbuch],[Aufbau Taschenbuch],2009,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",[575 S.]
4,"[(OCoLC)777853583, (ABN)000243260]",1990,[],{},1990,002410559,[],,1990,[BK020000],[0-19-282756-1],[],,[445 p.],[],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Oxford University Press],[Oxford University Press],1990,,{'245': ['Emma']},{'245': ['Emma']},[445 p.]
8,"[(OCoLC)887396789, (ABN)000628911]",2007,[],{},2007,004130235,[],,2007,[BK020000],[978-0-307-38684-7],[],,[495 S.],[],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Vintage Books],[Vintage Books],2007,,{'245': ['Emma']},{'245': ['Emma']},[495 S.]
13,"[(OCoLC)759250730, (BGR)000090407]",1996,[],{},1996,017204097,[],[87.-88. Tsd.],1996,[BK020000],[3-596-22191-9],[],,[414 S.],[2191],"{'100': ['AustenJane'], '245c': ['Jane Austen ...",[Fischer-Taschenbuch-Verlag],[Fischer-Taschenbuch-Verlag],1996,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",[414 S.]
16,"[(OCoLC)76214484, (BGR)000281870]",2001,[],{},2001,017738490,[],3. Aufl,2001,[BK020000],[3-7466-5105-0],[],,[553 S.],[5105],"{'100': ['AustenJane'], '245c': ['Jane Austen ...",[Aufbau Taschenbuch-Verlag],[Aufbau Taschenbuch-Verlag],2001,,{'245': ['Emma']},{'245': ['Emma']},[553 S.]
20,"[(OCoLC)759471685, (BGR)000212474]",2000,[],{},2000,01843486X,[],"[new ed., 2nd impression]",2000,[BK020000],[0-582-41794-5],[],,[59 S.],[],"{'100': ['AustenJane'], '700': ['BarnesAnnette...",[Pearson Education Ltd],[Pearson Education Ltd],2000,,{'245': ['Emma']},{'245': ['Emma']},[59 S.]
26,"[(OCoLC)610677900, (IDSLU)000093923]",1996,[],{},1996,021868816,[],,1996,[BK020000],[],[],,[560 S.],[[1]],{'100': ['AustenJane1775-1817(DE-588)118505173...,[],[],1996,,{'245': ['Emma']},{'245': ['Emma']},[560 S.]
42,"[(OCoLC)614721683, (SBT)000133903]",1983,[],{},1983,038872498,[],,1983,[BK020000],[0-14-043010-5],[],,[471 p.],[],"{'100': ['AustenJane'], '700': ['BlytheRonald'...",[Penguin Books],[Penguin Books],1983,,{'245': ['Emma']},{'245': ['Emma']},[471 p.]
53,"[(OCoLC)882111340, (SBT)000463076]",2002,[],{'710': ['Caritas (Ticino)']},2002,043085075,[],,2002,[VM010200],[],[],,[1 Videocassetta],[],"{'100': [], '245c': ['Caritas Ticino']}",[Caritas Ticino],[Caritas Ticino],2002,,"{'245': ['Emma', 'mobbing']}","{'245': ['Emma', 'mobbing']}",[1 Videocassetta]
62,"[(OCoLC)759052846, (SGBN)001050098]",1953,[],{},1953,051168871,[],,1953,[BK020000],[],[],,[377 S.],[],"{'100': ['AustenJane'], '245c': ['Jane Austen']}",[Collins],[Collins],1953,,{'245': ['Emma']},{'245': ['Emma']},[377 S.]


In [8]:
print('Number of records in unique.json with string \'Emma\' :', len(
    goldstandard['uniques'][
        goldstandard['uniques'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
    ]
)
     )

Number of records in unique.json with string 'Emma' : 44


The number of unique records of file $\texttt{unique.json}$ is 44. The data of the records reveal that none of them can be associated as duplicate of a record in $\texttt{master.json}$. These unique records are excellent training data for deduplication as they can be used for training of similar but non-duplicate, thus unique, record pairs.

The goal of generating training and test data with the help of Swissbib's goldstandard is to generate two categories of training data.

- The first category consists of records of pairs of the original records out of file $\texttt{slave.json}$ that have a label 'duplicates'. This will be accomplished with the help of the records of file $\texttt{master.json}$, see subsection [Generation of Pairs of Duplicates](#generation_pairs_duplicates).
- The second category consists of records of pairs of the original records out of files $\texttt{slave.json}$ and $\texttt{unique.json}$ that have a label 'uniques', see subsection [Generation of Pairs of Uniques](#generation_pairs_uniques).

To calculate the number of possible duplicate pairs the formula

$
\begin{align}
Tot_{duplicate\ pairs} = \sum_{i=1}^{M} \frac{1}{2}S_i \cdot \left(S_i-1\right)
\end{align}
$

where $M$ is the number of records in file $\texttt{master.json}$ and $S_i$ are the number of records in file $\texttt{slave.json}$ that are associated with maser record $M_i$. As the number of slave records must be counted for each master record, it is not possible to calculate the number of expected duplicate pair records in advance.

### Generation of Pairs of Duplicates<a id='generation_pairs_duplicates'/>

The central attribute for retrieving duplicates in Swissbib's goldstandard data is attribute $\texttt{035liste}$, see section [Attribute Analysis](./1_DataAnalysis.ipynb#attribute_analysis) in chapter [Data Analysis](./1_DataAnalysis.ipynb). It helps to identify the associated master record for a given record in file $\texttt{slave.json}$. The process implemented below parses the list of identifiers in attriubte $\texttt{035liste}$ of each record in file $\texttt{slave.json}$ and searches the value of the identifier in the attribute $\texttt{035liste}$ of all records of file $\texttt{master.json}$. When the identifier is found in file $\texttt{master.json}$, the master record's attribute $\texttt{docid}$ is stored as a new column in the slave DataFrame, see figure [Slave/master relationship](#slave_master_relationship). The new attribute in the slave record has the meaning of a foreign key to the related master record. This process is repeated for each list element of one slave record and for each record in the data file $\texttt{slave.json}$.

<center>
    <b>Figure</b><a id='slave_master_relationship'></a> Slave/master relationship.
    <img src="./documentation/training_data.png" style="width: 600px;"/></p>
</center>

As will be shown below, the relationship of a slave record to its master record is unique. Even if there is more than one entry in the list of attribute $\texttt{035liste}$ of a slave record it will be shown that all distinct entries of a $\texttt{035liste}$ attribute list of one slave record point to one and the same master record.

Attribute $\texttt{035liste}$ will not be used in training nore in perfomance testing of the models. The column will be removed before model training.

### Generation of Pairs of Uniques<a id='generation_pairs_uniques'/>

File $\texttt{unique.json}$ holds data of unique records. For training and testing purposes, pairs of records could be generated exclusively with the records out of this file. This would generate record pairs with label 'uniques' which can be clearly distinguished and therefore clearly be recognized as non-duplicate pairs. A more interesting pairing of records can be generated with records out of $\texttt{slave.json}$ and records out of $\texttt{unique.json}$. Pairs of records that are produced in a mixture of both sources, may reveal similarities in their attributes pairing that makes these pairs more difficult for classifying clearly as 'uniques'.

A mixture of both pairing sources will be implemented for the generation of the training and testing data. As the possible pairing combinations of all the records in $\texttt{unique.json}$ can be calculated with the help of $\frac{1}{2}N(N-1)$, the total number of possible uniques can be estimated with the number of records in $\texttt{unique.json}$ and the total number of records in $\texttt{slave.json}$.

In [9]:
sum_unique_slave = len(goldstandard['slaves']) + len(goldstandard['slaves'])

print('Total number of possible pairs for training and testing : {:,d}'.format(
    int(sum_unique_slave*(sum_unique_slave-1)/2)
))

Total number of possible pairs for training and testing : 378,015


## Implemenation of Pairs Generation<a id='implemenation_pairs_generation'/>

### Pairs of Duplicates Implementation<a id='pairs_duplicates_implementation'/>

### Pairs of Uniques Implementation<a id='pairs_uniques_implementation'/>

## Build DataFrames for Transformation into Feature Matrix

In [10]:
columns_to_use = ['century_x', 'volumes_x', 'century_y', 'volumes_y', 'duplicates']

In [11]:
import data_preparation_funcs as dpf

goldstandard['slaves'] = dpf.transform_list_to_string(goldstandard['slaves'], 'volumes')
goldstandard['masters'] = dpf.transform_list_to_string(goldstandard['masters'], 'volumes')
goldstandard['uniques'] = dpf.transform_list_to_string(goldstandard['uniques'], 'volumes')

## Determine Target Vector

In [12]:
goldstandard['masters']['035liste'][0]

['(IDSBB)002447452', '(ALEX)9912923344101791']

In [13]:
goldstandard['slaves'].docid.head()

0    000311049
1    00130724X
2    001817272
3    00236865X
4    00351031X
Name: docid, dtype: object

In [14]:
goldstandard['uniques'].docid.head()

0    000143235
1    00044801X
2    000996009
3    00239538X
4    002410559
Name: docid, dtype: object

In [15]:
goldstandard['masters'].docid.loc[0]

'264853032'

In [16]:
goldstandard['masters'][goldstandard['masters'].docid == goldstandard['uniques'].docid.loc[0]]

Unnamed: 0,035liste,century,coordinate,corporate,decade,docid,doi,edition,exactDate,format,isbn,ismn,musicid,pages,part,person,pubinit,pubword,pubyear,scale,ttlfull,ttlpart,volumes


In [17]:
goldstandard['masters'].loc[3]

035liste      [(NATIONALLICENCE)oxford-10.1093/cid/ciu795, (...
century                                                    2015
coordinate                                                   []
corporate                                                    {}
decade                                                     2015
docid                                                 264853784
doi                                        [10.1093/cid/ciu795]
edition                                                        
exactDate                                              20150201
format                                               [BK010053]
isbn                                                         []
ismn                                       [10.1093/cid/ciu795]
musicid                                                        
pages                                                        []
part                                [60/3(2015-02-01), 432-437]
person        {'100': [], '700': ['Rozot

In [18]:
goldstandard['slaves'].loc[347]

035liste           [(NATIONALLICENCE)oxford-10.1093/cid/ciu795]
century                                                    2015
coordinate                                                   []
corporate                                                    {}
decade                                                     2015
docid                                                 395019044
doi                                        [10.1093/cid/ciu795]
edition                                                        
exactDate                                              20150201
format                                               [BK010053]
isbn                                                         []
ismn                                       [10.1093/cid/ciu795]
musicid                                                        
pages                                                        []
part                                [60/3(2015-02-01), 432-437]
person        {'100': [], '700': ['Rozot

In [19]:
goldstandard['uniques']['035liste'].head()

0    [(OCoLC)362722306, (ABN)000551177]
1    [(OCoLC)886929897, (ABN)000223034]
2    [(OCoLC)778386601, (ABN)000433604]
3    [(OCoLC)778561839, (ABN)000238844]
4    [(OCoLC)777853583, (ABN)000243260]
Name: 035liste, dtype: object

In [20]:
goldstandard['slaves']['035liste'].head()

0    [(OCoLC)731635279, (ABN)000539983]
1    [(OCoLC)808324878, (ABN)000155059]
2    [(OCoLC)231772550, (ABN)000096920]
3    [(OCoLC)887157168, (ABN)000223912]
4    [(OCoLC)887324690, (ABN)000548154]
Name: 035liste, dtype: object

In [21]:
goldstandard['masters']['035liste'].head()

0           [(IDSBB)002447452, (ALEX)9912923344101791]
1                 [(IDSBB)000369647, (NEBIS)005609528]
2    [(IDSLU)001293605, (IDSBB)006725532, (NEBIS)01...
3    [(NATIONALLICENCE)oxford-10.1093/cid/ciu795, (...
4    [(NATIONALLICENCE)oxford-10.1093/ndt/gft319, (...
Name: 035liste, dtype: object

In [22]:
goldstandard['slaves'].loc[[0, 187, 276, 424, 428, 429]]

Unnamed: 0,035liste,century,coordinate,corporate,decade,docid,doi,edition,exactDate,format,isbn,ismn,musicid,pages,part,person,pubinit,pubword,pubyear,scale,ttlfull,ttlpart,volumes
0,"[(OCoLC)731635279, (ABN)000539983]",2009,[],{},2009,311049,[],,2009,[BK020000],[978-3-15-020008-7],[],,[600 S.],[20008],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Reclam jun.],[Reclam jun.],2009,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",600 S.
187,"[(OCoLC)731635279, (NEBIS)009587153]",2009,[],{},2009,196506476,[],,2009,[BK020000],[978-3-15-020008-7],[],,[600 S.],[20008],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Reclam],[Reclam],2009,,{'245': ['Emma']},{'245': ['Emma']},600 S.
276,"[(OCoLC)731635279, (LIBIB)000315536]",2009,[],{},2009,323173349,[],,2009,[BK020000],[978-3-15-020008-7],[],,[600 S.],[20008],"{'100': ['AustenJane'], '245c': ['Jane Austen']}",[Reclam],[Reclam],2009,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",600 S.
424,"[(OCoLC)1002177443, (IDSBB)006726594]",2017,[],{},2017,491629737,[10.1055/b-005-143650],"7., überarbeitete und erweiterte Auflage",2017,[BK020053],[978-3-13-240808-1],[10.1055/b-005-143650],,[1 Online-Ressource],[],{'100': ['TrappeHans-Joachim1954-(DE-588)12494...,[],[],2017,,{'245': ['EKG-Kurs für Isabel']},{'245': ['EKG-Kurs für Isabel']},1 Online-Ressource
428,"[(OCoLC)1002177443, (NEBIS)011045420, (OCoLC)1...",2017,[],{},2017,495381160,[10.1055/b-005-143650],"7., überarbeitete und erweiterte Auflage",2017,[BK020053],[978-3-13-240808-1],[10.1055/b-005-143650],,[1 Online-Ressource],[],{'100': ['TrappeHans-Joachim1954-(DE-588)12494...,[],[],2017,,{'245': ['EKG-Kurs für Isabel']},{'245': ['EKG-Kurs für Isabel']},1 Online-Ressource
429,"[(VAUD)991010321879702852, (RNV)000202321-41bc...",1970,[],{},1970,501860959,[],,19702006,[MU010100],[],[],BA 4553,[1 partition (379 p.)],"[Werkgruppe 5, Bd. 19]","{'100': ['MozartWolfgang Amadeus'], '700': ['G...",[Bärenreiter],[Bärenreiter],19702006,,"{'245': ['Neue Ausgabe sämtlicher Werke', 'Die...","{'245': ['Neue Ausgabe sämtlicher Werke', 'Die...",1 partition (379 p.)


In [23]:
goldstandard['masters']['035liste'].loc[85]

['(NEBIS)009587153', '(LIBIB)000315536', '(ABN)000539983']

In [24]:
goldstandard['masters'].loc[85]

035liste      [(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...
century                                                    2009
coordinate                                                   []
corporate                                                    {}
decade                                                     2009
docid                                                 504389793
doi                                                          []
edition                                                        
exactDate                                              2009    
format                                               [BK020000]
isbn                                        [978-3-15-020008-7]
ismn                                                         []
musicid                                                        
pages                                                  [600 S.]
part                                                    [20008]
person        {'100': ['AustenJane1775-1

In [25]:
def add_master_docid_to_slave (df_s, df_m):
    """Determine docid of master and store on slave."""
    # Initialize Foreign Key list
    df_s['masters_docid'] = [list() for x in range(len(df_s.index))]

    # Search for master of slave
    for i in range(len(df_s)):
        loc_li = list()
        for j in range(len(df_s['035liste'].loc[i])):
            master_index = df_m[df_m['035liste'].str.contains(
                df_s['035liste'].loc[i][j], regex=False
            )].index
            if len(master_index) > 0 : # Skip empty Series
                loc_li.append(df_m.docid[master_index].values[0])

        df_s['masters_docid'].loc[i] = loc_li
    
    return df_s

In [26]:
goldstandard['slaves'] = add_master_docid_to_slave(goldstandard['slaves'], goldstandard['masters'])

goldstandard['slaves'].masters_docid.head()

0    [504389793]
1    [504390597]
2    [50439018X]
3    [504389513]
4    [504389823]
Name: masters_docid, dtype: object

In [27]:
# Proof that all docid_masters are unique...
goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : set(x))
goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : list(x))

for i in range(len(goldstandard['slaves'])):
    if len(goldstandard['slaves'].masters_docid.loc[i]) != 1 :
        print('HALT!')
        break

goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : x[0])
goldstandard['slaves'].head()

Unnamed: 0,035liste,century,coordinate,corporate,decade,docid,doi,edition,exactDate,format,isbn,...,pages,part,person,pubinit,pubword,pubyear,scale,ttlfull,ttlpart,volumes,masters_docid
0,"[(OCoLC)731635279, (ABN)000539983]",2009,[],{},2009,000311049,[],,2009,[BK020000],[978-3-15-020008-7],...,[600 S.],[20008],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Reclam jun.],[Reclam jun.],2009,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",600 S.,504389793
1,"[(OCoLC)808324878, (ABN)000155059]",2000,[],"{'710': ['Metropolitan Opera Orchestra', 'Metr...",2000,00130724X,[],,2000,[VM010300],[],...,"[1 DVD-Video, DVD Region 0, 169 Min., farb.]",[],"{'100': ['LevineJamesDir.'], '700': ['MozartWo...",[Deutsche Grammophon],[Deutsche Grammophon],2000,,"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","1 DVD-Video, DVD Region 0, 169 Min., farb.",504390597
2,"[(OCoLC)231772550, (ABN)000096920]",1999,[],{},1999,001817272,[],,1999,[BK020000],[3-495-47879-5],...,[316 S.],[],"{'100': ['FluryAndreas'], '245c': ['Andreas Fl...",[Alber],[Alber],1999,,"{'245': ['Der moralische Status der Tiere', 'H...","{'245': ['Der moralische Status der Tiere', 'H...",316 S.,50439018X
3,"[(OCoLC)887157168, (ABN)000223912]",uuuu,[],{},uuuu,00236865X,[],,uuuuuuuu,[BK020000],[],...,[412 S.],[],"{'100': ['MozartWolfgang Amadeus'], '245c': ['']}",[Ernst Eulenburg],[Ernst Eulenburg],uuuuuuuu,,{'245': ['Die Zauberflöte']},{'245': ['Die Zauberflöte']},412 S.,504389513
4,"[(OCoLC)887324690, (ABN)000548154]",2008,[],{},2008,00351031X,[],,2008,[BK020000],[978-1-4058-8214-9],...,[64 S.],[],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Pearson Education],[Pearson Education],2008,,{'245': ['Emma']},{'245': ['Emma']},64 S.,504389823


In [28]:
result = pd.merge(left=goldstandard['slaves'], right=goldstandard['masters'], how='inner', left_on='masters_docid', right_on='docid')

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(result)

result.head()

Unnamed: 0,035liste_x,century_x,coordinate_x,corporate_x,decade_x,docid_x,doi_x,edition_x,exactDate_x,format_x,isbn_x,ismn_x,musicid_x,pages_x,part_x,person_x,pubinit_x,pubword_x,pubyear_x,scale_x,ttlfull_x,ttlpart_x,volumes_x,masters_docid,035liste_y,century_y,coordinate_y,corporate_y,decade_y,docid_y,doi_y,edition_y,exactDate_y,format_y,isbn_y,ismn_y,musicid_y,pages_y,part_y,person_y,pubinit_y,pubword_y,pubyear_y,scale_y,ttlfull_y,ttlpart_y,volumes_y
0,"[(OCoLC)731635279, (ABN)000539983]",2009,[],{},2009,000311049,[],,2009,[BK020000],[978-3-15-020008-7],[],,[600 S.],[20008],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Reclam jun.],[Reclam jun.],2009,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",600 S.,504389793,"[(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...",2009,[],{},2009,504389793,[],,2009,[BK020000],[978-3-15-020008-7],[],,[600 S.],[20008],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Reclam],[Reclam],2009,,{'245': ['Emma']},{'245': ['Emma']},600 S.
1,"[(OCoLC)731635279, (NEBIS)009587153]",2009,[],{},2009,196506476,[],,2009,[BK020000],[978-3-15-020008-7],[],,[600 S.],[20008],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Reclam],[Reclam],2009,,{'245': ['Emma']},{'245': ['Emma']},600 S.,504389793,"[(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...",2009,[],{},2009,504389793,[],,2009,[BK020000],[978-3-15-020008-7],[],,[600 S.],[20008],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Reclam],[Reclam],2009,,{'245': ['Emma']},{'245': ['Emma']},600 S.
2,"[(OCoLC)731635279, (LIBIB)000315536]",2009,[],{},2009,323173349,[],,2009,[BK020000],[978-3-15-020008-7],[],,[600 S.],[20008],"{'100': ['AustenJane'], '245c': ['Jane Austen']}",[Reclam],[Reclam],2009,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",600 S.,504389793,"[(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...",2009,[],{},2009,504389793,[],,2009,[BK020000],[978-3-15-020008-7],[],,[600 S.],[20008],{'100': ['AustenJane1775-1817(DE-588)118505173...,[Reclam],[Reclam],2009,,{'245': ['Emma']},{'245': ['Emma']},600 S.
3,"[(OCoLC)808324878, (ABN)000155059]",2000,[],"{'710': ['Metropolitan Opera Orchestra', 'Metr...",2000,00130724X,[],,2000,[VM010300],[],[],,"[1 DVD-Video, DVD Region 0, 169 Min., farb.]",[],"{'100': ['LevineJamesDir.'], '700': ['MozartWo...",[Deutsche Grammophon],[Deutsche Grammophon],2000,,"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","1 DVD-Video, DVD Region 0, 169 Min., farb.",504390597,"[(IDSBB)003690925, (NEBIS)005645758, (RERO)R00...",2000,[],"{'710': ['Metropolitan Opera', 'Metropolitan O...",2000,504390597,[],,2000,[VM010300],[],[],073 003-9,[1 DVD-Video (169 Min.)],[],{'100': ['MozartWolfgang Amadeus1756-1791(DE-5...,[],[],2000,,"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",1 DVD-Video (169 Min.)
4,"[(OCoLC)884447694, (IDSBB)003690925]",2000,[],"{'710': ['Metropolitan Opera', 'Metropolitan O...",2000,116188030,[],,2000,[VM010300],[],[],073 003-9,[1 DVD-Video (169 Min.)],[],{'100': ['MozartWolfgang Amadeus1756-1791(DE-5...,[],[],2000,,"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",1 DVD-Video (169 Min.),504390597,"[(IDSBB)003690925, (NEBIS)005645758, (RERO)R00...",2000,[],"{'710': ['Metropolitan Opera', 'Metropolitan O...",2000,504390597,[],,2000,[VM010300],[],[],073 003-9,[1 DVD-Video (169 Min.)],[],{'100': ['MozartWolfgang Amadeus1756-1791(DE-5...,[],[],2000,,"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",1 DVD-Video (169 Min.)


In [29]:
len(result)

435

In [30]:
result.loc[408]

035liste_x                     [(SNL)vtls001860448, (Sz)001860448]
century_x                                                     2013
coordinate_x                                                    []
corporate_x         {'710': ['Schweizerische Normen-Vereinigung']}
decade_x                                                      2013
docid_x                                                  404220762
doi_x                                                           []
edition_x                                                         
exactDate_x                                               2013    
format_x                                                [BK020053]
isbn_x                                                          []
ismn_x                                                          []
musicid_x                                                         
pages_x                                     [1 ressource en ligne]
part_x                                                        

In [31]:
def build_duplicate_pairs (df):
    """Builds-up all duplicate pairs, even with itself."""
    
    return pd.merge(left=df, right=df, how='inner', left_on='masters_docid', right_on='masters_docid')

In [32]:
duplicates = build_duplicate_pairs(goldstandard['slaves'])
duplicates['duplicates'] = 1

len(duplicates)

1473

In [33]:
duplicates.loc[1000]

035liste_x                    [(OCoLC)248381623, (NEBIS)008641750]
century_x                                                     2001
coordinate_x                                                    []
corporate_x                                                     {}
decade_x                                                      2001
docid_x                                                  190326522
doi_x                                                           []
edition_x                                            3., erw. Aufl
exactDate_x                                               2001    
format_x                                                [BK020000]
isbn_x                                             [3-13-127283-X]
ismn_x                                                          []
musicid_x                                                         
pages_x                                                   [323 S.]
part_x                                                        

In [34]:
df_s_1 = goldstandard['slaves']
df_u_1 = goldstandard['uniques']
df_s_1['duplicates'] = 0
df_u_1['duplicates'] = 0
non_duplicates = pd.merge(df_s_1, df_u_1, on='duplicates')

len(non_duplicates)

259260

In [35]:
print(len(duplicates)/len(non_duplicates)*100)

0.5681555195556585


In [36]:
non_duplicates.loc[0]

035liste_x                      [(OCoLC)731635279, (ABN)000539983]
century_x                                                     2009
coordinate_x                                                    []
corporate_x                                                     {}
decade_x                                                      2009
docid_x                                                  000311049
doi_x                                                           []
edition_x                                                         
exactDate_x                                               2009    
format_x                                                [BK020000]
isbn_x                                         [978-3-15-020008-7]
ismn_x                                                          []
musicid_x                                                         
pages_x                                                   [600 S.]
part_x                                                     [20

**Hand the resulting DataFrame over to next chapter.**

### Feature DataFrame

In [37]:
dupes = duplicates[columns_to_use]
non_dupes = non_duplicates[columns_to_use]

In [38]:
frames = [dupes, non_dupes]

df_feature_base = pd.concat(frames)
df_feature_base.head()

Unnamed: 0,century_x,volumes_x,century_y,volumes_y,duplicates
0,2009,600 S.,2009,600 S.,1
1,2009,600 S.,2009,600 S.,1
2,2009,600 S.,2009,600 S.,1
3,2009,600 S.,2009,600 S.,1
4,2009,600 S.,2009,600 S.,1


In [39]:
len(df_feature_base), len(df_feature_base[df_feature_base.duplicates==0]), len(df_feature_base[df_feature_base.duplicates==1])

(260733, 259260, 1473)

In [40]:
df_feature_base['century_delta'] = (df_feature_base['century_x'] == df_feature_base['century_y']).astype('int32')
df_feature_base['volumes_delta'] = (df_feature_base['volumes_x'] == df_feature_base['volumes_y']).astype('int32')

df_feature_base.drop(columns=['century_x', 'century_y', 'volumes_x', 'volumes_y'], inplace=True)

### Train-/Test Split

In [41]:
print('Part of duplicates (1) on uniques (2) in units of [%]')
print(df_feature_base.duplicates.value_counts(normalize=True)*100)

Part of duplicates (1) on uniques (2) in units of [%]
0    99.435054
1     0.564946
Name: duplicates, dtype: float64


In [42]:
X = df_feature_base.drop(columns=['duplicates']).values
y = df_feature_base.duplicates.values

In [43]:
from sklearn.model_selection import train_test_split

X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y, test_size=0.2, random_state=0)

## The models

### DecisionTree

In [44]:
X_tr[:5], y_tr[:5]

(array([[0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0]], dtype=int32), array([0, 0, 0, 0, 0]))

In [45]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=0)
dt.fit(X_tr, y_tr)
y_pred = dt.predict(X_te)

### Performance Measurement DecisionTree

In [46]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_te, y_pred)

array([[51827,    25],
       [  123,   172]])

In [47]:
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score

print('Score {:.1f}%'.format(100*dt.score(X_te, y_te)))
print('Area under the curve {:.1f}% - accuracy {:.1f}% - precision {:.1f}% - recall {:.1f}%'.format(100*roc_auc_score(y_te, y_pred),
                100*accuracy_score(y_te, y_pred),
                100*precision_score(y_te, y_pred),
                100*recall_score(y_te, y_pred)
               ))

Score 99.7%
Area under the curve 79.1% - accuracy 99.7% - precision 87.3% - recall 58.3%


### SVC

In [48]:
from sklearn.svm import SVC

sv = SVC(kernel='rbf', gamma='auto' ,random_state=0)
sv.fit(X_tr, y_tr)
y_pred = sv.predict(X_te)

### Performance Measurement SVC

In [49]:
confusion_matrix(y_te, y_pred)

array([[51827,    25],
       [  123,   172]])

In [50]:
print('Score {:.1f}%'.format(100*sv.score(X_te, y_te)))
print('Area under the curve {:.1f}% - accuracy {:.1f}% - precision {:.1f}% - recall {:.1f}%'.format(100*roc_auc_score(y_te, y_pred),
                100*accuracy_score(y_te, y_pred),
                100*precision_score(y_te, y_pred),
                100*recall_score(y_te, y_pred)
               ))

Score 99.7%
Area under the curve 79.1% - accuracy 99.7% - precision 87.3% - recall 58.3%
