In [1]:
oversampling = 20
modification_ratio = 0.2
strip_number_digits = True

# Data Synthesizing

Chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb) brings a fraction of duplicate records out of Swissbib's goldstandard of below 0.6%. The explicit full amount will be sufficient for training and performance testing of a machine learning model. The fraction by itself is a low value, though. This chapter will focus on increasing the amount of duplicate pair rows in order to have a more balanced set for training and testing models at hand. This is done, generating artifical data pairs of duplicates.

## Table of Contents

- [Metadata Takeover](#Metadata-Takeover)
- [Loading and Duplicating the Base Data](#Loading-and-Duplicating-the-Base-Data)
- [Modifying the Base Data](#Modifying-the-Base-Data)
    - [coordinate](#coordinate)
    - [corporate](#corporate)
    - [doi](#doi)
    - [edition](#edition)
    - [exactDate](#exactDate)
    - [format](#format)
    - [isbn](#isbn)
    - [ismn](#ismn)
    - [musicid](#musicid)
    - [part](#part)
    - [person](#person)
    - [pubinit](#pubinit)
    - [scale](#scale)
    - [ttlfull](#ttlfull)
    - [volumes](#volumes)
- [Data Pairing for Duplicates](#Data-Pairing-for-Duplicates)
- [Removing One Side of an Attribute Pair](#Removing-One-Side-of-an-Attribute-Pair)
- [Append Synthetic Duplicates to Feature Base](#Append-Synthetic-Duplicates-to-Feature-Base)
- [Summary](#Summary)
    - [Goldstandard DataFrame Handover](#Goldstandard-DataFrame-Handover)

## Metadata Takeover

Two artefacts have arisen from the previous chapters, the metadata dictionary and a DataFrame that will be the basis for the feature matrix for the models. Both artefact must be loaded as the first step of this chapter.

In [2]:
import os
import pickle as pk
import pandas as pd

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)
    
# Restore results so far
df_feature_base = pd.read_pickle(os.path.join(path_goldstandard, 'feature_base_df.pkl'),
                                 compression=None)

In [3]:
for k in columns_metadata_dict.keys():
    print(k, '\n', columns_metadata_dict[k], '\n')

data_analysis_columns 
 ['coordinate_E', 'coordinate_N', 'corporate_full', 'doi', 'edition', 'exactDate', 'format_prefix', 'format_postfix', 'isbn', 'ismn', 'musicid', 'part', 'person_100', 'person_700', 'person_245c', 'pubinit', 'scale', 'ttlfull_245', 'ttlfull_246', 'volumes'] 

columns_to_use 
 ['duplicates', 'coordinate_E_x', 'coordinate_E_y', 'coordinate_N_x', 'coordinate_N_y', 'corporate_full_x', 'corporate_full_y', 'doi_x', 'doi_y', 'edition_x', 'edition_y', 'exactDate_x', 'exactDate_y', 'format_prefix_x', 'format_prefix_y', 'format_postfix_x', 'format_postfix_y', 'isbn_x', 'isbn_y', 'ismn_x', 'ismn_y', 'musicid_x', 'musicid_y', 'part_x', 'part_y', 'person_100_x', 'person_100_y', 'person_700_x', 'person_700_y', 'person_245c_x', 'person_245c_y', 'pubinit_x', 'pubinit_y', 'scale_x', 'scale_y', 'ttlfull_245_x', 'ttlfull_245_y', 'ttlfull_246_x', 'ttlfull_246_y', 'volumes_x', 'volumes_y'] 



In [4]:
# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head()

Unnamed: 0,035liste_x,035liste_y,century_x,century_y,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,coordinate_x,coordinate_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,corporate_full_x,corporate_full_y,decade_x,decade_y,docid_x,docid_y,doi_x,doi_y,duplicates,edition_x,edition_y,exactDate_x,exactDate_y,format_postfix_x,format_postfix_y,format_prefix_x,format_prefix_y,isbn_x,isbn_y,ismn_x,ismn_y,masters_docid,musicid_x,musicid_y,pages_x,pages_y,part_x,part_y,person_100_x,person_100_y,person_245c_x,person_245c_y,person_700_x,person_700_y,pubinit_x,pubinit_y,pubword_x,pubword_y,pubyear_x,pubyear_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,ttlpart_x,ttlpart_y,volumes_x,volumes_y
0,"[(OCoLC)731635279, (ABN)000539983]","[(OCoLC)731635279, (ABN)000539983]",2009,2009,,,,,[],[],,,,,,,2009,2009,311049,311049,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,"grawechristian, graweursula","grawechristian, graweursula",reclam jun.,reclam jun.,[Reclam jun.],[Reclam jun.],2009,2009,,,"emma, roman","emma, roman",,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",600,600
1,"[(OCoLC)731635279, (ABN)000539983]","[(OCoLC)731635279, (NEBIS)009587153]",2009,2009,,,,,[],[],,,,,,,2009,2009,311049,196506476,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,"grawechristian, graweursula",,reclam jun.,reclam,[Reclam jun.],[Reclam],2009,2009,,,"emma, roman",emma,,,"{'245': ['Emma', 'Roman']}",{'245': ['Emma']},600,600
2,"[(OCoLC)731635279, (ABN)000539983]","[(OCoLC)731635279, (LIBIB)000315536]",2009,2009,,,,,[],[],,,,,,,2009,2009,311049,323173349,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem englischen übersetzt von...,jane austen,"grawechristian, graweursula",,reclam jun.,reclam,[Reclam jun.],[Reclam],2009,2009,,,"emma, roman","emma, roman",,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",600,600
3,"[(OCoLC)731635279, (NEBIS)009587153]","[(OCoLC)731635279, (ABN)000539983]",2009,2009,,,,,[],[],,,,,,,2009,2009,196506476,311049,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,,"grawechristian, graweursula",reclam,reclam jun.,[Reclam],[Reclam jun.],2009,2009,,,emma,"emma, roman",,,{'245': ['Emma']},"{'245': ['Emma', 'Roman']}",600,600
4,"[(OCoLC)731635279, (NEBIS)009587153]","[(OCoLC)731635279, (NEBIS)009587153]",2009,2009,,,,,[],[],,,,,,,2009,2009,196506476,196506476,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,,,reclam,reclam,[Reclam],[Reclam],2009,2009,,,emma,emma,,,{'245': ['Emma']},{'245': ['Emma']},600,600


## Loading and Duplicating the Base Data

For generating artificial pairs of duplicates, Swissbib's goldstandard file of uniques and masters will be taken as a basis. The files will be loaded into a DataFrame which will be duplicated for pairing after manipulating the data of the second, the duplicated DataFrame.

In [5]:
import json

records_master, records_unique = [], []
file_master, file_unique = 'master.json', 'unique.json'

for line in open(os.path.join(path_goldstandard, file_master), 'r'):
    records_master.append(json.loads(line))
for line in open(os.path.join(path_goldstandard, file_unique), 'r'):
    records_unique.append(json.loads(line))

print('Number of records in data file {:s}\t{:d}'.format(file_master, len(records_master)))
print('Number of records in data file {:s}\t{:d}'.format(file_unique, len(records_unique)))

records_unique.extend(records_master)
print('Number of total records \t\t\t{:d}'.format(len(records_unique)))

Number of records in data file master.json	159
Number of records in data file unique.json	596
Number of total records 			755


In [6]:
import pandas as pd

goldstandard_uniques = {}

goldstandard_uniques['original'] = pd.DataFrame(records_unique)
goldstandard_uniques['modified'] = pd.DataFrame(records_unique)
# Multiply oversampling data
number_of_oversampling_loop = round((oversampling*len(df_feature_base))/(
    (1-oversampling/100)*100*len(goldstandard_uniques['original'])))

for i in range(number_of_oversampling_loop) :
    goldstandard_uniques['modified'] = pd.concat(
        [goldstandard_uniques['modified'], pd.DataFrame(records_unique)], sort=True)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(goldstandard_uniques['original'].columns)

goldstandard_uniques['modified'].reset_index(drop=True, inplace=True)
if oversampling > 0:
    print('Omitting the goldstandard training data, with {:,d} iterations in the for-loop, ...'.format(
        number_of_oversampling_loop))
    print('... oversampling will generate a ratio of {:.1f}% of synthetic duplicates in the full trainig data.'.format(
        100*len(goldstandard_uniques['modified'])/(len(df_feature_base)+len(goldstandard_uniques['modified']))))
else :
    print('The goldstandard will not be increased with synthetic data.')
goldstandard_uniques['modified'].head()

Omitting the goldstandard training data, with 86 iterations in the for-loop, ...
... oversampling will generate a ratio of 20.2% of synthetic duplicates in the full trainig data.


Unnamed: 0,035liste,century,coordinate,corporate,decade,docid,doi,edition,exactDate,format,isbn,ismn,musicid,pages,part,person,pubinit,pubword,pubyear,scale,ttlfull,ttlpart,volumes
0,"[(OCoLC)362722306, (ABN)000551177]",2009,[],"{'110': [], '710': [], '810': []}",2009,000143235,[],,2009,[BK020000],[978-3-7466-6120-9],[],,[575 S.],[6120],"{'100': ['AustenJane'], '700': [], '800': [], ...",[Aufbau Taschenbuch],[Aufbau Taschenbuch],2009,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",[575 S.]
1,"[(OCoLC)886929897, (ABN)000223034]",2004,[],"{'110': [], '710': ['Wiener Philharmoniker'], ...",2004,00044801X,[],,2004,[MU040100],[],[],,[2 CD],[],"{'100': ['MozartWolfgang Amadeus'], '700': ['K...",[Membran],[Membran],2004,,{'245': ['Die Zauberflöte']},{'245': ['Die Zauberflöte']},[2 CD]
2,"[(OCoLC)778386601, (ABN)000433604]",1793,[],"{'110': [], '710': [], '810': []}",1793,000996009,[],,17931797,[MU010000],[],[],,[202 S.],[],"{'100': ['MozartWolfgang Amadeus'], '700': [],...",[Komm. Breitkopf],[Komm. Breitkopf],17931797,,"{'245': ['Die Zauberflöte', 'eine grosse Oper ...","{'245': ['Die Zauberflöte', 'eine grosse Oper ...",[202 S.]
3,"[(OCoLC)778561839, (ABN)000238844]",2000,[],"{'110': [], '710': [], '810': []}",2000,00239538X,[],,2000,[BK020047],[3-932992-42-3],[],,"[1 CD-Rom in Box, mit Notizbuchfunktion für Re...",[],"{'100': [], '700': ['FrischMax'], '800': [], '...",[Terzio],[Terzio],2000,,"{'245': ['Homo faber', 'Originaltext, Interpre...","{'245': ['Homo faber', 'Originaltext, Interpre...","[1 CD-Rom in Box, mit Notizbuchfunktion für Re..."
4,"[(OCoLC)777853583, (ABN)000243260]",1990,[],"{'110': [], '710': [], '810': []}",1990,002410559,[],,1990,[BK020000],[0-19-282756-1],[],,[445 p.],[],"{'100': ['AustenJane'], '700': [], '800': [], ...",[Oxford University Press],[Oxford University Press],1990,,{'245': ['Emma']},{'245': ['Emma']},[445 p.]


The goldstandard data must be preprocessed the same way as in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb), see section [Transform Attributes for Similarity Comparison](./2_GoldstandardDataPreparation.ipynb#Transform-Attributes-for-Similarity-Comparison) there.

In [7]:
import data_preparation_funcs as dpf

for i in ['original', 'modified']:
    goldstandard_uniques[i] = dpf.attribute_preprocessing(
        goldstandard_uniques[i],
        columns_metadata_dict['data_analysis_columns'], strip_number_digits)

In [8]:
columns_metadata_dict['data_analysis_columns']

['coordinate_E',
 'coordinate_N',
 'corporate_full',
 'doi',
 'edition',
 'exactDate',
 'format_prefix',
 'format_postfix',
 'isbn',
 'ismn',
 'musicid',
 'part',
 'person_100',
 'person_700',
 'person_245c',
 'pubinit',
 'scale',
 'ttlfull_245',
 'ttlfull_246',
 'volumes']

The following section of this chapter will modify the data of the above listed columns with the goal to produce pairs of duplicates after joining the DataFrame $\texttt{goldstandard}\_\texttt{uniques}[\texttt{'original'}]$ with $\texttt{goldstandard}\_\texttt{uniques}[\texttt{'modified'}]$. These pairs of duplicates are to consist of pairs that are not necessarily equal but may show similarity, instead. This will be achieved by modifying the attributes of $\texttt{goldstandard}\_\texttt{uniques}[\texttt{'modified'}]$.

## Modifying the Base Data

For each attribute, a function has been implemented that modifies its value in a manner that may be expected in Swissbib's data reality. The function has been implemented in a Python library [modify_data_funcs.py](./modify_data_funcs.py).

In [9]:
import modify_data_funcs as mdf

All functions below expect a parameter named $\texttt{modification}\_\texttt{ratio}$. This parameter determines the fraction of sample rows that will be modified. The parameter is set to its value all above, in the first code cell of this chapter. Besides the $\texttt{modification}\_\texttt{ratio}$, the functions are called with four additional parameters which have the effect of modifying the attribute value. All four of these parameters can be set separately. All four of these parameters simulate the effect of mistyping during data entry of a bibliographical unit.

- $\texttt{number}\_\texttt{of}\_\texttt{delete}$ : This parameter controls the effect of removing characters. Its value determines the amount of characters to be removed in the string chosen randomly.
- $\texttt{number}\_\texttt{of}\_\texttt{switch}$ : This parameter controls the effect of switching two adjacent characters. Its value determines the amount of character pairs to be switched in the string chosen randomly. The effect of switching two characters simulates a typo, e.g. in the title of a bibliographical unit.
- $\texttt{number}\_\texttt{of}\_\texttt{replace}$ : This parameter controls the effect of replacing a character at a randomly chosen position in the string with a randomly chosen new character. Its value determines the amount of characters to be replaced in the string chosen randomly.
- $\texttt{number}\_\texttt{of}\_\texttt{add}$ : This parameter controls the effect of adding characters. This is the opposite transaction to the effect of removing characters. Its value determines the amount of characters to be removed in the string chosen randomly.

In the following subsecions below, each attribute will be discussed and eventually handled separately.

### coordinate

For attribute $\texttt{coordinate}$, typos of switching two adjacent characters can be imagined. This will be the only kind of manipulation foreseen for this attribute.

In [10]:
number_of_delete = 0
number_of_switch = 1
number_of_replace = 0
number_of_add = 0

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'coordinate_E', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)
goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'coordinate_N', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 887 samples two characters of attribute coordinate_E switched.


In 887 samples two characters of attribute coordinate_N switched.


### corporate

For attribute $\texttt{corporate}$, which is a string of free text all kinds of typos can be imagined. All four parameters for manipulation will be set to a value greater than 0, therefore.

In [11]:
number_of_delete = 1
number_of_switch = 1
number_of_replace = 1
number_of_add = 1

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'corporate_full', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 2819 samples one character of attribute corporate_full removed.


In 2819 samples two characters of attribute corporate_full switched.


In 2819 samples one character of attribute corporate_full replaced.


In 2819 samples one character of attribute corporate_full added.


### doi

Attribute $\texttt{doi}$ will be compared on identity of the string in chapter [Feature Matrix Generation](./3_FeatureMatrixGeneration.ipynb). Reason is that the value of the attribute is expected to be processed fully electronically without any additional human or manual manipulation in Swissbib's data sources. Therefore, the attribute's values are expected to originate with a very high quality in their sources and this expectation is to be expressed in the training data. Leaving the attribute's values unmodified, will make it a a strong indicator of duplicate bibliographical units, if present.

### edition

This attribute holds numbers. Wrongly entered data are hard to simulate and the manipulation of this attribute shall be ommitted for it.

### exactDate

This attribute has a fixed length of eight characters. It can hold number digits or markers for unknown data. A typo of switching two adjacent digits can be imagined for this attribute as well as a mistyping of a number digit. The corresponding parameters will be set for data manipulation.

In [12]:
number_of_delete = 0
number_of_switch = 1
number_of_replace = 1
number_of_add = 0

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'exactDate', modification_ratio/2,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 6568 samples two characters of attribute exactDate switched.


In 6568 samples one character of attribute exactDate replaced.


### format

Attribute $\texttt{format}\_\texttt{prefix}$ is a code where character modification would produce unknown codes. This need not be expected by the deduplication model. With $\texttt{format}\_\texttt{postfix}$, a number switch may be possible, though.

In [13]:
number_of_delete = 0
number_of_switch = 1
number_of_replace = 0
number_of_add = 0

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'format_postfix', modification_ratio/4,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 3284 samples two characters of attribute format_postfix switched.


### isbn

Attribute $\texttt{isbn}$ will be compared on identity of the string. Therefore, this attribute will not be modified to remain a strong indicator of duplicate bibliographical units.

### ismn

The same statement as for attribute $\texttt{isbn}$ holds for attribute $\texttt{ismn}$. The attribute will not be modified.

### musicid

For this attribute, a missing character or an additional character may be possible for a pair of duplicates. This will be simulated here, all aditionally thinkable manipulations will be omitted.

In [14]:
number_of_delete = 1
number_of_switch = 0
number_of_replace = 0
number_of_add = 1

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'musicid', modification_ratio/4,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 344 samples one character of attribute musicid removed.


In 334 samples one character of attribute musicid added.


### part

Like attribute $\texttt{edition}$, this attribute holds numbers and is hard to manipulate is a good way. Wrongly entered data shall not be simulated with this attribute.

### person

The $\texttt{person}$ attribute is a typical attribute of manually entered data, prone to typos. For this reason, it will be manipulated in all four possible kinds described.

In [15]:
number_of_delete = 1
number_of_switch = 1
number_of_replace = 1
number_of_add = 1

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'person_100', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)
goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'person_700', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)
goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'person_245c', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 8126 samples one character of attribute person_100 removed.


In 8126 samples two characters of attribute person_100 switched.


In 8126 samples one character of attribute person_100 replaced.


In 8126 samples one character of attribute person_100 added.


In 7934 samples one character of attribute person_700 removed.


In 7934 samples two characters of attribute person_700 switched.


In 7934 samples one character of attribute person_700 replaced.


In 7934 samples one character of attribute person_700 added.


In 11867 samples one character of attribute person_245c removed.


In 11867 samples two characters of attribute person_245c switched.


In 11867 samples one character of attribute person_245c replaced.


In 11867 samples one character of attribute person_245c added.


### pubinit

This attribute holds names. It will be treated the same way as the $\texttt{person}$ attributes.

In [16]:
number_of_delete = 1
number_of_switch = 1
number_of_replace = 1
number_of_add = 1

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'pubinit', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 8543 samples one character of attribute pubinit removed.


In 8529 samples two characters of attribute pubinit switched.


In 8529 samples one character of attribute pubinit replaced.


In 8529 samples one character of attribute pubinit added.


### scale

This attribute is used for scaling information of maps. It is sparsely filled and digit 0 is the predominant one. As it is hard to modify the attribute in a good way, no modification will be similated on it.

### ttlfull

Both $\texttt{ttlfull}$ attributes can hold longer string sequences and are prototypical for typos. Both of them will be treated the same way as the $\texttt{person}$ attributes.

In [17]:
number_of_delete = 1
number_of_switch = 1
number_of_replace = 1
number_of_add = 1

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'ttlfull_245', modification_ratio*2,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)
goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'ttlfull_246', modification_ratio*2,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 26274 samples one character of attribute ttlfull_245 removed.


In 25734 samples two characters of attribute ttlfull_245 switched.


In 25734 samples one character of attribute ttlfull_245 replaced.


In 25734 samples one character of attribute ttlfull_245 added.


In 2958 samples one character of attribute ttlfull_246 removed.


In 2958 samples two characters of attribute ttlfull_246 switched.


In 2958 samples one character of attribute ttlfull_246 replaced.


In 2958 samples one character of attribute ttlfull_246 added.


### volumes

For the same reason as for attributes $\texttt{edition}$ and $\texttt{part}$, no modification will be similated on attribute $\texttt{volumes}$.

## Data Pairing for Duplicates

In chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb), the records pairing for generating pairs of duplicates was accomplished with the help of attribute $\texttt{035liste}$. For the synthetic data of this chapter, attribute $\texttt{docid}$ is the right key for joining. The reason for this choice is that attribute $\texttt{docid}$ identifies a record in an unambiguous way. It acts like a primary key for a record in the base data. Joining a record with itself on this unique identifier guarantees the generation of a pair of duplicates.

In [18]:
duplicates = pd.merge(left=goldstandard_uniques['original'], right=goldstandard_uniques['modified'], how='inner',
                  left_on='docid', right_on='docid')
# Mark all as duplicates for target vector
duplicates['duplicates'] = 1

print('Number of new duplicate pairs {:,d}'.format(len(duplicates)))

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(duplicates)

duplicates.sample(n=5)

Number of new duplicate pairs 65,685


Unnamed: 0,docid,035liste_x,isbn_x,ttlpart_x,pubyear_x,decade_x,century_x,exactDate_x,edition_x,part_x,pages_x,volumes_x,pubinit_x,pubword_x,scale_x,coordinate_x,doi_x,ismn_x,musicid_x,coordinate_E_x,coordinate_N_x,corporate_110_x,corporate_710_x,corporate_full_x,format_prefix_x,format_postfix_x,person_100_x,person_700_x,person_245c_x,ttlfull_245_x,ttlfull_246_x,035liste_y,century_y,coordinate_y,decade_y,doi_y,edition_y,exactDate_y,isbn_y,ismn_y,musicid_y,pages_y,part_y,pubinit_y,pubword_y,pubyear_y,scale_y,ttlpart_y,volumes_y,coordinate_E_y,coordinate_N_y,corporate_110_y,corporate_710_y,corporate_full_y,format_prefix_y,format_postfix_y,person_100_y,person_700_y,person_245c_y,ttlfull_245_y,ttlfull_246_y,duplicates
3912,040661660,"[(OCoLC)636370741, (SBT)000245999]",[],{'245': ['Flashdance']},uuuuuuuu,uuuu,uuuu,uuuuuuuu,,,[1 CD],1,polygram,[Polygram],,[],,,,,,,,,mu,40100,,,,flashdance,,"[(OCoLC)636370741, (SBT)000245999]",uuuu,[],uuuu,,,uuuuuuuu,[],,,[1 CD],,polyxram,[Polygram],uuuuuuuu,,{'245': ['Flashdance']},1,,,,,,mu,40100,,,,flashdance,,1
6308,054125235,"[(OCoLC)779540891, (SGBN)001137049]",[],{'245': ['ICSG Typis Monasterij S. Galli']},16001700,1600,1600,16001700,,,[1 Holzschnitt],1,[s.n.],[[s.n.]],,[],,,,,,,,,vm,20353,,,,icsg typis monasterij s. galli,,"[(OCoLC)779540891, (SGBN)001137049]",1600,[],1600,,,16001700,[],,,[1 Holzschnitt],,[s.n.],[[s.n.]],16001700,,{'245': ['ICSG Typis Monasterij S. Galli']},1,,,,,,vm,20353,,,,icsg typ smonasterij s. galli,,1
11713,113660332,"[(OCoLC)611252842, (IDSBB)000585728]",[],"{'245': ['Bonne chance', 'cours de langue fran...",19841992,1984,1984,19841992,,,[ v.],,,[],,[],,,,,,,,,vm,30100,,kesslersigrid,sigrid kessler ... [et al.],"bonne chance, cours de langue française : exig...",,"[(OCoLC)611252842, (IDSBB)000585728]",1984,[],1984,,,19841992,[],,,[ v.],,,[],19841992,,"{'245': ['Bonne chance', 'cours de langue fran...",,,,,,,vm,30100,,kesslersigrid,sigrid kessler .l. [et al.],"boxne chance, cours de langue française : exig...",,1
21274,224443526,[(RERO)R274732860],[],"{'245': ['Bonne chance!', 'cours de langue fra...",19941995,1994,1994,19941995,,,[4 disques compacts dans 3 coffrets],4 3,"interkantonale lehrmittelzentrale, staatlicher...","[Interkantonale Lehrmittelzentrale, Staatliche...",,[],,,,,,,,,mu,30000,,,sigrid kessler... [et al.] ; [hrsg.:] interkan...,"bonne chance!, cours de langue française, 1",,[(RERO)R274732860],1994,[],1994,,,19941995,[],,,[4 disques compacts dans 3 coffrets],,"inetrkantonale lehrmittelzentrale, staatlicher...","[Interkantonale Lehrmittelzentrale, Staatliche...",19941995,,"{'245': ['Bonne chance!', 'cours de langue fra...",4 3,,,,,,mu,30000,,,sigrid kessler... [et al.] ; [hrsg.:] ibterkan...,"bonne chance!, cour sde langue française 1",,1
15732,16018701X,"[(OCoLC)634141356, (NEBIS)004809032]",[],{'245': ['Vocabularius iuris utriusque']},1490,1490,1490,1490uuuu,,,[[130] Bl.],130,"[drucker des jordanus, d.i. georg husner]","[[Drucker des Jordanus, d.i. Georg Husner]]",,[],10.3931/e-rara-61897,,,,,,"husner, kloster rheinau","husner, kloster rheinau",bk,20053,iodocus,,[jodocus erfordensis],vocabularius iuris utriusque,,"[(OCoLC)634141356, (NEBIS)004809032]",1490,[],1490,10.3931/e-rara-61897,,1490uuuu,[],,,[[130] Bl.],,"[crucker des jordanus, d.i. georg husner]","[[Drucker des Jordanus, d.i. Georg Husner]]",1490,,{'245': ['Vocabularius iuris utriusque']},130,,,,"husner, kloster rheinau","husner, koster rheinau",bk,20053,iodocus,,[jodocus erfordensis],vocabularius iuri utriusque,,1


Some sample records of the new synthetic pairs of duplicates are shown above.

## Removing One Side of an Attribute Pair

One additional scenario of data differences of a pair of duplicates can be observed in Swissbib's raw data and has not been covered, yet. The scenario of an empty value in one attribute of a pair. To cover this scenario and in order to synthetise more reatlistic training data, a function has been implemented which removes attributes randomly from a pair. This function will be applyed to the columns only, that will be relevant for the final feature matrix.

In [19]:
duplicates_columns = columns_metadata_dict['columns_to_use'][:] # Copy list by value, not by reference!
# This attribute is always filled
duplicates_columns.remove('exactDate_x')
duplicates_columns.remove('exactDate_y')
# These attributes are mostly filled
duplicates_columns.remove('format_prefix_x')
duplicates_columns.remove('format_prefix_y')
duplicates_columns.remove('format_postfix_x')
duplicates_columns.remove('format_postfix_y')
# Title is the most important attribute that is hardly missing
duplicates_columns.remove('ttlfull_245_x')
duplicates_columns.remove('ttlfull_245_y')
duplicates_columns.remove('ttlfull_246_x')
duplicates_columns.remove('ttlfull_246_y')
# Target vector is needed
duplicates_columns.remove('duplicates')

duplicates = mdf.remove_one_side_of_attribute_pair(duplicates, duplicates_columns, modification_ratio/5)

## Append Synthetic Duplicates to Feature Base

To be able to append the synthesised pairs of duplicates to the feature base of chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb), attributes $\texttt{docid}\_\texttt{x}$ and $\texttt{docid}\_\texttt{y}$ have to be added with the goal to receive a DataFrame with the same columns like the feature base DataFrame, loaded in the beginning of this chapter.

In [20]:
duplicates.rename(columns={'docid' : 'docid_x'}, inplace=True)
duplicates['docid_y'] = duplicates['docid_x']
duplicates.sample(n=5)

Unnamed: 0,docid_x,035liste_x,isbn_x,ttlpart_x,pubyear_x,decade_x,century_x,exactDate_x,edition_x,part_x,pages_x,volumes_x,pubinit_x,pubword_x,scale_x,coordinate_x,doi_x,ismn_x,musicid_x,coordinate_E_x,coordinate_N_x,corporate_110_x,corporate_710_x,corporate_full_x,format_prefix_x,format_postfix_x,person_100_x,person_700_x,person_245c_x,ttlfull_245_x,ttlfull_246_x,035liste_y,century_y,coordinate_y,decade_y,doi_y,edition_y,exactDate_y,isbn_y,ismn_y,musicid_y,pages_y,part_y,pubinit_y,pubword_y,pubyear_y,scale_y,ttlpart_y,volumes_y,coordinate_E_y,coordinate_N_y,corporate_110_y,corporate_710_y,corporate_full_y,format_prefix_y,format_postfix_y,person_100_y,person_700_y,person_245c_y,ttlfull_245_y,ttlfull_246_y,duplicates,docid_y
42928,431462992,"[(VAUD)991002221929702853, (RNV)007138199-41bc...",[978-2-01-322734-6],"{'245': ['Maintenant, c'est ma vie']}",2012,2012,2012,2012uuuu,,1368 1368,[254 p.],254,hachette livre,[Hachette Livre],,[],,,,,,,,,bk,20000,rosoffmeg,collonhélène,meg rosoff ; trad. de l'anglais par hélène collon,"maintenant, c'est ma vie",,"[(VAUD)991002221929702853, (RNV)007138199-41bc...",2012,[],2012,,,2012uuuu,[978-2-01-322734-6],,,[254 p.],1368 1368,hachetet livre,[Hachette Livre],2012,,"{'245': ['Maintenant, c'est ma vie']}",254,,,,,,bk,20000,rosoffmg,colnohélène,meg rosoff ; trad.d e l'anglais par hélène collon,"maintenuant, c'est ma vie",,1,431462992
61984,504390147,"[(NEBIS)001584785, (IDSBB)001974245]",[0335-1793],{'245': ['Libération']},19739999,1973,1973,19739999,,,[],,libération,[Libération],,[],,,,,,,,,cr,30500,,,,libération,,"[(NEBIS)001584785, (IDSBB)001974245]",1973,[],1973,,,19739999,[0335-1793],,,[],,libération,[Libération],19739999,,{'245': ['Libération']},,,,,,,cr,30500,,,,libération,,1,504390147
45940,486959899,"[(OCoLC)962353124, (IDSBB)006713806, (OCoLC)96...","[978-1-118-62114-1, 1-118-62114-X, 978-1-118-6...",{'245': ['Reading the eighteenth-century novel']},2017,2017,2017,2017uuuu,,,"[viii, 240 Seiten]",240,,[],,[],,,,,,,,,bk,20000,richterdavid h.,,david h. richter,reading the eighteenth-century novel,,"[(OCoLC)962353124, (IDSBB)006713806, (OCoLC)96...",2017,[],2017,,,2017uu9u,"[978-1-118-62114-1, 1-118-62114-X, 978-1-118-6...",,,"[viii, 240 Seiten]",,,[],2017,,{'245': ['Reading the eighteenth-century novel']},240,,,,,,bk,20000,richterdavid h.,,david h. richter,reading the eighteenth-centuuyr novel,,1,486959899
14186,137735847,"[(OCoLC)613316506, (NEBIS)001322634]",[],{'245': ['Zauberflöte']},19uu,19uu,19uu,19uuuuuu,,38,[1 Klavierauszug],1,litolff,[Litolff],,[],,,12780.0,,,,,,mu,10200,mozartwolfgang amadeus,,mozart,zauberflöte,,"[(OCoLC)613316506, (NEBIS)001322634]",19uu,[],19uu,,,19uuuuuu,[],,12780.0,[1 Klavierauszug],38,litolff,[Litolff],19uu,,{'245': ['Zauberflöte']},1,,,,,,mu,10200,mzartwolfgnag amadeus,,mozarft,jauberfslöte,,1,137735847
30243,314724885,[(RERO)R006516504],[],"{'245': ['Bonne chance !', 'cours de langue fr...",1992,1992,1992,1992uuuu,,,"[135, 135 p.]",135 135,ed. scolaires du canton de berne,[Ed. scolaires du canton de Berne],,[],,,,,,,,,bk,20000,kesslersigrid,,[sigrid kessler],"bonne chance !, cours de langue française, tro...",,[(RERO)R006516504],1992,[],1992,,,1972uuuu,[],,,"[135, 135 p.]",,ed. scolaires du cantfon ed berne,[Ed. scolaires du canton de Berne],1992,,"{'245': ['Bonne chance !', 'cours de langue fr...",135 135,,,,,,bk,20000,kesslersigrid,,[sigrid kessler],"bonne chance !, cours de langue française, tro...",,1,314724885


Now, the synthetic records of pairs of duplicates is ready and can be appended to the feature base of chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb).

In [21]:
if oversampling > 0 :
    frames = [df_feature_base, duplicates]

    df_feature_base = pd.concat(frames, sort=True)
    # Set unique values on index
    df_feature_base.reset_index(inplace=True, drop=True)

## Summary

The amount of duplicate records that are generated out of Swissbib's goldstandard, is low with a fraction of below 0.6%. This raises the requirement for increasing this amount. This chapter increases the amount of duplicates with artifical records. The basic records from Swissbib's goldstandard have been loaded, manipulated slightly, and by the end joined to generate a desired amount of synthetic data records for training and performance testing.

In [22]:
print('Number of rows in training set :', len(df_feature_base))
print('Number of rows with pairs of duplicates in training set :', len(df_feature_base[df_feature_base.duplicates==1]))
print('Ratio : {:.2f}%'.format(100*len(df_feature_base[df_feature_base.duplicates==1])/len(df_feature_base)))

Number of rows in training set : 325113


Number of rows with pairs of duplicates in training set : 67158
Ratio : 20.66%


### Goldstandard DataFrame Handover

The DataFrame for the feature base has been extended with additional rows in this chapter. The result is saved into a pickle file. This is done to hand over the data to the next chapters. The data will be read in the next chapter [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb) as input file. The metadata dictionary has not been modified in this chapter. Therefore it is not needed to be stored again.

In [23]:
import pickle as pk

# Binary intermediary DataFrame file for feature matrix generation
with open(os.path.join(path_goldstandard, 'feature_base_df.pkl'), 'wb') as df_output_file:
    pk.dump(df_feature_base, df_output_file)