In [1]:
execution_mode = 'full'
oversampling = 0
modification_ratio = 0.2
strip_number_digits = True
sampling_fraction_nreb = 1
sampling_fraction_reb = 1

# Data Synthesizing

Chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb) brings a fraction of duplicate records out of Swissbib's goldstandard of below 0.6%. The explicit full amount will be sufficient for training and performance testing of a machine learning model. The fraction by itself is a low value, though. This chapter will focus on increasing the amount of duplicate pair rows in order to have a more balanced set for training and testing models at hand. This is done, generating artificial data pairs of duplicates.

## Table of Contents

- [Metadata Takeover](#Metadata-Takeover)
- [Loading and Duplicating the Base Data](#Loading-and-Duplicating-the-Base-Data)
- [Modifying the Base Data](#Modifying-the-Base-Data)
    - [coordinate](#coordinate)
    - [corporate](#corporate)
    - [doi](#doi)
    - [edition](#edition)
    - [exactDate](#exactDate)
    - [format](#format)
    - [isbn](#isbn)
    - [ismn](#ismn)
    - [musicid](#musicid)
    - [part](#part)
    - [person](#person)
    - [pubinit](#pubinit)
    - [scale](#scale)
    - [ttlfull](#ttlfull)
    - [volumes](#volumes)
- [Data Pairing for Duplicates](#Data-Pairing-for-Duplicates)
- [Removing One Side of an Attribute Pair](#Removing-One-Side-of-an-Attribute-Pair)
- [Append Synthetic Duplicates to Feature Base](#Append-Synthetic-Duplicates-to-Feature-Base)
- [Downsampling](#Downsampling)
    - [Downsampling without Rebalancing](#Downsampling-without-Rebalancing)
    - [Downsampling with Rebalancing](#Downsampling-with-Rebalancing)
- [Summary](#Summary)
    - [Goldstandard DataFrame Handover](#Goldstandard-DataFrame-Handover)

## Metadata Takeover

Two artefacts have arisen from the previous chapters, the metadata dictionary and a DataFrame that will be the basis for the feature matrix for the models. Both artefacts must be loaded as the first step of this chapter.

In [2]:
import os
import pickle as pk
import pandas as pd
import bz2
import _pickle as cPickle

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)
    
# Restore DataFrame with features from compressed pickle file
with bz2.BZ2File((os.path.join(
    path_goldstandard, 'feature_base_df.pkl')), 'rb') as file:
    df_feature_base = cPickle.load(file)

In [3]:
for k in columns_metadata_dict.keys():
    print(k, '\n', columns_metadata_dict[k], '\n')

data_analysis_columns 
 ['coordinate_E', 'coordinate_N', 'corporate_full', 'doi', 'edition', 'exactDate', 'format_prefix', 'format_postfix', 'isbn', 'ismn', 'musicid', 'part', 'person_100', 'person_700', 'person_245c', 'pubinit', 'scale', 'ttlfull_245', 'ttlfull_246', 'volumes'] 

columns_to_use 
 ['duplicates', 'coordinate_E_x', 'coordinate_E_y', 'coordinate_N_x', 'coordinate_N_y', 'corporate_full_x', 'corporate_full_y', 'doi_x', 'doi_y', 'edition_x', 'edition_y', 'exactDate_x', 'exactDate_y', 'format_prefix_x', 'format_prefix_y', 'format_postfix_x', 'format_postfix_y', 'isbn_x', 'isbn_y', 'ismn_x', 'ismn_y', 'musicid_x', 'musicid_y', 'part_x', 'part_y', 'person_100_x', 'person_100_y', 'person_700_x', 'person_700_y', 'person_245c_x', 'person_245c_y', 'pubinit_x', 'pubinit_y', 'scale_x', 'scale_y', 'ttlfull_245_x', 'ttlfull_245_y', 'ttlfull_246_x', 'ttlfull_246_y', 'volumes_x', 'volumes_y'] 

similarity_metrics 
 {'coordinate_E': LCSStr({'qval': 1, 'external': True}), 'coordinate_N': L

In [4]:
# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head()

Unnamed: 0,035liste_x,035liste_y,century_x,century_y,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,coordinate_x,coordinate_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,corporate_full_x,corporate_full_y,decade_x,decade_y,docid_x,docid_y,doi_x,doi_y,duplicates,edition_x,edition_y,exactDate_x,exactDate_y,format_postfix_x,format_postfix_y,format_prefix_x,format_prefix_y,isbn_x,isbn_y,ismn_x,ismn_y,masters_docid,musicid_x,musicid_y,pages_x,pages_y,part_x,part_y,person_100_x,person_100_y,person_245c_x,person_245c_y,person_700_x,person_700_y,pubinit_x,pubinit_y,pubword_x,pubword_y,pubyear_x,pubyear_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,ttlpart_x,ttlpart_y,volumes_x,volumes_y
0,"[(OCoLC)731635279, (ABN)000539983]","[(OCoLC)731635279, (ABN)000539983]",2009,2009,,,,,[],[],,,,,,,2009,2009,311049,311049,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,"grawechristian, graweursula","grawechristian, graweursula",reclam jun.,reclam jun.,[Reclam jun.],[Reclam jun.],2009,2009,,,"emma, roman","emma, roman",,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",600,600
1,"[(OCoLC)731635279, (ABN)000539983]","[(OCoLC)731635279, (NEBIS)009587153]",2009,2009,,,,,[],[],,,,,,,2009,2009,311049,196506476,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,"grawechristian, graweursula",,reclam jun.,reclam,[Reclam jun.],[Reclam],2009,2009,,,"emma, roman",emma,,,"{'245': ['Emma', 'Roman']}",{'245': ['Emma']},600,600
2,"[(OCoLC)731635279, (ABN)000539983]","[(OCoLC)731635279, (LIBIB)000315536]",2009,2009,,,,,[],[],,,,,,,2009,2009,311049,323173349,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem englischen übersetzt von...,jane austen,"grawechristian, graweursula",,reclam jun.,reclam,[Reclam jun.],[Reclam],2009,2009,,,"emma, roman","emma, roman",,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",600,600
3,"[(OCoLC)731635279, (NEBIS)009587153]","[(OCoLC)731635279, (ABN)000539983]",2009,2009,,,,,[],[],,,,,,,2009,2009,196506476,311049,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,,"grawechristian, graweursula",reclam,reclam jun.,[Reclam],[Reclam jun.],2009,2009,,,emma,"emma, roman",,,{'245': ['Emma']},"{'245': ['Emma', 'Roman']}",600,600
4,"[(OCoLC)731635279, (NEBIS)009587153]","[(OCoLC)731635279, (NEBIS)009587153]",2009,2009,,,,,[],[],,,,,,,2009,2009,196506476,196506476,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,,,reclam,reclam,[Reclam],[Reclam],2009,2009,,,emma,emma,,,{'245': ['Emma']},{'245': ['Emma']},600,600


## Loading and Duplicating the Base Data

For generating artificial pairs of duplicates, Swissbib's goldstandard file of uniques and masters will be taken as a basis. The files will be loaded into a DataFrame which will be duplicated for pairing after manipulating the data of the second, the duplicated DataFrame.

In [5]:
import json

records_master, records_unique = [], []
file_master, file_unique = 'master.json', 'unique.json'

for line in open(os.path.join(path_goldstandard, file_master), 'r'):
    records_master.append(json.loads(line))
for line in open(os.path.join(path_goldstandard, file_unique), 'r'):
    records_unique.append(json.loads(line))

print('Number of records in data file {:s}\t{:d}'.format(file_master, len(records_master)))
print('Number of records in data file {:s}\t{:d}'.format(file_unique, len(records_unique)))

records_unique.extend(records_master)
print('Number of total records \t\t\t{:d}'.format(len(records_unique)))

Number of records in data file master.json	159
Number of records in data file unique.json	596
Number of total records 			755


In [6]:
goldstandard_uniques = {}

goldstandard_uniques['original'] = pd.DataFrame(records_unique)
goldstandard_uniques['modified'] = pd.DataFrame(records_unique)
# Multiply oversampling data
number_of_oversampling_loop = round((oversampling*len(df_feature_base))/(
    (1-oversampling/100)*100*len(goldstandard_uniques['original'])))

for i in range(number_of_oversampling_loop) :
    goldstandard_uniques['modified'] = pd.concat(
        [goldstandard_uniques['modified'], pd.DataFrame(records_unique)], sort=True)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(goldstandard_uniques['original'].columns)

goldstandard_uniques['modified'].reset_index(drop=True, inplace=True)
if oversampling > 0:
    print('Omitting the goldstandard training data, with {:,d} iterations in the for-loop, ...'.format(
        number_of_oversampling_loop))
    print('... oversampling will generate a ratio of {:.1f}% of synthetic duplicates in the full training data.'.format(
        100*len(goldstandard_uniques['modified'])/(len(df_feature_base)+len(goldstandard_uniques['modified']))))
else :
    print('The goldstandard will not be increased with synthetic data.')
goldstandard_uniques['modified'].head()

The goldstandard will not be increased with synthetic data.


Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
0,000143235,"[(OCoLC)362722306, (ABN)000551177]",[978-3-7466-6120-9],"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}","{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",2009,2009,2009,2009,,[6120],[575 S.],[575 S.],[Aufbau Taschenbuch],[Aufbau Taschenbuch],,[],[],[],,[BK020000]
1,00044801X,"[(OCoLC)886929897, (ABN)000223034]",[],{'245': ['Die Zauberflöte']},{'245': ['Die Zauberflöte']},"{'100': ['MozartWolfgang Amadeus'], '700': ['K...","{'110': [], '710': ['Wiener Philharmoniker'], ...",2004,2004,2004,2004,,[],[2 CD],[2 CD],[Membran],[Membran],,[],[],[],,[MU040100]
2,000996009,"[(OCoLC)778386601, (ABN)000433604]",[],"{'245': ['Die Zauberflöte', 'eine grosse Oper ...","{'245': ['Die Zauberflöte', 'eine grosse Oper ...","{'100': ['MozartWolfgang Amadeus'], '700': [],...","{'110': [], '710': [], '810': []}",17931797,1793,1793,17931797,,[],[202 S.],[202 S.],[Komm. Breitkopf],[Komm. Breitkopf],,[],[],[],,[MU010000]
3,00239538X,"[(OCoLC)778561839, (ABN)000238844]",[3-932992-42-3],"{'245': ['Homo faber', 'Originaltext, Interpre...","{'245': ['Homo faber', 'Originaltext, Interpre...","{'100': [], '700': ['FrischMax'], '800': [], '...","{'110': [], '710': [], '810': []}",2000,2000,2000,2000,,[],"[1 CD-Rom in Box, mit Notizbuchfunktion für Re...","[1 CD-Rom in Box, mit Notizbuchfunktion für Re...",[Terzio],[Terzio],,[],[],[],,[BK020047]
4,002410559,"[(OCoLC)777853583, (ABN)000243260]",[0-19-282756-1],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': [], '800': [], ...","{'110': [], '710': [], '810': []}",1990,1990,1990,1990,,[],[445 p.],[445 p.],[Oxford University Press],[Oxford University Press],,[],[],[],,[BK020000]


The goldstandard data must be preprocessed the same way as in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb), see section [Transform Attributes for Similarity Comparison](./2_GoldstandardDataPreparation.ipynb#Transform-Attributes-for-Similarity-Comparison) there.

In [7]:
import data_preparation_funcs as dpf

for i in ['original', 'modified']:
    goldstandard_uniques[i] = dpf.attribute_preprocessing(
        goldstandard_uniques[i],
        columns_metadata_dict['data_analysis_columns'], strip_number_digits)

In [8]:
columns_metadata_dict['data_analysis_columns']

['coordinate_E',
 'coordinate_N',
 'corporate_full',
 'doi',
 'edition',
 'exactDate',
 'format_prefix',
 'format_postfix',
 'isbn',
 'ismn',
 'musicid',
 'part',
 'person_100',
 'person_700',
 'person_245c',
 'pubinit',
 'scale',
 'ttlfull_245',
 'ttlfull_246',
 'volumes']

The next section of this chapter will modify the data of the above listed columns with the goal to produce pairs of duplicates after joining the DataFrame $\texttt{goldstandard}\_\texttt{uniques}[\texttt{'original'}]$ with $\texttt{goldstandard}\_\texttt{uniques}[\texttt{'modified'}]$. These pairs of duplicates are to consist of pairs that are not necessarily equal but may show similarity, instead. This will be achieved by modifying the attributes of $\texttt{goldstandard}\_\texttt{uniques}[\texttt{'modified'}]$.

## Modifying the Base Data

For each attribute, a function has been implemented that modifies its value in a manner that may be expected in Swissbib's data reality. The function has been implemented in a Python library [modify_data_funcs.py](./modify_data_funcs.py).

In [9]:
import modify_data_funcs as mdf

All functions below expect a parameter named $\texttt{modification}\_\texttt{ratio}$. This parameter determines the fraction of sample rows that will be modified. The parameter is set to its value in the first code cell of this chapter. Besides the $\texttt{modification}\_\texttt{ratio}$, the functions are called with four additional parameters which have the effect of modifying the attribute value. All four of these parameters can be set separately. All four of these parameters simulate the effect of mistyping during data entry of a bibliographical unit.

- $\texttt{number}\_\texttt{of}\_\texttt{delete}$ : This parameter controls the effect of removing characters. Its value determines the amount of characters to be removed in the string chosen randomly.
- $\texttt{number}\_\texttt{of}\_\texttt{switch}$ : This parameter controls the effect of switching two adjacent characters. Its value determines the amount of character pairs to be switched in the string chosen randomly. The effect of switching two characters simulates a typo, e.g. in the title of a bibliographic unit.
- $\texttt{number}\_\texttt{of}\_\texttt{replace}$ : This parameter controls the effect of replacing a character at a randomly chosen position in the string with a randomly chosen new character. Its value determines the amount of characters to be replaced in the string chosen randomly.
- $\texttt{number}\_\texttt{of}\_\texttt{add}$ : This parameter controls the effect of adding characters. This is the opposite transaction to the effect of removing characters. Its value determines the amount of characters to be removed in the string chosen randomly.

In the following subsections, each attribute will be discussed and eventually handled separately.

### coordinate

For attribute $\texttt{coordinate}$, typos of switching two adjacent characters can be imagined. This will be the only kind of manipulation provided for this attribute.

In [10]:
number_of_delete = 0
number_of_switch = 1
number_of_replace = 0
number_of_add = 0

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'coordinate_E', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)
goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'coordinate_N', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 10 samples two characters of attribute coordinate_E switched.
In 10 samples two characters of attribute coordinate_N switched.


### corporate

For attribute $\texttt{corporate}$, which is a string of free text all kinds of typos can be imagined. All four parameters for manipulation will be set to a value greater than 0, therefore.

In [11]:
number_of_delete = 1
number_of_switch = 1
number_of_replace = 1
number_of_add = 1

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'corporate_full', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 32 samples one character of attribute corporate_full removed.
In 32 samples two characters of attribute corporate_full switched.
In 32 samples one character of attribute corporate_full replaced.
In 32 samples one character of attribute corporate_full added.


### doi

Attribute $\texttt{doi}$ will be compared on identity of the string in chapter [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb). Reason is that the value of the attribute is expected to be processed fully electronically without any additional human nore manual manipulation in Swissbib's data sources. Therefore, the attribute's values are expected to originate with a flawless quality in their sources and this expectation is to be expressed by the training data. Leaving the attribute's values unmodified, will make it a a strong indicator of duplicate bibliographic units, if present.

### edition

This attribute holds numbers. Wrongly entered data are hard to simulate and the manipulation of this attribute shall be omitted for it.

### exactDate

This attribute has a fixed length of eight characters. It can hold number digits or markers for unknown data. A typo of switching two adjacent digits can be imagined for this attribute as well as a mistyping of a number digit. The corresponding parameters will be set for data manipulation.

In [12]:
number_of_delete = 0
number_of_switch = 1
number_of_replace = 1
number_of_add = 0

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'exactDate', modification_ratio/2,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 76 samples two characters of attribute exactDate switched.
In 76 samples one character of attribute exactDate replaced.


### format

Attribute $\texttt{format}\_\texttt{prefix}$ is a code where character modification would produce unknown codes. This need not be expected by the deduplication model. With $\texttt{format}\_\texttt{postfix}$, a number switch may be possible, though.

In [13]:
number_of_delete = 0
number_of_switch = 1
number_of_replace = 0
number_of_add = 0

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'format_postfix', modification_ratio/4,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 38 samples two characters of attribute format_postfix switched.


### isbn

Attribute $\texttt{isbn}$ will be compared on identity of the string. Therefore, this attribute will not be modified to remain a strong indicator of duplicate bibliographical units.

### ismn

The same statement as for attribute $\texttt{isbn}$ holds for attribute $\texttt{ismn}$. The attribute will not be modified.

### musicid

For this attribute, a missing character or an additional character may be possible for a pair of duplicates. This will be simulated here, all additionally thinkable manipulations will be omitted.

In [14]:
number_of_delete = 1
number_of_switch = 0
number_of_replace = 0
number_of_add = 1

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'musicid', modification_ratio/4,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 4 samples one character of attribute musicid removed.
In 4 samples one character of attribute musicid added.


### part

Like attribute $\texttt{edition}$, this attribute holds numbers and is hard to manipulate in a good way. Wrongly entered data shall not be simulated with this attribute.

### person

The $\texttt{person}$ attribute is a typical attribute of manually entered data, prone to typos. For this reason, it will be manipulated in all four possible kinds described.

In [15]:
number_of_delete = 1
number_of_switch = 1
number_of_replace = 1
number_of_add = 1

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'person_100', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)
goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'person_700', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)
goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'person_245c', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 93 samples one character of attribute person_100 removed.
In 93 samples two characters of attribute person_100 switched.
In 93 samples one character of attribute person_100 replaced.


In 93 samples one character of attribute person_100 added.
In 91 samples one character of attribute person_700 removed.
In 91 samples two characters of attribute person_700 switched.


In 91 samples one character of attribute person_700 replaced.
In 91 samples one character of attribute person_700 added.
In 136 samples one character of attribute person_245c removed.


In 136 samples two characters of attribute person_245c switched.
In 136 samples one character of attribute person_245c replaced.
In 136 samples one character of attribute person_245c added.




### pubinit

This attribute holds names. It will be treated the same way as the $\texttt{person}$ attributes.

In [16]:
number_of_delete = 1
number_of_switch = 1
number_of_replace = 1
number_of_add = 1

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'pubinit', modification_ratio,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 98 samples one character of attribute pubinit removed.


In 98 samples two characters of attribute pubinit switched.
In 98 samples one character of attribute pubinit replaced.
In 98 samples one character of attribute pubinit added.


### scale

This attribute is used for scaling information of maps. It is sparsely filled and digit 0 is the predominant one. As it is hard to modify the attribute in a good way, no modification will be similated on it.

### ttlfull

Both $\texttt{ttlfull}$ attributes can hold longer string sequences and are prototypical for typos. Both of them will be treated the same way as the $\texttt{person}$ attributes.

In [17]:
number_of_delete = 1
number_of_switch = 1
number_of_replace = 1
number_of_add = 1

goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'ttlfull_245', modification_ratio*2,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)
goldstandard_uniques['modified'] = mdf.modify_character_string(
    goldstandard_uniques['modified'], 'ttlfull_246', modification_ratio*2,
    number_of_delete, number_of_switch, number_of_replace, number_of_add)

In 302 samples one character of attribute ttlfull_245 removed.
In 296 samples two characters of attribute ttlfull_245 switched.


In 296 samples one character of attribute ttlfull_245 replaced.
In 296 samples one character of attribute ttlfull_245 added.
In 34 samples one character of attribute ttlfull_246 removed.
In 34 samples two characters of attribute ttlfull_246 switched.
In 34 samples one character of attribute ttlfull_246 replaced.
In 34 samples one character of attribute ttlfull_246 added.


### volumes

For the same reason as for attributes $\texttt{edition}$ and $\texttt{part}$, no modification will be simulated on attribute $\texttt{volumes}$.

## Data Pairing for Duplicates

In chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb), the records pairing for generating pairs of duplicates was accomplished with the help of attribute $\texttt{035liste}$. For the synthetic data of this chapter, attribute $\texttt{docid}$ is the right key for joining. The reason for this choice is that attribute $\texttt{docid}$ identifies a record in an unambiguous way. It acts like a primary key for a record in the base data. Joining a record with itself on this unique identifier guarantees the generation of a pair of duplicates.

In [18]:
duplicates = pd.merge(left=goldstandard_uniques['original'], right=goldstandard_uniques['modified'], how='inner',
                  left_on='docid', right_on='docid')
# Mark all as duplicates for target vector
duplicates['duplicates'] = 1

print('Number of new duplicate pairs {:,d}'.format(len(duplicates)))

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(duplicates)

duplicates.sample(n=5)

Number of new duplicate pairs 755


Unnamed: 0,docid,035liste_x,isbn_x,ttlpart_x,pubyear_x,decade_x,century_x,exactDate_x,edition_x,part_x,pages_x,volumes_x,pubinit_x,pubword_x,scale_x,coordinate_x,doi_x,ismn_x,musicid_x,coordinate_E_x,coordinate_N_x,corporate_110_x,corporate_710_x,corporate_full_x,format_prefix_x,format_postfix_x,person_100_x,person_700_x,person_245c_x,ttlfull_245_x,ttlfull_246_x,035liste_y,isbn_y,ttlpart_y,pubyear_y,decade_y,century_y,exactDate_y,edition_y,part_y,pages_y,volumes_y,pubinit_y,pubword_y,scale_y,coordinate_y,doi_y,ismn_y,musicid_y,coordinate_E_y,coordinate_N_y,corporate_110_y,corporate_710_y,corporate_full_y,format_prefix_y,format_postfix_y,person_100_y,person_700_y,person_245c_y,ttlfull_245_y,ttlfull_246_y,duplicates
405,341865079,[(RETROS)oai:agora.ch:apk-002:2015:294::182],[],{'245': ['Blick in die Welt']},2015,2015,2015,2015uuuu,,294 2015,[],,,[],,[],10.5169/seals-515356,,,,,,,,bk,10053,bührerwalter,,[walter bührer],blick in die welt,,[(RETROS)oai:agora.ch:apk-002:2015:294::182],[],{'245': ['Blick in die Welt']},2015,2015,2015,2015uuu6,,294 2015,[],,,[],,[],10.5169/seals-515356,,,,,,,,bk,10053,bührerwalter,,[walter bühbrer],blicki n dieowet,,1
631,504389203,"[(NEBIS)004327684, (HEMU)9473]",[],{'245': ['Die Zauberflöte']},1977,1977,1977,1977uuuu,,,[1 partition de poche (32 p.)],1 32,e. eulenburg,[E. Eulenburg],,[],,m200205343,3714.0,,,,,,mu,10000,mozartwolfgang amadeus,"schikanederemanuel, moszkowiczimo",[musik von] w.a. mozart ; [libretto von] e. sc...,die zauberflöte,,"[(NEBIS)004327684, (HEMU)9473]",[],{'245': ['Die Zauberflöte']},1977,1977,1977,1977uuuu,,,[1 partition de poche (32 p.)],1 32,e. eulenbugr,[E. Eulenburg],,[],,m200205343,o3714,,,,,,mu,10000,mozartwolfgang amadeus,"schikanederemanuel, moszkowwiczimo",[musik von] w.a. mozart ; [libretto von] e. sc...,di ezaubernöte,,1
34,029233585,"[(OCoLC)73515225, (IDSLU)001070664, (OCoLC)735...",[],"{'245': ['Die Zauberflöte', '[KV 620]']}",2003,2003,2003,2003uuuu,,,[1 Partitur],1,,[],,[],,m700241001,4031.0,,,,quatre violons,quatre violons,mu,10100,mozartwolfgang amadeus,,w. a. mozart ; arrangement for four violins by...,"die zauberflöte, [kv 620]",,"[(OCoLC)73515225, (IDSLU)001070664, (OCoLC)735...",[],"{'245': ['Die Zauberflöte', '[KV 620]']}",2003,2003,2003,2003uuu2,,,[1 Partitur],1,,[],,[],,m700241001,401,,,,quatre violons,quatre violons,mu,10100,mozartwoflgang amadeus,,w. a. mozart ; arrkngement for four violins by...,"die zauberflöte, [kv 62a0]",,1
586,500243611,"[(SERSOL)ssib029231197, (WaSeSS)ssib029231197]","[978-3-598-31506-0 (print), 978-3-11-096994-8]",{'245': ['Katalog der Graphischen Porträts in ...,1994,1994,1994,1994uuuu,,26 26,"[1 online resource (iv, 388 p.)]",1 388,de gruyter saur,[de Gruyter Saur],,[],,,,,,,,,bk,20053,mortzfeldpeter,raabepaul,"mortzfeld, peter; raabe, paul",katalog der graphischen porträts in der herzog...,,"[(SERSOL)ssib029231197, (WaSeSS)ssib029231197]","[978-3-598-31506-0 (print), 978-3-11-096994-8]",{'245': ['Katalog der Graphischen Porträts in ...,1994,1994,1994,1994uuuu,,26 26,"[1 online resource (iv, 388 p.)]",1 388,de gruyter qaur,[de Gruyter Saur],,[],,,,,,,,,bk,20053,mortfzerdpeter,rsaabepaul,"mortzfeld, lpeter; raabe, paul",katalog der graphischen porträts in der herzog...,,1
95,07058415X,"[(SNL)vtls001657522, (Sz)001657522]",[],"{'245': ['Domo D'Ossola, Arona']}",1907,1907,1907,1907uuuu,1907.0,23 23 1907,[1 Karte],1,[eidg. landestopographie],[[Eidg. Landestopographie]],100000.0,[N0460833],,,,e0074147,n0460833,,eidgenössische landestopographie,eidgenössische landestopographie,mp,10300,,"dufourguillaume henri, müllhauptheinrich",g.h. dufour direxit ; h. müllhaupt sculpsit,"domo d'ossola, arona","domodossola, arona","[(SNL)vtls001657522, (Sz)001657522]",[],"{'245': ['Domo D'Ossola, Arona']}",1907,1907,1907,1907uuuu,1907.0,23 23 1907,[1 Karte],1,[eidg. landestopographie],[[Eidg. Landestopographie]],100000.0,[N0460833],,,,n0460833,n0406833,,eidgenössische landestopographie,eidgenössische ltndzestpographie,mp,10300,,"dufourguillaume henri, müllhaupheinrich",g.h. dufour direxit ; h. müllhaupet sculpsit,"domo d'ssola, aronra","domodossgola, arona",1


Some sample records of the new synthetic pairs of duplicates are shown above.

## Removing One Side of an Attribute Pair

One additional scenario of data differences of a pair of duplicates can be observed in Swissbib's raw data and has not been covered, yet. The scenario of an empty value in one attribute of a pair. To cover this scenario and in order to synthetise more realistic training data, a function has been implemented which removes attributes randomly from a pair. This function will be applied to the columns only, that will be relevant for the final feature matrix.

In [19]:
duplicates_columns = columns_metadata_dict['columns_to_use'][:] # Copy list by value, not by reference!
# This attribute is always filled
duplicates_columns.remove('exactDate_x')
duplicates_columns.remove('exactDate_y')
# These attributes are mostly filled
duplicates_columns.remove('format_prefix_x')
duplicates_columns.remove('format_prefix_y')
duplicates_columns.remove('format_postfix_x')
duplicates_columns.remove('format_postfix_y')
# Title is the most important attribute that is hardly missing
duplicates_columns.remove('ttlfull_245_x')
duplicates_columns.remove('ttlfull_245_y')
duplicates_columns.remove('ttlfull_246_x')
duplicates_columns.remove('ttlfull_246_y')
# Target vector is needed
duplicates_columns.remove('duplicates')

duplicates = mdf.remove_one_side_of_attribute_pair(duplicates, duplicates_columns, modification_ratio/5)

## Append Synthetic Duplicates to Feature Base

To be able to append the synthetic pairs of duplicates to the feature base of chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb), attributes $\texttt{docid}\_\texttt{x}$ and $\texttt{docid}\_\texttt{y}$ have to be added with the goal to receive a DataFrame with the same columns like the feature base DataFrame, loaded in the beginning of this chapter.

In [20]:
duplicates.rename(columns={'docid' : 'docid_x'}, inplace=True)
duplicates['docid_y'] = duplicates['docid_x']
duplicates.sample(n=5)

Unnamed: 0,docid_x,035liste_x,isbn_x,ttlpart_x,pubyear_x,decade_x,century_x,exactDate_x,edition_x,part_x,pages_x,volumes_x,pubinit_x,pubword_x,scale_x,coordinate_x,doi_x,ismn_x,musicid_x,coordinate_E_x,coordinate_N_x,corporate_110_x,corporate_710_x,corporate_full_x,format_prefix_x,format_postfix_x,person_100_x,person_700_x,person_245c_x,ttlfull_245_x,ttlfull_246_x,035liste_y,isbn_y,ttlpart_y,pubyear_y,decade_y,century_y,exactDate_y,edition_y,part_y,pages_y,volumes_y,pubinit_y,pubword_y,scale_y,coordinate_y,doi_y,ismn_y,musicid_y,coordinate_E_y,coordinate_N_y,corporate_110_y,corporate_710_y,corporate_full_y,format_prefix_y,format_postfix_y,person_100_y,person_700_y,person_245c_y,ttlfull_245_y,ttlfull_246_y,duplicates,docid_y
96,70584168,"[(SNL)vtls001657523, (Sz)001657523]",[],"{'245': ['Domo D'Ossola, Arona']}",1909,1909,1909,1909uuuu,1907.0,23 23 1909,[1 Karte],1,[eidg. landestopographie],[[Eidg. Landestopographie]],100000.0,[N0460833],,,,e0074147,n0460833,,eidgenössische landestopographie,eidgenössische landestopographie,mp,10300,,"dufourguillaume henri, müllhauptheinrich",g.h. dufour direxit ; h. müllhaupt sculpsit,"domo d'ossola, arona","domodossola, arona","[(SNL)vtls001657523, (Sz)001657523]",[],"{'245': ['Domo D'Ossola, Arona']}",1909,1909,1909,1909uuuu,1907.0,23 23 1909,[1 Karte],1,[eidg. landestopographie],[[Eidg. Landestopographie]],100000.0,[N0460833],,,,n0460833,n0460833,,eidgenössische landestopographie,eidgenössische landestopographie,mp,10300,,"dufourguillaume henri, müllhauptheinrich",g.h. duofudr direxit ; h. müllhaupt sculpsit,"domo d'ossola, arona",domodossolac arona,1,70584168
397,339446072,"[(OCoLC)925099140, (SGBN)001330111]",[978-3-944063-26-3],{'245': ['Die Zauberflöte']},2014,2014,2014,2014uuuu,,,[1 Compact Disc (67:03 Min.)],1 67 03,amor verlag,[Amor Verlag],,[],,,,,,,,,mu,30100,mozartwolfgang amadeus,,wolfgang amadeus mozart,die zauberflöte,,"[(OCoLC)925099140, (SGBN)001330111]",[978-3-944063-26-3],{'245': ['Die Zauberflöte']},2014,2014,2014,2014uuuu,,,[1 Compact Disc (67:03 Min.)],1 67 03,amor verlag,[Amor Verlag],,[],,,,,,,,,mu,30100,mozartwolfgang amadeus,,wolfgang amadeus mozart,die zauerfögte,,1,339446072
450,360974066,"[(OCoLC)837113254, (IDSLU)001237506]",[],{'245': ['Mörike-Lieder']},1965,1965,1965,1965uuuu,,,[1 Schallplatte],1,,[],,[],,,138.0,,,,,,mu,40000,wolfhugo,"learevelyn, werbaerik, wolfhugo",hugo wolf,mörike-lieder,,"[(OCoLC)837113254, (IDSLU)001237506]",[],{'245': ['Mörike-Lieder']},1965,1965,1965,1965uuuu,,,[1 Schallplatte],1,,[],,[],,,138.0,,,,,,mu,40000,wolfhuog,"learevelyn, werbaerik, nwolfhugo",hugo olf,mökrike-lideen,,1,360974066
674,504389726,"[(IDSBB)004969781, (NEBIS)009471728]",[978-3-13-127285-0],{'245': ['EKG-Kurs für Isabel']},2009,2009,2009,2009uuuu,5.0,,[312 S.],312,,[],,[],,,,,,,,,bk,20000,schusterhans-peter,trappehans-joachim,"hans-peter schuster, hans-joachim trappe",ekg-kurs für isabel,,"[(IDSBB)004969781, (NEBIS)009471728]",[978-3-13-127285-0],{'245': ['EKG-Kurs für Isabel']},2009,2009,2009,2009uuuu,5.0,,[312 S.],312,,[],,[],,,,,,,,,bk,20000,schusterhans-peter,trappeahns-joachim,"hans-peter schuster, hans-joachim trappl",ekg-akurs für isabel,,1,504389726
88,69281742,"[(SNL)vtls001519747, (Sz)001519747]",[],"{'245': ['Dufourkarten', 'topographische Karte...",2007,2007,2007,2007uuuu,,,[2 DVDs],2,bundesamt für landestopografie swisstopo,[Bundesamt für Landestopografie Swisstopo],100000.0,[N0475132],,,,e0055009,n0475132,,schweizbundesamt für landestopografie,schweizbundesamt für landestopografie,mp,10347,,dufourguillaume henri,,"dufourkarten, topographische karte der schweiz",topographische karte der schweiz,"[(SNL)vtls001519747, (Sz)001519747]",[],"{'245': ['Dufourkarten', 'topographische Karte...",2007,2007,2007,2007uuuu,,,[2 DVDs],2,bundesamt für landestopognafie swisstopo,[Bundesamt für Landestopografie Swisstopo],100000.0,[N0475132],,,,n0475132,n4075132,,schweizbundesamt für landestopografie,schweizbundesamt für landestopograife,mp,10347,,dufuorguillaume henri,,"dufourkarten, topugraphische karte der schweiz",topographische karte deh schweiz,1,69281742


Now, the synthetic records of pairs of duplicates are ready and can be appended to the feature base of chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb).

In [21]:
if (oversampling > 0) and (execution_mode!='manual') :
    frames = [df_feature_base, duplicates]

    df_feature_base = pd.concat(frames, sort=True)
    # Set unique values on index
    df_feature_base.reset_index(inplace=True, drop=True)

print(df_feature_base.shape)

print('\nAmount of duplicates (1) and uniques (0)')
print(df_feature_base.duplicates.value_counts())
print('\nPart of duplicates (1) and uniques (0) in units of [%]')
print(round(100*df_feature_base.duplicates.value_counts(normalize=True), 1))

(259428, 64)

Amount of duplicates (1) and uniques (0)
0    257955
1      1473
Name: duplicates, dtype: int64

Part of duplicates (1) and uniques (0) in units of [%]
0    99.4
1     0.6
Name: duplicates, dtype: float64


## Downsampling

The amount of data for training the models generates a very long runtime duration when fitting e.g. a [Support Vector Classifier Model](./7_SVCModel.ipynb). In order to reduce this runtime, a randomly chosen sample of data can be used instead of the full data generated. This process is called downsampling and can be implemented in two ways.

### Downsampling without Rebalancing

A downsampling without rebalancing the distribution of the classes of uniques and duplicates can be done as a first kind. This downsampling samples from the entire feature data set and does not respect the class, a record of pairs belongs to. This downsampling is done with function $\texttt{.sample}()$ of library pandas.

In [22]:
print(df_feature_base.shape)

print('\nAmount of duplicates (1) and uniques (0)')
print(df_feature_base.duplicates.value_counts())
print('\nPart of duplicates (1) and uniques (0) in units of [%]')
print(round(100*df_feature_base.duplicates.value_counts(normalize=True), 1))

(259428, 64)

Amount of duplicates (1) and uniques (0)
0    257955
1      1473
Name: duplicates, dtype: int64

Part of duplicates (1) and uniques (0) in units of [%]
0    99.4
1     0.6
Name: duplicates, dtype: float64


The full amount of data is downsampled to a fraction that is passed by a global parameter $\texttt{sampling}\_\texttt{fraction}\_\texttt{nreb}$, set either in the first line of code of this chapter or passed as a run parameter from chapter [Overview and Summary](./0_OverviewSummary.ipynb), depending on the calling environment of the notebook.

In [23]:
df_feature_base_down_no_rebalance = df_feature_base.sample(
    frac=sampling_fraction_nreb, # target resample fraction
    replace=False, # remove, only
    weights=None, # keep same balancing of classes
    random_state=0 # for reproduceability
)

The data now shows the following totals and ratios.

In [24]:
print('Amount of duplicates (1) and uniques (0)')
print(df_feature_base_down_no_rebalance.duplicates.value_counts())
print('\nPart of duplicates (1) and uniques (0) in units of [%]')
print(round(100*df_feature_base_down_no_rebalance.duplicates.value_counts(normalize=True), 1))
print('\nShape of the feature base')
print(print(df_feature_base_down_no_rebalance.shape))

Amount of duplicates (1) and uniques (0)


0    257955
1      1473
Name: duplicates, dtype: int64

Part of duplicates (1) and uniques (0) in units of [%]
0    99.4
1     0.6
Name: duplicates, dtype: float64

Shape of the feature base
(259428, 64)
None


### Downsampling with Rebalancing

A downsampling with rebalancing the distribution of the classes of uniques and duplicates can be done as an alternative kind. In the implementation of this capstone project the downsampling used here, takes samples from the feature data set of class uniques. The idea behind this kind of implementation is to increase the ratio of class duplicate keeping all of the generated records of pairs of duplicates while the records of pairs of uniques are sampled down. For a model, it is important to find the threshold to divide between uniques and duplicates during the training process. The records with pairs close in similarity to each other decide on the accuracy of the model. Therefore, the records of class duplicate are considered as more precious as the records of pairs of uniques hold less information for a learning model due to their generation procedure.

In [25]:
print(df_feature_base.shape)

print('\nAmount of duplicates (1) and uniques (0)')
print(df_feature_base.duplicates.value_counts())
print('\nPart of duplicates (1) and uniques (0) in units of [%]')
print(round(100*df_feature_base.duplicates.value_counts(normalize=True), 1))

(259428, 64)

Amount of duplicates (1) and uniques (0)
0    257955
1      1473
Name: duplicates, dtype: int64

Part of duplicates (1) and uniques (0) in units of [%]
0    99.4
1     0.6
Name: duplicates, dtype: float64


The data is downsampled to a fraction that is passed by a global parameter $\texttt{sampling}\_\texttt{fraction}\_\texttt{reb}$, set either in the first line of code of this chapter or passed as a run parameter from chapter [Overview and Summary](./0_OverviewSummary.ipynb), depending on the calling environment of the notebook.

In [26]:
# Only resample pairs of uniques
df_feature_base_down_w_rebalance = df_feature_base[df_feature_base['duplicates']==0].sample(
    frac=sampling_fraction_reb, # target resample fraction
    replace=False, # remove, only
    weights=None, # keep same balancing of classes
    random_state=0 # for reproduceability
)
# Concatenation with full pairs of duplicates
df_feature_base_down_w_rebalance = pd.concat(
    [df_feature_base_down_w_rebalance, df_feature_base[df_feature_base['duplicates']==1]], sort=True)

The totals and ratios of the resampled data is shown below.

In [27]:
print('Amount of duplicates (1) and uniques (0)')
print(df_feature_base_down_w_rebalance.duplicates.value_counts())
print('\nPart of duplicates (1) and uniques (0) in units of [%]')
print(round(100*df_feature_base_down_w_rebalance.duplicates.value_counts(normalize=True), 1))
print('\nShape of the feature base')
print(print(df_feature_base_down_w_rebalance.shape))

Amount of duplicates (1) and uniques (0)
0    257955
1      1473
Name: duplicates, dtype: int64

Part of duplicates (1) and uniques (0) in units of [%]
0    99.4
1     0.6
Name: duplicates, dtype: float64

Shape of the feature base
(259428, 64)
None


As a final step, the data of this chapter is to be replaced by the downsampled data.

In [28]:
if (sampling_fraction_nreb < 1) and (execution_mode!='manual') :
    df_feature_base_down_no_rebalance.reset_index(inplace=True, drop=True)
    df_feature_base = df_feature_base_down_no_rebalance
elif (sampling_fraction_reb < 1) and (execution_mode!='manual') :
    df_feature_base_down_w_rebalance.reset_index(inplace=True, drop=True)
    df_feature_base = df_feature_base_down_w_rebalance

## Summary

The amount of duplicate records that are generated out of Swissbib's goldstandard, is low with a fraction of below 0.6%. This brings up the requirement for increasing this very amount. This chapter increases the amount of duplicates with the help of artificial records. The basic records from Swissbib's goldstandard have been loaded, manipulated slightly, and by the end joined to generate a desired amount of synthetic data records for training and performance testing.

In [29]:
print('Number of rows in training set : {:,d}'.format(len(df_feature_base)))
print('Number of rows with pairs of duplicates in training set : {:,d}'.format(len(df_feature_base[df_feature_base.duplicates==1])))
print('Ratio : {:.2f}%'.format(100*len(df_feature_base[df_feature_base.duplicates==1])/len(df_feature_base)))

Number of rows in training set : 259,428
Number of rows with pairs of duplicates in training set : 1,473
Ratio : 0.57%


### Goldstandard DataFrame Handover

The DataFrame for the feature base has eventually been extended with additional rows in this chapter. The result is saved into a pickle file. This is done to hand over the data to the next chapters. The data will be read in the next chapter [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb) as input file. The metadata dictionary has not been modified in this chapter. Therefore it is not needed to be stored again.

In [30]:
# Store into compressed intermediary file
with bz2.BZ2File(os.path.join(path_goldstandard, 'feature_base_df.pkl'),
                 'w') as df_output_file:
    pk.dump(df_feature_base, df_output_file)

Chapter [Overview and Summary](./0_OverviewSummary.ipynb) assesses the predictions of the various models. One measure to analyse the results will be the resulting confusion matrix for a model prediction. The confusion matrix will reveal cases in the testing data that the model will predict opposite to the target values of the testing data. These predictions are called false positives and false negatives. To be able to analyse the original attribute values of the affected records in detail, columns $\texttt{035liste}\_\texttt{x}$, $\texttt{035liste}\_\texttt{y}$, $\texttt{docid}\_\texttt{x}$, and $\texttt{docid}\_\texttt{y}$ will be need to be restored. These four attributes will now be saved to a separate DataFrame to be reloaded in chapter [Overview and Summary](./0_OverviewSummary.ipynb).

In [31]:
# Store docid's for fast identification of row pairs in model results
df_index_docids = df_feature_base[['035liste_x', '035liste_y', 'docid_x', 'docid_y']]
    
# Binary intermediary DataFrame file for docid's, save for chapter 0
with open(os.path.join(path_goldstandard, 'index_docids_df.pkl'), 'wb') as df_output_file:
    pk.dump(df_index_docids, df_output_file)