In [1]:
factor = 1.0
exactDate_mode = 'added_u'

# Feature Matrix Generation

In chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb), Swissbib's goldstandard data has been processed to form records of pairs of duplicate and pairs of unique records. These records are the starting point for the final feature matrix generation and that is the reason, why the DataFrame was called feature base. As described in [[JudACaps](./A_References.ipynb#judacaps)], the next step will be an attribute-wise comparison of each attribute pair of each record in the original feature base. This comparison will generate similarity values for each attribute pair. It will halve the number of attributes replacing each attribute pair with one value expressing their degree of similarity. The goal of this chapter is a DataFrame with the full and final feature attributes. The values of these feature attributes will be used for training and performance testing of the machine learning models in the chapters to follow.

This chapter introduces similarity metrics for string comparisons. The metrics to be used for calculating its similarity will be decided for each attribute pair of the DataFrame built in the previous chapters.

## Table of Contents

- [Data Takeover](#Data-Takeover)
- [Object Distance and Similarity](#Object-Distance-and-Similarity)
    - [Mathematical Definitions](#Mathematical-Definitions)
    - [Library TextDistance](#Library-TextDistance)
- [Similarity Metrics on Attribute Level](#Similarity-Metrics-on-Attribute-Level)
    - [Table of Contents of Attribute Similarities](#Table-of-Contents-of-Attribute-Similarities)
- [DataFrame with Attributes and Similarity Features](#DataFrame-with-Attributes-and-Similarity-Features)
- [Summary](#Summary)
    - [Full Feature Matrix with Target Vector Handover](#Full-Feature-Matrix-with-Target-Vector-Handover)

## Data Takeover

Swissbib's raw data of the goldstandard has been processed in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb). As the first step of this chapter, this data is loaded for further processing to the feature matrix and target vector for the subsequent machine learning model chapters.

In [2]:
import os
import pandas as pd
import pickle as pk
import bz2
import _pickle as cPickle

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)

# Restore DataFrame with features from compressed pickle file
with bz2.BZ2File((os.path.join(
    path_goldstandard, 'feature_base_df.pkl')), 'rb') as file:
    df_feature_base = cPickle.load(file)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head()

Unnamed: 0,035liste_x,035liste_y,century_x,century_y,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,coordinate_x,coordinate_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,corporate_full_x,corporate_full_y,decade_x,decade_y,docid_x,docid_y,doi_x,doi_y,duplicates,edition_x,edition_y,exactDate_x,exactDate_y,format_postfix_x,format_postfix_y,format_prefix_x,format_prefix_y,isbn_x,isbn_y,ismn_x,ismn_y,masters_docid,musicid_x,musicid_y,pages_x,pages_y,part_x,part_y,person_100_x,person_100_y,person_245c_x,person_245c_y,person_700_x,person_700_y,pubinit_x,pubinit_y,pubword_x,pubword_y,pubyear_x,pubyear_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,ttlpart_x,ttlpart_y,volumes_x,volumes_y
0,[(RERO)R006024329],[(RETROS)oai:agora.ch:apk-002:2005:284::157],uuuu,2005,,,,,[],[],,,,,,,uuuu,2005,247704504,318419599,,10.5169/seals-377277,0,,,uuuuuuuu,2005uuuu,20000,10053,bk,bk,[],[],,,50438953X,,,[22 p.],[],,284 2005,mozartwolfgang amadeus,bührerwalter,wolfgang amadeus mozart ; emanuel schikaneder ...,[walter bührer],"mendelhermann, schikanederemanuel",,s. mode's verlag (gustav mode),,[S. Mode's Verlag (Gustav Mode)],[],uuuuuuuu,2005,,,"die zauberflöte, oper in zwei akten",blick in die welt,,,"{'245': ['Die Zauberflöte', 'Oper in zwei Akte...",{'245': ['Blick in die Welt']},22,
1,"[(SNL)vtls001092110, (Sz)001092110]",[(RERO)R008551762],1996,2016,,,,,[],[],,,,,,,1996,2016,065307801,403228271,,,0,,,19969999,2016uuuu,30300,20053,cr,bk,[],[978-2-226-31734-6],,,504389955,,,[ v.],[1 livre électronique (e-book)],,,,moriartyliane,,liane moriarty ; traduit de l'anglais (austral...,,taupeaubéatrice,universitätsverlag,albin michel,[Universitätsverlag],[Albin Michel],19969999,2016,,,"bildungsforschung und bildungspraxis, educatio...","petits secrets, grands mensonges","educazione e ricerca., education et recherche....",,{'245': ['Bildungsforschung und Bildungspraxis...,"{'245': ['Petits secrets, grands mensonges']}",,1.0
2,[(RERO)1553399],"[(OCoLC)1001961995, (NEBIS)011047950]",1984,2017,,,,,[],[],,,interkantonale lehrmittelzentrale (luzern),,interkantonale lehrmittelzentrale (luzern),,1984,2017,212984225,500162255,,,0,,7.0,19849999,2017uuuu,20000,20000,bk,bk,[],[978-3-13-240799-2],,,504389599,,,[ v.],[334 Seiten],,,,trappehans-joachim,sigrid kessler... [et al.] ; [éd.:] interkanto...,"hans-joachim trappe, hans-peter schuster",kesslersigrid,schusterhans-peter,staatlicher lehrmittelverlag,,[Staatlicher Lehrmittelverlag],[],19849999,2017,,,"bonne chance!, cours de langue française, troi...",ekg-kurs für isabel,,,"{'245': ['Bonne chance!', 'cours de langue fra...",{'245': ['EKG-Kurs für Isabel']},,334.0
3,"[(VAUD)991004649259702852, (RNV)008339403-41bc...","[(OCoLC)945563378, (NEBIS)010612600]",2016,2016,,,,,[],[],,,,,,,2016,2016,41487336X,358479975,,,0,,,2016uuuu,2016uuuu,20000,20000,bk,bk,[978-2-07-046833-1],[978-0-7294-1157-8],,,504388916,,,[153 p.],[332 Seiten],3870 3870,13,voltaire,voltaire,voltaire ; éd. établie et annotée par jacques ...,voltaire ; [sous la direction de diego venturino],"van den heuveljacques, sollersphilippe",venturinodiego,gallimard,,[Gallimard],[],2016,2016,,,"traité sur la tolérance, à l'occasion de la mo...","siècle de louis xiv (v), chapitres 25-30",,,"{'245': ['Traité sur la tolérance', 'à l'occas...","{'245': ['Siècle de Louis XIV (V)', 'chapitres...",153,332.0
4,"[(OCoLC)806968650, (SGBN)000443345]","[(OCoLC)612031656, (IDSBB)001972916]",2000,1981,,,,,[],[],,,the metropolitan chorus und orchestra,,the metropolitan chorus und orchestra,,2000,1981,048543322,97287970,,,0,,3.0,2000uuuu,1981uuuu,10300,20000,vm,bk,[],[3-442-33001-7],,,504388967,73.0,,[1 DVD (ca. 169 Min.)],[252 S.],,33001 33001,mozartwolfgang amadeus,mozartwolfgang amadeus,w. a. mozart ; dir.: james livine ; the metrop...,wolgang amadeus mozart ; dieser opernführer wu...,"levinejames, hockneydavid","schikanederemanuel, pahlenkurt",deutsche grammophon,,[Deutsche Grammophon],[],2000,1981,,,die zauberflöte,die zauberflöte,,,{'245': ['Die Zauberflöte']},{'245': ['Die Zauberflöte']},1 169,252.0


Now, the feature base can be reduced to the columns used for processing.

In [3]:
# The DataFrame of pairs with target information
df_feature_base = df_feature_base[columns_metadata_dict['columns_to_use']]

df_feature_base.sample(n=5)

Unnamed: 0,duplicates,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,corporate_full_x,corporate_full_y,doi_x,doi_y,edition_x,edition_y,exactDate_x,exactDate_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,isbn_x,isbn_y,ismn_x,ismn_y,musicid_x,musicid_y,part_x,part_y,person_100_x,person_100_y,person_700_x,person_700_y,person_245c_x,person_245c_y,pubinit_x,pubinit_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y
169615,1,,,,,,,,,,,2007uuuu,2007uuuu,mu,mu,10200,10200,[],[],m006450510,m006450510,4553.0,455.0,,,mozartwolfgang amadeus,nmozartwovfgang amadeus,,,wolfgang amadeus mozart ; libretto: emanuel sc...,wolfgang amadeusm ozart ; libretto: emanuel sc...,,,,,"die zauberflöte, eine deutsche oper in zwei au...","die zauberflöte, eine deutscxhe oper in zwei a...",,,1,1
129042,1,,,,,,,,,,,19201929,19201992,mu,mu,10200,10200,[],[],,,1215470.0,1215470.0,,,mozartwolfgang amadeus,mozartwolfgang amadeus,mozartwolfgang amadeus,moazrtworlfgang madeus,von w.a. mozart ; für pianoforte zu vier hände...,von w.a. mozart ; für pianoforte zu vier hände...,c.f. peters,mr.f. peters,,,"die zauberflöte, oper in 2 akten","deizaubeerflöte, oper in 2 akten",,,1 104,1 104
37327,0,,,,,,"interkantonale lehrmittelzentrale (rapperswil,...",,,,,1960uuuu,1996uuuu,mu,mu,10200,30000,[],[],,,10425.0,,,,mozartwolfgang amadeus,,"soldankurt, mozartwolfgang amadeus",,w. a. mozart ; nach dem in der preussischen st...,[hrsg.] interkantonale lehrmittelzentrale rapp...,c.f. peters,berner lehrmittel- und medienverl.,,,"die zauberflöte, oper in 2 aufzügen","bonne chance !, cours de langue française 2",,,1 188,3
92723,0,,,,,,schweizerische normen-vereinigung,,,,,1987uuuu,2017uuuu,bk,bk,20000,20053,[],[],,,,,,,,,,,,,[staatlicher lehrmittelverlag],,,,"bonne chance !, cours de langue française prem...",health informatics - personal health device co...,,medizinische informatik - kommunikation von ge...,134,1
110258,1,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-491-96266-8],[978-3-491-96266-8],,,,,,,austenjane,austenjane,,,jane austen,jane asuten,albatros,albarso,,,emma,emlm,,,450,450


In [4]:
print('Number of rows labelled as duplicates {:,d}'.format(len(df_feature_base[
    df_feature_base.duplicates==1])))
print('Number of rows labelled as uniques {:,d}'.format(len(df_feature_base[
    df_feature_base.duplicates==0])))
print('Total number of rows in DataFrame {:,d}'.format(df_feature_base.shape[0],
      'number of columns', df_feature_base.shape[1]))

Number of rows labelled as duplicates 67,158
Number of rows labelled as uniques 103,182
Total number of rows in DataFrame 170,340


In [5]:
print('Part of duplicates (1) and uniques (0) in units of [%]')
print(round(100*df_feature_base.duplicates.value_counts(normalize=True), 2))

Part of duplicates (1) and uniques (0) in units of [%]
0    60.57
1    39.43
Name: duplicates, dtype: float64


DataFrame feature base is the starting point used for the further processing in this chapter.

## Object Distance and Similarity

A mathematical idea of distance and similarity is needed for understanding object pair comparison. This section starts with a motivation for calculating similarities and afterwards gives a very basic definition of the two central terms, distance and similarity. The text of this section is a summary of [[Chri2012](./A_References.ipynb#chri2012)].

### Mathematical Definitions

The attributes to be used for pair comparison may contain values of poor quality. The quality originates in the way the data has been entered at the very source. Manual data entry may suffer from mistyping, automatically scanned data may suffer from insufficiencies of the scanned base material or the recognition algorithm in the optical character recognition (OCR) processing. The basic step of a deduplication process is to identify the probability of two strings of a pair to be a pair of duplicates. This is done by calculating a similarity value between the two strings compared, rather than using an exact comparison function. Based on this common similarity value for an attribute pair, their being duplicates can be decided.

The term similarity is strongly coupled to the term of distance of two values of an attribute. Mathematically, a distance can be explained with the help of a distance function. A _distance function_ or _distance metric_ $dist(o_i, o_j)$ between two points or data objects $o_i$ and $o_j$ must fulfill four requirements.

1. $dist(o_i, o_i)=0$, the distance from an object to itself is zero.
- $dist(o_i, o_j)\ge 0$, the distance between two objects is a non-negative number.
- $dist(o_i, o_j)=dist(o_j, o_i)$, the distance between two objects is symmetric.
- $dist(o_i, o_j)\le dist(o_i, o_k)+dist(o_k, o_j)$, the triangular inequality must hold. It states that the direct distance beween two objects is never larger than the combined distance when going through a third object.

A distance value expresses the dissimilarity $d$ of two objects [[HanK2012](./A_References.ipynb#hank2012)] and can therefore be converted into a similarity value $s$, calculating $s = \frac{1}{d}$, assuming $d\gt 0$. Alternatively, assuming the distance value is normalised $0\le d\le 1$, the similarity value can be calculated to $s = 1-d$. A _similarity function_ $sim(a_i, aj)$ between two attributes which can be strings, numbers, dates, geographic locations, text, XML documents, etc. fulfills the general requirements.

1. $sim(a_i, a_i)=1$, the result of comparing a value with itself is an exact similarity.
- $sim(a_i, a_j)=0$, the similarity of values that are completely different from each other is 0. What accounts for 'complete different' depends upon the type of data that are compared.
- $0\lt sim(a_i, a_j)\lt 1$, an approximate similarity between exact similarity and total dissimilarity is calculated if two attribute values are somewhat similar to each other. What accounts for 'somewhat different' depends upon the type of data that are compared.

The dissimilarity between two objects $o_i$ and $o_j$ can be computed based on the ratio of mismatches,
$$
d(o_i, o_j) = \frac{p-m}{p},
$$
where $m$ is the number of matching attributes and $p$ is the total number of attributes describing the objects [[HanK2012](./A_References.ipynb#hank2012)]. Thus the similarity between two objects can be computed as
$$
sim(o_i, o_j) = 1 - d(o_i, o_j) = \frac{m}{p}.
$$

For data deduplication, a comparison function needs to be tailored to the type of underlying data. Although there is a correspondence between a similarity function and the mathematical concept of a distance function, not all known and implemented similarity comparison functions used for string pair comparison fulfill the requirements of a distance function. Some similarity functions are not symmetric, others do not fulfill the triangular inequality. Decision taking on the best similarity function for a string pair, will be based on the effect, a similarity function has for the purpose needed. In the case of this capstone project, this purpose is its capability to contribute to the prediction whether a pair of records is a pair duplicates or a pair of uniques.

### Library TextDistance

An internet research on string distance calculation with Python has revealed libraries [[StSi](./A_References.ipynb#stsi)], [[TeDi](./A_References.ipynb#tedi)] and seperate code snippets for distinct algorithms. After trying the referenced libraries and a downloaded code snippet for a Smith Waterman similarity [[SmWa](./A_References.ipynb#smwa)], the text distance library [[TeDi](./A_References.ipynb#tedi)] has been decided to be the best decision for this capstone project. The decision is based on the github statistics of stars and the date of the latest pull requests, indicating its popularity and maintenance activity of the library. A look at the API of the library, reveals the Python library to be a complete implementation (compared to suggestions of similarity metrics in [[Chri2012](./A_References.ipynb#chri2012)]) and easy to use.

In [6]:
# Install textdistance Python library - if not done, yet.
! pip install textdistance



For using the library, see documentation in [[TeDi](./A_References.ipynb#tedi)]. For the purposes of this chapter, function $\texttt{.normalized}\_\texttt{similarity}()$ of an instantiated textdistance object will be used.

In [7]:
import textdistance as tedi

With the code line above, the library is imported for application in this chapter. In appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb) the effects of the similarity metrics of the library are compared for better understanding of their specific behaviour. This comparison for each attribute is the basis of deciding the best similarity metric available for an attribute pair.

## Similarity Metrics on Attribute Level

This section implements the decision for calculating the similarity metric for each attribute of the raw data based on appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb). The implementation is applied on a pair of attributes of two records, resulting in a new attribute, the similarity value, of the final feature matrix. A general function $\texttt{.build}\_\texttt{delta}\_\texttt{feature}()$ is provided by the code file [data_preparation_funcs.py](./data_preparation_funcs.py) for transforming two attributes into their feature attribute holding their similarity value.

In [8]:
import data_preparation_funcs as dpf

The two dictionaries of the following code cell will be filled by function $\texttt{.build}\_\texttt{delta}\_\texttt{feature}()$.

In [9]:
columns_metadata_dict['similarity_metrics'] = {}
columns_metadata_dict['features'] = []

### Table of Contents of Attribute Similarities

- [coordinate](#coordinate)
- [corporate](#corporate)
- [doi](#doi)
- [edition](#edition)
- [exactDate](#exactDate)
- [format](#format)
- [isbn](#isbn)
- [ismn](#ismn)
- [musicid](#musicid)
- [part](#part)
- [person](#person)
- [pubinit](#pubinit)
- [scale](#scale)
- [ttlfull](#ttlfull)
- [volumes](#volumes)

### coordinate

As discussed in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{coordinate}$ holds coordinates of maps. To decide whether a map covers the same geographical range, a metric will be chosen that compares the coordinate number digits from left to right. The more digits are found to be equal, the higher the similarity value is calculated. The comparison stops with the first digit pair that differs. This algorithm is satisfied by the LCS (Longest Common Substring comparison) algorithm and generates the wanted result, see appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb).

In [10]:
attribute = 'coordinate'

columns_metadata_dict['similarity_metrics'][attribute+'_E'] = tedi.LCSStr()
columns_metadata_dict['similarity_metrics'][attribute+'_N'] = tedi.LCSStr()

ne_values = ['_E', '_N']

for ne in ne_values :
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+ne,
        columns_metadata_dict['similarity_metrics'][attribute+ne],
        columns_metadata_dict)

The length of attribute $\texttt{coordinate}$ is exactly eight digits. The distinct similarity values that may occur form a discrete set of values with a distance of $\frac{1}{8}$ between adjacent values.

In [11]:
uniques, uniques_len = {}, {}

for ne in ne_values :
    uniques[attribute+ne], uniques_len[attribute+ne] = dpf.determine_similarity_values(
        df_feature_base, attribute+ne)

coordinate_E values range [0.    0.125 0.25  0.375 0.5   0.625 0.875 1.   ]
coordinate_N values range [0.    0.375 0.5   0.625 0.75  0.875 1.   ]


Looking at some samples of the feature matrix reveals a good match to the expectations.

In [12]:
position = 3

for ne in ne_values :
    dpf.show_samples_interval(
        df_feature_base, attribute+ne,
        uniques[attribute+ne][uniques_len[attribute+ne]-position],
        uniques[attribute+ne][uniques_len[attribute+ne]-position+1]
    )

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y
104047,1,0.875,e0080851,e0080855
11796,0,0.875,e0080851,e0080855
38906,0,0.875,e0080855,e0080851
48430,0,0.875,e0080855,e0080851
104036,1,0.875,e0080855,e0080851


0.625 <= coordinate_E_delta <= 0.875


Unnamed: 0,duplicates,coordinate_N_delta,coordinate_N_x,coordinate_N_y
142520,1,0.75,n0460833,0n460833
142925,1,0.75,n0460833,0n460833
113203,1,0.75,n0460833,0n460833
112943,1,0.75,n0460833,0n460833
34900,0,0.75,n0460821,n0460833


0.75 <= coordinate_N_delta <= 0.875


The samples above show the wanted similarity behaviour for value ranges greater than 0. The metric has the weakness, though, that empty coordinate values, e.g. for bibliographic units other than maps, have each been calculated to a similarity of 0. Some samples for duplicates in the training data are shown below.

In [13]:
dpf.show_samples_interval(
    df_feature_base[df_feature_base.duplicates==1],
    attribute+'_E', uniques[attribute+'_E'][0], uniques[attribute+'_E'][1], 10)

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y
122683,1,0.0,,
109992,1,0.0,,
130683,1,0.0,,
139573,1,0.0,,
143099,1,0.0,,
147578,1,0.0,,
129442,1,0.0,,
112033,1,0.0,,
163589,1,0.0,,
169979,1,0.0,,


0.0 <= coordinate_E_delta <= 0.125


This downside shall be avoided by marking pairs of missing coordinate values on both sides with a special negative value, which will point out to the models to be trained, the special case of missing information in a row. The implementation of this logic is done in function $\texttt{.mark}\_\texttt{missing}()$. The absolute value of this negative number is conrolled by a factor which is passed to the function as a parameter. The function handles explicitly two cases. The first one is missing information in both attributes (resulting in $-1*\texttt{factor}$) and the second one is missing information in only one of the two attributes (resulting in $-0.5*\texttt{factor}$).

In [14]:
for ne in ne_values :
    df_feature_base = dpf.mark_missing(df_feature_base, attribute+ne, factor)

### corporate

Attribute $\texttt{corporate}$ is a collection of corporate names. The Monge-Elkan metric compares string tokens pairwise [[Chri2012](./A_References.ipynb#chri2012)] while the LCS metric searches for the longest common substring. Assessing the differences of these two metrics with the help of their values distribution in chapter [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb), reveals a better distribution behaviour for LCS. Therefore, the LCS metric will be chosen for this attribute.

In [15]:
attribute = 'corporate_full'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.LCSStr()
#tedi.StrCmp95()
#tedi.MongeElkan()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [16]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

corporate_full values range [0.         0.01333333 0.01428571 0.01639344 0.01666667 0.01694915
 0.01818182 0.02       0.02173913 0.02272727 0.02380952 0.02531646
 0.02542373 0.02666667 0.02702703 0.02727273 0.02857143 0.03
 0.03030303 0.03125    0.03278689 0.03333333 0.03389831 0.03448276
 0.03571429 0.03636364 0.03773585 0.03797468 0.03846154 0.03921569
 0.04       0.04081633 0.04166667 0.04237288 0.04255319 0.04285714
 0.04347826 0.04444444 0.04545455 0.046875   0.04761905 0.04878049
 0.04918033 0.05       0.05263158 0.05333333 0.05405405 0.05454545
 0.05555556 0.05660377 0.05714286 0.05882353 0.05932203 0.06
 0.06060606 0.06122449 0.0625     0.06329114 0.06363636 0.06382979
 0.06451613 0.06521739 0.06557377 0.06666667 0.06779661 0.06818182
 0.06896552 0.07       0.07142857 0.07272727 0.07317073 0.075
 0.0754717  0.07627119 0.07692308 0.078125   0.07843137 0.07894737
 0.08       0.08108108 0.08163265 0.08181818 0.08333333 0.08474576
 0.08510638 0.08571429 0.08695652 0.08888889 0.0909

Its $110$ part is sparsely filled and even its $710$ part comes along with a little more than $10\%$ of filling, only. The LCS metric generates a similarity of 1 for the cases where both strings of a pair are empty. Missing values on both sides may be an indicator for a pair of duplicates but due to the sparsely available information, it is a weak indicator. Therefore, the pairs with missing data on both sides of the pair, will be marked with the negative value.

In [17]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute, factor)

Some sample cases are shown below for both $\texttt{corporate}$ features.

In [18]:
dpf.show_samples_interval(
    df_feature_base[df_feature_base.duplicates==1],
    attribute, 0.0, 1.0, 20
)

Unnamed: 0,duplicates,corporate_full_delta,corporate_full_x,corporate_full_y
163240,1,0.54955,eidgenössisches topographisches bureau eidgenö...,eidgenössisches btopographisches bureau eidgen...
125318,1,1.0,"interkantonale lehrmittelzentrale (rapperswil,...","interkantonale lehrmittelzentrale (rapperswil,..."
142179,1,0.636364,eidgenössische landestopographie,eidgenössische landesstopographie
128496,1,1.0,"interkantonale lehrmittelzentrale (rapperswil,...","interkantonale lehrmittelzentrale (rapperswil,..."
146205,1,1.0,arts florissants,arts florissants
144467,1,0.424242,eidgenössische landestopographie,eidgenössischel andestopographdie
140263,1,1.0,schweizeidgenössisches topographisches bureau,schweizeidgenössisches topographisches bureau
133043,1,0.666667,"interkantonale lehrmittelzentrale (rapperswil,...","interkantonmle lehrmittelzentrale (rapperswil,..."
104619,1,1.0,schweizerische normen-vereinigung,schweizerische normen-vereinigung
129681,1,0.583333,"interkantonale lehrmittelzentrale (rapperswil,...","interkantonale lehrmitteyzentrale (rapperswil,..."


0.0 <= corporate_full_delta <= 1.0


In [19]:
position = uniques_len[attribute]//2 # Let's have a look in the middle range of the similarities.

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+2], 20)

Unnamed: 0,duplicates,corporate_full_delta,corporate_full_x,corporate_full_y
116118,1,0.613636,uniwersytet im. adama mickiewicza w poznaniu,unwiersytet im. adama mickiewica w poznaniu
145284,1,0.613636,uniwersytet im. adama mickiewicza w poznaniu,uniwersytet im. adama mickiiewicza w pznaniu
132179,1,0.613636,"staatsoper (wien)chor, wiener philharmoniker","staatsoper (wien)chor, wienr philharmonyiker"
145297,1,0.613636,uniwersytet im. adama mickiewicza w poznaniu,uniwersytet im. adama mickiweicza w poznaniu
159995,1,0.612903,philharmonia orchestra (london),philharmoni orchestra (london)
116097,1,0.613636,uniwersytet im. adama mickiewicza w poznaniu,uniwersytet im. adama mickiqwicza w pozaniu
148358,1,0.613636,uniwersytet im. adama mickiewicza w poznaniu,uniwesrytet im. dama mickiewicza w poznaniu
118413,1,0.613636,uniwersytet im. adama mickiewicza w poznaniu,uniwersytet im. adama mickiawiczaw poznaniu
132129,1,0.613333,"metropolitan opera (new york)chorus, metropoli...","metropolitan opera (new yorkchorus, metropolit..."
116100,1,0.613636,uniwersytet im. adama mickiewicza w poznaniu,uniwersytet im. adama mickiwiczayw poznaniu


0.6129032258064516 <= corporate_full_delta <= 0.6136363636363636


### doi

Swissbib uses an explicit $\texttt{doi}$ attribute for its deduplication implementation. In chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb), the real doi identifier has been isolated with the help of a preprocessing function $\texttt{.reduce}\_\texttt{to}\_\texttt{doi}\_\texttt{element}()$, see [Data Analysis](./1_DataAnalysis.ipynb). Attribute $\texttt{doi}$ contains a single string value. The Identity metric will be used for comparing the string values of a pair in a row, calculating a similarity value of 1.0 or 0.0 for each pair. If one list is empty a value of 0 is returned.

In [20]:
attribute = 'doi'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Identity()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

df_feature_base['doi_delta'].unique()

array([0., 1.])

Some sample cases are shown below for each category of $\texttt{doi}\_\texttt{delta}$.

In [21]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

for doi_delta_value in df_feature_base['doi_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['doi_delta']==doi_delta_value])
    )

    dpf.show_samples_distinct(df_feature_base, 'doi', doi_delta_value, number_of_max_samples)
    print(f'doi_delta = {doi_delta_value}')

doi values range [0. 1.]


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
6431,0,0.0,,10.5169/seals-377362
64360,0,0.0,,10.5169/seals-377392
74807,0,0.0,,10.5169/seals-377392
13187,0,0.0,,10.5169/seals-377079
18604,0,0.0,10.1055/b-005-143650,
22972,0,0.0,,10.5169/seals-376961
47947,0,0.0,,10.5169/seals-376572
84985,0,0.0,,10.5169/seals-377079
58590,0,0.0,10.1093/cid/cir669,
45,0,0.0,,10.5169/seals-376732


doi_delta = 0.0


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
154353,1,1.0,,
68622,0,1.0,,
78037,0,1.0,,
101630,0,1.0,,
90814,0,1.0,,
38485,0,1.0,,
86306,0,1.0,,
26955,0,1.0,,
21576,0,1.0,,
5268,0,1.0,,


doi_delta = 1.0


In [22]:
# Let's have a look at some non-empty doi elements
df_doi_with_element = df_feature_base[df_feature_base.doi_x.apply(lambda x : len(x) > 0)]

for doi_delta_value in df_feature_base['doi_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['doi_delta']==doi_delta_value])
    )

    dpf.show_samples_distinct(df_doi_with_element, 'doi', doi_delta_value, number_of_max_samples)
    print(f'doi_delta = {doi_delta_value}')

Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
8774,0,0.0,10.1093/cid/cir669,
49134,0,0.0,10.1093/cid/ciu795,
61669,0,0.0,10.1093/cid/cir669,
61099,0,0.0,10.1007/978-3-642-41698-9,
32154,0,0.0,10.1093/cid/ciu795,
63260,0,0.0,10.5167/uzh-53042,
74464,0,0.0,10.1007/978-3-642-41698-9,
35950,0,0.0,10.1093/cid/cir669,
648,0,0.0,10.1093/ndt/gft319,
85694,0,0.0,10.1093/cid/cir669,


doi_delta = 0.0


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
136154,1,1.0,10.5169/seals-376732,10.5169/seals-376732
163557,1,1.0,10.1055/b-005-143650,10.1055/b-005-143650
137973,1,1.0,10.5169/seals-377422,10.5169/seals-377422
137366,1,1.0,10.5169/seals-377218,10.5169/seals-377218
136012,1,1.0,10.5169/seals-376645,10.5169/seals-376645
136108,1,1.0,10.5169/seals-376689,10.5169/seals-376689
136859,1,1.0,10.5169/seals-377028,10.5169/seals-377028
137687,1,1.0,10.5169/seals-377332,10.5169/seals-377332
164675,1,1.0,10.5451/unibas-006499413,10.5451/unibas-006499413
139832,1,1.0,10.5169/seals-515343,10.5169/seals-515343


doi_delta = 1.0


As can be seen above, a value of 1.0 is returned if both strings of a pair are empty. As the attribute filling of $\texttt{doi}$ is sparse, see chapter [Data Analysis](./1_DataAnalysis.ipynb), the $\texttt{doi}\_\texttt{delta}$ indicates strongly a pair of duplicates for most rows. To avoid such misleading identity indication, function $\texttt{.mark}\_\texttt{missing}()$ will be applied to the attribute.

In [23]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute, factor)

### edition

In its original form in Swissbib's raw data, the edition statement is a string value which may have several words. The modelling on this attribute has been tried with and without stripping letter characters from the string. The final decision for the best processing will be documented in chapter [Overview and Summary](./0_OverviewSummary.ipynb). A Jaccard similarity is tried for this attribute.

In [24]:
attribute = 'edition'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Jaccard()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [25]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

import numpy as np

edition_delta_uniques = np.sort(df_feature_base['edition_delta'].unique())
edition_delta_uniques_len = len(edition_delta_uniques)
print('edition values range', edition_delta_uniques[:30])

edition values range [0.         0.125      0.14285714 0.16666667 0.2        0.25
 0.28571429 0.33333333 0.4        0.5        0.6        1.        ]
edition values range [0.         0.125      0.14285714 0.16666667 0.2        0.25
 0.28571429 0.33333333 0.4        0.5        0.6        1.        ]


The comparison results in a wide number of distinct similarity values for the goldstandard data set. Below, some examples are shown.

In [26]:
position = edition_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, 'edition',
    edition_delta_uniques[edition_delta_uniques_len-position-2],
    edition_delta_uniques[edition_delta_uniques_len-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, 'edition',
    edition_delta_uniques[edition_delta_uniques_len-position],
    edition_delta_uniques[edition_delta_uniques_len-position+2], 10)

position = edition_delta_uniques_len//2

dpf.show_samples_interval(
    df_feature_base, 'edition',
    edition_delta_uniques[edition_delta_uniques_len-position-2],
    edition_delta_uniques[edition_delta_uniques_len-position-1], 10)

Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
133179,1,1.0,,
154773,1,1.0,,
48864,0,1.0,,
147247,1,1.0,,
82057,0,1.0,,
44062,0,1.0,,
86993,0,1.0,,
108689,1,1.0,,
21131,0,1.0,,
20154,0,1.0,,


0.6 <= edition_delta <= 1.0


Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
53457,0,0.0,3.0,
38335,0,0.0,,2.0
95504,0,0.0,,2.0
69422,0,0.0,,4.0
19033,0,0.0,,1907.0
50865,0,0.0,,1907.0
52765,0,0.0,2.0,3.0
81528,0,0.0,,4.0
28739,0,0.0,8.0,
66662,0,0.0,3.0,7.0


0.0 <= edition_delta <= 0.1428571428571428


Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
54798,0,0.25,6,1869
80111,0,0.25,5,1885
84702,0,0.25,1863,3
86933,0,0.2,10,1899
87530,0,0.2,10425,1
98735,0,0.2,10425,2
640,0,0.25,1,1994
5871,0,0.25,4,1943
25429,0,0.2,10425,1
49262,0,0.25,2,2007


0.19999999999999996 <= edition_delta <= 0.25


Again, for $\texttt{edition}\_\texttt{delta} = 1$, many empty values of the $\texttt{edition}$ attribute can be observed. These will be marked with the special negative value in the data with the goal to distinguish them from the matching attribute pairs.

In [27]:
df_feature_base = dpf.mark_missing(df_feature_base, 'edition', factor)

In [28]:
position = edition_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, 'edition',
    edition_delta_uniques[edition_delta_uniques_len-position-2],
    edition_delta_uniques[edition_delta_uniques_len-position-1], 10)

Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
109858,1,1.0,2,2
163867,1,1.0,4,4
163543,1,1.0,6,6
141323,1,1.0,1907,1907
156945,1,1.0,1,1
121639,1,1.0,2,2
121752,1,1.0,1994,1994
159737,1,1.0,10,10
163056,1,1.0,1863,1863
105716,1,1.0,7,7


0.6 <= edition_delta <= 1.0


### exactDate

As discussed in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{exactDate}$ holds a year number stored in the first four digits. Letter 'u' is used as a placeholder for an unknown digit. The attribute may hold some month and day or a second year information in its second four digits, additionally.

The attribute will be kept as a string and will not be transformed to an integer. The feature attribute of the record pair to be compared will be calculated with a modified Hamming algorithm, see appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb). The resulting similarity will be stored in a new attribute $\texttt{exactDate}\_\texttt{delta}$ which will be taken for the model calculation.

As can be seen in chapter [Decision Tree Model](./6_DecisionTreeModel.ipynb), this attribute is important for prediction. Different ways of increasing the weight of the unknown status of a digit have been tried. The different ways can be seen in the implementations below. The algorithm decided for the final simulation will be documented in chapter [Overview and Summary](./0_OverviewSummary.ipynb).

In [29]:
import string

def no_xor (x_side, y_side) :
    number = 0
    for i in range(len(x_side)) :
        if ((x_side[i] in string.ascii_lowercase) | (y_side[i] in string.ascii_lowercase)) & (x_side[i] != y_side[i]) :
            number = number + 1
    return number

print('Example comparison results in a value of', no_xor ('202a0aaa', '1920uuuu'))

Example comparison results in a value of 5


In [30]:
attribute = 'exactDate'

# Replace letter 'u' with letter 'a' for one of the two strings.
#  As an effect, the resulting Hamming similarity for a letter
#  instead of a numerical digit in either string will add with an amount 0.
df_feature_base[attribute+'_x'] = df_feature_base.exactDate_x.str.replace('u', 'a')

# Compute Hamming similarity for century string pair.
columns_metadata_dict['similarity_metrics'][attribute] = tedi.Hamming()

unknown_share = 16

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

if exactDate_mode == 'added_u':
    # Add amount of 1/16 to Hamming similarity for every letter digit.
    #  But only maximum number of letter digits in both strings of a pair.
    df_feature_base[attribute+'_delta'] = df_feature_base[[
        attribute+'_x', attribute+'_y', attribute+'_delta']].apply(
        lambda x : x[attribute+'_delta'] + 
        max(x[attribute+'_x'].count('a'), x[attribute+'_y'].count('u'))/unknown_share, axis=1
    )
elif exactDate_mode == 'xor':
    # Add amount of 1/16 to Hamming similarity for every letter digit.
    #  But only number of position-wise xor-ed letter digits in the two strings of a pair.
    df_feature_base[attribute+'_delta'] = df_feature_base[[
        attribute+'_x', attribute+'_y', attribute+'_delta']].apply(
        lambda x : x[attribute+'_delta'] + 
        no_xor(x[attribute+'_x'], x[attribute+'_y'])/unknown_share, axis=1
    )

In [31]:
df_feature_base[['exactDate_x', 'exactDate_y', 'exactDate_delta']].sample(n=10)

Unnamed: 0,exactDate_x,exactDate_y,exactDate_delta
30885,2004aaaa,2002uuuu,0.625
158038,1960aaaa,1960uuuu,0.75
149734,2017aaaa,2017uuuu,0.75
48850,1987aaaa,2007uuuu,0.375
43990,2016aaaa,1907uuuu,0.25
62391,19702006,1980uuuu,0.625
10237,2005aaaa,2006uuuu,0.625
5255,1987aaaa,2011uuuu,0.25
137160,2001aaaa,2001uuuu,0.75
132597,1994aaaa,1994uuuu,0.75


All resulting values of equal strings are equal to 1.

In [32]:
df_feature_base[['exactDate_x', 'exactDate_y', 'exactDate_delta']][
    df_feature_base.exactDate_x == df_feature_base.exactDate_y
].sort_values('exactDate_delta', ascending=False).head()

Unnamed: 0,exactDate_x,exactDate_y,exactDate_delta
3747,20151475,20151475,1.0
149265,20172016,20172016,1.0
149256,20172016,20172016,1.0
149257,20172016,20172016,1.0
149259,20172016,20172016,1.0


A discrete set of different similarity values can be found in the attribute deltas. Some sample records are shown below.

In [33]:
exactDate_deltas = np.sort(df_feature_base.exactDate_delta.unique())
exactDate_deltas

array([0.    , 0.125 , 0.25  , 0.3125, 0.375 , 0.4375, 0.5   , 0.5625,
       0.625 , 0.6875, 0.75  , 0.875 , 1.    ])

In [34]:
sample_size = 5

for i in exactDate_deltas :
    dpf.show_samples_distinct(df_feature_base, 'exactDate', i, sample_size)
    print(f'exactDate_delta = {i}')

Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
42733,0,0.0,19791999,20012004
59198,0,0.0,18901897,20022003
74597,0,0.0,20062005,18971989
61180,0,0.0,20091990,19942008
12322,0,0.0,20092005,15501850


exactDate_delta = 0.0


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
22105,0,0.125,19819999,18701880
100073,0,0.125,18801890,19952006
50037,0,0.125,20092005,19001950
74880,0,0.125,20151475,18761920
14147,0,0.125,20111201,19201929


exactDate_delta = 0.125


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
16711,0,0.25,2000aaaa,1763uuuu
73488,0,0.25,1950aaaa,2001uuuu
72068,0,0.25,1991aaaa,2006uuuu
76236,0,0.25,2012aaaa,18801900
100711,0,0.25,2011aaaa,1955uuuu


exactDate_delta = 0.25


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
17042,0,0.3125,20092005,192uuuuu
18262,0,0.3125,2009aaaa,181uuuuu
64630,0,0.3125,2007aaaa,181uuuuu
83936,0,0.3125,1959aaaa,200uuuuu
21576,0,0.3125,2016aaaa,192uuuuu


exactDate_delta = 0.3125


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
43682,0,0.375,1987aaaa,1879uuuu
723,0,0.375,18761920,1991uuuu
78363,0,0.375,1836aaaa,1997uuuu
2918,0,0.375,2005aaaa,1902uuuu
67273,0,0.375,20071990,1905uuuu


exactDate_delta = 0.375


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
61978,0,0.4375,170aaaaa,19881862
69973,0,0.4375,1932aaaa,188uuuuu
75273,0,0.4375,170aaaaa,1982uuuu
12438,0,0.4375,183aaaaa,1993uuuu
97333,0,0.4375,1984aaaa,189uuuuu


exactDate_delta = 0.4375


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
5374,0,0.5,1959aaaa,1984uuuu
81261,0,0.5,1932aaaa,19989999
71608,0,0.5,1991aaaa,1906uuuu
34450,0,0.5,1900aaaa,1996uuuu
7351,0,0.5,1959aaaa,1987uuuu


exactDate_delta = 0.5


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
25712,0,0.5625,18501875,189uuuuu
118498,1,0.5625,193aaaaa,19u3uuuu
24238,0,0.5625,2015aaaa,200uuuuu
70700,0,0.5625,1993aaaa,192uuuuu
90091,0,0.5625,1987aaaa,193uuuuu


exactDate_delta = 0.5625


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
84920,0,0.625,2009aaaa,2008uuuu
88597,0,0.625,2011aaaa,2017uuuu
110324,1,0.625,2010aaaa,201u0uuu
87310,0,0.625,1981aaaa,1987uuuu
96078,0,0.625,1991aaaa,1990uuuu


exactDate_delta = 0.625


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
106885,1,0.6875,189aaaaa,189uuuuu
133783,1,0.6875,192aaaaa,192uuuuu
134219,1,0.6875,188aaaaa,188uuuuu
118546,1,0.6875,193aaaaa,193uuuuu
106886,1,0.6875,189aaaaa,189uuuuu


exactDate_delta = 0.6875


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
108352,1,0.75,1983aaaa,1983uuuu
128692,1,0.75,2004aaaa,2004uuuu
121446,1,0.75,2011aaaa,2011uuuu
151666,1,0.75,1986aaaa,1986uuuu
137785,1,0.75,2008aaaa,2008uuuu


exactDate_delta = 0.75


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
122074,1,0.875,19942008,19972008
120976,1,0.875,19601969,12601969
106580,1,0.875,19951995,49951995
118375,1,0.875,19819999,11819999
122527,1,0.875,19982008,19952008


exactDate_delta = 0.875


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
164733,1,1.0,17001799,17001799
149256,1,1.0,20172016,20172016
158303,1,1.0,19001996,19001996
158391,1,1.0,19001999,19001999
160908,1,1.0,19911794,19911794


exactDate_delta = 1.0


### format

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{format}$ has been split up into two new attributes $\texttt{format}\_\texttt{prefix}$ and $\texttt{format}\_\texttt{postfix}$ which will be compared by a different similarity metrics.

- As the quality of $\texttt{format}\_\texttt{prefix}$ is expected to be high, an identity comparison should be sufficient.
- Due to the observed structure of $\texttt{format}\_\texttt{postfix}$, a q-gram based comparison will be chosen.

In [35]:
attribute = 'format'

columns_metadata_dict['similarity_metrics'][attribute+'_prefix'] = tedi.Identity()
columns_metadata_dict['similarity_metrics'][attribute+'_postfix'] = tedi.Jaccard(qval=2)

pfix_values = ['_prefix', '_postfix']

for pf in pfix_values :
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+pf,
        columns_metadata_dict['similarity_metrics'][attribute+pf],
        columns_metadata_dict)

In [36]:
for i in df_feature_base.format_prefix_delta[
    df_feature_base.format_prefix_x != df_feature_base.format_prefix_y].unique():
    
    dpf.show_samples_distinct(df_feature_base, 'format_prefix', i)
    print(f'format_prefix_delta = {i}')

Unnamed: 0,duplicates,format_prefix_delta,format_prefix_x,format_prefix_y
42839,0,0.0,mu,vm
9184,0,0.0,bk,mu
31005,0,0.0,mp,bk
53157,0,0.0,mu,bk
33563,0,0.0,cf,mu


format_prefix_delta = 0.0


In [37]:
for i in df_feature_base.format_postfix_delta[
    df_feature_base.format_postfix_x != df_feature_base.format_postfix_y].unique():
    
    dpf.show_samples_distinct(df_feature_base, 'format_postfix', i)
    print(f'format_postfix_delta = {i}')

Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
103059,0,0.111111,20000,40100
65426,0,0.111111,20053,30100
51821,0,0.111111,20000,10053
33188,0,0.111111,20000,40100
78667,0,0.111111,30600,20000


format_postfix_delta = 0.11111111111111116


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
58930,0,0.428571,30000,20000
69122,0,0.428571,20000,10000
11085,0,0.428571,30600,30100
79771,0,0.428571,20000,20053
90595,0,0.428571,20053,20000


format_postfix_delta = 0.4285714285714286


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
79271,0,0.25,20000,20353
10324,0,0.25,10347,10000
93205,0,0.25,10000,10253
36272,0,0.25,30500,20053
29224,0,0.25,20000,20347


format_postfix_delta = 0.25


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
35781,0,0.0,20000,30653
42480,0,0.0,20000,30653
76880,0,0.0,20000,10347
78584,0,0.0,10347,20053
363,0,0.0,10200,30653


format_postfix_delta = 0.0


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
118883,1,1.0,10200,10200
167008,1,1.0,20000,20000
134974,1,1.0,20053,20053
25335,0,1.0,10300,10300
148455,1,1.0,20053,20053


format_postfix_delta = 1.0


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
131011,1,0.666667,20000,200000
138428,1,0.666667,10100,100100
125689,1,0.666667,20000,200000
161080,1,0.666667,20000,200000
112483,1,0.666667,10300,100300


format_postfix_delta = 0.6666666666666666


### isbn

Swissbib uses each string element of the $\texttt{isbn}$ list separately for comparing with each string element of its comparison $\texttt{isbn}$ list. If two bibliographic units hold at least one element in common, this is interpreted as a strong indicator for duplicates [[WiCo2001](./A_References.ipynb#wico2001)].

This hard logic is used in a modified way in the context of this capstone project. A special comparison function $\texttt{.build}\_\texttt{delta}\_\texttt{isbn}()$ has been implemented that compares each list element of the left-hand side with each list element of the right-hand side of a pair. According to Swissbib's implementation, the Identity metric is used for string comparison, calculating a similarity value of 1.0 or 0.0 for each list element pair. For normalisation reasons, the sum of similarity values is divided by the number of elements of the smaller list. If both lists are empty a value of 1.0 is returned. If only one list is empty a value of 0.0 is returned.

In [38]:
attribute = 'isbn'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Identity()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

df_feature_base[attribute+'_delta'].unique()

array([1. , 0. , 0.5])

Some sample cases are shown below for each category of $\texttt{isbn_delta}$.

In [39]:
for isbn_delta_value in df_feature_base['isbn_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['isbn_delta']==isbn_delta_value])
    )

    dpf.show_samples_distinct(df_feature_base, 'isbn', isbn_delta_value, number_of_max_samples)
    print(f'isbn_delta = {isbn_delta_value}')

Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
13017,0,1.0,[],[]
165045,1,1.0,[],[]
154182,1,1.0,"[978-3-598-31806-1 (print), 978-3-11-096146-1]","[978-3-598-31806-1 (print), 978-3-11-096146-1]"
135322,1,1.0,[],[]
119728,1,1.0,[],[]
124802,1,1.0,[],[]
34478,0,1.0,[],[]
96767,0,1.0,[],[]
114142,1,1.0,[],[]
154942,1,1.0,"[978-3-598-31798-9 (print), 978-3-11-096277-2]","[978-3-598-31798-9 (print), 978-3-11-096277-2]"


isbn_delta = 1.0


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
81079,0,0.0,[3-495-47879-5 (Gb.)],[3-598-31516-3]
48969,0,0.0,[978-3-7255-6535-1],[]
8981,0,0.0,[],"[978-3-13-127284-3, 3-13-127284-8]"
97319,0,0.0,[],"[3-906721-57-4 (livre), 3-292-00059-9 (disque ..."
30347,0,0.0,[0-7294-0744-6],[]
80904,0,0.0,[],[3-906721-35-3]
41470,0,0.0,[],[3-15-002620-2]
71158,0,0.0,[],"[978-3-598-31514-5 (print), 978-3-11-096914-6]"
87460,0,0.0,[3-495-47879-5],[]
8242,0,0.0,[3-13-127284-8],"[978-3-598-31512-1 (print), 978-3-11-096916-0]"


isbn_delta = 0.0


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
104381,1,0.5,"[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]","[978-3-13-127286-7, 3-13-127286-4]"
104383,1,0.5,"[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]","[978-3-13-127286-7, 3-13-127286-4]"
104387,1,0.5,"[978-3-13-127286-7, 3-13-127286-4]","[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]"
104377,1,0.5,"[978-3-13-127286-7, 3-13-127286-4]","[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]"
104392,1,0.5,"[978-3-13-127286-7, 3-13-127286-4]","[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]"
104384,1,0.5,"[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]","[978-3-13-127286-7, 3-13-127286-4]"


isbn_delta = 0.5


For attribute $\texttt{isbn}$, the special marking of missing values is omitted.

### ismn

This attribute will be processed with the identity similarity metric. The reasoning for this decision is the same as for similar attributes above. 

In [40]:
attribute = 'ismn'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Identity()
#tedi.Jaccard()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [41]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

ismn values range [0. 1.]


In [42]:
for ismn_delta_value in df_feature_base[attribute+'_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base[attribute+'_delta']==ismn_delta_value])
    )

    dpf.show_samples_distinct(df_feature_base, 'ismn', ismn_delta_value, number_of_max_samples)
    print(f'ismn_delta = {ismn_delta_value}')

Unnamed: 0,duplicates,ismn_delta,ismn_x,ismn_y
150173,1,1.0,,
65871,0,1.0,,
131687,1,1.0,,
61496,0,1.0,,
73874,0,1.0,,
65092,0,1.0,,
44346,0,1.0,,
69332,0,1.0,,
74727,0,1.0,,
36951,0,1.0,,


ismn_delta = 1.0


Unnamed: 0,duplicates,ismn_delta,ismn_x,ismn_y
7631,0,0.0,m006546749,
64102,0,0.0,"m006546756 (kritischer bericht, leinen)",
7673,0,0.0,"m006546756 (kritischer bericht, leinen)",
45222,0,0.0,,9790006450510
60729,0,0.0,,m006450510
20807,0,0.0,m006546756,
50074,0,0.0,"m006546756 (kritischer bericht, leinen)",
82357,0,0.0,"m006546756 (kritischer bericht, leinen)",
76090,0,0.0,m200205343,
36637,0,0.0,"m006546756 (kritischer bericht, leinen)",


ismn_delta = 0.0


As can be seen in the previous chapters, attribute $\texttt{ismn}$ is filled sparsely. A lot of missing values calculate to a value of 1.0 in the chosen similarity metrics. To mark these cases specifically, they will be transformed to a negative value.

In [43]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute, factor)

### musicid

Chapter [Data Analysis](./1_DataAnalysis.ipynb) shows that attribute $\texttt{musicid}$ is an identifyer for a music record. A Jaccard metric has been tested on this attribute, resulting in a distribution of many high similarity values on uniques. Comparing this result with the LCS metric, the latter has been decided.

In [44]:
attribute = 'musicid'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.LCSStr()
#tedi.Jaccard()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [45]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

musicid values range [0.         0.125      0.14285714 0.16666667 0.2        0.25
 0.28571429 0.33333333 0.375      0.4        0.42857143 0.5
 0.55555556 0.57142857 0.6        0.625      0.66666667 0.71428571
 0.75       0.77777778 0.8        0.83333333 0.85714286 0.875
 0.88888889 1.        ]


In [46]:
position = uniques_len[attribute]

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position-2],
    uniques[attribute][uniques_len[attribute]-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+2], 10)

position = uniques_len[attribute]//2

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+1], 10)

Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
170229,1,1.0,4553,4553
169643,1,1.0,4553,4553
160062,1,1.0,50999,50999
118016,1,1.0,5714,5714
116887,1,1.0,134134,134134
159298,1,1.0,10425,10425
128867,1,1.0,226,226
158621,1,1.0,912,912
123629,1,1.0,8691,8691
147046,1,1.0,73,73


0.8888888888888888 <= musicid_delta <= 1.0


Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
91493,0,0.0,,
112418,1,0.0,,
107557,1,0.0,,
49995,0,0.0,,
38150,0,0.0,,
9578,0,0.0,,
104342,1,0.0,,
13195,0,0.0,,5714.0
166477,1,0.0,,
26947,0,0.0,,


0.0 <= musicid_delta <= 0.1428571428571429


Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
149260,1,0.571429,740408,7404m08
170231,1,0.6,4553,4o553
159070,1,0.6,7794,7q794
158712,1,0.6,4553,455a3
124962,1,0.6,8571,8i571
158438,1,0.6,5944,5k944
166329,1,0.571429,502430,50r2430
169563,1,0.6,4553,455f3
107652,1,0.6,4031,403p1
160295,1,0.6,6646,6y646


0.5714285714285714 <= musicid_delta <= 0.6


In [47]:
dpf.show_samples_interval(
    df_feature_base[df_feature_base.duplicates==1], attribute,
    uniques[attribute][0],
    uniques[attribute][uniques_len[attribute]-1], 20)

Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
133379,1,1.0,502430.0,502430.0
108924,1,0.0,,
144771,1,0.0,,
150635,1,0.0,,
144412,1,0.0,,
157651,1,0.0,,
123071,1,1.0,4553.0,4553.0
104045,1,0.0,,
159983,1,0.0,,
150288,1,1.0,449.0,449.0


0.0 <= musicid_delta <= 1.0


The attribute is filled with a degree of below $10\%$. The chosen metric for it results in a similarity value of 1.0 for empty value pairs. This effect can be adjusted with function $\texttt{.mark}\_\texttt{missing}()$ as above. 

In [48]:
df_feature_base = dpf.mark_missing(df_feature_base, 'musicid', factor)

### part

Analogous to attribute $\texttt{edition}$ described above, the string value of this attribute can be stripped to pure number digits. Both ways, with and without letter stripping have been tried for modelling. The final decision for the best processing will be documented in chapter [Overview and Summary](./0_OverviewSummary.ipynb). Three different metrics have been tried for attribute $\texttt{part}$. Finally, metric StringCompare95 will be used.

In [49]:
attribute = 'part'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.StrCmp95()
#tedi.Jaro()
#tedi.Hamming()
#tedi.LCSStr()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [50]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

part values range [0.         0.29202279 0.30740741 0.31481481 0.32407407 0.32898551
 0.33333333 0.33597884 0.33838384 0.33921569 0.34444444 0.3452381
 0.35128205 0.35185185 0.35897436 0.36111111 0.36231884 0.36666667
 0.37037037 0.37254902 0.37301587 0.37407407 0.375      0.37777778
 0.38333333 0.38461538 0.38568376 0.38888889 0.38927739 0.39215686
 0.39393939 0.3952381  0.3960114  0.39646465 0.39722222 0.4
 0.4037037  0.40604575 0.40740741 0.40842491 0.40855763 0.41025641
 0.41111111 0.41203704 0.41282051 0.41388889 0.41449275 0.41452991
 0.41507937 0.41666667 0.4178744  0.41798942 0.41851852 0.42083333
 0.42222222 0.42390289 0.4241453  0.42564103 0.42592593 0.42777778
 0.42810458 0.42857143 0.42948718 0.43030303 0.43055556 0.43115942
 0.43174603 0.43333333 0.43518519 0.43557423 0.43589744 0.43627451
 0.43650794 0.43703704 0.4375     0.43813131 0.43888889 0.44017094
 0.44166667 0.44200244 0.44230769 0.44358974 0.44444444 0.4469697
 0.44761905 0.44814815 0.4484127  0.4494302  0.449494

In [51]:
position = uniques_len[attribute]

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position-2],
    uniques[attribute][uniques_len[attribute]-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+2], 10)

position = uniques_len[attribute]//7

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position-2],
    uniques[attribute][uniques_len[attribute]-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+2], 10)

Unnamed: 0,duplicates,part_delta,part_x,part_y
146330,1,1.0,,
29069,0,1.0,,
113429,1,1.0,,
133299,1,1.0,,
146924,1,1.0,,
159938,1,1.0,,
158793,1,1.0,1,1
148021,1,1.0,,
138386,1,1.0,,
112200,1,1.0,7 7,7 7


0.9209401709401709 <= part_delta <= 1.0


Unnamed: 0,duplicates,part_delta,part_x,part_y
5544,0,0.0,2620 5,
74438,0,0.0,,23 23 1870 23
80845,0,0.0,,15 15
29451,0,0.0,2,1
29216,0,0.0,,9 9
11122,0,0.0,,1
63857,0,0.0,,2008 106 113
25740,0,0.0,63,3
53427,0,0.0,,4 4
95795,0,0.0,,3


0.0 <= part_delta <= 0.30740740740740735


Unnamed: 0,duplicates,part_delta,part_x,part_y
2178,0,0.713095,23 1862,293 2014
15711,0,0.713095,23 1862,286 2007
63720,0,0.713095,23 1862,283 2004
90832,0,0.713095,23 1862,286 2007
4289,0,0.713095,23 1862,281 2002
1701,0,0.713492,285 285 1963,23 23 1899


0.7130952380952381 <= part_delta <= 0.7134920634920635


Unnamed: 0,duplicates,part_delta,part_x,part_y
10320,0,0.714286,552 552,2
98801,0,0.714286,2,23 1902
100895,0,0.714286,23 1862,3
15026,0,0.716667,28 10 2013 2421 2431,2 2
19743,0,0.714286,2,241 319
52049,0,0.714286,23 1862,3
39950,0,0.714286,552 552,2
85356,0,0.714286,912 912,1
100022,0,0.714286,4,534 534
78861,0,0.714286,912 912,1


0.7142857142857143 <= part_delta <= 0.7166666666666667


In this attribute, too, moving pairs of empty values to negative values will result in a clearer distinction between pairs of uniques and duplicates, as will be seen in the graphical comparison of capter [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb).

In [52]:
df_feature_base = dpf.mark_missing(df_feature_base, 'part', factor)

### person

As a result of chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{person}$ has been split into three specific attributes. Attribute $\texttt{person}\_{100}$ and $\texttt{person}\_{700}$ hold strongly standardised string values. For comparing pure strings, a Levenshtein metric is recommended [[Chri2012](./A_References.ipynb#chri2012)]. Unfortunately, this metric shows a very long calculation time on the data of the capstone project. Comparing the similarity values of the Levenshtein metric with the similarity values of other metrics in appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb), similarity metric StrCmp95 has been decided to use.

In [53]:
attribute = 'person'

columns_metadata_dict['similarity_metrics'][attribute+'_100'] = tedi.StrCmp95()
columns_metadata_dict['similarity_metrics'][attribute+'_700'] = tedi.StrCmp95()
#tedi.Levenshtein()

pe_values = ['_100', '_700']

for pe in pe_values :
    print('Calculating person'+pe)
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+pe,
        columns_metadata_dict['similarity_metrics'][attribute+pe],
        columns_metadata_dict)

Calculating person_100


Calculating person_700


In [54]:
pe = '_100'

uniques[attribute+pe], uniques_len[attribute+pe] = dpf.determine_similarity_values(
    df_feature_base, attribute+pe)

person_100 values range [0.         0.31944444 0.32777778 ... 0.99090909 0.99130435 1.        ]


In [55]:
position = uniques_len[attribute+pe]

dpf.show_samples_interval(
    df_feature_base, attribute+pe,
    uniques[attribute+pe][uniques_len[attribute+pe]-position-2],
    uniques[attribute+pe][uniques_len[attribute+pe]-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute+pe,
    uniques[attribute+pe][uniques_len[attribute+pe]-position],
    uniques[attribute+pe][uniques_len[attribute+pe]-position+2], 10)

Unnamed: 0,duplicates,person_100_delta,person_100_x,person_100_y
133006,1,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus
39381,0,1.0,,
169991,1,1.0,,
105733,1,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus
77514,0,1.0,,
143632,1,1.0,austenjane,austenjane
134813,1,1.0,kesslersigrid,kesslersigrid
36881,0,1.0,,
156076,1,1.0,austenjane,austenjane
104445,1,1.0,,


0.9913043478260869 <= person_100_delta <= 1.0


Unnamed: 0,duplicates,person_100_delta,person_100_x,person_100_y
48528,0,0.0,mozartwolfgang amadeus,
51335,0,0.0,,austenjane
49618,0,0.0,voltaire,
31086,0,0.0,schlöndorffvolker,
78499,0,0.0,,bührerwalter
48460,0,0.0,,steinerrudolf
36192,0,0.0,mozartwolfgang amadeus,
59089,0,0.0,eigenmanndaniela,
27785,0,0.0,rosoffmeg,
13407,0,0.0,,mozartwolfgang amadeus


0.0 <= person_100_delta <= 0.3277777777777777


For comparing person names, like in attribute $\texttt{person}\_{245c}$, a Jaro metric will be tested [[Chri2012](./A_References.ipynb#chri2012)].

In [56]:
pe = '_245c'

columns_metadata_dict['similarity_metrics'][attribute+pe] = tedi.Jaro()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute+pe,
    columns_metadata_dict['similarity_metrics'][attribute+pe],
    columns_metadata_dict)

In [57]:
uniques[attribute+pe], uniques_len[attribute+pe] = dpf.determine_similarity_values(
    df_feature_base, attribute+pe)

person_245c values range [0.         0.25901876 0.26388889 ... 0.99881797 0.99882214 1.        ]


In [58]:
position = uniques_len[attribute+pe]

dpf.show_samples_interval(
    df_feature_base, attribute+pe,
    uniques[attribute+pe][uniques_len[attribute+pe]-position-2],
    uniques[attribute+pe][uniques_len[attribute+pe]-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute+pe,
    uniques[attribute+pe][uniques_len[attribute+pe]-position],
    uniques[attribute+pe][uniques_len[attribute+pe]-position+2], 10)

Unnamed: 0,duplicates,person_245c_delta,person_245c_x,person_245c_y
135365,1,1.0,[walter bührer],[walter bührer]
165674,1,1.0,,
112148,1,1.0,von w.a. mozart ; dichtung nach ludwig gisecke...,von w.a. mozart ; dichtung nach ludwig gisecke...
129526,1,1.0,sigrid kessler... [et al.] ; [hrsg.:] interkan...,sigrid kessler... [et al.] ; [hrsg.:] interkan...
70707,0,1.0,,
125927,1,1.0,sigrid kessler... [et al.] ; [hrsg.:] interkan...,sigrid kessler... [et al.] ; [hrsg.:] interkan...
104414,1,1.0,birgit spinath (hrsg.),birgit spinath (hrsg.)
148156,1,1.0,hrsg. von hans konrad biesalski ... [et al.],hrsg. von hans konrad biesalski ... [et al.]
169480,1,1.0,w. a. mozart,w. a. mozart
155455,1,1.0,"mortzfeld, peter; raabe, paul","mortzfeld, peter; raabe, paul"


0.9988221436984688 <= person_245c_delta <= 1.0


Unnamed: 0,duplicates,person_245c_delta,person_245c_x,person_245c_y
46976,0,0.0,,sigrid kessler [u.a.]
41360,0,0.0,,[walter bührer]
52102,0,0.0,,jane austen
53523,0,0.0,andreas basu ; liane faust,
11211,0,0.0,,[musik:] wolfgang amadeus mozart ; [text von e...
34914,0,0.0,,jane austen ; edited by james kinsley ; with a...
97928,0,0.0,,von emmanuel schikaneder ; musik von wolfgang ...
16345,0,0.0,,w.a. mozart ; text von emanuel schikaneder ; n...
73052,0,0.0,andreas flury,
75943,0,0.0,,sigrid kessler [u.a.]


0.0 <= person_245c_delta <= 0.26388888888888884


The similarities of all three $\texttt{person}$ attributes are affected by empty values. These will be handled the same way as the attributes above.

In [59]:
pe_values = ['_100', '_245c', '_700']

for pe in pe_values :
    df_feature_base = dpf.mark_missing(df_feature_base, 'person'+pe, factor)

### pubinit

This attribute holds publisher strings that have a similar representation as attribute $\texttt{person}$. A Jaro metric will be used.

In [60]:
attribute = 'pubinit'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Jaro()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [61]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

pubinit values range [0.         0.25132275 0.25303644 ... 0.9957265  0.99578059 1.        ]


In [62]:
position = uniques_len[attribute]//3

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position-1],
    uniques[attribute][uniques_len[attribute]-position], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-5],
    uniques[attribute][uniques_len[attribute]-1], 10)

Unnamed: 0,duplicates,pubinit_delta,pubinit_x,pubinit_y
77901,0,0.603415,le grand livre du mois,del prado
53111,0,0.603413,"tdk recording media europe, opernhaus","in melanges, ed. van den heuvel. paris, gallimard"


0.6034126748412463 <= pubinit_delta <= 0.6034151034151035


Unnamed: 0,duplicates,pubinit_delta,pubinit_x,pubinit_y
114306,1,1.0,,
166845,1,1.0,staatlicher lehrmittelverlag,staatlicher lehrmittelverlag
157165,1,1.0,la guilde du livre,la guilde du livre
157840,1,1.0,bärenreiter-[verlag],bärenreiter-[verlag]
126237,1,1.0,staatlicher lehrmittelverlag,staatlicher lehrmittelverlag
28059,0,1.0,,
144513,1,1.0,,
22213,0,1.0,,
58263,0,1.0,,
167872,1,1.0,,


0.9954337899543378 <= pubinit_delta <= 1.0


The similarities of $\texttt{pubinit}$ are affected by empty values. These will be transformed to negative values.

In [63]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute, factor)

### scale

Comparing the similarity metrics of some sample value pairs of attribute $\texttt{scale}$ in appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb), a Jaccard metrics has been identified to express the best matching behaviour for purely numerical values stored in the attribute.

In [64]:
attribute = 'scale'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Jaccard()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [65]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

scale values range [0.         0.17857143 0.21428571 0.57142857 1.        ]


In [66]:
position = uniques_len[attribute]

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-3],
    uniques[attribute][uniques_len[attribute]-2], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-4],
    uniques[attribute][uniques_len[attribute]-3], 10)

Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
11151,0,0.0,,100000.0
62389,0,0.0,,100000.0
62530,0,0.0,,100000.0
20155,0,0.0,,100000.0
32269,0,0.0,,100000.0
22009,0,0.0,,100000.0
91143,0,0.0,,100000.0
77448,0,0.0,,100000.0
26953,0,0.0,,100000.0
97880,0,0.0,50000.0,


0.0 <= scale_delta <= 0.1785714285714286


Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
70248,0,0.571429,100000,50000
101948,0,0.571429,50000,100000
18541,0,0.571429,50000,100000
34900,0,0.571429,50000,100000
36584,0,0.571429,50000,100000
1987,0,0.571429,50000,100000
47507,0,0.571429,50000,100000
74994,0,0.571429,50000,100000
77660,0,0.214286,50 000 8 10 8 35 45 55 46 05,100000
34628,0,0.571429,50000,100000


0.2142857142857143 <= scale_delta <= 0.5714285714285714


Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
40725,0,0.214286,50 000 8 10 8 35 45 55 46 05,100000
79937,0,0.214286,50 000 8 10 8 35 45 55 46 05,100000
11507,0,0.214286,50 000 8 10 8 35 45 55 46 05,100000
104053,1,0.178571,50000,50 000 8 10 8 35 45 55 46 05
83391,0,0.214286,50 000 8 10 8 35 45 55 46 05,100000
104054,1,0.178571,50 000 8 10 8 35 45 55 46 05,50000
26918,0,0.214286,50 000 8 10 8 35 45 55 46 05,100000
65219,0,0.214286,50 000 8 10 8 35 45 55 46 05,100000
104048,1,0.178571,50000,50 000 8 10 8 35 45 55 46 05
104056,1,0.178571,50 000 8 10 8 35 45 55 46 05,50000


0.1785714285714286 <= scale_delta <= 0.2142857142857143


Attribute $\texttt{scale}$ is filled for maps, only. Due to its sparse filling, the similarities of the attribute are affected strongly by empty values. These empty values will be marked with a special negative value.

In [67]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute, factor)

### ttlfull

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{ttlfull}$ has been split up into two new attributes $\texttt{ttlfull}\_\texttt{245}$ and $\texttt{ttlfull}\_\texttt{246}$ which will be compared by the same similarity metrics. A visual analysis of the values stored in the attribute, reveals a string of words, comparable to the strings in attribute $\texttt{person}\_\texttt{245c}$, above. The same similarity metric will be used for both title attributes, therefore.

In [68]:
attribute = 'ttlfull'

columns_metadata_dict['similarity_metrics'][attribute+'_245'] = tedi.Jaro()
columns_metadata_dict['similarity_metrics'][attribute+'_246'] = tedi.Jaro()

tf_values = ['_245', '_246']

for tf in tf_values :
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+tf,
        columns_metadata_dict['similarity_metrics'][attribute+tf],
        columns_metadata_dict)

In [69]:
for tf in tf_values :
    uniques[attribute+tf], uniques_len[attribute+tf] = dpf.determine_similarity_values(
        df_feature_base, attribute+tf)

ttlfull_245 values range [0.         0.24579125 0.25714286 ... 0.99947257 0.99947341 1.        ]
ttlfull_246 values range [0.         0.35555556 0.36111111 ... 0.99945085 0.99945175 1.        ]


In [70]:
tf = '_245'
position = uniques_len[attribute+tf]

dpf.show_samples_interval(
    df_feature_base, attribute+tf,
    uniques[attribute+tf][uniques_len[attribute+tf]-position],
    uniques[attribute+tf][uniques_len[attribute+tf]-position+1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute+tf,
    uniques[attribute+tf][uniques_len[attribute+tf]-3],
    uniques[attribute+tf][uniques_len[attribute+tf]-2], 10)
dpf.show_samples_interval(
    df_feature_base, attribute+tf,
    uniques[attribute+tf][uniques_len[attribute+tf]-4],
    uniques[attribute+tf][uniques_len[attribute+tf]-3], 10)

Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
25131,0,0.0,emma,blick in die welt
20212,0,0.0,emma,trionfi
19839,0,0.0,emma,blick in die welt
108,0,0.0,emma,blick in die welt
46012,0,0.0,bildungsforschung und bildungspraxis,emma
80847,0,0.0,emma,trionfi
36767,0,0.0,emma,blick in die welt
68906,0,0.0,emma,blick in die welt
82828,0,0.0,arts,blick in die welt
28883,0,0.245791,domodossola,reading the eighteenth-century novel


0.0 <= ttlfull_245_delta <= 0.24579124579124578


Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
165535,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165540,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165489,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165482,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165537,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165517,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165473,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165523,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165525,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165519,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...


0.9994725738396625 <= ttlfull_245_delta <= 0.9994734070563455


Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
165535,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165482,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165533,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165489,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165525,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165540,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
164982,1,0.999463,health informatics - personal health device co...,health informatics - personal health device co...
164965,1,0.999463,health informatics - personal health device co...,health informatics - personal health device co...
165517,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...
165546,1,0.999473,health informatics - personal health device co...,health informatics - personal health device co...


0.9994632313472893 <= ttlfull_245_delta <= 0.9994725738396625


Attribute $\texttt{ttlfull}\_\texttt{245}$ is filled for all data rows of Swissbib's raw data as can be seen in chapter [Data Analysis](./1_DataAnalysis.ipynb). For attribute $\texttt{ttlfull}\_\texttt{245}$, the filling is below $10\%$. The data pairs with missing values will be marked with a negative value as has been done for similar cases above.

In [71]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute+'_246', factor)

### volumes

This attribute is described in chapter [Data Analysis](./1_DataAnalysis.ipynb) to hold a kind of contents that resembles the contents of attribute $\texttt{part}$. The same similarity metrics will be used for attribute $\texttt{volumes}$ as for attribute $\texttt{part}$, therefore.

In [72]:
attribute = 'volumes'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.StrCmp95()
#tedi.Jaro()
#tedi.LCSSeq()
#tedi.MongeElkan()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [73]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

volumes values range [0.         0.32222222 0.35128205 0.37301587 0.37407407 0.38333333
 0.38461538 0.3952381  0.40740741 0.41111111 0.41666667 0.41737892
 0.42857143 0.43650794 0.43703704 0.44017094 0.44166667 0.44200244
 0.44761905 0.45555556 0.45833333 0.46296296 0.46428571 0.46581197
 0.46666667 0.47008547 0.47222222 0.47777778 0.48148148 0.48333333
 0.48412698 0.48611111 0.48888889 0.49007937 0.49206349 0.49365079
 0.4991453  0.5        0.50793651 0.51111111 0.51190476 0.52222222
 0.52380952 0.52564103 0.52777778 0.53333333 0.53703704 0.53968254
 0.54074074 0.54166667 0.54722222 0.54761905 0.5491453  0.55
 0.55128205 0.55555556 0.56031746 0.56111111 0.56190476 0.56507937
 0.56666667 0.57407407 0.57478632 0.57777778 0.58148148 0.58333333
 0.58862434 0.58888889 0.59444444 0.5952381  0.59722222 0.5982906
 0.6        0.60119048 0.60320513 0.60683761 0.60714286 0.61025641
 0.61111111 0.61507937 0.61666667 0.61904762 0.61923077 0.62222222
 0.62380952 0.625      0.62777778 0.62962963 0.6

In [74]:
position = uniques_len[attribute]

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-3],
    uniques[attribute][uniques_len[attribute]-2], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-4],
    uniques[attribute][uniques_len[attribute]-3], 10)

Unnamed: 0,duplicates,volumes_delta,volumes_x,volumes_y
26150,0,0.0,,1
15734,0,0.0,,1
44046,0,0.0,316.0,1
83721,0,0.0,413.0,
72976,0,0.0,2.0,1 67 03
15512,0,0.0,270.0,
28425,0,0.0,2.0,59
1197,0,0.0,600.0,1
26977,0,0.0,1.0,2
94097,0,0.0,2.0,


0.0 <= volumes_delta <= 0.3222222222222222


Unnamed: 0,duplicates,volumes_delta,volumes_x,volumes_y
87911,0,0.916667,1 82,1 2
723,0,0.933333,1 45,1 415
23288,0,0.933333,1 169,1 16
72971,0,0.916667,1 82,1 2
56698,0,0.916667,1 82,1 2
40158,0,0.916667,1 82,1 2
66950,0,0.933333,1 45,1 145
11439,0,0.933333,1 36,1 376
15600,0,0.933333,1 36,1 346
91731,0,0.933333,1 166,1 16


0.9166666666666666 <= volumes_delta <= 0.9333333333333332


Unnamed: 0,duplicates,volumes_delta,volumes_x,volumes_y
67783,0,0.916667,1 82,1 2
103502,1,0.904762,1 169,1 0 169
59578,0,0.916667,1 82,1 2
56698,0,0.916667,1 82,1 2
81516,0,0.916667,1 82,1 2
103501,1,0.904762,1 0 169,1 169
103199,1,0.904762,1 169,1 0 169
88589,0,0.916667,191,1 91
52091,0,0.916667,1 32,1 2
41097,0,0.916667,1 82,1 2


0.9047619047619048 <= volumes_delta <= 0.9166666666666666


Attribute $\texttt{volumes}$ holds rows with missing data. The data pairs with missing values will be marked with a special negative value.

In [75]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute, factor)

## DataFrame with Attributes and Similarity Features

The metric for each attribute of the feature DataFrame has been decided and the similarity features have been calculated. In this last step, the columns of the DataFrame are reordered in order to place the $\_\texttt{delta}$ columns close to their input origins $\_\texttt{x}$ and $\_\texttt{y}$ and some sample records are shown.

In [76]:
# Take _x, _y, and _delta columns together
fb_col_list = df_feature_base.columns.tolist()
fb_col_list.sort()
# Move target column to first place
fb_col_list.insert(0, fb_col_list.pop(fb_col_list.index('duplicates')))
# Reorder DataFrame columns
df_attribute_with_sim_feature = pd.DataFrame(df_feature_base, columns=fb_col_list)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_attribute_with_sim_feature.columns)

class_label = ['uniques', 'duplicate']

for i in class_label:
    display(df_attribute_with_sim_feature[df_attribute_with_sim_feature.duplicates==class_label.index(i)].sample(n=10))
    print(i)

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y,coordinate_N_delta,coordinate_N_x,coordinate_N_y,corporate_full_delta,corporate_full_x,corporate_full_y,doi_delta,doi_x,doi_y,edition_delta,edition_x,edition_y,exactDate_delta,exactDate_x,exactDate_y,format_postfix_delta,format_postfix_x,format_postfix_y,format_prefix_delta,format_prefix_x,format_prefix_y,isbn_delta,isbn_x,isbn_y,ismn_delta,ismn_x,ismn_y,musicid_delta,musicid_x,musicid_y,part_delta,part_x,part_y,person_100_delta,person_100_x,person_100_y,person_245c_delta,person_245c_x,person_245c_y,person_700_delta,person_700_x,person_700_y,pubinit_delta,pubinit_x,pubinit_y,scale_delta,scale_x,scale_y,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y,ttlfull_246_delta,ttlfull_246_x,ttlfull_246_y,volumes_delta,volumes_x,volumes_y
1497,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.5,20151475,2006uuuu,0.111111,20000,10300,0.0,bk,vm,0.0,"[88-7922-121-3, 978-88-7922-121-4]",[],-1.0,,,-1.0,,,-0.5,1 1,,0.61221,petrarcafrancesco,mozartwolfgang amadeus,0.678402,francesco petrarca ; commento di bernardo lapini,wolfgang amadeus mozart ; libretto: emanuel sc...,0.614021,lapinibernardo,schikanederemanuel,-0.5,"adv, biblioteca cantonale di lugano",,-1.0,,,0.582709,"trionfi, riedizione accurata dell'incunabolo c...","die zauberflöte, märchenoper für kinder : oper...",-1.0,,,0.0,2,1 145
97490,0,-1.0,,,-1.0,,,-0.5,schweizerische normen-vereinigung,,-1.0,,,-1.0,,,0.5,2016aaaa,uuuuuuuu,0.428571,20053,10200,0.0,bk,vm,1.0,[],[],-1.0,,,-1.0,,,-1.0,,,-1.0,,,-0.5,,"richard wagner, aarno cronvall",-0.5,,"cronvallaarno, wagnerrichard",-0.5,,teldec,-1.0,,,0.520088,informatique de la santé - communication entre...,der fliegende holländer,-0.5,medizinische informatik - kommunikation von ge...,,0.733333,1,1 139
92765,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-0.5,,5.0,0.25,2005aaaa,19631970,0.111111,10100,20000,0.0,mu,bk,1.0,[],[],-0.5,m006204687,,-0.5,601.0,,-1.0,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.632479,w. a. mozart,von w. a. mozart ; dargestellt von thilo corne...,0.554762,mozartwolfgang amadeus,cornelissenthilo,-0.5,bärenreiter,,-1.0,,,0.884058,"die zauberflöte, kv 620",die zauberflöte,-1.0,,,0.688889,1 379,107
48392,0,-0.5,,e0074147,-0.5,,n0460833,0.040816,trägerverein 600 jahre niklaus von flüe 1417-2017,schweizeidgenössisches topographisches bureau,-1.0,,,-1.0,,,0.25,2016aaaa,1870uuuu,0.111111,20000,10300,0.0,bk,mp,0.0,"[3-290-20138-4, 978-3-290-20138-8]",[],-1.0,,,-1.0,,,-0.5,,23 23 1870 23,-1.0,,,0.511153,herausgegeben für den trägerverein 600 jahre n...,g. h. dufour direxit ; h. müllhaupt sculpsit,0.653568,"gröbliroland, achermannwalter","dufourguillaume-henri, müllhauptheinrich, manzj.",0.48655,edition nzn bei tvz,[eidg. topographisches bureau],-0.5,,100000.0,0.460018,"mystiker mittler mensch, 600 jahre niklaus von...","domo d'ossola, arona",-0.5,,"[domodossola, arona]",0.0,388,1
41911,0,-1.0,,,-1.0,,,-0.5,,"interkantonale lehrmittelzentrale (rapperswil,...",-1.0,,,-0.5,,2.0,0.5,19702006,1999uuuu,0.111111,10100,20000,0.0,mu,bk,0.0,[],[3-906721-46-9],-0.5,"m006546756 (kritischer bericht, leinen)",,-0.5,4553.0,,-0.5,5 19,,-0.5,mozartwolfgang amadeus,,0.615301,wolfgang amadeus mozart ; in verbindung mit de...,sigrid kessler... [et al.] ; [hrsg.:] interkan...,0.596797,"grubergernot, orelalfred, faberrudolf","kesslersigrid, freymatthias",0.485931,bärenreiter,staatlicher lehrmittelverlag,-1.0,,,0.650952,"neue ausgabe sämtlicher werke, die zauberflöte...","bonne chance!, cours de langue française, 3, l...",-1.0,,,0.511111,1 379,104
5919,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-0.5,,2.0,0.625,1976aaaa,1979uuuu,1.0,20000,20000,1.0,bk,bk,0.0,[3-15-002620-2],[3-442-33001-7],-1.0,,,-1.0,,,0.437037,2620 2620,33001,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.660238,wolfgang amadeus mozart ; dichtung von emanuel...,wolfgang amadeus mozart ; hrsg.: kurt pahlen,0.546005,"zentnerwilhelm, schikanederemanuel, goethejoha...",pahlenkurt,0.412037,p. reclam,goldmann,-1.0,,,0.836917,"die zauberflöte, oper in zwei aufzügen","die zauberflöte, opernführer",-1.0,,,0.0,80,252
37327,0,-1.0,,,-1.0,,,-0.5,,"interkantonale lehrmittelzentrale (rapperswil,...",-1.0,,,-1.0,,,0.5,1960aaaa,1996uuuu,0.111111,10200,30000,1.0,mu,mu,1.0,[],[],-1.0,,,-0.5,10425.0,,-1.0,,,-0.5,mozartwolfgang amadeus,,0.627269,w. a. mozart ; nach dem in der preussischen st...,[hrsg.] interkantonale lehrmittelzentrale rapp...,-0.5,"soldankurt, mozartwolfgang amadeus",,0.400535,c.f. peters,berner lehrmittel- und medienverl.,-1.0,,,0.628657,"die zauberflöte, oper in 2 aufzügen","bonne chance !, cours de langue française 2",-1.0,,,0.0,1 188,3
102041,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.25,2007aaaa,1994uuuu,1.0,20000,20000,1.0,bk,bk,0.0,[3-7815-1531-1],[],-1.0,,,-1.0,,,-1.0,,,-0.5,gläser-zikudamichaela,,0.589903,hrsg. von michaela gläser-zikuda und tina hascher,sigrid kessler... [et al.] ; [hrsg.:] interkan...,-0.5,haschertina,,0.485931,klinckhardt,staatlicher lehrmittelverlag,-1.0,,,0.551785,"lernprozesse dokumentieren, reflektieren und b...","bonne chance!, cours de langue française, 1",-1.0,,,0.0,304,145
46794,0,-1.0,,,-1.0,,,-1.0,,,-0.5,10.5167/uzh-53042,,-1.0,,,0.3125,2011aaaa,188uuuuu,0.428571,10053,10000,0.0,bk,mu,1.0,[],[],-1.0,,,-0.5,,7918.0,-1.0,,,-0.5,,mozartwolfgang amadeus,0.521923,"[a u scherrer, b ledergerber, v von wyl, j bön...",von w. a. mozart ; für pianoforte zu vier händ...,-0.5,"scherrera u., ledergerberb., von wylv., bönij....",,-1.0,,,-1.0,,,0.547283,improved virological outcome in white patients...,"die zauberflöte, oper in 2 akten",-1.0,,,-0.5,,120
99699,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.25,2005aaaa,1979uuuu,1.0,20000,20000,1.0,bk,bk,0.0,"[978-0-521-82437-8, 0-521-82437-0]",[3-442-33001-7],-1.0,,,-0.5,,33001.0,-0.5,,33001 33001,0.581818,austenjane,mozartwolfgang amadeus,0.604139,jane austen ; ed. by richard cronin ... [et al.],wolfgang amadeus mozart ; dieser opernführer w...,0.557436,croninrichard,pahlenkurt,-0.5,,"w. goldmann, musikverlag b. schott's söhne",-1.0,,,0.520833,emma,"die zauberflöte, originalausgabe",-1.0,,,0.0,600,252


uniques


Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y,coordinate_N_delta,coordinate_N_x,coordinate_N_y,corporate_full_delta,corporate_full_x,corporate_full_y,doi_delta,doi_x,doi_y,edition_delta,edition_x,edition_y,exactDate_delta,exactDate_x,exactDate_y,format_postfix_delta,format_postfix_x,format_postfix_y,format_prefix_delta,format_prefix_x,format_prefix_y,isbn_delta,isbn_x,isbn_y,ismn_delta,ismn_x,ismn_y,musicid_delta,musicid_x,musicid_y,part_delta,part_x,part_y,person_100_delta,person_100_x,person_100_y,person_245c_delta,person_245c_x,person_245c_y,person_700_delta,person_700_x,person_700_y,pubinit_delta,pubinit_x,pubinit_y,scale_delta,scale_x,scale_y,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y,ttlfull_246_delta,ttlfull_246_x,ttlfull_246_y,volumes_delta,volumes_x,volumes_y
120958,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.875,19601969,17601969,1.0,10100,10100,1.0,mu,mu,1.0,[],[],-1.0,,,1.0,4355.0,4355.0,-1.0,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.940626,by emanuel schikaneder ; music by wolfgang ama...,by emanuel schikaneder ; music by wolfgang ama...,0.989583,"schikanederemanuel, aberthermann","cshikanederemanuel, aberthermann",1.0,eulenburg,eulenburg,-1.0,,,0.993056,"die zauberflöte, a german opera : köchel no. 620","die zauberflöet, a german opera : köchel no. 620",-1.0,,,1.0,1.0,1.0
103220,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,1999aaaa,1999uuuu,1.0,20000,20000,1.0,bk,bk,1.0,[3-495-47879-5],[3-495-47879-5],-1.0,,,-1.0,,,0.8,57,57 57,1.0,fluryandreas,fluryandreas,1.0,andreas flury,andreas flury,-1.0,,,-0.5,,k. alber,-1.0,,,0.99537,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p...",-1.0,,,1.0,316.0,316.0
140369,1,0.125,e0074147,n0460833,0.625,n0460833,n4060833,1.0,eidgenössische landestopographie,eidgenössische landestopographie,-1.0,,,-1.0,,,0.75,1905aaaa,1905uuuu,1.0,10300,10300,1.0,mp,mp,1.0,[],[],-1.0,,,-1.0,,,1.0,23 23 1905 23,23 23 1905 23,-1.0,,,0.984848,g. h. dufour direxit ; h. müllhaupt sculpsit,g. h. dufour direxit ; h. müllhaupt sculwsit,0.944362,"dufourguillaume-henri, müllhauptheinrich","dufuourgiullaumd-henri, müllhauptheinrich",0.973846,[eidg. landestopographie],[ehidg. landsetopographie],1.0,100000.0,100000.0,0.930604,"domo d'ossola, arona","domo d'oskoa, aroan",0.966667,"[domodossola, arona]",[domodossolab arona],-0.5,,1.0
156528,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,2000aaaa,2000uuuu,1.0,20000,20000,1.0,bk,bk,1.0,[0-87834-101-3],[0-87834-101-3],-1.0,,,-1.0,,,1.0,26 26 2000,26 26 2000,-1.0,,,0.989899,"c.y. cyrus chu, ronald lee (eds.)","c.y. cyurs chu, ronald lee (eds.)",0.993333,"chuc.y. cyrus, leeronald demos","chuc.y. cyrus ,leeronald demos",-1.0,,,-1.0,,,0.992248,population and economic change in east asia,population and econmic change in east asia,-1.0,,,1.0,319.0,319.0
139699,1,-1.0,,,-1.0,,,-1.0,,,1.0,10.5169/seals-515321,10.5169/seals-515321,-1.0,,,0.75,2012aaaa,2012uuuu,1.0,10053,10053,1.0,bk,bk,1.0,[],[],-1.0,,,-1.0,,,1.0,291 2012,291 2012,0.961111,bührerwalter,bühcrerwaltr,0.977778,[walter bührer],[walterbührer],-1.0,,,-1.0,,,-1.0,,,0.943355,blick in die welt,blick in dzie welc,-1.0,,,-1.0,,
112402,1,0.125,e0074147,n0460833,1.0,n0460833,n0460833,0.945946,eidgenssisches topographisches bureau,iedgenssisches topographisches bureau,-1.0,,,-1.0,,,0.75,1863aaaa,1863uuuu,1.0,10300,10300,1.0,mp,mp,1.0,[],[],-1.0,,,-1.0,,,1.0,24 23 1863,24 23 1863,-1.0,,,0.992248,g.h. dufour direxit ; h. müllhaupt sculpsit,g.h. dufour diexit ; h. müllhaupt sculpsit,1.0,"dufourguillaume henri, müllhauptheinrich","dufourguillaume henri, müllhauptheinrich",1.0,[eidg. topographisches bureau],[eidg. topographisches bureau],1.0,100000.0,100000.0,0.966667,"domo d'ossola, arona","domou d'ossola,arona",0.982456,"domodossola, arona","domodogssola, arona",1.0,1.0,1.0
162222,1,-1.0,,,-1.0,,,1.0,interkantonale lehrmittelzentrale (luzern),interkantonale lehrmittelzentrale (luzern),-1.0,,,-1.0,,,1.0,19849999,19849999,1.0,20000,20000,1.0,bk,bk,1.0,[],[],-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.996825,sigrid kessler... [et al.] ; [éd.:] interkanto...,sigrid kessler... [et al.] ; [édd.:] interkant...,0.984615,kesslersigrid,kesslersigrd,1.0,staatlicher lehrmittelverlag,staatlicher lehrmittelverlag,-1.0,,,0.995726,"bonne chance!, cours de langue française, troi...","bonne chance!, cours de langue françiase, troi...",-1.0,,,-1.0,,
165098,1,-1.0,,,-1.0,,,0.848485,schweizerische normen-vereinigung,schwizerische normen-vereinigung,-1.0,,,-1.0,,,0.75,2016aaaa,2016uuuu,1.0,20053,20053,1.0,bk,bk,1.0,[],[],-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.990494,health informatics - personal health device co...,health informatics - personal health device co...,0.953221,informatique de la santé - communication entre...,informatique de la santé - commonication entre...,1.0,1.0,1.0
115152,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,1.0,3.0,3.0,0.75,1981aaaa,1981uuuu,1.0,20000,20000,1.0,bk,bk,1.0,[3-442-33001-7],[3-442-33001-7],-1.0,,,-1.0,,,1.0,33001 33001,33001 33001,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.997151,wolgang amadeus mozart ; dieser opernführer wu...,wolgang amadeus mozart ; deser opernführer wur...,1.0,"schikanederemanuel, pahlenkurt","schikanederemanuel, pahlenkurt",-1.0,,,-1.0,,,0.931746,die zauberflöte,die zuaberfdöte,-1.0,,,1.0,252.0,252.0
165405,1,-1.0,,,-1.0,,,0.69697,schweizerische normen-vereinigung,schweizerische normen-vareinigung,-1.0,,,-1.0,,,0.75,2014aaaa,2014uuuu,1.0,20053,20053,1.0,bk,bk,1.0,[],[],-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,1.0,health informatics - personal health device co...,health informatics - personal health device co...,0.879384,informatique de santé - communication entre di...,informatique de santé - communicatiofn entre d...,1.0,1.0,1.0


duplicate


## Summary

This chapter covers the central area of feature construction. The features of the feature matrix have been generated for each attribute of Swissbib's raw data, deciding on its similarity metric. With these metric values, the feature base DataFrame has been extended and a new DataFrame with the attribute values of the pairs together with their calculated similarity value have been generated. The similarity values will be the final features for training and performance testing of the models, compare [[JudACaps](./A_References.ipynb#judacaps)].

In [77]:
columns_metadata_dict['similarity_metrics']

{'coordinate_E': LCSStr({'qval': 1, 'external': True}),
 'coordinate_N': LCSStr({'qval': 1, 'external': True}),
 'corporate_full': LCSStr({'qval': 1, 'external': True}),
 'doi': Identity({'qval': 1, 'external': True}),
 'edition': Jaccard({'qval': 1, 'as_set': False, 'external': True}),
 'exactDate': Hamming({'qval': 1, 'test_func': <function Base._ident at 0x122383950>, 'truncate': False, 'external': True}),
 'format_prefix': Identity({'qval': 1, 'external': True}),
 'format_postfix': Jaccard({'qval': 2, 'as_set': False, 'external': True}),
 'isbn': Identity({'qval': 1, 'external': True}),
 'ismn': Identity({'qval': 1, 'external': True}),
 'musicid': LCSStr({'qval': 1, 'external': True}),
 'part': StrCmp95({'long_strings': False, 'external': True}),
 'person_100': StrCmp95({'long_strings': False, 'external': True}),
 'person_700': StrCmp95({'long_strings': False, 'external': True}),
 'person_245c': Jaro({'qval': 1, 'long_tolerance': False, 'winklerize': False, 'external': True}),
 'pu

The similarity metric decided for each attribute has been added as an additional piece of information to the columns metadata dictionary. The following table gives this summary in a structured form and lists the metric used for each attribute. Attributes with the same font color indicate similar types of values (description column) for better orientation.

| attribute     | subtype | description | similarity metric |
| ------------- |:--------|:------------|:------------------|
|<font color='red'>[coordinate](#coordinate)</font>|<font color='red'>\_E</font>|<font color='red'>Code(9)</font>|<font color='red'>LCSStr</font>|
|               |<font color='red'>\_N</font>|<font color='red'>Code(9)</font>|<font color='red'>LCSStr</font>|
|<font color='blue'>[corporate](#corporate)</font>|<font color='blue'>\_full</font>|<font color='blue'>Name</font>|<font color='blue'>LCSStr</font>|
|<font color='green'>[doi](#doi)</font>|         |<font color='green'>Identifier</font>|<font color='green'>Identity</font>|
|<font color='orange'>[edition](#edition)</font>|         |<font color='orange'>Number</font>|<font color='orange'>Jaccard</font>|
|<font color='black'>[exactDate](#exactDate)</font>|         |<font color='black'>Date</font>|<font color='black'>Hamming</font>|
|<font color='red'>[format](#format)</font>|<font color='red'>\_prefix</font>|<font color='red'>Code(2)</font>|<font color='red'>Identity</font>|
|               |<font color='red'>\_postfix</font>|<font color='red'>Code(6)</font>|<font color='red'>Jaccard (qval=2)</font>|
|<font color='green'>[isbn](#isbn)</font>|         |<font color='green'>Identifier</font>|<font color='green'>Identity</font>|
|<font color='green'>[ismn](#ismn)</font>|         |<font color='green'>Identifier</font>|<font color='green'>Identity</font>|
|<font color='green'>[musicid](#musicid)</font>|         |<font color='green'>Identifier</font>|<font color='green'>LCSStr</font>|
|<font color='orange'>[part](#part)</font>|         |<font color='orange'>Number</font>|<font color='orange'>StrCmp95</font>|
|<font color='blue'>[person](#person)</font>|<font color='blue'>\_100</font>|<font color='blue'>Name</font>|<font color='blue'>StrCmp95</font>|
|               |<font color='blue'>\_700</font>|<font color='blue'>Name</font>|<font color='blue'>StrCmp95</font>|
|               |<font color='blue'>\_245c</font>|<font color='blue'>Name</font>|<font color='blue'>Jaro</font>|
|<font color='blue'>[pubinit](#pubinit)</font>|         |<font color='blue'>Name</font>|<font color='blue'>Jaro</font>|
|<font color='orange'>[scale](#scale)</font>|         |<font color='orange'>Number</font>|<font color='orange'>Jaccard</font>|
|<font color='blue'>[ttlfull](#ttlfull)</font>|<font color='blue'>\_245</font>|<font color='blue'>String</font>|<font color='blue'>Jaro</font>|
|               |<font color='blue'>\_246</font>|<font color='blue'>String</font>|<font color='blue'>Jaro</font>|
|<font color='orange'>[volumes](#volumes)</font>|         |<font color='orange'>Number</font>|<font color='orange'>StrCmp95</font>|

### Full Feature Matrix with Target Vector Handover

To hand over the resulting DataFrame of this chapter, the DataFrame is saved into a pickle file that will be read in the next chapters [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb) as input.

In [78]:
# Store into compressed intermediary file
with bz2.BZ2File(os.path.join(path_goldstandard,
                       'labelled_feature_matrix_full.pkl'), 'w') as df_output_file:
    pk.dump(df_attribute_with_sim_feature, df_output_file)

The full metadata dictionary is to be persisted for handover to subsequent chapters.

In [79]:
# The target is still needed for the feature matrix
columns_metadata_dict['features'].append('duplicates')

for k in columns_metadata_dict.keys():
    print(k, '\n', columns_metadata_dict[k], '\n')

data_analysis_columns 
 ['coordinate_E', 'coordinate_N', 'corporate_full', 'doi', 'edition', 'exactDate', 'format_prefix', 'format_postfix', 'isbn', 'ismn', 'musicid', 'part', 'person_100', 'person_700', 'person_245c', 'pubinit', 'scale', 'ttlfull_245', 'ttlfull_246', 'volumes'] 

columns_to_use 
 ['duplicates', 'coordinate_E_x', 'coordinate_E_y', 'coordinate_N_x', 'coordinate_N_y', 'corporate_full_x', 'corporate_full_y', 'doi_x', 'doi_y', 'edition_x', 'edition_y', 'exactDate_x', 'exactDate_y', 'format_prefix_x', 'format_prefix_y', 'format_postfix_x', 'format_postfix_y', 'isbn_x', 'isbn_y', 'ismn_x', 'ismn_y', 'musicid_x', 'musicid_y', 'part_x', 'part_y', 'person_100_x', 'person_100_y', 'person_700_x', 'person_700_y', 'person_245c_x', 'person_245c_y', 'pubinit_x', 'pubinit_y', 'scale_x', 'scale_y', 'ttlfull_245_x', 'ttlfull_245_y', 'ttlfull_246_x', 'ttlfull_246_y', 'volumes_x', 'volumes_y'] 

similarity_metrics 
 {'coordinate_E': LCSStr({'qval': 1, 'external': True}), 'coordinate_N': L

In [80]:
# Binary intermediary metadata file
with open(os.path.join(path_goldstandard,
                       'columns_metadata.pkl'), 'wb') as dict_output_file:
    pk.dump(columns_metadata_dict, dict_output_file)