# Feature Matrix Generation

This chapter introduces similarity metrics for string comparison. The metrics to be used for calculating its similarity is decided for each attribute of the DataFrame built in the previous chapters. As a result of this chapter, the feature matrix will be derived.

## Table of Contents

- [Data Takeover](#Data-Takeover)
- [Object Distance and Similarity](#Object-Distance-and-Similarity)
- [Library TextDistance](#Library-TextDistance)
- [Similarity Metrics on Attribute Level](#Similarity-Metrics-on-Attribute-Level)
    - [coordinate](#coordinate)
    - [corporate](#corporate)
    - [doi](#doi)
    - [edition](#edition)
    - [exactDate](#exactDate)
    - [format](#format)
    - [isbn](#isbn)
    - [musicid](#musicid)
    - [part](#part)
    - [person](#person)
    - [pubinit](#pubinit)
    - [scale](#scale)
    - [ttlfull](#ttlfull)
    - [volumes](#volumes)
- [Feature Base](#Feature-Base)
- [Summary](#Summary)
- 

## Data Takeover

Swissbib's raw data of the goldstandard has been processed in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb). As the first step of this chapter, this data is read in for further processing to the feature matrix and target vector for the subsequent machine learning model chapters.

In [1]:
import os
import pandas as pd
import pickle as pk

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)

# Restore results so far
df_feature_base = pd.read_pickle(os.path.join(path_goldstandard, 'feature_base_df.pkl'),
                                 compression=None)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head()

Unnamed: 0,duplicates,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,doi_x,doi_y,edition_x,edition_y,exactDate_x,exactDate_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,isbn_x,isbn_y,musicid_x,musicid_y,part_x,part_y,person_100_x,person_100_y,person_700_x,person_700_y,person_245c_x,person_245c_y,pubinit_x,pubinit_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y
0,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula","grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,reclam jun.,reclam jun.,,,"emma, roman","emma, roman",,,600 s.,600 s.
1,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,reclam jun.,reclam,,,"emma, roman",emma,,,600 s.,600 s.
2,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen,reclam jun.,reclam,,,"emma, roman","emma, roman",,,600 s.,600 s.
3,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,"grawechristian, graweursula",jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,reclam,reclam jun.,,,emma,"emma, roman",,,600 s.,600 s.
4,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,reclam,reclam,,,emma,emma,,,600 s.,600 s.


In [2]:
print('Number of rows labelled as duplicates', len(df_feature_base[
    df_feature_base.duplicates==1]))
print('Number of rows labelled as uniques', len(df_feature_base[
    df_feature_base.duplicates==0]))
print('Total number of rows in DataFrame', df_feature_base.shape[0],
      'number of columns', df_feature_base.shape[1])

Number of rows labelled as duplicates 1473
Number of rows labelled as uniques 259260
Total number of rows in DataFrame 260733 number of columns 41


In [3]:
print('Part of duplicates (1) on uniques (2) in units of [%]')
print(100*df_feature_base.duplicates.value_counts(normalize=True))

Part of duplicates (1) on uniques (2) in units of [%]
0    99.435054
1     0.564946
Name: duplicates, dtype: float64


The ratio of duplicate records in the full training data has a percentage value of below 0.6%. This is very low and will affect the training of the model. During the training process, the model will find more pairs of unique records ($\texttt{duplicates}=0$) than pairs of duplicates ($\texttt{duplicates}=1$). Undersampling of the amount of unique pairs might be necessary as a consequence and will be decided during model training.

## Object Distance and Similarity

A mathematical idea of distance and similarity is needed for understanding object pair comparison. This section starts with a motivation for calculating similarities and afterwards gives a very basic definition of the two central terms. The text of this section is a summary of [[Chri2012](./A_References.ipynb#chri2012)].

The attributes to be used for pair comparison may contain values of poor quality. The quality originates in the way the data has been entered at the very source. Manual data entry may suffer from mistyping, automatically scanned data may suffer from insufficiencies of the scanned base material or the recognition algorithm in the optical character recognition (OCR) processing. The basic step of a deduplication process is to identify the probability of two strings of a pair to be a pair of duplicates. This is done by calculating a similarity value between the two strings compared, rather than using an exact comparison function. Based on this common similarity value for an attribute pair, their being duplicates can be decided.

The term similarity is strongly coupled to the term of distance of two values of an attribute. Mathematically, a distance can be explained with the help of a distance function. A _distance function_ or _distance metric_ $dist(o_i, o_j)$ between two points or data objects $o_i$ and $o_j$ must fulfill four requirements.

1. $dist(o_i, o_i)=0$, the distance from an object to itself is zero.
- $dist(o_i, o_j)\ge 0$, the distance between two objects is a non-negative number.
- $dist(o_i, o_j)=dist(o_j, o_i)$, the distance between two objects is symmetric.
- $dist(o_i, o_j)\le dist(o_i, o_k)+dist(o_k, o_j)$, the triangular inequality must hold. It states that the direct distance beween two objects is never larger than the combined distance when going through a third object.

A distance value expresses the dissimilarity $d$ of two objects [[HanK2012](./A_References.ipynb#hank2012)] and can therefore be converted into a similarity value $s$, calculating $s = \frac{1}{d}$, assuming $d\gt 0$. Alternatively, assuming the distance value is normalised $0\le d\le 1$, the similarity value can be calculated to $s = 1-d$. A _similarity function_ $sim(a_i, aj)$ between two attributes which can be strings, numbers, dates, geographic locations, text, XML documents, etc. fulfills the general requirements.

1. $sim(a_i, a_i)=1$, the result of comparing a value with itself is an exact similarity.
- $sim(a_i, a_j)=0$, the similarity of values that are completely different from each other is 0. What accounts for 'complete different' depends upon the type of data that are compared.
- $0\lt sim(a_i, a_j)\lt 1$, an approximate similarity between exact similarity and total dissimilarity is calculated if two attribute values are somewhat similar to each other. What accounts for 'somewhat different' depends upon the type of data that are compared.

The dissimilarity between two objects $o_i$ and $o_j$ can be computed based on the ratio of mismatches,
$$
d(o_i, o_j) = \frac{p-m}{p},
$$
where $m$ is the number of matching attributes and $p$ is the total number of attributes describing the objects [[HanK2012](./A_References.ipynb#hank2012)]. Thus the similarity between two objects can be computed as
$$
sim(o_i, o_j) = 1 - d(o_i, o_j) = \frac{m}{p}.
$$

For data deduplication, a comparison function needs to be tailored to the type of underlying data. Although there is a correspondence between a similarity function and the mathematical concept of a distance function, not all known and implemented similarity comparison functions used for string pair comparison fulfill the requirements of a distance function. Some similarity functions are not symmetric, others do not fulfill the triangular inequality. Decision taking on the best similarity function for a string pair, will be based on the effect, a similarity function has for the purpose needed. In the case of this capstone project, this purpose is its capability to contribute to the prediction whether a pair of records is a duplicate or different.

## Library TextDistance

An internet research on string distance calculation with Python has revealed libraries [[StSi](./A_References.ipynb#stsi)], [[TeDi](./A_References.ipynb#tedi)] and seperate code snippets for distinct algorithms. After trying the referenced libraries and a downloaded code snippet for a Smith Waterman similarity [[SmWa](./A_References.ipynb#smwa)], the text distance library [[TeDi](./A_References.ipynb#tedi)] has been decided to be the best decision for this capstone project. The decision is based on the GitHub statistics of stars and the date of the latest pull requests, indicating its popularity and maintenance activity of the library. A look at the API of the library, reveals the Python library to be a complete implementation (compared to suggestions of similarity metrics in [[Chri2012](./A_References.ipynb#chri2012)]) and easy to use.

In [4]:
# Install textdistance Python library - if not done, yet.
! pip install textdistance



For using the library, see documentation in [[TeDi](./A_References.ipynb#tedi)]. For the purposes of this chapter, function $\texttt{.normalized}\_\texttt{similarity()}$ of an instantiated textdistance object will be used.

In [5]:
import textdistance as tedi

With the code line above, the library is imported for application in this chapter. In appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb) the effect of the similarity metrics of the library are compared for better understanding of their specific behaviour. This comparison for each attribute is the basis of deciding the best similarity metric available for an attribute pair.

## Similarity Metrics on Attribute Level

In this section, the decision for calculating the similarity metric for each attribute of the raw data is documented based on appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb) and implemented. The implementation is applied on a pair of attributes of different records, resulting in a new attribute of the final feature matrix. A general function $\texttt{build_delta_feature}$ is provided by the code file [data_preparation_funcs.py](./data_preparation_funcs.py) for transforming two attributes into their feature attribute holding their similarity value.

In [6]:
import data_preparation_funcs as dpf

In [7]:
columns_metadata_dict['similarity_metrics'] = {}
columns_metadata_dict['columns_for_comparison'] = []

### coordinate

As discussed in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{coordinate}$ holds coordinates of maps. To decide whether a map covers the same geographical range, a metric will be chosen that compares the coordinate number digits from left to right. The more digits are found to be the equal, the higher the similarity value is calculated. The comparison stops with the first digit pair that is unequal. This algorithm is satisfyed by the LCS (Longest Common Substring comparison) algorithm and generates the wanted resault, see appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb).

In [8]:
attribute = 'coordinate'

columns_metadata_dict['similarity_metrics'][attribute+'_E'] = tedi.LCSStr()
columns_metadata_dict['similarity_metrics'][attribute+'_N'] = tedi.LCSStr()

ne_values = ['_E', '_N']

for ne in ne_values :
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+ne,
        columns_metadata_dict['similarity_metrics'][attribute+ne],
        columns_metadata_dict)

Looking at some samples of the feature matrix reveals a good match to the expectations.

In [9]:
import numpy as np

coordinate_E_delta_uniques = np.sort(df_feature_base['coordinate_E_delta'].unique())
coordinate_E_delta_uniques_len = len(coordinate_E_delta_uniques)
print('coordinate_E values range', coordinate_E_delta_uniques)

coordinate_N_delta_uniques = np.sort(df_feature_base['coordinate_N_delta'].unique())
coordinate_N_delta_uniques_len = len(coordinate_N_delta_uniques)
print('coordinate_N values range', coordinate_N_delta_uniques)

coordinate_E values range [0.    0.375 0.5   0.625 0.875 1.   ]
coordinate_N values range [0.    0.375 0.5   0.75  0.875 1.   ]


In [10]:
position = 3

dpf.show_samples_interval(df_feature_base, 'coordinate_E', coordinate_E_delta_uniques[
    coordinate_E_delta_uniques_len-position], coordinate_E_delta_uniques[coordinate_E_delta_uniques_len-position+1])
dpf.show_samples_interval(df_feature_base, 'coordinate_N', coordinate_N_delta_uniques[
    coordinate_N_delta_uniques_len-position], coordinate_N_delta_uniques[coordinate_N_delta_uniques_len-position+1])

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y
54109,0,0.875,e0080855,e0080851
95182,0,0.875,e0080851,e0080855
95170,0,0.625,e0080851,e0080900
182249,0,0.875,e0080855,e0080851
95193,0,0.875,e0080851,e0080855


0.625 <= coordinate_E_delta <= 0.875


Unnamed: 0,duplicates,coordinate_N_delta,coordinate_N_x,coordinate_N_y
54072,0,0.75,n0460826,n0460833
95183,0,0.75,n0460821,n0460833
257345,0,0.75,n0460833,n0460821
182213,0,0.75,n0460826,n0460833
95175,0,0.75,n0460821,n0460833


0.75 <= coordinate_N_delta <= 0.875


The samples above show the wanted similarity behaviour for value ranges greater than 0. The metric has the weakness, though, that empty coordinate values, e.g. for bibliographical units other than maps, have each been calculated to a similarity of 0. Some samples for duplicates in the training data are shown below.

In [11]:
dpf.show_samples_interval(
    df_feature_base[df_feature_base.duplicates==1],
    'coordinate_E', coordinate_E_delta_uniques[0], coordinate_E_delta_uniques[1], 10)

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y
1227,1,0.0,,
901,1,0.0,,
1421,1,0.0,,
724,1,0.0,,
56,1,0.0,,
664,1,0.0,,
21,1,0.0,,
1263,1,0.0,,
503,1,0.0,,
512,1,0.0,,


0.0 <= coordinate_E_delta <= 0.375


This downside shall be avoided by marking pairs of missing coordinate values on both sides with a special value of -1, which will point out to the models to be fitted the special case of missing information in a row. The implementation of this logic is done in a spection function $\texttt{.mark}\_\texttt{both}\_\texttt{missing()}$.

In [12]:
for ne in ne_values :
    df_feature_base = dpf.mark_both_missing(df_feature_base, 'coordinate'+ne)

### corporate

Attribute $\texttt{corporate}$ is a collection of corporate names. The Monge-Elkan metric compares string tokens pairwise [[Chri2012](./A_References.ipynb#chri2012)] while the LCS metric searches for the longest common substring. Assessing the differences of these two metrics with the help of their values distribution in chapter [Features Discussion and Dummy Classifier Baseline](./4_FeatureDiscussionDummyBaseline.ipynb), reveals a better distribution behaviour for LCS. Therefore, the LCS metric will be chosen for this attribute.

In [13]:
attribute = 'corporate'

columns_metadata_dict['similarity_metrics'][attribute+'_110'] = tedi.LCSStr()
columns_metadata_dict['similarity_metrics'][attribute+'_710'] = tedi.LCSStr()
# tedi.MongeElkan()

co_values = ['_110', '_710']

for co in co_values :
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+co,
        columns_metadata_dict['similarity_metrics'][attribute+co],
        columns_metadata_dict)

Its $110$ part is sparsely filled and even its $710$ part comes along with a little more than $10\%$ of filling, only. The LCS metric generates a similarity of 1 for the cases where both strings of a pair are empty. Missing values on both sides may be an indicator for a pair of duplicates but due to the sparsely available information, it is a weak indicator. Therefore, the pairs with missing data on both sides of the pair, will be marked with the special value of -1.

In [14]:
corporate_110_delta_uniques = np.sort(df_feature_base['corporate_110_delta'].unique())
corporate_110_delta_uniques_len = len(corporate_110_delta_uniques)
print('corporate_110 values range', corporate_110_delta_uniques)

corporate_710_delta_uniques = np.sort(df_feature_base['corporate_710_delta'].unique())
corporate_710_delta_uniques_len = len(corporate_710_delta_uniques)
print('corporate_710 values range', corporate_710_delta_uniques)

corporate_110 values range [0. 1.]
corporate_710 values range [0.         0.01265823 0.01369863 0.01428571 0.01666667 0.01694915
 0.01818182 0.01886792 0.02       0.02040816 0.0212766  0.02173913
 0.02272727 0.02380952 0.02439024 0.025      0.02531646 0.02542373
 0.02564103 0.02702703 0.02739726 0.02857143 0.02898551 0.03
 0.03030303 0.03125    0.03278689 0.03333333 0.03389831 0.03448276
 0.03508772 0.03571429 0.03636364 0.03773585 0.03797468 0.03846154
 0.04       0.04081633 0.04109589 0.04166667 0.04237288 0.04255319
 0.04285714 0.04347826 0.04444444 0.04545455 0.046875   0.04761905
 0.04878049 0.04918033 0.05       0.05063291 0.05128205 0.05263158
 0.05405405 0.05454545 0.05479452 0.05555556 0.05660377 0.05714286
 0.05769231 0.05797101 0.05882353 0.05932203 0.06       0.06060606
 0.06122449 0.0625     0.06329114 0.06382979 0.06451613 0.06521739
 0.06557377 0.06666667 0.06779661 0.06849315 0.06896552 0.07
 0.07017544 0.07142857 0.07272727 0.07317073 0.075      0.0754717
 0.07594937 0

In [15]:
for co in co_values :
    df_feature_base = dpf.mark_both_missing(df_feature_base, 'corporate'+co)

Some sample cases are shown below for both $\texttt{corporate}$ features.

In [16]:
dpf.show_samples_interval(
    df_feature_base[df_feature_base.duplicates==1],
    'corporate_110', 0.0, 1.0, 20
)

Unnamed: 0,duplicates,corporate_110_delta,corporate_110_x,corporate_110_y
548,1,0.0,,eidgenössisches topographisches bureau
549,1,0.0,eidgenössisches topographisches bureau,
550,1,1.0,eidgenössisches topographisches bureau,eidgenössisches topographisches bureau


0.0 <= corporate_110_delta <= 1.0


In [17]:
position = corporate_710_delta_uniques_len//2 # Let's have a look in the middle range of the similarities.

dpf.show_samples_interval(
    df_feature_base, 'corporate_710',
    corporate_710_delta_uniques[corporate_710_delta_uniques_len-position],
    corporate_710_delta_uniques[corporate_710_delta_uniques_len-position+2], 20)

Unnamed: 0,duplicates,corporate_710_delta,corporate_710_x,corporate_710_y
37043,0,0.102041,schweizerische gesellschaft für bildungsforschung,eidgenssisches topographisches bureau
37045,0,0.102041,schweizerische gesellschaft für bildungsforschung,eidgenssisches topographisches bureau
167608,0,0.102041,schweizerische gesellschaft für bildungsforschung,eidgenössisches topographisches bureau
85913,0,0.102041,schweizerische gesellschaft für bildungsforschung,eidgenssisches topographisches bureau
10821,0,0.102041,schweizerische gesellschaft für bildungsforschung,eidgenssisches topographisches bureau
37041,0,0.102041,schweizerische gesellschaft für bildungsforschung,eidgenssisches topographisches bureau
167568,0,0.102041,schweizerische gesellschaft für bildungsforschung,eidgenssisches topographisches bureau
46580,0,0.102041,schweizerische gesellschaft für bildungsforschung,eidgenssisches topographisches bureau
10817,0,0.102041,schweizerische gesellschaft für bildungsforschung,eidgenssisches topographisches bureau
10820,0,0.102041,schweizerische gesellschaft für bildungsforschung,eidgenssisches topographisches bureau


0.10126582278481011 <= corporate_710_delta <= 0.10204081632653061


### doi

Swissbib uses an explicit $\texttt{doi}$ and an explicit $\texttt{ismn}$ attribute for its deduplication implementation. As these explicit dedicated identifiers are missing in Swissbib's data extract, cp. chapter [Data Analysis](./1_DataAnalysis.ipynb), an alternative comparison logic will be chosen for this attribute. Each string element of a $\texttt{doi}$ list will be compared separately with each string element of its comparison $\texttt{doi}$ list. If two bibliographic units hold at least one element in common, this is interpreted as a strong indicator for duplicates.

A special comparison function $\texttt{.build}\_\texttt{delta}\_\texttt{isbn()}$ (the same logic will be used for attribute $\texttt{isbn}$, see below) has been implemented that compares each list element of the left-hand side with each list element of the right-hand side of a pair. The Identity metric is used for string comparison, calculating a similarity value of 1 or 0 for each list element pair. For normalisation reasons, the sum of similarity values is divided by the number of elements of the smaller list. If both lists are empty a value of 1 is returned. If only one list is empty a value of 0 is returned.

In [18]:
attribute = 'doi'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Identity()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

df_feature_base['doi_delta'].unique()

array([1. , 0. , 0.5])

Some sample cases are shown below for each category of $\texttt{isbn_delta}$.

In [19]:
for doi_delta_value in df_feature_base['doi_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['doi_delta']==doi_delta_value])
    )

    dpf.show_samples_distinct(df_feature_base, 'doi', doi_delta_value, number_of_max_samples)
    print(f'doi_delta = {doi_delta_value}')

Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
103361,0,1.0,[],[]
256911,0,1.0,[],[]
194367,0,1.0,[],[]
107818,0,1.0,[],[]
56246,0,1.0,[],[]
68998,0,1.0,[],[]
25771,0,1.0,[],[]
203332,0,1.0,[],[]
7790,0,1.0,[],[]
120192,0,1.0,[],[]


doi_delta = 1.0


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
203043,0,0.0,[],[10.5169/seals-515332]
124370,0,0.0,[],[10.5169/seals-515321]
215523,0,0.0,[],[10.5169/seals-376925]
194207,0,0.0,"[10.5451/unibas-006503313, urn:nbn:ch:bel-bau-...",[]
253099,0,0.0,[10.1055/b-005-143650],[]
105874,0,0.0,[],[10.5169/seals-377422]
249490,0,0.0,[],[10.5169/seals-376732]
208054,0,0.0,[10.1093/ndt/gft319],[]
17066,0,0.0,[],[10.5169/seals-377305]
94944,0,0.0,[],[10.3931/e-rara-61897]


doi_delta = 0.0


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
153913,0,0.5,"[10.5167/uzh-53042, 10.1093/cid/cir669]","[21998284, 10.1093/cid/cir669]"


doi_delta = 0.5


In [20]:
# Let's have a look at some non-empty doi elements
df_doi_with_element = df_feature_base[df_feature_base.doi_x.apply(lambda x : len(x) > 0)]

for doi_delta_value in df_feature_base['doi_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['doi_delta']==doi_delta_value])
    )

    dpf.show_samples_distinct(df_doi_with_element, 'doi', doi_delta_value, number_of_max_samples)
    print(f'doi_delta = {doi_delta_value}')

Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
1188,1,1.0,[10.1093/cid/cir669],"[10.5167/uzh-53042, 10.1093/cid/cir669]"
1268,1,1.0,[10.1093/cid/ciu795],[10.1093/cid/ciu795]
1257,1,1.0,[10.1007/978-3-642-41698-9],[10.1007/978-3-642-41698-9]
1472,1,1.0,[10.1055/b-005-143650],[10.1055/b-005-143650]
1259,1,1.0,[10.1007/978-3-642-41698-9],[10.1007/978-3-642-41698-9]
1469,1,1.0,[10.1055/b-005-143650],[10.1055/b-005-143650]
1252,1,1.0,[10.1007/978-3-642-41698-9],[10.1007/978-3-642-41698-9]
162019,0,1.0,[10.1007/978-3-642-41698-9],[10.1007/978-3-642-41698-9]
1254,1,1.0,[10.1007/978-3-642-41698-9],[10.1007/978-3-642-41698-9]
1267,1,1.0,[10.1093/cid/ciu795],[10.1093/cid/ciu795]


doi_delta = 1.0


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
193748,0,0.0,"[10.5451/unibas-006503313, urn:nbn:ch:bel-bau-...",[]
166539,0,0.0,[10.1007/978-3-642-41698-9],[]
166135,0,0.0,[10.1093/ndt/gft319],[]
253238,0,0.0,[10.1055/b-005-143650],[]
207994,0,0.0,[10.1093/cid/cir669],[]
153852,0,0.0,"[10.5167/uzh-53042, 10.1093/cid/cir669]",[]
165187,0,0.0,[10.1093/cid/cir669],[]
167069,0,0.0,[10.1093/cid/ciu795],[]
194014,0,0.0,"[10.5451/unibas-006503313, urn:nbn:ch:bel-bau-...",[]
165539,0,0.0,[10.1093/cid/cir669],[]


doi_delta = 0.0


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
153913,0,0.5,"[10.5167/uzh-53042, 10.1093/cid/cir669]","[21998284, 10.1093/cid/cir669]"


doi_delta = 0.5


### edition

The edition statement is a string value which may have several words. A Jaccard similarity is tried for this attribute.

In [21]:
attribute = 'edition'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Jaccard()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [22]:
edition_delta_uniques = np.sort(df_feature_base['edition_delta'].unique())
edition_delta_uniques_len = len(edition_delta_uniques)
print('edition values range', edition_delta_uniques[:30])

edition values range [0.         0.01515152 0.01785714 0.01851852 0.01886792 0.02173913
 0.02298851 0.02325581 0.02564103 0.02631579 0.02702703 0.02857143
 0.03030303 0.03076923 0.03225806 0.03305785 0.03333333 0.03448276
 0.03478261 0.03571429 0.03703704 0.0375     0.03846154 0.03921569
 0.03960396 0.04       0.04040404 0.04081633 0.04098361 0.04166667]


The comparison results in a wide number of distinct similarity values for the goldstandard data set. Below, some examples are shown.

In [23]:
position = edition_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, 'edition',
    edition_delta_uniques[edition_delta_uniques_len-position-2],
    edition_delta_uniques[edition_delta_uniques_len-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, 'edition',
    edition_delta_uniques[edition_delta_uniques_len-position],
    edition_delta_uniques[edition_delta_uniques_len-position+2], 10)

position = edition_delta_uniques_len//2

dpf.show_samples_interval(
    df_feature_base, 'edition',
    edition_delta_uniques[edition_delta_uniques_len-position-2],
    edition_delta_uniques[edition_delta_uniques_len-position-1], 10)

Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
98744,0,1.0,,
253370,0,1.0,,
177726,0,1.0,,
18611,0,1.0,,
130154,0,1.0,,
63281,0,1.0,,
72556,0,1.0,,
131182,0,1.0,,
115363,0,1.0,,
80484,0,1.0,,


0.9615384615384616 <= edition_delta <= 1.0


Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
225488,0,0.0,,"Nouvelle éd., [verschiedene Auflagen]"
77159,0,0.0,,Nouv. éd. [4. Aufl.]
228961,0,0.0,,"Ueberdruck 1909/2, einzelne Nachträge 1912"
133664,0,0.0,,[Faksim.-Ausg.]
45088,0,0.0,,"7. Aufl. , Urtext der neuen Mozart-Ausg"
79680,0,0.0,,[Nachträge]
179874,0,0.0,,[Nouv. éd.]
169109,0,0.0,,"6., aktual. Aufl"
195863,0,0.0,2. Aufl. als Studienausg.,
165344,0,0.0,,"Nouv. éd., [vollständig überarb. Neuaufl.]"


0.0 <= edition_delta <= 0.017857142857142905


Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
208869,0,0.255556,"6., überarb. und erw. Aufl",Nouv. éd. [vollständig überarb. Neuauflage (Et...
157017,0,0.255556,"6., überarb. und erw. Aufl",Nouv. éd. [vollständig überarb. Neuauflage (Et...
195599,0,0.255319,2. Aufl. als Studienausg.,"[Rev., mit dem vollständigen Scenarium und mit..."
187413,0,0.255556,"6., überarb. und erw. Aufl",Nouv. éd. [vollständig überarb. Neuauflage (Et...
195705,0,0.255319,2. Aufl. als Studienausg.,"[Rev., mit dem vollständigen Scenarium und mit..."
155825,0,0.255556,"6., überarb. und erw. Aufl",Nouv. éd. [vollständig überarb. Neuauflage (Et...
193321,0,0.255319,2. Aufl. als Studienausg.,"[Rev., mit dem vollständigen Scenarium und mit..."
257929,0,0.255319,"5. Aufl., neu durchges. Aufl","Nachträge 1899, Ueberdruck 1902"
54361,0,0.255319,"1. Aufl., (1. bis 5. Tausend)",Faks. der autographen Partitur
180261,0,0.255556,"6., überarb. und erw. Aufl",Nouv. éd. [vollständig überarb. Neuauflage (Et...


0.25531914893617014 <= edition_delta <= 0.25555555555555554


Again, for $\texttt{edition}\_\texttt{delta} = 1$, many empty values of the $\texttt{edition}$ attribute can be observed. These will be marked with the special value of -1 in the data with the goal to distinguish them from the matching attribute pairs.

In [24]:
df_feature_base = dpf.mark_both_missing(df_feature_base, 'edition')

In [25]:
position = edition_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, 'edition',
    edition_delta_uniques[edition_delta_uniques_len-position-2],
    edition_delta_uniques[edition_delta_uniques_len-position-1], 10)

Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
1470,1,1.0,"7., überarbeitete und erweiterte Auflage","7., überarbeitete und erweiterte Auflage"
144457,0,1.0,[3. Aufl.],[3. Aufl.]
1215,1,1.0,"6., überarb. und erw. Aufl","6., überarb. und erw. Aufl"
217439,0,1.0,[Nouv. éd.],[Nouv. éd.]
1354,1,1.0,2. Aufl. als Studienausg.,2. Aufl. als Studienausg.
1298,1,1.0,6. Aufl,6. Aufl
1042,1,1.0,10. Aufl,10. Aufl
1206,1,1.0,"6., überarb. und erw. Aufl","6., überarb. und erw. Aufl"
235049,0,1.0,[3. Aufl.],[3. Aufl.]
1322,1,1.0,[Nouv.] éd.,[Nouv.] éd.


0.9615384615384616 <= edition_delta <= 1.0


### exactDate

As discussed in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{exactDate}$ holds a year number stored in the first fout digits. Letter 'u' is used as a placeholder for an unknown digit. The attribute may hold some month and day or a second year information in its second four digits, additionally.

The attribute will be kept as a string and will not be transformed into an integer. The feature attribute of the record pair to be compared will be calculated with a modified Hamming algorithm, see appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb). The resulting similarity will be stored in a new attribute $\texttt{exactDate}\_\texttt{delta}$ which will be taken for the model calculation.

In [26]:
attribute = 'exactDate'

# Replace letter 'u' with letter 'a' for one of the two strings.
#  As an effect, the resulting Hamming similarity for a letter
#  instead of a numerical digit in either string will add with an amount 0.
df_feature_base[attribute+'_x'] = df_feature_base.exactDate_x.str.replace('u', 'a')

# Compute Hamming similarity for century string pair.
columns_metadata_dict['similarity_metrics'][attribute] = tedi.Hamming()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

# Add amount of 1/16 to Hamming similarity for every letter digit.
#  But only maximum number of letter digits in both strings of a pair.
df_feature_base[attribute+'_delta'] = df_feature_base[[
    attribute+'_x', attribute+'_y', attribute+'_delta']].apply(
    lambda x : x[attribute+'_delta'] + 
    max(x[attribute+'_x'].count('a'), x[attribute+'_y'].count('u'))/16, axis=1
)

In [27]:
df_feature_base[['exactDate_x', 'exactDate_y', 'exactDate_delta']].sample(n=10)

Unnamed: 0,exactDate_x,exactDate_y,exactDate_delta
805,20091990,20091990,1.0
82113,1920aaaa,1795uuuu,0.375
135121,1764aaaa,1900uuuu,0.375
172313,2015aaaa,2001uuuu,0.5
14158,1976aaaa,1965uuuu,0.5
203044,2011aaaa,2014uuuu,0.625
197550,2016aaaa,2001uuuu,0.5
236965,2015aaaa,1981uuuu,0.25
71870,1982aaaa,19uuuuuu,0.625
157025,2013aaaa,2002uuuu,0.5


All resulting values of equal strings are equal to 1.

In [28]:
df_feature_base[['exactDate_x', 'exactDate_y', 'exactDate_delta']][
    df_feature_base.exactDate_x == df_feature_base.exactDate_y
].sort_values('exactDate_delta', ascending=False).head()

Unnamed: 0,exactDate_x,exactDate_y,exactDate_delta
159,20022000,20022000,1.0
843,19849999,19849999,1.0
845,19969999,19969999,1.0
846,19969999,19969999,1.0
847,19969999,19969999,1.0


Nine different similarity values can be found in the attribute deltas. Some sample records are shown below.

In [29]:
exactDate_deltas = np.sort(df_feature_base.exactDate_delta.unique())
exactDate_deltas

array([0.    , 0.125 , 0.25  , 0.3125, 0.375 , 0.4375, 0.5   , 0.5625,
       0.625 , 0.6875, 0.75  , 0.875 , 1.    ])

In [30]:
sample_size = 5

for i in exactDate_deltas :
    dpf.show_samples_distinct(df_feature_base, 'exactDate', i, sample_size)
    print(f'exactDate_delta = {i}')

Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
36258,0,0.0,19702006,20121970
200402,0,0.0,20151475,19949999
246945,0,0.0,18761920,20022003
78476,0,0.0,20062005,19631970
200721,0,0.0,20151475,19289999


exactDate_delta = 0.0


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
156308,0,0.125,18761920,20022010
8810,0,0.125,20022000,18761920
208035,0,0.125,201310aa,19819999
200390,0,0.125,20151475,19841992
173038,0,0.125,20159999,19811984


exactDate_delta = 0.125


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
81347,0,0.25,1996aaaa,2008uuuu
102092,0,0.25,2002aaaa,1475uuuu
238929,0,0.25,1987aaaa,2012uuuu
204914,0,0.25,2011aaaa,1992uuuu
242394,0,0.25,1763aaaa,2000uuuu


exactDate_delta = 0.25


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
29229,0,0.3125,20071990,189uuuuu
89093,0,0.3125,170aaaaa,2010uuuu
100885,0,0.3125,2000aaaa,181uuuuu
12677,0,0.3125,2007aaaa,181uuuuu
95694,0,0.3125,2005aaaa,192uuuuu


exactDate_delta = 0.3125


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
226766,0,0.375,1764aaaa,1995uuuu
128176,0,0.375,18801890,1978uuuu
56669,0,0.375,1993aaaa,1475uuuu
100526,0,0.375,1836aaaa,19811987
239583,0,0.375,1763aaaa,1994uuuu


exactDate_delta = 0.375


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
251181,0,0.4375,183aaaaa,1991uuuu
89027,0,0.4375,170aaaaa,1993uuuu
253461,0,0.4375,2017aaaa,181uuuuu
89236,0,0.4375,170aaaaa,1912uuuu
251183,0,0.4375,183aaaaa,19951997


exactDate_delta = 0.4375


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
211068,0,0.5,2013aaaa,2004uuuu
72445,0,0.5,1999aaaa,1971uuuu
114851,0,0.5,19791999,1943uuuu
255067,0,0.5,2017aaaa,2003uuuu
87139,0,0.5,1991aaaa,1980uuuu


exactDate_delta = 0.5


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
130087,0,0.5625,1981aaaa,193uuuuu
184242,0,0.5625,20159999,200uuuuu
220002,0,0.5625,2016aaaa,200uuuuu
40862,0,0.5625,19561791,192uuuuu
12079,0,0.5625,1991aaaa,193uuuuu


exactDate_delta = 0.5625


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
44591,0,0.625,2007aaaa,2000uuuu
138916,0,0.625,2005aaaa,2003uuuu
226201,0,0.625,1989aaaa,1984uuuu
31435,0,0.625,20071990,2004uuuu
153453,0,0.625,18501875,1950uuuu


exactDate_delta = 0.625


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
96630,0,0.6875,2002aaaa,200uuuuu
27494,0,0.6875,2000aaaa,200uuuuu
99014,0,0.6875,2005aaaa,200uuuuu
31070,0,0.6875,20071990,200uuuuu
23322,0,0.6875,2005aaaa,200uuuuu


exactDate_delta = 0.6875


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
710,1,0.75,1959aaaa,1959uuuu
128985,0,0.75,1998aaaa,1998uuuu
259171,0,0.75,2016aaaa,2016uuuu
124707,0,0.75,1941aaaa,1941uuuu
80166,0,0.75,1998aaaa,1998uuuu


exactDate_delta = 0.75


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
33028,0,0.875,19959999,19969999
88357,0,0.875,19969999,19989999
50101,0,0.875,19969999,19989999
135441,0,0.875,19949999,19989999
88385,0,0.875,19969999,19949999


exactDate_delta = 0.875


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
206,1,1.0,19791999,19791999
785,1,1.0,20091990,20091990
429,1,1.0,19911990,19911990
810,1,1.0,20091990,20091990
225,1,1.0,19791999,19791999


exactDate_delta = 1.0


### format

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{format}$ has been split up into two new attributes $\texttt{format_prefix}$ and $\texttt{format_postfix}$ which will be compared by a different similarity metrics.

- As the quality of $\texttt{format_prefix}$ is expected to be high, an identity comparison should be sufficient.
- Due to the observed structure of $\texttt{format_postfix}$, a q-gram based comparison will be chosen.

In [31]:
attribute = 'format'

columns_metadata_dict['similarity_metrics'][attribute+'_prefix'] = tedi.Identity()
columns_metadata_dict['similarity_metrics'][attribute+'_postfix'] = tedi.Jaccard(qval=2)

pfix_values = ['_prefix', '_postfix']

for pf in pfix_values :
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+pf,
        columns_metadata_dict['similarity_metrics'][attribute+pf],
        columns_metadata_dict)

In [32]:
for i in df_feature_base.format_prefix_delta[
    df_feature_base.format_prefix_x != df_feature_base.format_prefix_y].unique():
    
    dpf.show_samples_distinct(df_feature_base, 'format_prefix', i)
    print(f'format_prefix_delta = {i}')

Unnamed: 0,duplicates,format_prefix_delta,format_prefix_x,format_prefix_y
216398,0,0.0,mu,bk
61433,0,0.0,cr,bk
107468,0,0.0,bk,vm
25360,0,0.0,mu,bk
68252,0,0.0,mp,bk


format_prefix_delta = 0.0


In [33]:
for i in df_feature_base.format_postfix_delta[
    df_feature_base.format_postfix_x != df_feature_base.format_postfix_y].unique():
    
    dpf.show_samples_distinct(df_feature_base, 'format_postfix', i)
    print(f'format_postfix_delta = {i}')

Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
140387,0,0.428571,10200,10100
152288,0,0.428571,10200,20000
251347,0,0.428571,10100,10300
84008,0,0.428571,20000,20053
17924,0,0.428571,10200,40100


format_postfix_delta = 0.4285714285714286


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
186971,0,0.111111,20000,10300
139821,0,0.111111,20000,10053
2150,0,0.111111,20000,10053
156165,0,0.111111,30053,20000
129754,0,0.111111,20000,10300


format_postfix_delta = 0.11111111111111116


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
90148,0,0.25,20047,40100
65406,0,0.25,20000,20353
115483,0,0.25,10200,20353
100199,0,0.25,10200,20353
11767,0,0.25,20000,20347


format_postfix_delta = 0.25


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
104171,0,0.0,20000,30653
19385,0,0.0,10347,20000
234099,0,0.0,20000,30653
196324,0,0.0,10347,20000
19299,0,0.0,10347,20000


format_postfix_delta = 0.0


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
127751,0,1.0,20000,20000
117412,0,1.0,20000,20000
180004,0,1.0,20000,20000
108123,0,1.0,20000,20000
119487,0,1.0,20000,20000


format_postfix_delta = 1.0


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
90454,0,0.666667,20047,20400
26102,0,0.666667,20047,20400
90470,0,0.666667,20047,20400
244959,0,0.666667,20400,20047
26086,0,0.666667,20047,20400


format_postfix_delta = 0.6666666666666666


### isbn

Swissbib uses each string element of the $\texttt{isbn}$ list separately for comparing with each string element of its comparison $\texttt{isbn}$ list. If two bibliographic units hold at least one element in common, this is interpreted as a strong indicator for duplicates [[WiCo2001](./A_References.ipynb#wico2001)].

This hard logic is used in a modified way in the context of this capstone project. A special comparison function $\texttt{.build}\_\texttt{delta}\_\texttt{isbn()}$ has been implemented that compares each list element of the left-hand side with each list element of the right-hand side of a pair. According to Swissbib's implementation, the Identity metric is used for string comparison, calculating a similarity value of 1 or 0 for each list element pair. For normalisation reasons, the sum of similarity values is divided by the number of elements of the smaller list. If both lists are empty a value of 1.0 is returned. If only one list is empty a value of 0.0 is returned.

In [34]:
attribute = 'isbn'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Identity()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

df_feature_base[attribute+'_delta'].unique()

array([1. , 0. , 0.5])

Some sample cases are shown below for each category of $\texttt{isbn_delta}$.

In [35]:
for isbn_delta_value in df_feature_base['isbn_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['isbn_delta']==isbn_delta_value])
    )

    dpf.show_samples_distinct(df_feature_base, 'isbn', isbn_delta_value, number_of_max_samples)
    print(f'isbn_delta = {isbn_delta_value}')

Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
597,0,1.0,[],[]
97795,0,1.0,[],[]
257257,0,1.0,[],[]
40942,0,1.0,[],[]
229993,0,1.0,[],[]
250036,0,1.0,[],[]
62400,0,1.0,[],[]
189069,0,1.0,[],[]
52750,0,1.0,[],[]
73075,0,1.0,[],[]


isbn_delta = 1.0


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
244370,0,0.0,[],[3-499-17476-6]
199711,0,0.0,"[978-3-642-41698-9, 978-3-642-41697-2 (print)]",[]
188614,0,0.0,[978-3-648-07838-9],[]
113148,0,0.0,[],[978-88-7922-121-4]
51670,0,0.0,[3-495-47879-5],[]
75785,0,0.0,[3-13-127283-X],[]
75339,0,0.0,[],[3-15-002620-2]
155827,0,0.0,"[978-3-13-127286-7, 3-13-127286-4]",[3-292-00266-4]
138855,0,0.0,[3-7655-8593-9],"[978-3-598-31803-0 (print), 978-3-11-096275-8]"
161702,0,0.0,"[978-3-642-41697-2, 978-3-642-41698-9 (ebook)]",[]


isbn_delta = 0.0


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
1202,1,0.5,"[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]","[978-3-13-127286-7, 3-13-127286-4]"
1210,1,0.5,"[978-3-13-127286-7, 3-13-127286-4]","[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]"
161989,0,0.5,"[978-3-642-41697-2, 978-3-642-41698-9 (ebook)]","[978-3-642-41697-2, 3-642-41697-7]"
1199,1,0.5,"[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]","[978-3-13-127286-7, 3-13-127286-4]"
1205,1,0.5,"[978-3-13-127286-7, 3-13-127286-4]","[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]"
1201,1,0.5,"[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]","[978-3-13-127286-7, 3-13-127286-4]"
1195,1,0.5,"[978-3-13-127286-7, 3-13-127286-4]","[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]"


isbn_delta = 0.5


### musicid

Chapter [Data Analysis](./1_DataAnalysis.ipynb) shows that attribute $\texttt{musicid}$ is an identifyer for a music record. A Jaccard metric has been tested on this attribute, resulting in a distribution of many high similarity values on uniques. Comparing this result with the LCS metric, the latter has been decided.

In [36]:
attribute = 'musicid'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.LCSStr()
#musicid_algorithm = tedi.Jaccard()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [37]:
musicid_delta_uniques = np.sort(df_feature_base['musicid_delta'].unique())
musicid_delta_uniques_len = len(musicid_delta_uniques)
print('musicid values range', musicid_delta_uniques)

musicid values range [0.         0.01190476 0.02380952 0.02702703 0.02941176 0.03125
 0.03448276 0.03571429 0.03703704 0.03846154 0.04166667 0.04347826
 0.04545455 0.04761905 0.05       0.05263158 0.05405405 0.05555556
 0.05882353 0.0625     0.06666667 0.06896552 0.07142857 0.07407407
 0.07692308 0.08108108 0.08333333 0.08695652 0.08823529 0.09090909
 0.09375    0.0952381  0.1        0.10344828 0.10526316 0.10714286
 0.11111111 0.11538462 0.11764706 0.125      0.13043478 0.13333333
 0.13513514 0.13636364 0.13793103 0.14285714 0.14705882 0.14814815
 0.15       0.15384615 0.15789474 0.16666667 0.17647059 0.17857143
 0.18181818 0.1875     0.19230769 0.2        0.20588235 0.21428571
 0.2173913  0.22222222 0.23076923 0.23529412 0.24324324 0.25
 0.26086957 0.26470588 0.26666667 0.27272727 0.28571429 0.29411765
 0.3        0.30434783 0.31034483 0.3125     0.31818182 0.33333333
 0.35       0.35714286 0.36842105 0.375      0.4        0.40909091
 0.41666667 0.42857143 0.4375     0.44444444 0.5  

In [38]:
position = musicid_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, 'musicid',
    musicid_delta_uniques[musicid_delta_uniques_len-position-2],
    musicid_delta_uniques[musicid_delta_uniques_len-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, 'musicid',
    musicid_delta_uniques[musicid_delta_uniques_len-position],
    musicid_delta_uniques[musicid_delta_uniques_len-position+2], 10)

position = musicid_delta_uniques_len//20

dpf.show_samples_interval(
    df_feature_base, 'musicid',
    musicid_delta_uniques[musicid_delta_uniques_len-position],
    musicid_delta_uniques[musicid_delta_uniques_len-position+1], 10)

Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
516,1,1.0,BA 4553a,BA 4553a
267,1,1.0,501326,501326
249157,0,0.875,BA 4553a,BA 4553
115817,0,1.0,BA 4553a,BA 4553a
52194,0,0.875,BA 4553,BA 4553a
277,1,1.0,502023,502023
819,1,1.0,Deutsche Grammophon 415 287-2,Deutsche Grammophon 415 287-2
929,1,1.0,5944,5944
799,1,1.0,502430,502430
988,1,1.0,10425EP 71,10425EP 71


0.875 <= musicid_delta <= 1.0


Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
108334,0,0.0,,
219383,0,0.0,,ZA 8.35766
190062,0,0.0,,
33971,0,0.0,,
243291,0,0.0,,
66783,0,0.0,10425,
139414,0,0.0,99036,
75350,0,0.0,BA 4553a,
136422,0,0.0,TP 601,
148868,0,0.0,Erato 0630-12705-2,


0.0 <= musicid_delta <= 0.023809523809523836


Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
537,1,0.777778,BA 4553a,BA 4553 a
508,1,0.777778,BA 4553 a,BA 4553a
513,1,0.777778,BA 4553a,BA 4553 a
18222,0,0.777778,BA 4553 a,BA 4553a
18091,0,0.777778,BA 4553 a,BA 4553a
18073,0,0.777778,BA 4553 a,BA 4553a
17909,0,0.777778,BA 4553 a,BA 4553
525,1,0.777778,BA 4553a,BA 4553 a
512,1,0.777778,BA 4553 a,BA 4553a
519,1,0.777778,BA 4553a,BA 4553 a


0.7777777777777778 <= musicid_delta <= 0.8571428571428572


In [39]:
dpf.show_samples_interval(
    df_feature_base[df_feature_base.duplicates==1], 'musicid',
    musicid_delta_uniques[0],
    musicid_delta_uniques[musicid_delta_uniques_len-1], 20)

Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
939,1,0.666667,U. E. 245,U.E. 245
453,1,0.0,,
99,1,0.0,,
570,1,0.0,,
381,1,0.0,,
1224,1,0.0,,
1391,1,0.0,,
261,1,1.0,501326,501326
389,1,0.0,,
1312,1,0.0,,


0.0 <= musicid_delta <= 1.0


The attribute is filled with a degree of below $10\%$. The chosen metric for it results in a similarity value of 1.0 for empty value pairs. This effect can be adjusted with function $\texttt{.build}\_\texttt{delta}\_\texttt{isbn()}$ as above. 

In [40]:
df_feature_base = dpf.mark_both_missing(df_feature_base, 'musicid')

### part

Three different metrics have been tried for attribute $\texttt{part}$. Finally, metric Jaro will be tested.

In [41]:
attribute = 'part'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Jaro()
#part_algorithm = tedi.Hamming()
#part_algorithm = tedi.LCSStr()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [42]:
part_delta_uniques = np.sort(df_feature_base['part_delta'].unique())
part_delta_uniques_len = len(part_delta_uniques)
print('part values range', part_delta_uniques[:40])

part values range [0.         0.21198157 0.21483871 0.23018394 0.23278237 0.23368298
 0.23414558 0.23799534 0.23884298 0.2399005  0.24095238 0.24328358
 0.24458874 0.24461538 0.2457265  0.24717322 0.24761905 0.24814815
 0.24888889 0.24937343 0.24966262 0.25       0.25058781 0.25108225
 0.25132275 0.25303644 0.25393939 0.254662   0.25550964 0.25641026
 0.2567426  0.25687285 0.25714286 0.25757576 0.25961538 0.25995025
 0.26       0.26143791 0.26231884 0.26236045]


In [43]:
position = part_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, 'part',
    part_delta_uniques[part_delta_uniques_len-position-2],
    part_delta_uniques[part_delta_uniques_len-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, 'part',
    part_delta_uniques[part_delta_uniques_len-position],
    part_delta_uniques[part_delta_uniques_len-position+2], 10)

position = part_delta_uniques_len//7

dpf.show_samples_interval(
    df_feature_base, 'part',
    part_delta_uniques[part_delta_uniques_len-position-2],
    part_delta_uniques[part_delta_uniques_len-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, 'part',
    part_delta_uniques[part_delta_uniques_len-position],
    part_delta_uniques[part_delta_uniques_len-position+2], 10)

Unnamed: 0,duplicates,part_delta,part_x,part_y
153909,0,1.0,,
83718,0,1.0,,
6661,0,1.0,,
135341,0,1.0,,
141716,0,1.0,,
208750,0,1.0,,
115530,0,1.0,,
174941,0,1.0,,
215179,0,1.0,,
143631,0,1.0,,


0.9777777777777779 <= part_delta <= 1.0


Unnamed: 0,duplicates,part_delta,part_x,part_y
218791,0,0.0,,33001
122628,0,0.0,5.0,
45496,0,0.0,,1
196640,0,0.0,,"band 33, band 33"
119844,0,0.0,2.0,
180837,0,0.0,,lexique sonore [enregistrement sonore]
71622,0,0.0,,7
169289,0,0.0,4.0,
168558,0,0.0,63.0,
37765,0,0.0,,"2620, bd. 5, 2620, 5"


0.0 <= part_delta <= 0.21483870967741936


Unnamed: 0,duplicates,part_delta,part_x,part_y
124900,0,0.622222,"2620, 2620",208
57419,0,0.622222,lfg. 2,"jg. 2 (1980), heft 1"
142441,0,0.622222,"2620, 2620",208
68147,0,0.622222,f. 285,"jg. 2 (1980), heft 1"
147644,0,0.622494,"nr. 2620, 2620","bl. 23, 23,1912, 23"
227167,0,0.622222,"foglio 285, 285","bl. 23, 23,1885"
52055,0,0.622222,bd. 19,"jg. 2 (1980), heft 1"
91391,0,0.622222,nr. 21,"jg. 2 (1980), heft 1"
125093,0,0.622222,"2620, 2620",208
125157,0,0.622222,"2620, 2620",208


0.6222222222222222 <= part_delta <= 0.6224937343358395


Unnamed: 0,duplicates,part_delta,part_x,part_y
27865,0,0.625,bd. 8008,band 9
241274,0,0.625,13 d,"1368. contemporain, 1368"
14119,0,0.625,2620,"bl. 23,1869, 23,1869, 23"
5391,0,0.625,bd. 63,nr. 2620
28727,0,0.625,bd. 57,nr. 7633
219447,0,0.625,"63, 63",nr. 7633
47616,0,0.625,"7, 7","2017, nr. 313, 2017, 313"
186363,0,0.625,n. 1,"bl. 23,1869, 23,1869, 23"
241793,0,0.625,2620,"bl. 23,1896, 23,1896, 23"
150311,0,0.625,"63, 63",nr. 7633


0.6232298474945533 <= part_delta <= 0.625


In this attribute, too, moving pairs of empty values to -1.0 will result in a clearer distinction between pairs of uniques and duplicates, as will be seen in the graphical comparison of capter [Features Discussion and Dummy Classifier Baseline](./4_FeatureDiscussionDummyBaseline.ipynb).

In [44]:
df_feature_base = dpf.mark_both_missing(df_feature_base, 'part')

### person

As a result of chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{person}$ has been split into three specific attributes. Attribute $\texttt{person}\_{100}$ and $\texttt{person}\_{700}$ hold strongly standardised string values. For comparing pure strings, a Levenshtein metric is recommended [[Chri2012](./A_References.ipynb#chri2012)]. Unfortunately, this metric shows a very long calculation time on the data of the capstone project. Comparing the similarity values of the Levenshtein metric with the similarity values of other metrics, similarity metric StrCmp95 has been decided to use.

In [45]:
attribute = 'person'

columns_metadata_dict['similarity_metrics'][attribute+'_100'] = tedi.StrCmp95()
columns_metadata_dict['similarity_metrics'][attribute+'_700'] = tedi.StrCmp95()
# tedi.Levenshtein()

pe_values = ['_100', '_700']

for pe in pe_values :
    print('Calculating person'+pe)
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+pe,
        columns_metadata_dict['similarity_metrics'][attribute+pe],
        columns_metadata_dict)

Calculating person_100
Calculating person_700


In [46]:
person_100_delta_uniques = np.sort(df_feature_base['person_100_delta'].unique())
person_100_delta_uniques_len = len(person_100_delta_uniques)
print('person_100 values range', person_100_delta_uniques[:40])

person_100 values range [0.         0.29487179 0.30092593 0.30740741 0.32495127 0.32706856
 0.32777778 0.32810458 0.33825397 0.33828502 0.34555556 0.34575163
 0.34768519 0.34944444 0.35081699 0.35399586 0.35566239 0.3562963
 0.35861823 0.36191285 0.36202614 0.36414141 0.36683209 0.36727982
 0.36767507 0.36783969 0.36993464 0.37148962 0.37188034 0.3751634
 0.37635328 0.37768116 0.37777778 0.37887446 0.38017429 0.38063973
 0.38188002 0.38258547 0.38334754 0.38372352]


In [47]:
position = person_100_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, 'person_100',
    person_100_delta_uniques[person_100_delta_uniques_len-position-2],
    person_100_delta_uniques[person_100_delta_uniques_len-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, 'person_100',
    person_100_delta_uniques[person_100_delta_uniques_len-position],
    person_100_delta_uniques[person_100_delta_uniques_len-position+2], 10)

Unnamed: 0,duplicates,person_100_delta,person_100_x,person_100_y
70421,0,1.0,,
203998,0,1.0,,
48099,0,1.0,,
88532,0,1.0,,
157601,0,1.0,,
190578,0,1.0,,
80897,0,1.0,,
122364,0,1.0,,
207061,0,1.0,,
59941,0,1.0,mozartwolfgang amadeus1756-1791(de-588)118584596,mozartwolfgang amadeus1756-1791(de-588)118584596


0.96 <= person_100_delta <= 1.0


Unnamed: 0,duplicates,person_100_delta,person_100_x,person_100_y
182133,0,0.0,,bührerwalter
231997,0,0.0,,mozartwolfgang amadeus1756-1791(de-588)118584596
111713,0,0.0,austenjane1775-1817(de-588)118505173,
243379,0,0.0,,mozartwolfgang amadeus1756-1791(de-588)118584596
108372,0,0.0,mozartwolfgang amadeus1756-1791(de-588)118584596,
226443,0,0.0,,mortzfeldpeter
198171,0,0.0,,kesslersigrid
227417,0,0.0,,mozartwolfgang amadeus1756-1791(de-588)118584596
139130,0,0.0,,voltaire1694-1778
7970,0,0.0,jacquetluc,


0.0 <= person_100_delta <= 0.3009259259259258


For comparing person names, like in attribute $\texttt{person}\_{245c}$, a Jaro metric will be tested [[Chri2012](./A_References.ipynb#chri2012)].

In [48]:
columns_metadata_dict['similarity_metrics'][attribute+'_245c'] = tedi.Jaro()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute+'_245c',
    columns_metadata_dict['similarity_metrics'][attribute+'_245c'],
    columns_metadata_dict)

In [49]:
person_245c_delta_uniques = np.sort(df_feature_base['person_245c_delta'].unique())
person_245c_delta_uniques_len = len(person_245c_delta_uniques)
print('person_245c values range', person_245c_delta_uniques[:40])

person_245c values range [0.         0.24632035 0.25462963 0.25505051 0.25901876 0.26388889
 0.26893939 0.27020202 0.27104377 0.275      0.27777778 0.28100775
 0.28418803 0.28835979 0.29059829 0.29365079 0.294388   0.2962963
 0.2968254  0.29861111 0.3        0.30023852 0.30128205 0.3013468
 0.30246914 0.30444444 0.30555556 0.30952381 0.31089744 0.31224152
 0.31402258 0.31508967 0.31565657 0.31944444 0.32078853 0.3227467
 0.32491582 0.32597403 0.32666667 0.32777778]


In [50]:
position = person_245c_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, 'person_245c',
    person_245c_delta_uniques[person_245c_delta_uniques_len-position-2],
    person_245c_delta_uniques[person_245c_delta_uniques_len-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, 'person_245c',
    person_245c_delta_uniques[person_245c_delta_uniques_len-position],
    person_245c_delta_uniques[person_245c_delta_uniques_len-position+2], 10)

Unnamed: 0,duplicates,person_245c_delta,person_245c_x,person_245c_y
89312,0,1.0,,
212900,0,1.0,,
204238,0,1.0,,
212008,0,1.0,,
938,1,1.0,von w.a. mozart ; klavierauszug neu rev. von w...,von w.a. mozart ; klavierauszug neu rev. von w...
202769,0,1.0,,
210264,0,1.0,,
159745,0,1.0,regie: volker schlöndorff ; nach dem roman von...,regie: volker schlöndorff ; nach dem roman von...
57883,0,1.0,,
57288,0,1.0,,


0.9986338797814208 <= person_245c_delta <= 1.0


Unnamed: 0,duplicates,person_245c_delta,person_245c_x,person_245c_y
251810,0,0.0,,sigrid kessler [u.a.]
64170,0,0.0,"c.y. cyrus chu, ronald lee (eds.)",
58259,0,0.0,,andreas kagermeier und tobias reeh (hrsg.)
120557,0,0.0,[von emmanuel schikaneder] ; [die musik ist vo...,
141670,0,0.0,wolfgang amadeus mozart ; emanuel schikaneder ...,
90855,0,0.0,voltaire,
48575,0,0.0,,sigrid kessler [u.a.]
123905,0,0.0,sigrid kessler... [et al.] ; [éd.:] interkanto...,
76964,0,0.0,,[musik:] wolfgang amadeus mozart ; [text von e...
26925,0,0.0,von beatrice käser,


0.0 <= person_245c_delta <= 0.25462962962962954


The similarities of all three $\texttt{person}$ attributes are affected by empty values. These will be handled as for the attributes above.

In [51]:
pe_values = ['_100', '_245c', '_700']

for pe in pe_values :
    df_feature_base = dpf.mark_both_missing(df_feature_base, 'person'+pe)

### pubinit

This attribute holds publisher strings that have a similar representation as attribute $\texttt{person}$ above. A Jaro metric will be used.

In [52]:
attribute = 'pubinit'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Jaro()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [53]:
pubinit_delta_uniques = np.sort(df_feature_base['pubinit_delta'].unique())
pubinit_delta_uniques_len = len(pubinit_delta_uniques)
print(attribute, 'values range', pubinit_delta_uniques[:40])

pubinit values range [0.         0.25132275 0.25303644 0.25434618 0.25555556 0.2571644
 0.25730994 0.25757576 0.25873016 0.25925926 0.25978836 0.26060606
 0.26143791 0.26157407 0.26296296 0.26372925 0.26388889 0.26430976
 0.2654321  0.26556777 0.26666667 0.26740741 0.26842105 0.26893939
 0.26923077 0.27037037 0.27171717 0.27222222 0.27254902 0.27350427
 0.27407407 0.27489177 0.2755102  0.27582846 0.27666667 0.27777778
 0.27777778 0.27816492 0.27855478 0.27898551]


In [54]:
position = pubinit_delta_uniques_len//3

dpf.show_samples_interval(
    df_feature_base, attribute,
    pubinit_delta_uniques[pubinit_delta_uniques_len-position-1],
    pubinit_delta_uniques[pubinit_delta_uniques_len-position], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    pubinit_delta_uniques[pubinit_delta_uniques_len-5],
    pubinit_delta_uniques[pubinit_delta_uniques_len-1], 10)

Unnamed: 0,duplicates,pubinit_delta,pubinit_x,pubinit_y
99134,0,0.554834,frenetic films,staatlicher lehrmittelverlag bern
139066,0,0.554834,frenetic films,staatlicher lehrmittelverlag bern
87035,0,0.554834,[phonogram],polydor international
142046,0,0.554834,philipp reclam,staatlicher lehrmittelverlag bern
106286,0,0.554834,frenetic films,staatlicher lehrmittelverlag bern
44720,0,0.554834,gerstenberg,pearson education ltd
139484,0,0.554834,gerstenberg,pearson education ltd
136498,0,0.554809,cambridge univ. press,brillant classics
256478,0,0.554834,frenetic films,staatlicher lehrmittelverlag bern
97168,0,0.554834,gerstenberg,pearson education ltd


0.5548085901027077 <= pubinit_delta <= 0.5548340548340548


Unnamed: 0,duplicates,pubinit_delta,pubinit_x,pubinit_y
872,1,1.0,servizio topografico federale,servizio topografico federale
69490,0,1.0,,
21263,0,1.0,,
82154,0,1.0,,
210786,0,1.0,,
161371,0,1.0,,
109922,0,1.0,staatlicher lehrmittelverlag,staatlicher lehrmittelverlag
15616,0,1.0,,
188706,0,1.0,,
206614,0,1.0,,


0.9777777777777779 <= pubinit_delta <= 1.0


The similarities of all three $\texttt{pubinit}$ attributes are affected by empty values. These will be handled as for the attributes above.

In [55]:
df_feature_base = dpf.mark_both_missing(df_feature_base, attribute)

### scale

Comparing the similarity metrics of some sample value pairs of attribute $\texttt{scale}$ in appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb), a Jaccard metrics has been identifyed to express the best matching behaviour for purely numerical values stored in the attribute.

In [56]:
attribute = 'scale'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Jaccard()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [57]:
scale_delta_uniques = np.sort(df_feature_base[attribute+'_delta'].unique())
scale_delta_uniques_len = len(scale_delta_uniques)
print(attribute, 'values range', scale_delta_uniques)

scale values range [0.         0.04587156 0.05504587 0.57142857 1.        ]


In [58]:
position = scale_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, attribute,
    scale_delta_uniques[scale_delta_uniques_len-position],
    scale_delta_uniques[scale_delta_uniques_len-position+1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    scale_delta_uniques[scale_delta_uniques_len-3],
    scale_delta_uniques[scale_delta_uniques_len-2], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    scale_delta_uniques[scale_delta_uniques_len-4],
    scale_delta_uniques[scale_delta_uniques_len-3], 10)

Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
64805,0,0.0,,100000.0
210478,0,0.0,,100000.0
215017,0,0.0,,100000.0
119021,0,0.0,,100000.0
74930,0,0.0,,100000.0
240016,0,0.0,,50000.0
118103,0,0.0,,100000.0
183864,0,0.0,100000.0,
216787,0,0.0,,100000.0
38017,0,0.0,,50000.0


0.0 <= scale_delta <= 0.04587155963302747


Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
68361,0,0.571429,50000,100000
182211,0,0.571429,50000,100000
54075,0,0.571429,50000,100000
68357,0,0.571429,50000,100000
54054,0,0.571429,50000,100000
227491,0,0.055046,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
94859,0,0.571429,50000,100000
68354,0,0.571429,50000,100000
183997,0,0.571429,100000,50000
181873,0,0.571429,50000,100000


0.05504587155963303 <= scale_delta <= 0.5714285714285714


Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
227167,0,0.055046,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227482,0,0.045872,Scala 1:50.000 ; proiezione cilindrica ad asse...,50000
227208,0,0.055046,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227488,0,0.055046,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227499,0,0.055046,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227501,0,0.055046,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227490,0,0.055046,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227168,0,0.055046,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227487,0,0.055046,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227174,0,0.055046,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000


0.04587155963302747 <= scale_delta <= 0.05504587155963303


Attribute $\texttt{scale}$ is filled for maps, only. Due to its sparse filling, the similarities of the attribute is affected strongly by empty values. These empty values will be marked with the special value of -1.0.

In [59]:
df_feature_base = dpf.mark_both_missing(df_feature_base, attribute)

### ttlfull

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{ttlfull}$ has been split up into two new attributes $\texttt{ttlfull_245}$ and $\texttt{ttlfull_246}$ which will be compared by the same similarity metrics. A visual analysis of the values stored in the attribute, reveals a string of words, comparable to the strings in attribute $\texttt{person_245c}$, above. The same similarity metric will be used for both title attributes, therefore.

In [60]:
attribute = 'ttlfull'

columns_metadata_dict['similarity_metrics'][attribute+'_245'] = tedi.Jaro()
columns_metadata_dict['similarity_metrics'][attribute+'_246'] = tedi.Jaro()

tf_values = ['_245', '_246']

for tf in tf_values :
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+tf,
        columns_metadata_dict['similarity_metrics'][attribute+tf],
        columns_metadata_dict)

In [61]:
tf = '_245'

ttlfull_delta_uniques = np.sort(df_feature_base[attribute+tf+'_delta'].unique())
ttlfull_delta_uniques_len = len(ttlfull_delta_uniques)
print(attribute, 'values range', ttlfull_delta_uniques[:40])

ttlfull values range [0.         0.24175824 0.24358974 0.24579125 0.24969475 0.25714286
 0.26060606 0.28461538 0.29220779 0.29271709 0.29537456 0.29699248
 0.30626781 0.31750842 0.31888654 0.31929825 0.31950729 0.31984127
 0.3225641  0.32362637 0.32491582 0.32619048 0.32735043 0.3286648
 0.33076923 0.33098846 0.33100775 0.33225806 0.33232323 0.33252033
 0.33273916 0.33278689 0.33333333 0.33418803 0.33549784 0.33585859
 0.33594771 0.33611111 0.33625731 0.33638713]


In [62]:
position = ttlfull_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, attribute+tf,
    ttlfull_delta_uniques[ttlfull_delta_uniques_len-position],
    ttlfull_delta_uniques[ttlfull_delta_uniques_len-position+1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute+tf,
    ttlfull_delta_uniques[ttlfull_delta_uniques_len-3],
    ttlfull_delta_uniques[ttlfull_delta_uniques_len-2], 10)
dpf.show_samples_interval(
    df_feature_base, attribute+tf,
    ttlfull_delta_uniques[ttlfull_delta_uniques_len-4],
    ttlfull_delta_uniques[ttlfull_delta_uniques_len-3], 10)

Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
185405,0,0.0,emma,trionfi
185729,0,0.0,emma,blick in die welt
11373,0,0.0,emma,trionfi
2453,0,0.0,emma,blick in die welt
2763,0,0.0,emma,blick in die welt
10591,0,0.0,emma,blick in die welt
5247,0,0.0,bildungsforschung und bildungspraxis,emma
111816,0,0.0,emma,blick in die welt
24559,0,0.0,bildungsforschung und bildungspraxis,emma
32545,0,0.0,arts,blick in die welt


0.0 <= ttlfull_245_delta <= 0.2417582417582418


Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
249,1,0.997475,sozialleistungsbetrug: sozialversicherungsbetr...,sozialleistungsbetrug: sozialversicherungsbetr...
143980,0,0.995556,"bonne chance !, cours de langue française deux...","bonne chance !, cours de langue française, deu..."
237,1,0.997475,sozialleistungsbetrug: sozialversicherungsbetr...,sozialleistungsbetrug: sozialversicherungsbetr...


0.9955555555555556 <= ttlfull_245_delta <= 0.9974747474747474


Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
31,1,0.99537,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
84,1,0.99537,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
66,1,0.99537,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
199600,0,0.99537,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
1357,1,0.99537,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
61,1,0.99537,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
1360,1,0.99537,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
51792,0,0.99537,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
1351,1,0.99537,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
193301,0,0.99537,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."


0.9953703703703703 <= ttlfull_245_delta <= 0.9955555555555556


Attribute $\texttt{ttlfull}\_\texttt{245}$ is filled for all data rows of Swissbib's raw data as can be seen in chapter [Data Analysis](./1_DataAnalysis.ipynb). For attribute $\texttt{ttlfull}\_\texttt{245}$, the filling is below $10\%$. The data pairs with missing values on both sides will be marked with a value of -1.0 as above.

In [63]:
df_feature_base = dpf.mark_both_missing(df_feature_base, attribute+'_246')

### volumes

This attribute is described in chapter [Data Analysis](./1_DataAnalysis.ipynb) to hold a kind of contents that resembles the contents of attribute $\texttt{part}$. The same similarity metrics will be used for attribute $\texttt{volumes}$ as for attriubte $\texttt{part}$, therefore.

In [64]:
attribute = 'volumes'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Jaro()
# tedi.MongeElkan()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [65]:
volumes_delta_uniques = np.sort(df_feature_base[attribute+'_delta'].unique())
volumes_delta_uniques_len = len(volumes_delta_uniques)
print(attribute, 'values range', volumes_delta_uniques[:40])

volumes values range [0.         0.24175824 0.24691358 0.24693423 0.24888889 0.25087719
 0.25128205 0.25132275 0.25252525 0.25483871 0.25498575 0.25555556
 0.25632184 0.25714286 0.2571644  0.25730994 0.25802469 0.2582846
 0.25897436 0.25925926 0.26       0.26111111 0.26143791 0.26388889
 0.26430976 0.26455026 0.26507937 0.2654321  0.26550388 0.26556777
 0.26587302 0.26648841 0.26666667 0.26754386 0.26842105 0.26984127
 0.27037037 0.27171717 0.27350427 0.27469136]


In [66]:
position = volumes_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, attribute,
    volumes_delta_uniques[volumes_delta_uniques_len-position],
    volumes_delta_uniques[volumes_delta_uniques_len-position+1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    volumes_delta_uniques[volumes_delta_uniques_len-3],
    volumes_delta_uniques[volumes_delta_uniques_len-2], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    volumes_delta_uniques[volumes_delta_uniques_len-4],
    volumes_delta_uniques[volumes_delta_uniques_len-3], 10)

Unnamed: 0,duplicates,volumes_delta,volumes_x,volumes_y
21813,0,0.0,127 s.,
97551,0,0.0,64 s.,
216085,0,0.0,192 p.,
154273,0,0.0,,1 cd
198017,0,0.0,1 dvd-video (ca. 109 min.),
234366,0,0.0,2 cd dans un coffret,
171412,0,0.0,407 p.,
25397,0,0.0,1 taschenpartitur,
96326,0,0.0,95 s.,
215855,0,0.0,192 p.,


0.0 <= volumes_delta <= 0.2417582417582418


Unnamed: 0,duplicates,volumes_delta,volumes_x,volumes_y
274,1,0.987654,2 dvd-videos (ca. 109 min.),2 dvd-video (ca. 109 min.)
307,1,0.987654,2 dvd-video (ca. 109 min.),2 dvd-videos (ca. 109 min.)
286,1,0.987654,2 dvd-video (ca. 109 min.),2 dvd-videos (ca. 109 min.)
100860,0,0.984848,1 dvd-video (169 min.),1 dvd-video (16 min.)
292,1,0.987654,2 dvd-videos (ca. 109 min.),2 dvd-video (ca. 109 min.)
304,1,0.987654,2 dvd-video (ca. 109 min.),2 dvd-videos (ca. 109 min.)
73444,0,0.984848,1 dvd-video (169 min.),1 dvd-video (16 min.)
271,1,0.987654,2 dvd-videos (ca. 109 min.),2 dvd-video (ca. 109 min.)
295,1,0.987654,2 dvd-videos (ca. 109 min.),2 dvd-video (ca. 109 min.)
283,1,0.987654,2 dvd-video (ca. 109 min.),2 dvd-videos (ca. 109 min.)


0.9848484848484849 <= volumes_delta <= 0.9876543209876543


Unnamed: 0,duplicates,volumes_delta,volumes_x,volumes_y
148669,0,0.978495,2 disques compacts en 1 coffret,3 disques compacts en 1 coffret
100860,0,0.984848,1 dvd-video (169 min.),1 dvd-video (16 min.)
73444,0,0.984848,1 dvd-video (169 min.),1 dvd-video (16 min.)


0.978494623655914 <= volumes_delta <= 0.9848484848484849


Attribute $\texttt{volumes}$ holds rows with missing data. The data pairs with missing values on both sides will be marked with a value of -1.0.

In [67]:
df_feature_base = dpf.mark_both_missing(df_feature_base, attribute)

## Feature Base

The metris for each attribute of the feature DataFrame has been decided and the features have been calculated. The columns with the original attribute values are not needed for further processing and they will be dropped to generate the feature matrix for modelling the estimators.

In [68]:
# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head(20)

Unnamed: 0,duplicates,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,doi_x,doi_y,edition_x,edition_y,exactDate_x,exactDate_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,isbn_x,isbn_y,musicid_x,musicid_y,part_x,part_y,person_100_x,person_100_y,person_700_x,person_700_y,person_245c_x,person_245c_y,pubinit_x,pubinit_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y,coordinate_E_delta,coordinate_N_delta,corporate_110_delta,corporate_710_delta,doi_delta,edition_delta,exactDate_delta,format_prefix_delta,format_postfix_delta,isbn_delta,musicid_delta,part_delta,person_100_delta,person_700_delta,person_245c_delta,pubinit_delta,scale_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
0,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula","grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,reclam jun.,reclam jun.,,,"emma, roman","emma, roman",,,600 s.,600 s.,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,1.0,1.0,-1.0,1.0,-1.0,1.0
1,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,reclam jun.,reclam,,,"emma, roman",emma,,,600 s.,600 s.,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,1.0,1.0,0.0,0.818905,0.848485,-1.0,0.787879,-1.0,1.0
2,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen,reclam jun.,reclam,,,"emma, roman","emma, roman",,,600 s.,600 s.,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,1.0,0.855556,0.0,0.69774,0.848485,-1.0,1.0,-1.0,1.0
3,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,"grawechristian, graweursula",jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,reclam,reclam jun.,,,emma,"emma, roman",,,600 s.,600 s.,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,1.0,1.0,0.0,0.818905,0.848485,-1.0,0.787879,-1.0,1.0
4,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,reclam,reclam,,,emma,emma,,,600 s.,600 s.,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,1.0,1.0,-1.0,1.0,1.0,-1.0,1.0,-1.0,1.0
5,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane,,,jane austen ; aus dem engl. übers. von ursula ...,jane austen,reclam,reclam,,,emma,"emma, roman",,,600 s.,600 s.,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,1.0,0.855556,-1.0,0.702265,1.0,-1.0,0.787879,-1.0,1.0
6,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane,austenjane1775-1817(de-588)118505173,,"grawechristian, graweursula",jane austen,jane austen ; aus dem englischen übersetzt von...,reclam,reclam jun.,,,"emma, roman","emma, roman",,,600 s.,600 s.,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,1.0,0.855556,0.0,0.69774,0.848485,-1.0,1.0,-1.0,1.0
7,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane,austenjane1775-1817(de-588)118505173,,,jane austen,jane austen ; aus dem engl. übers. von ursula ...,reclam,reclam,,,"emma, roman",emma,,,600 s.,600 s.,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,1.0,0.855556,-1.0,0.702265,1.0,-1.0,0.787879,-1.0,1.0
8,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane,austenjane,,,jane austen,jane austen,reclam,reclam,,,"emma, roman","emma, roman",,,600 s.,600 s.,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,1.0,1.0,-1.0,1.0,1.0,-1.0,1.0,-1.0,1.0
9,1,,,,,,,"metropolitan opera orchestra, metropolitan ope...","metropolitan opera orchestra, metropolitan ope...",[],[],,,2000aaaa,2000uuuu,vm,vm,10300,10300,[],[],,,,,levinejamesdir.,levinejamesdir.,"mozartwolfgang amadeus, levinejames, schikaned...","mozartwolfgang amadeus, levinejames, schikaned...",w. a. mozart ; libretto: emanuel schikaneder ;...,w. a. mozart ; libretto: emanuel schikaneder ;...,deutsche grammophon,deutsche grammophon,,,"die zauberflöte, oper in zwei aufzügen","die zauberflöte, oper in zwei aufzügen",,,"1 dvd-video, dvd region 0, 169 min., farb.","1 dvd-video, dvd region 0, 169 min., farb.",-1.0,-1.0,-1.0,1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,-1.0,1.0,1.0,1.0,1.0,-1.0,1.0,-1.0,1.0


In [69]:
# Drop all non-delta columns, except of 'duplicates'
columns_to_be_dropped = [e for e in columns_metadata_dict['columns_to_use']
                         if e != 'duplicates']

df_feature_base.drop(columns=columns_to_be_dropped, inplace=True)

In [70]:
for i in range(2):
    display(df_feature_base[df_feature_base.duplicates==i].sample(n=20))

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_N_delta,corporate_110_delta,corporate_710_delta,doi_delta,edition_delta,exactDate_delta,format_prefix_delta,format_postfix_delta,isbn_delta,musicid_delta,part_delta,person_100_delta,person_700_delta,person_245c_delta,pubinit_delta,scale_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
148110,0,-1.0,-1.0,-1.0,0.0,1.0,-1.0,0.625,0.0,0.111111,1.0,0.0,-1.0,0.474359,0.0,0.404762,0.457937,-1.0,0.514475,-1.0,0.472222
185697,0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.375,0.0,0.428571,1.0,0.0,-1.0,0.55629,-1.0,0.454545,0.583333,-1.0,0.512346,-1.0,0.411111
165282,0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,0.25,1.0,0.111111,1.0,-1.0,0.0,0.0,0.0,0.51992,0.0,-1.0,0.512945,-1.0,0.0
150524,0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,0.5,1.0,0.111111,0.0,-1.0,0.0,0.596459,0.0,0.588609,0.0,-1.0,0.546056,-1.0,0.0
246238,0,0.0,0.0,-1.0,0.0,1.0,0.0,0.25,0.0,0.111111,0.0,-1.0,0.466667,0.0,0.55105,0.51376,0.0,0.0,0.47193,0.0,0.467787
192728,0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,0.25,1.0,0.428571,0.0,-1.0,-1.0,0.566501,0.498557,0.59288,0.0,-1.0,0.590996,-1.0,0.46595
88185,0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.25,0.0,0.25,0.0,-1.0,0.0,0.0,0.0,0.0,0.444444,-1.0,0.503704,-1.0,0.0
158129,0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.5,0.0,0.111111,0.0,-1.0,0.0,0.0,0.0,0.425351,0.592593,-1.0,0.50176,-1.0,0.422494
9234,0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.25,0.0,0.428571,1.0,-1.0,-1.0,0.0,0.0,0.45235,0.396329,-1.0,0.528744,0.0,0.526455
199879,0,-1.0,-1.0,-1.0,0.0,0.0,0.0,0.25,1.0,0.428571,0.0,-1.0,0.0,0.438596,-1.0,0.551662,0.0,-1.0,0.598837,-1.0,0.419369


Unnamed: 0,duplicates,coordinate_E_delta,coordinate_N_delta,corporate_110_delta,corporate_710_delta,doi_delta,edition_delta,exactDate_delta,format_prefix_delta,format_postfix_delta,isbn_delta,musicid_delta,part_delta,person_100_delta,person_700_delta,person_245c_delta,pubinit_delta,scale_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
1377,1,-1.0,-1.0,-1.0,1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0,1.0,-1.0
210,1,-1.0,-1.0,-1.0,1.0,1.0,-1.0,1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,0.576164,-1.0,-1.0,0.799708,0.0,0.0
398,1,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,-1.0,0.855556,0.886667,0.845843,-1.0,-1.0,1.0,-1.0,1.0
805,1,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0,1.0,1.0,1.0,0.0,-1.0,-1.0,0.618362,1.0,-1.0,-1.0,1.0,-1.0,1.0
955,1,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,-1.0,1.0,-1.0,1.0,-1.0,-1.0,1.0,-1.0,1.0
222,1,-1.0,-1.0,-1.0,0.49,1.0,-1.0,1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,0.762602,-1.0,-1.0,0.719298,0.818182,-1.0
1418,1,-1.0,-1.0,-1.0,1.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.841837,0.878465,0.662536
1280,1,-1.0,-1.0,-1.0,-1.0,1.0,1.0,0.75,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,1.0,-1.0,-1.0,1.0,-1.0,1.0
833,1,-1.0,-1.0,-1.0,0.0,1.0,-1.0,0.75,1.0,1.0,1.0,-1.0,-1.0,-1.0,0.0,0.0,0.977778,-1.0,0.904762,-1.0,0.0
204,1,-1.0,-1.0,-1.0,0.49,1.0,-1.0,1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,0.768312,0.0,-1.0,0.798535,-1.0,0.0


## Summary

Table of metrics used...

In [71]:
columns_metadata_dict['similarity_metrics']

{'coordinate_E': LCSStr({'qval': 1, 'external': True}),
 'coordinate_N': LCSStr({'qval': 1, 'external': True}),
 'corporate_110': LCSStr({'qval': 1, 'external': True}),
 'corporate_710': LCSStr({'qval': 1, 'external': True}),
 'doi': Identity({'qval': 1, 'external': True}),
 'edition': Jaccard({'qval': 1, 'as_set': False, 'external': True}),
 'exactDate': Hamming({'qval': 1, 'test_func': <function Base._ident at 0x1179020d0>, 'truncate': False, 'external': True}),
 'format_prefix': Identity({'qval': 1, 'external': True}),
 'format_postfix': Jaccard({'qval': 2, 'as_set': False, 'external': True}),
 'isbn': Identity({'qval': 1, 'external': True}),
 'musicid': LCSStr({'qval': 1, 'external': True}),
 'part': Jaro({'qval': 1, 'long_tolerance': False, 'winklerize': False, 'external': True}),
 'person_100': StrCmp95({'long_strings': False, 'external': True}),
 'person_700': StrCmp95({'long_strings': False, 'external': True}),
 'person_245c': Jaro({'qval': 1, 'long_tolerance': False, 'winkleri

## Feature Matrix and Target Vector Handover

To hand over the resulting DataFrame of this chapter, the DataFrame is saved into a pickle file that will be read in the next chapters [Decision Tree Model](./5_DecisionTreeModel.ipynb), ... as input file.

In [72]:
import pickle as pk

# Binary intermediary file
with open(os.path.join(path_goldstandard,
                       'labelled_feature_matrix.pkl'), 'wb') as df_output_file:
    pk.dump(df_feature_base, df_output_file)