# Feature Matrix Generation

This chapter introduces similarity metrics for string comparison. The metrics to be used for calculating its similarity is decided for each attribute of the DataFrame built in the previous chapters. As a result of this chapter, the feature matrix will be derived.

## Table of Contents

- [Data Takeover](#Data-Takeover)
- [Object Distance and Similarity](#Object-Distance-and-Similarity)
- [Library TextDistance](#Library-TextDistance)
- [Similarity Metrics on Attribute Level](#Similarity-Metrics-on-Attribute-Level)
    - [corporate](#corporate)
    - [coordinate](#coordinate)
    - [doi](#doi)
    - [edition](#edition)
    - [exactDate](#exactDate)
    - [format](#format)
    - [isbn](#isbn)
    - [musicid](#musicid)
    - [part](#part)
    - [person_100](#person_100)
    - [person_700](#person_700)
    - [person_245c](#person_245c)
    - [pubinit](#pubinit)
    - [ttlfull](#ttlfull)
    - [volumes](#volumes)
- [Feature Base](#Feature-Base)

## Data Takeover

Swissbib's raw data of the goldstandard has been processed in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb). As the first step of this chapter, this data is read in for further processing to the feature matrix and target vector for the subsequent machine learning model chapters.

In [1]:
import os
import pandas as pd
import pickle as pk

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)

# Restore results so far
df_feature_base = pd.read_pickle(os.path.join(path_goldstandard, 'feature_base_df.pkl'),
                                 compression=None)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head()

Unnamed: 0,duplicates,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,doi_x,doi_y,edition_x,edition_y,exactDate_x,exactDate_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,isbn_x,isbn_y,musicid_x,musicid_y,part_x,part_y,person_100_x,person_100_y,person_700_x,person_700_y,person_245c_x,person_245c_y,pubinit_x,pubinit_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y
0,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula","grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,reclam jun.,reclam jun.,,,"emma, roman","emma, roman",,,600 s.,600 s.
1,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,reclam jun.,reclam,,,"emma, roman",emma,,,600 s.,600 s.
2,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen,reclam jun.,reclam,,,"emma, roman","emma, roman",,,600 s.,600 s.
3,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,"grawechristian, graweursula",jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,reclam,reclam jun.,,,emma,"emma, roman",,,600 s.,600 s.
4,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,reclam,reclam,,,emma,emma,,,600 s.,600 s.


In [2]:
print('Number of rows labelled as duplicates', len(df_feature_base[
    df_feature_base.duplicates==1]))
print('Number of rows labelled as uniques', len(df_feature_base[
    df_feature_base.duplicates==0]))
print('Total number of rows in DataFrame', df_feature_base.shape[0],
      'number of columns', df_feature_base.shape[1])

Number of rows labelled as duplicates 1473
Number of rows labelled as uniques 259260
Total number of rows in DataFrame 260733 number of columns 41


In [3]:
print('Part of duplicates (1) on uniques (2) in units of [%]')
print(100*df_feature_base.duplicates.value_counts(normalize=True))

Part of duplicates (1) on uniques (2) in units of [%]
0    99.435054
1     0.564946
Name: duplicates, dtype: float64


The ratio of duplicate records in the full training data has a percentage value of below 0.6%. This is very low and will affect the training of the model. During the training process, the model will find more pairs of unique records ($\texttt{duplicates}=0$) than pairs of duplicates ($\texttt{duplicates}=1$). Undersampling of the amount of unique pairs might be necessary as a consequence and will be decided during model training.

## Object Distance and Similarity

A mathematical idea of distance and similarity is needed for understanding object pair comparison. This section starts with a motivation for calculating similarities and afterwards gives a very basic definition of the two central terms. The text of this section is a summary of [[Chri2012](./A_References.ipynb#chri2012)].

The attributes to be used for pair comparison may contain values of poor quality. The quality originates in the way the data has been entered at the very source. Manual data entry may suffer from mistyping, automatically scanned data may suffer from insufficiencies of the scanned base material or the recognition algorithm in the optical character recognition (OCR) processing. The basic step of a deduplication process is to identify the probability of two strings of a pair to be a pair of duplicates. This is done by calculating a similarity value between the two strings compared, rather than using an exact comparison function. Based on this common similarity value for an attribute pair, their being duplicates can be decided.

The term similarity is strongly coupled to the term of distance of two values of an attribute. Mathematically, a distance can be explained with the help of a distance function. A _distance function_ or _distance metric_ $dist(o_i, o_j)$ between two points or data objects $o_i$ and $o_j$ must fulfill four requirements.

1. $dist(o_i, o_i)=0$, the distance from an object to itself is zero.
- $dist(o_i, o_j)\ge 0$, the distance between two objects is a non-negative number.
- $dist(o_i, o_j)=dist(o_j, o_i)$, the distance between two objects is symmetric.
- $dist(o_i, o_j)\le dist(o_i, o_k)+dist(o_k, o_j)$, the triangular inequality must hold. It states that the direct distance beween two objects is never larger than the combined distance when going through a third object.

A distance value expresses the dissimilarity $d$ of two objects [[HanK2012](./A_References.ipynb#hank2012)] and can therefore be converted into a similarity value $s$, calculating $s = \frac{1}{d}$, assuming $d\gt 0$. Alternatively, assuming the distance value is normalised $0\le d\le 1$, the similarity value can be calculated to $s = 1-d$. A _similarity function_ $sim(a_i, aj)$ between two attributes which can be strings, numbers, dates, geographic locations, text, XML documents, etc. fulfills the general requirements.

1. $sim(a_i, a_i)=1$, the result of comparing a value with itself is an exact similarity.
- $sim(a_i, a_j)=0$, the similarity of values that are completely different from each other is 0. What accounts for 'complete different' depends upon the type of data that are compared.
- $0\lt sim(a_i, a_j)\lt 1$, an approximate similarity between exact similarity and total dissimilarity is calculated if two attribute values are somewhat similar to each other. What accounts for 'somewhat different' depends upon the type of data that are compared.

The dissimilarity between two objects $o_i$ and $o_j$ can be computed based on the ratio of mismatches,
$$
d(o_i, o_j) = \frac{p-m}{p},
$$
where $m$ is the number of matching attributes and $p$ is the total number of attributes describing the objects [[HanK2012](./A_References.ipynb#hank2012)]. Thus the similarity between two objects can be computed as
$$
sim(o_i, o_j) = 1 - d(o_i, o_j) = \frac{m}{p}.
$$

For data deduplication, a comparison function needs to be tailored to the type of underlying data. Although there is a correspondence between a similarity function and the mathematical concept of a distance function, not all known and implemented similarity comparison functions used for string pair comparison fulfill the requirements of a distance function. Some similarity functions are not symmetric, others do not fulfill the triangular inequality. Decision taking on the best similarity function for a string pair, will be based on the effect, a similarity function has for the purpose needed. In the case of this capstone project, this purpose is its capability to contribute to the prediction whether a pair of records is a duplicate or different.

## Library TextDistance

An internet research on string distance calculation with Python has revealed libraries [[StSi](./A_References.ipynb#stsi)], [[TeDi](./A_References.ipynb#tedi)] and seperate code snippets for distinct algorithms. After trying the referenced libraries and a downloaded code snippet for a Smith Waterman similarity [[SmWa](./A_References.ipynb#smwa)], the text distance library [[TeDi](./A_References.ipynb#tedi)] has been decided to be the best decision for this capstone project. The decision is based on the GitHub statistics of stars and the date of the latest pull requests, indicating its popularity and maintenance activity of the library. A look at the API of the library, reveals the Python library to be a complete implementation (compared to suggestions of similarity metrics in [[Chri2012](./A_References.ipynb#chri2012)]) and easy to use.

In [4]:
# Install textdistance Python library - if not done, yet.
! pip install textdistance



For using the library, see documentation in [[TeDi](./A_References.ipynb#tedi)]. For the purposes of this chapter, function $\texttt{.normalized}\_\texttt{similarity()}$ of an instantiated textdistance object will be used.

In [5]:
import textdistance as tedi

With the code line above, the library is imported for application in this chapter. In appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb) the effect of the similarity metrics of the library are compared for better understanding of their specific behaviour. This comparison for each attribute is the basis of deciding the best similarity metric available for an attribute pair.

## Similarity Metrics on Attribute Level

In this section, the decision for calculating the similarity metric for each attribute of the raw data is documented based on appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb) and implemented. The implementation is applied on a pair of attributes of different records, resulting in a new attribute of the final feature matrix. A general function $\texttt{build_delta_feature}$ is provided by the code file [data_preparation_funcs.py](./data_preparation_funcs.py) for transforming two attributes into their feature attribute holding their similarity value.

In [6]:
import data_preparation_funcs as dpf

### corporate

In [7]:
corporate_algorithm = tedi.Jaro()

In [8]:
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'corporate_110', corporate_algorithm)
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'corporate_710', corporate_algorithm)

In [9]:
dpf.show_samples_interval(df_feature_base, 'corporate_110', 0.0, 0.1, 10)
dpf.show_samples_interval(df_feature_base, 'corporate_710', 0.9, 1.0, 10)

Unnamed: 0,duplicates,corporate_110_delta,corporate_110_x,corporate_110_y
196627,0,0.0,eidgenössisches topographisches bureau,
196537,0,0.0,eidgenössisches topographisches bureau,
196404,0,0.0,eidgenössisches topographisches bureau,
196426,0,0.0,eidgenössisches topographisches bureau,
196631,0,0.0,eidgenössisches topographisches bureau,
196319,0,0.0,eidgenössisches topographisches bureau,
196657,0,0.0,eidgenössisches topographisches bureau,
196410,0,0.0,eidgenössisches topographisches bureau,
196087,0,0.0,eidgenössisches topographisches bureau,
196463,0,0.0,eidgenössisches topographisches bureau,


0.0 < corporate_110_delta < 0.1


Unnamed: 0,duplicates,corporate_710_delta,corporate_710_x,corporate_710_y
259059,0,1.0,,
222292,0,1.0,,
195233,0,1.0,,
252141,0,1.0,,
193646,0,1.0,,
162393,0,1.0,,
82882,0,1.0,,
144637,0,1.0,,
98297,0,1.0,,
143413,0,1.0,,


0.9 < corporate_710_delta < 1.0


### coordinate

In [10]:
coordinate_algorithm = tedi.Jaro()

In [11]:
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'coordinate_E', coordinate_algorithm)
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'coordinate_N', coordinate_algorithm)

In [12]:
df_feature_base['coordinate_E_delta'].unique(), df_feature_base['coordinate_N_delta'].unique()

(array([1.        , 0.        , 0.91666667, 0.77777778, 0.58333333,
        0.75      , 0.86904762, 0.66666667, 0.68333333]),
 array([1.        , 0.        , 0.91666667, 0.66666667, 0.83333333,
        0.77777778, 0.75      , 0.68333333]))

In [13]:
dpf.show_samples_interval(df_feature_base, 'coordinate_E', 0.5, 0.7, 10)

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y
182192,0,0.583333,e0080855,e0074147
95173,0,0.666667,e0080851,e0074147
77016,0,0.666667,e0060811,e0074147
94856,0,0.666667,e0080851,e0074147
54072,0,0.583333,e0080855,e0074147
183997,0,0.583333,e0074147,e0080855
68355,0,0.583333,e0080855,e0074147
54097,0,0.583333,e0080855,e0074147
53731,0,0.583333,e0080855,e0074147
68353,0,0.583333,e0080855,e0074147


0.5 < coordinate_E_delta < 0.7


### doi

Swissbib uses an explicit $\texttt{doi}$ and an explicit $\texttt{ismn}$ attribute for its deduplication implementation. As these explicit dedicated identifiers are missing in Swissbib's data extract, cp. chapter [Data Analysis](./1_DataAnalysis.ipynb), an alternative comparison logic will be chosen for this attribute. Each string element of a $\texttt{doi}$ list will be compared separately with each string element of its comparison $\texttt{doi}$ list. If two bibliographic units hold at least one element in common, this is interpreted as a strong indicator for duplicates.

A special comparison function $\texttt{.build}\_\texttt{delta}\_\texttt{isbn()}$ (the same logic will be used for attribute $\texttt{isbn}$, see below) has been implemented that compares each list element of the left-hand side with each list element of the right-hand side of a pair. The Identity metric is used for string comparison, calculating a similarity value of 1 or 0 for each list element pair. For normalisation reasons, the sum of similarity values is divided by the number of elements of the smaller list. If both lists are empty a value of 1.0 is returned. If only one list is empty a value of 0.0 is returned.

In [14]:
isbn_algorithm = tedi.Identity()
df_feature_base = dpf.build_delta_feature(df_feature_base, 'doi',
                                          isbn_algorithm)

df_feature_base['doi_delta'].unique()

array([1. , 0. , 0.5])

Some sample cases are shown below for each category of $\texttt{isbn_delta}$.

In [15]:
for doi_delta_value in df_feature_base['doi_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['doi_delta']==doi_delta_value])
    )

    dpf.show_samples_distinct(df_feature_base, 'doi', doi_delta_value, number_of_max_samples)
    print(f'doi_delta = {doi_delta_value}')

Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
187578,0,1.0,[],[]
78985,0,1.0,[],[]
202328,0,1.0,[],[]
34585,0,1.0,[],[]
195566,0,1.0,[],[]
176490,0,1.0,[],[]
224935,0,1.0,[],[]
170047,0,1.0,[],[]
182635,0,1.0,[],[]
65739,0,1.0,[],[]


doi_delta = 1.0


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
92748,0,0.0,[],[10.5169/seals-376961]
203596,0,0.0,[],[10.5169/seals-376645]
189988,0,0.0,[10.1055/b-002-26639],[]
194246,0,0.0,"[10.5451/unibas-006503313, urn:nbn:ch:bel-bau-...",[]
207342,0,0.0,[10.1093/cid/ciu795],[]
200060,0,0.0,[10.1007/978-3-642-41698-9],[]
177961,0,0.0,[],[10.5169/seals-376396]
42090,0,0.0,[],[10.5169/seals-377028]
67713,0,0.0,[],[10.5169/seals-376850]
120772,0,0.0,[],[10.5169/seals-377362]


doi_delta = 0.0


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
153913,0,0.5,"[10.5167/uzh-53042, 10.1093/cid/cir669]","[21998284, 10.1093/cid/cir669]"


doi_delta = 0.5


In [16]:
# Let's have a look at some non-empty doi elements
df_doi_with_element = df_feature_base[df_feature_base.doi_x.apply(lambda x : len(x) > 0)]

for doi_delta_value in df_feature_base['doi_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['doi_delta']==doi_delta_value])
    )

    dpf.show_samples_distinct(df_doi_with_element, 'doi', doi_delta_value, number_of_max_samples)
    print(f'doi_delta = {doi_delta_value}')

Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
1347,1,1.0,"[10.5451/unibas-006499413, urn:nbn:ch:bel-bau-...","[10.5451/unibas-006499413, urn:nbn:ch:bel-bau-..."
1188,1,1.0,[10.1093/cid/cir669],"[10.5167/uzh-53042, 10.1093/cid/cir669]"
1264,1,1.0,[10.1093/ndt/gft319],[10.1093/ndt/gft319]
1193,1,1.0,[10.1093/cid/cir669],[10.1093/cid/cir669]
1191,1,1.0,[10.1093/cid/cir669],"[10.5167/uzh-53042, 10.1093/cid/cir669]"
1252,1,1.0,[10.1007/978-3-642-41698-9],[10.1007/978-3-642-41698-9]
1469,1,1.0,[10.1055/b-005-143650],[10.1055/b-005-143650]
1251,1,1.0,[10.1055/b-002-26639],[10.1055/b-002-26639]
200163,0,1.0,[10.1007/978-3-642-41698-9],[10.1007/978-3-642-41698-9]
162019,0,1.0,[10.1007/978-3-642-41698-9],[10.1007/978-3-642-41698-9]


doi_delta = 1.0


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
255261,0,0.0,[10.1055/b-005-143650],[]
162098,0,0.0,[10.1007/978-3-642-41698-9],[]
255633,0,0.0,[10.1055/b-005-143650],[]
167012,0,0.0,[10.1093/cid/ciu795],[]
194075,0,0.0,"[10.5451/unibas-006503313, urn:nbn:ch:bel-bau-...",[10.5169/seals-377218]
166616,0,0.0,[10.1007/978-3-642-41698-9],[10.5167/uzh-57152]
165647,0,0.0,[10.1093/cid/cir669],[]
166256,0,0.0,[10.1093/ndt/gft319],[]
199797,0,0.0,[10.1007/978-3-642-41698-9],[]
166157,0,0.0,[10.1093/ndt/gft319],[]


doi_delta = 0.0


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
153913,0,0.5,"[10.5167/uzh-53042, 10.1093/cid/cir669]","[21998284, 10.1093/cid/cir669]"


doi_delta = 0.5


### edition

The edition statement is a string value which may have several words. A Jaccard similarity is tried for this attribute.

In [17]:
edition_algorithm = tedi.Jaccard()

In [18]:
df_feature_base = dpf.build_delta_feature(df_feature_base, 'edition', edition_algorithm)

In [19]:
df_feature_base.edition_delta.unique()[:30], len(df_feature_base.edition_delta.unique())

(array([1.        , 0.        , 0.95348837, 0.46511628, 0.62222222,
        0.66666667, 0.5       , 0.65909091, 0.62790698, 0.67741935,
        0.68965517, 0.84375   , 0.7       , 0.875     , 0.92857143,
        0.56521739, 0.54166667, 0.72727273, 0.96153846, 0.57142857,
        0.54761905, 0.74193548, 0.09375   , 0.17647059, 0.24      ,
        0.10526316, 0.75      , 0.06666667, 0.13636364, 0.28571429]), 802)

The comparison results in a wide number of distinct similarity values for the goldstandard data set. Below, some examples are shown.

In [20]:
dpf.show_samples_interval(df_feature_base, 'edition', 0.9, 1.0, 10)

Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
101175,0,1.0,,
253763,0,1.0,,
77200,0,1.0,,
230659,0,1.0,,
84547,0,1.0,,
25515,0,1.0,,
156448,0,1.0,,
126221,0,1.0,,
100963,0,1.0,,
116235,0,1.0,,


0.9 < edition_delta < 1.0


In [21]:
dpf.show_samples_interval(df_feature_base, 'edition', 0.0, 0.1, 10)

Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
90285,0,0.0,,3. Aufl.
77084,0,0.0,,"Nouvelle éd., [verschiedene Auflagen]"
63404,0,0.0,,"Nouv. éd., [vollständig überarb. Neuaufl.]"
184831,0,0.0,"8., vollständig überarb. und akutalisierte Aufl",
216204,0,0.0,,"8., vollständig überarbeitete und aktualisiert..."
86233,0,0.0,,Nachträge
133941,0,0.0,,Nachträge
27977,0,0.0,"5. Aufl., 43.-46. Tsd., neu durchges. Aufl",
94067,0,0.0,,[Nouv. éd.]
126009,0,0.0,,"Nouv. éd., [2. Aufl.]"


0.0 < edition_delta < 0.1


### exactDate

As discussed in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{exactDate}$ holds a year number stored in the first fout digits. Letter 'u' is used as a placeholder for an unknown digit. The attribute may hold some month and day or a second year information in its second four digits, additionally.

The attribute will be kept as a string and will not be transformed into an integer. The feature attribute of the record pair to be compared will be calculated with a modified Hamming algorithm, see appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb). The resulting similarity will be stored in a new attribute $\texttt{exactDate}\_\texttt{delta}$ which will be taken for the model calculation.

In [22]:
# Replace letter 'u' with letter 'a' for one of the two strings.
#  As an effect, the resulting Hamming similarity for a letter
#  instead of a numerical digit in either string will add with an amount 0.
df_feature_base['exactDate_x'] = df_feature_base.exactDate_x.str.replace('u', 'a')

# Compute Hamming similarity for century string pair.
exactDate_algorithm = tedi.Hamming()
df_feature_base = dpf.build_delta_feature(df_feature_base, 'exactDate', exactDate_algorithm)

# Add amount of 1/16 to Hamming similarity for every letter digit.
#  But only maximum number of letter digits in both strings of a pair.
df_feature_base['exactDate_delta'] = df_feature_base[['exactDate_x', 'exactDate_y', 'exactDate_delta']].apply(
    lambda x : x['exactDate_delta'] + 
    max(x['exactDate_x'].count('a'), x['exactDate_y'].count('u'))/16, axis=1
)

In [23]:
df_feature_base[['exactDate_x', 'exactDate_y', 'exactDate_delta']].sample(n=10)

Unnamed: 0,exactDate_x,exactDate_y,exactDate_delta
233623,2014aaaa,1994uuuu,0.375
186832,2016aaaa,2006uuuu,0.625
19970,1982aaaa,1998uuuu,0.5
14727,1991aaaa,19881862,0.5
137376,1988aaaa,1994uuuu,0.5
163314,19702006,1988uuuu,0.5
258456,2005aaaa,2014uuuu,0.5
225358,2001aaaa,2010uuuu,0.5
1867,aaaaaaaa,1999uuuu,0.5
87147,1991aaaa,19669999,0.5


All resulting values of equal strings are equal to 1.

In [24]:
df_feature_base[['exactDate_x', 'exactDate_y', 'exactDate_delta']][
    df_feature_base.exactDate_x == df_feature_base.exactDate_y
].sort_values('exactDate_delta', ascending=False).head()

Unnamed: 0,exactDate_x,exactDate_y,exactDate_delta
159,20022000,20022000,1.0
843,19849999,19849999,1.0
845,19969999,19969999,1.0
846,19969999,19969999,1.0
847,19969999,19969999,1.0


Nine different similarity values can be found in the attribute deltas. Some sample records are shown below.

In [25]:
import numpy as np

exactDate_deltas = np.sort(df_feature_base.exactDate_delta.unique())
exactDate_deltas

array([0.    , 0.125 , 0.25  , 0.3125, 0.375 , 0.4375, 0.5   , 0.5625,
       0.625 , 0.6875, 0.75  , 0.875 , 1.    ])

In [26]:
sample_size = 5

for i in exactDate_deltas :
    dpf.show_samples_distinct(df_feature_base, 'exactDate', i, sample_size)
    print(f'exactDate_delta = {i}')

Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
206958,0,0.0,20150201,19949999
135787,0,0.0,19949999,20012002
207056,0,0.0,20150201,19941995
198203,0,0.0,20091990,19282011
26373,0,0.0,20092005,19989999


exactDate_delta = 0.0


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
200404,0,0.125,20151475,18971989
123230,0,0.125,19829999,15501850
15652,0,0.125,19911990,20022010
215158,0,0.125,20159999,17931797
31415,0,0.125,20071990,19881862


exactDate_delta = 0.125


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
197516,0,0.25,2016aaaa,1995uuuu
29139,0,0.25,1999aaaa,2017uuuu
204200,0,0.25,2011aaaa,1996uuuu
230442,0,0.25,1987aaaa,2014uuuu
22252,0,0.25,2005aaaa,19942008


exactDate_delta = 0.25


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
169938,0,0.3125,1981aaaa,200uuuuu
257497,0,0.3125,2000aaaa,189uuuuu
99610,0,0.3125,1978aaaa,200uuuuu
257633,0,0.3125,2000aaaa,181uuuuu
177767,0,0.3125,2005aaaa,193uuuuu


exactDate_delta = 0.3125


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
69365,0,0.375,1991aaaa,1765uuuu
173016,0,0.375,20159999,1475uuuu
49285,0,0.375,1982aaaa,1896uuuu
161834,0,0.375,2014aaaa,1994uuuu
90855,0,0.375,1959aaaa,1763uuuu


exactDate_delta = 0.375


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
88862,0,0.4375,170aaaaa,2006uuuu
123397,0,0.4375,1987aaaa,189uuuuu
89171,0,0.4375,170aaaaa,1995uuuu
257210,0,0.4375,1862aaaa,192uuuuu
122801,0,0.4375,19829999,189uuuuu


exactDate_delta = 0.4375


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
194670,0,0.5,2016aaaa,2002uuuu
19806,0,0.5,1982aaaa,uuuuuuuu
140126,0,0.5,1960aaaa,1959uuuu
133584,0,0.5,1932aaaa,1988uuuu
80288,0,0.5,1998aaaa,19571963


exactDate_delta = 0.5


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
230986,0,0.5625,1987aaaa,192uuuuu
251257,0,0.5625,183aaaaa,1880uuuu
184838,0,0.5625,2011aaaa,200uuuuu
196705,0,0.5625,1862aaaa,189uuuuu
34131,0,0.5625,1989aaaa,193uuuuu


exactDate_delta = 0.5625


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
75767,0,0.625,2001aaaa,2007uuuu
113937,0,0.625,19702006,1940uuuu
46269,0,0.625,2008aaaa,2005uuuu
90041,0,0.625,1998aaaa,1995uuuu
123123,0,0.625,19829999,1992uuuu


exactDate_delta = 0.625


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
29282,0,0.6875,20071990,200uuuuu
54314,0,0.6875,2007aaaa,200uuuuu
182454,0,0.6875,2001aaaa,200uuuuu
76366,0,0.6875,2005aaaa,200uuuuu
169342,0,0.6875,2007aaaa,200uuuuu


exactDate_delta = 0.6875


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
446,1,0.75,1998aaaa,1998uuuu
178292,0,0.75,2007aaaa,2007uuuu
1435,1,0.75,1763aaaa,1763uuuu
136377,0,0.75,2005aaaa,2005uuuu
349,1,0.75,2005aaaa,2005uuuu


exactDate_delta = 0.75


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
788,1,0.875,20091990,20091991
88386,0,0.875,19969999,19989999
26709,0,0.875,20092005,20002005
227703,0,0.875,19829999,19819999
88249,0,0.875,19969999,19uu9999


exactDate_delta = 0.875


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
791,1,1.0,20091990,20091990
933,1,1.0,19739999,19739999
165,1,1.0,19791999,19791999
654,1,1.0,20092005,20092005
472,1,1.0,19201929,19201929


exactDate_delta = 1.0


### format

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{format}$ has been split up into two new attributes $\texttt{format_prefix}$ and $\texttt{format_postfix}$ which will be compared by a different similarity metrics.

- As the quality of $\texttt{format_prefix}$ is expected to be high, an identity comparison should be sufficient.
- Due to the observed structure of $\texttt{format_postfix}$, a q-gram based comparison will be chosen.

In [27]:
format_prefix_algorithm = tedi.Identity()
df_feature_base = dpf.build_delta_feature(df_feature_base, 'format_prefix',
                                          format_prefix_algorithm)

format_postfix_algorithm = tedi.Jaccard(qval=2)
df_feature_base = dpf.build_delta_feature(df_feature_base, 'format_postfix',
                                          format_postfix_algorithm)

In [28]:
for i in df_feature_base.format_prefix_delta[
    df_feature_base.format_prefix_x != df_feature_base.format_prefix_y].unique():
    
    dpf.show_samples_distinct(df_feature_base, 'format_prefix', i)
    print(f'format_prefix_delta = {i}')

Unnamed: 0,duplicates,format_prefix_delta,format_prefix_x,format_prefix_y
98762,0,0.0,bk,mp
160766,0,0.0,mu,bk
114246,0,0.0,mu,mp
168480,0,0.0,bk,mp
211906,0,0.0,bk,mu


format_prefix_delta = 0.0


In [29]:
for i in df_feature_base.format_postfix_delta[
    df_feature_base.format_postfix_x != df_feature_base.format_postfix_y].unique():
    
    dpf.show_samples_distinct(df_feature_base, 'format_postfix', i)
    print(f'format_postfix_delta = {i}')

Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
67053,0,0.428571,10200,20000
8702,0,0.428571,10300,10053
87415,0,0.428571,40100,10300
200002,0,0.428571,20053,10200
93220,0,0.428571,20000,30000


format_postfix_delta = 0.4285714285714286


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
10093,0,0.111111,40100,20053
77148,0,0.111111,10700,20000
59010,0,0.111111,20000,10300
76107,0,0.111111,20000,10300
156220,0,0.111111,30053,20000


format_postfix_delta = 0.11111111111111116


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
198337,0,0.25,10300,30653
20718,0,0.25,20000,20353
17356,0,0.25,20000,20353
79723,0,0.25,20000,20353
238259,0,0.25,20000,20353


format_postfix_delta = 0.25


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
19465,0,0.0,10347,20000
52913,0,0.0,20000,30653
214430,0,0.0,10100,30653
234299,0,0.0,40000,20353
129190,0,0.0,10100,20353


format_postfix_delta = 0.0


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
40548,0,1.0,20000,20000
162098,0,1.0,20053,20053
92026,0,1.0,20000,20000
42728,0,1.0,10300,10300
133166,0,1.0,20000,20000


format_postfix_delta = 1.0


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
90454,0,0.666667,20047,20400
90470,0,0.666667,20047,20400
26102,0,0.666667,20047,20400
244959,0,0.666667,20400,20047
26086,0,0.666667,20047,20400


format_postfix_delta = 0.6666666666666666


### isbn

Swissbib uses each string element of the $\texttt{isbn}$ list separately for comparing with each string element of its comparison $\texttt{isbn}$ list. If two bibliographic units hold at least one element in common, this is interpreted as a strong indicator for duplicates [[WiCo2001](./A_References.ipynb#wico2001)].

This hard logic is used in a modified way in the context of this capstone project. A special comparison function $\texttt{.build}\_\texttt{delta}\_\texttt{isbn()}$ has been implemented that compares each list element of the left-hand side with each list element of the right-hand side of a pair. According to Swissbib's implementation, the Identity metric is used for string comparison, calculating a similarity value of 1 or 0 for each list element pair. For normalisation reasons, the sum of similarity values is divided by the number of elements of the smaller list. If both lists are empty a value of 1.0 is returned. If only one list is empty a value of 0.0 is returned.

In [30]:
isbn_algorithm = tedi.Identity()
df_feature_base = dpf.build_delta_feature(df_feature_base, 'isbn',
                                          isbn_algorithm)

df_feature_base['isbn_delta'].unique()

array([1. , 0. , 0.5])

Some sample cases are shown below for each category of $\texttt{isbn_delta}$.

In [31]:
for isbn_delta_value in df_feature_base['isbn_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['isbn_delta']==isbn_delta_value])
    )

    dpf.show_samples_distinct(df_feature_base, 'isbn', isbn_delta_value, number_of_max_samples)
    print(f'isbn_delta = {isbn_delta_value}')

Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
15806,0,1.0,[],[]
201598,0,1.0,[],[]
226628,0,1.0,[],[]
3940,0,1.0,[],[]
227224,0,1.0,[],[]
101211,0,1.0,[],[]
91840,0,1.0,[],[]
192162,0,1.0,[],[]
130656,0,1.0,[],[]
169079,0,1.0,[],[]


isbn_delta = 1.0


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
37899,0,0.0,[],"[978-3-643-12370-1, 3-643-12370-1]"
207022,0,0.0,[],[978-3-13-127285-0]
205847,0,0.0,"[978-3-290-20138-8, 3-290-20138-4]",[]
223476,0,0.0,[978-0-7294-1156-1],[978-3-598-31500-8 (print)]
107245,0,0.0,[0-375-75742-2],"[978-3-598-31497-1 (print), 978-3-11-097083-8]"
183249,0,0.0,[0-87834-101-3],[2-08-070552-0]
56337,0,0.0,[978-3-7255-6535-1],[]
193352,0,0.0,"[3-495-48796-4, 978-3-495-48796-9]",[]
33842,0,0.0,[],[1013-0640]
32356,0,0.0,[],[3-598-31514-7]


isbn_delta = 0.0


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
1201,1,0.5,"[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]","[978-3-13-127286-7, 3-13-127286-4]"
161989,0,0.5,"[978-3-642-41697-2, 978-3-642-41698-9 (ebook)]","[978-3-642-41697-2, 3-642-41697-7]"
1205,1,0.5,"[978-3-13-127286-7, 3-13-127286-4]","[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]"
1195,1,0.5,"[978-3-13-127286-7, 3-13-127286-4]","[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]"
1202,1,0.5,"[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]","[978-3-13-127286-7, 3-13-127286-4]"
1199,1,0.5,"[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]","[978-3-13-127286-7, 3-13-127286-4]"
1210,1,0.5,"[978-3-13-127286-7, 3-13-127286-4]","[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]"


isbn_delta = 0.5


### musicid

In [32]:
musicid_algorithm = tedi.Jaccard()

In [33]:
df_feature_base = dpf.build_delta_feature(df_feature_base, 'musicid', musicid_algorithm)

In [34]:
df_feature_base['musicid_delta'].unique()

array([1.        , 0.        , 0.42857143, 0.25      , 0.40909091,
       0.41176471, 0.38888889, 0.53333333, 0.5       , 0.7       ,
       0.38461538, 0.88888889, 0.22222222, 0.35714286, 0.44117647,
       0.76470588, 0.31034483, 0.44444444, 0.8       , 0.18181818,
       0.08108108, 0.09677419, 0.05555556, 0.10344828, 0.04761905,
       0.1       , 0.11111111, 0.1025641 , 0.07692308, 0.03448276,
       0.14285714, 0.05882353, 0.30769231, 0.16      , 0.29411765,
       0.20833333, 0.17241379, 0.08571429, 0.23076923, 0.36363636,
       0.33333333, 0.28571429, 0.19047619, 0.27272727, 0.06451613,
       0.06666667, 0.02380952, 0.125     , 0.16666667, 0.04651163,
       0.17647059, 0.2       , 0.07142857, 0.15384615, 0.05263158,
       0.09090909, 0.11764706, 0.21428571, 0.15789474, 0.08333333,
       0.08695652, 0.03125   , 0.04878049, 0.12      , 0.71428571,
       0.3       , 0.1875    , 0.30434783, 0.025     , 0.35294118,
       0.09375   , 0.05405405, 0.13043478, 0.05128205, 0.13333

In [35]:
dpf.show_samples_interval(df_feature_base, 'musicid', 0.0, 0.1, 10)

Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
160894,0,0.0,U.E. 245,
12223,0,0.0,"LC 0171433 210,2",
198197,0,0.0,502430,
14376,0,0.0,422 543-2,
30579,0,0.0,,491.0
97902,0,0.0,BA 4553,
6739,0,0.0,502023,491.0
35653,0,0.0,Philips 422 543-2,
22381,0,0.0,Frenetic 99036,
139275,0,0.0,99036,


0.0 < musicid_delta < 0.1


In [36]:
dpf.show_samples_interval(df_feature_base, 'musicid', 0.6, 0.7, 10)

Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
75685,0,0.636364,BA 4553a,BA 4553-90
74673,0,0.666667,EP 10425,10425EP 4697
115201,0,0.692308,10425EP 71,10425EP 4697
134689,0,0.636364,BA 4553a,BA 4553-90
126632,0,0.666667,E.P. 10425,EP 1215470
216937,0,0.636364,BA 4553a,BA 4553-90
249717,0,0.636364,BA 4553a,BA 4553-90
63345,0,0.65,422 ; Kalmus Miniature Scores,Kalmus miniature orchestra scores 422
74670,0,0.666667,EP 10425,10425EP 4697
52441,0,0.7,BA 4553,BA 4553-90


0.6 < musicid_delta < 0.7


### part

In [37]:
part_algorithm = tedi.Jaro()

df_feature_base = dpf.build_delta_feature(df_feature_base, 'part', part_algorithm)

In [38]:
dpf.show_samples_interval(df_feature_base, 'part', 0.6, 0.7, 10)

Unnamed: 0,duplicates,part_delta,part_x,part_y
237898,0,0.688889,2,"bl. 23, 23,1905"
173165,0,0.633333,ed. 6,"bd. 27, 27"
14110,0,0.694444,2620,291(2012)
186457,0,0.666667,n. 1,cd 1
109634,0,0.677778,"nr. 2620, bd. 5","bang 50, bang 50"
246421,0,0.603175,20c,"nr. 2620, 2620"
51277,0,0.611111,"bd. 57, 57","nr. 2620, bd. 5"
252231,0,0.633041,nr.313(2017:august),nr. 7633
214057,0,0.653391,"no 912, 912","bl. 23, 23,1899"
241715,0,0.611111,2620,no. 20


0.6 < part_delta < 0.7


In [39]:
dpf.show_samples_interval(df_feature_base, 'part', 0.8, 0.9, 10)

Unnamed: 0,duplicates,part_delta,part_x,part_y
102,1,0.866667,"bd. 57, 57",bd. 57
78711,0,0.822222,bd. 19,bd. 4
23152,0,0.888889,nr. 12,n. 1
147239,0,0.857143,"nr. 2620, 2620",nr. 2620
1262,1,0.884242,"28/10(2013), 2421-2431","28/10(2013-10), 2421-2431"
195685,0,0.866667,"bd. 57, 57",bd. 57
257526,0,0.833333,"bd. 8008, 8008",bd 8008
672,1,0.844444,"bd. 8008., 8008",bd. 8008
54058,0,0.843137,"bl. 285, 285,1963","bl. 285, 285,2000"
179053,0,0.833333,71,7


0.8 < part_delta < 0.9


### person_100

In [40]:
person_100_algorithm = tedi.Jaro()

df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'person_100', person_100_algorithm)

In [41]:
dpf.show_samples_interval(df_feature_base, 'person_100', 0.0, 0.1, 10)
dpf.show_samples_interval(df_feature_base, 'person_100', 0.9, 1.0, 10)

Unnamed: 0,duplicates,person_100_delta,person_100_x,person_100_y
177949,0,0.0,,mozartwolfgang amadeus1756-1791(de-588)118584596
16981,0,0.0,mozartwolfgang amadeus1756-1791(de-588)1185845...,
185792,0,0.0,austenjane,
178566,0,0.0,,bührerwalter
174582,0,0.0,,mortzfeldpeter
3672,0,0.0,mozartwolfgang amadeus,
250547,0,0.0,mozartwolfgang amadeus1756-1791,
204209,0,0.0,,bührerwalter
211894,0,0.0,,mozartwolfgang amadeus
9766,0,0.0,mozartwolfgang amadeus,


0.0 < person_100_delta < 0.1


Unnamed: 0,duplicates,person_100_delta,person_100_x,person_100_y
258920,0,1.0,,
51907,0,1.0,mozartwolfgang amadeus1756-1791(de-588)118584596,mozartwolfgang amadeus1756-1791(de-588)118584596
125414,0,1.0,,
180848,0,1.0,,
218573,0,1.0,,
177484,0,1.0,,
212642,0,1.0,,
143071,0,1.0,,
22201,0,1.0,,
167743,0,1.0,,


0.9 < person_100_delta < 1.0


### person_700

In [42]:
person_700_algorithm = tedi.Jaro()

df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'person_700', person_700_algorithm)

In [43]:
dpf.show_samples_interval(df_feature_base, 'person_700', 0.0, 0.1, 10)
dpf.show_samples_interval(df_feature_base, 'person_700', 0.9, 1.0, 10)

Unnamed: 0,duplicates,person_700_delta,person_700_x,person_700_y
7903,0,0.0,,kliethomas1956-(de-588)120727633
62289,0,0.0,schikanederemanuel1751-1812(de-588)11860757x,
37833,0,0.0,,"kesslersigrid, antheniencaroline"
41574,0,0.0,,"learevelyn1926-2012(de-588)123980437sängersng,..."
183637,0,0.0,"dufourguillaume-henri1787-1875, müllhauptheinr...",
84415,0,0.0,"gläser-zikudamichaela1967-(de-588)123411122, h...",
128178,0,0.0,mozartwolfgang amadeusdie zauberflötemusique i...,
103603,0,0.0,,kesslersigrid
166756,0,0.0,spinathbirgit,
13521,0,0.0,,"dufourguillaume-henri1787-1875, müllhauptheinr..."


0.0 < person_700_delta < 0.1


Unnamed: 0,duplicates,person_700_delta,person_700_x,person_700_y
243194,0,1.0,,
33883,0,1.0,,
5843,0,1.0,,
77367,0,1.0,,
211091,0,1.0,,
70985,0,1.0,,
201176,0,1.0,,
244708,0,1.0,,
2131,0,1.0,,
33389,0,1.0,,


0.9 < person_700_delta < 1.0


### person_245c

In [44]:
person_245c_algorithm = tedi.Jaro()

In [45]:
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'person_245c', person_245c_algorithm)

In [46]:
dpf.show_samples_interval(df_feature_base, 'person_245c', 0.0, 0.1, 10)
dpf.show_samples_interval(df_feature_base, 'person_245c', 0.9, 1.0, 10)

Unnamed: 0,duplicates,person_245c_delta,person_245c_x,person_245c_y
88994,0,0.0,,jane austen
203663,0,0.0,,sozialversicherungsmissbrauch : am beispiel de...
204321,0,0.0,,jane austen
232643,0,0.0,,uri p. trier
89354,0,0.0,,"mortzfeld, peter; raabe, paul"
125751,0,0.0,,jane austen ; edited by james kinsley ; with a...
139376,0,0.0,ein film von luc jacquet ; original music emil...,
53927,0,0.0,,sigrid kessler... [et al.] ; [hrsg.:] interkan...
11126,0,0.0,[hrsg.:] schweizerische gesellschaft für bildu...,
30070,0,0.0,,sigrid kessler... [et al.] ; [hrsg.:] interkan...


0.0 < person_245c_delta < 0.1


Unnamed: 0,duplicates,person_245c_delta,person_245c_x,person_245c_y
125182,0,1.0,,
203662,0,1.0,,
50152,0,1.0,,
123,1,0.935589,jane austen ; retold by annette barnes,jane austen ; retold annette barnes
44044,0,1.0,andreas flury,andreas flury
252019,0,1.0,,
120012,0,1.0,sigrid kessler... [et al.] ; [éd.:] interkanto...,sigrid kessler... [et al.] ; [éd.:] interkanto...
231845,0,1.0,,
203177,0,1.0,,
201854,0,1.0,,


0.9 < person_245c_delta < 1.0


### pubinit

In [47]:
pubinit_algorithm = tedi.Jaro()

df_feature_base = dpf.build_delta_feature(df_feature_base, 'pubinit', pubinit_algorithm)

In [48]:
dpf.show_samples_interval(df_feature_base, 'pubinit', 0.6, 0.7, 10)

Unnamed: 0,duplicates,pubinit_delta,pubinit_x,pubinit_y
190317,0,0.602453,frenetic films,bärenreiter
119397,0,0.647619,staatlicher lehrmittelverlag,alber
154845,0,0.60107,bärenreiter,berner lehrmittel- und medienverl.
219614,0,0.666667,schulthess,schulverl. blmv
53279,0,0.605772,power music,ph. reclam jun.
131202,0,0.652021,staatlicher lehrmittelverl.,"interkantonale lehrmittelzentrale, staatlicher..."
238482,0,0.662698,staatlicher lehrmittelverlag,"interkantonale lehrmittelzentrale, staatlicher..."
224314,0,0.630159,albin michel,staatlicher lehrmittelverlag
1417,0,0.647619,alber,staatlicher lehrmittelverlag
171929,0,0.611111,gallimard,kalmus


0.6 < pubinit_delta < 0.7


In [49]:
dpf.show_samples_interval(df_feature_base, 'pubinit', 0.8, 0.9, 10)

Unnamed: 0,duplicates,pubinit_delta,pubinit_x,pubinit_y
147379,0,0.880952,p. reclam,p. reclam jun.
195685,0,0.875,k. alber,alber
143331,0,0.832576,interkantonale lehrmittelzentrale : staatliche...,"interkantonale lehrmittelzentrale ; [bern], sc..."
147608,0,0.888889,p. reclam,reclam
981,1,0.833333,springer,springer medizin
408,1,0.873871,cambridge university press,cambridge univ. press
49759,0,0.836544,"interkantonale lehrmittelzentrale, staatlicher...","interkantonale lehrmittelzentrale ; [bern], sc..."
49672,0,0.84127,"interkantonale lehrmittelzentrale, staatlicher...",interkantonale lehrmittelzentrale
243,0,0.837662,reclam jun.,p. reclam jun.
111777,0,0.888889,reclam,p. reclam


0.8 < pubinit_delta < 0.9


### scale

In [50]:
scale_algorithm = tedi.Jaro()

df_feature_base = dpf.build_delta_feature(df_feature_base, 'scale', scale_algorithm)

In [51]:
dpf.show_samples_interval(df_feature_base, 'scale', 0.6, 0.7, 10)

Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
227513,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227483,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227504,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227485,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227486,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227208,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227494,0,0.681957,Scala 1:50.000 ; proiezione cilindrica ad asse...,50000
874,1,0.681957,Scala 1:50.000 ; proiezione cilindrica ad asse...,50000
227510,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227166,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000


0.6 < scale_delta < 0.7


In [52]:
dpf.show_samples_interval(df_feature_base, 'scale', 0.8, 0.9, 10)

Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
68382,0,0.822222,50000,100000
257305,0,0.822222,100000,50000
182216,0,0.822222,50000,100000
181877,0,0.822222,50000,100000
95190,0,0.822222,50000,100000
54055,0,0.822222,50000,100000
95178,0,0.822222,50000,100000
181876,0,0.822222,50000,100000
184037,0,0.822222,100000,50000
54056,0,0.822222,50000,100000


0.8 < scale_delta < 0.9


### ttlfull

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{ttlfull}$ has been split up into two new attributes $\texttt{ttlfull_245}$ and $\texttt{ttlfull_246}$ which will be compared by the same similarity metrics.

In [53]:
ttlfull_algorithm = tedi.Jaccard()

df_feature_base = dpf.build_delta_feature(df_feature_base, 'ttlfull_245', ttlfull_algorithm)
df_feature_base = dpf.build_delta_feature(df_feature_base, 'ttlfull_246', ttlfull_algorithm)

In [54]:
df_feature_base.columns

Index(['duplicates', 'coordinate_E_x', 'coordinate_E_y', 'coordinate_N_x',
       'coordinate_N_y', 'corporate_110_x', 'corporate_110_y',
       'corporate_710_x', 'corporate_710_y', 'doi_x', 'doi_y', 'edition_x',
       'edition_y', 'exactDate_x', 'exactDate_y', 'format_prefix_x',
       'format_prefix_y', 'format_postfix_x', 'format_postfix_y', 'isbn_x',
       'isbn_y', 'musicid_x', 'musicid_y', 'part_x', 'part_y', 'person_100_x',
       'person_100_y', 'person_700_x', 'person_700_y', 'person_245c_x',
       'person_245c_y', 'pubinit_x', 'pubinit_y', 'scale_x', 'scale_y',
       'ttlfull_245_x', 'ttlfull_245_y', 'ttlfull_246_x', 'ttlfull_246_y',
       'volumes_x', 'volumes_y', 'corporate_110_delta', 'corporate_710_delta',
       'coordinate_E_delta', 'coordinate_N_delta', 'doi_delta',
       'edition_delta', 'exactDate_delta', 'format_prefix_delta',
       'format_postfix_delta', 'isbn_delta', 'musicid_delta', 'part_delta',
       'person_100_delta', 'person_700_delta', 'person_245

In [55]:
dpf.show_samples_interval(df_feature_base, 'ttlfull_245', 0.0, 0.1, 10)
dpf.show_samples_interval(df_feature_base, 'ttlfull_245', 0.9, 1.0, 10)

Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
104087,0,0.022222,"klinische kardiologie, krankheiten des herzens...",emma
219465,0,0.053232,"sozialleistungsbetrug, sozialversicherungsbetr...",die zauberflöte
256096,0,0.094527,"neue ausgabe sämtlicher werke, die zauberflöte...","domo d'ossola, arona"
184883,0,0.022222,"klinische kardiologie, krankheiten des herzens...",emma
11696,0,0.05,emma,blick in die welt
222465,0,0.053571,emma,"bonne chance, cours de langue française, deuxi..."
158009,0,0.093923,"die zisterze kaisheim und ihre tochterklöster,...",blick in die welt
168076,0,0.015267,sozialleistungsbetrug - sozialversicherungsbet...,emma
182134,0,0.076923,domodossola,blick in die welt
65169,0,0.035714,emma,"bonne chance!, cours de langue française, étap..."


0.0 < ttlfull_245_delta < 0.1


Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
151535,0,0.938776,"die zauberflöte, the magic flute : opera : k 620","die zauberflöte, the magic flute : opera, k 620"
44590,0,1.0,homo faber,homo faber
1453,1,1.0,"alles wissen dieser welt, warum bibliotheken n...","alles wissen dieser welt, warum bibliotheken n..."
98,1,0.986111,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
23035,0,1.0,die zauberflöte,die zauberflöte
141151,0,0.954545,"bonne chance !, cours de langue française 2","bonne chance !, cours de langue française 1"
229673,0,1.0,"die zauberflöte, oper in zwei aufzügen","die zauberflöte, oper in zwei aufzügen"
163977,0,1.0,die reise der pinguine,die reise der pinguine
186664,0,1.0,"traité sur la tolérance, à l'occasion de la mo...","traité sur la tolérance, à l'occasion de la mo..."
960,1,1.0,"die zauberflöte, oper in zwei akten : text der...","die zauberflöte, oper in zwei akten : text der..."


0.9 < ttlfull_245_delta < 1.0


In [56]:
dpf.show_samples_interval(df_feature_base, 'ttlfull_246', 0.0, 0.1, 10)
dpf.show_samples_interval(df_feature_base, 'ttlfull_246', 0.9, 1.0, 10)

Unnamed: 0,duplicates,ttlfull_246_delta,ttlfull_246_x,ttlfull_246_y
213367,0,0.0,medizinische informatik - kommunikation von ge...,
94079,0,0.0,,medizinsche informatik - kommunikation von ger...
80898,0,0.0,,"[domodossola, arona]"
36416,0,0.0,education et recherche,
67568,0,0.0,"die zauberflöte, ausgabe für gesang und klavier",
219082,0,0.0,education et recherche,
212513,0,0.0,medizinische informatik - kommunikation von ge...,
231968,0,0.0,medizinische informatik - kommunikation von ge...,
149089,0,0.0,,"domodossola, arona"
10038,0,0.0,,journal of adam mickiewicz university


0.0 < ttlfull_246_delta < 0.1


Unnamed: 0,duplicates,ttlfull_246_delta,ttlfull_246_x,ttlfull_246_y
83644,0,1.0,,
9083,0,1.0,,
179346,0,1.0,,
137117,0,1.0,,
30918,0,1.0,,
217664,0,1.0,,
24430,0,1.0,,
235323,0,1.0,,
95544,0,1.0,,
220394,0,1.0,,


0.9 < ttlfull_246_delta < 1.0


### volumes

In [57]:
volumes_algorithm = tedi.Jaccard()
#volumes_algo = tedi.MongeElkan()

In [58]:
df_feature_base = dpf.build_delta_feature(df_feature_base, 'volumes', volumes_algorithm)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head(20)

Unnamed: 0,duplicates,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,doi_x,doi_y,edition_x,edition_y,exactDate_x,exactDate_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,isbn_x,isbn_y,musicid_x,musicid_y,part_x,part_y,person_100_x,person_100_y,person_700_x,person_700_y,person_245c_x,person_245c_y,pubinit_x,pubinit_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y,corporate_110_delta,corporate_710_delta,coordinate_E_delta,coordinate_N_delta,doi_delta,edition_delta,exactDate_delta,format_prefix_delta,format_postfix_delta,isbn_delta,musicid_delta,part_delta,person_100_delta,person_700_delta,person_245c_delta,pubinit_delta,scale_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
0,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula","grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,reclam jun.,reclam jun.,,,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,reclam jun.,reclam,,,"emma, roman",emma,,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.818905,0.848485,1.0,0.363636,1.0,1.0
2,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen,reclam jun.,reclam,,,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,0.759259,0.0,0.69774,0.848485,1.0,1.0,1.0,1.0
3,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,"grawechristian, graweursula",jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,reclam,reclam jun.,,,emma,"emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.818905,0.848485,1.0,0.363636,1.0,1.0
4,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,reclam,reclam,,,emma,emma,,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane,,,jane austen ; aus dem engl. übers. von ursula ...,jane austen,reclam,reclam,,,emma,"emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,0.759259,1.0,0.702265,1.0,1.0,0.363636,1.0,1.0
6,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane,austenjane1775-1817(de-588)118505173,,"grawechristian, graweursula",jane austen,jane austen ; aus dem englischen übersetzt von...,reclam,reclam jun.,,,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,0.759259,0.0,0.69774,0.848485,1.0,1.0,1.0,1.0
7,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane,austenjane1775-1817(de-588)118505173,,,jane austen,jane austen ; aus dem engl. übers. von ursula ...,reclam,reclam,,,"emma, roman",emma,,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,0.759259,1.0,0.702265,1.0,1.0,0.363636,1.0,1.0
8,1,,,,,,,,,[],[],,,2009aaaa,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008.0,20008.0,austenjane,austenjane,,,jane austen,jane austen,reclam,reclam,,,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,1,,,,,,,"metropolitan opera orchestra, metropolitan ope...","metropolitan opera orchestra, metropolitan ope...",[],[],,,2000aaaa,2000uuuu,vm,vm,10300,10300,[],[],,,,,levinejamesdir.,levinejamesdir.,"mozartwolfgang amadeus, levinejames, schikaned...","mozartwolfgang amadeus, levinejames, schikaned...",w. a. mozart ; libretto: emanuel schikaneder ;...,w. a. mozart ; libretto: emanuel schikaneder ;...,deutsche grammophon,deutsche grammophon,,,"die zauberflöte, oper in zwei aufzügen","die zauberflöte, oper in zwei aufzügen",,,"1 dvd-video, dvd region 0, 169 min., farb.","1 dvd-video, dvd region 0, 169 min., farb.",1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [59]:
columns_metadata_dict

{'columns_to_use': ['duplicates',
  'coordinate_E_x',
  'coordinate_E_y',
  'coordinate_N_x',
  'coordinate_N_y',
  'corporate_110_x',
  'corporate_110_y',
  'corporate_710_x',
  'corporate_710_y',
  'doi_x',
  'doi_y',
  'edition_x',
  'edition_y',
  'exactDate_x',
  'exactDate_y',
  'format_prefix_x',
  'format_prefix_y',
  'format_postfix_x',
  'format_postfix_y',
  'isbn_x',
  'isbn_y',
  'musicid_x',
  'musicid_y',
  'part_x',
  'part_y',
  'person_100_x',
  'person_100_y',
  'person_700_x',
  'person_700_y',
  'person_245c_x',
  'person_245c_y',
  'pubinit_x',
  'pubinit_y',
  'scale_x',
  'scale_y',
  'ttlfull_245_x',
  'ttlfull_245_y',
  'ttlfull_246_x',
  'ttlfull_246_y',
  'volumes_x',
  'volumes_y']}

## Feature Base

The metris for each attribute of the feature DataFrame has been decided and the features have been calculated. The columns with the original attribute values are not needed for further processing and they will be dropped to generate the feature matrix for modelling the estimators.

In [60]:
# Drop all non-delta columns, except of 'duplicates'
columns_to_be_dropped = [e for e in columns_metadata_dict['columns_to_use']
                         if e != 'duplicates']

df_feature_base.drop(columns=columns_to_be_dropped, inplace=True)

In [61]:
for i in range(2):
    display(df_feature_base[df_feature_base.duplicates==i].sample(n=20))

Unnamed: 0,duplicates,corporate_110_delta,corporate_710_delta,coordinate_E_delta,coordinate_N_delta,doi_delta,edition_delta,exactDate_delta,format_prefix_delta,format_postfix_delta,isbn_delta,musicid_delta,part_delta,person_100_delta,person_700_delta,person_245c_delta,pubinit_delta,scale_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
145027,0,1.0,1.0,1.0,1.0,1.0,0.0,0.625,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.977778,1.0,0.573171,1.0,0.222222
71034,0,1.0,1.0,1.0,1.0,1.0,1.0,0.5,0.0,0.428571,0.0,1.0,0.0,0.417735,0.0,0.498701,1.0,1.0,0.314815,1.0,0.12
24652,0,1.0,0.0,1.0,1.0,1.0,1.0,0.5,0.0,0.111111,0.0,1.0,0.0,1.0,1.0,0.603083,0.0,1.0,0.313725,0.0,0.0
251205,0,1.0,1.0,1.0,1.0,1.0,0.0,0.3125,0.0,0.428571,0.0,0.0,1.0,0.0,0.544674,0.588492,0.40202,1.0,0.40678,1.0,0.178571
108241,0,1.0,1.0,1.0,1.0,0.0,1.0,0.5,1.0,0.111111,0.0,1.0,0.453704,0.481151,0.0,0.504035,0.0,1.0,0.306122,1.0,0.0
118888,0,1.0,1.0,1.0,1.0,1.0,0.409091,0.25,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.792133,0.581566,1.0,0.339161,1.0,0.416667
230651,0,1.0,1.0,1.0,1.0,1.0,1.0,0.375,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.635175,0.493651,1.0,0.222222,1.0,0.5
191521,0,1.0,1.0,1.0,1.0,1.0,0.404255,0.25,0.0,0.111111,0.0,1.0,0.0,0.0,0.0,0.504002,0.0,1.0,0.470588,1.0,0.466667
228133,0,1.0,0.0,1.0,1.0,1.0,1.0,0.375,0.0,0.111111,1.0,1.0,1.0,0.0,0.0,0.477335,0.0,1.0,0.467391,1.0,0.058824
99827,0,1.0,1.0,1.0,1.0,1.0,1.0,0.5,1.0,1.0,1.0,1.0,0.0,1.0,0.447869,0.51159,0.382937,1.0,0.168831,1.0,0.333333


Unnamed: 0,duplicates,corporate_110_delta,corporate_710_delta,coordinate_E_delta,coordinate_N_delta,doi_delta,edition_delta,exactDate_delta,format_prefix_delta,format_postfix_delta,isbn_delta,musicid_delta,part_delta,person_100_delta,person_700_delta,person_245c_delta,pubinit_delta,scale_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
891,1,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.830688,0.422222,1.0,1.0,1.0,1.0
1084,1,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.745098,1.0,1.0,0.72,1.0,1.0
385,1,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.845843,1.0,1.0,1.0,1.0,0.333333
351,1,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.817733,0.0,1.0,0.511628,1.0,1.0
523,1,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,0.25,1.0,0.819444,0.654692,0.815144,1.0,1.0,1.0,1.0,1.0
272,1,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,0.0,1.0,1.0,0.456514,0.787082,0.0,1.0,1.0,1.0,1.0
348,1,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,0.0,1.0,1.0,0.888889,1.0,0.729885,0.0,1.0,1.0,0.0,1.0
475,1,1.0,1.0,1.0,1.0,1.0,1.0,0.5,1.0,1.0,1.0,0.0,1.0,0.819444,0.0,0.909163,0.0,1.0,0.703704,1.0,0.92
1077,1,1.0,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
854,1,1.0,1.0,0.916667,0.916667,1.0,0.0,0.75,1.0,1.0,1.0,1.0,0.764706,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.555556


## Feature Matrix and Target Vector Handover

To hand over the resulting DataFrame of this chapter, the DataFrame is saved into a pickle file that will be read in the next chapters [Decision Tree Model](./5_DecisionTreeModel.ipynb), ... as input file.

In [62]:
import pickle as pk

# Binary intermediary file
with open(os.path.join(path_goldstandard,
                       'labelled_feature_matrix.pkl'), 'wb') as df_output_file:
    pk.dump(df_feature_base, df_output_file)