# Feature Matrix Generation

This chapter introduces similarity metrics for string comparison. The metrics to be used for calculating its similarity is decided for each attribute of the DataFrame built in the previous chapters. As a result of this chapter, the feature matrix will be derived.

## Table of Contents

- [Data Takeover](#Data-Takeover)
- [Object Distance and Similarity](#Object-Distance-and-Similarity)
- [Library TextDistance](#Library-TextDistance)
- [Similarity Metrics on Attribute Level](#Similarity-Metrics-on-Attribute-Level)
    - [century](#century)
    - [corporate](#corporate)
    - [coordinate](#coordinate)
    - [edition](#edition)
    - [format](#format)
    - [isbn](#isbn)
    - [musicid](#musicid)
    - [part](#part)
    - [person_100](#person_100)
    - [person_700](#person_700)
    - [person_245c](#person_245c)
    - [pubinit](#pubinit)
    - [ttlfull](#ttlfull)
    - [volumes](#volumes)
- [Feature Base](#Feature-Base)

## Data Takeover

Swissbib's raw data of the goldstandard has been processed in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb). As the first step of this chapter, this data is read in for further processing to the feature matrix and target vector for the subsequent machine learning model chapters.

In [1]:
import os
import pandas as pd
import pickle as pk

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)

# Restore results so far
df_feature_base = pd.read_pickle(os.path.join(path_goldstandard, 'feature_base_df.pkl'),
                                 compression=None)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head()

Unnamed: 0,duplicates,century_x,century_y,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,doi_x,doi_y,edition_x,edition_y,exactDate_x,exactDate_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,isbn_x,isbn_y,ismn_x,ismn_y,musicid_x,musicid_y,part_x,part_y,person_100_x,person_100_y,person_700_x,person_700_y,person_245c_x,person_245c_y,pubinit_x,pubinit_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y
0,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula","grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,reclam jun.,reclam jun.,,,"emma, roman","emma, roman",,,600 s.,600 s.
1,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,reclam jun.,reclam,,,"emma, roman",emma,,,600 s.,600 s.
2,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen,reclam jun.,reclam,,,"emma, roman","emma, roman",,,600 s.,600 s.
3,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,"grawechristian, graweursula",jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,reclam,reclam jun.,,,emma,"emma, roman",,,600 s.,600 s.
4,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,reclam,reclam,,,emma,emma,,,600 s.,600 s.


In [2]:
print('Number of rows labelled as duplicates', len(df_feature_base[
    df_feature_base.duplicates==1]))
print('Number of rows labelled as uniques', len(df_feature_base[
    df_feature_base.duplicates==0]))
print('Total number of rows in DataFrame', df_feature_base.shape[0],
      'number of columns', df_feature_base.shape[1])

Number of rows labelled as duplicates 1473
Number of rows labelled as uniques 259260
Total number of rows in DataFrame 260733 number of columns 45


In [3]:
print('Part of duplicates (1) on uniques (2) in units of [%]')
print(100*df_feature_base.duplicates.value_counts(normalize=True))

Part of duplicates (1) on uniques (2) in units of [%]
0    99.435054
1     0.564946
Name: duplicates, dtype: float64


The ratio of duplicate records in the full training data has a percentage value of below 0.6%. This is very low and will affect the training of the model. During the training process, the model will find more pairs of unique records ($\texttt{duplicates}=0$) than pairs of duplicates ($\texttt{duplicates}=1$). Undersampling of the amount of unique pairs might be necessary as a consequence and will be decided during model training.

## Object Distance and Similarity

A mathematical idea of distance and similarity is needed for understanding object pair comparison. This section starts with a motivation for calculating similarities and afterwards gives a very basic definition of the two central terms. The text of this section is a summary of [[Chri2012](./A_References.ipynb#chri2012)].

The attributes to be used for pair comparison may contain values of poor quality. The quality originates in the way the data has been entered at the very source. Manual data entry may suffer from mistyping, automatically scanned data may suffer from insufficiencies of the scanned base material or the recognition algorithm in the optical character recognition (OCR) processing. The basic step of a deduplication process is to identify the probability of two strings of a pair to be a pair of duplicates. This is done by calculating a similarity value between the two strings compared, rather than using an exact comparison function. Based on this common similarity value for an attribute pair, their being duplicates can be decided.

The term similarity is strongly coupled to the term of distance of two values of an attribute. Mathematically, a distance can be explained with the help of a distance function. A _distance function_ or _distance metric_ $dist(o_i, o_j)$ between two points or data objects $o_i$ and $o_j$ must fulfill four requirements.

1. $dist(o_i, o_i)=0$, the distance from an object to itself is zero.
- $dist(o_i, o_j)\ge 0$, the distance between two objects is a non-negative number.
- $dist(o_i, o_j)=dist(o_j, o_i)$, the distance between two objects is symmetric.
- $dist(o_i, o_j)\le dist(o_i, o_k)+dist(o_k, o_j)$, the triangular inequality must hold. It states that the direct distance beween two objects is never larger than the combined distance when going through a third object.

A distance value expresses the dissimilarity $d$ of two objects [[HanK2012](./A_References.ipynb#hank2012)] and can therefore be converted into a similarity value $s$, calculating $s = \frac{1}{d}$, assuming $d\gt 0$. Alternatively, assuming the distance value is normalised $0\le d\le 1$, the similarity value can be calculated to $s = 1-d$. A _similarity function_ $sim(a_i, aj)$ between two attributes which can be strings, numbers, dates, geographic locations, text, XML documents, etc. fulfills the general requirements.

1. $sim(a_i, a_i)=1$, the result of comparing a value with itself is an exact similarity.
- $sim(a_i, a_j)=0$, the similarity of values that are completely different from each other is 0. What accounts for 'complete different' depends upon the type of data that are compared.
- $0\lt sim(a_i, a_j)\lt 1$, an approximate similarity between exact similarity and total dissimilarity is calculated if two attribute values are somewhat similar to each other. What accounts for 'somewhat different' depends upon the type of data that are compared.

The dissimilarity between two objects $o_i$ and $o_j$ can be computed based on the ratio of mismatches,
$$
d(o_i, o_j) = \frac{p-m}{p},
$$
where $m$ is the number of matching attributes and $p$ is the total number of attributes describing the objects [[HanK2012](./A_References.ipynb#hank2012)]. Thus the similarity between two objects can be computed as
$$
sim(o_i, o_j) = 1 - d(o_i, o_j) = \frac{m}{p}.
$$

For data deduplication, a comparison function needs to be tailored to the type of underlying data. Although there is a correspondence between a similarity function and the mathematical concept of a distance function, not all known and implemented similarity comparison functions used for string pair comparison fulfill the requirements of a distance function. Some similarity functions are not symmetric, others do not fulfill the triangular inequality. Decision taking on the best similarity function for a string pair, will be based on the effect, a similarity function has for the purpose needed. In the case of this capstone project, this purpose is its capability to contribute to the prediction whether a pair of records is a duplicate or different.

## Library TextDistance

An internet research on string distance calculation with Python has revealed libraries [[StSi](./A_References.ipynb#stsi)], [[TeDi](./A_References.ipynb#tedi)] and seperate code snippets for distinct algorithms. After trying the referenced libraries and a downloaded code snippet for a Smith Waterman similarity [[SmWa](./A_References.ipynb#smwa)], the text distance library [[TeDi](./A_References.ipynb#tedi)] has been decided to be the best decision for this capstone project. The decision is based on the GitHub statistics of stars and the date of the latest pull requests, indicating its popularity and maintenance activity of the library. A look at the API of the library, reveals the Python library to be a complete implementation (compared to suggestions of similarity metrics in [[Chri2012](./A_References.ipynb#chri2012)]) and easy to use.

In [4]:
# Install textdistance Python library - if not done, yet.
! pip install textdistance



For using the library, see documentation in [[TeDi](./A_References.ipynb#tedi)]. For the purposes of this chapter, function $\texttt{.normalized}\_\texttt{similarity()}$ of an instantiated textdistance object will be used.

In [5]:
import textdistance as tedi

With the code line above, the library is imported for application in this chapter. In appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb) the effect of the similarity metrics of the library are compared for better understanding of their specific behaviour. This comparison for each attribute is the basis of deciding the best similarity metric available for an attribute pair.

## Similarity Metrics on Attribute Level

In this section, the decision for calculating the similarity metric for each attribute of the raw data is documented based on appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb) and implemented. The implementation is applied on a pair of attributes of different records, resulting in a new attribute of the final feature matrix. A general function $\texttt{build_delta_feature}$ is provided by the code file [data_preparation_funcs.py](./data_preparation_funcs.py) for transforming two attributes into their feature attribute holding their similarity value.

In [6]:
import data_preparation_funcs as dpf

### century

As discussed in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{century}$ holds year number stored as a string of length 4. Letter 'u' is used as a placeholder for an unknown digit. For this reason, the attribute will be kept as a string and will not be transformed into an integer. The feature attribute of the record pair to be compared will be calculated with a modified Hamming algorithm, see appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb). The resulting similarity will be stored in a new attribute $\texttt{century}\_\texttt{delta}$ which will be taken for the model calculation.

In [7]:
# Replace letter 'u' with letter 'a' for one of the two strings.
#  As an effect, the resulting Hamming similarity for a letter
#  instead of a numerical digit in either string will add with an amount 0.
df_feature_base['century_x'] = df_feature_base.century_x.str.replace('u', 'a')

# Compute Hamming similarity for century string pair.
century_algorithm = tedi.Hamming()
df_feature_base = dpf.build_delta_feature(df_feature_base, 'century', century_algorithm)

# Add amount of 0.125 to Hamming similarity for every letter digit.
#  But only maximum number of letter digits in both strings of a pair.
df_feature_base['century_delta'] = df_feature_base[['century_x', 'century_y', 'century_delta']].apply(
    lambda x : x['century_delta'] + 
    0.125*max(x['century_x'].count('a'), x['century_y'].count('u')), axis=1
)

In [8]:
df_feature_base[['century_x', 'century_y', 'century_delta']].sample(n=10)

Unnamed: 0,century_x,century_y,century_delta
134751,1764,1988,0.25
154761,1970,2014,0.0
168073,2012,2004,0.5
79723,1978,1550,0.25
30813,2005,1943,0.0
225290,2001,1793,0.0
152056,1983,1793,0.5
226781,1764,1998,0.25
95609,2005,1998,0.0
155092,aaaa,1863,0.5


All resulting values of equal strings are equal to 1.

In [9]:
df_feature_base[['century_x', 'century_y', 'century_delta']][
    df_feature_base.century_x == df_feature_base.century_y
].sort_values('century_delta', ascending=False).head()

Unnamed: 0,century_x,century_y,century_delta
0,2009,2009,1.0
158402,2013,2013,1.0
158980,2013,2013,1.0
158939,2013,2013,1.0
158885,2013,2013,1.0


Nine different similarity values can be found in the attribute deltas. Some sample records are shown below.

In [10]:
import numpy as np

century_deltas = np.sort(df_feature_base.century_delta.unique())
century_deltas

array([0.   , 0.125, 0.25 , 0.375, 0.5  , 0.625, 0.75 , 0.875, 1.   ])

In [11]:
sample_size = 5

for i in century_deltas :
    dpf.show_samples_distinct(df_feature_base, 'century', i)
    print(f'century_delta = {i}')

Unnamed: 0,duplicates,century_delta,century_x,century_y
110500,0,0.0,2009,1995
105420,0,0.0,2011,1964
249685,0,0.0,2007,1990
144803,0,0.0,1989,2008
191424,0,0.0,2016,1844


century_delta = 0.0


Unnamed: 0,duplicates,century_delta,century_x,century_y
189091,0,0.125,2016,193u
251060,0,0.125,183a,2010
258227,0,0.125,2005,193u
4197,0,0.125,2002,189u
6715,0,0.125,2007,193u


century_delta = 0.125


Unnamed: 0,duplicates,century_delta,century_x,century_y
150626,0,0.25,2012,1932
31483,0,0.25,2007,1987
226599,0,0.25,1764,1985
92339,0,0.25,1843,1990
108324,0,0.25,1982,2012


century_delta = 0.25


Unnamed: 0,duplicates,century_delta,century_x,century_y
251029,0,0.375,183a,1763
248097,0,0.375,1932,181u
251478,0,0.375,183a,1991
251467,0,0.375,183a,1989
88844,0,0.375,170a,1994


century_delta = 0.375


Unnamed: 0,duplicates,century_delta,century_x,century_y
76788,0,0.5,2005,2010
163732,0,0.5,1970,1926
234977,0,0.5,1989,1920
186636,0,0.5,2016,2007
111495,0,0.5,2009,uuuu


century_delta = 0.5


Unnamed: 0,duplicates,century_delta,century_x,century_y
68103,0,0.625,1963,193u
131279,0,0.625,1982,193u
135317,0,0.625,1994,189u
69470,0,0.625,1991,192u
187818,0,0.625,2016,200u


century_delta = 0.625


Unnamed: 0,duplicates,century_delta,century_x,century_y
223379,0,0.75,2015,2017
104698,0,0.75,1984,1981
188849,0,0.75,2015,2017
179768,0,0.75,2009,2000
19125,0,0.75,2007,2002


century_delta = 0.75


Unnamed: 0,duplicates,century_delta,century_x,century_y
129410,0,0.875,2000,200u
248095,0,0.875,1932,193u
2462,0,0.875,2008,200u
95438,0,0.875,2005,200u
132986,0,0.875,2001,200u


century_delta = 0.875


Unnamed: 0,duplicates,century_delta,century_x,century_y
243046,0,1.0,2016,2016
95,1,1.0,1999,1999
55620,0,1.0,2012,2012
244222,0,1.0,2016,2016
209654,0,1.0,2016,2016


century_delta = 1.0


### corporate

In [12]:
corporate_algorithm = tedi.Jaro()

In [13]:
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'corporate_110', corporate_algorithm)
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'corporate_710', corporate_algorithm)

In [14]:
dpf.show_samples_interval(df_feature_base, 'corporate_110', 0.0, 0.1, 10)
dpf.show_samples_interval(df_feature_base, 'corporate_710', 0.9, 1.0, 10)

Unnamed: 0,duplicates,corporate_110_delta,corporate_110_x,corporate_110_y
196182,0,0.0,eidgenössisches topographisches bureau,
196557,0,0.0,eidgenössisches topographisches bureau,
196329,0,0.0,eidgenössisches topographisches bureau,
196588,0,0.0,eidgenössisches topographisches bureau,
196446,0,0.0,eidgenössisches topographisches bureau,
196296,0,0.0,eidgenössisches topographisches bureau,
196172,0,0.0,eidgenössisches topographisches bureau,
196152,0,0.0,eidgenössisches topographisches bureau,
196218,0,0.0,eidgenössisches topographisches bureau,
196656,0,0.0,eidgenössisches topographisches bureau,


0.0 < corporate_110_delta < 0.1


Unnamed: 0,duplicates,corporate_710_delta,corporate_710_x,corporate_710_y
215838,0,1.0,,
192585,0,1.0,,
195628,0,1.0,,
242214,0,1.0,,
228418,0,1.0,,
66754,0,1.0,,
236937,0,1.0,,
133237,0,1.0,,
65928,0,1.0,,
206582,0,1.0,,


0.9 < corporate_710_delta < 1.0


### coordinate

In [15]:
coordinate_algorithm = tedi.Jaro()

In [16]:
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'coordinate_E', coordinate_algorithm)
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'coordinate_N', coordinate_algorithm)

In [17]:
df_feature_base['coordinate_E_delta'].unique(), df_feature_base['coordinate_N_delta'].unique()

(array([1.        , 0.        , 0.91666667, 0.77777778, 0.58333333,
        0.75      , 0.86904762, 0.66666667, 0.68333333]),
 array([1.        , 0.        , 0.91666667, 0.66666667, 0.83333333,
        0.77777778, 0.75      , 0.68333333]))

In [18]:
dpf.show_samples_interval(df_feature_base, 'coordinate_E', 0.5, 0.7, 10)

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y
53736,0,0.583333,e0080855,e0074147
68401,0,0.583333,e0080855,e0074147
182195,0,0.583333,e0080855,e0074147
182205,0,0.583333,e0080855,e0074147
68374,0,0.583333,e0080855,e0074147
182218,0,0.583333,e0080855,e0074147
197106,0,0.583333,e0074147,e0055009
257302,0,0.583333,e0074147,e0055009
182191,0,0.583333,e0080855,e0074147
76980,0,0.666667,e0060811,e0074147


0.5 < coordinate_E_delta < 0.7


### edition

The edition statement is a string value which may have several words. A Jaccard similarity is tried for this attribute.

In [19]:
edition_algorithm = tedi.Jaccard()

In [20]:
df_feature_base = dpf.build_delta_feature(df_feature_base, 'edition', edition_algorithm)

In [21]:
df_feature_base.edition_delta.unique()[:30], len(df_feature_base.edition_delta.unique())

(array([1.        , 0.        , 0.95348837, 0.46511628, 0.62222222,
        0.66666667, 0.5       , 0.65909091, 0.62790698, 0.67741935,
        0.68965517, 0.84375   , 0.7       , 0.875     , 0.92857143,
        0.56521739, 0.54166667, 0.72727273, 0.96153846, 0.57142857,
        0.54761905, 0.74193548, 0.09375   , 0.17647059, 0.24      ,
        0.10526316, 0.75      , 0.06666667, 0.13636364, 0.28571429]), 802)

The comparison results in a wide number of distinct similarity values for the goldstandard data set. Below, some examples are shown.

In [22]:
dpf.show_samples_interval(df_feature_base, 'edition', 0.9, 1.0, 10)

Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
213407,0,1.0,,
209843,0,1.0,,
108171,0,1.0,,
183116,0,1.0,,
230270,0,1.0,,
69608,0,1.0,,
33254,0,1.0,,
106412,0,1.0,,
120439,0,1.0,,
234371,0,1.0,,


0.9 < edition_delta < 1.0


In [23]:
dpf.show_samples_interval(df_feature_base, 'edition', 0.0, 0.1, 10)

Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
175702,0,0.0,,[Nouv. éd.]
133188,0,0.0,"3., erw. Aufl.",
41568,0,0.0,,Pbk. ed.
190544,0,0.0,,Überdruck
122713,0,0.0,[2. Aufl.],
129214,0,0.0,,[Nouv. éd.]
164016,0,0.0,,[Rééd.]
147471,0,0.0,,[Nouvelle éd.]
118283,0,0.0,,Nouv. éd. [4. Aufl.]
35438,0,0.0,,"Nouv. éd., [2. Aufl.]"


0.0 < edition_delta < 0.1


### format

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{format}$ has been split up into two new attributes $\texttt{format_prefix}$ and $\texttt{format_postfix}$ which will be compared by a different similarity metrics.

- As the quality of $\texttt{format_prefix}$ is expected to be high, an identity comparison should be sufficient.
- Due to the observed structure of $\texttt{format_postfix}$, a q-gram based comparison will be chosen.

In [24]:
format_prefix_algorithm = tedi.Identity()
df_feature_base = dpf.build_delta_feature(df_feature_base, 'format_prefix',
                                          format_prefix_algorithm)

format_postfix_algorithm = tedi.Jaccard(qval=2)
df_feature_base = dpf.build_delta_feature(df_feature_base, 'format_postfix',
                                          format_postfix_algorithm)

In [25]:
for i in df_feature_base.format_prefix_delta[
    df_feature_base.format_prefix_x != df_feature_base.format_prefix_y].unique():
    
    dpf.show_samples_distinct(df_feature_base, 'format_prefix', i)
    print(f'format_prefix_delta = {i}')

Unnamed: 0,duplicates,format_prefix_delta,format_prefix_x,format_prefix_y
214590,0,0.0,bk,vm
259095,0,0.0,bk,mp
151614,0,0.0,mu,vm
165228,0,0.0,bk,vm
138100,0,0.0,bk,vm


format_prefix_delta = 0.0


In [26]:
for i in df_feature_base.format_postfix_delta[
    df_feature_base.format_postfix_x != df_feature_base.format_postfix_y].unique():
    
    dpf.show_samples_distinct(df_feature_base, 'format_postfix', i)
    print(f'format_postfix_delta = {i}')

Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
243679,0,0.428571,20053,20000
156204,0,0.428571,30053,20053
229550,0,0.428571,10200,10300
92297,0,0.428571,20000,30000
59793,0,0.428571,20000,10200


format_postfix_delta = 0.4285714285714286


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
257445,0,0.111111,10300,20053
257520,0,0.111111,20000,10300
237859,0,0.111111,20000,40100
206648,0,0.111111,20000,10300
22063,0,0.111111,10300,20000


format_postfix_delta = 0.11111111111111116


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
196443,0,0.25,10347,10053
66342,0,0.25,10200,10353
214631,0,0.25,20000,20353
88186,0,0.25,30500,20053
129775,0,0.25,20000,20347


format_postfix_delta = 0.25


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
223369,0,0.0,20000,30653
250987,0,0.0,10100,20353
246977,0,0.0,30653,20000
196227,0,0.0,10347,20000
145014,0,0.0,20000,10353


format_postfix_delta = 0.0


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
224103,0,1.0,20000,20000
96255,0,1.0,20000,20000
69098,0,1.0,20053,20053
158245,0,1.0,20000,20000
220127,0,1.0,20000,20000


format_postfix_delta = 1.0


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
26086,0,0.666667,20047,20400
26102,0,0.666667,20047,20400
87943,0,0.666667,30500,30053
90470,0,0.666667,20047,20400
244959,0,0.666667,20400,20047


format_postfix_delta = 0.6666666666666666


### isbn

Swissbib uses each string element of the $\texttt{isbn}$ list separately for comparing with each string element of its comparison $\texttt{isbn}$ list. If two bibliographic units hold at least one element in common, this is interpreted as a strong indicator for duplicates [[WiCo2001](./A_References.ipynb#wico2001)].

This hard logic is used in a modified way in the context of this capstone project. A special comparison function $\texttt{.build}\_\texttt{delta}\_\texttt{isbn()}$ has been implemented that compares each list element of the left-hand side with each list element of the right-hand side of a pair. According to Swissbib's implementation, the Identity metric is used for string comparison, calculating a similarity value of 1 or 0 for each list element pair. For normalisation reasons, the sum of similarity values is divided by the number of elements of the smaller list. If both lists are empty a value of 1.0 is returned. If only one list is empty a value of 0.0 is returned.

In [27]:
isbn_algorithm = tedi.Identity()
df_feature_base = dpf.build_delta_feature(df_feature_base, 'isbn',
                                          isbn_algorithm)

df_feature_base['isbn_delta'].unique()

array([1. , 0. , 0.5])

Some sample cases are shown below for each category of $\texttt{isbn_delta}$.

In [28]:
for isbn_delta_value in df_feature_base['isbn_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['isbn_delta']==isbn_delta_value])
    )

    dpf.show_samples_distinct(df_feature_base, 'isbn', isbn_delta_value, number_of_max_samples)
    print(f'isbn_delta = {isbn_delta_value}')

Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
212604,0,1.0,[],[]
140661,0,1.0,[],[]
67970,0,1.0,[],[]
138935,0,1.0,[],[]
244463,0,1.0,[],[]
54051,0,1.0,[],[]
14656,0,1.0,[],[]
106132,0,1.0,[],[]
123568,0,1.0,[],[]
33271,0,1.0,[],[]


isbn_delta = 1.0


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
19914,0,0.0,[3-499-17476-6],[]
132962,0,0.0,[3-13-127283-X],"[3-7957-8008-X (Schott).), 3-492-18008-6 (Piper)]"
174193,0,0.0,[978-3-7190-3171-8],[]
235802,0,0.0,[978-2-226-31734-6],[]
68527,0,0.0,[],"[978-3-598-31803-0 (print), 978-3-11-096275-8]"
179246,0,0.0,[],[978-0-7294-1157-8]
187837,0,0.0,[978-0-7294-1157-8],[]
117390,0,0.0,[2-08-070552-0],"[978-3-598-31502-2 (print), 978-3-11-097002-9]"
127590,0,0.0,[3-495-47879-5],[]
83812,0,0.0,[978-3-13-127285-0],[]


isbn_delta = 0.0


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
1199,1,0.5,"[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]","[978-3-13-127286-7, 3-13-127286-4]"
1210,1,0.5,"[978-3-13-127286-7, 3-13-127286-4]","[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]"
1202,1,0.5,"[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]","[978-3-13-127286-7, 3-13-127286-4]"
1201,1,0.5,"[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]","[978-3-13-127286-7, 3-13-127286-4]"
1195,1,0.5,"[978-3-13-127286-7, 3-13-127286-4]","[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]"
161989,0,0.5,"[978-3-642-41697-2, 978-3-642-41698-9 (ebook)]","[978-3-642-41697-2, 3-642-41697-7]"
1205,1,0.5,"[978-3-13-127286-7, 3-13-127286-4]","[978-3-13-127286-7, 978-3-13-150826-3 (PDF)]"


isbn_delta = 0.5


### musicid

In [29]:
musicid_algorithm = tedi.Jaccard()

In [30]:
df_feature_base = dpf.build_delta_feature(df_feature_base, 'musicid', musicid_algorithm)

In [31]:
df_feature_base['musicid_delta'].unique()

array([1.        , 0.        , 0.42857143, 0.25      , 0.40909091,
       0.41176471, 0.38888889, 0.53333333, 0.5       , 0.7       ,
       0.38461538, 0.88888889, 0.22222222, 0.35714286, 0.44117647,
       0.76470588, 0.31034483, 0.44444444, 0.8       , 0.18181818,
       0.08108108, 0.09677419, 0.05555556, 0.10344828, 0.04761905,
       0.1       , 0.11111111, 0.1025641 , 0.07692308, 0.03448276,
       0.14285714, 0.05882353, 0.30769231, 0.16      , 0.29411765,
       0.20833333, 0.17241379, 0.08571429, 0.23076923, 0.36363636,
       0.33333333, 0.28571429, 0.19047619, 0.27272727, 0.06451613,
       0.06666667, 0.02380952, 0.125     , 0.16666667, 0.04651163,
       0.17647059, 0.2       , 0.07142857, 0.15384615, 0.05263158,
       0.09090909, 0.11764706, 0.21428571, 0.15789474, 0.08333333,
       0.08695652, 0.03125   , 0.04878049, 0.12      , 0.71428571,
       0.3       , 0.1875    , 0.30434783, 0.025     , 0.35294118,
       0.09375   , 0.05405405, 0.13043478, 0.05128205, 0.13333

In [32]:
dpf.show_samples_interval(df_feature_base, 'musicid', 0.0, 0.1, 10)

Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
69581,0,0.0,433210-2,
63050,0,0.0,U.E. 245,
176306,0,0.0,,95088
245859,0,0.0,K 1004,
137674,0,0.0,242 716-28.35766 ZA,
117690,0,0.0,,PB 226
60369,0,0.0,,10425EP 4697
205235,0,0.0,,BA 4553a
15692,0,0.0,Decca 433 210-2,
102814,0,0.0,502430,


0.0 < musicid_delta < 0.1


In [33]:
dpf.show_samples_interval(df_feature_base, 'musicid', 0.6, 0.7, 10)

Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
52441,0,0.7,BA 4553,BA 4553-90
102505,0,0.636364,BA 4553a,BA 4553-90
438,1,0.7,433 221-2,433210-2
152569,0,0.636364,BA 4553a,BA 4553-90
139206,0,0.666667,99036,99064
229630,0,0.692308,10425EP 71,10425EP 4697
74493,0,0.636364,BA 4553a,BA 4553-90
163893,0,0.7,BA 4553,BA 4553-90
133677,0,0.692308,10425EP 71,10425EP 4697
216937,0,0.636364,BA 4553a,BA 4553-90


0.6 < musicid_delta < 0.7


### part

In [34]:
part_algorithm = tedi.Jaro()

df_feature_base = dpf.build_delta_feature(df_feature_base, 'part', part_algorithm)

In [35]:
dpf.show_samples_interval(df_feature_base, 'part', 0.6, 0.7, 10)

Unnamed: 0,duplicates,part_delta,part_x,part_y
199187,0,0.625,bd. 57,nr. 7633
183069,0,0.663768,"vol. 26, 2000, 26, 2000","bl. 23, 23,1912"
221814,0,0.69697,"3870., 3870",7
89519,0,0.60119,band 57,nr. 7633
51987,0,0.666667,bd. 19,h. 117
158029,0,0.652778,"bd. 5, 5","bl. 24, 23,1863"
167057,0,0.684211,"60/3(2015), 432-437",1
121996,0,0.680556,1,"bl. 23,1891, 23,1891, 23"
79082,0,0.69883,bd. 19,"bl. 23, 23,1905, 23"
241755,0,0.694444,2620,282(2003)


0.6 < part_delta < 0.7


In [36]:
dpf.show_samples_interval(df_feature_base, 'part', 0.8, 0.9, 10)

Unnamed: 0,duplicates,part_delta,part_x,part_y
665,1,0.857143,bd. 8008,"bd. 8008, 8008"
79389,0,0.851852,33001/680,33001
127741,0,0.866667,"bd. 57, 57",bd. 57
664,1,0.844444,bd. 8008,"bd. 8008., 8008"
71371,0,0.822222,bd. 63,bd. 4
59,1,0.833333,bd 57,"bd. 57, 57"
84,1,0.849206,band 57,bd. 57
1319,1,0.857143,n. 1,"n. 1, 1"
214115,0,0.800866,"no 912, 912",no. 912
147447,0,0.861111,"nr. 2620, 2620","nr. 2620, bd. 5, 2620, 5"


0.8 < part_delta < 0.9


### person_100

In [37]:
person_100_algorithm = tedi.Jaro()

df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'person_100', person_100_algorithm)

In [38]:
dpf.show_samples_interval(df_feature_base, 'person_100', 0.0, 0.1, 10)
dpf.show_samples_interval(df_feature_base, 'person_100', 0.9, 1.0, 10)

Unnamed: 0,duplicates,person_100_delta,person_100_x,person_100_y
218465,0,0.0,,mozartwolfgang amadeus
133377,0,0.0,schusterhans-peter,
147486,0,0.0,mozartwolfgang amadeus,
88188,0,0.0,,mortzfeldpeter
43223,0,0.0,mozartwolfgang amadeus1756-1791(de-588)118584596,
118555,0,0.0,,mortzfeldpeter
109234,0,0.0,mozartwolfgang amadeus1756-1791(de-588)118584596,
141539,0,0.0,mozartwolfgang amadeus,
205749,0,0.0,,mozartwolfgang amadeus1756-1791(de-588)118584596
155432,0,0.0,mozartwolfgang amadeus1756-1791(de-588)118584596,


0.0 < person_100_delta < 0.1


Unnamed: 0,duplicates,person_100_delta,person_100_x,person_100_y
52626,0,1.0,,
127468,0,0.903226,mozartwolfgang amadeus,mozartwolfgang amadeus1756-1791
104410,0,1.0,mozartwolfgang amadeus1756-1791(de-588)118584596,mozartwolfgang amadeus1756-1791(de-588)118584596
114663,0,1.0,,
183392,0,1.0,,
135890,0,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus
181470,0,1.0,,
247029,0,1.0,,
176704,0,1.0,,
9581,0,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus


0.9 < person_100_delta < 1.0


### person_700

In [39]:
person_700_algorithm = tedi.Jaro()

df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'person_700', person_700_algorithm)

In [40]:
dpf.show_samples_interval(df_feature_base, 'person_700', 0.0, 0.1, 10)
dpf.show_samples_interval(df_feature_base, 'person_700', 0.9, 1.0, 10)

Unnamed: 0,duplicates,person_700_delta,person_700_x,person_700_y
135309,0,0.0,,"schlöndorffvolker, frischmax"
219123,0,0.0,,"lauxgerd1948-(de-588)122100123, deisterarno195..."
195346,0,0.0,,zieglerleonhard1782-1854(de-588)1069520616früh...
20128,0,0.0,,schikanederemanuel1751-1812(de-588)11860757x
68032,0,0.0,,dufourguillaume henri1787-1875(de-588)118527959
126695,0,0.0,"soldankurt, mozartwolfgang amadeusdie zauberfl...",
45272,0,0.0,,raabepaul
221019,0,0.0,,"douetdanièle, taupeaubéatrice"
139169,0,0.0,"jacquetluc1967-, simonemilie, fesslermichel, d...",
232430,0,0.0,,raabepaul


0.0 < person_700_delta < 0.1


Unnamed: 0,duplicates,person_700_delta,person_700_x,person_700_y
230320,0,1.0,,
17322,0,1.0,,
87656,0,1.0,,
199734,0,1.0,,
57937,0,1.0,,
70079,0,1.0,,
63443,0,1.0,,
15603,0,1.0,,
17317,0,1.0,,
43851,0,1.0,,


0.9 < person_700_delta < 1.0


### person_245c

In [41]:
person_245c_algorithm = tedi.Jaro()

In [42]:
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'person_245c', person_245c_algorithm)

In [43]:
dpf.show_samples_interval(df_feature_base, 'person_245c', 0.0, 0.1, 10)
dpf.show_samples_interval(df_feature_base, 'person_245c', 0.9, 1.0, 10)

Unnamed: 0,duplicates,person_245c_delta,person_245c_x,person_245c_y
74642,0,0.0,w. a. mozart ; hrsg. von kurt soldan ; [text v...,
2202,0,0.0,,g. h. dufour direxit ; h. müllhaupt sculpsit
2107,0,0.0,,[sigrid kessler ... et al.] ; [hrsg.: interkan...
171217,0,0.0,sous la dir. de diego venturino = [les oeuvres...,
243625,0,0.0,,g.h. dufour direxit ; h. müllhaupt sculpsit
110792,0,0.0,"hans-peter schuster, hans-joachim trappe",
26046,0,0.0,max frisch,
244212,0,0.0,,jane austen
105125,0,0.0,[regie:] volker schlöndorff,
232419,0,0.0,,"mortzfeld, peter; raabe, pa"


0.0 < person_245c_delta < 0.1


Unnamed: 0,duplicates,person_245c_delta,person_245c_x,person_245c_y
65443,0,1.0,jane austen,jane austen
63574,0,1.0,,
1167,1,1.0,beatrice käser,beatrice käser
30228,0,1.0,,
48748,0,1.0,,
121226,0,1.0,sigrid kessler... [et al.] ; [éd.:] interkanto...,sigrid kessler... [et al.] ; [éd.:] interkanto...
121215,0,1.0,sigrid kessler... [et al.] ; [éd.:] interkanto...,sigrid kessler... [et al.] ; [éd.:] interkanto...
232543,0,1.0,,
63227,0,1.0,,
4706,0,1.0,,


0.9 < person_245c_delta < 1.0


### pubinit

In [44]:
pubinit_algorithm = tedi.Jaro()

df_feature_base = dpf.build_delta_feature(df_feature_base, 'pubinit', pubinit_algorithm)

In [45]:
dpf.show_samples_interval(df_feature_base, 'pubinit', 0.6, 0.7, 10)

Unnamed: 0,duplicates,pubinit_delta,pubinit_x,pubinit_y
119097,0,0.690476,staatlicher lehrmittelverlag,hachette livre
89356,0,0.655556,de fer,de gruyter saur
39026,0,0.617821,staatlicher lehrmittelverlag,schulverl. blmv
111762,0,0.626984,reclam,staatlicher lehrmittelverlag
88432,0,0.600631,universitätsverlag,staatlicher lehrmittelverlag
123625,0,0.628571,staatlicher lehrmittelverlag,berner lehrmittel- und medienverlag
143271,0,0.660714,interkantonale lehrmittelzentrale : staatliche...,staatlicher lehrmittelverlag
227681,0,0.600631,staatlicher lehrmittelverlag,universitätsverlag
47920,0,0.662698,"interkantonale lehrmittelzentrale, staatlicher...",staatlicher lehrmittelverlag
114086,0,0.60107,bärenreiter,berner lehrmittel- und medienverl.


0.6 < pubinit_delta < 0.7


In [46]:
dpf.show_samples_interval(df_feature_base, 'pubinit', 0.8, 0.9, 10)

Unnamed: 0,duplicates,pubinit_delta,pubinit_x,pubinit_y
10737,0,0.843137,"universitätsverlag, klett + balmer",universitätsverlag
710,1,0.833333,guilde du livre,la guilde du livre
182876,0,0.888889,thieme,g. thieme
21,1,0.842593,"deutsche grammophon, universal music",deutsche grammophon
1,1,0.848485,reclam jun.,reclam
109235,0,0.809524,reclam,p. reclam jun.
143217,0,0.838542,interkantonale lehrmittelzentrale : staatliche...,interkantonale lehrmittelzentrale
49161,0,0.84127,"interkantonale lehrmittelzentrale, staatlicher...",interkantonale lehrmittelzentrale
143910,0,0.810164,interkantonale lehrmittelzentrale : staatliche...,"interkantonale lehrmittelzentral, berner lehrm..."
50663,0,0.820513,terzio-verlag,terzio


0.8 < pubinit_delta < 0.9


### scale

In [47]:
scale_algorithm = tedi.Jaro()

df_feature_base = dpf.build_delta_feature(df_feature_base, 'scale', scale_algorithm)

In [48]:
dpf.show_samples_interval(df_feature_base, 'scale', 0.6, 0.7, 10)

Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
227164,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227514,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227493,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227168,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227171,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227487,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227170,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227494,0,0.681957,Scala 1:50.000 ; proiezione cilindrica ad asse...,50000
227169,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000
227511,0,0.626402,Scala 1:50.000 ; proiezione cilindrica ad asse...,100000


0.6 < scale_delta < 0.7


In [49]:
dpf.show_samples_interval(df_feature_base, 'scale', 0.8, 0.9, 10)

Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
54068,0,0.822222,50000,100000
68382,0,0.822222,50000,100000
54054,0,0.822222,50000,100000
54078,0,0.822222,50000,100000
182191,0,0.822222,50000,100000
182199,0,0.822222,50000,100000
196513,0,0.822222,100000,50000
182201,0,0.822222,50000,100000
95199,0,0.822222,50000,100000
181877,0,0.822222,50000,100000


0.8 < scale_delta < 0.9


### ttlfull

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{ttlfull}$ has been split up into two new attributes $\texttt{ttlfull_245}$ and $\texttt{ttlfull_246}$ which will be compared by the same similarity metrics.

In [50]:
ttlfull_algorithm = tedi.Jaccard()

df_feature_base = dpf.build_delta_feature(df_feature_base, 'ttlfull_245', ttlfull_algorithm)
df_feature_base = dpf.build_delta_feature(df_feature_base, 'ttlfull_246', ttlfull_algorithm)

In [51]:
df_feature_base.columns

Index(['duplicates', 'century_x', 'century_y', 'coordinate_E_x',
       'coordinate_E_y', 'coordinate_N_x', 'coordinate_N_y', 'corporate_110_x',
       'corporate_110_y', 'corporate_710_x', 'corporate_710_y', 'doi_x',
       'doi_y', 'edition_x', 'edition_y', 'exactDate_x', 'exactDate_y',
       'format_prefix_x', 'format_prefix_y', 'format_postfix_x',
       'format_postfix_y', 'isbn_x', 'isbn_y', 'ismn_x', 'ismn_y', 'musicid_x',
       'musicid_y', 'part_x', 'part_y', 'person_100_x', 'person_100_y',
       'person_700_x', 'person_700_y', 'person_245c_x', 'person_245c_y',
       'pubinit_x', 'pubinit_y', 'scale_x', 'scale_y', 'ttlfull_245_x',
       'ttlfull_245_y', 'ttlfull_246_x', 'ttlfull_246_y', 'volumes_x',
       'volumes_y', 'century_delta', 'corporate_110_delta',
       'corporate_710_delta', 'coordinate_E_delta', 'coordinate_N_delta',
       'edition_delta', 'format_prefix_delta', 'format_postfix_delta',
       'isbn_delta', 'musicid_delta', 'part_delta', 'person_100_delta',


In [52]:
dpf.show_samples_interval(df_feature_base, 'ttlfull_245', 0.0, 0.1, 10)
dpf.show_samples_interval(df_feature_base, 'ttlfull_245', 0.9, 1.0, 10)

Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
61948,0,0.05618,liberation,katalog der graphischen porträts in der herzog...
68317,0,0.076923,domodossola,blick in die welt
31464,0,0.016502,homo faber,health informatics - personal health device co...
57828,0,0.03125,"die zauberflöte, [daraus:] aria ""in diesen hei...",emma
76304,0,0.095238,ekg-kurs für isabel,emma
125224,0,0.032,"bildungsforschung und bildungspraxis, educatio...",emma
11867,0,0.024845,emma,katalog der graphischen porträts in der herzog...
180111,0,0.054054,"ekg-kurs für isabel, [mit ekg-lineal und onlin...",emma
121776,0,0.091837,"bonne chance!, cours de langue française, troi...","emma, roman"
10118,0,0.059072,die zauberflöte,katalog der graphischen porträts in der herzog...


0.0 < ttlfull_245_delta < 0.1


Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
1277,1,1.0,gewaltfreie kommunikation,gewaltfreie kommunikation
111836,0,1.0,emma,emma
1436,1,1.0,traité sur la tolérance,traité sur la tolérance
108,1,1.0,die zauberflöte,die zauberflöte
1173,1,0.984848,sozialleistungsbetrug - sozialversicherungsbet...,"sozialleistungsbetrug, sozialversicherungsbetr..."
595,1,1.0,die reise der pinguine,die reise der pinguine
1419,1,1.0,informatique de santé - communication entre di...,health informatics - personal health device co...
172580,0,1.0,gewaltfreie kommunikation,gewaltfreie kommunikation
768,1,1.0,"die zauberflöte, grosse oper in 2 akten","die zauberflöte, grosse oper in 2 akten"
670,1,1.0,"die zauberflöte, textbuch","die zauberflöte, textbuch"


0.9 < ttlfull_245_delta < 1.0


In [53]:
dpf.show_samples_interval(df_feature_base, 'ttlfull_246', 0.0, 0.1, 10)
dpf.show_samples_interval(df_feature_base, 'ttlfull_246', 0.9, 1.0, 10)

Unnamed: 0,duplicates,ttlfull_246_delta,ttlfull_246_x,ttlfull_246_y
183741,0,0.0,"[domodossola, arona]",
166976,0,0.0,,"domodossola, arona"
232363,0,0.0,medizinische informatik - kommunikation von ge...,
172120,0,0.0,,medizinische informatik - kommunikation von ge...
20771,0,0.0,,medizinsche informatik - kommunikation von ger...
92197,0,0.0,,"[domodossola, arona]"
201370,0,0.0,medizinische informatik - kommunikation von ge...,
228358,0,0.0,,"domodossola, arona"
232073,0,0.0,medizinische informatik - kommunikation von ge...,
67431,0,0.0,"die zauberflöte, ausgabe für gesang und klavier",


0.0 < ttlfull_246_delta < 0.1


Unnamed: 0,duplicates,ttlfull_246_delta,ttlfull_246_x,ttlfull_246_y
195068,0,1.0,,
127152,0,1.0,,
161992,0,1.0,,
214836,0,1.0,,
153956,0,1.0,,
200269,0,1.0,,
103082,0,1.0,,
86083,0,1.0,,
99744,0,1.0,,
185020,0,1.0,,


0.9 < ttlfull_246_delta < 1.0


### volumes

In [54]:
volumes_algorithm = tedi.Jaccard()
#volumes_algo = tedi.MongeElkan()

In [55]:
df_feature_base = dpf.build_delta_feature(df_feature_base, 'volumes', volumes_algorithm)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head(20)

Unnamed: 0,duplicates,century_x,century_y,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,doi_x,doi_y,edition_x,edition_y,exactDate_x,exactDate_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,isbn_x,isbn_y,ismn_x,ismn_y,musicid_x,musicid_y,part_x,part_y,person_100_x,person_100_y,person_700_x,person_700_y,person_245c_x,person_245c_y,pubinit_x,pubinit_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y,century_delta,corporate_110_delta,corporate_710_delta,coordinate_E_delta,coordinate_N_delta,edition_delta,format_prefix_delta,format_postfix_delta,isbn_delta,musicid_delta,part_delta,person_100_delta,person_700_delta,person_245c_delta,pubinit_delta,scale_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
0,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula","grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,reclam jun.,reclam jun.,,,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,reclam jun.,reclam,,,"emma, roman",emma,,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.818905,0.848485,1.0,0.363636,1.0,1.0
2,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen,reclam jun.,reclam,,,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.759259,0.0,0.69774,0.848485,1.0,1.0,1.0,1.0
3,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,"grawechristian, graweursula",jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,reclam,reclam jun.,,,emma,"emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.818905,0.848485,1.0,0.363636,1.0,1.0
4,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,reclam,reclam,,,emma,emma,,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008.0,20008.0,austenjane1775-1817(de-588)118505173,austenjane,,,jane austen ; aus dem engl. übers. von ursula ...,jane austen,reclam,reclam,,,emma,"emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.759259,1.0,0.702265,1.0,1.0,0.363636,1.0,1.0
6,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008.0,20008.0,austenjane,austenjane1775-1817(de-588)118505173,,"grawechristian, graweursula",jane austen,jane austen ; aus dem englischen übersetzt von...,reclam,reclam jun.,,,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.759259,0.0,0.69774,0.848485,1.0,1.0,1.0,1.0
7,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008.0,20008.0,austenjane,austenjane1775-1817(de-588)118505173,,,jane austen,jane austen ; aus dem engl. übers. von ursula ...,reclam,reclam,,,"emma, roman",emma,,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.759259,1.0,0.702265,1.0,1.0,0.363636,1.0,1.0
8,1,2009,2009,,,,,,,,,,,,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],[],[],,,20008.0,20008.0,austenjane,austenjane,,,jane austen,jane austen,reclam,reclam,,,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,1,2000,2000,,,,,,,"metropolitan opera orchestra, metropolitan ope...","metropolitan opera orchestra, metropolitan ope...",,,,,2000uuuu,2000uuuu,vm,vm,10300,10300,[],[],[],[],,,,,levinejamesdir.,levinejamesdir.,"mozartwolfgang amadeus, levinejames, schikaned...","mozartwolfgang amadeus, levinejames, schikaned...",w. a. mozart ; libretto: emanuel schikaneder ;...,w. a. mozart ; libretto: emanuel schikaneder ;...,deutsche grammophon,deutsche grammophon,,,"die zauberflöte, oper in zwei aufzügen","die zauberflöte, oper in zwei aufzügen",,,"1 dvd-video, dvd region 0, 169 min., farb.","1 dvd-video, dvd region 0, 169 min., farb.",1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [56]:
columns_metadata_dict

{'data_analysis_columns': ['century',
  'coordinate_E',
  'coordinate_N',
  'corporate_110',
  'corporate_710',
  'doi',
  'edition',
  'exactDate',
  'format_prefix',
  'format_postfix',
  'isbn',
  'ismn',
  'musicid',
  'part',
  'person_100',
  'person_700',
  'person_245c',
  'pubinit',
  'scale',
  'ttlfull_245',
  'ttlfull_246',
  'volumes'],
 'columns_to_use': ['duplicates',
  'century_x',
  'century_y',
  'coordinate_E_x',
  'coordinate_E_y',
  'coordinate_N_x',
  'coordinate_N_y',
  'corporate_110_x',
  'corporate_110_y',
  'corporate_710_x',
  'corporate_710_y',
  'doi_x',
  'doi_y',
  'edition_x',
  'edition_y',
  'exactDate_x',
  'exactDate_y',
  'format_prefix_x',
  'format_prefix_y',
  'format_postfix_x',
  'format_postfix_y',
  'isbn_x',
  'isbn_y',
  'ismn_x',
  'ismn_y',
  'musicid_x',
  'musicid_y',
  'part_x',
  'part_y',
  'person_100_x',
  'person_100_y',
  'person_700_x',
  'person_700_y',
  'person_245c_x',
  'person_245c_y',
  'pubinit_x',
  'pubinit_y',
  'sca

## Feature Base

The metris for each attribute of the feature DataFrame has been decided and the features have been calculated. The columns with the original attribute values are not needed for further processing and they will be dropped to generate the feature matrix for modelling the estimators.

In [57]:
# Drop all non-delta columns, except of 'duplicates'
columns_to_be_dropped = [e for e in columns_metadata_dict['columns_to_use']
                         if e != 'duplicates']

df_feature_base.drop(columns=columns_to_be_dropped, inplace=True)

In [58]:
for i in range(2):
    display(df_feature_base[df_feature_base.duplicates==i].sample(n=20))

Unnamed: 0,duplicates,century_delta,corporate_110_delta,corporate_710_delta,coordinate_E_delta,coordinate_N_delta,edition_delta,format_prefix_delta,format_postfix_delta,isbn_delta,musicid_delta,part_delta,person_100_delta,person_700_delta,person_245c_delta,pubinit_delta,scale_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
212572,0,0.75,1.0,0.0,1.0,1.0,0.0,1.0,0.428571,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.021318,0.0,0.083333
100284,0,0.0,1.0,1.0,1.0,1.0,0.272727,0.0,0.428571,0.0,0.0,1.0,0.0,0.37304,0.691142,0.0,1.0,0.373134,1.0,0.166667
99899,0,0.5,1.0,1.0,1.0,1.0,1.0,1.0,0.111111,1.0,1.0,0.437037,0.0,0.0,0.54686,0.0,1.0,0.333333,1.0,0.0
74208,0,0.5,1.0,1.0,1.0,1.0,0.0,0.0,0.428571,1.0,0.0,1.0,0.0,0.0,0.490931,0.0,1.0,0.542553,1.0,0.111111
161275,0,0.25,1.0,1.0,1.0,1.0,0.0,1.0,0.428571,0.0,1.0,0.0,0.438889,0.0,0.411111,1.0,1.0,0.285714,1.0,0.0
248576,0,0.5,1.0,1.0,1.0,1.0,1.0,1.0,0.428571,1.0,0.0,1.0,0.0,0.0,0.0,0.405303,1.0,0.108108,1.0,0.2
74615,0,0.5,1.0,1.0,1.0,1.0,0.0,0.0,0.428571,1.0,0.0,0.0,0.679894,0.0,0.501391,1.0,1.0,0.054054,1.0,0.21875
223459,0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.428571,0.0,1.0,0.583333,0.594048,0.404762,0.495492,0.460819,1.0,0.265193,1.0,0.16129
248440,0,0.0,1.0,1.0,1.0,1.0,0.022989,0.0,0.428571,0.0,1.0,0.0,0.487824,0.603426,0.513568,0.0,1.0,0.350877,1.0,0.125
35755,0,0.5,1.0,1.0,1.0,1.0,0.0,0.0,0.111111,0.0,0.0,1.0,0.384049,0.0,0.509122,0.0,1.0,0.153846,1.0,0.076923


Unnamed: 0,duplicates,century_delta,corporate_110_delta,corporate_710_delta,coordinate_E_delta,coordinate_N_delta,edition_delta,format_prefix_delta,format_postfix_delta,isbn_delta,musicid_delta,part_delta,person_100_delta,person_700_delta,person_245c_delta,pubinit_delta,scale_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
222,1,1.0,1.0,0.83,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.762602,1.0,1.0,0.157895,0.454545,1.0
1161,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.428571,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.6
337,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.726776,0.0,1.0,1.0,1.0,1.0
710,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.521111,0.0,0.0,0.752688,0.833333,1.0,1.0,1.0,0.714286
1444,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1092,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
535,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1184,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1331,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.914894,0.0,0.739288,0.0,1.0,0.980198,1.0,0.27027
574,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.826667,0.0,1.0,1.0,1.0,1.0


## Feature Matrix and Target Vector Handover

To hand over the resulting DataFrame of this chapter, the DataFrame is saved into a pickle file that will be read in the next chapters [Decision Tree Model](./5_DecisionTreeModel.ipynb), ... as input file.

In [59]:
import pickle as pk

# Binary intermediary file
with open(os.path.join(path_goldstandard,
                       'labelled_feature_matrix.pkl'), 'wb') as df_output_file:
    pk.dump(df_feature_base, df_output_file)