In [1]:
factor = 1.0
exactDate_mode = 'xor'

# Feature Matrix Generation

In chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb), Swissbib's goldstandard data has been processed to form records of pairs of duplicate and pairs of unique records. These records are the starting point for the final feature matrix generation and that is the reason, why the DataFrame was called feature base. As described in [[JudACaps](./A_References.ipynb#judacaps)], the next step will be an attribute-wise comparison of each attribute pair of each record in the original feature base. This comparison will generate similarity values for each attribute pair. It will halve the number of attributes replacing each attribute pair with one value expressing their degree of similarity. The goal of this chapter is a DataFrame with the full and final feature attributes. The values of these feature attributes will be used for training and performance testing of the machine learning models in the chapters to follow.

This chapter introduces similarity metrics for string comparisons. The metrics to be used for calculating its similarity will be decided for each attribute pair of the DataFrame built in the previous chapters.

## Table of Contents

- [Data Takeover](#Data-Takeover)
- [Object Distance and Similarity](#Object-Distance-and-Similarity)
    - [Mathematical Definitions](#Mathematical-Definitions)
    - [Library TextDistance](#Library-TextDistance)
- [Similarity Metrics on Attribute Level](#Similarity-Metrics-on-Attribute-Level)
    - [Table of Contents of Attribute Similarities](#Table-of-Contents-of-Attribute-Similarities)
- [DataFrame with Attributes and Similarity Features](#DataFrame-with-Attributes-and-Similarity-Features)
- [Summary](#Summary)
    - [Full Feature Matrix with Target Vector Handover](#Full-Feature-Matrix-with-Target-Vector-Handover)

## Data Takeover

Swissbib's raw data of the goldstandard has been processed in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb). As the first step of this chapter, this data is loaded for further processing to the feature matrix and target vector for the subsequent machine learning model chapters.

In [2]:
import os
import pandas as pd
import pickle as pk
import bz2
import _pickle as cPickle

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)

# Restore DataFrame with features from compressed pickle file
with bz2.BZ2File((os.path.join(
    path_goldstandard, 'feature_base_df.pkl')), 'rb') as file:
    df_feature_base = cPickle.load(file)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head()

Unnamed: 0,035liste_x,035liste_y,century_x,century_y,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,coordinate_x,coordinate_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,corporate_full_x,corporate_full_y,decade_x,decade_y,docid_x,docid_y,doi_x,doi_y,duplicates,edition_x,edition_y,exactDate_x,exactDate_y,format_postfix_x,format_postfix_y,format_prefix_x,format_prefix_y,isbn_x,isbn_y,ismn_x,ismn_y,masters_docid,musicid_x,musicid_y,pages_x,pages_y,part_x,part_y,person_100_x,person_100_y,person_245c_x,person_245c_y,person_700_x,person_700_y,pubinit_x,pubinit_y,pubword_x,pubword_y,pubyear_x,pubyear_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,ttlpart_x,ttlpart_y,volumes_x,volumes_y
0,"[(OCoLC)884555343, (ABN)000305947]","[(OCoLC)884555343, (ABN)000305947]",2005,2005,,,,,[],[],,,,,,,2005,2005,00350560X,00350560X,,,1,,,2005uuuu,2005uuuu,10300,10300,vm,vm,[],[],,,11112,2785408,2785408.0,[1 DVD-Video (ca. 122 Min.)],[1 DVD-Video (ca. 122 Min.)],,,,,regie: alexader payne ; drehbuch: alexander pa...,regie: alexader payne ; drehbuch: alexander pa...,"giamattipaul, haden churchthomas, madsenvirgin...","giamattipaul, haden churchthomas, madsenvirgin...",,,[],[],2005,2005,,,sideways,sideways,,,{'245': ['Sideways']},{'245': ['Sideways']},1 122,1 122
1,"[(OCoLC)884555343, (ABN)000305947]","[(OCoLC)887993295, (SGBN)000595500]",2005,2005,,,,,[],[],,,,,,,2005,2005,00350560X,050352490,,,1,,,2005uuuu,2005uuuu,10300,10300,vm,vm,[],[],,,11112,2785408,2.0,[1 DVD-Video (ca. 122 Min.)],[1 DVD-Video],,,,,regie: alexader payne ; drehbuch: alexander pa...,,"giamattipaul, haden churchthomas, madsenvirgin...",paynealexander,,twentieth century,[],[Twentieth Century],2005,2005,,,sideways,"sideways, eine geschichte über das leben, die ...",,,{'245': ['Sideways']},"{'245': ['Sideways', 'eine Geschichte über das...",1 122,1
2,"[(OCoLC)884555343, (ABN)000305947]","[(OCoLC)884555343, (IDSBB)003662541]",2005,2005,,,,,[],[],,,,,,,2005,2005,00350560X,11983393X,,,1,,,2005uuuu,2005uuuu,10300,10300,vm,vm,[],[],,,11112,2785408,2785408.0,[1 DVD-Video (ca. 122 Min.)],[1 DVD-Video (ca. 122 Min.)],,,,,regie: alexader payne ; drehbuch: alexander pa...,regie: alexader payne ; drehbuch: alexander pa...,"giamattipaul, haden churchthomas, madsenvirgin...","paynealexander, taylorjim, pickettrex, giamatt...",,,[],[],2005,2005,,,sideways,sideways,,,{'245': ['Sideways']},{'245': ['Sideways']},1 122,1 122
3,"[(OCoLC)884555343, (ABN)000305947]","[(OCoLC)884555343, (NEBIS)005002232]",2005,2005,,,,,[],[],,,,,,,2005,2005,00350560X,161731651,,,1,,,2005uuuu,20052004,10300,10300,vm,vm,[],[],,,11112,2785408,,[1 DVD-Video (ca. 122 Min.)],[1 DVD-Video (ca. 122 Min.)],,,,,regie: alexader payne ; drehbuch: alexander pa...,regie: alexander payne,"giamattipaul, haden churchthomas, madsenvirgin...","paynealexander, pickettrex, taylorjim, giamatt...",,twentieth century fox home entertainment,[],[Twentieth Century Fox Home Entertainment],2005,20052004,,,sideways,sideways,,,{'245': ['Sideways']},{'245': ['Sideways']},1 122,1 122
4,"[(OCoLC)884555343, (ABN)000305947]","[(OCoLC)611340779, (IDSSG)000762076]",2005,2005,,,,,[],[],,,,,,,2005,2005,00350560X,340501588,,,1,,,2005uuuu,2005uuuu,10300,10300,vm,vm,[],[],,,11112,2785408,,[1 DVD-Video (ca. 122 Min.)],[1 DVD-Video (122 Min.) Ländercode 2],,,,,regie: alexader payne ; drehbuch: alexander pa...,regie: alexander payne,"giamattipaul, haden churchthomas, madsenvirgin...",paynealexander,,,[],[],2005,2005,,,sideways,"sideways, eine geschichte über das leben, die ...",,,{'245': ['Sideways']},"{'245': ['Sideways', 'eine Geschichte über das...",1 122,1 122 2


Now, the feature base can be reduced to the columns used for processing.

In [3]:
# The DataFrame of pairs with target information
df_feature_base = df_feature_base[columns_metadata_dict['columns_to_use']]

df_feature_base.sample(n=5)

Unnamed: 0,duplicates,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,corporate_full_x,corporate_full_y,doi_x,doi_y,edition_x,edition_y,exactDate_x,exactDate_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,isbn_x,isbn_y,ismn_x,ismn_y,musicid_x,musicid_y,part_x,part_y,person_100_x,person_100_y,person_700_x,person_700_y,person_245c_x,person_245c_y,pubinit_x,pubinit_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y
46567,0,,,,,,,10.5167/uzh-61279,,,,2012uuuu,1983uuuu,bk,bk,10053,20000,[],[],,,,,121 159,2,setherolf,,"weberrolf h., islerpeter r.",,[rolf sethe],,,,,,rechtspolitische überlegungen zur haftung der ...,tabellen,,,,130
8970,0,,,,,universidad de salamanca,kanton bernregierungsrat,,,,,uuuuuuuu,1871uuuu,cr,bk,30300,20000,[],[],,,,,,,,,,weberjohann,[universidad de salamanca],,,,,,"acta salmanticensia, textos medievales",verordnung zum schutz der waldungen gegen inse...,,,,1
1583,1,e0063600,,n0465700,,société des sentiers des gorges de l'areuse (n...,société des sentiers des gorges de l'areuse,,,,,1902uuuu,1902uuuu,mp,mp,10300,10300,[],[],,,,,,,,,"borelmaurice, duboisauguste","borelmaurice, duboisauguste",par m[auri]ce borel et aug[uste] dubois ; édit...,m[auri]ce borel et aug[uste] dubois ; ed. par ...,f. gendre,,15000.0,15 000,carte des gorges de l'areuse,carte des gorges de l'areuse,,,1.0,1 1
101471,0,,,,,,,,,,,1876uuuu,20182018,bk,bk,20000,20000,[],[978-88-536-2483-3],,,,,,2 4,beigelhermann,austenjane,,"sardisilvana, maconealberto",herm. beigel,jane austen ; adaption and activities by silva...,[s.n.],,,,atlas der frauenkrankheiten,emma,,,1.0,111
68674,0,,,,,,,,,,,1955uuuu,1984uuuu,bk,bk,20000,20000,[],[3-518-01099-9],,,,,14 14,,bataillegeorges,nerudapablo,hemmerichkarl georg,,[von georges bataille] ; [übers. von karl geor...,pablo neruda,,,,,"manet, [biographisch-kritische studie]",gedichte,,,135.0,260


In [4]:
print('Number of rows labelled as duplicates {:,d}'.format(len(df_feature_base[
    df_feature_base.duplicates==1])))
print('Number of rows labelled as uniques {:,d}'.format(len(df_feature_base[
    df_feature_base.duplicates==0])))
print('Total number of rows in DataFrame {:,d}'.format(df_feature_base.shape[0],
      'number of columns', df_feature_base.shape[1]))

Number of rows labelled as duplicates 2,783
Number of rows labelled as uniques 115,005
Total number of rows in DataFrame 117,788


In [5]:
print('Part of duplicates (1) and uniques (0) in units of [%]')
print(round(100*df_feature_base.duplicates.value_counts(normalize=True), 2))

Part of duplicates (1) and uniques (0) in units of [%]
0    97.64
1     2.36
Name: duplicates, dtype: float64


DataFrame feature base is the starting point used for the further processing in this chapter.

## Object Distance and Similarity

A mathematical idea of distance and similarity is needed for understanding object pair comparison. This section starts with a motivation for calculating similarities and afterwards gives a very basic definition of the two central terms, distance and similarity. The text of this section is a summary of [[Chri2012](./A_References.ipynb#chri2012)].

### Mathematical Definitions

The attributes to be used for pair comparison may contain values of poor quality. The quality originates in the way the data has been entered at the very source. Manual data entry may suffer from mistyping, automatically scanned data may suffer from insufficiencies of the scanned base material or the recognition algorithm in the optical character recognition (OCR) processing. The basic step of a deduplication process is to identify the probability of two strings of a pair to be a pair of duplicates. This is done by calculating a similarity value between the two strings compared, rather than using an exact comparison function. Based on this common similarity value for an attribute pair, their being duplicates can be decided.

The term similarity is strongly coupled to the term of distance of two values of an attribute. Mathematically, a distance can be explained with the help of a distance function. A _distance function_ or _distance metric_ $dist(o_i, o_j)$ between two points or data objects $o_i$ and $o_j$ must fulfill four requirements.

1. $dist(o_i, o_i)=0$, the distance from an object to itself is zero.
- $dist(o_i, o_j)\ge 0$, the distance between two objects is a non-negative number.
- $dist(o_i, o_j)=dist(o_j, o_i)$, the distance between two objects is symmetric.
- $dist(o_i, o_j)\le dist(o_i, o_k)+dist(o_k, o_j)$, the triangular inequality must hold. It states that the direct distance beween two objects is never larger than the combined distance when going through a third object.

A distance value expresses the dissimilarity $d$ of two objects [[HanK2012](./A_References.ipynb#hank2012)] and can therefore be converted into a similarity value $s$, calculating $s = \frac{1}{d}$, assuming $d\gt 0$. Alternatively, assuming the distance value is normalised $0\le d\le 1$, the similarity value can be calculated to $s = 1-d$. A _similarity function_ $sim(a_i, aj)$ between two attributes which can be strings, numbers, dates, geographic locations, text, XML documents, etc. fulfills the general requirements.

1. $sim(a_i, a_i)=1$, the result of comparing a value with itself is an exact similarity.
- $sim(a_i, a_j)=0$, the similarity of values that are completely different from each other is 0. What accounts for 'complete different' depends upon the type of data that are compared.
- $0\lt sim(a_i, a_j)\lt 1$, an approximate similarity between exact similarity and total dissimilarity is calculated if two attribute values are somewhat similar to each other. What accounts for 'somewhat different' depends upon the type of data that are compared.

The dissimilarity between two objects $o_i$ and $o_j$ can be computed based on the ratio of mismatches,
$$
d(o_i, o_j) = \frac{p-m}{p},
$$
where $m$ is the number of matching attributes and $p$ is the total number of attributes describing the objects [[HanK2012](./A_References.ipynb#hank2012)]. Thus the similarity between two objects can be computed as
$$
sim(o_i, o_j) = 1 - d(o_i, o_j) = \frac{m}{p}.
$$

For data deduplication, a comparison function needs to be tailored to the type of underlying data. Although there is a correspondence between a similarity function and the mathematical concept of a distance function, not all known and implemented similarity comparison functions used for string pair comparison fulfill the requirements of a distance function. Some similarity functions are not symmetric, others do not fulfill the triangular inequality. Decision taking on the best similarity function for a string pair, will be based on the effect, a similarity function has for the purpose needed. In the case of this capstone project, this purpose is its capability to contribute to the prediction whether a pair of records is a pair duplicates or a pair of uniques.

### Library TextDistance

An internet research on string distance calculation with Python has revealed libraries [[StSi](./A_References.ipynb#stsi)], [[TeDi](./A_References.ipynb#tedi)] and seperate code snippets for distinct algorithms. After trying the referenced libraries and a downloaded code snippet for a Smith Waterman similarity [[SmWa](./A_References.ipynb#smwa)], the text distance library [[TeDi](./A_References.ipynb#tedi)] has been decided to be the best decision for this capstone project. The decision is based on the github statistics of stars and the date of the latest pull requests, indicating its popularity and maintenance activity of the library. A look at the API of the library, reveals the Python library to be a complete implementation (compared to suggestions of similarity metrics in [[Chri2012](./A_References.ipynb#chri2012)]) and easy to use.

In [6]:
# Install textdistance Python library - if not done, yet.
! pip install textdistance

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m


For using the library, see documentation in [[TeDi](./A_References.ipynb#tedi)]. For the purposes of this chapter, function $\texttt{.normalized}\_\texttt{similarity}()$ of an instantiated textdistance object will be used.

In [7]:
import textdistance as tedi

With the code line above, the library is imported for application in this chapter. In appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb) the effects of the similarity metrics of the library are compared for better understanding of their specific behaviour. This comparison for each attribute is the basis of deciding the best similarity metric available for an attribute pair.

## Similarity Metrics on Attribute Level

This section implements the decision for calculating the similarity metric for each attribute of the raw data based on appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb). The implementation is applied on a pair of attributes of two records, resulting in a new attribute, the similarity value, of the final feature matrix. A general function $\texttt{.build}\_\texttt{delta}\_\texttt{feature}()$ is provided by the code file [data_preparation_funcs.py](./data_preparation_funcs.py) for transforming two attributes into their feature attribute holding their similarity value.

In [8]:
import data_preparation_funcs as dpf

The two dictionaries of the following code cell will be filled by function $\texttt{.build}\_\texttt{delta}\_\texttt{feature}()$.

In [9]:
columns_metadata_dict['similarity_metrics'] = {}
columns_metadata_dict['features'] = []

### Table of Contents of Attribute Similarities

- [coordinate](#coordinate)
- [corporate](#corporate)
- [doi](#doi)
- [edition](#edition)
- [exactDate](#exactDate)
- [format](#format)
- [isbn](#isbn)
- [ismn](#ismn)
- [musicid](#musicid)
- [part](#part)
- [person](#person)
- [pubinit](#pubinit)
- [scale](#scale)
- [ttlfull](#ttlfull)
- [volumes](#volumes)

### coordinate

As discussed in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{coordinate}$ holds coordinates of maps. To decide whether a map covers the same geographical range, a metric will be chosen that compares the coordinate number digits from left to right. The more digits are found to be equal, the higher the similarity value is calculated. The comparison stops with the first digit pair that differs. This algorithm is satisfied by the LCS (Longest Common Substring comparison) algorithm and generates the wanted result, see appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb).

In [10]:
attribute = 'coordinate'

columns_metadata_dict['similarity_metrics'][attribute+'_E'] = tedi.LCSStr()
columns_metadata_dict['similarity_metrics'][attribute+'_N'] = tedi.LCSStr()

ne_values = ['_E', '_N']

for ne in ne_values :
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+ne,
        columns_metadata_dict['similarity_metrics'][attribute+ne],
        columns_metadata_dict)

The length of attribute $\texttt{coordinate}$ is exactly eight digits. The distinct similarity values that may occur form a discrete set of values with a distance of $\frac{1}{8}$ between adjacent values.

In [11]:
uniques, uniques_len = {}, {}

for ne in ne_values :
    uniques[attribute+ne], uniques_len[attribute+ne] = dpf.determine_similarity_values(
        df_feature_base, attribute+ne)

coordinate_E values range [0.    0.125 0.25  0.375 0.5   0.625 0.75  0.875 1.   ]
coordinate_N values range [0.    0.25  0.375 0.5   0.625 0.75  0.875 1.   ]


Looking at some samples of the feature matrix reveals a good match to the expectations.

In [12]:
position = 3

for ne in ne_values :
    dpf.show_samples_interval(
        df_feature_base, attribute+ne,
        uniques[attribute+ne][uniques_len[attribute+ne]-position],
        uniques[attribute+ne][uniques_len[attribute+ne]-position+1]
    )

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y
2062,1,0.875,e0064617,e0064613
532,1,0.75,e0102620,e0102615
539,1,0.75,e0102618,e0102620
56540,0,0.75,e0102620,e0102615
2063,1,0.875,e0064617,e0064613


0.75 <= coordinate_E_delta <= 0.875


Unnamed: 0,duplicates,coordinate_N_delta,coordinate_N_x,coordinate_N_y
24563,0,0.875,n0463834,n0463830
1895,1,0.75,n0473909,n0473914
532,1,0.875,n0463834,n0463830
4554,0,0.875,n0463834,n0463830
541,1,0.875,n0463830,n0463834


0.75 <= coordinate_N_delta <= 0.875


The samples above show the wanted similarity behaviour for value ranges greater than 0. The metric has the weakness, though, that empty coordinate values, e.g. for bibliographic units other than maps, have each been calculated to a similarity of 0. Some samples for duplicates in the training data are shown below.

In [13]:
dpf.show_samples_interval(
    df_feature_base[df_feature_base.duplicates==1],
    attribute+'_E', uniques[attribute+'_E'][0], uniques[attribute+'_E'][1], 10)

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y
1723,1,0.0,,
1466,1,0.0,,
1447,1,0.0,,
399,1,0.0,,
303,1,0.0,,
2682,1,0.0,,
1252,1,0.0,,
2477,1,0.0,,
197,1,0.0,,
474,1,0.0,,


0.0 <= coordinate_E_delta <= 0.125


This downside shall be avoided by marking pairs of missing coordinate values on both sides with a special negative value, which will point out to the models to be trained, the special case of missing information in a row. The implementation of this logic is done in function $\texttt{.mark}\_\texttt{missing}()$. The absolute value of this negative number is conrolled by a factor which is passed to the function as a parameter. The function handles explicitly two cases. The first one is missing information in both attributes (resulting in $-1*\texttt{factor}$) and the second one is missing information in only one of the two attributes (resulting in $-0.5*\texttt{factor}$).

In [14]:
for ne in ne_values :
    df_feature_base = dpf.mark_missing(df_feature_base, attribute+ne, factor)

### corporate

Attribute $\texttt{corporate}$ is a collection of corporate names. The Monge-Elkan metric compares string tokens pairwise [[Chri2012](./A_References.ipynb#chri2012)] while the LCS metric searches for the longest common substring. Assessing the differences of these two metrics with the help of their values distribution in chapter [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb), reveals a better distribution behaviour for LCS. Therefore, the LCS metric will be chosen for this attribute.

In [15]:
attribute = 'corporate_full'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.LCSStr()
#tedi.StrCmp95()
#tedi.MongeElkan()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [16]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

corporate_full values range [0.         0.00526316 0.00980392 0.01052632 0.01162791 0.01388889
 0.01449275 0.01470588 0.01578947 0.01587302 0.01666667 0.01785714
 0.01886792 0.01904762 0.01923077 0.01960784 0.02       0.02040816
 0.02105263 0.0212766  0.02222222 0.02272727 0.02325581 0.02380952
 0.02439024 0.02564103 0.02631579 0.02702703 0.02777778 0.02857143
 0.02898551 0.02941176 0.03       0.03030303 0.03076923 0.03125
 0.03157895 0.03174603 0.03225806 0.03278689 0.03333333 0.03448276
 0.03488372 0.03571429 0.03636364 0.03684211 0.03703704 0.03773585
 0.03809524 0.03846154 0.03921569 0.04       0.04081633 0.04166667
 0.04210526 0.04255319 0.04347826 0.04411765 0.04444444 0.04545455
 0.04615385 0.04651163 0.04761905 0.04878049 0.04901961 0.04918033
 0.05       0.05128205 0.05263158 0.05357143 0.05405405 0.05454545
 0.05555556 0.05660377 0.05714286 0.05769231 0.05789474 0.05797101
 0.05813953 0.05882353 0.06       0.06060606 0.06122449 0.06153846
 0.0625     0.06349206 0.06382979 0.0

Its $110$ part is sparsely filled and even its $710$ part comes along with a little more than $10\%$ of filling, only. The LCS metric generates a similarity of 1 for the cases where both strings of a pair are empty. Missing values on both sides may be an indicator for a pair of duplicates but due to the sparsely available information, it is a weak indicator. Therefore, the pairs with missing data on both sides of the pair, will be marked with the negative value.

In [17]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute, factor)

Some sample cases are shown below for both $\texttt{corporate}$ features.

In [18]:
dpf.show_samples_interval(
    df_feature_base[df_feature_base.duplicates==1],
    attribute, 0.0, 1.0, 20
)

Unnamed: 0,duplicates,corporate_full_delta,corporate_full_x,corporate_full_y
1581,1,0.781818,société des sentiers des gorges de l'areuse (n...,société des sentiers des gorges de l'areuse
2613,1,1.0,institut für geologie und paläontologie,institut für geologie und paläontologie
1081,1,1.0,gastrosuisse,gastrosuisse
825,1,1.0,schweizerische konferenz der kantonalen erzieh...,schweizerische konferenz der kantonalen erzieh...
732,1,1.0,schweizerische konferenz der kantonalen erzieh...,schweizerische konferenz der kantonalen erzieh...
1917,1,1.0,"arte deutschland tv gmbh, zweites deutsches fe...","arte deutschland tv gmbh, zweites deutsches fe..."
2173,1,1.0,kunstmuseum thun,kunstmuseum thun
1206,1,0.236842,suisseoffice fédéral de la statistique,schweizbundesamt für statistik
830,1,0.115942,conferenza svizzera dei direttori cantonali de...,schweizerische konferenz der kantonalen erzieh...
1217,1,1.0,universidad de salamanca,universidad de salamanca


0.0 <= corporate_full_delta <= 1.0


In [19]:
position = uniques_len[attribute]//2 # Let's have a look in the middle range of the similarities.

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+2], 20)

Unnamed: 0,duplicates,corporate_full_delta,corporate_full_x,corporate_full_y
26360,0,0.1,schweizerische konferenz der kantonalen erzieh...,kanton bernregierungsrat
110900,0,0.1,schweizerische konferenz der kantonalen erzieh...,museum für hamburgische geschichte
74042,0,0.1,schweizerische konferenz der kantonalen erzieh...,kanton bernregierungsrat
110885,0,0.1,schweizerische konferenz der kantonalen erzieh...,kanton bernregierungsrat
79842,0,0.1,schweizerische konferenz der kantonalen erzieh...,kanton bernregierungsrat
56548,0,0.1,schweizbundesamt für landestopografie,universität stuttgartinstitut für geologie und...
8856,0,0.1,schweizbundesamt für statistik,spar- und leihkasse neunkirch
5422,0,0.1,schweizerische konferenz der kantonalen erzieh...,kanton bernregierungsrat
49517,0,0.1,capella savaria,kinderschutz schweiz
5418,0,0.1,schweizerische konferenz der kantonalen erzieh...,kanton bernregierungsrat


0.09722222222222221 <= corporate_full_delta <= 0.09999999999999998


### doi

Swissbib uses an explicit $\texttt{doi}$ attribute for its deduplication implementation. In chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb), the real doi identifier has been isolated with the help of a preprocessing function $\texttt{.reduce}\_\texttt{to}\_\texttt{doi}\_\texttt{element}()$, see [Data Analysis](./1_DataAnalysis.ipynb). Attribute $\texttt{doi}$ contains a single string value. The Identity metric will be used for comparing the string values of a pair in a row, calculating a similarity value of 1.0 or 0.0 for each pair. If one list is empty a value of 0 is returned.

In [20]:
attribute = 'doi'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Identity()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

df_feature_base['doi_delta'].unique()

array([1., 0.])

Some sample cases are shown below for each category of $\texttt{doi}\_\texttt{delta}$.

In [21]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

for doi_delta_value in df_feature_base['doi_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['doi_delta']==doi_delta_value])
    )

    dpf.show_samples_distinct(df_feature_base, 'doi', doi_delta_value, number_of_max_samples)
    print(f'doi_delta = {doi_delta_value}')

doi values range [0. 1.]


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
4356,0,1.0,,
97272,0,1.0,,
9839,0,1.0,,
62611,0,1.0,,
260,1,1.0,,
61737,0,1.0,,
58101,0,1.0,,
6716,0,1.0,,
52056,0,1.0,,
2194,1,1.0,,


doi_delta = 1.0


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
92286,0,0.0,,10.1093/sw/36.1.86
52658,0,0.0,10.1111/spsr.12118,
32875,0,0.0,,10.5169/seals-858566
84450,0,0.0,,10.5169/seals-727036
11689,0,0.0,,10.1093/sw/36.1.86
81574,0,0.0,10.5169/seals-790639,
6092,0,0.0,,10.5167/uzh-150310
99971,0,0.0,,10.5169/seals-727036
87488,0,0.0,10.5167/uzh-191252,
41248,0,0.0,,10.5167/uzh-150310


doi_delta = 0.0


In [22]:
# Let's have a look at some non-empty doi elements
df_doi_with_element = df_feature_base[df_feature_base.doi_x.apply(lambda x : len(x) > 0)]

for doi_delta_value in df_feature_base['doi_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['doi_delta']==doi_delta_value])
    )

    dpf.show_samples_distinct(df_doi_with_element, 'doi', doi_delta_value, number_of_max_samples)
    print(f'doi_delta = {doi_delta_value}')

Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
2398,1,1.0,10.5167/uzh-147815,10.5167/uzh-147815
2747,1,1.0,10.1007/978-3-658-17985-4,10.1007/978-3-658-17985-4
2177,1,1.0,10.1111/spsr.12118,10.1111/spsr.12118
2060,1,1.0,10.1093/cid/cir669,10.1093/cid/cir669
2057,1,1.0,10.1093/cid/cir669,10.1093/cid/cir669
2315,1,1.0,10.5169/seals-790639,10.5169/seals-790639
2746,1,1.0,10.1002/9780470110461,10.1002/9780470110461
2181,1,1.0,10.1111/spsr.12118,10.1111/spsr.12118
2052,1,1.0,10.5167/uzh-53042,10.5167/uzh-53042
2178,1,1.0,10.1111/spsr.12118,10.1111/spsr.12118


doi_delta = 1.0


Unnamed: 0,duplicates,doi_delta,doi_x,doi_y
69599,0,0.0,10.5167/uzh-147815,
46694,0,0.0,10.5167/uzh-61279,
81566,0,0.0,10.5169/seals-790639,
52318,0,0.0,10.1111/spsr.12118,
87432,0,0.0,10.5167/uzh-191252,
109006,0,0.0,10.1007/978-3-658-17985-4,
109117,0,0.0,10.1007/978-3-658-17985-4,
51054,0,0.0,10.1093/cid/cir669,
49230,0,0.0,10.5167/uzh-98235,
70585,0,0.0,10.3389/fnhum.2018.00200,


doi_delta = 0.0


As can be seen above, a value of 1.0 is returned if both strings of a pair are empty. As the attribute filling of $\texttt{doi}$ is sparse, see chapter [Data Analysis](./1_DataAnalysis.ipynb), the $\texttt{doi}\_\texttt{delta}$ indicates strongly a pair of duplicates for most rows. To avoid such misleading identity indication, function $\texttt{.mark}\_\texttt{missing}()$ will be applied to the attribute.

In [23]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute, factor)

### edition

In its original form in Swissbib's raw data, the edition statement is a string value which may have several words. The modelling on this attribute has been tried with and without stripping letter characters from the string. The final decision for the best processing will be documented in chapter [Overview and Summary](./0_OverviewSummary.ipynb). A Jaccard similarity is tried for this attribute.

In [24]:
attribute = 'edition'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Jaccard()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [25]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

import numpy as np

edition_delta_uniques = np.sort(df_feature_base['edition_delta'].unique())
edition_delta_uniques_len = len(edition_delta_uniques)
print('edition values range', edition_delta_uniques[:30])

edition values range [0.         0.14285714 0.2        0.25       0.33333333 0.5
 0.6        1.        ]
edition values range [0.         0.14285714 0.2        0.25       0.33333333 0.5
 0.6        1.        ]


The comparison results in a wide number of distinct similarity values for the goldstandard data set. Below, some examples are shown.

In [26]:
position = edition_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, 'edition',
    edition_delta_uniques[edition_delta_uniques_len-position-2],
    edition_delta_uniques[edition_delta_uniques_len-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, 'edition',
    edition_delta_uniques[edition_delta_uniques_len-position],
    edition_delta_uniques[edition_delta_uniques_len-position+2], 10)

position = edition_delta_uniques_len//2

dpf.show_samples_interval(
    df_feature_base, 'edition',
    edition_delta_uniques[edition_delta_uniques_len-position-2],
    edition_delta_uniques[edition_delta_uniques_len-position-1], 10)

Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
60527,0,1.0,,
84880,0,1.0,,
71341,0,1.0,,
42889,0,1.0,,
33087,0,1.0,,
55656,0,1.0,,
7049,0,1.0,,
25935,0,1.0,,
39117,0,1.0,,
110305,0,1.0,,


0.6 <= edition_delta <= 1.0


Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
81776,0,0.0,,1999.0
53600,0,0.0,,2001.0
37946,0,0.0,3.0,
70244,0,0.0,,1.0
89504,0,0.0,,2001.0
37554,0,0.0,7.0,
28165,0,0.0,,2017.0
27005,0,0.0,2011.0,
61297,0,0.0,,2.0
38660,0,0.0,,4.0


0.0 <= edition_delta <= 0.19999999999999996


Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
64817,0,0.25,2,2017
115681,0,0.25,1,2017
34258,0,0.25,2003,2
103343,0,0.25,2013,1
76079,0,0.25,1997,1
77949,0,0.25,2011,1
57374,0,0.25,2011,1
78097,0,0.25,2,2001
34369,0,0.25,2003,2
114749,0,0.25,1,2001


0.19999999999999996 <= edition_delta <= 0.25


Again, for $\texttt{edition}\_\texttt{delta} = 1$, many empty values of the $\texttt{edition}$ attribute can be observed. These will be marked with the special negative value in the data with the goal to distinguish them from the matching attribute pairs.

In [27]:
df_feature_base = dpf.mark_missing(df_feature_base, 'edition', factor)

In [28]:
position = edition_delta_uniques_len

dpf.show_samples_interval(
    df_feature_base, 'edition',
    edition_delta_uniques[edition_delta_uniques_len-position-2],
    edition_delta_uniques[edition_delta_uniques_len-position-1], 10)

Unnamed: 0,duplicates,edition_delta,edition_x,edition_y
1891,1,1.0,2011,2011
2776,1,1.0,1,1
1903,1,0.6,2013,2011
2153,1,1.0,8,8
58085,0,0.6,2013,2017
85242,0,1.0,1,1
452,1,1.0,2,2
405,1,1.0,2,2
539,1,1.0,2003,2003
57931,0,1.0,2,2


0.6 <= edition_delta <= 1.0


### exactDate

As discussed in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{exactDate}$ holds a year number stored in the first four digits. Letter 'u' is used as a placeholder for an unknown digit. The attribute may hold some month and day or a second year information in its second four digits, additionally.

The attribute will be kept as a string and will not be transformed to an integer. The feature attribute of the record pair to be compared will be calculated with a modified Hamming algorithm, see appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb). The resulting similarity will be stored in a new attribute $\texttt{exactDate}\_\texttt{delta}$ which will be taken for the model calculation.

As can be seen in chapter [Decision Tree Model](./6_DecisionTreeModel.ipynb), this attribute is important for prediction. Different ways of increasing the weight of the unknown status of a digit have been tried. The different ways can be seen in the implementations below. The algorithm decided for the final simulation will be documented in chapter [Overview and Summary](./0_OverviewSummary.ipynb).

In [29]:
import string

def no_xor (x_side, y_side) :
    number = 0
    for i in range(len(x_side)) :
        if ((x_side[i] in string.ascii_lowercase) | (y_side[i] in string.ascii_lowercase)) & (x_side[i] != y_side[i]) :
            number = number + 1
    return number

print('Example comparison results in a value of', no_xor ('202a0aaa', '1920uuuu'))

Example comparison results in a value of 5


In [30]:
attribute = 'exactDate'

# Replace letter 'u' with letter 'a' for one of the two strings.
#  As an effect, the resulting Hamming similarity for a letter
#  instead of a numerical digit in either string will add with an amount 0.
df_feature_base[attribute+'_x'] = df_feature_base.exactDate_x.str.replace('u', 'a')

# Compute Hamming similarity for century string pair.
columns_metadata_dict['similarity_metrics'][attribute] = tedi.Hamming()

unknown_share = 16

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

if exactDate_mode == 'added_u':
    # Add amount of 1/16 to Hamming similarity for every letter digit.
    #  But only maximum number of letter digits in both strings of a pair.
    df_feature_base[attribute+'_delta'] = df_feature_base[[
        attribute+'_x', attribute+'_y', attribute+'_delta']].apply(
        lambda x : x[attribute+'_delta'] + 
        max(x[attribute+'_x'].count('a'), x[attribute+'_y'].count('u'))/unknown_share, axis=1
    )
elif exactDate_mode == 'xor':
    # Add amount of 1/16 to Hamming similarity for every letter digit.
    #  But only number of position-wise xor-ed letter digits in the two strings of a pair.
    df_feature_base[attribute+'_delta'] = df_feature_base[[
        attribute+'_x', attribute+'_y', attribute+'_delta']].apply(
        lambda x : x[attribute+'_delta'] + 
        no_xor(x[attribute+'_x'], x[attribute+'_y'])/unknown_share, axis=1
    )

In [31]:
df_feature_base[['exactDate_x', 'exactDate_y', 'exactDate_delta']].sample(n=10)

Unnamed: 0,exactDate_x,exactDate_y,exactDate_delta
8114,2006aaaa,19992018,0.25
6046,1973aaaa,2009uuuu,0.25
31510,1973aaaa,2016uuuu,0.25
85482,2020aaaa,1955uuuu,0.25
36659,19549999,1994uuuu,0.625
29307,1967aaaa,1967uuuu,0.75
20649,2004aaaa,20022018,0.625
131,1992aaaa,1992uuuu,0.75
95986,2011aaaa,19992017,0.25
6549,1998aaaa,1983uuuu,0.5


All resulting values of equal strings are equal to 1.

In [32]:
df_feature_base[['exactDate_x', 'exactDate_y', 'exactDate_delta']][
    df_feature_base.exactDate_x == df_feature_base.exactDate_y
].sort_values('exactDate_delta', ascending=False).head()

Unnamed: 0,exactDate_x,exactDate_y,exactDate_delta
24,20052004,20052004,1.0
1782,19552014,19552014,1.0
1813,19831984,19831984,1.0
1812,19831984,19831984,1.0
1811,19831984,19831984,1.0


A discrete set of different similarity values can be found in the attribute deltas. Some sample records are shown below.

In [33]:
exactDate_deltas = np.sort(df_feature_base.exactDate_delta.unique())
exactDate_deltas

array([0.    , 0.0625, 0.125 , 0.1875, 0.25  , 0.3125, 0.375 , 0.4375,
       0.5   , 0.5625, 0.625 , 0.6875, 0.75  , 0.8125, 0.875 , 0.9375,
       1.    ])

In [34]:
sample_size = 5

for i in exactDate_deltas :
    dpf.show_samples_distinct(df_feature_base, 'exactDate', i, sample_size)
    print(f'exactDate_delta = {i}')

Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
59708,0,0.0,20019999,19992017
26229,0,0.0,20139999,18501868
82703,0,0.0,20182016,18501868
61239,0,0.0,20111201,19859999
111136,0,0.0,20042008,19559999


exactDate_delta = 0.0


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
17508,0,0.0625,196a9999,20182018
36234,0,0.0625,196a9999,20140801
24988,0,0.0625,196a9999,20182018
15809,0,0.0625,198a9999,20172017
35585,0,0.0625,200a9999,19992017


exactDate_delta = 0.0625


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
59682,0,0.125,20019999,19581978
22958,0,0.125,19552014,20140318
5724,0,0.125,20049999,19842003
26214,0,0.125,20139999,19541972
96355,0,0.125,19872016,18001899


exactDate_delta = 0.125


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
15757,0,0.1875,198a9999,18501868
17440,0,0.1875,196a9999,18501868
16386,0,0.1875,1aaa9999,20182018
15751,0,0.1875,198a9999,20101986
35616,0,0.1875,200a9999,19831984


exactDate_delta = 0.1875


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
28134,0,0.25,19591990,2014uuuu
11719,0,0.25,1955aaaa,2006uuuu
21854,0,0.25,1988aaaa,2020uuuu
33016,0,0.25,1979aaaa,2006uuuu
45000,0,0.25,2007aaaa,1975uuuu


exactDate_delta = 0.25


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
15720,0,0.3125,198a9999,2007uuuu
36069,0,0.3125,196a9999,18811894
15727,0,0.3125,198a9999,2004uuuu
24998,0,0.3125,196a9999,2019uuuu
36254,0,0.3125,196a9999,2020uuuu


exactDate_delta = 0.3125


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
47000,0,0.375,2012aaaa,19uu9999
7767,0,0.375,1892aaaa,19851985
78084,0,0.375,19879999,2017uuuu
73728,0,0.375,1994aaaa,18001899
114253,0,0.375,20019999,1871uuuu


exactDate_delta = 0.375


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
35609,0,0.4375,200a9999,20022018
16304,0,0.4375,1aaa9999,2008uuuu
16297,0,0.4375,1aaa9999,2005uuuu
16410,0,0.4375,1aaa9999,2019uuuu
15703,0,0.4375,198a9999,1871uuuu


exactDate_delta = 0.4375


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
43195,0,0.5,2003aaaa,uuuuuuuu
58306,0,0.5,20089999,2020uuuu
11258,0,0.5,1955aaaa,uuuuuuuu
35132,0,0.5,2007aaaa,2020uuuu
50453,0,0.5,2006aaaa,2020uuuu


exactDate_delta = 0.5


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
17465,0,0.5625,196a9999,19721990
15826,0,0.5625,198a9999,1999uuuu
16314,0,0.5625,1aaa9999,1984uuuu
36242,0,0.5625,196a9999,20189999
35658,0,0.5625,200a9999,2017uuuu


exactDate_delta = 0.5625


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
51940,0,0.625,2011aaaa,2019uuuu
29038,0,0.625,19922003,1993uuuu
30523,0,0.625,19aa9999,19691969
66764,0,0.625,1984aaaa,1988uuuu
67995,0,0.625,1987aaaa,19842003


exactDate_delta = 0.625


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
36133,0,0.6875,196a9999,19691969
17526,0,0.6875,196a9999,1967uuuu
35565,0,0.6875,200a9999,2008uuuu
35542,0,0.6875,200a9999,2007uuuu
35554,0,0.6875,200a9999,2004uuuu


exactDate_delta = 0.6875


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
1065,1,0.75,1955aaaa,1955uuuu
1245,1,0.75,20089999,2008uuuu
68383,0,0.75,2017aaaa,2017uuuu
37306,0,0.75,19849999,19729999
2267,1,0.75,1925aaaa,1925uuuu


exactDate_delta = 0.75


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
36149,0,0.8125,196a9999,19559999
24964,0,0.8125,196a9999,19729999
24961,0,0.8125,196a9999,19859999
1215,1,0.8125,1aaa9999,1uuu9999
24983,0,0.8125,196a9999,19759999


exactDate_delta = 0.8125


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
28650,0,0.875,19549999,19589999
30746,0,0.875,19aa9999,19999999
2157,1,0.875,201303aa,201303uu
36710,0,0.875,19549999,19559999
31592,0,0.875,19aa9999,19909999


exactDate_delta = 0.875


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
1837,1,0.9375,196a9999,196u9999
1592,1,0.9375,196a9999,196u9999
1375,1,0.9375,198a9999,198u9999
1667,1,0.9375,20049999,200u9999
1838,1,0.9375,196a9999,196u9999


exactDate_delta = 0.9375


Unnamed: 0,duplicates,exactDate_delta,exactDate_x,exactDate_y
1642,1,1.0,19922003,19922003
1810,1,1.0,19831984,19831984
2044,1,1.0,19271947,19271947
1183,1,1.0,19849999,19849999
1928,1,1.0,19581958,19581958


exactDate_delta = 1.0


### format

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{format}$ has been split up into two new attributes $\texttt{format}\_\texttt{prefix}$ and $\texttt{format}\_\texttt{postfix}$ which will be compared by a different similarity metrics.

- As the quality of $\texttt{format}\_\texttt{prefix}$ is expected to be high, an identity comparison should be sufficient.
- Due to the observed structure of $\texttt{format}\_\texttt{postfix}$, a q-gram based comparison will be chosen.

In [35]:
attribute = 'format'

columns_metadata_dict['similarity_metrics'][attribute+'_prefix'] = tedi.Identity()
columns_metadata_dict['similarity_metrics'][attribute+'_postfix'] = tedi.Jaccard(qval=2)

pfix_values = ['_prefix', '_postfix']

for pf in pfix_values :
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+pf,
        columns_metadata_dict['similarity_metrics'][attribute+pf],
        columns_metadata_dict)

In [36]:
for i in df_feature_base.format_prefix_delta[
    df_feature_base.format_prefix_x != df_feature_base.format_prefix_y].unique():
    
    dpf.show_samples_distinct(df_feature_base, 'format_prefix', i)
    print(f'format_prefix_delta = {i}')

Unnamed: 0,duplicates,format_prefix_delta,format_prefix_x,format_prefix_y
55698,0,0.0,bk,vm
115358,0,0.0,vm,cr
111460,0,0.0,mp,cr
84446,0,0.0,vm,bk
91487,0,0.0,mu,vm


format_prefix_delta = 0.0


In [37]:
for i in df_feature_base.format_postfix_delta[
    df_feature_base.format_postfix_x != df_feature_base.format_postfix_y].unique():
    
    dpf.show_samples_distinct(df_feature_base, 'format_postfix', i)
    print(f'format_postfix_delta = {i}')

Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
2232,1,0.428571,10000,10300
23301,0,0.428571,40100,10053
74610,0,0.428571,20300,20000
33986,0,0.428571,20400,20500
65148,0,0.428571,10300,30600


format_postfix_delta = 0.4285714285714286


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
9548,0,0.111111,20000,30600
59873,0,0.111111,20300,10100
99117,0,0.111111,10100,20000
24727,0,0.111111,30300,10400
88327,0,0.111111,20000,30300


format_postfix_delta = 0.11111111111111116


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
14193,0,1.0,20000,20000
88232,0,1.0,10300,10300
44513,0,1.0,20000,20000
50795,0,1.0,10300,10300
4498,0,1.0,10300,10300


format_postfix_delta = 1.0


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
93100,0,0.0,10100,20353
82718,0,0.0,10153,30300
32887,0,0.0,30600,20453
56450,0,0.0,10300,20453
101675,0,0.0,30600,10253


format_postfix_delta = 0.0


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
108998,0,0.25,20053,10253
21368,0,0.25,20000,20453
87423,0,0.25,10053,40500
111870,0,0.25,40100,30047
110495,0,0.25,20053,30500


format_postfix_delta = 0.25


Unnamed: 0,duplicates,format_postfix_delta,format_postfix_x,format_postfix_y
74217,0,0.666667,30053,30500
84626,0,0.666667,10053,10500
107290,0,0.666667,20053,20500
108994,0,0.666667,20053,20500
110656,0,0.666667,20053,20500


format_postfix_delta = 0.6666666666666666


### isbn

Swissbib uses each string element of the $\texttt{isbn}$ list separately for comparing with each string element of its comparison $\texttt{isbn}$ list. If two bibliographic units hold at least one element in common, this is interpreted as a strong indicator for duplicates [[WiCo2001](./A_References.ipynb#wico2001)].

This hard logic is used in a modified way in the context of this capstone project. A special comparison function $\texttt{.build}\_\texttt{delta}\_\texttt{isbn}()$ has been implemented that compares each list element of the left-hand side with each list element of the right-hand side of a pair. According to Swissbib's implementation, the Identity metric is used for string comparison, calculating a similarity value of 1.0 or 0.0 for each list element pair. For normalisation reasons, the sum of similarity values is divided by the number of elements of the smaller list. If both lists are empty a value of 1.0 is returned. If only one list is empty a value of 0.0 is returned.

In [38]:
attribute = 'isbn'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Identity()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

df_feature_base[attribute+'_delta'].unique()

array([1.        , 0.        , 0.5       , 0.09090909, 0.1       ,
       0.9       , 0.18181818, 0.90909091, 0.16666667])

Some sample cases are shown below for each category of $\texttt{isbn_delta}$.

In [39]:
for isbn_delta_value in df_feature_base['isbn_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base['isbn_delta']==isbn_delta_value])
    )

    dpf.show_samples_distinct(df_feature_base, 'isbn', isbn_delta_value, number_of_max_samples)
    print(f'isbn_delta = {isbn_delta_value}')

Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
71308,0,1.0,[],[]
100692,0,1.0,[],[]
27108,0,1.0,[],[]
117773,0,1.0,[],[]
7601,0,1.0,[],[]
8015,0,1.0,[],[]
43472,0,1.0,[3-290-17327-5],[3-290-17327-5]
72443,0,1.0,[],[]
114819,0,1.0,[],[]
5506,0,1.0,[],[]


isbn_delta = 1.0


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
50384,0,0.0,"[3-7891-2941-0, 978-3-7891-2941-4]",[]
17483,0,0.0,[],[978-3-8373-0828-0]
48929,0,0.0,"[978-3-608-89148-5, 3-608-89148-X]",[]
67451,0,0.0,"[978-3-16-155678-4, 3-16-155678-X]",[]
57018,0,0.0,[3-303-01200-8],[]
38468,0,0.0,[],[978-2-37349-115-9]
69772,0,0.0,"[978-3-476-02276-9, 978-3-476-05278-0 (ebook)]",[]
108865,0,0.0,"[978-0-470-11088-1, 978-0-470-11046-1]",[]
8418,0,0.0,[3-908003-66-0],[90-04-09973-5]
108888,0,0.0,"[978-0-470-11088-1, 978-0-470-11046-1]",[]


isbn_delta = 0.0


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
256,1,0.5,"[3-7281-1755-2 (Vdf), 3-519-05031-5 (Teubner)]","[3-7281-1755-2 (Verlag der Fachvereine), 3-519..."
2168,1,0.5,"[978-3-476-02276-9, 978-3-476-05278-0 (ebook)]","[3-476-05278-8, 978-3-476-02276-9, 978-3-476-0..."
174,1,0.5,"[3-7281-1755-2 (Verlag der Fachvereine), 3-519...","[3-7281-1755-2 (Vdf), 3-519-05031-5 (Teubner)]"
214,1,0.5,"[3-7281-1755-2 (Vlg. der Fachvereine), 3-519-0...","[3-7281-1755-2 (Verlag der Fachvereine), 3-519..."
255,1,0.5,"[3-7281-1755-2 (Vdf), 3-519-05031-5 (Teubner)]","[3-7281-1755-2 (Vlg. der Fachvereine), 3-519-0..."
269,1,0.5,"[3-7281-1755-2 (Verlag der Fachvereine), 3-519...","[3-7281-1755-2 (Vlg. der Fachvereine), 3-519-0..."
272,1,0.5,"[3-7281-1755-2 (Verlag der Fachvereine), 3-519...","[3-7281-1755-2 (Vdf), 3-519-05031-5 (Teubner)]"
210,1,0.5,"[3-7281-1755-2 (Vlg. der Fachvereine), 3-519-0...","[3-7281-1755-2 (Verlag der Fachvereine), 3-519..."
230,1,0.5,"[3-7281-1755-2 (Verlag der Fachvereine), 3-519...","[3-7281-1755-2 (Vdf), 3-519-05031-5 (Teubner)]"
2164,1,0.5,"[3-476-05278-8, 978-3-476-02276-9, 978-3-476-0...","[978-3-476-02276-9, 978-3-476-05278-0 (ebook)]"


isbn_delta = 0.5


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
448,1,0.090909,"[978-3-7772-8527-6 (Gesamtwerk), 3-7772-8721-0...","[3-7772-8527-7 (Gesamtwerk), 3-7772-8721-0 (Ba..."
408,1,0.090909,"[3-7772-8527-7 (Gesamtwerk), 3-7772-8721-0 (Ba...","[978-3-7772-8527-6 (Gesamtwerk), 3-7772-8721-0..."


isbn_delta = 0.09090909090909091


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
420,1,0.1,"[3-7772-8527-7, 3-7772-8721-0, 3-7772-8911-6, ...","[3-7772-8527-7, 978-3-7772-8721-8 (Bd. 1), 978..."
417,1,0.1,"[3-7772-8527-7, 3-7772-8721-0, 3-7772-8911-6, ...","[3-7772-8527-7, 978-3-7772-8721-8 (Bd. 1), 978..."
427,1,0.1,"[3-7772-8527-7, 978-3-7772-8721-8 (Bd. 1), 978...","[3-7772-8527-7, 3-7772-8721-0, 3-7772-8911-6, ..."
460,1,0.1,"[3-7772-8527-7, 978-3-7772-8721-8 (Bd. 1), 978...","[3-7772-8527-7, 3-7772-8721-0, 3-7772-8911-6, ..."


isbn_delta = 0.1


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
421,1,0.9,"[3-7772-8527-7, 3-7772-8721-0, 3-7772-8911-6, ...","[3-7772-0327-0, 3-7772-0433-1, 3-7772-8721-0, ..."
471,1,0.9,"[3-7772-0327-0, 3-7772-0433-1, 3-7772-8721-0, ...","[3-7772-8527-7, 3-7772-8721-0, 3-7772-8911-6, ..."


isbn_delta = 0.9


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
430,1,0.181818,"[3-7772-8527-7, 978-3-7772-8721-8 (Bd. 1), 978...","[978-3-7772-8527-6 (Gesamtwerk), 3-7772-8721-0..."
450,1,0.181818,"[978-3-7772-8527-6 (Gesamtwerk), 3-7772-8721-0...","[3-7772-8527-7, 978-3-7772-8721-8 (Bd. 1), 978..."


isbn_delta = 0.18181818181818182


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
461,1,0.909091,"[3-7772-8527-7, 978-3-7772-8721-8 (Bd. 1), 978...","[3-7772-8527-7, 978-3-7772-8721-8 (Bd. 1), 978..."
431,1,0.909091,"[3-7772-8527-7, 978-3-7772-8721-8 (Bd. 1), 978...","[3-7772-8527-7, 978-3-7772-8721-8 (Bd. 1), 978..."


isbn_delta = 0.9090909090909091


Unnamed: 0,duplicates,isbn_delta,isbn_x,isbn_y
463,1,0.166667,"[3-7772-8527-7, 978-3-7772-8721-8 (Bd. 1), 978...","[978-3-7772-8527-6 (Gesamtwerk), 3-7772-8721-0..."
453,1,0.166667,"[978-3-7772-8527-6 (Gesamtwerk), 3-7772-8721-0...","[3-7772-8527-7, 978-3-7772-8721-8 (Bd. 1), 978..."


isbn_delta = 0.16666666666666666


For attribute $\texttt{isbn}$, the special marking of missing values is omitted.

### ismn

This attribute will be processed with the identity similarity metric. The reasoning for this decision is the same as for similar attributes above. 

In [40]:
attribute = 'ismn'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Identity()
#tedi.Jaccard()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [41]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

ismn values range [0. 1.]


In [42]:
for ismn_delta_value in df_feature_base[attribute+'_delta'].unique():
    number_of_max_samples = min(
        10,
        len(df_feature_base[df_feature_base[attribute+'_delta']==ismn_delta_value])
    )

    dpf.show_samples_distinct(df_feature_base, 'ismn', ismn_delta_value, number_of_max_samples)
    print(f'ismn_delta = {ismn_delta_value}')

Unnamed: 0,duplicates,ismn_delta,ismn_x,ismn_y
15847,0,1.0,,
57314,0,1.0,,
54285,0,1.0,,
109721,0,1.0,,
69026,0,1.0,,
92412,0,1.0,,
29250,0,1.0,,
62323,0,1.0,,
78070,0,1.0,,
76505,0,1.0,,


ismn_delta = 1.0


Unnamed: 0,duplicates,ismn_delta,ismn_x,ismn_y
53134,0,0.0,m004182772,
53124,0,0.0,m004182772,
65293,0,0.0,m001068048,
65414,0,0.0,m001068048,
41149,0,0.0,m001068048,
41176,0,0.0,m001068048,
53140,0,0.0,m004182772,
53159,0,0.0,m004182772,
35221,0,0.0,m001068062,
41186,0,0.0,m001068048,


ismn_delta = 0.0


As can be seen in the previous chapters, attribute $\texttt{ismn}$ is filled sparsely. A lot of missing values calculate to a value of 1.0 in the chosen similarity metrics. To mark these cases specifically, they will be transformed to a negative value.

In [43]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute, factor)

### musicid

Chapter [Data Analysis](./1_DataAnalysis.ipynb) shows that attribute $\texttt{musicid}$ is an identifyer for a music record. A Jaccard metric has been tested on this attribute, resulting in a distribution of many high similarity values on uniques. Comparing this result with the LCS metric, the latter has been decided.

In [44]:
attribute = 'musicid'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.LCSStr()
#tedi.Jaccard()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [45]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

musicid values range [0.         0.05555556 0.07142857 0.07692308 0.08333333 0.09090909
 0.1        0.11111111 0.125      0.14285714 0.15384615 0.16666667
 0.18181818 0.2        0.21428571 0.22222222 0.23076923 0.25
 0.27777778 0.28571429 0.30769231 0.33333333 0.35714286 0.38461538
 0.4        0.41666667 0.44444444 0.5        0.55555556 0.6
 0.66666667 0.69230769 0.75       0.8        1.        ]


In [46]:
position = uniques_len[attribute]

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position-2],
    uniques[attribute][uniques_len[attribute]-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+2], 10)

position = uniques_len[attribute]//2

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+1], 10)

Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
1810,1,1.0,12,12
2723,1,1.0,9500,9500
2014,1,1.0,41573,41573
2433,1,1.0,765,765
921,1,1.0,6407,6407
1704,1,1.0,6161,6161
100,1,1.0,77690,77690
2678,1,1.0,670,670
1823,1,1.0,80,80
1789,1,1.0,907063,907063


0.8 <= musicid_delta <= 1.0


Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
49533,0,0.0,,
22038,0,0.0,,
55307,0,0.0,,
54137,0,0.0,9352.0,
53879,0,0.0,88875083199.0,
98314,0,0.0,,
6316,0,0.0,41572.0,
69698,0,0.0,,
51320,0,0.0,,
49110,0,0.0,,


0.0 <= musicid_delta <= 0.0714285714285714


Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
2574,1,0.277778,12235,122354061229122351
15115,0,0.285714,28947755722,26671000757220
2576,1,0.277778,122354061229122351,12235
2578,1,0.277778,122354061229122351,12235
2580,1,0.277778,12235,122354061229122351


0.2777777777777778 <= musicid_delta <= 0.2857142857142857


In [47]:
dpf.show_samples_interval(
    df_feature_base[df_feature_base.duplicates==1], attribute,
    uniques[attribute][0],
    uniques[attribute][uniques_len[attribute]-1], 20)

Unnamed: 0,duplicates,musicid_delta,musicid_x,musicid_y
2117,1,0.0,,
695,1,0.0,,
1055,1,0.0,,
305,1,0.0,,
61,1,0.0,,
644,1,0.0,,
1930,1,0.0,,
1752,1,0.0,,
1423,1,0.384615,5099964192727.0,50999.0
2488,1,0.0,,


0.0 <= musicid_delta <= 1.0


The attribute is filled with a degree of below $10\%$. The chosen metric for it results in a similarity value of 1.0 for empty value pairs. This effect can be adjusted with function $\texttt{.mark}\_\texttt{missing}()$ as above. 

In [48]:
df_feature_base = dpf.mark_missing(df_feature_base, 'musicid', factor)

### part

Analogous to attribute $\texttt{edition}$ described above, the string value of this attribute can be stripped to pure number digits. Both ways, with and without letter stripping have been tried for modelling. The final decision for the best processing will be documented in chapter [Overview and Summary](./0_OverviewSummary.ipynb). Three different metrics have been tried for attribute $\texttt{part}$. Finally, metric StringCompare95 will be used.

In [49]:
attribute = 'part'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.StrCmp95()
#tedi.Jaro()
#tedi.Hamming()
#tedi.LCSStr()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [50]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

part values range [0.         0.32251082 0.32857143 0.33597884 0.33921569 0.34444444
 0.3452381  0.35714286 0.36060606 0.36666667 0.37037037 0.37254902
 0.37301587 0.37777778 0.38095238 0.38461538 0.38888889 0.39324619
 0.39393939 0.39417989 0.3952381  0.4        0.40277778 0.40740741
 0.41005291 0.41111111 0.41203704 0.41452991 0.41798942 0.42063492
 0.42121212 0.42222222 0.42380952 0.42592593 0.42810458 0.42857143
 0.43030303 0.43076923 0.43162393 0.43174603 0.43333333 0.43518519
 0.43557423 0.43627451 0.43650794 0.43703704 0.43813131 0.43888889
 0.44017094 0.44047619 0.44166667 0.44444444 0.44603175 0.4469697
 0.44722222 0.44761905 0.44814815 0.4484127  0.44949495 0.45
 0.45238095 0.4537037  0.45555556 0.45598846 0.45740741 0.45833333
 0.45970696 0.46031746 0.46068376 0.46176471 0.46296296 0.46405229
 0.46428571 0.46470588 0.46507937 0.46581197 0.46666667 0.46825397
 0.46851852 0.47008547 0.47056277 0.47222222 0.47301587 0.47474747
 0.47619048 0.47619048 0.47657952 0.47777778 0.4777

In [51]:
position = uniques_len[attribute]

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position-2],
    uniques[attribute][uniques_len[attribute]-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+2], 10)

position = uniques_len[attribute]//7

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position-2],
    uniques[attribute][uniques_len[attribute]-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+2], 10)

Unnamed: 0,duplicates,part_delta,part_x,part_y
63849,0,1.0,,
65537,0,1.0,,
26092,0,1.0,,
13900,0,1.0,,
79815,0,1.0,,
101474,0,1.0,,
68096,0,1.0,,
64090,0,1.0,,
75330,0,1.0,,
19170,0,1.0,,


0.9607843137254902 <= part_delta <= 1.0


Unnamed: 0,duplicates,part_delta,part_x,part_y
5316,0,0.0,81,
31828,0,0.0,2 35,
114556,0,0.0,142 36 2019,
37172,0,0.0,41,
92360,0,0.0,3,
44977,0,0.0,272,
17975,0,0.0,,14 14
115515,0,0.0,3,
42356,0,0.0,50 50,
108936,0,0.0,26,


0.0 <= part_delta <= 0.3285714285714285


Unnamed: 0,duplicates,part_delta,part_x,part_y
31232,0,0.722222,2,2 14 2
83845,0,0.722222,12 200,1
83757,0,0.722222,12 200,2
18721,0,0.722222,1 2005,1
70568,0,0.722222,12 200,1
34602,0,0.722222,2005 1,2
13468,0,0.722222,2005 1,2
83773,0,0.722222,12 200,2
8793,0,0.722222,1 2005,2
44697,0,0.722222,1 2005,2


0.7222222222222223 <= part_delta <= 0.7244607244607244


Unnamed: 0,duplicates,part_delta,part_x,part_y
88335,0,0.72619,1971 66,1901
74127,0,0.726852,50 50 50,4520 4520
59354,0,0.726852,50 50 50,4520 4520
52664,0,0.72549,20 3 2014 388 412,2 4


0.7254901960784315 <= part_delta <= 0.7268518518518517


In this attribute, too, moving pairs of empty values to negative values will result in a clearer distinction between pairs of uniques and duplicates, as will be seen in the graphical comparison of capter [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb).

In [52]:
df_feature_base = dpf.mark_missing(df_feature_base, 'part', factor)

### person

As a result of chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{person}$ has been split into three specific attributes. Attribute $\texttt{person}\_{100}$ and $\texttt{person}\_{700}$ hold strongly standardised string values. For comparing pure strings, a Levenshtein metric is recommended [[Chri2012](./A_References.ipynb#chri2012)]. Unfortunately, this metric shows a very long calculation time on the data of the capstone project. Comparing the similarity values of the Levenshtein metric with the similarity values of other metrics in appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb), similarity metric StrCmp95 has been decided to use.

In [53]:
attribute = 'person'

columns_metadata_dict['similarity_metrics'][attribute+'_100'] = tedi.StrCmp95()
columns_metadata_dict['similarity_metrics'][attribute+'_700'] = tedi.StrCmp95()
#tedi.Levenshtein()

pe_values = ['_100', '_700']

for pe in pe_values :
    print('Calculating person'+pe)
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+pe,
        columns_metadata_dict['similarity_metrics'][attribute+pe],
        columns_metadata_dict)

Calculating person_100
Calculating person_700


In [54]:
pe = '_100'

uniques[attribute+pe], uniques_len[attribute+pe] = dpf.determine_similarity_values(
    df_feature_base, attribute+pe)

person_100 values range [0.         0.30606061 0.30735931 ... 0.97777778 0.98181818 1.        ]


In [55]:
position = uniques_len[attribute+pe]

dpf.show_samples_interval(
    df_feature_base, attribute+pe,
    uniques[attribute+pe][uniques_len[attribute+pe]-position-2],
    uniques[attribute+pe][uniques_len[attribute+pe]-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute+pe,
    uniques[attribute+pe][uniques_len[attribute+pe]-position],
    uniques[attribute+pe][uniques_len[attribute+pe]-position+2], 10)

Unnamed: 0,duplicates,person_100_delta,person_100_x,person_100_y
105760,0,1.0,,
33081,0,1.0,,
97609,0,1.0,,
1908,1,1.0,,
5422,0,1.0,,
36413,0,1.0,,
36773,0,1.0,,
24694,0,1.0,,
74477,0,1.0,,
1941,1,1.0,,


0.9818181818181818 <= person_100_delta <= 1.0


Unnamed: 0,duplicates,person_100_delta,person_100_x,person_100_y
95765,0,0.0,lindgrenastrid,
113907,0,0.0,,youngneil
98620,0,0.0,corduseuricius,
39027,0,0.0,,thileniusgeorg
96328,0,0.0,,bataillegeorges
18997,0,0.0,,beerhans de
82326,0,0.0,,nerudapablo
81755,0,0.0,jyscharne,
50463,0,0.0,lindgrenastrid,
27025,0,0.0,,paillouspierre


0.0 <= person_100_delta <= 0.3073593073593073


For comparing person names, like in attribute $\texttt{person}\_{245c}$, a Jaro metric will be tested [[Chri2012](./A_References.ipynb#chri2012)].

In [56]:
pe = '_245c'

columns_metadata_dict['similarity_metrics'][attribute+pe] = tedi.Jaro()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute+pe,
    columns_metadata_dict['similarity_metrics'][attribute+pe],
    columns_metadata_dict)

In [57]:
uniques[attribute+pe], uniques_len[attribute+pe] = dpf.determine_similarity_values(
    df_feature_base, attribute+pe)

person_245c values range [0.         0.23230373 0.23389356 ... 0.99004975 0.99869281 1.        ]


In [58]:
position = uniques_len[attribute+pe]

dpf.show_samples_interval(
    df_feature_base, attribute+pe,
    uniques[attribute+pe][uniques_len[attribute+pe]-position-2],
    uniques[attribute+pe][uniques_len[attribute+pe]-position-1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute+pe,
    uniques[attribute+pe][uniques_len[attribute+pe]-position],
    uniques[attribute+pe][uniques_len[attribute+pe]-position+2], 10)

Unnamed: 0,duplicates,person_245c_delta,person_245c_x,person_245c_y
886,1,1.0,györgy ligeti,györgy ligeti
57248,0,1.0,,
93123,0,1.0,,
2083,1,1.0,,
106654,0,1.0,,
45841,0,1.0,,
42637,0,1.0,,
75901,0,1.0,,
106681,0,1.0,,
1755,1,1.0,written and directed by charlie kaufman,written and directed by charlie kaufman


0.9986928104575163 <= person_245c_delta <= 1.0


Unnamed: 0,duplicates,person_245c_delta,person_245c_x,person_245c_y
23676,0,0.0,roland plattner,
71563,0,0.0,"europäische wirtschaftsgemeinschaft, kommission",
97980,0,0.0,par robert legros,
88595,0,0.0,brigitta z'graggen,
29588,0,0.0,edith cresson ... [et al.]; winterthur-versich...,
101126,0,0.0,manuel marques ; coll. dir. par isabelle guillez,
19198,0,0.0,,[nach] erich kästner ; in der fassung von jame...
94030,0,0.0,,klaus-jürgen wrede
81646,0,0.0,"[mark eisenegger, mario schranz]",
19965,0,0.0,"institut für geologie und paläontologie, unive...",


0.0 <= person_245c_delta <= 0.23389355742296913


The similarities of all three $\texttt{person}$ attributes are affected by empty values. These will be handled the same way as the attributes above.

In [59]:
pe_values = ['_100', '_245c', '_700']

for pe in pe_values :
    df_feature_base = dpf.mark_missing(df_feature_base, 'person'+pe, factor)

### pubinit

This attribute holds publisher strings that have a similar representation as attribute $\texttt{person}$. A Jaro metric will be used.

In [60]:
attribute = 'pubinit'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Jaro()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [61]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

pubinit values range [0.         0.25196409 0.25483871 ... 0.96738351 0.97101449 1.        ]


In [62]:
position = uniques_len[attribute]//3

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position-1],
    uniques[attribute][uniques_len[attribute]-position], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-5],
    uniques[attribute][uniques_len[attribute]-1], 10)

Unnamed: 0,duplicates,pubinit_delta,pubinit_x,pubinit_y
75765,0,0.551768,gastrosuisse,luiss university press
99514,0,0.551768,gastrosuisse,luiss university press
89953,0,0.551782,federazione delle colonie libere italiane in s...,excerpta medica


0.5517676767676768 <= pubinit_delta <= 0.5517819706498952


Unnamed: 0,duplicates,pubinit_delta,pubinit_x,pubinit_y
71433,0,1.0,,
15413,0,1.0,,
47723,0,1.0,,
91402,0,1.0,,
433,1,1.0,,
70862,0,1.0,,
67427,0,1.0,,
40484,0,1.0,,
98562,0,1.0,,
109971,0,1.0,,


0.9196490469888087 <= pubinit_delta <= 1.0


The similarities of $\texttt{pubinit}$ are affected by empty values. These will be transformed to negative values.

In [63]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute, factor)

### scale

Comparing the similarity metrics of some sample value pairs of attribute $\texttt{scale}$ in appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb), a Jaccard metrics has been identified to express the best matching behaviour for purely numerical values stored in the attribute.

In [64]:
attribute = 'scale'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.Jaccard()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [65]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

scale values range [0.         0.13793103 0.17857143 0.1875     0.21428571 0.25
 0.26666667 0.33333333 0.375      0.41666667 0.44444444 0.45454545
 0.5        0.57142857 0.625      0.63636364 0.66666667 0.71428571
 0.75       0.83333333 0.85714286 1.        ]


In [66]:
position = uniques_len[attribute]

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-3],
    uniques[attribute][uniques_len[attribute]-2], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-4],
    uniques[attribute][uniques_len[attribute]-3], 10)

Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
52613,0,0.0,,25000
99181,0,0.0,,15 000
46740,0,0.0,100000.0,
16869,0,0.0,15000.0,
77948,0,0.0,100000.0,
19754,0,0.0,,35000
36344,0,0.0,,25000
76311,0,0.0,50000.0,
106147,0,0.0,,1000000
92823,0,0.0,,15 000


0.0 <= scale_delta <= 0.13793103448275867


Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
1888,1,0.833333,50000,50 000
2070,1,0.857143,100000,100 000
1882,1,0.833333,50000,50 000
57340,0,0.857143,100000,1000000
2082,1,0.857143,100 000,100000
1575,1,0.833333,15000,15 000
46868,0,0.857143,100000,1000000
16901,0,0.833333,15000,15 000
1586,1,0.833333,15 000,15000
1907,1,0.833333,50 000,50000


0.8333333333333334 <= scale_delta <= 0.8571428571428571


Unnamed: 0,duplicates,scale_delta,scale_x,scale_y
56358,0,0.833333,15000,15 000
1586,1,0.833333,15 000,15000
1910,1,0.833333,50 000,50000
1894,1,0.833333,50000,50 000
1573,1,0.833333,15000,15 000
16901,0,0.833333,15000,15 000
1907,1,0.833333,50 000,50000
1909,1,0.833333,50 000,50000
110074,0,0.75,100 000,1000000
1882,1,0.833333,50000,50 000


0.75 <= scale_delta <= 0.8333333333333334


Attribute $\texttt{scale}$ is filled for maps, only. Due to its sparse filling, the similarities of the attribute are affected strongly by empty values. These empty values will be marked with a special negative value.

In [67]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute, factor)

### ttlfull

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{ttlfull}$ has been split up into two new attributes $\texttt{ttlfull}\_\texttt{245}$ and $\texttt{ttlfull}\_\texttt{246}$ which will be compared by the same similarity metrics. A visual analysis of the values stored in the attribute, reveals a string of words, comparable to the strings in attribute $\texttt{person}\_\texttt{245c}$, above. The same similarity metric will be used for both title attributes, therefore.

In [68]:
attribute = 'ttlfull'

columns_metadata_dict['similarity_metrics'][attribute+'_245'] = tedi.Jaro()
columns_metadata_dict['similarity_metrics'][attribute+'_246'] = tedi.Jaro()

tf_values = ['_245', '_246']

for tf in tf_values :
    df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
        df_feature_base, attribute+tf,
        columns_metadata_dict['similarity_metrics'][attribute+tf],
        columns_metadata_dict)

In [69]:
for tf in tf_values :
    uniques[attribute+tf], uniques_len[attribute+tf] = dpf.determine_similarity_values(
        df_feature_base, attribute+tf)

ttlfull_245 values range [0.         0.2427766  0.24567901 ... 0.9952381  0.99691358 1.        ]
ttlfull_246 values range [0.         0.31481481 0.31754386 ... 0.93333333 0.94639376 1.        ]


In [70]:
tf = '_245'
position = uniques_len[attribute+tf]

dpf.show_samples_interval(
    df_feature_base, attribute+tf,
    uniques[attribute+tf][uniques_len[attribute+tf]-position],
    uniques[attribute+tf][uniques_len[attribute+tf]-position+1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute+tf,
    uniques[attribute+tf][uniques_len[attribute+tf]-3],
    uniques[attribute+tf][uniques_len[attribute+tf]-2], 10)
dpf.show_samples_interval(
    df_feature_base, attribute+tf,
    uniques[attribute+tf][uniques_len[attribute+tf]-4],
    uniques[attribute+tf][uniques_len[attribute+tf]-3], 10)

Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
34446,0,0.0,1861-1862,"bericht, über das ... vereinsjahr"
29157,0,0.0,1688-1692,geschäftsbericht
115710,0,0.0,"young world, 3, english class 5",emma
101562,0,0.0,geschäftsbericht,1-400
108107,0,0.0,"topiaria helvetica, jahrbuch = revue annuelle ...",1-400
45803,0,0.0,"acta societatis scientiarum fennicae, a, opera...",1-400
10460,0,0.0,"conversations au bord du fleuve mourant, ethno...",1-400
29330,0,0.0,1688-1692,"babylon berlin, staffel 1-3"
66934,0,0.0,"die brüder löwenherz, abenteuer im land der sa...",1-400
66406,0,0.0,topiaria helvetica : jahrbuch,1-400


0.0 <= ttlfull_245_delta <= 0.24277660324171957


Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
850,1,0.996914,"100 jahre edk, jubiläumsfeier vom 5./6. juni 1...","100 jahre edk, jubiläumsfeier vom 5./6. juni 1..."
844,1,0.996914,"100 jahre edk, jubiläumsfeier vom 5./6. juni 1...","100 jahre edk, jubiläumsfeier vom 5./6. juni 1..."
801,1,0.996914,"100 jahre edk, jubiläumsfeier vom 5./6. juni 1...","100 jahre edk, jubiläumsfeier vom 5./6. juni 1..."
827,1,0.996914,"100 jahre edk, jubiläumsfeier vom 5./6. juni 1...","100 jahre edk, jubiläumsfeier vom 5./6. juni 1..."
736,1,0.996914,"100 jahre edk, jubiläumsfeier vom 5./6. juni 1...","100 jahre edk, jubiläumsfeier vom 5./6. juni 1..."
2352,1,0.995238,massnahmen gegen die arbeitslosigkeit vor dem ...,massnahmen gegen die arbeitslosigkeit vor dem ...
814,1,0.996914,"100 jahre edk, jubiläumsfeier vom 5./6. juni 1...","100 jahre edk, jubiläumsfeier vom 5./6. juni 1..."
723,1,0.996914,"100 jahre edk, jubiläumsfeier vom 5./6. juni 1...","100 jahre edk, jubiläumsfeier vom 5./6. juni 1..."
851,1,0.996914,"100 jahre edk, jubiläumsfeier vom 5./6. juni 1...","100 jahre edk, jubiläumsfeier vom 5./6. juni 1..."
843,1,0.996914,"100 jahre edk, jubiläumsfeier vom 5./6. juni 1...","100 jahre edk, jubiläumsfeier vom 5./6. juni 1..."


0.9952380952380953 <= ttlfull_245_delta <= 0.9969135802469135


Unnamed: 0,duplicates,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
1288,1,0.994536,"wie sicher ist unsere zukunft?, beiträge zur w...","wie sicher ist unsere zukunft?, beiträge zur w..."
1292,1,0.994536,"wie sicher ist unsere zukunft?, beiträge zur w...","wie sicher ist unsere zukunft?, beiträge zur w..."
1274,1,0.994536,"wie sicher ist unsere zukunft?, beiträge zur w...","wie sicher ist unsere zukunft?, beiträge zur w..."
1290,1,0.994536,"wie sicher ist unsere zukunft?, beiträge zur w...","wie sicher ist unsere zukunft?, beiträge zur w..."
1267,1,0.994536,"wie sicher ist unsere zukunft?, beiträge zur w...","wie sicher ist unsere zukunft?, beiträge zur w..."
2351,1,0.995238,massnahmen gegen die arbeitslosigkeit vor dem ...,massnahmen gegen die arbeitslosigkeit vor dem ...
1265,1,0.994536,"wie sicher ist unsere zukunft?, beiträge zur w...","wie sicher ist unsere zukunft?, beiträge zur w..."
1304,1,0.994536,"wie sicher ist unsere zukunft?, beiträge zur w...","wie sicher ist unsere zukunft?, beiträge zur w..."
1286,1,0.994536,"wie sicher ist unsere zukunft?, beiträge zur w...","wie sicher ist unsere zukunft?, beiträge zur w..."
1289,1,0.994536,"wie sicher ist unsere zukunft?, beiträge zur w...","wie sicher ist unsere zukunft?, beiträge zur w..."


0.9945355191256832 <= ttlfull_245_delta <= 0.9952380952380953


Attribute $\texttt{ttlfull}\_\texttt{245}$ is filled for all data rows of Swissbib's raw data as can be seen in chapter [Data Analysis](./1_DataAnalysis.ipynb). For attribute $\texttt{ttlfull}\_\texttt{245}$, the filling is below $10\%$. The data pairs with missing values will be marked with a negative value as has been done for similar cases above.

In [71]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute+'_246', factor)

### volumes

This attribute is described in chapter [Data Analysis](./1_DataAnalysis.ipynb) to hold a kind of contents that resembles the contents of attribute $\texttt{part}$. The same similarity metrics will be used for attribute $\texttt{volumes}$ as for attribute $\texttt{part}$, therefore.

In [72]:
attribute = 'volumes'

columns_metadata_dict['similarity_metrics'][attribute] = tedi.StrCmp95()
#tedi.Jaro()
#tedi.LCSSeq()
#tedi.MongeElkan()

df_feature_base, columns_metadata_dict = dpf.build_delta_feature(
    df_feature_base, attribute,
    columns_metadata_dict['similarity_metrics'][attribute],
    columns_metadata_dict)

In [73]:
uniques[attribute], uniques_len[attribute] = dpf.determine_similarity_values(
    df_feature_base, attribute)

volumes values range [0.         0.32251082 0.33333333 0.35483871 0.35714286 0.36060606
 0.36111111 0.37301587 0.37407407 0.38333333 0.39393939 0.3952381
 0.40740741 0.41039427 0.41111111 0.41666667 0.42857143 0.43030303
 0.43055556 0.43650794 0.43703704 0.44166667 0.44444444 0.44761905
 0.44949495 0.4537037  0.45448029 0.45519713 0.45555556 0.45833333
 0.46296296 0.46428571 0.46666667 0.47222222 0.47474747 0.47634409
 0.47979798 0.48148148 0.48333333 0.48412698 0.48611111 0.49007937
 0.49185868 0.49206349 0.5        0.50448029 0.50529101 0.50555556
 0.50793651 0.51075269 0.51111111 0.51190476 0.51313131 0.51388889
 0.51523297 0.51689708 0.51851852 0.51936027 0.52222222 0.52380952
 0.52727273 0.52777778 0.53030303 0.53154122 0.53174603 0.53225806
 0.53333333 0.53703704 0.53968254 0.54074074 0.54166667 0.54301075
 0.54722222 0.54761905 0.55       0.5515873  0.55468665 0.55555556
 0.55852535 0.55967742 0.55984848 0.56060606 0.56168831 0.56190476
 0.56481481 0.56507937 0.56666667 0.572452

In [74]:
position = uniques_len[attribute]

dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-position],
    uniques[attribute][uniques_len[attribute]-position+1], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-3],
    uniques[attribute][uniques_len[attribute]-2], 10)
dpf.show_samples_interval(
    df_feature_base, attribute,
    uniques[attribute][uniques_len[attribute]-4],
    uniques[attribute][uniques_len[attribute]-3], 10)

Unnamed: 0,duplicates,volumes_delta,volumes_x,volumes_y
78199,0,0.0,95,1
3432,0,0.0,3,1
96819,0,0.0,,1 1 1
37996,0,0.0,519,1
103297,0,0.0,1,4
40837,0,0.0,176,
110865,0,0.0,104 23,
44937,0,0.0,,1
63739,0,0.0,3 1,1
112198,0,0.0,6 1 24 2 24 1 20 2 23 1 19 2 19,


0.0 <= volumes_delta <= 0.3225108225108224


Unnamed: 0,duplicates,volumes_delta,volumes_x,volumes_y
53015,0,0.933333,1 383,1 33
2897,0,0.933333,1 122,1 22
104440,0,0.933333,1 383,1 33
1475,1,0.925926,259 302 1,259 302
69095,0,0.933333,1 122,1 22
1479,1,0.925926,259 302,259 302 1
34126,0,0.933333,1 122,1 22
24402,0,0.933333,1 122,1 22
48153,0,0.933333,1 383,1 33
1485,1,0.925926,259 302 1,259 302


0.9259259259259259 <= volumes_delta <= 0.9333333333333332


Unnamed: 0,duplicates,volumes_delta,volumes_x,volumes_y
82672,0,0.916667,1 28,1 2
1479,1,0.925926,259 302,259 302 1
47709,0,0.916667,12 1,1 1
112147,0,0.916667,1 82,1 28
36863,0,0.916667,1 16,1 1
54248,0,0.916667,1 82,1 2
679,1,0.916667,346 48,346 4 48
23485,0,0.916667,1 3,1 30
647,1,0.916667,346 4 48,346 48
82691,0,0.916667,1 28,128


0.9166666666666666 <= volumes_delta <= 0.9259259259259259


Attribute $\texttt{volumes}$ holds rows with missing data. The data pairs with missing values will be marked with a special negative value.

In [75]:
df_feature_base = dpf.mark_missing(df_feature_base, attribute, factor)

## DataFrame with Attributes and Similarity Features

The metric for each attribute of the feature DataFrame has been decided and the similarity features have been calculated. In this last step, the columns of the DataFrame are reordered in order to place the $\_\texttt{delta}$ columns close to their input origins $\_\texttt{x}$ and $\_\texttt{y}$ and some sample records are shown.

In [76]:
# Take _x, _y, and _delta columns together
fb_col_list = df_feature_base.columns.tolist()
fb_col_list.sort()
# Move target column to first place
fb_col_list.insert(0, fb_col_list.pop(fb_col_list.index('duplicates')))
# Reorder DataFrame columns
df_attribute_with_sim_feature = pd.DataFrame(df_feature_base, columns=fb_col_list)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_attribute_with_sim_feature.columns)

class_label = ['uniques', 'duplicate']

for i in class_label:
    display(df_attribute_with_sim_feature[df_attribute_with_sim_feature.duplicates==class_label.index(i)].sample(n=10))
    print(i)

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y,coordinate_N_delta,coordinate_N_x,coordinate_N_y,corporate_full_delta,corporate_full_x,corporate_full_y,doi_delta,doi_x,doi_y,edition_delta,edition_x,edition_y,exactDate_delta,exactDate_x,exactDate_y,format_postfix_delta,format_postfix_x,format_postfix_y,format_prefix_delta,format_prefix_x,format_prefix_y,isbn_delta,isbn_x,isbn_y,ismn_delta,ismn_x,ismn_y,musicid_delta,musicid_x,musicid_y,part_delta,part_x,part_y,person_100_delta,person_100_x,person_100_y,person_245c_delta,person_245c_x,person_245c_y,person_700_delta,person_700_x,person_700_y,pubinit_delta,pubinit_x,pubinit_y,scale_delta,scale_x,scale_y,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y,ttlfull_246_delta,ttlfull_246_x,ttlfull_246_y,volumes_delta,volumes_x,volumes_y
4573,0,-0.5,e0102620,,-0.5,n0463834,,-1.0,,,-1.0,,,-0.5,2003.0,,0.375,2005aaaa,1925uuuu,0.428571,10300,30600,0.0,mp,cr,1.0,[],[],-1.0,,,-1.0,,,-0.5,1239 2005,,-1.0,,,-1.0,,,-1.0,,,-0.5,,shinkenchiku-sha,-0.5,25000.0,,0.484127,müstair,shinkenchiku,-1.0,,,-0.5,1,
65656,0,-1.0,,,-1.0,,,-0.5,polska akademia naukzakład badania ssaków,,-1.0,,,-1.0,,,0.25,19552014,2004uuuu,0.111111,30600,40100,0.0,cr,mu,0.0,[0001-7051],[],-1.0,,,-0.5,,9362489352.0,-1.0,,,-0.5,,youngneil,0.412963,"zakład badania ssaków, polskiej akademii nauk",neil young,-1.0,,,-0.5,,warner,-1.0,,,0.570324,acta theriologica,greatest hits,-1.0,,,-0.5,,1
112295,0,-1.0,,,-1.0,,,-1.0,,,-0.5,,10.1093/sw/36.1.86,-1.0,,,0.5,1988aaaa,199101uu,0.428571,10000,10053,0.0,mu,bk,1.0,[],[],-1.0,,,-1.0,,,-0.5,,36 1 1991 01 86 86,0.51,boccheriniluigi,hidalgohilda,0.494686,luigi boccherini; edizione a cura di aldo pais,[hilda hidalgo],-0.5,paisaldo,,-1.0,,,-1.0,,,0.613694,"sei sestetti op. 23; vol ii, per due violini, ...",aids: a guide to clinical counseling riva mill...,-1.0,,,-0.5,6 1 24 2 24 1 20 2 23 1 19 2 19,
9114,0,-1.0,,,-1.0,,,-0.5,universidad de salamanca,,-1.0,,,-0.5,,4.0,0.5,aaaaaaaa,2020uuuu,0.111111,30300,10000,0.0,cr,cf,1.0,[],[],-1.0,,,-1.0,,,-1.0,,,-1.0,,,-0.5,[universidad de salamanca],,-1.0,,,-1.0,,,-1.0,,,0.574065,"acta salmanticensia, textos medievales",life is strange 2,-1.0,,,-0.5,,1
15292,0,-1.0,,,-1.0,,,-0.5,,pigband borste,-1.0,,,-1.0,,,0.5,2012aaaa,2020uuuu,0.428571,30000,20000,0.0,mu,bk,0.0,[978-3-03776-504-3],"[978-3-8346-4297-4, 3-8346-4297-5]",-1.0,,,-1.0,,,-1.0,,,-0.5,schneiderjörg,,0.510559,"jörg schneider, ines torelli und paul bühlmann",pigband borste,-0.5,"torelliines, bühlmannpaul",,-1.0,,,-1.0,,,0.470439,kasperlitheater,"vom schulanfang zum weihnachtsklang, 10 songs ...",-1.0,,,0.0,22,1
38545,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,1955aaaa,1955uuuu,1.0,20000,20000,1.0,bk,bk,1.0,[],[],-1.0,,,-1.0,,,1.0,14 14,14 14,1.0,bataillegeorges,bataillegeorges,0.562199,de georges bataille,biographical and critical study by georges bat...,-0.5,,wainhouseaustryn,-0.5,,skira,-1.0,,,0.710526,"manet, études biographique et critique",manet,-1.0,,,0.777778,135,136
80696,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.25,2015aaaa,1967uuuu,0.428571,10400,10100,0.0,vm,mu,1.0,[],[],-1.0,,,-0.5,,415731723.0,-1.0,,,-1.0,,,-0.5,ein film von jemaine clement und taika waititi,,-0.5,"clementjemaine, waitititaika",,-0.5,,schirmer,-1.0,,,0.571077,"5 zimmer, küche, sarg, what we do in the shadows","twenty-four italian songs and arias, of the se...",-1.0,,,0.633333,1 85,1 100
88799,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.25,1973aaaa,2020uuuu,0.25,20000,20353,1.0,bk,bk,1.0,[],[],-1.0,,,-1.0,,,-0.5,66 1973,,0.488047,nerudapablo,wyssenbachstefanie,0.537363,pablo neruda ; [auswahl und übers. von erich a...,stefanie wysssenbach,-1.0,,,-1.0,,,-1.0,,,0.501269,gedichte,"frische fische und prunkvolle bankette, handel...",-1.0,,,-0.5,446,
81067,0,-1.0,,,-1.0,,,-0.5,,paradox paradise,-1.0,,,-1.0,,,0.25,1984aaaa,2016uuuu,0.428571,40100,10300,0.0,mu,vm,1.0,[],[],-1.0,,,-0.5,,73.0,-1.0,,,-0.5,dejohnettejack,,-0.5,,a film by nicolas steiner ; original score par...,0.544771,"purcelljohn, murraydavid, johnsonhoward, reidr...","steinernicolas, hoferbrigitte",-1.0,,,-1.0,,,0.548068,"album album, jack dejohnette's special edition",above and below,-1.0,,,0.733333,1,1 119
65314,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.5,1973aaaa,19521958,0.111111,10000,30700,0.0,mu,cr,1.0,[],[],-0.5,m001068048,,-0.5,43368.0,,-1.0,,,-0.5,ligetigyörgy,,-0.5,györgy ligeti,,-0.5,ligetigyörgy,,-1.0,,,-1.0,,,0.531127,"sechs bagatellen, für bläserquintett (1953)","sinkentiku, new architecture of japan = till 1...",-0.5,,"new architecture of japan, architecture of jap...",-0.5,5,


uniques


Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y,coordinate_N_delta,coordinate_N_x,coordinate_N_y,corporate_full_delta,corporate_full_x,corporate_full_y,doi_delta,doi_x,doi_y,edition_delta,edition_x,edition_y,exactDate_delta,exactDate_x,exactDate_y,format_postfix_delta,format_postfix_x,format_postfix_y,format_prefix_delta,format_prefix_x,format_prefix_y,isbn_delta,isbn_x,isbn_y,ismn_delta,ismn_x,ismn_y,musicid_delta,musicid_x,musicid_y,part_delta,part_x,part_y,person_100_delta,person_100_x,person_100_y,person_245c_delta,person_245c_x,person_245c_y,person_700_delta,person_700_x,person_700_y,pubinit_delta,pubinit_x,pubinit_y,scale_delta,scale_x,scale_y,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y,ttlfull_246_delta,ttlfull_246_x,ttlfull_246_y,volumes_delta,volumes_x,volumes_y
1815,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,2012aaaa,2012uuuu,1.0,10000,10000,1.0,bk,bk,1.0,[],[],-1.0,,,-1.0,,,0.809524,289 310,289,1.0,plattnerroland,plattnerroland,1.0,roland plattner,roland plattner,-1.0,,,-1.0,,,-1.0,,,0.851282,chronik der rechtsprechung 2009-2011 [im kanto...,chronik der rechtsprechung 2009-2011,-1.0,,,-1.0,,
1540,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,2005aaaa,2005uuuu,1.0,40100,40100,1.0,mu,mu,1.0,[],[],-1.0,,,1.0,289.0,289.0,-1.0,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.432367,mozart,wolfgang amadeus mozart,0.649525,"mozartwolfgang amadeus, hahnhilary, zhunatalie","hahnhilary, zhunatalie",-1.0,,,-1.0,,,0.510171,"violin sonatas, k. 301, 304, 376 & 526","sonatas for piano and violin, sonaten für klav...",-0.5,,"violin sonatas k. 301, 304, 376 & 526",1.0,1,1
253,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,1992aaaa,1992uuuu,1.0,20000,20000,1.0,bk,bk,0.0,"[3-7281-1755-2 (Vdf), 3-519-05031-5 (Teubner)]","[3-7281-1755-2 (vdf Zürich), 3-519-05031-5 (Te...",-1.0,,,-1.0,,,0.8,81 81,81,1.0,kochmichael,kochmichael,1.0,michael koch,michael koch,-1.0,,,0.679167,verlag der fachvereine an den schweizerischen ...,vdf,-1.0,,,0.808333,"städtebau in der schweiz, 1800-1990, entwicklu...",städtebau in der schweiz 1800-1990,-1.0,,,1.0,315,315
2340,1,-1.0,,,-1.0,,,-0.5,société suisse pour l'art des jardins,,-1.0,,,-1.0,,,1.0,20019999,20019999,1.0,30653,30653,1.0,cr,cr,0.0,[1424-9235],[2297-1297],-1.0,,,-1.0,,,-1.0,,,-1.0,,,-0.5,,"sggk, schweizerische gesellschaft für gartenku...",-1.0,,,-1.0,,,-1.0,,,0.796296,"topiaria helvetica, jahrbuch = revue annuelle ...","topiaria helvetica, jahrbuch",-1.0,,,-1.0,,
658,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,1985aaaa,1985uuuu,0.428571,20000,20300,1.0,bk,bk,0.0,"[3-7278-0334-7 (Universitätsverl.), 3-525-5368...","[3-7278-0334-7, 3-525-53688-7]",-1.0,,,-1.0,,,0.8,65 65,65,1.0,sadekabdel-aziz fahmy,sadekabdel-aziz fahmy,1.0,abdel-aziz fahmy sadek,abdel-aziz fahmy sadek,-1.0,,,-1.0,,,-1.0,,,1.0,"contribution à l'étude de l'amdouat, les varia...","contribution à l'étude de l'amdouat, les varia...",-1.0,,,0.916667,346 48,346 4 48
479,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-0.5,,2.0,0.5,19872016,19879999,1.0,20000,20000,1.0,bk,bk,0.0,[3-7772-8527-7],[],-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75585,"hrsg. von severin corsten, günther pflug und f...",hrsg. von severin corsten ... [et al.] ; unter...,0.820671,"corstenseverin, schmidt-künsemüllerfriedrich a...","corstenseverin, bischoffbernhard",-1.0,,,-1.0,,,0.945946,"lexikon des gesamten buchwesens, lgb²",lexikon des gesamten buchwesens,0.916667,lgb²,lgb,0.0,9,1
222,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,1992aaaa,1992uuuu,1.0,20000,20000,1.0,bk,bk,0.0,"[3-7281-1755-2 (Verlag der Fachvereine), 3-519...","[3-7281-1755-2, 3-519-05031-5]",-1.0,,,-1.0,,,0.8,81 81,81,1.0,kochmichael,kochmichael,1.0,michael koch,michael koch,-1.0,,,-1.0,,,-1.0,,,0.936762,"städtebau in der schweiz 1800-1990, entwicklun...","städtebau in der schweiz, 1800-1990, entwicklu...",-1.0,,,1.0,315,315
419,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,1.0,2.0,2.0,0.5,19872016,19879999,1.0,20000,20000,1.0,bk,bk,0.0,"[3-7772-8527-7, 3-7772-8721-0, 3-7772-8911-6, ...","[978-3-7772-8527-6 (Gesamtwerk), 3-7772-8721-0...",-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.812261,hrsg. von severin corsten ... [et al.],hrsg. von severin corsten ... [et al.] ; unter...,0.843077,corstenseverin,"corstenseverin, bischoffbernhard, pfluggünther...",-0.5,,a. hiersemann,-1.0,,,0.990991,"lexikon des gesamten buchwesens, lgb","lexikon des gesamten buchwesens, lgb2",0.916667,lgb,lgb2,-0.5,9,
1195,1,-1.0,,,-1.0,,,-0.5,,suisseoffice fédéral de la statistique,-1.0,,,-1.0,,,0.75,2005aaaa,2005uuuu,1.0,20000,20000,1.0,bk,bk,0.0,[3-303-01199-0],[],-1.0,,,-1.0,,,0.777778,2005 1,1 2005,1.0,raisfrançois,raisfrançois,1.0,"françois rais, esther salvisberg, dominique sp...","françois rais, esther salvisberg, dominique sp...",1.0,"salvisbergesther, spahndominique","salvisbergesther, spahndominique",-0.5,bundesamt für statistik,,-1.0,,,0.867965,zur verwendung von einzeldaten aus administrat...,zur verwendung von einzeldaten aus administrat...,-1.0,,,1.0,39,39
5,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,2005aaaa,2005uuuu,1.0,10300,10300,1.0,vm,vm,1.0,[],[],-1.0,,,-0.5,2785408.0,,-1.0,,,-1.0,,,0.661422,regie: alexader payne ; drehbuch: alexander pa...,regie: alexander payne,0.54272,"giamattipaul, haden churchthomas, madsenvirgin...",paynealexander,-1.0,,,-1.0,,,0.711111,sideways,"sideways, eine geschichte über das leben, die ...",-1.0,,,1.0,1 122,1 122


duplicate


## Summary

This chapter covers the central area of feature construction. The features of the feature matrix have been generated for each attribute of Swissbib's raw data, deciding on its similarity metric. With these metric values, the feature base DataFrame has been extended and a new DataFrame with the attribute values of the pairs together with their calculated similarity value have been generated. The similarity values will be the final features for training and performance testing of the models, compare [[JudACaps](./A_References.ipynb#judacaps)].

In [77]:
columns_metadata_dict['similarity_metrics']

{'coordinate_E': LCSStr({'qval': 1, 'external': True}),
 'coordinate_N': LCSStr({'qval': 1, 'external': True}),
 'corporate_full': LCSStr({'qval': 1, 'external': True}),
 'doi': Identity({'qval': 1, 'external': True}),
 'edition': Jaccard({'qval': 1, 'as_set': False, 'external': True}),
 'exactDate': Hamming({'qval': 1, 'test_func': <function Base._ident at 0x7fb3595842f0>, 'truncate': False, 'external': True}),
 'format_prefix': Identity({'qval': 1, 'external': True}),
 'format_postfix': Jaccard({'qval': 2, 'as_set': False, 'external': True}),
 'isbn': Identity({'qval': 1, 'external': True}),
 'ismn': Identity({'qval': 1, 'external': True}),
 'musicid': LCSStr({'qval': 1, 'external': True}),
 'part': StrCmp95({'long_strings': False, 'external': True}),
 'person_100': StrCmp95({'long_strings': False, 'external': True}),
 'person_700': StrCmp95({'long_strings': False, 'external': True}),
 'person_245c': Jaro({'qval': 1, 'long_tolerance': False, 'winklerize': False, 'external': True}),
 

The similarity metric decided for each attribute has been added as an additional piece of information to the columns metadata dictionary. The following table gives this summary in a structured form and lists the metric used for each attribute. Attributes with the same font color indicate similar types of values (description column) for better orientation.

| attribute     | subtype | description | similarity metric |
| ------------- |:--------|:------------|:------------------|
|<font color='red'>[coordinate](#coordinate)</font>|<font color='red'>\_E</font>|<font color='red'>Code(9)</font>|<font color='red'>LCSStr</font>|
|               |<font color='red'>\_N</font>|<font color='red'>Code(9)</font>|<font color='red'>LCSStr</font>|
|<font color='blue'>[corporate](#corporate)</font>|<font color='blue'>\_full</font>|<font color='blue'>Name</font>|<font color='blue'>LCSStr</font>|
|<font color='green'>[doi](#doi)</font>|         |<font color='green'>Identifier</font>|<font color='green'>Identity</font>|
|<font color='orange'>[edition](#edition)</font>|         |<font color='orange'>Number</font>|<font color='orange'>Jaccard</font>|
|<font color='black'>[exactDate](#exactDate)</font>|         |<font color='black'>Date</font>|<font color='black'>Hamming</font>|
|<font color='red'>[format](#format)</font>|<font color='red'>\_prefix</font>|<font color='red'>Code(2)</font>|<font color='red'>Identity</font>|
|               |<font color='red'>\_postfix</font>|<font color='red'>Code(6)</font>|<font color='red'>Jaccard (qval=2)</font>|
|<font color='green'>[isbn](#isbn)</font>|         |<font color='green'>Identifier</font>|<font color='green'>Identity</font>|
|<font color='green'>[ismn](#ismn)</font>|         |<font color='green'>Identifier</font>|<font color='green'>Identity</font>|
|<font color='green'>[musicid](#musicid)</font>|         |<font color='green'>Identifier</font>|<font color='green'>LCSStr</font>|
|<font color='orange'>[part](#part)</font>|         |<font color='orange'>Number</font>|<font color='orange'>StrCmp95</font>|
|<font color='blue'>[person](#person)</font>|<font color='blue'>\_100</font>|<font color='blue'>Name</font>|<font color='blue'>StrCmp95</font>|
|               |<font color='blue'>\_700</font>|<font color='blue'>Name</font>|<font color='blue'>StrCmp95</font>|
|               |<font color='blue'>\_245c</font>|<font color='blue'>Name</font>|<font color='blue'>Jaro</font>|
|<font color='blue'>[pubinit](#pubinit)</font>|         |<font color='blue'>Name</font>|<font color='blue'>Jaro</font>|
|<font color='orange'>[scale](#scale)</font>|         |<font color='orange'>Number</font>|<font color='orange'>Jaccard</font>|
|<font color='blue'>[ttlfull](#ttlfull)</font>|<font color='blue'>\_245</font>|<font color='blue'>String</font>|<font color='blue'>Jaro</font>|
|               |<font color='blue'>\_246</font>|<font color='blue'>String</font>|<font color='blue'>Jaro</font>|
|<font color='orange'>[volumes](#volumes)</font>|         |<font color='orange'>Number</font>|<font color='orange'>StrCmp95</font>|

### Full Feature Matrix with Target Vector Handover

To hand over the resulting DataFrame of this chapter, the DataFrame is saved into a pickle file that will be read in the next chapters [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb) as input.

In [78]:
# Store into compressed intermediary file
with bz2.BZ2File(os.path.join(path_goldstandard,
                       'labelled_feature_matrix_full.pkl'), 'w') as df_output_file:
    pk.dump(df_attribute_with_sim_feature, df_output_file)

The full metadata dictionary is to be persisted for handover to subsequent chapters.

In [79]:
# The target is still needed for the feature matrix
columns_metadata_dict['features'].append('duplicates')

for k in columns_metadata_dict.keys():
    print(k, '\n', columns_metadata_dict[k], '\n')

data_analysis_columns 
 ['coordinate_E', 'coordinate_N', 'corporate_full', 'doi', 'edition', 'exactDate', 'format_prefix', 'format_postfix', 'isbn', 'ismn', 'musicid', 'part', 'person_100', 'person_700', 'person_245c', 'pubinit', 'scale', 'ttlfull_245', 'ttlfull_246', 'volumes'] 

columns_to_use 
 ['duplicates', 'coordinate_E_x', 'coordinate_E_y', 'coordinate_N_x', 'coordinate_N_y', 'corporate_full_x', 'corporate_full_y', 'doi_x', 'doi_y', 'edition_x', 'edition_y', 'exactDate_x', 'exactDate_y', 'format_prefix_x', 'format_prefix_y', 'format_postfix_x', 'format_postfix_y', 'isbn_x', 'isbn_y', 'ismn_x', 'ismn_y', 'musicid_x', 'musicid_y', 'part_x', 'part_y', 'person_100_x', 'person_100_y', 'person_700_x', 'person_700_y', 'person_245c_x', 'person_245c_y', 'pubinit_x', 'pubinit_y', 'scale_x', 'scale_y', 'ttlfull_245_x', 'ttlfull_245_y', 'ttlfull_246_x', 'ttlfull_246_y', 'volumes_x', 'volumes_y'] 

similarity_metrics 
 {'coordinate_E': LCSStr({'qval': 1, 'external': True}), 'coordinate_N': L

In [80]:
# Binary intermediary metadata file
with open(os.path.join(path_goldstandard,
                       'columns_metadata.pkl'), 'wb') as dict_output_file:
    pk.dump(columns_metadata_dict, dict_output_file)