# Comparison of Similarity Metrics

This appendix compares the effect of all similarity metric implementations of the textdistance library [[TeDi](./A_References.ipynb#tedi)] for some example strings pair combinations for each feature to be calculated. The comparison is based on the goldstandard data and is the basis for deciding the similarity metrics for each feature to be used.

In [1]:
import textdistance as tedi

## Table of Contents

- [Data Takeover](#Data-Takeover)
- [Functions for Similarty Metrics Analysis](#Functions-for-Similarty-Metrics-Analysis)
- [Similarity Metric Assessments](#Similarity-Metric-Assessments)
    - [coordinate](#coordinate)
    - [corporate](#corporate)
    - [doi](#doi)
    - [edition](#edition)
    - [exactDate](#exactDate)
    - [format](#format)
    - [isbn](#isbn)
    - [musicid](#musicid)
    - [part](#part)
    - [person](#person)
    - [pubinit](#pubinit)
    - [scale](#scale)
    - [ttlfull](#ttlfull)
    - [volumes](#volumes)

## Data Takeover

As a first step, the training data set as a result of chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb) is read. Some sample strings of this data set will be used for the comparison and assessment of the different metrics implementations.

In [2]:
import os
import pandas as pd
import pickle as pk

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)

# Restore results so far
df_feature_base = pd.read_pickle(os.path.join(path_goldstandard, 'feature_base_df.pkl'),
                                 compression=None)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head()

Unnamed: 0,duplicates,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,doi_x,doi_y,edition_x,edition_y,exactDate_x,exactDate_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,isbn_x,isbn_y,musicid_x,musicid_y,part_x,part_y,person_100_x,person_100_y,person_700_x,person_700_y,person_245c_x,person_245c_y,pubinit_x,pubinit_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y
0,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula","grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,reclam jun.,reclam jun.,,,"emma, roman","emma, roman",,,600 s.,600 s.
1,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,reclam jun.,reclam,,,"emma, roman",emma,,,600 s.,600 s.
2,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen,reclam jun.,reclam,,,"emma, roman","emma, roman",,,600 s.,600 s.
3,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,"grawechristian, graweursula",jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,reclam,reclam jun.,,,emma,"emma, roman",,,600 s.,600 s.
4,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,reclam,reclam,,,emma,emma,,,600 s.,600 s.


## Functions for Similarty Metrics Analysis

The available metrics algorithms of library [[TeDi](./A_References.ipynb#tedi)] are listed in the dictionary below. The dictionary will help calculating a similarity value for each implemented algorithm of the library.

In [3]:
tedi_algorithms = {
    # Edit based
    'Hamming' : tedi.Hamming(), 'MLIPNS' : tedi.MLIPNS(), 'Levenshtein' : tedi.Levenshtein(),
    'DamerauLevenshtein' : tedi.DamerauLevenshtein(), 'Jaro' : tedi.Jaro(), 'JaroWinkler' : tedi.JaroWinkler(),
    'StrCmp95' : tedi.StrCmp95(), 'NeedlemanWunsch' : tedi.NeedlemanWunsch(), 'Gotoh' : tedi.Gotoh(),
    'SmithWaterman' : tedi.SmithWaterman(),
    # Token based
    'Jaccard' : tedi.Jaccard(), 'Sorensen' : tedi.Sorensen(), 'Tversky' : tedi.Tversky(), 'Overlap' : tedi.Overlap(),
    'Tanimoto' : tedi.Tanimoto(), 'Cosine' : tedi.Cosine(), 'MongeElkan' : tedi.MongeElkan(), 'Bag' : tedi.Bag(),
    # Sequence based
    'LCSSeq' : tedi.LCSSeq(), 'LCSStr' : tedi.LCSStr(), 'RatcliffObershelp' : tedi.RatcliffObershelp(),
    # Compression based
    'ArithNCD' : tedi.ArithNCD(), 'RLENCD' : tedi.RLENCD(), 'BWTRLENCD' : tedi.BWTRLENCD(),
    'SqrtNCD' : tedi.SqrtNCD(), 'EntropyNCD' : tedi.EntropyNCD(), 'BZ2NCD' : tedi.BZ2NCD(),
    'LZMANCD' : tedi.LZMANCD(), 'ZLIBNCD' : tedi.ZLIBNCD(),
    # Phonetic
    'MRA' : tedi.MRA(), 'Editex' : tedi.Editex(),
    # Simple
    'Prefix' : tedi.Prefix(), 'Postfix' : tedi.Postfix(), 'Length' : tedi.Length(), 'Identity' : tedi.Identity(),
    'Matrix' : tedi.Matrix()
}

This appendix uses function $\texttt{.apply}\_\texttt{similarities()}$ that applies the $\texttt{.normalized}\_\texttt{similarity()}$ function of $\texttt{textdistance}$ for each algorithm object available in the library. The function is implemented in the separate code file [data_analysis_funcs.py](./data_analysis_funcs.py) 

In [4]:
import data_analysis_funcs as daf

## Similarity Metric Assessments

This section iterates through all available similarity metrics of library $\texttt{textdistance}$ and calculates the similarity values for a pair of two sample strings of each feature of the model. The calculated similarity values will be analysed visually and an algorithm will be decided to be used in chapter [Feature Matrix Generation](./3_FeatureMatrixGeneration.ipynb). The decision will be based on a visual assessment that is validated with the literature, [[Chri2012](./A_References.ipynb#chri2012)].

In [5]:
# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(tedi_algorithms)+3

def num_of_samples (df) :
    max_number_of_num_samples = 30

    return min(len(df), max_number_of_num_samples)

### coordinate

In [6]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'coordinate_E_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
4,e0074147,e0060811,0.375,0.0,0.375,0.375,0.666667,0.666667,0.666667,0.6875,0.375,0.333333,0.5,0.333333,0.5,-1.584963,0.5,0.03125,0.5,0.5,0.375,0.5,0.0,0.111111,0.348914,0.862591,0.821429,0.84,0.428571,0.333333,0.375,0.375,0.0,1.0,0.0,0.0
1,e0074147,e0055700,0.375,0.0,0.375,0.375,0.666667,0.666667,0.666667,0.6875,0.375,0.333333,0.5,0.333333,0.5,-1.584963,0.5,0.039062,0.5,0.5,0.375,0.5,0.0,0.222222,0.357649,0.812782,0.857143,0.84,0.428571,0.333333,0.375,0.375,0.0,1.0,0.0,0.0
5,e0074147,e0074147,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.111111,0.585786,1.0,0.928571,0.76,0.857143,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,e0074147,e0080855,0.375,0.0,0.375,0.375,0.583333,0.583333,0.583333,0.6875,0.375,0.230769,0.375,0.230769,0.375,-2.115477,0.375,0.023438,0.375,0.375,0.375,0.375,0.0,0.222222,0.239639,0.771151,0.857143,0.84,0.428571,0.333333,0.375,0.375,0.0,1.0,0.0,0.0
0,e0074147,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.307692,0.0,0.36,0.428571,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,e0074147,e0080851,0.375,0.0,0.375,0.375,0.666667,0.666667,0.666667,0.6875,0.375,0.333333,0.5,0.333333,0.5,-1.584963,0.5,0.03125,0.5,0.5,0.375,0.5,0.0,0.222222,0.333476,0.848074,0.821429,0.84,0.428571,0.333333,0.375,0.375,0.0,1.0,0.0,0.0


### corporate

In [7]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'corporate_710_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
4,société suisse pour la recherche en éducation,"opernhaus (zürich), opernhaus (zürich), orches...",0.054545,0.0,0.218182,0.254545,0.60806,0.60806,0.620181,0.5,0.133333,0.515152,0.68,0.515152,0.755556,-0.956931,0.683426,0.009877,0.618182,0.345455,0.072727,0.24,0.0,-0.019231,0.427265,0.943363,0.477612,0.555556,0.25,0.166667,0.327273,0.0,0.0,0.818182,0.0,0.0
13,société suisse pour la recherche en éducation,"rundfunkchor, sächsische staatskapelle dresden",0.0,0.0,0.086957,0.130435,0.646004,0.646004,0.659193,0.51087,0.088889,0.568966,0.725275,0.568966,0.733333,-0.813587,0.725319,0.01037,0.717391,0.304348,0.086957,0.241758,0.0,0.108696,0.485227,0.95238,0.567164,0.466667,0.415094,0.166667,0.25,0.0,0.021739,0.978261,0.0,0.0
2,société suisse pour la recherche en éducation,"metropolitan opera, metropolitan opera, orches...",0.052632,0.0,0.22807,0.245614,0.604465,0.604465,0.616394,0.5,0.155556,0.478261,0.647059,0.478261,0.733333,-1.06413,0.651584,0.010123,0.578947,0.333333,0.070175,0.294118,0.0,0.074074,0.475796,0.962692,0.403226,0.511111,0.326923,0.0,0.289474,0.0,0.0,0.789474,0.0,0.0
20,société suisse pour la recherche en éducation,"interkantonale lehrmittelzentrale (luzern), st...",0.037975,0.0,0.202532,0.240506,0.521086,0.521086,0.552478,0.386076,0.111111,0.305263,0.467742,0.305263,0.644444,-1.711875,0.486383,0.009877,0.367089,0.240506,0.050633,0.241935,0.0,-0.026667,0.348636,0.93523,0.302632,0.403509,0.264706,0.0,0.329114,0.0,0.0,0.56962,0.0,0.0
23,société suisse pour la recherche en éducation,les arts florissants,0.0,0.0,0.155556,0.177778,0.561111,0.561111,0.582778,0.3,0.1,0.354167,0.523077,0.354167,0.85,-1.4975,0.566667,0.007654,0.377778,0.2,0.066667,0.246154,0.0,0.0,0.325097,0.872234,0.435484,0.466667,0.25,0.166667,0.311111,0.0,0.0,0.444444,0.0,0.0
25,société suisse pour la recherche en éducation,oper (köln),0.0,0.0,0.111111,0.133333,0.50404,0.50404,0.515354,0.177778,0.0,0.142857,0.25,0.142857,0.636364,-2.807355,0.314627,0.005185,0.155556,0.133333,0.044444,0.214286,0.0,0.021739,0.188756,0.869414,0.403226,0.466667,0.115385,0.0,0.222222,0.0,0.0,0.244444,0.0,0.0
6,société suisse pour la recherche en éducation,opernhaus,0.022222,0.0,0.133333,0.133333,0.522222,0.522222,0.522222,0.166667,0.0,0.2,0.333333,0.2,1.0,-2.321928,0.447214,0.006173,0.2,0.133333,0.044444,0.222222,0.0,0.021739,0.245319,0.85736,0.435484,0.466667,0.173077,0.0,0.233333,0.0,0.0,0.2,0.0,0.0
31,société suisse pour la recherche en éducation,schweiz,0.022222,0.0,0.111111,0.111111,0.470106,0.470106,0.503122,0.133333,0.142857,0.106383,0.192308,0.106383,0.714286,-3.232661,0.281718,0.004444,0.111111,0.111111,0.044444,0.192308,0.0,0.043478,0.143395,0.767253,0.354839,0.466667,0.115385,0.333333,0.188889,0.022222,0.0,0.155556,0.0,0.0
21,société suisse pour la recherche en éducation,interkantonale lehrmittelzentrale (luzern),0.066667,0.0,0.177778,0.2,0.534536,0.534536,0.571361,0.555556,0.119048,0.338462,0.505747,0.338462,0.52381,-1.562936,0.506048,0.007901,0.488889,0.333333,0.044444,0.16092,0.0,-0.021739,0.396174,0.91692,0.354839,0.511111,0.326923,0.0,0.277778,0.0,0.0,0.933333,0.0,0.0
29,société suisse pour la recherche en éducation,allgemeine geschichtforschende gesellschaft de...,0.018182,0.0,0.163636,0.236364,0.626263,0.626263,0.646465,0.481818,0.111111,0.492537,0.66,0.492537,0.733333,-1.021695,0.663325,0.009383,0.6,0.290909,0.054545,0.28,0.0,0.057692,0.431842,0.95609,0.40625,0.510204,0.37037,0.0,0.345455,0.0,0.0,0.818182,0.0,0.0


Monge-Elkan, Jaccard with q-grams, and LCSStr seem to be valid metrics for both $\texttt{corporate}$ attributes due to their algorithms [[Chri2012](./A_References.ipynb#chri2012)]. The metrics to be chosen will be analysed and justifyed in chapter [Feature Matrix Generation](./3_FeatureMatrixGeneration.ipynb).

### doi

Attribute $\texttt{doi}$ is treated as a list of string elements, see chapter [Feature Matrix Generation](./3_FeatureMatrixGeneration.ipynb). The metrics comparison will be ommitted, here.

### edition

In [8]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'edition_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
36,Neu durchges. 5. Aufl,6. überarbeitete und erweiterte Auflage,0.076923,0.0,0.205128,0.230769,0.54823,0.54823,0.570818,0.371795,0.190476,0.304348,0.466667,0.304348,0.666667,-1.716207,0.4892,0.018141,0.358974,0.25641,0.128205,0.233333,0.0,0.05,0.302922,0.927241,0.492308,0.512195,0.214286,0.0,0.269231,0.0,0.0,0.538462,0.0,0.0
4,Neu durchges. 5. Aufl,Edition Bibliothek,0.0,0.0,0.047619,0.047619,0.371958,0.371958,0.482804,0.452381,0.055556,0.147059,0.25641,0.147059,0.277778,-2.765535,0.257172,0.00907,0.238095,0.142857,0.047619,0.051282,0.0,0.090909,0.183407,0.827257,0.6,0.69697,0.222222,0.0,0.166667,0.0,0.0,0.857143,0.0,0.0
44,Neu durchges. 5. Aufl,2. Aufl. als Studienausg.,0.04,0.0,0.12,0.2,0.533391,0.533391,0.598984,0.46,0.0,0.483871,0.652174,0.483871,0.714286,-1.047306,0.654654,0.018141,0.6,0.24,0.24,0.26087,0.0,0.12,0.377032,0.933719,0.56,0.657143,0.354839,0.0,0.22,0.0,0.0,0.84,0.0,0.0
33,Neu durchges. 5. Aufl,[Nouv. éd.],0.0,0.0,0.142857,0.142857,0.554834,0.554834,0.58254,0.333333,0.181818,0.230769,0.375,0.230769,0.545455,-2.115477,0.394771,0.011338,0.285714,0.238095,0.095238,0.3125,0.0,0.045455,0.200324,0.834726,0.555556,0.636364,0.222222,0.0,0.238095,0.0,0.0,0.52381,0.0,0.0
0,Neu durchges. 5. Aufl,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.210571,0.0,0.272727,0.222222,0.0,0.0,0.0,0.0,0.0,0.0,0.0
27,Neu durchges. 5. Aufl,"4., überarb. u. erweiterte Aufl",0.032258,0.0,0.290323,0.290323,0.576903,0.576903,0.610532,0.483871,0.285714,0.333333,0.5,0.333333,0.619048,-1.584963,0.50951,0.015873,0.419355,0.322581,0.16129,0.384615,0.0,0.09375,0.303267,0.922955,0.517241,0.567568,0.236842,0.5,0.33871,0.0,0.16129,0.677419,0.0,0.0
7,Neu durchges. 5. Aufl,"Neu durchges. Aufl., 5. Aufl., 43.-46. Tsd",0.333333,0.0,0.5,0.5,0.785714,0.871429,0.871429,0.5,0.095238,0.5,0.666667,0.5,1.0,-1.0,0.707107,0.02381,0.5,0.5,0.333333,0.666667,0.0,0.170732,0.391919,0.936204,0.619048,0.72093,0.525,0.5,0.511905,0.333333,0.0,0.5,0.0,0.0
5,Neu durchges. 5. Aufl,[Orig.-Ausg.],0.0,0.0,0.142857,0.142857,0.471306,0.471306,0.521123,0.380952,0.076923,0.259259,0.411765,0.259259,0.538462,-1.947533,0.423659,0.010204,0.333333,0.238095,0.095238,0.294118,0.0,0.045455,0.231314,0.850677,0.577778,0.69697,0.222222,0.0,0.214286,0.0,0.0,0.619048,0.0,0.0
6,Neu durchges. 5. Aufl,"5. Aufl., 43.-46. Tsd., neu durchges. Aufl",0.02381,0.0,0.333333,0.333333,0.52381,0.52381,0.552381,0.380952,0.666667,0.465116,0.634921,0.465116,0.952381,-1.104337,0.673435,0.022676,0.47619,0.404762,0.309524,0.539683,0.0,0.15,0.371155,0.929957,0.571429,0.674419,0.5,0.5,0.369048,0.0,0.142857,0.5,0.0,0.0
3,Neu durchges. 5. Aufl,Nach dem Urtext der Neuen Mozart-Ausgabe,0.025,0.0,0.2,0.225,0.545452,0.545452,0.57496,0.3625,0.095238,0.326087,0.491803,0.326087,0.714286,-1.616671,0.517549,0.018141,0.375,0.225,0.075,0.229508,0.0,0.04878,0.288379,0.890046,0.484375,0.512195,0.159091,0.166667,0.275,0.025,0.0,0.525,0.0,0.0


### exactDate

In [9]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'exactDate_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
Gotoh
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,Gotoh,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
66,1909uuuu,1960uuuu,0.75,1.0,0.75,0.75,0.916667,0.933333,0.916667,0.875,0.875,0.75,0.777778,0.875,0.777778,0.875,-0.36257,0.875,0.0625,0.875,0.875,0.5,0.875,0.0,0.375,0.504218,0.932393,0.9,0.84,0.5,0.5,0.75,0.25,0.5,1.0,0.0,0.0
70,1909uuuu,170uuuuu,0.75,1.0,0.75,0.875,0.833333,0.85,0.833333,0.875,0.875,0.75,0.6,0.75,0.6,0.75,-0.736966,0.75,0.046875,0.75,0.75,0.5,0.75,0.0,0.375,0.444689,0.89341,0.866667,0.84,0.5,0.5,0.75,0.125,0.5,1.0,0.0,0.0
5,1909uuuu,2008uuuu,0.625,1.0,0.625,0.75,0.75,0.75,0.75,0.8125,0.8125,0.625,0.454545,0.625,0.454545,0.625,-1.137504,0.625,0.039062,0.625,0.625,0.5,0.625,0.0,0.375,0.342383,0.880793,0.866667,0.84,0.5,0.0,0.625,0.0,0.5,1.0,0.0,0.0
60,1909uuuu,18801890,0.125,0.0,0.125,0.125,0.583333,0.583333,0.583333,0.5625,0.5625,0.125,0.230769,0.375,0.230769,0.375,-2.115477,0.375,0.03125,0.375,0.375,0.25,0.375,0.0,0.111111,0.368042,0.806831,0.733333,0.84,0.428571,0.333333,0.5,0.125,0.0,1.0,0.0,0.0
37,1909uuuu,19959999,0.25,0.0,0.25,0.375,0.583333,0.583333,0.583333,0.625,0.625,0.25,0.230769,0.375,0.230769,0.375,-2.115477,0.375,0.023438,0.375,0.375,0.25,0.375,0.0,0.0,0.299409,0.704101,0.766667,0.84,0.416667,0.75,0.75,0.25,0.0,1.0,0.0,0.0
79,1909uuuu,19819999,0.25,0.0,0.25,0.25,0.583333,0.583333,0.583333,0.625,0.625,0.25,0.230769,0.375,0.230769,0.375,-2.115477,0.375,0.023438,0.375,0.375,0.25,0.375,0.0,0.111111,0.311531,0.754252,0.833333,0.84,0.416667,0.4,0.625,0.25,0.0,1.0,0.0,0.0
26,1909uuuu,18uuuuuu,0.625,1.0,0.625,0.75,0.75,0.775,0.75,0.8125,0.8125,0.625,0.454545,0.625,0.454545,0.625,-1.137504,0.625,0.039062,0.625,0.625,0.5,0.625,0.0,0.25,0.345941,0.777267,0.866667,0.76,0.5,0.25,0.625,0.125,0.5,1.0,0.0,0.0
59,1909uuuu,1941uuuu,0.75,1.0,0.75,0.75,0.833333,0.866667,0.833333,0.875,0.875,0.75,0.6,0.75,0.6,0.75,-0.736966,0.75,0.054688,0.75,0.75,0.5,0.75,0.0,0.375,0.468378,0.943404,0.9,0.84,0.5,0.5,0.75,0.25,0.5,1.0,0.0,0.0
34,1909uuuu,2004uuuu,0.625,1.0,0.625,0.75,0.75,0.75,0.75,0.8125,0.8125,0.625,0.454545,0.625,0.454545,0.625,-1.137504,0.625,0.039062,0.625,0.625,0.5,0.625,0.0,0.375,0.342383,0.880793,0.866667,0.84,0.5,0.0,0.625,0.0,0.5,1.0,0.0,0.0
12,1909uuuu,2006uuuu,0.625,1.0,0.625,0.75,0.75,0.75,0.75,0.8125,0.8125,0.625,0.454545,0.625,0.454545,0.625,-1.137504,0.625,0.039062,0.625,0.625,0.5,0.625,0.0,0.375,0.342383,0.880793,0.866667,0.84,0.5,0.0,0.625,0.0,0.5,1.0,0.0,0.0


Attribute $\texttt{exactDate}$ is a string of four digits or characters. For calculating the [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance), each string pair is compared character-wise. A so called edit distance between the string pair is calculated as the sum of all edit operations needed to convert the strings into each other, [[Chri2012](./A_References.ipynb#chri2012)]. The resulting Hamming similarity can be deduced from the edit distance and the length of one string. This can be easily done in the examples of the DataFrame above. The Hamming similarity shall be used for attribute $\texttt{exactDate}$.

The Hamming similarity has one drawbak, though, looking at the goldstandard data. The attribute may be filled with letter 'u' for 'unknown' digits instead of a number. A letter 'u' will result in an edit distance of 1. This is a statement which need not be true for the bibliographic units that the records describe. On the other hand, a pair of strings with a letter 'u' at the same digit, need not have a distance of 0 for the bibliographical units. For this reason, the Hamming similarity will be adapted for the case of existence of letter 'u' in one of the strings of the pair. The Hamming similarity will be increased by a value of 1/16 for each unknown digit in a string and the maximum of unknown digits of the compared pair will be calculated. This algorithm based on the Hamming similarity will be implemented in chapter [Feature Matrix Generation](./3_FeatureMatrixGeneration.ipynb).

### format

In [10]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'format_prefix_x')

for algorithm in tedi_algorithms :
    if algorithm not in [] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
Gotoh
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
ArithNCD
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,Gotoh,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,ArithNCD,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
2,bk,mu,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,-0.666667,0.0,0.333333,0.0,0.5,0.791667,1.0,0.75,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,bk,vm,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,-0.333333,0.0,0.333333,0.0,0.5,0.791667,1.0,0.75,0.0,0.0,0.0,0.0,1.0,0.0,0.0
0,bk,bk,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.333333,0.0,0.333333,0.585786,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,bk,mp,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,-0.666667,0.0,0.333333,0.0,0.5,0.791667,1.0,0.75,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,bk,cr,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.333333,0.0,0.5,0.791667,1.0,0.75,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,bk,cf,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.333333,0.0,0.5,0.863636,1.0,0.75,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [11]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'format_postfix_x')

for algorithm in tedi_algorithms :
    if algorithm not in [] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
Gotoh
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
ArithNCD
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,Gotoh,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,ArithNCD,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
12,30600,10000,0.666667,1.0,0.666667,0.833333,0.777778,0.8,0.777778,0.833333,0.833333,0.666667,0.5,0.666667,0.5,0.666667,-1.0,0.666667,0.055556,0.666667,0.666667,0.333333,0.5,-0.833333,0.166667,0.285714,0.309017,0.752403,0.769231,0.92,0.583333,0.4,0.666667,0.166667,0.333333,1.0,0.0,0.0
14,30600,20800,0.666667,1.0,0.666667,0.666667,0.777778,0.8,0.777778,0.833333,0.833333,0.666667,0.5,0.666667,0.5,0.666667,-1.0,0.666667,0.055556,0.666667,0.666667,0.333333,0.666667,-0.833333,0.166667,0.428571,0.292893,0.851959,0.875,0.92,0.5,0.6,0.666667,0.166667,0.333333,1.0,0.0,0.0
16,30600,30000,0.833333,1.0,0.833333,1.0,0.888889,0.922222,0.888889,0.916667,0.916667,0.833333,0.714286,0.833333,0.714286,0.833333,-0.485427,0.833333,0.069444,0.833333,0.833333,0.5,0.833333,-0.833333,0.166667,0.285714,0.455464,0.826424,0.769231,0.92,0.583333,0.6,0.833333,0.5,0.333333,1.0,0.0,0.0
9,30600,20053,0.333333,0.0,0.333333,0.5,0.666667,0.666667,0.666667,0.666667,0.666667,0.333333,0.5,0.666667,0.5,0.666667,-1.0,0.666667,0.069444,0.666667,0.5,0.333333,0.5,-1.333333,0.166667,0.285714,0.353353,0.810547,0.84,0.92,0.5,0.4,0.5,0.166667,0.0,1.0,0.0,0.0
13,30600,30500,0.833333,1.0,0.833333,0.833333,0.888889,0.922222,0.888889,0.916667,0.916667,0.833333,0.714286,0.833333,0.714286,0.833333,-0.485427,0.833333,0.069444,0.833333,0.833333,0.5,0.833333,-0.666667,0.166667,0.428571,0.43934,0.92598,0.875,0.92,0.5,0.8,0.833333,0.5,0.333333,1.0,0.0,0.0
2,30600,40100,0.666667,1.0,0.666667,0.666667,0.777778,0.8,0.777778,0.833333,0.833333,0.666667,0.5,0.666667,0.5,0.666667,-1.0,0.666667,0.055556,0.666667,0.666667,0.333333,0.666667,-0.714286,0.166667,0.428571,0.292893,0.851959,0.791667,0.92,0.5,0.6,0.666667,0.166667,0.333333,1.0,0.0,0.0
18,30600,10053,0.333333,0.0,0.333333,0.5,0.666667,0.666667,0.666667,0.666667,0.666667,0.333333,0.5,0.666667,0.5,0.666667,-1.0,0.666667,0.069444,0.666667,0.5,0.333333,0.5,-1.333333,0.166667,0.285714,0.353353,0.810547,0.84,0.92,0.5,0.4,0.5,0.166667,0.0,1.0,0.0,0.0
6,30600,10200,0.666667,1.0,0.666667,0.666667,0.777778,0.8,0.777778,0.833333,0.833333,0.666667,0.5,0.666667,0.5,0.666667,-1.0,0.666667,0.055556,0.666667,0.666667,0.333333,0.666667,-0.333333,0.166667,0.428571,0.292893,0.851959,0.875,0.92,0.5,0.6,0.666667,0.166667,0.333333,1.0,0.0,0.0
19,30600,30053,0.5,1.0,0.5,0.666667,0.777778,0.844444,0.777778,0.75,0.75,0.5,0.5,0.666667,0.5,0.666667,-1.0,0.666667,0.069444,0.666667,0.666667,0.5,0.666667,-0.833333,0.166667,0.428571,0.426519,0.878222,0.84,0.92,0.5,0.6,0.666667,0.5,0.0,1.0,0.0,0.0
7,30600,10347,0.333333,0.0,0.333333,0.333333,0.555556,0.555556,0.555556,0.666667,0.666667,0.333333,0.333333,0.5,0.333333,0.5,-1.584963,0.5,0.069444,0.5,0.333333,0.333333,0.333333,-0.5,0.166667,0.285714,0.286378,0.731155,0.88,0.92,0.5,0.333333,0.333333,0.166667,0.0,1.0,0.0,0.0


### isbn

Attribute $\texttt{isbn}$ is treated as a list of string elements, see chapter [Feature Matrix Generation](./3_FeatureMatrixGeneration.ipynb). The metrics comparison will be ommitted, here.

### musicid

In [12]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'musicid_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
15,073 003-9,U.E. 245,0.0,0.0,0.0,0.0,0.412037,0.412037,0.412037,0.444444,0.0,0.0625,0.117647,0.0625,0.125,-4.0,0.117851,0.006173,0.111111,0.111111,0.111111,0.117647,0.0,0.1,0.07900857,0.723883,0.666667,0.777778,0.4,0.0,0.111111,0.0,0.0,0.888889,0.0,0.0
21,073 003-9,Frenetic 99036,0.0,0.0,0.142857,0.142857,0.493386,0.493386,0.493386,0.357143,0.111111,0.210526,0.347826,0.210526,0.444444,-2.247928,0.356348,0.04321,0.285714,0.214286,0.142857,0.26087,0.0,0.066667,0.2090796,0.741472,0.560976,0.793103,0.3,0.0,0.214286,0.0,0.0,0.642857,0.0,0.0
47,073 003-9,063-00835,0.444444,0.0,0.444444,0.444444,0.703704,0.733333,0.703704,0.722222,0.444444,0.5,0.666667,0.5,0.666667,-1.0,0.666667,0.037037,0.666667,0.555556,0.222222,0.555556,0.0,0.3,0.3398734,0.902516,0.827586,0.851852,0.4,0.333333,0.444444,0.111111,0.0,1.0,0.0,0.0
39,073 003-9,7794,0.111111,0.0,0.111111,0.333333,0.453704,0.453704,0.453704,0.277778,0.0,0.181818,0.307692,0.181818,0.5,-2.459432,0.333333,0.012346,0.222222,0.222222,0.111111,0.307692,0.0,0.1,0.1774283,0.660399,0.689655,0.777778,0.4,0.0,0.222222,0.0,0.0,0.444444,0.0,0.0
27,073 003-9,No 912E.E. 4355,0.0,0.0,0.066667,0.133333,0.4,0.4,0.417778,0.333333,0.111111,0.142857,0.25,0.142857,0.333333,-2.807355,0.258199,0.024691,0.2,0.133333,0.066667,0.083333,0.0,0.0625,0.1540756,0.716961,0.605263,0.793103,0.285714,0.0,0.133333,0.0,0.0,0.6,0.0,0.0
18,073 003-9,BA 4553 a,0.111111,0.0,0.111111,0.222222,0.481481,0.481481,0.481481,0.555556,0.111111,0.125,0.222222,0.125,0.222222,-3.0,0.222222,0.018519,0.222222,0.222222,0.222222,0.222222,0.0,0.1,0.1742784,0.77266,0.617647,0.851852,0.4,0.0,0.222222,0.0,0.0,1.0,0.0,0.0
5,073 003-9,502023,0.0,0.0,0.222222,0.222222,0.5,0.5,0.5,0.444444,0.166667,0.25,0.4,0.25,0.5,-2.0,0.408248,0.030864,0.333333,0.333333,0.111111,0.266667,0.0,0.2,0.222824,0.776482,0.724138,0.851852,0.4,0.0,0.222222,0.0,0.0,0.666667,0.0,0.0
12,073 003-9,433 221-2,0.333333,0.0,0.333333,0.444444,0.555556,0.555556,0.555556,0.666667,0.333333,0.285714,0.444444,0.285714,0.444444,-1.807355,0.444444,0.024691,0.444444,0.333333,0.222222,0.333333,0.0,0.1,0.2798665,0.837527,0.793103,0.851852,0.4,0.166667,0.444444,0.0,0.0,1.0,0.0,0.0
24,073 003-9,10425,0.0,0.0,0.111111,0.111111,0.437037,0.437037,0.437037,0.333333,0.2,0.076923,0.142857,0.076923,0.2,-3.70044,0.149071,0.018519,0.111111,0.111111,0.111111,0.142857,0.0,0.1,0.1024382,0.774479,0.689655,0.851852,0.4,0.0,0.222222,0.0,0.0,0.555556,0.0,0.0
43,073 003-9,Erato 0630-12705-2,0.0,0.0,0.222222,0.277778,0.544444,0.544444,0.561111,0.361111,0.333333,0.35,0.518519,0.35,0.777778,-1.514573,0.549972,0.049383,0.388889,0.277778,0.111111,0.296296,0.0,0.052632,0.2281749,0.746638,0.545455,0.741935,0.25,0.166667,0.222222,0.0,0.0,0.5,0.0,0.0


### part

In [13]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'part_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
5,"[42], 42","bd. 57, 57",0.0,0.0,0.2,0.2,0.316667,0.316667,0.316667,0.5,0.25,0.125,0.222222,0.125,0.25,-3.0,0.223607,0.015625,0.2,0.2,0.2,0.222222,0.0,0.181818,0.153828,0.771568,0.71875,0.777778,0.428571,0.166667,0.2,0.0,0.0,0.8,0.0,0.0
20,"[42], 42",208,0.0,0.0,0.125,0.125,0.486111,0.486111,0.486111,0.25,0.0,0.1,0.181818,0.1,0.333333,-3.321928,0.204124,0.015625,0.125,0.125,0.125,0.181818,0.0,0.111111,0.0999,0.639889,0.645161,0.84,0.428571,0.0,0.125,0.0,0.0,0.375,0.0,0.0
70,"[42], 42","60/3(2015-02-01), 432-437",0.0,0.0,0.2,0.2,0.443333,0.443333,0.443333,0.26,0.125,0.222222,0.363636,0.222222,0.75,-2.169925,0.424264,0.046875,0.24,0.2,0.12,0.30303,0.0,0.04,0.161457,0.730531,0.536585,0.6,0.258065,0.0,0.2,0.0,0.0,0.32,0.0,0.0
86,"[42], 42",20c,0.0,0.0,0.125,0.125,0.486111,0.486111,0.486111,0.25,0.0,0.1,0.181818,0.1,0.333333,-3.321928,0.204124,0.015625,0.125,0.125,0.125,0.181818,0.0,0.111111,0.0999,0.639889,0.677419,0.84,0.428571,0.0,0.125,0.0,0.0,0.375,0.0,0.0
88,"[42], 42",nr.313(2017:august),0.052632,0.0,0.052632,0.052632,0.392544,0.392544,0.392544,0.236842,0.0,0.038462,0.074074,0.038462,0.125,-4.70044,0.081111,0.015625,0.052632,0.052632,0.052632,0.074074,0.0,0.05,0.039563,0.638598,0.477273,0.677419,0.24,0.0,0.078947,0.0,0.0,0.421053,0.0,0.0
84,"[42], 42","57, 57",0.0,0.0,0.25,0.25,0.527778,0.527778,0.527778,0.5,0.333333,0.166667,0.285714,0.166667,0.333333,-2.584963,0.288675,0.015625,0.25,0.25,0.25,0.285714,0.0,0.111111,0.171573,0.705167,0.806452,0.84,0.428571,0.166667,0.25,0.0,0.0,0.75,0.0,0.0
46,"[42], 42","fes z1, z1",0.0,0.0,0.2,0.2,0.316667,0.316667,0.361667,0.5,0.25,0.125,0.222222,0.125,0.25,-3.0,0.223607,0.015625,0.2,0.2,0.2,0.222222,0.0,0.090909,0.153828,0.771568,0.588235,0.777778,0.428571,0.166667,0.2,0.0,0.0,0.8,0.0,0.0
49,"[42], 42","2, 2",0.0,0.0,0.5,0.5,0.708333,0.708333,0.708333,0.5,0.5,0.5,0.666667,0.5,1.0,-1.0,0.707107,0.03125,0.5,0.5,0.25,0.666667,0.0,0.111111,0.292893,0.73763,0.774194,0.84,0.428571,0.0,0.5,0.0,0.125,0.5,0.0,0.0
73,"[42], 42","3870, 3870",0.2,0.0,0.2,0.2,0.483333,0.483333,0.483333,0.5,0.0,0.125,0.222222,0.125,0.25,-3.0,0.223607,0.015625,0.2,0.2,0.2,0.222222,0.0,0.090909,0.15301,0.778236,0.774194,0.777778,0.5,0.0,0.2,0.0,0.0,0.8,0.0,0.0
55,"[42], 42","vol. 26, 2000, 26, 2000",0.0,0.0,0.173913,0.173913,0.474638,0.474638,0.474638,0.26087,0.125,0.148148,0.258065,0.148148,0.5,-2.754888,0.294884,0.03125,0.173913,0.173913,0.086957,0.258065,0.0,0.052632,0.180991,0.820729,0.575,0.636364,0.285714,0.166667,0.347826,0.0,0.0,0.347826,0.0,0.0


### person

Attribute $\texttt{person}$ consists of three different representations of data, see chapter [Data Analysis](./1_DataAnalysis.ipynb). The similarity metric may be sensitive, depending on the kind of representation. All three representations will be investigated below.

In [20]:
person_representations = ['100', '245c', '700']

for pr in person_representations :
    df_string_pairs = daf.string_pair_list(df_feature_base, 'person_'+pr+'_x')

    print('\nperson_'+pr+'\n**********')
    for algorithm in tedi_algorithms :
        if algorithm not in ['Gotoh', 'ArithNCD'] :
            daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

    display(df_string_pairs.sample(n=num_of_samples(df_string_pairs)))


person_100
**********
Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
7,steinerrudolf1861-1925(de-588)118617443,fluryandreas1958-(de-588)120381575,0.051282,0.0,0.282051,0.358974,0.692397,0.692397,0.708913,0.576923,0.323529,0.622222,0.767123,0.622222,0.823529,-0.684498,0.768929,0.010519,0.717949,0.435897,0.230769,0.410959,0.0,0.05,0.458961,0.961258,0.6,0.560976,0.302326,0.0,0.397436,0.0,0.0,0.871795,0.0,0.0
35,steinerrudolf1861-1925(de-588)118617443,bruchjulia1982-,0.0,0.0,0.128205,0.153846,0.462108,0.462108,0.489801,0.25641,0.066667,0.2,0.333333,0.2,0.6,-2.321928,0.372104,0.006246,0.230769,0.179487,0.051282,0.259259,0.0,0.025,0.207812,0.856479,0.433333,0.512195,0.139535,0.0,0.25641,0.0,0.0,0.384615,0.0,0.0
32,steinerrudolf1861-1925(de-588)118617443,trappehans-joachim1954-(de-588)124942687,0.125,0.0,0.3,0.35,0.618376,0.618376,0.633568,0.6375,0.282051,0.490566,0.658228,0.490566,0.666667,-1.027481,0.658281,0.011506,0.65,0.45,0.225,0.405063,0.0,0.04878,0.446926,0.93781,0.571429,0.560976,0.304348,0.0,0.3625,0.0,0.0,0.975,0.0,0.0
38,steinerrudolf1861-1925(de-588)118617443,spinathbirgiteditor,0.025641,0.0,0.102564,0.102564,0.420003,0.420003,0.459139,0.294872,0.0,0.183673,0.310345,0.183673,0.473684,-2.444785,0.330623,0.003945,0.230769,0.153846,0.051282,0.172414,0.0,0.025,0.191719,0.801262,0.35,0.512195,0.139535,0.333333,0.25641,0.025641,0.0,0.487179,0.0,0.0
11,steinerrudolf1861-1925(de-588)118617443,jacquetluc,0.051282,0.0,0.051282,0.051282,0.417521,0.417521,0.467778,0.153846,0.0,0.088889,0.163265,0.088889,0.4,-3.491853,0.202548,0.001972,0.102564,0.051282,0.025641,0.081633,0.0,0.025,0.090201,0.698202,0.316667,0.512195,0.139535,0.0,0.192308,0.0,0.0,0.25641,0.0,0.0
25,steinerrudolf1861-1925(de-588)118617443,steinerrudolf1861-1925(de-588)118617443,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.175,0.585786,1.0,0.85,0.902439,0.930233,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,steinerrudolf1861-1925(de-588)118617443,levinejamesdir.,0.076923,0.0,0.102564,0.128205,0.536182,0.536182,0.563875,0.24359,0.0,0.2,0.333333,0.2,0.6,-2.321928,0.372104,0.003616,0.230769,0.128205,0.076923,0.185185,0.0,0.025,0.168085,0.799672,0.45,0.512195,0.186047,0.166667,0.24359,0.0,0.0,0.384615,0.0,0.0
31,steinerrudolf1861-1925(de-588)118617443,rosoffmeg,0.0,0.0,0.102564,0.128205,0.42792,0.42792,0.468946,0.166667,0.0,0.116279,0.208333,0.116279,0.555556,-3.104337,0.26688,0.00263,0.128205,0.102564,0.025641,0.083333,0.0,0.025,0.11736,0.684779,0.366667,0.560976,0.139535,0.0,0.230769,0.0,0.0,0.230769,0.0,0.0
42,steinerrudolf1861-1925(de-588)118617443,broderursula1987-(de-588)1095569961,0.128205,0.0,0.333333,0.358974,0.65177,0.65177,0.668034,0.602564,0.314286,0.541667,0.702703,0.541667,0.742857,-0.884523,0.703732,0.010191,0.666667,0.487179,0.230769,0.486486,0.0,0.025,0.434109,0.946418,0.583333,0.560976,0.325581,0.0,0.448718,0.0,0.0,0.897436,0.0,0.0
10,steinerrudolf1861-1925(de-588)118617443,käserbeatrice,0.0,0.0,0.076923,0.076923,0.523504,0.523504,0.544017,0.205128,0.0,0.181818,0.307692,0.181818,0.615385,-2.459432,0.355292,0.00263,0.205128,0.128205,0.051282,0.192308,0.0,0.025,0.125595,0.751057,0.366667,0.512195,0.139535,0.0,0.205128,0.0,0.0,0.333333,0.0,0.0



person_245c
**********
Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
32,sous la dir. de diego venturino = [les oeuvres...,ein filme von volker schlöndorff,0.009091,0.0,0.163636,0.172727,0.529108,0.529108,0.545245,0.227273,0.125,0.245614,0.394366,0.245614,0.875,-2.025535,0.47194,0.003264,0.254545,0.181818,0.036364,0.239437,0.0,0.018692,0.262783,0.899762,0.279661,0.341772,0.147059,0.0,0.240909,0.0,0.0,0.290909,0.0,0.0
6,sous la dir. de diego venturino = [les oeuvres...,w.a. mozart ; libretto: emanuel schikaneder ; ...,0.027273,0.0,0.236364,0.254545,0.634487,0.634487,0.645789,0.454545,0.094595,0.520661,0.684783,0.520661,0.851351,-0.941583,0.698278,0.004008,0.572727,0.272727,0.027273,0.206522,0.0,0.009346,0.41724,0.970226,0.305085,0.392405,0.27451,0.0,0.3,0.0,0.0,0.672727,0.0,0.0
119,sous la dir. de diego venturino = [les oeuvres...,volker schlöndorff ; nach dem roman von max fr...,0.036364,0.0,0.218182,0.236364,0.595585,0.595585,0.615017,0.468182,0.075,0.450382,0.621053,0.450382,0.7375,-1.15078,0.628942,0.003926,0.536364,0.290909,0.036364,0.294737,0.0,0.037383,0.401134,0.956972,0.313559,0.417722,0.284314,0.0,0.309091,0.0,0.0,0.727273,0.0,0.0
198,sous la dir. de diego venturino = [les oeuvres...,sous la dir. de diego venturino = [les oeuvres...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.17757,0.585786,1.0,0.813559,1.0,0.95098,1.0,1.0,1.0,1.0,1.0,1.0,1.0
173,sous la dir. de diego venturino = [les oeuvres...,birgit spinath (hrsg.),0.018182,0.0,0.109091,0.118182,0.478409,0.478409,0.489318,0.154545,0.045455,0.147826,0.257576,0.147826,0.772727,-2.758027,0.345574,0.002397,0.154545,0.118182,0.018182,0.121212,0.0,0.009346,0.189102,0.863304,0.220339,0.316456,0.127451,0.0,0.172727,0.0,0.0,0.2,0.0,0.0
36,sous la dir. de diego venturino = [les oeuvres...,w. a. mozart ; dir.: james livine ; the metrop...,0.072727,0.0,0.190909,0.227273,0.658841,0.658841,0.673591,0.554545,0.11215,0.619403,0.764977,0.619403,0.775701,-0.69105,0.76505,0.004132,0.754545,0.390909,0.045455,0.175115,0.0,0.046729,0.473828,0.976009,0.254237,0.443038,0.254902,0.0,0.281818,0.0,0.0,0.972727,0.0,0.0
132,sous la dir. de diego venturino = [les oeuvres...,von emanuel schikaneder ; [musik von] wolfgang...,0.0625,0.0,0.160714,0.214286,0.652976,0.652976,0.66559,0.544643,0.109091,0.531034,0.693694,0.531034,0.7,-0.913123,0.693722,0.004215,0.6875,0.348214,0.026786,0.234234,0.0,-0.018692,0.474559,0.969743,0.237288,0.443038,0.284314,0.0,0.254464,0.0,0.0,0.982143,0.0,0.0
49,sous la dir. de diego venturino = [les oeuvres...,max frisch,0.0,0.0,0.063636,0.063636,0.445022,0.445022,0.46684,0.077273,0.0,0.071429,0.133333,0.071429,0.8,-3.807355,0.241209,0.00186,0.072727,0.063636,0.018182,0.1,0.0,0.009346,0.116459,0.800641,0.220339,0.291139,0.088235,0.0,0.113636,0.0,0.0,0.090909,0.0,0.0
19,sous la dir. de diego venturino = [les oeuvres...,wolfgang amadeus mozart ; libretto by emanuel ...,0.018182,0.0,0.209091,0.218182,0.584707,0.584707,0.611341,0.363636,0.052632,0.368852,0.538922,0.368852,0.789474,-1.438884,0.568301,0.003636,0.409091,0.245455,0.027273,0.131737,0.0,-0.009346,0.350543,0.958846,0.288136,0.367089,0.245098,0.0,0.295455,0.0,0.0,0.518182,0.0,0.0
150,sous la dir. de diego venturino = [les oeuvres...,wolfgang amadeus mozart ; ed. by hermann abert,0.009091,0.0,0.2,0.227273,0.520949,0.520949,0.554862,0.309091,0.086957,0.278689,0.435897,0.278689,0.73913,-1.843274,0.477973,0.003554,0.309091,0.218182,0.027273,0.179487,0.0,0.0,0.296205,0.928292,0.279661,0.341772,0.22549,0.0,0.272727,0.0,0.0,0.418182,0.0,0.0



person_700
**********
Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
101,schneiderreto u.1963-(de-588)115417206verfasse...,"frischmax1911-1991homo faber, junkersdorfeberhard",0.0,0.0,0.151724,0.158621,0.560156,0.560156,0.57927,0.244828,0.081633,0.276316,0.43299,0.276316,0.857143,-1.85561,0.498273,0.002164,0.289655,0.172414,0.02069,0.164948,0.0,0.007576,0.266138,0.891607,0.257576,0.358025,0.196429,0.166667,0.272414,0.0,0.0,0.337931,0.0,0.0
112,schneiderreto u.1963-(de-588)115417206verfasse...,kogelgustav friedrich1849-1921(de-588)11630309...,0.006897,0.0,0.17931,0.186207,0.571124,0.571124,0.58802,0.286207,0.224138,0.326797,0.492611,0.326797,0.862069,-1.613532,0.54522,0.003234,0.344828,0.206897,0.068966,0.216749,0.0,0.05303,0.388722,0.974927,0.363636,0.407407,0.258929,0.0,0.3,0.0,0.0,0.4,0.0,0.0
107,schneiderreto u.1963-(de-588)115417206verfasse...,schikanederemanuel1751-1812(de-588)11860757xve...,0.075862,0.0,0.337931,0.351724,0.66588,0.66588,0.679173,0.558621,0.141732,0.572254,0.727941,0.572254,0.779528,-0.805272,0.72954,0.003187,0.682759,0.482759,0.096552,0.492647,0.0,0.25,0.48215,0.966076,0.348485,0.530864,0.366071,0.5,0.42069,0.02069,0.0,0.875862,0.0,0.0
159,schneiderreto u.1963-(de-588)115417206verfasse...,honsellheinrich,0.006897,0.0,0.062069,0.062069,0.488697,0.488697,0.50341,0.082759,0.133333,0.088435,0.1625,0.088435,0.866667,-3.499233,0.278749,0.001023,0.089655,0.062069,0.013793,0.0375,0.0,-0.007576,0.10702,0.697336,0.181818,0.283951,0.089286,0.0,0.172414,0.0,0.0,0.103448,0.0,0.0
31,schneiderreto u.1963-(de-588)115417206verfasse...,"levinejames, hockneydavid",0.0,0.0,0.082759,0.089655,0.489994,0.489994,0.518132,0.127586,0.0,0.133333,0.235294,0.133333,0.8,-2.906891,0.332182,0.001284,0.137931,0.089655,0.02069,0.129412,0.0,0.007576,0.160832,0.833558,0.234848,0.333333,0.107143,0.0,0.206897,0.0,0.0,0.172414,0.0,0.0
152,schneiderreto u.1963-(de-588)115417206verfasse...,"rozotv.aut, patriziaa.aut, viganos.aut, mazza-...",0.030837,0.0,0.154185,0.15859,0.433855,0.433855,0.466631,0.396476,0.082759,0.231788,0.376344,0.231788,0.482759,-2.109122,0.385835,0.001974,0.30837,0.193833,0.026432,0.215054,0.0,0.0,0.346334,0.892362,0.128571,0.368421,0.138462,0.166667,0.270925,0.0,0.013216,0.638767,0.0,0.0
51,schneiderreto u.1963-(de-588)115417206verfasse...,"grubergernot, orelalfred, moehnheinz, schikane...",0.041379,0.0,0.137931,0.151724,0.540306,0.540306,0.575021,0.489655,0.057377,0.369231,0.539326,0.369231,0.590164,-1.437405,0.541338,0.001974,0.496552,0.248276,0.02069,0.179775,0.0,-0.007576,0.337638,0.906536,0.181818,0.432099,0.196429,0.0,0.248276,0.0,0.0,0.841379,0.0,0.0
59,schneiderreto u.1963-(de-588)115417206verfasse...,"mendelhermann, schikanederemanuel",0.02069,0.0,0.124138,0.124138,0.464263,0.464263,0.479143,0.175862,0.090909,0.163399,0.280899,0.163399,0.757576,-2.613532,0.361409,0.001474,0.172414,0.137931,0.02069,0.134831,0.0,0.022727,0.176629,0.7847,0.212121,0.333333,0.142857,0.0,0.231034,0.0,0.0,0.227586,0.0,0.0
49,schneiderreto u.1963-(de-588)115417206verfasse...,"grubergernot1939-(de-588)12110625x, orelalfred...",0.050633,0.0,0.221519,0.246835,0.654844,0.654844,0.669393,0.528481,0.131034,0.569948,0.726073,0.569948,0.758621,-0.811097,0.726742,0.003306,0.696203,0.43038,0.06962,0.29703,0.0,0.146853,0.500762,0.971651,0.325758,0.483146,0.266667,0.0,0.316456,0.0,0.0,0.917722,0.0,0.0
57,schneiderreto u.1963-(de-588)115417206verfasse...,csampaiattila,0.0,0.0,0.062069,0.062069,0.512732,0.512732,0.537878,0.075862,0.076923,0.07483,0.139241,0.07483,0.846154,-3.740241,0.253359,0.000595,0.075862,0.062069,0.006897,0.050633,0.0,0.007576,0.081918,0.64759,0.189394,0.308642,0.089286,0.0,0.168966,0.0,0.0,0.089655,0.0,0.0


### pubinit

In [18]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'pubinit_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
9,"deutsche grammophon, universal music",pearson education,0.055556,0.0,0.166667,0.222222,0.686819,0.686819,0.686819,0.319444,0.058824,0.472222,0.641509,0.472222,1.0,-1.082462,0.687184,0.010417,0.472222,0.222222,0.055556,0.264151,0.0,0.054054,0.367589,0.906939,0.518519,0.538462,0.238095,0.166667,0.319444,0.0,0.0,0.472222,0.0,0.0
39,"deutsche grammophon, universal music",s. mode's verlag (gustav mode),0.0,0.0,0.138889,0.166667,0.590741,0.590741,0.602963,0.486111,0.133333,0.5,0.666667,0.5,0.733333,-1.0,0.669439,0.010031,0.611111,0.333333,0.083333,0.333333,0.0,0.027027,0.403558,0.932843,0.5,0.538462,0.214286,0.166667,0.263889,0.0,0.0,0.833333,0.0,0.0
44,"deutsche grammophon, universal music",e. eulenburg,0.027778,0.0,0.138889,0.138889,0.50463,0.50463,0.537963,0.236111,0.083333,0.263158,0.416667,0.263158,0.833333,-1.925999,0.481125,0.005787,0.277778,0.166667,0.055556,0.166667,0.0,0.027027,0.203162,0.776718,0.5,0.538462,0.142857,0.0,0.25,0.0,0.0,0.333333,0.0,0.0
7,"deutsche grammophon, universal music",k. alber,0.0,0.0,0.111111,0.111111,0.37037,0.37037,0.416204,0.166667,0.125,0.128205,0.227273,0.128205,0.625,-2.963474,0.294628,0.004244,0.138889,0.111111,0.055556,0.181818,0.0,0.054054,0.133578,0.749496,0.407407,0.538462,0.142857,0.0,0.208333,0.0,0.0,0.222222,0.0,0.0
6,"deutsche grammophon, universal music",alber,0.0,0.0,0.083333,0.083333,0.45,0.45,0.495556,0.111111,0.0,0.108108,0.195122,0.108108,0.8,-3.209453,0.298142,0.003086,0.111111,0.083333,0.055556,0.146341,0.0,0.054054,0.104937,0.63833,0.444444,0.538462,0.142857,0.0,0.180556,0.0,0.0,0.138889,0.0,0.0
84,"deutsche grammophon, universal music",springer,0.0,0.0,0.138889,0.138889,0.472222,0.472222,0.4875,0.180556,0.0,0.222222,0.363636,0.222222,1.0,-2.169925,0.471405,0.005401,0.222222,0.138889,0.055556,0.181818,0.0,0.027027,0.188907,0.736181,0.462963,0.538462,0.142857,0.0,0.236111,0.0,0.0,0.222222,0.0,0.0
10,"deutsche grammophon, universal music",pearson education ltd.,0.055556,0.0,0.194444,0.194444,0.674331,0.674331,0.674331,0.402778,0.090909,0.487179,0.655172,0.487179,0.863636,-1.037475,0.675136,0.010802,0.527778,0.222222,0.055556,0.275862,0.0,0.054054,0.405018,0.931729,0.537037,0.538462,0.190476,0.166667,0.305556,0.0,0.0,0.611111,0.0,0.0
32,"deutsche grammophon, universal music",terzio-verlag,0.055556,0.0,0.222222,0.222222,0.499288,0.499288,0.530698,0.291667,0.076923,0.289474,0.44898,0.289474,0.846154,-1.788496,0.508475,0.005787,0.305556,0.222222,0.083333,0.285714,0.0,0.054054,0.239762,0.832681,0.462963,0.538462,0.166667,0.0,0.305556,0.0,0.0,0.361111,0.0,0.0
56,"deutsche grammophon, universal music","klett u. balmer, delval, ed. universitaires",0.093023,0.0,0.255814,0.302326,0.579663,0.579663,0.60518,0.534884,0.222222,0.410714,0.582278,0.410714,0.638889,-1.283793,0.584578,0.010802,0.534884,0.372093,0.186047,0.329114,0.0,0.068182,0.453231,0.936647,0.553571,0.534884,0.408163,0.0,0.348837,0.0,0.0,0.837209,0.0,0.0
94,"deutsche grammophon, universal music",modern library,0.055556,0.0,0.194444,0.194444,0.497354,0.497354,0.537037,0.291667,0.214286,0.282051,0.44,0.282051,0.785714,-1.825971,0.489979,0.008102,0.305556,0.222222,0.055556,0.24,0.0,0.027027,0.274125,0.856784,0.462963,0.538462,0.166667,0.0,0.319444,0.0,0.0,0.388889,0.0,0.0


### scale

In [25]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'scale_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
3,50000,Scala 1:50.000 ; proiezione cilindrica ad asse...,0.0,0.0,0.045872,0.045872,0.681957,0.681957,0.681957,0.045872,0.2,0.045872,0.087719,0.045872,1.0,-4.446256,0.214176,0.1,0.045872,0.045872,0.027523,0.087719,0.0,0.0,0.035519,0.307395,0.180328,0.298701,0.054545,0.0,0.105505,0.0,0.0,0.045872,0.0,0.0
1,50000,100000,0.666667,1.0,0.666667,0.833333,0.822222,0.822222,0.822222,0.75,0.8,0.571429,0.727273,0.571429,0.8,-0.807355,0.730297,0.08,0.666667,0.666667,0.666667,0.727273,0.0,0.25,0.381966,0.874656,0.884615,0.92,0.666667,0.5,0.833333,0.0,0.666667,0.833333,0.0,0.0
2,50000,50000,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.25,0.585786,1.0,1.0,0.92,0.666667,1.0,1.0,1.0,1.0,1.0,1.0,1.0
0,50000,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.580744,0.0,0.36,0.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### ttlfull

Attribute $\texttt{ttlfull}$ consists of two different representations of data as can be seen in chapter [Data Analysis](./1_DataAnalysis.ipynb). Both representations will be investigated below.

In [33]:
ttlfull_representations = ['245', '246']

for tf in ttlfull_representations :
    df_string_pairs = daf.string_pair_list(df_feature_base, 'ttlfull_'+tf+'_x')

    print('\nttlfull_'+tf+'\n***********')
    for algorithm in tedi_algorithms :
        if algorithm not in ['Gotoh', 'ArithNCD'] :
            daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

    display(df_string_pairs.sample(n=num_of_samples(df_string_pairs)))


ttlfull_245
***********
Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
145,"die zauberflöte, eine grosse oper in zwey aufz...",obligationenrecht,0.02,0.0,0.16,0.16,0.558319,0.558319,0.566202,0.25,0.0,0.264151,0.41791,0.264151,0.823529,-1.920566,0.480196,0.0054,0.28,0.16,0.04,0.238806,0.0,0.039216,0.247801,0.864206,0.361111,0.446809,0.172414,0.0,0.26,0.0,0.0,0.34,0.0,0.0
3,"die zauberflöte, eine grosse oper in zwey aufz...","die zauberflöte, oper in zwei aufzügen : kv 620",0.38,0.0,0.56,0.56,0.792021,0.875213,0.877689,0.7,0.425532,0.701754,0.824742,0.701754,0.851064,-0.510962,0.825137,0.0094,0.8,0.74,0.34,0.762887,0.0,0.176471,0.492151,0.958305,0.662162,0.744681,0.672414,0.5,0.6,0.34,0.0,0.94,0.0,0.0
55,"die zauberflöte, eine grosse oper in zwey aufz...",traité sur la tolérance,0.04,0.0,0.18,0.18,0.526051,0.526051,0.545094,0.32,0.086957,0.280702,0.438356,0.280702,0.695652,-1.83289,0.471814,0.007,0.32,0.2,0.04,0.191781,0.0,0.019608,0.286641,0.871763,0.444444,0.446809,0.224138,0.0,0.3,0.0,0.0,0.46,0.0,0.0
159,"die zauberflöte, eine grosse oper in zwey aufz...",health informatics - personal health device co...,0.007533,0.0,0.06968,0.073446,0.51305,0.51305,0.519615,0.081921,0.08,0.090056,0.165232,0.090056,0.96,-3.473029,0.294584,0.0098,0.090395,0.071563,0.007533,0.089501,0.0,-0.002427,0.178887,0.888417,0.115385,0.156069,0.091549,0.0,0.136535,0.0,0.0,0.094162,0.0,0.0
11,"die zauberflöte, eine grosse oper in zwey aufz...",bildungsforschung und bildungspraxis,0.1,0.0,0.16,0.16,0.540209,0.540209,0.55932,0.44,0.055556,0.365079,0.534884,0.365079,0.638889,-1.453718,0.542115,0.0064,0.46,0.24,0.02,0.116279,0.0,-0.019608,0.374747,0.922835,0.402778,0.489362,0.206897,0.0,0.32,0.0,0.0,0.72,0.0,0.0
144,"die zauberflöte, eine grosse oper in zwey aufz...","traité sur la tolérance, à l'occasion de la mo...",0.086957,0.0,0.217391,0.246377,0.517338,0.517338,0.55528,0.471014,0.08,0.337079,0.504202,0.337079,0.6,-1.568843,0.510754,0.0074,0.434783,0.304348,0.043478,0.302521,0.0,0.014706,0.329863,0.928322,0.305882,0.389831,0.22973,0.0,0.318841,0.0,0.0,0.724638,0.0,0.0
107,"die zauberflöte, eine grosse oper in zwey aufz...","die zauberflöte, kv 620 : opera in two acts = ...",0.295082,0.0,0.459016,0.47541,0.706598,0.823959,0.834877,0.606557,0.26,0.520548,0.684685,0.520548,0.76,-0.941897,0.688072,0.0096,0.622951,0.540984,0.278689,0.576577,0.0,0.131148,0.419493,0.936612,0.518072,0.54717,0.411765,0.5,0.516393,0.278689,0.0,0.819672,0.0,0.0
57,"die zauberflöte, eine grosse oper in zwey aufz...","die zauberflöte, [kv 620 : eine deutsche oper ...",0.301587,0.0,0.714286,0.714286,0.789947,0.873968,0.880425,0.753968,0.64,0.661765,0.79646,0.661765,0.9,-0.59561,0.801784,0.0098,0.714286,0.714286,0.269841,0.79646,0.0,0.142857,0.433398,0.944299,0.579545,0.672727,0.619718,0.5,0.738095,0.269841,0.0,0.793651,0.0,0.0
60,"die zauberflöte, eine grosse oper in zwey aufz...","die zauberflöte, wolfgang amadeus mozart",0.36,0.0,0.42,0.44,0.750833,0.8505,0.8667,0.61,0.275,0.551724,0.711111,0.551724,0.8,-0.857981,0.715542,0.0094,0.64,0.48,0.34,0.511111,0.0,0.098039,0.495766,0.972219,0.597222,0.574468,0.448276,0.5,0.53,0.34,0.0,0.8,0.0,0.0
115,"die zauberflöte, eine grosse oper in zwey aufz...","bonne chance!, cours de langue française, prem...",0.04,0.0,0.266667,0.28,0.634417,0.634417,0.644417,0.466667,0.12,0.488095,0.656,0.488095,0.82,-1.034765,0.669527,0.0084,0.546667,0.293333,0.04,0.288,0.0,0.027027,0.392764,0.959392,0.402299,0.377049,0.289474,0.0,0.366667,0.0,0.0,0.666667,0.0,0.0



ttlfull_246
***********
Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
5,medizinische informatik - kommunikation von ge...,"the magic flute : [dvd-video], la flûte enchantée",0.010033,0.0,0.113712,0.113712,0.53019,0.53019,0.532566,0.138796,0.061224,0.148515,0.258621,0.148515,0.918367,-2.751321,0.371774,0.001197,0.150502,0.113712,0.010033,0.137931,0.0,0.003759,0.22246,0.914159,0.188679,0.226891,0.10989,0.0,0.187291,0.0,0.0,0.16388,0.0,0.0
20,medizinische informatik - kommunikation von ge...,medizinische informatik - kommunikation von ge...,0.395062,0.0,0.679012,0.688272,0.834633,0.90078,0.904252,0.78858,0.625418,0.811047,0.895666,0.811047,0.93311,-0.302143,0.896388,0.001661,0.861111,0.753086,0.290123,0.767255,0.0,0.370504,0.546321,0.992909,0.617117,0.705426,0.628713,1.0,0.717593,0.290123,0.018519,0.92284,0.0,0.0
22,medizinische informatik - kommunikation von ge...,medizinische informatik - kommunikation von ge...,0.278873,0.0,0.560563,0.569014,0.808373,0.885024,0.887981,0.695775,0.464883,0.744,0.853211,0.744,0.93311,-0.426625,0.856356,0.001616,0.785915,0.625352,0.253521,0.642202,0.0,0.305296,0.514121,0.987616,0.494118,0.572414,0.486842,1.0,0.619718,0.253521,0.016901,0.842254,0.0,0.0
15,medizinische informatik - kommunikation von ge...,die zauberflöte,0.003344,0.0,0.043478,0.043478,0.493385,0.493385,0.500386,0.046823,0.0,0.046667,0.089172,0.046667,0.933333,-4.421464,0.209048,0.000878,0.046823,0.043478,0.013378,0.063694,0.0,0.011278,0.109507,0.811864,0.146226,0.193277,0.065934,0.166667,0.105351,0.0,0.0,0.050167,0.0,0.0
0,medizinische informatik - kommunikation von ge...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003759,0.0,0.177791,0.0,0.07563,0.032967,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41,medizinische informatik - kommunikation von ge...,bibliotheken,0.0,0.0,0.0301,0.0301,0.488926,0.488926,0.506262,0.035117,0.0,0.033223,0.064309,0.033223,0.833333,-4.911692,0.166945,0.000688,0.033445,0.0301,0.006689,0.045016,0.0,0.007519,0.079264,0.723589,0.117925,0.210084,0.06044,0.0,0.09699,0.0,0.0,0.040134,0.0,0.0
27,medizinische informatik - kommunikation von ge...,medizinische informatik - kommunikation von ge...,0.392962,0.0,0.777126,0.780059,0.84586,0.907516,0.909023,0.826979,0.745819,0.828571,0.90625,0.828571,0.9699,-0.271302,0.908208,0.001661,0.85044,0.809384,0.272727,0.853125,0.0,0.377104,0.546011,0.990545,0.617284,0.751825,0.680751,0.833333,0.803519,0.272727,0.002933,0.876833,0.0,0.0
32,medizinische informatik - kommunikation von ge...,medizinische informatik - kommunikation von ge...,0.372727,0.0,0.642424,0.654545,0.832936,0.899762,0.903587,0.763636,0.58194,0.792023,0.883943,0.792023,0.929766,-0.336386,0.885018,0.001661,0.842424,0.712121,0.281818,0.72814,0.0,0.321799,0.542862,0.991836,0.571429,0.659259,0.563981,0.833333,0.687879,0.281818,0.00303,0.906061,0.0,0.0
8,medizinische informatik - kommunikation von ge...,marche de l'empereur,0.003344,0.0,0.063545,0.063545,0.583462,0.583462,0.583462,0.065217,0.0,0.063333,0.119122,0.063333,0.95,-3.980891,0.245699,0.000761,0.063545,0.063545,0.013378,0.087774,0.0,0.026316,0.118063,0.764625,0.150943,0.210084,0.076923,0.166667,0.123746,0.003344,0.0,0.06689,0.0,0.0
11,medizinische informatik - kommunikation von ge...,"education et recherche, educazione e ricerca. ...",0.013378,0.0,0.173913,0.177258,0.571544,0.571544,0.576444,0.215719,0.12987,0.240924,0.388298,0.240924,0.948052,-2.053349,0.481107,0.001247,0.244147,0.180602,0.026756,0.154255,0.0,0.037594,0.26041,0.886766,0.188679,0.294118,0.181319,0.166667,0.249164,0.0,0.0,0.257525,0.0,0.0


### volumes

In [34]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'volumes_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
166,388 seiten,teil 1-,0.0,0.0,0.1,0.1,0.465079,0.465079,0.465079,0.35,0.142857,0.307692,0.470588,0.307692,0.571429,-1.70044,0.478091,0.025,0.4,0.2,0.2,0.235294,0.0,0.090909,0.276326,0.864908,0.771429,0.777778,0.375,0.166667,0.2,0.0,0.0,0.7,0.0,0.0
112,388 seiten,319 s.,0.3,0.0,0.3,0.3,0.6,0.6,0.626667,0.45,0.0,0.230769,0.375,0.230769,0.5,-2.115477,0.387298,0.015,0.3,0.3,0.2,0.375,0.0,0.090909,0.199057,0.798559,0.685714,0.851852,0.375,0.166667,0.35,0.1,0.0,0.6,0.0,0.0
113,388 seiten,"ix, 319 s.",0.1,0.0,0.1,0.1,0.422222,0.422222,0.462222,0.55,0.1,0.25,0.4,0.25,0.4,-2.0,0.4,0.02,0.4,0.3,0.2,0.3,0.0,0.090909,0.259132,0.839334,0.777778,0.851852,0.375,0.0,0.1,0.0,0.0,1.0,0.0,0.0
47,388 seiten,"1 klavierauszug (vi, 188 s.)",0.035714,0.0,0.071429,0.071429,0.430952,0.430952,0.444524,0.214286,0.1,0.1875,0.315789,0.1875,0.6,-2.415037,0.358569,0.035,0.214286,0.142857,0.142857,0.210526,0.0,0.034483,0.166161,0.756246,0.538462,0.6,0.264706,0.166667,0.178571,0.0,0.0,0.357143,0.0,0.0
146,388 seiten,"1 partition de travail (xxvii, 379 p.)",0.052632,0.0,0.078947,0.078947,0.477193,0.477193,0.502456,0.171053,0.0,0.142857,0.25,0.142857,0.6,-2.807355,0.307794,0.035,0.157895,0.105263,0.052632,0.166667,0.0,0.025641,0.155787,0.748879,0.482759,0.560976,0.136364,0.0,0.184211,0.0,0.0,0.263158,0.0,0.0
127,388 seiten,312 s.,0.3,0.0,0.3,0.3,0.6,0.6,0.626667,0.45,0.0,0.230769,0.375,0.230769,0.5,-2.115477,0.387298,0.015,0.3,0.3,0.2,0.375,0.0,0.090909,0.199057,0.798559,0.714286,0.851852,0.375,0.166667,0.35,0.1,0.0,0.6,0.0,0.0
27,388 seiten,1 dvd (ca. 169 min.),0.0,0.0,0.1,0.1,0.433333,0.433333,0.478333,0.3,0.1,0.111111,0.2,0.111111,0.3,-3.169925,0.212132,0.015,0.15,0.15,0.05,0.2,0.0,0.047619,0.119162,0.75523,0.604651,0.677419,0.230769,0.0,0.175,0.0,0.0,0.5,0.0,0.0
136,388 seiten,134 p.,0.1,0.0,0.1,0.1,0.511111,0.511111,0.537778,0.35,0.0,0.142857,0.25,0.142857,0.333333,-2.807355,0.258199,0.01,0.2,0.2,0.1,0.25,0.0,0.090909,0.132705,0.766687,0.714286,0.851852,0.375,0.0,0.25,0.0,0.0,0.6,0.0,0.0
178,388 seiten,166 seiten,0.7,1.0,0.7,0.7,0.8,0.8,0.8,0.85,0.7,0.538462,0.7,0.538462,0.7,-0.893085,0.7,0.035,0.7,0.7,0.7,0.7,0.0,0.272727,0.425598,0.923507,0.857143,0.777778,0.75,0.666667,0.8,0.0,0.7,1.0,0.0,0.0
111,388 seiten,1 matériel de choeur,0.05,0.0,0.15,0.15,0.516667,0.516667,0.531667,0.325,0.1,0.2,0.333333,0.2,0.5,-2.321928,0.353553,0.025,0.25,0.15,0.05,0.2,0.0,0.047619,0.166712,0.775697,0.530612,0.69697,0.222222,0.0,0.25,0.0,0.0,0.5,0.0,0.0
