# Comparison of Similarity Metrics

This appendix compares the effect of all similarity metric implementations of the textdistance library [[TeDi](./A_References.ipynb#tedi)] for some example strings pair combinations for each feature to be calculated. The comparison is based on the goldstandard data and is the basis for deciding the similarity metrics for each feature to be used.

## Table of Contents

- [Data Takeover](#Data-Takeover)
- [Functions for Similarity Metrics Analysis](#Functions-for-Similarity-Metrics-Analysis)
- [Similarity Metric Assessments](#Similarity-Metric-Assessments)
    - [coordinate](#coordinate)
    - [corporate](#corporate)
    - [doi](#doi)
    - [edition](#edition)
    - [exactDate](#exactDate)
    - [format](#format)
    - [isbn](#isbn)
    - [ismn](#ismn)
    - [musicid](#musicid)
    - [part](#part)
    - [person](#person)
    - [pubinit](#pubinit)
    - [scale](#scale)
    - [ttlfull](#ttlfull)
    - [volumes](#volumes)

## Data Takeover

As a first step, the training data set as a result of chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb) is read. Some sample strings of this data set will be used for the comparison and assessment of the different metrics implementations.

In [2]:
import os
import pandas as pd
import pickle as pk

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)

# Restore results so far
df_feature_base = pd.read_pickle(os.path.join(path_goldstandard, 'feature_base_df.pkl'),
                                 compression=None)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head()

Unnamed: 0,035liste_x,035liste_y,century_x,century_y,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,coordinate_x,coordinate_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,corporate_full_x,corporate_full_y,decade_x,decade_y,docid_x,docid_y,doi_x,doi_y,duplicates,edition_x,edition_y,exactDate_x,exactDate_y,format_postfix_x,format_postfix_y,format_prefix_x,format_prefix_y,isbn_x,isbn_y,ismn_x,ismn_y,masters_docid,musicid_x,musicid_y,pages_x,pages_y,part_x,part_y,person_100_x,person_100_y,person_245c_x,person_245c_y,person_700_x,person_700_y,pubinit_x,pubinit_y,pubword_x,pubword_y,pubyear_x,pubyear_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,ttlpart_x,ttlpart_y,volumes_x,volumes_y
0,"[(OCoLC)731635279, (ABN)000539983]","[(OCoLC)731635279, (ABN)000539983]",2009,2009,,,,,[],[],,,,,,,2009,2009,311049,311049,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,"grawechristian, graweursula","grawechristian, graweursula",reclam jun.,reclam jun.,[Reclam jun.],[Reclam jun.],2009,2009,,,"emma, roman","emma, roman",,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",600,600
1,"[(OCoLC)731635279, (ABN)000539983]","[(OCoLC)731635279, (NEBIS)009587153]",2009,2009,,,,,[],[],,,,,,,2009,2009,311049,196506476,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,"grawechristian, graweursula",,reclam jun.,reclam,[Reclam jun.],[Reclam],2009,2009,,,"emma, roman",emma,,,"{'245': ['Emma', 'Roman']}",{'245': ['Emma']},600,600
2,"[(OCoLC)731635279, (ABN)000539983]","[(OCoLC)731635279, (LIBIB)000315536]",2009,2009,,,,,[],[],,,,,,,2009,2009,311049,323173349,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem englischen übersetzt von...,jane austen,"grawechristian, graweursula",,reclam jun.,reclam,[Reclam jun.],[Reclam],2009,2009,,,"emma, roman","emma, roman",,,"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",600,600
3,"[(OCoLC)731635279, (NEBIS)009587153]","[(OCoLC)731635279, (ABN)000539983]",2009,2009,,,,,[],[],,,,,,,2009,2009,196506476,311049,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,,"grawechristian, graweursula",reclam,reclam jun.,[Reclam],[Reclam jun.],2009,2009,,,emma,"emma, roman",,,{'245': ['Emma']},"{'245': ['Emma', 'Roman']}",600,600
4,"[(OCoLC)731635279, (NEBIS)009587153]","[(OCoLC)731635279, (NEBIS)009587153]",2009,2009,,,,,[],[],,,,,,,2009,2009,196506476,196506476,,,1,,,2009uuuu,2009uuuu,20000,20000,bk,bk,[978-3-15-020008-7],[978-3-15-020008-7],,,504389793,,,[600 S.],[600 S.],20008,20008,austenjane,austenjane,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,,,reclam,reclam,[Reclam],[Reclam],2009,2009,,,emma,emma,,,{'245': ['Emma']},{'245': ['Emma']},600,600


## Functions for Similarity Metrics Analysis

All available metrics algorithms of library [[TeDi](./A_References.ipynb#tedi)] are listed in the dictionary below. The dictionary will help calculating a similarity value for each available algorithm of the library.

In [3]:
import textdistance as tedi

tedi_algorithms = {
    # Edit based
    'Hamming' : tedi.Hamming(), 'MLIPNS' : tedi.MLIPNS(), 'Levenshtein' : tedi.Levenshtein(),
    'DamerauLevenshtein' : tedi.DamerauLevenshtein(), 'Jaro' : tedi.Jaro(), 'JaroWinkler' : tedi.JaroWinkler(),
    'StrCmp95' : tedi.StrCmp95(), 'NeedlemanWunsch' : tedi.NeedlemanWunsch(), 'Gotoh' : tedi.Gotoh(),
    'SmithWaterman' : tedi.SmithWaterman(),
    # Token based
    'Jaccard' : tedi.Jaccard(), 'Sorensen' : tedi.Sorensen(), 'Tversky' : tedi.Tversky(), 'Overlap' : tedi.Overlap(),
    'Tanimoto' : tedi.Tanimoto(), 'Cosine' : tedi.Cosine(), 'MongeElkan' : tedi.MongeElkan(), 'Bag' : tedi.Bag(),
    # Sequence based
    'LCSSeq' : tedi.LCSSeq(), 'LCSStr' : tedi.LCSStr(), 'RatcliffObershelp' : tedi.RatcliffObershelp(),
    # Compression based
    'ArithNCD' : tedi.ArithNCD(), 'RLENCD' : tedi.RLENCD(), 'BWTRLENCD' : tedi.BWTRLENCD(),
    'SqrtNCD' : tedi.SqrtNCD(), 'EntropyNCD' : tedi.EntropyNCD(), 'BZ2NCD' : tedi.BZ2NCD(),
    'LZMANCD' : tedi.LZMANCD(), 'ZLIBNCD' : tedi.ZLIBNCD(),
    # Phonetic
    'MRA' : tedi.MRA(), 'Editex' : tedi.Editex(),
    # Simple
    'Prefix' : tedi.Prefix(), 'Postfix' : tedi.Postfix(), 'Length' : tedi.Length(), 'Identity' : tedi.Identity(),
    'Matrix' : tedi.Matrix()
}

This appendix uses function $\texttt{.apply}\_\texttt{similarities()}$ that applies the $\texttt{.normalized}\_\texttt{similarity()}$ function of $\texttt{textdistance}$ for each algorithm object available in the library. The function is implemented in the separate code file [data_analysis_funcs.py](./data_analysis_funcs.py) 

In [4]:
import data_analysis_funcs as daf

## Similarity Metric Assessments

This section iterates through all available similarity metrics of library [[TeDi](./A_References.ipynb#tedi)] and calculates the similarity values for a pair of two sample strings of each feature of the model. The calculated similarity values will be analysed visually and an algorithm will be decided to be used in chapter [Feature Matrix Generation](./3_FeatureMatrixGeneration.ipynb). The decision will be based on a visual assessment that is validated with the literature, [[Chri2012](./A_References.ipynb#chri2012)].

In [5]:
# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(tedi_algorithms)+3

def num_of_samples (df) :
    max_number_of_num_samples = 30

    return min(len(df), max_number_of_num_samples)

### coordinate

In [6]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'coordinate_E_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
8,e0080900,e0075850,0.5,0.0,0.5,0.5,0.75,0.825,0.775,0.75,0.5,0.454545,0.625,0.454545,0.625,-1.137504,0.625,0.054688,0.625,0.625,0.375,0.625,0.0,0.111111,0.376044,0.817405,0.892857,0.84,0.428571,0.5,0.5,0.375,0.125,1.0,0.0,0.0
4,e0080900,e0060811,0.5,0.0,0.5,0.5,0.683333,0.683333,0.683333,0.75,0.5,0.454545,0.625,0.454545,0.625,-1.137504,0.625,0.054688,0.625,0.5,0.375,0.5,0.0,0.444444,0.376044,0.817405,0.892857,0.84,0.428571,0.5,0.625,0.375,0.0,1.0,0.0,0.0
6,e0080900,e0055009,0.5,0.0,0.5,0.625,0.777778,0.844444,0.8,0.75,0.5,0.6,0.75,0.6,0.75,-0.736966,0.75,0.054688,0.75,0.625,0.375,0.625,0.0,0.555556,0.444689,0.89341,0.896552,0.84,0.428571,0.666667,0.75,0.375,0.0,1.0,0.0,0.0
5,e0080900,e0074147,0.375,0.0,0.375,0.375,0.583333,0.583333,0.583333,0.6875,0.375,0.230769,0.375,0.230769,0.375,-2.115477,0.375,0.046875,0.375,0.375,0.375,0.375,0.0,0.333333,0.254751,0.739081,0.821429,0.84,0.428571,0.333333,0.375,0.375,0.0,1.0,0.0,0.0
2,e0080900,e0080855,0.625,1.0,0.625,0.625,0.75,0.85,0.775,0.8125,0.625,0.454545,0.625,0.454545,0.625,-1.137504,0.625,0.054688,0.625,0.625,0.625,0.625,0.0,0.444444,0.432992,0.860952,0.892857,0.84,0.642857,0.666667,0.75,0.625,0.0,1.0,0.0,0.0
1,e0080900,e0055700,0.625,1.0,0.625,0.625,0.75,0.825,0.775,0.8125,0.625,0.454545,0.625,0.454545,0.625,-1.137504,0.625,0.046875,0.625,0.625,0.375,0.625,0.0,0.222222,0.336495,0.847956,0.892857,0.84,0.428571,0.333333,0.625,0.375,0.25,1.0,0.0,0.0
7,e0080900,e0080900,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.555556,0.585786,1.0,0.928571,0.76,0.857143,1.0,1.0,1.0,1.0,1.0,1.0,1.0
0,e0080900,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.392342,0.0,0.36,0.428571,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,e0080900,e0080851,0.625,1.0,0.625,0.625,0.75,0.85,0.775,0.8125,0.625,0.454545,0.625,0.454545,0.625,-1.137504,0.625,0.054688,0.625,0.625,0.625,0.625,0.0,0.444444,0.391724,0.832356,0.892857,0.84,0.642857,0.5,0.625,0.625,0.0,1.0,0.0,0.0


### corporate

In [7]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'corporate_full_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
50,"historischer verein des kantons solothurn, kon...","reussenhock, drach, kloster rheinau",0.016949,0.0,0.177966,0.186441,0.579223,0.579223,0.582928,0.237288,0.142857,0.29661,0.457516,0.29661,1.0,-1.75336,0.544619,0.003914,0.29661,0.194915,0.025424,0.156863,0.0,0.017391,0.355136,0.957282,0.298969,0.362319,0.255814,0.333333,0.275424,0.0,0.0,0.29661,0.0,0.0
20,"historischer verein des kantons solothurn, kon...","interkantonale lehrmittelzentrale (luzern), st...",0.033898,0.0,0.254237,0.271186,0.627916,0.627916,0.649049,0.457627,0.139241,0.481203,0.649746,0.481203,0.810127,-1.055282,0.662866,0.00413,0.542373,0.313559,0.059322,0.324873,0.0,0.043478,0.455399,0.954719,0.360825,0.449275,0.337209,0.166667,0.360169,0.0,0.0,0.669492,0.0,0.0
44,"historischer verein des kantons solothurn, kon...","kölnische bibliotheksgesellschaft, bühnen köln...",0.025424,0.0,0.220339,0.237288,0.576144,0.576144,0.590382,0.322034,0.32,0.354839,0.52381,0.354839,0.88,-1.494765,0.572831,0.003842,0.372881,0.254237,0.084746,0.25,0.0,0.086957,0.370135,0.97442,0.350515,0.42029,0.27907,0.0,0.29661,0.0,0.0,0.423729,0.0,0.0
8,"historischer verein des kantons solothurn, kon...",schweizerische gesellschaft für bildungsforschung,0.025424,0.0,0.211864,0.228814,0.586898,0.586898,0.595563,0.313559,0.204082,0.336,0.502994,0.336,0.857143,-1.573467,0.552345,0.00395,0.355932,0.228814,0.076271,0.191617,0.0,0.017391,0.382052,0.972843,0.329897,0.42029,0.255814,0.166667,0.292373,0.0,0.0,0.415254,0.0,0.0
42,"historischer verein des kantons solothurn, kon...",caritas (ticino),0.0,0.0,0.101695,0.101695,0.487044,0.487044,0.494141,0.118644,0.0,0.116667,0.208955,0.116667,0.875,-3.099536,0.322201,0.002514,0.118644,0.101695,0.016949,0.104478,0.0,0.008696,0.178582,0.840574,0.257732,0.304348,0.127907,0.166667,0.186441,0.0,0.0,0.135593,0.0,0.0
54,"historischer verein des kantons solothurn, kon...","berliner philharmoniker, rias-kammerchor",0.042373,0.0,0.186441,0.186441,0.544562,0.544562,0.567994,0.262712,0.125,0.284553,0.443038,0.284553,0.875,-1.813231,0.509445,0.003591,0.29661,0.194915,0.025424,0.164557,0.0,0.043478,0.337764,0.937198,0.319588,0.333333,0.267442,0.0,0.288136,0.0,0.0,0.338983,0.0,0.0
11,"historischer verein des kantons solothurn, kon...","staatsoper (wien)chor, wiener philharmoniker",0.016949,0.0,0.220339,0.220339,0.574012,0.574012,0.589613,0.29661,0.068182,0.306452,0.469136,0.306452,0.863636,-1.706269,0.527371,0.00377,0.322034,0.220339,0.025424,0.222222,0.0,0.034783,0.362574,0.970403,0.329897,0.362319,0.22093,0.0,0.317797,0.0,0.0,0.372881,0.0,0.0
2,"historischer verein des kantons solothurn, kon...","metropolitan operaorchestra, metropolitan oper...",0.050847,0.0,0.211864,0.220339,0.587915,0.587915,0.604321,0.330508,0.150943,0.357143,0.526316,0.357143,0.849057,-1.485427,0.569028,0.003735,0.381356,0.237288,0.025424,0.187135,0.0,0.043478,0.369278,0.927407,0.319588,0.391304,0.232558,0.333333,0.330508,0.0,0.0,0.449153,0.0,0.0
32,"historischer verein des kantons solothurn, kon...",suisse,0.0,0.0,0.050847,0.050847,0.572505,0.572505,0.572505,0.050847,0.0,0.050847,0.096774,0.050847,1.0,-4.297681,0.225494,0.001185,0.050847,0.050847,0.016949,0.064516,0.0,0.0,0.079667,0.553768,0.226804,0.333333,0.104651,0.0,0.127119,0.0,0.0,0.050847,0.0,0.0
39,"historischer verein des kantons solothurn, kon...","malpigli, annibale (bologna)",0.0,0.0,0.127119,0.135593,0.426874,0.426874,0.466644,0.182203,0.142857,0.168,0.287671,0.168,0.75,-2.573467,0.365342,0.002298,0.177966,0.144068,0.025424,0.164384,0.0,0.034783,0.21589,0.875656,0.268041,0.304348,0.174419,0.0,0.220339,0.0,0.0,0.237288,0.0,0.0


Monge-Elkan, Jaccard with q-grams, and LCSStr seem to be valid metrics for the $\texttt{corporate}$ attribute due to their algorithms [[Chri2012](./A_References.ipynb#chri2012)]. The metrics to be chosen will be analysed and justifyed in chapter [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb).

### doi

For attribute $\texttt{doi}$, a preprocessing function has been implemented to extract real doi identifiers, see chapter [Data Analysis](./1_DataAnalysis.ipynb). The DataFrame $\texttt{df}\_\texttt{feature}\_\texttt{base}$ holds doi identifiers as preprocessed singular strings.

In [8]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'doi_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
27,10.5169/seals-377305,10.5169/seals-376850,0.8,1.0,0.8,0.85,0.914815,0.948889,0.914815,0.9,0.8,0.818182,0.9,0.818182,0.9,-0.289507,0.9,0.025,0.9,0.85,0.8,0.85,0.0,0.095238,0.554195,0.980634,0.790698,0.741935,0.730769,0.5,0.85,0.8,0.0,1.0,0.0,0.0
26,10.5169/seals-377305,10.5169/seals-376810,0.8,1.0,0.8,0.8,0.9,0.94,0.9,0.9,0.8,0.73913,0.85,0.73913,0.85,-0.436099,0.85,0.025,0.85,0.85,0.8,0.85,0.0,0.142857,0.553542,0.982386,0.795455,0.741935,0.730769,0.5,0.8,0.8,0.0,1.0,0.0,0.0
9,10.5169/seals-377305,10.1055/b-005-143650,0.25,0.0,0.3,0.4,0.6,0.6,0.63,0.625,0.25,0.37931,0.55,0.37931,0.55,-1.398549,0.55,0.015,0.55,0.4,0.15,0.3,0.0,0.047619,0.36403,0.85997,0.571429,0.677419,0.230769,0.5,0.425,0.15,0.0,1.0,0.0,0.0
25,10.5169/seals-377305,10.5169/seals-376773,0.8,1.0,0.8,0.8,0.933333,0.96,0.933333,0.9,0.8,0.818182,0.9,0.818182,0.9,-0.289507,0.9,0.025,0.9,0.9,0.8,0.9,0.0,0.095238,0.578848,0.991341,0.833333,0.741935,0.769231,0.5,0.85,0.8,0.0,1.0,0.0,0.0
0,10.5169/seals-377305,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.211778,0.0,0.290323,0.230769,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,10.5169/seals-377305,10.5169/seals-376689,0.8,1.0,0.8,0.8,0.866667,0.92,0.866667,0.9,0.8,0.666667,0.8,0.666667,0.8,-0.584963,0.8,0.025,0.8,0.8,0.8,0.8,0.0,0.095238,0.548725,0.977919,0.795455,0.741935,0.730769,0.5,0.85,0.8,0.0,1.0,0.0,0.0
28,10.5169/seals-377305,10.5169/seals-376890,0.8,1.0,0.8,0.8,0.9,0.94,0.9,0.9,0.8,0.73913,0.85,0.73913,0.85,-0.436099,0.85,0.025,0.85,0.85,0.8,0.85,0.0,0.142857,0.551273,0.978092,0.795455,0.741935,0.730769,0.5,0.8,0.8,0.0,1.0,0.0,0.0
43,10.5169/seals-377305,10.5169/seals-377392,0.9,1.0,0.9,0.9,0.933333,0.96,0.933333,0.95,0.9,0.818182,0.9,0.818182,0.9,-0.289507,0.9,0.025,0.9,0.9,0.9,0.9,0.0,0.095238,0.554195,0.980634,0.790698,0.741935,0.807692,0.666667,0.9,0.9,0.0,1.0,0.0,0.0
17,10.5169/seals-377305,10.5169/seals-376473,0.8,1.0,0.8,0.8,0.933333,0.96,0.933333,0.9,0.8,0.818182,0.9,0.818182,0.9,-0.289507,0.9,0.025,0.9,0.9,0.8,0.9,0.0,0.047619,0.554195,0.980634,0.795455,0.741935,0.730769,0.5,0.8,0.8,0.0,1.0,0.0,0.0
20,10.5169/seals-377305,10.5169/seals-376572,0.8,1.0,0.8,0.8,0.914815,0.948889,0.914815,0.9,0.8,0.818182,0.9,0.818182,0.9,-0.289507,0.9,0.025,0.9,0.85,0.8,0.85,0.0,0.047619,0.554195,0.980634,0.790698,0.741935,0.730769,0.5,0.8,0.8,0.0,1.0,0.0,0.0


### edition

In [9]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'edition_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
3,10425,5.0,0.0,0.0,0.2,0.2,0.0,0.0,0.0,0.2,1.0,0.2,0.333333,0.2,1.0,-2.321928,0.447214,0.02,0.2,0.2,0.2,0.333333,0.0,0.166667,0.117157,0.322192,0.88,0.92,0.545455,0.0,0.2,0.0,0.2,0.2,0.0,0.0
12,10425,1863.0,0.2,0.0,0.2,0.2,0.483333,0.483333,0.483333,0.5,0.0,0.125,0.222222,0.125,0.25,-3.0,0.223607,0.02,0.2,0.2,0.2,0.222222,0.0,0.166667,0.117157,0.714713,0.88,0.84,0.545455,0.2,0.2,0.2,0.0,0.8,0.0,0.0
32,10425,144.0,0.4,0.0,0.4,0.6,0.688889,0.688889,0.688889,0.5,0.0,0.333333,0.5,0.333333,0.666667,-1.584963,0.516398,0.04,0.4,0.4,0.2,0.5,0.0,0.166667,0.25359,0.627523,0.84,0.92,0.545455,0.0,0.4,0.2,0.0,0.6,0.0,0.0
18,10425,1885.0,0.2,0.0,0.4,0.4,0.633333,0.633333,0.633333,0.6,0.25,0.285714,0.444444,0.285714,0.5,-1.807355,0.447214,0.04,0.4,0.4,0.2,0.444444,0.0,0.166667,0.234315,0.697989,0.84,0.84,0.545455,0.2,0.4,0.2,0.2,0.8,0.0,0.0
29,10425,1943.0,0.4,0.0,0.4,0.4,0.633333,0.633333,0.633333,0.6,0.25,0.285714,0.444444,0.285714,0.5,-1.807355,0.447214,0.04,0.4,0.4,0.2,0.444444,0.0,0.166667,0.234315,0.781609,0.88,0.84,0.545455,0.4,0.4,0.2,0.0,0.8,0.0,0.0
1,10425,1.0,0.2,0.0,0.2,0.2,0.733333,0.733333,0.733333,0.2,0.0,0.2,0.333333,0.2,1.0,-2.321928,0.447214,0.02,0.2,0.2,0.2,0.333333,0.0,0.166667,0.117157,0.322192,0.88,0.92,0.545455,0.0,0.2,0.2,0.0,0.2,0.0,0.0
26,10425,1994.0,0.2,0.0,0.2,0.2,0.633333,0.633333,0.633333,0.5,0.0,0.285714,0.444444,0.285714,0.5,-1.807355,0.447214,0.04,0.4,0.4,0.2,0.444444,0.0,0.166667,0.234315,0.697989,0.88,0.84,0.545455,0.4,0.4,0.2,0.0,0.8,0.0,0.0
23,10425,1909.0,0.2,0.0,0.2,0.2,0.633333,0.633333,0.633333,0.5,0.0,0.285714,0.444444,0.285714,0.5,-1.807355,0.447214,0.04,0.4,0.4,0.2,0.444444,0.0,0.166667,0.234315,0.697989,0.88,0.84,0.545455,0.2,0.2,0.2,0.0,0.8,0.0,0.0
19,10425,1889.0,0.2,0.0,0.2,0.2,0.483333,0.483333,0.483333,0.5,0.0,0.125,0.222222,0.125,0.25,-3.0,0.223607,0.02,0.2,0.2,0.2,0.222222,0.0,0.166667,0.117157,0.631094,0.8,0.84,0.545455,0.2,0.2,0.2,0.0,0.8,0.0,0.0
22,10425,1907.0,0.2,0.0,0.2,0.2,0.633333,0.633333,0.633333,0.5,0.0,0.285714,0.444444,0.285714,0.5,-1.807355,0.447214,0.04,0.4,0.4,0.2,0.444444,0.0,0.166667,0.234315,0.781609,0.92,0.84,0.545455,0.2,0.2,0.2,0.0,0.8,0.0,0.0


### exactDate

In [10]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'exactDate_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
Gotoh
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,Gotoh,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
165,1765uuuu,19972000,0.125,0.0,0.125,0.125,0.5,0.5,0.5,0.5625,0.5625,0.125,0.142857,0.25,0.142857,0.25,-2.807355,0.25,0.015625,0.25,0.25,0.125,0.25,0.0,0.125,0.190615,0.73767,0.666667,0.84,0.428571,0.2,0.5,0.125,0.0,1.0,0.0,0.0
189,1765uuuu,1873uuuu,0.625,1.0,0.625,0.625,0.833333,0.85,0.833333,0.8125,0.8125,0.625,0.6,0.75,0.6,0.75,-0.736966,0.75,0.046875,0.75,0.75,0.5,0.75,0.0,0.375,0.390524,0.916667,0.866667,0.84,0.5,0.25,0.625,0.125,0.5,1.0,0.0,0.0
151,1765uuuu,18971989,0.125,0.0,0.125,0.125,0.5,0.5,0.5,0.5625,0.5625,0.125,0.142857,0.25,0.142857,0.25,-2.807355,0.25,0.015625,0.25,0.25,0.125,0.25,0.0,0.111111,0.211325,0.75,0.7,0.84,0.428571,0.166667,0.25,0.125,0.0,1.0,0.0,0.0
90,1765uuuu,1862uuuu,0.75,1.0,0.75,0.75,0.833333,0.85,0.833333,0.875,0.875,0.75,0.6,0.75,0.6,0.75,-0.736966,0.75,0.046875,0.75,0.75,0.5,0.75,0.0,0.375,0.390524,0.916667,0.8,0.84,0.5,0.5,0.75,0.125,0.5,1.0,0.0,0.0
180,1765uuuu,1956uuuu,0.625,1.0,0.625,0.75,0.869048,0.882143,0.869048,0.8125,0.8125,0.625,0.777778,0.875,0.777778,0.875,-0.36257,0.875,0.054688,0.875,0.75,0.5,0.75,0.0,0.375,0.488155,0.958333,0.866667,0.84,0.5,0.25,0.625,0.125,0.5,1.0,0.0,0.0
128,1765uuuu,1889uuuu,0.625,1.0,0.625,0.625,0.75,0.775,0.75,0.8125,0.8125,0.625,0.454545,0.625,0.454545,0.625,-1.137504,0.625,0.039062,0.625,0.625,0.5,0.625,0.0,0.375,0.292893,0.833333,0.833333,0.84,0.5,0.25,0.625,0.125,0.5,1.0,0.0,0.0
141,1765uuuu,1870uuuu,0.625,1.0,0.625,0.625,0.833333,0.85,0.833333,0.8125,0.8125,0.625,0.6,0.75,0.6,0.75,-0.736966,0.75,0.046875,0.75,0.75,0.5,0.75,0.0,0.375,0.390524,0.916667,0.833333,0.84,0.5,0.25,0.625,0.125,0.5,1.0,0.0,0.0
197,1765uuuu,19881862,0.125,0.0,0.125,0.125,0.416667,0.416667,0.416667,0.5625,0.5625,0.125,0.142857,0.25,0.142857,0.25,-2.807355,0.25,0.015625,0.25,0.25,0.125,0.25,0.0,0.111111,0.206296,0.752621,0.666667,0.84,0.428571,0.166667,0.25,0.125,0.0,1.0,0.0,0.0
181,1765uuuu,1913uuuu,0.625,1.0,0.625,0.625,0.75,0.775,0.75,0.8125,0.8125,0.625,0.454545,0.625,0.454545,0.625,-1.137504,0.625,0.039062,0.625,0.625,0.5,0.625,0.0,0.375,0.308956,0.84906,0.833333,0.84,0.5,0.25,0.625,0.125,0.5,1.0,0.0,0.0
202,1765uuuu,1794uuuu,0.75,1.0,0.75,0.75,0.833333,0.866667,0.833333,0.875,0.875,0.75,0.6,0.75,0.6,0.75,-0.736966,0.75,0.046875,0.75,0.75,0.5,0.75,0.0,0.375,0.390524,0.916667,0.8,0.84,0.5,0.5,0.75,0.25,0.5,1.0,0.0,0.0


Attribute $\texttt{exactDate}$ is a string of four digits or characters. For calculating the [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance), each string pair is compared character-wise. A so called edit distance between the string pair is calculated as the sum of all edit operations needed to convert the strings into each other, [[Chri2012](./A_References.ipynb#chri2012)]. The resulting Hamming similarity can be deduced from the edit distance and the length of one string. This can be easily done in the examples of the DataFrame above. The Hamming similarity shall be used for attribute $\texttt{exactDate}$.

The Hamming similarity has one drawbak, though, looking at Swissbib's data. The attribute may be filled with letter 'u' for 'unknown' digits instead of a number. A letter 'u' will result in an edit distance of 1. This is a statement which need not be true for the bibliographic units that the records describe. On the other hand, a pair of strings with a letter 'u' at the same digit, need not have a distance of 0 for the two bibliographical units compared. For this reason, the Hamming similarity will be adapted for the case of existence of letter 'u' in one of the strings of the pair. The Hamming similarity will be increased by a small value for each unknown digit in a string. This algorithm based on the Hamming similarity will be explicitly implemented in chapter [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb).

### format

In [11]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'format_prefix_x')

for algorithm in tedi_algorithms :
    if algorithm not in [] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
Gotoh
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
ArithNCD
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,Gotoh,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,ArithNCD,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
3,mu,cr,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.333333,0.0,0.5,0.875,1.0,0.75,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,mu,mu,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.333333,0.0,0.333333,0.585786,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,mu,vm,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.333333,0.5,0.333333,0.5,-1.584963,0.5,0.125,0.5,0.5,0.5,0.5,-0.333333,0.0,0.333333,0.292893,0.75,0.958333,1.0,0.75,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,mu,cf,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,-0.666667,0.0,0.333333,0.0,0.5,0.791667,1.0,0.75,0.0,0.0,0.0,0.0,1.0,0.0,0.0
0,mu,bk,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,-0.666667,0.0,0.333333,0.0,0.5,0.791667,1.0,0.75,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,mu,mp,0.5,1.0,0.5,0.5,0.666667,0.666667,0.666667,0.75,0.75,0.5,0.333333,0.5,0.333333,0.5,-1.584963,0.5,0.125,0.5,0.5,0.5,0.5,0.0,0.0,0.333333,0.292893,0.75,0.916667,1.0,0.75,0.5,0.5,0.5,0.0,1.0,0.0,0.0


In [12]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'format_postfix_x')

for algorithm in tedi_algorithms :
    if algorithm not in [] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
Gotoh
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
ArithNCD
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,Gotoh,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,ArithNCD,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
18,20353,10053,0.666667,1.0,0.666667,0.666667,0.694444,0.694444,0.694444,0.833333,0.833333,0.666667,0.5,0.666667,0.5,0.666667,-1.0,0.666667,0.069444,0.666667,0.666667,0.333333,0.666667,-0.4,0.0,0.285714,0.451109,0.910186,0.88,0.92,0.5,0.333333,0.666667,0.166667,0.333333,1.0,0.0,0.0
10,20353,30300,0.5,1.0,0.5,0.5,0.694444,0.694444,0.694444,0.75,0.75,0.5,0.5,0.666667,0.5,0.666667,-1.0,0.666667,0.055556,0.666667,0.5,0.333333,0.5,-0.5,0.166667,0.285714,0.371374,0.757558,0.84,0.92,0.75,0.5,0.5,0.166667,0.0,1.0,0.0,0.0
26,20353,40500,0.333333,0.0,0.333333,0.5,0.666667,0.666667,0.666667,0.666667,0.666667,0.333333,0.333333,0.5,0.333333,0.5,-1.584963,0.5,0.041667,0.5,0.5,0.166667,0.5,-0.6,0.166667,0.285714,0.321121,0.757558,0.84,0.92,0.5,0.333333,0.5,0.166667,0.0,1.0,0.0,0.0
20,20353,30653,0.666667,1.0,0.666667,0.666667,0.822222,0.84,0.822222,0.833333,0.833333,0.666667,0.714286,0.833333,0.714286,0.833333,-0.485427,0.833333,0.069444,0.833333,0.666667,0.333333,0.666667,-0.5,0.0,0.142857,0.464466,0.942889,0.88,0.92,0.5,0.666667,0.666667,0.166667,0.333333,1.0,0.0,0.0
6,20353,10200,0.333333,0.0,0.333333,0.333333,0.555556,0.555556,0.555556,0.666667,0.666667,0.333333,0.333333,0.5,0.333333,0.5,-1.584963,0.5,0.041667,0.5,0.5,0.5,0.5,-0.2,0.166667,0.285714,0.321121,0.757558,0.84,0.92,0.666667,0.333333,0.333333,0.166667,0.0,1.0,0.0,0.0
9,20353,20053,0.833333,1.0,0.833333,0.833333,0.822222,0.875556,0.822222,0.916667,0.916667,0.833333,0.714286,0.833333,0.714286,0.833333,-0.485427,0.833333,0.083333,0.833333,0.833333,0.5,0.833333,-0.3,0.0,0.285714,0.572429,0.967297,0.92,0.92,0.5,0.5,0.833333,0.5,0.333333,1.0,0.0,0.0
3,20353,30600,0.333333,0.0,0.333333,0.333333,0.555556,0.555556,0.555556,0.666667,0.666667,0.333333,0.333333,0.5,0.333333,0.5,-1.584963,0.5,0.055556,0.5,0.333333,0.333333,0.333333,-0.4,0.166667,0.142857,0.341081,0.779114,0.88,0.92,0.5,0.333333,0.333333,0.166667,0.0,1.0,0.0,0.0
12,20353,10000,0.333333,0.0,0.333333,0.333333,0.555556,0.555556,0.555556,0.666667,0.666667,0.333333,0.2,0.333333,0.2,0.333333,-2.321928,0.333333,0.027778,0.333333,0.333333,0.166667,0.333333,-0.5,0.166667,0.142857,0.208045,0.612565,0.807692,0.92,0.583333,0.0,0.333333,0.166667,0.0,1.0,0.0,0.0
25,20353,20347,0.666667,1.0,0.666667,0.666667,0.777778,0.866667,0.777778,0.833333,0.833333,0.666667,0.5,0.666667,0.5,0.666667,-1.0,0.666667,0.069444,0.666667,0.666667,0.666667,0.666667,0.25,0.0,0.142857,0.387199,0.865577,0.88,0.92,0.666667,0.666667,0.666667,0.666667,0.0,1.0,0.0,0.0
8,20353,20047,0.5,1.0,0.5,0.5,0.666667,0.666667,0.666667,0.75,0.75,0.5,0.333333,0.5,0.333333,0.5,-1.584963,0.5,0.041667,0.5,0.5,0.5,0.5,-0.8,0.0,0.142857,0.309828,0.831519,0.923077,0.92,0.5,0.5,0.5,0.5,0.0,1.0,0.0,0.0


### isbn

Attribute $\texttt{isbn}$ is treated as a list of string elements, see chapter [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb). The metrics comparison will be ommitted, here.

### ismn

In [13]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'ismn_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
7,"m006546756 (kritischer bericht, leinen)",m006204687,0.102564,0.0,0.128205,0.128205,0.626496,0.626496,0.63906,0.192308,0.0,0.166667,0.285714,0.166667,0.7,-2.584963,0.354459,0.00263,0.179487,0.179487,0.102564,0.285714,0.0,0.025,0.129247,0.683321,0.362069,0.512195,0.177778,0.5,0.179487,0.102564,0.0,0.25641,0.0,0.0
6,"m006546756 (kritischer bericht, leinen)","m006546756 (kritischer bericht, leinen)",1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.125,0.585786,1.0,0.827586,0.853659,0.933333,1.0,1.0,1.0,1.0,1.0,1.0,1.0
8,"m006546756 (kritischer bericht, leinen)",m700241001,0.076923,0.0,0.076923,0.076923,0.476068,0.476068,0.501197,0.166667,0.0,0.113636,0.204082,0.113636,0.5,-3.137504,0.253185,0.001644,0.128205,0.102564,0.051282,0.163265,0.0,0.05,0.098336,0.61719,0.362069,0.512195,0.133333,0.166667,0.128205,0.025641,0.0,0.25641,0.0,0.0
3,"m006546756 (kritischer bericht, leinen)",m006546756,0.25641,0.0,0.25641,0.25641,0.752137,0.851282,0.776923,0.25641,0.0,0.25641,0.408163,0.25641,1.0,-1.963474,0.50637,0.003287,0.25641,0.25641,0.25641,0.408163,0.0,0.025,0.159993,0.669551,0.37931,0.560976,0.311111,0.5,0.307692,0.25641,0.0,0.25641,0.0,0.0
1,"m006546756 (kritischer bericht, leinen)",m006450510,0.102564,0.0,0.153846,0.153846,0.578877,0.578877,0.591441,0.205128,0.0,0.166667,0.285714,0.166667,0.7,-2.584963,0.354459,0.002959,0.179487,0.153846,0.102564,0.244898,0.0,0.025,0.133547,0.637768,0.344828,0.512195,0.177778,0.5,0.205128,0.102564,0.0,0.25641,0.0,0.0
4,"m006546756 (kritischer bericht, leinen)",m006546749,0.205128,0.0,0.205128,0.205128,0.668376,0.668376,0.668376,0.230769,0.0,0.195122,0.326531,0.195122,0.8,-2.357552,0.405096,0.003287,0.205128,0.205128,0.205128,0.326531,0.0,0.025,0.154418,0.705288,0.37931,0.512195,0.266667,0.5,0.25641,0.205128,0.0,0.25641,0.0,0.0
0,"m006546756 (kritischer bericht, leinen)",,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.192245,0.0,0.219512,0.133333,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,"m006546756 (kritischer bericht, leinen)",m008060205,0.076923,0.0,0.102564,0.102564,0.542735,0.542735,0.555299,0.179487,0.0,0.113636,0.204082,0.113636,0.5,-3.137504,0.253185,0.00263,0.128205,0.128205,0.076923,0.204082,0.0,-0.025,0.10854,0.602494,0.327586,0.512195,0.133333,0.333333,0.153846,0.076923,0.0,0.25641,0.0,0.0
2,"m006546756 (kritischer bericht, leinen)",9790006450510,0.051282,0.0,0.051282,0.076923,0.477411,0.477411,0.487668,0.192308,0.0,0.155556,0.269231,0.155556,0.538462,-2.684498,0.310881,0.002959,0.179487,0.128205,0.076923,0.192308,0.0,0.025,0.134985,0.662546,0.344828,0.560976,0.177778,0.0,0.128205,0.0,0.0,0.333333,0.0,0.0
11,"m006546756 (kritischer bericht, leinen)",9790006201334,0.025641,0.0,0.025641,0.025641,0.37094,0.37094,0.381197,0.179487,0.0,0.106383,0.192308,0.106383,0.384615,-3.232661,0.222058,0.002301,0.128205,0.102564,0.076923,0.153846,0.0,0.05,0.10362,0.685825,0.344828,0.560976,0.177778,0.0,0.102564,0.0,0.0,0.333333,0.0,0.0


### musicid

In [14]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'musicid_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
14,1092,99036.0,0.0,0.0,0.2,0.2,0.466667,0.466667,0.466667,0.5,0.25,0.285714,0.444444,0.285714,0.5,-1.807355,0.447214,0.0625,0.4,0.2,0.2,0.222222,0.0,0.166667,0.287242,0.834182,0.92,0.84,0.545455,0.25,0.4,0.0,0.0,0.8,0.0,0.0
30,1092,92633.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.285714,0.444444,0.285714,0.5,-1.807355,0.447214,0.0625,0.4,0.4,0.4,0.444444,0.0,0.166667,0.265409,0.806223,0.92,0.84,0.545455,0.0,0.2,0.0,0.0,0.8,0.0,0.0
49,1092,117.0,0.25,0.0,0.25,0.25,0.527778,0.527778,0.527778,0.5,0.0,0.166667,0.285714,0.166667,0.333333,-2.584963,0.288675,0.03125,0.25,0.25,0.25,0.285714,0.0,0.2,0.170541,0.596737,0.84,0.913043,0.6,0.25,0.25,0.25,0.0,0.75,0.0,0.0
13,1092,4553.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.583333,0.84,0.913043,0.6,0.0,0.0,0.0,0.0,1.0,0.0,0.0
50,1092,1092.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.2,0.585786,1.0,0.96,0.913043,0.8,1.0,1.0,1.0,1.0,1.0,1.0,1.0
29,1092,50999.0,0.4,0.0,0.4,0.4,0.633333,0.633333,0.633333,0.6,0.25,0.285714,0.444444,0.285714,0.5,-1.807355,0.447214,0.0625,0.4,0.4,0.4,0.444444,0.0,0.166667,0.329459,0.770712,0.84,0.84,0.545455,0.5,0.6,0.0,0.0,0.8,0.0,0.0
6,1092,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.226024,0.88,0.913043,0.6,0.0,0.0,0.0,0.0,0.25,0.0,0.0
16,1092,4355.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.583333,0.84,0.913043,0.6,0.0,0.0,0.0,0.0,1.0,0.0,0.0
23,1092,7794.0,0.25,0.0,0.25,0.25,0.5,0.5,0.5,0.625,0.25,0.142857,0.25,0.142857,0.25,-2.807355,0.25,0.03125,0.25,0.25,0.25,0.25,0.0,0.2,0.146447,0.666667,0.88,0.913043,0.6,0.0,0.25,0.0,0.0,1.0,0.0,0.0
2,1092,242.0,0.0,0.0,0.25,0.25,0.527778,0.527778,0.527778,0.5,0.333333,0.166667,0.285714,0.166667,0.333333,-2.584963,0.288675,0.03125,0.25,0.25,0.25,0.285714,0.0,0.2,0.170541,0.596737,0.84,0.913043,0.6,0.0,0.25,0.0,0.25,0.75,0.0,0.0


### part

In [15]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'part_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
140,23 1862,286 2007,0.125,0.0,0.25,0.25,0.713095,0.741786,0.713095,0.5625,0.142857,0.5,0.666667,0.5,0.714286,-1.0,0.668153,0.05102,0.625,0.5,0.25,0.533333,0.0,0.111111,0.37868,0.903391,0.785714,0.84,0.428571,0.166667,0.25,0.125,0.0,0.875,0.0,0.0
66,23 1862,41 42 620,0.111111,0.0,0.333333,0.333333,0.505291,0.505291,0.505291,0.555556,0.142857,0.454545,0.625,0.454545,0.714286,-1.137504,0.629941,0.05102,0.555556,0.444444,0.222222,0.5,0.0,0.1,0.370329,0.907838,0.758621,0.851852,0.4,0.166667,0.333333,0.0,0.0,0.777778,0.0,0.0
167,23 1862,23 23 1900 23,0.230769,0.0,0.384615,0.384615,0.632967,0.632967,0.632967,0.461538,0.285714,0.333333,0.5,0.333333,0.714286,-1.584963,0.524142,0.05102,0.384615,0.384615,0.307692,0.5,0.0,0.25,0.343781,0.917046,0.733333,0.793103,0.470588,0.5,0.461538,0.230769,0.0,0.538462,0.0,0.0
33,23 1862,1 1,0.0,0.0,0.285714,0.285714,0.650794,0.650794,0.650794,0.357143,0.0,0.25,0.4,0.25,0.666667,-2.0,0.436436,0.020408,0.285714,0.285714,0.285714,0.4,0.0,0.125,0.197678,0.566071,0.814815,0.84,0.461538,0.0,0.285714,0.0,0.0,0.428571,0.0,0.0
51,23 1862,60 3 2015 432 437,0.058824,0.0,0.235294,0.235294,0.625584,0.625584,0.625584,0.323529,0.142857,0.333333,0.5,0.333333,0.857143,-1.584963,0.550019,0.061224,0.352941,0.235294,0.117647,0.333333,0.0,0.055556,0.291948,0.85453,0.6,0.741935,0.285714,0.166667,0.235294,0.0,0.0,0.411765,0.0,0.0
84,23 1862,23 23 1909,0.3,0.0,0.4,0.4,0.671429,0.671429,0.671429,0.55,0.571429,0.416667,0.588235,0.416667,0.714286,-1.263034,0.597614,0.05102,0.5,0.4,0.4,0.470588,0.0,0.363636,0.362883,0.914666,0.8,0.777778,0.5625,0.5,0.4,0.3,0.0,0.7,0.0,0.0
60,23 1862,20,0.142857,0.0,0.142857,0.142857,0.547619,0.547619,0.547619,0.214286,0.0,0.125,0.222222,0.125,0.5,-3.0,0.267261,0.020408,0.142857,0.142857,0.142857,0.222222,0.0,0.125,0.106352,0.533852,0.703704,0.84,0.461538,0.0,0.142857,0.142857,0.0,0.285714,0.0,0.0
0,23 1862,20008,0.285714,0.0,0.285714,0.285714,0.561905,0.561905,0.561905,0.5,0.0,0.2,0.333333,0.2,0.4,-2.321928,0.338062,0.030612,0.285714,0.285714,0.142857,0.333333,0.0,0.125,0.197678,0.643671,0.740741,0.92,0.461538,0.0,0.285714,0.142857,0.0,0.714286,0.0,0.0
55,23 1862,3870 3870,0.0,0.0,0.222222,0.333333,0.47619,0.47619,0.47619,0.5,0.285714,0.230769,0.375,0.230769,0.428571,-2.115477,0.377964,0.030612,0.333333,0.333333,0.111111,0.375,0.0,0.2,0.292948,0.822639,0.785714,0.851852,0.538462,0.166667,0.222222,0.0,0.0,0.777778,0.0,0.0
96,23 1862,1 29,0.0,0.0,0.142857,0.285714,0.428571,0.428571,0.428571,0.357143,0.25,0.375,0.545455,0.375,0.75,-1.415037,0.566947,0.040816,0.428571,0.285714,0.142857,0.181818,0.0,0.125,0.289004,0.811584,0.888889,0.84,0.461538,0.0,0.142857,0.0,0.0,0.571429,0.0,0.0


### person

Attribute $\texttt{person}$ consists of three different representations of data, see chapter [Data Analysis](./1_DataAnalysis.ipynb). The similarity metric may be sensitive, depending on the kind of representation. All three representations will be investigated below.

In [16]:
person_representations = ['100', '245c', '700']

for pr in person_representations :
    df_string_pairs = daf.string_pair_list(df_feature_base, 'person_'+pr+'_x')

    print('\nperson_'+pr+'\n**********')
    for algorithm in tedi_algorithms :
        if algorithm not in ['Gotoh', 'ArithNCD'] :
            daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

    display(df_string_pairs.sample(n=num_of_samples(df_string_pairs)))


person_100
**********
Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
24,rosoffmeg,eigenmanndaniela,0.0,0.0,0.0625,0.125,0.395833,0.395833,0.447917,0.3125,0.0,0.136364,0.24,0.136364,0.333333,-2.874469,0.25,0.018519,0.1875,0.125,0.0625,0.16,0.0,0.058824,0.174993,0.817973,0.65625,0.724138,0.272727,0.0,0.28125,0.0,0.0,0.5625,0.0,0.0
3,rosoffmeg,fluryandreas,0.0,0.0,0.083333,0.166667,0.416667,0.416667,0.475,0.416667,0.111111,0.235294,0.380952,0.235294,0.444444,-2.087463,0.3849,0.030864,0.333333,0.166667,0.083333,0.190476,0.0,0.076923,0.234189,0.804066,0.71875,0.777778,0.333333,0.0,0.291667,0.0,0.0,0.75,0.0,0.0
35,rosoffmeg,iodocus,0.222222,0.0,0.222222,0.222222,0.502646,0.502646,0.553439,0.5,0.142857,0.230769,0.375,0.230769,0.428571,-2.115477,0.377964,0.018519,0.333333,0.222222,0.111111,0.25,0.0,0.1,0.180651,0.804493,0.7,0.851852,0.4,0.0,0.388889,0.0,0.0,0.777778,0.0,0.0
29,rosoffmeg,biglermatthias,0.071429,0.0,0.071429,0.071429,0.410053,0.410053,0.446561,0.357143,0.111111,0.277778,0.434783,0.277778,0.555556,-1.847997,0.445435,0.030864,0.357143,0.142857,0.071429,0.173913,0.0,0.066667,0.23924,0.787951,0.705882,0.793103,0.3,0.0,0.321429,0.0,0.0,0.642857,0.0,0.0
28,rosoffmeg,bührerwalter,0.0,0.0,0.166667,0.166667,0.462963,0.462963,0.501852,0.458333,0.222222,0.105263,0.190476,0.105263,0.222222,-3.247928,0.19245,0.012346,0.166667,0.166667,0.083333,0.190476,0.0,0.076923,0.139383,0.783945,0.611111,0.793103,0.315789,0.0,0.25,0.0,0.0,0.75,0.0,0.0
19,rosoffmeg,bruchjulia,0.0,0.0,0.1,0.1,0.403704,0.403704,0.488148,0.5,0.111111,0.055556,0.105263,0.055556,0.111111,-4.169925,0.105409,0.006173,0.1,0.1,0.1,0.105263,0.0,0.090909,0.062224,0.732796,0.6875,0.851852,0.375,0.0,0.3,0.0,0.0,0.9,0.0,0.0
6,rosoffmeg,käserbeatrice,0.076923,0.0,0.076923,0.153846,0.410256,0.410256,0.48547,0.384615,0.111111,0.157895,0.272727,0.157895,0.333333,-2.662965,0.27735,0.018519,0.230769,0.153846,0.076923,0.181818,0.0,0.071429,0.179432,0.789167,0.641026,0.793103,0.3,0.0,0.230769,0.0,0.0,0.692308,0.0,0.0
9,rosoffmeg,basuandreas,0.090909,0.0,0.181818,0.181818,0.468013,0.468013,0.52862,0.5,0.0,0.176471,0.3,0.176471,0.333333,-2.5025,0.301511,0.018519,0.272727,0.181818,0.090909,0.2,0.0,0.083333,0.202677,0.812512,0.741935,0.851852,0.352941,0.166667,0.363636,0.0,0.0,0.818182,0.0,0.0
5,rosoffmeg,gläser-zikudamichaela,0.0,0.0,0.142857,0.142857,0.46455,0.46455,0.512169,0.285714,0.0,0.2,0.333333,0.2,0.555556,-2.321928,0.363696,0.030864,0.238095,0.142857,0.047619,0.2,0.0,0.045455,0.168309,0.727247,0.5,0.69697,0.214286,0.0,0.214286,0.0,0.0,0.428571,0.0,0.0
17,rosoffmeg,rosoffmeg,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.1,0.585786,1.0,0.9,0.851852,0.866667,1.0,1.0,1.0,1.0,1.0,1.0,1.0



person_245c
**********
Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
271,von wolfgang amadeus mozart ; libretto: emanue...,wolfgang amadeus mozart ; bearb.: stéphane des...,0.037037,0.0,0.62963,0.62963,0.726192,0.726192,0.732051,0.654321,0.576271,0.686747,0.814286,0.686747,0.966102,-0.542149,0.82453,0.008331,0.703704,0.679012,0.395062,0.785714,0.0,0.0625,0.488662,0.979306,0.543478,0.714286,0.654321,0.5,0.641975,0.0,0.395062,0.728395,0.0,0.0
158,von wolfgang amadeus mozart ; libretto: emanue...,heidy binder... [et al.] ; [éd.:] interkantona...,0.039216,0.0,0.215686,0.235294,0.603529,0.603529,0.627606,0.397059,0.152542,0.4375,0.608696,0.4375,0.830508,-1.192645,0.631641,0.0079,0.480392,0.254902,0.029412,0.186335,0.0,-0.010101,0.423381,0.966636,0.304762,0.391304,0.271739,0.0,0.328431,0.0,0.0,0.578431,0.0,0.0
287,von wolfgang amadeus mozart ; libretto: emanue...,"uniwersytet im. adama mickiewicza w poznaniu, uam",0.016949,0.0,0.186441,0.20339,0.616788,0.616788,0.657881,0.5,0.081633,0.479452,0.648148,0.479452,0.714286,-1.060542,0.650945,0.006895,0.59322,0.305085,0.050847,0.277778,0.0,0.016949,0.412921,0.935035,0.39726,0.490196,0.377049,0.166667,0.322034,0.0,0.0,0.830508,0.0,0.0
385,von wolfgang amadeus mozart ; libretto: emanue...,liane moriarty ; trad. de l'anglais (australie...,0.032967,0.0,0.252747,0.274725,0.604258,0.604258,0.621021,0.450549,0.118644,0.442308,0.613333,0.442308,0.779661,-1.176878,0.627785,0.007469,0.505495,0.318681,0.043956,0.28,0.0,0.033333,0.412793,0.955949,0.326733,0.42029,0.267442,0.5,0.346154,0.0,0.0,0.648352,0.0,0.0
225,von wolfgang amadeus mozart ; libretto: emanue...,[musik v.] wolfgang amadeus mozart ; opernführ...,0.026786,0.0,0.464286,0.464286,0.661267,0.661267,0.663855,0.495536,0.423729,0.5,0.666667,0.5,0.966102,-1.0,0.701197,0.008187,0.508929,0.464286,0.241071,0.608187,0.0,0.063636,0.409991,0.960938,0.403361,0.544304,0.446602,0.0,0.486607,0.0,0.0,0.526786,0.0,0.0
307,von wolfgang amadeus mozart ; libretto: emanue...,sigrid kessler ... <et al.> ; <éds.> interkant...,0.035714,0.0,0.223214,0.223214,0.601544,0.601544,0.622246,0.375,0.169492,0.425,0.596491,0.425,0.864407,-1.234465,0.627386,0.008044,0.455357,0.241071,0.026786,0.175439,0.0,-0.028037,0.424557,0.96887,0.318182,0.380282,0.257732,0.0,0.325893,0.0,0.0,0.526786,0.0,0.0
175,von wolfgang amadeus mozart ; libretto: emanue...,"[o. bonny, m. vinciguerra, m.l. gumz, g. mazzo...",0.101695,0.0,0.152542,0.220339,0.596314,0.596314,0.618022,0.516949,0.057692,0.441558,0.612613,0.441558,0.653846,-1.179324,0.613834,0.006607,0.576271,0.271186,0.033898,0.108108,0.0,0.0,0.389318,0.930685,0.383562,0.411765,0.278689,0.166667,0.271186,0.0,0.0,0.881356,0.0,0.0
128,von wolfgang amadeus mozart ; libretto: emanue...,wolfgang amadeus mozart ; dichtung von emanuel...,0.017544,0.0,0.385965,0.394737,0.669688,0.669688,0.669688,0.434211,0.118644,0.504348,0.67052,0.504348,0.983051,-0.987509,0.707212,0.008331,0.508772,0.429825,0.22807,0.566474,0.0,0.06087,0.483599,0.985266,0.461538,0.584416,0.494505,0.0,0.421053,0.0,0.0,0.517544,0.0,0.0
276,von wolfgang amadeus mozart ; libretto: emanue...,von w.a. mozart ; [die deutsche dichtung ist v...,0.069307,0.0,0.326733,0.346535,0.67561,0.67561,0.689035,0.455446,0.135593,0.52381,0.6875,0.52381,0.932203,-0.932886,0.712485,0.008044,0.544554,0.405941,0.118812,0.5125,0.0,0.070707,0.439783,0.972915,0.411765,0.492958,0.384615,0.5,0.415842,0.049505,0.0,0.584158,0.0,0.0
87,von wolfgang amadeus mozart ; libretto: emanue...,wolfgang amadeus mozart ; einführung und komme...,0.045455,0.0,0.454545,0.469697,0.708344,0.708344,0.724395,0.643939,0.389831,0.689189,0.816,0.689189,0.864407,-0.537028,0.817283,0.008044,0.772727,0.545455,0.393939,0.464,0.0,0.030303,0.530249,0.980385,0.445783,0.649123,0.521739,0.0,0.537879,0.0,0.0,0.893939,0.0,0.0



person_700
**********
Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
80,"christiewilliam, dessaynatalie, mannionrosa, b...","karajanherbert von, mozartwolfgang amadeus",0.02521,0.0,0.302521,0.302521,0.580999,0.580999,0.600327,0.327731,0.595238,0.330579,0.496894,0.330579,0.952381,-1.596935,0.565799,0.003637,0.336134,0.302521,0.201681,0.372671,0.0,0.017391,0.375482,0.968866,0.301887,0.468354,0.347368,0.5,0.390756,0.0,0.201681,0.352941,0.0,0.0
45,"christiewilliam, dessaynatalie, mannionrosa, b...","schelhassmartin, schikanederemanuel, mozartwol...",0.02521,0.0,0.411765,0.411765,0.653876,0.653876,0.666552,0.453782,0.491525,0.483333,0.651685,0.483333,0.983051,-1.04891,0.692195,0.004096,0.487395,0.420168,0.201681,0.483146,0.0,0.069565,0.451515,0.983026,0.415094,0.493671,0.4,0.5,0.470588,0.0,0.201681,0.495798,0.0,0.0
39,"christiewilliam, dessaynatalie, mannionrosa, b...","schikanederemanuel, soldankurt",0.042017,0.0,0.168067,0.176471,0.554026,0.554026,0.566547,0.210084,0.1,0.231405,0.375839,0.231405,0.933333,-2.111508,0.468623,0.003778,0.235294,0.176471,0.02521,0.255034,0.0,0.026087,0.317994,0.938964,0.311321,0.341772,0.210526,0.0,0.268908,0.0,0.0,0.252101,0.0,0.0
137,"christiewilliam, dessaynatalie, mannionrosa, b...","dufourguillaume henri, müllhaupthans heinrich,...",0.02521,0.0,0.252101,0.277311,0.602538,0.602538,0.620489,0.369748,0.103448,0.416,0.587571,0.416,0.896552,-1.265345,0.625916,0.003955,0.436975,0.277311,0.042017,0.293785,0.0,-0.008696,0.41887,0.962492,0.367925,0.417722,0.242105,0.166667,0.352941,0.0,0.0,0.487395,0.0,0.0
219,"christiewilliam, dessaynatalie, mannionrosa, b...","schikanederemanuel, soltigeorg",0.033613,0.0,0.159664,0.168067,0.560815,0.560815,0.573336,0.205882,0.066667,0.252101,0.402685,0.252101,1.0,-1.987927,0.502096,0.003884,0.252101,0.168067,0.02521,0.255034,0.0,0.034783,0.333165,0.952953,0.349057,0.367089,0.210526,0.0,0.264706,0.0,0.0,0.252101,0.0,0.0
78,"christiewilliam, dessaynatalie, mannionrosa, b...","schlöndorffvolker, frischmax",0.008403,0.0,0.134454,0.134454,0.483427,0.483427,0.509897,0.184874,0.071429,0.185484,0.312925,0.185484,0.821429,-2.430634,0.398451,0.003531,0.193277,0.151261,0.033613,0.176871,0.0,0.043478,0.28351,0.952187,0.311321,0.341772,0.126316,0.0,0.235294,0.0,0.0,0.235294,0.0,0.0
34,"christiewilliam, dessaynatalie, mannionrosa, b...","zentnerwilhelm, schikanederemanuel, goethejoha...",0.016807,0.0,0.285714,0.302521,0.636426,0.636426,0.651304,0.394958,0.180328,0.47541,0.644444,0.47541,0.95082,-1.072756,0.680753,0.004096,0.487395,0.319328,0.07563,0.377778,0.0,0.026087,0.442664,0.974805,0.386792,0.468354,0.305263,0.0,0.390756,0.0,0.0,0.512605,0.0,0.0
153,"christiewilliam, dessaynatalie, mannionrosa, b...","schikanederemanuel, csampaiattila",0.02521,0.0,0.184874,0.184874,0.582762,0.582762,0.590503,0.231092,0.121212,0.277311,0.434211,0.277311,1.0,-1.850424,0.526603,0.003601,0.277311,0.193277,0.02521,0.184211,0.0,0.034783,0.3271,0.926596,0.292453,0.367089,0.221053,0.0,0.285714,0.0,0.0,0.277311,0.0,0.0
235,"christiewilliam, dessaynatalie, mannionrosa, b...","zamperoniluca, braunrichard, petzoldbert alexa...",0.02521,0.0,0.386555,0.386555,0.663989,0.663989,0.672756,0.504202,0.418919,0.519685,0.683938,0.519685,0.891892,-0.944291,0.703323,0.004131,0.554622,0.411765,0.201681,0.466321,0.0,0.078261,0.481662,0.984534,0.415094,0.56962,0.4,0.5,0.495798,0.0,0.201681,0.621849,0.0,0.0
114,"christiewilliam, dessaynatalie, mannionrosa, b...","mozartwolfgang amadeus, aberthermann",0.05042,0.0,0.168067,0.176471,0.545915,0.545915,0.564005,0.235294,0.222222,0.302521,0.464516,0.302521,1.0,-1.724893,0.550019,0.003601,0.302521,0.184874,0.184874,0.283871,0.0,0.008696,0.3561,0.95128,0.358491,0.417722,0.305263,0.166667,0.268908,0.0,0.0,0.302521,0.0,0.0


### pubinit

In [17]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'pubinit_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
182,del prado,"interkantonale lehrmittelzentral, berner lehrm...",0.014925,0.0,0.089552,0.089552,0.474295,0.474295,0.486899,0.11194,0.111111,0.117647,0.210526,0.117647,0.888889,-3.087463,0.325785,0.049383,0.119403,0.089552,0.044776,0.105263,0.0,0.030303,0.166567,0.814259,0.375,0.433962,0.163934,0.0,0.164179,0.0,0.0,0.134328,0.0,0.0
18,del prado,"klett und balmer, universitätsverlag, universi...",0.0,0.0,0.089286,0.089286,0.480159,0.480159,0.505952,0.125,0.111111,0.101695,0.184615,0.101695,0.666667,-3.297681,0.267261,0.04321,0.107143,0.107143,0.017857,0.184615,0.0,0.0,0.148244,0.771866,0.384615,0.511111,0.139535,0.166667,0.160714,0.0,0.0,0.160714,0.0,0.0
178,del prado,"interkantonale lehrmittelzentrale, berner lehr...",0.014493,0.0,0.086957,0.086957,0.47343,0.47343,0.48599,0.108696,0.111111,0.114286,0.205128,0.114286,0.888889,-3.129283,0.321029,0.049383,0.115942,0.086957,0.028986,0.128205,0.0,0.014493,0.161087,0.808795,0.333333,0.433962,0.16129,0.0,0.15942,0.0,0.0,0.130435,0.0,0.0
223,del prado,fr. bufse,0.111111,0.0,0.111111,0.111111,0.407407,0.407407,0.496296,0.555556,0.111111,0.2,0.333333,0.2,0.333333,-2.321928,0.333333,0.018519,0.333333,0.111111,0.111111,0.111111,0.0,0.1,0.208856,0.831125,0.818182,0.851852,0.4,0.0,0.222222,0.0,0.0,1.0,0.0,0.0
158,del prado,saur,0.0,0.0,0.111111,0.111111,0.453704,0.453704,0.525926,0.277778,0.25,0.181818,0.307692,0.181818,0.5,-2.459432,0.333333,0.012346,0.222222,0.111111,0.111111,0.153846,0.0,0.1,0.139237,0.686172,0.75,0.777778,0.4,0.0,0.166667,0.0,0.0,0.444444,0.0,0.0
35,del prado,universal edition,0.058824,0.0,0.235294,0.235294,0.634609,0.634609,0.634609,0.382353,0.111111,0.368421,0.538462,0.368421,0.777778,-1.440573,0.565916,0.049383,0.411765,0.294118,0.117647,0.384615,0.0,0.055556,0.294857,0.837602,0.717949,0.741935,0.26087,0.0,0.294118,0.0,0.0,0.529412,0.0,0.0
25,del prado,emi records,0.090909,0.0,0.181818,0.363636,0.603367,0.603367,0.663973,0.5,0.111111,0.333333,0.5,0.333333,0.555556,-1.584963,0.502519,0.037037,0.454545,0.363636,0.090909,0.1,0.0,0.083333,0.327424,0.887447,0.735294,0.851852,0.352941,0.166667,0.227273,0.0,0.0,0.818182,0.0,0.0
61,del prado,könemann,0.0,0.0,0.111111,0.111111,0.490741,0.490741,0.490741,0.5,0.125,0.133333,0.235294,0.133333,0.25,-2.906891,0.235702,0.012346,0.222222,0.222222,0.111111,0.235294,0.0,0.1,0.139237,0.734229,0.65625,0.851852,0.4,0.0,0.111111,0.0,0.0,0.888889,0.0,0.0
195,del prado,konsortium der schweizer hochschulbibliotheken...,0.015385,0.0,0.076923,0.076923,0.419658,0.419658,0.444957,0.107692,0.111111,0.088235,0.162162,0.088235,0.666667,-3.5025,0.248069,0.04321,0.092308,0.076923,0.030769,0.135135,0.0,0.015873,0.133221,0.761881,0.362319,0.471698,0.1,0.0,0.207692,0.0,0.0,0.138462,0.0,0.0
111,del prado,edition nzn bei tvz,0.0,0.0,0.105263,0.105263,0.38499,0.38499,0.417739,0.289474,0.0,0.166667,0.285714,0.166667,0.444444,-2.584963,0.305888,0.030864,0.210526,0.157895,0.052632,0.214286,0.0,0.05,0.199584,0.837282,0.621622,0.741935,0.24,0.0,0.236842,0.0,0.0,0.473684,0.0,0.0


### scale

In [18]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'scale_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
4,5000050000,5000050000,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.857143,0.585786,1.0,0.961538,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
0,5000050000,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.580744,0.0,0.333333,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,5000050000,50000,0.5,0.0,0.5,0.5,0.833333,0.9,0.833333,0.5,1.0,0.5,0.666667,0.5,1.0,-1.0,0.707107,0.05,0.5,0.5,0.5,0.666667,0.0,0.571429,0.482362,1.0,0.961538,0.925926,0.75,0.5,0.9,0.5,0.5,0.5,0.0,0.0
1,5000050000,100000,0.4,0.0,0.5,0.5,0.777778,0.777778,0.777778,0.55,0.666667,0.454545,0.625,0.454545,0.833333,-1.137504,0.645497,0.04,0.5,0.5,0.4,0.625,0.0,0.428571,0.343876,0.873183,0.846154,0.851852,0.75,0.25,0.8,0.0,0.4,0.6,0.0,0.0
3,5000050000,50 000 8 10 8 35 45 55 46 05,0.142857,0.0,0.25,0.25,0.602381,0.602381,0.602381,0.303571,0.2,0.266667,0.421053,0.266667,0.8,-1.906891,0.478091,0.05,0.285714,0.25,0.107143,0.315789,0.0,-0.041667,0.187329,0.537651,0.513514,0.72973,0.3,0.333333,0.321429,0.071429,0.0,0.357143,0.0,0.0


### ttlfull

Attribute $\texttt{ttlfull}$ consists of two different representations of data as can be seen in chapter [Data Analysis](./1_DataAnalysis.ipynb). Both representations will be investigated below.

In [25]:
ttlfull_representations = ['245', '246']

for tf in ttlfull_representations :
    df_string_pairs = daf.string_pair_list(df_feature_base, 'ttlfull_'+tf+'_x')

    print('\nttlfull_'+tf+'\n***********')
    for algorithm in tedi_algorithms :
        if algorithm not in ['Gotoh', 'ArithNCD', 'Editex'] : # Some (very long string) samples fail with Editex
            daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

    display(df_string_pairs.sample(n=num_of_samples(df_string_pairs)))


ttlfull_245
***********
Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Prefix,Postfix,Length,Identity,Matrix
180,"neue ausgabe sämtlicher werke, die zauberflöte...",micromégas and other texts (1738-1742),0.005,0.0,0.105,0.105,0.447632,0.447632,0.463289,0.1475,0.105263,0.127962,0.226891,0.127962,0.710526,-2.966212,0.309711,0.001838,0.135,0.105,0.02,0.109244,0.0,0.010471,0.205884,0.960791,0.148571,0.221239,0.075472,0.0,0.0,0.0,0.19,0.0,0.0
204,"neue ausgabe sämtlicher werke, die zauberflöte...","epistola ad damasum de morte hieronymi, episto...",0.05,0.0,0.2,0.255,0.646398,0.646398,0.67405,0.51,0.109091,0.540084,0.70137,0.540084,0.775758,-0.888743,0.704617,0.002125,0.64,0.34,0.015,0.076712,0.0,0.015707,0.401098,0.9302,0.125714,0.309735,0.144654,0.333333,0.0,0.0,0.825,0.0,0.0
86,"neue ausgabe sämtlicher werke, die zauberflöte...","die zauberflöte, (il flauto magico) : deutsche...",0.03,0.0,0.22,0.22,0.607971,0.607971,0.607971,0.27,0.123077,0.311881,0.475472,0.311881,0.969231,-1.680932,0.552547,0.002287,0.315,0.24,0.12,0.324528,0.0,0.062827,0.351688,0.970048,0.234286,0.362832,0.27673,0.166667,0.0,0.0,0.325,0.0,0.0
371,"neue ausgabe sämtlicher werke, die zauberflöte...","domodossola, valle d'ossola - val grande - ver...",0.02,0.0,0.16,0.17,0.523299,0.523299,0.538064,0.2075,0.117647,0.236453,0.38247,0.236453,0.941176,-2.080373,0.475271,0.001825,0.24,0.16,0.015,0.191235,0.0,0.036649,0.238146,0.869826,0.091429,0.256637,0.113208,0.0,0.0,0.0,0.255,0.0,0.0
369,"neue ausgabe sämtlicher werke, die zauberflöte...","siècle de louis xiv (vi), chapitres 31-39",0.005,0.0,0.115,0.12,0.486495,0.486495,0.495312,0.16,0.097561,0.158654,0.273859,0.158654,0.804878,-2.656046,0.364424,0.001812,0.165,0.12,0.02,0.107884,0.0,0.005236,0.21895,0.93741,0.131429,0.221239,0.08805,0.0,0.0,0.0,0.205,0.0,0.0
426,"neue ausgabe sämtlicher werke, die zauberflöte...",katalog der graphischen porträts in der herzog...,0.065,0.0,0.205,0.24,0.663016,0.663016,0.671807,0.51,0.107784,0.561702,0.719346,0.561702,0.790419,-0.832123,0.722272,0.002287,0.66,0.32,0.03,0.294278,0.0,0.026178,0.465027,0.972699,0.211429,0.345133,0.194969,0.166667,0.0,0.0,0.835,0.0,0.0
142,"neue ausgabe sämtlicher werke, die zauberflöte...","the complete works of voltaire, siècle de loui...",0.015,0.0,0.15,0.16,0.583068,0.583068,0.587577,0.2175,0.105263,0.259804,0.412451,0.259804,0.929825,-1.944505,0.49639,0.001988,0.265,0.16,0.02,0.171206,0.0,0.015707,0.292273,0.926636,0.148571,0.256637,0.150943,0.166667,0.0,0.0,0.285,0.0,0.0
424,"neue ausgabe sämtlicher werke, die zauberflöte...",katalog der graphischen porträts in der herzog...,0.065,0.0,0.185,0.225,0.65446,0.65446,0.66187,0.53,0.094972,0.559671,0.717678,0.559671,0.759777,-0.83735,0.718782,0.0023,0.68,0.32,0.03,0.290237,0.0,0.036649,0.47768,0.9572,0.217143,0.345133,0.213836,0.0,0.0,0.0,0.895,0.0,0.0
332,"neue ausgabe sämtlicher werke, die zauberflöte...","bonne chance!, cours de langue française, 1, d...",0.045,0.0,0.195,0.205,0.588355,0.588355,0.596879,0.305,0.13253,0.347619,0.515901,0.347619,0.879518,-1.524421,0.56659,0.002012,0.365,0.215,0.015,0.127208,0.0,0.0,0.327464,0.936529,0.165714,0.292035,0.176101,0.0,0.0,0.0,0.415,0.0,0.0
82,"neue ausgabe sämtlicher werke, die zauberflöte...","arbeit an sich selbst, wie man zum schauen in ...",0.025,0.0,0.21,0.225,0.630952,0.630952,0.630952,0.28,0.157143,0.35,0.518519,0.35,1.0,-1.514573,0.591608,0.002188,0.35,0.225,0.02,0.266667,0.0,0.041885,0.331085,0.924245,0.154286,0.274336,0.157233,0.0,0.0,0.0,0.35,0.0,0.0



ttlfull_246
***********
Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Prefix,Postfix,Length,Identity,Matrix
34,"die zauberflöte, ausgabe für klavier allein",medizinische informatik - kommunikation von ge...,0.004577,0.0,0.080092,0.080092,0.548094,0.548094,0.550649,0.089245,0.069767,0.09589,0.175,0.09589,0.976744,-3.38247,0.30639,0.011628,0.09611,0.080092,0.011442,0.1125,0.0,0.0,0.184051,0.890092,0.126712,0.177914,0.092664,0.166667,0.0,0.0,0.098398,0.0,0.0
29,"die zauberflöte, ausgabe für klavier allein",medizinische informatik - kommunikation von ge...,0.005376,0.0,0.091398,0.094086,0.52092,0.52092,0.526109,0.103495,0.023256,0.109626,0.19759,0.109626,0.953488,-3.189342,0.324174,0.011087,0.110215,0.091398,0.013441,0.125301,0.0,-0.003096,0.187506,0.872573,0.167331,0.205674,0.116071,0.166667,0.0,0.0,0.115591,0.0,0.0
5,"die zauberflöte, ausgabe für klavier allein","the magic flute : [dvd-video], la flûte enchantée",0.163265,0.0,0.244898,0.265306,0.609911,0.609911,0.623011,0.561224,0.139535,0.4375,0.608696,0.4375,0.651163,-1.192645,0.609994,0.008924,0.571429,0.346939,0.040816,0.304348,0.0,0.04,0.362884,0.924333,0.454545,0.489362,0.254545,0.0,0.0,0.0,0.877551,0.0,0.0
59,"die zauberflöte, ausgabe für klavier allein",informatique de santé - communication entre di...,0.003361,0.0,0.060504,0.062185,0.543469,0.543469,0.543469,0.066387,0.023256,0.072269,0.134796,0.072269,1.0,-3.790481,0.268829,0.011628,0.072269,0.060504,0.008403,0.084639,0.0,0.0,0.160137,0.878735,0.117302,0.153439,0.076433,0.0,0.0,0.0,0.072269,0.0,0.0
19,"die zauberflöte, ausgabe für klavier allein",medizinische informatik - kommunikation von ge...,0.006154,0.0,0.098462,0.101538,0.521813,0.521813,0.527079,0.115385,0.093023,0.128834,0.228261,0.128834,0.976744,-2.956411,0.355282,0.011628,0.129231,0.098462,0.015385,0.11413,0.0,-0.006944,0.200954,0.879785,0.153191,0.221374,0.122549,0.166667,0.0,0.0,0.132308,0.0,0.0
17,"die zauberflöte, ausgabe für klavier allein","[domodossola, arona]",0.0,0.0,0.139535,0.139535,0.44199,0.44199,0.493269,0.302326,0.1,0.188679,0.31746,0.188679,0.5,-2.405992,0.340997,0.005679,0.232558,0.162791,0.069767,0.190476,0.0,-0.023256,0.215653,0.815308,0.390625,0.466667,0.137255,0.0,0.0,0.0,0.465116,0.0,0.0
18,"die zauberflöte, ausgabe für klavier allein",medizinische informatik - kommunikation von ge...,0.006849,0.0,0.109589,0.116438,0.543495,0.543495,0.548831,0.128425,0.069767,0.139456,0.244776,0.139456,0.953488,-2.84212,0.365896,0.011087,0.140411,0.109589,0.017123,0.167164,0.0,0.0,0.205321,0.874634,0.156398,0.243697,0.131148,0.166667,0.0,0.0,0.14726,0.0,0.0
2,"die zauberflöte, ausgabe für klavier allein","education et recherche, educazione e ricerca",0.068182,0.0,0.136364,0.181818,0.603648,0.603648,0.635836,0.522727,0.093023,0.45,0.62069,0.45,0.627907,-1.152003,0.620731,0.007842,0.613636,0.409091,0.068182,0.390805,0.0,0.023256,0.368225,0.895676,0.34375,0.555556,0.294118,0.0,0.0,0.0,0.977273,0.0,0.0
28,"die zauberflöte, ausgabe für klavier allein",medizinische informatik - kommunikation von ge...,0.004608,0.0,0.078341,0.078341,0.528181,0.528181,0.530737,0.08871,0.046512,0.099078,0.180294,0.099078,1.0,-3.335286,0.314767,0.011628,0.099078,0.078341,0.011521,0.104822,0.0,-0.005249,0.183183,0.884569,0.136519,0.187879,0.09434,0.166667,0.0,0.0,0.099078,0.0,0.0
60,"die zauberflöte, ausgabe für klavier allein",informatique de la santé - communication entre...,0.01049,0.0,0.064685,0.064685,0.52466,0.52466,0.527161,0.06993,0.093023,0.075175,0.139837,0.075175,1.0,-3.733607,0.27418,0.011628,0.075175,0.064685,0.008741,0.087805,0.0,0.006438,0.161539,0.877774,0.111437,0.15508,0.077922,0.0,0.0,0.0,0.075175,0.0,0.0


### volumes

In [26]:
df_string_pairs = daf.string_pair_list(df_feature_base, 'volumes_x')

for algorithm in tedi_algorithms :
    if algorithm not in ['Gotoh', 'ArithNCD'] :
        daf.apply_similarities(df_string_pairs, tedi_algorithms[algorithm], algorithm)

df_string_pairs.sample(n=num_of_samples(df_string_pairs))

Hamming
MLIPNS
Levenshtein
DamerauLevenshtein
Jaro
JaroWinkler
StrCmp95
NeedlemanWunsch
SmithWaterman
Jaccard
Sorensen
Tversky
Overlap
Tanimoto
Cosine
MongeElkan
Bag
LCSSeq
LCSStr
RatcliffObershelp
RLENCD
BWTRLENCD
SqrtNCD
EntropyNCD
BZ2NCD
LZMANCD
ZLIBNCD
MRA
Editex
Prefix
Postfix
Length
Identity
Matrix


Unnamed: 0,str1,str2,Hamming,MLIPNS,Levenshtein,DamerauLevenshtein,Jaro,JaroWinkler,StrCmp95,NeedlemanWunsch,SmithWaterman,Jaccard,Sorensen,Tversky,Overlap,Tanimoto,Cosine,MongeElkan,Bag,LCSSeq,LCSStr,RatcliffObershelp,RLENCD,BWTRLENCD,SqrtNCD,EntropyNCD,BZ2NCD,LZMANCD,ZLIBNCD,MRA,Editex,Prefix,Postfix,Length,Identity,Matrix
206,413,3 33,0.25,0.0,0.25,0.5,0.527778,0.527778,0.527778,0.5,0.333333,0.166667,0.285714,0.166667,0.333333,-2.584963,0.288675,0.055556,0.25,0.25,0.25,0.285714,0.0,0.0,0.244017,0.66993,0.791667,0.913043,0.6,0.333333,0.5,0.0,0.25,0.75,0.0,0.0
186,413,169,0.0,1.0,0.0,0.333333,0.0,0.0,0.0,0.5,0.0,0.2,0.333333,0.2,0.333333,-2.321928,0.333333,0.055556,0.333333,0.333333,0.333333,0.333333,0.0,0.25,0.195262,0.742098,0.913043,0.913043,0.666667,0.0,0.0,0.0,0.0,1.0,0.0,0.0
139,413,16,0.0,1.0,0.333333,0.333333,0.0,0.0,0.0,0.5,0.5,0.25,0.4,0.25,0.5,-2.0,0.408248,0.055556,0.333333,0.333333,0.333333,0.4,0.0,0.25,0.195262,0.64335,0.869565,0.913043,0.666667,0.0,0.333333,0.0,0.0,0.666667,0.0,0.0
56,413,365,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.2,0.333333,0.2,0.333333,-2.321928,0.333333,0.055556,0.333333,0.333333,0.333333,0.333333,0.0,0.25,0.195262,0.742098,0.913043,0.913043,0.666667,0.0,0.0,0.0,0.0,1.0,0.0,0.0
95,413,1 108,0.0,0.0,0.2,0.2,0.511111,0.511111,0.511111,0.4,0.0,0.142857,0.25,0.142857,0.333333,-2.807355,0.258199,0.055556,0.2,0.2,0.2,0.25,0.0,0.166667,0.154538,0.719132,0.814815,0.92,0.545455,0.0,0.2,0.0,0.0,0.6,0.0,0.0
16,413,3,0.0,1.0,0.333333,0.333333,0.0,0.0,0.0,0.333333,1.0,0.333333,0.5,0.333333,1.0,-1.584963,0.57735,0.055556,0.333333,0.333333,0.333333,0.5,0.0,0.25,0.195262,0.419721,0.956522,1.0,0.666667,0.0,0.333333,0.0,0.333333,0.333333,0.0,0.0
34,413,418,0.666667,1.0,0.666667,0.666667,0.777778,0.777778,0.777778,0.833333,0.666667,0.5,0.666667,0.5,0.666667,-1.0,0.666667,0.111111,0.666667,0.666667,0.666667,0.666667,0.0,0.25,0.390524,0.871049,0.913043,0.913043,0.666667,0.666667,0.666667,0.666667,0.0,1.0,0.0,0.0
62,413,323,0.333333,1.0,0.333333,0.333333,0.555556,0.555556,0.555556,0.666667,0.333333,0.2,0.333333,0.2,0.333333,-2.321928,0.333333,0.055556,0.333333,0.333333,0.333333,0.333333,0.0,0.25,0.227388,0.661819,0.869565,0.913043,0.666667,0.333333,0.333333,0.0,0.333333,1.0,0.0,0.0
188,413,213 1,0.4,0.0,0.4,0.4,0.688889,0.688889,0.688889,0.5,0.0,0.333333,0.5,0.333333,0.666667,-1.584963,0.516398,0.111111,0.4,0.4,0.4,0.5,0.0,0.333333,0.287242,0.804692,0.814815,0.92,0.545455,0.4,0.4,0.0,0.0,0.6,0.0,0.0
213,413,1 1 12,0.0,0.0,0.166667,0.166667,0.5,0.5,0.5,0.333333,0.333333,0.125,0.222222,0.125,0.333333,-3.0,0.235702,0.055556,0.166667,0.166667,0.166667,0.222222,0.0,0.142857,0.176557,0.768018,0.807692,0.84,0.666667,0.0,0.166667,0.0,0.0,0.5,0.0,0.0
