# Feature Matrix Generation

This chapter introduces similarity metrics for string comparison. The metrics to be used for calculating its similarity is decided for each attribute of the DataFrame built in the previous chapters. As a result of this chapter, the feature matrix will be derived.

## Table of Contents

- [Data Takeover](#Data-Takeover)
- [Object Distance and Similarity](#Object-Distance-and-Similarity)
- [Library TextDistance](#Library-TextDistance)
- [Similarity Metrics on Attribute Level](#Similarity-Metrics-on-Attribute-Level)
    - [century](#century)
    - [edition](#edition)
    - [format](#format)
    - [person_245c](#person_245c)
    - [ttlfull](#ttlfull)
    - [volumes](#volumes)
- [Feature Base](#Feature-Base)

## Data Takeover

Swissbib's raw data of the goldstandard has been processed in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb). As the first step of this chapter, this data is read in for further processing to the feature matrix and target vector for the subsequent machine learning model chapters.

In [1]:
import os
import pandas as pd
import pickle as pk

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)

# Restore results so far
df_feature_base = pd.read_pickle(os.path.join(path_goldstandard, 'feature_base_df.pkl'),
                                 compression=None)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head()

Unnamed: 0,duplicates,century_x,century_y,coordinate_x,coordinate_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,edition_x,edition_y,exactDate_x,exactDate_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,person_245c_x,person_245c_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y
0,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,"emma, roman","emma, roman",,,600 s.,600 s.
1,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,"emma, roman",emma,,,600 s.,600 s.
2,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen ; aus dem englischen übersetzt von...,jane austen,"emma, roman","emma, roman",,,600 s.,600 s.
3,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,emma,"emma, roman",,,600 s.,600 s.
4,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,emma,emma,,,600 s.,600 s.


In [2]:
print('Number of rows labelled as duplicates', len(df_feature_base[
    df_feature_base.duplicates==1]))
print('Number of rows labelled as uniques', len(df_feature_base[
    df_feature_base.duplicates==0]))
print('Total number of rows in DataFrame', df_feature_base.shape[0],
      'number of columns', df_feature_base.shape[1])

Number of rows labelled as duplicates 1473
Number of rows labelled as uniques 259260
Total number of rows in DataFrame 260733 number of columns 25


In [3]:
print('Part of duplicates (1) on uniques (2) in units of [%]')
print(100*df_feature_base.duplicates.value_counts(normalize=True))

Part of duplicates (1) on uniques (2) in units of [%]
0    99.435054
1     0.564946
Name: duplicates, dtype: float64


The ratio of duplicate records in the full training data has a percentage value of below 0.6%. This is very low and will affect the training of the model. During the training process, the model will find more pairs of unique records ($\texttt{duplicates}=0$) than pairs of duplicates ($\texttt{duplicates}=1$). Undersampling of the amount of unique pairs might be necessary as a consequence and will be decided during model training.

## Object Distance and Similarity

A mathematical idea of distance and similarity is needed for understanding object pair comparison. This section starts with a motivation for calculating similarities and afterwards gives a very basic definition of the two central terms. The text of this section is a summary of [[Chri2012](./A_References.ipynb#chri2012)].

The attributes to be used for pair comparison may contain values of poor quality. The quality originates in the way the data has been entered at the very source. Manual data entry may suffer from mistyping, automatically scanned data may suffer from insufficiencies of the scanned base material or the recognition algorithm in the optical character recognition (OCR) processing. The basic step of a deduplication process is to identify the probability of two strings of a pair to be a pair of duplicates. This is done by calculating a similarity value between the two strings compared, rather than using an exact comparison function. Based on this common similarity value for an attribute pair, their being duplicates can be decided.

The term similarity is strongly coupled to the term of distance of two values of an attribute. Mathematically, a distance can be explained with the help of a distance function. A _distance function_ or _distance metric_ $dist(o_i, o_j)$ between two points or data objects $o_i$ and $o_j$ must fulfill four requirements.

1. $dist(o_i, o_i)=0$, the distance from an object to itself is zero.
- $dist(o_i, o_j)\ge 0$, the distance between two objects is a non-negative number.
- $dist(o_i, o_j)=dist(o_j, o_i)$, the distance between two objects is symmetric.
- $dist(o_i, o_j)\le dist(o_i, o_k)+dist(o_k, o_j)$, the triangular inequality must hold. It states that the direct distance beween two objects is never larger than the combined distance when going through a third object.

A distance value expresses the dissimilarity $d$ of two objects [[HanK2012](./A_References.ipynb#hank2012)] and can therefore be converted into a similarity value $s$, calculating $s = \frac{1}{d}$, assuming $d\gt 0$. Alternatively, assuming the distance value is normalised $0\le d\le 1$, the similarity value can be calculated to $s = 1-d$. A _similarity function_ $sim(a_i, aj)$ between two attributes which can be strings, numbers, dates, geographic locations, text, XML documents, etc. fulfills the general requirements.

1. $sim(a_i, a_i)=1$, the result of comparing a value with itself is an exact similarity.
- $sim(a_i, a_j)=0$, the similarity of values that are completely different from each other is 0. What accounts for 'complete different' depends upon the type of data that are compared.
- $0\lt sim(a_i, a_j)\lt 1$, an approximate similarity between exact similarity and total dissimilarity is calculated if two attribute values are somewhat similar to each other. What accounts for 'somewhat different' depends upon the type of data that are compared.

The dissimilarity between two objects $o_i$ and $o_j$ can be computed based on the ratio of mismatches,
$$
d(o_i, o_j) = \frac{p-m}{p},
$$
where $m$ is the number of matching attributes and $p$ is the total number of attributes describing the objects [[HanK2012](./A_References.ipynb#hank2012)]. Thus the similarity between two objects can be computed as
$$
sim(o_i, o_j) = 1 - d(o_i, o_j) = \frac{m}{p}.
$$

For data deduplication, a comparison function needs to be tailored to the type of underlying data. Although there is a correspondence between a similarity function and the mathematical concept of a distance function, not all known and implemented similarity comparison functions used for string pair comparison fulfill the requirements of a distance function. Some similarity functions are not symmetric, others do not fulfill the triangular inequality. Decision taking on the best similarity function for a string pair, will be based on the effect, a similarity function has for the purpose needed. In the case of this capstone project, this purpose is its capability to contribute to the prediction whether a pair of records is a duplicate or different.

## Library TextDistance

An internet research on string distance calculation with Python has revealed libraries [[StSi](./A_References.ipynb#stsi)], [[TeDi](./A_References.ipynb#tedi)] and seperate code snippets for distinct algorithms. After trying the referenced libraries and a downloaded code snippet for a Smith Waterman similarity [[SmWa](./A_References.ipynb#smwa)], the text distance library [[TeDi](./A_References.ipynb#tedi)] has been decided to be the best decision for this capstone project. The decision is based on the GitHub statistics of stars and the date of the latest pull requests, indicating its popularity and maintenance activity of the library. A look at the API of the library, reveals the Python library to be a complete implementation (compared to suggestions of similarity metrics in [[Chri2012](./A_References.ipynb#chri2012)]) and easy to use.

In [4]:
# Install textdistance Python library - if not done, yet.
! pip install textdistance



For using the library, see documentation in [[TeDi](./A_References.ipynb#tedi)]. For the purposes of this chapter, function $\texttt{.normalized}\_\texttt{similarity()}$ of an instantiated textdistance object will be used.

In [5]:
import textdistance as tedi

With the code line above, the library is imported for application in this chapter. In appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb) the effect of the similarity metrics of the library are compared for better understanding of their specific behaviour. This comparison for each attribute is the basis of deciding the best similarity metric available for an attribute pair.

## Similarity Metrics on Attribute Level

In this section, the decision for calculating the similarity metric for each attribute of the raw data is documented based on appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb) and implemented. The implementation is applied on a pair of attributes of different records, resulting in a new attribute of the final feature matrix. A general function $\texttt{build_delta_feature}$ is provided by the code file [data_preparation_funcs.py](./data_preparation_funcs.py) for transforming two attributes into their feature attribute holding their similarity value.

In [6]:
import data_preparation_funcs as dpf

### century

As discussed in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{century}$ holds year number stored as a string of length 4. Letter 'u' is used as a placeholder for an unknown digit. For this reason, the attribute will be kept as a string and will not be transformed into an integer. The feature attribute of the record pair to be compared will be calculated with a modified Hamming algorithm, see appendix [Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb). The resulting similarity will be stored in a new attribute $\texttt{century}\_\texttt{delta}$ which will be taken for the model calculation.

In [7]:
# Replace letter 'u' with letter 'a' for one of the two strings.
#  As an effect, the resulting Hamming similarity for a letter
#  instead of a numerical digit in either string will add with an amount 0.
df_feature_base['century_x'] = df_feature_base.century_x.str.replace('u', 'a')

# Compute Hamming similarity for century string pair.
century_algorithm = tedi.Hamming()
df_feature_base = dpf.build_delta_feature(df_feature_base, 'century', century_algorithm)

# Add amount of 0.125 to Hamming similarity for every letter digit.
#  But only maximum number of letter digits in both strings of a pair.
df_feature_base['century_delta'] = df_feature_base[['century_x', 'century_y', 'century_delta']].apply(
    lambda x : x['century_delta'] + 
    0.125*max(x['century_x'].count('a'), x['century_y'].count('u')), axis=1
)

In [8]:
df_feature_base[['century_x', 'century_y', 'century_delta']].sample(n=10)

Unnamed: 0,century_x,century_y,century_delta
25654,1998,1996,0.75
18153,1970,2003,0.0
133116,2001,2008,0.75
158858,2013,1956,0.0
181582,2015,1981,0.0
114965,1979,2017,0.0
226451,1989,2007,0.0
133583,1932,1999,0.5
22209,2005,1981,0.0
120013,1987,1887,0.75


All resulting values of equal strings are equal to 1.

In [9]:
df_feature_base[['century_x', 'century_y', 'century_delta']][
    df_feature_base.century_x == df_feature_base.century_y
].sort_values('century_delta', ascending=False).head()

Unnamed: 0,century_x,century_y,century_delta
0,2009,2009,1.0
158402,2013,2013,1.0
158980,2013,2013,1.0
158939,2013,2013,1.0
158885,2013,2013,1.0


Nine different similarity values can be found in the attribute deltas. Some sample records are shown below.

In [10]:
import numpy as np

century_deltas = np.sort(df_feature_base.century_delta.unique())
century_deltas

array([0.   , 0.125, 0.25 , 0.375, 0.5  , 0.625, 0.75 , 0.875, 1.   ])

In [11]:
sample_size = 5

for i in century_deltas :
    display(df_feature_base[['century_delta', 'century_x', 'century_y']][
        df_feature_base.century_delta == i].sample(n=sample_size)
         )

Unnamed: 0,century_delta,century_x,century_y
100264,0.0,1836,2004
42547,0.0,2009,1990
90189,0.0,1998,2011
100948,0.0,2000,1992
113690,0.0,2000,1965


Unnamed: 0,century_delta,century_x,century_y
69214,0.125,1991,200u
253634,0.125,2017,192u
251104,0.125,183a,2005
106843,0.125,2001,193u
242054,0.125,1763,200u


Unnamed: 0,century_delta,century_x,century_y
241448,0.25,1959,2009
231106,0.25,1987,1550
29931,0.25,1471,1966
136770,0.25,2005,1995
95441,0.25,2005,1985


Unnamed: 0,century_delta,century_x,century_y
89100,0.375,170a,1994
251479,0.375,183a,1991
242310,0.375,1763,192u
248097,0.375,1932,181u
121149,0.375,1984,181u


Unnamed: 0,century_delta,century_x,century_y
123206,0.5,1982,1907
132222,0.5,2000,2016
50827,0.5,1998,18uu
255889,0.5,1970,1998
155479,0.5,aaaa,1990


Unnamed: 0,century_delta,century_x,century_y
109823,0.625,1981,193u
209274,0.625,2016,200u
124307,0.625,1981,188u
114766,0.625,1979,192u
37286,0.625,1979,192u


Unnamed: 0,century_delta,century_x,century_y
22328,0.75,2005,2004
39108,0.75,1990,1996
189903,0.75,2013,2003
196414,0.75,2007,2009
153133,0.75,1900,1990


Unnamed: 0,century_delta,century_x,century_y
7230,0.875,2000,200u
45970,0.875,2008,200u
138350,0.875,2006,200u
178282,0.875,2007,200u
17362,0.875,2005,200u


Unnamed: 0,century_delta,century_x,century_y
991,1.0,1983,1983
370,1.0,2005,2005
249205,1.0,2007,2007
1348,1.0,2016,2016
173827,1.0,2015,2015


### corporate

In [12]:
corporate_algorithm = tedi.Jaro()

In [13]:
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'corporate_110', corporate_algorithm)
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'corporate_710', corporate_algorithm)

In [14]:
print(df_feature_base[['corporate_110_delta', 'corporate_110_x', 'corporate_110_y']][
    (df_feature_base.corporate_110_delta > 0.0) & (df_feature_base.corporate_110_delta <= 0.1)].head()
     )

Empty DataFrame
Columns: [corporate_110_delta, corporate_110_x, corporate_110_y]
Index: []


In [15]:
print(df_feature_base[['corporate_710_delta', 'corporate_710_x', 'corporate_710_y']][
    (df_feature_base.corporate_710_delta >= 0.9) & (df_feature_base.corporate_710_delta <= 1.0)].head()
     )

   corporate_710_delta corporate_710_x corporate_710_y
0                  1.0                                
1                  1.0                                
2                  1.0                                
3                  1.0                                
4                  1.0                                


### edition

The edition statement is a string value which may have several words. A Jaccard similarity is tried for this attribute.

In [16]:
edition_algorithm = tedi.Jaccard()

In [17]:
df_feature_base = dpf.build_delta_feature(df_feature_base, 'edition', edition_algorithm)

In [18]:
df_feature_base.edition_delta.unique()[:30], len(df_feature_base.edition_delta.unique())

(array([1.        , 0.        , 0.95348837, 0.46511628, 0.62222222,
        0.66666667, 0.5       , 0.65909091, 0.62790698, 0.67741935,
        0.68965517, 0.84375   , 0.7       , 0.875     , 0.92857143,
        0.56521739, 0.54166667, 0.72727273, 0.96153846, 0.57142857,
        0.54761905, 0.74193548, 0.09375   , 0.17647059, 0.24      ,
        0.10526316, 0.75      , 0.06666667, 0.13636364, 0.28571429]), 802)

The comparison results in a wide number of distinct similarity values for the goldstandard data set. Below, some examples are shown.

In [19]:
print(df_feature_base[['edition_delta', 'edition_x', 'edition_y']][
    (df_feature_base.edition_delta >= 0.9) & (df_feature_base.edition_delta < 1.0)].head()
     )

      edition_delta                                   edition_x  \
657        0.953488  5. Aufl., 43.-46. Tsd., neu durchges. Aufl   
661        0.953488  Neu durchges. Aufl., 5. Aufl., 43.-46. Tsd   
995        0.928571                               3., erw. Aufl   
999        0.928571                               3., erw. Aufl   
1001       0.928571                              3., erw. Aufl.   

                                       edition_y  
657   Neu durchges. Aufl., 5. Aufl., 43.-46. Tsd  
661   5. Aufl., 43.-46. Tsd., neu durchges. Aufl  
995                               3., erw. Aufl.  
999                               3., erw. Aufl.  
1001                               3., erw. Aufl  


In [20]:
print(df_feature_base[['edition_delta', 'edition_x', 'edition_y']][
    (df_feature_base.edition_delta > 0.0) & (df_feature_base.edition_delta <= 0.1)].head()
     )

      edition_delta edition_x                                   edition_y
2985       0.093750   1. Aufl                Schweiz 2007, SF1 01.03.2007
3000       0.066667   1. Aufl                   [new ed., 2nd impression]
3059       0.071429   1. Aufl                                    6th impr
3076       0.093023   1. Aufl    Einzelne Nachträge 1907, Ueberdruck 1909
3077       0.088889   1. Aufl  Ueberdruck 1909/2, einzelne Nachträge 1912


### format

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{format}$ has been split up into two new attributes $\texttt{format_prefix}$ and $\texttt{format_postfix}$ which will be compared by a different similarity metrics.

- As the quality of $\texttt{format_prefix}$ is expected to be high, an identity comparison should be sufficient.
- Due to the observed structure of $\texttt{format_postfix}$, a q-gram based comparison will be chosen.

In [21]:
format_prefix_algorithm = tedi.Identity()
df_feature_base = dpf.build_delta_feature(df_feature_base, 'format_prefix',
                                          format_prefix_algorithm)

format_postfix_algorithm = tedi.Jaccard(qval=2)
df_feature_base = dpf.build_delta_feature(df_feature_base, 'format_postfix',
                                          format_postfix_algorithm)

In [22]:
for i in df_feature_base.format_prefix_delta[
    df_feature_base.format_prefix_x != df_feature_base.format_prefix_y].unique():
    
    print(df_feature_base[['format_prefix_delta', 'format_prefix_x', 'format_prefix_y']][
        df_feature_base.format_prefix_delta == i].sample(n=10)
         )

        format_prefix_delta format_prefix_x format_prefix_y
38241                   0.0              vm              mp
129048                  0.0              cf              bk
36018                   0.0              mu              bk
190816                  0.0              bk              mp
100568                  0.0              mu              bk
15889                   0.0              mu              bk
148311                  0.0              mu              bk
50133                   0.0              cr              bk
216856                  0.0              mu              bk
13156                   0.0              cf              mu


In [23]:
for i in df_feature_base.format_postfix_delta[
    df_feature_base.format_postfix_x != df_feature_base.format_postfix_y].unique():
    
    print(df_feature_base[['format_postfix_delta', 'format_postfix_x', 'format_postfix_y']][
        df_feature_base.format_postfix_delta == i].sample(n=5)
         )

        format_postfix_delta format_postfix_x format_postfix_y
18821               0.428571           010200           020000
15029               0.428571           010200           010100
5416                0.428571           020300           020053
130046              0.428571           020000           010200
9309                0.428571           010300           010053
        format_postfix_delta format_postfix_x format_postfix_y
98088               0.111111           010100           020000
235791              0.111111           020000           010053
136370              0.111111           010100           020000
16114               0.111111           010200           030300
251376              0.111111           010100           020000
        format_postfix_delta format_postfix_x format_postfix_y
1647                    0.25           020000           020353
175667                  0.25           020000           020347
88156                   0.25           030500          

### person_245c

In [24]:
person_245c_algorithm = tedi.Jaro()

In [25]:
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'person_245c', person_245c_algorithm)

In [26]:
print(df_feature_base[['person_245c_delta', 'person_245c_x', 'person_245c_y']][
    (df_feature_base.person_245c_delta > 0.0) & (df_feature_base.person_245c_delta <= 0.1)].head()
     )

Empty DataFrame
Columns: [person_245c_delta, person_245c_x, person_245c_y]
Index: []


In [27]:
print(df_feature_base[['person_245c_delta', 'person_245c_x', 'person_245c_y']][
    (df_feature_base.person_245c_delta >= 0.9) & (df_feature_base.person_245c_delta <= 1.0)].head()
     )

    person_245c_delta                                      person_245c_x  \
0                 1.0  jane austen ; aus dem englischen übersetzt von...   
4                 1.0  jane austen ; aus dem engl. übers. von ursula ...   
8                 1.0                                        jane austen   
9                 1.0  w. a. mozart ; libretto: emanuel schikaneder ;...   
14                1.0       w. a. mozart ; libretto: emanuel schikaneder   

                                        person_245c_y  
0   jane austen ; aus dem englischen übersetzt von...  
4   jane austen ; aus dem engl. übers. von ursula ...  
8                                         jane austen  
9   w. a. mozart ; libretto: emanuel schikaneder ;...  
14       w. a. mozart ; libretto: emanuel schikaneder  


### ttlfull

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{ttlfull}$ has been split up into two new attributes $\texttt{ttlfull_245}$ and $\texttt{ttlfull_246}$ which will be compared by the same similarity metrics.

In [28]:
ttlfull_algorithm = tedi.Jaccard()

df_feature_base = dpf.build_delta_feature(df_feature_base, 'ttlfull_245', ttlfull_algorithm)
df_feature_base = dpf.build_delta_feature(df_feature_base, 'ttlfull_246', ttlfull_algorithm)

In [29]:
df_feature_base.columns

Index(['duplicates', 'century_x', 'century_y', 'coordinate_x', 'coordinate_y',
       'corporate_110_x', 'corporate_110_y', 'corporate_710_x',
       'corporate_710_y', 'edition_x', 'edition_y', 'exactDate_x',
       'exactDate_y', 'format_prefix_x', 'format_prefix_y', 'format_postfix_x',
       'format_postfix_y', 'person_245c_x', 'person_245c_y', 'ttlfull_245_x',
       'ttlfull_245_y', 'ttlfull_246_x', 'ttlfull_246_y', 'volumes_x',
       'volumes_y', 'century_delta', 'corporate_110_delta',
       'corporate_710_delta', 'edition_delta', 'format_prefix_delta',
       'format_postfix_delta', 'person_245c_delta', 'ttlfull_245_delta',
       'ttlfull_246_delta'],
      dtype='object')

In [30]:
print(df_feature_base[['ttlfull_245_delta', 'ttlfull_245_x', 'ttlfull_245_y']][
    (df_feature_base.ttlfull_245_delta > 0.0) & (df_feature_base.ttlfull_245_delta <= 0.1)].head()
     )

      ttlfull_245_delta                                      ttlfull_245_x  \
465            0.092437                                        zauberflöte   
467            0.092437  die zauberflöte, kv 620 : eine deutsche oper i...   
690            0.044444                                               arts   
691            0.044444  arts, beaux-arts, littérature, spectacles : (j...   
1014           0.075862                                      genéve, genff   

                                          ttlfull_245_y  
465   die zauberflöte, kv 620 : eine deutsche oper i...  
467                                         zauberflöte  
690   arts, beaux-arts, littérature, spectacles : (j...  
691                                                arts  
1014  genève, ville capitale d'une république de mêm...  


In [31]:
print(df_feature_base[['ttlfull_245_delta', 'ttlfull_245_x', 'ttlfull_245_y']][
    (df_feature_base.ttlfull_245_delta >= 0.9) & (df_feature_base.ttlfull_245_delta < 1.0)].head()
     )

    ttlfull_245_delta                                      ttlfull_245_x  \
29           0.986111  der moralische status der tiere, henry salt, p...   
31           0.986111  der moralische status der tiere, henry salt, p...   
32           0.986111  der moralische status der tiere, henry salt, p...   
33           0.986111  der moralische status der tiere, henry salt, p...   
38           0.986111  der moralische status der tiere, henry salt, p...   

                                        ttlfull_245_y  
29  der moralische status der tiere, henry salt, p...  
31  der moralische status der tiere, henry salt, p...  
32  der moralische status der tiere, henry salt, p...  
33  der moralische status der tiere, henry salt, p...  
38  der moralische status der tiere, henry salt, p...  


In [32]:
display(df_feature_base[['ttlfull_245_delta', 'ttlfull_245_x', 'ttlfull_245_y']][
    (df_feature_base.ttlfull_245_delta >= 0.9) & (df_feature_base.ttlfull_245_delta < 1.0)].head()
     )

Unnamed: 0,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y
29,0.986111,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
31,0.986111,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
32,0.986111,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
33,0.986111,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."
38,0.986111,"der moralische status der tiere, henry salt, p...","der moralische status der tiere, henry salt, p..."


### volumes

In [33]:
volumes_algorithm = tedi.Jaccard()
#volumes_algo = tedi.MongeElkan()

In [34]:
df_feature_base = dpf.build_delta_feature(df_feature_base, 'volumes', volumes_algorithm)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head(20)

Unnamed: 0,duplicates,century_x,century_y,coordinate_x,coordinate_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,edition_x,edition_y,exactDate_x,exactDate_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,person_245c_x,person_245c_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y,century_delta,corporate_110_delta,corporate_710_delta,edition_delta,format_prefix_delta,format_postfix_delta,person_245c_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
0,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,"emma, roman",emma,,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.818905,0.363636,1.0,1.0
2,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen ; aus dem englischen übersetzt von...,jane austen,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.69774,1.0,1.0,1.0
3,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,emma,"emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.818905,0.363636,1.0,1.0
4,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,emma,emma,,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen ; aus dem engl. übers. von ursula ...,jane austen,emma,"emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.702265,0.363636,1.0,1.0
6,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen,jane austen ; aus dem englischen übersetzt von...,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.69774,1.0,1.0,1.0
7,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen,jane austen ; aus dem engl. übers. von ursula ...,"emma, roman",emma,,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.702265,0.363636,1.0,1.0
8,1,2009,2009,[],[],,,,,,,2009,2009,bk,bk,20000,20000,jane austen,jane austen,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,1,2000,2000,[],[],,,"metropolitan opera orchestra, metropolitan ope...","metropolitan opera orchestra, metropolitan ope...",,,2000,2000,vm,vm,10300,10300,w. a. mozart ; libretto: emanuel schikaneder ;...,w. a. mozart ; libretto: emanuel schikaneder ;...,"die zauberflöte, oper in zwei aufzügen","die zauberflöte, oper in zwei aufzügen",,,"1 dvd-video, dvd region 0, 169 min., farb.","1 dvd-video, dvd region 0, 169 min., farb.",1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [35]:
columns_metadata_dict

{'list_columns': ['coordinate',
  'doi',
  'format',
  'isbn',
  'ismn',
  'pages',
  'volumes'],
 'strings_columns': ['century',
  'corporate',
  'corporate_110',
  'corporate_710',
  'corporate_810',
  'decade',
  'docid',
  'edition',
  'exactDate',
  'person',
  'person_100',
  'person_700',
  'person_800',
  'person_245c',
  'pubyear',
  'ttlfull',
  'ttlfull_245',
  'ttlfull_246'],
 'data_analysis_columns': ['century',
  'coordinate',
  'corporate_110',
  'corporate_710',
  'edition',
  'exactDate',
  'format_prefix',
  'format_postfix',
  'person_245c',
  'ttlfull_245',
  'ttlfull_246',
  'volumes'],
 'columns_to_use': ['duplicates',
  'century_x',
  'century_y',
  'coordinate_x',
  'coordinate_y',
  'corporate_110_x',
  'corporate_110_y',
  'corporate_710_x',
  'corporate_710_y',
  'edition_x',
  'edition_y',
  'exactDate_x',
  'exactDate_y',
  'format_prefix_x',
  'format_prefix_y',
  'format_postfix_x',
  'format_postfix_y',
  'person_245c_x',
  'person_245c_y',
  'ttlfull_24

## Feature Base

The metris for each attribute of the feature DataFrame has been decided and the features have been calculated. The columns with the original attribute values are not needed for further processing and they will be dropped to generate the feature matrix for modelling the estimators.

In [36]:
# Drop all non-delta columns, except of 'duplicates'
columns_to_be_dropped = [e for e in columns_metadata_dict['columns_to_use']
                         if e != 'duplicates']

df_feature_base.drop(columns=columns_to_be_dropped, inplace=True)

In [37]:
for i in range(2):
    display(df_feature_base[df_feature_base.duplicates==i].sample(n=20))

Unnamed: 0,duplicates,century_delta,corporate_110_delta,corporate_710_delta,edition_delta,format_prefix_delta,format_postfix_delta,person_245c_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
170499,0,0.5,1.0,1.0,1.0,0.0,0.428571,0.421549,0.252874,1.0,0.147059
176927,0,0.0,1.0,0.0,1.0,0.0,0.111111,0.654966,0.221311,0.0,0.0
28082,0,0.5,1.0,1.0,1.0,0.0,0.428571,0.583683,0.333333,1.0,0.0
98912,0,0.0,1.0,1.0,1.0,1.0,0.428571,0.489927,0.589888,1.0,0.125
111475,0,0.0,1.0,1.0,1.0,0.0,0.428571,0.693115,0.034483,1.0,0.105263
65794,0,0.5,1.0,1.0,0.0,1.0,1.0,0.572327,0.425743,1.0,0.222222
28111,0,0.0,1.0,0.0,0.0,0.0,0.111111,0.764954,0.577778,1.0,0.076923
153278,0,0.75,1.0,1.0,1.0,1.0,1.0,0.0,0.553571,1.0,0.1
29719,0,0.75,1.0,1.0,0.0,0.0,0.111111,0.545833,0.15,1.0,0.222222
235889,0,1.0,1.0,0.0,0.0,0.0,0.111111,0.0,0.363636,1.0,0.083333


Unnamed: 0,duplicates,century_delta,corporate_110_delta,corporate_710_delta,edition_delta,format_prefix_delta,format_postfix_delta,person_245c_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
257,1,1.0,1.0,0.0,1.0,1.0,1.0,0.648791,0.309859,1.0,1.0
367,1,1.0,1.0,1.0,1.0,1.0,1.0,0.707407,0.511628,0.0,1.0
97,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.986111,1.0,1.0
685,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
768,1,0.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
210,1,1.0,1.0,1.0,1.0,1.0,1.0,0.576164,0.399123,0.0,0.0
666,1,1.0,1.0,1.0,0.465116,1.0,1.0,0.894767,1.0,1.0,1.0
1008,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
401,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.333333
416,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Feature Matrix and Target Vector Handover

To hand over the resulting DataFrame of this chapter, the DataFrame is saved into a pickle file that will be read in the next chapters [Decision Tree Model](./5_DecisionTreeModel.ipynb), ... as input file.

In [38]:
import pickle as pk

# Binary intermediary file
with open(os.path.join(path_goldstandard,
                       'labelled_feature_matrix.pkl'), 'wb') as df_output_file:
    pk.dump(df_feature_base, df_output_file)