# Feature Matrix Generation

## Table of Contents

- [Data Takeover](#Data-Takeover)
- [String Distance and Similarity](#String-Distance-and-Similarity)
- [Similarity Metrics on Attribute Level](#Similarity-Metrics-on-Attribute-Level)
    - [century](#century)
    - [edition](#edition)
    - [format](#format)
    - [person_245c](#person_245c)
    - [ttlfull](#ttlfull)
    - [volumes](#volumes)
- [Feature Base](#Feature-Base)

## Data Takeover

Swissbib's raw data of the goldstandard has been processed in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb). As the first step of this chapter, this data is read in for further processing to the feature matrix and target vector for the subsequent machine learning model chapters.

In [1]:
import os
import pandas as pd
import pickle as pk

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'culomns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)

# Restore results so far
df_feature_base = pd.read_pickle(os.path.join(path_goldstandard, 'feature_base_df.pkl'),
                                 compression=None)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head()

Unnamed: 0,duplicates,century_x,century_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,edition_x,edition_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,person_245c_x,person_245c_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y
0,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,"emma, roman","emma, roman",,,600 s.,600 s.
1,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,"emma, roman",emma,,,600 s.,600 s.
2,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen ; aus dem englischen übersetzt von...,jane austen,"emma, roman","emma, roman",,,600 s.,600 s.
3,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,emma,"emma, roman",,,600 s.,600 s.
4,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,emma,emma,,,600 s.,600 s.


In [2]:
print('Number of rows labelled as duplicates', len(df_feature_base[df_feature_base.duplicates==1]))
print('Number of rows labelled as uniques', len(df_feature_base[df_feature_base.duplicates==0]))
print('Total number of rows in DataFrame', len(df_feature_base))

Number of rows labelled as duplicates 1473
Number of rows labelled as uniques 259260
Total number of rows in DataFrame 260733


In [3]:
print('Part of duplicates (1) on uniques (2) in units of [%]')
print(df_feature_base.duplicates.value_counts(normalize=True)*100)

Part of duplicates (1) on uniques (2) in units of [%]
0    99.435054
1     0.564946
Name: duplicates, dtype: float64


## String Distance and Similarity

What is distance, what is similarity? [[Chri2012](./A_References.ipynb#chri2012)]

An internet research on string distance calculation with Python reveals libraries and seperate code snippets for distinct algorithms, see [[StSi](./A_References.ipynb#stsi)], [[TeDi](./A_References.ipynb#tedi)]. After trying the referenced libraries and a downloaded code snippet for a Smith Waterman similarity [[SmWa](./A_References.ipynb#smwa)], the text distance library [[TeDi](./A_References.ipynb#tedi)] has been decided to be adequate for this capstone project, seeming to be the most complete (compared to suggestions of similarity metrics in [[Chri2012](./A_References.ipynb#chri2012)]) and easiest to use Python library.

In [4]:
# Install textdistance Python library - if not done, yet.
! pip install textdistance



For using the library, see documentation in [[TeDi](./A_References.ipynb#tedi)]. For the purposes of this chapter, function $\texttt{.normalized_similarity()}$ of the instantiated textdistance object will be used.

With the following code line, the library is imported for application in this chapter.

In [5]:
import textdistance as tedi

## Similarity Metrics on Attribute Level

In this section, a metric is decided for each attribute of the raw data for calculating the similarities of the attributes of the feature matrix. A function $\texttt{build_delta_feature}$ is provided by the Python code file [data_preparation_funcs.py](./data_preparation_funcs.py) for transforming two attributes into their feature attribute holding their similarity value.

In [6]:
import data_preparation_funcs as dpf

### century

As discussed in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{century}$ holds year number stored as a string of length 4. Letter 'u' is used as a placeholder for an unknown digit. For this reason, the attribute will be kept as a string and will not be transformed into an integer value. The feature attribute of the record pair to be compared will be calculated with a Smith Waterman algorithm and the resulting similarity will be taken for the model calculation. This decision is due to the fact that the Smith Waterman algorithm is strong in comparing gene sequences.

In [7]:
century_algorithm = tedi.SmithWaterman()
df_feature_base = dpf.build_delta_feature(df_feature_base, 'century', century_algorithm)

All resulting values of equal strings are equal to 1.

In [8]:
df_feature_base.century_delta[
    df_feature_base.century_x == df_feature_base.century_y
].sort_values(ascending=False).head()

259253    1.0
25195     1.0
24360     1.0
24362     1.0
24495     1.0
Name: century_delta, dtype: float64

Only four different similarity values can be built for the attribute deltas. Some sample records are shown below.

In [9]:
df_feature_base.century_delta[
    df_feature_base.century_x != df_feature_base.century_y
].unique()

array([0.  , 0.5 , 0.75, 0.25])

In [10]:
for i in df_feature_base.century_delta[
    df_feature_base.century_x != df_feature_base.century_y].unique():
    
    print(df_feature_base[['century_delta', 'century_x', 'century_y']][
        df_feature_base.century_delta == i].sample(n=5)
         )

        century_delta century_x century_y
121299            0.0      1984      2011
144080            0.0      1982      2013
74879             0.0      1960      2007
184564            0.0      2015      1963
18731             0.0      2007      1999
        century_delta century_x century_y
18147             0.5      1970      1995
176589            0.5      1979      1955
42814             0.5      2009      2012
248306            0.5      1932      1998
102431            0.5      2002      2017
        century_delta century_x century_y
244208           0.75      2016      2013
235295           0.75      1989      1979
127793           0.75      1999      1998
32149            0.75      1999      1991
125380           0.75      1995      1975
        century_delta century_x century_y
246996           0.25      1876      1997
20886            0.25      1843      1996
78542            0.25      2006      1876
73762            0.25      2000      1550
178849           0.25      1932   

### corporate

In [11]:
corporate_algorithm = tedi.Jaro()

In [12]:
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'corporate_110', corporate_algorithm)
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'corporate_710', corporate_algorithm)

In [13]:
print(df_feature_base[['corporate_110_delta', 'corporate_110_x', 'corporate_110_y']][
    (df_feature_base.corporate_110_delta > 0.0) & (df_feature_base.corporate_110_delta <= 0.1)].head()
     )

Empty DataFrame
Columns: [corporate_110_delta, corporate_110_x, corporate_110_y]
Index: []


In [14]:
print(df_feature_base[['corporate_710_delta', 'corporate_710_x', 'corporate_710_y']][
    (df_feature_base.corporate_710_delta >= 0.9) & (df_feature_base.corporate_710_delta <= 1.0)].head()
     )

   corporate_710_delta corporate_710_x corporate_710_y
0                  1.0                                
1                  1.0                                
2                  1.0                                
3                  1.0                                
4                  1.0                                


### edition

The edition statement is a string value which may have several words. A Jaccard similarity is tried for this attribute.

In [15]:
edition_algorithm = tedi.Jaccard()

In [16]:
df_feature_base = dpf.build_delta_feature(df_feature_base, 'edition', edition_algorithm)

In [17]:
df_feature_base.edition_delta.unique()[:30], len(df_feature_base.edition_delta.unique())

(array([1.        , 0.        , 0.95348837, 0.46511628, 0.62222222,
        0.66666667, 0.5       , 0.65909091, 0.62790698, 0.67741935,
        0.68965517, 0.84375   , 0.7       , 0.875     , 0.92857143,
        0.56521739, 0.54166667, 0.72727273, 0.96153846, 0.57142857,
        0.54761905, 0.74193548, 0.09375   , 0.17647059, 0.24      ,
        0.10526316, 0.75      , 0.06666667, 0.13636364, 0.28571429]), 802)

The comparison results in a wide number of distinct similarity values for the goldstandard data set. Below, some examples are shown.

In [18]:
print(df_feature_base[['edition_delta', 'edition_x', 'edition_y']][
    (df_feature_base.edition_delta >= 0.9) & (df_feature_base.edition_delta < 1.0)].head()
     )

      edition_delta                                   edition_x  \
657        0.953488  5. Aufl., 43.-46. Tsd., neu durchges. Aufl   
661        0.953488  Neu durchges. Aufl., 5. Aufl., 43.-46. Tsd   
995        0.928571                               3., erw. Aufl   
999        0.928571                               3., erw. Aufl   
1001       0.928571                              3., erw. Aufl.   

                                       edition_y  
657   Neu durchges. Aufl., 5. Aufl., 43.-46. Tsd  
661   5. Aufl., 43.-46. Tsd., neu durchges. Aufl  
995                               3., erw. Aufl.  
999                               3., erw. Aufl.  
1001                               3., erw. Aufl  


In [19]:
print(df_feature_base[['edition_delta', 'edition_x', 'edition_y']][
    (df_feature_base.edition_delta > 0.0) & (df_feature_base.edition_delta <= 0.1)].head()
     )

      edition_delta edition_x                                   edition_y
2985       0.093750   1. Aufl                Schweiz 2007, SF1 01.03.2007
3000       0.066667   1. Aufl                   [new ed., 2nd impression]
3059       0.071429   1. Aufl                                    6th impr
3076       0.093023   1. Aufl    Einzelne Nachträge 1907, Ueberdruck 1909
3077       0.088889   1. Aufl  Ueberdruck 1909/2, einzelne Nachträge 1912


### format

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{format}$ has been split up into two new attributes $\texttt{format_prefix}$ and $\texttt{format_postfix}$ which will be compared by a different similarity metrics.

- As the quality of $\texttt{format_prefix}$ is expected to be high, an identity comparison should be sufficient.
- Due to the observed structure of $\texttt{format_postfix}$, a q-gram based comparison will be chosen.

In [20]:
format_prefix_algorithm = tedi.Identity()
df_feature_base = dpf.build_delta_feature(df_feature_base, 'format_prefix',
                                          format_prefix_algorithm)

format_postfix_algorithm = tedi.Jaccard(qval=2)
df_feature_base = dpf.build_delta_feature(df_feature_base, 'format_postfix',
                                          format_postfix_algorithm)

In [21]:
for i in df_feature_base.format_prefix_delta[
    df_feature_base.format_prefix_x != df_feature_base.format_prefix_y].unique():
    
    print(df_feature_base[['format_prefix_delta', 'format_prefix_x', 'format_prefix_y']][
        df_feature_base.format_prefix_delta == i].sample(n=10)
         )

        format_prefix_delta format_prefix_x format_prefix_y
141137                  0.0              bk              mu
13144                   0.0              cf              mu
4355                    0.0              vm              mu
111888                  0.0              bk              mp
160393                  0.0              mu              bk
28761                   0.0              bk              mu
142873                  0.0              vm              mp
92837                   0.0              mu              mp
24186                   0.0              mu              bk
45513                   0.0              vm              bk


In [22]:
for i in df_feature_base.format_postfix_delta[
    df_feature_base.format_postfix_x != df_feature_base.format_postfix_y].unique():
    
    print(df_feature_base[['format_postfix_delta', 'format_postfix_x', 'format_postfix_y']][
        df_feature_base.format_postfix_delta == i].sample(n=5)
         )

        format_postfix_delta format_postfix_x format_postfix_y
234303              0.428571           040000           020000
58990               0.428571           020000           020053
15012               0.428571           010200           010100
140409              0.428571           010200           020053
56015               0.428571           020300           020053
        format_postfix_delta format_postfix_x format_postfix_y
31564               0.111111           010300           020053
200521              0.111111           020000           040100
43530               0.111111           020000           030300
240818              0.111111           020000           010100
228814              0.111111           010000           020053
        format_postfix_delta format_postfix_x format_postfix_y
72187                   0.25           020000           020353
83512                   0.25           020000           020353
117483                  0.25           020000          

### person_245c

In [23]:
person_245c_algorithm = tedi.Jaro()

In [24]:
df_feature_base = dpf.build_delta_feature(
    df_feature_base, 'person_245c', person_245c_algorithm)

In [25]:
print(df_feature_base[['person_245c_delta', 'person_245c_x', 'person_245c_y']][
    (df_feature_base.person_245c_delta > 0.0) & (df_feature_base.person_245c_delta <= 0.1)].head()
     )

Empty DataFrame
Columns: [person_245c_delta, person_245c_x, person_245c_y]
Index: []


In [26]:
print(df_feature_base[['person_245c_delta', 'person_245c_x', 'person_245c_y']][
    (df_feature_base.person_245c_delta >= 0.9) & (df_feature_base.person_245c_delta <= 1.0)].head()
     )

    person_245c_delta                                      person_245c_x  \
0                 1.0  jane austen ; aus dem englischen übersetzt von...   
4                 1.0  jane austen ; aus dem engl. übers. von ursula ...   
8                 1.0                                        jane austen   
9                 1.0  w. a. mozart ; libretto: emanuel schikaneder ;...   
14                1.0       w. a. mozart ; libretto: emanuel schikaneder   

                                        person_245c_y  
0   jane austen ; aus dem englischen übersetzt von...  
4   jane austen ; aus dem engl. übers. von ursula ...  
8                                         jane austen  
9   w. a. mozart ; libretto: emanuel schikaneder ;...  
14       w. a. mozart ; libretto: emanuel schikaneder  


### ttlfull

Due to the discussion in chapter [Data Analysis](./1_DataAnalysis.ipynb), attribute $\texttt{ttlfull}$ has been split up into two new attributes $\texttt{ttlfull_245}$ and $\texttt{ttlfull_246}$ which will be compared by the same similarity metrics.

In [27]:
ttlfull_algorithm = tedi.Jaccard()

df_feature_base = dpf.build_delta_feature(df_feature_base, 'ttlfull_245', ttlfull_algorithm)
df_feature_base = dpf.build_delta_feature(df_feature_base, 'ttlfull_246', ttlfull_algorithm)

In [28]:
df_feature_base.columns

Index(['duplicates', 'century_x', 'century_y', 'corporate_110_x',
       'corporate_110_y', 'corporate_710_x', 'corporate_710_y', 'edition_x',
       'edition_y', 'format_prefix_x', 'format_prefix_y', 'format_postfix_x',
       'format_postfix_y', 'person_245c_x', 'person_245c_y', 'ttlfull_245_x',
       'ttlfull_245_y', 'ttlfull_246_x', 'ttlfull_246_y', 'volumes_x',
       'volumes_y', 'century_delta', 'corporate_110_delta',
       'corporate_710_delta', 'edition_delta', 'format_prefix_delta',
       'format_postfix_delta', 'person_245c_delta', 'ttlfull_245_delta',
       'ttlfull_246_delta'],
      dtype='object')

In [29]:
print(df_feature_base[['ttlfull_245_delta', 'ttlfull_245_x', 'ttlfull_245_y']][
    (df_feature_base.ttlfull_245_delta > 0.0) & (df_feature_base.ttlfull_245_delta <= 0.1)].head()
     )

      ttlfull_245_delta                                      ttlfull_245_x  \
465            0.092437                                        zauberflöte   
467            0.092437  die zauberflöte, kv 620 : eine deutsche oper i...   
690            0.044444                                               arts   
691            0.044444  arts, beaux-arts, littérature, spectacles : (j...   
1014           0.075862                                      genéve, genff   

                                          ttlfull_245_y  
465   die zauberflöte, kv 620 : eine deutsche oper i...  
467                                         zauberflöte  
690   arts, beaux-arts, littérature, spectacles : (j...  
691                                                arts  
1014  genève, ville capitale d'une république de mêm...  


In [30]:
print(df_feature_base[['ttlfull_245_delta', 'ttlfull_245_x', 'ttlfull_245_y']][
    (df_feature_base.ttlfull_245_delta >= 0.9) & (df_feature_base.ttlfull_245_delta < 1.0)].head()
     )

    ttlfull_245_delta                                      ttlfull_245_x  \
29           0.986111  der moralische status der tiere, henry salt, p...   
31           0.986111  der moralische status der tiere, henry salt, p...   
32           0.986111  der moralische status der tiere, henry salt, p...   
33           0.986111  der moralische status der tiere, henry salt, p...   
38           0.986111  der moralische status der tiere, henry salt, p...   

                                        ttlfull_245_y  
29  der moralische status der tiere, henry salt, p...  
31  der moralische status der tiere, henry salt, p...  
32  der moralische status der tiere, henry salt, p...  
33  der moralische status der tiere, henry salt, p...  
38  der moralische status der tiere, henry salt, p...  


### volumes

In [31]:
volumes_algorithm = tedi.Jaccard()
#volumes_algo = tedi.MongeElkan()

In [32]:
df_feature_base = dpf.build_delta_feature(df_feature_base, 'volumes', volumes_algorithm)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_feature_base.columns)

df_feature_base.head(20)

Unnamed: 0,duplicates,century_x,century_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,edition_x,edition_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,person_245c_x,person_245c_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y,century_delta,corporate_110_delta,corporate_710_delta,edition_delta,format_prefix_delta,format_postfix_delta,person_245c_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
0,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,"emma, roman",emma,,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.818905,0.363636,1.0,1.0
2,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen ; aus dem englischen übersetzt von...,jane austen,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.69774,1.0,1.0,1.0
3,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,emma,"emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.818905,0.363636,1.0,1.0
4,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,emma,emma,,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen ; aus dem engl. übers. von ursula ...,jane austen,emma,"emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.702265,0.363636,1.0,1.0
6,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen,jane austen ; aus dem englischen übersetzt von...,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.69774,1.0,1.0,1.0
7,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen,jane austen ; aus dem engl. übers. von ursula ...,"emma, roman",emma,,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,0.702265,0.363636,1.0,1.0
8,1,2009,2009,,,,,,,bk,bk,20000,20000,jane austen,jane austen,"emma, roman","emma, roman",,,600 s.,600 s.,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,1,2000,2000,,,"metropolitan opera orchestra, metropolitan ope...","metropolitan opera orchestra, metropolitan ope...",,,vm,vm,10300,10300,w. a. mozart ; libretto: emanuel schikaneder ;...,w. a. mozart ; libretto: emanuel schikaneder ;...,"die zauberflöte, oper in zwei aufzügen","die zauberflöte, oper in zwei aufzügen",,,"1 dvd-video, dvd region 0, 169 min., farb.","1 dvd-video, dvd region 0, 169 min., farb.",1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Feature Base

In [33]:
# Drop all non-delta columns, except of 'duplicates'
columns_to_be_dropped = [e for e in columns_metadata_dict['columns_to_use']
                         if e != 'duplicates']

df_feature_base.drop(columns=columns_to_be_dropped, inplace=True)

In [34]:
for i in range(2):
    print(df_feature_base[df_feature_base.duplicates==i].sample(n=20))

        duplicates  century_delta  corporate_110_delta  corporate_710_delta  \
115334           0           0.50                  1.0             1.000000   
92621            0           0.25                  1.0             0.000000   
158579           0           0.00                  1.0             1.000000   
188166           0           0.50                  1.0             0.000000   
185005           0           0.00                  1.0             1.000000   
249060           0           0.00                  1.0             1.000000   
226226           0           0.00                  1.0             1.000000   
232393           0           0.00                  1.0             0.000000   
118317           0           0.00                  1.0             1.000000   
240567           0           0.50                  1.0             1.000000   
173503           0           0.00                  1.0             0.000000   
197687           0           0.00                  1

## Feature Matrix and Target Vector Handover

To hand over the resulting DataFrame of this chapter, the DataFrame is saved into a pickle file that will be read in the next chapters [Decision Tree Model](./4_DecisionTreeModel.ipynb), ... as input file.

In [35]:
import pickle as pk

# Binary intermediary file
with open(os.path.join(path_goldstandard,
                       'labelled_feature_matrix.pkl'), 'wb') as df_output_file:
    pk.dump(df_feature_base, df_output_file)