# Goldstandard and Data Preparation

Swissbib's goldstandard is a set of pre-defined records being used for testing their implemented logic of identifying duplicate and unique records. The Swissbib project team has processed this data into a data extract with the help of a scala implementation [[ScalRepo](./A_References.ipynb#scala_repo)].

This chapter of the capstone project loads the data records of Swissbib's goldstandard and prepares the data until it is ready for the generation of the feature matrix for the machine learning models.

## Table of Contents

- [Metadata Takeover](#Metadata-Takeover)
- [Description of Swissbib's Goldstandard](#Description-of-Swissbib's-Goldstandard)
    - [Sample Records of a Goldstandard Example](#Sample-Records-of-a-Goldstandard-Example)
    - [Generation of Pairs of Duplicates](#Generation-of-Pairs-of-Duplicates)
    - [Generation of Pairs of Uniques](#Generation-of-Pairs-of-Uniques)
- [Implemenation of Pairs Generation](#Implemenation-of-Pairs-Generation)
    - [Pairs of Duplicates Implementation](#Pairs-of-Duplicates-Implementation)
    - [Pairs of Uniques Implementation](#Pairs-of-Uniques-Implementation)
- [Build DataFrames for Transformation into Feature Matrix](#Build-DataFrames-for-Transformation-into-Feature-Matrix)
- [Determine Target Vector](#Determine-Target-Vector)
- [Build Feature Base](#Build-Feature-Base)
- [Goldstandard DataFrame Handover](#Goldstandard-DataFrame-Handover)
- [Summary](#Summary)

## Metadata Takeover

In [1]:
import os
import pickle as pk

path_goldstandard = './daten_goldstandard'

# Restore metadata so far
with open(os.path.join(path_goldstandard, 'columns_metadata.pkl'), 'rb') as handle:
    columns_metadata_dict = pk.load(handle)

## Description of Swissbib's Goldstandard

This section starts with loading Swissbib's goldstandard data. Afterwards, the data is explained and analysed.

In [2]:
import json

records_slave, records_master, records_unique = [], [], []
file_slave, file_master, file_unique = 'slave.json', 'master.json', 'unique.json'

for line in open(os.path.join(path_goldstandard, file_slave), 'r'):
    records_slave.append(json.loads(line))
for line in open(os.path.join(path_goldstandard, file_master), 'r'):
    records_master.append(json.loads(line))
for line in open(os.path.join(path_goldstandard, file_unique), 'r'):
    records_unique.append(json.loads(line))

print('Number of records in data file {:s}\t{:d}'.format(file_slave, len(records_slave)))
print('Number of records in data file {:s}\t{:d}'.format(file_master, len(records_master)))
print('Number of records in data file {:s}\t{:d}'.format(file_unique, len(records_unique)))

Number of records in data file slave.json	435
Number of records in data file master.json	159
Number of records in data file unique.json	596


In [3]:
import pandas as pd

goldstandard = {}

goldstandard['slaves'] = pd.DataFrame(records_slave)
goldstandard['masters'] = pd.DataFrame(records_master)
goldstandard['uniques'] = pd.DataFrame(records_unique)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(goldstandard['slaves'].columns)

goldstandard['slaves'].head()

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
0,000311049,"[(OCoLC)731635279, (ABN)000539983]",[978-3-15-020008-7],"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",{'100': ['AustenJane1775-1817(DE-588)118505173...,{},2009,2009,2009,2009,,[20008],[600 S.],[600 S.],[Reclam jun.],[Reclam jun.],,[],[],[],,[BK020000]
1,00130724X,"[(OCoLC)808324878, (ABN)000155059]",[],"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...","{'100': ['LevineJamesDir.'], '700': ['MozartWo...","{'710': ['Metropolitan Opera Orchestra', 'Metr...",2000,2000,2000,2000,,[],"[1 DVD-Video, DVD Region 0, 169 Min., farb.]","[1 DVD-Video, DVD Region 0, 169 Min., farb.]",[Deutsche Grammophon],[Deutsche Grammophon],,[],[],[],,[VM010300]
2,001817272,"[(OCoLC)231772550, (ABN)000096920]",[3-495-47879-5],"{'245': ['Der moralische Status der Tiere', 'H...","{'245': ['Der moralische Status der Tiere', 'H...","{'100': ['FluryAndreas'], '245c': ['Andreas Fl...",{},1999,1999,1999,1999,,[],[316 S.],[316 S.],[Alber],[Alber],,[],[],[],,[BK020000]
3,00236865X,"[(OCoLC)887157168, (ABN)000223912]",[],{'245': ['Die Zauberflöte']},{'245': ['Die Zauberflöte']},"{'100': ['MozartWolfgang Amadeus'], '245c': ['']}",{},uuuuuuuu,uuuu,uuuu,uuuuuuuu,,[],[412 S.],[412 S.],[Ernst Eulenburg],[Ernst Eulenburg],,[],[],[],,[BK020000]
4,00351031X,"[(OCoLC)887324690, (ABN)000548154]",[978-1-4058-8214-9],{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},2008,2008,2008,2008,,[],[64 S.],[64 S.],[Pearson Education],[Pearson Education],,[],[],[],,[BK020000]


The goldstandard is a fix set of pre-defined data that has been processed into three distinct .json files [[ScalRepo](./A_References.ipynb#scala_repo)].

- $\texttt{slave.json}$ - This data file holds the original duplicated records that are being merged into a master record.
- $\texttt{master.json}$ - This data file holds the unique master records that are formed out of its records in file $\texttt{slave.json}$.
- $\texttt{unique.json}$ - This data file holds records with similar data as in $\texttt{slave.json}$ and $\texttt{master.json}$, but records that are correctly being identified as unique records and not as a duplicate record of any record in $\texttt{slave.json}$.

The three files of the goldstandard will be used for training and performance testing the machine learning models of the capstone project. Combining the records of file $\texttt{slave.json}$ into pairs will generate a set of duplicate training data, while combining the records of file $\texttt{slave.json}$ with file $\texttt{unique.json}$ will generate a set of unique pair records for the training data.

### Sample Records of a Goldstandard Example

Let's have a rough look at an example in the goldstandard data. As the example, all records of all three .json files will be chosen with string 'Emma' in attribute $\texttt{ttlfull}$.

In [4]:
goldstandard['slaves'][
    goldstandard['slaves'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
]

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
0,000311049,"[(OCoLC)731635279, (ABN)000539983]",[978-3-15-020008-7],"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",{'100': ['AustenJane1775-1817(DE-588)118505173...,{},2009,2009,2009,2009,,[20008],[600 S.],[600 S.],[Reclam jun.],[Reclam jun.],,[],[],[],,[BK020000]
4,00351031X,"[(OCoLC)887324690, (ABN)000548154]",[978-1-4058-8214-9],{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},2008,2008,2008,2008,,[],[64 S.],[64 S.],[Pearson Education],[Pearson Education],,[],[],[],,[BK020000]
17,017959411,"[(OCoLC)636062037, (BGR)000409509]","[978-1-4058-7953-8 (CD Pack), 978-1-4058-8214-...",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['BarnesAnnette...",{},2008,2008,2008,2008,,[],[64 S.],[64 S.],[Pearson Education Ltd.],[Pearson Education Ltd.],,[],[],[],,[BK020000]
19,020155182,"[(OCoLC)218626148, (BGR)000463276]","[978-0-521-82437-8, 0-521-82437-0]",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['CroninRichard...",{},2005,2005,2005,2005,,[],[600 S.],[600 S.],[Cambridge University Press],[Cambridge University Press],,[],[],[],,[BK020000]
29,022315098,"[(OCoLC)218626148, (IDSLU)000449481]",[0-521-82437-0],{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},2005,2005,2005,2005,,[],[599 S.],[599 S.],[],[],,[],[],[],,[BK020000]
51,035554215,"[(OCoLC)218626148, (IDSSG)000338145]","[978-0-521-82437-8, 0-521-82437-0]",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['CroninRichard...",{},2005,2005,2005,2005,,[],[600 S.],[600 S.],[],[],,[],[],[],,[BK020000]
77,055836801,"[(OCoLC)495204467, (SGBN)001068279, (OCoLC)495...","[978-1-4058-8214-9 (br), 1-4058-8214-X (br)]",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['BarnesAnnette...",{},2008,2008,2008,2008,,[],[64 S.],[64 S.],[Pearson Education],[Pearson Education],,[],[],[],,[BK020000]
109,103342699,"[(OCoLC)263554860, (IDSBB)001470548]",[],{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},1979,1979,1979,1979,,[],[549 S.],[549 S.],[],[],,[],[],[],,[BK020000]
130,117574562,"[(OCoLC)218626148, (IDSBB)003781869]",[0-521-82437-0],{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},2005,2005,2005,2005,,[],[600 S.],[600 S.],[],[],,[],[],[],,[BK020000]
160,161169244,"[(OCoLC)218626148, (NEBIS)004930649]","[978-0-521-82437-8, 0-521-82437-0]",{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},2005,2005,2005,2005,,[],[600 S.],[600 S.],[Cambridge University Press],[Cambridge University Press],,[],[],[],,[BK020000]


In [5]:
print('Number of records in slave.json with string \'Emma\' :', len(
    goldstandard['slaves'][
        goldstandard['slaves'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
    ]
)
     )

Number of records in slave.json with string 'Emma' : 18


A number of 18 records in $\texttt{slave.json}$ contain the string 'Emma' in attribute $ttlfull$. Some of these records are duplicates. File $\texttt{master.json}$ knows the amount of uniques out of the 18 records.

In [6]:
goldstandard['masters'][
    goldstandard['masters'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
]

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
85,504389793,"[(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...",[978-3-15-020008-7],{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},2009,2009,2009,2009,,[20008],[600 S.],[600 S.],[Reclam],[Reclam],,[],[],[],,[BK020000]
86,504389807,"[(NEBIS)008647887, (VAUD)991001434509702852, (...",[0-375-75742-2],{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},2001,2001,2001,2001,,[],[359 S.],[359 S.],[Modern Library],[Modern Library],,[],[],[],,[BK020000]
87,504389815,"[(IDSLU)000449481, (IDSBB)003781869, (NEBIS)00...",[0-521-82437-0],{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},2005,2005,2005,2005,,[],[599 S.],[599 S.],[],[],,[],[],[],,[BK020000]
88,504389823,"[(SGBN)001068279, (ABN)000548154, (BGR)0004095...","[978-1-4058-8214-9 (br), 1-4058-8214-X (br)]",{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['BarnesAnnette...",{},2008,2008,2008,2008,,[],[64 S.],[64 S.],[Pearson Education],[Pearson Education],,[],[],[],,[BK020000]
89,504389831,"[(IDSBB)001470548, (SGBN)001344510]",[],{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},1979,1979,1979,1979,,[],[549 S.],[549 S.],[],[],,[],[],[],,[BK020000]


In [7]:
print('Number of records in master.json with string \'Emma\' :', len(
    goldstandard['masters'][
        goldstandard['masters'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
    ]
)
     )

Number of records in master.json with string 'Emma' : 5


The number of deduplicated records of file $\texttt{slave.json}$ containing string 'Emma' is 5. This is the number of unique records that will be built in Swissbib's deduplication step out of the original 18 records above.

There may be some minor differences in some fields of the deduplicated master file records compared to their original data in the duplicated original slave file records. Nevertheless, it is possible to identify the slave file records as duplicates of their associated master file records, looking at the details of the values of the attributes.

In a final step, let's have a look at the 'Emma' string data of $\texttt{unique.json}$.

In [8]:
goldstandard['uniques'][
    goldstandard['uniques'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
]

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
0,000143235,"[(OCoLC)362722306, (ABN)000551177]",[978-3-7466-6120-9],"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}",{'100': ['AustenJane1775-1817(DE-588)118505173...,{},2009,2009,2009,2009,,[6120],[575 S.],[575 S.],[Aufbau Taschenbuch],[Aufbau Taschenbuch],,[],[],[],,[BK020000]
4,002410559,"[(OCoLC)777853583, (ABN)000243260]",[0-19-282756-1],{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},1990,1990,1990,1990,,[],[445 p.],[445 p.],[Oxford University Press],[Oxford University Press],,[],[],[],,[BK020000]
8,004130235,"[(OCoLC)887396789, (ABN)000628911]",[978-0-307-38684-7],{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},2007,2007,2007,2007,,[],[495 S.],[495 S.],[Vintage Books],[Vintage Books],,[],[],[],,[BK020000]
13,017204097,"[(OCoLC)759250730, (BGR)000090407]",[3-596-22191-9],"{'245': ['Emma', 'Roman']}","{'245': ['Emma', 'Roman']}","{'100': ['AustenJane'], '245c': ['Jane Austen ...",{},1996,1996,1996,1996,[87.-88. Tsd.],[2191],[414 S.],[414 S.],[Fischer-Taschenbuch-Verlag],[Fischer-Taschenbuch-Verlag],,[],[],[],,[BK020000]
16,017738490,"[(OCoLC)76214484, (BGR)000281870]",[3-7466-5105-0],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '245c': ['Jane Austen ...",{},2001,2001,2001,2001,3. Aufl,[5105],[553 S.],[553 S.],[Aufbau Taschenbuch-Verlag],[Aufbau Taschenbuch-Verlag],,[],[],[],,[BK020000]
20,01843486X,"[(OCoLC)759471685, (BGR)000212474]",[0-582-41794-5],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['BarnesAnnette...",{},2000,2000,2000,2000,"[new ed., 2nd impression]",[],[59 S.],[59 S.],[Pearson Education Ltd],[Pearson Education Ltd],,[],[],[],,[BK020000]
26,021868816,"[(OCoLC)610677900, (IDSLU)000093923]",[],{'245': ['Emma']},{'245': ['Emma']},{'100': ['AustenJane1775-1817(DE-588)118505173...,{},1996,1996,1996,1996,,[[1]],[560 S.],[560 S.],[],[],,[],[],[],,[BK020000]
42,038872498,"[(OCoLC)614721683, (SBT)000133903]",[0-14-043010-5],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '700': ['BlytheRonald'...",{},1983,1983,1983,1983,,[],[471 p.],[471 p.],[Penguin Books],[Penguin Books],,[],[],[],,[BK020000]
53,043085075,"[(OCoLC)882111340, (SBT)000463076]",[],"{'245': ['Emma', 'mobbing']}","{'245': ['Emma', 'mobbing']}","{'100': [], '245c': ['Caritas Ticino']}",{'710': ['Caritas (Ticino)']},2002,2002,2002,2002,,[],[1 Videocassetta],[1 Videocassetta],[Caritas Ticino],[Caritas Ticino],,[],[],[],,[VM010200]
62,051168871,"[(OCoLC)759052846, (SGBN)001050098]",[],{'245': ['Emma']},{'245': ['Emma']},"{'100': ['AustenJane'], '245c': ['Jane Austen']}",{},1953,1953,1953,1953,,[],[377 S.],[377 S.],[Collins],[Collins],,[],[],[],,[BK020000]


In [9]:
print('Number of records in unique.json with string \'Emma\' :', len(
    goldstandard['uniques'][
        goldstandard['uniques'].ttlfull.apply(lambda x : x['245'][0]).str.contains('Emma')
    ]
)
     )

Number of records in unique.json with string 'Emma' : 44


The number of unique records of file $\texttt{unique.json}$ is 44. The data of the records reveal that none of them can be associated as duplicate of a record in $\texttt{master.json}$. These unique records are excellent training data for deduplication as they can be used for training of similar but non-duplicate, thus unique, record pairs.

The goal of generating training and test data with the help of Swissbib's goldstandard is to generate two categories of training data.

- The first category consists of records of pairs of the original records out of file $\texttt{slave.json}$ that have a label 'duplicates'. This will be accomplished with the help of the records of file $\texttt{master.json}$, see subsection [Generation of Pairs of Duplicates](#generation_pairs_duplicates).
- The second category consists of records of pairs of the original records out of files $\texttt{slave.json}$ and $\texttt{unique.json}$ that have a label 'uniques', see subsection [Generation of Pairs of Uniques](#generation_pairs_uniques).

To calculate the number of possible duplicate pairs the formula

$
\begin{align}
Tot_{duplicate\ pairs} = \sum_{i=1}^{M} \frac{1}{2}S_i \cdot \left(S_i-1\right)
\end{align}
$

where $M$ is the number of records in file $\texttt{master.json}$ and $S_i$ are the number of records in file $\texttt{slave.json}$ that are associated with maser record $M_i$. As the number of slave records must be counted for each master record, it is not possible to calculate the number of expected duplicate pair records in advance.

### Generation of Pairs of Duplicates

The central attribute for retrieving duplicates in Swissbib's goldstandard data is attribute $\texttt{035liste}$, see section [Attribute Analysis](./1_DataAnalysis.ipynb#attribute_analysis) in chapter [Data Analysis](./1_DataAnalysis.ipynb). It helps to identify the associated master record for a given record in file $\texttt{slave.json}$. The process implemented below parses the list of identifiers in attribute $\texttt{035liste}$ of each record in file $\texttt{slave.json}$ and searches the value of the identifier in the attribute $\texttt{035liste}$ of all records of file $\texttt{master.json}$. When the identifier is found in file $\texttt{master.json}$, the master record's attribute $\texttt{docid}$ is stored as a new column in the slave DataFrame, see figure [Slave/master relationship](#slave_master_relationship). The new attribute in the slave record has the meaning of a foreign key to the related master record. This process is repeated for each list element of one slave record and for each record in the data file $\texttt{slave.json}$.

<center>
    <b>Figure</b><a id='slave_master_relationship'></a> Slave/master relationship.
    <img src="./documentation/training_data.png" style="width: 600px;"/></p>
</center>

As will be shown below, the relationship of a slave record to its master record is unique. Even if there is more than one entry in the list of attribute $\texttt{035liste}$ of a slave record it will be shown that all distinct entries of a $\texttt{035liste}$ attribute list of one slave record point to one and the same master record.

Attribute $\texttt{035liste}$ will not be used in training nore in perfomance testing of the models. The column will be removed before model training.

### Generation of Pairs of Uniques

File $\texttt{unique.json}$ holds data of unique records. For training and testing purposes, pairs of records could be generated exclusively with the records out of this file. This would generate record pairs with label 'uniques' which can be clearly distinguished and therefore clearly be recognized as non-duplicate pairs. A more interesting pairing of records can be generated with records out of $\texttt{slave.json}$ and records out of $\texttt{unique.json}$. Pairs of records that are produced in a mixture of both sources, may reveal similarities in their attributes pairing that makes these pairs more difficult for classifying clearly as 'uniques'.

A mixture of both pairing sources will be implemented for the generation of the training and testing data. As the possible pairing combinations of all the records in $\texttt{unique.json}$ can be calculated with the help of $\frac{1}{2}N(N-1)$, the total number of possible uniques can be estimated with the number of records in $\texttt{unique.json}$ and the total number of records in $\texttt{slave.json}$.

In [10]:
sum_unique_slave = len(goldstandard['slaves']) + len(goldstandard['slaves'])

print('Total number of possible pairs for training and testing : {:,d}'.format(
    int(sum_unique_slave*(sum_unique_slave-1)/2)
))

Total number of possible pairs for training and testing : 378,015


## Implemenation of Pairs Generation

### Pairs of Duplicates Implementation

### Pairs of Uniques Implementation

## Build DataFrames for Transformation into Feature Matrix

In [11]:
# Target is the first necessary column
columns_metadata_dict['columns_to_use'] = ['duplicates']

# After join with itself, the data has _x and _y attributes
for i in columns_metadata_dict['data_analysis_columns']:
    for j in ['_x', '_y']:
        columns_metadata_dict['columns_to_use'].append(i+j)

In [12]:
import data_preparation_funcs as dpf

for i in ['slaves', 'masters', 'uniques']:
    for attrib in columns_metadata_dict['data_analysis_columns']:
        if attrib in ['doi', 'edition', 'isbn', 'musicid', 'scale' # Take as is
                      , 'corporate_710' # See 'corporate_110'
                      , 'coordinate_N' # See 'coordinate_E'
                      , 'format_postfix' # See 'format_prefix'
                      , 'person_100', 'person_700' # See 'person_245c'
                      , 'ttlfull_246' # See 'ttlfull_245'
                     ]:
            continue # Explicitly : do nothing!
        elif attrib in ['coordinate_E']:
            goldstandard[i] = dpf.split_coordinate(goldstandard[i])
        elif attrib in ['corporate_110']:
            goldstandard[i] = dpf.split_dictionary_column(
                goldstandard[i], 'corporate', ['110', '710'#, '810'
                                              ]
            )
        elif attrib in ['exactDate']:
            goldstandard[i] = dpf.clean_exactDate_string(goldstandard[i])
        elif attrib in ['format_prefix']:
            goldstandard[i] = dpf.transform_list_to_string(goldstandard[i], 'format')
            goldstandard[i] = dpf.split_format(goldstandard[i])
        elif attrib in ['person_245c']:
            goldstandard[i] = dpf.split_dictionary_column(
                goldstandard[i], 'person', ['100', '700', #'800', 
                    '245c']
            )
        elif attrib in ['ttlfull_245']:
            goldstandard[i] = dpf.split_dictionary_column(
                goldstandard[i], 'ttlfull', ['245', '246']
            )
        elif attrib in ['part', 'pubinit', 'volumes']:
            goldstandard[i] = dpf.transform_list_to_string(goldstandard[i], attrib)
        else: # Not explicitly handled, yet
            print('Attribute', attrib, 'is missing in this processing step!')

## Determine Target Vector

In [13]:
goldstandard['masters']['035liste'][0]

['(IDSBB)002447452', '(ALEX)9912923344101791']

In [14]:
goldstandard['slaves'].docid.head()

0    000311049
1    00130724X
2    001817272
3    00236865X
4    00351031X
Name: docid, dtype: object

In [15]:
goldstandard['uniques'].docid.head()

0    000143235
1    00044801X
2    000996009
3    00239538X
4    002410559
Name: docid, dtype: object

In [16]:
goldstandard['masters'].docid.loc[0]

'264853032'

In [17]:
goldstandard['masters'][goldstandard['masters'].docid == goldstandard['uniques'].docid.loc[0]]

Unnamed: 0,docid,035liste,isbn,ttlpart,pubyear,decade,century,exactDate,edition,part,pages,...,coordinate_E,coordinate_N,corporate_110,corporate_710,format_prefix,format_postfix,person_100,person_700,person_245c,ttlfull_245,ttlfull_246


In [18]:
goldstandard['masters'].loc[3]

docid                                                     264853784
035liste          [(NATIONALLICENCE)oxford-10.1093/cid/ciu795, (...
isbn                                                             []
ttlpart           {'245': ['Combined Use of Mycobacterium tuberc...
pubyear                                                    20150201
decade                                                         2015
century                                                        2015
exactDate                                                  20150201
edition                                                            
part                                      60/3(2015-02-01), 432-437
pages                                                            []
volumes                                                            
pubinit                                                            
pubword                                                          []
scale                                           

In [19]:
goldstandard['slaves'].loc[347]

docid                                                     395019044
035liste               [(NATIONALLICENCE)oxford-10.1093/cid/ciu795]
isbn                                                             []
ttlpart           {'245': ['Combined Use of Mycobacterium tuberc...
pubyear                                                    20150201
decade                                                         2015
century                                                        2015
exactDate                                                  20150201
edition                                                            
part                                      60/3(2015-02-01), 432-437
pages                                                            []
volumes                                                            
pubinit                                                            
pubword                                                          []
scale                                           

In [20]:
goldstandard['uniques']['035liste'].head()

0    [(OCoLC)362722306, (ABN)000551177]
1    [(OCoLC)886929897, (ABN)000223034]
2    [(OCoLC)778386601, (ABN)000433604]
3    [(OCoLC)778561839, (ABN)000238844]
4    [(OCoLC)777853583, (ABN)000243260]
Name: 035liste, dtype: object

In [21]:
goldstandard['slaves']['035liste'].head()

0    [(OCoLC)731635279, (ABN)000539983]
1    [(OCoLC)808324878, (ABN)000155059]
2    [(OCoLC)231772550, (ABN)000096920]
3    [(OCoLC)887157168, (ABN)000223912]
4    [(OCoLC)887324690, (ABN)000548154]
Name: 035liste, dtype: object

In [22]:
goldstandard['masters']['035liste'].head()

0           [(IDSBB)002447452, (ALEX)9912923344101791]
1                 [(IDSBB)000369647, (NEBIS)005609528]
2    [(IDSLU)001293605, (IDSBB)006725532, (NEBIS)01...
3    [(NATIONALLICENCE)oxford-10.1093/cid/ciu795, (...
4    [(NATIONALLICENCE)oxford-10.1093/ndt/gft319, (...
Name: 035liste, dtype: object

In [23]:
goldstandard['slaves'].loc[[0, 187, 276, 424, 428, 429]]

Unnamed: 0,docid,035liste,isbn,ttlpart,pubyear,decade,century,exactDate,edition,part,pages,...,coordinate_E,coordinate_N,corporate_110,corporate_710,format_prefix,format_postfix,person_100,person_700,person_245c,ttlfull_245,ttlfull_246
0,311049,"[(OCoLC)731635279, (ABN)000539983]",[978-3-15-020008-7],"{'245': ['Emma', 'Roman']}",2009,2009,2009,2009uuuu,,20008,[600 S.],...,,,,,bk,20000,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,"emma, roman",
187,196506476,"[(OCoLC)731635279, (NEBIS)009587153]",[978-3-15-020008-7],{'245': ['Emma']},2009,2009,2009,2009uuuu,,20008,[600 S.],...,,,,,bk,20000,austenjane1775-1817(de-588)118505173,,jane austen ; aus dem engl. übers. von ursula ...,emma,
276,323173349,"[(OCoLC)731635279, (LIBIB)000315536]",[978-3-15-020008-7],"{'245': ['Emma', 'Roman']}",2009,2009,2009,2009uuuu,,20008,[600 S.],...,,,,,bk,20000,austenjane,,jane austen,"emma, roman",
424,491629737,"[(OCoLC)1002177443, (IDSBB)006726594]",[978-3-13-240808-1],{'245': ['EKG-Kurs für Isabel']},2017,2017,2017,2017uuuu,"7., überarbeitete und erweiterte Auflage",,[1 Online-Ressource],...,,,,,bk,20053,trappehans-joachim1954-(de-588)124942687verfas...,schusterhans-peter1937-(de-588)115586075verfas...,"hans-joachim trappe, hans-peter schuster",ekg-kurs für isabel,
428,495381160,"[(OCoLC)1002177443, (NEBIS)011045420, (OCoLC)1...",[978-3-13-240808-1],{'245': ['EKG-Kurs für Isabel']},2017,2017,2017,2017uuuu,"7., überarbeitete und erweiterte Auflage",,[1 Online-Ressource],...,,,,,bk,20053,trappehans-joachim1954-(de-588)124942687verfas...,schusterhans-peter1937-(de-588)115586075verfas...,"hans-joachim trappe, hans-peter schuster",ekg-kurs für isabel,
429,501860959,"[(VAUD)991010321879702852, (RNV)000202321-41bc...",[],"{'245': ['Neue Ausgabe sämtlicher Werke', 'Die...",19702006,1970,1970,19702006,,"werkgruppe 5, bd. 19",[1 partition (379 p.)],...,,,,,mu,10100,mozartwolfgang amadeus,"grubergernot, orelalfred, faberrudolf",wolfgang amadeus mozart ; in verbindung mit de...,"neue ausgabe sämtlicher werke, die zauberflöte...",


In [24]:
goldstandard['masters']['035liste'].loc[85]

['(NEBIS)009587153', '(LIBIB)000315536', '(ABN)000539983']

In [25]:
goldstandard['masters'].loc[85]

docid                                                     504389793
035liste          [(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...
isbn                                            [978-3-15-020008-7]
ttlpart                                           {'245': ['Emma']}
pubyear                                                    2009    
decade                                                         2009
century                                                        2009
exactDate                                                  2009uuuu
edition                                                            
part                                                          20008
pages                                                      [600 S.]
volumes                                                      600 s.
pubinit                                                      reclam
pubword                                                    [Reclam]
scale                                           

In [26]:
def add_master_docid_to_slave (df_s, df_m):
    """Determine docid of master and store on slave."""
    # Initialize Foreign Key list
    df_s['masters_docid'] = [list() for x in range(len(df_s.index))]

    # Search for master of slave
    for i in range(len(df_s)):
        loc_li = list()
        for j in range(len(df_s['035liste'].loc[i])):
            master_index = df_m[df_m['035liste'].str.contains(
                df_s['035liste'].loc[i][j], regex=False
            )].index
            if len(master_index) > 0 : # Skip empty Series
                loc_li.append(df_m.docid[master_index].values[0])

        df_s['masters_docid'].loc[i] = loc_li
    
    return df_s

In [27]:
goldstandard['slaves'] = add_master_docid_to_slave(goldstandard['slaves'], goldstandard['masters'])

goldstandard['slaves'].masters_docid.head()

0    [504389793]
1    [504390597]
2    [50439018X]
3    [504389513]
4    [504389823]
Name: masters_docid, dtype: object

In [28]:
# Proof that all docid_masters are unique...
goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : set(x))
goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : list(x))

for i in range(len(goldstandard['slaves'])):
    if len(goldstandard['slaves'].masters_docid.loc[i]) != 1 :
        print('HALT!')
        break

goldstandard['slaves']['masters_docid'] = goldstandard['slaves']['masters_docid'].apply(lambda x : x[0])
goldstandard['slaves'].head()

Unnamed: 0,docid,035liste,isbn,ttlpart,pubyear,decade,century,exactDate,edition,part,pages,...,coordinate_N,corporate_110,corporate_710,format_prefix,format_postfix,person_100,person_700,person_245c,ttlfull_245,ttlfull_246,masters_docid
0,000311049,"[(OCoLC)731635279, (ABN)000539983]",[978-3-15-020008-7],"{'245': ['Emma', 'Roman']}",2009,2009,2009,2009uuuu,,20008.0,[600 S.],...,,,,bk,20000,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,"emma, roman",,504389793
1,00130724X,"[(OCoLC)808324878, (ABN)000155059]",[],"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",2000,2000,2000,2000uuuu,,,"[1 DVD-Video, DVD Region 0, 169 Min., farb.]",...,,,"metropolitan opera orchestra, metropolitan ope...",vm,10300,levinejamesdir.,"mozartwolfgang amadeus, levinejames, schikaned...",w. a. mozart ; libretto: emanuel schikaneder ;...,"die zauberflöte, oper in zwei aufzügen",,504390597
2,001817272,"[(OCoLC)231772550, (ABN)000096920]",[3-495-47879-5],"{'245': ['Der moralische Status der Tiere', 'H...",1999,1999,1999,1999uuuu,,,[316 S.],...,,,,bk,20000,fluryandreas,,andreas flury,"der moralische status der tiere, henry salt, p...",,50439018X
3,00236865X,"[(OCoLC)887157168, (ABN)000223912]",[],{'245': ['Die Zauberflöte']},uuuuuuuu,uuuu,uuuu,uuuuuuuu,,,[412 S.],...,,,,bk,20000,mozartwolfgang amadeus,,,die zauberflöte,,504389513
4,00351031X,"[(OCoLC)887324690, (ABN)000548154]",[978-1-4058-8214-9],{'245': ['Emma']},2008,2008,2008,2008uuuu,,,[64 S.],...,,,,bk,20000,austenjane1775-1817(de-588)118505173,strangejoanna,jane austen ; retold by joanna strange,emma,,504389823


In [29]:
result = pd.merge(left=goldstandard['slaves'], right=goldstandard['masters'], how='inner',
                  left_on='masters_docid', right_on='docid')

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(result)

result.head()

Unnamed: 0,docid_x,035liste_x,isbn_x,ttlpart_x,pubyear_x,decade_x,century_x,exactDate_x,edition_x,part_x,pages_x,volumes_x,pubinit_x,pubword_x,scale_x,coordinate_x,doi_x,ismn_x,musicid_x,coordinate_E_x,coordinate_N_x,corporate_110_x,corporate_710_x,format_prefix_x,format_postfix_x,person_100_x,person_700_x,person_245c_x,ttlfull_245_x,ttlfull_246_x,masters_docid,docid_y,035liste_y,isbn_y,ttlpart_y,pubyear_y,decade_y,century_y,exactDate_y,edition_y,part_y,pages_y,volumes_y,pubinit_y,pubword_y,scale_y,coordinate_y,doi_y,ismn_y,musicid_y,coordinate_E_y,coordinate_N_y,corporate_110_y,corporate_710_y,format_prefix_y,format_postfix_y,person_100_y,person_700_y,person_245c_y,ttlfull_245_y,ttlfull_246_y
0,000311049,"[(OCoLC)731635279, (ABN)000539983]",[978-3-15-020008-7],"{'245': ['Emma', 'Roman']}",2009,2009,2009,2009uuuu,,20008.0,[600 S.],600 s.,reclam jun.,[Reclam jun.],,[],[],[],,,,,,bk,20000,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,"emma, roman",,504389793,504389793,"[(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...",[978-3-15-020008-7],{'245': ['Emma']},2009,2009,2009,2009uuuu,,20008.0,[600 S.],600 s.,reclam,[Reclam],,[],[],[],,,,,,bk,20000,austenjane1775-1817(de-588)118505173,,jane austen ; aus dem engl. übers. von ursula ...,emma,
1,196506476,"[(OCoLC)731635279, (NEBIS)009587153]",[978-3-15-020008-7],{'245': ['Emma']},2009,2009,2009,2009uuuu,,20008.0,[600 S.],600 s.,reclam,[Reclam],,[],[],[],,,,,,bk,20000,austenjane1775-1817(de-588)118505173,,jane austen ; aus dem engl. übers. von ursula ...,emma,,504389793,504389793,"[(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...",[978-3-15-020008-7],{'245': ['Emma']},2009,2009,2009,2009uuuu,,20008.0,[600 S.],600 s.,reclam,[Reclam],,[],[],[],,,,,,bk,20000,austenjane1775-1817(de-588)118505173,,jane austen ; aus dem engl. übers. von ursula ...,emma,
2,323173349,"[(OCoLC)731635279, (LIBIB)000315536]",[978-3-15-020008-7],"{'245': ['Emma', 'Roman']}",2009,2009,2009,2009uuuu,,20008.0,[600 S.],600 s.,reclam,[Reclam],,[],[],[],,,,,,bk,20000,austenjane,,jane austen,"emma, roman",,504389793,504389793,"[(NEBIS)009587153, (LIBIB)000315536, (ABN)0005...",[978-3-15-020008-7],{'245': ['Emma']},2009,2009,2009,2009uuuu,,20008.0,[600 S.],600 s.,reclam,[Reclam],,[],[],[],,,,,,bk,20000,austenjane1775-1817(de-588)118505173,,jane austen ; aus dem engl. übers. von ursula ...,emma,
3,00130724X,"[(OCoLC)808324878, (ABN)000155059]",[],"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",2000,2000,2000,2000uuuu,,,"[1 DVD-Video, DVD Region 0, 169 Min., farb.]","1 dvd-video, dvd region 0, 169 min., farb.",deutsche grammophon,[Deutsche Grammophon],,[],[],[],,,,,"metropolitan opera orchestra, metropolitan ope...",vm,10300,levinejamesdir.,"mozartwolfgang amadeus, levinejames, schikaned...",w. a. mozart ; libretto: emanuel schikaneder ;...,"die zauberflöte, oper in zwei aufzügen",,504390597,504390597,"[(IDSBB)003690925, (NEBIS)005645758, (RERO)R00...",[],"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",2000,2000,2000,2000uuuu,,,[1 DVD-Video (169 Min.)],1 dvd-video (169 min.),,[],,[],[],[],073 003-9,,,,"metropolitan opera, metropolitan opera, orches...",vm,10300,mozartwolfgang amadeus1756-1791(de-588)118584596,"schikanederemanuel1751-1812(de-588)11860757x, ...",w. a. mozart ; libretto: emanuel schikaneder,"die zauberflöte, oper in zwei aufzügen : kv 620",
4,116188030,"[(OCoLC)884447694, (IDSBB)003690925]",[],"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",2000,2000,2000,2000uuuu,,,[1 DVD-Video (169 Min.)],1 dvd-video (169 min.),,[],,[],[],[],073 003-9,,,,"metropolitan opera, metropolitan opera, orches...",vm,10300,mozartwolfgang amadeus1756-1791(de-588)118584596,"schikanederemanuel1751-1812(de-588)11860757x, ...",w. a. mozart ; libretto: emanuel schikaneder,"die zauberflöte, oper in zwei aufzügen : kv 620",,504390597,504390597,"[(IDSBB)003690925, (NEBIS)005645758, (RERO)R00...",[],"{'245': ['Die Zauberflöte', 'Oper in zwei Aufz...",2000,2000,2000,2000uuuu,,,[1 DVD-Video (169 Min.)],1 dvd-video (169 min.),,[],,[],[],[],073 003-9,,,,"metropolitan opera, metropolitan opera, orches...",vm,10300,mozartwolfgang amadeus1756-1791(de-588)118584596,"schikanederemanuel1751-1812(de-588)11860757x, ...",w. a. mozart ; libretto: emanuel schikaneder,"die zauberflöte, oper in zwei aufzügen : kv 620",


In [30]:
len(result)

435

In [31]:
result.loc[408]

docid_x                                                  404220762
035liste_x                     [(SNL)vtls001860448, (Sz)001860448]
isbn_x                                                          []
ttlpart_x        {'245': ['Informatique de santé - Communicatio...
pubyear_x                                                 2013    
                                       ...                        
person_100_y                                                      
person_700_y                                                      
person_245c_y                                                     
ttlfull_245_y    health informatics - personal health device co...
ttlfull_246_y    informatique de santé - communication entre di...
Name: 408, Length: 61, dtype: object

In [32]:
def build_duplicate_pairs (df):
    """Builds-up all duplicate pairs, even with itself."""
    
    return pd.merge(left=df, right=df, how='inner',
                    left_on='masters_docid', right_on='masters_docid')

In [33]:
duplicates = build_duplicate_pairs(goldstandard['slaves'])
duplicates['duplicates'] = 1

len(duplicates)

1473

In [34]:
duplicates.loc[1000]

docid_x                                         190326522
035liste_x           [(OCoLC)248381623, (NEBIS)008641750]
isbn_x                                    [3-13-127283-X]
ttlpart_x                {'245': ['EKG-Kurs für Isabel']}
pubyear_x                                        2001    
                                   ...                   
person_700_y                           trappehans-joachim
person_245c_y    hans-peter schuster, hans-joachim trappe
ttlfull_245_y                         ekg-kurs für isabel
ttlfull_246_y                                            
duplicates                                              1
Name: 1000, Length: 62, dtype: object

In [35]:
df_s_1 = goldstandard['slaves']
df_u_1 = goldstandard['uniques']
df_s_1['duplicates'] = 0
df_u_1['duplicates'] = 0
non_duplicates = pd.merge(df_s_1, df_u_1, on='duplicates')

len(non_duplicates)

259260

In [36]:
print(len(duplicates)/len(non_duplicates)*100)

0.5681555195556585


In [37]:
non_duplicates.loc[0]

docid_x                                     000311049
035liste_x         [(OCoLC)731635279, (ABN)000539983]
isbn_x                            [978-3-15-020008-7]
ttlpart_x                  {'245': ['Emma', 'Roman']}
pubyear_x                                    2009    
                                 ...                 
person_100_y     austenjane1775-1817(de-588)118505173
person_700_y                                         
person_245c_y                             jane austen
ttlfull_245_y                             emma, roman
ttlfull_246_y                                        
Name: 0, Length: 62, dtype: object

In [38]:
duplicates.columns

Index(['docid_x', '035liste_x', 'isbn_x', 'ttlpart_x', 'pubyear_x', 'decade_x',
       'century_x', 'exactDate_x', 'edition_x', 'part_x', 'pages_x',
       'volumes_x', 'pubinit_x', 'pubword_x', 'scale_x', 'coordinate_x',
       'doi_x', 'ismn_x', 'musicid_x', 'coordinate_E_x', 'coordinate_N_x',
       'corporate_110_x', 'corporate_710_x', 'format_prefix_x',
       'format_postfix_x', 'person_100_x', 'person_700_x', 'person_245c_x',
       'ttlfull_245_x', 'ttlfull_246_x', 'masters_docid', 'docid_y',
       '035liste_y', 'isbn_y', 'ttlpart_y', 'pubyear_y', 'decade_y',
       'century_y', 'exactDate_y', 'edition_y', 'part_y', 'pages_y',
       'volumes_y', 'pubinit_y', 'pubword_y', 'scale_y', 'coordinate_y',
       'doi_y', 'ismn_y', 'musicid_y', 'coordinate_E_y', 'coordinate_N_y',
       'corporate_110_y', 'corporate_710_y', 'format_prefix_y',
       'format_postfix_y', 'person_100_y', 'person_700_y', 'person_245c_y',
       'ttlfull_245_y', 'ttlfull_246_y', 'duplicates'],
      dtype=

## Build Feature Base

In [39]:
dupes = duplicates[columns_metadata_dict['columns_to_use']]
non_dupes = non_duplicates[columns_metadata_dict['columns_to_use']]

frames = [dupes, non_dupes]

df_feature_base = pd.concat(frames)
df_feature_base.head()

Unnamed: 0,duplicates,coordinate_E_x,coordinate_E_y,coordinate_N_x,coordinate_N_y,corporate_110_x,corporate_110_y,corporate_710_x,corporate_710_y,doi_x,doi_y,edition_x,edition_y,exactDate_x,exactDate_y,format_prefix_x,format_prefix_y,format_postfix_x,format_postfix_y,isbn_x,isbn_y,musicid_x,musicid_y,part_x,part_y,person_100_x,person_100_y,person_700_x,person_700_y,person_245c_x,person_245c_y,pubinit_x,pubinit_y,scale_x,scale_y,ttlfull_245_x,ttlfull_245_y,ttlfull_246_x,ttlfull_246_y,volumes_x,volumes_y
0,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula","grawechristian, graweursula",jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem englischen übersetzt von...,reclam jun.,reclam jun.,,,"emma, roman","emma, roman",,,600 s.,600 s.
1,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen ; aus dem engl. übers. von ursula ...,reclam jun.,reclam,,,"emma, roman",emma,,,600 s.,600 s.
2,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane,"grawechristian, graweursula",,jane austen ; aus dem englischen übersetzt von...,jane austen,reclam jun.,reclam,,,"emma, roman","emma, roman",,,600 s.,600 s.
3,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,"grawechristian, graweursula",jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem englischen übersetzt von...,reclam,reclam jun.,,,emma,"emma, roman",,,600 s.,600 s.
4,1,,,,,,,,,[],[],,,2009uuuu,2009uuuu,bk,bk,20000,20000,[978-3-15-020008-7],[978-3-15-020008-7],,,20008,20008,austenjane1775-1817(de-588)118505173,austenjane1775-1817(de-588)118505173,,,jane austen ; aus dem engl. übers. von ursula ...,jane austen ; aus dem engl. übers. von ursula ...,reclam,reclam,,,emma,emma,,,600 s.,600 s.


## Goldstandard DataFrame Handover

In [40]:
import pickle as pk

# Binary intermediary metadata file
with open(os.path.join(path_goldstandard,
                       'columns_metadata.pkl'), 'wb') as df_output_file:
    pk.dump(columns_metadata_dict, df_output_file)

# Binary intermediary DataFrame file
with open(os.path.join(path_goldstandard, 'feature_base_df.pkl'), 'wb') as df_output_file:
    pk.dump(df_feature_base, df_output_file)

The DataFrame developped in this chapter is saved into a pickle file. This is done to hand over the data in a processed format. The data will be read in the next chapter [Feature Matrix Generation](./3_FeatureMatrixGeneration.ipynb) as input file.

## Summary

In [41]:
columns_metadata_dict

{'data_analysis_columns': ['coordinate_E',
  'coordinate_N',
  'corporate_110',
  'corporate_710',
  'doi',
  'edition',
  'exactDate',
  'format_prefix',
  'format_postfix',
  'isbn',
  'musicid',
  'part',
  'person_100',
  'person_700',
  'person_245c',
  'pubinit',
  'scale',
  'ttlfull_245',
  'ttlfull_246',
  'volumes'],
 'columns_to_use': ['duplicates',
  'coordinate_E_x',
  'coordinate_E_y',
  'coordinate_N_x',
  'coordinate_N_y',
  'corporate_110_x',
  'corporate_110_y',
  'corporate_710_x',
  'corporate_710_y',
  'doi_x',
  'doi_y',
  'edition_x',
  'edition_y',
  'exactDate_x',
  'exactDate_y',
  'format_prefix_x',
  'format_prefix_y',
  'format_postfix_x',
  'format_postfix_y',
  'isbn_x',
  'isbn_y',
  'musicid_x',
  'musicid_y',
  'part_x',
  'part_y',
  'person_100_x',
  'person_100_y',
  'person_700_x',
  'person_700_y',
  'person_245c_x',
  'person_245c_y',
  'pubinit_x',
  'pubinit_y',
  'scale_x',
  'scale_y',
  'ttlfull_245_x',
  'ttlfull_245_y',
  'ttlfull_246_x',
