# Capstone Project - Deduplication of Swissbib Raw Data

**Program** Applied Data Science : Machine Learning<br>
**Institution** EPFL Extension School<br>
**Course** \#5, Capstone Project<br><br>
**Title** Deduplication of Swissbib Raw Data<br>
**Author** Andreas Jud<br>
**Date** dd-MAR-2020

## Table of Contents

- [Introduction](#Introduction)
- [Structure of the Project](#Structure-of-the-Project)
- [Runs and Results](#Runs-and-Results)
    - [Runtime Parameters](#Runtime-Parameters)
    - [Run 1 - Full](#Run-1---Full)
- [Assessment of Results](#Assessment-of-Results)
- [Summary](#Summary)

## Introduction

[Proposal](./project-proposal-andreas-jud.ipynb)

## Structure of the Project

The notebook of the capstone project consists of the following chapters.

1. [Data Analysis](./1_DataAnalysis.ipynb)
1. [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb)
1. [Data Synthesizing](./3_DataSynthesizing.ipynb)
1. [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb)
1. [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb)
1. [Decision Tree Model](./6_DecisionTreeModel.ipynb)
1. [Support Vector Classifier Model](./7_SVCModel.ipynb)
1. [Neural Network Model](./8_NeuralNetwork.ipynb)

Appendix

- [A. References](./A_References.ipynb)
- [B. Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb)

## Runs and Results

This section starts with explaining the runtime parameters with which the notebooks of the capstone project can be called. After the parameter space has been settled, a series of runs will be executed with different parameter values each.

### Runtime Parameters

The notebooks of this capstone project can be called with six specific global parameters. These parameters are listed and explained here.

- $\texttt{execution}\_\texttt{mode}$ - The reason for introducing this parameter has been runtime of execution. For the models, grid search has been implemented with the goal to find the best parameters for a model. The bigger the grid space, i.e. the more points it has for each of its dimensions, the longer the runtime of a notebook lasts. When searching the best parameters for a model, the grid space has to be scanned widely. The runtime of the model may extend to hours, for such calculations. For some runs, smaller grid spaces may be sufficient. In order to save calculation time, a restricted grid space can be chosen. The execution mode of a notebook may have two distinct values.
    - Mode $\texttt{full}$ will be used for executing the notebook, calling it in this very chapter and collecting the results of each notebook for final comparison and assessment.
    - Mode $\texttt{restricted}$ will mainly, but not exclusively be used for executing the notebook locally, i.e. opening it manually and running it cell by cell. The original purpose of this mode of execution has been to open the notebook and read its text, in order to focus on the contents and specific explanation for a model. Runtime is supposed to be short for these execution modes. The grid parameters chosen for this mode have flowed back from the insights found from results with full execution mode of this chapter.
- $\texttt{factor}$ - In Swissbib's raw data, records may have missing values in attributes. When building pairs of records for generating the feature matrix, records may occur with a value on both sides of a pair, but also with missing values on one side of a pair and even with missing values on both sides of a pair, see chapters [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb) and [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb) for a deeper discussion. Missing values may influence the model. For that reason, a decision has been taken to mark the features of records of pairs with missing attribute values. One way of marking them can be to transform them to a negative similarity value. During implementation, a discussion has been on how the distance from the origin (similarity value of 0) on the negative similarity side would influence a model, especially a Neural Network, due to its linear dependency on firing of a node. To be able to set the distance from the origin, this factor has been introduced. In the implemented code, the factor ...
    - multiplies -0.5 if one attribute of the pair is missing.
    - multiplies -1.0 if both attributes of the paire are missing.
- $\texttt{oversampling}$ - The number of records of duplicates generated with Swissbib's goldstandard data has been low compared to the number of records with uniques. The effect has been to generally use balancing for model fitting. In order to increase the ratio of duplicates in the training and testing data, an oversampling with synthetic data has been tried. To control the ratio, parameter $\texttt{oversampling}$ has been introduced. Synthetic data will be multiplyed with a for loop, so to reach a ratio of oversampling in percent (%) in the final data set for model calculation. If $\texttt{oversampling}=0$, no synthetic data will be added to the goldstandard data. This parameter will be used in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb).
- $\texttt{modification}\_\texttt{ratio}$ - This parameter will be used in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb), too. In that chapter, some specific kinds of data modification (typos) to be simulated have been defined for each attribute. If an attribute shows one or more kinds of modification, this parameter controls the ratio and therefore the amount of records with modification.
- $\texttt{mode}\_\texttt{exactDate}$ - The basic similarity metric of attribute $\texttt{exact}\_\texttt{date}$, undergoes some modification in presence of unknown values, see chapter [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb) for implementation details. Two different modes of modifying the basic similarity metric have been implemented. To decide on one mode of modification, parameter $\texttt{exact}\_\texttt{date}$ has been introduced. 
- $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ - Swissbib's raw data bring attributes $\texttt{scale}$, $\texttt{part}$, and $\texttt{volumes}$ as full-text strings. Swissbib's deduplication engine extracts their number digit parts in a preprocessing step with the goal to generate more reliable results. A very basic stripping function has been implemented in this capstone project with the goal to copy Swissbib's more sophisticated logic. The model result may change as a function of the similarity values of these three attributes. To assess the effect of stripping the attributes values, parameter $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ will be used for switching on ($\texttt{strip}\_\texttt{number}\_\texttt{digits}=\texttt{True}$) and off ($\texttt{strip}\_\texttt{number}\_\texttt{digits}=\texttt{False}$) the stripping to number digits logic.

In [1]:
# Setting parameters for runs

# Set run mode for each notebook :
#  full = full grid search space is scanned, results are collected.
#  restricted = small grid search space is scanned for local runs.
execution_mode='full'
#execution_mode = 'restricted'

# Factor for missing attributes, chapter 4 and graphs in chapter 5 :
#  -0.5*factor : One attribute of the pair is missing.
#  -1.0*factor : Both attributes of the pair are missing.
factor = 0.1
#factor = 1.0

# Factor of oversampling with synthetic data, chapter 3 :
oversampling = 0
#oversampling = 20

# Ratio to which the attributes with a value will be modified in the ...
#  synthetic data generation of chapter 3.
modification_ratio = 0.2

# Function applyed to exactDate attribute to increase value of ...
#  strings with characters indicating 'unknown' digits.
#mode_exactDate = 'added_u'
mode_exactDate = 'xor'

# Decides whether for attributes 'scale', 'part', and 'volumes', the full string ...
#  shall be stripped to number digits (True) or shall be left as is (False).
strip_number_digits = True
#strip_number_digits = False

# Generate dictionary for parameter handover
runtime_param_dict = {
    'em' : execution_mode,
    'fa' : factor,
    'os' : oversampling,
    'mr' : modification_ratio,
    'me' : mode_exactDate,
    'sn' : strip_number_digits,
    'notebook_name' : ''
}

To execute the notebooks of the capstone project, functions of python library $\texttt{nbparameterise}$ will be used.

In [2]:
#! pip install nbparameterise

### Run 1 - Full

In [3]:
import nbformat
from nbconvert.preprocessors import ExecutePreprocessor
import nbparameterise as nbp
import os
import results_saving_funcs as rsf

path_results = './results'
# Determine all relenvant notebooks, ommit Overview Summary and Appendixes
a = ! ls [1-9]_* | grep .ipynb

for i in range(len(a)):
    print('Executing notebook', a[i])
    with open(a[i]) as notebook:
        nb = nbformat.read(notebook, as_version=4)
        
        # Get list of parameter objects
        orig_parameters = nbp.extract_parameters(nb)
        # Update parameters
        params = nbp.parameter_values(orig_parameters,
                                      execution_mode=runtime_param_dict['em'],
                                      factor=runtime_param_dict['fa'],
                                      oversampling=runtime_param_dict['os'],
                                      modification_ratio = runtime_param_dict['mr'],
                                      exactDate_mode = runtime_param_dict['me'],
                                      strip_number_digits = runtime_param_dict['sn']
                                     )
        # Make notebook object with these definitions, ...
        nb = nbp.replace_definitions(nb, params, execute=False)

        ep = ExecutePreprocessor(timeout=None)
        # ... and execute it.
        ep.preprocess(nb, {"metadata": {"path": './'}})
    # Save notebook run in result file
    runtime_param_dict.update({'notebook_name' : a[i]})
    rsf.save_notebook_results(nb, path_results, runtime_param_dict)

print('Done with all notebooks.')

Executing notebook 1_DataAnalysis.ipynb
Executing notebook 2_GoldstandardDataPreparation.ipynb
Executing notebook 3_DataSynthesizing.ipynb
Executing notebook 4_FeatureMatrixGeneration.ipynb
Executing notebook 5_FeatureDiscussionDummyBaseline.ipynb
Executing notebook 6_DecisionTreeModel.ipynb
Executing notebook 7_SVCModel.ipynb
Executing notebook 8_NeuralNetwork.ipynb


CellExecutionError: An error occurred while executing the following cell:
------------------
import results_analysis_funcs as raf
import results_saving_funcs as rsf

df_feature_base_full_tr = df_attribute_with_sim_feature.iloc[idx_tr]

idx = {}
idx['true_predicted_uniques'], idx['true_predicted_duplicates'], idx['false_predicted_uniques'], idx['false_predicted_duplicates'] = raf.get_confusion_matrix_indices(y_te, y_pred)

wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']

for i in wrong_prediction_groups :
    rsf.add_wrong_predictions(path_goldstandard, 
                              model_best, i, df_feature_base_full_te.loc[idx[i]])
------------------

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
[0;32m<ipython-input-16-04c6308b145a>[0m in [0;36m<module>[0;34m[0m
[1;32m     11[0m [0;32mfor[0m [0mi[0m [0;32min[0m [0mwrong_prediction_groups[0m [0;34m:[0m[0;34m[0m[0;34m[0m[0m
[1;32m     12[0m     rsf.add_wrong_predictions(path_goldstandard, 
[0;32m---> 13[0;31m                               model_best, i, df_feature_base_full_te.loc[idx[i]])
[0m
[0;31mNameError[0m: name 'df_feature_base_full_te' is not defined
NameError: name 'df_feature_base_full_te' is not defined


In [45]:
import pandas as pd

path_goldstandard = './daten_goldstandard'

results = rsf.restore_dict_results(path_goldstandard, 'results.pkl')

results['results_best_model'].reset_index(drop=True, inplace=True)
# Ranking metric according to chapter 6 : roc auc
results['results_best_model'].sort_values('auc', ascending=False)

Unnamed: 0,model,auc,accuracy,precision,recall,auc_log,accuracy_log,precision_log,recall_log
4,DecisionTreeClassifier_CV,98.28764,99.946036,94.059406,96.610169,4.067298,7.5246,2.823361,3.38439
3,DecisionTreeClassifier,98.118148,99.944108,94.039735,96.271186,3.972914,7.489508,2.820055,3.28908
5,RandomForestClassifier,97.949626,99.944108,94.333333,95.932203,3.887148,7.489508,2.870569,3.202069
6,Neural Network,97.605797,99.930617,92.739274,95.254237,3.73212,7.273285,2.62269,3.047918
7,Neural Network,97.095384,99.920981,92.05298,94.237288,3.538869,7.143232,2.532373,2.853762
2,SVC_CV,95.731699,99.890144,89.403974,91.525424,3.153954,6.813753,2.244691,2.4681
1,SVC,95.057609,99.890144,90.47619,90.169492,3.007321,6.813753,2.351375,2.31968
0,DummyClassifier,49.898126,98.893729,0.355872,0.338983,0.691112,4.504175,0.003565,0.003396


In [46]:
import time

pd.options.display.max_rows = 200

# For timestamp in filename
tmstmp = time.strftime("%Y%m%d-%H%M%S")

for classifier in results['results_model_scores'].keys() :
    print(f'\n{classifier}')
    display(results['results_model_scores'][classifier].head(20))
    results['results_model_scores'][classifier].to_csv(os.path.join(path_results, classifier + '_'
                                                                    + tmstmp + '.csv'), index=False)


DummyClassifier



SVC


Unnamed: 0,C,class_weight,degree,gamma,kernel,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
0,0.5,,3,2.0,poly,0.999464,0.998892,7.531305,6.805024



SVC_CV


Unnamed: 0,C,class_weight,degree,gamma,kernel,accuracy_val,std_accuracy_val,log_accuracy_val
0,0.5,,3,2.0,poly,0.999056,0.000169,-6.964974



DecisionTreeClassifier


Unnamed: 0,class_weight,criterion,max_depth,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
9,balanced,gini,20.0,1.0,0.999253,-inf,7.199678
10,balanced,gini,22.0,1.0,0.999253,-inf,7.199678
16,balanced,gini,50.0,1.0,0.999253,-inf,7.199678
15,balanced,gini,40.0,1.0,0.999253,-inf,7.199678
14,balanced,gini,35.0,1.0,0.999253,-inf,7.199678
13,balanced,gini,28.0,1.0,0.999253,-inf,7.199678
12,balanced,gini,26.0,1.0,0.999253,-inf,7.199678
11,balanced,gini,24.0,1.0,0.999253,-inf,7.199678
17,balanced,gini,,1.0,0.999253,-inf,7.199678
8,balanced,gini,18.0,0.999976,0.999133,10.633647,7.050147



DecisionTreeClassifier_CV


Unnamed: 0,class_weight,criterion,max_depth,accuracy_val,std_accuracy_val,log_accuracy_val
12,balanced,gini,26.0,0.999292,0.000153,7.252656
17,balanced,gini,,0.999287,0.000152,7.245876
16,balanced,gini,50.0,0.999287,0.000152,7.245876
15,balanced,gini,40.0,0.999287,0.000152,7.245876
14,balanced,gini,35.0,0.999287,0.000152,7.245876
13,balanced,gini,28.0,0.999287,0.000152,7.245876
11,balanced,gini,24.0,0.999287,0.000145,7.245876
10,balanced,gini,22.0,0.999215,0.000176,7.149339
9,balanced,gini,20.0,0.999147,0.000193,7.06694
8,balanced,gini,18.0,0.999003,0.000147,6.91037



RandomForestClassifier


Unnamed: 0,class_weight,max_depth,n_estimators,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
2,,,75,1.0,0.999518,-inf,7.637933
3,,,100,1.0,0.999518,-inf,7.637933
0,,20.0,75,0.999982,0.99947,10.92133,7.542623
1,,20.0,100,0.999988,0.99947,11.326795,7.542623



Neural Network


Unnamed: 0,class_weight,dropout_rate,l2_alpha,number_of_hidden1_layers,number_of_hidden2_layers,sgd_learnrate,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
5,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,40,70,0.002,0.99922,0.999002,7.156693,6.909427
1,,0.1,0.0,40,70,0.002,0.999218,0.998994,7.154249,6.901731
7,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,60,70,0.002,0.999298,0.99899,7.262138,6.897887
3,,0.1,0.0,60,70,0.002,0.999288,0.998928,7.247297,6.838617
2,,0.1,0.0,60,0,0.002,0.999113,0.998901,7.028162,6.813743
4,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,40,0,0.002,0.999032,0.998867,6.940792,6.782665
6,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,60,0,0.002,0.999069,0.998817,6.979353,6.739392
0,,0.1,0.0,40,0,0.002,0.999007,0.998767,6.915187,6.697904


## Assessment of Results

In [47]:
# Read confusion matrix results from chapters
wrong_predictions = rsf.restore_dict_results(path_goldstandard, 'wrong_predictions.pkl')

wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']
fpu, fpd = {}, {}

for i in wrong_predictions.keys() :
    fpu[i] = wrong_predictions[i][wrong_prediction_groups[0]].sort_index().index.tolist()
    fpd[i] = wrong_predictions[i][wrong_prediction_groups[1]].sort_index().index.tolist()

print(wrong_prediction_groups[0])
for i in fpu.keys() :
    print(i, len(fpu[i]), '\n', fpu[i])
print('')
print(wrong_prediction_groups[1])
for i in fpd.keys() :
    print(i, len(fpd[i]), '\n', fpd[i])

false_predicted_uniques
SVC_CV 25 
 [155, 264, 267, 303, 432, 465, 471, 474, 476, 486, 550, 641, 642, 691, 724, 821, 900, 916, 921, 924, 931, 932, 1039, 1128, 1256]
RandomForestClassifier 12 
 [632, 691, 900, 911, 914, 921, 924, 931, 932, 1014, 1039, 1254]
Neural Network 17 
 [6, 264, 267, 297, 303, 432, 465, 471, 641, 642, 685, 883, 916, 921, 924, 1158, 1256]

false_predicted_duplicates
SVC_CV 32 
 [30547, 32326, 34747, 49754, 51294, 56220, 60386, 60958, 64500, 67912, 68059, 68062, 69499, 89246, 103087, 103658, 105513, 121358, 129309, 135180, 139980, 143206, 149395, 149988, 150037, 152254, 165879, 179983, 197295, 198538, 214519, 216943]
RandomForestClassifier 17 
 [3652, 43593, 51294, 56220, 64500, 67321, 68059, 80378, 89246, 100540, 103087, 135180, 160727, 179983, 196996, 198366, 198538]
Neural Network 24 
 [3652, 30547, 32326, 43593, 49754, 51294, 60958, 64500, 67912, 68059, 68062, 89246, 100540, 103658, 118782, 135180, 139980, 165879, 179983, 196658, 196996, 197599, 198366, 198538]

In [48]:
# Restore DataFrame with attributes and similarity values
df_attribute_with_sim_feature = pd.read_pickle(os.path.join(
    path_goldstandard, 'labelled_feature_matrix_full.pkl'), compression=None)

# Binary intermediary DataFrame file for docid's
df_index_docids = pd.read_pickle(os.path.join(
    path_goldstandard, 'index_docids_df.pkl'), compression=None)

In [49]:
pd.options.display.max_columns = 200

df_attribute_with_sim_feature.iloc[fpu[i]]

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y,coordinate_N_delta,coordinate_N_x,coordinate_N_y,corporate_full_delta,corporate_full_x,corporate_full_y,doi_delta,doi_x,doi_y,edition_delta,edition_x,edition_y,exactDate_delta,exactDate_x,exactDate_y,format_postfix_delta,format_postfix_x,format_postfix_y,format_prefix_delta,format_prefix_x,format_prefix_y,isbn_delta,isbn_x,isbn_y,ismn_delta,ismn_x,ismn_y,musicid_delta,musicid_x,musicid_y,part_delta,part_x,part_y,person_100_delta,person_100_x,person_100_y,person_245c_delta,person_245c_x,person_245c_y,person_700_delta,person_700_x,person_700_y,pubinit_delta,pubinit_x,pubinit_y,scale_delta,scale_x,scale_y,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y,ttlfull_246_delta,ttlfull_246_x,ttlfull_246_y,volumes_delta,volumes_x,volumes_y
6,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.75,2009aaaa,2009uuuu,1.0,20000,20000,1.0,bk,bk,1.0,[978-3-15-020008-7],[978-3-15-020008-7],-0.1,,,-0.1,,,1.0,20008,20008,1.0,austenjane,austenjane,0.69774,jane austen,jane austen ; aus dem englischen übersetzt von...,-0.05,,"grawechristian, graweursula",0.848485,reclam,reclam jun.,-0.1,,,1.0,"emma, roman","emma, roman",-0.1,,,1.0,600,600
264,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.75,20062005,2006uuuu,1.0,10300,10300,1.0,vm,vm,0.0,[],[3-7655-8593-9],-0.1,,,1.0,501326.0,501326.0,-0.1,,,-0.1,,,1.0,ein film von luc jacquet,ein film von luc jacquet,1.0,jacquetluc,jacquetluc,-0.05,,kinowelt home entertainment arthaus,-0.1,,,1.0,die reise der pinguine,die reise der pinguine,-0.1,,,0.75,1 82,1
267,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.75,2006aaaa,20062005,1.0,10300,10300,1.0,vm,vm,0.0,[3-7655-8593-9],[],-0.1,,,1.0,501326.0,501326.0,-0.1,,,-0.1,,,1.0,ein film von luc jacquet,ein film von luc jacquet,1.0,jacquetluc,jacquetluc,-0.05,kinowelt home entertainment arthaus,,-0.1,,,1.0,die reise der pinguine,die reise der pinguine,-0.1,,,0.75,1,1 82
297,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.75,2007aaaa,2007uuuu,1.0,10300,10300,1.0,vm,vm,1.0,[],[],-0.1,,,1.0,502023.0,502023.0,-0.1,,,-0.1,,,0.830688,ein film von volker schlöndorff,ein film von volker schlöndorff ; nach dem rom...,0.639303,"schlöndorffvolker, frischmax, shepardsam, delp...","frischmax, schlöndorffvolker",0.916667,"kinowelt home entertainment, arthaus",kinowelt home entertainment,-0.1,,,1.0,homo faber,homo faber,-0.1,,,0.733333,2,2 109
303,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.75,2007aaaa,20071990,1.0,10300,10300,1.0,vm,vm,1.0,[],[],-0.1,,,1.0,502023.0,502023.0,-0.1,,,-0.1,,,0.588685,ein film von volker schlöndorff,regie: volker schlöndorff ; drehbuch: volker s...,0.851843,"schlöndorffvolker, frischmax, shepardsam, delp...","schlöndorffvolker, wurlitzerrudy, frischmax, m...",0.916667,"kinowelt home entertainment, arthaus",kinowelt home entertainment,-0.1,,,1.0,homo faber,homo faber,-0.1,,,0.733333,2,2 109
432,1,-0.1,,,-0.1,,,-0.05,wiener philharmoniker,,-0.1,,,-0.1,,,0.75,1991aaaa,1991uuuu,1.0,40100,40100,1.0,mu,mu,1.0,[],[],-0.1,,,0.428571,433210.0,171433.0,-0.1,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.713536,wolfgang amadeus mozart ; wiener philharmonike...,von emanuel schikaneder ; wolfgang amadeus mozart,0.587407,soltigeorg,schikanederemanuel,-0.05,,decca record,-0.1,,,0.798246,die zauberflöte,"die zauberflöte, oper in zwei aufzügen",-0.1,,,1.0,2 152,2 152
465,1,-0.1,,,-0.1,,,-0.05,,"rundfunkchor, sächsische staatskapelle dresden",-0.1,,,-0.1,,,0.5,aaaaaaaa,1991uuuu,1.0,40100,40100,1.0,mu,mu,1.0,[],[],-0.1,,,-0.05,,422.0,-0.05,,43,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.672418,wolfgang amadeus mozart,[musik]: wolfgang amadeus mozart; libretto: em...,0.546377,"mathisedith, karajanherbert von","mollkurt, serraluciana, pricemargaret, venutim...",-0.05,,[phonogram],-0.1,,,0.545964,zauberflöte,"die zauberflöte, kv 620 : eine deutsche oper i...",-0.1,,,0.777778,3,3 1
471,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.5,19201929,uuuuuuuu,1.0,10200,10200,1.0,mu,mu,1.0,[],[],-0.1,,,-0.05,245.0,,-0.1,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.67731,von emanuel schikaneder ; [musik von] wolfgang...,von w.a. mozart ; klavierauszug neu rev. von w...,-0.05,"schikanederemanuel, kienzlwilhelm",,-0.05,universal edition,,-0.1,,,0.854023,"die zauberflöte, il flauto magico : oper in zw...","die zauberflöte, oper in 2 akten = il flauto m...",-0.1,,,1.0,1 167,1 167
641,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.625,19aaaaaa,1950uuuu,1.0,10100,10100,1.0,mu,mu,1.0,[],[],-0.1,,,0.0,4355.0,912.0,-0.05,,912 912,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.880856,a german opera by emanuel schikaneder ; music ...,a german opera by emanuel schikaneder ; music ...,-0.05,,"aberthermann, schikanederemanuel",-0.05,,e. eulenburg,-0.1,,,0.833333,die zauberflöte,"die zauberflöte, köchel no 620",-0.1,,,0.733333,1,1 412
642,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.625,1950aaaa,19uuuuuu,1.0,10100,10100,1.0,mu,mu,1.0,[],[],-0.1,,,0.0,912.0,4355.0,-0.1,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.713937,wolfgang amadeus mozart ; libretto by emanuel ...,a german opera by emanuel schikaneder ; music ...,-0.05,"schikanederemanuel, aberthermann",,-0.05,e. eulenburg,,-0.1,,,0.770833,"die zauberflöte, the magic flute : opera : k 620",die zauberflöte,-0.1,,,0.733333,1 412,1


In [50]:
df_index_docids.iloc[fpu[i]]

Unnamed: 0,035liste_x,035liste_y,docid_x,docid_y
6,"[(OCoLC)731635279, (LIBIB)000315536]","[(OCoLC)731635279, (ABN)000539983]",323173349,000311049
264,"[(OCoLC)634380788, (IDSBB)004076773]",[(RERO)R004263905],118218034,23590242X
267,[(RERO)R004263905],"[(OCoLC)634380788, (IDSBB)004076773]",23590242X,118218034
297,"[(OCoLC)604985552, (SGBN)000887381]","[(OCoLC)887478782, (ABN)000327579]",055221017,006217826
303,"[(OCoLC)604985552, (SGBN)000887381]","[(OCoLC)604985552, (NEBIS)005519625]",055221017,199374376
432,"[(OCoLC)796203880, (IDSBB)005967090]","[(OCoLC)808021169, (BGR)000119170]",114467048,020561318
465,"[(OCoLC)882061057, (SBT)000242507]","[(OCoLC)638188846, (NEBIS)001183345]",041431642,136079180
471,"[(OCoLC)890130815, (NEBIS)003645770]","[(OCoLC)695884327, (IDSLU)000901978]",15172783X,021555524
641,"[(OCoLC)611159941, (IDSLU)000464498]","[(VAUD)991019165679702852, (RNV)000396480-41bc...",028968867,405473354
642,[(RERO)R007095034],"[(OCoLC)611159941, (IDSLU)000464498]",252355962,028968867


## Summary