# Capstone Project - Deduplication of Swissbib Raw Data

**Program** Applied Data Science : Machine Learning<br>
**Institution** EPFL Extension School<br>
**Course** \#5, Capstone Project<br><br>
**Title** Deduplication of Swissbib Raw Data<br>
**Author** Andreas Jud<br>
**Date** dd-MAR-2020

## Table of Contents

- [Introduction](#Introduction)
- [Structure of the Project](#Structure-of-the-Project)
- [Runs and Results](#Runs-and-Results)
    - [Runtime Parameters](#Runtime-Parameters)
    - [Run 1 - Full](#Run-1---Full)
- [Assessment of Results](#Assessment-of-Results)
- [Summary](#Summary)

## Introduction

[Proposal](./project-proposal-andreas-jud.ipynb)

## Structure of the Project

The notebook of the capstone project consists of the following chapters.

1. [Data Analysis](./1_DataAnalysis.ipynb)
1. [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb)
1. [Data Synthesizing](./3_DataSynthesizing.ipynb)
1. [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb)
1. [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb)
1. [Decision Tree Model](./6_DecisionTreeModel.ipynb)
1. [Support Vector Classifier Model](./7_SVCModel.ipynb)
1. [Neural Network Model](./8_NeuralNetwork.ipynb)

Appendix

- [A. References](./A_References.ipynb)
- [B. Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb)

## Runs and Results

This section starts with explaining the runtime parameters with which the notebooks of the capstone project can be called. After the parameter space has been settled, a series of runs will be executed with different parameter values each.

### Runtime Parameters

The notebooks of this capstone project can be called with six specific global parameters. These parameters are listed and explained here.

- $\texttt{execution}\_\texttt{mode}$ - The reason for introducing this parameter has been runtime of execution. For the models, grid search has been implemented with the goal to find the best parameters for a model. The bigger the grid space, i.e. the more points it has for each of its dimensions, the longer the runtime of a notebook lasts. When searching the best parameters for a model, the grid space has to be scanned widely. The runtime of the model may extend to hours, for such calculations. For some runs, smaller grid spaces may be sufficient. In order to save calculation time, a restricted grid space can be chosen. The execution mode of a notebook may have two distinct values.
    - Mode $\texttt{full}$ will be used for executing the notebook, calling it in this very chapter and collecting the results of each notebook for final comparison and assessment.
    - Mode $\texttt{restricted}$ will mainly, but not exclusively be used for executing the notebook locally, i.e. opening it manually and running it cell by cell. The original purpose of this mode of execution has been to open the notebook and read its text, in order to focus on the contents and specific explanation for a model. Runtime is supposed to be short for these execution modes. The grid parameters chosen for this mode have flowed back from the insights found from results with full execution mode of this chapter.
- $\texttt{factor}$ - In Swissbib's raw data, records may have missing values in attributes. When building pairs of records for generating the feature matrix, records may occur with a value on both sides of a pair, but also with missing values on one side of a pair and even with missing values on both sides of a pair, see chapters [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb) and [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb) for a deeper discussion. Missing values may influence the model. For that reason, a decision has been taken to mark the features of records of pairs with missing attribute values. One way of marking them can be to transform them to a negative similarity value. During implementation, a discussion has been on how the distance from the origin (similarity value of 0) on the negative similarity side would influence a model, especially a Neural Network, due to its linear dependency on firing of a node. To be able to set the distance from the origin, this factor has been introduced. In the implemented code, the factor ...
    - multiplies -0.5 if one attribute of the pair is missing.
    - multiplies -1.0 if both attributes of the paire are missing.
- $\texttt{oversampling}$ - The number of records of duplicates generated with Swissbib's goldstandard data has been low compared to the number of records with uniques. The effect has been to generally use balancing for model fitting. In order to increase the ratio of duplicates in the training and testing data, an oversampling with synthetic data has been tried. To control the ratio, parameter $\texttt{oversampling}$ has been introduced. Synthetic data will be multiplyed with a for loop, so to reach a ratio of oversampling in percent (%) in the final data set for model calculation. If $\texttt{oversampling}=0$, no synthetic data will be added to the goldstandard data. This parameter will be used in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb).
- $\texttt{modification}\_\texttt{ratio}$ - This parameter will be used in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb), too. In that chapter, some specific kinds of data modification (typos) to be simulated have been defined for each attribute. If an attribute shows one or more kinds of modification, this parameter controls the ratio and therefore the amount of records with modification.
- $\texttt{mode}\_\texttt{exactDate}$ - The basic similarity metric of attribute $\texttt{exact}\_\texttt{date}$, undergoes some modification in presence of unknown values, see chapter [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb) for implementation details. Two different modes of modifying the basic similarity metric have been implemented. To decide on one mode of modification, parameter $\texttt{exact}\_\texttt{date}$ has been introduced. 
- $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ - Swissbib's raw data bring attributes $\texttt{scale}$, $\texttt{part}$, and $\texttt{volumes}$ as full-text strings. Swissbib's deduplication engine extracts their number digit parts in a preprocessing step with the goal to generate more reliable results. A very basic stripping function has been implemented in this capstone project with the goal to copy Swissbib's more sophisticated logic. The model result may change as a function of the similarity values of these three attributes. To assess the effect of stripping the attributes values, parameter $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ will be used for switching on ($\texttt{strip}\_\texttt{number}\_\texttt{digits}=\texttt{True}$) and off ($\texttt{strip}\_\texttt{number}\_\texttt{digits}=\texttt{False}$) the stripping to number digits logic.

In [1]:
# Setting parameters for runs

# Set run mode for each notebook :
#  full = full grid search space is scanned, results are collected.
#  restricted = small grid search space is scanned for local runs.
execution_mode='full'
#execution_mode = 'restricted'

# Factor for missing attributes, chapter 4 and graphs in chapter 5 :
#  -0.5*factor : One attribute of the pair is missing.
#  -1.0*factor : Both attributes of the pair are missing.
factor = 0.1
#factor = 1.0

# Factor of oversampling with synthetic data, chapter 3 :
oversampling = 0
#oversampling = 20

# Ratio to which the attributes with a value will be modified in the ...
#  synthetic data generation of chapter 3.
modification_ratio = 0.2

# Function applyed to exactDate attribute to increase value of ...
#  strings with characters indicating 'unknown' digits.
#mode_exactDate = 'added_u'
mode_exactDate = 'xor'

# Decides whether for attributes 'scale', 'part', and 'volumes', the full string ...
#  shall be stripped to number digits (True) or shall be left as is (False).
strip_number_digits = True
#strip_number_digits = False

# Generate dictionary for parameter handover
runtime_param_dict = {
    'em' : execution_mode,
    'fa' : factor,
    'os' : oversampling,
    'mr' : modification_ratio,
    'me' : mode_exactDate,
    'sn' : strip_number_digits,
    'notebook_name' : ''
}

To execute the notebooks of the capstone project, functions of python library $\texttt{nbparameterise}$ will be used.

In [2]:
#! pip install nbparameterise

### Run 1 - Full

In [3]:
import nbformat
from nbconvert.preprocessors import ExecutePreprocessor
import nbparameterise as nbp
import os
import results_saving_funcs as rsf

path_results = './results'
# Determine all relenvant notebooks, ommit Overview Summary and Appendixes
a = ! ls [1-9]_* | grep .ipynb

for i in range(len(a)):
    print('Executing notebook', a[i])
    with open(a[i]) as notebook:
        nb = nbformat.read(notebook, as_version=4)
        
        # Get list of parameter objects
        orig_parameters = nbp.extract_parameters(nb)
        # Update parameters
        params = nbp.parameter_values(orig_parameters,
                                      execution_mode=runtime_param_dict['em'],
                                      factor=runtime_param_dict['fa'],
                                      oversampling=runtime_param_dict['os'],
                                      modification_ratio = runtime_param_dict['mr'],
                                      exactDate_mode = runtime_param_dict['me'],
                                      strip_number_digits = runtime_param_dict['sn']
                                     )
        # Make notebook object with these definitions, ...
        nb = nbp.replace_definitions(nb, params, execute=False)

        ep = ExecutePreprocessor(timeout=None)
        # ... and execute it.
        ep.preprocess(nb, {"metadata": {"path": './'}})
    # Save notebook run in result file
    runtime_param_dict.update({'notebook_name' : a[i]})
    rsf.save_notebook_results(nb, path_results, runtime_param_dict)

print('Done with all notebooks.')

Executing notebook 1_DataAnalysis.ipynb
Executing notebook 2_GoldstandardDataPreparation.ipynb
Executing notebook 3_DataSynthesizing.ipynb
Executing notebook 4_FeatureMatrixGeneration.ipynb
Executing notebook 5_FeatureDiscussionDummyBaseline.ipynb
Executing notebook 6_DecisionTreeModel.ipynb
Executing notebook 7_SVCModel.ipynb
Executing notebook 8_NeuralNetwork.ipynb
Done with all notebooks.


In [4]:
import pandas as pd

path_goldstandard = './daten_goldstandard'

results = rsf.restore_dict_results(path_goldstandard, 'results.pkl')

results['results_best_model'].reset_index(drop=True, inplace=True)
# Ranking metric according to chapter 6 : roc auc
results['results_best_model'].sort_values('auc', ascending=False)

Unnamed: 0,model,auc,accuracy,precision,recall,auc_log,accuracy_log,precision_log,recall_log
2,DecisionTreeClassifier_CV,98.460039,99.953745,95.016611,96.949153,4.173413,7.67875,2.99906,3.489751
3,RandomForestClassifier,97.781104,99.944108,94.630872,95.59322,3.80816,7.489508,2.924505,3.122026
1,DecisionTreeClassifier,97.768504,99.919053,90.675241,95.59322,3.802498,7.119135,2.372497,3.122026
6,Neural Network,97.263907,99.920981,91.776316,94.576271,3.598639,7.143232,2.498152,2.914387
4,SVC,96.248896,99.913271,92.22973,92.542373,3.28312,7.050142,2.554865,2.595933
5,SVC_CV,95.90119,99.892071,89.438944,91.864407,3.194474,6.831453,2.247997,2.508922
0,DummyClassifier,49.898126,98.893729,0.355872,0.338983,0.691112,4.504175,0.003565,0.003396


In [5]:
import time

pd.options.display.max_rows = 200

# For timestamp in filename
tmstmp = time.strftime("%Y%m%d-%H%M%S")

for classifier in results['results_model_scores'].keys() :
    print(f'\n{classifier}')
    display(results['results_model_scores'][classifier].head(20))
    results['results_model_scores'][classifier].to_csv(os.path.join(path_results, classifier + '_'
                                                                    + tmstmp + '.csv'), index=False)


DummyClassifier



DecisionTreeClassifier


Unnamed: 0,class_weight,criterion,max_depth,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
23,,gini,24.0,1.0,0.99935,-inf,7.337829
17,,gini,18.0,1.0,0.99935,-inf,7.337829
31,,gini,40.0,1.0,0.99935,-inf,7.337829
30,,gini,35.0,1.0,0.99935,-inf,7.337829
29,,gini,30.0,1.0,0.99935,-inf,7.337829
28,,gini,29.0,1.0,0.99935,-inf,7.337829
27,,gini,28.0,1.0,0.99935,-inf,7.337829
26,,gini,27.0,1.0,0.99935,-inf,7.337829
25,,gini,26.0,1.0,0.99935,-inf,7.337829
24,,gini,25.0,1.0,0.99935,-inf,7.337829



DecisionTreeClassifier_CV


Unnamed: 0,class_weight,criterion,max_depth,accuracy_val,std_accuracy_val,log_accuracy_val
131,balanced,entropy,27.0,0.999297,0.000153,7.259482
139,balanced,entropy,,0.999292,0.000151,7.252656
138,balanced,entropy,50.0,0.999292,0.000151,7.252656
137,balanced,entropy,45.0,0.999292,0.000151,7.252656
136,balanced,entropy,40.0,0.999292,0.000151,7.252656
135,balanced,entropy,35.0,0.999292,0.000151,7.252656
134,balanced,entropy,30.0,0.999292,0.000151,7.252656
133,balanced,entropy,29.0,0.999292,0.000151,7.252656
132,balanced,entropy,28.0,0.999292,0.000151,7.252656
95,balanced,gini,26.0,0.999292,0.000153,7.252656



RandomForestClassifier


Unnamed: 0,class_weight,max_depth,n_estimators,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
102,,24.0,50,0.999994,0.999615,12.019942,7.861077
90,,22.0,100,1.0,0.99959,-inf,7.800452
208,balanced,23.0,75,1.0,0.999566,-inf,7.743294
109,,,50,1.0,0.999566,-inf,7.743294
88,,22.0,50,0.999994,0.999566,12.019942,7.743294
101,,24.0,40,0.999988,0.999566,11.326795,7.743294
108,,,40,0.999994,0.999542,12.019942,7.689227
207,balanced,23.0,50,1.0,0.999542,-inf,7.689227
74,,20.0,50,0.999994,0.999542,12.019942,7.689227
209,balanced,23.0,100,1.0,0.999542,-inf,7.689227



SVC


Unnamed: 0,C,class_weight,degree,gamma,kernel,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
17,0.5,,4,1.5,poly,0.999687,0.999085,8.068698,6.996079
117,1.0,,4,2.5,poly,0.999886,0.99906,9.075503,6.970104
109,1.0,,3,2.5,poly,0.999572,0.99906,7.757262,6.970104
107,1.0,,3,2.0,poly,0.99953,0.99906,7.663233,6.970104
13,0.5,,3,2.5,poly,0.99953,0.999036,7.663233,6.944786
93,0.9,,4,2.5,poly,0.999886,0.999036,9.075503,6.944786
85,0.9,,3,2.5,poly,0.999572,0.999036,7.757262,6.944786
69,0.8,,4,2.5,poly,0.999867,0.999036,8.928899,6.944786
41,0.7,,4,1.5,poly,0.999729,0.999036,8.213279,6.944786
45,0.7,,4,2.5,poly,0.999849,0.999012,8.801066,6.920093



SVC_CV


Unnamed: 0,C,class_weight,degree,gamma,kernel,accuracy_val,std_accuracy_val,log_accuracy_val
81,0.9,,3,1.5,poly,0.999075,0.000179,-6.985594
105,1.0,,3,1.5,poly,0.99907,0.000177,-6.980399
13,0.5,,3,2.5,poly,0.99907,0.000185,-6.980399
107,1.0,,3,2.0,poly,0.99907,0.000185,-6.980399
57,0.8,,3,1.5,poly,0.999065,0.000163,-6.975231
17,0.5,,4,1.5,poly,0.99906,0.000204,-6.970091
35,0.7,,3,2.0,poly,0.99906,0.00017,-6.97009
33,0.7,,3,1.5,poly,0.999056,0.000174,-6.964975
59,0.8,,3,2.0,poly,0.999056,0.000178,-6.964975
11,0.5,,3,2.0,poly,0.999056,0.000169,-6.964974



Neural Network


Unnamed: 0,class_weight,dropout_rate,l2_alpha,number_of_hidden1_layers,number_of_hidden2_layers,sgd_learnrate,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
74,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,70,60,0.003,0.999276,0.999179,7.231189,7.104951
72,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,70,60,0.001,0.999306,0.999125,7.273158,7.041291
71,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,70,55,0.003,0.999233,0.999113,7.172956,7.028176
168,"[0.5028541799926344, 88.09083191850594]",0.2,0.0,75,45,0.001,0.999218,0.999113,7.15303,7.028162
40,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,60,55,0.002,0.99928,0.999102,7.236474,7.015204
73,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,70,60,0.002,0.999295,0.999098,7.2568,7.010926
38,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,60,50,0.003,0.999204,0.999094,7.135886,7.006666
27,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,55,60,0.001,0.999284,0.99909,7.241788,7.002411
33,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,60,45,0.001,0.999291,0.999086,7.251323,6.998187
36,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,60,50,0.001,0.999294,0.999083,7.255365,6.993968


## Assessment of Results

In [6]:
# Read confusion matrix results from chapters
wrong_predictions = rsf.restore_dict_results(path_goldstandard, 'wrong_predictions.pkl')

wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']
fpu, fpd = {}, {}

for i in wrong_predictions.keys() :
    fpu[i] = wrong_predictions[i][wrong_prediction_groups[0]].sort_index().index.tolist()
    fpd[i] = wrong_predictions[i][wrong_prediction_groups[1]].sort_index().index.tolist()

print(wrong_prediction_groups[0])
for i in fpu.keys() :
    print(i, len(fpu[i]), '\n', fpu[i])
print('')
print(wrong_prediction_groups[1])
for i in fpd.keys() :
    print(i, len(fpd[i]), '\n', fpd[i])

false_predicted_uniques
DecisionTreeClassifier 13 
 [550, 632, 691, 900, 911, 914, 921, 924, 931, 932, 1014, 1039, 1254]
DecisionTreeClassifier_CV 9 
 [550, 664, 672, 911, 914, 921, 924, 931, 932]
RandomForestClassifier 13 
 [471, 632, 691, 900, 911, 914, 921, 924, 931, 932, 1014, 1039, 1254]
SVC 22 
 [152, 264, 267, 432, 465, 471, 474, 476, 486, 641, 642, 691, 821, 900, 916, 931, 932, 1014, 1039, 1128, 1256, 1317]
SVC_CV 24 
 [264, 267, 373, 432, 465, 471, 474, 476, 486, 550, 641, 642, 691, 724, 821, 900, 916, 921, 924, 931, 932, 1014, 1039, 1256]
Neural Network 16 
 [23, 465, 471, 641, 642, 685, 724, 883, 900, 916, 921, 923, 924, 931, 932, 1158]

false_predicted_duplicates
DecisionTreeClassifier 29 
 [3651, 3652, 24030, 43593, 49539, 56220, 63486, 64500, 67321, 67912, 68059, 68062, 80378, 82011, 100540, 103087, 103658, 125645, 129309, 135180, 144612, 160727, 170792, 179983, 197599, 198366, 198538, 203005, 205995]
DecisionTreeClassifier_CV 15 
 [24498, 49754, 50774, 60958, 80378, 1005

In [7]:
# Restore DataFrame with attributes and similarity values
df_attribute_with_sim_feature = pd.read_pickle(os.path.join(
    path_goldstandard, 'labelled_feature_matrix_full.pkl'), compression=None)

# Binary intermediary DataFrame file for docid's
df_index_docids = pd.read_pickle(os.path.join(
    path_goldstandard, 'index_docids_df.pkl'), compression=None)

In [8]:
pd.options.display.max_columns = 200

df_attribute_with_sim_feature.iloc[fpu[i]]

Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y,coordinate_N_delta,coordinate_N_x,coordinate_N_y,corporate_full_delta,corporate_full_x,corporate_full_y,doi_delta,doi_x,doi_y,edition_delta,edition_x,edition_y,exactDate_delta,exactDate_x,exactDate_y,format_postfix_delta,format_postfix_x,format_postfix_y,format_prefix_delta,format_prefix_x,format_prefix_y,isbn_delta,isbn_x,isbn_y,ismn_delta,ismn_x,ismn_y,musicid_delta,musicid_x,musicid_y,part_delta,part_x,part_y,person_100_delta,person_100_x,person_100_y,person_245c_delta,person_245c_x,person_245c_y,person_700_delta,person_700_x,person_700_y,pubinit_delta,pubinit_x,pubinit_y,scale_delta,scale_x,scale_y,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y,ttlfull_246_delta,ttlfull_246_x,ttlfull_246_y,volumes_delta,volumes_x,volumes_y
23,1,-0.1,,,-0.1,,,0.386667,"metropolitan opera (new york)orchestra, metrop...","metropolitan operaorchestra, metropolitan oper...",-0.1,,,-0.1,,,0.75,2000aaaa,2000uuuu,1.0,10300,10300,1.0,vm,vm,1.0,[],[],-0.1,,,1.0,73.0,73.0,-0.1,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.855856,w.a. mozart ; libretto: emanuel schikaneder ; ...,w.a. mozart ; libretto emanuel schikaneder,0.871335,"schikanederemanuel, coxjohn, levinejames, batt...","schikanederemanuel, hockneydavid, coxjohn, lev...",0.819266,"deutsche grammophon, universal music",deutsche grammophon gesellschaft,-0.1,,,0.740916,"die zauberflöte, oper in zwei aufzügen = the m...","die zauberflöte, oper in zwei aufzügen : kv 620",-0.1,,,0.733333,1,1 169
465,1,-0.1,,,-0.1,,,-0.05,,"rundfunkchor, sächsische staatskapelle dresden",-0.1,,,-0.1,,,0.5,aaaaaaaa,1991uuuu,1.0,40100,40100,1.0,mu,mu,1.0,[],[],-0.1,,,-0.05,,422.0,-0.05,,43,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.672418,wolfgang amadeus mozart,[musik]: wolfgang amadeus mozart; libretto: em...,0.546377,"mathisedith, karajanherbert von","mollkurt, serraluciana, pricemargaret, venutim...",-0.05,,[phonogram],-0.1,,,0.545964,zauberflöte,"die zauberflöte, kv 620 : eine deutsche oper i...",-0.1,,,0.777778,3,3 1
471,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.5,19201929,uuuuuuuu,1.0,10200,10200,1.0,mu,mu,1.0,[],[],-0.1,,,-0.05,245.0,,-0.1,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.67731,von emanuel schikaneder ; [musik von] wolfgang...,von w.a. mozart ; klavierauszug neu rev. von w...,-0.05,"schikanederemanuel, kienzlwilhelm",,-0.05,universal edition,,-0.1,,,0.854023,"die zauberflöte, il flauto magico : oper in zw...","die zauberflöte, oper in 2 akten = il flauto m...",-0.1,,,1.0,1 167,1 167
641,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.625,19aaaaaa,1950uuuu,1.0,10100,10100,1.0,mu,mu,1.0,[],[],-0.1,,,0.0,4355.0,912.0,-0.05,,912 912,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.880856,a german opera by emanuel schikaneder ; music ...,a german opera by emanuel schikaneder ; music ...,-0.05,,"aberthermann, schikanederemanuel",-0.05,,e. eulenburg,-0.1,,,0.833333,die zauberflöte,"die zauberflöte, köchel no 620",-0.1,,,0.733333,1,1 412
642,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.625,1950aaaa,19uuuuuu,1.0,10100,10100,1.0,mu,mu,1.0,[],[],-0.1,,,0.0,912.0,4355.0,-0.1,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.713937,wolfgang amadeus mozart ; libretto by emanuel ...,a german opera by emanuel schikaneder ; music ...,-0.05,"schikanederemanuel, aberthermann",,-0.05,e. eulenburg,,-0.1,,,0.770833,"die zauberflöte, the magic flute : opera : k 620",die zauberflöte,-0.1,,,0.733333,1 412,1
685,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.75,1471aaaa,1471uuuu,1.0,20053,20053,1.0,bk,bk,1.0,[],[],-0.1,,,-0.1,,,-0.1,,,1.0,crescenzipietro de',crescenzipietro de',-0.1,,,-0.1,,,-0.1,,,-0.1,,,1.0,ruralia commoda,ruralia commoda,-0.1,,,1.0,418,418
724,1,-0.1,,,-0.1,,,-0.05,"rundfunkchor, sächsische staatskapelle dresden",,-0.1,,,-0.1,,,0.5,1984aaaa,uuuuuuuu,1.0,40100,40100,1.0,mu,mu,1.0,[],[],-0.1,,,1.0,422.0,422.0,-0.1,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.806061,wolfgang amadeus mozart ; libretto: emanuel sc...,wolfgang amadeus mozart,0.631315,"mollkurt, serraluciana, pricemargaret, venutim...","mollkurt, daviscolin",1.0,philips,philips,-0.1,,,0.629735,"die zauberflöte, the magic flute",zauberflöte,-0.1,,,1.0,3,3
883,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.75,2011aaaa,2011uuuu,1.0,10300,10300,1.0,vm,vm,1.0,[],[],-0.1,,,-0.1,,,1.0,1 1,1 1,-0.05,schlöndorffvolker,,0.78176,volker schlöndorff ; nach dem roman von max fr...,volker schlöndorff ; nach dem roman von max fr...,0.580823,"frischmax, junkersdorfeberhard","schlöndorffvolker, myersstanley, wurlitzerrudo...",1.0,"suhrkamp, absolut medien","suhrkamp, absolut medien",-0.1,,,1.0,homo faber,homo faber,-0.1,,,1.0,1 117,1 117
900,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.625,18901899,18uuuuuu,0.428571,10000,10200,1.0,mu,mu,1.0,[],[],-0.1,,,-0.1,,,0.777778,2 2,2,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,-0.05,von w.a. mozart,,-0.05,mozartwolfgang amadeus,,-0.05,bei a.h. hirsch,,-0.1,,,0.674584,"die zauberflöte, il flauto magico : oper in zw...","die zauberflöte, grosse oper in zwei aufzügen ...",-0.1,,,-0.05,63,
916,1,-0.1,,,-0.1,,,-0.1,,,-0.1,,,-0.1,,,0.5,aaaaaaaa,1941uuuu,1.0,20000,20000,1.0,bk,bk,1.0,[],[],-0.1,,,-0.1,,,0.796296,2620 5,2620 2620,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.839556,von mozart ; dichtung von emanuel schikaneder ...,von mozart ; dichtung von emanuel schikaneder ...,0.771292,"krusegeorg richard, schikanederemanuel","schikanederemanuel, krusegeorg richard",0.809524,reclam,p. reclam jun.,-0.1,,,0.881356,"die zauberflöte, oper in zwei aufzügen","die zauberflöte, oper in zwei aufzügen : volls...",-0.1,,,1.0,74,74


In [9]:
df_index_docids.iloc[fpu[i]]

Unnamed: 0,035liste_x,035liste_y,docid_x,docid_y
23,[(RERO)R003034172],"[(OCoLC)884447694, (NEBIS)005645758]",225394006,167023853
465,"[(OCoLC)882061057, (SBT)000242507]","[(OCoLC)638188846, (NEBIS)001183345]",041431642,136079180
471,"[(OCoLC)890130815, (NEBIS)003645770]","[(OCoLC)695884327, (IDSLU)000901978]",15172783X,21555524
641,"[(OCoLC)611159941, (IDSLU)000464498]","[(VAUD)991019165679702852, (RNV)000396480-41bc...",028968867,405473354
642,[(RERO)R007095034],"[(OCoLC)611159941, (IDSLU)000464498]",252355962,28968867
685,"[(OCoLC)611643448, (IDSSG)000416104]","[(OCoLC)611643448, (IDSSG)000416104]",032531982,32531982
724,"[(OCoLC)745579984, (NEBIS)006565661]","[(OCoLC)882061500, (SBT)000243094]",175505578,41433645
883,"[(SNL)vtls001635981, (Sz)001635981]",[(RERO)R006080192],070381291,248221590
900,"[(VAUD)991006920619702852, (RNV)000620926-41bc...","[(OCoLC)882733652, (IDSBB)002886398]",419850414,80110495
916,"[(OCoLC)604627094, (NEBIS)009407654]",[(RERO)1706143],195531280,214241025


## Summary