# Capstone Project - Deduplication of Swissbib Raw Data

**Program** Applied Data Science : Machine Learning<br>
**Institution** EPFL Extension School<br>
**Course** \#5, Capstone Project<br><br>
**Title** Deduplication of Swissbib Raw Data<br>
**Author** Andreas Jud<br>
**Date** dd-MAR-2020

## Table of Contents

- [Introduction](#Introduction)
    - [Requirements](#Requirements)
    - [Thanks](#Thanks)
- [Structure of the Project](#Structure-of-the-Project)
- [Runs and Results](#Runs-and-Results)
    - [Runtime Parameters](#Runtime-Parameters)
    - [Overview of Runs](#Overview-of-Runs)
    - [Runs Execution](#Runs-Execution)
- [Assessment of Results](#Assessment-of-Results)
    - [Run with id 0](#Run-with-id-0)
    - [Run with id 1](#Run-with-id-1)
    - [Wrong Predictions](#Wrong-Predictions)
- [Summary](#Summary)

## Introduction

[Proposal](./project-proposal-andreas-jud.ipynb)

### Requirements

This capstone project uses several publically available Python libraries. The chapters where a library is needed show the
<br>$\texttt{! pip install <library name>}$<br>
command in a separate code cell, respectively. These commands have been executed once for the development environment of the author and have been commented out for later execution runs in order to produce more readable notebooks. For executing the set of notebooks of the capstone project on a python development environment with a basic setup, a [requirements.txt](./requirements.txt) file has been written. This file will be executed in the code cell below and installs the library packages needed for this capstone project.

In [1]:
#! pip install -r requirements.txt

### Thanks

## Structure of the Project

The notebook of the capstone project consists of the following chapters.

1. [Data Analysis](./1_DataAnalysis.ipynb)
1. [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb)
1. [Data Synthesizing](./3_DataSynthesizing.ipynb)
1. [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb)
1. [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb)
1. [Decision Tree Model](./6_DecisionTreeModel.ipynb)
1. [Support Vector Classifier Model](./7_SVCModel.ipynb)
1. [Neural Network Model](./8_NeuralNetwork.ipynb)

Appendix

- [A. References](./A_References.ipynb)
- [B. Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb)

## Runs and Results

This section starts with explaining the runtime parameters with which the notebooks of the capstone project can be called. After the parameter space has been settled, a series of runs will be executed with different parameter values each.

### Runtime Parameters

The notebooks of this capstone project can be called with six specific global parameters. These parameters are listed and explained here.

- $\texttt{execution}\_\texttt{mode}$ - The reason for introducing this parameter has been runtime of execution. For the models, grid search has been implemented with the goal to find the best parameters for a model. The bigger the grid space, i.e. the more points it has for each of its dimensions, the longer the runtime of a notebook lasts. Oversampling of records of duplicates intreases the runtime of a notebook, too. When searching the best parameters for a model, the grid space has to be scanned widely. The runtime of the model may extend to hours, for such calculations. For some runs, smaller grid spaces may be sufficient. In order to save calculation time, a restricted grid space can be chosen. The execution mode of a notebook may have three distinct values.
    - Mode $\texttt{full}$ will be used for executing the notebook, calling it in this very chapter and collecting the results of each notebook for final comparison and assessment.
    - Mode $\texttt{restricted}$ will mainly, but not exclusively be used for executing the notebook locally, i.e. opening it manually and running it cell by cell. The original purpose of this mode of execution has been to open the notebook and read its text, in order to focus on the contents and specific explanation for a model. Runtime is supposed to be short for these execution modes. The grid parameters chosen for this mode have flowed back from the insights found from results with full execution mode of this chapter.
    - Mode $\texttt{tune}$ will be used for a final fine tuning of the models' parameters. Goal of a mode $\texttt{tune}$ run is to get the best models of a grid space close to a precalculated best model of the wide grid space. While mode $\texttt{full}$ will be used for scanning a wide range of orders of magnitude of the parameter space, mode $\texttt{tune}$ will be used for scanning the neighbouring parameter points of the best models of the mode $\texttt{full}$ run. This approach is an iterative search for the best parameters of the models.
- $\texttt{oversampling}$ - The number of records of duplicates generated with Swissbib's goldstandard data has been low compared to the number of records with uniques. The effect has been to generally use balancing for model fitting. In order to increase the ratio of duplicates in the training and testing data, an oversampling with synthetic data has been tried. To control the ratio, parameter $\texttt{oversampling}$ has been introduced. Synthetic data will be multiplyed with a for loop, so to reach a ratio of oversampling in percent (%) in the final data set for model calculation. If $\texttt{oversampling}=0$, no synthetic data will be added to the goldstandard data. This parameter will be used in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb).
- $\texttt{modification}\_\texttt{ratio}$ - This parameter will be used in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb), too. In that chapter, some specific kinds of data modification (typos) to be simulated have been defined for each attribute. If an attribute shows one or more kinds of modification, this parameter controls the ratio and therefore the amount of records with modification.
- $\texttt{factor}$ - In Swissbib's raw data, records may have missing values in attributes. When building pairs of records for generating the feature matrix, records may occur with a value on both sides of a pair, but also with missing values on one side of a pair and even with missing values on both sides of a pair, see chapters [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb) and [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb) for a deeper discussion. Missing values may influence the model. For that reason, a decision has been taken to mark the features of records of pairs with missing attribute values. One way of marking them can be to transform them to a negative similarity value. During implementation, a discussion has been on how the distance from the origin (similarity value of 0) on the negative similarity side would influence a model, especially a Neural Network, due to its linear dependency on firing of a node. To be able to set the distance from the origin, this factor has been introduced. In the implemented code, the factor ...
    - multiplies -0.5 if one attribute of the pair is missing.
    - multiplies -1.0 if both attributes of the paire are missing.
- $\texttt{mode}\_\texttt{exactDate}$ - The basic similarity metric of attribute $\texttt{exact}\_\texttt{date}$, undergoes some modification in presence of unknown values, see chapter [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb) for implementation details. Two different modes of modifying the basic similarity metric have been implemented. To decide on one mode of modification, parameter $\texttt{exact}\_\texttt{date}$ has been introduced. 
- $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ - Swissbib's raw data bring attributes $\texttt{scale}$, $\texttt{part}$, and $\texttt{volumes}$ as full-text strings. Swissbib's deduplication engine extracts their number digit parts in a preprocessing step with the goal to generate more reliable results. A very basic stripping function has been implemented in this capstone project with the goal to copy Swissbib's more sophisticated logic. The model result may change as a function of the similarity values of these three attributes. To assess the effect of stripping the attributes values, parameter $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ will be used for switching on ($\texttt{strip}\_\texttt{number}\_\texttt{digits}=\texttt{True}$) and off ($\texttt{strip}\_\texttt{number}\_\texttt{digits}=\texttt{False}$) the stripping to number digits logic.

To execute the notebooks of the capstone project, functions of Python library $\texttt{nbparameterise}$ will be used.

In [2]:
#! pip install nbparameterise

### Overview of Runs

With the description of the runtime parameters, a multi-dimensional space of calculation options has been spanned. Due to limited calculation power based on restricted resources, it has appeared to be important to design the runs to be done well to reduce unnecessary calculation time and to increase the statements of the documented runs. The strategy used for the runs of this capstone project is shown in the table below. This strategy with its specific parameters has grown in the course of the capstone project iteratively. Many non-documented runs have been done with the models up to a point where the author had reached a feeling for the basic behaviour of a model linked to its best-suited parameter space. In the end, this chapter condenses the author's learning with the models.

| run id | description | parameter set |
| :----: | :---------- | :--------- |
| 0 | Goldstandard sampling,<br>**full feature modification** | $\texttt{execution}\_\texttt{mode}$ = $\texttt{full}$<br>$\texttt{oversampling}$ = $\texttt{None}$ with $\texttt{modification}\_\texttt{ratio}$ = \< irrelevant \><br>$\texttt{factor}$ = $0.1$<br>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{xor}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{True}$ |
| 1 | Goldstandard sampling,<br>**little feature modification** | $\texttt{execution}\_\texttt{mode}$ = $\texttt{restricted}$<br>$\texttt{oversampling}$ = $\texttt{None}$ with $\texttt{modification}\_\texttt{ratio}$ = \< irrelevant \><br>$\texttt{factor}$ = $0.1$<br><font color='red'>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{added}\_\texttt{u}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{False}$</font> |
| 2 | Goldstandard sampling,<br>**full missing separation** | $\texttt{execution}\_\texttt{mode}$ = $\texttt{restricted}$<br>$\texttt{oversampling}$ = $\texttt{None}$ with $\texttt{modification}\_\texttt{ratio}$ = \< irrelevant \><br><font color='red'>$\texttt{factor}$ = $1.0$</font><br>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{added}\_\texttt{u}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{True}$ |
| 3 | **Oversampling** | $\texttt{execution}\_\texttt{mode}$ = $\texttt{full}$<br><font color='red'>$\texttt{oversampling}$ = $\texttt{20}$ with $\texttt{modification}\_\texttt{ratio}$ = $0.2$</font><br>$\texttt{factor}$ = $0.1$<br>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{added}\_\texttt{u}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{True}$ |
| 4 | Final fine tuning | <font color='red'>$\texttt{execution}\_\texttt{mode}$ = $\texttt{tune}$</font><br>$\texttt{oversampling}$ = $\texttt{None}$ with $\texttt{modification}\_\texttt{ratio}$ = \< irrelevant \><br>$\texttt{factor}$ = $0.1$<br>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{xor}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{True}$ |

The strategy for finding the best parameters for the best model can be described as follows. The item numbers in the list below correspond to the run ids in the table above.
0. The first group of runs scans the parameter space widely with a coarse granularity in the grid space. The runs are done with the goldstandard data, only, without any oversampling of the data. The parameter $\texttt{factor}$ is set to 0.1 which stands for the expectation of a better performance for Neural Networks, see above. The stripping of text attributes to numbers is done in a forced way with the expectation to better approach Swissbib's preprocessing logic. This run represents a first guess based on expectations of finding the best models.
1. The assumption of the text attributes' stripping is validated with the next group of runs, leaving attributes $\texttt{part}$, $\texttt{scale}$, and $\texttt{volumes}$ unabbreviated in Swissbib's original raw data output. This validation is done on a resctricted grid space, based of the findings of the best models from run with id 0.
2. The next group of runs validates the assumptions of the influence of the distance of missing data points from the origin on the models, see above. The other parameters are set according to the findings so far, comparing the performance of the models.
3. The low ratio of records with duplicate pairs compared to the amount of records with uniques has shown to be insignificant for training the models. The effect of oversampling with synthetic data still remains an interesting point to be investigated. The result is to be compared with Swissbib's unmodified goldstandard data. The other parameters will be the ones found from the best models, up to that point.
4. The last group of runs scans the grid space in a fine granularity in the vicinity of the grid points found for the best models in the preceding runs, explicitly in run 0. This will be a fine tuning step in order to be sure to have found the very best parameters for the best model of best models.

Before the run can be executed, the global parameters described in subsection [Runtime Parameters](#Runtime-Parameters) have to be set.

In [3]:
# Generate dictionary for parameter handover
runtime_param_dict = {
    'em' : 'full' #execution_mode : ['restricted', 'full', 'tune']
    , 'os' : 0 # oversampling : [0, 20]
    , 'mr' : 0.2 # modification_ratio
    , 'fa' : 0.1 # factor : [0.1, 1.0]
    , 'me' : 'xor' # mode_exactDate : ['added_u', 'xor']
    , 'sn' : True # strip_number_digits : [True, False]
}
# Run id = 0
runtime_param_dict_list = [runtime_param_dict]

# Run id = 1
runtime_param_dict = runtime_param_dict.copy()
# Del
runtime_param_dict['fa'] = 1.0
# /Del
runtime_param_dict['em'] = 'restricted'
runtime_param_dict['me'] = 'added_u'
runtime_param_dict['sn'] = True
runtime_param_dict_list.append(runtime_param_dict)

# Run id = 2
runtime_param_dict = runtime_param_dict.copy()
runtime_param_dict['fa'] = 1.0
runtime_param_dict['me'] = 'xor'
runtime_param_dict['sn'] = True
runtime_param_dict_list.append(runtime_param_dict)

# Run id = 3
runtime_param_dict = runtime_param_dict.copy()
#runtime_param_dict['em'] = 'full'
runtime_param_dict['fa'] = 0.1
runtime_param_dict['os'] = 20
runtime_param_dict_list.append(runtime_param_dict)

# Run id = 4
runtime_param_dict = runtime_param_dict.copy()
runtime_param_dict['em'] = 'tune'
runtime_param_dict['os'] = 0
runtime_param_dict_list.append(runtime_param_dict)

# Let's have a look at the predefined parameters
for run in range(len(runtime_param_dict_list)):
    print('Parameters for run', run, ': \n', runtime_param_dict_list[run])

Parameters for run 0 : 
 {'em': 'full', 'os': 0, 'mr': 0.2, 'fa': 0.1, 'me': 'xor', 'sn': True}
Parameters for run 1 : 
 {'em': 'restricted', 'os': 0, 'mr': 0.2, 'fa': 1.0, 'me': 'added_u', 'sn': True}
Parameters for run 2 : 
 {'em': 'restricted', 'os': 0, 'mr': 0.2, 'fa': 1.0, 'me': 'xor', 'sn': True}
Parameters for run 3 : 
 {'em': 'restricted', 'os': 20, 'mr': 0.2, 'fa': 0.1, 'me': 'xor', 'sn': True}
Parameters for run 4 : 
 {'em': 'tune', 'os': 0, 'mr': 0.2, 'fa': 0.1, 'me': 'xor', 'sn': True}


Now, the parameters for each run has been set according to the strategy of scanning the grid space. As a next step, all groups of runs can be executed.

### Runs Execution

The calculations of the notebooks can be done with the parameter specified by the list of dictionaries $\texttt{runtime}\_\texttt{param}\_\texttt{dict}$.

In [4]:
import os
import results_saving_funcs as rsf
import pandas as pd
import time

path_results = './results'
path_goldstandard = './daten_goldstandard/'

# Determine all relevant notebooks, ommit Overview Summary and Appendixes
notebook = ! ls [1-9]_* | grep .ipynb

for run in range(len(runtime_param_dict_list)):
    if run == 1 :
        print('\nRun id', run)
        rsf.run_notebooks(notebook, runtime_param_dict_list, run, path_results)

        # Save the resulting handover files for the run done right now
        os.rename(os.path.join(path_results, 'results.pkl'),
                  os.path.join(path_results, 'results_run_' + str(run) + '.pkl'))
        os.rename(os.path.join(path_goldstandard, 'wrong_predictions.pkl'),
                  os.path.join(path_goldstandard, 'wrong_predictions_run_' + str(run) + '.pkl'))
        # Assessment of run
        results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

        results['results_best_model'].reset_index(drop=True, inplace=True)
        # Ranking metric according to chapter 6 : roc auc
        display(results['results_best_model'].sort_values('auc', ascending=False))

        for classifier in results['results_model_scores'].keys() :
            # Persist results per classifer for analysis
            results['results_model_scores'][classifier].to_csv(os.path.join(path_results,
                                                                            classifier + '_run_' + str(run) + '.csv'),
                                                               index=False)

        print('********\n')

print('Done with all runs of all notebooks.')


Run id 1
Executing notebook 1_DataAnalysis.ipynb
Executing notebook 2_GoldstandardDataPreparation.ipynb
Executing notebook 3_DataSynthesizing.ipynb
Executing notebook 4_FeatureMatrixGeneration.ipynb
Executing notebook 5_FeatureDiscussionDummyBaseline.ipynb
Executing notebook 6_DecisionTreeModel.ipynb
Executing notebook 7_SVCModel.ipynb
Executing notebook 8_NeuralNetwork.ipynb


Unnamed: 0,model,auc,accuracy,precision,recall,auc_log,accuracy_log,precision_log,recall_log
2,DecisionTreeClassifier_CV,98.288609,99.947963,94.370861,96.610169,4.067864,7.560967,2.877214,3.38439
1,DecisionTreeClassifier,98.118148,99.944108,94.039735,96.271186,3.972914,7.489508,2.820055,3.28908
3,RandomForestClassifier,97.951564,99.947963,94.966443,95.932203,3.888094,7.560967,2.989043,3.202069
6,Neural Network,97.438244,99.932544,93.333333,94.915254,3.664477,7.301456,2.70805,2.978925
5,SVC_CV,96.919109,99.905562,89.935065,93.898305,3.479951,6.964984,2.296113,2.796604
4,SVC,95.560269,99.884362,88.778878,91.186441,3.114576,6.76246,2.187372,2.428879
0,DummyClassifier,49.898126,98.893729,0.355872,0.338983,0.691112,4.504175,0.003565,0.003396


********

Done with all runs of all notebooks.


All predefined runs have been done. The results have been stored in specific files and will be analysed in the next section.

## Assessment of Results

The ranking of the models is shown above for each run. As a next step, the results are to be discussed for each run group separately. Goal of this first step of discussion is, to identify the best parameters of the grid search for each model.

### Run with id 0

The ranking of the best models of run with id 0 can be seen in subsection [Runs Execution](#Runs-Execution) above. The overall best model in the ranking has the Decision Tree Classifer with cross-validation. Its roc auc metric shows a value of 98.5% which comes from a total of 9 (false predicted uniques) + 15 (false predicted duplicates) = 24 wrong predictions in the test data set, see subsection [Wrong Predictions](#Wrong-Predictions) below. This highest value of roc auc is confirmed by the highest value in all metrics like accuracy, precision, and recall. The next best classifiers all belong to the Ensemble classifier family. The Random Forest Classifier has a total of 13 + 16 = 29 wrong predictions in the test data set, see [Wrong Predictions](#Wrong-Predictions) below, which results in a significantly lower roc auc value of below 98.0%. After the Ensemble classifier family, the Neural Network reaches the fourth best rank with an roc auc of 97.3%. Although the values for accuracy and precision are higher for the Neural Network compared to the values for the Decision Tree Classifier without cross-validation, its recall value is lower than the recall value of the Decision Tree Classifier. This last value is the reason why the Neural Network Classifier has a lower roc auc than the Decision Tree Classifier and is ranked below the latter one. All SVM classifiers show lower values than all other models, except for Dummy Classifier. For the SVM classifiers needs to be pointed out that the classifier with cross-validation shows a worse result than the classifier without cross-validation. As the classifier with cross-validation is statistically more stable, the classifier without cross-validation is less reliable. Altogether, this gives a nice and consistent picture.

In [5]:
run = 0

results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

for classifier in results['results_model_scores'].keys() :
    # Show results
    print(f'\n{classifier}')
    display(results['results_model_scores'][classifier].head(20))

FileNotFoundError: [Errno 2] No such file or directory: './results/results_run_0.pkl'

Looking at the detailed ranking per classifier model to find out the best grid parameter set for each model, see above.
- Although the run for the Decision Tree Classifier has been done with $\texttt{class}\_\texttt{weight}=\texttt{balanced}$ and $\texttt{None}$, no balancing seems to generate the best results. Furthermore, the gini measure seems to be the overall best measure (not entropy), when looking at the accuracy value. For the ranking of this classifier, it is important to notice that a $\texttt{max}\_\texttt{depth}$ value of 17 is the lowest value with the highest accuracy. For all $\texttt{max}\_\texttt{depth} \gt 17$, the accuracy remains constant.
- For parameter $\texttt{class}\_\texttt{weight}$, the picture is opposite with Decision Tree Classifier with cross-validation. The parameter must be set to a $\texttt{class}\_\texttt{weight}=\texttt{balanced}$ for the best models. Here, it seems that measure entropy is better suited than measure gini. This is hard to confirm, though, as the accuracy value is a mean value and considering the standard deviation of this accuracy value, both measures, entropy but also gini, and a whole range of values for $\texttt{max}\_\texttt{depth}$ remain in the same accuracy interval. The clear decision on the parameters of the best estimator remains open from the point of view of the validation data and is to be looked up directly in the stored Jupyter Notebook of the run, see [Decision Tree Model](./results/6_DecisionTreeModel_run_0.ipynb).
- Looking at the Random Forest Classifier, a tendency for unbalanced $\texttt{class}\_\texttt{weight}$ may be detected to find the best estimator. In the same way as for Decision Tree Classifier, high values of $\texttt{max}\_\texttt{depth}$ in a range around 20 are preferred which may confirm expectations as the model has been trained with a total of 20 features. For the parameter $\texttt{n}\_\texttt{estimators}$, higher values are clearly preferred.
- The Neural Network has been trained with balancing, only. This is due to the experience of undocumented old runs, with unbalanced and balanced $\texttt{class}\_\texttt{weight}$ which all resulted in a better performance on balanced data. A low dropout rate of around 0.1 seems to be beneficial for the model and a regularization of 0 has always proven as best value for all models of undocumented runs. Therefore, no higher value of $\texttt{l2}\_\texttt{alpha}$ has been set in the grid search. Surprisingly, a higher number of nodes results in a higher accuracy value. Explicitly, this is also true for a second hidden layer. This observation, together with the observation of a low learning rate of around 0.001 and 0.003 may be one reason, for the described slow stabilization rate of all Neural Network models.
- As for the Support Vector Classifiers, polynomial kernels of a degree of 3 or 4 generate the best accuracy results on the validation data. A $\gamma$ value of 1.5 and a $\texttt{C}$ value slightly lower than 1.0 produce the best models.

### Run with id 1

In [6]:
run = 1

results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

for classifier in results['results_model_scores'].keys() :
    # Show results
    print(f'\n{classifier}')
    display(results['results_model_scores'][classifier].head(20))


DummyClassifier



DecisionTreeClassifier


Unnamed: 0,class_weight,criterion,max_depth,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
9,balanced,gini,20.0,1.0,0.999253,-inf,7.199678
10,balanced,gini,22.0,1.0,0.999253,-inf,7.199678
16,balanced,gini,50.0,1.0,0.999253,-inf,7.199678
15,balanced,gini,40.0,1.0,0.999253,-inf,7.199678
14,balanced,gini,35.0,1.0,0.999253,-inf,7.199678
13,balanced,gini,28.0,1.0,0.999253,-inf,7.199678
12,balanced,gini,26.0,1.0,0.999253,-inf,7.199678
11,balanced,gini,24.0,1.0,0.999253,-inf,7.199678
17,balanced,gini,,1.0,0.999253,-inf,7.199678
8,balanced,gini,18.0,0.999976,0.999205,10.633647,7.137158



DecisionTreeClassifier_CV


Unnamed: 0,class_weight,criterion,max_depth,accuracy_val,std_accuracy_val,log_accuracy_val
0,balanced,gini,2.0,0.951354,0.022701,3.02319
1,balanced,gini,4.0,0.985882,0.007431,4.260327
2,balanced,gini,6.0,0.98873,0.003432,4.485611
3,balanced,gini,8.0,0.993148,0.002125,4.983269
4,balanced,gini,10.0,0.996642,0.001312,5.696305
5,balanced,gini,12.0,0.99774,0.000572,6.092487
6,balanced,gini,14.0,0.998492,0.000431,6.496888
7,balanced,gini,16.0,0.998791,0.000392,6.717638
8,balanced,gini,18.0,0.999003,0.000147,6.91037
9,balanced,gini,20.0,0.999166,0.000181,7.089798



RandomForestClassifier


Unnamed: 0,class_weight,max_depth,n_estimators,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
8,,22.0,100,1.0,0.99959,-inf,7.800452
6,,22.0,50,0.999994,0.999566,12.019942,7.743294
9,,,50,1.0,0.999566,-inf,7.743294
3,,20.0,50,0.999994,0.999542,12.019942,7.689227
7,,22.0,75,0.999994,0.999518,12.019942,7.637933
10,,,75,1.0,0.999518,-inf,7.637933
11,,,100,1.0,0.999518,-inf,7.637933
4,,20.0,75,0.999982,0.999494,10.92133,7.589143
5,,20.0,100,0.999988,0.99947,11.326795,7.542623
2,,18.0,100,0.999976,0.999398,10.633647,7.41479



SVC


Unnamed: 0,C,class_weight,degree,gamma,kernel,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
0,0.5,,3,2.0,poly,0.999747,0.999036,8.282272,6.944786



SVC_CV


Unnamed: 0,C,class_weight,degree,gamma,kernel,accuracy_val,std_accuracy_val,log_accuracy_val
0,0.5,,3,2.0,poly,0.99894,0.000216,-6.849461



Neural Network


Unnamed: 0,class_weight,dropout_rate,l2_alpha,number_of_hidden1_layers,number_of_hidden2_layers,sgd_learnrate,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
5,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,40,70,0.002,0.999193,0.999164,7.122646,7.086351
3,,0.1,0.0,60,70,0.002,0.99926,0.99916,7.208791,7.08173
1,,0.1,0.0,40,70,0.002,0.99919,0.999063,7.117928,6.973174
7,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,60,70,0.002,0.999219,0.999017,7.15547,6.924974
2,,0.1,0.0,60,0,0.002,0.999071,0.998855,6.981404,6.772513
6,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,60,0,0.002,0.999028,0.998844,6.935753,6.762442
0,,0.1,0.0,40,0,0.002,0.998971,0.998809,6.87898,6.732885
4,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,40,0,0.002,0.99899,0.998716,6.897863,6.658098


### Wrong Predictions

In [7]:
# Unlimited number of columns allowed
pd.options.display.max_columns = None

for run in range(len(runtime_param_dict_list)):
    if run == 1 :
        # Read confusion matrix results from chapters
        wrong_predictions = rsf.restore_dict_results(path_goldstandard, 'wrong_predictions_run_' + str(run) + '.pkl')

        wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']
        fpu, fpd = {}, {}

        for i in wrong_predictions.keys() :
            fpu[i] = wrong_predictions[i][wrong_prediction_groups[0]].sort_index().index.tolist()
            fpd[i] = wrong_predictions[i][wrong_prediction_groups[1]].sort_index().index.tolist()

        print(wrong_prediction_groups[0])
        for i in fpu.keys() :
            print(i, len(fpu[i]), '\n', fpu[i])
        print('')
        print(wrong_prediction_groups[1])
        for i in fpd.keys() :
            print(i, len(fpd[i]), '\n', fpd[i])

false_predicted_uniques
DecisionTreeClassifier 11 
 [550, 691, 900, 911, 914, 931, 932, 1014, 1039, 1041, 1254]
DecisionTreeClassifier_CV 10 
 [264, 267, 550, 911, 914, 921, 924, 931, 932, 1041]
RandomForestClassifier 12 
 [632, 691, 900, 911, 914, 921, 924, 931, 932, 1014, 1039, 1254]
SVC 26 
 [264, 267, 460, 465, 468, 471, 507, 641, 642, 685, 691, 731, 779, 857, 900, 916, 921, 924, 931, 932, 985, 1014, 1039, 1254, 1256, 1317]
SVC_CV 18 
 [264, 267, 460, 465, 468, 471, 641, 642, 685, 738, 821, 900, 916, 921, 924, 985, 1256, 1317]
Neural Network 15 
 [289, 465, 486, 547, 641, 642, 685, 778, 821, 916, 921, 923, 924, 1039, 1404]

false_predicted_duplicates
DecisionTreeClassifier 18 
 [3652, 37159, 43593, 49754, 61573, 64500, 67321, 80378, 100540, 126621, 148802, 152201, 154612, 165879, 179983, 196658, 197599, 198366]
DecisionTreeClassifier_CV 17 
 [16225, 24530, 37159, 43593, 49754, 82813, 126025, 135180, 139980, 150037, 152201, 154612, 160727, 179983, 197599, 198366, 198538]
RandomFores

In [8]:
import bz2
import _pickle as cPickle

# Restore DataFrame with features from compressed pickle file
with bz2.BZ2File((os.path.join(
    path_goldstandard, 'labelled_feature_matrix_full.pkl')), 'rb') as file:
    df_attribute_with_sim_feature = cPickle.load(file)

# Binary intermediary DataFrame file for docid's
df_index_docids = pd.read_pickle(os.path.join(
    path_goldstandard, 'index_docids_df.pkl'), compression=None)

In [9]:
for run in range(len(runtime_param_dict_list)):
    if run == 1 :
        print(wrong_prediction_groups[0])
        display(df_attribute_with_sim_feature.iloc[fpu[i]])
        print(wrong_prediction_groups[1])
        display(df_attribute_with_sim_feature.iloc[fpd[i]])

false_predicted_uniques


Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y,coordinate_N_delta,coordinate_N_x,coordinate_N_y,corporate_full_delta,corporate_full_x,corporate_full_y,doi_delta,doi_x,doi_y,edition_delta,edition_x,edition_y,exactDate_delta,exactDate_x,exactDate_y,format_postfix_delta,format_postfix_x,format_postfix_y,format_prefix_delta,format_prefix_x,format_prefix_y,isbn_delta,isbn_x,isbn_y,ismn_delta,ismn_x,ismn_y,musicid_delta,musicid_x,musicid_y,part_delta,part_x,part_y,person_100_delta,person_100_x,person_100_y,person_245c_delta,person_245c_x,person_245c_y,person_700_delta,person_700_x,person_700_y,pubinit_delta,pubinit_x,pubinit_y,scale_delta,scale_x,scale_y,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y,ttlfull_246_delta,ttlfull_246_x,ttlfull_246_y,volumes_delta,volumes_x,volumes_y
289,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,1.0,20071990,20071990,1.0,10300,10300,1.0,vm,vm,1.0,[],[],-1.0,,,1.0,502023.0,502023.0,-1.0,,,-1.0,,,0.586432,ein filme von volker schlöndorff,regie: volker schlöndorff ; drehbuch: volker s...,0.851843,"schlöndorffvolker, frischmax, delpyjulie, shep...","schlöndorffvolker, wurlitzerrudy, frischmax, m...",-0.5,,kinowelt home entertainment,-1.0,,,1.0,homo faber,homo faber,-1.0,,,1.0,2 109,2 109
465,1,-1.0,,,-1.0,,,-0.5,,"rundfunkchor, sächsische staatskapelle dresden",-1.0,,,-1.0,,,0.5,aaaaaaaa,1991uuuu,1.0,40100,40100,1.0,mu,mu,1.0,[],[],-1.0,,,-0.5,,422.0,-0.5,,43,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.672418,wolfgang amadeus mozart,[musik]: wolfgang amadeus mozart; libretto: em...,0.546377,"mathisedith, karajanherbert von","mollkurt, serraluciana, pricemargaret, venutim...",-0.5,,[phonogram],-1.0,,,0.545964,zauberflöte,"die zauberflöte, kv 620 : eine deutsche oper i...",-1.0,,,0.777778,3,3 1
486,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-0.5,,10425.0,0.75,1932aaaa,1932uuuu,1.0,10200,10200,1.0,mu,mu,1.0,[],[],-1.0,,,-0.5,10425.0,,-1.0,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.540472,w. a. mozart ; klavierauszug nach dem in der p...,w.a: mozart / [hrsg. von kurt soldan],-0.5,,"soldankurt, mozartwolfgang amadeus",-0.5,,c.f. peters,-1.0,,,0.92029,"die zauberflöte, oper in 2 aufzügen : [kv 620]","die zauberflöte, oper in 2 aufzügen",-1.0,,,0.733333,1,1 188
547,1,-1.0,,,-1.0,,,1.0,schweizbundesamt für landestopografie,schweizbundesamt für landestopografie,-1.0,,,-1.0,,,0.75,2007aaaa,2007uuuu,1.0,10347,10347,1.0,mp,mp,1.0,[],[],-1.0,,,-1.0,,,-1.0,,,-1.0,,,1.0,bundesamt für landestopografie swisstopo,bundesamt für landestopografie swisstopo,1.0,dufourguillaume henri,dufourguillaume henri,-1.0,,,-1.0,,,1.0,"dufourkarten, topografische karte der schweiz","dufourkarten, topografische karte der schweiz",-1.0,,,1.0,2,2
641,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.625,19aaaaaa,1950uuuu,1.0,10100,10100,1.0,mu,mu,1.0,[],[],-1.0,,,0.0,4355.0,912.0,-0.5,,912 912,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.880856,a german opera by emanuel schikaneder ; music ...,a german opera by emanuel schikaneder ; music ...,-0.5,,"aberthermann, schikanederemanuel",-0.5,,e. eulenburg,-1.0,,,0.833333,die zauberflöte,"die zauberflöte, köchel no 620",-1.0,,,0.733333,1,1 412
642,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.625,1950aaaa,19uuuuuu,1.0,10100,10100,1.0,mu,mu,1.0,[],[],-1.0,,,0.0,912.0,4355.0,-1.0,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.713937,wolfgang amadeus mozart ; libretto by emanuel ...,a german opera by emanuel schikaneder ; music ...,-0.5,"schikanederemanuel, aberthermann",,-0.5,e. eulenburg,,-1.0,,,0.770833,"die zauberflöte, the magic flute : opera : k 620",die zauberflöte,-1.0,,,0.733333,1 412,1
685,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,1471aaaa,1471uuuu,1.0,20053,20053,1.0,bk,bk,1.0,[],[],-1.0,,,-1.0,,,-1.0,,,1.0,crescenzipietro de',crescenzipietro de',-1.0,,,-1.0,,,-1.0,,,-1.0,,,1.0,ruralia commoda,ruralia commoda,-1.0,,,1.0,418,418
778,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,1993aaaa,1993uuuu,0.428571,10000,10100,1.0,mu,mu,1.0,[963-8303-08-5],[963-8303-08-5],-1.0,,,-0.5,1004.0,,-1.0,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,1.0,wolfgang amadeus mozart,wolfgang amadeus mozart,-1.0,,,-0.5,,könemann,-1.0,,,1.0,"die zauberflöte, partitura","die zauberflöte, partitura",-1.0,,,0.511111,1 225,225
821,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,2005aaaa,2005uuuu,1.0,10300,10300,1.0,vm,vm,1.0,[],[],-1.0,,,-0.5,99064.0,,-1.0,,,-1.0,,,1.0,ein film von luc jacquet,ein film von luc jacquet,1.0,jacquetluc,jacquetluc,-0.5,bonne pioche,,-1.0,,,0.767123,"die reise der pinguine, die natur schreibt die...",die reise der pinguine,-1.0,,,0.75,1,1 82
916,1,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.5,aaaaaaaa,1941uuuu,1.0,20000,20000,1.0,bk,bk,1.0,[],[],-1.0,,,-1.0,,,0.796296,2620 5,2620 2620,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.839556,von mozart ; dichtung von emanuel schikaneder ...,von mozart ; dichtung von emanuel schikaneder ...,0.771292,"krusegeorg richard, schikanederemanuel","schikanederemanuel, krusegeorg richard",0.809524,reclam,p. reclam jun.,-1.0,,,0.881356,"die zauberflöte, oper in zwei aufzügen","die zauberflöte, oper in zwei aufzügen : volls...",-1.0,,,1.0,74,74


false_predicted_duplicates


Unnamed: 0,duplicates,coordinate_E_delta,coordinate_E_x,coordinate_E_y,coordinate_N_delta,coordinate_N_x,coordinate_N_y,corporate_full_delta,corporate_full_x,corporate_full_y,doi_delta,doi_x,doi_y,edition_delta,edition_x,edition_y,exactDate_delta,exactDate_x,exactDate_y,format_postfix_delta,format_postfix_x,format_postfix_y,format_prefix_delta,format_prefix_x,format_prefix_y,isbn_delta,isbn_x,isbn_y,ismn_delta,ismn_x,ismn_y,musicid_delta,musicid_x,musicid_y,part_delta,part_x,part_y,person_100_delta,person_100_x,person_100_y,person_245c_delta,person_245c_x,person_245c_y,person_700_delta,person_700_x,person_700_y,pubinit_delta,pubinit_x,pubinit_y,scale_delta,scale_x,scale_y,ttlfull_245_delta,ttlfull_245_x,ttlfull_245_y,ttlfull_246_delta,ttlfull_246_x,ttlfull_246_y,volumes_delta,volumes_x,volumes_y
32326,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.875,20071990,20071991,1.0,10300,10300,1.0,vm,vm,1.0,[],[],-1.0,,,-0.5,,502023.0,-1.0,,,-1.0,,,0.741377,ein volker schlöndorff film ; nach dem gleichn...,regie: volker schlöndorff ; nach dem roman von...,0.883582,"schlöndorffvolker, frischmax, shepardsam, delp...","schlöndorffvolker, frischmax",-0.5,,kinowelt home entertainment,-1.0,,,1.0,homo faber,homo faber,-1.0,,,0.866667,2 109,1 109
43593,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,20091990,20071991,1.0,10300,10300,1.0,vm,vm,1.0,[],[],-1.0,,,0.5,502430.0,502023.0,-1.0,,,-1.0,,,0.772755,ein film von volker schlöndorff ; nach dem rom...,regie: volker schlöndorff ; nach dem roman von...,0.86115,"schlöndorffvolker, wurlitzerrudy, frischmax, s...","schlöndorffvolker, frischmax",1.0,kinowelt home entertainment,kinowelt home entertainment,-1.0,,,1.0,homo faber,homo faber,-1.0,,,1.0,1 109,1 109
49754,0,-1.0,,,-1.0,,,-0.5,,interkantonale lehrmittelzentrale (luzern),-1.0,,,-1.0,,,0.75,aaaa9999,19969999,1.0,20000,20000,1.0,bk,bk,1.0,[],[],-1.0,,,-1.0,,,-1.0,,,-1.0,,,-0.5,,[éd.:] interkantonale lehrmittelzentrale luzern,-0.5,hubercharles,,0.977778,[staatlicher lehrmittelverlag],staatlicher lehrmittelverlag,-1.0,,,0.89521,"bonne chance!, cours de langue française : pre...","bonne chance!, cours de langue française, 2",-1.0,,,-1.0,,
51294,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,1.0,19969999,19969999,1.0,30300,30300,1.0,cr,cr,1.0,[],[],-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,1.0,universitätsverlag,universitätsverlag,-1.0,,,0.698559,"bildungsforschung und bildungspraxis, educatio...","bildungsforschung und bildungspraxis. beiheft,...",-0.5,"educazione e ricerca., education et recherche....",,-1.0,,
56220,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.75,2011aaaa,2011uuuu,1.0,10300,10300,1.0,vm,vm,1.0,[],[],-1.0,,,-1.0,,,0.777778,1 1,1,-0.5,schlöndorffvolker,,0.720374,volker schlöndorff ; nach dem roman von max fr...,ein film von volker schlöndorff ; nach dem rom...,0.638638,"frischmax, junkersdorfeberhard","schlöndorffvolker, frischmax, delpyjulie, shep...",0.777778,"suhrkamp, absolut medien",suhrkamp,-1.0,,,1.0,homo faber,homo faber,-1.0,,,1.0,1 117,1 117
60958,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.5,1909aaaa,1970uuuu,1.0,20000,20000,1.0,bk,bk,1.0,[],[],-1.0,,,-1.0,,,-1.0,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.649672,textbuch von emanuel schikaneder ; szenische e...,wolfgang amadeus mozart ; nacherzählt von ingr...,0.615271,"schikanederemanuel, loewenfeldhans, leflerhein...","weixelbaumeringrid, riera rochasroque",-0.5,,ueberreuter,-1.0,,,1.0,die zauberflöte,die zauberflöte,-1.0,,,1.0,1,1
68059,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.5,aaaaaaaa,1980uuuu,1.0,10200,10200,1.0,mu,mu,1.0,[],[],-1.0,,,1.0,10425.0,10425.0,-1.0,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,-0.5,,w. a. mozart ; text von emanuel schikaneder ; ...,0.55248,soldankurt,"schikanederemanuel, zallingermeinhard von",-0.5,,peters,-1.0,,,0.711793,"die zauberflöte, oper in zwei aufzügen : klavi...","die zauberflöte, eine deutsche oper in 2 aufzü...",-0.5,"die zauberflöte, ausgabe für gesang und klavier",,0.733333,1,1 188
68062,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.5,aaaaaaaa,1955uuuu,1.0,10200,10200,1.0,mu,mu,1.0,[],[],-1.0,,,1.0,10425.0,10425.0,-1.0,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,-0.5,,[musik] von w. a. mozart ; klavierauszug nach ...,0.50832,soldankurt,"zallingermeinhard von, schikanederemanuel",-0.5,,peters,-1.0,,,0.732405,"die zauberflöte, oper in zwei aufzügen : klavi...","die zauberflöte, oper in 2 aufzügen",-0.5,"die zauberflöte, ausgabe für gesang und klavier",,1.0,1,1
103658,0,-1.0,,,-1.0,,,-1.0,,,-1.0,,,-1.0,,,0.875,20091991,20081991,1.0,10300,10300,1.0,vm,vm,1.0,[],[],-1.0,,,-0.5,502430.0,,-1.0,,,-1.0,,,0.794715,ein film von volker schlöndorff ; nach dem rom...,ein film von volker schlöndorff ; nach dem rom...,0.885478,"schlöndorffvolker, frischmax, junkersdorfeberh...","schlöndorffvolker, frischmax, arvantisjorgos, ...",1.0,kinowelt home entertainment,kinowelt home entertainment,-1.0,,,1.0,homo faber,homo faber,-1.0,,,1.0,1 109,1 109
105562,0,-1.0,,,-1.0,,,0.086957,"rundfunkchor, sächsische staatskapelle dresden","berliner philharmoniker, rias-kammerchor",-1.0,,,-1.0,,,0.625,1984aaaa,1964uuuu,1.0,40100,40100,1.0,mu,mu,1.0,[],[],-1.0,,,0.333333,422.0,449.0,-1.0,,,1.0,mozartwolfgang amadeus,mozartwolfgang amadeus,0.806061,wolfgang amadeus mozart ; libretto: emanuel sc...,wolfgang amadeus mozart,0.745726,"mollkurt, serraluciana, pricemargaret, venutim...","böhmkarl, crassfranz, fischer-dieskaudietrich,...",0.296992,philips,deutsche grammophon,-1.0,,,0.822917,"die zauberflöte, the magic flute",die zauberflöte,-1.0,,,0.0,3,2 1


In [10]:
df_index_docids.iloc[fpu[i]]

Unnamed: 0,035liste_x,035liste_y,docid_x,docid_y
289,"[(OCoLC)886772374, (IDSLU)000547974]","[(OCoLC)604985552, (NEBIS)005519625]",31003621,199374376
465,"[(OCoLC)882061057, (SBT)000242507]","[(OCoLC)638188846, (NEBIS)001183345]",41431642,136079180
486,"[(OCoLC)885295528, (IDSLU)000411402]",[(HEMU)357],21920796,485370239
547,"[(OCoLC)611356565, (IDSLU)000546873]","[(OCoLC)611356565, (IDSLU)000546873]",23403969,23403969
641,"[(OCoLC)611159941, (IDSLU)000464498]","[(VAUD)991019165679702852, (RNV)000396480-41bc...",28968867,405473354
642,[(RERO)R007095034],"[(OCoLC)611159941, (IDSLU)000464498]",252355962,28968867
685,"[(OCoLC)611643448, (IDSSG)000416104]","[(OCoLC)611643448, (IDSSG)000416104]",32531982,32531982
778,"[(OCoLC)806965128, (IDSLU)001278755]","[(OCoLC)806965128, (SGBN)000433323]",482993472,53400631
821,"[(OCoLC)807003147, (SGBN)000610425]",[(KBTG)131754],55479324,505863065
916,"[(OCoLC)604627094, (NEBIS)009407654]",[(RERO)1706143],195531280,214241025


## Summary