# Capstone Project - Deduplication of Swissbib Raw Data

**Program** Applied Data Science : Machine Learning<br>
**Institution** EPFL Extension School<br>
**Course** \#5, Capstone Project<br><br>
**Title** Deduplication of Swissbib Raw Data<br>
**Author** Andreas Jud<br>
**Date** 06-APR-2020

## Table of Contents

- [Introduction](#Introduction)
    - [Requirements](#Requirements)
    - [Acknowledgements](#Acknowledgements)
- [Structure of the Project](#Structure-of-the-Project)
- [Runs and Results](#Runs-and-Results)
    - [Runtime Parameters](#Runtime-Parameters)
    - [Overview of Runs](#Overview-of-Runs)
    - [Runs Execution](#Runs-Execution)
- [Assessment of Results](#Assessment-of-Results)
    - [Run with id 0](#Run-with-id-0)
    - [Run with id 1](#Run-with-id-1)
    - [Run with id 2](#Run-with-id-2)
    - [Run with id 3](#Run-with-id-3)
    - [Run with id 4](#Run-with-id-4)
    - [Conclusion of Runs](#Conclusion-of-Runs)
    - [Wrong Predictions](#Wrong-Predictions)
    - [Classification of Swissbib's Goldstandard Data](#Classification-of-Swissbib's-Goldstandard-Data)
- [Comparison with Literature](#Comparison-with-Literature)
- [Summary and Outlook](#Summary-and-Outlook)

## Introduction

The goal of this capstone project is to explore a data deduplication with the help of machine learning methods. The data to be deduplicated is bibliographical data, provided by the online catalogue [Swissbib](https://www.swissbib.ch/). Swissbib's currently implemented deduplication mechanism is to be replaced by a new mechanism implemented with machine learning. More information on Swissbib, its implemented architecture for deduplication and its data can be found in the [proposal](./project-proposal-andreas-jud.ipynb) for the capstone project.

This capstone project is the fifth and concluding module of an online course [Applied Data Science: Machine Learning](https://www.extensionschool.ch/applied-data-science-machine-learning) by the [EPFL Extension School](https://www.extensionschool.ch/). In its first two modules, the course teaches programming language [Python](https://www.python.org/) while in modules 3 and 4, the course gives an insight into machine learning. For this reason, this capstone project implements code written in Python and applies the machine learning methods that are covered by the course.

### Requirements

This capstone project uses several publically available Python libraries. The chapters where a library is needed show the
<br>$\texttt{! pip install <library name>}$<br>
command in a separate code cell, respectively. These commands have been executed once for the development environment of the author and have been commented out for later execution runs in order to produce more readable notebooks. For executing the set of notebooks of the capstone project on a python development environment with a basic setup, a [requirements.txt](./requirements.txt) file has been written. This file can be executed in the code cell below and installs the library packages needed for this capstone project.

In [1]:
#! pip install -r requirements.txt

### Acknowledgements

The author of this capstone project is outside the academic library sector. This capstone project would not have been possible without a basic support by the Swissbib project team. During a short time period of about four months, the author has had the chance to meet Swissbib's project team members, learn from them and discuss his ideas with them. It was a steep learning curve and the author appreciates the team members' patience and curiosity for his methods. A special and warm thank is to be expressed to Silvia Witzig of the project team. She is the unique and absolute expert of Swissbib's goldstandard data and of the implemented deduplication logic. Her sharing of her knowledge was very efficient and effective. Another explicit thank you is to be expressed to Günter Hipler. Günter has brought up the idea for the project, a problem, for which the author in the beginning had no idea on how to solve it. And Günter has implemented several file exports of the goldstandard and of test data. Without this data provisioning, this project would not have been possible.

Two more people have supported the author highly in the course of the online training in general and explicitly during this capstone project. EPFL course trainer Michael Notter was available on demand for inspiring discussions with helpful hints and ideas for the project. And the author's spouse Claudia has supported him mentally, has given him guidance and motivation even in moments of strain.

This capstone project forms the final module of an online course that has been done as an in-service training. The author's employer and sponsor Baloise Insurance has to be mentioned at this place. Baloise is a company with an open-minded management and the author's thanks go explicitly to his supervisor Nicole Rupp who agreed on a topic for this capstone project, outside the insurance sector.

Thanks to all of you! I would hope, I can give back a little bit of my enthusiasm to you.

## Structure of the Project

The inital notebook of this capstone project is its [proposal](./project-proposal-andreas-jud.ipynb). Based on it, the collection of notebooks of the capstone project consists of the following chapters.

0. Overview and Summary
0. [Data Analysis](./1_DataAnalysis.ipynb)
0. [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb)
0. [Data Synthesizing](./3_DataSynthesizing.ipynb)
0. [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb)
0. [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb)
0. [Decision Tree Model](./6_DecisionTreeModel.ipynb)
0. [Support Vector Classifier Model](./7_SVCModel.ipynb)
0. [Neural Network Model](./8_NeuralNetwork.ipynb)

Appendix

- [A. References](./A_References.ipynb)
- [B. Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb)
- [C. Assessment of Models Trained with Synthetic Data](./C_AssessmentSyntheticModels.ipynb)

Chapter 0 is the summary chapter that executes all Jupyter Notebooks of the project, analyses their results and assesses the calculated models. The time needed to run this notebook is about 14 hours on a 2-years old Apple desktop.

Chapter 1 analyses an amount of nearly 200,000 records provided by Swissbib by a data extract. In the beginning of the chapter, some sample records of different formats of bibliographical units are shown. After this general assessment, all attributes are investigated separately in a profound way with the goal of understanding their contents and their potential contribution as a feature to a machine learning model. The artefact of the chapter is a dictionary of metadata on the attributes being processed in the scope of this capstone project. This artefact is read by the next chapter for further processing.

Chapter 2 analyses Swissbib's goldstandard data. After having understood the relationship of the records among each other in the raw data, pairs of records will be built. These records of pairs of the original records will be the starting point for the feature matrix with the similarity values. This feature base will be handed over as an artefact of the chapter to be further processed in the next chapter.

Chapter 3 generates artificial records of pairs with the goal to increase the ratio of pairs of duplicates in the data for training and testing. The amount of increase can be controlled by a numerical target value for the ratio of duplicate pairs. A pair of identical duplicates would not correspond to Swissbib's raw data reality, though. Therefore, a second target amount of pairs of duplicates is modified according to suggestions described in [[Chri2012](./A_References.ipynb#chri2012)]. This second target amount can be controlled by another parameter. At the end of the chapter, the amount of pairs of uniques can be reduced with the help of yet another control parameter. This reduction of pairs of uniques increases the ratio of duplicates in the training data. The main goal of this downsampling of uniques is the reduction of the full amount of training data, though. The idea behind it is an increase of the run performance of the notebooks where the models are calculated.

Chapter 4 is one central part of the capstone project. In there, the features of the feature matrix are being determined. For the project at hand, a feature is a numerical distance value between two same attributes of one pair of records. The distance value is calculated with the help of a similarity algorithm, provided by the functions of a library of Python code. The artefact of chapter 4 is the labelled feature matrix, still stored in the form of a pandas DataFrame.

Chapter 5 uses the artefact of the preceeding chapter to analyse the effect of the features calculated in chapter 4. A valid similarity metric for an attribute is given if it separates records of pairs of uniques from records of pairs of duplicates. The analysis is done with the help of a series of histograms. Afterwards, the first machine learning models are generated. The fitted models belong to the machine learning category of unsupervised learning. Omitting the target vector when training the classifiers, those models have to find clusters of records in the absence of labels. A Principal Component Analysis, a t-SNE model, and a k-means classifier generate impressive results each. Chapter 5 ends with fitting a dummy classifier which will be used as a statistical baseline for the models fitted in the remaining chapters.

Chapter 6 shows the title 'Decision Tree Model'. In it, three different classifiers are fitted each one of the Ensemble family. A simple Decision Tree Classifier is used as a warm-up of the chapter, before fitting statistically more robust models like Decision Tree Classifier with cross-validation and Random Forests Classifier. One important part of chapter 6 is the introduction of the performance metric used for this capstone project. Another important part is a first effort made in trying to understand and interpret the models of this project. Chapter 6 passes the models' performance to the chapter here as its results for global assessment and comparison.

Chapter 7 calculates in its central part a Support Vector Classifier with the help of cross-validation. This classifier offers another chance for approaching the interpretation of the models. Some effort is taken in the chapter. Chapter 7 passes the models' preformance to chapter 0 as its results for the global assessment, here.

Chapter 8 is the third and last chapter for training a model. A Neural Network is fitted in an implementation with the Keras library. The slow convergence of the networks is striking. A lot of epochs need to be run for the model to reach its absolute maximum. The second striking aspect is the size of the network favoured by better performance figures. On the one hand a network with two hidden layer and on the other hand, a network with a large amount of neurons per layer show the best performance characteristics. Chapter 8, too, contributes with its results to the global assessment of this chapter.

Appendix A holds the bibliography of the capstone project.

Appendix B systematically compares all similarity metrics available in Python library $\texttt{textdistance}$ for samples of all attributes used for the feature matrix of this capstone project. This systematic comparison is the basis for deciding on the similarity metrics of done in chapter [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb).

Appendix C compares the results of two Ensemble classifier models trained with the additional help of synthetic data produced in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb) with the results of the same classifiers trained exclusively with pure Swissbib data. This comparison reveals the quality of the chosen data synthesis of chapter [Data Synthesizing](./3_DataSynthesizing.ipynb).

## Runs and Results

This section starts with explaining the runtime parameters with which the notebooks of the capstone project can be called. After the parameter space has been settled, a series of runs will be executed with different parameter values each.

### Runtime Parameters

The notebooks of this capstone project can be called with eight specific global parameters. These parameters are listed and explained here.

- $\texttt{execution}\_\texttt{mode}$ - The reason for introducing this parameter has been runtime of execution. Grid search has been implemented with the goal to find the best parameters for a model. The bigger the grid space, i.e. the more grid points the grid space has for each of its dimensions, the more model fits have to run and the longer lasts the runtime of a notebook. Oversampling of records of duplicates intreases the runtime of a notebook, even more. When searching the best parameters for a model, the grid space has to be scanned widely. The runtime of the model may extend to hours, even days for such calculations. For some runs, smaller grid spaces may be sufficient. A restricted grid space can be chosen in order to save calculation time. The execution mode of a notebook may have four distinct values.
    - Mode $\texttt{manual}$ will be used for executing the notebook, opening it and running it directly cell by cell. This kind of execution shall be called a local execution mode. The original purpose of this mode of execution has been to open the notebook and read its text, in order to focus on the contents and specific explanation for a model. Runtime is supposed to be moderate or even short for these execution modes. The result of the notebook is to be reflected by its text with the purpose to explain it thoroughly. The grid parameters chosen for this mode have flowed back from the insights found from results with full execution mode of this chapter.
    - Mode $\texttt{full}$ will be used for executing the notebook, calling it in this very chapter and collecting the results of each notebook for final comparison and assessment.
    - Mode $\texttt{restricted}$ serves for exploring different data processing modes explained below in this list, cp. modes $\texttt{factor}$, $\texttt{mode}\_\texttt{exactDate}$, and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$. Runtime is supposed to be short to moderate again for these execution modes. The grid parameters chosen for this mode have flowed back from the insights found from results with full execution mode of this chapter.
    - Mode $\texttt{tune}$ will be used for a final fine tuning of the models' parameters. Goal of a run in mode $\texttt{tune}$ is to get the best models of a grid space close to a precalculated best model of the wide grid space run with mode $\texttt{full}$. While mode $\texttt{full}$ will be used for scanning a wide range of orders of magnitude of the parameter space, mode $\texttt{tune}$ will be used for scanning the neighbouring parameter points of the best models of the mode $\texttt{full}$ run. This approach helps in an iterative search for the best parameters of the models.
- $\texttt{oversampling}$ - The number of records of duplicates generated with Swissbib's goldstandard data has been low compared to the number of records of pairs of uniques. The consequence has been to generally use balancing for model fitting. In order to increase the ratio of duplicates in the training and testing data, an oversampling with synthetic data has been implemented. To control the ratio, parameter $\texttt{oversampling}$ can be used. Synthetic data will be multiplyed with a for loop, so to reach a ratio of oversampling in percent \[%\] in the final data set for model calculation. If $\texttt{oversampling}=0$, no synthetic data will be added to the goldstandard data. This parameter will be used in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb).
- $\texttt{modification}\_\texttt{ratio}$ - This parameter will be used in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb), too. In that chapter, some specific kinds of data modification (typos) to be simulated have been defined for each attribute. If an attribute shows one or more kinds of modification, this parameter controls the ratio and therefore the amount of records with modification.
- $\texttt{sampling}\_\texttt{fraction}\_\texttt{nreb}$ - The models of chapters [Support Vector Classifier Model](./7_SVCModel.ipynb) and [Neural Network Model](./8_NeuralNetwork.ipynb) explicitly suffer from long runtime during training. Ways to reduce this duration to a smaller order of magnitude have been searched. Two different basic ways could be imagined. The first one is to use a PCA (Principle Component Analysis) classifier to transform the features of a model to a lower dimensionality. This way of dimensionality reduction has been rejected with the desire of keeping full information of all the features of the model. An alternative way of reducing the calculation load on a model has been chosen, instead. In chapter [Data Synthesizing](./3_DataSynthesizing.ipynb), two kinds of downsampling have been implemented. The first kind reduces the amount of records of the training, validation and testing data, independent of the class belonging by selecting a purely random subset of records out of the basic data set. This kind of downsampling leaves the ratio of the two classes unchanged. The target ratio for the subset is set by parameter $\texttt{sampling}\_\texttt{fraction}\_\texttt{nreb}$.
- $\texttt{sampling}\_\texttt{fraction}\_\texttt{reb}$ - The second kind of downsampling implemented in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb) reduces the amount of records of class unique, only. This kind leaves the low amount of records of pairs of duplicates untouched. Reducing exclusively the amount of records of class unique increases the ratio of records of pairs of duplicates in the total data set for training, validation and testing as a side effect. Details on the implementation are given in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb), the parameter for controlling the downsampling in the second kind is $\texttt{sampling}\_\texttt{fraction}\_\texttt{reb}$.
- $\texttt{factor}$ - In Swissbib's raw data, records may have missing values in attributes. When building pairs of records for generating the feature matrix, records may occur with a value on both sides of a pair, but also with missing values on one side of a pair and even with missing values on both sides of a pair, see chapters [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb) and [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb) for a deeper discussion. Missing values may influence the model. For that reason, a decision has been taken to mark the features of records of pairs with missing attribute values. One way of marking them can be to transform them to a negative similarity value. During implementation, a discussion has been on how the distance from the origin (similarity value of 0) on the negative similarity side would influence a model, especially a Neural Network, due to its linear dependency on firing of a neuron. To be able to set the distance from the origin, this factor has been introduced. In the implemented code, the factor ...
    - multiplies -0.5 if one attribute of the pair is missing.
    - multiplies -1.0 if both attributes of the paire are missing.
- $\texttt{mode}\_\texttt{exactDate}$ - The basic similarity metric of attribute $\texttt{exact}\_\texttt{date}$, undergoes some modification in presence of unknown values, see chapter [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb) for implementation details. Two different modes of modifying the basic similarity metric have been implemented. To decide on one mode of modification, parameter $\texttt{exact}\_\texttt{date}$ has been introduced. 
- $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ - Swissbib's raw data bring attributes $\texttt{scale}$, $\texttt{part}$, and $\texttt{volumes}$ as full-text strings. Swissbib's deduplication engine extracts their number digit parts in a preprocessing step with the goal to generate more reliable results. A very basic stripping function has been implemented in this capstone project with the goal to copy Swissbib's more sophisticated logic. The model result may change as a function of the similarity values of these three attributes. To assess the effect of stripping the attributes values, parameter $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ will be used for switching on ($\texttt{strip}\_\texttt{number}\_\texttt{digits}=\texttt{True}$) and off ($\texttt{strip}\_\texttt{number}\_\texttt{digits}=\texttt{False}$) the stripping to number digits logic.

The usage of the above global parameters for the notebooks will be explained in the run strategy outlined in the next subsection.

### Overview of Runs

In the course of this capstone project, many runs have been executed. These runs were started in very early stages of the implementation. The results of some of these runs have been discarded while others have been stored in the github space for the proposal [[PropRepo](./A_References.ipynb#proposal_repo)] or the project github space [[ProjRepo](./A_References.ipynb#project_repo)] of the capstone project. An overview of the most important runs is given in the <a href='https://en.wikipedia.org/wiki/Numbers_(spreadsheet)'>Apple Numbers</a> file [runs_summary](./documentation/runs_summary.numbers) which is also provided as a .csv file [runs_summary.csv](./documentation/runs_summary.csv). The assessment of these runs has generated a specific kind of experience gained on the behaviour of the models fitted in the course of the capstone project. This experience has flowed into the run strategy described in this subsection.

With the description of the runtime parameters in the subsection above, a multi-dimensional space of calculation options has been spanned. The dimensions of this space have increased in the course of the project. This increase turned the limitations of the hardware resources more and more to the critical factor. One answer to the increased need for computational power was the downsampling of the records with the implementation controlled by parameters $\texttt{sampling}\_\texttt{fraction}\_\texttt{nreb}$ and $\texttt{sampling}\_\texttt{fraction}\_\texttt{reb}$ mentioned in the subsection above. It took the author a long time to decide for a resampling of the data. Reason was the fear to influence the models in an inappropriate way. Finally comparing the results of downsampled models with the results of the original runs showed a calming picture, though. The results remained tolerably close in their accuracy performance. This statement can be retraced in file [runs_summary.csv](./documentation/runs_summary.csv).

Another answer to the increased need for computational power due to the need to explain the results of the capstone project in this chapter is the strategy of runs described in this subsection. It is important to design well the runs to be done in order to reduce unnecessary calculation time and to increase the statements of the documented runs. This sophisticated design was only possible with the experience of many previous runs, documented in file [runs_summary.csv](./documentation/runs_summary.csv). The strategy described here is the result of a series of full runs of which only traces of a selection are visible in the summary file.

The strategy used for the runs of this capstone project is shown in the table below. This strategy with its specific parameters has grown in the course of the capstone project iteratively. In the end, this chapter condenses the author's learning on the basic behaviour of the models and their best-suited parameter space.

| run id | description | parameter set |
| :----: | :---------- | :--------- |
| 0 | Goldstandard sampling,<br>**full feature modification** | $\texttt{execution}\_\texttt{mode}$ = $\texttt{full}$<br>$\texttt{oversampling}$ = $\texttt{None}$ with $\texttt{modification}\_\texttt{ratio}$ = \< irrelevant \><br>$\texttt{sampling}\_\texttt{fraction}\_\texttt{nreb}$ = $1.0$ and $\texttt{sampling}\_\texttt{fraction}\_\texttt{reb}$ = $0.4$<br>$\texttt{factor}$ = $1.0$<br>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{added}\_\texttt{u}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{True}$ |
| 1 | Goldstandard sampling,<br>**little feature modification** | $\texttt{execution}\_\texttt{mode}$ = $\texttt{restricted}$<br>$\texttt{oversampling}$ = $\texttt{None}$ with $\texttt{modification}\_\texttt{ratio}$ = \< irrelevant \><br>$\texttt{sampling}\_\texttt{fraction}\_\texttt{nreb}$ = $1.0$ and $\texttt{sampling}\_\texttt{fraction}\_\texttt{reb}$ = $0.4$<br>$\texttt{factor}$ = $1.0$<br><font color='red'>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{xor}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{False}$</font> |
| 2 | Goldstandard sampling,<br>**small separation of missings** | $\texttt{execution}\_\texttt{mode}$ = $\texttt{restricted}$<br>$\texttt{oversampling}$ = $\texttt{None}$ with $\texttt{modification}\_\texttt{ratio}$ = \< irrelevant \><br>$\texttt{sampling}\_\texttt{fraction}\_\texttt{nreb}$ = $1.0$ and $\texttt{sampling}\_\texttt{fraction}\_\texttt{reb}$ = $0.4$<br><font color='red'>$\texttt{factor}$ = $0.1$</font><br>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{added}\_\texttt{u}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{True}$ |
| 3 | **Oversampling** | $\texttt{execution}\_\texttt{mode}$ = $\texttt{full}$<br><font color='red'>$\texttt{oversampling}$ = $\texttt{20}$ with $\texttt{modification}\_\texttt{ratio}$ = $0.2$</font><br>$\texttt{sampling}\_\texttt{fraction}\_\texttt{nreb}$ = $1.0$ and $\texttt{sampling}\_\texttt{fraction}\_\texttt{reb}$ = $0.4$<br>$\texttt{factor}$ = $1.0$<br>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{added}\_\texttt{u}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{True}$ |
| 4 | Fine tuning | <font color='red'>$\texttt{execution}\_\texttt{mode}$ = $\texttt{tune}$</font><br>$\texttt{oversampling}$ = $\texttt{None}$ with $\texttt{modification}\_\texttt{ratio}$ = \< irrelevant \><br>$\texttt{sampling}\_\texttt{fraction}\_\texttt{nreb}$ = $1.0$ and $\texttt{sampling}\_\texttt{fraction}\_\texttt{reb}$ = $0.4$<br>$\texttt{factor}$ = $1.0$<br>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{added}\_\texttt{u}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{True}$ |

The strategy for finding the best parameters for the best model can be described as follows. The item numbers in the list below correspond to the run ids in the table above.
0. The first group of runs scans the parameter space widely with a coarse granularity in the grid space. The runs are done with a downsampled goldstandard data, due to runtime. The parameter $\texttt{factor}$ is set to its originally intended value of 1.0. The stripping of text attributes to numbers is done in a forced way with the expectation to better approach Swissbib's data preprocessing. This run represents a first search of parameters based on technical guesses for narrowing them down for the best models later on.
0. The assumption of the text attributes' stripping is validated with the next group of runs, leaving attributes $\texttt{part}$, $\texttt{scale}$, and $\texttt{volumes}$ unabbreviated to Swissbib's original raw data output. This validation is done with downsampling on a restricted grid space, based of the findings of the best models from run with id 0.
0. The next group of runs validates the assumptions of the influence of the distance of missing data from the origin on the models. Setting $\texttt{factor} = 0.1$ stands for the expectation of a better performance for Neural Networks, see above. The other parameters are set according to the findings so far, comparing the performance of the models.
0. The low ratio of records with duplicate pairs compared to the amount of records with uniques has shown to be of low significance for training the models. The effect of oversampling with synthetic data still remains an interesting point to be investigated. The result is to be compared with Swissbib's goldstandard data without oversampling. The other parameters will be the ones found from the best models, up to that point. For oversampled data, runtime becomes even more critical. Therefore, this run, too, will be done with a downsampled data set.
0. The last group of runs scans the grid space in a fine granularity in the vicinity of the grid points found for the best models in the preceding runs, explicitly in run 0. This will be a fine tuning step in order to be sure to have found the very best parameters for the best model of all best models.

Before the defined runs can be executed, the global parameters described in subsection [Runtime Parameters](#Runtime-Parameters) have to be set according to the strategy.

In [2]:
# Generate dictionary for parameter handover
runtime_param_dict = {
    'em' : 'full' #execution_mode : ['restricted', 'full', 'tune']
    , 'os' : 0 # oversampling : [0, 20]
    , 'mr' : 0.2 # modification_ratio
    , 'dsn' : 1 # 𝚜𝚊𝚖𝚙𝚕𝚒𝚗𝚐_𝚏𝚛𝚊𝚌𝚝𝚒𝚘𝚗_𝚗𝚛𝚎𝚋 : <= 1
    , 'dsw' : 0.4 # 𝚜𝚊𝚖𝚙𝚕𝚒𝚗𝚐_𝚏𝚛𝚊𝚌𝚝𝚒𝚘𝚗_𝚛𝚎𝚋 : <= 1
    , 'fa' : 1.0 # factor : [0.1, 1.0]
    , 'me' : 'added_u' # mode_exactDate : ['added_u', 'xor']
    , 'sn' : True # strip_number_digits : [True, False]
}
# Run id = 0
runtime_param_dict_list = [runtime_param_dict]

# Run id = 1
runtime_param_dict = runtime_param_dict_list[0].copy()
runtime_param_dict['em'] = 'restricted'
runtime_param_dict['me'] = 'xor'
runtime_param_dict['sn'] = False
runtime_param_dict_list.append(runtime_param_dict)

# Run id = 2
runtime_param_dict = runtime_param_dict_list[0].copy()
runtime_param_dict['em'] = 'restricted'
runtime_param_dict['fa'] = 0.1
runtime_param_dict_list.append(runtime_param_dict)

# Run id = 3
runtime_param_dict = runtime_param_dict_list[0].copy()
runtime_param_dict['os'] = 20
runtime_param_dict_list.append(runtime_param_dict)

# Run id = 4
runtime_param_dict = runtime_param_dict_list[0].copy()
runtime_param_dict['em'] = 'tune'
runtime_param_dict_list.append(runtime_param_dict)

# Let's have a look at the predefined parameters
for run in range(len(runtime_param_dict_list)):
    print('Parameters for run', run, ': \n', runtime_param_dict_list[run])

Parameters for run 0 : 
 {'em': 'full', 'os': 0, 'mr': 0.2, 'dsn': 1, 'dsw': 0.4, 'fa': 1.0, 'me': 'added_u', 'sn': True}
Parameters for run 1 : 
 {'em': 'restricted', 'os': 0, 'mr': 0.2, 'dsn': 1, 'dsw': 0.4, 'fa': 1.0, 'me': 'xor', 'sn': False}
Parameters for run 2 : 
 {'em': 'restricted', 'os': 0, 'mr': 0.2, 'dsn': 1, 'dsw': 0.4, 'fa': 0.1, 'me': 'added_u', 'sn': True}
Parameters for run 3 : 
 {'em': 'full', 'os': 20, 'mr': 0.2, 'dsn': 1, 'dsw': 0.4, 'fa': 1.0, 'me': 'added_u', 'sn': True}
Parameters for run 4 : 
 {'em': 'tune', 'os': 0, 'mr': 0.2, 'dsn': 1, 'dsw': 0.4, 'fa': 1.0, 'me': 'added_u', 'sn': True}


The parameters for each run have been set according to the strategy of scanning the grid space. All groups of runs can be executed as a next step.

### Runs Execution

To execute the notebooks of the capstone project, functions of Python library family for notebooks handling like $\texttt{nbformat}$, $\texttt{nbparameterise}$, and $\texttt{nbpconvert}$ will be used.

In [3]:
#! pip install nbparameterise

The calculations of the notebooks can be done with the parameter specified by the list of dictionaries $\texttt{runtime}\_\texttt{param}\_\texttt{dict}$. The call of a notebook and its execution is implemented in the separate library [results_saving_funcs.py](./results_saving_funcs.py).

In [4]:
import os
import results_saving_funcs as rsf
import pandas as pd

path_results = './results'
path_goldstandard = './daten_goldstandard/'

# Determine all relevant notebooks, omit Overview Summary and Appendixes
notebook = ! ls [1-9]_* | grep .ipynb

for run in range(len(runtime_param_dict_list)):
    print('\nRun id', run)
    rsf.run_notebooks(notebook, runtime_param_dict_list[run], run, path_results)

    # Save the resulting handover files for the run done right now
    os.rename(os.path.join(path_results, 'results.pkl'),
              os.path.join(path_results, 'results_run_' + str(run) + '.pkl'))
    os.rename(os.path.join(path_goldstandard, 'wrong_predictions.pkl'),
              os.path.join(path_goldstandard, 'wrong_predictions_run_' + str(run) + '.pkl'))
    # Assessment of run
    results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

    results['results_best_model'].reset_index(drop=True, inplace=True)
    # Ranking metric of models : accuracy
    display(results['results_best_model'].sort_values(by=['accuracy'], ascending=False))

    for classifier in results['results_model_scores'].keys() :
        # Persist results per classifer for analysis
        results['results_model_scores'][classifier].to_csv(os.path.join(path_results,
                                                                        classifier + '_run_' + str(run) + '.csv'),
                                                           index=False)

    print('********\n')

print('Done with all runs of all notebooks.')


Run id 0
Executing notebook 1_DataAnalysis.ipynb
Executing notebook 2_GoldstandardDataPreparation.ipynb
Executing notebook 3_DataSynthesizing.ipynb


CellExecutionError: An error occurred while executing the following cell:
------------------
import data_preparation_funcs as dpf

for i in ['original', 'modified']:
    goldstandard_uniques[i] = dpf.attribute_preprocessing(
        goldstandard_uniques[i],
        columns_metadata_dict['data_analysis_columns'], strip_number_digits)
------------------

[0;31m---------------------------------------------------------------------------[0m
[0;31mTypeError[0m                                 Traceback (most recent call last)
[0;32m<ipython-input-7-7add35a3caf3>[0m in [0;36m<module>[0;34m[0m
[1;32m      4[0m     goldstandard_uniques[i] = dpf.attribute_preprocessing(
[1;32m      5[0m         [0mgoldstandard_uniques[0m[0;34m[[0m[0mi[0m[0;34m][0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 6[0;31m         columns_metadata_dict['data_analysis_columns'], strip_number_digits)
[0m
[0;32m~/test1/clustering_metadata/data_preparation_funcs.py[0m in [0;36mattribute_preprocessing[0;34m(df, columns, strip_digits)[0m
[1;32m    173[0m             [0;32mcontinue[0m [0;31m# Explicitly : do nothing![0m[0;34m[0m[0;34m[0m[0m
[1;32m    174[0m         [0;32melif[0m [0mattrib[0m [0;32min[0m [0;34m[[0m[0;34m'coordinate_E'[0m[0;34m][0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 175[0;31m             [0mdf[0m [0;34m=[0m [0msplit_coordinate[0m[0;34m([0m[0mdf[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m    176[0m         [0;32melif[0m [0mattrib[0m [0;32min[0m [0;34m[[0m[0;34m'corporate_full'[0m[0;34m][0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[1;32m    177[0m             df = split_dictionary_column(df, 'corporate', ['110', '710'#, '810

[0;32m~/test1/clustering_metadata/data_preparation_funcs.py[0m in [0;36msplit_coordinate[0;34m(df)[0m
[1;32m    144[0m [0;32mdef[0m [0msplit_coordinate[0m [0;34m([0m[0mdf[0m[0;34m)[0m [0;34m:[0m[0;34m[0m[0;34m[0m[0m
[1;32m    145[0m     [0mdf[0m[0;34m[[0m[0;34m'coordinate_E'[0m[0;34m][0m [0;34m=[0m [0mdf[0m[0;34m[[0m[0;34m'coordinate'[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 146[0;31m     [0mdf[0m [0;34m=[0m [0mnorm_first_coordinate[0m[0;34m([0m[0mdf[0m[0;34m,[0m [0;34m'_E'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m    147[0m     [0;31m# Recuce list to N and S and then same procedure[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m    148[0m     [0mdf[0m[0;34m[[0m[0;34m'coordinate_N'[0m[0;34m][0m [0;34m=[0m [0mdf[0m[0;34m[[0m[0;34m'coordinate'[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m

[0;32m~/test1/clustering_metadata/data_preparation_funcs.py[0m in [0;36mnorm_first_coordinate[0;34m(df, suffix)[0m
[1;32m    129[0m [0;34m[0m[0m
[1;32m    130[0m [0;32mdef[0m [0mnorm_first_coordinate[0m [0;34m([0m[0mdf[0m[0;34m,[0m [0msuffix[0m[0;34m)[0m [0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 131[0;31m     [0mdf[0m[0;34m[[0m[0;34m'coordinate'[0m[0;34m+[0m[0msuffix[0m[0;34m][0m [0;34m=[0m [0mdf[0m[0;34m[[0m[0;34m'coordinate'[0m[0;34m+[0m[0msuffix[0m[0;34m][0m[0;34m.[0m[0mmap[0m[0;34m([0m[0;32mlambda[0m [0mx[0m [0;34m:[0m [0mx[0m[0;34m[[0m[0;36m0[0m[0;34m][0m [0;32mif[0m [0mlen[0m[0;34m([0m[0mx[0m[0;34m)[0m[0;34m>[0m[0;36m0[0m [0;32melse[0m [0;34m''[0m[0;34m)[0m[0;34m.[0m[0mstr[0m[0;34m.[0m[0mreplace[0m[0;34m([0m[0;34m' '[0m[0;34m,[0m [0;34m''[0m[0;34m)[0m[0;34m.[0m[0mstr[0m[0;34m.[0m[0mreplace[0m[0;34m([0m[0;34m'.'[0m[0;34m,[0m [0;34m''[0m[0;34m)[0m[0;34m.[0m[0mstr[0m[0;34m[[0m[0;34m:[0m[0;36m8[0m[0;34m][0m[0;34m.[0m[0mstr[0m[0;34m.[0m[0mlower[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m    132[0m [0;34m[0m[0m
[1;32m    133[0m     [0;32mreturn[0m [0mdf[0m[0;34m[0m[0;34m[0m[0m

[0;32m~/anaconda3/lib/python3.7/site-packages/pandas/core/series.py[0m in [0;36mmap[0;34m(self, arg, na_action)[0m
[1;32m   3826[0m         [0mdtype[0m[0;34m:[0m [0mobject[0m[0;34m[0m[0;34m[0m[0m
[1;32m   3827[0m         """
[0;32m-> 3828[0;31m         [0mnew_values[0m [0;34m=[0m [0msuper[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0m_map_values[0m[0;34m([0m[0marg[0m[0;34m,[0m [0mna_action[0m[0;34m=[0m[0mna_action[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m   3829[0m         [0;32mreturn[0m [0mself[0m[0;34m.[0m[0m_constructor[0m[0;34m([0m[0mnew_values[0m[0;34m,[0m [0mindex[0m[0;34m=[0m[0mself[0m[0;34m.[0m[0mindex[0m[0;34m)[0m[0;34m.[0m[0m__finalize__[0m[0;34m([0m[0mself[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m   3830[0m [0;34m[0m[0m

[0;32m~/anaconda3/lib/python3.7/site-packages/pandas/core/base.py[0m in [0;36m_map_values[0;34m(self, mapper, na_action)[0m
[1;32m   1298[0m [0;34m[0m[0m
[1;32m   1299[0m         [0;31m# mapper is a function[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m-> 1300[0;31m         [0mnew_values[0m [0;34m=[0m [0mmap_f[0m[0;34m([0m[0mvalues[0m[0;34m,[0m [0mmapper[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m   1301[0m [0;34m[0m[0m
[1;32m   1302[0m         [0;32mreturn[0m [0mnew_values[0m[0;34m[0m[0;34m[0m[0m

[0;32mpandas/_libs/lib.pyx[0m in [0;36mpandas._libs.lib.map_infer[0;34m()[0m

[0;32m~/test1/clustering_metadata/data_preparation_funcs.py[0m in [0;36m<lambda>[0;34m(x)[0m
[1;32m    129[0m [0;34m[0m[0m
[1;32m    130[0m [0;32mdef[0m [0mnorm_first_coordinate[0m [0;34m([0m[0mdf[0m[0;34m,[0m [0msuffix[0m[0;34m)[0m [0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 131[0;31m     [0mdf[0m[0;34m[[0m[0;34m'coordinate'[0m[0;34m+[0m[0msuffix[0m[0;34m][0m [0;34m=[0m [0mdf[0m[0;34m[[0m[0;34m'coordinate'[0m[0;34m+[0m[0msuffix[0m[0;34m][0m[0;34m.[0m[0mmap[0m[0;34m([0m[0;32mlambda[0m [0mx[0m [0;34m:[0m [0mx[0m[0;34m[[0m[0;36m0[0m[0;34m][0m [0;32mif[0m [0mlen[0m[0;34m([0m[0mx[0m[0;34m)[0m[0;34m>[0m[0;36m0[0m [0;32melse[0m [0;34m''[0m[0;34m)[0m[0;34m.[0m[0mstr[0m[0;34m.[0m[0mreplace[0m[0;34m([0m[0;34m' '[0m[0;34m,[0m [0;34m''[0m[0;34m)[0m[0;34m.[0m[0mstr[0m[0;34m.[0m[0mreplace[0m[0;34m([0m[0;34m'.'[0m[0;34m,[0m [0;34m''[0m[0;34m)[0m[0;34m.[0m[0mstr[0m[0;34m[[0m[0;34m:[0m[0;36m8[0m[0;34m][0m[0;34m.[0m[0mstr[0m[0;34m.[0m[0mlower[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m    132[0m [0;34m[0m[0m
[1;32m    133[0m     [0;32mreturn[0m [0mdf[0m[0;34m[0m[0;34m[0m[0m

[0;31mTypeError[0m: object of type 'float' has no len()
TypeError: object of type 'float' has no len()


The results for each run have been stored in specific files and will be analysed in the next section of this chapter.

## Assessment of Results

The ranking of the models is shown above for each run. As a next step, the results are to be discussed for each run group separately. The rankings' display will be repeated for the discussions. Goal of this first step of discussion is, to identify the best parameters of the grid search for each model.

### Run with id 0

In [None]:
# Import in case of not having imported, yet
import results_saving_funcs as rsf

run = 0
path_results = './results'

print('\nRun id', run, 'with parameters\n', runtime_param_dict_list[run])

results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

results['results_best_model'].set_index('model', inplace=True)
# Ranking metric according to chapter 6 : 1. accuracy, 2. roc auc
display(results['results_best_model'].sort_values(by=['accuracy', 'auc'], ascending=False).round(3))

The ranking of the best models of run with id 0 can be seen above. The overall best accuracy of the models compared has the Random Forest Classifier. This accuracy value can be derived from the sum of false predicted records which shows a total of 24 wrong predictions and a ratio of 0.11% in the resulting notebook [Decision Tree Model](./results/6_DecisionTreeModel_run_0.ipynb) of the run, see subsection [Wrong Predictions](#Wrong-Predictions) below. This highest value of accuracy is not confirmed by the highest value in all metrics like precision and recall. The Decision Tree Classifier with cross-validation shows a higher recall value than the Random Forest Classifier, resulting even in a higher roc auc score. The performance metric values for the Neural Network may vary in the ranking above. Reason is the Keras library used for the implementation of the Neural Network in this capstone project. Reproduceability can be controlled in a scikit-learn implementation when setting a parameter $\texttt{random}\_\texttt{state}$ to a fixed value on instantiating a classifier object. A Keras implementation requires some more lines of code, see [[KeraRand](./A_References.ipynb#kerarand)]. This effort has not been taken in this capstone project. The statement on the Neural Network may differ from the picture above, therefore. Some runs have been observed with the following properties.
- There have been runs where the Neural Network has shown an higher precision score than the Random Forest Classifier which has not resulted in a higher roc auc, though.
- There have been runs where the Neural Network have ranked in between the Ensemble classifier family and the Support Vector Classifiers with the second highest recall score and the second highest roc auc.

As an overall picture, the Ensemble family classifiers build the group of classifiers with the best overall scoring values followed by chance by the Support Vector Classifier with cross-validation or by the Neural Network classifier. The models are close to each other in their accuracy with a value of 99.8% and roc auc with values higher than 96.2%. The Dummy Classifier gives security on being on the safe side of statistics.

Altogether, this gives a nice and consistent picture.

In [None]:
# If not defined, yet
import pandas as pd

results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

# Unlimited number of columns allowed
pd.options.display.max_columns = None

for classifier in results['results_model_scores'].keys() :
    if classifier != 'DummyClassifier': # DummyClassifier has no results to be analysed
        # Show results
        print(f'\n{classifier}')
        display(results['results_model_scores'][classifier].sort_values(by=['accuracy_val'], ascending=False).head(20))

Looking at the detailed ranking per classifier model above has the goal to find out the best grid parameter set for each model. Be aware that the accuracy values of the detailed runs' data differ from the accuracy values of the best models of the comparison ranking. Reason is the data set used for the ranking. For the models' comparison ranking, the validation part of the training data split has been used. For the grid parameter comparison results, the test data part of the full data set has been used.

| model | parameters assessment |
| :---- | :-------------------- |
| RandomForestClassifier | A tendency for unbalanced $\texttt{class}\_\texttt{weight}$ can be detected for the best estimator. High values of $\texttt{max}\_\texttt{depth}$ in a range around 20 are preferred which may confirm expectations as the model has been trained with a total of 20 features. Be aware though that the validation accuracy score of the first three models have identical values. Even the group of models ranked higher than 3 have a deviation in score $\texttt{accuracy}\_\texttt{val}$ of 0.006% which seems to be a statistically questionable significance. The lowest $\texttt{max}\_\texttt{depth}$ in the two groups of top rank is 20. For the parameter $\texttt{n}\_\texttt{estimators}$, values higher or equal to 16 show the best performance. |
| DecisionTreeClassifier_CV | Although the run for this classifier has been done with $\texttt{class}\_\texttt{weight}=\texttt{balanced}$ and $\texttt{None}$, balancing generates the best results. The gini measure and not entropy generates the overall best measure, when looking at the accuracy value. For the ranking of this classifier, it is noticeable that a $\texttt{max}\_\texttt{depth}$ value of 26 is the lowest value with the highest accuracy. For all $\texttt{max}\_\texttt{depth} \ge 26$, the accuracy remains constant. Remarkable is a view on the standard deviation of $\texttt{accuracy}\_\texttt{val}$. Comparing the accuracy scores with overlapping standard deviation intervals reveals that the models' statistical distinction is a hard job to do. |
| DecisionTreeClassifier | As the Decision Tree Classifier with cross-validation is stronger in its statistical statement, the classifier found without cross-validation will not be discussed deeper, here. |
| SVC_CV | Polynomial kernels of a degree of 3 or eventually 4 generate the best accuracy results on the validation data. A $\gamma$ value of around 1.0 or lower and a $\texttt{C}$ value of around 1.0 produce the best models. Again, looking at the accuracy scores of the validation data under additional consideration of standard deviation produces a weak picture for identifying the best parameters. |
| SVC | As the Support Vector Classifier with cross-validation is stronger in its statistical statement, the model found without cross-validation will not be discussed deeper, here.  |
| NeuralNetwork | The Neural Network fits its best classifiers with a balanced training data set. This observation can be confirmed with the experience from a big number of old runs, documented in file [runs_summary.csv](./documentation/runs_summary.csv). A low dropout rate with a maximum of 0.1 seems to be beneficial for the model. As can be seen in chapter [Neural Network Model](./8_NeuralNetwork.ipynb) the models need an extraordinary long training phase. Their velocity of converging is very low. This might be a reason why dropout does not seem to play an important role for training the network. As for regularization a value of 0 has always proven as best value for all models of all runs done. Therefore, no higher value of $\texttt{l2}\_\texttt{alpha}$ has been set in the grid search. Surprisingly, the highest numbers of neurons result in higher accuracy values. Explicitly, this is also true for a second hidden layer. This observation, together with the observation of a low learning rate of around 0.001 may be anonther reason, for the described slow stabilization velocity of the Neural Network models. |

### Run with id 1

For the assessment of the runs with an id of 1 and higher, the same structure as in the previous subsection will be chosen. Mainly differences between the results of the runs will be pointed out.

In [None]:
run = 1

print('\nRun id', run, 'with parameters\n', runtime_param_dict_list[run])

results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

results['results_best_model'].set_index('model', inplace=True)
# Ranking metric according to chapter 6 : 1. accuracy, 2. roc auc
display(results['results_best_model'].sort_values(by=['accuracy', 'auc'], ascending=False).round(3))

Comparing the accuracy scores of this run with the values of the first run, the question might come up why parameters $\texttt{mode}\_\texttt{exactDate}=\texttt{xor}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}=\texttt{False}$ have been chosen for calculating the best of best models of the capstone project. The decision had been hard to take but, again, it is a summary of observations of model performances documented in file [runs_summary.csv](./documentation/runs_summary.csv) that turned the balance for the choice made.

The best accuracy of the models compared has the Decision Tree Classifier with cross-validation instead of the Random Forest Classifier. The family of Ensemble classifiers rank topmost, here followed by the Neural Network classifier. The Support Vector Classifier is ranked lowest although having a higher roc auc value than the Neural Network.

It can be mentioned that the results so far document a tendency for the general ranking of the methods used but the differences in the specific accuracy values of the validation data set show an arbitrariness of the results up to a certain degree.

In [None]:
results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

# Unlimited number of columns allowed
pd.options.display.max_columns = None

for classifier in results['results_model_scores'].keys() :
    if classifier != 'DummyClassifier': # DummyClassifier has no results to be analysed
        # Show results
        print(f'\n{classifier}')
        display(results['results_model_scores'][classifier].sort_values(by=['accuracy_val'], ascending=False).head(20))

The global parameters of this run loose their importance due to the decision of using the parameters of run with id 0. Additionally, the $\texttt{execution}\_\texttt{mode}=\texttt{restricted}$ has scanned a limited grid space. The detailed discussion of the results above is omitted, therefore.

### Run with id 2

In [None]:
run = 2

print('\nRun id', run, 'with parameters\n', runtime_param_dict_list[run])

results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

results['results_best_model'].set_index('model', inplace=True)
# Ranking metric according to chapter 6 : 1. accuracy, 2. roc auc
display(results['results_best_model'].sort_values(by=['accuracy', 'auc'], ascending=False).round(3))

With this run the Ensemble family ranks topmost as with the runs so far. The Random Forest Classifier shows the highest accuracy score with a value of 99.895% which is 0.01 percentage points higher than the accuracy of the same classifier in run with id 0. As can be checked in the result notebook [Decision Tree Model](./results/6_DecisionTreeModel_run_0.ipynb) of run 0 and in the result notebook [Decision Tree Model](./results/6_DecisionTreeModel_run_2.ipynb) of this run, this difference originates from a total of two false negative records that have been predicted correctly by the model of this run. In subsection [Wrong Predictions](#Wrong-Predictions) below, the records' index comparison reveals that the records with index \[103609, 103618, 104110, 104493\] have been wrongly predicted as false predicted uniques exclusively with run 0 while the records with index \[103813, 104493\] have been wrongly predicted as false predicted uniques exclusively with run 2. The set of false predicted duplicates have remained the same set for both runs. The Random Forest Classifier results in a higher roc auc score due to the smaller total of false negative records.

This run had been motivated by the assumption that a small value for parameter $\texttt{factor}$ would result in a better classifier of the Neural Network, compared to a parameter $\texttt{factor}=1$. The accuracy and recall scores confirm the expectation, leading to a ranking of the Neural Network higher than the Support Vector Classifier with cross-validation. The resulting roc auc score is remarkably higher for this run compared to run with id 0. On the other hand, the precision score is lower for this run. Overall, it is hard to decide a distinct effect of parameter $\texttt{factor}$ based on these score values.

Looking at the result of this run, might arise the question, why the global parameters of this group have been chosen for a subsidiary run. The answer is the same as above for [Run with id 1](#Run-with-id-1), see file [runs_summary.csv](./documentation/runs_summary.csv).

In [None]:
results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

# Unlimited number of columns allowed
pd.options.display.max_columns = None

for classifier in results['results_model_scores'].keys() :
    if classifier != 'DummyClassifier': # DummyClassifier has no results to be analysed
        # Show results
        print(f'\n{classifier}')
        display(results['results_model_scores'][classifier].sort_values(by=['accuracy_val'], ascending=False).head(20))

The global parameters of this run loose their importance due to the decision of using the parameters of run with id 0 for fine tuning. The detailed discussion of the results above is omitted.

### Run with id 3

This run has been done with upsampling records of class duplicate with the help of synthetic data, see the description of chapter [Data Synthesizing](./3_DataSynthesizing.ipynb). The effect of the chosen algorithm for synthesizing the data is shown in the interesting result below.

In [None]:
run = 3

print('\nRun id', run, 'with parameters\n', runtime_param_dict_list[run])

results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

results['results_best_model'].set_index('model', inplace=True)
# Ranking metric according to chapter 6 : 1. accuracy, 2. roc auc
display(results['results_best_model'].sort_values(by=['accuracy', 'auc'], ascending=False).round(3))

The synthetic data, generated in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb), lead to remarkably higher scores all over. Mainly precision and recall scores reach values the other models cannot reach. Be aware explicitly that not only the training but also the performance testing in this run has been done with the help of synthetic data. Therefore, the excellent performance of the models of this run come along with the suspicion that the performance scores would not be confirmed when testing the models' performances of this run with Swissbib's original data without synthesized supplement. The answer to this question remains open.

The total of wrong predictions for the best classifier of Random Forest for run 0 is 24, while the total of wrong predictions for the best classifier of Random Forest for this run is 28, compare subsection [Wrong Predictions](#Wrong-Predictions) below. The total of records in the testing data set is 20,931 for run 0 and 34,068 for the run of this subsection. The ratio of false predicted records for run 0 is lower with a value of 0.11% compared to the ratio of false predictions for run 3 with a value of 0.08%. This observation is in line with the high roc auc values of this subsection and demonstrates that the high score set is not only due to a higher amount of data records used for testing the performance.

In [None]:
results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

# Unlimited number of columns allowed
pd.options.display.max_columns = None

for classifier in results['results_model_scores'].keys() :
    if classifier != 'DummyClassifier': # DummyClassifier has no results to be analysed
        # Show results
        print(f'\n{classifier}')
        display(results['results_model_scores'][classifier].sort_values(by=['accuracy_val'], ascending=False).head(20))

The run of this subsection had been motivated by the desire for more training data. This subsection shows that the models are influenced specifically by the specific implementation chosen for this kind of data upsampling. Due to this observation, the results of this subsection will be rejected and the above listing will not be analysed deeper.

### Run with id 4

The last group of runs searches for the best parameters on the grid of each model. Several runs have been done with Swissbib's full goldstandard data that have been persisted in the repository of the project [[ProjRepo](./A_References.ipynb#project_repo)] with the goal to find the set of overall best parameters for each model, see [runs_summary.csv](./documentation/runs_summary.csv). Here, the runs done will be reproduced with the downsampled goldstandard data on the best parameters found.

In [None]:
run = 4

print('\nRun id', run, 'with parameters\n', runtime_param_dict_list[run])

results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

results['results_best_model'].set_index('model', inplace=True)
# Ranking metric according to chapter 6 : 1. accuracy, 2. roc auc
display(results['results_best_model'].sort_values(by=['accuracy', 'auc'], ascending=False).round(3))

After fully tuning all models separately, the Ensemble family has remained on top of all models. Explicitly the Random Forest Classifier could be tuned to remain constantly the best classifier over all. The tuning of the Neural Network has moved this classifier into close vicinity of the Ensemble family. The Support Vector Classifiers have moved down to the last ranked models. This ranking is due to the comparison of the accuracy. When looking at the roc auc score, the Support Vector Classifier with cross-validation may exhibit a higher score than the Neural Network. This result has some randomness as for [[KeraRand](./A_References.ipynb#kerarand)], see above.

In [None]:
results = rsf.restore_dict_results(path_results, 'results_run_' + str(run) + '.pkl')

# Unlimited number of columns allowed
pd.options.display.max_columns = None

for classifier in results['results_model_scores'].keys() :
    if classifier != 'DummyClassifier': # DummyClassifier has no results to be analysed
        # Show results
        print(f'\n{classifier}')
        display(results['results_model_scores'][classifier].sort_values(by=['accuracy_val'], ascending=False).head(20))

The assumption of this section is that the best parameters found on the full goldstandard data are the best parameters found on the downsampled goldstandard data. In this capstone project, the finally wanted results refer to the full goldstandard data. The shortcuts taken in this chapter are due to runtime restrictions and serve as description of the found overall models. Looking at the parameters' ranking of each model brings the insight described in the following table.

| model | parameters assessment |
| :---- | :-------------------- |
| RandomForestClassifier | The observed tendency for unbalanced $\texttt{class}\_\texttt{weight}$ of the runs above is confirmed with the runs of this group for Random Forest Classifier. A $\texttt{max}\_\texttt{depth}$ value of 22 produces the overall best accuracy on the test data of the full goldstandard data set. Although this value is close to the number of features, the run with $\texttt{max}\_\texttt{depth}=20$ has been ranked much lower than the runs of $\texttt{max}\_\texttt{depth}$ values in the vicinity of 20. Parameter $\texttt{n}\_\texttt{estimators}$ is confirmed to be high. Again, the accuracy values of the test data are close to each other for models with different parameters. It is difficult to concisely define the best parameter point of the grid. |
| DecisionTreeClassifier_CV | The Decision Tree Classifier with cross-validation shows a separation between $\texttt{criterion}$ gini and entropy. Measure gini exhibits the 15 top accuracy values. The accuracies with $\texttt{max}\_\texttt{depth}\ge 25$ are all the same and at the very top. This picture is consistent with the one of [Run with id 0](#Run-with-id-0) and gives the additional information of finer granularity. |
| DecisionTreeClassifier | For the same reason as documented in the runs above, see subsection [Run with id 0](#Run-with-id-0), this model is not discussed deeper here. |
| SVC_CV | For the tuning run of Support Vector Classifier, the decision for polynomial of degree 3 has been taken. The overall best accuracy has been reached with $C=1.2$ and $\gamma=0.6$. For an additional assessment of these values, see the comments of the next subsection, below. |
| SVC | For the same reason as documented in the runs above, this model is not discussed deeper here.  |
| NeuralNetwork | The tuning of the Neural Network results in a two-layer network with a high amount of neurons in both layers. The first hidden layer has its best tuning value at 40 and the second hidden layer at 75 which is close to the doubled value of the first layer. The specific result may vary, though, due to [[KeraRand](./A_References.ipynb#kerarand)]. |

### Conclusion of Runs

Several observations can be held, as a general conclusion of the runs above.

- For the problem at hand, the overall best classifiers can be built with models of the Ensemble family. These classifiers have two advantages. They have a high performance in their training process and they can be interpreted easily. In the course of the capstone project, these models were the least complex ones when doing grid search. Only a little amount of parameters had to be searched and the results showed reproducible stability. For all runs, the classifiers of the Ensemble family have ranked at the top, although the differences between the models have shown to be low.
- The results of the Support Vector Classifier have come along with a poorer performance than the models of the Ensemble family. On the one hand, it took a longer time for training the models and on the other hand, searching the best parameters of $C$ and $\gamma$ has shown to be tricky. The two parameters interact with each other and the search for the maximum accuracy as a function of this two parameter space was a task that needed some patience. The results above remain arbitrary in this aspect of the capstone project. Stated positively, the Support Vector Classifier models show stability over a wide range of parameter values. Unfortunately, they do not exhibit the same prediction accuracy as the models of the Ensemble family.
- Some effort has been spent during the capstone project to interpret and understand the models. In chapter [Support Vector Classifier Model](./7_SVCModel.ipynb), the idea had been to gain insight into the threshold that would devide the subset of records with pairs of uniques from the records with pairs of duplicates. Adding up the weighted ($w_i$ : feature weight as to relative feature importance) values of all features $s_{ij}$ of a record $r_j = (s_{1j}, ... , s_{20j})$ would result in a sum $S_j$ for the similarity $$S_j = \sum_{i=1}^{20}w_i\cdot s_{ij},\;\;j=1,...,N$$ which could be compared to a threshold $\Theta$. This threshold would have been the similarity threshold that would have divided the two classes $S_u \le \Theta < S_d$, where $r_u \in \{\texttt{records of pair of uniques}\}$ and $r_d \in \{\texttt{records of pair of duplicates}\}$. This threshold would have helped in identifying the feature(s) $s_{ij}$ of a record $r_j$ that determine its class belonging for a prediction, like the ones shown in subsection [Wrong Predictions](#Wrong-Predictions) below. This intention had to be given up after some effort due to the fact that the Support Vector Classifier model use a non-linear kernel, so the sum above does not hold and not threshold $\Theta$ could be determined. Therefore, identifying the cause for a wrong prediction and deriving any measures to turn it into a correct prediction remains an open topic.
- The Neural Network training have shown to be a difficult task. Although the implementation with library keras has been done quickly, the trained models have shown a very slow convergence to a constant accuracy value, only. This has made the grid search process cumbersome. The final result found above with a two-layer network with a high amount of neurons in the second layer stands representative for the generally slow learning rate of the models. The network model find a first rough separation between the classes quickly. To find the finally stable network for the goldstandard data with a sharp threshold between records of pairs of uniques and records of pairs of duplicates seem to be way more difficult. One reason for this observation may be given in the overall assessment of the results at the very end of this chapter.

### Wrong Predictions

In a first step, some print functions are to be defined.

In [None]:
# Unlimited number of columns allowed
pd.options.display.max_columns = None

def print_wrongly_predicted_records(run_id, wpg, wrong_predictions):
    # If not defined, yet
    path_goldstandard = './daten_goldstandard/'

    number_of_test_records = 20931
    number_of_class_in_test = [number_of_test_records - 20636, number_of_test_records - 295]

    for g in range(2) :
        print('Run', run_id, '-', wpg[g], '\n*****')
        for model in wrong_predictions.keys() :
            fp = wrong_predictions[model][wpg[g]].sort_index().index.tolist()
            print('\n{} has {:d} {} which corresponds to ...\n * {:.3f}% of records of class duplicate\n * {:.3f}% of all test data'.format(
                model, len(fp), wpg[g],
                100*len(fp)/number_of_class_in_test[g],
                100*len(fp)/number_of_test_records))
            print('The wrongly classified records have the index ...\n', fp)
        print('')
        
    return None

def print_union_intersection_wrongly_predicted_records(run_id, wpg, wrong_predictions):
    for g in range(2) :
        print('Run', run_id, '-', wpg[g], '\n*****')
        false_predicted_union = set()
        false_predicted_intersect = set(wrong_predictions['SVC_CV'][wpg[g]].sort_index().index.tolist())
        for model in wrong_predictions.keys() :
            fp = wrong_predictions[model][wpg[g]].sort_index().index.tolist()
            false_predicted_union = false_predicted_union.union(fp)
            false_predicted_intersect = false_predicted_intersect.intersection(fp)

        # Result output
        print('- All {} of all classifiers (union)\n{}'.format(
            wpg[g], false_predicted_union))
        print('- All common {} of all classifiers (intersection)\n{}\n'.format(
            wpg[g], false_predicted_intersect))

    return None

def display_wrongly_predicted_records(run_id, wpg, wrong_predictions, df):
    for g in range(2) :
        print('Run', run_id, '-', wpg[g], '\n*****')
        for model in wrong_predictions.keys() :
            fp = wrong_predictions[model][wpg[g]].sort_index().index.tolist()
            print(wpg[g], 'for', model)
            display(df.iloc[fp])

    return None

The amounts of wrong predictions of a model are condensed in the characteristic numbers accuracy, precision, and recall, see chapter [Decision Tree Model](./6_DecisionTreeModel.ipynb). So far, the discussion has been on this condesed view of the performance of the models. When looking at the specific wrong predictions, for each model, two aspects may be interesting to assess.
- How many false predicted records of pairs of unique and how many false predicted records of pairs of duplicate can be differentiated?
- Which are the overlapping explicit records of pairs of unique and pairs of duplicate for the models? Identifying this common set of records that are difficult to classify for the models, might be helpful in understanding the reason why a record is hard to classify. Identifying the set of distinct records on the other hand, might be helpful to understand the specifics of each model.

Starting with run with id 0, the explicit false predictions are shown below.

In [None]:
run = 0
wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']

# Read confusion matrix results from chapters
wrong_predictions = rsf.restore_dict_results(path_goldstandard, 'wrong_predictions_run_' + str(run) + '.pkl')

print_wrongly_predicted_records(run, wrong_prediction_groups, wrong_predictions)

The total number of records in the test data set is 20,931 with a distribution of 20,636 records of uniques and 295 records of duplicates, see e.g. [Decision Tree Model](./6_DecisionTreeModel.ipynb) and subsections for the train/test split, there. In the list above, the amount of false predicted uniques and the amount of false predicted duplicates has been divided by this total amount to illustrate the proportion of wrong predictions. All ratios are below 10% for each model, which is a satisfying result.

As for the specific indices the union of all models and the intersection of the indices of all models can be differentiated. This is shown below.

In [None]:
# Read confusion matrix results from chapters
wrong_predictions = rsf.restore_dict_results(path_goldstandard, 'wrong_predictions_run_' + str(run) + '.pkl')

print_union_intersection_wrongly_predicted_records(run, wrong_prediction_groups, wrong_predictions)

Now, let's have a look at the full data of the false predicted records.

In [None]:
import bz2
import _pickle as cPickle

# Restore DataFrame with features from compressed pickle file
with bz2.BZ2File((os.path.join(
    path_goldstandard, 'labelled_feature_matrix_full.pkl')), 'rb') as file:
    df_attribute_with_sim_feature = cPickle.load(file)

In [None]:
# Read confusion matrix results from chapters
wrong_predictions = rsf.restore_dict_results(path_goldstandard, 'wrong_predictions_run_' + str(run) + '.pkl')

display_wrongly_predicted_records(run, wrong_prediction_groups, wrong_predictions, df_attribute_with_sim_feature)

The output displayed above has been used for assessing similarity metrics applyed during the preprocessing of the attributes. In the end, the different ways of attributes' preprocessing with the globel parameters $\texttt{exact}\_\texttt{date}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$, described in subsection [Runtime Parameters](#Runtime-Parameters), represent this kind of models' tuning. In the course of this capstone project, some effort has been taken to analyse the resulting wrong predictions for each model. It has turned out to be difficult to decide on improvements of modifying similarity metrics, due to only small changes in the specific results of above. The models show a high-level stability in their behaviour and results against the modifications tried. This is expressed in the stable order of magnitude of the characterstic numbers of the confusion matrix despite of modifying the similarity metrics of an attribute applyed. The specific records of false predictions may change, though. This stability is interpreted as a positive result of the project.

For reasons of completeness, all wrong predictions of all models of all runs are shown below.

In [None]:
run_done = run + 1

# Rest of runs
for r in range(run_done, len(runtime_param_dict_list)):

    # Read confusion matrix results from chapters
    wrong_predictions = rsf.restore_dict_results(path_goldstandard, 'wrong_predictions_run_' + str(r) + '.pkl')

    print_wrongly_predicted_records(r, wrong_prediction_groups, wrong_predictions)
    print_union_intersection_wrongly_predicted_records(r, wrong_prediction_groups, wrong_predictions)

In [None]:
# Rest of runs
for r in range(run_done, len(runtime_param_dict_list)):
    if r != 3: # Run 3 has been done with oversampled data. This data is not available here.

        # Read confusion matrix results from chapters
        wrong_predictions = rsf.restore_dict_results(path_goldstandard, 'wrong_predictions_run_' + str(r) + '.pkl')

        display_wrongly_predicted_records(r, wrong_prediction_groups, wrong_predictions, df_attribute_with_sim_feature)

### Classification of Swissbib's Goldstandard Data

For some final consistency checks and for communication with Swissbib's project team, the records' docids can be printed. Just one example is shown below.

In [None]:
# Binary intermediary DataFrame file for docid's
df_index_docids = pd.read_pickle(os.path.join(
    path_goldstandard, 'index_docids_df.pkl'), compression=None)

In [None]:
# Careful, last run may be the only one to hold the right indices, ...
#  ... depending on global run parameters
run = len(runtime_param_dict_list)-1

wrong_predictions = rsf.restore_dict_results(path_goldstandard, 'wrong_predictions_run_' + str(run) + '.pkl')

#for g in range(2):
for model in wrong_predictions.keys() :
#    fp = wrong_predictions[model][wrong_prediction_groups[g]].sort_index().index.tolist()
    fp = wrong_predictions[model][wrong_prediction_groups[0]].sort_index().index.tolist()
#    print(model, '-', wrong_prediction_groups[g], '-', fp)
    print(model, '-', wrong_prediction_groups[0], '-', fp)
    display(df_index_docids.iloc[fp])

A final analysis of the resulting false predictions of the models had to be done with Swissbib's project team. Some specific sample records of pairs in the test data set were chosen that had a target classification of uniques but had been predicted as a pair of duplicates and vice versa. Swissbib's project team was asked, for an explanation why these sample records had been classified in their goldstandard as they were. The answer was twofold.

- The base data of the goldstandard is subject to change in their sources. Therefore, new criteria might have been added that would change today's classification of the goldstandard data, compared to its original classification done manually one year ago.
- In one example of Decision Tree Classifier discussed, the classification of machine learning model was judged as being correct despite of the opposite classification in the goldstandard data. This means that the machine learning result was found to be better than the target result out of the goldstandard data.

This surprising and promising answer has to be discussed deeper with Swissbib's project team. As a next step, each of the false predicted records could be analysed with the goal to improve the goldstandard data according to the its latest state of data. With an improved goldstandard data set, a new training could be initiated, fitting new models. This could be done iteratively until a satisfying level of quality has been reached.

## Comparison with Literature

In an early phase of this capstone project stands a review article [[Padm2012](./A_References.ipynb#padm2012)] that has been referenced in the [proposal](./project-proposal-andreas-jud.ipynb). In this section, the findings of this capstone project are to be compared with this review article.

[[Padm2012](./A_References.ipynb#padm2012)] has been the starting point and motivation for the idea of implementing a Neural Network for resolving the problem at hand. While the authors describe the implementation of a one-layer network, it has turned out early in this capstone project that a network with two hidden layers would produce slightly better results. The first difference between [[Padm2012](./A_References.ipynb#padm2012)] and this capstone project is the complexity of the implemented network and the second difference must be the implementation with library Keras of this project. Keras is a newly released library, not available in year 2012. Therefore, the implementation of this capstone project may be interesting in this approach.

One fundamental difference of this project and the approach described in [[Padm2012](./A_References.ipynb#padm2012)] is due to the features. While in this project, one single similarity metric is used for each attribute, [[Padm2012](./A_References.ipynb#padm2012)] uses three different similarity metric for each attribute. These three similarity metrics are chosen and kept fix for the input attributes. This is a very different approach to the one chosen here, but it would be interesting to compare a solution with several distinct similarity metric for one attribute of the data of this project.

The authors of [[Padm2012](./A_References.ipynb#padm2012)] claim for a resulting accuracy value of nearly 80%. This accuracy has been reached with the help of synthetic data, though. The reason for this small accuracy value compared to the accuracy values reached in this project remains open. However it can be said for sure that the data used for training and performance measuring of a machine learning model is critical.

## Summary and Outlook

This chapter of the capstone project builds the bracket of all chapters included. Each chapter is implemented and written in a separate Jupyter Notebook that can be run separately. This chapter executes each chapter as an alternative way of running and collecting the results of the project.

In this chapter, the implemented models have been fully run several times. Each execution has been done with a different set of parameters. The total of all runs explores the implemented models in different aspects. As an overall result, the findings can be summarised with the following items.

- The overall best models for the problem of deduplication with Swissbib's data can be found with the Ensemble classifier family. Fitting a Neural Network shows results of comparable performance, although the results of the Ensemble classifiers could not be exceeded. As for the Neural Network it is remarkable that networks with more than one layer and with a relatively high number of neurons exhibit the best results within a small range quality. A network with nore layers and more neurons learn more interactions between the features.
- The general and satisfying experience with the calculated models has been their stability. On changing some parameters of run conditions, the models exhibited about the same results with only minor differences. This observation generates a feeling of security of having found reliable and reproducible results in the course of this capstone project.

The following items list options for improvement of the results of this capstone project as a terminating outlook.

- As described in the [proposal](./project-proposal-andreas-jud.ipynb), Swissbib has implemented a sophisticated preprocessing of their data before its deduplication. Some examples can be mentioned here explicitly.
    - Attributes $\texttt{century}$ and $\texttt{decade}$ are extracted parts of attribute $\texttt{exactDate}$. Taking the same approach in this capstone project, has been discussed in chapter [Data Analysis](./1_DataAnalysis.ipynb). The decision taken there, could be revised and models could be fitted based on a feature matrix with additional, although redundant, information.
    - Attributes $\texttt{edition}$, $\texttt{musicid}$, $\texttt{pages}$, $\texttt{part}$, and $\texttt{volumes}$ are generated by elaborated algorithms each, that interpret number digits and even literal number expressions of different languages. This kind of preprocessing has been tried to copy with the stripping to number digits in this capstone project. The implementation here has remained very rudimentary, though. In a better implementation, Swissbib's data preprocessing could be used for the preparation of the goldstandard training data. It would be interesting to observe, whether some better prepared data would result in even better results.
    - Nearly all attributes of Swissbib's data are optional, see chapter [Data Analysis](./1_DataAnalysis.ipynb) for details. In the course of this capstone project, the decision has been taken to mark missing attributes with a negative number. An alternative implementation would be to mark missing attributes with an additional feature, a special flag in the feature matrix. It would be interesting to see the effect of an implementation alike.
- The similarity metrics applied for the attributes of Swissbib data have been chosen by an iterative process. [[Chri2012](./A_References.ipynb#chri2012)] cites some literature on how to find the best similarity metric for an attribute with the help of machine learning. Some effort would have to be taken for implementing this idea. Although the results of this capstone project come out to a satisfying level, a deeper engagement into the similarities used would be rewarding.
- A data record based understanding of the resulting prediction has been tried at several stages of the capstone project. At the end, the deep interpretability of the models, its feature-wise understanding remains an open issue that would require some more elaborate technics and effort.
- Subsection [Classification of Swissbib's Goldstandard Data](#Classification-of-Swissbib's-Goldstandard-Data) discusses the quality of the goldstandard data. An iterative improvement of the goldstandard data is proposed which requires detailed discussion with Swissbib's project team. Eventually, these discussions will be done, depending on the availability of Swissbib's resources.
- The [proposal](./project-proposal-andreas-jud.ipynb) of the capstone project suggests one of the models designed here to be implemented in [Apache Flink](https://flink.apache.org/) or [Apache Spark](https://spark.apache.org/). At the moment, [Apache Beam](https://beam.apache.org/) would be an even more attractive opportunity to explore. Goal of such an implementation would be the replacement of Swissbib's deduplication logic in production with a solution resulting from this capstone project. Considering the big amount of Swissbib data and the $O(N^2)$ scaling ot the pair comparison introduced here, one important step would have to be resolved, before any deployment into operation. Swissbib's amount of data only can be processed after a forceful preclustering, described in the [proposal](./project-proposal-andreas-jud.ipynb). Some feasible preclustering solutions are described in [[Chri2012](./A_References.ipynb#chri2012)]. An implementation would still have to be explored. But this is a different project.

These options conclude the discussion of the results of this capstone project.