# Capstone Project - Deduplication of Swissbib Raw Data

**Program** Applied Data Science : Machine Learning<br>
**Institution** EPFL Extension School<br>
**Course** \#5, Capstone Project<br><br>
**Title** Deduplication of Swissbib Raw Data<br>
**Author** Andreas Jud<br>
**Date** dd-MAR-2020

## Table of Contents

- [Introduction](#Introduction)
    - [Requirements](#Requirements)
    - [Thanks](#Thanks)
- [Structure of the Project](#Structure-of-the-Project)
- [Runs and Results](#Runs-and-Results)
    - [Runtime Parameters](#Runtime-Parameters)
    - [Overview of Runs](#Overview-of-Runs)
    - [Runs Execution](#Runs-Execution)
- [Assessment of Results](#Assessment-of-Results)
- [Summary](#Summary)

## Introduction

[Proposal](./project-proposal-andreas-jud.ipynb)

### Requirements

This capstone project uses several publically available python libraries. The chapters where a library is needed show the
<br>$\texttt{! pip install <library name>}$<br>
command in a separate code cell, respectively. These commands have been executed once for the development environment of the author and have been commented out for later execution runs in order to produce more readable notebooks. For executing the set of notebooks of the capstone project on a python development environment with a basic setup, a [requirements.txt](./requirements.txt) file has been written. This file will be executed in the code cell below and installs the library packages needed for this capstone project.

In [1]:
#! pip install -r requirements.txt

### Thanks

## Structure of the Project

The notebook of the capstone project consists of the following chapters.

1. [Data Analysis](./1_DataAnalysis.ipynb)
1. [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb)
1. [Data Synthesizing](./3_DataSynthesizing.ipynb)
1. [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb)
1. [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb)
1. [Decision Tree Model](./6_DecisionTreeModel.ipynb)
1. [Support Vector Classifier Model](./7_SVCModel.ipynb)
1. [Neural Network Model](./8_NeuralNetwork.ipynb)

Appendix

- [A. References](./A_References.ipynb)
- [B. Comparison of Similarity Metrics](./B_CompareSimilarities.ipynb)

## Runs and Results

This section starts with explaining the runtime parameters with which the notebooks of the capstone project can be called. After the parameter space has been settled, a series of runs will be executed with different parameter values each.

### Runtime Parameters

The notebooks of this capstone project can be called with six specific global parameters. These parameters are listed and explained here.

- $\texttt{execution}\_\texttt{mode}$ - The reason for introducing this parameter has been runtime of execution. For the models, grid search has been implemented with the goal to find the best parameters for a model. The bigger the grid space, i.e. the more points it has for each of its dimensions, the longer the runtime of a notebook lasts. Oversampling of records of duplicates intreases the runtime of a notebook, too. When searching the best parameters for a model, the grid space has to be scanned widely. The runtime of the model may extend to hours, for such calculations. For some runs, smaller grid spaces may be sufficient. In order to save calculation time, a restricted grid space can be chosen. The execution mode of a notebook may have three distinct values.
    - Mode $\texttt{full}$ will be used for executing the notebook, calling it in this very chapter and collecting the results of each notebook for final comparison and assessment.
    - Mode $\texttt{restricted}$ will mainly, but not exclusively be used for executing the notebook locally, i.e. opening it manually and running it cell by cell. The original purpose of this mode of execution has been to open the notebook and read its text, in order to focus on the contents and specific explanation for a model. Runtime is supposed to be short for these execution modes. The grid parameters chosen for this mode have flowed back from the insights found from results with full execution mode of this chapter.
    - Mode $\texttt{tune}$ will be used for a final fine tuning of the models' parameters. Goal of a mode $\texttt{tune}$ run is to get the best models of a grid space close to a precalculated best model of the wide grid space. While mode $\texttt{full}$ will be used for scanning a wide range of orders of magnitude of the parameter space, mode $\texttt{tune}$ will be used for scanning the neighbouring parameter points of the best models of the mode $\texttt{full}$ run. This approach is an iterative search for the best parameters of the models.
- $\texttt{oversampling}$ - The number of records of duplicates generated with Swissbib's goldstandard data has been low compared to the number of records with uniques. The effect has been to generally use balancing for model fitting. In order to increase the ratio of duplicates in the training and testing data, an oversampling with synthetic data has been tried. To control the ratio, parameter $\texttt{oversampling}$ has been introduced. Synthetic data will be multiplyed with a for loop, so to reach a ratio of oversampling in percent (%) in the final data set for model calculation. If $\texttt{oversampling}=0$, no synthetic data will be added to the goldstandard data. This parameter will be used in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb).
- $\texttt{modification}\_\texttt{ratio}$ - This parameter will be used in chapter [Data Synthesizing](./3_DataSynthesizing.ipynb), too. In that chapter, some specific kinds of data modification (typos) to be simulated have been defined for each attribute. If an attribute shows one or more kinds of modification, this parameter controls the ratio and therefore the amount of records with modification.
- $\texttt{factor}$ - In Swissbib's raw data, records may have missing values in attributes. When building pairs of records for generating the feature matrix, records may occur with a value on both sides of a pair, but also with missing values on one side of a pair and even with missing values on both sides of a pair, see chapters [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb) and [Features Discussion and Dummy Classifier Baseline](./5_FeatureDiscussionDummyBaseline.ipynb) for a deeper discussion. Missing values may influence the model. For that reason, a decision has been taken to mark the features of records of pairs with missing attribute values. One way of marking them can be to transform them to a negative similarity value. During implementation, a discussion has been on how the distance from the origin (similarity value of 0) on the negative similarity side would influence a model, especially a Neural Network, due to its linear dependency on firing of a node. To be able to set the distance from the origin, this factor has been introduced. In the implemented code, the factor ...
    - multiplies -0.5 if one attribute of the pair is missing.
    - multiplies -1.0 if both attributes of the paire are missing.
- $\texttt{mode}\_\texttt{exactDate}$ - The basic similarity metric of attribute $\texttt{exact}\_\texttt{date}$, undergoes some modification in presence of unknown values, see chapter [Feature Matrix Generation](./4_FeatureMatrixGeneration.ipynb) for implementation details. Two different modes of modifying the basic similarity metric have been implemented. To decide on one mode of modification, parameter $\texttt{exact}\_\texttt{date}$ has been introduced. 
- $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ - Swissbib's raw data bring attributes $\texttt{scale}$, $\texttt{part}$, and $\texttt{volumes}$ as full-text strings. Swissbib's deduplication engine extracts their number digit parts in a preprocessing step with the goal to generate more reliable results. A very basic stripping function has been implemented in this capstone project with the goal to copy Swissbib's more sophisticated logic. The model result may change as a function of the similarity values of these three attributes. To assess the effect of stripping the attributes values, parameter $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ will be used for switching on ($\texttt{strip}\_\texttt{number}\_\texttt{digits}=\texttt{True}$) and off ($\texttt{strip}\_\texttt{number}\_\texttt{digits}=\texttt{False}$) the stripping to number digits logic.

To execute the notebooks of the capstone project, functions of python library $\texttt{nbparameterise}$ will be used.

In [2]:
#! pip install nbparameterise

### Overview of Runs

With the description of the runtime parameters, a multi-dimensional space of calculation options has been spanned. Due to limited calculation power based on restricted resources, it has appeared to be important to design the runs to be done well to reduce unnecessary calculation time and to increase the statements of the documented runs. The strategy used for the runs of this capstone project is shown in the table below. This strategy with its specific parameters has grown in the course of the capstone project iteratively. Many non-documented runs have been done with the models up to a point where the author had reached a feeling for the basic behaviour of a model linked to its best-suited parameter space. In the end, this chapter condenses the author's learning with the models.

| run id | description | parameter set |
| :----: | :---------- | :--------- |
| 0 | Goldstandard sampling,<br>**full feature modification** | $\texttt{execution}\_\texttt{mode}$ = $\texttt{full}$<br>$\texttt{oversampling}$ = $\texttt{None}$ with $\texttt{modification}\_\texttt{ratio}$ = \< irrelevant \><br>$\texttt{factor}$ = $0.1$<br>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{xor}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{True}$ |
| 1 | Goldstandard sampling,<br>**little feature modification** | $\texttt{execution}\_\texttt{mode}$ = $\texttt{restricted}$<br>$\texttt{oversampling}$ = $\texttt{None}$ with $\texttt{modification}\_\texttt{ratio}$ = \< irrelevant \><br>$\texttt{factor}$ = $0.1$<br><font color='red'>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{added}\_\texttt{u}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{False}$</font> |
| 2 | Goldstandard sampling,<br>**full missing separation** | $\texttt{execution}\_\texttt{mode}$ = $\texttt{restricted}$<br>$\texttt{oversampling}$ = $\texttt{None}$ with $\texttt{modification}\_\texttt{ratio}$ = \< irrelevant \><br><font color='red'>$\texttt{factor}$ = $1.0$</font><br>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{added}\_\texttt{u}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{True}$ |
| 3 | **Oversampling** | $\texttt{execution}\_\texttt{mode}$ = $\texttt{full}$<br><font color='red'>$\texttt{oversampling}$ = $\texttt{20}$ with $\texttt{modification}\_\texttt{ratio}$ = $0.2$</font><br>$\texttt{factor}$ = $0.1$<br>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{added}\_\texttt{u}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{True}$ |
| 4 | Final fine tuning | <font color='red'>$\texttt{execution}\_\texttt{mode}$ = $\texttt{tune}$</font><br>$\texttt{oversampling}$ = $\texttt{None}$ with $\texttt{modification}\_\texttt{ratio}$ = \< irrelevant \><br>$\texttt{factor}$ = $0.1$<br>$\texttt{mode}\_\texttt{exactDate}$ = $\texttt{xor}$ and $\texttt{strip}\_\texttt{number}\_\texttt{digits}$ = $\texttt{True}$ |

The strategy can be described as following.
1. run - find best models.
1. run - play with feature engineering.

Before the run can be executed, the global parameters describe in subsection [Runtime Parameters](#Runtime-Parameters) have to be set.

In [1]:
# Generate dictionary for parameter handover
runtime_param_dict = {
    'em' : 'full' #execution_mode : ['restricted', 'full', 'tune']
#    'em' : 'restricted' #execution_mode : ['restricted', 'full', 'tune']
    , 'os' : 0 # oversampling : [0, 20]
    , 'mr' : 0.2 # modification_ratio
    , 'fa' : 0.1 # factor : [0.1, 1.0]
    , 'me' : 'xor' # mode_exactDate : ['added_u', 'xor']
    , 'sn' : True # strip_number_digits : [True, False]
    , 'notebook_name' : ''
}
# Run id = 0
runtime_param_dict_list = [runtime_param_dict]

# Run id = 1
runtime_param_dict = runtime_param_dict.copy()
runtime_param_dict['em'] = 'restricted'
runtime_param_dict['me'] = 'added_u'
runtime_param_dict['sn'] = False
runtime_param_dict_list.append(runtime_param_dict)

# Run id = 2
runtime_param_dict = runtime_param_dict.copy()
runtime_param_dict['fa'] = 1.0
runtime_param_dict['me'] = 'xor'
runtime_param_dict['sn'] = True
runtime_param_dict_list.append(runtime_param_dict)

# Run id = 3
runtime_param_dict = runtime_param_dict.copy()
runtime_param_dict['em'] = 'full'
runtime_param_dict['fa'] = 0.1
runtime_param_dict['os'] = 20
runtime_param_dict_list.append(runtime_param_dict)

# Run id = 4
runtime_param_dict = runtime_param_dict.copy()
runtime_param_dict['em'] = 'tune'
runtime_param_dict['os'] = 0
runtime_param_dict_list.append(runtime_param_dict)

# Let's have a look at the predefined parameters
for run in range(len(runtime_param_dict_list)):
    print('Parameters for run', run, ': \n', runtime_param_dict_list[run])

Parameters for run 0 : 
 {'em': 'full', 'os': 0, 'mr': 0.2, 'fa': 0.1, 'me': 'xor', 'sn': True, 'notebook_name': ''}
Parameters for run 1 : 
 {'em': 'restricted', 'os': 0, 'mr': 0.2, 'fa': 0.1, 'me': 'added_u', 'sn': False, 'notebook_name': ''}
Parameters for run 2 : 
 {'em': 'restricted', 'os': 0, 'mr': 0.2, 'fa': 1.0, 'me': 'xor', 'sn': True, 'notebook_name': ''}
Parameters for run 3 : 
 {'em': 'full', 'os': 20, 'mr': 0.2, 'fa': 0.1, 'me': 'xor', 'sn': True, 'notebook_name': ''}
Parameters for run 4 : 
 {'em': 'tune', 'os': 0, 'mr': 0.2, 'fa': 0.1, 'me': 'xor', 'sn': True, 'notebook_name': ''}


### Runs Execution

The calculations of the notebooks can be done with the parameter specified by the list of dictionaries $\texttt{runtime}\_\texttt{param}\_\texttt{dict}$.

In [None]:
import nbformat
from nbconvert.preprocessors import ExecutePreprocessor
import nbparameterise as nbp
import os
import results_saving_funcs as rsf
import pandas as pd
import time
import datetime

path_results = './results'

# Determine all relenvant notebooks, ommit Overview Summary and Appendixes
a = ! ls [2-9]_* | grep .ipynb

for run in range(len(runtime_param_dict_list)):
    print('Run number', run)
    for i in range(len(a)):
        print('Executing notebook', a[i])
        print ("start at: " + datetime.datetime.now())
        with open(a[i]) as notebook:
            nb = nbformat.read(notebook, as_version=4)

            # Get list of parameter objects
            orig_parameters = nbp.extract_parameters(nb)
            # Update parameters
            params = nbp.parameter_values(orig_parameters,
                                          execution_mode=runtime_param_dict_list[run]['em'],
                                          oversampling=runtime_param_dict_list[run]['os'],
                                          modification_ratio = runtime_param_dict_list[run]['mr'],
                                          factor=runtime_param_dict_list[run]['fa'],
                                          exactDate_mode = runtime_param_dict_list[run]['me'],
                                          strip_number_digits = runtime_param_dict_list[run]['sn']
                                         )
            # Make notebook object with these definitions, ...
            nb = nbp.replace_definitions(nb, params, execute=False)

            ep = ExecutePreprocessor(timeout=None)
            # ... and execute it.
            ep.preprocess(nb, {"metadata": {"path": './'}})
        # Save notebook run in result file
        runtime_param_dict.update({'notebook_name' : a[i]})
        rsf.save_notebook_results(nb, path_results, runtime_param_dict_list[run])

    # Assessment of run
    ### HIER MUSS DER DATEINAME RUN-ABHÄNGIG GEMACHT WERDEN.
    results = rsf.restore_dict_results(path_results, 'results.pkl')
    ### /HIER MUSS DER DATEINAME RUN-ABHÄNGIG GEMACHT WERDEN.

    results['results_best_model'].reset_index(drop=True, inplace=True)
    # Ranking metric according to chapter 6 : roc auc
    results['results_best_model'].sort_values('auc', ascending=False)

    pd.options.display.max_rows = 200

    # For timestamp in filename
    tmstmp = time.strftime("%Y%m%d-%H%M%S")

    for classifier in results['results_model_scores'].keys() :
        print(f'\n{classifier}')
        display(results['results_model_scores'][classifier].head(20))
        results['results_model_scores'][classifier].to_csv(os.path.join(path_results, classifier + '_'
                                                                        + tmstmp + '.csv'), index=False)

print('Done with all runs of all notebooks.')

Run number 0
Executing notebook 2_GoldstandardDataPreparation.ipynb
Executing notebook 3_DataSynthesizing.ipynb
Executing notebook 4_FeatureMatrixGeneration.ipynb
Executing notebook 5_FeatureDiscussionDummyBaseline.ipynb
Executing notebook 6_DecisionTreeModel.ipynb
Executing notebook 7_SVCModel.ipynb
Executing notebook 8_NeuralNetwork.ipynb

DummyClassifier



DecisionTreeClassifier


Unnamed: 0,class_weight,criterion,max_depth,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
9,balanced,gini,20.0,1.0,0.999253,-inf,7.199678
10,balanced,gini,22.0,1.0,0.999253,-inf,7.199678
16,balanced,gini,50.0,1.0,0.999253,-inf,7.199678
15,balanced,gini,40.0,1.0,0.999253,-inf,7.199678
14,balanced,gini,35.0,1.0,0.999253,-inf,7.199678
13,balanced,gini,28.0,1.0,0.999253,-inf,7.199678
12,balanced,gini,26.0,1.0,0.999253,-inf,7.199678
11,balanced,gini,24.0,1.0,0.999253,-inf,7.199678
17,balanced,gini,,1.0,0.999253,-inf,7.199678
8,balanced,gini,18.0,0.999976,0.999133,10.633647,7.050147



DecisionTreeClassifier_CV


Unnamed: 0,class_weight,criterion,max_depth,accuracy_val,std_accuracy_val,log_accuracy_val
12,balanced,gini,26.0,0.999292,0.000153,7.252656
17,balanced,gini,,0.999287,0.000152,7.245876
16,balanced,gini,50.0,0.999287,0.000152,7.245876
15,balanced,gini,40.0,0.999287,0.000152,7.245876
14,balanced,gini,35.0,0.999287,0.000152,7.245876
13,balanced,gini,28.0,0.999287,0.000152,7.245876
11,balanced,gini,24.0,0.999287,0.000145,7.245876
10,balanced,gini,22.0,0.999215,0.000176,7.149339
9,balanced,gini,20.0,0.999147,0.000193,7.06694
8,balanced,gini,18.0,0.999003,0.000147,6.91037



RandomForestClassifier


Unnamed: 0,class_weight,max_depth,n_estimators,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
2,,,75,1.0,0.999518,-inf,7.637933
3,,,100,1.0,0.999518,-inf,7.637933
0,,20.0,75,0.999982,0.99947,10.92133,7.542623
1,,20.0,100,0.999988,0.99947,11.326795,7.542623



SVC


Unnamed: 0,C,class_weight,degree,gamma,kernel,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
0,0.5,,3,2.0,poly,0.999464,0.998892,7.531305,6.805024



SVC_CV


Unnamed: 0,C,class_weight,degree,gamma,kernel,accuracy_val,std_accuracy_val,log_accuracy_val
0,0.5,,3,2.0,poly,0.999056,0.000169,-6.964974



Neural Network


Unnamed: 0,class_weight,dropout_rate,l2_alpha,number_of_hidden1_layers,number_of_hidden2_layers,sgd_learnrate,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
7,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,60,70,0.002,0.999314,0.999032,7.284561,6.940792
3,,0.1,0.0,60,70,0.002,0.999321,0.999002,7.294339,6.909415
5,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,40,70,0.002,0.999225,0.998986,7.162982,6.894081
1,,0.1,0.0,40,70,0.002,0.999198,0.998967,7.128575,6.875246
6,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,60,0,0.002,0.999082,0.998882,6.99289,6.79637
0,,0.1,0.0,40,0,0.002,0.99903,0.998863,6.93784,6.779273
2,,0.1,0.0,60,0,0.002,0.999069,0.998855,6.979353,6.772502
4,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,40,0,0.002,0.999045,0.998847,6.953814,6.765788


Run number 1
Executing notebook 2_GoldstandardDataPreparation.ipynb
Executing notebook 3_DataSynthesizing.ipynb
Executing notebook 4_FeatureMatrixGeneration.ipynb
Executing notebook 5_FeatureDiscussionDummyBaseline.ipynb
Executing notebook 6_DecisionTreeModel.ipynb
Executing notebook 7_SVCModel.ipynb
Executing notebook 8_NeuralNetwork.ipynb


## Assessment of Results

In [None]:
# Read confusion matrix results from chapters
wrong_predictions = rsf.restore_dict_results(path_goldstandard, 'wrong_predictions.pkl')

wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']
fpu, fpd = {}, {}

for i in wrong_predictions.keys() :
    fpu[i] = wrong_predictions[i][wrong_prediction_groups[0]].sort_index().index.tolist()
    fpd[i] = wrong_predictions[i][wrong_prediction_groups[1]].sort_index().index.tolist()

print(wrong_prediction_groups[0])
for i in fpu.keys() :
    print(i, len(fpu[i]), '\n', fpu[i])
print('')
print(wrong_prediction_groups[1])
for i in fpd.keys() :
    print(i, len(fpd[i]), '\n', fpd[i])

In [None]:
# Restore DataFrame with attributes and similarity values
df_attribute_with_sim_feature = pd.read_pickle(os.path.join(
    path_goldstandard, 'labelled_feature_matrix_full.pkl'), compression=None)

# Binary intermediary DataFrame file for docid's
df_index_docids = pd.read_pickle(os.path.join(
    path_goldstandard, 'index_docids_df.pkl'), compression=None)

In [None]:
pd.options.display.max_columns = 200

df_attribute_with_sim_feature.iloc[fpu[i]]

In [None]:
df_index_docids.iloc[fpu[i]]

## Summary