# Capstone Project - Deduplication of Swissbib Raw Data

**Program** Applied Data Science : Machine Learning<br>
**Institution** EPFL Extension School<br>
**Course** \#5, Capstone Project<br><br>
**Title** Deduplication of Swissbib Raw Data<br>
**Author** Andreas Jud<br>
**Date** dd-MAR-2020

## Table of Contents

- [Introduction](#Introduction)
- [Overview](#overview)
- [Summary](#summary)

## Introduction<a id='introduction'/>

[Proposal](./project-proposal-andreas-jud.ipynb)

## Overview<a id='overview'/>

The notebook of the capstone project consists of the following chapters.

1. [Data Analysis](./1_DataAnalysis.ipynb)
- [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb)
- [Feature Matrix Generation](./3_FeatureMatrixGeneration.ipynb)
- [Decision Tree Model](/4_DecisionTreeModel.ipynb)

Appendix

A. [References](./A_References.ipynb)

## Summary<a id='summary'/>

In [1]:
# Setting parameters for runs

# Set run mode for each notebook :
#  full = full grid search space is scanned, results are collected.
#  restricted = small grid search space is scanned for local runs.
exec_mode='full'
#exec_mode = 'restricted'

# Factor for missing attributes, chapter 4 and graphs in chapter 5 :
#  -0.5*fact : One attribute of the pair is missing.
#  -1.0*fact : Both attributes of the pair are missing.
fact = 0.1
#fact = 1.0

# Factor of oversampling with synthetic data, chapter 3 :
#  Synthetic data is multiplyed with a for loop, so to reach a ratio of
#  oversamp% in the full data set. If oversamp = 0, no synthetic data
#  will be added to the goldstandard data.
oversamp = 0
#oversamp = 20

# Ratio to which the attributes with a value will be modified in the ...
#  synthetic data generation of chapter 3.
modifi_ratio = 0.2

# Function applyed to exactDate attribute to increase value of ...
#  strings with characters indicating 'unknown' digits.
mode_exactDate = 'xor'
#mode_exactDate = 'added_u'

# Decides whether for attribute 'volumes', the full string ...
#  shall be stripped to number digits (True) or shall be left as is (False).
strip_numbers = True
#strip_numbers = False

# Generate dictionary for parameter handover
runtime_param_dict = {
    'em' : exec_mode,
    'fa' : fact,
    'os' : oversamp,
    'mr' : modifi_ratio,
    'me' : mode_exactDate,
    'sn' : strip_numbers,
    'notebook_name' : ''
}

In [2]:
#! pip install nbparameterise

In [3]:
import nbformat
from nbconvert.preprocessors import ExecutePreprocessor
import nbparameterise as nbp
import os
import results_saving_funcs as rsf

path_results = './results'
# Determine all relenvant notebooks, ommit Overview Summary and Appendixes
a = ! ls [1-9]_* | grep .ipynb

for i in range(len(a)):
    print('Executing notebook', a[i])
    with open(a[i]) as notebook:
        nb = nbformat.read(notebook, as_version=4)
        
        # Get list of parameter objects
        orig_parameters = nbp.extract_parameters(nb)
        # Update parameters
        params = nbp.parameter_values(orig_parameters,
                                      execution_mode=exec_mode,
                                      factor=fact,
                                      oversampling=oversamp,
                                      modification_ratio = modifi_ratio,
                                      exactDate_mode = mode_exactDate,
                                      strip_number_digits = strip_numbers
                                     )
        # Make notebook object with these definitions, ...
        nb = nbp.replace_definitions(nb, params, execute=False)

        ep = ExecutePreprocessor(timeout=None)
        # ... and execute it.
        ep.preprocess(nb, {"metadata": {"path": './'}})
    # Save notebook run in result file
    runtime_param_dict.update({'notebook_name' : a[i]})
    rsf.save_notebook_results(nb, path_results, runtime_param_dict)

print('Done with all notebooks.')

Executing notebook 1_DataAnalysis.ipynb
Executing notebook 2_GoldstandardDataPreparation.ipynb
Executing notebook 3_DataSynthesizing.ipynb
Executing notebook 4_FeatureMatrixGeneration.ipynb


KeyboardInterrupt: 

In [4]:
import pandas as pd

path_goldstandard = './daten_goldstandard'

results = rsf.restore_dict_results(path_goldstandard, 'results.pkl')

results['results_best_model'].reset_index(drop=True, inplace=True)
# Ranking metric according to chapter 6 : roc auc
results['results_best_model'].sort_values('auc', ascending=False)

Unnamed: 0,model,test_score,auc,accuracy,precision,recall,test_score_log,auc_log,accuracy_log,precision_log,recall_log
34,DecisionTreeClassifier_CV,0.99946,98.28764,99.946036,94.059406,96.610169,7.5246,4.067298,7.5246,2.823361,3.38439
14,DecisionTreeClassifier_CV,0.99946,98.28764,99.946036,94.059406,96.610169,7.5246,4.067298,7.5246,2.823361,3.38439
2,DecisionTreeClassifier_CV,0.99946,98.28764,99.946036,94.059406,96.610169,7.5246,4.067298,7.5246,2.823361,3.38439
23,DecisionTreeClassifier_CV,0.99946,98.28764,99.946036,94.059406,96.610169,7.5246,4.067298,7.5246,2.823361,3.38439
5,DecisionTreeClassifier_CV,0.99946,98.28764,99.946036,94.059406,96.610169,7.5246,4.067298,7.5246,2.823361,3.38439
20,DecisionTreeClassifier_CV,0.99946,98.28764,99.946036,94.059406,96.610169,7.5246,4.067298,7.5246,2.823361,3.38439
8,DecisionTreeClassifier_CV,0.99946,98.28764,99.946036,94.059406,96.610169,7.5246,4.067298,7.5246,2.823361,3.38439
17,DecisionTreeClassifier_CV,0.99946,98.28764,99.946036,94.059406,96.610169,7.5246,4.067298,7.5246,2.823361,3.38439
11,DecisionTreeClassifier_CV,0.99946,98.28764,99.946036,94.059406,96.610169,7.5246,4.067298,7.5246,2.823361,3.38439
22,DecisionTreeClassifier,0.999441,98.118148,99.944108,94.039735,96.271186,7.489508,3.972914,7.489508,2.820055,3.28908


In [8]:
import time

pd.options.display.max_rows = 200

# For timestamp in filename
tmstmp = time.strftime("%Y%m%d-%H%M%S")

for classifier in results['results_model_scores'].keys() :
    print(f'\n{classifier}')
    display(results['results_model_scores'][classifier].head(20))
    results['results_model_scores'][classifier].to_csv(os.path.join(path_results, classifier + '_'
                                                                    + tmstmp + '.csv'), index=False)


DummyClassifier



DecisionTreeClassifier


Unnamed: 0,class_weight,criterion,max_depth,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
9,balanced,gini,20.0,1.0,0.999253,-inf,7.199678
10,balanced,gini,22.0,1.0,0.999253,-inf,7.199678
16,balanced,gini,50.0,1.0,0.999253,-inf,7.199678
15,balanced,gini,40.0,1.0,0.999253,-inf,7.199678
14,balanced,gini,35.0,1.0,0.999253,-inf,7.199678
13,balanced,gini,28.0,1.0,0.999253,-inf,7.199678
12,balanced,gini,26.0,1.0,0.999253,-inf,7.199678
11,balanced,gini,24.0,1.0,0.999253,-inf,7.199678
17,balanced,gini,,1.0,0.999253,-inf,7.199678
8,balanced,gini,18.0,0.999976,0.999133,10.633647,7.050147



DecisionTreeClassifier_CV


Unnamed: 0,class_weight,criterion,max_depth,accuracy_val,std_accuracy_val,log_accuracy_val
12,balanced,gini,26.0,0.999292,0.000153,7.252656
17,balanced,gini,,0.999287,0.000152,7.245876
16,balanced,gini,50.0,0.999287,0.000152,7.245876
15,balanced,gini,40.0,0.999287,0.000152,7.245876
14,balanced,gini,35.0,0.999287,0.000152,7.245876
13,balanced,gini,28.0,0.999287,0.000152,7.245876
11,balanced,gini,24.0,0.999287,0.000145,7.245876
10,balanced,gini,22.0,0.999215,0.000176,7.149339
9,balanced,gini,20.0,0.999147,0.000193,7.06694
8,balanced,gini,18.0,0.999003,0.000147,6.91037



RandomForestClassifier


Unnamed: 0,class_weight,max_depth,n_estimators,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
2,,,75,1.0,0.999518,-inf,7.637933
3,,,100,1.0,0.999518,-inf,7.637933
0,,20.0,75,0.999982,0.99947,10.92133,7.542623
1,,20.0,100,0.999988,0.99947,11.326795,7.542623



SVC


Unnamed: 0,C,class_weight,degree,gamma,kernel,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
0,0.5,,3,2.0,poly,0.999464,0.998892,7.531305,6.805024



SVC_CV


Unnamed: 0,C,class_weight,degree,gamma,kernel,accuracy_val,std_accuracy_val,log_accuracy_val
0,0.5,,3,2.0,poly,0.999056,0.000169,-6.964974



Neural Network


Unnamed: 0,class_weight,dropout_rate,l2_alpha,number_of_hidden1_layers,number_of_hidden2_layers,sgd_learnrate,accuracy_tr,accuracy_val,log_accuracy_tr,log_accuracy_val
3,,0.1,0.0,60,70,0.002,0.999337,0.999121,-7.318762,-7.0369
5,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,40,70,0.002,0.999185,0.999063,-7.111988,-6.973174
7,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,60,70,0.002,0.999307,0.999059,-7.274705,-6.969072
1,,0.1,0.0,40,70,0.002,0.999172,0.998959,-7.096752,-6.867819
2,,0.1,0.0,60,0,0.002,0.999099,0.998913,-7.011957,-6.824336
0,,0.1,0.0,40,0,0.002,0.999062,0.998878,-6.972144,-6.792921
6,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,60,0,0.002,0.999093,0.998847,-7.005627,-6.765798
4,"[0.5028541799926344, 88.09083191850594]",0.1,0.0,40,0,0.002,0.999,0.998836,-6.90747,-6.755805


## Comparison of Results

In [None]:
# Read confusion matrix results from chapters
wrong_predictions = rsf.restore_dict_results(path_goldstandard, 'wrong_predictions.pkl')

wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']
fpu, fpd = {}, {}

for i in wrong_predictions.keys() :
    fpu[i] = wrong_predictions[i][wrong_prediction_groups[0]].sort_index().index.tolist()
    fpd[i] = wrong_predictions[i][wrong_prediction_groups[1]].sort_index().index.tolist()

print(wrong_prediction_groups[0])
for i in fpu.keys() :
    print(i, len(fpu[i]), '\n', fpu[i])
print('\n')
print(wrong_prediction_groups[1])
for i in fpd.keys() :
    print(i, len(fpd[i]), '\n', fpd[i])

In [None]:
# Restore DataFrame with attributes and similarity values
df_attribute_with_sim_feature = pd.read_pickle(os.path.join(
    path_goldstandard, 'labelled_feature_matrix_full.pkl'), compression=None)

# Binary intermediary DataFrame file for docid's
df_index_docids = pd.read_pickle(os.path.join(
    path_goldstandard, 'index_docids_df.pkl'), compression=None)

In [None]:
pd.options.display.max_columns = 200

df_attribute_with_sim_feature.iloc[fpu[i]]

In [None]:
df_index_docids.iloc[fpu[i]]