# Capstone Project - Deduplication of Swissbib Raw Data

**Program** Applied Data Science : Machine Learning<br>
**Institution** EPFL Extension School<br>
**Course** \#5, Capstone Project<br><br>
**Title** Deduplication of Swissbib Raw Data<br>
**Author** Andreas Jud<br>
**Date** dd-mmm-2020

## Table of Contents

- [Introduction](#Introduction)
- [Overview](#overview)
- [Summary](#summary)

## Introduction<a id='introduction'/>

[Proposal](./project-proposal-andreas-jud.ipynb)

## Overview<a id='overview'/>

The notebook of the capstone project consists of the following chapters.

1. [Data Analysis](./1_DataAnalysis.ipynb)
- [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb)
- [Feature Matrix Generation](./3_FeatureMatrixGeneration.ipynb)
- [Decision Tree Model](/4_DecisionTreeModel.ipynb)

Appendix

A. [References](./A_References.ipynb)

## Summary<a id='summary'/>

In [1]:
# Setting parameters for runs

# Set run mode for each notebook :
#  full = full grid search space is scanned, results are collected.
#  restricted = small grid search space is scanned for local runs.
exec_mode='full'
#exec_mode = 'restricted'

# Factor for missing attributes, chapter 4 and graphs in chapter 5 :
#  -0.5*fact : One attribute of the pair is missing.
#  -1.0*fact : Both attributes of the pair are missing.
fact = 0.1
#fact = 1.0

# Factor of oversampling with synthetic data, chapter 3 :
#  Synthetic data is multiplyed with a for loop, so to reach a ratio of
#  oversamp% in the full data set. If oversamp = 0, no synthetic data
#  will be added to the goldstandard data.
oversamp = 0
#oversamp = 20

# Ratio to which the attributes with a value will be modified in the ...
#  synthetic data generation of chapter 3.
modifi_ratio = 0.2

# Function applyed to exactDate attribute to increase value of ...
#  strings with characters indicating 'unknown' digits.
mode_exactDate = 'xor'
#mode_exactDate = 'added_u'

# Decides whether for attribute 'volumes', the full string ...
#  shall be stripped to number digits (True) or shall be left as is (False).
strip_numbers = True
#strip_numbers = False

In [2]:
#! pip install nbparameterise

In [3]:
import nbformat
from nbconvert.preprocessors import ExecutePreprocessor
import nbparameterise as nbp

# Determine all relenvant notebooks, ommit Overview Summary and Appendixes
a = ! ls [1-9]_* | grep .ipynb

for i in range(len(a)):
    print('Executing notebook', a[i])
    with open(a[i]) as notebook:
        nb = nbformat.read(notebook, as_version=4)
        
        # Get list of parameter objects
        orig_parameters = nbp.extract_parameters(nb)
        # Update parameters
        params = nbp.parameter_values(orig_parameters,
                                      execution_mode=exec_mode,
                                      factor=fact,
                                      oversampling=oversamp,
                                      modification_ratio = modifi_ratio,
                                      exactDate_mode = mode_exactDate,
                                      strip_number_digits = strip_numbers
                                     )
        # Make notebook object with these definitions, ...
        nb = nbp.replace_definitions(nb, params, execute=False)

        ep = ExecutePreprocessor(timeout=None)
        # ... and execute it.
        ep.preprocess(nb, {"metadata": {"path": './'}})
    with open('Xctd_'+a[i], 'wt') as f:
        nbformat.write(nb, f)

print('Done with all notebooks.')

Executing notebook 1_DataAnalysis.ipynb
Executing notebook 2_GoldstandardDataPreparation.ipynb
Executing notebook 3_DataSynthesizing.ipynb
Executing notebook 4_FeatureMatrixGeneration.ipynb
Executing notebook 5_FeatureDiscussionDummyBaseline.ipynb
Executing notebook 6_DecisionTreeModel.ipynb
Executing notebook 7_SVCModel.ipynb
Executing notebook 8_NeuralNetwork.ipynb


KeyboardInterrupt: 

In [4]:
import results_saving_funcs as rsf
import pandas as pd

path_goldstandard = './daten_goldstandard'

results = rsf.restore_dict_results(path_goldstandard, 'results.pkl')

results['results_best_model'].reset_index(drop=True, inplace=True)
results['results_best_model'].sort_values('test_score', ascending=False)

Unnamed: 0,model,test_score,auc,accuracy,precision,recall,test_score_log,auc_log,accuracy_log,precision_log,recall_log
2,DecisionTreeClassifier_CV,0.999537,98.460039,99.953745,95.016611,96.949153,7.67875,4.173413,7.67875,2.99906,3.489751
3,RandomForestClassifier,0.999441,97.781104,99.944108,94.630872,95.59322,7.489508,3.80816,7.489508,2.924505,3.122026
1,DecisionTreeClassifier,0.999191,97.768504,99.919053,90.675241,95.59322,7.119135,3.802498,7.119135,2.372497,3.122026
4,SVC,0.999133,96.248896,99.913271,92.22973,92.542373,7.050142,3.28312,7.050142,2.554865,2.595933
5,SVC_CV,0.998921,95.90119,99.892071,89.438944,91.864407,6.831453,3.194474,6.831453,2.247997,2.508922
0,DummyClassifier,0.988937,49.898126,98.893729,0.355872,0.338983,4.504175,0.691112,4.504175,0.003565,0.003396


In [None]:
pd.options.display.max_rows = 200

for classifier in results['results_model_scores'].keys() :
    print(f'\n{classifier}')
    display(results['results_model_scores'][classifier].head(20))
    results['results_model_scores'][classifier].to_csv(classifier+'.csv', index=False)

## Comparison of Results

In [None]:
# Read confusion matrix results from chapters
wrong_predictions = rsf.restore_dict_results(path_goldstandard, 'wrong_predictions.pkl')

wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']
fpu, fpd = {}, {}

for i in wrong_predictions.keys() :
    fpu[i] = wrong_predictions[i][wrong_prediction_groups[0]].sort_index().index.tolist()
    fpd[i] = wrong_predictions[i][wrong_prediction_groups[1]].sort_index().index.tolist()

print(wrong_prediction_groups[0])
for i in fpu.keys() :
    print(i, len(fpu[i]), '\n', fpu[i])
print('\n')
print(wrong_prediction_groups[1])
for i in fpd.keys() :
    print(i, len(fpd[i]), '\n', fpd[i])

In [None]:
import os

# Restore DataFrame with attributes and similarity values
df_attribute_with_sim_feature = pd.read_pickle(os.path.join(
    path_goldstandard, 'labelled_feature_matrix_full.pkl'), compression=None)

# Binary intermediary DataFrame file for docid's
df_index_docids = pd.read_pickle(os.path.join(
    path_goldstandard, 'index_docids_df.pkl'), compression=None)

In [None]:
pd.options.display.max_columns = 200

df_attribute_with_sim_feature.iloc[fpu[i]]

In [None]:
df_index_docids.iloc[fpu[i]]