# Daten deduplizieren

In diesem Notebook deduplizieren wir Daten mithilfe der [Dedupe](https://docs.dedupe.io/)-Bibliothek, die ein flaches neuronales Netzwerk verwendet, um aus einem kleinen Training zu lernen.

Zudem haben dieselben Entwickler\*innen [parserator](https://github.com/datamade/parserator) erstellt, mit dem ihr Textfunktionen extrahieren und eure eigenen Textextraktion trainieren könnt. 

## 1. Importe

In [1]:
import pandas as pd
import dedupe
import os

In [2]:
customers = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv', 
                        encoding='utf-8')

## 2. Datenqualität überprüfen

In [3]:
customers.head()

Unnamed: 0,name,job,company,street_address,city,state,email,user_name
0,Patricia Schaefer,"Programmer, systems",Estrada-Best,398 Paul Drive,Christianview,Delaware,lambdavid@gmail.com,ndavidson
1,Olivie Dubois,Ingénieur recherche et développement en agroal...,Moreno,rue Lucas Benard,Saint Anastasie-les-Bains,AR,berthelotjacqueline@mahe.fr,manonallain
2,Mary Davies-Kirk,Public affairs consultant,Baker Ltd,Flat 3\nPugh mews,Stanleyfurt,ZA,middletonconor@hotmail.com,colemanmichael
3,Miroslawa Eckbauer,Dispensing optician,Ladeck GmbH,Mijo-Lübs-Straße 12,Neubrandenburg,Berlin,sophia01@yahoo.de,romanjunitz
4,Richard Bauer,"Accountant, chartered certified",Hoffman-Rocha,6541 Rodriguez Wall,Carlosmouth,Texas,tross@jensen-ware.org,adam78


In [4]:
customers.dtypes

name              object
job               object
company           object
street_address    object
city              object
state             object
email             object
user_name         object
dtype: object

In [5]:
for col in customers.columns:
    print(col, customers[col].isnull().sum())

name 0
job 0
company 0
street_address 0
city 0
state 0
email 0
user_name 0


## 3. Dedupe konfigurieren

Nun definieren wir die Felder, auf die bei der Deduplizierung geachtet werden soll und erstellen ein neues `deduper`-Objekt:

In [6]:
variables = [
    {'field': 'name', 'type': 'String'},
    {'field': 'job', 'type': 'String'},
    {'field': 'company', 'type': 'String'},  
    {'field': 'street_address','type': 'String'},
    {'field': 'city','type': 'String'},
    {'field': 'state', 'type': 'String', 'has_missing': True},
    {'field': 'email', 'type': 'String', 'has_missing': True},
    {'field': 'user_name', 'type': 'String'},
]

deduper = dedupe.Dedupe(variables)

In [7]:
deduper

<dedupe.api.Dedupe at 0x7f983598a1c0>

In [8]:
customers.shape

(2080, 8)

## 4. Trainingsdaten erstellen

In [9]:
deduper.prepare_training(customers.T.to_dict())

INFO:dedupe.canopy_index:Removing stop word om
INFO:dedupe.canopy_index:Removing stop word co
INFO:dedupe.canopy_index:Removing stop word com
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:TfidfTextCanopyPredicate: (0.8, email)


## 5. Aktives Lernen

Wenn Dedupe ein Datensatzpaar findet, werdet ihr gebeten, es als Duplikat zu kennzeichnen. Ihr könnt hierfürdie Tasten `y`, `n` und `u`, um Duplikate zu kennzeichnen. Drückt `f`, wenn ihr fertig seid.

In [10]:
dedupe.console_label(deduper)

name : Kenneth Moore
job : Magazine journalist
company : Cross, Bell and Diaz
street_address : 75443 Lindsey Pine
city : Thompsonshire
state : Colorado
email : ashley28@rice.com
user_name : todd72

name : Kenneth Moore
job : Magazine journalist
company : Cross, Bfll anf Diaz
street_address : 753 Lindsey Pine
city : Thompsonshe
state : Colorao
email : ashey28@rice.co
user_name : todd72

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


name : Dr. Catherine Sutton
job : Engineer, maintenance
company : Ross LLC
street_address : 13689 Morales Centers
city : North Sarah
state : New Mexico
email : lewisnicole@yahoo.com
user_name : clittle

name : Dr. Catherine Sutton
job : Enginee maintenance
company : Ross LLC
street_address : 13689 Morales Centers
city : North Sarah
state : New Mexico
email : ewinicoe@yaoo.com
user_name : little

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (alphaNumericPredicate, user_name)
name : Marcelle Francois
job : Collecteur de fonds
company : Leconte S.A.
street_address : avenue Corinne Allard
city : Sainte Philippenec
state : KH
email : yblot@sauvage.net
user_name : molivier

name : Marcelle Francois
job : Collecteur de fonds
company : Lecinte SiAi
street_address : aenue Crinne Allard
city : Sainte Philippenec
state : KH
email : yblot@sauvage.net
user_name : mlivier

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (alphaNumericPredicate, user_name)
INFO:dedupe.training:SimplePredicate: (twoGramFingerprint, street_address)


In [11]:
training_file = 'csv_example_training.json'

if os.path.exists(training_file):
    print('reading labeled examples from ', training_file)
    with open(training_file, 'rb') as f:
        deduper.prepare_training(customers.T.to_dict(), f)
else:
    deduper.prepare_training(customers.T.to_dict())

if os.path.exists(training_file):
    print('reading labeled examples from ', training_file)
    with open(training_file, 'rb') as f:
        deduper.prepare_training(customers.T.to_dict(), f)
else:
    deduper.prepare_training(customers.T.to_dict())

INFO:dedupe.canopy_index:Removing stop word om
INFO:dedupe.canopy_index:Removing stop word co
INFO:dedupe.canopy_index:Removing stop word com
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:TfidfTextCanopyPredicate: (0.8, email)
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (twoGramFingerprint, street_address)
INFO:dedupe.training:SimplePredicate: (alphaNumericPredicate, user_name)
INFO:dedupe.canopy_index:Removing stop word om
INFO:dedupe.canopy_index:Removing stop word co
INFO:dedupe.canopy_index:Removing stop word com
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:TfidfTextCanopyPredicate: (0.8, email)
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (twoGramFingerprint, street_address)
INFO:dedupe.training:SimplePredicate: (alphaNumericPredicate, user_name)


Wenn Ihr fertig seid, speichert eure Trainingsdaten:

In [12]:
deduper.train(index_predicates=True)
with open(training_file, 'w') as tf:
    deduper.write_training(tf)

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
INFO:rlr.crossvalidation:optimum alpha: 0.000010, score 0.0
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(TfidfTextCanopyPredicate: (0.6, street_address), SimplePredicate: (wholeFieldPredicate, job), SimplePredicate: (sameFiveCharStartPredicate, user_name))
INFO:dedupe.training:(SimplePredicate: (twoGramFingerprint, company), SimplePredicate: (commonTwoTokens, state), TfidfNGramCanopyPredicate: (0.6, user_name))


Speichert auch eure Gewichte und Prädikate. Wenn `settings_file` bereits existiert, werden beim nächsten Durchlauf Training und aktives Lernen übersprungen:

In [13]:
settings_file = 'csv_example_learned_settings'
if os.path.exists(settings_file):
    print('reading from', settings_file)
    with open(settings_file, 'rb') as f:
        deduper = dedupe.StaticDedupe(f)
else:
    deduper.train(index_predicates=True)
    with open(settings_file, 'wb') as sf:
        deduper.write_settings(sf)

reading from csv_example_learned_settings


INFO:dedupe.api:Predicate set:
INFO:dedupe.api:(TfidfNGramCanopyPredicate: (0.8, user_name), TfidfTextCanopyPredicate: (0.8, city), SimplePredicate: (sameThreeCharStartPredicate, email))
INFO:dedupe.api:(SimplePredicate: (commonThreeTokens, street_address), TfidfTextCanopyPredicate: (0.8, city), TfidfTextCanopyPredicate: (0.4, email))
INFO:dedupe.api:(SimplePredicate: (twoGramFingerprint, street_address), SimplePredicate: (wholeFieldPredicate, company), TfidfNGramCanopyPredicate: (0.8, email))
INFO:dedupe.api:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (fingerprint, city), TfidfTextCanopyPredicate: (0.4, state))
INFO:dedupe.api:(SimplePredicate: (firstTokenPredicate, email), TfidfNGramCanopyPredicate: (0.8, job), TfidfNGramCanopyPredicate: (0.4, company))
INFO:dedupe.api:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (wholeFieldPredicate, job), SimplePredicate: (twoGramFingerprint, state))
INFO:dedupe.api:(SimplePredicate: (oneGramFin