# Daten deduplizieren

In diesem Notebook deduplizieren wir Daten mithilfe der [Dedupe](https://dedupe.readthedocs.io/en/latest/)-Bibliothek, die ein flaches neuronales Netzwerk verwendet, um aus einem kleinen Training zu lernen.

Zudem haben dieselben Entwickler\*innen [parserator](https://github.com/datamade/parserator) erstellt, mit dem Ihr Textfunktionen extrahieren und Eure eigenen Textextraktion trainieren könnt. 

## 1. Importe

In [1]:
import pandas as pd
import dedupe
import os

In [2]:
customers = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv', 
                        encoding='utf-8')

## 2. Datenqualität überprüfen

In [3]:
customers.head()

Unnamed: 0,name,job,company,street_address,city,state,email,user_name
0,Patricia Schaefer,"Programmer, systems",Estrada-Best,398 Paul Drive,Christianview,Delaware,lambdavid@gmail.com,ndavidson
1,Olivie Dubois,Ingénieur recherche et développement en agroal...,Moreno,rue Lucas Benard,Saint Anastasie-les-Bains,AR,berthelotjacqueline@mahe.fr,manonallain
2,Mary Davies-Kirk,Public affairs consultant,Baker Ltd,Flat 3\nPugh mews,Stanleyfurt,ZA,middletonconor@hotmail.com,colemanmichael
3,Miroslawa Eckbauer,Dispensing optician,Ladeck GmbH,Mijo-Lübs-Straße 12,Neubrandenburg,Berlin,sophia01@yahoo.de,romanjunitz
4,Richard Bauer,"Accountant, chartered certified",Hoffman-Rocha,6541 Rodriguez Wall,Carlosmouth,Texas,tross@jensen-ware.org,adam78


In [4]:
customers.dtypes

name              object
job               object
company           object
street_address    object
city              object
state             object
email             object
user_name         object
dtype: object

In [5]:
for col in customers.columns:
    print(col, customers[col].isnull().sum())

name 0
job 0
company 0
street_address 0
city 0
state 0
email 0
user_name 0


## 3. Dedupe konfigurieren

Nun definieren wir die Felder, auf die bei der Deduplizierung geachtet werden soll und erstellen ein neues `deduper`-Objekt:

In [6]:
variables = [
    {'field': 'name', 'type': 'String'},
    {'field': 'job', 'type': 'String'},
    {'field': 'company', 'type': 'String'},  
    {'field': 'street_address','type': 'String'},
    {'field': 'city','type': 'String'},
    {'field': 'state', 'type': 'String', 'has_missing': True},
    {'field': 'email', 'type': 'String', 'has_missing': True},
    {'field': 'user_name', 'type': 'String'},
]

deduper = dedupe.Dedupe(variables)

In [7]:
deduper

<dedupe.api.Dedupe at 0x1226c6ac8>

In [8]:
customers.shape

(2080, 8)

## 4. Trainingsdaten erstellen

In [9]:
deduper.prepare_training(customers.T.to_dict())

INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, job), SimplePredicate: (twoGramFingerprint, user_name))


## 5. Aktives Lernen

Wenn Dedupe ein Datensatzpaar findet, werdet Ihr gebeten, es als Duplikat zu kennzeichnen. Ihr könnt hierfürdie Tasten `y`, `n` und `u`, um Duplikate zu kennzeichnen. Drückt `f`, wenn Ihr fertig seid.

In [10]:
dedupe.console_label(deduper)

name : Julio Agustín Amaya
job : Tax adviser
company : Piñol, Belmonte and Codina
street_address : Callejón de Gregorio Bustamante 28 Piso 7 
city : Las Palmas
state : Salamanca
email : usolana@jáuregui-pedraza.com
user_name : gloriaolmo

name : Julio Agustín Amaya
job : Tax aviser
company : Piñolk Belmonke and Codina
street_address : Calleón de Gregorio Bustamante 28 Piso 7 
city : La Pala
state : Salamanca
email : usolana@jáuregui-pedraza.om
user_name : gloriaolmo

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


name : Frédérique Lejeune-Daniel
job : Technicien chimiste
company : Schmitt
street_address : chemin Denise Ferrand
city : Saint CharlotteVille
state : IE
email : jchretien@costa.com
user_name : joseph60

name : Frédérique Lejeune-Daniel
job : Tecce cse
company : Sctmitt
street_address : chemin Denise Ferrand
city : Saint ChalotteVille
state : IE
email : jchretien@costacom
user_name : joseph60

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (oneGramFingerprint, user_name))
name : Monique Marty
job : Maoqiie
company : Arnfud
street_address : 70, rue de Carre
city : CheallierBour
state : EC
email : frederiquerichard@cohen.com
user_name : marquesseastie

name : Monique Marty
job : Maroquinier
company : Arnaud
street_address : 70, rue de Carre
city : ChevallierBourg
state : EC
email : frederiquerichard@cohen.com
user_name : marquessebastien

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


name : Kenneth Moore
job : Magazine journalist
company : Cross, Bell and Diaz
street_address : 75443 Lindsey Pine
city : Thompsonshire
state : Colorado
email : ashley28@rice.com
user_name : todd72

name : Kenneth Moore
job : Magazine journalist
company : Cross, Bfll anf Diaz
street_address : 753 Lindsey Pine
city : Thompsonshe
state : Colorao
email : ashey28@rice.co
user_name : todd72

3/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, user_name))
name : Sarah Hoffman
job : Exhibition designer
company : Henson, Wiley and Ryan
street_address : 97490 Curtis Spur Suite 825
city : Josephtown
state : Arizona
email : ncole@yahoo.com
user_name : csmith

name : Sarah Hoffman
job : Exhibitin designe
company : Hensont Wiley and Ryan
street_address : 9490 Curts Spur Sute 82
city : Jseptwn
state : Arizona
email : ncole@yahoo.com
user_name : csmith

4/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, user_name))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (commonTwoTokens, job))
name : Jose Carlos Pérez Arias
job : Engineer, maintenance (IT)
company : Marquez PLC
street_address : Pasadizo Ángel Sureda 715 Piso 3 
city : La Rioja
state : Córdoba
email : cifuentesraquel@peralta.com
user_name : gonzalo63

name : Jose Carlos Pérez Arias
job : Egieer, maiteace (IT)
company : Marquez PLC
street_address : Psdizo Ángel Sured 715 Piso  
city : La Rioja
state : Córdob
email : ifuenteraque@perata.om
user_name : gonzalo6

5/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, user_name))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, company), SimplePredicate: (twoGramFingerprint, email))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (commonTwoTokens, job))
name : Luc Weber
job : Chrgé d'res en géne clmtqe
company : Perez
street_address : rue Da Silva
city : Rxdan
state : IS
email : rousseauedih@bouyge.fr
user_name : alexadrialaroce

name : Luc Weber
job : Chargé d'affaires en génie climatique
company : Perez
street_address : rue Da Silva
city : Rouxdan
state : IS
email : rousseauedith@bouygtel.fr
user_name : alexandrialaroche

6/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, user_name))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, company), SimplePredicate: (twoGramFingerprint, email))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (commonTwoTokens, job))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, city), SimplePredicate: (twoGramFingerprint, company))
name : Gerhart Krebs MBA.
job : Sugeo
company : Roskoth
street_address : Kühnertweg 83
city : Stade
state : Bayer
email : oav44@oader.de
user_name : bettyhahn

name : Gerhart Krebs MBA.
job : Surgeon
company : Roskoth
street_address : Kühnertweg 863
city : Stade
state : Bayern
email : olav44@bolander.de
user_name : bettyhahn

7/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, user_name))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, company), SimplePredicate: (twoGramFingerprint, email))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (commonTwoTokens, job))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, city), SimplePredicate: (twoGramFingerprint, company))
INFO:dedupe.training:(SimplePredicate: (fingerprint, company), SimplePredicate: (twoGramFingerprint, street_address))
name : Richard Lemaitre
job : Vendeur-conseil en matériel agricole
company : Begue
street_address : 38, chemin de Guillaume
city : Guilbert
state : TO
email : olivier70@marechal.net
user_name : michelle19

name : Richard Lemaitre
job : Vur-cosil  téril gricol
company : Begue
street_address : 38, chemin de Gillame
city : Guilbert
state : TO
email : olivier70@ma

y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, user_name))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, company), SimplePredicate: (twoGramFingerprint, email))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (commonTwoTokens, job))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, city), SimplePredicate: (twoGramFingerprint, company))
INFO:dedupe.training:(SimplePredicate: (firstTokenPredicate, user_name), SimplePredicate: (twoGramFingerprint, company))
INFO:dedupe.training:(SimplePredicate: (fingerprint, company), SimplePredicate: (twoGramFingerprint, street_address))
name : Mrs. Frances Peters
job : Furniture designer
company : Rogers, Lawrence and Richards
street_address : Studio 00
Carpenter keys
city : West Simon
state : BO
email : charlenewilliams@wilson-sanders.org
user_name : amy17

name : Mrs. Fra

y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, email))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, company), SimplePredicate: (twoGramFingerprint, email))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (commonTwoTokens, job))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, city), SimplePredicate: (twoGramFingerprint, company))
INFO:dedupe.training:(SimplePredicate: (firstTokenPredicate, user_name), SimplePredicate: (twoGramFingerprint, company))
name : Ing. Marian Heidrich MBA.
job : Civil engineer, consulting
company : Johann Heuser AG
street_address : Lilija-Ortmann-Straße 54
city : Husum
state : Hamburg
email : truebconcetta@googlemail.com
user_name : marie78

name : Ing. Marian Heidrich MBA.
job : Cii ngin, consuting
company : Johann Heuser AG
street_address : Lilija-Ortmann-Straße 54
city : Husu
s

y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, email))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, company), SimplePredicate: (fingerprint, user_name))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (commonTwoTokens, job))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, city), SimplePredicate: (twoGramFingerprint, company))
INFO:dedupe.training:(SimplePredicate: (firstTokenPredicate, user_name), SimplePredicate: (twoGramFingerprint, company))
name : Meinhard Finke-Girschner
job : Health and safety adviser
company : Blümel AG & Co. OHG
street_address : Ladislaus-Koch II-Straße 457
city : Querfurt
state : Nordrhein-Westfalen
email : lscheibe@hotmail.de
user_name : junckgisa

name : Meinhard Finke-Girschner
job : Health ad safety adviser
company : Blü8el AG & C88 OHG
street_address : Ladslaus-Koch II-Sraß

y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, user_name))
INFO:dedupe.training:(SimplePredicate: (doubleMetaphone, user_name), SimplePredicate: (twoGramFingerprint, company))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, company), SimplePredicate: (fingerprint, user_name))
INFO:dedupe.training:(SimplePredicate: (fingerprint, company), SimplePredicate: (twoGramFingerprint, street_address))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (commonTwoTokens, job))
name : Éric Descamps
job : Geonnare de parone
company : Pa,cal S.A.R.L.
street_address : 1, avenue Delaaye
city : VerdiersurGmes
state : SC
email : josephine43@eores.fr
user_name : nathalieledux

name : Éric Descamps
job : Gestionnaire de patrimoine
company : Pascal S.A.R.L.
street_address : 1, avenue Delahaye
city : Verdier-sur-Gomes
state : SC
email : jose

f


Finished labeling
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, user_name))
INFO:dedupe.training:(SimplePredicate: (doubleMetaphone, user_name), SimplePredicate: (twoGramFingerprint, company))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, company), SimplePredicate: (fingerprint, user_name))
INFO:dedupe.training:(SimplePredicate: (fingerprint, company), SimplePredicate: (twoGramFingerprint, street_address))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (commonTwoTokens, job))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, job), SimplePredicate: (fingerprint, email))


In [11]:
training_file = 'csv_example_training.json'

if os.path.exists(training_file):
    print('reading labeled examples from ', training_file)
    with open(training_file, 'rb') as f:
        deduper.prepare_training(customers.T.to_dict(), f)
else:
    deduper.prepare_training(customers.T.to_dict())

if os.path.exists(training_file):
    print('reading labeled examples from ', training_file)
    with open(training_file, 'rb') as f:
        deduper.prepare_training(customers.T.to_dict(), f)
else:
    deduper.prepare_training(customers.T.to_dict())

INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, job), SimplePredicate: (twoGramFingerprint, user_name))
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (doubleMetaphone, street_address), SimplePredicate: (wholeFieldPredicate, email))
INFO:dedupe.training:(SimplePredicate: (doubleMetaphone, user_name), SimplePredicate: (twoGramFingerprint, company))
INFO:dedupe.training:(SimplePredicate: (fingerprint, company), SimplePredicate: (twoGramFingerprint, street_address))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (commonThreeTokens, company))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (commonTwoTokens, job))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (oneGramFingerprint, user_name))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, job), SimplePredicate: (

Wenn Ihr fertig seid, speichert Eure Trainingsdaten:

In [12]:
with open(training_file, 'w') as tf:
    deduper.write_training(tf)

Speichert auch Eure Gewichte und Prädikate. Wenn `settings_file` bereits existiert, werden beim nächsten Durchlauf Training und aktives Lernen übersprungen:

In [None]:
settings_file = 'csv_example_learned_settings'
if os.path.exists(settings_file):
    print('reading from', settings_file)
    with open(settings_file, 'rb') as f:
        deduper = dedupe.StaticDedupe(f)
else:
    with open(settings_file, 'wb') as sf:
        deduper.write_settings(sf)