## Deduplicating data

In this notebook, we deduplicate data using the [Dedupe library](https://dedupe.readthedocs.io/en/latest/), which utilizes a shallow neural network to learn from a small training exercise.

If you are interested in building your own parser, the same folks have created the [Parserator](https://github.com/datamade/parserator) which you can use to extract text features and train your own text extraction (hooray! less brittle than regex!)

In [1]:
import pandas as pd
import dedupe
import os

In [2]:
customers = pd.read_csv('../data/customer_data_duped.csv', 
                        encoding='utf-8')

## Checking Data Quality

In [4]:
customers.head()

Unnamed: 0,name,job,company,street_address,city,state,email,user_name
0,Patricia Schaefer,"Programmer, systems",Estrada-Best,398 Paul Drive,Christianview,Delaware,lambdavid@gmail.com,ndavidson
1,Olivie Dubois,Ingénieur recherche et développement en agroal...,Moreno,rue Lucas Benard,Saint Anastasie-les-Bains,AR,berthelotjacqueline@mahe.fr,manonallain
2,Mary Davies-Kirk,Public affairs consultant,Baker Ltd,Flat 3\nPugh mews,Stanleyfurt,ZA,middletonconor@hotmail.com,colemanmichael
3,Miroslawa Eckbauer,Dispensing optician,Ladeck GmbH,Mijo-Lübs-Straße 12,Neubrandenburg,Berlin,sophia01@yahoo.de,romanjunitz
4,Richard Bauer,"Accountant, chartered certified",Hoffman-Rocha,6541 Rodriguez Wall,Carlosmouth,Texas,tross@jensen-ware.org,adam78


In [5]:
customers.dtypes

name              object
job               object
company           object
street_address    object
city              object
state             object
email             object
user_name         object
dtype: object

In [6]:
for col in customers.columns:
    print(col, customers[col].isnull().sum())

name 0
job 0
company 0
street_address 0
city 0
state 0
email 0
user_name 0


## Setting up Dedupe

In [7]:
variables = [
    {'field': 'name', 'type': 'String'},
    {'field': 'job', 'type': 'String'},
    {'field': 'company', 'type': 'String'},  
    {'field': 'street_address','type': 'String'},
    {'field': 'city','type': 'String'},
    {'field': 'state', 'type': 'String', 'has_missing': True},
    {'field': 'email', 'type': 'String', 'has_missing': True},
    {'field': 'user_name', 'type': 'String'},
]

deduper = dedupe.Dedupe(variables)

In [8]:
deduper

<dedupe.api.Dedupe at 0x7f8d900fd650>

In [9]:
customers.shape

(2080, 8)

In [None]:
deduper.sample(customers.T.to_dict(), 500)

Note: If you receive an error like this:

```/usr/local/lib/python2.7/site-packages/dedupe/sampling.py:39: UserWarning: 250 blocked samples were requested, but only able to sample 249
  % (sample_size, len(blocked_sample)))
```

you can continue (some were selected), or use the suggested number (^ here it would be 249)

#### Either use training file (uncomment) or resume active training below

In [None]:
training_file = '../data/ignore-dedupe-training.json'
#if os.path.exists(training_file):
#    with open(training_file, 'rb') as f:
#        deduper.readTraining(f)

In [None]:
dedupe.consoleLabel(deduper)

In [None]:
deduper.train()

In [None]:
with open(training_file, 'w') as tf:
    deduper.writeTraining(tf)

In [None]:
dupes = deduper.match(customers.T.to_dict())

In [None]:
dupes

In [None]:
dupes[2]

In [None]:
customers.iloc[[741,1107]]

### Exercise: Flag duplicates by adding 2 extra columns, one for confidence score and one for duplicate_ids

In [None]:
# %load ../solutions/dedupe.py


In [None]:
customers[customers.confidence.notnull() == True].head()