## Deduplicating data

In this notebook, we deduplicate data using the [Dedupe library](https://dedupe.readthedocs.io/en/latest/), which utilizes a shallow neural network to learn from a small training exercise.

If you are interested in building your own parser, the same folks have created the [Parserator](https://github.com/datamade/parserator) which you can use to extract text features and train your own text extraction (hooray! less brittle than regex!)

In [6]:
pip install dedupe

Collecting dedupeNote: you may need to restart the kernel to use updated packages.

  Using cached dedupe-2.0.23-cp311-cp311-win_amd64.whl (96 kB)
Collecting affinegap>=1.3 (from dedupe)
  Using cached affinegap-1.12.tar.gz (33 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting categorical-distance>=1.9 (from dedupe)
  Using cached categorical_distance-1.9-py3-none-any.whl (3.3 kB)
Collecting dedupe-variable-datetime (from dedupe)
  Using cached dedupe_variable_datetime-1.0.0-py3-none-any.whl (3.9 kB)
Collecting doublemetaphone (from dedupe)
  Using cached DoubleMetaphone-1.1.tar.gz (34 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Gett

  error: subprocess-exited-with-error
  
  Getting requirements to build wheel did not run successfully.
  exit code: 1
  
  [56 lines of output]
    tree = Parsing.p_module(s, pxd, full_module_name)
  
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  
          x_a = aligned_copy(x0.ravel())
  
          try:
              callback_data = (f, progress, x0.shape, args)
              r = lbfgs(n, x_a, fx_final, call_eval,
                                          ^
  ------------------------------------------------------------
  
  lbfgs\_lowlevel.pyx:395:40: Cannot assign type 'lbfgsfloatval_t (void *, lbfgsconst_p, lbfgsfloatval_t *, int, lbfgsfloatval_t) except? -1' to 'lbfgs_evaluate_t'. Exception values are incompatible. Suggest adding 'noexcept' to type 'lbfgsfloatval_t (void *, lbfgsconst_p, lbfgsfloatval_t *, int, lbfgsfloatval_t) except? -1'.
  
  Error compiling Cython file:
  ------------------------------------------------

In [11]:
import pandas as pd
import dedpupe
import os

ModuleNotFoundError: No module named 'dedpupe'

In [7]:
customers = pd.read_csv('customer_data_duped.csv', 
                        encoding='utf-8')

## Checking Data Quality

In [9]:
customers.head()

Unnamed: 0,name,job,company,street_address,city,state,email,user_name
0,Patricia Schaefer,"Programmer, systems",Estrada-Best,398 Paul Drive,Christianview,Delaware,lambdavid@gmail.com,ndavidson
1,Olivie Dubois,Ingénieur recherche et développement en agroal...,Moreno,rue Lucas Benard,Saint Anastasie-les-Bains,AR,berthelotjacqueline@mahe.fr,manonallain
2,Mary Davies-Kirk,Public affairs consultant,Baker Ltd,Flat 3\nPugh mews,Stanleyfurt,ZA,middletonconor@hotmail.com,colemanmichael
3,Miroslawa Eckbauer,Dispensing optician,Ladeck GmbH,Mijo-Lübs-Straße 12,Neubrandenburg,Berlin,sophia01@yahoo.de,romanjunitz
4,Richard Bauer,"Accountant, chartered certified",Hoffman-Rocha,6541 Rodriguez Wall,Carlosmouth,Texas,tross@jensen-ware.org,adam78


In [None]:
customers.dtypes

In [None]:
for col in customers.columns:
    print(col, customers[col].isnull().sum())

## Setting up Dedupe

In [None]:
variables = [
    {'field': 'name', 'type': 'String'},
    {'field': 'job', 'type': 'String'},
    {'field': 'company', 'type': 'String'},  
    {'field': 'street_address','type': 'String'},
    {'field': 'city','type': 'String'},
    {'field': 'state', 'type': 'String', 'has_missing': True},
    {'field': 'email', 'type': 'String', 'has_missing': True},
    {'field': 'user_name', 'type': 'String'},
]

deduper = dedupe.Dedupe(variables)

In [None]:
deduper

In [None]:
customers.shape

In [None]:
deduper.sample(customers.T.to_dict(), 500)

Note: If you receive an error like this:

```/usr/local/lib/python2.7/site-packages/dedupe/sampling.py:39: UserWarning: 250 blocked samples were requested, but only able to sample 249
  % (sample_size, len(blocked_sample)))
```

you can continue (some were selected), or use the suggested number (^ here it would be 249)

#### Either use training file (uncomment) or resume active training below

In [None]:
training_file = '../data/ignore-dedupe-training.json'
#if os.path.exists(training_file):
#    with open(training_file, 'rb') as f:
#        deduper.readTraining(f)

In [None]:
dedupe.consoleLabel(deduper)

In [None]:
deduper.train()

In [None]:
with open(training_file, 'w') as tf:
    deduper.writeTraining(tf)

In [None]:
dupes = deduper.match(customers.T.to_dict())

In [None]:
dupes

In [None]:
dupes[2]

In [None]:
customers.iloc[[741,1107]]

### Exercise: Flag duplicates by adding 2 extra columns, one for confidence score and one for duplicate_ids

In [None]:
# %load ../solutions/dedupe.py


In [None]:
customers[customers.confidence.notnull() == True].head()