# Deduplicating data

In this notebook, we deduplicate data using the [Dedupe](https://docs.dedupe.io/) library, which uses a flat neural network to learn from a little training.

> **See also:**
> 
> [csvdedupe](https://github.com/dedupeio/csvdedupe) offers a command line interface for Dedupe.

In addition, the same developers have created [parserator](https://github.com/datamade/parserator), which you can use to extract text functions and train your own text extraction.

## 1. Imports

In [1]:
import pandas as pd
import dedupe
import os

In [2]:
customers = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv', 
                        encoding='utf-8')

## 2. Check data quality

### 2.1 2.1 Return first *n* rows with [pandas.DataFrame.head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)

In [3]:
customers.head()

Unnamed: 0,name,job,company,street_address,city,state,email,user_name
0,Patricia Schaefer,"Programmer, systems",Estrada-Best,398 Paul Drive,Christianview,Delaware,lambdavid@gmail.com,ndavidson
1,Olivie Dubois,Ingénieur recherche et développement en agroal...,Moreno,rue Lucas Benard,Saint Anastasie-les-Bains,AR,berthelotjacqueline@mahe.fr,manonallain
2,Mary Davies-Kirk,Public affairs consultant,Baker Ltd,Flat 3\nPugh mews,Stanleyfurt,ZA,middletonconor@hotmail.com,colemanmichael
3,Miroslawa Eckbauer,Dispensing optician,Ladeck GmbH,Mijo-Lübs-Straße 12,Neubrandenburg,Berlin,sophia01@yahoo.de,romanjunitz
4,Richard Bauer,"Accountant, chartered certified",Hoffman-Rocha,6541 Rodriguez Wall,Carlosmouth,Texas,tross@jensen-ware.org,adam78


### 2.2 Display data types with [pandas.DataFrame.dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html)

In [4]:
customers.dtypes

name              object
job               object
company           object
street_address    object
city              object
state             object
email             object
user_name         object
dtype: object

### 2.3 Determine missing values with [pandas.isnull](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html)

This function indicates for an array-like object whether values are missing (`NaN` in numeric arrays, `None` or `NaN` in object arrays, `NaT` in [datetimelike](https://pandas.pydata.org/docs/reference/general_functions.html#top-level-dealing-with-datetimelike)).

**See also:**

* [notna](https://pandas.pydata.org/docs/reference/api/pandas.notna.html) for the boolean inverse of [pandas.isna](https://pandas.pydata.org/docs/reference/api/pandas.isna.html)
* [Series.isna](https://pandas.pydata.org/docs/reference/api/pandas.Series.isna.html) for the missing values in a series
* [DataFrame.isna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) for the missing values in a DataFrame
* [Index.isna](https://pandas.pydata.org/docs/reference/api/pandas.Index.isna.html) for the missing values in an index

In [5]:
for col in customers.columns:
    print(col, customers[col].isnull().sum())

name 0
job 0
company 0
street_address 0
city 0
state 0
email 0
user_name 0


## 3. Configure dedupe

Now we define the fields to be taken care of during deduplication and create a new `deduper` object:

In [6]:
variables = [
    {'field': 'name', 'type': 'String'},
    {'field': 'job', 'type': 'String'},
    {'field': 'company', 'type': 'String'},  
    {'field': 'street_address','type': 'String'},
    {'field': 'city','type': 'String'},
    {'field': 'state', 'type': 'String', 'has_missing': True},
    {'field': 'email', 'type': 'String', 'has_missing': True},
    {'field': 'user_name', 'type': 'String'},
]

deduper = dedupe.Dedupe(variables)

If the value of a field is missing, this missing value should be represented as a `None` object. However, by `'has_missing': True`, a new, additional field is created to indicate whether the data was present or not, and the missing data is given a null.

> **See also:**
> 
> [Missing Data](https://docs.dedupe.io/en/latest/Variable-definition.html#missing-data)

In [7]:
deduper

<dedupe.api.Dedupe at 0x7fd3d9e7dd00>

In [8]:
customers.shape

(2080, 8)

## 4. Create training data

In [9]:
deduper.prepare_training(customers.T.to_dict())

INFO:dedupe.canopy_index:Removing stop word co
INFO:dedupe.canopy_index:Removing stop word om
INFO:dedupe.canopy_index:Removing stop word com
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:TfidfTextCanopyPredicate: (0.8, email)


[prepare_training](https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.prepare_training) initialises active learning with our data and, optionally, with existing training data.

`T` mirrors the DataFrame across its diagonal by writing rows as columns and vice versa. For this, [pandas.DataFrame.transpose](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html) is used.

## 5. Active learning

Use [dedupe.console_label](https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.console_label) to train your dedupe instance. When Dedupe finds a record pair, you will be asked to label it as a duplicate. You can use the `y`, `n` and `u` keys to label duplicates. Press `f` when you are finished.

In [10]:
dedupe.console_label(deduper)

name : Frédérique Lejeune-Daniel
job : Technicien chimiste
company : Schmitt
street_address : chemin Denise Ferrand
city : Saint CharlotteVille
state : IE
email : jchretien@costa.com
user_name : joseph60

name : Frédérique Lejeune-Daniel
job : Tecce cse
company : Sctmitt
street_address : chemin Denise Ferrand
city : Saint ChalotteVille
state : IE
email : jchretien@costacom
user_name : joseph60

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


name : Kenneth Moore
job : Magazine journalist
company : Cross, Bell and Diaz
street_address : 75443 Lindsey Pine
city : Thompsonshire
state : Colorado
email : ashley28@rice.com
user_name : todd72

name : Kenneth Moore
job : Magazine journalist
company : Cross, Bfll anf Diaz
street_address : 753 Lindsey Pine
city : Thompsonshe
state : Colorao
email : ashey28@rice.co
user_name : todd72

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (fingerprint, email)
name : Dr. Catherine Sutton
job : Engineer, maintenance
company : Ross LLC
street_address : 13689 Morales Centers
city : North Sarah
state : New Mexico
email : lewisnicole@yahoo.com
user_name : clittle

name : Dr. Catherine Sutton
job : Enginee maintenance
company : Ross LLC
street_address : 13689 Morales Centers
city : North Sarah
state : New Mexico
email : ewinicoe@yaoo.com
user_name : little

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:TfidfTextCanopyPredicate: (0.8, email)
INFO:dedupe.training:SimplePredicate: (alphaNumericPredicate, user_name)


[Dedupe.train](https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.train) adds the record pairs you marked to the training data and updates the matching model.

With `index_predicates=True`, deduplication also takes into account predicates based on the indexing of the data.

When you are done, save your training data with [Dedupe.write_settings](https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.write_settings).

In [11]:
settings_file = 'csv_example_learned_settings'
if os.path.exists(settings_file):
    print('reading from', settings_file)
    with open(settings_file, 'rb') as f:
        deduper = dedupe.StaticDedupe(f)
else:
    deduper.train(index_predicates=True)
    with open(settings_file, 'wb') as sf:
        deduper.write_settings(sf)

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
INFO:rlr.crossvalidation:optimum alpha: 0.000010, score 0.0
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(TfidfTextCanopyPredicate: (0.8, user_name), SimplePredicate: (commonTwoTokens, job), SimplePredicate: (firstTokenPredicate, company))
INFO:dedupe.training:(SimplePredicate: (twoGramFingerprint, street_address), SimplePredicate: (oneGramFingerprint, user_name), TfidfTextCanopyPredicate: (0.4, street_address))


With [dedupe.Dedupe.partition](https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.partition), records that all refer to the same entity are identified and returned as tuples that are a sequence of record IDs and confidence values. For more details on the confidence value, see [dedupe.Dedupe.cluster](https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.cluster).

In [12]:
dupes = deduper.partition(customers.T.to_dict())

In [14]:
dupes

[((136, 1360), (1.0, 1.0)),
 ((1351, 1384), (1.0, 1.0)),
 ((0,), (1.0,)),
 ((1,), (1.0,)),
 ((2,), (1.0,)),
 ((3,), (1.0,)),
 ((4,), (1.0,)),
 ...]

We can also output only individual entries:

In [18]:
dupes[1]

((1351, 1384), (1.0, 1.0))

We can then display these with [pandas.DataFrame.iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html):

In [17]:
customers.iloc[[136,1360]]

Unnamed: 0,name,job,company,street_address,city,state,email,user_name
136,Frédérique Lejeune-Daniel,Technicien chimiste,Schmitt,chemin Denise Ferrand,Saint CharlotteVille,IE,jchretien@costa.com,joseph60
1360,Frédérique Lejeune-Daniel,Tecce cse,Sctmitt,chemin Denise Ferrand,Saint ChalotteVille,IE,jchretien@costacom,joseph60
