# Exploring Entity Resolution with Dedupe in Python    

This walk-through uses [Jupyter Notebook](http://jupyter.readthedocs.org/en/latest/install.html) and [Pandas](https://readthedocs.org/projects/pandas/) (and of course, [Dedupe](https://dedupe.readthedocs.org/en/latest/)) to explore some initial approaches to deduplication and entity resolution with the Python library Dedupe.

Please make sure you have Jupyter and Pandas installed before we move on.

```bash
pip install jupyter
pip install pandas
```

## Clone repo and get started    

To get started, we'll clone a git repository with some sample text files and deduplication scripts:    

```bash
git clone https://github.com/DistrictDataLabs/dedupe-examples.git
cd dedupe-examples
```

This first example has us working with a list of early childhood education sites in Chicago from 10 different sources.
```bash
cd csv_example   
```

## Data exploration
Before we get any further, let's use a Pandas `dataframe` to explore the dataset we'll be working with:

In [1]:
import pandas as pd
df = pd.read_csv("../dedupe-examples/csv_example/csv_example_messy_input.csv", index_col="Id")

We can use the `shape` method to find out how many rows and columns we're dealing with:

In [2]:
df.shape

(3337, 31)

Next we can use the `list` method to get the column names of our dataframe:

In [3]:
list(df)

['Source',
 'Site name',
 'Address',
 'Zip',
 'Phone',
 'Fax',
 'Program Name',
 'Length of Day',
 'IDHS Provider ID',
 'Agency',
 'Neighborhood',
 'Funded Enrollment',
 'Program Option',
 'Number per Site EHS',
 'Number per Site HS',
 'Director',
 'Head Start Fund',
 'Eearly Head Start Fund',
 'CC fund',
 'Progmod',
 'Website',
 'Executive Director',
 'Center Director',
 'ECE Available Programs',
 'NAEYC Valid Until',
 'NAEYC Program Id',
 'Email Address',
 'Ounce of Prevention Description',
 'Purple binder service type',
 'Column',
 'Column2']

Calling `head` will give us the first few rows of the dataframe:

In [4]:
df.head(10)

Unnamed: 0_level_0,Source,Site name,Address,Zip,Phone,Fax,Program Name,Length of Day,IDHS Provider ID,Agency,...,Executive Director,Center Director,ECE Available Programs,NAEYC Valid Until,NAEYC Program Id,Email Address,Ounce of Prevention Description,Purple binder service type,Column,Column2
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,CPS_Early_Childhood_Portal_scrape.csv,Salvation Army - Temple / Salvation Army,1 N Ogden Ave,,2262649,,Child Care,EXTENDED DAY,,,...,,,,,,,,,,
1,CPS_Early_Childhood_Portal_scrape.csv,Salvation Army - Temple / Salvation Army,1 N Ogden Ave,,2262649,,Child Care,EXTENDED DAY,,,...,,,,,,,,,,
2,CPS_Early_Childhood_Portal_scrape.csv,National Louis University - Dr. Effie O. Elli...,10 S Kedzie Ave,,5339011,,Child Care,EXTENDED DAY,,,...,,,,,,,,,,
3,CPS_Early_Childhood_Portal_scrape.csv,National Louis University - Dr. Effie O. Elli...,10 S Kedzie Ave,,5339011,,Child Care,EXTENDED DAY,,,...,,,,,,,,,,
4,CPS_Early_Childhood_Portal_scrape.csv,Board Trustees-City Colleges of Chicago - Oli...,10001 S Woodlawn Ave,,2916100,,Child Care,EXTENDED DAY,,,...,,,,,,,,,,
5,CPS_Early_Childhood_Portal_scrape.csv,Board Trustees-City Colleges of Chicago - Oli...,10001 S Woodlawn Ave,,2916100,,Child Care,EXTENDED DAY,,,...,,,,,,,,,,
6,CPS_Early_Childhood_Portal_scrape.csv,Easter Seals Society of Metropolitan Chicago ...,1001 W Roosevelt Rd,,9395115,,Child Care,EXTENDED DAY,,,...,,,,,,,,,,
7,CPS_Early_Childhood_Portal_scrape.csv,Easter Seals Society of Metropolitan Chicago ...,1001 W Roosevelt Rd,,9395115,,Child Care,EXTENDED DAY,,,...,,,,,,,,,,
8,CPS_Early_Childhood_Portal_scrape.csv,Hull House Association - Uptown Head Start / ...,1020 W Bryn Mawr Ave,,7695753,,Child Care,EXTENDED DAY,,,...,,,,,,,,,,
9,CPS_Early_Childhood_Portal_scrape.csv,Hull House Association - Child Dev. Central O...,1030 W Van Buren St,,9068600,,Child Care,EXTENDED DAY,,,...,,,,,,,,,,


It appears that our entities are childcare sites, and at first glance, there do appear to be some possible duplicates.  

We can search for duplicates within the rows of a column. Let's check for duplicates in the "Site name" column:

In [5]:
df["Site name"].duplicated()  

Id
0       False
1        True
2       False
3        True
4       False
5        True
6       False
7        True
8       False
9       False
10       True
11      False
12       True
13      False
14      False
15       True
16      False
17      False
18      False
19      False
20       True
21      False
22      False
23       True
24      False
25       True
26      False
27      False
28       True
29      False
        ...  
3307    False
3308    False
3309     True
3310    False
3311     True
3312     True
3313     True
3314     True
3315    False
3316     True
3317     True
3318    False
3319    False
3320    False
3321    False
3322    False
3323     True
3324     True
3325     True
3326     True
3327    False
3328    False
3329    False
3330    False
3331    False
3332     True
3333    False
3334     True
3335     True
3336    False
Name: Site name, dtype: bool

Looks like a lot of duplicates!  

## Testing out `dedupe`

Let's experiment with using the `dedupe` library to try cleaning up our file.    

To get `dedupe` running, we'll need to install [Unidecode](https://pypi.python.org/pypi/Unidecode), [Future](https://pypi.python.org/pypi/future), and [Dedupe](https://dedupe.readthedocs.org/en/latest/).

In your terminal:
```bash
pip install unidecode
pip install future
pip install dedupe
```    

Then we'll run the csv_example.py file to see what dedupe can do:

```bash
python csv_example.py
```

You can see that `dedupe` is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

    Do these records refer to the same thing?
    (y)es / (n)o / (u)nsure / (f)inished

Let's start training! 
Use 'y', 'n' and 'u' keys to flag duplicates for active learning.    

When you are finished, enter 'f' to quit.    

## Questions for the DDRL Entity Resolution Lab

1. What is dedupe doing with our [csv_example.py file](http://datamade.github.com/dedupe-examples/docs/csv_example.html)?
2. In general, how does [dedupe](http://datamade.github.io/dedupe-examples/docs/csv_example.html) work?
2. How does the active learning method work?    
3. How does `dedupe` decide [which data fields](https://github.com/DistrictDataLabs/dedupe-examples/blob/master/csv_example/csv_example.py#L91) to include in the training? How would this work differently/better/worse with the dataset you have been working with for the last two weeks?      
4. How does `dedupe` treat yeses, nos, and unsures?    
5. What do you like about `dedupe`?    
6. What would make `dedupe` better? (And what do we mean by "better"?)    