# 2. Deduplication part 1

One important, and potentially expensive (in terms of efficiency and energy consumption), task when handling data is *data deduplication*.  The goal of this task is to make sure that each item in your data set (e.g., each person) only appears once.  We'll explore 3 different ways this can be done in this lab (in this part and in part 3), and determine which one is most efficient.

First, let's examine some new data!  The data we'll be working with comes from a ProPublica story about a risk assessment tool called COMPAS.  In order to understand the data, start by reading the article.  The data can be found in this directory at "compas-scores.csv".

** 1) Read the ProPublica article <https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing> and the document describing how they did their analysis <https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm> **

** 2) What does each row in the dataset represent?  Would it make sense for there to be duplicate items? **

** 3) Get the data using the function we used in past labs. **

You'll want to use this function instead of using a csv reader so that when you need to go over the data multiple times (below) you don't have to keep reopening the file.

In [6]:
"""
Takes a filename and returns your data.
For example, with a file that looks like this:
header1, header2, header3
1,2,3
4,5,6

You could get the first row, second header item like this:
dict = get_data("temp.csv")
print dict[1]["header2"]

For the interested, the returned data is a list of dictionaries.  We'll see this more in future weeks.
"""
def get_data(filename):
    filepointer = open(filename, "r")
    data = []
    
    # get_header, inline instead of calling the function above so that the file continues reading
    # from the line right after the header in the for loop below.
    line = filepointer.readline()
    header = line.strip().split(",")

    for line in filepointer:
        fields = line.strip().split(",")

        # Unfortunately, split will split at some commas that we don't mean to split on (e.g., if they've
        # been written into addresses) so we check below to make sure we have the expected number of fields
        # and throw out any other data.  We shouldn't really be throwing out data, we should be fixing the
        # actual problem, but for the purposes of this lab, this will do.
        if (len(fields) == len(header)):
            row = {}
            for fieldNumber in range(len(fields)):
                row[header[fieldNumber]] = fields[fieldNumber]
            data.append(row)
            
    filepointer.close()
    return data

In order to determine if two rows contain the same item, we need to develop a function that checks the information that *should* be unique to each row and compares it to determine equality.  We need to decide what information this should be carefully, or deduplication will fail.  For example, the compas-scores.csv file has an "id" column, which seems at first glance like it should be unique in the file (and it is), but these were determined for the purposes of this file and so duplicate entries will have different ids.  We need to instead choose fields (potentially multiple fields) that are unique per person and check equality based on those.

** 4) Write a function that takes a row and returns a string version of the row containing only the fields that should be used to check equality. **

** 5) Write a function takes two rows as input and returns True if these rows are the same and False otherwise. **

One way to find the number of duplicates in a data set is to compare each item to each other item in the data set, counting duplicates as you go.  Make sure *not* to count the comparison of an item to itself as a duplicate.

** 6) Create a function that uses this "all pairs" method of deduplication to return the number of duplicate items in a given data set. **

Another way to find the number of duplicates is to create a key for each item in the data set and insert these items into a dictionary, with the count of the number of times this item is seen as the value.  The choice of key will determine whether two items are considered to be duplicates, so choose it carefully.  Note that this function and the previous duplicate counting function should return the same number of duplicates.

** 7) Create a function that uses this dictionary-based method of deduplication to return the number of duplicate items in a given data set. **

** 8) Using the module you developed from the previous part of the lab, determine: **

a) How many seconds does each duplication counting method take when run on the ProPublica data set?
    
b) How many items (e.g., cups of coffee) could be created by this amount of energy for each duplication counting method?
    
c) Create two graphs using regplot showing the number of rows deduplicated on the x-axis and the energy consumption in number of items on the y-axis for each of these methods.  Note that this means you'll need to create multiple versions of the dataset with increasing numbers of rows to be checked.  (For *Extra Credit* graph these on the same plot using lmplot or another seaborn method.)
    
d) Explain which method is more efficient.