# Spreadsheet Cleaner

This script deletes duplicate rows from a spreadsheet file. First you have to specify input and output file:

In [1]:
input_file = './test.csv'
output_file = './test_clean.csv'

These files should contain comma-seperated values (CSV format). But you can also specify a different delimiter or quote-character (standard is ',' and '"')

In [2]:
delimiter=','
quotechar='"'

Now you have to specify which columns should be used to compare rows. A row will be deleted if there is another row later in the file that has identical entries in all the specified columns.

In [63]:
compare_columns=[4,5]

Now happens the "magic" ;-)

*(Technical note: This script is actually very crude since for every row it iterates over all other rows. That is for *n* rows we get *O(n^2)* complexity. First sorting all rows and then deleting rows with identical successor row would result in an improvement to *O(log(n)+n)*. But I was to lazy to implement it that way :-) )*

In [64]:
## read/write csv files
import csv
from IPython.core.display import HTML
#import display

## output results as beautiful HTML-table
html = ['<table width=100%>']

## open input file (readonly)
with open(input_file, 'r') as csvfile_in:
    spamreader = list(csv.reader(csvfile_in, delimiter=delimiter, quotechar=quotechar))

    ## write header for output table
    column_count = 0
    for row in spamreader:
        column_count = max(len(row),column_count)
    html.append('<tr><th></th>')
    for col_idx in range(column_count):
        if col_idx in compare_columns:
            html.append('<th style="background-color:#ddf">{0}</th>'.format(col_idx))
        else:
            html.append('<th>{0}</th>'.format(col_idx))
    html.append('</tr>')
    
    ## open output file
    with open(output_file, 'w') as csvfile_out:
        spamwriter = csv.writer(csvfile_out, delimiter=delimiter, quotechar=delimiter)
        
        ## iterate through rows
        for idx1,row1 in enumerate(spamreader):
            
            ## check for duplicates by (again) iterating over row
            found_duplicate = False
            for idx2,row2 in enumerate(spamreader):
                ## skip rows that were already checked including the one currently being checked
                if idx2 <= idx1:
                    continue
                ## check if any of the specified columns is NOT equal
                all_equal = True
                for col_idx in compare_columns:
                    if row1[col_idx] != row2[col_idx]:
                        all_equal = False
                        break
                ## if all specified colums ARE equal the two rows are considered to be a duplicate of each other
                if all_equal:
                    found_duplicate = True
                    break

            ## print rows (mark as 'KEEP' or 'DELETE' respectively) and write to output file
            if found_duplicate:
                background_col = '#fdd'
                html.append('<tr style="background-color:#fdd"><td><b>{0}</b></td>'.format(idx1))
            else:
                html.append('<tr><td><b>{0}</b></td>'.format(idx1))
                spamwriter.writerow(row1)
            for col_idx,cell in enumerate(row1):
                if not found_duplicate:
                    if col_idx in compare_columns:
                        background_col = '#ddf'
                    else:
                        background_col = '#fff'
                html.append('<td style="background-color:{0}">{1}</td>'.format(background_col,cell))
            html.append('</tr>')

html.append('</table>')
HTML(''.join(html))

Unnamed: 0,0,1,2,3,4,5
0,lsadlkj,lj,lkj,lkj,jkjk,
1,lkj,kj,kj,kj,kj,asdf
2,,lkj,jkk,jk,jkjk,
3,lkj,kj,kj,kj,kj,asdf
