## Data Cleaning: 
- An iterative process

## Measures of Data Quality:
- Validity: conforms to a schema
- Accuracy: conforms to gold standard
- Completeness:
- Consistency: matches other data
- Uniformity: same units

## Blueprint for Cleaning:
1. Audit your data
2. Create a data cleaning plan
3. Execute the plan
4. Manually correct

## Auditing Validity:
- Foreign-key constraints
- Cross-field constraints
- Data Type
- Regular Expressions
- Set memberships  
    - etc...

### Correcting Validity

In [1]:
import csv
import pprint

INPUT_FILE = 'autos.csv'
OUTPUT_GOOD = 'autos-valid.csv'
OUTPUT_BAD = 'FIXME-autos.csv'

In [2]:
def process_file(input_file, output_good, output_bad):
    # store data into lists for output
    data_good = []
    data_bad = []
    with open(input_file, "r") as f:
        reader = csv.DictReader(f)
        header = reader.fieldnames
        for row in reader:
            # validate URI value
            if row['URI'].find("dbpedia.org") < 0:
                continue

            ps_year = row['productionStartYear'][:4]
            try: # use try/except to filter valid items
                ps_year = int(ps_year)
                row['productionStartYear'] = ps_year
                if (ps_year >= 1886) and (ps_year <= 2014):
                    data_good.append(row)
                else:
                    data_bad.append(row)
            except ValueError: # non-numeric strings caught by exception
                if ps_year == 'NULL':
                    data_bad.append(row)

    # Write processed data to output files
    with open(output_good, "w") as good:
        writer = csv.DictWriter(good, delimiter=",", fieldnames= header)
        writer.writeheader()
        for row in data_good:
            writer.writerow(row)

    with open(output_bad, "w") as bad:
        writer = csv.DictWriter(bad, delimiter=",", fieldnames= header)
        writer.writeheader()
        for row in data_bad:
            writer.writerow(row)

In [3]:
process_file(INPUT_FILE, OUTPUT_GOOD, OUTPUT_BAD)

## Audit Accuracy 

## Audit Completeness
- Need reference data

## Audit Consistency
- Which data source do I trust the most? 
    - which collection method is more realiable

## Audit Uniformity
- Same unit measurement