## Process

My first step was to examine the data links. Had this been a real project, I would have asked for clarification on which data specifically we wanted to use (statement of vote vs supplemental, 1 year vs 5 year census data, etc), as well as the eventual use case for the output.

I decided to use my standard [data science setup](https://github.com/jupyter/docker-stacks/tree/master/datascience-notebook): a jupyter notebook and the python library pandas.

Because of the nature of both the census data and the webapp that serves it, I downloaded the census data manually. For the election results, I'm fetching the file each time we run, so if the election results were to be updated, our output would be correspondingly updated.

As for the ability to manually override results, I've created the `override.json` file, which can be used to change the results before output, as well as add fields, like 'winner'. As an example, I've changed the results for Clinton in Alameda County to 22222222 and declared her the winner.

The biggest issue I anticipated was normalizing the county names between the two datasets. This turned out to be as simple as removing `' County, California'` from the census table's `NAME` column.

Another issue was getting the voting results data into a more usable format than the less than ideal excel file the state provides, namely only including the rows we care about.

Finally, I merged the two datasets by county name, applied the manual overrides, and exported the output as json and csv files. I did some spot checks to verify that everything was working correctly and matched the original data as expected.

The notebook can be run interactively, from the command line, or exported as normal python script and made part of a larger codebase.

### Improvements I Might Make

- There are some weird characters, like carriage returns, in the election results and asterisks and dashes in place of data in the census data. These should be filtered out, and correctly detected as NA.
- For simplicity, I'm not including the per county per candidate percentages. They could be included, recomputed here, or on the frontend.
- Depending on the use case for the output it might be better to include the census column names directly in the output (e.g. rather than `'S1903_C02_024E'`)
- My naive county name normalizing method works for these data, but it's brittle and would fail in other circumstances.
- My naming conventions in the notebook code leave a lot to be desired.
- There's almost certainly a way to automate the census data download, similar to the way I'm fetching the voting results. On the other hand, that relies on those servers working, so local caching might be worth it as well.
- For the manual overrides: updating json isn't the most user friendly and is somewhat error prone, so something closer to a UI might make sense.
