ETL Exercise

prompt: https://github.com/PlutoFlume/cr_eng_challenge_python

Considerations:

loading the file is IO bound and the input is a single json document which makes it difficult to chunk. With line oriented json that could have been an option. It would also be possible to use a C backed json parser like ijson but I decided to keep it python only.
the mapping step between recipients and words could have benefited from a multiprocessing based map reduce approach since it's CPU bound, but I didn't get that far.
the flushing to the database phase is an IO bound operation so I've parallelised it with threads which is possible since the tables don't have relationships.
the implementation is idempotent apart from word counts, which are incremented. email ids are hashes of the payloads and writes to the database ignore conflicts in the case of the 'Emails' and 'Recipients', or deal with them in the case of the 'Words' table.
this is a quick script and is missing error handling, logging, packaging, etc.
there are also no tests so it's very possible that there are bugs.
it uses postgres as the database.

# grab data
curl https://raw.githubusercontent.com/PlutoFlume/cr_eng_challenge_python/dev/uploads.json > uploads.json

# install dependencies
pipenv install
# or
pip install click psycopg2

# usage
python main.py --help
Usage: main.py [OPTIONS]

Options:
  -f, --input-file TEXT
  -h, --db-host TEXT
  -u, --db-user TEXT
  -p, --db-password TEXT
  -d, --db-name TEXT
  --help                  Show this message and exit.

python main.py
Finished decoding json: 1.789s
Finished parsing input: 9.062s
Total time: 31.659s

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
create.sql		create.sql
main.py		main.py
mini.json		mini.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETL Exercise

Considerations:

About

Uh oh!

Releases

Packages

Languages

sgarcez/etl-exercise

Folders and files

Latest commit

History

Repository files navigation

ETL Exercise

Considerations:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages