Skip to content
This repository has been archived by the owner on Apr 7, 2022. It is now read-only.

ukgovdatascience/cookiecutter-data-science

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cookiecutter-data-science

cookiecutter-data-science contains the boilerplate you need to start a Data Science project.

Project Goals

  • Provide project templates to quickstart Data Science research & development projects
  • Stop the use of filenames such as YYYYMMDD_data_exploration_v1_new_old_new_v34_final_new.sql

Install guide

Get cookiecutter (first time only)

$ pip install cookiecutter

Now run it against this repo

$ cookiecutter https://github.com/jvawdrey/cookiecutter-data-science.git

You will be prompted to answer a set of questions which will be used to fill out the boilerplate. Note - Hitting enter will keep the defaults displayed.

Cloning into 'cookiecutter-data-science'...
remote: Counting objects: 24, done.
remote: Compressing objects: 100% (15/15), done.
remote: Total 24 (delta 9), reused 23 (delta 8), pack-reused 0
Unpacking objects: 100% (24/24), done.
Checking connectivity... done.
full_name [Matthew Upson]:
email [gdsdatascience@digital.cabinet-office.gov.uk]:
client [GDS]:
project_name [My Data Science Project]:
project_slug [my-data-science-project]:
project_description [Data Science cookiecutter contains all the boilerplate you need to create a Data Science project.]:
created_date [2015-11-17]:

Enter project

$ cd <project name>

Create git repo for version control

# initialize project
$ git init

# add all files
$ git add -A

# first commit
$ git commit -m "Initial commit"

Repo layout

  • /data: Utility project data such as mapping tables
  • /docs: Project documents including excel files
  • /img: Visualizations generated for project
  • /python: Python code for project
    • /python/notebooks: Python notebook code
  • /prod: "Operational ready code"
  • /prod/tests: "Tests for prod code"
  • /r: R code for project
  • /sql: SQL scripts
  • /tmp: Temp files (Note - Files in this directory are ignored from Git repo)

File descriptions

  • /data:
    • filename: description
  • /docs:
    • filename: description
  • /img:
    • Note - Relevant images are documented in related powerpoint presentation in the '/docs' directory
  • /python:
    • filename: description
  • /r:
    • data-exploration.R: This code contains data exploration queries used to understand available data for cleaning and feature generation.
  • /sql:
    • data-exploration.sql: This file contains data exploration queries used to audit and understand available data in addition to determining correct data preparation logic
    • data-preparation.sql: This script contains table generation queries used to prepare available data for feature generation.
    • feature-generation.sql: This script creates features (independent variables) for testing as inputs into the models.
    • feature-selection.sql: This script contains a set of queries which calculate summary statistics, correlations etc. to be used for feature selection.
    • model-development.sql: This script is used to build (train) the model
    • model-scoring.sql: This script is used to apply model to new data (scoring)
    • model-update.sql: This script is used to update (retrain) model

Contact

About

boilerplate for ds project startup

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 63.5%
  • Python 36.5%