Skip to content

Template for research projects that utilize Jupyter notebooks for exploration, experimentation, analysis, and reporting

License

Notifications You must be signed in to change notification settings

velexi-research/VLXI-Cookiecutter-Research

Repository files navigation

Velexi Research Project Cookiecutter

The Velexi Research Project Cookiecutter is intended to streamline the process of setting up a Jupyter-based research project involving computational work (but that is not necessarily centered around data science and/or machine learning models). The structure of this research project template is inspired by Cookiecutter Data Science, Kuygen Tran's Data Science Template, the blog article "Jupyter Notebook Best Practices for Data Science" by Jonathan Whitmore.

Features

  • Support for common research workflows (for both individuals and teams)

  • A directory structure that organizes and separates different components and stages of research: data, exploration/experimentation (e.g., Jupyter notebooks), documentation (e.g., reports, references), and software (e.g., custom functions and test code)

  • Integration with tools that encourage code, data, and scientific quality while promoting research efficiency.

  • Quick references for software tools (e.g., FastDS, MLflow, Poetry)

  • Support for the Julia programming language

  • Python package and dependency management using Poetry

  • Directory-based development environment isolation with direnv


Table of Contents

  1. Usage

    1.1. Cookiecutter Parameters

    1.2. Setting Up a New Research Project

    1.3. Publishing Project Documentation to GitHub Pages

    1.4. Known Issues

  2. Contributor Notes

    2.1. License

    2.2. Repository Contents

    2.3. Software Requirements

    2.4. Setting Up to Develop the Cookiecutter

    2.5. Additional Notes

  3. Documentation


1. Usage

1.1. Cookiecutter Parameters

  • project_name: project name

  • author: project's primary author

  • email: primary author's email

  • license: type of license to use for the project

  • python_version: Python versions compatible with project. See the "Dependency sepcification" section of the Poetry documentation for version specifier semantics.

  • enable_julia: flag indicating whether Julia should be enabled for the project

1.2. Setting Up a New Research Project

  1. Prerequisites

    • Install Git.

    • Install Python 3.9 (or greater).

    • If the project uses Julia, install Julia 1.6 (or greater).

    • Install Poetry 1.2 (or greater).

      Note. The project template uses poetry instead of pip for management of Python package dependencies.

    • Install the Cookiecutter Python package.

    • Optional. Install direnv.

  2. Use cookiecutter to create a new research project.

    $ cookiecutter https://github.com/velexi-research/VLXI-Cookiecutter-Research.git
  3. Set up a dedicated virtual environment for the project. Any of the common virtual environment options (e.g., venv, direnv, conda) should work. Below are instructions for setting up a direnv or poetry environment.

    Note: to avoid conflicts between virtual environments, only one method should be used to manage the virtual environment.

    • direnv Environment. Note: direnv manages the environment for both Python and the shell.

      • Prerequisite. Install direnv.

      • Copy extras/dot-envrc to the project root directory, and rename it to .envrc.

        $ cd $PROJECT_ROOT_DIR
        $ cp extras/dot-envrc .envrc
      • Grant permission to direnv to execute the .envrc file.

        $ direnv allow
    • poetry Environment. Note: poetry only manages the Python environment (it does not manage the shell environment).

      • Create a poetry environment that uses a specific Python executable. For instance, if python3 is on your PATH, the following command creates (or activates if it already exists) a Python virtual environment that uses python3 for the project.

        $ poetry env use python3

        For commands to use other Python executables for the virtual environment, see the Poetry Quick Reference.

  4. Install the base Python package dependencies.

    $ poetry install
  5. Configure Git.

    • Install the Git pre-commit hooks.

      $ pre-commit install
    • Optional. Set up a remote Git repository (e.g., GitHub repository).

      • Create a remote Git repository.

      • Configure the remote Git repository.

        $ git remote add origin GIT_REMOTE

        where GIT_REMOTE is the URL of the remote Git repository.

      • Push the main branch to the remote Git repository.

        $ git checkout main
        $ git push -u origin main
  6. Configure DVC.

    • Initialize DVC (Data Version Control). In the following command PROJECT_DIR should be replaced by the path to the newly created research project.

      • Using fds.

        $ cd PROJECT_DIR
        $ fds init
        $ fds commit -m "Initialize DVC"
      • Using dvc + git.

        $ cd PROJECT_DIR
        $ dvc init
        $ git commit -m "Initialize DVC"
    • Add a remote DVC repository.

      • Set up a remote DVC repository (e.g., S3 bucket).

      • Configure the remote DVC repository.

        $ dvc remote add -d storage DVC_REMOTE

        where storage is the name for the remote repository and DVC_REMOTE is the URL to the remote DVC repository. Note: the -d option indicates that storage should be used as the default remote DVC repository.

    • Configure DVC to automatically stage changes to *.dvc files with Git.

      $ dvc config core.autostage true
  7. Finish setting up the new research project.

    • Verify the copyright year and owner in the copyright notice. If the project is licensed under Apache License 2.0, the copyright notice is located in the NOTICE file. Otherwise, the copyright notice is located in the LICENSE file.

    • Update the base Python package dependencies to the latest available versions.

      $ poetry update
    • Review the Python package dependencies for the project, and modify them as needed using the poetry CLI tool. For a quick reference of poetry commands, see the Poetry Quick Reference.

      Packages that may be useful (but are not included by default):

      • numpy
      • numba
      • scipy
      • pandas
      • scikit-learn
      • matplotlib
      • seaborn

      For instance, to add numpy to the project dependencies, use the command:

      $ poetry add numpy
    • Fill in any empty fields in pyproject.toml.

    • Customize the README.md file to reflect the specifics of the project.

    • If the project was created with Julia support enabled, configure the Julia package dependencies for the project

      julia> ]
      
      (...) pkg> instantiate
      • Review the Julia package dependencies for the project, and modify them as needed using the Julia package manager. For a quick reference of Julia package manager REPL commands, see the Julia Quick Reference.
    • Commit all updated files (e.g., poetry.lock, Project.toml) to the project Git repository.

1.3. Publishing Project Documentation to GitHub Pages

  1. From the project GitHub repository, navigate to "Settings" > "Pages" (in the "Code and automation" section of the side menu) and configure GitHub Pages to deploy from the main branch.

    • Source: Deploy from a branch
    • Branch: main
      • Folder: /(root)
  2. In the "About" section of the project GitHub repository, set "Website" to the URL for the project GitHub Pages.

  3. That's it! Every time the main branch is updated, GitHub will automatically build project documentation from the README.md file (and any linked Markdown files) and publish them to the project GitHub Pages.

1.4. Known Issues

  • When including numba as a project dependency, the Python version constraint pyproject.toml needs to be more restrictive than the default ^3.9. For numba 0.55, the Python version constraint in [tool.poetry.dependencies] section of pyproject.toml should be set to:

    python = ">=3.9,<3.11"
    

2. Contributor Notes

2.1. License

The contents of this cookiecutter are covered under the Apache License 2.0 (included in the LICENSE file). The copyright for this cookiecutter is contained in the NOTICE file.

2.2. Repository Contents

├── README.md                         <- this file
├── RELEASE-NOTES.md                  <- cookiecutter release notes
├── LICENSE                           <- cookiecutter license
├── NOTICE                            <- cookiecutter copyright notice
├── cookiecutter.json                 <- cookiecutter configuration file
├── pyproject.toml                    <- Python project metadata file for
│                                        cookiecutter development
├── poetry.lock                       <- Poetry lockfile for cookiecutter
│                                        development
├── docs/                             <- cookiecutter documentation
├── extras/                           <- additional files that may be useful for
│                                        cookiecutter development
├── hooks/                            <- cookiecutter scripts that run before
│                                        and/or after project generation
├── spikes/                           <- experimental code
└── {{cookiecutter.__project_name}}/  <- cookiecutter template

2.3. Software Requirements

Base Requirements

Optional Packages

Python Packages

See [tool.poetry.dependencies] section of pyproject.toml.

2.4. Setting Up to Develop the Cookiecutter

  1. Set up a dedicated virtual environment for cookiecutter development. See Step 3 from Section 2.1 for instructions on how to set up direnv and poetry environments.

  2. Install the Python packages required for development.

    $ poetry install
    
  3. Install the Git pre-commit hooks.

    $ pre-commit install
  4. Make the cookiecutter better!

2.5. Additional Notes

Updating Template Dependencies

To update the Python dependencies for the template (contained in the {{cookiecutter.__project_name}} directory), use the following procedure to ensure that Python package dependencies for developing the non-template components of the cookiecutter (e.g., hooks/pre_gen_project.py) do not interfere with Python package dependencies for the template.

  • Create a local clone of the cookiecutter Git repository to use for cookiecutter development.

    $ git clone git@github.com:velexi-research/VLXI-Cookiecutter-Research.git
  • Use cookiecutter from the local cookiecutter Git repository to create an instance of the template to use for updating Python package dependencies.

    $ cookiecutter PATH/TO/LOCAL/REPO
  • In the instance of the template, perform the following steps to update the template's Python package dependencies.

    • Set up a virtual environment for developing the template (e.g., a direnv environment).

    • Use poetry or manually edit pyproject.toml to (1) make changes to the Python package dependency list and (2) update the versions of Python package dependencies.

    • Use poetry to update the Python package dependencies and versions recorded in the poetry.lock file.

  • Update {{cookiecutter.__project_name}}/pyproject.toml.

    • Copy pyproject.toml from the instance of the template to {{cookiecutter.__project_name}}/pyproject.toml.

    • Restore the templated values in the [tool.poetry] section to the following:

      [tool.poetry]
      name = "{{ cookiecutter.__project_name }}"
      version = "0.0.0"
      description = ""
      license = "{% if cookiecutter.license == 'Apache License 2.0' %}Apache-2.0{% elif cookiecutter.license == 'BSD-3-Clause License' %}BSD-3-Clause{% elif cookiecutter.license == 'MIT License' %}MIT{% endif %}"
      readme = "README.md"
      authors = ["{{ cookiecutter.author }} <{{ cookiecutter.email }}>"]
  • Update {{cookiecutter.__project_name}}/poetry.lock.

    • Copy poetry.lock from the instance of the template to {{cookiecutter.__project_name}}/poetry.lock.
  • Commit the updated pyproject.toml and poetry.lock files to the Git repository.


3. Documentation