A template creation tool for Machine Learning and Data Science projects.
🇷🇺 Здесь лежит русскоязычная версия этого README.
- Install Sphinx for automatic documentation support.
Follow this link for the installation instructions. Preferred way of installing is via pip3: pip3 install -U sphinx
.
- Execute commands in Terminal:
sudo -i
git clone https://github.com/EnlightenedCSF/Ocean.git
cd <cloned repo>
pip install --upgrade .
Creating a new project:
ocean project new -n "<project_name>" \ # ! must be provided !
-a "<author>" \ # default is `Surf`
-v "<version>" \ # default is `0.0.1`
-d "<description>" \ # default is ``
-l "<licence>" \ # default is `MIT`
-p "<path>" # default is `.`
Install the project code as a package:
make -B package
Creating a new experiment in the project:
ocean exp new -n "<exp_name>" # ! must be provided !
-a "<author>" # ! must be provided !
The project is based on cookiecutter-data-science template, but is a modification of it. Before continue reading, I highly recommend you to follow the given link and take a look, because many key points listed there are important.
Let's see how the original cookiecutter is structured:
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- Make this project pip installable with `pip install -e`
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org
It can be upgraded at once:
- we added
make docs
command for automatic generation of Sphinx documentation based on a wholesrc
module's docstrings; - we added a conveinient file logger (and
logs
folder, respectivelly); - we added a coordinator entity for an easy navigation throughout the project, taking off the necessity of writing
os.path.join
,os.path.abspath
илиos.path.dirname
every time.
But what problems are there?
- The folder
data
could grow significantly, but what script/notebook generated each file is a mystery. The amount of different files stored there can be misleading. Also it is not clear whether any of them is useful for a new feature implementation, because there is no place to contain descriptions and explanations. - The folder
data
lacks thefeatures
submodule which could be a good use: the one can store calculated statistics, embeddings and other features. There is a nice writing about this which I strongly recommend. - The
src
folder is an another problem. It contains both functionality that is relevant project-wise (likesrc.data
submodule) and functionality relevant to concrete and often small sub-tasks (likesrc.models
). - The folder
references
exists, but there is an opened question, who, when and how has to put some records there. And there is a lot to explain during the development process: which experiments have been done, what were the results, what are we doing next.
For a sake of solving listed problems I introduce the experiment entity.
So, the experiment is a place which contains all the data relevant to some hypothesis checking.
Including:
- What data was used
- What data (or artefacts) was produced
- Code version
- Timestamp of beginning and ending of an experiment
- Source file
- Parameters
- Metrics
- Logs
Many things can be logged via tracker utilities like mlflow, but it is not enough. We can improve our workflow.
This is what an example experiment looks like:
<project_root>
└── experiments
├── exp-001-Tree-models
│ ├── config <- yaml-files with grid search parameters or just model parameters
│ ├── models <- dumped models
│ ├── notebooks <- notebooks for research
│ ├── scripts <- scripts like train.py or predict.py
│ ├── Makefile <- for handling experiment with just few words put in console
│ ├── requirements.txt <- dependent libraries
│ └── log.md <- logs of how the experiment is going
│
├── exp-002-Gradient-boosting
...
Let's take a look at the workflow for one experiment.
- The notebooks are created where data is being prepared for a model, and model's structure is being introduced.
- Once the code is ready, it is moved to
train.py
- Use might track model parameters from there (for instance, with
mlflow
) - Create a relevant
config
-file for a training configuration - The code should has the possibility to be called from the console
- It could take paths to the data, the
config
-file, and the directory to dump model to.
- Use might track model parameters from there (for instance, with
- Then, Makefile is modified to start the training process via console. Provide a command like
make train
. - Many models are trained, all the metrics and parameters are sent to
mlflow
. The one can usemlflow ui
to check the results. - Finally, all results are being recorded into
log.md
. It has some impact analysis elements: the developer needs to point out what data was used and what data was generated. This clarification can be used to generate automatically a readme file for adata
folder and clarify where which file is used.