Skip to content

Latest commit

 

History

History
260 lines (171 loc) · 11.3 KB

db.rst

File metadata and controls

260 lines (171 loc) · 11.3 KB

API - Database

This is the alpha version of database management system. If you have any trouble, please ask for help at tensorlayer@gmail.com .

Why Database

TensorLayer is designed for real world production, capable of large scale machine learning applications. TensorLayer database is introduced to address the many data management challenges in the large scale machine learning projects, such as:

  1. Finding training data from an enterprise data warehouse.
  2. Loading large datasets that are beyond the storage limitation of one computer.
  3. Managing different models with version control, and comparing them(e.g. accuracy).
  4. Automating the process of training, evaluating and deploying machine learning models.

With the TensorLayer system, we introduce this database technology to address the challenges above.

The database management system is designed with the following three principles in mind.

Everything is Data

Data warehouses can store and capture the entire machine learning development process. The data can be categorized as:

  1. Dataset: This includes all the data used for training, validation and prediction. The labels can be manually specified or generated by model prediction.
  2. Model architecture: The database includes a table that stores different model architectures, enabling users to reuse the many model development works.
  3. Model parameters: This database stores all the model parameters of each epoch in the training step.
  4. Tasks: A project usually include many small tasks. Each task contains the necessary information such as hyper-parameters for training or validation. For a training task, typical information includes training data, the model parameter, the model architecture, how many epochs the training task has. Validation, testing and inference are also supported by the task system.
  5. Loggings: The logs store all the metrics of each machine learning model, such as the time stamp, loss and accuracy of each batch or epoch.

TensorLayer database in principle is a keyword based search engine. Each model, parameter, or training data is assigned many tags. The storage system organizes data into two layers: the index layer, and the blob layer. The index layer stores all the tags and references to the blob storage. The index layer is implemented based on NoSQL document database such as MongoDB. The blob layer stores videos, medical images or label masks in large chunk size, which is usually implemented based on a file system. Our database is based on MongoDB. The blob system is based on the GridFS while the indexes are stored as documents.

Everything is identified by Query

Within the database framework, any entity within the data warehouse, such as the data, model or tasks is specified by the database query language. As a reference, the query is more space efficient for storage and it can specify multiple objects in a concise way. Another advantage of such a design is enabling a highly flexible software system. Many system can be implemented by simply rewriting different components, with many new applications can be implemented just by update the query without modification of any application code.

Preparation

In principle, the database can be implemented by any document oriented NoSQL database system. The existing implementation is based on MongoDB. Further implementations on other databases will be released depending on the progress. It will be straightforward to port our database system to Google Cloud, AWS and Azure. The following tutorials are based on the MongoDB implementation.

Installing and running MongoDB

The installation instruction of MongoDB can be found at MongoDB Docs. There are also many MongoDB services from Amazon or GCP, such as Mongo Atlas from MongoDB User can also use docker, which is a powerful tool for deploying software . After installing MongoDB, a MongoDB management tool with graphic user interface will be extremely useful. Users can also install Studio3T(MongoChef), which is powerful user interface tool for MongoDB and is free for non-commercial use studio3t.

Tutorials

Connect to the database

Similar with MongoDB management tools, IP and port number are required for connecting to the database. To distinguish the different projects, the database instances have a project_name argument. In the following example, we connect to MongoDB on a local machine with the IP localhost, and port 27017 (this is the default port number of MongoDB).

db = tl.db.TensorHub(ip='localhost', port=27017, dbname='temp',
      username=None, password='password', project_name='tutorial')

Dataset management

You can save a dataset into the database and allow all machines to access it. Apart from the dataset key, you can also insert a custom argument such as version and description, for better managing the datasets. Note that, all saving functions will automatically save a timestamp, allowing you to load staff (data, model, task) using the timestamp.

db.save_dataset(dataset=[X_train, y_train, X_test, y_test], dataset_name='mnist', description='this is a tutorial')

After saving the dataset, others can access the dataset as followed:

dataset = db.find_dataset('mnist')
dataset = db.find_dataset('mnist', version='1.0')

If you have multiple datasets that use the same dataset key, you can get all of them as followed:

datasets = db.find_all_datasets('mnist')

Model management

Save model architecture and parameters into database. The model architecture is represented by a TL graph, and the parameters are stored as a list of array.

db.save_model(net, accuracy=0.8, loss=2.3, name='second_model')

After saving the model into database, we can load it as follow:

net = db.find_model(sess=sess, accuracy=0.8, loss=2.3)

If there are many models, you can use MongoDB's 'sort' method to find the model you want. To get the newest or oldest model, you can sort by time:

## newest model

net = db.find_model(sess=sess, sort=[("time", pymongo.DESCENDING)])
net = db.find_model(sess=sess, sort=[("time", -1)])

## oldest model

net = db.find_model(sess=sess, sort=[("time", pymongo.ASCENDING)])
net = db.find_model(sess=sess, sort=[("time", 1)])

If you save the model along with accuracy, you can get the model with the best accuracy as followed:

net = db.find_model(sess=sess, sort=[("test_accuracy", -1)])

To delete all models in a project:

db.delete_model()

If you want to specify which model you want to delete, you need to put arguments inside.

Event / Logging management

Save training log:

db.save_training_log(accuracy=0.33)
db.save_training_log(accuracy=0.44)

Delete logs that match the requirement:

db.delete_training_log(accuracy=0.33)

Delete all logging of this project:

db.delete_training_log()
db.delete_validation_log()
db.delete_testing_log()

Task distribution

A project usually consists of many tasks such as hyper parameter selection. To make it easier, we can distribute these tasks to several GPU servers. A task consists of a task script, hyper parameters, desired result and a status.

A task distributor can push both dataset and tasks into a database, allowing task runners on GPU servers to pull and run. The following is an example that pushes 3 tasks with different hyper parameters.

## save dataset into database, then allow other servers to use it
X_train, y_train, X_val, y_val, X_test, y_test = tl.files.load_mnist_dataset(shape=(-1, 784))
db.save_dataset((X_train, y_train, X_val, y_val, X_test, y_test), 'mnist', description='handwriting digit')

## push tasks into database, then allow other servers pull tasks to run
db.create_task(
    task_name='mnist', script='task_script.py', hyper_parameters=dict(n_units1=800, n_units2=800),
    saved_result_keys=['test_accuracy'], description='800-800'
)

db.create_task(
    task_name='mnist', script='task_script.py', hyper_parameters=dict(n_units1=600, n_units2=600),
    saved_result_keys=['test_accuracy'], description='600-600'
)

db.create_task(
    task_name='mnist', script='task_script.py', hyper_parameters=dict(n_units1=400, n_units2=400),
    saved_result_keys=['test_accuracy'], description='400-400'
)

## wait for tasks to finish
while db.check_unfinished_task(task_name='mnist'):
    print("waiting runners to finish the tasks")
    time.sleep(1)

## you can get the model and result from database and do some analysis at the end

The task runners on GPU servers can monitor the database, and run the tasks immediately when they are made available. In the task script, we can save the final model and results to the database, this allows task distributors to get the desired model and results.

## monitors the database and pull tasks to run
while True:
    print("waiting task from distributor")
    db.run_task(task_name='mnist', sort=[("time", -1)])
    time.sleep(1)

Example codes

See here.

TensorHub API

tensorlayer.db

TensorHub