Skip to content

yuhanghe01/caffe_incremental_add_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Caffe C++ implementation for adding new image data to existing LEVELDB/LMDB database

One excruciating experience when working with Caffe is to create LEVELDB/LMDB database. Traditionally, if you have to add new image data to existing databse, current method requires to create the whole database again, which is time consuming. For example, during my experiment, I have to deal with about 5,000,000 images for each new model version iteration, it takes about 10+ hours to re-generate the whole image database each time newly image data is added. Intuitively, it is unnecessary to re-generate the existing database since they share the same format. A rule of thumb is to directly add new input image to the existing LEVELDB/LMDB database.

My proposed approach to address this problem

The simplest way to address this problem is to directly add new image to the existing LEVELDB/LMDB database one by one. However, it violates the shuffle requirement for the whole image list. Note that shuffle is of vital importance to get powerful model. If the number of newly added image is large and they belong to the same category, we must shuffle the whole image list in order to guarantee good result. Another issue is that LEVELDB/LMDB in Caffe stores key-value pair in quene-like format, which means we cannot get the value directly from its key. My solution is to treat existing database and new image list separately, and indicating them with 0 and 1 individually. Then I shuffle the 0-1 list, after that I begin to create the new database one by one according to the shuffled 0-1 list: for 1, I choose to read the new image, for 0, I choose to read one LEVELDB/LMDB data, vice versa. #Experiment My proposed approach time cost mainly derives from two sides: the first one is to read and convert newly added image data, the second one is to read the existing LEVELDB/LMDB data and store it to the new destination database. However, I find out that the time used to reading LEVELDB/LMDB data is almost negligible comparing to reading an image and further convert it LEVELDB/LMDB format. I conducted two experiments:

  1. 50,000 images + 50,000 LEVELDB-encoded image: about 7 mins; 100,000+ images: about 14 mins.
  2. 5,000,000 images: about 10+ hours ; 300,000+ images + 5,000,000 LEVELDB-encoded images: about 2.3 hours.

About

C++ script for incrementally adding new image data to existing leveldb/lmdb database

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages