Experiment management

In lumo, theExperimentclass provides sufficient guarantees to ensure experiment reproducibility. Specifically,Experimentguarantees reproducibility from four perspectives: path management, version control, parameter recording, and backup. It also simplifies the operation threshold through visual panels, command-line interfaces, and other methods.

Path Management

To ensure that paths are not duplicated,Experimentassigns a unique experiment ID (test_name) to each experiment run. At the same time,Experimentprovides three different types of data storage paths for storing information (info_dir), binary files (blob_dir), and temporary files (cache_dir), with the following path relationships:

- <cache_root>
    - <exp_name>
        - <cache_dir>

- <info_root>
    - <exp_name>
        - <info_dir>

- <blob_root>
    - <exp_name>
        - <blob_dir>

Version Control

The lifecycle ofExperimentincludes start/progress/end, and a series ofExpHookclasses are set up to perform partial operations at each lifecycle stage. Among them,~lumo.exp.exphook.GitCommitis responsible for git commit, which checks for file changes aton_startand submits a snapshot of the current file to thelumo_experimentsbranch if changes exist. The commit information corresponding to the current code is recorded in theinfo_dirof theExperimentinstance and can be viewed throughexp.properties['git'].

Information Recording

Information recording includes startup parameters such as hyperparameters and program execution parameters, runtime and post-run parameters such as Metric, execution time, and other metadata. All information mentioned except for hyperparameters is automatically recorded byExperimentat.start(). The hyperparameters of the experiment can be recorded byexp.dump_info('params', params_dict).

When usinglumo.Trainerfor training, hyperparameters used are automatically recorded in theparamskey.

For Metric, theExperimentinstance can be recorded using.dump_metricand.dump_metrics(), for example:

max_acc = exp.dump_metric("acc",acc, "cls_acc", cls_acc)

Here shows an example in exp.properties

{'agent': nan,
 'backup': {'23-03-17-161847': {'backend': 'github',
                                'number': 4,
                                'repo': 'sailist/image-classification'}},
 'deprecated': nan,
 'exception': nan,
 'execute': {'cwd': '~/python/image-classification-private',
             'exec_argv': ['train_ssl.py',
                           'train_ssl.py',
                           '--module=simclr',
                           '--device=2',
                           '--config=config/ssl/simclr/cifar100.yaml',
                           '--model=wrn282',
                           '--scan=ssl-2023.02.28'],
             'exec_bin': '~/miniconda3/bin/python3',
             'exec_file': 'train_ssl.py',
             'repo': '~/python/image-classification-private'},
 'exp_name': 'simclr.simclrexp',
 'git': {'commit': '294ccdac',
         'dep_hash': '404fc6044b2119d56a5e8b92ac02fc1c',
         'repo': '~/python/image-classification-private'},
 'hooks': {'Diary': {'loaded': True, 'msg': ''},
           'FinalReport': {'loaded': True, 'msg': ''},
           'GitCommit': {'loaded': True, 'msg': ''},
           'LastCmd': {'loaded': True, 'msg': ''},
           'LockFile': {'loaded': True, 'msg': ''},
           'RecordAbort': {'loaded': True, 'msg': ''}},
 'lock': {'accelerate': '0.16.0',
          'decorator': '5.1.1',
          'fire': '0.5.0',
          'hydra': '1.3.1',
          'joblib': '1.2.0',
          'lumo': '0.15.0',
          'numpy': '1.24.2',
          'omegaconf': '2.3.0',
          'psutil': '5.9.4',
          'torch': '1.8.1+cu101',
          'torch.version.cuda': '10.1'},
 'note': '',
 'params': {'apply_mixco': False,
            'apply_unmix': False,
            'config': 'config/ssl/simclr/cifar100.yaml',
            'dataset': 'cifar100',
            'detach_cls': True,
            'device': 2,
            'ema': True,
            'ema_alpha': 0.99,
            'epoch': 1000,
            'eval': {'batch_size': 512,
                     'num_workers': 8,
                     'pin_memory': True,
                     'shuffle': True},
            'feature_dim': 128,
            'hidden_feature_size': 128,
            'knn': True,
            'knn_k': 200,
            'knn_t': 0.1,
            'linear_eval': False,
            'lr_decay_end': 0.0005,
            'method': 'simclr',
            'model': 'wrn282',
            'module': 'simclr',
            'more_sample': True,
            'n_classes': 100,
            'optim': {'lr': 0.06,
                      'momentum': 0.9,
                      'name': 'SGD',
                      'weight_decay': 0.0005},
            'pretrain_path': None,
            'scan': 'ssl-2023.02.28',
            'seed': 1,
            'semi_eval': False,
            'stl10_unlabeled': True,
            'temperature': 0.1,
            'test': {'batch_size': 512,
                     'num_workers': 8,
                     'pin_memory': True,
                     'shuffle': False},
            'train': {'batch_size': 512,
                      'num_workers': 8,
                      'pin_memory': True,
                      'shuffle': True},
            'train_ending': 10,
            'train_linear': True,
            'train_strategy': 'ending',
            'warmup_epochs': 0,
            'warmup_from': 0.01,
            'with_bn': False},
 'pinfo': {'hash': '62ee6de98b381872e200e82901ad51f7',
           'obj': {'argv': ['~/miniconda3/bin/python3',
                            'train_ssl.py',
                            'train_ssl.py',
                            '--module=simclr',
                            '--device=2',
                            '--config=config/ssl/simclr/cifar100.yaml',
                            '--model=wrn282',
                            '--scan=ssl-2023.02.28'],
                   'pid': 27687,
                   'pname': 'python3',
                   'pstart': 1678763482.5},
           'pid': 27687},
 'progress': {'finished': False,
              'last_edit_time': '23-03-14-212932',
              'ratio': 1.0,
              'start': '23-03-14-111124',
              'update_from': None},
 'rerun': {'from': '230313.015.99t', 'repeat': 1},
 'test_name': '230314.000.a3t',
 ...
 }

Retrieve Experiment

Watchconsolidates information for all experiments, allowing users to search for a specific experiment.

from lumo import Watcher, Experiment

w = Watcher()
df = w.load() # all experiments

exp = Experiment.from_cache(df.iloc[0].to_dict())

For a known experiment withtest_name, theExperimentinstance can be directly retrieved using theretrievemethod:

w.retrieve('230306.012.d5t')
>>> Experiment(info_dir=".../.lumo/experiments/moco.mocoexp/230306.012.d5t")

Visual Panel

A fixed-style panel can never satisfy everyone's needs. Therefore, lumo provides dynamic panels based on pandas and panel, with all styles except for a few fixed parts added by the user:

from lumo import Watcher
w = Watcher()
df = w.load()

... filter operations ...

new_df = ...

w.panel(new_df)

Repetitive Experiment

Repetitive experiments mainly occur in two scenarios:

To verify the stability of the results, rerun the experiment with other random seeds and the same parameters.
In the middle of the experiment, due to memory, disk space, or other reasons, the experiment failed and needs to be rerun with similar parameters.

Especially when scanning parameters, if only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reproducibility.md

reproducibility.md

Experiment management

Path Management

Version Control

Information Recording

Retrieve Experiment

Visual Panel

Repetitive Experiment

Files

reproducibility.md

Latest commit

History

reproducibility.md

File metadata and controls

Experiment management

Path Management

Version Control

Information Recording

Retrieve Experiment

Visual Panel

Repetitive Experiment