# `nanoGPT`: Shakespeare

## Install / Setup

### Google Colab

```python
from google.colab import drive
drive.mount('/content/drive')
```

### First Time Running

We need to install `wordplay` and setup the Shakespeare dataset

This will need to be ran the first time you are running this notebook.

Following the

```python
!python3 -m pip install wordplay
```

you will need to restart your runtime (Runtime -> Restart runtime)

After this, you should be able to

```python
>>> import ngpt
>>> ngpt.__file__
'/content/nanoGPT/src/ngpt/__init__.py'
```

In [1]:
%%bash

python3 -c 'import wordplay; print(wordplay.__file__)' 2> '/dev/null'

if [[ $? -eq 0 ]]; then
    echo "Has wordplay installed. Nothing to do."
else
    echo "Does not have wordplay installed. Installing..."
    git clone 'https://github.com/saforem2/wordplay'
    python3 wordplay/data/shakespeare_char/prepare.py
    python3 -m pip install -e nanoGPT -vvv
fi

/Users/samforeman/projects/saforem2/wordplay/src/wordplay/__init__.py


Has wordplay installed. Nothing to do.


## Post Install

If installed correctly, you should be able to:

```python
>>> import wordplay
>>> wordplay.__file__
'/path/to/wordplay/src/wordplay/__init__.py'
```

In [2]:
import os
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'
os.environ['COLORTERM'] = 'truecolor'

In [3]:
%load_ext autoreload
%autoreload 2

import wordplay
from enrich import get_logger
log = get_logger(level='INFO')
#from rich import print
log.info(wordplay.__file__)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
[30m[[0m[90m2024-02-06 [0m[90m00:12:54[0m[30m][0m[30m[[0m[1;32mINFO[0m[30m][0m[30m[[0m[3;36m1627752624[0m[92m:[0m[30m8[0m[30m][0m[1;93m - [0m[32m/Users/samforeman/projects/saforem2/wordplay/src/wordplay/[0m[35m__init__.py[0m


## Build Trainer

Explicitly, we:

1. `setup_torch(...)`
2. Build `cfg: DictConfig = get_config(...)`
3. Instnatiate `config: ExperimentConfig = instantiate(cfg)`
4. Build `trainer = Trainer(config)`

In [4]:
import os
import numpy as np
from ezpz import setup
from hydra.utils import instantiate
from wordplay.configs import get_config, PROJECT_ROOT
from wordplay.trainer import Trainer

HF_DATASETS_CACHE = PROJECT_ROOT.joinpath('.cache', 'huggingface')
HF_DATASETS_CACHE.mkdir(exist_ok=True, parents=True)

os.environ['HF_DATASETS_CACHE'] = HF_DATASETS_CACHE.as_posix()

rank = setup(
    framework='pytorch',
    backend='deepspeed',
    seed=1234,
)

cfg = get_config(
    [
        'data=shakespeare',
        'model=shakespeare',
        'optimizer=shakespeare',
        'train=shakespeare',
        'train.backend=deepspeed',
        'train.compile=false',
        'train.dtype=float32',
        'train.max_iters=2000',
        'train.log_interval=100',
        'train.eval_interval=500',
    ]
)
config = instantiate(cfg)

Failed to download font: Source Sans Pro, skipping!


Failed to download font: Titillium WebRoboto Condensed, skipping!


[2024-02-06 00:13:23,550] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to mps (auto detect)




  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


[2024-02-06 00:13:24,484] [INFO] [comm.py:637:init_distributed] cdb=None


[2024-02-06 00:13:24,484] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...


[2024-02-06 00:13:24,486] [INFO] [comm.py:707:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=localhost, master_port=29500


[2024-02-06 00:13:24,486] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend None


  _torch_pytree._register_pytree_node(


--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [92m[OKAY][0m
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented  [93m[NO][0m ....... [92m[OKAY][0m
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/Users/samforeman/projects/saforem2/wordplay/venvs/2023-11-11/lib/python3.11/site-packages/torch']
torch version .................... 2.2.0
deepspeed install path ........... ['/Users/samforeman/projects/saf

In [5]:
trainer = Trainer(config)

[2024-02-06 00:13:33,547] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.1, git-hash=unknown, git-branch=unknown


[34m[1mwandb[0m: Currently logged in as: [33msaforem2[0m ([33ml2hmc-qcd[0m). Use [1m`wandb login --relogin`[0m to force relogin


[2024-02-06 00:13:35,803] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False


[2024-02-06 00:13:35,805] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer


[2024-02-06 00:13:35,805] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer


[2024-02-06 00:13:35,805] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW


[2024-02-06 00:13:35,806] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW


[2024-02-06 00:13:35,806] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler


[2024-02-06 00:13:35,806] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None


[2024-02-06 00:13:35,807] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001, 0.001], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:13:35,807] [INFO] [config.py:984:print] DeepSpeedEngine configuration:


[2024-02-06 00:13:35,808] [INFO] [config.py:988:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}


[2024-02-06 00:13:35,808] [INFO] [config.py:988:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}


[2024-02-06 00:13:35,808] [INFO] [config.py:988:print]   amp_enabled .................. False


[2024-02-06 00:13:35,809] [INFO] [config.py:988:print]   amp_params ................... False


[2024-02-06 00:13:35,809] [INFO] [config.py:988:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}


[2024-02-06 00:13:35,809] [INFO] [config.py:988:print]   bfloat16_enabled ............. False


[2024-02-06 00:13:35,810] [INFO] [config.py:988:print]   checkpoint_parallel_write_pipeline  False


[2024-02-06 00:13:35,810] [INFO] [config.py:988:print]   checkpoint_tag_validation_enabled  True


[2024-02-06 00:13:35,810] [INFO] [config.py:988:print]   checkpoint_tag_validation_fail  False


[2024-02-06 00:13:35,810] [INFO] [config.py:988:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x2d601ef90>


[2024-02-06 00:13:35,811] [INFO] [config.py:988:print]   communication_data_type ...... None


[2024-02-06 00:13:35,811] [INFO] [config.py:988:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channe

[2024-02-06 00:13:35,811] [INFO] [config.py:988:print]   curriculum_enabled_legacy .... False


[2024-02-06 00:13:35,812] [INFO] [config.py:988:print]   curriculum_params_legacy ..... False


[2024-02-06 00:13:35,812] [INFO] [config.py:988:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}


[2024-02-06 00:13:35,812] [INFO] [config.py:988:print]   data_efficiency_enabled ...... False


[2024-02-06 00:13:35,812] [INFO] [config.py:988:print]   dataloader_drop_last ......... False


[2024-02-06 00:13:35,813] [INFO] [config.py:988:print]   disable_allgather ............ False


[2024-02-06 00:13:35,813] [INFO] [config.py:988:print]   dump_state ................... False


[2024-02-06 00:13:35,813] [INFO] [config.py:988:print]   dynamic_loss_scale_args ...... None


[2024-02-06 00:13:35,814] [INFO] [config.py:988:print]   eigenvalue_enabled ........... False


[2024-02-06 00:13:35,814] [INFO] [config.py:988:print]   eigenvalue_gas_boundary_resolution  1


[2024-02-06 00:13:35,814] [INFO] [config.py:988:print]   eigenvalue_layer_name ........ bert.encoder.layer


[2024-02-06 00:13:35,815] [INFO] [config.py:988:print]   eigenvalue_layer_num ......... 0


[2024-02-06 00:13:35,815] [INFO] [config.py:988:print]   eigenvalue_max_iter .......... 100


[2024-02-06 00:13:35,815] [INFO] [config.py:988:print]   eigenvalue_stability ......... 1e-06


[2024-02-06 00:13:35,815] [INFO] [config.py:988:print]   eigenvalue_tol ............... 0.01


[2024-02-06 00:13:35,815] [INFO] [config.py:988:print]   eigenvalue_verbose ........... False


[2024-02-06 00:13:35,816] [INFO] [config.py:988:print]   elasticity_enabled ........... False


[2024-02-06 00:13:35,816] [INFO] [config.py:988:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}


[2024-02-06 00:13:35,816] [INFO] [config.py:988:print]   fp16_auto_cast ............... None


[2024-02-06 00:13:35,817] [INFO] [config.py:988:print]   fp16_enabled ................. False


[2024-02-06 00:13:35,817] [INFO] [config.py:988:print]   fp16_master_weights_and_gradients  False


[2024-02-06 00:13:35,817] [INFO] [config.py:988:print]   global_rank .................. 0


[2024-02-06 00:13:35,817] [INFO] [config.py:988:print]   grad_accum_dtype ............. None


[2024-02-06 00:13:35,818] [INFO] [config.py:988:print]   gradient_accumulation_steps .. 1


[2024-02-06 00:13:35,818] [INFO] [config.py:988:print]   gradient_clipping ............ 0.0


[2024-02-06 00:13:35,818] [INFO] [config.py:988:print]   gradient_predivide_factor .... 1.0


[2024-02-06 00:13:35,819] [INFO] [config.py:988:print]   graph_harvesting ............. False


[2024-02-06 00:13:35,819] [INFO] [config.py:988:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8


[2024-02-06 00:13:35,819] [INFO] [config.py:988:print]   initial_dynamic_scale ........ 65536


[2024-02-06 00:13:35,819] [INFO] [config.py:988:print]   load_universal_checkpoint .... False


[2024-02-06 00:13:35,820] [INFO] [config.py:988:print]   loss_scale ................... 0


[2024-02-06 00:13:35,820] [INFO] [config.py:988:print]   memory_breakdown ............. False


[2024-02-06 00:13:35,820] [INFO] [config.py:988:print]   mics_hierarchial_params_gather  False


[2024-02-06 00:13:35,821] [INFO] [config.py:988:print]   mics_shard_size .............. -1


[2024-02-06 00:13:35,821] [INFO] [config.py:988:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=True, group=None, team=None, project='WordPlay') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=True


[2024-02-06 00:13:35,821] [INFO] [config.py:988:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}


[2024-02-06 00:13:35,821] [INFO] [config.py:988:print]   optimizer_legacy_fusion ...... False


[2024-02-06 00:13:35,822] [INFO] [config.py:988:print]   optimizer_name ............... None


[2024-02-06 00:13:35,822] [INFO] [config.py:988:print]   optimizer_params ............. None


[2024-02-06 00:13:35,822] [INFO] [config.py:988:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}


[2024-02-06 00:13:35,822] [INFO] [config.py:988:print]   pld_enabled .................. False


[2024-02-06 00:13:35,823] [INFO] [config.py:988:print]   pld_params ................... False


[2024-02-06 00:13:35,823] [INFO] [config.py:988:print]   prescale_gradients ........... False


[2024-02-06 00:13:35,823] [INFO] [config.py:988:print]   scheduler_name ............... None


[2024-02-06 00:13:35,823] [INFO] [config.py:988:print]   scheduler_params ............. None


[2024-02-06 00:13:35,824] [INFO] [config.py:988:print]   seq_parallel_communication_data_type  torch.float32


[2024-02-06 00:13:35,824] [INFO] [config.py:988:print]   sparse_attention ............. None


[2024-02-06 00:13:35,824] [INFO] [config.py:988:print]   sparse_gradients_enabled ..... False


[2024-02-06 00:13:35,824] [INFO] [config.py:988:print]   steps_per_print .............. 100


[2024-02-06 00:13:35,824] [INFO] [config.py:988:print]   train_batch_size ............. 64


[2024-02-06 00:13:35,825] [INFO] [config.py:988:print]   train_micro_batch_size_per_gpu  64


[2024-02-06 00:13:35,825] [INFO] [config.py:988:print]   use_data_before_expert_parallel_  False


[2024-02-06 00:13:35,825] [INFO] [config.py:988:print]   use_node_local_storage ....... False


[2024-02-06 00:13:35,826] [INFO] [config.py:988:print]   wall_clock_breakdown ......... False


[2024-02-06 00:13:35,826] [INFO] [config.py:988:print]   weight_quantization_config ... None


[2024-02-06 00:13:35,826] [INFO] [config.py:988:print]   world_size ................... 1


[2024-02-06 00:13:35,826] [INFO] [config.py:988:print]   zero_allow_untested_optimizer  False


[2024-02-06 00:13:35,827] [INFO] [config.py:988:print]   zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_ga

[2024-02-06 00:13:35,827] [INFO] [config.py:988:print]   zero_enabled ................. False


[2024-02-06 00:13:35,827] [INFO] [config.py:988:print]   zero_force_ds_cpu_optimizer .. True


[2024-02-06 00:13:35,827] [INFO] [config.py:988:print]   zero_optimization_stage ...... 0


[2024-02-06 00:13:35,828] [INFO] [config.py:974:print_user_config]   json = {
    "dump_state": false, 
    "gradient_accumulation_steps": 1, 
    "wall_clock_breakdown": false, 
    "flops_profiler": {
        "enabled": false, 
        "profile_step": 1
    }, 
    "wandb": {
        "enabled": true, 
        "project": "WordPlay"
    }, 
    "train_micro_batch_size_per_gpu": 64, 
    "steps_per_print": 100
}




## Prompt (**prior** to training)

In [6]:
query = "What is an LLM?"
outputs = trainer.evaluate(
    query,
    num_samples=1,
    max_new_tokens=32,
    top_k=1,
    display=False
)
log.info(f"['prompt']: '{query}'")
log.info("['response']:\n\n" + fr"{outputs['0']['raw']}")

## Train Model

|  name  |       description            |
|:------:|:----------------------------:|
| `step` | Current training step        |
| `loss` | Loss value                   |
| `dt`   | Time per step (in **ms**)    |
| `sps`  | Samples per second           |
| `mtps` | (million) Tokens per sec     |
| `mfu`  | Model Flops utilization[^1]  |
^legend: #tbl-legend

[^1]: in units of A100 `bfloat16` peak FLOPS

In [7]:
trainer.config.device_type

'mps'

In [8]:
trainer.train()

  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 1/2000 [00:44<24:45:30, 44.59s/it]

  0%|          | 2/2000 [00:45<10:27:01, 18.83s/it]

  0%|          | 3/2000 [00:45<5:45:22, 10.38s/it] 

  0%|          | 4/2000 [00:46<3:32:59,  6.40s/it]

  0%|          | 5/2000 [00:46<2:19:49,  4.21s/it]

  0%|          | 6/2000 [00:46<1:35:48,  2.88s/it]

  0%|          | 7/2000 [00:46<1:07:52,  2.04s/it]

  0%|          | 8/2000 [00:47<49:35,  1.49s/it]  

  0%|          | 9/2000 [00:47<37:16,  1.12s/it]

  0%|          | 10/2000 [00:47<28:54,  1.15it/s]

  1%|          | 11/2000 [00:48<23:11,  1.43it/s]

  1%|          | 12/2000 [00:48<19:15,  1.72it/s]

  1%|          | 13/2000 [00:48<16:32,  2.00it/s]

  1%|          | 14/2000 [00:49<14:39,  2.26it/s]

  1%|          | 15/2000 [00:49<13:22,  2.47it/s]

  1%|          | 16/2000 [00:49<12:26,  2.66it/s]

  1%|          | 17/2000 [00:50<11:46,  2.81it/s]

  1%|          | 18/2000 [00:50<11:18,  2.92it/s]

  1%|          | 19/2000 [00:50<11:00,  3.00it/s]

  1%|          | 20/2000 [00:50<10:43,  3.08it/s]

  1%|          | 21/2000 [00:51<10:33,  3.13it/s]

  1%|          | 22/2000 [00:51<10:25,  3.16it/s]

  1%|          | 23/2000 [00:51<10:20,  3.18it/s]

  1%|          | 24/2000 [00:52<10:18,  3.19it/s]

  1%|▏         | 25/2000 [00:52<10:19,  3.19it/s]

  1%|▏         | 26/2000 [00:52<10:19,  3.19it/s]

  1%|▏         | 27/2000 [00:53<10:16,  3.20it/s]

  1%|▏         | 28/2000 [00:53<10:16,  3.20it/s]

  1%|▏         | 29/2000 [00:53<10:16,  3.20it/s]

  2%|▏         | 30/2000 [00:54<10:15,  3.20it/s]

  2%|▏         | 31/2000 [00:54<10:13,  3.21it/s]

  2%|▏         | 32/2000 [00:54<10:12,  3.21it/s]

  2%|▏         | 33/2000 [00:55<10:09,  3.22it/s]

  2%|▏         | 34/2000 [00:55<10:06,  3.24it/s]

  2%|▏         | 35/2000 [00:55<10:09,  3.22it/s]

  2%|▏         | 36/2000 [00:55<10:07,  3.23it/s]

  2%|▏         | 37/2000 [00:56<10:08,  3.23it/s]

  2%|▏         | 38/2000 [00:56<10:07,  3.23it/s]

  2%|▏         | 39/2000 [00:56<10:08,  3.22it/s]

  2%|▏         | 40/2000 [00:57<10:07,  3.23it/s]

  2%|▏         | 41/2000 [00:57<10:08,  3.22it/s]

  2%|▏         | 42/2000 [00:57<10:05,  3.23it/s]

  2%|▏         | 43/2000 [00:58<10:02,  3.25it/s]

  2%|▏         | 44/2000 [00:58<10:02,  3.25it/s]

  2%|▏         | 45/2000 [00:58<10:01,  3.25it/s]

  2%|▏         | 46/2000 [00:59<10:03,  3.24it/s]

  2%|▏         | 47/2000 [00:59<10:02,  3.24it/s]

  2%|▏         | 48/2000 [00:59<10:03,  3.23it/s]

  2%|▏         | 49/2000 [00:59<10:04,  3.23it/s]

  2%|▎         | 50/2000 [01:00<10:04,  3.23it/s]

  3%|▎         | 51/2000 [01:00<10:00,  3.25it/s]

  3%|▎         | 52/2000 [01:00<09:59,  3.25it/s]

  3%|▎         | 53/2000 [01:01<09:58,  3.25it/s]

  3%|▎         | 54/2000 [01:01<09:57,  3.25it/s]

  3%|▎         | 55/2000 [01:01<09:56,  3.26it/s]

  3%|▎         | 56/2000 [01:02<09:57,  3.25it/s]

  3%|▎         | 57/2000 [01:02<10:01,  3.23it/s]

  3%|▎         | 58/2000 [01:02<10:10,  3.18it/s]

  3%|▎         | 59/2000 [01:03<10:18,  3.14it/s]

  3%|▎         | 60/2000 [01:03<10:19,  3.13it/s]

  3%|▎         | 61/2000 [01:03<10:16,  3.15it/s]

  3%|▎         | 62/2000 [01:04<10:09,  3.18it/s]

  3%|▎         | 63/2000 [01:04<10:04,  3.20it/s]

  3%|▎         | 64/2000 [01:04<10:01,  3.22it/s]

  3%|▎         | 65/2000 [01:04<09:59,  3.23it/s]

  3%|▎         | 66/2000 [01:05<10:01,  3.22it/s]

  3%|▎         | 67/2000 [01:05<09:59,  3.22it/s]

  3%|▎         | 68/2000 [01:05<09:58,  3.23it/s]

  3%|▎         | 69/2000 [01:06<09:57,  3.23it/s]

  4%|▎         | 70/2000 [01:06<10:00,  3.21it/s]

  4%|▎         | 71/2000 [01:06<10:00,  3.21it/s]

  4%|▎         | 72/2000 [01:07<10:02,  3.20it/s]

  4%|▎         | 73/2000 [01:07<09:58,  3.22it/s]

  4%|▎         | 74/2000 [01:07<10:00,  3.21it/s]

  4%|▍         | 75/2000 [01:08<09:58,  3.22it/s]

  4%|▍         | 76/2000 [01:08<09:58,  3.22it/s]

  4%|▍         | 77/2000 [01:08<09:59,  3.21it/s]

  4%|▍         | 78/2000 [01:09<09:57,  3.22it/s]

  4%|▍         | 79/2000 [01:09<09:57,  3.21it/s]

  4%|▍         | 80/2000 [01:09<09:55,  3.23it/s]

  4%|▍         | 81/2000 [01:09<09:55,  3.22it/s]

  4%|▍         | 82/2000 [01:10<09:53,  3.23it/s]

  4%|▍         | 83/2000 [01:10<09:53,  3.23it/s]

  4%|▍         | 84/2000 [01:10<09:52,  3.23it/s]

  4%|▍         | 85/2000 [01:11<09:51,  3.24it/s]

  4%|▍         | 86/2000 [01:11<09:50,  3.24it/s]

  4%|▍         | 87/2000 [01:11<09:50,  3.24it/s]

  4%|▍         | 88/2000 [01:12<09:53,  3.22it/s]

  4%|▍         | 89/2000 [01:12<09:52,  3.22it/s]

  4%|▍         | 90/2000 [01:12<09:54,  3.21it/s]

  5%|▍         | 91/2000 [01:13<09:51,  3.23it/s]

  5%|▍         | 92/2000 [01:13<09:54,  3.21it/s]

  5%|▍         | 93/2000 [01:13<09:53,  3.21it/s]

  5%|▍         | 94/2000 [01:13<09:50,  3.23it/s]

  5%|▍         | 95/2000 [01:14<09:47,  3.24it/s]

  5%|▍         | 96/2000 [01:14<09:48,  3.24it/s]

  5%|▍         | 97/2000 [01:14<09:47,  3.24it/s]

  5%|▍         | 98/2000 [01:15<09:46,  3.24it/s]

  5%|▍         | 99/2000 [01:15<09:49,  3.23it/s]

[2024-02-06 00:16:36,471] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=0, lr=[0.00099, 0.00099], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:16:36,472] [INFO] [timer.py:260:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=207.14260085739843, CurrSamplesPerSec=206.924308390582, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


  5%|▌         | 100/2000 [01:15<09:52,  3.21it/s]

  5%|▌         | 101/2000 [01:16<09:51,  3.21it/s]

  5%|▌         | 102/2000 [01:16<09:49,  3.22it/s]

  5%|▌         | 103/2000 [01:16<09:47,  3.23it/s]

  5%|▌         | 104/2000 [01:17<09:45,  3.24it/s]

  5%|▌         | 105/2000 [01:17<09:43,  3.25it/s]

  5%|▌         | 106/2000 [01:17<09:43,  3.25it/s]

  5%|▌         | 107/2000 [01:17<09:40,  3.26it/s]

  5%|▌         | 108/2000 [01:18<09:43,  3.25it/s]

  5%|▌         | 109/2000 [01:18<09:45,  3.23it/s]

  6%|▌         | 110/2000 [01:18<09:43,  3.24it/s]

  6%|▌         | 111/2000 [01:19<09:44,  3.23it/s]

  6%|▌         | 112/2000 [01:19<09:46,  3.22it/s]

  6%|▌         | 113/2000 [01:19<09:45,  3.22it/s]

  6%|▌         | 114/2000 [01:20<09:44,  3.23it/s]

  6%|▌         | 115/2000 [01:20<09:44,  3.23it/s]

  6%|▌         | 116/2000 [01:20<09:43,  3.23it/s]

  6%|▌         | 117/2000 [01:21<09:41,  3.24it/s]

  6%|▌         | 118/2000 [01:21<09:40,  3.24it/s]

  6%|▌         | 119/2000 [01:21<09:43,  3.22it/s]

  6%|▌         | 120/2000 [01:22<09:44,  3.21it/s]

  6%|▌         | 121/2000 [01:22<09:44,  3.21it/s]

  6%|▌         | 122/2000 [01:22<09:45,  3.21it/s]

  6%|▌         | 123/2000 [01:22<09:45,  3.21it/s]

  6%|▌         | 124/2000 [01:23<09:45,  3.20it/s]

  6%|▋         | 125/2000 [01:23<09:43,  3.22it/s]

  6%|▋         | 126/2000 [01:23<09:41,  3.22it/s]

  6%|▋         | 127/2000 [01:24<09:37,  3.24it/s]

  6%|▋         | 128/2000 [01:24<09:37,  3.24it/s]

  6%|▋         | 129/2000 [01:24<09:38,  3.24it/s]

  6%|▋         | 130/2000 [01:25<09:40,  3.22it/s]

  7%|▋         | 131/2000 [01:25<09:42,  3.21it/s]

  7%|▋         | 132/2000 [01:25<09:41,  3.21it/s]

  7%|▋         | 133/2000 [01:26<09:41,  3.21it/s]

  7%|▋         | 134/2000 [01:26<09:38,  3.22it/s]

  7%|▋         | 135/2000 [01:26<09:40,  3.21it/s]

  7%|▋         | 136/2000 [01:26<09:36,  3.23it/s]

  7%|▋         | 137/2000 [01:27<09:34,  3.24it/s]

  7%|▋         | 138/2000 [01:27<09:33,  3.25it/s]

  7%|▋         | 139/2000 [01:27<09:30,  3.26it/s]

  7%|▋         | 140/2000 [01:28<09:34,  3.24it/s]

  7%|▋         | 141/2000 [01:28<09:33,  3.24it/s]

  7%|▋         | 142/2000 [01:28<09:36,  3.22it/s]

  7%|▋         | 143/2000 [01:29<09:34,  3.23it/s]

  7%|▋         | 144/2000 [01:29<09:36,  3.22it/s]

  7%|▋         | 145/2000 [01:29<09:37,  3.21it/s]

  7%|▋         | 146/2000 [01:30<09:34,  3.23it/s]

  7%|▋         | 147/2000 [01:30<09:33,  3.23it/s]

  7%|▋         | 148/2000 [01:30<09:32,  3.23it/s]

  7%|▋         | 149/2000 [01:30<09:32,  3.24it/s]

  8%|▊         | 150/2000 [01:31<09:28,  3.26it/s]

  8%|▊         | 151/2000 [01:31<09:32,  3.23it/s]

  8%|▊         | 152/2000 [01:31<09:34,  3.22it/s]

  8%|▊         | 153/2000 [01:32<09:32,  3.23it/s]

  8%|▊         | 154/2000 [01:32<09:31,  3.23it/s]

  8%|▊         | 155/2000 [01:32<09:30,  3.23it/s]

  8%|▊         | 156/2000 [01:33<09:28,  3.24it/s]

  8%|▊         | 157/2000 [01:33<09:27,  3.25it/s]

  8%|▊         | 158/2000 [01:33<09:25,  3.26it/s]

  8%|▊         | 159/2000 [01:34<09:25,  3.26it/s]

  8%|▊         | 160/2000 [01:34<09:24,  3.26it/s]

  8%|▊         | 161/2000 [01:34<09:27,  3.24it/s]

  8%|▊         | 162/2000 [01:35<09:28,  3.23it/s]

  8%|▊         | 163/2000 [01:35<09:30,  3.22it/s]

  8%|▊         | 164/2000 [01:35<09:31,  3.22it/s]

  8%|▊         | 165/2000 [01:35<09:32,  3.21it/s]

  8%|▊         | 166/2000 [01:36<09:30,  3.22it/s]

  8%|▊         | 167/2000 [01:36<09:27,  3.23it/s]

  8%|▊         | 168/2000 [01:36<09:25,  3.24it/s]

  8%|▊         | 169/2000 [01:37<09:23,  3.25it/s]

  8%|▊         | 170/2000 [01:37<09:22,  3.26it/s]

  9%|▊         | 171/2000 [01:37<09:20,  3.26it/s]

  9%|▊         | 172/2000 [01:38<09:22,  3.25it/s]

  9%|▊         | 173/2000 [01:38<09:22,  3.25it/s]

  9%|▊         | 174/2000 [01:38<09:23,  3.24it/s]

  9%|▉         | 175/2000 [01:39<09:24,  3.23it/s]

  9%|▉         | 176/2000 [01:39<09:26,  3.22it/s]

  9%|▉         | 177/2000 [01:39<09:27,  3.21it/s]

  9%|▉         | 178/2000 [01:39<09:25,  3.22it/s]

  9%|▉         | 179/2000 [01:40<09:23,  3.23it/s]

  9%|▉         | 180/2000 [01:40<09:21,  3.24it/s]

  9%|▉         | 181/2000 [01:40<09:19,  3.25it/s]

  9%|▉         | 182/2000 [01:41<09:19,  3.25it/s]

  9%|▉         | 183/2000 [01:41<09:20,  3.24it/s]

  9%|▉         | 184/2000 [01:41<09:25,  3.21it/s]

  9%|▉         | 185/2000 [01:42<09:27,  3.20it/s]

  9%|▉         | 186/2000 [01:42<09:24,  3.21it/s]

  9%|▉         | 187/2000 [01:42<09:22,  3.22it/s]

  9%|▉         | 188/2000 [01:43<09:23,  3.21it/s]

  9%|▉         | 189/2000 [01:43<09:20,  3.23it/s]

 10%|▉         | 190/2000 [01:43<09:18,  3.24it/s]

 10%|▉         | 191/2000 [01:43<09:17,  3.24it/s]

 10%|▉         | 192/2000 [01:44<09:17,  3.25it/s]

 10%|▉         | 193/2000 [01:44<09:18,  3.23it/s]

 10%|▉         | 194/2000 [01:44<09:17,  3.24it/s]

 10%|▉         | 195/2000 [01:45<09:21,  3.21it/s]

 10%|▉         | 196/2000 [01:45<09:22,  3.20it/s]

 10%|▉         | 197/2000 [01:45<09:24,  3.20it/s]

 10%|▉         | 198/2000 [01:46<09:22,  3.21it/s]

 10%|▉         | 199/2000 [01:46<09:23,  3.20it/s]

[2024-02-06 00:17:07,433] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=0, lr=[0.0009990938195679395, 0.0009990938195679395], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:17:07,434] [INFO] [timer.py:260:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=207.45346770870768, CurrSamplesPerSec=210.8930707521933, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 10%|█         | 200/2000 [01:46<09:20,  3.21it/s]

 10%|█         | 201/2000 [01:47<09:16,  3.23it/s]

 10%|█         | 202/2000 [01:47<09:12,  3.25it/s]

 10%|█         | 203/2000 [01:47<09:10,  3.26it/s]

 10%|█         | 204/2000 [01:48<09:14,  3.24it/s]

 10%|█         | 205/2000 [01:48<09:14,  3.24it/s]

 10%|█         | 206/2000 [01:48<09:13,  3.24it/s]

 10%|█         | 207/2000 [01:48<09:16,  3.22it/s]

 10%|█         | 208/2000 [01:49<09:18,  3.21it/s]

 10%|█         | 209/2000 [01:49<09:18,  3.21it/s]

 10%|█         | 210/2000 [01:49<09:16,  3.22it/s]

 11%|█         | 211/2000 [01:50<09:13,  3.23it/s]

 11%|█         | 212/2000 [01:50<09:12,  3.24it/s]

 11%|█         | 213/2000 [01:50<09:10,  3.24it/s]

 11%|█         | 214/2000 [01:51<09:09,  3.25it/s]

 11%|█         | 215/2000 [01:51<09:11,  3.24it/s]

 11%|█         | 216/2000 [01:51<09:13,  3.23it/s]

 11%|█         | 217/2000 [01:52<09:12,  3.23it/s]

 11%|█         | 218/2000 [01:52<09:12,  3.22it/s]

 11%|█         | 219/2000 [01:52<09:13,  3.22it/s]

 11%|█         | 220/2000 [01:52<09:12,  3.22it/s]

 11%|█         | 221/2000 [01:53<09:10,  3.23it/s]

 11%|█         | 222/2000 [01:53<09:11,  3.22it/s]

 11%|█         | 223/2000 [01:53<09:09,  3.24it/s]

 11%|█         | 224/2000 [01:54<09:06,  3.25it/s]

 11%|█▏        | 225/2000 [01:54<09:08,  3.24it/s]

 11%|█▏        | 226/2000 [01:54<09:09,  3.23it/s]

 11%|█▏        | 227/2000 [01:55<09:09,  3.23it/s]

 11%|█▏        | 228/2000 [01:55<09:07,  3.23it/s]

 11%|█▏        | 229/2000 [01:55<09:09,  3.22it/s]

 12%|█▏        | 230/2000 [01:56<09:11,  3.21it/s]

 12%|█▏        | 231/2000 [01:56<09:09,  3.22it/s]

 12%|█▏        | 232/2000 [01:56<09:08,  3.22it/s]

 12%|█▏        | 233/2000 [01:56<09:06,  3.23it/s]

 12%|█▏        | 234/2000 [01:57<09:04,  3.24it/s]

 12%|█▏        | 235/2000 [01:57<09:02,  3.25it/s]

 12%|█▏        | 236/2000 [01:57<09:06,  3.23it/s]

 12%|█▏        | 237/2000 [01:58<09:06,  3.23it/s]

 12%|█▏        | 238/2000 [01:58<09:06,  3.22it/s]

 12%|█▏        | 239/2000 [01:58<09:05,  3.23it/s]

 12%|█▏        | 240/2000 [01:59<09:02,  3.24it/s]

 12%|█▏        | 241/2000 [01:59<09:03,  3.24it/s]

 12%|█▏        | 242/2000 [01:59<09:02,  3.24it/s]

 12%|█▏        | 243/2000 [02:00<09:01,  3.24it/s]

 12%|█▏        | 244/2000 [02:00<09:00,  3.25it/s]

 12%|█▏        | 245/2000 [02:00<09:00,  3.25it/s]

 12%|█▏        | 246/2000 [02:01<09:00,  3.25it/s]

 12%|█▏        | 247/2000 [02:01<09:02,  3.23it/s]

 12%|█▏        | 248/2000 [02:01<09:02,  3.23it/s]

 12%|█▏        | 249/2000 [02:01<09:04,  3.22it/s]

 12%|█▎        | 250/2000 [02:02<09:04,  3.21it/s]

 13%|█▎        | 251/2000 [02:02<09:02,  3.22it/s]

 13%|█▎        | 252/2000 [02:02<09:05,  3.20it/s]

 13%|█▎        | 253/2000 [02:03<09:02,  3.22it/s]

 13%|█▎        | 254/2000 [02:03<09:01,  3.23it/s]

 13%|█▎        | 255/2000 [02:03<08:58,  3.24it/s]

 13%|█▎        | 256/2000 [02:04<08:59,  3.23it/s]

 13%|█▎        | 257/2000 [02:04<08:58,  3.24it/s]

 13%|█▎        | 258/2000 [02:04<08:56,  3.24it/s]

 13%|█▎        | 259/2000 [02:05<08:58,  3.23it/s]

 13%|█▎        | 260/2000 [02:05<08:59,  3.23it/s]

 13%|█▎        | 261/2000 [02:05<08:59,  3.22it/s]

 13%|█▎        | 262/2000 [02:05<09:01,  3.21it/s]

 13%|█▎        | 263/2000 [02:06<09:01,  3.21it/s]

 13%|█▎        | 264/2000 [02:06<08:59,  3.22it/s]

 13%|█▎        | 265/2000 [02:06<08:56,  3.23it/s]

 13%|█▎        | 266/2000 [02:07<08:54,  3.24it/s]

 13%|█▎        | 267/2000 [02:07<08:54,  3.24it/s]

 13%|█▎        | 268/2000 [02:07<08:56,  3.23it/s]

 13%|█▎        | 269/2000 [02:08<08:56,  3.22it/s]

 14%|█▎        | 270/2000 [02:08<08:56,  3.23it/s]

 14%|█▎        | 271/2000 [02:08<08:57,  3.22it/s]

 14%|█▎        | 272/2000 [02:09<08:57,  3.22it/s]

 14%|█▎        | 273/2000 [02:09<08:57,  3.21it/s]

 14%|█▎        | 274/2000 [02:09<08:53,  3.23it/s]

 14%|█▍        | 275/2000 [02:09<08:52,  3.24it/s]

 14%|█▍        | 276/2000 [02:10<08:51,  3.24it/s]

 14%|█▍        | 277/2000 [02:10<08:51,  3.24it/s]

 14%|█▍        | 278/2000 [02:10<08:53,  3.23it/s]

 14%|█▍        | 279/2000 [02:11<08:52,  3.23it/s]

 14%|█▍        | 280/2000 [02:11<08:53,  3.23it/s]

 14%|█▍        | 281/2000 [02:11<08:51,  3.23it/s]

 14%|█▍        | 282/2000 [02:12<08:52,  3.23it/s]

 14%|█▍        | 283/2000 [02:12<08:49,  3.24it/s]

 14%|█▍        | 284/2000 [02:12<08:51,  3.23it/s]

 14%|█▍        | 285/2000 [02:13<08:50,  3.23it/s]

 14%|█▍        | 286/2000 [02:13<08:48,  3.24it/s]

 14%|█▍        | 287/2000 [02:13<08:47,  3.24it/s]

 14%|█▍        | 288/2000 [02:14<08:45,  3.26it/s]

 14%|█▍        | 289/2000 [02:14<08:48,  3.24it/s]

 14%|█▍        | 290/2000 [02:14<08:49,  3.23it/s]

 15%|█▍        | 291/2000 [02:14<08:47,  3.24it/s]

 15%|█▍        | 292/2000 [02:15<08:49,  3.23it/s]

 15%|█▍        | 293/2000 [02:15<08:48,  3.23it/s]

 15%|█▍        | 294/2000 [02:15<08:49,  3.22it/s]

 15%|█▍        | 295/2000 [02:16<08:46,  3.24it/s]

 15%|█▍        | 296/2000 [02:16<08:46,  3.24it/s]

 15%|█▍        | 297/2000 [02:16<08:44,  3.25it/s]

 15%|█▍        | 298/2000 [02:17<08:43,  3.25it/s]

 15%|█▍        | 299/2000 [02:17<08:42,  3.26it/s]

[2024-02-06 00:17:38,364] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=0, lr=[0.0009963423087899531, 0.0009963423087899531], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:17:38,364] [INFO] [timer.py:260:stop] epoch=0/micro_step=300/global_step=300, RunningAvgSamplesPerSec=207.61764617080723, CurrSamplesPerSec=209.02118434884173, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 15%|█▌        | 300/2000 [02:17<08:43,  3.25it/s]

 15%|█▌        | 301/2000 [02:18<08:46,  3.23it/s]

 15%|█▌        | 302/2000 [02:18<08:46,  3.23it/s]

 15%|█▌        | 303/2000 [02:18<08:48,  3.21it/s]

 15%|█▌        | 304/2000 [02:18<08:48,  3.21it/s]

 15%|█▌        | 305/2000 [02:19<08:47,  3.21it/s]

 15%|█▌        | 306/2000 [02:19<08:46,  3.21it/s]

 15%|█▌        | 307/2000 [02:19<08:43,  3.23it/s]

 15%|█▌        | 308/2000 [02:20<08:43,  3.23it/s]

 15%|█▌        | 309/2000 [02:20<08:41,  3.24it/s]

 16%|█▌        | 310/2000 [02:20<08:40,  3.25it/s]

 16%|█▌        | 311/2000 [02:21<08:42,  3.23it/s]

 16%|█▌        | 312/2000 [02:21<08:44,  3.22it/s]

 16%|█▌        | 313/2000 [02:21<08:46,  3.21it/s]

 16%|█▌        | 314/2000 [02:22<08:45,  3.21it/s]

 16%|█▌        | 315/2000 [02:22<08:44,  3.21it/s]

 16%|█▌        | 316/2000 [02:22<08:45,  3.21it/s]

 16%|█▌        | 317/2000 [02:23<08:43,  3.21it/s]

 16%|█▌        | 318/2000 [02:23<08:40,  3.23it/s]

 16%|█▌        | 319/2000 [02:23<08:40,  3.23it/s]

 16%|█▌        | 320/2000 [02:23<08:39,  3.24it/s]

 16%|█▌        | 321/2000 [02:24<08:40,  3.23it/s]

 16%|█▌        | 322/2000 [02:24<08:40,  3.22it/s]

 16%|█▌        | 323/2000 [02:24<08:42,  3.21it/s]

 16%|█▌        | 324/2000 [02:25<08:40,  3.22it/s]

 16%|█▋        | 325/2000 [02:25<08:40,  3.22it/s]

 16%|█▋        | 326/2000 [02:25<08:40,  3.21it/s]

 16%|█▋        | 327/2000 [02:26<08:39,  3.22it/s]

 16%|█▋        | 328/2000 [02:26<08:37,  3.23it/s]

 16%|█▋        | 329/2000 [02:26<08:35,  3.24it/s]

 16%|█▋        | 330/2000 [02:27<08:33,  3.25it/s]

 17%|█▋        | 331/2000 [02:27<08:33,  3.25it/s]

 17%|█▋        | 332/2000 [02:27<08:34,  3.24it/s]

 17%|█▋        | 333/2000 [02:27<08:36,  3.23it/s]

 17%|█▋        | 334/2000 [02:28<08:36,  3.23it/s]

 17%|█▋        | 335/2000 [02:28<08:38,  3.21it/s]

 17%|█▋        | 336/2000 [02:28<08:37,  3.21it/s]

 17%|█▋        | 337/2000 [02:29<08:38,  3.21it/s]

 17%|█▋        | 338/2000 [02:29<08:36,  3.22it/s]

 17%|█▋        | 339/2000 [02:29<08:34,  3.23it/s]

 17%|█▋        | 340/2000 [02:30<08:32,  3.24it/s]

 17%|█▋        | 341/2000 [02:30<08:30,  3.25it/s]

 17%|█▋        | 342/2000 [02:30<08:33,  3.23it/s]

 17%|█▋        | 343/2000 [02:31<08:34,  3.22it/s]

 17%|█▋        | 344/2000 [02:31<08:35,  3.21it/s]

 17%|█▋        | 345/2000 [02:31<08:35,  3.21it/s]

 17%|█▋        | 346/2000 [02:31<08:35,  3.21it/s]

 17%|█▋        | 347/2000 [02:32<08:34,  3.21it/s]

 17%|█▋        | 348/2000 [02:32<08:34,  3.21it/s]

 17%|█▋        | 349/2000 [02:32<08:32,  3.22it/s]

 18%|█▊        | 350/2000 [02:33<08:31,  3.23it/s]

 18%|█▊        | 351/2000 [02:33<08:30,  3.23it/s]

 18%|█▊        | 352/2000 [02:33<08:28,  3.24it/s]

 18%|█▊        | 353/2000 [02:34<08:28,  3.24it/s]

 18%|█▊        | 354/2000 [02:34<08:28,  3.24it/s]

 18%|█▊        | 355/2000 [02:34<08:28,  3.24it/s]

 18%|█▊        | 356/2000 [02:35<08:30,  3.22it/s]

 18%|█▊        | 357/2000 [02:35<08:29,  3.22it/s]

 18%|█▊        | 358/2000 [02:35<08:30,  3.21it/s]

 18%|█▊        | 359/2000 [02:36<08:29,  3.22it/s]

 18%|█▊        | 360/2000 [02:36<08:27,  3.23it/s]

 18%|█▊        | 361/2000 [02:36<08:25,  3.24it/s]

 18%|█▊        | 362/2000 [02:36<08:25,  3.24it/s]

 18%|█▊        | 363/2000 [02:37<08:26,  3.23it/s]

 18%|█▊        | 364/2000 [02:37<08:27,  3.22it/s]

 18%|█▊        | 365/2000 [02:37<08:28,  3.22it/s]

 18%|█▊        | 366/2000 [02:38<08:28,  3.21it/s]

 18%|█▊        | 367/2000 [02:38<08:31,  3.20it/s]

 18%|█▊        | 368/2000 [02:38<08:28,  3.21it/s]

 18%|█▊        | 369/2000 [02:39<08:27,  3.22it/s]

 18%|█▊        | 370/2000 [02:39<08:25,  3.23it/s]

 19%|█▊        | 371/2000 [02:39<08:21,  3.25it/s]

 19%|█▊        | 372/2000 [02:40<08:22,  3.24it/s]

 19%|█▊        | 373/2000 [02:40<08:19,  3.26it/s]

 19%|█▊        | 374/2000 [02:40<08:18,  3.26it/s]

 19%|█▉        | 375/2000 [02:40<08:23,  3.23it/s]

 19%|█▉        | 376/2000 [02:41<08:21,  3.24it/s]

 19%|█▉        | 377/2000 [02:41<08:21,  3.24it/s]

 19%|█▉        | 378/2000 [02:41<08:20,  3.24it/s]

 19%|█▉        | 379/2000 [02:42<08:20,  3.24it/s]

 19%|█▉        | 380/2000 [02:42<08:21,  3.23it/s]

 19%|█▉        | 381/2000 [02:42<08:20,  3.23it/s]

 19%|█▉        | 382/2000 [02:43<08:18,  3.24it/s]

 19%|█▉        | 383/2000 [02:43<08:17,  3.25it/s]

 19%|█▉        | 384/2000 [02:43<08:16,  3.25it/s]

 19%|█▉        | 385/2000 [02:44<08:19,  3.23it/s]

 19%|█▉        | 386/2000 [02:44<08:20,  3.22it/s]

 19%|█▉        | 387/2000 [02:44<08:20,  3.23it/s]

 19%|█▉        | 388/2000 [02:44<08:20,  3.22it/s]

 19%|█▉        | 389/2000 [02:45<08:20,  3.22it/s]

 20%|█▉        | 390/2000 [02:45<08:23,  3.20it/s]

 20%|█▉        | 391/2000 [02:45<08:20,  3.21it/s]

 20%|█▉        | 392/2000 [02:46<08:18,  3.23it/s]

 20%|█▉        | 393/2000 [02:46<08:16,  3.24it/s]

 20%|█▉        | 394/2000 [02:46<08:15,  3.24it/s]

 20%|█▉        | 395/2000 [02:47<08:16,  3.23it/s]

 20%|█▉        | 396/2000 [02:47<08:16,  3.23it/s]

 20%|█▉        | 397/2000 [02:47<08:17,  3.22it/s]

 20%|█▉        | 398/2000 [02:48<08:16,  3.22it/s]

 20%|█▉        | 399/2000 [02:48<08:16,  3.23it/s]

[2024-02-06 00:18:09,363] [INFO] [logging.py:96:log_dist] [Rank 0] step=400, skipped=0, lr=[0.000991756681725024, 0.000991756681725024], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:18:09,364] [INFO] [timer.py:260:stop] epoch=0/micro_step=400/global_step=400, RunningAvgSamplesPerSec=207.58147642554775, CurrSamplesPerSec=204.92460701888584, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 20%|██        | 400/2000 [02:48<08:19,  3.20it/s]

 20%|██        | 401/2000 [02:49<08:19,  3.20it/s]

 20%|██        | 402/2000 [02:49<08:15,  3.23it/s]

 20%|██        | 403/2000 [02:49<08:12,  3.24it/s]

 20%|██        | 404/2000 [02:49<08:12,  3.24it/s]

 20%|██        | 405/2000 [02:50<08:10,  3.25it/s]

 20%|██        | 406/2000 [02:50<08:12,  3.23it/s]

 20%|██        | 407/2000 [02:50<08:13,  3.23it/s]

 20%|██        | 408/2000 [02:51<08:13,  3.23it/s]

 20%|██        | 409/2000 [02:51<08:13,  3.22it/s]

 20%|██        | 410/2000 [02:51<08:15,  3.21it/s]

 21%|██        | 411/2000 [02:52<08:16,  3.20it/s]

 21%|██        | 412/2000 [02:52<08:13,  3.22it/s]

 21%|██        | 413/2000 [02:52<08:10,  3.23it/s]

 21%|██        | 414/2000 [02:53<08:08,  3.25it/s]

 21%|██        | 415/2000 [02:53<08:05,  3.26it/s]

 21%|██        | 416/2000 [02:53<08:04,  3.27it/s]

 21%|██        | 417/2000 [02:53<08:07,  3.25it/s]

 21%|██        | 418/2000 [02:54<08:07,  3.24it/s]

 21%|██        | 419/2000 [02:54<08:09,  3.23it/s]

 21%|██        | 420/2000 [02:54<08:09,  3.23it/s]

 21%|██        | 421/2000 [02:55<08:09,  3.22it/s]

 21%|██        | 422/2000 [02:55<08:10,  3.22it/s]

 21%|██        | 423/2000 [02:55<08:08,  3.23it/s]

 21%|██        | 424/2000 [02:56<08:08,  3.23it/s]

 21%|██▏       | 425/2000 [02:56<08:07,  3.23it/s]

 21%|██▏       | 426/2000 [02:56<08:05,  3.24it/s]

 21%|██▏       | 427/2000 [02:57<08:05,  3.24it/s]

 21%|██▏       | 428/2000 [02:57<08:06,  3.23it/s]

 21%|██▏       | 429/2000 [02:57<08:07,  3.22it/s]

 22%|██▏       | 430/2000 [02:57<08:08,  3.22it/s]

 22%|██▏       | 431/2000 [02:58<08:06,  3.22it/s]

 22%|██▏       | 432/2000 [02:58<08:05,  3.23it/s]

 22%|██▏       | 433/2000 [02:58<08:07,  3.22it/s]

 22%|██▏       | 434/2000 [02:59<08:05,  3.23it/s]

 22%|██▏       | 435/2000 [02:59<08:03,  3.24it/s]

 22%|██▏       | 436/2000 [02:59<08:02,  3.24it/s]

 22%|██▏       | 437/2000 [03:00<08:01,  3.25it/s]

 22%|██▏       | 438/2000 [03:00<08:04,  3.23it/s]

 22%|██▏       | 439/2000 [03:00<08:06,  3.21it/s]

 22%|██▏       | 440/2000 [03:01<08:07,  3.20it/s]

 22%|██▏       | 441/2000 [03:01<08:06,  3.20it/s]

 22%|██▏       | 442/2000 [03:01<08:06,  3.21it/s]

 22%|██▏       | 443/2000 [03:02<08:05,  3.21it/s]

 22%|██▏       | 444/2000 [03:02<08:03,  3.22it/s]

 22%|██▏       | 445/2000 [03:02<08:00,  3.24it/s]

 22%|██▏       | 446/2000 [03:02<08:01,  3.23it/s]

 22%|██▏       | 447/2000 [03:03<08:00,  3.23it/s]

 22%|██▏       | 448/2000 [03:03<07:59,  3.23it/s]

 22%|██▏       | 449/2000 [03:03<07:59,  3.23it/s]

 22%|██▎       | 450/2000 [03:04<08:00,  3.22it/s]

 23%|██▎       | 451/2000 [03:04<08:02,  3.21it/s]

 23%|██▎       | 452/2000 [03:04<08:01,  3.21it/s]

 23%|██▎       | 453/2000 [03:05<08:02,  3.21it/s]

 23%|██▎       | 454/2000 [03:05<08:00,  3.22it/s]

 23%|██▎       | 455/2000 [03:05<07:58,  3.23it/s]

 23%|██▎       | 456/2000 [03:06<07:59,  3.22it/s]

 23%|██▎       | 457/2000 [03:06<07:55,  3.24it/s]

 23%|██▎       | 458/2000 [03:06<07:55,  3.25it/s]

 23%|██▎       | 459/2000 [03:06<07:55,  3.24it/s]

 23%|██▎       | 460/2000 [03:07<07:54,  3.25it/s]

 23%|██▎       | 461/2000 [03:07<07:55,  3.24it/s]

 23%|██▎       | 462/2000 [03:07<07:54,  3.24it/s]

 23%|██▎       | 463/2000 [03:08<07:54,  3.24it/s]

 23%|██▎       | 464/2000 [03:08<07:57,  3.21it/s]

 23%|██▎       | 465/2000 [03:08<07:58,  3.21it/s]

 23%|██▎       | 466/2000 [03:09<07:56,  3.22it/s]

 23%|██▎       | 467/2000 [03:09<07:56,  3.22it/s]

 23%|██▎       | 468/2000 [03:09<07:53,  3.23it/s]

 23%|██▎       | 469/2000 [03:10<07:51,  3.25it/s]

 24%|██▎       | 470/2000 [03:10<07:51,  3.25it/s]

 24%|██▎       | 471/2000 [03:10<07:51,  3.24it/s]

 24%|██▎       | 472/2000 [03:11<07:53,  3.23it/s]

 24%|██▎       | 473/2000 [03:11<07:54,  3.22it/s]

 24%|██▎       | 474/2000 [03:11<07:53,  3.22it/s]

 24%|██▍       | 475/2000 [03:11<07:54,  3.22it/s]

 24%|██▍       | 476/2000 [03:12<07:53,  3.22it/s]

 24%|██▍       | 477/2000 [03:12<07:52,  3.22it/s]

 24%|██▍       | 478/2000 [03:12<07:50,  3.24it/s]

 24%|██▍       | 479/2000 [03:13<07:48,  3.24it/s]

 24%|██▍       | 480/2000 [03:13<07:48,  3.24it/s]

 24%|██▍       | 481/2000 [03:13<07:49,  3.24it/s]

 24%|██▍       | 482/2000 [03:14<07:50,  3.22it/s]

 24%|██▍       | 483/2000 [03:14<07:51,  3.21it/s]

 24%|██▍       | 484/2000 [03:14<07:52,  3.21it/s]

 24%|██▍       | 485/2000 [03:15<07:51,  3.21it/s]

 24%|██▍       | 486/2000 [03:15<07:51,  3.21it/s]

 24%|██▍       | 487/2000 [03:15<07:49,  3.22it/s]

 24%|██▍       | 488/2000 [03:15<07:47,  3.23it/s]

 24%|██▍       | 489/2000 [03:16<07:45,  3.24it/s]

 24%|██▍       | 490/2000 [03:16<07:46,  3.24it/s]

 25%|██▍       | 491/2000 [03:16<07:47,  3.23it/s]

 25%|██▍       | 492/2000 [03:17<07:46,  3.23it/s]

 25%|██▍       | 493/2000 [03:17<07:46,  3.23it/s]

 25%|██▍       | 494/2000 [03:17<07:46,  3.23it/s]

 25%|██▍       | 495/2000 [03:18<07:46,  3.22it/s]

 25%|██▍       | 496/2000 [03:18<07:47,  3.22it/s]

 25%|██▍       | 497/2000 [03:18<07:45,  3.23it/s]

 25%|██▍       | 498/2000 [03:19<07:43,  3.24it/s]

 25%|██▍       | 499/2000 [03:19<07:43,  3.24it/s]

[2024-02-06 00:18:40,331] [INFO] [logging.py:96:log_dist] [Rank 0] step=500, skipped=0, lr=[0.0009853557816983753, 0.0009853557816983753], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:18:40,332] [INFO] [timer.py:260:stop] epoch=0/micro_step=500/global_step=500, RunningAvgSamplesPerSec=207.60501861601585, CurrSamplesPerSec=210.38102306596895, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 25%|██▌       | 500/2000 [03:19<07:43,  3.24it/s]

 25%|██▌       | 501/2000 [03:59<5:03:07, 12.13s/it]

 25%|██▌       | 502/2000 [03:59<3:34:20,  8.59s/it]

 25%|██▌       | 503/2000 [04:00<2:32:14,  6.10s/it]

 25%|██▌       | 504/2000 [04:00<1:48:50,  4.37s/it]

 25%|██▌       | 505/2000 [04:00<1:18:26,  3.15s/it]

 25%|██▌       | 506/2000 [04:00<57:11,  2.30s/it]  

 25%|██▌       | 507/2000 [04:01<42:19,  1.70s/it]

 25%|██▌       | 508/2000 [04:01<31:56,  1.28s/it]

 25%|██▌       | 509/2000 [04:01<24:40,  1.01it/s]

 26%|██▌       | 510/2000 [04:02<19:32,  1.27it/s]

 26%|██▌       | 511/2000 [04:02<15:57,  1.55it/s]

 26%|██▌       | 512/2000 [04:02<13:27,  1.84it/s]

 26%|██▌       | 513/2000 [04:03<11:43,  2.11it/s]

 26%|██▌       | 514/2000 [04:03<10:30,  2.36it/s]

 26%|██▌       | 515/2000 [04:03<09:40,  2.56it/s]

 26%|██▌       | 516/2000 [04:04<09:04,  2.72it/s]

 26%|██▌       | 517/2000 [04:04<08:38,  2.86it/s]

 26%|██▌       | 518/2000 [04:04<08:20,  2.96it/s]

 26%|██▌       | 519/2000 [04:04<08:07,  3.04it/s]

 26%|██▌       | 520/2000 [04:05<07:57,  3.10it/s]

 26%|██▌       | 521/2000 [04:05<07:51,  3.14it/s]

 26%|██▌       | 522/2000 [04:05<07:46,  3.17it/s]

 26%|██▌       | 523/2000 [04:06<07:42,  3.19it/s]

 26%|██▌       | 524/2000 [04:06<07:38,  3.22it/s]

 26%|██▋       | 525/2000 [04:06<07:37,  3.22it/s]

 26%|██▋       | 526/2000 [04:07<07:35,  3.24it/s]

 26%|██▋       | 527/2000 [04:07<07:35,  3.23it/s]

 26%|██▋       | 528/2000 [04:07<07:37,  3.22it/s]

 26%|██▋       | 529/2000 [04:08<07:37,  3.21it/s]

 26%|██▋       | 530/2000 [04:08<07:43,  3.17it/s]

 27%|██▋       | 531/2000 [04:08<07:40,  3.19it/s]

 27%|██▋       | 532/2000 [04:09<07:38,  3.20it/s]

 27%|██▋       | 533/2000 [04:09<07:35,  3.22it/s]

 27%|██▋       | 534/2000 [04:09<07:34,  3.23it/s]

 27%|██▋       | 535/2000 [04:09<07:33,  3.23it/s]

 27%|██▋       | 536/2000 [04:10<07:34,  3.22it/s]

 27%|██▋       | 537/2000 [04:10<07:31,  3.24it/s]

 27%|██▋       | 538/2000 [04:10<07:31,  3.24it/s]

 27%|██▋       | 539/2000 [04:11<07:34,  3.21it/s]

 27%|██▋       | 540/2000 [04:11<07:34,  3.21it/s]

 27%|██▋       | 541/2000 [04:11<07:35,  3.20it/s]

 27%|██▋       | 542/2000 [04:12<07:34,  3.21it/s]

 27%|██▋       | 543/2000 [04:12<07:29,  3.24it/s]

 27%|██▋       | 544/2000 [04:12<07:28,  3.25it/s]

 27%|██▋       | 545/2000 [04:13<07:27,  3.25it/s]

 27%|██▋       | 546/2000 [04:13<07:29,  3.23it/s]

 27%|██▋       | 547/2000 [04:13<07:31,  3.22it/s]

 27%|██▋       | 548/2000 [04:13<07:32,  3.21it/s]

 27%|██▋       | 549/2000 [04:14<07:31,  3.22it/s]

 28%|██▊       | 550/2000 [04:14<07:33,  3.20it/s]

 28%|██▊       | 551/2000 [04:14<07:30,  3.22it/s]

 28%|██▊       | 552/2000 [04:15<07:29,  3.22it/s]

 28%|██▊       | 553/2000 [04:15<07:28,  3.23it/s]

 28%|██▊       | 554/2000 [04:15<07:26,  3.24it/s]

 28%|██▊       | 555/2000 [04:16<07:24,  3.25it/s]

 28%|██▊       | 556/2000 [04:16<07:24,  3.25it/s]

 28%|██▊       | 557/2000 [04:16<07:27,  3.23it/s]

 28%|██▊       | 558/2000 [04:17<07:28,  3.22it/s]

 28%|██▊       | 559/2000 [04:17<07:29,  3.21it/s]

 28%|██▊       | 560/2000 [04:17<07:28,  3.21it/s]

 28%|██▊       | 561/2000 [04:18<07:28,  3.21it/s]

 28%|██▊       | 562/2000 [04:18<07:27,  3.21it/s]

 28%|██▊       | 563/2000 [04:18<07:26,  3.22it/s]

 28%|██▊       | 564/2000 [04:18<07:23,  3.24it/s]

 28%|██▊       | 565/2000 [04:19<07:22,  3.24it/s]

 28%|██▊       | 566/2000 [04:19<07:22,  3.24it/s]

 28%|██▊       | 567/2000 [04:19<07:22,  3.24it/s]

 28%|██▊       | 568/2000 [04:20<07:23,  3.23it/s]

 28%|██▊       | 569/2000 [04:20<07:24,  3.22it/s]

 28%|██▊       | 570/2000 [04:20<07:24,  3.22it/s]

 29%|██▊       | 571/2000 [04:21<07:25,  3.21it/s]

 29%|██▊       | 572/2000 [04:21<07:25,  3.20it/s]

 29%|██▊       | 573/2000 [04:21<07:28,  3.18it/s]

 29%|██▊       | 574/2000 [04:22<07:26,  3.19it/s]

 29%|██▉       | 575/2000 [04:22<07:24,  3.21it/s]

 29%|██▉       | 576/2000 [04:22<07:22,  3.21it/s]

 29%|██▉       | 577/2000 [04:22<07:19,  3.24it/s]

 29%|██▉       | 578/2000 [04:23<07:22,  3.22it/s]

 29%|██▉       | 579/2000 [04:23<07:23,  3.21it/s]

 29%|██▉       | 580/2000 [04:23<07:23,  3.20it/s]

 29%|██▉       | 581/2000 [04:24<07:22,  3.21it/s]

 29%|██▉       | 582/2000 [04:24<07:23,  3.20it/s]

 29%|██▉       | 583/2000 [04:24<07:22,  3.20it/s]

 29%|██▉       | 584/2000 [04:25<07:20,  3.22it/s]

 29%|██▉       | 585/2000 [04:25<07:19,  3.22it/s]

 29%|██▉       | 586/2000 [04:25<07:17,  3.23it/s]

 29%|██▉       | 587/2000 [04:26<07:16,  3.24it/s]

 29%|██▉       | 588/2000 [04:26<07:17,  3.23it/s]

 29%|██▉       | 589/2000 [04:26<07:18,  3.21it/s]

 30%|██▉       | 590/2000 [04:27<07:17,  3.22it/s]

 30%|██▉       | 591/2000 [04:27<07:18,  3.22it/s]

 30%|██▉       | 592/2000 [04:27<07:18,  3.21it/s]

 30%|██▉       | 593/2000 [04:27<07:18,  3.21it/s]

 30%|██▉       | 594/2000 [04:28<07:18,  3.21it/s]

 30%|██▉       | 595/2000 [04:28<07:16,  3.22it/s]

 30%|██▉       | 596/2000 [04:28<07:15,  3.23it/s]

 30%|██▉       | 597/2000 [04:29<07:14,  3.23it/s]

 30%|██▉       | 598/2000 [04:29<07:11,  3.25it/s]

 30%|██▉       | 599/2000 [04:29<07:14,  3.22it/s]

[2024-02-06 00:19:50,781] [INFO] [logging.py:96:log_dist] [Rank 0] step=600, skipped=0, lr=[0.000977165911381206, 0.000977165911381206], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:19:50,783] [INFO] [timer.py:260:stop] epoch=0/micro_step=600/global_step=600, RunningAvgSamplesPerSec=207.54093888151758, CurrSamplesPerSec=204.7856260089166, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 30%|███       | 600/2000 [04:30<07:17,  3.20it/s]

 30%|███       | 601/2000 [04:30<07:16,  3.20it/s]

 30%|███       | 602/2000 [04:30<07:17,  3.20it/s]

 30%|███       | 603/2000 [04:31<07:16,  3.20it/s]

 30%|███       | 604/2000 [04:31<07:16,  3.20it/s]

 30%|███       | 605/2000 [04:31<07:13,  3.22it/s]

 30%|███       | 606/2000 [04:32<07:12,  3.22it/s]

 30%|███       | 607/2000 [04:32<07:10,  3.23it/s]

 30%|███       | 608/2000 [04:32<07:10,  3.23it/s]

 30%|███       | 609/2000 [04:32<07:10,  3.23it/s]

 30%|███       | 610/2000 [04:33<07:10,  3.23it/s]

 31%|███       | 611/2000 [04:33<07:11,  3.22it/s]

 31%|███       | 612/2000 [04:33<07:12,  3.21it/s]

 31%|███       | 613/2000 [04:34<07:12,  3.21it/s]

 31%|███       | 614/2000 [04:34<07:12,  3.20it/s]

 31%|███       | 615/2000 [04:34<07:12,  3.20it/s]

 31%|███       | 616/2000 [04:35<07:10,  3.21it/s]

 31%|███       | 617/2000 [04:35<07:08,  3.23it/s]

 31%|███       | 618/2000 [04:35<07:07,  3.23it/s]

 31%|███       | 619/2000 [04:36<07:06,  3.24it/s]

 31%|███       | 620/2000 [04:36<07:07,  3.23it/s]

 31%|███       | 621/2000 [04:36<07:08,  3.22it/s]

 31%|███       | 622/2000 [04:36<07:08,  3.22it/s]

 31%|███       | 623/2000 [04:37<07:10,  3.20it/s]

 31%|███       | 624/2000 [04:37<07:09,  3.20it/s]

 31%|███▏      | 625/2000 [04:37<07:09,  3.20it/s]

 31%|███▏      | 626/2000 [04:38<07:08,  3.21it/s]

 31%|███▏      | 627/2000 [04:38<07:06,  3.22it/s]

 31%|███▏      | 628/2000 [04:38<07:05,  3.23it/s]

 31%|███▏      | 629/2000 [04:39<07:02,  3.24it/s]

 32%|███▏      | 630/2000 [04:39<07:02,  3.24it/s]

 32%|███▏      | 631/2000 [04:39<07:04,  3.23it/s]

 32%|███▏      | 632/2000 [04:40<07:06,  3.21it/s]

 32%|███▏      | 633/2000 [04:40<07:06,  3.21it/s]

 32%|███▏      | 634/2000 [04:40<07:06,  3.20it/s]

 32%|███▏      | 635/2000 [04:41<07:06,  3.20it/s]

 32%|███▏      | 636/2000 [04:41<07:06,  3.20it/s]

 32%|███▏      | 637/2000 [04:41<07:05,  3.20it/s]

 32%|███▏      | 638/2000 [04:41<07:04,  3.21it/s]

 32%|███▏      | 639/2000 [04:42<07:02,  3.22it/s]

 32%|███▏      | 640/2000 [04:42<07:01,  3.23it/s]

 32%|███▏      | 641/2000 [04:42<07:03,  3.21it/s]

 32%|███▏      | 642/2000 [04:43<07:03,  3.20it/s]

 32%|███▏      | 643/2000 [04:43<07:03,  3.20it/s]

 32%|███▏      | 644/2000 [04:43<07:03,  3.20it/s]

 32%|███▏      | 645/2000 [04:44<07:02,  3.20it/s]

 32%|███▏      | 646/2000 [04:44<07:03,  3.20it/s]

 32%|███▏      | 647/2000 [04:44<07:02,  3.20it/s]

 32%|███▏      | 648/2000 [04:45<07:00,  3.22it/s]

 32%|███▏      | 649/2000 [04:45<06:59,  3.22it/s]

 32%|███▎      | 650/2000 [04:45<06:58,  3.23it/s]

 33%|███▎      | 651/2000 [04:45<06:58,  3.23it/s]

 33%|███▎      | 652/2000 [04:46<06:58,  3.22it/s]

 33%|███▎      | 653/2000 [04:46<06:58,  3.22it/s]

 33%|███▎      | 654/2000 [04:46<06:58,  3.21it/s]

 33%|███▎      | 655/2000 [04:47<06:59,  3.21it/s]

 33%|███▎      | 656/2000 [04:47<06:59,  3.21it/s]

 33%|███▎      | 657/2000 [04:47<06:58,  3.21it/s]

 33%|███▎      | 658/2000 [04:48<06:57,  3.22it/s]

 33%|███▎      | 659/2000 [04:48<06:55,  3.23it/s]

 33%|███▎      | 660/2000 [04:48<06:54,  3.23it/s]

 33%|███▎      | 661/2000 [04:49<06:53,  3.24it/s]

 33%|███▎      | 662/2000 [04:49<06:51,  3.25it/s]

 33%|███▎      | 663/2000 [04:49<06:53,  3.23it/s]

 33%|███▎      | 664/2000 [04:50<06:53,  3.23it/s]

 33%|███▎      | 665/2000 [04:50<06:54,  3.22it/s]

 33%|███▎      | 666/2000 [04:50<06:55,  3.21it/s]

 33%|███▎      | 667/2000 [04:50<06:55,  3.21it/s]

 33%|███▎      | 668/2000 [04:51<06:54,  3.21it/s]

 33%|███▎      | 669/2000 [04:51<06:52,  3.23it/s]

 34%|███▎      | 670/2000 [04:51<06:49,  3.25it/s]

 34%|███▎      | 671/2000 [04:52<06:49,  3.25it/s]

 34%|███▎      | 672/2000 [04:52<06:48,  3.25it/s]

 34%|███▎      | 673/2000 [04:52<06:49,  3.24it/s]

 34%|███▎      | 674/2000 [04:53<06:51,  3.22it/s]

 34%|███▍      | 675/2000 [04:53<06:53,  3.21it/s]

 34%|███▍      | 676/2000 [04:53<06:52,  3.21it/s]

 34%|███▍      | 677/2000 [04:54<06:51,  3.21it/s]

 34%|███▍      | 678/2000 [04:54<06:50,  3.22it/s]

 34%|███▍      | 679/2000 [04:54<06:51,  3.21it/s]

 34%|███▍      | 680/2000 [04:54<06:49,  3.22it/s]

 34%|███▍      | 681/2000 [04:55<06:47,  3.24it/s]

 34%|███▍      | 682/2000 [04:55<06:47,  3.24it/s]

 34%|███▍      | 683/2000 [04:55<06:46,  3.24it/s]

 34%|███▍      | 684/2000 [04:56<06:48,  3.23it/s]

 34%|███▍      | 685/2000 [04:56<06:49,  3.21it/s]

 34%|███▍      | 686/2000 [04:56<06:49,  3.21it/s]

 34%|███▍      | 687/2000 [04:57<06:50,  3.20it/s]

 34%|███▍      | 688/2000 [04:57<06:49,  3.20it/s]

 34%|███▍      | 689/2000 [04:57<06:50,  3.20it/s]

 34%|███▍      | 690/2000 [04:58<06:48,  3.21it/s]

 35%|███▍      | 691/2000 [04:58<06:45,  3.23it/s]

 35%|███▍      | 692/2000 [04:58<06:45,  3.23it/s]

 35%|███▍      | 693/2000 [04:59<06:44,  3.23it/s]

 35%|███▍      | 694/2000 [04:59<06:43,  3.24it/s]

 35%|███▍      | 695/2000 [04:59<06:44,  3.23it/s]

 35%|███▍      | 696/2000 [04:59<06:45,  3.22it/s]

 35%|███▍      | 697/2000 [05:00<06:46,  3.21it/s]

 35%|███▍      | 698/2000 [05:00<06:45,  3.21it/s]

 35%|███▍      | 699/2000 [05:00<06:44,  3.21it/s]

[2024-02-06 00:20:21,864] [INFO] [logging.py:96:log_dist] [Rank 0] step=700, skipped=0, lr=[0.0009672207247073707, 0.0009672207247073707], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:20:21,864] [INFO] [timer.py:260:stop] epoch=0/micro_step=700/global_step=700, RunningAvgSamplesPerSec=207.45519526072945, CurrSamplesPerSec=204.08236722523415, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 35%|███▌      | 700/2000 [05:01<06:47,  3.19it/s]

 35%|███▌      | 701/2000 [05:01<06:46,  3.20it/s]

 35%|███▌      | 702/2000 [05:01<06:42,  3.22it/s]

 35%|███▌      | 703/2000 [05:02<06:40,  3.23it/s]

 35%|███▌      | 704/2000 [05:02<06:38,  3.25it/s]

 35%|███▌      | 705/2000 [05:02<06:40,  3.23it/s]

 35%|███▌      | 706/2000 [05:03<06:42,  3.22it/s]

 35%|███▌      | 707/2000 [05:03<06:42,  3.21it/s]

 35%|███▌      | 708/2000 [05:03<06:42,  3.21it/s]

 35%|███▌      | 709/2000 [05:04<06:41,  3.21it/s]

 36%|███▌      | 710/2000 [05:04<06:41,  3.22it/s]

 36%|███▌      | 711/2000 [05:04<06:41,  3.21it/s]

 36%|███▌      | 712/2000 [05:04<06:39,  3.23it/s]

 36%|███▌      | 713/2000 [05:05<06:37,  3.24it/s]

 36%|███▌      | 714/2000 [05:05<06:38,  3.23it/s]

 36%|███▌      | 715/2000 [05:05<06:35,  3.25it/s]

 36%|███▌      | 716/2000 [05:06<06:36,  3.24it/s]

 36%|███▌      | 717/2000 [05:06<06:37,  3.23it/s]

 36%|███▌      | 718/2000 [05:06<06:37,  3.23it/s]

 36%|███▌      | 719/2000 [05:07<06:38,  3.21it/s]

 36%|███▌      | 720/2000 [05:07<06:38,  3.21it/s]

 36%|███▌      | 721/2000 [05:07<06:38,  3.21it/s]

 36%|███▌      | 722/2000 [05:08<06:35,  3.23it/s]

 36%|███▌      | 723/2000 [05:08<06:34,  3.24it/s]

 36%|███▌      | 724/2000 [05:08<06:33,  3.24it/s]

 36%|███▋      | 725/2000 [05:08<06:33,  3.24it/s]

 36%|███▋      | 726/2000 [05:09<06:35,  3.22it/s]

 36%|███▋      | 727/2000 [05:09<06:36,  3.21it/s]

 36%|███▋      | 728/2000 [05:09<06:36,  3.21it/s]

 36%|███▋      | 729/2000 [05:10<06:37,  3.20it/s]

 36%|███▋      | 730/2000 [05:10<06:34,  3.22it/s]

 37%|███▋      | 731/2000 [05:10<06:35,  3.21it/s]

 37%|███▋      | 732/2000 [05:11<06:35,  3.21it/s]

 37%|███▋      | 733/2000 [05:11<06:33,  3.22it/s]

 37%|███▋      | 734/2000 [05:11<06:34,  3.21it/s]

 37%|███▋      | 735/2000 [05:12<06:32,  3.22it/s]

 37%|███▋      | 736/2000 [05:12<06:31,  3.23it/s]

 37%|███▋      | 737/2000 [05:12<06:29,  3.24it/s]

 37%|███▋      | 738/2000 [05:13<06:32,  3.22it/s]

 37%|███▋      | 739/2000 [05:13<06:32,  3.21it/s]

 37%|███▋      | 740/2000 [05:13<06:32,  3.21it/s]

 37%|███▋      | 741/2000 [05:13<06:32,  3.21it/s]

 37%|███▋      | 742/2000 [05:14<06:33,  3.20it/s]

 37%|███▋      | 743/2000 [05:14<06:31,  3.21it/s]

 37%|███▋      | 744/2000 [05:14<06:29,  3.22it/s]

 37%|███▋      | 745/2000 [05:15<06:29,  3.22it/s]

 37%|███▋      | 746/2000 [05:15<06:28,  3.23it/s]

 37%|███▋      | 747/2000 [05:15<06:27,  3.23it/s]

 37%|███▋      | 748/2000 [05:16<06:28,  3.22it/s]

 37%|███▋      | 749/2000 [05:16<06:28,  3.22it/s]

 38%|███▊      | 750/2000 [05:16<06:29,  3.21it/s]

 38%|███▊      | 751/2000 [05:17<06:28,  3.21it/s]

 38%|███▊      | 752/2000 [05:17<06:30,  3.19it/s]

 38%|███▊      | 753/2000 [05:17<06:29,  3.20it/s]

 38%|███▊      | 754/2000 [05:17<06:26,  3.22it/s]

 38%|███▊      | 755/2000 [05:18<06:26,  3.22it/s]

 38%|███▊      | 756/2000 [05:18<06:27,  3.21it/s]

 38%|███▊      | 757/2000 [05:18<06:23,  3.24it/s]

 38%|███▊      | 758/2000 [05:19<06:25,  3.22it/s]

 38%|███▊      | 759/2000 [05:19<06:27,  3.21it/s]

 38%|███▊      | 760/2000 [05:19<06:26,  3.21it/s]

 38%|███▊      | 761/2000 [05:20<06:25,  3.21it/s]

 38%|███▊      | 762/2000 [05:20<06:25,  3.21it/s]

 38%|███▊      | 763/2000 [05:20<06:23,  3.22it/s]

 38%|███▊      | 764/2000 [05:21<06:24,  3.21it/s]

 38%|███▊      | 765/2000 [05:21<06:23,  3.22it/s]

 38%|███▊      | 766/2000 [05:21<06:21,  3.23it/s]

 38%|███▊      | 767/2000 [05:22<06:21,  3.23it/s]

 38%|███▊      | 768/2000 [05:22<06:23,  3.21it/s]

 38%|███▊      | 769/2000 [05:22<06:23,  3.21it/s]

 38%|███▊      | 770/2000 [05:22<06:23,  3.20it/s]

 39%|███▊      | 771/2000 [05:23<06:23,  3.21it/s]

 39%|███▊      | 772/2000 [05:23<06:22,  3.21it/s]

 39%|███▊      | 773/2000 [05:23<06:24,  3.19it/s]

 39%|███▊      | 774/2000 [05:24<06:21,  3.21it/s]

 39%|███▉      | 775/2000 [05:24<06:20,  3.22it/s]

 39%|███▉      | 776/2000 [05:24<06:21,  3.21it/s]

 39%|███▉      | 777/2000 [05:25<06:18,  3.23it/s]

 39%|███▉      | 778/2000 [05:25<06:17,  3.24it/s]

 39%|███▉      | 779/2000 [05:25<06:18,  3.23it/s]

 39%|███▉      | 780/2000 [05:26<06:19,  3.22it/s]

 39%|███▉      | 781/2000 [05:26<06:20,  3.21it/s]

 39%|███▉      | 782/2000 [05:26<06:21,  3.19it/s]

 39%|███▉      | 783/2000 [05:27<06:20,  3.20it/s]

 39%|███▉      | 784/2000 [05:27<06:19,  3.20it/s]

 39%|███▉      | 785/2000 [05:27<06:18,  3.21it/s]

 39%|███▉      | 786/2000 [05:27<06:16,  3.22it/s]

 39%|███▉      | 787/2000 [05:28<06:14,  3.24it/s]

 39%|███▉      | 788/2000 [05:28<06:15,  3.23it/s]

 39%|███▉      | 789/2000 [05:28<06:13,  3.24it/s]

 40%|███▉      | 790/2000 [05:29<06:16,  3.22it/s]

 40%|███▉      | 791/2000 [05:29<06:16,  3.22it/s]

 40%|███▉      | 792/2000 [05:29<06:16,  3.21it/s]

 40%|███▉      | 793/2000 [05:30<06:17,  3.20it/s]

 40%|███▉      | 794/2000 [05:30<06:17,  3.19it/s]

 40%|███▉      | 795/2000 [05:30<06:18,  3.19it/s]

 40%|███▉      | 796/2000 [05:31<06:15,  3.21it/s]

 40%|███▉      | 797/2000 [05:31<06:16,  3.19it/s]

 40%|███▉      | 798/2000 [05:31<06:14,  3.21it/s]

 40%|███▉      | 799/2000 [05:31<06:11,  3.23it/s]

[2024-02-06 00:20:52,945] [INFO] [logging.py:96:log_dist] [Rank 0] step=800, skipped=0, lr=[0.000955561088582148, 0.000955561088582148], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:20:52,945] [INFO] [timer.py:260:stop] epoch=0/micro_step=800/global_step=800, RunningAvgSamplesPerSec=207.38929085108597, CurrSamplesPerSec=203.20240419371322, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 40%|████      | 800/2000 [05:32<06:14,  3.20it/s]

 40%|████      | 801/2000 [05:32<06:15,  3.19it/s]

 40%|████      | 802/2000 [05:32<06:14,  3.20it/s]

 40%|████      | 803/2000 [05:33<06:15,  3.18it/s]

 40%|████      | 804/2000 [05:33<06:15,  3.19it/s]

 40%|████      | 805/2000 [05:33<06:15,  3.19it/s]

 40%|████      | 806/2000 [05:34<06:11,  3.21it/s]

 40%|████      | 807/2000 [05:34<06:10,  3.22it/s]

 40%|████      | 808/2000 [05:34<06:08,  3.23it/s]

 40%|████      | 809/2000 [05:35<06:08,  3.24it/s]

 40%|████      | 810/2000 [05:35<06:09,  3.22it/s]

 41%|████      | 811/2000 [05:35<06:10,  3.21it/s]

 41%|████      | 812/2000 [05:36<06:10,  3.21it/s]

 41%|████      | 813/2000 [05:36<06:11,  3.19it/s]

 41%|████      | 814/2000 [05:36<06:11,  3.19it/s]

 41%|████      | 815/2000 [05:36<06:10,  3.20it/s]

 41%|████      | 816/2000 [05:37<06:10,  3.20it/s]

 41%|████      | 817/2000 [05:37<06:09,  3.20it/s]

 41%|████      | 818/2000 [05:37<06:06,  3.22it/s]

 41%|████      | 819/2000 [05:38<06:06,  3.23it/s]

 41%|████      | 820/2000 [05:38<06:05,  3.22it/s]

 41%|████      | 821/2000 [05:38<06:05,  3.22it/s]

 41%|████      | 822/2000 [05:39<06:07,  3.21it/s]

 41%|████      | 823/2000 [05:39<06:07,  3.20it/s]

 41%|████      | 824/2000 [05:39<06:08,  3.19it/s]

 41%|████▏     | 825/2000 [05:40<06:09,  3.18it/s]

 41%|████▏     | 826/2000 [05:40<06:09,  3.18it/s]

 41%|████▏     | 827/2000 [05:40<06:07,  3.19it/s]

 41%|████▏     | 828/2000 [05:41<06:06,  3.20it/s]

 41%|████▏     | 829/2000 [05:41<06:04,  3.21it/s]

 42%|████▏     | 830/2000 [05:41<06:03,  3.22it/s]

 42%|████▏     | 831/2000 [05:41<06:02,  3.23it/s]

 42%|████▏     | 832/2000 [05:42<06:03,  3.21it/s]

 42%|████▏     | 833/2000 [05:42<06:04,  3.20it/s]

 42%|████▏     | 834/2000 [05:42<06:03,  3.21it/s]

 42%|████▏     | 835/2000 [05:43<06:04,  3.20it/s]

 42%|████▏     | 836/2000 [05:43<06:03,  3.20it/s]

 42%|████▏     | 837/2000 [05:43<06:02,  3.21it/s]

 42%|████▏     | 838/2000 [05:44<06:01,  3.22it/s]

 42%|████▏     | 839/2000 [05:44<06:00,  3.22it/s]

 42%|████▏     | 840/2000 [05:44<05:59,  3.23it/s]

 42%|████▏     | 841/2000 [05:45<05:57,  3.25it/s]

 42%|████▏     | 842/2000 [05:45<05:59,  3.22it/s]

 42%|████▏     | 843/2000 [05:45<06:00,  3.21it/s]

 42%|████▏     | 844/2000 [05:46<06:00,  3.20it/s]

 42%|████▏     | 845/2000 [05:46<06:02,  3.19it/s]

 42%|████▏     | 846/2000 [05:46<06:01,  3.19it/s]

 42%|████▏     | 847/2000 [05:46<06:00,  3.20it/s]

 42%|████▏     | 848/2000 [05:47<06:02,  3.18it/s]

 42%|████▏     | 849/2000 [05:47<06:00,  3.19it/s]

 42%|████▎     | 850/2000 [05:47<05:57,  3.21it/s]

 43%|████▎     | 851/2000 [05:48<05:56,  3.22it/s]

 43%|████▎     | 852/2000 [05:48<05:55,  3.23it/s]

 43%|████▎     | 853/2000 [05:48<05:56,  3.21it/s]

 43%|████▎     | 854/2000 [05:49<05:58,  3.20it/s]

 43%|████▎     | 855/2000 [05:49<05:56,  3.21it/s]

 43%|████▎     | 856/2000 [05:49<05:58,  3.20it/s]

 43%|████▎     | 857/2000 [05:50<05:58,  3.18it/s]

 43%|████▎     | 858/2000 [05:50<05:59,  3.18it/s]

 43%|████▎     | 859/2000 [05:50<05:58,  3.18it/s]

 43%|████▎     | 860/2000 [05:51<05:57,  3.19it/s]

 43%|████▎     | 861/2000 [05:51<05:56,  3.20it/s]

 43%|████▎     | 862/2000 [05:51<05:57,  3.18it/s]

 43%|████▎     | 863/2000 [05:51<05:57,  3.18it/s]

 43%|████▎     | 864/2000 [05:52<05:58,  3.17it/s]

 43%|████▎     | 865/2000 [05:52<06:00,  3.15it/s]

 43%|████▎     | 866/2000 [05:52<06:00,  3.14it/s]

 43%|████▎     | 867/2000 [05:53<06:01,  3.13it/s]

 43%|████▎     | 868/2000 [05:53<06:02,  3.12it/s]

 43%|████▎     | 869/2000 [05:53<06:01,  3.13it/s]

 44%|████▎     | 870/2000 [05:54<05:59,  3.14it/s]

 44%|████▎     | 871/2000 [05:54<05:58,  3.15it/s]

 44%|████▎     | 872/2000 [05:54<05:57,  3.16it/s]

 44%|████▎     | 873/2000 [05:55<05:56,  3.17it/s]

 44%|████▎     | 874/2000 [05:55<05:57,  3.15it/s]

 44%|████▍     | 875/2000 [05:55<05:58,  3.14it/s]

 44%|████▍     | 876/2000 [05:56<05:58,  3.14it/s]

 44%|████▍     | 877/2000 [05:56<05:59,  3.13it/s]

 44%|████▍     | 878/2000 [05:56<05:58,  3.13it/s]

 44%|████▍     | 879/2000 [05:57<05:56,  3.15it/s]

 44%|████▍     | 880/2000 [05:57<05:55,  3.15it/s]

 44%|████▍     | 881/2000 [05:57<05:54,  3.16it/s]

 44%|████▍     | 882/2000 [05:58<05:52,  3.17it/s]

 44%|████▍     | 883/2000 [05:58<05:52,  3.17it/s]

 44%|████▍     | 884/2000 [05:58<05:52,  3.16it/s]

 44%|████▍     | 885/2000 [05:58<05:54,  3.15it/s]

 44%|████▍     | 886/2000 [05:59<05:55,  3.13it/s]

 44%|████▍     | 887/2000 [05:59<05:56,  3.12it/s]

 44%|████▍     | 888/2000 [05:59<05:55,  3.12it/s]

 44%|████▍     | 889/2000 [06:00<05:55,  3.12it/s]

 44%|████▍     | 890/2000 [06:00<05:54,  3.13it/s]

 45%|████▍     | 891/2000 [06:00<05:54,  3.13it/s]

 45%|████▍     | 892/2000 [06:01<05:53,  3.14it/s]

 45%|████▍     | 893/2000 [06:01<05:51,  3.15it/s]

 45%|████▍     | 894/2000 [06:01<05:52,  3.14it/s]

 45%|████▍     | 895/2000 [06:02<05:53,  3.13it/s]

 45%|████▍     | 896/2000 [06:02<05:54,  3.11it/s]

 45%|████▍     | 897/2000 [06:02<05:57,  3.08it/s]

 45%|████▍     | 898/2000 [06:03<05:57,  3.08it/s]

 45%|████▍     | 899/2000 [06:03<05:55,  3.09it/s]

[2024-02-06 00:21:24,430] [INFO] [logging.py:96:log_dist] [Rank 0] step=900, skipped=0, lr=[0.0009422349149513604, 0.0009422349149513604], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:21:24,430] [INFO] [timer.py:260:stop] epoch=0/micro_step=900/global_step=900, RunningAvgSamplesPerSec=207.03749501112767, CurrSamplesPerSec=200.89226292755063, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 45%|████▌     | 900/2000 [06:03<05:55,  3.09it/s]

 45%|████▌     | 901/2000 [06:04<05:52,  3.11it/s]

 45%|████▌     | 902/2000 [06:04<05:51,  3.13it/s]

 45%|████▌     | 903/2000 [06:04<05:49,  3.14it/s]

 45%|████▌     | 904/2000 [06:05<05:48,  3.15it/s]

 45%|████▌     | 905/2000 [06:05<05:49,  3.14it/s]

 45%|████▌     | 906/2000 [06:05<05:48,  3.14it/s]

 45%|████▌     | 907/2000 [06:06<05:49,  3.13it/s]

 45%|████▌     | 908/2000 [06:06<05:50,  3.12it/s]

 45%|████▌     | 909/2000 [06:06<05:51,  3.11it/s]

 46%|████▌     | 910/2000 [06:06<05:50,  3.11it/s]

 46%|████▌     | 911/2000 [06:07<05:49,  3.11it/s]

 46%|████▌     | 912/2000 [06:07<05:48,  3.13it/s]

 46%|████▌     | 913/2000 [06:07<05:46,  3.14it/s]

 46%|████▌     | 914/2000 [06:08<05:44,  3.15it/s]

 46%|████▌     | 915/2000 [06:08<05:46,  3.13it/s]

 46%|████▌     | 916/2000 [06:08<05:48,  3.11it/s]

 46%|████▌     | 917/2000 [06:09<05:48,  3.11it/s]

 46%|████▌     | 918/2000 [06:09<05:47,  3.11it/s]

 46%|████▌     | 919/2000 [06:09<05:49,  3.10it/s]

 46%|████▌     | 920/2000 [06:10<05:48,  3.10it/s]

 46%|████▌     | 921/2000 [06:10<05:45,  3.12it/s]

 46%|████▌     | 922/2000 [06:10<05:45,  3.12it/s]

 46%|████▌     | 923/2000 [06:11<05:44,  3.13it/s]

 46%|████▌     | 924/2000 [06:11<05:44,  3.12it/s]

 46%|████▋     | 925/2000 [06:11<05:46,  3.11it/s]

 46%|████▋     | 926/2000 [06:12<05:47,  3.09it/s]

 46%|████▋     | 927/2000 [06:12<05:49,  3.07it/s]

 46%|████▋     | 928/2000 [06:12<05:51,  3.05it/s]

 46%|████▋     | 929/2000 [06:13<05:50,  3.06it/s]

 46%|████▋     | 930/2000 [06:13<05:51,  3.05it/s]

 47%|████▋     | 931/2000 [06:13<05:52,  3.03it/s]

 47%|████▋     | 932/2000 [06:14<05:54,  3.01it/s]

 47%|████▋     | 933/2000 [06:14<05:55,  3.00it/s]

 47%|████▋     | 934/2000 [06:14<05:55,  3.00it/s]

 47%|████▋     | 935/2000 [06:15<05:58,  2.97it/s]

 47%|████▋     | 936/2000 [06:15<06:00,  2.95it/s]

 47%|████▋     | 937/2000 [06:15<06:00,  2.95it/s]

 47%|████▋     | 938/2000 [06:16<05:59,  2.95it/s]

 47%|████▋     | 939/2000 [06:16<05:57,  2.97it/s]

 47%|████▋     | 940/2000 [06:16<05:55,  2.98it/s]

 47%|████▋     | 941/2000 [06:17<05:51,  3.01it/s]

 47%|████▋     | 942/2000 [06:17<05:48,  3.04it/s]

 47%|████▋     | 943/2000 [06:17<05:46,  3.05it/s]

 47%|████▋     | 944/2000 [06:18<05:43,  3.08it/s]

 47%|████▋     | 945/2000 [06:18<05:44,  3.06it/s]

 47%|████▋     | 946/2000 [06:18<05:45,  3.05it/s]

 47%|████▋     | 947/2000 [06:19<05:44,  3.06it/s]

 47%|████▋     | 948/2000 [06:19<05:43,  3.06it/s]

 47%|████▋     | 949/2000 [06:19<05:42,  3.07it/s]

 48%|████▊     | 950/2000 [06:20<05:39,  3.10it/s]

 48%|████▊     | 951/2000 [06:20<05:36,  3.12it/s]

 48%|████▊     | 952/2000 [06:20<05:35,  3.13it/s]

 48%|████▊     | 953/2000 [06:21<05:35,  3.12it/s]

 48%|████▊     | 954/2000 [06:21<05:34,  3.12it/s]

 48%|████▊     | 955/2000 [06:21<05:35,  3.11it/s]

 48%|████▊     | 956/2000 [06:21<05:37,  3.09it/s]

 48%|████▊     | 957/2000 [06:22<05:36,  3.10it/s]

 48%|████▊     | 958/2000 [06:22<05:38,  3.08it/s]

 48%|████▊     | 959/2000 [06:22<05:39,  3.07it/s]

 48%|████▊     | 960/2000 [06:23<05:38,  3.07it/s]

 48%|████▊     | 961/2000 [06:23<05:37,  3.08it/s]

 48%|████▊     | 962/2000 [06:23<05:36,  3.09it/s]

 48%|████▊     | 963/2000 [06:24<05:34,  3.10it/s]

 48%|████▊     | 964/2000 [06:24<05:33,  3.10it/s]

 48%|████▊     | 965/2000 [06:24<05:35,  3.08it/s]

 48%|████▊     | 966/2000 [06:25<05:36,  3.07it/s]

 48%|████▊     | 967/2000 [06:25<05:36,  3.07it/s]

 48%|████▊     | 968/2000 [06:25<05:36,  3.06it/s]

 48%|████▊     | 969/2000 [06:26<05:36,  3.07it/s]

 48%|████▊     | 970/2000 [06:26<05:36,  3.06it/s]

 49%|████▊     | 971/2000 [06:26<05:34,  3.07it/s]

 49%|████▊     | 972/2000 [06:27<05:32,  3.09it/s]

 49%|████▊     | 973/2000 [06:27<05:31,  3.10it/s]

 49%|████▊     | 974/2000 [06:27<05:31,  3.10it/s]

 49%|████▉     | 975/2000 [06:28<05:31,  3.09it/s]

 49%|████▉     | 976/2000 [06:28<05:31,  3.09it/s]

 49%|████▉     | 977/2000 [06:28<05:32,  3.07it/s]

 49%|████▉     | 978/2000 [06:29<05:32,  3.07it/s]

 49%|████▉     | 979/2000 [06:29<05:30,  3.09it/s]

 49%|████▉     | 980/2000 [06:29<05:30,  3.09it/s]

 49%|████▉     | 981/2000 [06:30<05:29,  3.09it/s]

 49%|████▉     | 982/2000 [06:30<05:26,  3.11it/s]

 49%|████▉     | 983/2000 [06:30<05:26,  3.12it/s]

 49%|████▉     | 984/2000 [06:31<05:26,  3.11it/s]

 49%|████▉     | 985/2000 [06:31<05:26,  3.11it/s]

 49%|████▉     | 986/2000 [06:31<05:29,  3.08it/s]

 49%|████▉     | 987/2000 [06:32<05:28,  3.08it/s]

 49%|████▉     | 988/2000 [06:32<05:28,  3.08it/s]

 49%|████▉     | 989/2000 [06:32<05:29,  3.07it/s]

 50%|████▉     | 990/2000 [06:32<05:28,  3.08it/s]

 50%|████▉     | 991/2000 [06:33<05:27,  3.08it/s]

 50%|████▉     | 992/2000 [06:33<05:26,  3.09it/s]

 50%|████▉     | 993/2000 [06:33<05:24,  3.10it/s]

 50%|████▉     | 994/2000 [06:34<05:24,  3.10it/s]

 50%|████▉     | 995/2000 [06:34<05:25,  3.09it/s]

 50%|████▉     | 996/2000 [06:34<05:25,  3.08it/s]

 50%|████▉     | 997/2000 [06:35<05:27,  3.07it/s]

 50%|████▉     | 998/2000 [06:35<05:26,  3.07it/s]

 50%|████▉     | 999/2000 [06:35<05:27,  3.06it/s]

[2024-02-06 00:21:56,900] [INFO] [logging.py:96:log_dist] [Rank 0] step=1000, skipped=0, lr=[0.0009272969639209125, 0.0009272969639209125], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:21:56,900] [INFO] [timer.py:260:stop] epoch=0/micro_step=1000/global_step=1000, RunningAvgSamplesPerSec=206.100689950668, CurrSamplesPerSec=196.1390094096294, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 50%|█████     | 1000/2000 [06:36<05:28,  3.05it/s]

 50%|█████     | 1001/2000 [07:18<3:34:56, 12.91s/it]

 50%|█████     | 1002/2000 [07:18<2:31:52,  9.13s/it]

 50%|█████     | 1003/2000 [07:19<1:47:45,  6.49s/it]

 50%|█████     | 1004/2000 [07:19<1:16:54,  4.63s/it]

 50%|█████     | 1005/2000 [07:19<55:19,  3.34s/it]  

 50%|█████     | 1006/2000 [07:20<40:13,  2.43s/it]

 50%|█████     | 1007/2000 [07:20<29:41,  1.79s/it]

 50%|█████     | 1008/2000 [07:20<22:18,  1.35s/it]

 50%|█████     | 1009/2000 [07:21<17:09,  1.04s/it]

 50%|█████     | 1010/2000 [07:21<13:34,  1.21it/s]

 51%|█████     | 1011/2000 [07:21<11:05,  1.49it/s]

 51%|█████     | 1012/2000 [07:21<09:21,  1.76it/s]

 51%|█████     | 1013/2000 [07:22<08:08,  2.02it/s]

 51%|█████     | 1014/2000 [07:22<07:17,  2.25it/s]

 51%|█████     | 1015/2000 [07:22<06:42,  2.45it/s]

 51%|█████     | 1016/2000 [07:23<06:16,  2.62it/s]

 51%|█████     | 1017/2000 [07:23<05:59,  2.74it/s]

 51%|█████     | 1018/2000 [07:23<05:45,  2.84it/s]

 51%|█████     | 1019/2000 [07:24<05:36,  2.91it/s]

 51%|█████     | 1020/2000 [07:24<05:33,  2.94it/s]

 51%|█████     | 1021/2000 [07:24<05:30,  2.96it/s]

 51%|█████     | 1022/2000 [07:25<05:28,  2.97it/s]

 51%|█████     | 1023/2000 [07:25<05:25,  3.00it/s]

 51%|█████     | 1024/2000 [07:25<05:25,  3.00it/s]

 51%|█████▏    | 1025/2000 [07:26<05:24,  3.01it/s]

 51%|█████▏    | 1026/2000 [07:26<05:20,  3.04it/s]

 51%|█████▏    | 1027/2000 [07:26<05:19,  3.05it/s]

 51%|█████▏    | 1028/2000 [07:27<05:16,  3.07it/s]

 51%|█████▏    | 1029/2000 [07:27<05:16,  3.07it/s]

 52%|█████▏    | 1030/2000 [07:27<05:16,  3.07it/s]

 52%|█████▏    | 1031/2000 [07:28<05:15,  3.07it/s]

 52%|█████▏    | 1032/2000 [07:28<05:14,  3.08it/s]

 52%|█████▏    | 1033/2000 [07:28<05:14,  3.08it/s]

 52%|█████▏    | 1034/2000 [07:29<05:14,  3.08it/s]

 52%|█████▏    | 1035/2000 [07:29<05:12,  3.09it/s]

 52%|█████▏    | 1036/2000 [07:29<05:11,  3.09it/s]

 52%|█████▏    | 1037/2000 [07:30<05:12,  3.08it/s]

 52%|█████▏    | 1038/2000 [07:30<05:10,  3.10it/s]

 52%|█████▏    | 1039/2000 [07:30<05:09,  3.11it/s]

 52%|█████▏    | 1040/2000 [07:31<05:10,  3.09it/s]

 52%|█████▏    | 1041/2000 [07:31<05:11,  3.08it/s]

 52%|█████▏    | 1042/2000 [07:31<05:13,  3.06it/s]

 52%|█████▏    | 1043/2000 [07:32<05:12,  3.06it/s]

 52%|█████▏    | 1044/2000 [07:32<05:13,  3.05it/s]

 52%|█████▏    | 1045/2000 [07:32<05:12,  3.06it/s]

 52%|█████▏    | 1046/2000 [07:33<05:10,  3.08it/s]

 52%|█████▏    | 1047/2000 [07:33<05:07,  3.10it/s]

 52%|█████▏    | 1048/2000 [07:33<05:07,  3.10it/s]

 52%|█████▏    | 1049/2000 [07:34<05:06,  3.10it/s]

 52%|█████▎    | 1050/2000 [07:34<05:07,  3.09it/s]

 53%|█████▎    | 1051/2000 [07:34<05:08,  3.08it/s]

 53%|█████▎    | 1052/2000 [07:34<05:08,  3.07it/s]

 53%|█████▎    | 1053/2000 [07:35<05:09,  3.06it/s]

 53%|█████▎    | 1054/2000 [07:35<05:08,  3.06it/s]

 53%|█████▎    | 1055/2000 [07:35<05:08,  3.06it/s]

 53%|█████▎    | 1056/2000 [07:36<05:05,  3.09it/s]

 53%|█████▎    | 1057/2000 [07:36<05:05,  3.08it/s]

 53%|█████▎    | 1058/2000 [07:36<05:04,  3.09it/s]

 53%|█████▎    | 1059/2000 [07:37<05:03,  3.10it/s]

 53%|█████▎    | 1060/2000 [07:37<05:05,  3.08it/s]

 53%|█████▎    | 1061/2000 [07:37<05:05,  3.08it/s]

 53%|█████▎    | 1062/2000 [07:38<05:05,  3.07it/s]

 53%|█████▎    | 1063/2000 [07:38<05:05,  3.07it/s]

 53%|█████▎    | 1064/2000 [07:38<05:04,  3.07it/s]

 53%|█████▎    | 1065/2000 [07:39<05:04,  3.07it/s]

 53%|█████▎    | 1066/2000 [07:39<05:03,  3.08it/s]

 53%|█████▎    | 1067/2000 [07:39<05:02,  3.09it/s]

 53%|█████▎    | 1068/2000 [07:40<05:01,  3.09it/s]

 53%|█████▎    | 1069/2000 [07:40<05:02,  3.08it/s]

 54%|█████▎    | 1070/2000 [07:40<05:00,  3.10it/s]

 54%|█████▎    | 1071/2000 [07:41<05:01,  3.08it/s]

 54%|█████▎    | 1072/2000 [07:41<05:00,  3.08it/s]

 54%|█████▎    | 1073/2000 [07:41<05:02,  3.06it/s]

 54%|█████▎    | 1074/2000 [07:42<05:02,  3.06it/s]

 54%|█████▍    | 1075/2000 [07:42<05:02,  3.06it/s]

 54%|█████▍    | 1076/2000 [07:42<05:02,  3.05it/s]

 54%|█████▍    | 1077/2000 [07:43<05:00,  3.07it/s]

 54%|█████▍    | 1078/2000 [07:43<04:59,  3.08it/s]

 54%|█████▍    | 1079/2000 [07:43<04:57,  3.09it/s]

 54%|█████▍    | 1080/2000 [07:44<04:57,  3.10it/s]

 54%|█████▍    | 1081/2000 [07:44<04:57,  3.09it/s]

 54%|█████▍    | 1082/2000 [07:44<04:57,  3.08it/s]

 54%|█████▍    | 1083/2000 [07:45<04:56,  3.10it/s]

 54%|█████▍    | 1084/2000 [07:45<04:57,  3.08it/s]

 54%|█████▍    | 1085/2000 [07:45<04:57,  3.08it/s]

 54%|█████▍    | 1086/2000 [07:46<04:57,  3.07it/s]

 54%|█████▍    | 1087/2000 [07:46<04:56,  3.08it/s]

 54%|█████▍    | 1088/2000 [07:46<04:53,  3.11it/s]

 54%|█████▍    | 1089/2000 [07:46<04:52,  3.11it/s]

 55%|█████▍    | 1090/2000 [07:47<04:50,  3.13it/s]

 55%|█████▍    | 1091/2000 [07:47<04:53,  3.10it/s]

 55%|█████▍    | 1092/2000 [07:47<04:53,  3.09it/s]

 55%|█████▍    | 1093/2000 [07:48<04:54,  3.08it/s]

 55%|█████▍    | 1094/2000 [07:48<04:54,  3.08it/s]

 55%|█████▍    | 1095/2000 [07:48<04:53,  3.08it/s]

 55%|█████▍    | 1096/2000 [07:49<04:52,  3.09it/s]

 55%|█████▍    | 1097/2000 [07:49<04:52,  3.09it/s]

 55%|█████▍    | 1098/2000 [07:49<04:50,  3.10it/s]

 55%|█████▍    | 1099/2000 [07:50<04:49,  3.12it/s]

[2024-02-06 00:23:11,202] [INFO] [logging.py:96:log_dist] [Rank 0] step=1100, skipped=0, lr=[0.0009108086187357684, 0.0009108086187357684], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:23:11,203] [INFO] [timer.py:260:stop] epoch=0/micro_step=1100/global_step=1100, RunningAvgSamplesPerSec=205.40880321295793, CurrSamplesPerSec=199.1355040489494, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 55%|█████▌    | 1100/2000 [07:50<04:50,  3.10it/s]

 55%|█████▌    | 1101/2000 [07:50<04:49,  3.11it/s]

 55%|█████▌    | 1102/2000 [07:51<04:48,  3.11it/s]

 55%|█████▌    | 1103/2000 [07:51<04:49,  3.10it/s]

 55%|█████▌    | 1104/2000 [07:51<04:50,  3.08it/s]

 55%|█████▌    | 1105/2000 [07:52<04:50,  3.09it/s]

 55%|█████▌    | 1106/2000 [07:52<04:48,  3.10it/s]

 55%|█████▌    | 1107/2000 [07:52<04:48,  3.10it/s]

 55%|█████▌    | 1108/2000 [07:53<04:46,  3.11it/s]

 55%|█████▌    | 1109/2000 [07:53<04:46,  3.11it/s]

 56%|█████▌    | 1110/2000 [07:53<04:45,  3.12it/s]

 56%|█████▌    | 1111/2000 [07:54<04:45,  3.11it/s]

 56%|█████▌    | 1112/2000 [07:54<04:45,  3.11it/s]

 56%|█████▌    | 1113/2000 [07:54<04:46,  3.10it/s]

 56%|█████▌    | 1114/2000 [07:55<04:46,  3.09it/s]

 56%|█████▌    | 1115/2000 [07:55<04:46,  3.09it/s]

 56%|█████▌    | 1116/2000 [07:55<04:46,  3.09it/s]

 56%|█████▌    | 1117/2000 [07:56<04:46,  3.09it/s]

 56%|█████▌    | 1118/2000 [07:56<04:44,  3.10it/s]

 56%|█████▌    | 1119/2000 [07:56<04:43,  3.11it/s]

 56%|█████▌    | 1120/2000 [07:57<04:41,  3.12it/s]

 56%|█████▌    | 1121/2000 [07:57<04:41,  3.12it/s]

 56%|█████▌    | 1122/2000 [07:57<04:40,  3.13it/s]

 56%|█████▌    | 1123/2000 [07:57<04:40,  3.13it/s]

 56%|█████▌    | 1124/2000 [07:58<04:42,  3.10it/s]

 56%|█████▋    | 1125/2000 [07:58<04:42,  3.09it/s]

 56%|█████▋    | 1126/2000 [07:58<04:42,  3.09it/s]

 56%|█████▋    | 1127/2000 [07:59<04:42,  3.09it/s]

 56%|█████▋    | 1128/2000 [07:59<04:42,  3.08it/s]

 56%|█████▋    | 1129/2000 [07:59<04:42,  3.09it/s]

 56%|█████▋    | 1130/2000 [08:00<04:43,  3.07it/s]

 57%|█████▋    | 1131/2000 [08:00<04:43,  3.06it/s]

 57%|█████▋    | 1132/2000 [08:00<04:44,  3.05it/s]

 57%|█████▋    | 1133/2000 [08:01<04:44,  3.05it/s]

 57%|█████▋    | 1134/2000 [08:01<04:44,  3.05it/s]

 57%|█████▋    | 1135/2000 [08:01<04:45,  3.03it/s]

 57%|█████▋    | 1136/2000 [08:02<04:45,  3.02it/s]

 57%|█████▋    | 1137/2000 [08:02<04:45,  3.03it/s]

 57%|█████▋    | 1138/2000 [08:02<04:44,  3.03it/s]

 57%|█████▋    | 1139/2000 [08:03<04:42,  3.04it/s]

 57%|█████▋    | 1140/2000 [08:03<04:42,  3.04it/s]

 57%|█████▋    | 1141/2000 [08:03<04:41,  3.05it/s]

 57%|█████▋    | 1142/2000 [08:04<04:41,  3.05it/s]

 57%|█████▋    | 1143/2000 [08:04<04:40,  3.06it/s]

 57%|█████▋    | 1144/2000 [08:04<04:39,  3.06it/s]

 57%|█████▋    | 1145/2000 [08:05<04:39,  3.06it/s]

 57%|█████▋    | 1146/2000 [08:05<04:38,  3.06it/s]

 57%|█████▋    | 1147/2000 [08:05<04:36,  3.09it/s]

 57%|█████▋    | 1148/2000 [08:06<04:35,  3.10it/s]

 57%|█████▋    | 1149/2000 [08:06<04:34,  3.10it/s]

 57%|█████▊    | 1150/2000 [08:06<04:32,  3.11it/s]

 58%|█████▊    | 1151/2000 [08:07<04:33,  3.10it/s]

 58%|█████▊    | 1152/2000 [08:07<04:34,  3.09it/s]

 58%|█████▊    | 1153/2000 [08:07<04:33,  3.10it/s]

 58%|█████▊    | 1154/2000 [08:08<04:33,  3.09it/s]

 58%|█████▊    | 1155/2000 [08:08<04:33,  3.09it/s]

 58%|█████▊    | 1156/2000 [08:08<04:32,  3.09it/s]

 58%|█████▊    | 1157/2000 [08:09<04:31,  3.11it/s]

 58%|█████▊    | 1158/2000 [08:09<04:31,  3.10it/s]

 58%|█████▊    | 1159/2000 [08:09<04:30,  3.11it/s]

 58%|█████▊    | 1160/2000 [08:09<04:30,  3.11it/s]

 58%|█████▊    | 1161/2000 [08:10<04:30,  3.10it/s]

 58%|█████▊    | 1162/2000 [08:10<04:31,  3.08it/s]

 58%|█████▊    | 1163/2000 [08:10<04:31,  3.08it/s]

 58%|█████▊    | 1164/2000 [08:11<04:31,  3.08it/s]

 58%|█████▊    | 1165/2000 [08:11<04:32,  3.06it/s]

 58%|█████▊    | 1166/2000 [08:11<04:32,  3.06it/s]

 58%|█████▊    | 1167/2000 [08:12<04:31,  3.07it/s]

 58%|█████▊    | 1168/2000 [08:12<04:30,  3.08it/s]

 58%|█████▊    | 1169/2000 [08:12<04:29,  3.08it/s]

 58%|█████▊    | 1170/2000 [08:13<04:29,  3.08it/s]

 59%|█████▊    | 1171/2000 [08:13<04:29,  3.07it/s]

 59%|█████▊    | 1172/2000 [08:13<04:30,  3.06it/s]

 59%|█████▊    | 1173/2000 [08:14<04:30,  3.06it/s]

 59%|█████▊    | 1174/2000 [08:14<04:30,  3.05it/s]

 59%|█████▉    | 1175/2000 [08:14<04:28,  3.07it/s]

 59%|█████▉    | 1176/2000 [08:15<04:29,  3.06it/s]

 59%|█████▉    | 1177/2000 [08:15<04:28,  3.07it/s]

 59%|█████▉    | 1178/2000 [08:15<04:27,  3.07it/s]

 59%|█████▉    | 1179/2000 [08:16<04:26,  3.08it/s]

 59%|█████▉    | 1180/2000 [08:16<04:26,  3.08it/s]

 59%|█████▉    | 1181/2000 [08:16<04:26,  3.07it/s]

 59%|█████▉    | 1182/2000 [08:17<04:25,  3.08it/s]

 59%|█████▉    | 1183/2000 [08:17<04:25,  3.08it/s]

 59%|█████▉    | 1184/2000 [08:17<04:25,  3.07it/s]

 59%|█████▉    | 1185/2000 [08:18<04:24,  3.08it/s]

 59%|█████▉    | 1186/2000 [08:18<04:23,  3.08it/s]

 59%|█████▉    | 1187/2000 [08:18<04:22,  3.09it/s]

 59%|█████▉    | 1188/2000 [08:19<04:21,  3.10it/s]

 59%|█████▉    | 1189/2000 [08:19<04:20,  3.11it/s]

 60%|█████▉    | 1190/2000 [08:19<04:20,  3.11it/s]

 60%|█████▉    | 1191/2000 [08:20<04:19,  3.12it/s]

 60%|█████▉    | 1192/2000 [08:20<04:20,  3.10it/s]

 60%|█████▉    | 1193/2000 [08:20<04:22,  3.08it/s]

 60%|█████▉    | 1194/2000 [08:21<04:21,  3.08it/s]

 60%|█████▉    | 1195/2000 [08:21<04:21,  3.08it/s]

 60%|█████▉    | 1196/2000 [08:21<04:22,  3.07it/s]

 60%|█████▉    | 1197/2000 [08:22<04:20,  3.08it/s]

 60%|█████▉    | 1198/2000 [08:22<04:19,  3.09it/s]

 60%|█████▉    | 1199/2000 [08:22<04:18,  3.09it/s]

[2024-02-06 00:23:43,637] [INFO] [logging.py:96:log_dist] [Rank 0] step=1200, skipped=0, lr=[0.0008928376335430331, 0.0008928376335430331], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:23:43,637] [INFO] [timer.py:260:stop] epoch=0/micro_step=1200/global_step=1200, RunningAvgSamplesPerSec=204.79480905791058, CurrSamplesPerSec=199.22240495882488, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 60%|██████    | 1200/2000 [08:22<04:19,  3.09it/s]

 60%|██████    | 1201/2000 [08:23<04:19,  3.08it/s]

 60%|██████    | 1202/2000 [08:23<04:19,  3.07it/s]

 60%|██████    | 1203/2000 [08:23<04:18,  3.08it/s]

 60%|██████    | 1204/2000 [08:24<04:17,  3.09it/s]

 60%|██████    | 1205/2000 [08:24<04:17,  3.09it/s]

 60%|██████    | 1206/2000 [08:24<04:16,  3.09it/s]

 60%|██████    | 1207/2000 [08:25<04:15,  3.10it/s]

 60%|██████    | 1208/2000 [08:25<04:14,  3.11it/s]

 60%|██████    | 1209/2000 [08:25<04:14,  3.11it/s]

 60%|██████    | 1210/2000 [08:26<04:13,  3.11it/s]

 61%|██████    | 1211/2000 [08:26<04:14,  3.10it/s]

 61%|██████    | 1212/2000 [08:26<04:14,  3.09it/s]

 61%|██████    | 1213/2000 [08:27<04:14,  3.09it/s]

 61%|██████    | 1214/2000 [08:27<04:14,  3.09it/s]

 61%|██████    | 1215/2000 [08:27<04:13,  3.09it/s]

 61%|██████    | 1216/2000 [08:28<04:13,  3.10it/s]

 61%|██████    | 1217/2000 [08:28<04:12,  3.10it/s]

 61%|██████    | 1218/2000 [08:28<04:11,  3.11it/s]

 61%|██████    | 1219/2000 [08:29<04:10,  3.11it/s]

 61%|██████    | 1220/2000 [08:29<04:09,  3.12it/s]

 61%|██████    | 1221/2000 [08:29<04:09,  3.12it/s]

 61%|██████    | 1222/2000 [08:30<04:09,  3.11it/s]

 61%|██████    | 1223/2000 [08:30<04:09,  3.11it/s]

 61%|██████    | 1224/2000 [08:30<04:10,  3.10it/s]

 61%|██████▏   | 1225/2000 [08:31<04:10,  3.10it/s]

 61%|██████▏   | 1226/2000 [08:31<04:09,  3.10it/s]

 61%|██████▏   | 1227/2000 [08:31<04:08,  3.11it/s]

 61%|██████▏   | 1228/2000 [08:32<04:08,  3.11it/s]

 61%|██████▏   | 1229/2000 [08:32<04:07,  3.12it/s]

 62%|██████▏   | 1230/2000 [08:32<04:07,  3.12it/s]

 62%|██████▏   | 1231/2000 [08:32<04:05,  3.13it/s]

 62%|██████▏   | 1232/2000 [08:33<04:07,  3.10it/s]

 62%|██████▏   | 1233/2000 [08:33<04:07,  3.10it/s]

 62%|██████▏   | 1234/2000 [08:33<04:07,  3.10it/s]

 62%|██████▏   | 1235/2000 [08:34<04:08,  3.08it/s]

 62%|██████▏   | 1236/2000 [08:34<04:07,  3.09it/s]

 62%|██████▏   | 1237/2000 [08:34<04:05,  3.11it/s]

 62%|██████▏   | 1238/2000 [08:35<04:05,  3.11it/s]

 62%|██████▏   | 1239/2000 [08:35<04:04,  3.11it/s]

 62%|██████▏   | 1240/2000 [08:35<04:04,  3.11it/s]

 62%|██████▏   | 1241/2000 [08:36<04:03,  3.12it/s]

 62%|██████▏   | 1242/2000 [08:36<04:04,  3.10it/s]

 62%|██████▏   | 1243/2000 [08:36<04:04,  3.10it/s]

 62%|██████▏   | 1244/2000 [08:37<04:04,  3.09it/s]

 62%|██████▏   | 1245/2000 [08:37<04:03,  3.10it/s]

 62%|██████▏   | 1246/2000 [08:37<04:02,  3.10it/s]

 62%|██████▏   | 1247/2000 [08:38<04:01,  3.12it/s]

 62%|██████▏   | 1248/2000 [08:38<04:00,  3.13it/s]

 62%|██████▏   | 1249/2000 [08:38<04:00,  3.12it/s]

 62%|██████▎   | 1250/2000 [08:39<03:59,  3.13it/s]

 63%|██████▎   | 1251/2000 [08:39<03:59,  3.13it/s]

 63%|██████▎   | 1252/2000 [08:39<03:59,  3.12it/s]

 63%|██████▎   | 1253/2000 [08:40<04:00,  3.10it/s]

 63%|██████▎   | 1254/2000 [08:40<04:00,  3.10it/s]

 63%|██████▎   | 1255/2000 [08:40<04:00,  3.10it/s]

 63%|██████▎   | 1256/2000 [08:41<04:00,  3.09it/s]

 63%|██████▎   | 1257/2000 [08:41<04:00,  3.09it/s]

 63%|██████▎   | 1258/2000 [08:41<04:00,  3.08it/s]

 63%|██████▎   | 1259/2000 [08:42<04:00,  3.08it/s]

 63%|██████▎   | 1260/2000 [08:42<03:59,  3.09it/s]

 63%|██████▎   | 1261/2000 [08:42<03:59,  3.09it/s]

 63%|██████▎   | 1262/2000 [08:42<03:58,  3.09it/s]

 63%|██████▎   | 1263/2000 [08:43<03:59,  3.08it/s]

 63%|██████▎   | 1264/2000 [08:43<03:58,  3.09it/s]

 63%|██████▎   | 1265/2000 [08:43<03:58,  3.08it/s]

 63%|██████▎   | 1266/2000 [08:44<03:58,  3.08it/s]

 63%|██████▎   | 1267/2000 [08:44<03:58,  3.08it/s]

 63%|██████▎   | 1268/2000 [08:44<03:57,  3.08it/s]

 63%|██████▎   | 1269/2000 [08:45<03:57,  3.08it/s]

 64%|██████▎   | 1270/2000 [08:45<03:56,  3.09it/s]

 64%|██████▎   | 1271/2000 [08:45<03:55,  3.09it/s]

 64%|██████▎   | 1272/2000 [08:46<03:54,  3.10it/s]

 64%|██████▎   | 1273/2000 [08:46<03:53,  3.11it/s]

 64%|██████▎   | 1274/2000 [08:46<03:53,  3.11it/s]

 64%|██████▍   | 1275/2000 [08:47<03:54,  3.09it/s]

 64%|██████▍   | 1276/2000 [08:47<03:52,  3.11it/s]

 64%|██████▍   | 1277/2000 [08:47<03:52,  3.10it/s]

 64%|██████▍   | 1278/2000 [08:48<03:51,  3.12it/s]

 64%|██████▍   | 1279/2000 [08:48<03:50,  3.13it/s]

 64%|██████▍   | 1280/2000 [08:48<03:49,  3.13it/s]

 64%|██████▍   | 1281/2000 [08:49<03:48,  3.15it/s]

 64%|██████▍   | 1282/2000 [08:49<03:48,  3.15it/s]

 64%|██████▍   | 1283/2000 [08:49<03:48,  3.14it/s]

 64%|██████▍   | 1284/2000 [08:50<03:48,  3.14it/s]

 64%|██████▍   | 1285/2000 [08:50<03:47,  3.14it/s]

 64%|██████▍   | 1286/2000 [08:50<03:47,  3.14it/s]

 64%|██████▍   | 1287/2000 [08:51<03:48,  3.12it/s]

 64%|██████▍   | 1288/2000 [08:51<03:47,  3.12it/s]

 64%|██████▍   | 1289/2000 [08:51<03:47,  3.13it/s]

 64%|██████▍   | 1290/2000 [08:51<03:45,  3.14it/s]

 65%|██████▍   | 1291/2000 [08:52<03:45,  3.14it/s]

 65%|██████▍   | 1292/2000 [08:52<03:45,  3.15it/s]

 65%|██████▍   | 1293/2000 [08:52<03:44,  3.14it/s]

 65%|██████▍   | 1294/2000 [08:53<03:45,  3.13it/s]

 65%|██████▍   | 1295/2000 [08:53<03:44,  3.14it/s]

 65%|██████▍   | 1296/2000 [08:53<03:43,  3.14it/s]

 65%|██████▍   | 1297/2000 [08:54<03:43,  3.14it/s]

 65%|██████▍   | 1298/2000 [08:54<03:44,  3.13it/s]

 65%|██████▍   | 1299/2000 [08:54<03:44,  3.13it/s]

[2024-02-06 00:24:15,794] [INFO] [logging.py:96:log_dist] [Rank 0] step=1300, skipped=0, lr=[0.0008734578549756275, 0.0008734578549756275], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:24:15,795] [INFO] [timer.py:260:stop] epoch=0/micro_step=1300/global_step=1300, RunningAvgSamplesPerSec=204.41478697689357, CurrSamplesPerSec=203.3093665977445, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 65%|██████▌   | 1300/2000 [08:55<03:43,  3.13it/s]

 65%|██████▌   | 1301/2000 [08:55<03:42,  3.14it/s]

 65%|██████▌   | 1302/2000 [08:55<03:41,  3.14it/s]

 65%|██████▌   | 1303/2000 [08:56<03:41,  3.14it/s]

 65%|██████▌   | 1304/2000 [08:56<03:41,  3.14it/s]

 65%|██████▌   | 1305/2000 [08:56<03:41,  3.13it/s]

 65%|██████▌   | 1306/2000 [08:57<03:42,  3.12it/s]

 65%|██████▌   | 1307/2000 [08:57<03:41,  3.12it/s]

 65%|██████▌   | 1308/2000 [08:57<03:41,  3.12it/s]

 65%|██████▌   | 1309/2000 [08:58<03:40,  3.13it/s]

 66%|██████▌   | 1310/2000 [08:58<03:40,  3.12it/s]

 66%|██████▌   | 1311/2000 [08:58<03:40,  3.13it/s]

 66%|██████▌   | 1312/2000 [08:58<03:39,  3.14it/s]

 66%|██████▌   | 1313/2000 [08:59<03:38,  3.15it/s]

 66%|██████▌   | 1314/2000 [08:59<03:38,  3.14it/s]

 66%|██████▌   | 1315/2000 [08:59<03:38,  3.14it/s]

 66%|██████▌   | 1316/2000 [09:00<03:38,  3.12it/s]

 66%|██████▌   | 1317/2000 [09:00<03:39,  3.11it/s]

 66%|██████▌   | 1318/2000 [09:00<03:39,  3.11it/s]

 66%|██████▌   | 1319/2000 [09:01<03:38,  3.12it/s]

 66%|██████▌   | 1320/2000 [09:01<03:37,  3.12it/s]

 66%|██████▌   | 1321/2000 [09:01<03:36,  3.13it/s]

 66%|██████▌   | 1322/2000 [09:02<03:36,  3.13it/s]

 66%|██████▌   | 1323/2000 [09:02<03:36,  3.13it/s]

 66%|██████▌   | 1324/2000 [09:02<03:42,  3.04it/s]

 66%|██████▋   | 1325/2000 [09:03<03:41,  3.04it/s]

 66%|██████▋   | 1326/2000 [09:03<03:39,  3.07it/s]

 66%|██████▋   | 1327/2000 [09:03<03:38,  3.08it/s]

 66%|██████▋   | 1328/2000 [09:04<03:37,  3.09it/s]

 66%|██████▋   | 1329/2000 [09:04<03:36,  3.10it/s]

 66%|██████▋   | 1330/2000 [09:04<03:36,  3.10it/s]

 67%|██████▋   | 1331/2000 [09:05<03:35,  3.10it/s]

 67%|██████▋   | 1332/2000 [09:05<03:34,  3.12it/s]

 67%|██████▋   | 1333/2000 [09:05<03:33,  3.13it/s]

 67%|██████▋   | 1334/2000 [09:06<03:33,  3.12it/s]

 67%|██████▋   | 1335/2000 [09:06<03:33,  3.12it/s]

 67%|██████▋   | 1336/2000 [09:06<03:33,  3.11it/s]

 67%|██████▋   | 1337/2000 [09:07<03:33,  3.11it/s]

 67%|██████▋   | 1338/2000 [09:07<03:32,  3.12it/s]

 67%|██████▋   | 1339/2000 [09:07<03:32,  3.10it/s]

 67%|██████▋   | 1340/2000 [09:07<03:31,  3.11it/s]

 67%|██████▋   | 1341/2000 [09:08<03:31,  3.12it/s]

 67%|██████▋   | 1342/2000 [09:08<03:30,  3.12it/s]

 67%|██████▋   | 1343/2000 [09:08<03:30,  3.12it/s]

 67%|██████▋   | 1344/2000 [09:09<03:30,  3.12it/s]

 67%|██████▋   | 1345/2000 [09:09<03:30,  3.12it/s]

 67%|██████▋   | 1346/2000 [09:09<03:30,  3.11it/s]

 67%|██████▋   | 1347/2000 [09:10<03:29,  3.12it/s]

 67%|██████▋   | 1348/2000 [09:10<03:28,  3.13it/s]

 67%|██████▋   | 1349/2000 [09:10<03:28,  3.12it/s]

 68%|██████▊   | 1350/2000 [09:11<03:27,  3.13it/s]

 68%|██████▊   | 1351/2000 [09:11<03:28,  3.12it/s]

 68%|██████▊   | 1352/2000 [09:11<03:28,  3.11it/s]

 68%|██████▊   | 1353/2000 [09:12<03:28,  3.11it/s]

 68%|██████▊   | 1354/2000 [09:12<03:28,  3.10it/s]

 68%|██████▊   | 1355/2000 [09:12<03:29,  3.08it/s]

 68%|██████▊   | 1356/2000 [09:13<03:29,  3.07it/s]

 68%|██████▊   | 1357/2000 [09:13<03:29,  3.07it/s]

 68%|██████▊   | 1358/2000 [09:13<03:28,  3.08it/s]

 68%|██████▊   | 1359/2000 [09:14<03:28,  3.07it/s]

 68%|██████▊   | 1360/2000 [09:14<03:28,  3.07it/s]

 68%|██████▊   | 1361/2000 [09:14<03:26,  3.09it/s]

 68%|██████▊   | 1362/2000 [09:15<03:26,  3.10it/s]

 68%|██████▊   | 1363/2000 [09:15<03:25,  3.10it/s]

 68%|██████▊   | 1364/2000 [09:15<03:24,  3.11it/s]

 68%|██████▊   | 1365/2000 [09:16<03:23,  3.11it/s]

 68%|██████▊   | 1366/2000 [09:16<03:24,  3.10it/s]

 68%|██████▊   | 1367/2000 [09:16<03:23,  3.10it/s]

 68%|██████▊   | 1368/2000 [09:17<03:22,  3.12it/s]

 68%|██████▊   | 1369/2000 [09:17<03:22,  3.11it/s]

 68%|██████▊   | 1370/2000 [09:17<03:21,  3.12it/s]

 69%|██████▊   | 1371/2000 [09:17<03:19,  3.15it/s]

 69%|██████▊   | 1372/2000 [09:18<03:19,  3.15it/s]

 69%|██████▊   | 1373/2000 [09:18<03:18,  3.15it/s]

 69%|██████▊   | 1374/2000 [09:18<03:18,  3.16it/s]

 69%|██████▉   | 1375/2000 [09:19<03:17,  3.16it/s]

 69%|██████▉   | 1376/2000 [09:19<03:17,  3.16it/s]

 69%|██████▉   | 1377/2000 [09:19<03:17,  3.15it/s]

 69%|██████▉   | 1378/2000 [09:20<03:16,  3.16it/s]

 69%|██████▉   | 1379/2000 [09:20<03:17,  3.15it/s]

 69%|██████▉   | 1380/2000 [09:20<03:17,  3.15it/s]

 69%|██████▉   | 1381/2000 [09:21<03:16,  3.15it/s]

 69%|██████▉   | 1382/2000 [09:21<03:15,  3.15it/s]

 69%|██████▉   | 1383/2000 [09:21<03:16,  3.14it/s]

 69%|██████▉   | 1384/2000 [09:22<03:15,  3.15it/s]

 69%|██████▉   | 1385/2000 [09:22<03:15,  3.14it/s]

 69%|██████▉   | 1386/2000 [09:22<03:16,  3.12it/s]

 69%|██████▉   | 1387/2000 [09:23<03:16,  3.11it/s]

 69%|██████▉   | 1388/2000 [09:23<03:16,  3.11it/s]

 69%|██████▉   | 1389/2000 [09:23<03:15,  3.12it/s]

 70%|██████▉   | 1390/2000 [09:24<03:16,  3.11it/s]

 70%|██████▉   | 1391/2000 [09:24<03:16,  3.10it/s]

 70%|██████▉   | 1392/2000 [09:24<03:14,  3.12it/s]

 70%|██████▉   | 1393/2000 [09:24<03:14,  3.12it/s]

 70%|██████▉   | 1394/2000 [09:25<03:13,  3.13it/s]

 70%|██████▉   | 1395/2000 [09:25<03:12,  3.14it/s]

 70%|██████▉   | 1396/2000 [09:25<03:12,  3.14it/s]

 70%|██████▉   | 1397/2000 [09:26<03:12,  3.14it/s]

 70%|██████▉   | 1398/2000 [09:26<03:11,  3.14it/s]

 70%|██████▉   | 1399/2000 [09:26<03:11,  3.15it/s]

[2024-02-06 00:24:47,853] [INFO] [logging.py:96:log_dist] [Rank 0] step=1400, skipped=0, lr=[0.000852748918700635, 0.000852748918700635], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:24:47,854] [INFO] [timer.py:260:stop] epoch=0/micro_step=1400/global_step=1400, RunningAvgSamplesPerSec=204.13720172786063, CurrSamplesPerSec=200.50557180983168, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 70%|███████   | 1400/2000 [09:27<03:11,  3.13it/s]

 70%|███████   | 1401/2000 [09:27<03:10,  3.14it/s]

 70%|███████   | 1402/2000 [09:27<03:09,  3.16it/s]

 70%|███████   | 1403/2000 [09:28<03:09,  3.15it/s]

 70%|███████   | 1404/2000 [09:28<03:09,  3.15it/s]

 70%|███████   | 1405/2000 [09:28<03:08,  3.16it/s]

 70%|███████   | 1406/2000 [09:29<03:08,  3.16it/s]

 70%|███████   | 1407/2000 [09:29<03:07,  3.16it/s]

 70%|███████   | 1408/2000 [09:29<03:08,  3.15it/s]

 70%|███████   | 1409/2000 [09:30<03:07,  3.14it/s]

 70%|███████   | 1410/2000 [09:30<03:07,  3.15it/s]

 71%|███████   | 1411/2000 [09:30<03:07,  3.15it/s]

 71%|███████   | 1412/2000 [09:31<03:07,  3.13it/s]

 71%|███████   | 1413/2000 [09:31<03:07,  3.13it/s]

 71%|███████   | 1414/2000 [09:31<03:06,  3.15it/s]

 71%|███████   | 1415/2000 [09:31<03:05,  3.16it/s]

 71%|███████   | 1416/2000 [09:32<03:05,  3.16it/s]

 71%|███████   | 1417/2000 [09:32<03:05,  3.15it/s]

 71%|███████   | 1418/2000 [09:32<03:04,  3.15it/s]

 71%|███████   | 1419/2000 [09:33<03:04,  3.15it/s]

 71%|███████   | 1420/2000 [09:33<03:04,  3.14it/s]

 71%|███████   | 1421/2000 [09:33<03:04,  3.14it/s]

 71%|███████   | 1422/2000 [09:34<03:04,  3.14it/s]

 71%|███████   | 1423/2000 [09:34<03:03,  3.15it/s]

 71%|███████   | 1424/2000 [09:34<03:02,  3.15it/s]

 71%|███████▏  | 1425/2000 [09:35<03:02,  3.15it/s]

 71%|███████▏  | 1426/2000 [09:35<03:01,  3.16it/s]

 71%|███████▏  | 1427/2000 [09:35<03:02,  3.14it/s]

 71%|███████▏  | 1428/2000 [09:36<03:01,  3.15it/s]

 71%|███████▏  | 1429/2000 [09:36<03:01,  3.14it/s]

 72%|███████▏  | 1430/2000 [09:36<03:01,  3.13it/s]

 72%|███████▏  | 1431/2000 [09:37<03:01,  3.14it/s]

 72%|███████▏  | 1432/2000 [09:37<03:01,  3.13it/s]

 72%|███████▏  | 1433/2000 [09:37<03:00,  3.14it/s]

 72%|███████▏  | 1434/2000 [09:38<03:00,  3.14it/s]

 72%|███████▏  | 1435/2000 [09:38<03:00,  3.13it/s]

 72%|███████▏  | 1436/2000 [09:38<02:59,  3.13it/s]

 72%|███████▏  | 1437/2000 [09:38<03:00,  3.12it/s]

 72%|███████▏  | 1438/2000 [09:39<02:59,  3.13it/s]

 72%|███████▏  | 1439/2000 [09:39<02:59,  3.13it/s]

 72%|███████▏  | 1440/2000 [09:39<02:59,  3.12it/s]

 72%|███████▏  | 1441/2000 [09:40<02:59,  3.12it/s]

 72%|███████▏  | 1442/2000 [09:40<02:58,  3.12it/s]

 72%|███████▏  | 1443/2000 [09:40<02:58,  3.12it/s]

 72%|███████▏  | 1444/2000 [09:41<02:57,  3.13it/s]

 72%|███████▏  | 1445/2000 [09:41<02:57,  3.12it/s]

 72%|███████▏  | 1446/2000 [09:41<02:58,  3.11it/s]

 72%|███████▏  | 1447/2000 [09:42<02:57,  3.11it/s]

 72%|███████▏  | 1448/2000 [09:42<02:58,  3.10it/s]

 72%|███████▏  | 1449/2000 [09:42<02:58,  3.08it/s]

 72%|███████▎  | 1450/2000 [09:43<02:59,  3.07it/s]

 73%|███████▎  | 1451/2000 [09:43<02:59,  3.07it/s]

 73%|███████▎  | 1452/2000 [09:43<02:59,  3.06it/s]

 73%|███████▎  | 1453/2000 [09:44<02:58,  3.06it/s]

 73%|███████▎  | 1454/2000 [09:44<02:57,  3.07it/s]

 73%|███████▎  | 1455/2000 [09:44<02:56,  3.09it/s]

 73%|███████▎  | 1456/2000 [09:45<02:55,  3.09it/s]

 73%|███████▎  | 1457/2000 [09:45<02:55,  3.10it/s]

 73%|███████▎  | 1458/2000 [09:45<02:55,  3.10it/s]

 73%|███████▎  | 1459/2000 [09:46<02:54,  3.10it/s]

 73%|███████▎  | 1460/2000 [09:46<02:54,  3.10it/s]

 73%|███████▎  | 1461/2000 [09:46<02:53,  3.10it/s]

 73%|███████▎  | 1462/2000 [09:47<02:52,  3.11it/s]

 73%|███████▎  | 1463/2000 [09:47<02:52,  3.11it/s]

 73%|███████▎  | 1464/2000 [09:47<02:51,  3.13it/s]

 73%|███████▎  | 1465/2000 [09:47<02:50,  3.14it/s]

 73%|███████▎  | 1466/2000 [09:48<02:50,  3.14it/s]

 73%|███████▎  | 1467/2000 [09:48<02:50,  3.14it/s]

 73%|███████▎  | 1468/2000 [09:48<02:49,  3.13it/s]

 73%|███████▎  | 1469/2000 [09:49<02:49,  3.14it/s]

 74%|███████▎  | 1470/2000 [09:49<02:49,  3.14it/s]

 74%|███████▎  | 1471/2000 [09:49<02:48,  3.13it/s]

 74%|███████▎  | 1472/2000 [09:50<02:48,  3.12it/s]

 74%|███████▎  | 1473/2000 [09:50<02:48,  3.13it/s]

 74%|███████▎  | 1474/2000 [09:50<02:48,  3.13it/s]

 74%|███████▍  | 1475/2000 [09:51<02:48,  3.12it/s]

 74%|███████▍  | 1476/2000 [09:51<02:47,  3.12it/s]

 74%|███████▍  | 1477/2000 [09:51<02:47,  3.13it/s]

 74%|███████▍  | 1478/2000 [09:52<02:46,  3.13it/s]

 74%|███████▍  | 1479/2000 [09:52<02:47,  3.11it/s]

 74%|███████▍  | 1480/2000 [09:52<02:47,  3.11it/s]

 74%|███████▍  | 1481/2000 [09:53<02:46,  3.11it/s]

 74%|███████▍  | 1482/2000 [09:53<02:45,  3.13it/s]

 74%|███████▍  | 1483/2000 [09:53<02:45,  3.13it/s]

 74%|███████▍  | 1484/2000 [09:54<02:43,  3.15it/s]

 74%|███████▍  | 1485/2000 [09:54<02:43,  3.15it/s]

 74%|███████▍  | 1486/2000 [09:54<02:42,  3.17it/s]

 74%|███████▍  | 1487/2000 [09:55<02:41,  3.17it/s]

 74%|███████▍  | 1488/2000 [09:55<02:41,  3.18it/s]

 74%|███████▍  | 1489/2000 [09:55<02:41,  3.16it/s]

 74%|███████▍  | 1490/2000 [09:55<02:41,  3.16it/s]

 75%|███████▍  | 1491/2000 [09:56<02:41,  3.16it/s]

 75%|███████▍  | 1492/2000 [09:56<02:40,  3.16it/s]

 75%|███████▍  | 1493/2000 [09:56<02:40,  3.16it/s]

 75%|███████▍  | 1494/2000 [09:57<02:40,  3.15it/s]

 75%|███████▍  | 1495/2000 [09:57<02:39,  3.16it/s]

 75%|███████▍  | 1496/2000 [09:57<02:38,  3.17it/s]

 75%|███████▍  | 1497/2000 [09:58<02:38,  3.17it/s]

 75%|███████▍  | 1498/2000 [09:58<02:38,  3.16it/s]

 75%|███████▍  | 1499/2000 [09:58<02:38,  3.16it/s]

[2024-02-06 00:25:19,776] [INFO] [logging.py:96:log_dist] [Rank 0] step=1500, skipped=0, lr=[0.000830795922179262, 0.000830795922179262], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:25:19,777] [INFO] [timer.py:260:stop] epoch=0/micro_step=1500/global_step=1500, RunningAvgSamplesPerSec=203.9562791046921, CurrSamplesPerSec=200.72266767363695, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 75%|███████▌  | 1500/2000 [09:59<02:39,  3.14it/s]

 75%|███████▌  | 1501/2000 [10:40<1:44:34, 12.57s/it]

 75%|███████▌  | 1502/2000 [10:40<1:13:47,  8.89s/it]

 75%|███████▌  | 1503/2000 [10:40<52:18,  6.31s/it]  

 75%|███████▌  | 1504/2000 [10:41<37:16,  4.51s/it]

 75%|███████▌  | 1505/2000 [10:41<26:47,  3.25s/it]

 75%|███████▌  | 1506/2000 [10:41<19:27,  2.36s/it]

 75%|███████▌  | 1507/2000 [10:42<14:19,  1.74s/it]

 75%|███████▌  | 1508/2000 [10:42<10:44,  1.31s/it]

 75%|███████▌  | 1509/2000 [10:42<08:14,  1.01s/it]

 76%|███████▌  | 1510/2000 [10:42<06:29,  1.26it/s]

 76%|███████▌  | 1511/2000 [10:43<05:16,  1.54it/s]

 76%|███████▌  | 1512/2000 [10:43<04:25,  1.83it/s]

 76%|███████▌  | 1513/2000 [10:43<03:50,  2.11it/s]

 76%|███████▌  | 1514/2000 [10:44<03:25,  2.36it/s]

 76%|███████▌  | 1515/2000 [10:44<03:08,  2.57it/s]

 76%|███████▌  | 1516/2000 [10:44<02:56,  2.74it/s]

 76%|███████▌  | 1517/2000 [10:45<02:47,  2.88it/s]

 76%|███████▌  | 1518/2000 [10:45<02:42,  2.97it/s]

 76%|███████▌  | 1519/2000 [10:45<02:37,  3.05it/s]

 76%|███████▌  | 1520/2000 [10:46<02:34,  3.10it/s]

 76%|███████▌  | 1521/2000 [10:46<02:32,  3.14it/s]

 76%|███████▌  | 1522/2000 [10:46<02:31,  3.16it/s]

 76%|███████▌  | 1523/2000 [10:47<02:29,  3.19it/s]

 76%|███████▌  | 1524/2000 [10:47<02:28,  3.20it/s]

 76%|███████▋  | 1525/2000 [10:47<02:28,  3.21it/s]

 76%|███████▋  | 1526/2000 [10:47<02:27,  3.22it/s]

 76%|███████▋  | 1527/2000 [10:48<02:26,  3.22it/s]

 76%|███████▋  | 1528/2000 [10:48<02:26,  3.22it/s]

 76%|███████▋  | 1529/2000 [10:48<02:25,  3.23it/s]

 76%|███████▋  | 1530/2000 [10:49<02:25,  3.23it/s]

 77%|███████▋  | 1531/2000 [10:49<02:24,  3.24it/s]

 77%|███████▋  | 1532/2000 [10:49<02:24,  3.25it/s]

 77%|███████▋  | 1533/2000 [10:50<02:23,  3.25it/s]

 77%|███████▋  | 1534/2000 [10:50<02:23,  3.26it/s]

 77%|███████▋  | 1535/2000 [10:50<02:22,  3.25it/s]

 77%|███████▋  | 1536/2000 [10:51<02:22,  3.25it/s]

 77%|███████▋  | 1537/2000 [10:51<02:22,  3.25it/s]

 77%|███████▋  | 1538/2000 [10:51<02:21,  3.25it/s]

 77%|███████▋  | 1539/2000 [10:51<02:21,  3.26it/s]

 77%|███████▋  | 1540/2000 [10:52<02:20,  3.27it/s]

 77%|███████▋  | 1541/2000 [10:52<02:20,  3.27it/s]

 77%|███████▋  | 1542/2000 [10:52<02:19,  3.28it/s]

 77%|███████▋  | 1543/2000 [10:53<02:19,  3.28it/s]

 77%|███████▋  | 1544/2000 [10:53<02:19,  3.27it/s]

 77%|███████▋  | 1545/2000 [10:53<02:18,  3.28it/s]

 77%|███████▋  | 1546/2000 [10:54<02:18,  3.28it/s]

 77%|███████▋  | 1547/2000 [10:54<02:17,  3.29it/s]

 77%|███████▋  | 1548/2000 [10:54<02:17,  3.28it/s]

 77%|███████▋  | 1549/2000 [10:54<02:17,  3.28it/s]

 78%|███████▊  | 1550/2000 [10:55<02:17,  3.28it/s]

 78%|███████▊  | 1551/2000 [10:55<02:17,  3.27it/s]

 78%|███████▊  | 1552/2000 [10:55<02:16,  3.28it/s]

 78%|███████▊  | 1553/2000 [10:56<02:16,  3.28it/s]

 78%|███████▊  | 1554/2000 [10:56<02:15,  3.28it/s]

 78%|███████▊  | 1555/2000 [10:56<02:15,  3.28it/s]

 78%|███████▊  | 1556/2000 [10:57<02:15,  3.29it/s]

 78%|███████▊  | 1557/2000 [10:57<02:14,  3.28it/s]

 78%|███████▊  | 1558/2000 [10:57<02:14,  3.28it/s]

 78%|███████▊  | 1559/2000 [10:58<02:14,  3.28it/s]

 78%|███████▊  | 1560/2000 [10:58<02:14,  3.27it/s]

 78%|███████▊  | 1561/2000 [10:58<02:14,  3.27it/s]

 78%|███████▊  | 1562/2000 [10:58<02:14,  3.26it/s]

 78%|███████▊  | 1563/2000 [10:59<02:13,  3.27it/s]

 78%|███████▊  | 1564/2000 [10:59<02:13,  3.27it/s]

 78%|███████▊  | 1565/2000 [10:59<02:12,  3.28it/s]

 78%|███████▊  | 1566/2000 [11:00<02:12,  3.28it/s]

 78%|███████▊  | 1567/2000 [11:00<02:11,  3.28it/s]

 78%|███████▊  | 1568/2000 [11:00<02:11,  3.28it/s]

 78%|███████▊  | 1569/2000 [11:01<02:11,  3.28it/s]

 78%|███████▊  | 1570/2000 [11:01<02:11,  3.27it/s]

 79%|███████▊  | 1571/2000 [11:01<02:11,  3.27it/s]

 79%|███████▊  | 1572/2000 [11:01<02:10,  3.27it/s]

 79%|███████▊  | 1573/2000 [11:02<02:10,  3.27it/s]

 79%|███████▊  | 1574/2000 [11:02<02:12,  3.21it/s]

 79%|███████▉  | 1575/2000 [11:02<02:11,  3.22it/s]

 79%|███████▉  | 1576/2000 [11:03<02:11,  3.23it/s]

 79%|███████▉  | 1577/2000 [11:03<02:10,  3.24it/s]

 79%|███████▉  | 1578/2000 [11:03<02:09,  3.25it/s]

 79%|███████▉  | 1579/2000 [11:04<02:09,  3.26it/s]

 79%|███████▉  | 1580/2000 [11:04<02:08,  3.26it/s]

 79%|███████▉  | 1581/2000 [11:04<02:08,  3.27it/s]

 79%|███████▉  | 1582/2000 [11:05<02:08,  3.26it/s]

 79%|███████▉  | 1583/2000 [11:05<02:07,  3.26it/s]

 79%|███████▉  | 1584/2000 [11:05<02:07,  3.27it/s]

 79%|███████▉  | 1585/2000 [11:05<02:06,  3.27it/s]

 79%|███████▉  | 1586/2000 [11:06<02:06,  3.27it/s]

 79%|███████▉  | 1587/2000 [11:06<02:06,  3.27it/s]

 79%|███████▉  | 1588/2000 [11:06<02:05,  3.27it/s]

 79%|███████▉  | 1589/2000 [11:07<02:05,  3.28it/s]

 80%|███████▉  | 1590/2000 [11:07<02:05,  3.28it/s]

 80%|███████▉  | 1591/2000 [11:07<02:04,  3.28it/s]

 80%|███████▉  | 1592/2000 [11:08<02:04,  3.27it/s]

 80%|███████▉  | 1593/2000 [11:08<02:04,  3.27it/s]

 80%|███████▉  | 1594/2000 [11:08<02:04,  3.27it/s]

 80%|███████▉  | 1595/2000 [11:09<02:03,  3.27it/s]

 80%|███████▉  | 1596/2000 [11:09<02:03,  3.27it/s]

 80%|███████▉  | 1597/2000 [11:09<02:02,  3.28it/s]

 80%|███████▉  | 1598/2000 [11:09<02:02,  3.28it/s]

 80%|███████▉  | 1599/2000 [11:10<02:02,  3.28it/s]

[2024-02-06 00:26:31,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=1600, skipped=0, lr=[0.0008076890749831181, 0.0008076890749831181], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:26:31,224] [INFO] [timer.py:260:stop] epoch=0/micro_step=1600/global_step=1600, RunningAvgSamplesPerSec=204.32901322912141, CurrSamplesPerSec=209.66343933841333, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 80%|████████  | 1600/2000 [11:10<02:02,  3.26it/s]

 80%|████████  | 1601/2000 [11:10<02:02,  3.26it/s]

 80%|████████  | 1602/2000 [11:11<02:02,  3.25it/s]

 80%|████████  | 1603/2000 [11:11<02:01,  3.26it/s]

 80%|████████  | 1604/2000 [11:11<02:01,  3.25it/s]

 80%|████████  | 1605/2000 [11:12<02:01,  3.25it/s]

 80%|████████  | 1606/2000 [11:12<02:01,  3.25it/s]

 80%|████████  | 1607/2000 [11:12<02:00,  3.25it/s]

 80%|████████  | 1608/2000 [11:13<02:00,  3.25it/s]

 80%|████████  | 1609/2000 [11:13<02:00,  3.25it/s]

 80%|████████  | 1610/2000 [11:13<02:00,  3.25it/s]

 81%|████████  | 1611/2000 [11:13<02:00,  3.24it/s]

 81%|████████  | 1612/2000 [11:14<01:59,  3.24it/s]

 81%|████████  | 1613/2000 [11:14<01:59,  3.24it/s]

 81%|████████  | 1614/2000 [11:14<01:59,  3.24it/s]

 81%|████████  | 1615/2000 [11:15<01:58,  3.24it/s]

 81%|████████  | 1616/2000 [11:15<01:58,  3.24it/s]

 81%|████████  | 1617/2000 [11:15<01:57,  3.25it/s]

 81%|████████  | 1618/2000 [11:16<01:57,  3.25it/s]

 81%|████████  | 1619/2000 [11:16<01:57,  3.25it/s]

 81%|████████  | 1620/2000 [11:16<01:56,  3.25it/s]

 81%|████████  | 1621/2000 [11:17<01:56,  3.26it/s]

 81%|████████  | 1622/2000 [11:17<01:55,  3.26it/s]

 81%|████████  | 1623/2000 [11:17<01:55,  3.26it/s]

 81%|████████  | 1624/2000 [11:17<01:55,  3.26it/s]

 81%|████████▏ | 1625/2000 [11:18<01:55,  3.26it/s]

 81%|████████▏ | 1626/2000 [11:18<01:54,  3.26it/s]

 81%|████████▏ | 1627/2000 [11:18<01:54,  3.26it/s]

 81%|████████▏ | 1628/2000 [11:19<01:53,  3.27it/s]

 81%|████████▏ | 1629/2000 [11:19<01:53,  3.27it/s]

 82%|████████▏ | 1630/2000 [11:19<01:53,  3.27it/s]

 82%|████████▏ | 1631/2000 [11:20<01:52,  3.27it/s]

 82%|████████▏ | 1632/2000 [11:20<01:52,  3.27it/s]

 82%|████████▏ | 1633/2000 [11:20<01:52,  3.27it/s]

 82%|████████▏ | 1634/2000 [11:21<01:51,  3.27it/s]

 82%|████████▏ | 1635/2000 [11:21<01:51,  3.27it/s]

 82%|████████▏ | 1636/2000 [11:21<01:51,  3.25it/s]

 82%|████████▏ | 1637/2000 [11:21<01:51,  3.26it/s]

 82%|████████▏ | 1638/2000 [11:22<01:50,  3.26it/s]

 82%|████████▏ | 1639/2000 [11:22<01:50,  3.27it/s]

 82%|████████▏ | 1640/2000 [11:22<01:50,  3.27it/s]

 82%|████████▏ | 1641/2000 [11:23<01:49,  3.27it/s]

 82%|████████▏ | 1642/2000 [11:23<01:49,  3.28it/s]

 82%|████████▏ | 1643/2000 [11:23<01:48,  3.28it/s]

 82%|████████▏ | 1644/2000 [11:24<01:48,  3.27it/s]

 82%|████████▏ | 1645/2000 [11:24<01:48,  3.27it/s]

 82%|████████▏ | 1646/2000 [11:24<01:48,  3.26it/s]

 82%|████████▏ | 1647/2000 [11:25<01:48,  3.26it/s]

 82%|████████▏ | 1648/2000 [11:25<01:48,  3.26it/s]

 82%|████████▏ | 1649/2000 [11:25<01:47,  3.26it/s]

 82%|████████▎ | 1650/2000 [11:25<01:47,  3.26it/s]

 83%|████████▎ | 1651/2000 [11:26<01:46,  3.27it/s]

 83%|████████▎ | 1652/2000 [11:26<01:46,  3.27it/s]

 83%|████████▎ | 1653/2000 [11:26<01:46,  3.27it/s]

 83%|████████▎ | 1654/2000 [11:27<01:45,  3.27it/s]

 83%|████████▎ | 1655/2000 [11:27<01:45,  3.27it/s]

 83%|████████▎ | 1656/2000 [11:27<01:45,  3.28it/s]

 83%|████████▎ | 1657/2000 [11:28<01:44,  3.28it/s]

 83%|████████▎ | 1658/2000 [11:28<01:44,  3.28it/s]

 83%|████████▎ | 1659/2000 [11:28<01:44,  3.28it/s]

 83%|████████▎ | 1660/2000 [11:28<01:43,  3.28it/s]

 83%|████████▎ | 1661/2000 [11:29<01:43,  3.27it/s]

 83%|████████▎ | 1662/2000 [11:29<01:43,  3.28it/s]

 83%|████████▎ | 1663/2000 [11:29<01:42,  3.28it/s]

 83%|████████▎ | 1664/2000 [11:30<01:42,  3.27it/s]

 83%|████████▎ | 1665/2000 [11:30<01:42,  3.27it/s]

 83%|████████▎ | 1666/2000 [11:30<01:42,  3.27it/s]

 83%|████████▎ | 1667/2000 [11:31<01:41,  3.27it/s]

 83%|████████▎ | 1668/2000 [11:31<01:41,  3.26it/s]

 83%|████████▎ | 1669/2000 [11:31<01:41,  3.26it/s]

 84%|████████▎ | 1670/2000 [11:32<01:41,  3.26it/s]

 84%|████████▎ | 1671/2000 [11:32<01:40,  3.26it/s]

 84%|████████▎ | 1672/2000 [11:32<01:40,  3.27it/s]

 84%|████████▎ | 1673/2000 [11:32<01:39,  3.27it/s]

 84%|████████▎ | 1674/2000 [11:33<01:39,  3.27it/s]

 84%|████████▍ | 1675/2000 [11:33<01:39,  3.27it/s]

 84%|████████▍ | 1676/2000 [11:33<01:38,  3.28it/s]

 84%|████████▍ | 1677/2000 [11:34<01:38,  3.28it/s]

 84%|████████▍ | 1678/2000 [11:34<01:38,  3.27it/s]

 84%|████████▍ | 1679/2000 [11:34<01:38,  3.27it/s]

 84%|████████▍ | 1680/2000 [11:35<01:37,  3.27it/s]

 84%|████████▍ | 1681/2000 [11:35<01:37,  3.27it/s]

 84%|████████▍ | 1682/2000 [11:35<01:37,  3.27it/s]

 84%|████████▍ | 1683/2000 [11:36<01:36,  3.27it/s]

 84%|████████▍ | 1684/2000 [11:36<01:36,  3.27it/s]

 84%|████████▍ | 1685/2000 [11:36<01:36,  3.26it/s]

 84%|████████▍ | 1686/2000 [11:36<01:36,  3.26it/s]

 84%|████████▍ | 1687/2000 [11:37<01:35,  3.26it/s]

 84%|████████▍ | 1688/2000 [11:37<01:35,  3.27it/s]

 84%|████████▍ | 1689/2000 [11:37<01:35,  3.27it/s]

 84%|████████▍ | 1690/2000 [11:38<01:34,  3.26it/s]

 85%|████████▍ | 1691/2000 [11:38<01:34,  3.27it/s]

 85%|████████▍ | 1692/2000 [11:38<01:34,  3.27it/s]

 85%|████████▍ | 1693/2000 [11:39<01:34,  3.26it/s]

 85%|████████▍ | 1694/2000 [11:39<01:33,  3.26it/s]

 85%|████████▍ | 1695/2000 [11:39<01:33,  3.25it/s]

 85%|████████▍ | 1696/2000 [11:39<01:33,  3.25it/s]

 85%|████████▍ | 1697/2000 [11:40<01:33,  3.24it/s]

 85%|████████▍ | 1698/2000 [11:40<01:33,  3.24it/s]

 85%|████████▍ | 1699/2000 [11:40<01:32,  3.24it/s]

[2024-02-06 00:27:01,887] [INFO] [logging.py:96:log_dist] [Rank 0] step=1700, skipped=0, lr=[0.0007835233281037354, 0.0007835233281037354], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:27:01,888] [INFO] [timer.py:260:stop] epoch=0/micro_step=1700/global_step=1700, RunningAvgSamplesPerSec=204.64819242868523, CurrSamplesPerSec=206.7816469323095, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 85%|████████▌ | 1700/2000 [11:41<01:33,  3.22it/s]

 85%|████████▌ | 1701/2000 [11:41<01:32,  3.22it/s]

 85%|████████▌ | 1702/2000 [11:41<01:32,  3.23it/s]

 85%|████████▌ | 1703/2000 [11:42<01:31,  3.24it/s]

 85%|████████▌ | 1704/2000 [11:42<01:31,  3.24it/s]

 85%|████████▌ | 1705/2000 [11:42<01:30,  3.24it/s]

 85%|████████▌ | 1706/2000 [11:43<01:30,  3.24it/s]

 85%|████████▌ | 1707/2000 [11:43<01:30,  3.24it/s]

 85%|████████▌ | 1708/2000 [11:43<01:30,  3.24it/s]

 85%|████████▌ | 1709/2000 [11:44<01:29,  3.25it/s]

 86%|████████▌ | 1710/2000 [11:44<01:29,  3.24it/s]

 86%|████████▌ | 1711/2000 [11:44<01:29,  3.24it/s]

 86%|████████▌ | 1712/2000 [11:44<01:28,  3.24it/s]

 86%|████████▌ | 1713/2000 [11:45<01:28,  3.24it/s]

 86%|████████▌ | 1714/2000 [11:45<01:28,  3.24it/s]

 86%|████████▌ | 1715/2000 [11:45<01:27,  3.24it/s]

 86%|████████▌ | 1716/2000 [11:46<01:27,  3.25it/s]

 86%|████████▌ | 1717/2000 [11:46<01:26,  3.25it/s]

 86%|████████▌ | 1718/2000 [11:46<01:26,  3.26it/s]

 86%|████████▌ | 1719/2000 [11:47<01:26,  3.27it/s]

 86%|████████▌ | 1720/2000 [11:47<01:25,  3.26it/s]

 86%|████████▌ | 1721/2000 [11:47<01:25,  3.26it/s]

 86%|████████▌ | 1722/2000 [11:48<01:25,  3.26it/s]

 86%|████████▌ | 1723/2000 [11:48<01:24,  3.27it/s]

 86%|████████▌ | 1724/2000 [11:48<01:24,  3.27it/s]

 86%|████████▋ | 1725/2000 [11:48<01:23,  3.28it/s]

 86%|████████▋ | 1726/2000 [11:49<01:23,  3.28it/s]

 86%|████████▋ | 1727/2000 [11:49<01:23,  3.29it/s]

 86%|████████▋ | 1728/2000 [11:49<01:22,  3.29it/s]

 86%|████████▋ | 1729/2000 [11:50<01:22,  3.29it/s]

 86%|████████▋ | 1730/2000 [11:50<01:22,  3.29it/s]

 87%|████████▋ | 1731/2000 [11:50<01:21,  3.29it/s]

 87%|████████▋ | 1732/2000 [11:51<01:21,  3.28it/s]

 87%|████████▋ | 1733/2000 [11:51<01:21,  3.28it/s]

 87%|████████▋ | 1734/2000 [11:51<01:21,  3.28it/s]

 87%|████████▋ | 1735/2000 [11:51<01:20,  3.28it/s]

 87%|████████▋ | 1736/2000 [11:52<01:20,  3.29it/s]

 87%|████████▋ | 1737/2000 [11:52<01:20,  3.28it/s]

 87%|████████▋ | 1738/2000 [11:52<01:20,  3.26it/s]

 87%|████████▋ | 1739/2000 [11:53<01:19,  3.27it/s]

 87%|████████▋ | 1740/2000 [11:53<01:19,  3.27it/s]

 87%|████████▋ | 1741/2000 [11:53<01:19,  3.27it/s]

 87%|████████▋ | 1742/2000 [11:54<01:18,  3.27it/s]

 87%|████████▋ | 1743/2000 [11:54<01:18,  3.27it/s]

 87%|████████▋ | 1744/2000 [11:54<01:18,  3.27it/s]

 87%|████████▋ | 1745/2000 [11:55<01:18,  3.27it/s]

 87%|████████▋ | 1746/2000 [11:55<01:17,  3.26it/s]

 87%|████████▋ | 1747/2000 [11:55<01:17,  3.26it/s]

 87%|████████▋ | 1748/2000 [11:55<01:17,  3.27it/s]

 87%|████████▋ | 1749/2000 [11:56<01:16,  3.28it/s]

 88%|████████▊ | 1750/2000 [11:56<01:16,  3.28it/s]

 88%|████████▊ | 1751/2000 [11:56<01:15,  3.28it/s]

 88%|████████▊ | 1752/2000 [11:57<01:15,  3.28it/s]

 88%|████████▊ | 1753/2000 [11:57<01:15,  3.29it/s]

 88%|████████▊ | 1754/2000 [11:57<01:14,  3.28it/s]

 88%|████████▊ | 1755/2000 [11:58<01:14,  3.28it/s]

 88%|████████▊ | 1756/2000 [11:58<01:14,  3.28it/s]

 88%|████████▊ | 1757/2000 [11:58<01:14,  3.28it/s]

 88%|████████▊ | 1758/2000 [11:58<01:13,  3.28it/s]

 88%|████████▊ | 1759/2000 [11:59<01:13,  3.28it/s]

 88%|████████▊ | 1760/2000 [11:59<01:13,  3.28it/s]

 88%|████████▊ | 1761/2000 [11:59<01:12,  3.28it/s]

 88%|████████▊ | 1762/2000 [12:00<01:12,  3.27it/s]

 88%|████████▊ | 1763/2000 [12:00<01:12,  3.27it/s]

 88%|████████▊ | 1764/2000 [12:00<01:12,  3.26it/s]

 88%|████████▊ | 1765/2000 [12:01<01:12,  3.25it/s]

 88%|████████▊ | 1766/2000 [12:01<01:12,  3.24it/s]

 88%|████████▊ | 1767/2000 [12:01<01:11,  3.24it/s]

 88%|████████▊ | 1768/2000 [12:02<01:11,  3.25it/s]

 88%|████████▊ | 1769/2000 [12:02<01:11,  3.25it/s]

 88%|████████▊ | 1770/2000 [12:02<01:10,  3.24it/s]

 89%|████████▊ | 1771/2000 [12:02<01:10,  3.23it/s]

 89%|████████▊ | 1772/2000 [12:03<01:10,  3.24it/s]

 89%|████████▊ | 1773/2000 [12:03<01:10,  3.24it/s]

 89%|████████▊ | 1774/2000 [12:03<01:09,  3.24it/s]

 89%|████████▉ | 1775/2000 [12:04<01:09,  3.24it/s]

 89%|████████▉ | 1776/2000 [12:04<01:09,  3.23it/s]

 89%|████████▉ | 1777/2000 [12:04<01:09,  3.23it/s]

 89%|████████▉ | 1778/2000 [12:05<01:08,  3.23it/s]

 89%|████████▉ | 1779/2000 [12:05<01:08,  3.24it/s]

 89%|████████▉ | 1780/2000 [12:05<01:07,  3.24it/s]

 89%|████████▉ | 1781/2000 [12:06<01:07,  3.24it/s]

 89%|████████▉ | 1782/2000 [12:06<01:07,  3.25it/s]

 89%|████████▉ | 1783/2000 [12:06<01:06,  3.26it/s]

 89%|████████▉ | 1784/2000 [12:06<01:06,  3.26it/s]

 89%|████████▉ | 1785/2000 [12:07<01:06,  3.26it/s]

 89%|████████▉ | 1786/2000 [12:07<01:05,  3.25it/s]

 89%|████████▉ | 1787/2000 [12:07<01:05,  3.25it/s]

 89%|████████▉ | 1788/2000 [12:08<01:04,  3.26it/s]

 89%|████████▉ | 1789/2000 [12:08<01:04,  3.26it/s]

 90%|████████▉ | 1790/2000 [12:08<01:04,  3.27it/s]

 90%|████████▉ | 1791/2000 [12:09<01:03,  3.27it/s]

 90%|████████▉ | 1792/2000 [12:09<01:03,  3.28it/s]

 90%|████████▉ | 1793/2000 [12:09<01:03,  3.28it/s]

 90%|████████▉ | 1794/2000 [12:10<01:02,  3.29it/s]

 90%|████████▉ | 1795/2000 [12:10<01:02,  3.29it/s]

 90%|████████▉ | 1796/2000 [12:10<01:02,  3.28it/s]

 90%|████████▉ | 1797/2000 [12:10<01:02,  3.27it/s]

 90%|████████▉ | 1798/2000 [12:11<01:02,  3.25it/s]

 90%|████████▉ | 1799/2000 [12:11<01:02,  3.23it/s]

[2024-02-06 00:27:32,572] [INFO] [logging.py:96:log_dist] [Rank 0] step=1800, skipped=0, lr=[0.0007583979837785775, 0.0007583979837785775], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:27:32,574] [INFO] [timer.py:260:stop] epoch=0/micro_step=1800/global_step=1800, RunningAvgSamplesPerSec=204.92245255207052, CurrSamplesPerSec=198.18662225037616, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 90%|█████████ | 1800/2000 [12:11<01:03,  3.17it/s]

 90%|█████████ | 1801/2000 [12:12<01:03,  3.14it/s]

 90%|█████████ | 1802/2000 [12:12<01:03,  3.12it/s]

 90%|█████████ | 1803/2000 [12:12<01:03,  3.12it/s]

 90%|█████████ | 1804/2000 [12:13<01:03,  3.11it/s]

 90%|█████████ | 1805/2000 [12:13<01:02,  3.11it/s]

 90%|█████████ | 1806/2000 [12:13<01:02,  3.12it/s]

 90%|█████████ | 1807/2000 [12:14<01:01,  3.12it/s]

 90%|█████████ | 1808/2000 [12:14<01:01,  3.12it/s]

 90%|█████████ | 1809/2000 [12:14<01:01,  3.13it/s]

 90%|█████████ | 1810/2000 [12:15<01:00,  3.13it/s]

 91%|█████████ | 1811/2000 [12:15<00:59,  3.15it/s]

 91%|█████████ | 1812/2000 [12:15<00:59,  3.17it/s]

 91%|█████████ | 1813/2000 [12:16<00:58,  3.18it/s]

 91%|█████████ | 1814/2000 [12:16<00:58,  3.18it/s]

 91%|█████████ | 1815/2000 [12:16<00:58,  3.19it/s]

 91%|█████████ | 1816/2000 [12:17<00:57,  3.20it/s]

 91%|█████████ | 1817/2000 [12:17<00:57,  3.20it/s]

 91%|█████████ | 1818/2000 [12:17<00:56,  3.21it/s]

 91%|█████████ | 1819/2000 [12:17<00:56,  3.20it/s]

 91%|█████████ | 1820/2000 [12:18<00:56,  3.20it/s]

 91%|█████████ | 1821/2000 [12:18<00:56,  3.19it/s]

 91%|█████████ | 1822/2000 [12:18<00:55,  3.20it/s]

 91%|█████████ | 1823/2000 [12:19<00:55,  3.20it/s]

 91%|█████████ | 1824/2000 [12:19<00:55,  3.19it/s]

 91%|█████████▏| 1825/2000 [12:19<00:54,  3.20it/s]

 91%|█████████▏| 1826/2000 [12:20<00:54,  3.20it/s]

 91%|█████████▏| 1827/2000 [12:20<00:54,  3.19it/s]

 91%|█████████▏| 1828/2000 [12:20<00:53,  3.19it/s]

 91%|█████████▏| 1829/2000 [12:21<00:53,  3.20it/s]

 92%|█████████▏| 1830/2000 [12:21<00:53,  3.20it/s]

 92%|█████████▏| 1831/2000 [12:21<00:52,  3.19it/s]

 92%|█████████▏| 1832/2000 [12:22<00:52,  3.19it/s]

 92%|█████████▏| 1833/2000 [12:22<00:52,  3.19it/s]

 92%|█████████▏| 1834/2000 [12:22<00:52,  3.18it/s]

 92%|█████████▏| 1835/2000 [12:22<00:51,  3.18it/s]

 92%|█████████▏| 1836/2000 [12:23<00:51,  3.18it/s]

 92%|█████████▏| 1837/2000 [12:23<00:51,  3.18it/s]

 92%|█████████▏| 1838/2000 [12:23<00:50,  3.18it/s]

 92%|█████████▏| 1839/2000 [12:24<00:50,  3.18it/s]

 92%|█████████▏| 1840/2000 [12:24<00:50,  3.18it/s]

 92%|█████████▏| 1841/2000 [12:24<00:49,  3.18it/s]

 92%|█████████▏| 1842/2000 [12:25<00:49,  3.19it/s]

 92%|█████████▏| 1843/2000 [12:25<00:49,  3.18it/s]

 92%|█████████▏| 1844/2000 [12:25<00:49,  3.18it/s]

 92%|█████████▏| 1845/2000 [12:26<00:48,  3.19it/s]

 92%|█████████▏| 1846/2000 [12:26<00:48,  3.19it/s]

 92%|█████████▏| 1847/2000 [12:26<00:47,  3.19it/s]

 92%|█████████▏| 1848/2000 [12:27<00:47,  3.19it/s]

 92%|█████████▏| 1849/2000 [12:27<00:47,  3.19it/s]

 92%|█████████▎| 1850/2000 [12:27<00:46,  3.19it/s]

 93%|█████████▎| 1851/2000 [12:27<00:46,  3.19it/s]

 93%|█████████▎| 1852/2000 [12:28<00:46,  3.19it/s]

 93%|█████████▎| 1853/2000 [12:28<00:46,  3.16it/s]

 93%|█████████▎| 1854/2000 [12:28<00:46,  3.14it/s]

 93%|█████████▎| 1855/2000 [12:29<00:46,  3.13it/s]

 93%|█████████▎| 1856/2000 [12:29<00:46,  3.13it/s]

 93%|█████████▎| 1857/2000 [12:29<00:45,  3.13it/s]

 93%|█████████▎| 1858/2000 [12:30<00:45,  3.13it/s]

 93%|█████████▎| 1859/2000 [12:30<00:44,  3.14it/s]

 93%|█████████▎| 1860/2000 [12:30<00:44,  3.15it/s]

 93%|█████████▎| 1861/2000 [12:31<00:43,  3.17it/s]

 93%|█████████▎| 1862/2000 [12:31<00:43,  3.18it/s]

 93%|█████████▎| 1863/2000 [12:31<00:42,  3.19it/s]

 93%|█████████▎| 1864/2000 [12:32<00:42,  3.20it/s]

 93%|█████████▎| 1865/2000 [12:32<00:42,  3.21it/s]

 93%|█████████▎| 1866/2000 [12:32<00:41,  3.22it/s]

 93%|█████████▎| 1867/2000 [12:33<00:41,  3.22it/s]

 93%|█████████▎| 1868/2000 [12:33<00:41,  3.21it/s]

 93%|█████████▎| 1869/2000 [12:33<00:40,  3.21it/s]

 94%|█████████▎| 1870/2000 [12:33<00:40,  3.21it/s]

 94%|█████████▎| 1871/2000 [12:34<00:40,  3.20it/s]

 94%|█████████▎| 1872/2000 [12:34<00:40,  3.19it/s]

 94%|█████████▎| 1873/2000 [12:34<00:39,  3.19it/s]

 94%|█████████▎| 1874/2000 [12:35<00:39,  3.18it/s]

 94%|█████████▍| 1875/2000 [12:35<00:39,  3.18it/s]

 94%|█████████▍| 1876/2000 [12:35<00:38,  3.19it/s]

 94%|█████████▍| 1877/2000 [12:36<00:38,  3.20it/s]

 94%|█████████▍| 1878/2000 [12:36<00:38,  3.20it/s]

 94%|█████████▍| 1879/2000 [12:36<00:37,  3.21it/s]

 94%|█████████▍| 1880/2000 [12:37<00:37,  3.21it/s]

 94%|█████████▍| 1881/2000 [12:37<00:36,  3.22it/s]

 94%|█████████▍| 1882/2000 [12:37<00:36,  3.22it/s]

 94%|█████████▍| 1883/2000 [12:38<00:36,  3.23it/s]

 94%|█████████▍| 1884/2000 [12:38<00:35,  3.22it/s]

 94%|█████████▍| 1885/2000 [12:38<00:35,  3.23it/s]

 94%|█████████▍| 1886/2000 [12:38<00:35,  3.23it/s]

 94%|█████████▍| 1887/2000 [12:39<00:34,  3.24it/s]

 94%|█████████▍| 1888/2000 [12:39<00:34,  3.25it/s]

 94%|█████████▍| 1889/2000 [12:39<00:34,  3.26it/s]

 94%|█████████▍| 1890/2000 [12:40<00:33,  3.25it/s]

 95%|█████████▍| 1891/2000 [12:40<00:33,  3.26it/s]

 95%|█████████▍| 1892/2000 [12:40<00:33,  3.26it/s]

 95%|█████████▍| 1893/2000 [12:41<00:32,  3.26it/s]

 95%|█████████▍| 1894/2000 [12:41<00:32,  3.26it/s]

 95%|█████████▍| 1895/2000 [12:41<00:32,  3.26it/s]

 95%|█████████▍| 1896/2000 [12:42<00:31,  3.26it/s]

 95%|█████████▍| 1897/2000 [12:42<00:31,  3.27it/s]

 95%|█████████▍| 1898/2000 [12:42<00:31,  3.27it/s]

 95%|█████████▍| 1899/2000 [12:42<00:30,  3.27it/s]

[2024-02-06 00:28:03,889] [INFO] [logging.py:96:log_dist] [Rank 0] step=1900, skipped=0, lr=[0.0007324162874368463, 0.0007324162874368463], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:28:03,890] [INFO] [timer.py:260:stop] epoch=0/micro_step=1900/global_step=1900, RunningAvgSamplesPerSec=204.95530331939582, CurrSamplesPerSec=208.8582285808542, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


 95%|█████████▌| 1900/2000 [12:43<00:30,  3.25it/s]

 95%|█████████▌| 1901/2000 [12:43<00:30,  3.25it/s]

 95%|█████████▌| 1902/2000 [12:43<00:30,  3.26it/s]

 95%|█████████▌| 1903/2000 [12:44<00:29,  3.25it/s]

 95%|█████████▌| 1904/2000 [12:44<00:29,  3.26it/s]

 95%|█████████▌| 1905/2000 [12:44<00:29,  3.26it/s]

 95%|█████████▌| 1906/2000 [12:45<00:28,  3.26it/s]

 95%|█████████▌| 1907/2000 [12:45<00:28,  3.26it/s]

 95%|█████████▌| 1908/2000 [12:45<00:28,  3.27it/s]

 95%|█████████▌| 1909/2000 [12:45<00:27,  3.27it/s]

 96%|█████████▌| 1910/2000 [12:46<00:27,  3.28it/s]

 96%|█████████▌| 1911/2000 [12:46<00:27,  3.29it/s]

 96%|█████████▌| 1912/2000 [12:46<00:26,  3.28it/s]

 96%|█████████▌| 1913/2000 [12:47<00:26,  3.29it/s]

 96%|█████████▌| 1914/2000 [12:47<00:26,  3.29it/s]

 96%|█████████▌| 1915/2000 [12:47<00:25,  3.30it/s]

 96%|█████████▌| 1916/2000 [12:48<00:25,  3.30it/s]

 96%|█████████▌| 1917/2000 [12:48<00:25,  3.30it/s]

 96%|█████████▌| 1918/2000 [12:48<00:24,  3.30it/s]

 96%|█████████▌| 1919/2000 [12:49<00:24,  3.31it/s]

 96%|█████████▌| 1920/2000 [12:49<00:24,  3.31it/s]

 96%|█████████▌| 1921/2000 [12:49<00:23,  3.31it/s]

 96%|█████████▌| 1922/2000 [12:49<00:23,  3.31it/s]

 96%|█████████▌| 1923/2000 [12:50<00:23,  3.31it/s]

 96%|█████████▌| 1924/2000 [12:50<00:22,  3.31it/s]

 96%|█████████▋| 1925/2000 [12:50<00:22,  3.30it/s]

 96%|█████████▋| 1926/2000 [12:51<00:22,  3.30it/s]

 96%|█████████▋| 1927/2000 [12:51<00:22,  3.31it/s]

 96%|█████████▋| 1928/2000 [12:51<00:21,  3.30it/s]

 96%|█████████▋| 1929/2000 [12:52<00:21,  3.30it/s]

 96%|█████████▋| 1930/2000 [12:52<00:21,  3.31it/s]

 97%|█████████▋| 1931/2000 [12:52<00:20,  3.31it/s]

 97%|█████████▋| 1932/2000 [12:52<00:20,  3.31it/s]

 97%|█████████▋| 1933/2000 [12:53<00:20,  3.31it/s]

 97%|█████████▋| 1934/2000 [12:53<00:19,  3.31it/s]

 97%|█████████▋| 1935/2000 [12:53<00:19,  3.31it/s]

 97%|█████████▋| 1936/2000 [12:54<00:19,  3.31it/s]

 97%|█████████▋| 1937/2000 [12:54<00:19,  3.31it/s]

 97%|█████████▋| 1938/2000 [12:54<00:18,  3.30it/s]

 97%|█████████▋| 1939/2000 [12:55<00:18,  3.30it/s]

 97%|█████████▋| 1940/2000 [12:55<00:18,  3.30it/s]

 97%|█████████▋| 1941/2000 [12:55<00:17,  3.31it/s]

 97%|█████████▋| 1942/2000 [12:55<00:17,  3.31it/s]

 97%|█████████▋| 1943/2000 [12:56<00:17,  3.30it/s]

 97%|█████████▋| 1944/2000 [12:56<00:16,  3.31it/s]

 97%|█████████▋| 1945/2000 [12:56<00:16,  3.31it/s]

 97%|█████████▋| 1946/2000 [12:57<00:16,  3.31it/s]

 97%|█████████▋| 1947/2000 [12:57<00:16,  3.31it/s]

 97%|█████████▋| 1948/2000 [12:57<00:15,  3.31it/s]

 97%|█████████▋| 1949/2000 [12:58<00:15,  3.31it/s]

 98%|█████████▊| 1950/2000 [12:58<00:15,  3.31it/s]

 98%|█████████▊| 1951/2000 [12:58<00:14,  3.31it/s]

 98%|█████████▊| 1952/2000 [12:59<00:14,  3.31it/s]

 98%|█████████▊| 1953/2000 [12:59<00:14,  3.31it/s]

 98%|█████████▊| 1954/2000 [12:59<00:13,  3.31it/s]

 98%|█████████▊| 1955/2000 [12:59<00:13,  3.31it/s]

 98%|█████████▊| 1956/2000 [13:00<00:13,  3.31it/s]

 98%|█████████▊| 1957/2000 [13:00<00:12,  3.31it/s]

 98%|█████████▊| 1958/2000 [13:00<00:12,  3.30it/s]

 98%|█████████▊| 1959/2000 [13:01<00:12,  3.29it/s]

 98%|█████████▊| 1960/2000 [13:01<00:12,  3.29it/s]

 98%|█████████▊| 1961/2000 [13:01<00:11,  3.29it/s]

 98%|█████████▊| 1962/2000 [13:02<00:11,  3.29it/s]

 98%|█████████▊| 1963/2000 [13:02<00:11,  3.30it/s]

 98%|█████████▊| 1964/2000 [13:02<00:11,  3.25it/s]

 98%|█████████▊| 1965/2000 [13:02<00:10,  3.26it/s]

 98%|█████████▊| 1966/2000 [13:03<00:10,  3.26it/s]

 98%|█████████▊| 1967/2000 [13:03<00:10,  3.27it/s]

 98%|█████████▊| 1968/2000 [13:03<00:09,  3.28it/s]

 98%|█████████▊| 1969/2000 [13:04<00:09,  3.28it/s]

 98%|█████████▊| 1970/2000 [13:04<00:09,  3.28it/s]

 99%|█████████▊| 1971/2000 [13:04<00:08,  3.28it/s]

 99%|█████████▊| 1972/2000 [13:05<00:08,  3.28it/s]

 99%|█████████▊| 1973/2000 [13:05<00:08,  3.28it/s]

 99%|█████████▊| 1974/2000 [13:05<00:07,  3.28it/s]

 99%|█████████▉| 1975/2000 [13:06<00:07,  3.28it/s]

 99%|█████████▉| 1976/2000 [13:06<00:07,  3.29it/s]

 99%|█████████▉| 1977/2000 [13:06<00:06,  3.29it/s]

 99%|█████████▉| 1978/2000 [13:06<00:06,  3.29it/s]

 99%|█████████▉| 1979/2000 [13:07<00:06,  3.29it/s]

 99%|█████████▉| 1980/2000 [13:07<00:06,  3.29it/s]

 99%|█████████▉| 1981/2000 [13:07<00:05,  3.29it/s]

 99%|█████████▉| 1982/2000 [13:08<00:05,  3.30it/s]

 99%|█████████▉| 1983/2000 [13:08<00:05,  3.30it/s]

 99%|█████████▉| 1984/2000 [13:08<00:04,  3.30it/s]

 99%|█████████▉| 1985/2000 [13:09<00:04,  3.30it/s]

 99%|█████████▉| 1986/2000 [13:09<00:04,  3.30it/s]

 99%|█████████▉| 1987/2000 [13:09<00:03,  3.31it/s]

 99%|█████████▉| 1988/2000 [13:09<00:03,  3.31it/s]

 99%|█████████▉| 1989/2000 [13:10<00:03,  3.31it/s]

100%|█████████▉| 1990/2000 [13:10<00:03,  3.31it/s]

100%|█████████▉| 1991/2000 [13:10<00:02,  3.30it/s]

100%|█████████▉| 1992/2000 [13:11<00:02,  3.29it/s]

100%|█████████▉| 1993/2000 [13:11<00:02,  3.29it/s]

100%|█████████▉| 1994/2000 [13:11<00:01,  3.29it/s]

100%|█████████▉| 1995/2000 [13:12<00:01,  3.29it/s]

100%|█████████▉| 1996/2000 [13:12<00:01,  3.30it/s]

100%|█████████▉| 1997/2000 [13:12<00:00,  3.30it/s]

100%|█████████▉| 1998/2000 [13:12<00:00,  3.30it/s]

100%|█████████▉| 1999/2000 [13:13<00:00,  3.30it/s]

[2024-02-06 00:28:34,236] [INFO] [logging.py:96:log_dist] [Rank 0] step=2000, skipped=0, lr=[0.000705685003441866, 0.000705685003441866], mom=[(0.9, 0.99), (0.9, 0.99)]


[2024-02-06 00:28:34,237] [INFO] [timer.py:260:stop] epoch=0/micro_step=2000/global_step=2000, RunningAvgSamplesPerSec=205.29878533900109, CurrSamplesPerSec=212.37335113953634, MemAllocated=0.9GB, MaxMemAllocated=6.33GB


100%|██████████| 2000/2000 [13:13<00:00,  3.29it/s]

100%|██████████| 2000/2000 [13:13<00:00,  2.52it/s]




## Evaluate Model

In [9]:
import time

query = "What is an LLM?"
t0 = time.perf_counter()
outputs = trainer.evaluate(
    query,
    num_samples=1,
    max_new_tokens=128,
    top_k=2,
    display=False
)
log.info(f'took: {time.perf_counter() - t0:.4f}s')
log.info(f"['prompt']: '{query}'")
log.info("['response']:\n\n" + fr"{outputs['0']['raw']}")