# Setup
-  Follow the setup instructions based on your preferred environment!

## Local

One of our key goals in designing this assignment is to allow you to complete most of the preliminary implementation work locally.  
We highly recommend that you **pass all tests locally** using the provided `hw4_data_subset` before moving to a GPU runtime.  
To do this, simply:

### Create a new conda environment
```bash
# Be sure to deactivate any active environments first
conda create -n hw4 python=3.12.4
```

### Activate the conda environment
```bash
conda activate hw4
```

### Install the dependencies using the provided `requirements.txt`
```bash
pip install --no-cache-dir --ignore-installed -r requirements.txt
```

### Ensure that your notebook is in the same working directory as the `Handout`
This can be achieved by:
1. Physically moving the notebook into the handout directory.
2. Changing the notebook’s current working directory to the handout directory using the os.chdir() function.

### Open the notebook and select the newly created environment from the kernel selector.

If everything was done correctly, You should see atleast the following files in your current working directory after running `!ls`:
```
.
├── README.md
├── requirements.txt
├── hw4lib/
├── mytorch/
├── tests/
└── hw4_data_subset/
```

## Colab

### Step 1: Get your handout
- See writeup for recommended approaches.

In [None]:
from google.colab import userdata

# Example: My preferred approach
import os
# Settings -> Developer Settings -> Personal Access Tokens -> Token (classic)
os.environ['GITHUB_TOKEN'] = "your_token_here"

GITHUB_USERNAME = "ranai-srivastav"
REPO_NAME       = "IDL4-Transfomers"
TOKEN = userdata.get('GH_PAT')

repo_url        = f"https://{TOKEN}@github.com/{GITHUB_USERNAME}/{REPO_NAME}.git"
!git clone {repo_url}

In [None]:
if False:
    # To pull latest changes (Must be in the repo dir, use pwd/ls to verify)
    !cd {REPO_NAME} && git pull

### Step 2: Install Dependencies
- `NOTE`: Your runtime will be restarted to ensure all dependencies are updated.
- `NOTE`: You will see a runtime crashed message, this was intentionally done. Simply move on to the next cell.

In [None]:
%pip install --no-deps -r /content/IDL4-Transfomers/IDL-HW4/requirements.txt
import os
os.kill(os.getpid(), 9) # NOTE: This will restart the your colab Python runtime (required)!

### Step 3: Obtain Data

- `NOTE`: This process will automatically download and unzip data for both `HW4P1` and `HW4P2`.  


In [None]:
!curl -L -o /content/f25-hw4-data.zip https://www.kaggle.com/api/v1/datasets/download/cmu11785/f25-11785-hw4-data
!unzip -q -o /content/f25-hw4-data.zip -d /content/hw4_data
!rm -rf /content/f25-hw4-data.zip
!du -h --max-depth=2 /content/hw4_data

### Step 4: Move to Handout Directory
You must be within the handout directory for the library imports to work!

- `NOTE`: You may have to repeat running this command anytime you restart your runtime.
- `NOTE`: You can do a `pwd` to check if you are in the right directory.
- `NOTE`: The way it is setup currently, Your data directory should be one level up from your project directory. Keep this in mind when you are setting your `root` in the config file.

If everything was done correctly, You should see atleast the following files in your current working directory after running `!ls`:
```
.
├── README.md
├── requirements.txt
├── hw4lib/
├── mytorch/
├── tests/
└── hw4_data_subset/

```

In [None]:
import os
os.chdir('IDL-HW4')
!ls

## Kaggle

While it is possible to run the notebook on Kaggle, we would recommend against it. This assignment is more resource intensive and may run slower on Kaggle.

### Step 1: Get your handout
- See writeup for recommended approaches.

In [None]:
# Example: My preferred approach
import os
# Settings -> Developer Settings -> Personal Access Tokens -> Token (classic)
os.environ['GITHUB_TOKEN'] = "your_token_here"

GITHUB_USERNAME = "your_username_here"
REPO_NAME       = "your_repo_name_here"
TOKEN = os.environ.get("GITHUB_TOKEN")
repo_url        = f"https://{TOKEN}@github.com/{GITHUB_USERNAME}/{REPO_NAME}.git"
!git clone {repo_url}

In [None]:
# To pull latest changes (Must be in the repo dir, use pwd/ls to verify)
!cd {REPO_NAME} && git pull

### Step 2: Install Dependencies
- Simply set the `Environment` setting in the notebook to `Always use latest environment`. No need to install anything.

### Step 3: Obtain Data

#### ⚠️ Important: Kaggle Users  
If you are using Kaggle, **do not manually download the data!** The dataset is large and may exceed your available disk space. Instead, follow these steps to add the dataset directly to your notebook:

1. Open your **Kaggle Notebook**.  
2. Navigate to **Notebook → Input**.  
3. Click **Add Input**.  
4. In the search bar, paste the following URL:  
   👉 [https://www.kaggle.com/datasets/cmu11785/f25-11785-hw4-data](https://www.kaggle.com/datasets/cmu11785/f25-11785-hw4-data)  
5. Click the **➕ (plus sign)** to add the dataset to your notebook.  

#### 📌 Note:  
This process will automatically download and unzip data for both `HW4P1` and `HW4P2`.  


### Step 4: Move to Handout Directory
You must be within the handout directory for the library imports to work!

- `NOTE`: You may have to repeat running this command anytime you restart your runtime.
- `NOTE`: You can do a `pwd` to check if you are in the right directory.
- `NOTE`: The way it is setup currently, Your data directory should be one level up from your project directory. Keep this in mind when you are setting your `root` in the config file.

If everything was done correctly, You should see atleast the following files in your current working directory after running `!ls`:
```
.
├── README.md
├── requirements.txt
├── hw4lib/
├── mytorch/
├── tests/
└── hw4_data_subset/

```

In [None]:
import os
os.chdir('IDL-HW4')
!ls

## PSC

### 1️⃣ **Step 1 Setting Up Your Environment on Bridges2**

❗️⚠️ For this homework, we are **providing shared Datasets and a shared Conda environment** for the entire class.

❗️⚠️ So for PSC users, **do not download the data yourself** and **do not need to manually install the packages**!


Follow these steps to set up the environment and start a Jupyter notebook on Bridges2:

To run your notebook more efficiently on PSC, we need to use a **Jupyter Server** hosted on a compute node.

You can use your prefered way of connecting to the Jupyter Server. Your options should be covered in the docs linked in post 558 @ piazza.

**The recommended way of connecting is:**

#### **Connect in VSCode**
SSH into Bridges2 and navigate to your **Jet directory** (`Jet/home/<your_psc_username>`). Upload your notebook there, and then connect to the Jupyter Server from that directory.

#### **1. SSH into Bridges2**
1）Open VS Code and click on the `Extensions` icon in the left sidebar. Make sure the "**Remote - SSH**" extension is installed.

2）Open the command palette (**Shift+Command+P** on Mac, **Ctrl+Shift+P** on Windows). A search box will appear at the top center. Choose `"Remote-SSH: Add New SSH Host"`, then enter:

```bash
ssh <your_username>@bridges2.psc.edu #change <your_username> to your username
```

Next, choose `"/Users/<your_username>/.ssh/config"` as the config file. A dialog will appear in the bottom right saying "Host Added". Click `"Connect"`, and then enter your password.

(Note: After adding the host once, you can later use `"Remote-SSH: Connect to Host"` and select "bridges2.psc.edu" from the list.)

3）Once connected, click `"Explorer"` in the left sidebar > "Open Folder", and navigate to your home directory under the project grant:
```bash
/jet/home/<your_username>  #change <your_username> to your username
```

4）You can now drag your notebook files directly into the right-hand pane (your remote home directory), or upload them using `scp` into your folder.

> ❗️⚠️ The following steps should be executed in the **VSCode integrated terminal**.

#### **2. Navigate to Your Directory**
Make sure to use this `/jet/home/<your_username>` as your working directory, since all subsequent operations (up to submission) are based on this path.
```bash
cd /jet/home/<your_username>  #change <your_username> to your username
```

#### **3. Request a Compute Node**
```bash
interact -p GPU-shared --gres=gpu:v100-32:1 -t 8:00:00 -A cis250019p
```

#### **4. Load the Anaconda Module**
```bash
module load anaconda3
```

#### **5. Activate the provided HW4 Environment**
```bash
conda deactivate # First, deactivate any existing Conda environment
conda activate /ocean/projects/cis250019p/mzhang23/TA/HW4/envs/hw4_env && export PYTHONNOUSERSITE=1
```

#### **6. Start Jupyter Notebook**
Launch Jupyter Notebook:
```bash
jupyter notebook --no-browser --ip=0.0.0.0
```

Go to **Kernel** → **Select Another Kernel** → **Existing Jupyter Server**
   Enter the URL of the Jupyter Server:```http://{hostname}:{port}/tree?token={token}```
   
   *(Usually, this URL appears in the terminal output after you run `jupyter notebook --no-browser --ip=0.0.0.0`, in a line like:  “Jupyter Server is running at: http://...”)*

   - eg: `http://v011.ib.bridges2.psc.edu:8888/tree?token=e4b302434e68990f28bc2b4ae8d216eb87eecb7090526249`

> **Note**: Replace `{hostname}`, `{port}` and `{token}` with your actual values from the Jupyter output.

After launching the Jupyter notebook, you can run the cells directly inside the notebook — no need to use the terminal for the remaining steps.

### 2️⃣ Step 2: Get Repo

In [None]:
#Make sure you are in your directory
!pwd #should be /jet/home/<your_username>, if not, uncomment the following line and replace with your actual username:
# %cd /jet/home/<your_username>
#TODO: replace the "<your_username>" to yours

In [None]:
from google.colab import userdata

# Example: My preferred approach
import os
# Settings -> Developer Settings -> Personal Access Tokens -> Token (classic)
os.environ['GITHUB_TOKEN'] = "your_token_here"

GITHUB_USERNAME = "ranai-srivastav"
REPO_NAME       = "IDL4-Transfomers"
TOKEN = os.environ.get("GH_PAT")

repo_url        = f"https://{TOKEN}@github.com/{GITHUB_USERNAME}/{REPO_NAME}.git"
!git clone {repo_url}

In [None]:
# To pull latest changes (Must be in the repo dir, use pwd/ls to verify)
!cd {REPO_NAME} && git pull

#### **Move to Project Directory**
- `NOTE`: You may have to repeat this on anytime you restart your runtime. You can do a `pwd` or `ls` to check if you are in the right directory.

In [None]:
import os
os.chdir('IDL-HW4')
!ls

### 3️⃣ **Step 3: Set up Kaggle API Authentication**

In [None]:
# TODO: Use the same Kaggle code from HW3P2
!mkdir /jet/home/<your_username>/.kaggle #TODO: replace the "<your_username>" to yours

with open("/jet/home/<your_username>/.kaggle/kaggle.json", "w+") as f: #TODO: replace the "<your_username>" to yours
    f.write('{"username":"<your_username>","key":"<your_key>"}')
    # TODO: Put your kaggle username & key here

!chmod 600 /jet/home/<your_username>/.kaggle/kaggle.json #TODO: replace the "<your_username>" to yours

### 4️⃣ **Step 4: Get Data**

❗️⚠️ The data used in this assignment is **already stored in a shared, read-only folder, so you do not need to manually download anything**.

Instead, just make sure to replace the dataset path in your notebook code with the correct path from the shared directory.

You can run the following block to explore the shared directory structure:

In [None]:
import os
data_path = "/ocean/projects/cis250019p/mzhang23/TA/HW4/hw4p1_data" #Shared data path, do not need to change the username to yours
print("Files in shared hw4p2 dataset:", os.listdir(data_path))

In [None]:
!apt-get install tree
!tree -L 2 /ocean/projects/cis250019p/mzhang23/TA/HW4/hw4p1_data

# Imports

- If your setup was done correctly, you should be able to run the following cell without any issues.

In [None]:
import os
import sys

sys.path.append('/jet/home/srivastr/idl/IDL4-Transfomers/IDL-HW4')
os.chdir("/jet/home/srivastr/idl/IDL4-Transfomers/IDL-HW4")

In [None]:
print("hello world")

In [None]:
from hw4lib.data import (
    H4Tokenizer,
    LMDataset,
    verify_dataloader
)
from hw4lib.model import (
    CausalMask,
    PadMask,
    PositionalEncoding,
    DecoderOnlyTransformer
)
from hw4lib.utils import (
    create_optimizer,
    create_scheduler,
    plot_lr_schedule
)
from hw4lib.trainers import (
    LMTrainer,
)
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import yaml
import gc
import torch
from torchinfo import summary
import os
import json
import tarfile
import shutil
import wandb
import yaml
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Implementations

- `NOTE`: All of these implementations have detailed specification, implementation details, and hints in their respective source files. Make sure to read all of them in their entirety to understand the implementation details!

## MyTorch Implementations
- Modify your `Linear` implementation from HW1P1 to support arbitrary number of dimensions in `mytorch/nn/linear.py`.
- Modify your `Softmax` implementation from HW1P1 to support arbitrary number of dimensions in `mytorch/nn/activation.py`.
- Implement the `ScaledDotProductAttention` class in `mytorch/nn/scaled_dot_product_attention.py`.
- Implement the `MultiHeadAttention` class in `mytorch/nn/multi_head_attention.py`.
- Run the cell below to check your implementations.


In [None]:
!python -m tests.test_mytorch

## Dataset Implementation
- Familiarize yourself with the `tokenize`, `encode`, and `decode` methods of the `H4Tokenizer` class in `hw4lib/data/tokenizer.py`. You will need to make use of these methods in both `HW4P1` and `HW4P2` both in the dataset implementations and during decoding.
- Implement the `LMDataset` class in `hw4lib/data/lm_dataset.py`.
    - You will have to implement parts of `__init__` and completely implement the `__len__`, `__getitem__` and `collate_fn` methods.
- Run the cell below to check your implementation.


In [None]:
!python -m tests.test_dataset_lm

## Model Implementations
#### Overview:
- Implement the `CausalMask` and `PadMask` functions in `hw4lib/modules/masks.py` to handle masking.
- Implement the `PositionalEncoding` class in `hw4lib/model/positional_encoding.py` to handle positional encoding.
- Implement the Transformer Sublayers: `SelfAttentionLayer` and `FeedForwardLayer` classes in `hw4lib/model/sublayers.py`.
- Implement the Transformer Layer: `SelfAttentionDecoderLayer` class in `hw4lib/model/decoder_layers.py`.
- Implement the `DecoderOnlyTransformer` class in `hw4lib/model/transformers.py`.
- Run the cells below to check your implementation.
- `NOTE`: Besides the `DecoderOnlyTransformer` (P1 mandatory, P2 optional), you will use all of the above implementations in both `HW4P1` and `HW4P2`!

### Masks
- Implement the `PadMask` and `CausalMask` functions in `hw4lib/modules/masks.py`.
- Run the cell below to check your implementation.
- You will need to make use of these masks in both `HW4P1` and `HW4P2`.

#### Causal Mask

In [None]:
!python -m tests.test_mask_causal

#### Padding Mask

In [None]:
!python -m tests.test_mask_padding

#### Optional: Visualize your Masks

In [None]:
# Dummy data
_d_model   = 64
_x         = torch.zeros(4, 20, _d_model)
_x_len     = torch.tensor([5, 15, 10, 20])
_x_causal  = CausalMask(_x)
_x_padding = PadMask(_x, _x_len)

# Create figure with two subplots side by side
fig, mask_axs = plt.subplots(1, 2, figsize=(12, 4))

# Plot masks
masks_and_titles = [
    (_x_padding, "Padding Mask"),
    (_x_causal, "Causal Mask")
]

# Plot each mask
images = []
for i, (mask, title) in enumerate(masks_and_titles):
    im = mask_axs[i].imshow(mask, cmap="gray", aspect='auto')
    mask_axs[i].set_title(title, fontsize=8)
    images.append(im)

# Add colorbar at the bottom
fig.subplots_adjust(bottom=0.2)  # Make space for colorbar
cbar_ax = fig.add_axes([0.15, 0.1, 0.7, 0.02])  # [left, bottom, width, height]
cbar = plt.colorbar(images[0], cax=cbar_ax, orientation='horizontal')
cbar.ax.set_xlabel('Mask Values', labelpad=5, fontsize=8)
cbar.set_ticks([0, 1])
cbar.set_ticklabels(['Attend (0)', 'Ignore/Mask (1)'])
cbar.ax.tick_params(labelsize=6)

plt.show()

### Positional Encoding
- Implement the `PositionalEncoding` class in `hw4lib/model/positional_encoding.py`.
- Run the cell below to check your implementation.
- You will need to make use of this positional encoding in both `HW4P1` and `HW4P2`.

In [None]:
!python -m tests.test_positional_encoding

#### Optional: Visualize your Positional Encoding

In [None]:
# Create sample positional encoding
d_model = 64
max_len = 100
pos_encoding = PositionalEncoding(d_model=d_model, max_len=max_len)
pe = pos_encoding.pe.squeeze(0).numpy()  # Remove batch dimension and convert to numpy

# Create figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Plot 1: Positional encoding matrix
im = ax1.imshow(pe, aspect='auto', cmap='RdBu',
                extent=[0, d_model, max_len, 0])  # Flip y-axis to show position top-to-bottom
plt.colorbar(im, ax=ax1, label='Encoding Value')
ax1.set_xlabel('Dimension')
ax1.set_ylabel('Position')
ax1.set_title('Positional Encoding Matrix')
ax1.grid(False)

# Plot 2: Sinusoidal patterns
dimensions = [0, 15, 31, 47, 63]  # Plot first few dimensions
for dim in dimensions:
    ax2.plot(pe[:, dim], label=f'dim {dim}')
ax2.set_xlabel('Position')
ax2.set_ylabel('Encoding Value')
ax2.set_title('Sinusoidal Patterns for Different Dimensions')
ax2.legend()
ax2.grid(True)

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()

### Transformer Sublayers
- Implement the Transformer Sublayers: `SelfAttentionLayer`, and `FeedForwardLayer` classes in `hw4lib/model/sublayers.py`.
- Run the cell below to check your implementation.
- You will need to make use of all of these sublayers in both `HW4P1` and `HW4P2`.

In [None]:
!python -m tests.test_sublayer_selfattention

In [None]:
!python -m tests.test_sublayer_feedforward

### Transformer Self-Attention Decoder Layer
- Implement the Transformer Layer: `SelfAttentionDecoderLayer` class in `hw4lib/model/decoder_layers.py`.
- Run the cell below to check your implementation.
- You will need to make use of this sublayer in `HW4P2`.

In [None]:
!python -m tests.test_decoderlayer_selfattention

### Decoder-Only Transformer

- Implement the `DecoderOnlyTransformer` class in `hw4lib/model/transformers.py`.
- Run the cell below to check your implementation.
- You will need to make use of in `HW4P1` and optionally `HW4P2`.

In [None]:
!python -m tests.test_transformer_decoder_only

## Decoding Implementation
- Implement the `generate_greedy` method of the `SequenceGenerator` class in `hw4lib/decoding/sequence_generator.py`.
- Run the cell below to check your implementation.

In [None]:
!python -m tests.test_decoding --mode greedy

## Trainer Implementation
You will have to do some minor in-filling for the `LMTrainer` class in `hw4lib/trainers/lm_trainer.py` before you can use it.
- Fill in the `TODO`s in the `__init__`.
- Fill in the `TODO`s in the `_train_epoch`.
- Fill in the `TODO`s in the `_validate_epoch`.
- Fill in the `TODO`s in the `generate` method.
- Fill in the `TODO`s in the `train` method.

`WARNING`: There are no test's for this. Implement carefully!

# Experiments
From this point onwards you may want to switch to a `GPU` runtime.
- `OBJECTIVE`: You must achieve a per-character perplexity ≤ 3.5 in order to get points for Task 2.

## Config
- You can use the `config.yaml` file to set your config for your ablation study.

---
### Notes:

- Set `tokenization: token_type:` to specify your desired tokenization strategy
- You will need to set the root path to your `hw4p1_data` folder in `data: root:`. This will depend on your setup. For eg. if you are following out setup instruction:
  - `PSC`: `"/ocean/projects/cis250019p/mzhang23/TA/HW4/hw4p1_data"`
  - `Colab:`: `"/content/hw4_data/hw4p1_data"`
  - `Kaggle:`: `"/kaggle/input/s25-hw4-data/hw4p1_data"`
- There's extra configurations in the `optimizer` section which will only be relevant if you decide to use the `create_optimizer` function we've provided in `hw4lib/utils/create_optimizer.py`.
- `BE CAREFUL` while setting numeric values. Eg. `1e-4` will get serialized to a `str` while `1.0e-4` gets serialized to float.


In [22]:
%%writefile config.yaml

Name                      : "Ranai Srivastav"

###### Tokenization ------------------------------------------------------------
tokenization:
  token_type                : "char"       # [char, 1k, 5k, 10k]
  token_map :
      'char': 'hw4lib/data/tokenizer_jsons/tokenizer_char.json'
      '1k'  : 'hw4lib/data/tokenizer_jsons/tokenizer_1000.json'
      '5k'  : 'hw4lib/data/tokenizer_jsons/tokenizer_5000.json'
      '10k' : 'hw4lib/data/tokenizer_jsons/tokenizer_10000.json'

###### Dataset -----------------------------------------------------------------
data:                    # Currently setup for Colab assuming out setup
  root                 : "/ocean/projects/cis250019p/mzhang23/TA/HW4/hw4p1_data"  # TODO: Set the root path of your data
  train_partition      : "train"  # train
  val_partition        : "val"    # val
  test_partition       : "test"   # test
  subset               : 1.0      # Load a subset of the data (for debugging, testing, etc
  batch_size           : 256      #
  NUM_WORKERS          : 2        # Set to 0 for CPU

###### Network Specs -------------------------------------------------------------
model: # Decoder-Only Language Model (HW4P1)
  d_model                   : 512
  d_ff                      : 1024
  num_layers                : 4
  num_heads                 : 2
  dropout                   : 0.25
  layer_drop_rate           : 0.05
  weight_tying              : False

###### Common Training Parameters ------------------------------------------------
training:
  use_wandb                   : True   # Toggle wandb logging
  wandb_run_id                : "none" # "none" or "run_id"
  resume                      : False  # Resume an existing run (run_id != 'none')
  epochs                      : 100
  gradient_accumulation_steps : 1
  wandb_project               : "H4P1-ranais" # wandb project to log to

###### Loss ----------------------------------------------------------------------
loss: # Just good ol' CrossEntropy
  label_smoothing: 0.05

###### Optimizer -----------------------------------------------------------------
optimizer:
  name: "adamw" # Options: sgd, adam, adamw
  lr: 1.0e-4   # Base learning rate

  # Common parameters
  weight_decay: 0.0001

  # Parameter groups
  param_groups:
    - name: self_attn
      patterns: []  # Will match all parameters containing keywords set their learning rate to 0.0001
      lr: 0.0001    # LR for self_attn
      layer_decay:
        enabled: False
        decay_rate: 0.8

    - name: ffn
      patterns: [] # Will match all parameters containing "ffn" and set their learning rate to 0.0001
      lr: 0.0001   # LR for ffn
      layer_decay:
        enabled: False
        decay_rate: 0.8

  # Layer-wise learning rates
  layer_decay:
    enabled: False
    decay_rate: 0.75

  # SGD specific parameters
  sgd:
    momentum: 0.9
    nesterov: True
    dampening: 0

  # Adam specific parameters
  adam:
    betas: [0.9, 0.999]
    eps: 1.0e-8
    amsgrad: False

  # AdamW specific parameters
  adamw:
    betas: [0.9, 0.999]
    eps: 1.0e-8
    amsgrad: False

###### Scheduler -----------------------------------------------------------------
scheduler:
  name: "cosine"  # Options: reduce_lr, cosine, cosine_warm

  # ReduceLROnPlateau specific parameters
  reduce_lr:
    mode: "min"  # Options: min, max
    factor: 0.1  # Factor to reduce learning rate by
    patience: 10  # Number of epochs with no improvement after which LR will be reduced
    threshold: 0.0001  # Threshold for measuring the new optimum
    threshold_mode: "rel"  # Options: rel, abs
    cooldown: 0  # Number of epochs to wait before resuming normal operation
    min_lr: 0.0000001  # Minimum learning rate
    eps: 1.0e-8  # Minimal decay applied to lr

  # CosineAnnealingLR specific parameters
  cosine:
    T_max: 100  # Maximum number of iterations
    eta_min: 1.0e-8  # Minimum learning rate
    last_epoch: -1

  # CosineAnnealingWarmRestarts specific parameters
  cosine_warm:
    T_0: 4  # Number of iterations for the first restart
    T_mult: 4  # Factor increasing T_i after each restart
    eta_min: 0.0000001  # Minimum learning rate
    last_epoch: -1

  # Warmup parameters (can be used with any scheduler)
  warmup:
    enabled: True
    type: "exponential"  # Options: linear, exponential
    epochs: 5
    start_factor: 0.1
    end_factor: 1.0

Overwriting config.yaml


In [23]:
import yaml

with open('config.yaml', 'r') as file:
    config = yaml.safe_load(file)

## Tokenizer

In [24]:
Tokenizer = H4Tokenizer(
    token_map  = config['tokenization']['token_map'],
    token_type = config['tokenization']['token_type']
)

                         Tokenizer Configuration (char)                         
--------------------------------------------------------------------------------
Vocabulary size:     35

Special Tokens:
PAD:              0
UNK:              1
MASK:             2
SOS:              3
EOS:              4
BLANK:            5

Validation Example:
--------------------------------------------------------------------------------
Input text:  [SOS]HI DEEP LEARNERS[EOS]
Tokens:      ['[SOS]', 'H', 'I', ' ', 'D', 'E', 'E', 'P', ' ', 'L', 'E', 'A', 'R', 'N', 'E', 'R', 'S', '[EOS]']
Token IDs:   [3, 13, 12, 6, 16, 7, 7, 25, 6, 17, 7, 9, 15, 11, 7, 15, 14, 4]
Decoded:     [SOS]HI DEEP LEARNERS[EOS]


## Datasets

In [None]:
train_dataset  = LMDataset(
    partition  = config['data']['train_partition'],
    config     = config['data'],
    tokenizer  = Tokenizer
)

val_dataset    = LMDataset(
    partition  = config['data']['val_partition'],
    config     = config['data'],
    tokenizer  = Tokenizer
)

test_dataset   = LMDataset(
    partition  = config['data']['test_partition'],
    config     = config['data'],
    tokenizer  = Tokenizer
)

gc.collect()

## Dataloaders

In [None]:
config['data']['batch_size']

In [25]:
train_loader    = DataLoader(
    dataset     = train_dataset,
    batch_size  = config['data']['batch_size'],
    shuffle     = True,
    num_workers = config['data']['NUM_WORKERS'] if device == 'cuda' else 0,
    pin_memory  = True,
    collate_fn  = train_dataset.collate_fn
)

val_loader      = DataLoader(
    dataset     = val_dataset,
    batch_size  = 6,
    shuffle     = False,
    num_workers = config['data']['NUM_WORKERS'] if device == 'cuda' else 0,
    pin_memory  = True,
    collate_fn  = val_dataset.collate_fn
)

test_loader     = DataLoader(
    dataset     = test_dataset,
    batch_size  = 6,
    shuffle     = False,
    num_workers = config['data']['NUM_WORKERS'] if device == 'cuda' else 0,
    pin_memory  = True,
    collate_fn  = test_dataset.collate_fn
)

### Dataloader Verification

In [None]:
verify_dataloader(train_loader)

In [None]:
verify_dataloader(val_loader)

In [None]:
verify_dataloader(test_loader)

## Calculate Max Transcript Length




Calculating the maximum transcript length across your dataset is a crucial step when working with certain transformer models.
-  We'll use sinusoidal positional encodings that must be precomputed up to a fixed maximum length.
- This maximum length is a hyperparameter that determines:
  - How long of a sequence your model can process
  - The size of your positional encoding matrix
  - Memory requirements during training and inference
- `Requirements`: For this assignment, ensure your positional encodings can accommodate at least the longest sequence in your dataset to prevent truncation. However, you can set this value higher if you anticipate using your language model to work with longer sequences in future tasks (hint: this might be useful for P2! 😉).

In [26]:
max_transcript_length = max(train_dataset.text_max_len, val_dataset.text_max_len, test_dataset.text_max_len)
print("="*50)
print(f"{'Global Max Transcript Length':<30} : {max_transcript_length}")
print("="*50)

Global Max Transcript Length   : 525


## Model

In [27]:
model_config = config['model']
model_config.update({
    'max_len': max_transcript_length,
    'num_classes': Tokenizer.vocab_size
})
model = DecoderOnlyTransformer(**model_config)

# Get some inputs from the text loader
for batch in train_loader:
    shifted_transcripts, golden_transcripts, transcript_lengths = batch
    print("Shape of shifted_transcripts : ", shifted_transcripts.shape)
    print("Shape of golden_transcripts  : ", golden_transcripts.shape)
    print("Shape of transcript_lengths  : ", transcript_lengths.shape)
    break

model_stats = summary(model, input_data=[shifted_transcripts, transcript_lengths])
print(model_stats)

Shape of shifted_transcripts :  torch.Size([256, 283])
Shape of golden_transcripts  :  torch.Size([256, 283])
Shape of transcript_lengths  :  torch.Size([256])
Layer (type:depth-idx)                        Output Shape              Param #
DecoderOnlyTransformer                        [256, 283, 35]            --
├─Embedding: 1-1                              [256, 283, 512]           17,920
├─PositionalEncoding: 1-2                     [256, 283, 512]           --
├─Dropout: 1-3                                [256, 283, 512]           --
├─ModuleList: 1-4                             --                        --
│    └─SelfAttentionDecoderLayer: 2-1         [256, 283, 512]           --
│    │    └─SelfAttentionLayer: 3-1           [256, 283, 512]           1,051,648
│    │    └─FeedForwardLayer: 3-2             [256, 283, 512]           1,051,136
│    └─SelfAttentionDecoderLayer: 2-2         [256, 283, 512]           --
│    │    └─SelfAttentionLayer: 3-3           [256, 283, 512]      

## Wandb

In [None]:
wandb.login()

## Trainer

Every time you run the trainer, it will create a new directory in the `expts` folder with the following structure:
```
expts/
    └── {run_name}/
        ├── config.yaml
        ├── model_arch.txt
        ├── checkpoints/
        │   ├── checkpoint-best-metric-model.pth
        │   └── checkpoint-last-epoch-model.pth
        ├── attn/
        │   └── {attention visualizations}
        └── text/
            └── {generated text outputs}
```

In [35]:
trainer = LMTrainer(
    model=model,
    tokenizer=Tokenizer,
    config=config,
    run_name="final-train",
    config_file="config.yaml",
    device=device
)

Using device: cuda


  self.scaler = torch.cuda.amp.GradScaler()


### Setup Optimizer and Scheduler

You can set your own optimizer and scheduler by setting the class members in the `LMTrainer` class.
Eg:
```python
trainer.optimizer = optim.AdamW(model.parameters(), lr=config['optimizer']['lr'], weight_decay=config['optimizer']['weight_decay'])
trainer.scheduler = optim.lr_scheduler.CosineAnnealingLR(trainer.optimizer, T_max=config['training']['epochs'])
```

We also provide a utility function to create your own optimizer and scheduler with the congig and some extra bells and whistles. You are free to use it or not. Do read their code and documentation to understand how it works (`hw4lib/utils/*`).


#### Setting up the optimizer

In [36]:
trainer.optimizer = create_optimizer(
    model=model,
    opt_config=config['optimizer']
)


🔧 Configuring Optimizer:
├── Type: ADAMW
├── Base LR: 0.0001
├── Weight Decay: 0.0001
├── Parameter Groups:
│   ├── Group: self_attn
│   │   ├── LR: 0.0001
│   │   └── Patterns: []
│   ├── Group: ffn
│   │   ├── LR: 0.0001
│   │   └── Patterns: []
│   └── Default Group (unmatched parameters)
└── AdamW Specific:
    ├── Betas: [0.9, 0.999]
    ├── Epsilon: 1e-08
    └── AMSGrad: False


#### Creating a test scheduler and plotting the learning rate schedule

In [None]:
test_scheduler = create_scheduler(
    optimizer=trainer.optimizer,
    scheduler_config=config['scheduler'],
    train_loader=train_loader,
    gradient_accumulation_steps=config['training']['gradient_accumulation_steps']
)

plot_lr_schedule(
    scheduler=test_scheduler,
    num_epochs=config['training']['epochs'],
    train_loader=train_loader,
    gradient_accumulation_steps=config['training']['gradient_accumulation_steps']
)

#### Setting up the scheduler

In [37]:
trainer.scheduler = create_scheduler(
    optimizer=trainer.optimizer,
    scheduler_config=config['scheduler'],
    train_loader=train_loader,
    gradient_accumulation_steps=config['training']['gradient_accumulation_steps']
)


📈 Configuring Learning Rate Scheduler:
├── Type: COSINE
├── Cosine Annealing Settings:
│   ├── T_max: 100 epochs (104400 steps)
│   └── Min LR: 1e-08
├── Warmup Settings:
│   ├── Duration: 5 epochs (5220 steps)
│   ├── Start Factor: 0.1
│   └── End Factor: 1.0


# Train
- Set your epochs

In [None]:
trainer.train(train_loader, val_loader, epochs=config['training']['epochs'])

[Training LM]:   0%|                                                                                                                      | 0/1044 [00:00<?, ?it/s]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   0%|                                                | 1/1044 [00:00<06:40,  2.60it/s, acc_step=1/1, ce_loss_token=3.1348, perplexity_token=22.9831]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   0%|                                                | 2/1044 [00:00<06:30,  2.67it/s, acc_step=1/1, ce_loss_token=3.0845, perplexity_token=21.8564]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   0%|▏                                               | 3/1044 [00:01<06:36,  2.62it/s, acc_step=1/1, ce_loss_token=3.0616, perplexity_token=21.3609]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:   0%|▏                                               | 4/1044 [00:01<05:51,  2.96it/s, acc_step=1/1, ce_loss_token=3.0775, perplexity_token=21.7047]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   0%|▏                                               | 5/1044 [00:01<06:07,  2.83it/s, acc_step=1/1, ce_loss_token=3.0618, perplexity_token=21.3663]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   1%|▎                                               | 6/1044 [00:02<05:47,  2.98it/s, acc_step=1/1, ce_loss_token=3.0716, perplexity_token=21.5755]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   1%|▎                                               | 7/1044 [00:02<05:53,  2.94it/s, acc_step=1/1, ce_loss_token=3.0590, perplexity_token=21.3060]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   1%|▎                                               | 8/1044 [00:02<05:39,  3.05it/s, acc_step=1/1, ce_loss_token=3.0609, perplexity_token=21.3460]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   1%|▍                                               | 9/1044 [00:03<05:54,  2.92it/s, acc_step=1/1, ce_loss_token=3.0502, perplexity_token=21.1198]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   1%|▍                                              | 10/1044 [00:03<06:02,  2.85it/s, acc_step=1/1, ce_loss_token=3.0421, perplexity_token=20.9488]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   1%|▍                                              | 11/1044 [00:03<06:04,  2.83it/s, acc_step=1/1, ce_loss_token=3.0335, perplexity_token=20.7706]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   1%|▌                                              | 12/1044 [00:04<06:11,  2.77it/s, acc_step=1/1, ce_loss_token=3.0255, perplexity_token=20.6040]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:   1%|▌                                              | 13/1044 [00:04<06:42,  2.56it/s, acc_step=1/1, ce_loss_token=3.0176, perplexity_token=20.4423]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   1%|▋                                              | 14/1044 [00:05<06:39,  2.58it/s, acc_step=1/1, ce_loss_token=3.0113, perplexity_token=20.3129]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   1%|▋                                              | 15/1044 [00:05<06:29,  2.64it/s, acc_step=1/1, ce_loss_token=3.0049, perplexity_token=20.1833]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   2%|▋                                              | 16/1044 [00:05<06:39,  2.58it/s, acc_step=1/1, ce_loss_token=2.9990, perplexity_token=20.0659]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   2%|▊                                              | 17/1044 [00:06<06:28,  2.64it/s, acc_step=1/1, ce_loss_token=2.9934, perplexity_token=19.9525]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   2%|▊                                              | 18/1044 [00:06<06:28,  2.64it/s, acc_step=1/1, ce_loss_token=2.9880, perplexity_token=19.8454]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   2%|▊                                              | 19/1044 [00:06<06:27,  2.65it/s, acc_step=1/1, ce_loss_token=2.9832, perplexity_token=19.7503]

torch.Size([256, 437, 35]) torch.Size([256, 437])


[Training LM]:   2%|▉                                              | 20/1044 [00:07<07:00,  2.44it/s, acc_step=1/1, ce_loss_token=2.9842, perplexity_token=19.7703]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   2%|▉                                              | 21/1044 [00:07<06:56,  2.45it/s, acc_step=1/1, ce_loss_token=2.9791, perplexity_token=19.6697]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   2%|▉                                              | 22/1044 [00:08<06:53,  2.47it/s, acc_step=1/1, ce_loss_token=2.9746, perplexity_token=19.5826]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:   2%|█                                              | 23/1044 [00:08<06:32,  2.60it/s, acc_step=1/1, ce_loss_token=2.9699, perplexity_token=19.4891]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   2%|█                                              | 24/1044 [00:08<06:27,  2.63it/s, acc_step=1/1, ce_loss_token=2.9654, perplexity_token=19.4029]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   2%|█▏                                             | 25/1044 [00:09<06:22,  2.66it/s, acc_step=1/1, ce_loss_token=2.9606, perplexity_token=19.3094]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:   2%|█▏                                             | 26/1044 [00:09<06:38,  2.55it/s, acc_step=1/1, ce_loss_token=2.9563, perplexity_token=19.2270]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   3%|█▏                                             | 27/1044 [00:10<06:32,  2.59it/s, acc_step=1/1, ce_loss_token=2.9520, perplexity_token=19.1443]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   3%|█▎                                             | 28/1044 [00:10<06:27,  2.62it/s, acc_step=1/1, ce_loss_token=2.9478, perplexity_token=19.0634]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   3%|█▎                                             | 29/1044 [00:10<06:16,  2.69it/s, acc_step=1/1, ce_loss_token=2.9433, perplexity_token=18.9780]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   3%|█▎                                             | 30/1044 [00:11<06:20,  2.67it/s, acc_step=1/1, ce_loss_token=2.9393, perplexity_token=18.9017]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:   3%|█▍                                             | 31/1044 [00:11<06:13,  2.71it/s, acc_step=1/1, ce_loss_token=2.9356, perplexity_token=18.8327]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:   3%|█▍                                             | 32/1044 [00:11<05:56,  2.84it/s, acc_step=1/1, ce_loss_token=2.9347, perplexity_token=18.8154]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   3%|█▍                                             | 33/1044 [00:12<05:35,  3.01it/s, acc_step=1/1, ce_loss_token=2.9342, perplexity_token=18.8056]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   3%|█▌                                             | 34/1044 [00:12<05:53,  2.86it/s, acc_step=1/1, ce_loss_token=2.9303, perplexity_token=18.7339]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:   3%|█▌                                             | 35/1044 [00:12<06:10,  2.72it/s, acc_step=1/1, ce_loss_token=2.9266, perplexity_token=18.6648]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   3%|█▌                                             | 36/1044 [00:13<06:11,  2.71it/s, acc_step=1/1, ce_loss_token=2.9229, perplexity_token=18.5957]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   4%|█▋                                             | 37/1044 [00:13<06:09,  2.72it/s, acc_step=1/1, ce_loss_token=2.9196, perplexity_token=18.5341]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:   4%|█▋                                             | 38/1044 [00:14<06:33,  2.56it/s, acc_step=1/1, ce_loss_token=2.9160, perplexity_token=18.4680]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:   4%|█▊                                             | 39/1044 [00:14<06:40,  2.51it/s, acc_step=1/1, ce_loss_token=2.9126, perplexity_token=18.4038]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   4%|█▊                                             | 40/1044 [00:14<06:29,  2.58it/s, acc_step=1/1, ce_loss_token=2.9092, perplexity_token=18.3418]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:   4%|█▊                                             | 41/1044 [00:15<07:14,  2.31it/s, acc_step=1/1, ce_loss_token=2.9059, perplexity_token=18.2816]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   4%|█▉                                             | 42/1044 [00:15<07:03,  2.36it/s, acc_step=1/1, ce_loss_token=2.9027, perplexity_token=18.2228]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   4%|█▉                                             | 43/1044 [00:16<06:27,  2.58it/s, acc_step=1/1, ce_loss_token=2.9018, perplexity_token=18.2070]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:   4%|█▉                                             | 44/1044 [00:16<06:32,  2.55it/s, acc_step=1/1, ce_loss_token=2.8989, perplexity_token=18.1534]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:   4%|██                                             | 45/1044 [00:16<06:19,  2.63it/s, acc_step=1/1, ce_loss_token=2.8958, perplexity_token=18.0987]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   4%|██                                             | 46/1044 [00:17<06:23,  2.60it/s, acc_step=1/1, ce_loss_token=2.8928, perplexity_token=18.0438]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   5%|██                                             | 47/1044 [00:17<06:14,  2.66it/s, acc_step=1/1, ce_loss_token=2.8898, perplexity_token=17.9895]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:   5%|██▏                                            | 48/1044 [00:18<06:31,  2.54it/s, acc_step=1/1, ce_loss_token=2.8870, perplexity_token=17.9388]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   5%|██▏                                            | 49/1044 [00:18<06:43,  2.47it/s, acc_step=1/1, ce_loss_token=2.8843, perplexity_token=17.8903]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:   5%|██▎                                            | 50/1044 [00:19<07:03,  2.34it/s, acc_step=1/1, ce_loss_token=2.8813, perplexity_token=17.8381]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   5%|██▎                                            | 51/1044 [00:19<06:53,  2.40it/s, acc_step=1/1, ce_loss_token=2.8783, perplexity_token=17.7839]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   5%|██▎                                            | 52/1044 [00:19<06:37,  2.49it/s, acc_step=1/1, ce_loss_token=2.8757, perplexity_token=17.7376]

torch.Size([256, 274, 35]) torch.Size([256, 274])


[Training LM]:   5%|██▍                                            | 53/1044 [00:20<06:14,  2.64it/s, acc_step=1/1, ce_loss_token=2.8731, perplexity_token=17.6925]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   5%|██▍                                            | 54/1044 [00:20<06:10,  2.67it/s, acc_step=1/1, ce_loss_token=2.8704, perplexity_token=17.6443]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   5%|██▍                                            | 55/1044 [00:20<06:25,  2.56it/s, acc_step=1/1, ce_loss_token=2.8678, perplexity_token=17.5979]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   5%|██▌                                            | 57/1044 [00:21<05:25,  3.04it/s, acc_step=1/1, ce_loss_token=2.8675, perplexity_token=17.5929]

torch.Size([256, 290, 35]) torch.Size([256, 290])
torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   6%|██▌                                            | 58/1044 [00:21<05:40,  2.90it/s, acc_step=1/1, ce_loss_token=2.8651, perplexity_token=17.5511]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:   6%|██▋                                            | 59/1044 [00:22<06:14,  2.63it/s, acc_step=1/1, ce_loss_token=2.8626, perplexity_token=17.5062]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   6%|██▋                                            | 60/1044 [00:22<06:09,  2.66it/s, acc_step=1/1, ce_loss_token=2.8599, perplexity_token=17.4602]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   6%|██▋                                            | 61/1044 [00:23<06:18,  2.60it/s, acc_step=1/1, ce_loss_token=2.8574, perplexity_token=17.4168]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:   6%|██▊                                            | 62/1044 [00:23<06:02,  2.71it/s, acc_step=1/1, ce_loss_token=2.8550, perplexity_token=17.3749]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   6%|██▊                                            | 63/1044 [00:23<06:10,  2.65it/s, acc_step=1/1, ce_loss_token=2.8525, perplexity_token=17.3310]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   6%|██▉                                            | 64/1044 [00:24<05:51,  2.79it/s, acc_step=1/1, ce_loss_token=2.8513, perplexity_token=17.3103]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   6%|██▉                                            | 65/1044 [00:24<05:33,  2.94it/s, acc_step=1/1, ce_loss_token=2.8504, perplexity_token=17.2939]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   6%|██▉                                            | 66/1044 [00:24<05:36,  2.91it/s, acc_step=1/1, ce_loss_token=2.8480, perplexity_token=17.2532]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   6%|███                                            | 67/1044 [00:25<05:41,  2.86it/s, acc_step=1/1, ce_loss_token=2.8457, perplexity_token=17.2137]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   7%|███                                            | 68/1044 [00:25<05:53,  2.76it/s, acc_step=1/1, ce_loss_token=2.8434, perplexity_token=17.1739]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   7%|███                                            | 69/1044 [00:25<06:03,  2.68it/s, acc_step=1/1, ce_loss_token=2.8411, perplexity_token=17.1340]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   7%|███▏                                           | 70/1044 [00:26<06:07,  2.65it/s, acc_step=1/1, ce_loss_token=2.8389, perplexity_token=17.0973]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   7%|███▏                                           | 71/1044 [00:26<06:07,  2.65it/s, acc_step=1/1, ce_loss_token=2.8367, perplexity_token=17.0592]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   7%|███▏                                           | 72/1044 [00:26<05:41,  2.85it/s, acc_step=1/1, ce_loss_token=2.8356, perplexity_token=17.0413]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   7%|███▎                                           | 73/1044 [00:27<05:42,  2.84it/s, acc_step=1/1, ce_loss_token=2.8335, perplexity_token=17.0041]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   7%|███▎                                           | 74/1044 [00:27<05:42,  2.84it/s, acc_step=1/1, ce_loss_token=2.8311, perplexity_token=16.9647]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:   7%|███▍                                           | 75/1044 [00:28<05:56,  2.72it/s, acc_step=1/1, ce_loss_token=2.8291, perplexity_token=16.9304]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   7%|███▍                                           | 76/1044 [00:28<05:56,  2.71it/s, acc_step=1/1, ce_loss_token=2.8271, perplexity_token=16.8960]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   7%|███▍                                           | 77/1044 [00:28<05:58,  2.70it/s, acc_step=1/1, ce_loss_token=2.8250, perplexity_token=16.8608]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   7%|███▌                                           | 78/1044 [00:29<06:02,  2.66it/s, acc_step=1/1, ce_loss_token=2.8229, perplexity_token=16.8253]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:   8%|███▌                                           | 79/1044 [00:29<05:54,  2.72it/s, acc_step=1/1, ce_loss_token=2.8217, perplexity_token=16.8054]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   8%|███▌                                           | 80/1044 [00:29<05:56,  2.71it/s, acc_step=1/1, ce_loss_token=2.8197, perplexity_token=16.7723]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   8%|███▋                                           | 81/1044 [00:30<05:33,  2.89it/s, acc_step=1/1, ce_loss_token=2.8188, perplexity_token=16.7560]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   8%|███▋                                           | 82/1044 [00:30<05:44,  2.79it/s, acc_step=1/1, ce_loss_token=2.8168, perplexity_token=16.7233]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   8%|███▋                                           | 83/1044 [00:31<05:52,  2.73it/s, acc_step=1/1, ce_loss_token=2.8148, perplexity_token=16.6899]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   8%|███▊                                           | 84/1044 [00:31<05:51,  2.73it/s, acc_step=1/1, ce_loss_token=2.8129, perplexity_token=16.6583]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:   8%|███▊                                           | 85/1044 [00:31<05:45,  2.78it/s, acc_step=1/1, ce_loss_token=2.8111, perplexity_token=16.6279]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   8%|███▉                                           | 87/1044 [00:32<05:04,  3.14it/s, acc_step=1/1, ce_loss_token=2.8102, perplexity_token=16.6139]

torch.Size([256, 303, 35]) torch.Size([256, 303])
torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   8%|███▉                                           | 88/1044 [00:32<04:57,  3.22it/s, acc_step=1/1, ce_loss_token=2.8093, perplexity_token=16.5986]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   9%|████                                           | 89/1044 [00:33<05:24,  2.94it/s, acc_step=1/1, ce_loss_token=2.8076, perplexity_token=16.5697]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   9%|████                                           | 90/1044 [00:33<05:41,  2.79it/s, acc_step=1/1, ce_loss_token=2.8058, perplexity_token=16.5409]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   9%|████                                           | 91/1044 [00:33<05:26,  2.92it/s, acc_step=1/1, ce_loss_token=2.8048, perplexity_token=16.5236]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   9%|████▏                                          | 92/1044 [00:34<05:37,  2.82it/s, acc_step=1/1, ce_loss_token=2.8031, perplexity_token=16.4949]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   9%|████▏                                          | 93/1044 [00:34<05:50,  2.72it/s, acc_step=1/1, ce_loss_token=2.8012, perplexity_token=16.4645]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:   9%|████▏                                          | 94/1044 [00:34<06:12,  2.55it/s, acc_step=1/1, ce_loss_token=2.7994, perplexity_token=16.4356]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   9%|████▎                                          | 95/1044 [00:35<06:06,  2.59it/s, acc_step=1/1, ce_loss_token=2.7977, perplexity_token=16.4076]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:   9%|████▎                                          | 96/1044 [00:35<05:54,  2.67it/s, acc_step=1/1, ce_loss_token=2.7960, perplexity_token=16.3786]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   9%|████▎                                          | 97/1044 [00:36<05:46,  2.73it/s, acc_step=1/1, ce_loss_token=2.7944, perplexity_token=16.3522]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   9%|████▍                                          | 98/1044 [00:36<05:48,  2.71it/s, acc_step=1/1, ce_loss_token=2.7927, perplexity_token=16.3251]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:   9%|████▍                                          | 99/1044 [00:36<06:03,  2.60it/s, acc_step=1/1, ce_loss_token=2.7909, perplexity_token=16.2962]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  10%|████▍                                         | 100/1044 [00:37<06:05,  2.59it/s, acc_step=1/1, ce_loss_token=2.7893, perplexity_token=16.2691]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  10%|████▍                                         | 101/1044 [00:37<05:59,  2.62it/s, acc_step=1/1, ce_loss_token=2.7876, perplexity_token=16.2417]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  10%|████▍                                         | 102/1044 [00:37<06:00,  2.62it/s, acc_step=1/1, ce_loss_token=2.7860, perplexity_token=16.2162]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  10%|████▌                                         | 103/1044 [00:38<05:47,  2.71it/s, acc_step=1/1, ce_loss_token=2.7844, perplexity_token=16.1903]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  10%|████▌                                         | 104/1044 [00:38<05:46,  2.71it/s, acc_step=1/1, ce_loss_token=2.7828, perplexity_token=16.1639]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  10%|████▋                                         | 105/1044 [00:39<05:50,  2.68it/s, acc_step=1/1, ce_loss_token=2.7812, perplexity_token=16.1382]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  10%|████▋                                         | 106/1044 [00:39<05:55,  2.64it/s, acc_step=1/1, ce_loss_token=2.7795, perplexity_token=16.1115]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  10%|████▋                                         | 107/1044 [00:39<05:25,  2.88it/s, acc_step=1/1, ce_loss_token=2.7785, perplexity_token=16.0955]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  10%|████▊                                         | 108/1044 [00:40<05:38,  2.77it/s, acc_step=1/1, ce_loss_token=2.7771, perplexity_token=16.0720]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  10%|████▊                                         | 109/1044 [00:40<05:37,  2.77it/s, acc_step=1/1, ce_loss_token=2.7757, perplexity_token=16.0493]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  11%|████▊                                         | 110/1044 [00:40<05:42,  2.73it/s, acc_step=1/1, ce_loss_token=2.7742, perplexity_token=16.0265]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  11%|████▉                                         | 111/1044 [00:41<06:06,  2.55it/s, acc_step=1/1, ce_loss_token=2.7728, perplexity_token=16.0041]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  11%|████▉                                         | 112/1044 [00:41<06:14,  2.49it/s, acc_step=1/1, ce_loss_token=2.7714, perplexity_token=15.9813]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  11%|████▉                                         | 113/1044 [00:42<06:08,  2.53it/s, acc_step=1/1, ce_loss_token=2.7699, perplexity_token=15.9572]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  11%|█████                                         | 114/1044 [00:42<06:04,  2.55it/s, acc_step=1/1, ce_loss_token=2.7684, perplexity_token=15.9339]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  11%|█████                                         | 115/1044 [00:42<06:07,  2.53it/s, acc_step=1/1, ce_loss_token=2.7671, perplexity_token=15.9124]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  11%|█████                                         | 116/1044 [00:43<06:05,  2.54it/s, acc_step=1/1, ce_loss_token=2.7657, perplexity_token=15.8905]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  11%|█████▏                                        | 117/1044 [00:43<06:21,  2.43it/s, acc_step=1/1, ce_loss_token=2.7644, perplexity_token=15.8695]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  11%|█████▏                                        | 118/1044 [00:44<06:13,  2.48it/s, acc_step=1/1, ce_loss_token=2.7630, perplexity_token=15.8478]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  11%|█████▏                                        | 119/1044 [00:44<06:16,  2.45it/s, acc_step=1/1, ce_loss_token=2.7617, perplexity_token=15.8271]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  11%|█████▎                                        | 120/1044 [00:44<05:46,  2.67it/s, acc_step=1/1, ce_loss_token=2.7607, perplexity_token=15.8117]

torch.Size([256, 399, 35]) torch.Size([256, 399])


[Training LM]:  12%|█████▎                                        | 121/1044 [00:45<06:35,  2.33it/s, acc_step=1/1, ce_loss_token=2.7594, perplexity_token=15.7903]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  12%|█████▍                                        | 122/1044 [00:45<06:23,  2.41it/s, acc_step=1/1, ce_loss_token=2.7581, perplexity_token=15.7696]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  12%|█████▍                                        | 123/1044 [00:46<05:50,  2.63it/s, acc_step=1/1, ce_loss_token=2.7572, perplexity_token=15.7558]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  12%|█████▍                                        | 124/1044 [00:46<05:35,  2.74it/s, acc_step=1/1, ce_loss_token=2.7563, perplexity_token=15.7421]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  12%|█████▌                                        | 125/1044 [00:46<05:10,  2.96it/s, acc_step=1/1, ce_loss_token=2.7556, perplexity_token=15.7304]

torch.Size([256, 441, 35]) torch.Size([256, 441])


[Training LM]:  12%|█████▌                                        | 126/1044 [00:47<06:35,  2.32it/s, acc_step=1/1, ce_loss_token=2.7543, perplexity_token=15.7100]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  12%|█████▌                                        | 127/1044 [00:47<06:36,  2.31it/s, acc_step=1/1, ce_loss_token=2.7530, perplexity_token=15.6893]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  12%|█████▋                                        | 128/1044 [00:48<06:11,  2.46it/s, acc_step=1/1, ce_loss_token=2.7516, perplexity_token=15.6676]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  12%|█████▋                                        | 129/1044 [00:48<06:05,  2.50it/s, acc_step=1/1, ce_loss_token=2.7504, perplexity_token=15.6495]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  12%|█████▋                                        | 130/1044 [00:48<05:52,  2.59it/s, acc_step=1/1, ce_loss_token=2.7493, perplexity_token=15.6310]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  13%|█████▊                                        | 131/1044 [00:49<05:43,  2.66it/s, acc_step=1/1, ce_loss_token=2.7481, perplexity_token=15.6122]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  13%|█████▊                                        | 132/1044 [00:49<05:41,  2.67it/s, acc_step=1/1, ce_loss_token=2.7469, perplexity_token=15.5949]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  13%|█████▊                                        | 133/1044 [00:49<05:24,  2.81it/s, acc_step=1/1, ce_loss_token=2.7462, perplexity_token=15.5829]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  13%|█████▉                                        | 134/1044 [00:50<05:23,  2.81it/s, acc_step=1/1, ce_loss_token=2.7450, perplexity_token=15.5650]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  13%|█████▉                                        | 135/1044 [00:50<05:25,  2.80it/s, acc_step=1/1, ce_loss_token=2.7439, perplexity_token=15.5472]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  13%|█████▉                                        | 136/1044 [00:50<05:08,  2.94it/s, acc_step=1/1, ce_loss_token=2.7431, perplexity_token=15.5356]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  13%|██████                                        | 137/1044 [00:51<05:20,  2.83it/s, acc_step=1/1, ce_loss_token=2.7420, perplexity_token=15.5177]

torch.Size([256, 393, 35]) torch.Size([256, 393])


[Training LM]:  13%|██████                                        | 138/1044 [00:51<06:16,  2.41it/s, acc_step=1/1, ce_loss_token=2.7408, perplexity_token=15.4999]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  13%|██████                                        | 139/1044 [00:52<06:04,  2.48it/s, acc_step=1/1, ce_loss_token=2.7397, perplexity_token=15.4825]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  13%|██████▏                                       | 140/1044 [00:52<06:02,  2.49it/s, acc_step=1/1, ce_loss_token=2.7386, perplexity_token=15.4655]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  14%|██████▏                                       | 141/1044 [00:53<06:05,  2.47it/s, acc_step=1/1, ce_loss_token=2.7374, perplexity_token=15.4471]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  14%|██████▎                                       | 142/1044 [00:53<06:09,  2.44it/s, acc_step=1/1, ce_loss_token=2.7363, perplexity_token=15.4291]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  14%|██████▎                                       | 143/1044 [00:53<06:20,  2.37it/s, acc_step=1/1, ce_loss_token=2.7351, perplexity_token=15.4119]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  14%|██████▎                                       | 144/1044 [00:54<05:44,  2.61it/s, acc_step=1/1, ce_loss_token=2.7344, perplexity_token=15.4004]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  14%|██████▍                                       | 145/1044 [00:54<05:11,  2.89it/s, acc_step=1/1, ce_loss_token=2.7337, perplexity_token=15.3897]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  14%|██████▍                                       | 146/1044 [00:54<05:11,  2.88it/s, acc_step=1/1, ce_loss_token=2.7326, perplexity_token=15.3731]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  14%|██████▍                                       | 147/1044 [00:55<05:12,  2.87it/s, acc_step=1/1, ce_loss_token=2.7316, perplexity_token=15.3578]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  14%|██████▌                                       | 148/1044 [00:55<05:22,  2.78it/s, acc_step=1/1, ce_loss_token=2.7307, perplexity_token=15.3431]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  14%|██████▌                                       | 149/1044 [00:55<05:00,  2.98it/s, acc_step=1/1, ce_loss_token=2.7300, perplexity_token=15.3328]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  14%|██████▌                                       | 150/1044 [00:56<05:05,  2.93it/s, acc_step=1/1, ce_loss_token=2.7290, perplexity_token=15.3177]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  14%|██████▋                                       | 151/1044 [00:56<05:14,  2.84it/s, acc_step=1/1, ce_loss_token=2.7280, perplexity_token=15.3025]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  15%|██████▋                                       | 152/1044 [00:56<05:27,  2.72it/s, acc_step=1/1, ce_loss_token=2.7271, perplexity_token=15.2877]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  15%|██████▋                                       | 153/1044 [00:57<05:27,  2.72it/s, acc_step=1/1, ce_loss_token=2.7260, perplexity_token=15.2714]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  15%|██████▊                                       | 154/1044 [00:57<05:23,  2.75it/s, acc_step=1/1, ce_loss_token=2.7250, perplexity_token=15.2560]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  15%|██████▊                                       | 155/1044 [00:58<05:22,  2.76it/s, acc_step=1/1, ce_loss_token=2.7239, perplexity_token=15.2403]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  15%|██████▊                                       | 156/1044 [00:58<05:03,  2.93it/s, acc_step=1/1, ce_loss_token=2.7232, perplexity_token=15.2288]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  15%|██████▉                                       | 157/1044 [00:58<05:10,  2.85it/s, acc_step=1/1, ce_loss_token=2.7222, perplexity_token=15.2143]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  15%|██████▉                                       | 158/1044 [00:59<05:18,  2.78it/s, acc_step=1/1, ce_loss_token=2.7214, perplexity_token=15.2013]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  15%|███████                                       | 159/1044 [00:59<05:21,  2.75it/s, acc_step=1/1, ce_loss_token=2.7205, perplexity_token=15.1872]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  15%|███████                                       | 160/1044 [00:59<05:28,  2.69it/s, acc_step=1/1, ce_loss_token=2.7195, perplexity_token=15.1731]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  15%|███████                                       | 161/1044 [01:00<05:35,  2.63it/s, acc_step=1/1, ce_loss_token=2.7186, perplexity_token=15.1584]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  16%|███████▏                                      | 162/1044 [01:00<05:40,  2.59it/s, acc_step=1/1, ce_loss_token=2.7176, perplexity_token=15.1439]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  16%|███████▏                                      | 163/1044 [01:01<05:35,  2.63it/s, acc_step=1/1, ce_loss_token=2.7166, perplexity_token=15.1292]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  16%|███████▏                                      | 164/1044 [01:01<05:35,  2.62it/s, acc_step=1/1, ce_loss_token=2.7157, perplexity_token=15.1155]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  16%|███████▎                                      | 165/1044 [01:01<05:33,  2.64it/s, acc_step=1/1, ce_loss_token=2.7148, perplexity_token=15.1019]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  16%|███████▎                                      | 166/1044 [01:02<05:51,  2.50it/s, acc_step=1/1, ce_loss_token=2.7140, perplexity_token=15.0890]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  16%|███████▎                                      | 167/1044 [01:02<05:41,  2.57it/s, acc_step=1/1, ce_loss_token=2.7131, perplexity_token=15.0760]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  16%|███████▍                                      | 168/1044 [01:02<05:32,  2.64it/s, acc_step=1/1, ce_loss_token=2.7123, perplexity_token=15.0632]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  16%|███████▍                                      | 169/1044 [01:03<05:31,  2.64it/s, acc_step=1/1, ce_loss_token=2.7115, perplexity_token=15.0513]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  16%|███████▍                                      | 170/1044 [01:03<05:35,  2.61it/s, acc_step=1/1, ce_loss_token=2.7106, perplexity_token=15.0388]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  16%|███████▌                                      | 171/1044 [01:04<05:26,  2.67it/s, acc_step=1/1, ce_loss_token=2.7098, perplexity_token=15.0268]

torch.Size([256, 404, 35]) torch.Size([256, 404])


[Training LM]:  16%|███████▌                                      | 172/1044 [01:04<06:13,  2.33it/s, acc_step=1/1, ce_loss_token=2.7090, perplexity_token=15.0136]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  17%|███████▌                                      | 173/1044 [01:04<05:58,  2.43it/s, acc_step=1/1, ce_loss_token=2.7081, perplexity_token=15.0006]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  17%|███████▋                                      | 174/1044 [01:05<06:01,  2.41it/s, acc_step=1/1, ce_loss_token=2.7073, perplexity_token=14.9883]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  17%|███████▋                                      | 175/1044 [01:05<05:44,  2.52it/s, acc_step=1/1, ce_loss_token=2.7064, perplexity_token=14.9755]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  17%|███████▊                                      | 176/1044 [01:06<05:31,  2.62it/s, acc_step=1/1, ce_loss_token=2.7057, perplexity_token=14.9648]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  17%|███████▊                                      | 177/1044 [01:06<05:29,  2.63it/s, acc_step=1/1, ce_loss_token=2.7048, perplexity_token=14.9520]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  17%|███████▊                                      | 178/1044 [01:06<05:36,  2.57it/s, acc_step=1/1, ce_loss_token=2.7040, perplexity_token=14.9388]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  17%|███████▉                                      | 179/1044 [01:07<05:33,  2.59it/s, acc_step=1/1, ce_loss_token=2.7031, perplexity_token=14.9266]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  17%|███████▉                                      | 180/1044 [01:07<05:26,  2.65it/s, acc_step=1/1, ce_loss_token=2.7023, perplexity_token=14.9144]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  17%|███████▉                                      | 181/1044 [01:07<05:21,  2.68it/s, acc_step=1/1, ce_loss_token=2.7015, perplexity_token=14.9028]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  17%|████████                                      | 182/1044 [01:08<05:16,  2.72it/s, acc_step=1/1, ce_loss_token=2.7008, perplexity_token=14.8912]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  18%|████████                                      | 183/1044 [01:08<05:44,  2.50it/s, acc_step=1/1, ce_loss_token=2.7000, perplexity_token=14.8798]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  18%|████████                                      | 184/1044 [01:09<05:34,  2.57it/s, acc_step=1/1, ce_loss_token=2.6992, perplexity_token=14.8676]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  18%|████████▏                                     | 185/1044 [01:09<05:24,  2.65it/s, acc_step=1/1, ce_loss_token=2.6985, perplexity_token=14.8573]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  18%|████████▏                                     | 186/1044 [01:09<05:35,  2.56it/s, acc_step=1/1, ce_loss_token=2.6977, perplexity_token=14.8461]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  18%|████████▏                                     | 187/1044 [01:10<05:22,  2.65it/s, acc_step=1/1, ce_loss_token=2.6970, perplexity_token=14.8345]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  18%|████████▎                                     | 188/1044 [01:10<05:21,  2.66it/s, acc_step=1/1, ce_loss_token=2.6962, perplexity_token=14.8232]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  18%|████████▎                                     | 189/1044 [01:11<05:22,  2.65it/s, acc_step=1/1, ce_loss_token=2.6955, perplexity_token=14.8125]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  18%|████████▎                                     | 190/1044 [01:11<05:36,  2.54it/s, acc_step=1/1, ce_loss_token=2.6947, perplexity_token=14.8013]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  18%|████████▍                                     | 191/1044 [01:11<05:13,  2.72it/s, acc_step=1/1, ce_loss_token=2.6943, perplexity_token=14.7951]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  18%|████████▍                                     | 192/1044 [01:12<04:52,  2.91it/s, acc_step=1/1, ce_loss_token=2.6938, perplexity_token=14.7883]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  18%|████████▌                                     | 193/1044 [01:12<05:01,  2.83it/s, acc_step=1/1, ce_loss_token=2.6931, perplexity_token=14.7775]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  19%|████████▌                                     | 194/1044 [01:12<04:46,  2.97it/s, acc_step=1/1, ce_loss_token=2.6926, perplexity_token=14.7702]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  19%|████████▌                                     | 195/1044 [01:13<05:05,  2.78it/s, acc_step=1/1, ce_loss_token=2.6919, perplexity_token=14.7595]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  19%|████████▋                                     | 196/1044 [01:13<05:09,  2.74it/s, acc_step=1/1, ce_loss_token=2.6912, perplexity_token=14.7497]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  19%|████████▋                                     | 197/1044 [01:13<05:10,  2.73it/s, acc_step=1/1, ce_loss_token=2.6905, perplexity_token=14.7396]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  19%|████████▋                                     | 198/1044 [01:14<05:25,  2.60it/s, acc_step=1/1, ce_loss_token=2.6898, perplexity_token=14.7289]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  19%|████████▊                                     | 199/1044 [01:14<04:57,  2.84it/s, acc_step=1/1, ce_loss_token=2.6894, perplexity_token=14.7225]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  19%|████████▊                                     | 200/1044 [01:14<04:52,  2.89it/s, acc_step=1/1, ce_loss_token=2.6887, perplexity_token=14.7129]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  19%|████████▊                                     | 201/1044 [01:15<04:53,  2.87it/s, acc_step=1/1, ce_loss_token=2.6881, perplexity_token=14.7034]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  19%|████████▉                                     | 202/1044 [01:15<04:40,  3.00it/s, acc_step=1/1, ce_loss_token=2.6877, perplexity_token=14.6972]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  19%|████████▉                                     | 203/1044 [01:15<04:30,  3.10it/s, acc_step=1/1, ce_loss_token=2.6871, perplexity_token=14.6897]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  20%|████████▉                                     | 204/1044 [01:16<04:39,  3.01it/s, acc_step=1/1, ce_loss_token=2.6864, perplexity_token=14.6794]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  20%|█████████                                     | 205/1044 [01:16<04:51,  2.88it/s, acc_step=1/1, ce_loss_token=2.6858, perplexity_token=14.6699]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  20%|█████████                                     | 206/1044 [01:16<04:38,  3.00it/s, acc_step=1/1, ce_loss_token=2.6853, perplexity_token=14.6632]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  20%|█████████                                     | 207/1044 [01:17<04:51,  2.87it/s, acc_step=1/1, ce_loss_token=2.6846, perplexity_token=14.6527]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  20%|█████████▏                                    | 208/1044 [01:17<04:46,  2.91it/s, acc_step=1/1, ce_loss_token=2.6842, perplexity_token=14.6459]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  20%|█████████▏                                    | 209/1044 [01:17<04:33,  3.06it/s, acc_step=1/1, ce_loss_token=2.6837, perplexity_token=14.6394]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  20%|█████████▎                                    | 210/1044 [01:18<04:47,  2.90it/s, acc_step=1/1, ce_loss_token=2.6830, perplexity_token=14.6296]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  20%|█████████▎                                    | 211/1044 [01:18<04:55,  2.82it/s, acc_step=1/1, ce_loss_token=2.6824, perplexity_token=14.6203]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  20%|█████████▎                                    | 212/1044 [01:19<05:04,  2.74it/s, acc_step=1/1, ce_loss_token=2.6818, perplexity_token=14.6110]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  20%|█████████▍                                    | 213/1044 [01:19<05:10,  2.68it/s, acc_step=1/1, ce_loss_token=2.6811, perplexity_token=14.6014]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  20%|█████████▍                                    | 214/1044 [01:19<04:51,  2.85it/s, acc_step=1/1, ce_loss_token=2.6806, perplexity_token=14.5942]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  21%|█████████▍                                    | 215/1044 [01:20<05:05,  2.72it/s, acc_step=1/1, ce_loss_token=2.6800, perplexity_token=14.5845]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  21%|█████████▌                                    | 216/1044 [01:20<05:17,  2.61it/s, acc_step=1/1, ce_loss_token=2.6794, perplexity_token=14.5758]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  21%|█████████▌                                    | 217/1044 [01:20<04:53,  2.81it/s, acc_step=1/1, ce_loss_token=2.6789, perplexity_token=14.5692]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  21%|█████████▌                                    | 218/1044 [01:21<04:59,  2.75it/s, acc_step=1/1, ce_loss_token=2.6783, perplexity_token=14.5601]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  21%|█████████▋                                    | 219/1044 [01:21<05:01,  2.74it/s, acc_step=1/1, ce_loss_token=2.6777, perplexity_token=14.5517]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  21%|█████████▋                                    | 220/1044 [01:21<04:54,  2.80it/s, acc_step=1/1, ce_loss_token=2.6772, perplexity_token=14.5449]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  21%|█████████▋                                    | 221/1044 [01:22<04:35,  2.99it/s, acc_step=1/1, ce_loss_token=2.6768, perplexity_token=14.5382]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  21%|█████████▊                                    | 222/1044 [01:22<04:48,  2.85it/s, acc_step=1/1, ce_loss_token=2.6762, perplexity_token=14.5298]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  21%|█████████▊                                    | 223/1044 [01:23<04:45,  2.87it/s, acc_step=1/1, ce_loss_token=2.6756, perplexity_token=14.5213]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  21%|█████████▊                                    | 224/1044 [01:23<05:03,  2.70it/s, acc_step=1/1, ce_loss_token=2.6751, perplexity_token=14.5131]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  22%|█████████▉                                    | 225/1044 [01:23<05:42,  2.39it/s, acc_step=1/1, ce_loss_token=2.6744, perplexity_token=14.5043]

torch.Size([256, 381, 35]) torch.Size([256, 381])


[Training LM]:  22%|█████████▉                                    | 226/1044 [01:24<06:08,  2.22it/s, acc_step=1/1, ce_loss_token=2.6738, perplexity_token=14.4956]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  22%|██████████                                    | 227/1044 [01:24<05:52,  2.32it/s, acc_step=1/1, ce_loss_token=2.6733, perplexity_token=14.4875]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  22%|██████████                                    | 228/1044 [01:25<05:21,  2.54it/s, acc_step=1/1, ce_loss_token=2.6728, perplexity_token=14.4810]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  22%|██████████                                    | 229/1044 [01:25<05:14,  2.59it/s, acc_step=1/1, ce_loss_token=2.6723, perplexity_token=14.4726]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  22%|██████████▏                                   | 230/1044 [01:25<05:13,  2.59it/s, acc_step=1/1, ce_loss_token=2.6717, perplexity_token=14.4647]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  22%|██████████▏                                   | 231/1044 [01:26<05:05,  2.66it/s, acc_step=1/1, ce_loss_token=2.6712, perplexity_token=14.4572]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  22%|██████████▏                                   | 232/1044 [01:26<04:47,  2.82it/s, acc_step=1/1, ce_loss_token=2.6707, perplexity_token=14.4508]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  22%|██████████▎                                   | 233/1044 [01:26<04:27,  3.03it/s, acc_step=1/1, ce_loss_token=2.6704, perplexity_token=14.4451]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  22%|██████████▎                                   | 234/1044 [01:27<04:33,  2.96it/s, acc_step=1/1, ce_loss_token=2.6698, perplexity_token=14.4368]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  23%|██████████▎                                   | 235/1044 [01:27<04:42,  2.86it/s, acc_step=1/1, ce_loss_token=2.6692, perplexity_token=14.4287]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  23%|██████████▍                                   | 236/1044 [01:27<04:23,  3.07it/s, acc_step=1/1, ce_loss_token=2.6688, perplexity_token=14.4229]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  23%|██████████▍                                   | 237/1044 [01:28<04:28,  3.01it/s, acc_step=1/1, ce_loss_token=2.6683, perplexity_token=14.4154]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  23%|██████████▍                                   | 238/1044 [01:28<04:56,  2.72it/s, acc_step=1/1, ce_loss_token=2.6678, perplexity_token=14.4081]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  23%|██████████▌                                   | 239/1044 [01:29<04:53,  2.74it/s, acc_step=1/1, ce_loss_token=2.6672, perplexity_token=14.4001]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  23%|██████████▌                                   | 240/1044 [01:29<04:54,  2.73it/s, acc_step=1/1, ce_loss_token=2.6667, perplexity_token=14.3919]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  23%|██████████▌                                   | 241/1044 [01:29<04:52,  2.75it/s, acc_step=1/1, ce_loss_token=2.6662, perplexity_token=14.3845]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  23%|██████████▋                                   | 242/1044 [01:30<04:36,  2.90it/s, acc_step=1/1, ce_loss_token=2.6658, perplexity_token=14.3793]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  23%|██████████▋                                   | 243/1044 [01:30<04:38,  2.88it/s, acc_step=1/1, ce_loss_token=2.6653, perplexity_token=14.3717]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  23%|██████████▊                                   | 244/1044 [01:30<04:40,  2.85it/s, acc_step=1/1, ce_loss_token=2.6647, perplexity_token=14.3641]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  23%|██████████▊                                   | 245/1044 [01:31<04:40,  2.85it/s, acc_step=1/1, ce_loss_token=2.6643, perplexity_token=14.3577]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  24%|██████████▊                                   | 246/1044 [01:31<04:44,  2.81it/s, acc_step=1/1, ce_loss_token=2.6638, perplexity_token=14.3507]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  24%|██████████▉                                   | 247/1044 [01:31<04:53,  2.72it/s, acc_step=1/1, ce_loss_token=2.6633, perplexity_token=14.3430]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  24%|██████████▉                                   | 248/1044 [01:32<05:00,  2.65it/s, acc_step=1/1, ce_loss_token=2.6627, perplexity_token=14.3353]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  24%|██████████▉                                   | 249/1044 [01:32<04:55,  2.69it/s, acc_step=1/1, ce_loss_token=2.6622, perplexity_token=14.3280]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  24%|███████████                                   | 250/1044 [01:33<04:59,  2.65it/s, acc_step=1/1, ce_loss_token=2.6617, perplexity_token=14.3207]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  24%|███████████                                   | 251/1044 [01:33<05:02,  2.62it/s, acc_step=1/1, ce_loss_token=2.6613, perplexity_token=14.3153]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  24%|███████████                                   | 252/1044 [01:33<04:59,  2.65it/s, acc_step=1/1, ce_loss_token=2.6608, perplexity_token=14.3083]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  24%|███████████▏                                  | 253/1044 [01:34<05:15,  2.51it/s, acc_step=1/1, ce_loss_token=2.6604, perplexity_token=14.3015]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  24%|███████████▏                                  | 254/1044 [01:34<05:09,  2.56it/s, acc_step=1/1, ce_loss_token=2.6598, perplexity_token=14.2941]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  24%|███████████▏                                  | 255/1044 [01:34<05:03,  2.60it/s, acc_step=1/1, ce_loss_token=2.6593, perplexity_token=14.2866]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  25%|███████████▎                                  | 256/1044 [01:35<04:52,  2.69it/s, acc_step=1/1, ce_loss_token=2.6589, perplexity_token=14.2803]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  25%|███████████▎                                  | 257/1044 [01:35<04:56,  2.66it/s, acc_step=1/1, ce_loss_token=2.6584, perplexity_token=14.2734]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  25%|███████████▎                                  | 258/1044 [01:36<05:04,  2.58it/s, acc_step=1/1, ce_loss_token=2.6579, perplexity_token=14.2660]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  25%|███████████▍                                  | 259/1044 [01:36<05:02,  2.60it/s, acc_step=1/1, ce_loss_token=2.6574, perplexity_token=14.2591]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  25%|███████████▍                                  | 260/1044 [01:36<04:35,  2.84it/s, acc_step=1/1, ce_loss_token=2.6570, perplexity_token=14.2540]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  25%|███████████▌                                  | 261/1044 [01:37<04:24,  2.96it/s, acc_step=1/1, ce_loss_token=2.6567, perplexity_token=14.2485]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  25%|███████████▌                                  | 262/1044 [01:37<04:08,  3.14it/s, acc_step=1/1, ce_loss_token=2.6563, perplexity_token=14.2439]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  25%|███████████▌                                  | 263/1044 [01:37<04:21,  2.99it/s, acc_step=1/1, ce_loss_token=2.6559, perplexity_token=14.2374]

torch.Size([256, 454, 35]) torch.Size([256, 454])


[Training LM]:  25%|███████████▋                                  | 264/1044 [01:38<05:37,  2.31it/s, acc_step=1/1, ce_loss_token=2.6554, perplexity_token=14.2303]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  25%|███████████▋                                  | 265/1044 [01:38<05:03,  2.57it/s, acc_step=1/1, ce_loss_token=2.6550, perplexity_token=14.2253]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  25%|███████████▋                                  | 266/1044 [01:38<04:46,  2.72it/s, acc_step=1/1, ce_loss_token=2.6547, perplexity_token=14.2206]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  26%|███████████▊                                  | 267/1044 [01:39<04:33,  2.84it/s, acc_step=1/1, ce_loss_token=2.6544, perplexity_token=14.2162]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  26%|███████████▊                                  | 268/1044 [01:39<04:17,  3.02it/s, acc_step=1/1, ce_loss_token=2.6540, perplexity_token=14.2108]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  26%|███████████▊                                  | 269/1044 [01:39<04:19,  2.99it/s, acc_step=1/1, ce_loss_token=2.6536, perplexity_token=14.2045]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  26%|███████████▉                                  | 270/1044 [01:40<04:25,  2.91it/s, acc_step=1/1, ce_loss_token=2.6531, perplexity_token=14.1982]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  26%|███████████▉                                  | 271/1044 [01:40<04:29,  2.86it/s, acc_step=1/1, ce_loss_token=2.6527, perplexity_token=14.1919]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  26%|███████████▉                                  | 272/1044 [01:41<04:29,  2.86it/s, acc_step=1/1, ce_loss_token=2.6522, perplexity_token=14.1858]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  26%|████████████                                  | 273/1044 [01:41<04:43,  2.72it/s, acc_step=1/1, ce_loss_token=2.6518, perplexity_token=14.1798]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  26%|████████████                                  | 274/1044 [01:41<04:51,  2.64it/s, acc_step=1/1, ce_loss_token=2.6514, perplexity_token=14.1737]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  26%|████████████                                  | 275/1044 [01:42<04:56,  2.60it/s, acc_step=1/1, ce_loss_token=2.6510, perplexity_token=14.1679]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  26%|████████████▏                                 | 276/1044 [01:42<04:32,  2.82it/s, acc_step=1/1, ce_loss_token=2.6506, perplexity_token=14.1624]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  27%|████████████▏                                 | 277/1044 [01:42<04:33,  2.81it/s, acc_step=1/1, ce_loss_token=2.6502, perplexity_token=14.1562]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  27%|████████████▏                                 | 278/1044 [01:43<04:33,  2.80it/s, acc_step=1/1, ce_loss_token=2.6497, perplexity_token=14.1501]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  27%|████████████▎                                 | 279/1044 [01:43<04:32,  2.81it/s, acc_step=1/1, ce_loss_token=2.6493, perplexity_token=14.1444]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  27%|████████████▎                                 | 280/1044 [01:43<04:37,  2.75it/s, acc_step=1/1, ce_loss_token=2.6488, perplexity_token=14.1374]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  27%|████████████▍                                 | 281/1044 [01:44<04:39,  2.73it/s, acc_step=1/1, ce_loss_token=2.6484, perplexity_token=14.1316]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  27%|████████████▍                                 | 282/1044 [01:44<04:56,  2.57it/s, acc_step=1/1, ce_loss_token=2.6480, perplexity_token=14.1262]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  27%|████████████▍                                 | 283/1044 [01:45<04:50,  2.62it/s, acc_step=1/1, ce_loss_token=2.6476, perplexity_token=14.1197]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  27%|████████████▌                                 | 284/1044 [01:45<04:42,  2.69it/s, acc_step=1/1, ce_loss_token=2.6471, perplexity_token=14.1135]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  27%|████████████▌                                 | 285/1044 [01:45<04:37,  2.74it/s, acc_step=1/1, ce_loss_token=2.6467, perplexity_token=14.1075]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  27%|████████████▌                                 | 286/1044 [01:46<04:45,  2.66it/s, acc_step=1/1, ce_loss_token=2.6463, perplexity_token=14.1017]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  27%|████████████▋                                 | 287/1044 [01:46<04:42,  2.68it/s, acc_step=1/1, ce_loss_token=2.6459, perplexity_token=14.0965]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  28%|████████████▋                                 | 288/1044 [01:46<04:42,  2.68it/s, acc_step=1/1, ce_loss_token=2.6455, perplexity_token=14.0908]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  28%|████████████▋                                 | 289/1044 [01:47<04:41,  2.68it/s, acc_step=1/1, ce_loss_token=2.6451, perplexity_token=14.0850]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  28%|████████████▊                                 | 290/1044 [01:47<04:36,  2.73it/s, acc_step=1/1, ce_loss_token=2.6447, perplexity_token=14.0796]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  28%|████████████▊                                 | 291/1044 [01:48<04:31,  2.77it/s, acc_step=1/1, ce_loss_token=2.6443, perplexity_token=14.0741]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  28%|████████████▊                                 | 292/1044 [01:48<04:19,  2.90it/s, acc_step=1/1, ce_loss_token=2.6440, perplexity_token=14.0695]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  28%|████████████▉                                 | 293/1044 [01:48<04:34,  2.74it/s, acc_step=1/1, ce_loss_token=2.6436, perplexity_token=14.0638]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  28%|████████████▉                                 | 295/1044 [01:49<03:59,  3.13it/s, acc_step=1/1, ce_loss_token=2.6432, perplexity_token=14.0587]

torch.Size([256, 302, 35]) torch.Size([256, 302])
torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  28%|█████████████                                 | 296/1044 [01:49<04:05,  3.05it/s, acc_step=1/1, ce_loss_token=2.6428, perplexity_token=14.0531]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  28%|█████████████                                 | 297/1044 [01:50<04:10,  2.98it/s, acc_step=1/1, ce_loss_token=2.6424, perplexity_token=14.0471]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  29%|█████████████▏                                | 298/1044 [01:50<04:40,  2.66it/s, acc_step=1/1, ce_loss_token=2.6420, perplexity_token=14.0415]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  29%|█████████████▏                                | 299/1044 [01:50<04:38,  2.67it/s, acc_step=1/1, ce_loss_token=2.6416, perplexity_token=14.0360]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  29%|█████████████▏                                | 300/1044 [01:51<04:38,  2.67it/s, acc_step=1/1, ce_loss_token=2.6413, perplexity_token=14.0309]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  29%|█████████████▎                                | 301/1044 [01:51<04:18,  2.88it/s, acc_step=1/1, ce_loss_token=2.6410, perplexity_token=14.0268]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  29%|█████████████▎                                | 302/1044 [01:51<04:22,  2.83it/s, acc_step=1/1, ce_loss_token=2.6406, perplexity_token=14.0216]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  29%|█████████████▎                                | 303/1044 [01:52<04:26,  2.78it/s, acc_step=1/1, ce_loss_token=2.6402, perplexity_token=14.0162]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  29%|█████████████▍                                | 304/1044 [01:52<04:35,  2.69it/s, acc_step=1/1, ce_loss_token=2.6398, perplexity_token=14.0108]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  29%|█████████████▍                                | 305/1044 [01:52<04:15,  2.90it/s, acc_step=1/1, ce_loss_token=2.6395, perplexity_token=14.0069]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  29%|█████████████▍                                | 306/1044 [01:53<04:11,  2.93it/s, acc_step=1/1, ce_loss_token=2.6392, perplexity_token=14.0017]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  29%|█████████████▌                                | 307/1044 [01:53<04:25,  2.78it/s, acc_step=1/1, ce_loss_token=2.6388, perplexity_token=13.9960]

torch.Size([256, 410, 35]) torch.Size([256, 410])


[Training LM]:  30%|█████████████▌                                | 308/1044 [01:54<05:11,  2.37it/s, acc_step=1/1, ce_loss_token=2.6384, perplexity_token=13.9907]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  30%|█████████████▌                                | 309/1044 [01:54<05:01,  2.44it/s, acc_step=1/1, ce_loss_token=2.6380, perplexity_token=13.9855]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  30%|█████████████▋                                | 310/1044 [01:55<04:52,  2.51it/s, acc_step=1/1, ce_loss_token=2.6377, perplexity_token=13.9806]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  30%|█████████████▋                                | 311/1044 [01:55<04:44,  2.57it/s, acc_step=1/1, ce_loss_token=2.6373, perplexity_token=13.9753]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  30%|█████████████▋                                | 312/1044 [01:55<04:28,  2.73it/s, acc_step=1/1, ce_loss_token=2.6370, perplexity_token=13.9709]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  30%|█████████████▊                                | 313/1044 [01:56<04:24,  2.77it/s, acc_step=1/1, ce_loss_token=2.6366, perplexity_token=13.9660]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  30%|█████████████▊                                | 314/1044 [01:56<04:37,  2.63it/s, acc_step=1/1, ce_loss_token=2.6363, perplexity_token=13.9609]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  30%|█████████████▉                                | 315/1044 [01:56<04:57,  2.45it/s, acc_step=1/1, ce_loss_token=2.6359, perplexity_token=13.9560]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  30%|█████████████▉                                | 316/1044 [01:57<04:49,  2.52it/s, acc_step=1/1, ce_loss_token=2.6356, perplexity_token=13.9511]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  30%|█████████████▉                                | 317/1044 [01:57<04:26,  2.73it/s, acc_step=1/1, ce_loss_token=2.6353, perplexity_token=13.9471]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  30%|██████████████                                | 318/1044 [01:58<04:30,  2.68it/s, acc_step=1/1, ce_loss_token=2.6349, perplexity_token=13.9424]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  31%|██████████████                                | 319/1044 [01:58<04:33,  2.65it/s, acc_step=1/1, ce_loss_token=2.6346, perplexity_token=13.9375]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  31%|██████████████                                | 320/1044 [01:58<04:54,  2.46it/s, acc_step=1/1, ce_loss_token=2.6342, perplexity_token=13.9322]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  31%|██████████████▏                               | 321/1044 [01:59<04:45,  2.53it/s, acc_step=1/1, ce_loss_token=2.6339, perplexity_token=13.9278]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  31%|██████████████▏                               | 322/1044 [01:59<04:19,  2.78it/s, acc_step=1/1, ce_loss_token=2.6336, perplexity_token=13.9242]

torch.Size([256, 396, 35]) torch.Size([256, 396])


[Training LM]:  31%|██████████████▏                               | 323/1044 [02:00<04:55,  2.44it/s, acc_step=1/1, ce_loss_token=2.6332, perplexity_token=13.9188]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  31%|██████████████▎                               | 324/1044 [02:00<04:53,  2.46it/s, acc_step=1/1, ce_loss_token=2.6329, perplexity_token=13.9140]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  31%|██████████████▎                               | 325/1044 [02:00<05:15,  2.28it/s, acc_step=1/1, ce_loss_token=2.6326, perplexity_token=13.9092]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  31%|██████████████▎                               | 326/1044 [02:01<05:05,  2.35it/s, acc_step=1/1, ce_loss_token=2.6323, perplexity_token=13.9050]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  31%|██████████████▍                               | 327/1044 [02:01<05:01,  2.38it/s, acc_step=1/1, ce_loss_token=2.6319, perplexity_token=13.8997]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  31%|██████████████▍                               | 328/1044 [02:02<04:36,  2.59it/s, acc_step=1/1, ce_loss_token=2.6316, perplexity_token=13.8961]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  32%|██████████████▍                               | 329/1044 [02:02<04:35,  2.59it/s, acc_step=1/1, ce_loss_token=2.6313, perplexity_token=13.8911]

torch.Size([256, 364, 35]) torch.Size([256, 364])


[Training LM]:  32%|██████████████▌                               | 330/1044 [02:02<04:54,  2.42it/s, acc_step=1/1, ce_loss_token=2.6309, perplexity_token=13.8863]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  32%|██████████████▌                               | 331/1044 [02:03<04:46,  2.49it/s, acc_step=1/1, ce_loss_token=2.6305, perplexity_token=13.8813]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  32%|██████████████▋                               | 332/1044 [02:03<04:35,  2.59it/s, acc_step=1/1, ce_loss_token=2.6302, perplexity_token=13.8768]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  32%|██████████████▋                               | 333/1044 [02:04<04:33,  2.60it/s, acc_step=1/1, ce_loss_token=2.6299, perplexity_token=13.8722]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  32%|██████████████▋                               | 334/1044 [02:04<04:33,  2.59it/s, acc_step=1/1, ce_loss_token=2.6296, perplexity_token=13.8676]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  32%|██████████████▊                               | 335/1044 [02:04<04:28,  2.64it/s, acc_step=1/1, ce_loss_token=2.6292, perplexity_token=13.8629]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  32%|██████████████▊                               | 336/1044 [02:05<04:22,  2.70it/s, acc_step=1/1, ce_loss_token=2.6289, perplexity_token=13.8582]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  32%|██████████████▊                               | 337/1044 [02:05<04:24,  2.67it/s, acc_step=1/1, ce_loss_token=2.6285, perplexity_token=13.8535]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  32%|██████████████▉                               | 338/1044 [02:05<04:22,  2.69it/s, acc_step=1/1, ce_loss_token=2.6282, perplexity_token=13.8487]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  32%|██████████████▉                               | 339/1044 [02:06<04:26,  2.65it/s, acc_step=1/1, ce_loss_token=2.6279, perplexity_token=13.8445]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  33%|██████████████▉                               | 340/1044 [02:06<04:26,  2.64it/s, acc_step=1/1, ce_loss_token=2.6276, perplexity_token=13.8401]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  33%|███████████████                               | 341/1044 [02:07<04:41,  2.50it/s, acc_step=1/1, ce_loss_token=2.6272, perplexity_token=13.8355]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  33%|███████████████                               | 342/1044 [02:07<04:35,  2.55it/s, acc_step=1/1, ce_loss_token=2.6269, perplexity_token=13.8310]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  33%|███████████████                               | 343/1044 [02:07<04:19,  2.70it/s, acc_step=1/1, ce_loss_token=2.6267, perplexity_token=13.8278]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  33%|███████████████▏                              | 344/1044 [02:08<04:19,  2.70it/s, acc_step=1/1, ce_loss_token=2.6263, perplexity_token=13.8230]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  33%|███████████████▏                              | 345/1044 [02:08<04:29,  2.60it/s, acc_step=1/1, ce_loss_token=2.6260, perplexity_token=13.8190]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  33%|███████████████▏                              | 346/1044 [02:08<04:09,  2.79it/s, acc_step=1/1, ce_loss_token=2.6259, perplexity_token=13.8164]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  33%|███████████████▎                              | 347/1044 [02:09<03:54,  2.98it/s, acc_step=1/1, ce_loss_token=2.6256, perplexity_token=13.8129]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  33%|███████████████▎                              | 348/1044 [02:09<04:00,  2.89it/s, acc_step=1/1, ce_loss_token=2.6253, perplexity_token=13.8084]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  33%|███████████████▍                              | 349/1044 [02:09<04:15,  2.72it/s, acc_step=1/1, ce_loss_token=2.6249, perplexity_token=13.8038]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  34%|███████████████▍                              | 350/1044 [02:10<04:16,  2.70it/s, acc_step=1/1, ce_loss_token=2.6246, perplexity_token=13.7997]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  34%|███████████████▍                              | 351/1044 [02:10<04:15,  2.71it/s, acc_step=1/1, ce_loss_token=2.6244, perplexity_token=13.7957]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  34%|███████████████▌                              | 352/1044 [02:11<04:16,  2.70it/s, acc_step=1/1, ce_loss_token=2.6240, perplexity_token=13.7911]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  34%|███████████████▌                              | 353/1044 [02:11<04:19,  2.67it/s, acc_step=1/1, ce_loss_token=2.6237, perplexity_token=13.7866]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  34%|███████████████▌                              | 354/1044 [02:11<04:01,  2.86it/s, acc_step=1/1, ce_loss_token=2.6235, perplexity_token=13.7835]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  34%|███████████████▋                              | 355/1044 [02:12<03:59,  2.88it/s, acc_step=1/1, ce_loss_token=2.6232, perplexity_token=13.7794]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  34%|███████████████▋                              | 356/1044 [02:12<04:02,  2.84it/s, acc_step=1/1, ce_loss_token=2.6229, perplexity_token=13.7755]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  34%|███████████████▋                              | 357/1044 [02:12<03:59,  2.86it/s, acc_step=1/1, ce_loss_token=2.6226, perplexity_token=13.7714]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  34%|███████████████▊                              | 358/1044 [02:13<04:02,  2.83it/s, acc_step=1/1, ce_loss_token=2.6223, perplexity_token=13.7669]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  34%|███████████████▊                              | 359/1044 [02:13<04:02,  2.83it/s, acc_step=1/1, ce_loss_token=2.6220, perplexity_token=13.7633]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  34%|███████████████▊                              | 360/1044 [02:13<03:54,  2.91it/s, acc_step=1/1, ce_loss_token=2.6218, perplexity_token=13.7599]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  35%|███████████████▉                              | 361/1044 [02:14<04:00,  2.83it/s, acc_step=1/1, ce_loss_token=2.6214, perplexity_token=13.7554]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  35%|███████████████▉                              | 362/1044 [02:14<04:09,  2.73it/s, acc_step=1/1, ce_loss_token=2.6211, perplexity_token=13.7512]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  35%|███████████████▉                              | 363/1044 [02:14<04:05,  2.78it/s, acc_step=1/1, ce_loss_token=2.6208, perplexity_token=13.7470]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  35%|████████████████                              | 364/1044 [02:15<04:08,  2.74it/s, acc_step=1/1, ce_loss_token=2.6205, perplexity_token=13.7428]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  35%|████████████████                              | 365/1044 [02:15<04:09,  2.73it/s, acc_step=1/1, ce_loss_token=2.6202, perplexity_token=13.7387]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  35%|████████████████▏                             | 366/1044 [02:16<04:10,  2.71it/s, acc_step=1/1, ce_loss_token=2.6199, perplexity_token=13.7348]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  35%|████████████████▏                             | 367/1044 [02:16<04:15,  2.65it/s, acc_step=1/1, ce_loss_token=2.6197, perplexity_token=13.7310]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  35%|████████████████▏                             | 368/1044 [02:16<03:51,  2.91it/s, acc_step=1/1, ce_loss_token=2.6194, perplexity_token=13.7279]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  35%|████████████████▎                             | 369/1044 [02:17<04:01,  2.79it/s, acc_step=1/1, ce_loss_token=2.6191, perplexity_token=13.7237]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  35%|████████████████▎                             | 370/1044 [02:17<03:53,  2.88it/s, acc_step=1/1, ce_loss_token=2.6189, perplexity_token=13.7210]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  36%|████████████████▎                             | 371/1044 [02:17<04:03,  2.77it/s, acc_step=1/1, ce_loss_token=2.6186, perplexity_token=13.7170]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  36%|████████████████▍                             | 372/1044 [02:18<04:04,  2.74it/s, acc_step=1/1, ce_loss_token=2.6184, perplexity_token=13.7132]

torch.Size([256, 391, 35]) torch.Size([256, 391])


[Training LM]:  36%|████████████████▍                             | 373/1044 [02:18<04:40,  2.39it/s, acc_step=1/1, ce_loss_token=2.6181, perplexity_token=13.7091]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  36%|████████████████▍                             | 374/1044 [02:19<04:36,  2.42it/s, acc_step=1/1, ce_loss_token=2.6178, perplexity_token=13.7051]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  36%|████████████████▌                             | 375/1044 [02:19<04:28,  2.49it/s, acc_step=1/1, ce_loss_token=2.6175, perplexity_token=13.7016]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  36%|████████████████▌                             | 376/1044 [02:20<04:49,  2.31it/s, acc_step=1/1, ce_loss_token=2.6172, perplexity_token=13.6976]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  36%|████████████████▌                             | 377/1044 [02:20<04:41,  2.37it/s, acc_step=1/1, ce_loss_token=2.6169, perplexity_token=13.6934]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  36%|████████████████▋                             | 378/1044 [02:20<04:38,  2.39it/s, acc_step=1/1, ce_loss_token=2.6166, perplexity_token=13.6893]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  36%|████████████████▋                             | 379/1044 [02:21<04:35,  2.41it/s, acc_step=1/1, ce_loss_token=2.6163, perplexity_token=13.6856]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  36%|████████████████▋                             | 380/1044 [02:21<04:10,  2.65it/s, acc_step=1/1, ce_loss_token=2.6161, perplexity_token=13.6828]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  36%|████████████████▊                             | 381/1044 [02:21<04:07,  2.68it/s, acc_step=1/1, ce_loss_token=2.6159, perplexity_token=13.6795]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  37%|████████████████▊                             | 382/1044 [02:22<03:57,  2.79it/s, acc_step=1/1, ce_loss_token=2.6157, perplexity_token=13.6767]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  37%|████████████████▉                             | 383/1044 [02:22<03:52,  2.84it/s, acc_step=1/1, ce_loss_token=2.6154, perplexity_token=13.6731]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  37%|████████████████▉                             | 384/1044 [02:22<03:53,  2.83it/s, acc_step=1/1, ce_loss_token=2.6152, perplexity_token=13.6694]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  37%|████████████████▉                             | 385/1044 [02:23<04:00,  2.74it/s, acc_step=1/1, ce_loss_token=2.6149, perplexity_token=13.6656]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  37%|█████████████████                             | 386/1044 [02:23<03:53,  2.82it/s, acc_step=1/1, ce_loss_token=2.6146, perplexity_token=13.6618]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  37%|█████████████████                             | 387/1044 [02:23<03:35,  3.05it/s, acc_step=1/1, ce_loss_token=2.6144, perplexity_token=13.6591]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  37%|█████████████████                             | 388/1044 [02:24<03:46,  2.90it/s, acc_step=1/1, ce_loss_token=2.6142, perplexity_token=13.6557]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  37%|█████████████████▏                            | 389/1044 [02:24<03:50,  2.84it/s, acc_step=1/1, ce_loss_token=2.6139, perplexity_token=13.6520]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  37%|█████████████████▏                            | 390/1044 [02:25<03:59,  2.73it/s, acc_step=1/1, ce_loss_token=2.6136, perplexity_token=13.6485]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  37%|█████████████████▏                            | 391/1044 [02:25<03:46,  2.88it/s, acc_step=1/1, ce_loss_token=2.6134, perplexity_token=13.6455]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  38%|█████████████████▎                            | 392/1044 [02:25<03:47,  2.87it/s, acc_step=1/1, ce_loss_token=2.6131, perplexity_token=13.6418]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  38%|█████████████████▎                            | 393/1044 [02:26<03:50,  2.83it/s, acc_step=1/1, ce_loss_token=2.6129, perplexity_token=13.6386]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  38%|█████████████████▎                            | 394/1044 [02:26<03:50,  2.82it/s, acc_step=1/1, ce_loss_token=2.6127, perplexity_token=13.6352]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  38%|█████████████████▍                            | 395/1044 [02:26<03:52,  2.80it/s, acc_step=1/1, ce_loss_token=2.6124, perplexity_token=13.6317]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  38%|█████████████████▍                            | 396/1044 [02:27<03:51,  2.80it/s, acc_step=1/1, ce_loss_token=2.6121, perplexity_token=13.6281]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  38%|█████████████████▍                            | 397/1044 [02:27<03:49,  2.82it/s, acc_step=1/1, ce_loss_token=2.6119, perplexity_token=13.6244]

torch.Size([256, 366, 35]) torch.Size([256, 366])


[Training LM]:  38%|█████████████████▌                            | 398/1044 [02:28<04:13,  2.55it/s, acc_step=1/1, ce_loss_token=2.6116, perplexity_token=13.6210]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  38%|█████████████████▌                            | 399/1044 [02:28<03:53,  2.76it/s, acc_step=1/1, ce_loss_token=2.6114, perplexity_token=13.6185]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  38%|█████████████████▌                            | 400/1044 [02:28<03:52,  2.78it/s, acc_step=1/1, ce_loss_token=2.6111, perplexity_token=13.6147]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  38%|█████████████████▋                            | 401/1044 [02:29<03:57,  2.71it/s, acc_step=1/1, ce_loss_token=2.6109, perplexity_token=13.6113]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  39%|█████████████████▋                            | 402/1044 [02:29<03:55,  2.73it/s, acc_step=1/1, ce_loss_token=2.6107, perplexity_token=13.6082]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  39%|█████████████████▊                            | 403/1044 [02:29<03:56,  2.72it/s, acc_step=1/1, ce_loss_token=2.6104, perplexity_token=13.6050]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  39%|█████████████████▊                            | 404/1044 [02:30<03:56,  2.70it/s, acc_step=1/1, ce_loss_token=2.6102, perplexity_token=13.6017]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  39%|█████████████████▊                            | 405/1044 [02:30<03:59,  2.67it/s, acc_step=1/1, ce_loss_token=2.6099, perplexity_token=13.5983]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  39%|█████████████████▉                            | 406/1044 [02:30<03:56,  2.70it/s, acc_step=1/1, ce_loss_token=2.6097, perplexity_token=13.5948]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  39%|█████████████████▉                            | 407/1044 [02:31<04:02,  2.62it/s, acc_step=1/1, ce_loss_token=2.6094, perplexity_token=13.5912]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  39%|██████████████████                            | 409/1044 [02:31<03:35,  2.94it/s, acc_step=1/1, ce_loss_token=2.6092, perplexity_token=13.5885]

torch.Size([256, 311, 35]) torch.Size([256, 311])
torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  39%|██████████████████                            | 411/1044 [02:32<03:11,  3.31it/s, acc_step=1/1, ce_loss_token=2.6091, perplexity_token=13.5867]

torch.Size([256, 308, 35]) torch.Size([256, 308])
torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  39%|██████████████████▏                           | 412/1044 [02:32<03:23,  3.11it/s, acc_step=1/1, ce_loss_token=2.6088, perplexity_token=13.5832]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  40%|██████████████████▏                           | 413/1044 [02:33<03:28,  3.03it/s, acc_step=1/1, ce_loss_token=2.6086, perplexity_token=13.5798]

torch.Size([256, 344, 35]) torch.Size([256, 344])


[Training LM]:  40%|██████████████████▏                           | 414/1044 [02:33<03:49,  2.75it/s, acc_step=1/1, ce_loss_token=2.6083, perplexity_token=13.5765]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  40%|██████████████████▎                           | 415/1044 [02:34<03:50,  2.73it/s, acc_step=1/1, ce_loss_token=2.6081, perplexity_token=13.5734]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  40%|██████████████████▎                           | 416/1044 [02:34<03:50,  2.72it/s, acc_step=1/1, ce_loss_token=2.6079, perplexity_token=13.5704]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  40%|██████████████████▎                           | 417/1044 [02:34<03:52,  2.69it/s, acc_step=1/1, ce_loss_token=2.6077, perplexity_token=13.5673]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  40%|██████████████████▍                           | 418/1044 [02:35<03:56,  2.65it/s, acc_step=1/1, ce_loss_token=2.6074, perplexity_token=13.5639]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  40%|██████████████████▍                           | 419/1044 [02:35<03:52,  2.69it/s, acc_step=1/1, ce_loss_token=2.6072, perplexity_token=13.5607]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  40%|██████████████████▌                           | 420/1044 [02:35<03:35,  2.90it/s, acc_step=1/1, ce_loss_token=2.6070, perplexity_token=13.5585]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  40%|██████████████████▌                           | 421/1044 [02:36<03:41,  2.82it/s, acc_step=1/1, ce_loss_token=2.6068, perplexity_token=13.5552]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  40%|██████████████████▌                           | 422/1044 [02:36<03:26,  3.01it/s, acc_step=1/1, ce_loss_token=2.6066, perplexity_token=13.5528]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  41%|██████████████████▋                           | 423/1044 [02:36<03:28,  2.98it/s, acc_step=1/1, ce_loss_token=2.6064, perplexity_token=13.5497]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  41%|██████████████████▋                           | 424/1044 [02:37<03:31,  2.93it/s, acc_step=1/1, ce_loss_token=2.6061, perplexity_token=13.5465]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  41%|██████████████████▋                           | 425/1044 [02:37<03:38,  2.84it/s, acc_step=1/1, ce_loss_token=2.6059, perplexity_token=13.5434]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  41%|██████████████████▊                           | 426/1044 [02:37<03:50,  2.69it/s, acc_step=1/1, ce_loss_token=2.6056, perplexity_token=13.5399]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  41%|██████████████████▊                           | 427/1044 [02:38<03:58,  2.59it/s, acc_step=1/1, ce_loss_token=2.6054, perplexity_token=13.5367]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  41%|██████████████████▊                           | 428/1044 [02:38<03:57,  2.60it/s, acc_step=1/1, ce_loss_token=2.6052, perplexity_token=13.5333]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  41%|██████████████████▉                           | 429/1044 [02:39<03:56,  2.60it/s, acc_step=1/1, ce_loss_token=2.6049, perplexity_token=13.5298]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  41%|██████████████████▉                           | 430/1044 [02:39<03:53,  2.63it/s, acc_step=1/1, ce_loss_token=2.6047, perplexity_token=13.5265]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  41%|██████████████████▉                           | 431/1044 [02:39<03:45,  2.71it/s, acc_step=1/1, ce_loss_token=2.6044, perplexity_token=13.5233]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  41%|███████████████████                           | 432/1044 [02:40<03:48,  2.68it/s, acc_step=1/1, ce_loss_token=2.6042, perplexity_token=13.5203]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  41%|███████████████████                           | 433/1044 [02:40<03:59,  2.56it/s, acc_step=1/1, ce_loss_token=2.6040, perplexity_token=13.5175]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  42%|███████████████████                           | 434/1044 [02:41<04:00,  2.54it/s, acc_step=1/1, ce_loss_token=2.6037, perplexity_token=13.5140]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  42%|███████████████████▏                          | 435/1044 [02:41<03:57,  2.57it/s, acc_step=1/1, ce_loss_token=2.6035, perplexity_token=13.5110]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  42%|███████████████████▏                          | 436/1044 [02:41<03:57,  2.56it/s, acc_step=1/1, ce_loss_token=2.6033, perplexity_token=13.5079]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  42%|███████████████████▎                          | 437/1044 [02:42<03:54,  2.59it/s, acc_step=1/1, ce_loss_token=2.6031, perplexity_token=13.5049]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  42%|███████████████████▎                          | 438/1044 [02:42<03:34,  2.82it/s, acc_step=1/1, ce_loss_token=2.6029, perplexity_token=13.5027]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  42%|███████████████████▎                          | 439/1044 [02:42<03:36,  2.80it/s, acc_step=1/1, ce_loss_token=2.6027, perplexity_token=13.4996]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  42%|███████████████████▍                          | 440/1044 [02:43<03:28,  2.90it/s, acc_step=1/1, ce_loss_token=2.6025, perplexity_token=13.4970]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  42%|███████████████████▍                          | 441/1044 [02:43<03:14,  3.10it/s, acc_step=1/1, ce_loss_token=2.6023, perplexity_token=13.4945]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  42%|███████████████████▍                          | 442/1044 [02:43<03:36,  2.79it/s, acc_step=1/1, ce_loss_token=2.6021, perplexity_token=13.4916]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  42%|███████████████████▌                          | 443/1044 [02:44<03:35,  2.79it/s, acc_step=1/1, ce_loss_token=2.6018, perplexity_token=13.4886]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  43%|███████████████████▌                          | 444/1044 [02:44<03:34,  2.79it/s, acc_step=1/1, ce_loss_token=2.6016, perplexity_token=13.4854]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  43%|███████████████████▌                          | 445/1044 [02:44<03:39,  2.73it/s, acc_step=1/1, ce_loss_token=2.6014, perplexity_token=13.4827]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  43%|███████████████████▋                          | 446/1044 [02:45<03:32,  2.81it/s, acc_step=1/1, ce_loss_token=2.6012, perplexity_token=13.4805]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  43%|███████████████████▋                          | 448/1044 [02:45<03:04,  3.23it/s, acc_step=1/1, ce_loss_token=2.6010, perplexity_token=13.4774]

torch.Size([256, 290, 35]) torch.Size([256, 290])
torch.Size([256, 525, 35]) torch.Size([256, 525])


[Training LM]:  43%|███████████████████▊                          | 449/1044 [02:46<04:40,  2.12it/s, acc_step=1/1, ce_loss_token=2.6008, perplexity_token=13.4744]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  43%|███████████████████▊                          | 450/1044 [02:47<04:28,  2.21it/s, acc_step=1/1, ce_loss_token=2.6006, perplexity_token=13.4712]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  43%|███████████████████▊                          | 451/1044 [02:47<04:18,  2.30it/s, acc_step=1/1, ce_loss_token=2.6003, perplexity_token=13.4684]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  43%|███████████████████▉                          | 452/1044 [02:47<04:12,  2.35it/s, acc_step=1/1, ce_loss_token=2.6001, perplexity_token=13.4654]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  43%|███████████████████▉                          | 453/1044 [02:48<04:04,  2.42it/s, acc_step=1/1, ce_loss_token=2.5999, perplexity_token=13.4625]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  43%|████████████████████                          | 454/1044 [02:48<03:53,  2.52it/s, acc_step=1/1, ce_loss_token=2.5997, perplexity_token=13.4595]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  44%|████████████████████                          | 455/1044 [02:49<03:52,  2.53it/s, acc_step=1/1, ce_loss_token=2.5995, perplexity_token=13.4569]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  44%|████████████████████                          | 456/1044 [02:49<04:00,  2.44it/s, acc_step=1/1, ce_loss_token=2.5993, perplexity_token=13.4541]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  44%|████████████████████▏                         | 457/1044 [02:49<03:53,  2.51it/s, acc_step=1/1, ce_loss_token=2.5991, perplexity_token=13.4515]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  44%|████████████████████▏                         | 458/1044 [02:50<03:32,  2.76it/s, acc_step=1/1, ce_loss_token=2.5989, perplexity_token=13.4494]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  44%|████████████████████▏                         | 459/1044 [02:50<03:38,  2.68it/s, acc_step=1/1, ce_loss_token=2.5987, perplexity_token=13.4463]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  44%|████████████████████▎                         | 460/1044 [02:50<03:24,  2.85it/s, acc_step=1/1, ce_loss_token=2.5985, perplexity_token=13.4438]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  44%|████████████████████▎                         | 461/1044 [02:51<03:23,  2.86it/s, acc_step=1/1, ce_loss_token=2.5983, perplexity_token=13.4408]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  44%|████████████████████▎                         | 462/1044 [02:51<03:30,  2.76it/s, acc_step=1/1, ce_loss_token=2.5981, perplexity_token=13.4380]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  44%|████████████████████▍                         | 463/1044 [02:51<03:25,  2.83it/s, acc_step=1/1, ce_loss_token=2.5979, perplexity_token=13.4354]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  44%|████████████████████▍                         | 464/1044 [02:52<03:26,  2.81it/s, acc_step=1/1, ce_loss_token=2.5977, perplexity_token=13.4322]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  45%|████████████████████▍                         | 465/1044 [02:52<03:26,  2.80it/s, acc_step=1/1, ce_loss_token=2.5975, perplexity_token=13.4297]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  45%|████████████████████▌                         | 466/1044 [02:53<03:35,  2.68it/s, acc_step=1/1, ce_loss_token=2.5973, perplexity_token=13.4268]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  45%|████████████████████▌                         | 467/1044 [02:53<03:51,  2.49it/s, acc_step=1/1, ce_loss_token=2.5971, perplexity_token=13.4242]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  45%|████████████████████▌                         | 468/1044 [02:53<03:55,  2.45it/s, acc_step=1/1, ce_loss_token=2.5969, perplexity_token=13.4215]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  45%|████████████████████▋                         | 469/1044 [02:54<03:49,  2.50it/s, acc_step=1/1, ce_loss_token=2.5966, perplexity_token=13.4185]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  45%|████████████████████▋                         | 470/1044 [02:54<03:49,  2.50it/s, acc_step=1/1, ce_loss_token=2.5964, perplexity_token=13.4157]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  45%|████████████████████▊                         | 471/1044 [02:55<03:44,  2.55it/s, acc_step=1/1, ce_loss_token=2.5962, perplexity_token=13.4130]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  45%|████████████████████▊                         | 472/1044 [02:55<03:51,  2.48it/s, acc_step=1/1, ce_loss_token=2.5960, perplexity_token=13.4101]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  45%|████████████████████▊                         | 473/1044 [02:55<03:47,  2.51it/s, acc_step=1/1, ce_loss_token=2.5958, perplexity_token=13.4075]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  45%|████████████████████▉                         | 474/1044 [02:56<03:50,  2.48it/s, acc_step=1/1, ce_loss_token=2.5956, perplexity_token=13.4047]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  45%|████████████████████▉                         | 475/1044 [02:56<03:40,  2.59it/s, acc_step=1/1, ce_loss_token=2.5954, perplexity_token=13.4020]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  46%|████████████████████▉                         | 476/1044 [02:57<03:44,  2.53it/s, acc_step=1/1, ce_loss_token=2.5952, perplexity_token=13.3994]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  46%|█████████████████████                         | 477/1044 [02:57<03:22,  2.81it/s, acc_step=1/1, ce_loss_token=2.5951, perplexity_token=13.3973]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  46%|█████████████████████                         | 478/1044 [02:57<03:19,  2.84it/s, acc_step=1/1, ce_loss_token=2.5949, perplexity_token=13.3947]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  46%|█████████████████████                         | 479/1044 [02:58<03:21,  2.80it/s, acc_step=1/1, ce_loss_token=2.5946, perplexity_token=13.3918]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  46%|█████████████████████▏                        | 480/1044 [02:58<03:30,  2.68it/s, acc_step=1/1, ce_loss_token=2.5944, perplexity_token=13.3892]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  46%|█████████████████████▏                        | 481/1044 [02:58<03:42,  2.53it/s, acc_step=1/1, ce_loss_token=2.5942, perplexity_token=13.3864]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  46%|█████████████████████▏                        | 482/1044 [02:59<03:51,  2.43it/s, acc_step=1/1, ce_loss_token=2.5940, perplexity_token=13.3837]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  46%|█████████████████████▎                        | 483/1044 [02:59<03:54,  2.39it/s, acc_step=1/1, ce_loss_token=2.5938, perplexity_token=13.3811]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  46%|█████████████████████▎                        | 484/1044 [03:00<03:49,  2.44it/s, acc_step=1/1, ce_loss_token=2.5936, perplexity_token=13.3784]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  46%|█████████████████████▎                        | 485/1044 [03:00<03:25,  2.72it/s, acc_step=1/1, ce_loss_token=2.5935, perplexity_token=13.3765]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  47%|█████████████████████▍                        | 486/1044 [03:00<03:23,  2.74it/s, acc_step=1/1, ce_loss_token=2.5934, perplexity_token=13.3745]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  47%|█████████████████████▍                        | 487/1044 [03:01<03:25,  2.72it/s, acc_step=1/1, ce_loss_token=2.5932, perplexity_token=13.3719]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  47%|█████████████████████▌                        | 488/1044 [03:01<03:21,  2.76it/s, acc_step=1/1, ce_loss_token=2.5930, perplexity_token=13.3692]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  47%|█████████████████████▌                        | 489/1044 [03:01<03:09,  2.93it/s, acc_step=1/1, ce_loss_token=2.5928, perplexity_token=13.3669]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  47%|█████████████████████▌                        | 490/1044 [03:02<02:57,  3.13it/s, acc_step=1/1, ce_loss_token=2.5926, perplexity_token=13.3650]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  47%|█████████████████████▋                        | 491/1044 [03:02<03:00,  3.06it/s, acc_step=1/1, ce_loss_token=2.5925, perplexity_token=13.3630]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  47%|█████████████████████▋                        | 492/1044 [03:02<03:05,  2.98it/s, acc_step=1/1, ce_loss_token=2.5923, perplexity_token=13.3605]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  47%|█████████████████████▋                        | 493/1044 [03:03<02:54,  3.16it/s, acc_step=1/1, ce_loss_token=2.5922, perplexity_token=13.3587]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  47%|█████████████████████▊                        | 494/1044 [03:03<03:00,  3.05it/s, acc_step=1/1, ce_loss_token=2.5920, perplexity_token=13.3562]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  47%|█████████████████████▊                        | 495/1044 [03:03<03:14,  2.82it/s, acc_step=1/1, ce_loss_token=2.5918, perplexity_token=13.3535]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  48%|█████████████████████▊                        | 496/1044 [03:04<03:23,  2.69it/s, acc_step=1/1, ce_loss_token=2.5916, perplexity_token=13.3511]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  48%|█████████████████████▉                        | 497/1044 [03:04<03:21,  2.72it/s, acc_step=1/1, ce_loss_token=2.5914, perplexity_token=13.3482]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  48%|█████████████████████▉                        | 498/1044 [03:05<03:31,  2.58it/s, acc_step=1/1, ce_loss_token=2.5912, perplexity_token=13.3455]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  48%|█████████████████████▉                        | 499/1044 [03:05<03:09,  2.87it/s, acc_step=1/1, ce_loss_token=2.5910, perplexity_token=13.3433]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  48%|██████████████████████                        | 500/1044 [03:05<03:07,  2.90it/s, acc_step=1/1, ce_loss_token=2.5908, perplexity_token=13.3407]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  48%|██████████████████████                        | 501/1044 [03:06<03:14,  2.80it/s, acc_step=1/1, ce_loss_token=2.5906, perplexity_token=13.3377]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  48%|██████████████████████                        | 502/1044 [03:06<03:13,  2.80it/s, acc_step=1/1, ce_loss_token=2.5904, perplexity_token=13.3353]

torch.Size([256, 273, 35]) torch.Size([256, 273])


[Training LM]:  48%|██████████████████████▏                       | 503/1044 [03:06<03:08,  2.88it/s, acc_step=1/1, ce_loss_token=2.5902, perplexity_token=13.3330]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  48%|██████████████████████▏                       | 504/1044 [03:07<03:14,  2.77it/s, acc_step=1/1, ce_loss_token=2.5901, perplexity_token=13.3305]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  48%|██████████████████████▎                       | 505/1044 [03:07<03:12,  2.80it/s, acc_step=1/1, ce_loss_token=2.5899, perplexity_token=13.3280]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  48%|██████████████████████▎                       | 506/1044 [03:07<03:15,  2.75it/s, acc_step=1/1, ce_loss_token=2.5897, perplexity_token=13.3255]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  49%|██████████████████████▎                       | 507/1044 [03:08<03:20,  2.67it/s, acc_step=1/1, ce_loss_token=2.5895, perplexity_token=13.3230]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  49%|██████████████████████▍                       | 508/1044 [03:08<03:22,  2.65it/s, acc_step=1/1, ce_loss_token=2.5893, perplexity_token=13.3205]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  49%|██████████████████████▍                       | 509/1044 [03:09<03:19,  2.69it/s, acc_step=1/1, ce_loss_token=2.5891, perplexity_token=13.3179]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  49%|██████████████████████▍                       | 510/1044 [03:09<03:17,  2.70it/s, acc_step=1/1, ce_loss_token=2.5889, perplexity_token=13.3155]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  49%|██████████████████████▌                       | 511/1044 [03:09<03:20,  2.66it/s, acc_step=1/1, ce_loss_token=2.5888, perplexity_token=13.3131]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  49%|██████████████████████▌                       | 512/1044 [03:10<03:44,  2.37it/s, acc_step=1/1, ce_loss_token=2.5886, perplexity_token=13.3105]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  49%|██████████████████████▌                       | 513/1044 [03:10<03:29,  2.53it/s, acc_step=1/1, ce_loss_token=2.5884, perplexity_token=13.3087]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  49%|██████████████████████▋                       | 514/1044 [03:11<03:27,  2.55it/s, acc_step=1/1, ce_loss_token=2.5882, perplexity_token=13.3061]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  49%|██████████████████████▋                       | 515/1044 [03:11<03:15,  2.71it/s, acc_step=1/1, ce_loss_token=2.5881, perplexity_token=13.3042]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  49%|██████████████████████▋                       | 516/1044 [03:11<03:05,  2.84it/s, acc_step=1/1, ce_loss_token=2.5879, perplexity_token=13.3024]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  50%|██████████████████████▊                       | 517/1044 [03:12<03:23,  2.59it/s, acc_step=1/1, ce_loss_token=2.5878, perplexity_token=13.3001]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  50%|██████████████████████▊                       | 518/1044 [03:12<03:21,  2.61it/s, acc_step=1/1, ce_loss_token=2.5876, perplexity_token=13.2978]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  50%|██████████████████████▊                       | 519/1044 [03:12<03:16,  2.67it/s, acc_step=1/1, ce_loss_token=2.5874, perplexity_token=13.2957]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  50%|██████████████████████▉                       | 520/1044 [03:13<03:19,  2.63it/s, acc_step=1/1, ce_loss_token=2.5873, perplexity_token=13.2934]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  50%|██████████████████████▉                       | 521/1044 [03:13<03:19,  2.62it/s, acc_step=1/1, ce_loss_token=2.5871, perplexity_token=13.2909]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  50%|███████████████████████                       | 522/1044 [03:14<03:20,  2.60it/s, acc_step=1/1, ce_loss_token=2.5869, perplexity_token=13.2887]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  50%|███████████████████████                       | 523/1044 [03:14<03:19,  2.61it/s, acc_step=1/1, ce_loss_token=2.5867, perplexity_token=13.2864]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  50%|███████████████████████                       | 524/1044 [03:14<03:42,  2.33it/s, acc_step=1/1, ce_loss_token=2.5866, perplexity_token=13.2842]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  50%|███████████████████████▏                      | 525/1044 [03:15<03:16,  2.65it/s, acc_step=1/1, ce_loss_token=2.5865, perplexity_token=13.2826]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  50%|███████████████████████▏                      | 526/1044 [03:15<03:19,  2.59it/s, acc_step=1/1, ce_loss_token=2.5863, perplexity_token=13.2805]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  50%|███████████████████████▏                      | 527/1044 [03:15<03:03,  2.82it/s, acc_step=1/1, ce_loss_token=2.5862, perplexity_token=13.2788]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  51%|███████████████████████▎                      | 528/1044 [03:16<03:05,  2.79it/s, acc_step=1/1, ce_loss_token=2.5860, perplexity_token=13.2767]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  51%|███████████████████████▎                      | 529/1044 [03:16<03:28,  2.47it/s, acc_step=1/1, ce_loss_token=2.5858, perplexity_token=13.2743]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  51%|███████████████████████▎                      | 530/1044 [03:17<03:20,  2.56it/s, acc_step=1/1, ce_loss_token=2.5857, perplexity_token=13.2719]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  51%|███████████████████████▍                      | 531/1044 [03:17<03:23,  2.52it/s, acc_step=1/1, ce_loss_token=2.5855, perplexity_token=13.2698]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  51%|███████████████████████▍                      | 532/1044 [03:17<03:20,  2.56it/s, acc_step=1/1, ce_loss_token=2.5853, perplexity_token=13.2675]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  51%|███████████████████████▍                      | 533/1044 [03:18<03:17,  2.59it/s, acc_step=1/1, ce_loss_token=2.5852, perplexity_token=13.2654]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  51%|███████████████████████▌                      | 534/1044 [03:18<03:17,  2.58it/s, acc_step=1/1, ce_loss_token=2.5850, perplexity_token=13.2631]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  51%|███████████████████████▌                      | 535/1044 [03:19<03:13,  2.62it/s, acc_step=1/1, ce_loss_token=2.5848, perplexity_token=13.2612]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  51%|███████████████████████▌                      | 536/1044 [03:19<03:12,  2.64it/s, acc_step=1/1, ce_loss_token=2.5847, perplexity_token=13.2587]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  51%|███████████████████████▋                      | 537/1044 [03:19<03:07,  2.70it/s, acc_step=1/1, ce_loss_token=2.5845, perplexity_token=13.2566]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  52%|███████████████████████▋                      | 538/1044 [03:20<03:15,  2.59it/s, acc_step=1/1, ce_loss_token=2.5843, perplexity_token=13.2544]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  52%|███████████████████████▋                      | 539/1044 [03:20<03:13,  2.60it/s, acc_step=1/1, ce_loss_token=2.5842, perplexity_token=13.2522]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  52%|███████████████████████▊                      | 540/1044 [03:20<02:57,  2.83it/s, acc_step=1/1, ce_loss_token=2.5840, perplexity_token=13.2505]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  52%|███████████████████████▊                      | 541/1044 [03:21<03:04,  2.72it/s, acc_step=1/1, ce_loss_token=2.5839, perplexity_token=13.2483]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  52%|███████████████████████▉                      | 542/1044 [03:21<03:05,  2.70it/s, acc_step=1/1, ce_loss_token=2.5837, perplexity_token=13.2461]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  52%|███████████████████████▉                      | 543/1044 [03:21<03:06,  2.69it/s, acc_step=1/1, ce_loss_token=2.5835, perplexity_token=13.2439]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  52%|███████████████████████▉                      | 544/1044 [03:22<03:09,  2.64it/s, acc_step=1/1, ce_loss_token=2.5834, perplexity_token=13.2418]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  52%|████████████████████████                      | 545/1044 [03:22<03:12,  2.60it/s, acc_step=1/1, ce_loss_token=2.5832, perplexity_token=13.2397]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  52%|████████████████████████                      | 546/1044 [03:23<02:56,  2.82it/s, acc_step=1/1, ce_loss_token=2.5831, perplexity_token=13.2380]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  52%|████████████████████████                      | 547/1044 [03:23<02:56,  2.82it/s, acc_step=1/1, ce_loss_token=2.5829, perplexity_token=13.2358]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  52%|████████████████████████▏                     | 548/1044 [03:23<02:56,  2.80it/s, acc_step=1/1, ce_loss_token=2.5828, perplexity_token=13.2337]

torch.Size([256, 342, 35]) torch.Size([256, 342])


[Training LM]:  53%|████████████████████████▏                     | 549/1044 [03:24<03:07,  2.64it/s, acc_step=1/1, ce_loss_token=2.5826, perplexity_token=13.2317]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  53%|████████████████████████▏                     | 550/1044 [03:24<02:53,  2.85it/s, acc_step=1/1, ce_loss_token=2.5825, perplexity_token=13.2302]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  53%|████████████████████████▎                     | 551/1044 [03:24<02:58,  2.76it/s, acc_step=1/1, ce_loss_token=2.5823, perplexity_token=13.2280]

torch.Size([256, 421, 35]) torch.Size([256, 421])


[Training LM]:  53%|████████████████████████▎                     | 552/1044 [03:25<03:34,  2.30it/s, acc_step=1/1, ce_loss_token=2.5822, perplexity_token=13.2259]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  53%|████████████████████████▎                     | 553/1044 [03:25<03:27,  2.36it/s, acc_step=1/1, ce_loss_token=2.5820, perplexity_token=13.2238]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  53%|████████████████████████▍                     | 554/1044 [03:26<03:16,  2.49it/s, acc_step=1/1, ce_loss_token=2.5819, perplexity_token=13.2218]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  53%|████████████████████████▍                     | 555/1044 [03:26<03:18,  2.47it/s, acc_step=1/1, ce_loss_token=2.5817, perplexity_token=13.2197]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  53%|████████████████████████▍                     | 556/1044 [03:27<03:13,  2.53it/s, acc_step=1/1, ce_loss_token=2.5816, perplexity_token=13.2179]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  53%|████████████████████████▌                     | 557/1044 [03:27<03:08,  2.58it/s, acc_step=1/1, ce_loss_token=2.5814, perplexity_token=13.2159]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  53%|████████████████████████▌                     | 558/1044 [03:27<03:06,  2.61it/s, acc_step=1/1, ce_loss_token=2.5813, perplexity_token=13.2141]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  54%|████████████████████████▋                     | 559/1044 [03:28<02:49,  2.86it/s, acc_step=1/1, ce_loss_token=2.5812, perplexity_token=13.2125]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  54%|████████████████████████▋                     | 560/1044 [03:28<02:54,  2.77it/s, acc_step=1/1, ce_loss_token=2.5810, perplexity_token=13.2104]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  54%|████████████████████████▋                     | 561/1044 [03:28<03:00,  2.67it/s, acc_step=1/1, ce_loss_token=2.5808, perplexity_token=13.2082]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  54%|████████████████████████▊                     | 562/1044 [03:29<02:56,  2.74it/s, acc_step=1/1, ce_loss_token=2.5807, perplexity_token=13.2059]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  54%|████████████████████████▊                     | 563/1044 [03:29<02:59,  2.68it/s, acc_step=1/1, ce_loss_token=2.5805, perplexity_token=13.2038]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  54%|████████████████████████▊                     | 564/1044 [03:29<03:02,  2.63it/s, acc_step=1/1, ce_loss_token=2.5804, perplexity_token=13.2019]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  54%|████████████████████████▉                     | 565/1044 [03:30<03:06,  2.56it/s, acc_step=1/1, ce_loss_token=2.5802, perplexity_token=13.1997]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  54%|████████████████████████▉                     | 566/1044 [03:30<03:09,  2.53it/s, acc_step=1/1, ce_loss_token=2.5800, perplexity_token=13.1978]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  54%|████████████████████████▉                     | 567/1044 [03:31<03:13,  2.47it/s, acc_step=1/1, ce_loss_token=2.5799, perplexity_token=13.1958]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  54%|█████████████████████████                     | 568/1044 [03:31<03:03,  2.60it/s, acc_step=1/1, ce_loss_token=2.5797, perplexity_token=13.1937]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  55%|█████████████████████████                     | 569/1044 [03:31<02:58,  2.67it/s, acc_step=1/1, ce_loss_token=2.5796, perplexity_token=13.1917]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  55%|█████████████████████████                     | 570/1044 [03:32<03:05,  2.56it/s, acc_step=1/1, ce_loss_token=2.5794, perplexity_token=13.1899]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  55%|█████████████████████████▏                    | 571/1044 [03:32<03:05,  2.56it/s, acc_step=1/1, ce_loss_token=2.5793, perplexity_token=13.1878]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  55%|█████████████████████████▏                    | 572/1044 [03:33<02:55,  2.69it/s, acc_step=1/1, ce_loss_token=2.5792, perplexity_token=13.1860]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  55%|█████████████████████████▏                    | 573/1044 [03:33<02:55,  2.69it/s, acc_step=1/1, ce_loss_token=2.5790, perplexity_token=13.1839]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  55%|█████████████████████████▎                    | 574/1044 [03:33<02:39,  2.95it/s, acc_step=1/1, ce_loss_token=2.5789, perplexity_token=13.1824]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  55%|█████████████████████████▎                    | 575/1044 [03:33<02:27,  3.18it/s, acc_step=1/1, ce_loss_token=2.5788, perplexity_token=13.1808]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  55%|█████████████████████████▍                    | 576/1044 [03:34<02:31,  3.09it/s, acc_step=1/1, ce_loss_token=2.5786, perplexity_token=13.1788]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  55%|█████████████████████████▍                    | 577/1044 [03:34<02:22,  3.27it/s, acc_step=1/1, ce_loss_token=2.5785, perplexity_token=13.1771]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  55%|█████████████████████████▍                    | 578/1044 [03:34<02:27,  3.16it/s, acc_step=1/1, ce_loss_token=2.5783, perplexity_token=13.1751]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  55%|█████████████████████████▌                    | 579/1044 [03:35<02:33,  3.03it/s, acc_step=1/1, ce_loss_token=2.5782, perplexity_token=13.1731]

torch.Size([256, 349, 35]) torch.Size([256, 349])


[Training LM]:  56%|█████████████████████████▌                    | 580/1044 [03:35<02:36,  2.96it/s, acc_step=1/1, ce_loss_token=2.5781, perplexity_token=13.1715]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  56%|█████████████████████████▌                    | 581/1044 [03:35<02:40,  2.88it/s, acc_step=1/1, ce_loss_token=2.5779, perplexity_token=13.1695]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  56%|█████████████████████████▋                    | 582/1044 [03:36<02:47,  2.75it/s, acc_step=1/1, ce_loss_token=2.5777, perplexity_token=13.1674]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  56%|█████████████████████████▋                    | 583/1044 [03:36<02:50,  2.71it/s, acc_step=1/1, ce_loss_token=2.5776, perplexity_token=13.1655]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  56%|█████████████████████████▋                    | 584/1044 [03:37<02:50,  2.69it/s, acc_step=1/1, ce_loss_token=2.5774, perplexity_token=13.1635]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  56%|█████████████████████████▊                    | 585/1044 [03:37<02:41,  2.84it/s, acc_step=1/1, ce_loss_token=2.5773, perplexity_token=13.1617]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  56%|█████████████████████████▊                    | 586/1044 [03:37<02:43,  2.81it/s, acc_step=1/1, ce_loss_token=2.5772, perplexity_token=13.1598]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  56%|█████████████████████████▊                    | 587/1044 [03:38<02:54,  2.63it/s, acc_step=1/1, ce_loss_token=2.5770, perplexity_token=13.1577]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  56%|█████████████████████████▉                    | 588/1044 [03:38<03:01,  2.51it/s, acc_step=1/1, ce_loss_token=2.5769, perplexity_token=13.1558]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  56%|█████████████████████████▉                    | 589/1044 [03:39<03:03,  2.48it/s, acc_step=1/1, ce_loss_token=2.5767, perplexity_token=13.1539]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  57%|█████████████████████████▉                    | 590/1044 [03:39<03:01,  2.50it/s, acc_step=1/1, ce_loss_token=2.5766, perplexity_token=13.1520]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  57%|██████████████████████████                    | 591/1044 [03:39<02:57,  2.55it/s, acc_step=1/1, ce_loss_token=2.5764, perplexity_token=13.1501]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  57%|██████████████████████████                    | 592/1044 [03:40<02:43,  2.76it/s, acc_step=1/1, ce_loss_token=2.5763, perplexity_token=13.1487]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  57%|██████████████████████████▏                   | 593/1044 [03:40<02:47,  2.69it/s, acc_step=1/1, ce_loss_token=2.5762, perplexity_token=13.1469]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  57%|██████████████████████████▏                   | 594/1044 [03:40<02:50,  2.64it/s, acc_step=1/1, ce_loss_token=2.5760, perplexity_token=13.1451]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  57%|██████████████████████████▏                   | 595/1044 [03:41<02:48,  2.66it/s, acc_step=1/1, ce_loss_token=2.5759, perplexity_token=13.1432]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  57%|██████████████████████████▎                   | 596/1044 [03:41<02:43,  2.74it/s, acc_step=1/1, ce_loss_token=2.5757, perplexity_token=13.1409]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  57%|██████████████████████████▎                   | 597/1044 [03:42<02:43,  2.73it/s, acc_step=1/1, ce_loss_token=2.5756, perplexity_token=13.1392]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  57%|██████████████████████████▎                   | 598/1044 [03:42<02:43,  2.73it/s, acc_step=1/1, ce_loss_token=2.5754, perplexity_token=13.1372]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  57%|██████████████████████████▍                   | 599/1044 [03:42<02:46,  2.68it/s, acc_step=1/1, ce_loss_token=2.5753, perplexity_token=13.1354]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  57%|██████████████████████████▍                   | 600/1044 [03:43<02:47,  2.64it/s, acc_step=1/1, ce_loss_token=2.5752, perplexity_token=13.1335]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  58%|██████████████████████████▍                   | 601/1044 [03:43<02:46,  2.65it/s, acc_step=1/1, ce_loss_token=2.5750, perplexity_token=13.1318]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  58%|██████████████████████████▌                   | 602/1044 [03:43<02:41,  2.73it/s, acc_step=1/1, ce_loss_token=2.5749, perplexity_token=13.1298]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  58%|██████████████████████████▌                   | 603/1044 [03:44<02:48,  2.62it/s, acc_step=1/1, ce_loss_token=2.5747, perplexity_token=13.1277]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  58%|██████████████████████████▌                   | 604/1044 [03:44<02:35,  2.83it/s, acc_step=1/1, ce_loss_token=2.5746, perplexity_token=13.1263]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  58%|██████████████████████████▋                   | 605/1044 [03:45<02:42,  2.69it/s, acc_step=1/1, ce_loss_token=2.5745, perplexity_token=13.1245]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  58%|██████████████████████████▋                   | 606/1044 [03:45<02:46,  2.63it/s, acc_step=1/1, ce_loss_token=2.5743, perplexity_token=13.1226]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  58%|██████████████████████████▋                   | 607/1044 [03:45<02:56,  2.48it/s, acc_step=1/1, ce_loss_token=2.5742, perplexity_token=13.1206]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  58%|██████████████████████████▊                   | 608/1044 [03:46<02:56,  2.48it/s, acc_step=1/1, ce_loss_token=2.5740, perplexity_token=13.1187]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  58%|██████████████████████████▊                   | 609/1044 [03:46<02:55,  2.49it/s, acc_step=1/1, ce_loss_token=2.5739, perplexity_token=13.1168]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  58%|██████████████████████████▉                   | 610/1044 [03:47<02:54,  2.49it/s, acc_step=1/1, ce_loss_token=2.5738, perplexity_token=13.1150]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  59%|██████████████████████████▉                   | 611/1044 [03:47<02:52,  2.52it/s, acc_step=1/1, ce_loss_token=2.5736, perplexity_token=13.1129]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  59%|██████████████████████████▉                   | 612/1044 [03:47<02:55,  2.47it/s, acc_step=1/1, ce_loss_token=2.5735, perplexity_token=13.1111]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  59%|███████████████████████████                   | 613/1044 [03:48<02:46,  2.59it/s, acc_step=1/1, ce_loss_token=2.5733, perplexity_token=13.1090]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  59%|███████████████████████████                   | 614/1044 [03:48<02:45,  2.60it/s, acc_step=1/1, ce_loss_token=2.5732, perplexity_token=13.1073]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  59%|███████████████████████████                   | 615/1044 [03:48<02:41,  2.65it/s, acc_step=1/1, ce_loss_token=2.5730, perplexity_token=13.1050]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  59%|███████████████████████████▏                  | 616/1044 [03:49<02:37,  2.72it/s, acc_step=1/1, ce_loss_token=2.5729, perplexity_token=13.1033]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  59%|███████████████████████████▏                  | 617/1044 [03:49<02:42,  2.63it/s, acc_step=1/1, ce_loss_token=2.5727, perplexity_token=13.1018]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  59%|███████████████████████████▏                  | 618/1044 [03:49<02:28,  2.86it/s, acc_step=1/1, ce_loss_token=2.5726, perplexity_token=13.1003]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  59%|███████████████████████████▎                  | 619/1044 [03:50<02:28,  2.85it/s, acc_step=1/1, ce_loss_token=2.5725, perplexity_token=13.0983]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  59%|███████████████████████████▎                  | 620/1044 [03:50<02:38,  2.67it/s, acc_step=1/1, ce_loss_token=2.5723, perplexity_token=13.0964]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  59%|███████████████████████████▎                  | 621/1044 [03:51<02:42,  2.60it/s, acc_step=1/1, ce_loss_token=2.5722, perplexity_token=13.0946]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  60%|███████████████████████████▍                  | 622/1044 [03:51<02:42,  2.60it/s, acc_step=1/1, ce_loss_token=2.5720, perplexity_token=13.0925]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  60%|███████████████████████████▍                  | 623/1044 [03:51<02:38,  2.65it/s, acc_step=1/1, ce_loss_token=2.5719, perplexity_token=13.0908]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  60%|███████████████████████████▍                  | 624/1044 [03:52<02:26,  2.87it/s, acc_step=1/1, ce_loss_token=2.5718, perplexity_token=13.0896]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  60%|███████████████████████████▌                  | 625/1044 [03:52<02:20,  2.98it/s, acc_step=1/1, ce_loss_token=2.5717, perplexity_token=13.0882]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  60%|███████████████████████████▌                  | 626/1044 [03:52<02:25,  2.88it/s, acc_step=1/1, ce_loss_token=2.5716, perplexity_token=13.0864]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  60%|███████████████████████████▋                  | 627/1044 [03:53<02:17,  3.03it/s, acc_step=1/1, ce_loss_token=2.5715, perplexity_token=13.0853]

torch.Size([256, 397, 35]) torch.Size([256, 397])


[Training LM]:  60%|███████████████████████████▋                  | 628/1044 [03:53<02:44,  2.52it/s, acc_step=1/1, ce_loss_token=2.5714, perplexity_token=13.0835]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  60%|███████████████████████████▋                  | 629/1044 [03:54<02:51,  2.42it/s, acc_step=1/1, ce_loss_token=2.5712, perplexity_token=13.0816]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  60%|███████████████████████████▊                  | 630/1044 [03:54<02:52,  2.40it/s, acc_step=1/1, ce_loss_token=2.5711, perplexity_token=13.0799]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  60%|███████████████████████████▊                  | 631/1044 [03:54<02:38,  2.61it/s, acc_step=1/1, ce_loss_token=2.5710, perplexity_token=13.0785]

torch.Size([256, 354, 35]) torch.Size([256, 354])


[Training LM]:  61%|███████████████████████████▊                  | 632/1044 [03:55<02:33,  2.69it/s, acc_step=1/1, ce_loss_token=2.5709, perplexity_token=13.0771]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  61%|███████████████████████████▉                  | 633/1044 [03:55<02:32,  2.69it/s, acc_step=1/1, ce_loss_token=2.5707, perplexity_token=13.0756]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  61%|███████████████████████████▉                  | 634/1044 [03:55<02:31,  2.71it/s, acc_step=1/1, ce_loss_token=2.5706, perplexity_token=13.0741]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  61%|███████████████████████████▉                  | 635/1044 [03:56<02:28,  2.76it/s, acc_step=1/1, ce_loss_token=2.5705, perplexity_token=13.0726]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  61%|████████████████████████████                  | 636/1044 [03:56<02:25,  2.80it/s, acc_step=1/1, ce_loss_token=2.5704, perplexity_token=13.0708]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  61%|████████████████████████████                  | 637/1044 [03:57<02:33,  2.66it/s, acc_step=1/1, ce_loss_token=2.5702, perplexity_token=13.0691]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  61%|████████████████████████████                  | 638/1044 [03:57<02:23,  2.84it/s, acc_step=1/1, ce_loss_token=2.5701, perplexity_token=13.0677]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  61%|████████████████████████████▏                 | 639/1044 [03:57<02:19,  2.91it/s, acc_step=1/1, ce_loss_token=2.5700, perplexity_token=13.0659]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  61%|████████████████████████████▏                 | 640/1044 [03:58<02:19,  2.89it/s, acc_step=1/1, ce_loss_token=2.5699, perplexity_token=13.0644]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  61%|████████████████████████████▏                 | 641/1044 [03:58<02:24,  2.80it/s, acc_step=1/1, ce_loss_token=2.5698, perplexity_token=13.0626]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  61%|████████████████████████████▎                 | 642/1044 [03:58<02:23,  2.80it/s, acc_step=1/1, ce_loss_token=2.5696, perplexity_token=13.0610]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  62%|████████████████████████████▎                 | 643/1044 [03:59<02:23,  2.80it/s, acc_step=1/1, ce_loss_token=2.5695, perplexity_token=13.0592]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  62%|████████████████████████████▍                 | 644/1044 [03:59<02:24,  2.78it/s, acc_step=1/1, ce_loss_token=2.5694, perplexity_token=13.0576]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  62%|████████████████████████████▍                 | 645/1044 [03:59<02:25,  2.73it/s, acc_step=1/1, ce_loss_token=2.5692, perplexity_token=13.0559]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  62%|████████████████████████████▍                 | 646/1044 [04:00<02:25,  2.74it/s, acc_step=1/1, ce_loss_token=2.5691, perplexity_token=13.0542]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  62%|████████████████████████████▌                 | 647/1044 [04:00<02:26,  2.71it/s, acc_step=1/1, ce_loss_token=2.5690, perplexity_token=13.0522]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  62%|████████████████████████████▌                 | 648/1044 [04:01<02:32,  2.59it/s, acc_step=1/1, ce_loss_token=2.5688, perplexity_token=13.0506]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  62%|████████████████████████████▌                 | 649/1044 [04:01<02:21,  2.79it/s, acc_step=1/1, ce_loss_token=2.5687, perplexity_token=13.0495]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  62%|████████████████████████████▋                 | 650/1044 [04:01<02:40,  2.45it/s, acc_step=1/1, ce_loss_token=2.5686, perplexity_token=13.0476]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  62%|████████████████████████████▋                 | 651/1044 [04:02<02:37,  2.50it/s, acc_step=1/1, ce_loss_token=2.5685, perplexity_token=13.0458]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  62%|████████████████████████████▋                 | 652/1044 [04:02<02:31,  2.59it/s, acc_step=1/1, ce_loss_token=2.5683, perplexity_token=13.0441]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  63%|████████████████████████████▊                 | 653/1044 [04:03<02:31,  2.58it/s, acc_step=1/1, ce_loss_token=2.5682, perplexity_token=13.0425]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  63%|████████████████████████████▊                 | 654/1044 [04:03<02:28,  2.62it/s, acc_step=1/1, ce_loss_token=2.5681, perplexity_token=13.0407]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  63%|████████████████████████████▊                 | 655/1044 [04:03<02:24,  2.69it/s, acc_step=1/1, ce_loss_token=2.5679, perplexity_token=13.0390]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  63%|████████████████████████████▉                 | 656/1044 [04:04<02:23,  2.70it/s, acc_step=1/1, ce_loss_token=2.5678, perplexity_token=13.0371]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  63%|████████████████████████████▉                 | 657/1044 [04:04<02:14,  2.87it/s, acc_step=1/1, ce_loss_token=2.5677, perplexity_token=13.0359]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  63%|████████████████████████████▉                 | 658/1044 [04:04<02:16,  2.82it/s, acc_step=1/1, ce_loss_token=2.5676, perplexity_token=13.0341]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  63%|█████████████████████████████                 | 659/1044 [04:05<02:19,  2.76it/s, acc_step=1/1, ce_loss_token=2.5674, perplexity_token=13.0322]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  63%|█████████████████████████████                 | 660/1044 [04:05<02:26,  2.63it/s, acc_step=1/1, ce_loss_token=2.5673, perplexity_token=13.0306]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  63%|█████████████████████████████▏                | 662/1044 [04:06<02:13,  2.86it/s, acc_step=1/1, ce_loss_token=2.5672, perplexity_token=13.0296]

torch.Size([256, 301, 35]) torch.Size([256, 301])
torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  64%|█████████████████████████████▏                | 663/1044 [04:06<02:13,  2.86it/s, acc_step=1/1, ce_loss_token=2.5671, perplexity_token=13.0280]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  64%|█████████████████████████████▎                | 664/1044 [04:06<02:15,  2.81it/s, acc_step=1/1, ce_loss_token=2.5670, perplexity_token=13.0263]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  64%|█████████████████████████████▎                | 665/1044 [04:07<02:03,  3.06it/s, acc_step=1/1, ce_loss_token=2.5669, perplexity_token=13.0251]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  64%|█████████████████████████████▎                | 666/1044 [04:07<01:58,  3.19it/s, acc_step=1/1, ce_loss_token=2.5668, perplexity_token=13.0239]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  64%|█████████████████████████████▍                | 667/1044 [04:07<02:08,  2.93it/s, acc_step=1/1, ce_loss_token=2.5667, perplexity_token=13.0224]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  64%|█████████████████████████████▍                | 668/1044 [04:08<02:08,  2.92it/s, acc_step=1/1, ce_loss_token=2.5666, perplexity_token=13.0208]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  64%|█████████████████████████████▍                | 669/1044 [04:08<02:01,  3.08it/s, acc_step=1/1, ce_loss_token=2.5665, perplexity_token=13.0196]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  64%|█████████████████████████████▌                | 670/1044 [04:08<02:12,  2.82it/s, acc_step=1/1, ce_loss_token=2.5663, perplexity_token=13.0179]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  64%|█████████████████████████████▌                | 671/1044 [04:09<02:09,  2.87it/s, acc_step=1/1, ce_loss_token=2.5662, perplexity_token=13.0165]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  64%|█████████████████████████████▌                | 672/1044 [04:09<02:10,  2.84it/s, acc_step=1/1, ce_loss_token=2.5661, perplexity_token=13.0147]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  64%|█████████████████████████████▋                | 673/1044 [04:09<02:05,  2.95it/s, acc_step=1/1, ce_loss_token=2.5660, perplexity_token=13.0132]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  65%|█████████████████████████████▋                | 674/1044 [04:10<02:06,  2.91it/s, acc_step=1/1, ce_loss_token=2.5658, perplexity_token=13.0116]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  65%|█████████████████████████████▋                | 675/1044 [04:10<02:12,  2.79it/s, acc_step=1/1, ce_loss_token=2.5657, perplexity_token=13.0098]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  65%|█████████████████████████████▊                | 676/1044 [04:11<02:07,  2.88it/s, acc_step=1/1, ce_loss_token=2.5656, perplexity_token=13.0086]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  65%|█████████████████████████████▊                | 677/1044 [04:11<02:14,  2.74it/s, acc_step=1/1, ce_loss_token=2.5655, perplexity_token=13.0069]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  65%|█████████████████████████████▊                | 678/1044 [04:11<02:13,  2.73it/s, acc_step=1/1, ce_loss_token=2.5654, perplexity_token=13.0054]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  65%|█████████████████████████████▉                | 679/1044 [04:12<02:14,  2.70it/s, acc_step=1/1, ce_loss_token=2.5652, perplexity_token=13.0038]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  65%|█████████████████████████████▉                | 680/1044 [04:12<02:04,  2.92it/s, acc_step=1/1, ce_loss_token=2.5651, perplexity_token=13.0024]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  65%|██████████████████████████████                | 681/1044 [04:12<02:05,  2.89it/s, acc_step=1/1, ce_loss_token=2.5650, perplexity_token=13.0007]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  65%|██████████████████████████████                | 682/1044 [04:13<02:07,  2.84it/s, acc_step=1/1, ce_loss_token=2.5649, perplexity_token=12.9993]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  65%|██████████████████████████████                | 683/1044 [04:13<02:11,  2.75it/s, acc_step=1/1, ce_loss_token=2.5648, perplexity_token=12.9976]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  66%|██████████████████████████████▏               | 684/1044 [04:13<02:11,  2.74it/s, acc_step=1/1, ce_loss_token=2.5647, perplexity_token=12.9962]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  66%|██████████████████████████████▏               | 685/1044 [04:14<02:14,  2.66it/s, acc_step=1/1, ce_loss_token=2.5645, perplexity_token=12.9946]

torch.Size([256, 372, 35]) torch.Size([256, 372])


[Training LM]:  66%|██████████████████████████████▏               | 686/1044 [04:14<02:27,  2.43it/s, acc_step=1/1, ce_loss_token=2.5644, perplexity_token=12.9930]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  66%|██████████████████████████████▎               | 687/1044 [04:15<02:22,  2.51it/s, acc_step=1/1, ce_loss_token=2.5643, perplexity_token=12.9915]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  66%|██████████████████████████████▎               | 688/1044 [04:15<02:18,  2.57it/s, acc_step=1/1, ce_loss_token=2.5642, perplexity_token=12.9902]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  66%|██████████████████████████████▎               | 689/1044 [04:15<02:07,  2.78it/s, acc_step=1/1, ce_loss_token=2.5641, perplexity_token=12.9890]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  66%|██████████████████████████████▍               | 690/1044 [04:16<02:11,  2.70it/s, acc_step=1/1, ce_loss_token=2.5640, perplexity_token=12.9876]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  66%|██████████████████████████████▍               | 691/1044 [04:16<02:16,  2.59it/s, acc_step=1/1, ce_loss_token=2.5639, perplexity_token=12.9860]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  66%|██████████████████████████████▍               | 692/1044 [04:17<02:15,  2.60it/s, acc_step=1/1, ce_loss_token=2.5637, perplexity_token=12.9843]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  66%|██████████████████████████████▌               | 693/1044 [04:17<02:16,  2.58it/s, acc_step=1/1, ce_loss_token=2.5636, perplexity_token=12.9828]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  66%|██████████████████████████████▌               | 694/1044 [04:17<02:17,  2.55it/s, acc_step=1/1, ce_loss_token=2.5635, perplexity_token=12.9811]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  67%|██████████████████████████████▌               | 695/1044 [04:18<02:15,  2.58it/s, acc_step=1/1, ce_loss_token=2.5634, perplexity_token=12.9797]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  67%|██████████████████████████████▋               | 696/1044 [04:18<02:11,  2.64it/s, acc_step=1/1, ce_loss_token=2.5633, perplexity_token=12.9782]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  67%|██████████████████████████████▋               | 697/1044 [04:18<02:09,  2.68it/s, acc_step=1/1, ce_loss_token=2.5631, perplexity_token=12.9765]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  67%|██████████████████████████████▊               | 698/1044 [04:19<02:10,  2.66it/s, acc_step=1/1, ce_loss_token=2.5630, perplexity_token=12.9750]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  67%|██████████████████████████████▊               | 699/1044 [04:19<02:09,  2.66it/s, acc_step=1/1, ce_loss_token=2.5629, perplexity_token=12.9735]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  67%|██████████████████████████████▊               | 700/1044 [04:20<02:00,  2.85it/s, acc_step=1/1, ce_loss_token=2.5628, perplexity_token=12.9722]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  67%|██████████████████████████████▉               | 701/1044 [04:20<01:51,  3.06it/s, acc_step=1/1, ce_loss_token=2.5627, perplexity_token=12.9711]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  67%|██████████████████████████████▉               | 702/1044 [04:20<01:44,  3.26it/s, acc_step=1/1, ce_loss_token=2.5626, perplexity_token=12.9697]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  67%|██████████████████████████████▉               | 703/1044 [04:20<01:49,  3.12it/s, acc_step=1/1, ce_loss_token=2.5625, perplexity_token=12.9681]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  67%|███████████████████████████████               | 704/1044 [04:21<01:45,  3.22it/s, acc_step=1/1, ce_loss_token=2.5624, perplexity_token=12.9668]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  68%|███████████████████████████████               | 705/1044 [04:21<01:48,  3.12it/s, acc_step=1/1, ce_loss_token=2.5623, perplexity_token=12.9654]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  68%|███████████████████████████████               | 706/1044 [04:21<01:44,  3.24it/s, acc_step=1/1, ce_loss_token=2.5622, perplexity_token=12.9640]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  68%|███████████████████████████████▏              | 707/1044 [04:22<01:47,  3.13it/s, acc_step=1/1, ce_loss_token=2.5620, perplexity_token=12.9622]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  68%|███████████████████████████████▏              | 708/1044 [04:22<01:54,  2.95it/s, acc_step=1/1, ce_loss_token=2.5619, perplexity_token=12.9606]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  68%|███████████████████████████████▏              | 709/1044 [04:22<01:50,  3.04it/s, acc_step=1/1, ce_loss_token=2.5618, perplexity_token=12.9594]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  68%|███████████████████████████████▎              | 710/1044 [04:23<01:53,  2.95it/s, acc_step=1/1, ce_loss_token=2.5617, perplexity_token=12.9579]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  68%|███████████████████████████████▎              | 711/1044 [04:23<01:57,  2.83it/s, acc_step=1/1, ce_loss_token=2.5616, perplexity_token=12.9562]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  68%|███████████████████████████████▎              | 712/1044 [04:24<02:03,  2.68it/s, acc_step=1/1, ce_loss_token=2.5614, perplexity_token=12.9546]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  68%|███████████████████████████████▍              | 713/1044 [04:24<02:08,  2.58it/s, acc_step=1/1, ce_loss_token=2.5613, perplexity_token=12.9529]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  68%|███████████████████████████████▍              | 714/1044 [04:24<02:04,  2.66it/s, acc_step=1/1, ce_loss_token=2.5612, perplexity_token=12.9514]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  68%|███████████████████████████████▌              | 715/1044 [04:25<01:59,  2.75it/s, acc_step=1/1, ce_loss_token=2.5611, perplexity_token=12.9499]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  69%|███████████████████████████████▌              | 716/1044 [04:25<02:01,  2.71it/s, acc_step=1/1, ce_loss_token=2.5610, perplexity_token=12.9484]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  69%|███████████████████████████████▌              | 717/1044 [04:25<01:59,  2.74it/s, acc_step=1/1, ce_loss_token=2.5609, perplexity_token=12.9469]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  69%|███████████████████████████████▋              | 718/1044 [04:26<02:02,  2.66it/s, acc_step=1/1, ce_loss_token=2.5607, perplexity_token=12.9453]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  69%|███████████████████████████████▋              | 719/1044 [04:26<02:04,  2.61it/s, acc_step=1/1, ce_loss_token=2.5606, perplexity_token=12.9440]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  69%|███████████████████████████████▋              | 720/1044 [04:27<02:03,  2.62it/s, acc_step=1/1, ce_loss_token=2.5605, perplexity_token=12.9424]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  69%|███████████████████████████████▊              | 721/1044 [04:27<02:01,  2.67it/s, acc_step=1/1, ce_loss_token=2.5604, perplexity_token=12.9409]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  69%|███████████████████████████████▊              | 722/1044 [04:27<02:04,  2.58it/s, acc_step=1/1, ce_loss_token=2.5603, perplexity_token=12.9393]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  69%|███████████████████████████████▊              | 723/1044 [04:28<02:03,  2.59it/s, acc_step=1/1, ce_loss_token=2.5602, perplexity_token=12.9378]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  69%|███████████████████████████████▉              | 724/1044 [04:28<02:04,  2.57it/s, acc_step=1/1, ce_loss_token=2.5600, perplexity_token=12.9362]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  69%|███████████████████████████████▉              | 725/1044 [04:29<02:04,  2.55it/s, acc_step=1/1, ce_loss_token=2.5599, perplexity_token=12.9348]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  70%|███████████████████████████████▉              | 726/1044 [04:29<02:01,  2.63it/s, acc_step=1/1, ce_loss_token=2.5598, perplexity_token=12.9332]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  70%|████████████████████████████████              | 727/1044 [04:29<01:59,  2.65it/s, acc_step=1/1, ce_loss_token=2.5597, perplexity_token=12.9318]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  70%|████████████████████████████████              | 728/1044 [04:30<02:00,  2.62it/s, acc_step=1/1, ce_loss_token=2.5596, perplexity_token=12.9302]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  70%|████████████████████████████████              | 729/1044 [04:30<01:49,  2.87it/s, acc_step=1/1, ce_loss_token=2.5595, perplexity_token=12.9291]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  70%|████████████████████████████████▏             | 730/1044 [04:30<01:51,  2.82it/s, acc_step=1/1, ce_loss_token=2.5594, perplexity_token=12.9276]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  70%|████████████████████████████████▏             | 731/1044 [04:31<01:51,  2.81it/s, acc_step=1/1, ce_loss_token=2.5592, perplexity_token=12.9260]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  70%|████████████████████████████████▎             | 732/1044 [04:31<01:53,  2.76it/s, acc_step=1/1, ce_loss_token=2.5591, perplexity_token=12.9244]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  70%|████████████████████████████████▎             | 733/1044 [04:31<01:56,  2.68it/s, acc_step=1/1, ce_loss_token=2.5590, perplexity_token=12.9229]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  70%|████████████████████████████████▎             | 734/1044 [04:32<01:54,  2.71it/s, acc_step=1/1, ce_loss_token=2.5589, perplexity_token=12.9216]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  70%|████████████████████████████████▍             | 735/1044 [04:32<01:54,  2.71it/s, acc_step=1/1, ce_loss_token=2.5588, perplexity_token=12.9201]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  70%|████████████████████████████████▍             | 736/1044 [04:33<01:58,  2.60it/s, acc_step=1/1, ce_loss_token=2.5587, perplexity_token=12.9185]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  71%|████████████████████████████████▍             | 737/1044 [04:33<01:58,  2.59it/s, acc_step=1/1, ce_loss_token=2.5586, perplexity_token=12.9171]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  71%|████████████████████████████████▌             | 738/1044 [04:33<01:58,  2.58it/s, acc_step=1/1, ce_loss_token=2.5584, perplexity_token=12.9156]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  71%|████████████████████████████████▌             | 739/1044 [04:34<01:55,  2.64it/s, acc_step=1/1, ce_loss_token=2.5583, perplexity_token=12.9140]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  71%|████████████████████████████████▌             | 740/1044 [04:34<01:44,  2.91it/s, acc_step=1/1, ce_loss_token=2.5582, perplexity_token=12.9129]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  71%|████████████████████████████████▋             | 741/1044 [04:34<01:41,  2.97it/s, acc_step=1/1, ce_loss_token=2.5581, perplexity_token=12.9114]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  71%|████████████████████████████████▋             | 742/1044 [04:35<01:44,  2.88it/s, acc_step=1/1, ce_loss_token=2.5580, perplexity_token=12.9099]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  71%|████████████████████████████████▋             | 743/1044 [04:35<01:47,  2.80it/s, acc_step=1/1, ce_loss_token=2.5579, perplexity_token=12.9084]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  71%|████████████████████████████████▊             | 744/1044 [04:35<01:47,  2.80it/s, acc_step=1/1, ce_loss_token=2.5578, perplexity_token=12.9070]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  71%|████████████████████████████████▊             | 745/1044 [04:36<01:40,  2.96it/s, acc_step=1/1, ce_loss_token=2.5577, perplexity_token=12.9058]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  71%|████████████████████████████████▊             | 746/1044 [04:36<01:41,  2.94it/s, acc_step=1/1, ce_loss_token=2.5576, perplexity_token=12.9043]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  72%|████████████████████████████████▉             | 747/1044 [04:36<01:45,  2.83it/s, acc_step=1/1, ce_loss_token=2.5574, perplexity_token=12.9028]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  72%|████████████████████████████████▉             | 748/1044 [04:37<01:44,  2.83it/s, acc_step=1/1, ce_loss_token=2.5573, perplexity_token=12.9015]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  72%|█████████████████████████████████             | 749/1044 [04:37<01:38,  2.99it/s, acc_step=1/1, ce_loss_token=2.5573, perplexity_token=12.9005]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  72%|█████████████████████████████████             | 750/1044 [04:37<01:32,  3.19it/s, acc_step=1/1, ce_loss_token=2.5572, perplexity_token=12.8994]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  72%|█████████████████████████████████             | 751/1044 [04:38<01:37,  3.00it/s, acc_step=1/1, ce_loss_token=2.5571, perplexity_token=12.8979]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  72%|█████████████████████████████████▏            | 752/1044 [04:38<01:54,  2.55it/s, acc_step=1/1, ce_loss_token=2.5570, perplexity_token=12.8965]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  72%|█████████████████████████████████▏            | 753/1044 [04:39<01:57,  2.48it/s, acc_step=1/1, ce_loss_token=2.5569, perplexity_token=12.8952]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  72%|█████████████████████████████████▏            | 754/1044 [04:39<01:49,  2.65it/s, acc_step=1/1, ce_loss_token=2.5568, perplexity_token=12.8942]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  72%|█████████████████████████████████▎            | 755/1044 [04:39<01:46,  2.73it/s, acc_step=1/1, ce_loss_token=2.5567, perplexity_token=12.8928]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  72%|█████████████████████████████████▎            | 756/1044 [04:40<01:43,  2.79it/s, acc_step=1/1, ce_loss_token=2.5566, perplexity_token=12.8914]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  73%|█████████████████████████████████▎            | 757/1044 [04:40<01:43,  2.78it/s, acc_step=1/1, ce_loss_token=2.5565, perplexity_token=12.8901]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  73%|█████████████████████████████████▍            | 758/1044 [04:40<01:37,  2.93it/s, acc_step=1/1, ce_loss_token=2.5564, perplexity_token=12.8890]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  73%|█████████████████████████████████▍            | 759/1044 [04:41<01:44,  2.73it/s, acc_step=1/1, ce_loss_token=2.5562, perplexity_token=12.8873]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  73%|█████████████████████████████████▍            | 760/1044 [04:41<01:51,  2.55it/s, acc_step=1/1, ce_loss_token=2.5561, perplexity_token=12.8857]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  73%|█████████████████████████████████▌            | 761/1044 [04:42<01:47,  2.64it/s, acc_step=1/1, ce_loss_token=2.5560, perplexity_token=12.8843]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  73%|█████████████████████████████████▌            | 762/1044 [04:42<01:48,  2.60it/s, acc_step=1/1, ce_loss_token=2.5559, perplexity_token=12.8830]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  73%|█████████████████████████████████▌            | 763/1044 [04:42<01:48,  2.59it/s, acc_step=1/1, ce_loss_token=2.5558, perplexity_token=12.8816]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  73%|█████████████████████████████████▋            | 764/1044 [04:43<01:46,  2.63it/s, acc_step=1/1, ce_loss_token=2.5557, perplexity_token=12.8801]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  73%|█████████████████████████████████▋            | 765/1044 [04:43<01:43,  2.70it/s, acc_step=1/1, ce_loss_token=2.5556, perplexity_token=12.8789]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  73%|█████████████████████████████████▊            | 766/1044 [04:43<01:35,  2.91it/s, acc_step=1/1, ce_loss_token=2.5555, perplexity_token=12.8778]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  73%|█████████████████████████████████▊            | 767/1044 [04:44<01:37,  2.85it/s, acc_step=1/1, ce_loss_token=2.5554, perplexity_token=12.8762]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  74%|█████████████████████████████████▊            | 768/1044 [04:44<01:37,  2.83it/s, acc_step=1/1, ce_loss_token=2.5553, perplexity_token=12.8749]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  74%|█████████████████████████████████▉            | 769/1044 [04:44<01:39,  2.78it/s, acc_step=1/1, ce_loss_token=2.5552, perplexity_token=12.8734]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  74%|█████████████████████████████████▉            | 770/1044 [04:45<01:38,  2.78it/s, acc_step=1/1, ce_loss_token=2.5550, perplexity_token=12.8719]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  74%|█████████████████████████████████▉            | 771/1044 [04:45<01:39,  2.74it/s, acc_step=1/1, ce_loss_token=2.5549, perplexity_token=12.8706]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  74%|██████████████████████████████████            | 772/1044 [04:46<01:40,  2.70it/s, acc_step=1/1, ce_loss_token=2.5548, perplexity_token=12.8693]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  74%|██████████████████████████████████            | 773/1044 [04:46<01:38,  2.76it/s, acc_step=1/1, ce_loss_token=2.5547, perplexity_token=12.8678]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  74%|██████████████████████████████████            | 774/1044 [04:46<01:42,  2.63it/s, acc_step=1/1, ce_loss_token=2.5546, perplexity_token=12.8663]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  74%|██████████████████████████████████▏           | 775/1044 [04:47<01:42,  2.64it/s, acc_step=1/1, ce_loss_token=2.5545, perplexity_token=12.8650]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  74%|██████████████████████████████████▏           | 776/1044 [04:47<01:40,  2.66it/s, acc_step=1/1, ce_loss_token=2.5544, perplexity_token=12.8638]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  74%|██████████████████████████████████▏           | 777/1044 [04:47<01:42,  2.62it/s, acc_step=1/1, ce_loss_token=2.5543, perplexity_token=12.8624]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  75%|██████████████████████████████████▎           | 778/1044 [04:48<01:40,  2.63it/s, acc_step=1/1, ce_loss_token=2.5542, perplexity_token=12.8610]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  75%|██████████████████████████████████▎           | 779/1044 [04:48<01:37,  2.72it/s, acc_step=1/1, ce_loss_token=2.5541, perplexity_token=12.8596]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  75%|██████████████████████████████████▍           | 781/1044 [04:49<01:25,  3.08it/s, acc_step=1/1, ce_loss_token=2.5540, perplexity_token=12.8580]

torch.Size([256, 291, 35]) torch.Size([256, 291])
torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  75%|██████████████████████████████████▍           | 782/1044 [04:49<01:25,  3.05it/s, acc_step=1/1, ce_loss_token=2.5539, perplexity_token=12.8566]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  75%|██████████████████████████████████▌           | 783/1044 [04:49<01:26,  3.02it/s, acc_step=1/1, ce_loss_token=2.5538, perplexity_token=12.8553]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  75%|██████████████████████████████████▌           | 784/1044 [04:50<01:22,  3.14it/s, acc_step=1/1, ce_loss_token=2.5537, perplexity_token=12.8541]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  75%|██████████████████████████████████▌           | 785/1044 [04:50<01:24,  3.05it/s, acc_step=1/1, ce_loss_token=2.5536, perplexity_token=12.8528]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  75%|██████████████████████████████████▋           | 786/1044 [04:50<01:28,  2.93it/s, acc_step=1/1, ce_loss_token=2.5535, perplexity_token=12.8514]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  75%|██████████████████████████████████▋           | 787/1044 [04:51<01:30,  2.84it/s, acc_step=1/1, ce_loss_token=2.5534, perplexity_token=12.8501]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  75%|██████████████████████████████████▋           | 788/1044 [04:51<01:32,  2.77it/s, acc_step=1/1, ce_loss_token=2.5533, perplexity_token=12.8489]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  76%|██████████████████████████████████▊           | 789/1044 [04:52<01:34,  2.71it/s, acc_step=1/1, ce_loss_token=2.5532, perplexity_token=12.8475]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  76%|██████████████████████████████████▊           | 790/1044 [04:52<01:34,  2.69it/s, acc_step=1/1, ce_loss_token=2.5530, perplexity_token=12.8462]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  76%|██████████████████████████████████▊           | 791/1044 [04:52<01:34,  2.68it/s, acc_step=1/1, ce_loss_token=2.5529, perplexity_token=12.8448]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  76%|██████████████████████████████████▉           | 792/1044 [04:53<01:26,  2.90it/s, acc_step=1/1, ce_loss_token=2.5529, perplexity_token=12.8437]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  76%|██████████████████████████████████▉           | 793/1044 [04:53<01:28,  2.83it/s, acc_step=1/1, ce_loss_token=2.5527, perplexity_token=12.8423]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  76%|██████████████████████████████████▉           | 794/1044 [04:53<01:22,  3.02it/s, acc_step=1/1, ce_loss_token=2.5527, perplexity_token=12.8413]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  76%|███████████████████████████████████           | 795/1044 [04:54<01:26,  2.88it/s, acc_step=1/1, ce_loss_token=2.5526, perplexity_token=12.8400]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  76%|███████████████████████████████████           | 797/1044 [04:54<01:17,  3.19it/s, acc_step=1/1, ce_loss_token=2.5524, perplexity_token=12.8385]

torch.Size([256, 292, 35]) torch.Size([256, 292])
torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  76%|███████████████████████████████████▏          | 798/1044 [04:55<01:14,  3.31it/s, acc_step=1/1, ce_loss_token=2.5524, perplexity_token=12.8373]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  77%|███████████████████████████████████▏          | 799/1044 [04:55<01:21,  2.99it/s, acc_step=1/1, ce_loss_token=2.5523, perplexity_token=12.8360]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  77%|███████████████████████████████████▏          | 800/1044 [04:55<01:22,  2.94it/s, acc_step=1/1, ce_loss_token=2.5521, perplexity_token=12.8346]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  77%|███████████████████████████████████▎          | 801/1044 [04:56<01:23,  2.92it/s, acc_step=1/1, ce_loss_token=2.5520, perplexity_token=12.8333]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  77%|███████████████████████████████████▎          | 802/1044 [04:56<01:26,  2.80it/s, acc_step=1/1, ce_loss_token=2.5519, perplexity_token=12.8319]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  77%|███████████████████████████████████▍          | 803/1044 [04:56<01:27,  2.77it/s, acc_step=1/1, ce_loss_token=2.5518, perplexity_token=12.8305]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  77%|███████████████████████████████████▍          | 804/1044 [04:57<01:27,  2.76it/s, acc_step=1/1, ce_loss_token=2.5517, perplexity_token=12.8292]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  77%|███████████████████████████████████▍          | 805/1044 [04:57<01:27,  2.72it/s, acc_step=1/1, ce_loss_token=2.5516, perplexity_token=12.8278]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  77%|███████████████████████████████████▌          | 806/1044 [04:57<01:20,  2.94it/s, acc_step=1/1, ce_loss_token=2.5515, perplexity_token=12.8268]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  77%|███████████████████████████████████▌          | 807/1044 [04:58<01:24,  2.81it/s, acc_step=1/1, ce_loss_token=2.5514, perplexity_token=12.8255]

torch.Size([256, 388, 35]) torch.Size([256, 388])


[Training LM]:  77%|███████████████████████████████████▋          | 809/1044 [04:59<01:23,  2.81it/s, acc_step=1/1, ce_loss_token=2.5513, perplexity_token=12.8240]

torch.Size([256, 336, 35]) torch.Size([256, 336])
torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  78%|███████████████████████████████████▋          | 810/1044 [04:59<01:18,  2.97it/s, acc_step=1/1, ce_loss_token=2.5512, perplexity_token=12.8230]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  78%|███████████████████████████████████▋          | 811/1044 [04:59<01:19,  2.92it/s, acc_step=1/1, ce_loss_token=2.5511, perplexity_token=12.8216]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  78%|███████████████████████████████████▊          | 812/1044 [05:00<01:21,  2.83it/s, acc_step=1/1, ce_loss_token=2.5510, perplexity_token=12.8202]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  78%|███████████████████████████████████▊          | 813/1044 [05:00<01:26,  2.68it/s, acc_step=1/1, ce_loss_token=2.5509, perplexity_token=12.8189]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  78%|███████████████████████████████████▊          | 814/1044 [05:00<01:29,  2.56it/s, acc_step=1/1, ce_loss_token=2.5508, perplexity_token=12.8176]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  78%|███████████████████████████████████▉          | 815/1044 [05:01<01:24,  2.71it/s, acc_step=1/1, ce_loss_token=2.5507, perplexity_token=12.8166]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  78%|███████████████████████████████████▉          | 816/1044 [05:01<01:23,  2.73it/s, acc_step=1/1, ce_loss_token=2.5506, perplexity_token=12.8153]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  78%|███████████████████████████████████▉          | 817/1044 [05:01<01:16,  2.96it/s, acc_step=1/1, ce_loss_token=2.5506, perplexity_token=12.8142]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  78%|████████████████████████████████████          | 818/1044 [05:02<01:17,  2.92it/s, acc_step=1/1, ce_loss_token=2.5504, perplexity_token=12.8127]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  78%|████████████████████████████████████          | 819/1044 [05:02<01:16,  2.94it/s, acc_step=1/1, ce_loss_token=2.5503, perplexity_token=12.8114]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  79%|████████████████████████████████████▏         | 820/1044 [05:02<01:18,  2.86it/s, acc_step=1/1, ce_loss_token=2.5502, perplexity_token=12.8102]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  79%|████████████████████████████████████▏         | 821/1044 [05:03<01:20,  2.77it/s, acc_step=1/1, ce_loss_token=2.5501, perplexity_token=12.8088]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  79%|████████████████████████████████████▏         | 822/1044 [05:03<01:17,  2.87it/s, acc_step=1/1, ce_loss_token=2.5500, perplexity_token=12.8077]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  79%|████████████████████████████████████▎         | 823/1044 [05:04<01:17,  2.86it/s, acc_step=1/1, ce_loss_token=2.5499, perplexity_token=12.8064]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  79%|████████████████████████████████████▎         | 824/1044 [05:04<01:18,  2.81it/s, acc_step=1/1, ce_loss_token=2.5498, perplexity_token=12.8051]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  79%|████████████████████████████████████▎         | 825/1044 [05:04<01:18,  2.79it/s, acc_step=1/1, ce_loss_token=2.5497, perplexity_token=12.8038]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  79%|████████████████████████████████████▍         | 826/1044 [05:05<01:17,  2.82it/s, acc_step=1/1, ce_loss_token=2.5496, perplexity_token=12.8026]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  79%|████████████████████████████████████▍         | 827/1044 [05:05<01:20,  2.70it/s, acc_step=1/1, ce_loss_token=2.5495, perplexity_token=12.8013]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  79%|████████████████████████████████████▍         | 828/1044 [05:05<01:18,  2.77it/s, acc_step=1/1, ce_loss_token=2.5495, perplexity_token=12.8004]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  79%|████████████████████████████████████▌         | 829/1044 [05:06<01:19,  2.69it/s, acc_step=1/1, ce_loss_token=2.5494, perplexity_token=12.7991]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  80%|████████████████████████████████████▌         | 830/1044 [05:06<01:19,  2.69it/s, acc_step=1/1, ce_loss_token=2.5493, perplexity_token=12.7980]

torch.Size([256, 438, 35]) torch.Size([256, 438])


[Training LM]:  80%|████████████████████████████████████▌         | 831/1044 [05:07<01:34,  2.25it/s, acc_step=1/1, ce_loss_token=2.5492, perplexity_token=12.7967]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  80%|████████████████████████████████████▋         | 832/1044 [05:07<01:31,  2.33it/s, acc_step=1/1, ce_loss_token=2.5491, perplexity_token=12.7955]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  80%|████████████████████████████████████▋         | 833/1044 [05:08<01:26,  2.45it/s, acc_step=1/1, ce_loss_token=2.5490, perplexity_token=12.7942]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  80%|████████████████████████████████████▋         | 834/1044 [05:08<01:23,  2.50it/s, acc_step=1/1, ce_loss_token=2.5489, perplexity_token=12.7928]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  80%|████████████████████████████████████▊         | 835/1044 [05:08<01:22,  2.52it/s, acc_step=1/1, ce_loss_token=2.5488, perplexity_token=12.7914]

torch.Size([256, 370, 35]) torch.Size([256, 370])


[Training LM]:  80%|████████████████████████████████████▊         | 836/1044 [05:09<01:27,  2.37it/s, acc_step=1/1, ce_loss_token=2.5487, perplexity_token=12.7901]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  80%|████████████████████████████████████▉         | 837/1044 [05:09<01:26,  2.40it/s, acc_step=1/1, ce_loss_token=2.5486, perplexity_token=12.7888]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  80%|████████████████████████████████████▉         | 838/1044 [05:09<01:17,  2.65it/s, acc_step=1/1, ce_loss_token=2.5485, perplexity_token=12.7878]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  80%|████████████████████████████████████▉         | 839/1044 [05:10<01:19,  2.59it/s, acc_step=1/1, ce_loss_token=2.5484, perplexity_token=12.7867]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  80%|█████████████████████████████████████         | 840/1044 [05:10<01:16,  2.67it/s, acc_step=1/1, ce_loss_token=2.5483, perplexity_token=12.7854]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  81%|█████████████████████████████████████         | 841/1044 [05:11<01:18,  2.57it/s, acc_step=1/1, ce_loss_token=2.5482, perplexity_token=12.7840]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  81%|█████████████████████████████████████         | 842/1044 [05:11<01:16,  2.64it/s, acc_step=1/1, ce_loss_token=2.5481, perplexity_token=12.7829]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  81%|█████████████████████████████████████▏        | 843/1044 [05:11<01:15,  2.67it/s, acc_step=1/1, ce_loss_token=2.5480, perplexity_token=12.7815]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  81%|█████████████████████████████████████▏        | 844/1044 [05:12<01:16,  2.62it/s, acc_step=1/1, ce_loss_token=2.5479, perplexity_token=12.7802]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  81%|█████████████████████████████████████▏        | 845/1044 [05:12<01:14,  2.66it/s, acc_step=1/1, ce_loss_token=2.5478, perplexity_token=12.7789]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  81%|█████████████████████████████████████▎        | 846/1044 [05:12<01:10,  2.83it/s, acc_step=1/1, ce_loss_token=2.5477, perplexity_token=12.7779]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  81%|█████████████████████████████████████▎        | 847/1044 [05:13<01:08,  2.89it/s, acc_step=1/1, ce_loss_token=2.5476, perplexity_token=12.7768]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  81%|█████████████████████████████████████▎        | 848/1044 [05:13<01:10,  2.78it/s, acc_step=1/1, ce_loss_token=2.5475, perplexity_token=12.7755]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  81%|█████████████████████████████████████▍        | 849/1044 [05:13<01:09,  2.80it/s, acc_step=1/1, ce_loss_token=2.5474, perplexity_token=12.7742]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  81%|█████████████████████████████████████▍        | 850/1044 [05:14<01:12,  2.69it/s, acc_step=1/1, ce_loss_token=2.5473, perplexity_token=12.7729]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  82%|█████████████████████████████████████▍        | 851/1044 [05:14<01:13,  2.62it/s, acc_step=1/1, ce_loss_token=2.5472, perplexity_token=12.7715]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  82%|█████████████████████████████████████▌        | 852/1044 [05:15<01:12,  2.64it/s, acc_step=1/1, ce_loss_token=2.5471, perplexity_token=12.7702]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  82%|█████████████████████████████████████▌        | 853/1044 [05:15<01:10,  2.72it/s, acc_step=1/1, ce_loss_token=2.5470, perplexity_token=12.7689]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  82%|█████████████████████████████████████▋        | 854/1044 [05:15<01:10,  2.71it/s, acc_step=1/1, ce_loss_token=2.5469, perplexity_token=12.7675]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  82%|█████████████████████████████████████▋        | 855/1044 [05:16<01:09,  2.71it/s, acc_step=1/1, ce_loss_token=2.5468, perplexity_token=12.7664]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  82%|█████████████████████████████████████▋        | 856/1044 [05:16<01:08,  2.73it/s, acc_step=1/1, ce_loss_token=2.5467, perplexity_token=12.7655]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  82%|█████████████████████████████████████▊        | 857/1044 [05:16<01:08,  2.75it/s, acc_step=1/1, ce_loss_token=2.5466, perplexity_token=12.7642]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  82%|█████████████████████████████████████▊        | 858/1044 [05:17<01:08,  2.72it/s, acc_step=1/1, ce_loss_token=2.5465, perplexity_token=12.7629]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  82%|█████████████████████████████████████▊        | 859/1044 [05:17<01:08,  2.71it/s, acc_step=1/1, ce_loss_token=2.5464, perplexity_token=12.7617]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  82%|█████████████████████████████████████▉        | 860/1044 [05:18<01:07,  2.73it/s, acc_step=1/1, ce_loss_token=2.5464, perplexity_token=12.7605]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  82%|█████████████████████████████████████▉        | 861/1044 [05:18<01:08,  2.67it/s, acc_step=1/1, ce_loss_token=2.5463, perplexity_token=12.7594]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  83%|█████████████████████████████████████▉        | 862/1044 [05:18<01:10,  2.58it/s, acc_step=1/1, ce_loss_token=2.5462, perplexity_token=12.7582]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  83%|██████████████████████████████████████        | 863/1044 [05:19<01:10,  2.58it/s, acc_step=1/1, ce_loss_token=2.5461, perplexity_token=12.7570]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  83%|██████████████████████████████████████        | 864/1044 [05:19<01:05,  2.75it/s, acc_step=1/1, ce_loss_token=2.5460, perplexity_token=12.7560]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  83%|██████████████████████████████████████        | 865/1044 [05:19<01:04,  2.78it/s, acc_step=1/1, ce_loss_token=2.5459, perplexity_token=12.7548]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  83%|██████████████████████████████████████▏       | 866/1044 [05:20<01:00,  2.95it/s, acc_step=1/1, ce_loss_token=2.5458, perplexity_token=12.7539]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  83%|██████████████████████████████████████▏       | 867/1044 [05:20<01:01,  2.89it/s, acc_step=1/1, ce_loss_token=2.5457, perplexity_token=12.7527]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  83%|██████████████████████████████████████▏       | 868/1044 [05:20<00:57,  3.07it/s, acc_step=1/1, ce_loss_token=2.5457, perplexity_token=12.7515]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  83%|██████████████████████████████████████▎       | 869/1044 [05:21<00:57,  3.03it/s, acc_step=1/1, ce_loss_token=2.5456, perplexity_token=12.7503]

torch.Size([256, 276, 35]) torch.Size([256, 276])


[Training LM]:  83%|██████████████████████████████████████▎       | 870/1044 [05:21<00:53,  3.25it/s, acc_step=1/1, ce_loss_token=2.5455, perplexity_token=12.7494]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  83%|██████████████████████████████████████▍       | 871/1044 [05:21<00:58,  2.96it/s, acc_step=1/1, ce_loss_token=2.5454, perplexity_token=12.7481]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  84%|██████████████████████████████████████▍       | 872/1044 [05:22<01:01,  2.80it/s, acc_step=1/1, ce_loss_token=2.5453, perplexity_token=12.7469]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  84%|██████████████████████████████████████▍       | 873/1044 [05:22<01:02,  2.73it/s, acc_step=1/1, ce_loss_token=2.5452, perplexity_token=12.7456]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  84%|██████████████████████████████████████▌       | 874/1044 [05:22<01:00,  2.81it/s, acc_step=1/1, ce_loss_token=2.5451, perplexity_token=12.7445]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  84%|██████████████████████████████████████▌       | 875/1044 [05:23<00:59,  2.82it/s, acc_step=1/1, ce_loss_token=2.5450, perplexity_token=12.7432]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  84%|██████████████████████████████████████▌       | 876/1044 [05:23<01:01,  2.74it/s, acc_step=1/1, ce_loss_token=2.5449, perplexity_token=12.7420]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  84%|██████████████████████████████████████▋       | 877/1044 [05:24<00:57,  2.91it/s, acc_step=1/1, ce_loss_token=2.5448, perplexity_token=12.7411]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  84%|██████████████████████████████████████▋       | 878/1044 [05:24<00:56,  2.94it/s, acc_step=1/1, ce_loss_token=2.5447, perplexity_token=12.7398]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  84%|██████████████████████████████████████▋       | 879/1044 [05:24<00:59,  2.75it/s, acc_step=1/1, ce_loss_token=2.5446, perplexity_token=12.7385]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  84%|██████████████████████████████████████▊       | 880/1044 [05:25<00:59,  2.77it/s, acc_step=1/1, ce_loss_token=2.5445, perplexity_token=12.7374]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  84%|██████████████████████████████████████▊       | 881/1044 [05:25<00:55,  2.92it/s, acc_step=1/1, ce_loss_token=2.5445, perplexity_token=12.7364]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  84%|██████████████████████████████████████▊       | 882/1044 [05:25<00:52,  3.07it/s, acc_step=1/1, ce_loss_token=2.5444, perplexity_token=12.7354]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  85%|██████████████████████████████████████▉       | 883/1044 [05:26<00:56,  2.87it/s, acc_step=1/1, ce_loss_token=2.5443, perplexity_token=12.7341]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  85%|██████████████████████████████████████▉       | 884/1044 [05:26<00:57,  2.80it/s, acc_step=1/1, ce_loss_token=2.5442, perplexity_token=12.7330]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  85%|██████████████████████████████████████▉       | 885/1044 [05:26<00:56,  2.83it/s, acc_step=1/1, ce_loss_token=2.5441, perplexity_token=12.7317]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  85%|███████████████████████████████████████       | 886/1044 [05:27<00:56,  2.79it/s, acc_step=1/1, ce_loss_token=2.5440, perplexity_token=12.7305]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  85%|███████████████████████████████████████       | 887/1044 [05:27<00:56,  2.77it/s, acc_step=1/1, ce_loss_token=2.5439, perplexity_token=12.7293]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  85%|███████████████████████████████████████▏      | 888/1044 [05:27<00:58,  2.65it/s, acc_step=1/1, ce_loss_token=2.5438, perplexity_token=12.7281]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  85%|███████████████████████████████████████▏      | 889/1044 [05:28<00:56,  2.73it/s, acc_step=1/1, ce_loss_token=2.5437, perplexity_token=12.7269]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  85%|███████████████████████████████████████▏      | 890/1044 [05:28<00:56,  2.71it/s, acc_step=1/1, ce_loss_token=2.5436, perplexity_token=12.7258]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  85%|███████████████████████████████████████▎      | 891/1044 [05:28<00:52,  2.94it/s, acc_step=1/1, ce_loss_token=2.5436, perplexity_token=12.7249]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  85%|███████████████████████████████████████▎      | 892/1044 [05:29<00:52,  2.88it/s, acc_step=1/1, ce_loss_token=2.5435, perplexity_token=12.7237]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  86%|███████████████████████████████████████▎      | 893/1044 [05:29<00:49,  3.07it/s, acc_step=1/1, ce_loss_token=2.5434, perplexity_token=12.7228]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  86%|███████████████████████████████████████▍      | 894/1044 [05:29<00:45,  3.28it/s, acc_step=1/1, ce_loss_token=2.5433, perplexity_token=12.7219]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  86%|███████████████████████████████████████▍      | 895/1044 [05:30<00:44,  3.38it/s, acc_step=1/1, ce_loss_token=2.5433, perplexity_token=12.7211]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  86%|███████████████████████████████████████▍      | 896/1044 [05:30<00:47,  3.15it/s, acc_step=1/1, ce_loss_token=2.5432, perplexity_token=12.7201]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  86%|███████████████████████████████████████▌      | 897/1044 [05:30<00:50,  2.93it/s, acc_step=1/1, ce_loss_token=2.5431, perplexity_token=12.7187]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  86%|███████████████████████████████████████▌      | 898/1044 [05:31<00:51,  2.84it/s, acc_step=1/1, ce_loss_token=2.5430, perplexity_token=12.7175]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  86%|███████████████████████████████████████▌      | 899/1044 [05:31<00:51,  2.83it/s, acc_step=1/1, ce_loss_token=2.5429, perplexity_token=12.7163]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  86%|███████████████████████████████████████▋      | 900/1044 [05:31<00:47,  3.03it/s, acc_step=1/1, ce_loss_token=2.5428, perplexity_token=12.7153]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  86%|███████████████████████████████████████▋      | 901/1044 [05:32<00:49,  2.91it/s, acc_step=1/1, ce_loss_token=2.5427, perplexity_token=12.7140]

torch.Size([256, 402, 35]) torch.Size([256, 402])


[Training LM]:  86%|███████████████████████████████████████▋      | 902/1044 [05:32<00:57,  2.46it/s, acc_step=1/1, ce_loss_token=2.5426, perplexity_token=12.7128]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  86%|███████████████████████████████████████▊      | 903/1044 [05:33<00:55,  2.55it/s, acc_step=1/1, ce_loss_token=2.5425, perplexity_token=12.7116]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  87%|███████████████████████████████████████▊      | 904/1044 [05:33<00:53,  2.60it/s, acc_step=1/1, ce_loss_token=2.5424, perplexity_token=12.7102]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  87%|███████████████████████████████████████▉      | 905/1044 [05:34<00:58,  2.39it/s, acc_step=1/1, ce_loss_token=2.5423, perplexity_token=12.7090]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  87%|███████████████████████████████████████▉      | 906/1044 [05:34<00:53,  2.59it/s, acc_step=1/1, ce_loss_token=2.5422, perplexity_token=12.7081]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  87%|███████████████████████████████████████▉      | 907/1044 [05:34<00:52,  2.62it/s, acc_step=1/1, ce_loss_token=2.5422, perplexity_token=12.7070]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  87%|████████████████████████████████████████      | 908/1044 [05:35<00:52,  2.58it/s, acc_step=1/1, ce_loss_token=2.5421, perplexity_token=12.7057]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  87%|████████████████████████████████████████      | 909/1044 [05:35<00:52,  2.57it/s, acc_step=1/1, ce_loss_token=2.5420, perplexity_token=12.7045]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  87%|████████████████████████████████████████      | 910/1044 [05:36<00:55,  2.43it/s, acc_step=1/1, ce_loss_token=2.5419, perplexity_token=12.7034]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  87%|████████████████████████████████████████▏     | 912/1044 [05:36<00:45,  2.92it/s, acc_step=1/1, ce_loss_token=2.5418, perplexity_token=12.7023]

torch.Size([256, 318, 35]) torch.Size([256, 318])
torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  87%|████████████████████████████████████████▏     | 913/1044 [05:37<00:47,  2.73it/s, acc_step=1/1, ce_loss_token=2.5417, perplexity_token=12.7011]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  88%|████████████████████████████████████████▎     | 914/1044 [05:37<00:44,  2.94it/s, acc_step=1/1, ce_loss_token=2.5416, perplexity_token=12.7002]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  88%|████████████████████████████████████████▎     | 915/1044 [05:37<00:46,  2.79it/s, acc_step=1/1, ce_loss_token=2.5415, perplexity_token=12.6989]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  88%|████████████████████████████████████████▎     | 916/1044 [05:38<00:45,  2.80it/s, acc_step=1/1, ce_loss_token=2.5414, perplexity_token=12.6978]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  88%|████████████████████████████████████████▍     | 917/1044 [05:38<00:46,  2.76it/s, acc_step=1/1, ce_loss_token=2.5413, perplexity_token=12.6965]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  88%|████████████████████████████████████████▍     | 918/1044 [05:38<00:45,  2.79it/s, acc_step=1/1, ce_loss_token=2.5412, perplexity_token=12.6953]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  88%|████████████████████████████████████████▍     | 919/1044 [05:39<00:44,  2.79it/s, acc_step=1/1, ce_loss_token=2.5411, perplexity_token=12.6941]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  88%|████████████████████████████████████████▌     | 920/1044 [05:39<00:45,  2.74it/s, acc_step=1/1, ce_loss_token=2.5411, perplexity_token=12.6930]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  88%|████████████████████████████████████████▌     | 921/1044 [05:39<00:41,  2.94it/s, acc_step=1/1, ce_loss_token=2.5410, perplexity_token=12.6922]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  88%|████████████████████████████████████████▌     | 922/1044 [05:40<00:40,  2.98it/s, acc_step=1/1, ce_loss_token=2.5409, perplexity_token=12.6912]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  88%|████████████████████████████████████████▋     | 923/1044 [05:40<00:41,  2.92it/s, acc_step=1/1, ce_loss_token=2.5408, perplexity_token=12.6899]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  89%|████████████████████████████████████████▋     | 924/1044 [05:40<00:42,  2.80it/s, acc_step=1/1, ce_loss_token=2.5407, perplexity_token=12.6888]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  89%|████████████████████████████████████████▊     | 925/1044 [05:41<00:39,  3.00it/s, acc_step=1/1, ce_loss_token=2.5407, perplexity_token=12.6880]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  89%|████████████████████████████████████████▊     | 926/1044 [05:41<00:36,  3.20it/s, acc_step=1/1, ce_loss_token=2.5406, perplexity_token=12.6871]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  89%|████████████████████████████████████████▊     | 927/1044 [05:41<00:40,  2.88it/s, acc_step=1/1, ce_loss_token=2.5405, perplexity_token=12.6859]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  89%|████████████████████████████████████████▉     | 928/1044 [05:42<00:40,  2.88it/s, acc_step=1/1, ce_loss_token=2.5404, perplexity_token=12.6848]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  89%|████████████████████████████████████████▉     | 929/1044 [05:42<00:38,  2.96it/s, acc_step=1/1, ce_loss_token=2.5403, perplexity_token=12.6838]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  89%|████████████████████████████████████████▉     | 930/1044 [05:42<00:42,  2.66it/s, acc_step=1/1, ce_loss_token=2.5402, perplexity_token=12.6827]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  89%|█████████████████████████████████████████     | 931/1044 [05:43<00:43,  2.62it/s, acc_step=1/1, ce_loss_token=2.5401, perplexity_token=12.6816]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  89%|█████████████████████████████████████████     | 932/1044 [05:43<00:42,  2.66it/s, acc_step=1/1, ce_loss_token=2.5401, perplexity_token=12.6804]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  89%|█████████████████████████████████████████     | 933/1044 [05:44<00:40,  2.72it/s, acc_step=1/1, ce_loss_token=2.5400, perplexity_token=12.6792]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  89%|█████████████████████████████████████████▏    | 934/1044 [05:44<00:38,  2.86it/s, acc_step=1/1, ce_loss_token=2.5399, perplexity_token=12.6783]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  90%|█████████████████████████████████████████▏    | 935/1044 [05:44<00:36,  2.99it/s, acc_step=1/1, ce_loss_token=2.5398, perplexity_token=12.6774]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  90%|█████████████████████████████████████████▏    | 936/1044 [05:45<00:36,  2.93it/s, acc_step=1/1, ce_loss_token=2.5397, perplexity_token=12.6763]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  90%|█████████████████████████████████████████▎    | 937/1044 [05:45<00:37,  2.83it/s, acc_step=1/1, ce_loss_token=2.5396, perplexity_token=12.6751]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  90%|█████████████████████████████████████████▎    | 938/1044 [05:45<00:37,  2.80it/s, acc_step=1/1, ce_loss_token=2.5396, perplexity_token=12.6740]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  90%|█████████████████████████████████████████▎    | 939/1044 [05:46<00:38,  2.76it/s, acc_step=1/1, ce_loss_token=2.5395, perplexity_token=12.6727]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  90%|█████████████████████████████████████████▍    | 940/1044 [05:46<00:37,  2.77it/s, acc_step=1/1, ce_loss_token=2.5394, perplexity_token=12.6715]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  90%|█████████████████████████████████████████▍    | 941/1044 [05:46<00:37,  2.76it/s, acc_step=1/1, ce_loss_token=2.5393, perplexity_token=12.6703]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  90%|█████████████████████████████████████████▌    | 942/1044 [05:47<00:37,  2.69it/s, acc_step=1/1, ce_loss_token=2.5392, perplexity_token=12.6690]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  90%|█████████████████████████████████████████▌    | 943/1044 [05:47<00:35,  2.88it/s, acc_step=1/1, ce_loss_token=2.5391, perplexity_token=12.6681]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  90%|█████████████████████████████████████████▌    | 944/1044 [05:47<00:33,  2.97it/s, acc_step=1/1, ce_loss_token=2.5390, perplexity_token=12.6672]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  91%|█████████████████████████████████████████▋    | 945/1044 [05:48<00:35,  2.79it/s, acc_step=1/1, ce_loss_token=2.5389, perplexity_token=12.6659]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  91%|█████████████████████████████████████████▋    | 946/1044 [05:48<00:34,  2.84it/s, acc_step=1/1, ce_loss_token=2.5388, perplexity_token=12.6647]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  91%|█████████████████████████████████████████▋    | 947/1044 [05:49<00:36,  2.69it/s, acc_step=1/1, ce_loss_token=2.5387, perplexity_token=12.6636]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  91%|█████████████████████████████████████████▊    | 948/1044 [05:49<00:36,  2.65it/s, acc_step=1/1, ce_loss_token=2.5386, perplexity_token=12.6624]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  91%|█████████████████████████████████████████▊    | 949/1044 [05:49<00:35,  2.68it/s, acc_step=1/1, ce_loss_token=2.5385, perplexity_token=12.6612]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  91%|█████████████████████████████████████████▊    | 950/1044 [05:50<00:34,  2.71it/s, acc_step=1/1, ce_loss_token=2.5384, perplexity_token=12.6600]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  91%|█████████████████████████████████████████▉    | 951/1044 [05:50<00:33,  2.74it/s, acc_step=1/1, ce_loss_token=2.5384, perplexity_token=12.6588]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  91%|█████████████████████████████████████████▉    | 952/1044 [05:50<00:33,  2.72it/s, acc_step=1/1, ce_loss_token=2.5383, perplexity_token=12.6577]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  91%|█████████████████████████████████████████▉    | 953/1044 [05:51<00:30,  2.94it/s, acc_step=1/1, ce_loss_token=2.5382, perplexity_token=12.6568]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  91%|██████████████████████████████████████████    | 954/1044 [05:51<00:29,  3.06it/s, acc_step=1/1, ce_loss_token=2.5381, perplexity_token=12.6559]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  91%|██████████████████████████████████████████    | 955/1044 [05:51<00:29,  3.05it/s, acc_step=1/1, ce_loss_token=2.5380, perplexity_token=12.6548]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  92%|██████████████████████████████████████████    | 956/1044 [05:52<00:28,  3.06it/s, acc_step=1/1, ce_loss_token=2.5380, perplexity_token=12.6540]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  92%|██████████████████████████████████████████▏   | 957/1044 [05:52<00:29,  2.99it/s, acc_step=1/1, ce_loss_token=2.5379, perplexity_token=12.6528]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  92%|██████████████████████████████████████████▏   | 958/1044 [05:52<00:31,  2.77it/s, acc_step=1/1, ce_loss_token=2.5378, perplexity_token=12.6515]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  92%|██████████████████████████████████████████▎   | 959/1044 [05:53<00:31,  2.70it/s, acc_step=1/1, ce_loss_token=2.5377, perplexity_token=12.6503]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  92%|██████████████████████████████████████████▎   | 960/1044 [05:53<00:30,  2.73it/s, acc_step=1/1, ce_loss_token=2.5376, perplexity_token=12.6490]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  92%|██████████████████████████████████████████▎   | 961/1044 [05:53<00:28,  2.94it/s, acc_step=1/1, ce_loss_token=2.5375, perplexity_token=12.6482]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  92%|██████████████████████████████████████████▍   | 962/1044 [05:54<00:28,  2.92it/s, acc_step=1/1, ce_loss_token=2.5374, perplexity_token=12.6469]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  92%|██████████████████████████████████████████▍   | 963/1044 [05:54<00:28,  2.83it/s, acc_step=1/1, ce_loss_token=2.5373, perplexity_token=12.6457]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  92%|██████████████████████████████████████████▍   | 964/1044 [05:54<00:27,  2.86it/s, acc_step=1/1, ce_loss_token=2.5372, perplexity_token=12.6445]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  92%|██████████████████████████████████████████▌   | 965/1044 [05:55<00:27,  2.84it/s, acc_step=1/1, ce_loss_token=2.5371, perplexity_token=12.6434]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  93%|██████████████████████████████████████████▌   | 966/1044 [05:55<00:27,  2.83it/s, acc_step=1/1, ce_loss_token=2.5370, perplexity_token=12.6422]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  93%|██████████████████████████████████████████▌   | 967/1044 [05:56<00:28,  2.70it/s, acc_step=1/1, ce_loss_token=2.5369, perplexity_token=12.6409]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  93%|██████████████████████████████████████████▋   | 968/1044 [05:56<00:27,  2.74it/s, acc_step=1/1, ce_loss_token=2.5368, perplexity_token=12.6396]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  93%|██████████████████████████████████████████▋   | 969/1044 [05:56<00:25,  2.92it/s, acc_step=1/1, ce_loss_token=2.5368, perplexity_token=12.6385]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  93%|██████████████████████████████████████████▋   | 970/1044 [05:57<00:25,  2.93it/s, acc_step=1/1, ce_loss_token=2.5367, perplexity_token=12.6373]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  93%|██████████████████████████████████████████▊   | 971/1044 [05:57<00:25,  2.89it/s, acc_step=1/1, ce_loss_token=2.5366, perplexity_token=12.6361]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  93%|██████████████████████████████████████████▊   | 972/1044 [05:57<00:26,  2.73it/s, acc_step=1/1, ce_loss_token=2.5365, perplexity_token=12.6350]

torch.Size([256, 351, 35]) torch.Size([256, 351])


[Training LM]:  93%|██████████████████████████████████████████▊   | 973/1044 [05:58<00:28,  2.53it/s, acc_step=1/1, ce_loss_token=2.5364, perplexity_token=12.6338]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  93%|██████████████████████████████████████████▉   | 974/1044 [05:58<00:26,  2.61it/s, acc_step=1/1, ce_loss_token=2.5363, perplexity_token=12.6326]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  93%|██████████████████████████████████████████▉   | 975/1044 [05:59<00:25,  2.70it/s, acc_step=1/1, ce_loss_token=2.5362, perplexity_token=12.6314]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  93%|███████████████████████████████████████████   | 976/1044 [05:59<00:23,  2.91it/s, acc_step=1/1, ce_loss_token=2.5361, perplexity_token=12.6304]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  94%|███████████████████████████████████████████   | 977/1044 [05:59<00:24,  2.74it/s, acc_step=1/1, ce_loss_token=2.5360, perplexity_token=12.6293]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  94%|███████████████████████████████████████████   | 978/1044 [06:00<00:23,  2.75it/s, acc_step=1/1, ce_loss_token=2.5359, perplexity_token=12.6281]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  94%|███████████████████████████████████████████▏  | 979/1044 [06:00<00:24,  2.69it/s, acc_step=1/1, ce_loss_token=2.5358, perplexity_token=12.6270]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  94%|███████████████████████████████████████████▏  | 980/1044 [06:00<00:23,  2.78it/s, acc_step=1/1, ce_loss_token=2.5357, perplexity_token=12.6258]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  94%|███████████████████████████████████████████▏  | 981/1044 [06:01<00:22,  2.78it/s, acc_step=1/1, ce_loss_token=2.5356, perplexity_token=12.6245]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  94%|███████████████████████████████████████████▎  | 982/1044 [06:01<00:23,  2.61it/s, acc_step=1/1, ce_loss_token=2.5355, perplexity_token=12.6232]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  94%|███████████████████████████████████████████▎  | 983/1044 [06:01<00:21,  2.82it/s, acc_step=1/1, ce_loss_token=2.5355, perplexity_token=12.6223]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  94%|███████████████████████████████████████████▎  | 984/1044 [06:02<00:19,  3.01it/s, acc_step=1/1, ce_loss_token=2.5354, perplexity_token=12.6215]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  94%|███████████████████████████████████████████▍  | 985/1044 [06:02<00:20,  2.94it/s, acc_step=1/1, ce_loss_token=2.5353, perplexity_token=12.6203]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  94%|███████████████████████████████████████████▍  | 986/1044 [06:02<00:19,  2.91it/s, acc_step=1/1, ce_loss_token=2.5352, perplexity_token=12.6191]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  95%|███████████████████████████████████████████▍  | 987/1044 [06:03<00:18,  3.05it/s, acc_step=1/1, ce_loss_token=2.5351, perplexity_token=12.6182]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  95%|███████████████████████████████████████████▌  | 988/1044 [06:03<00:18,  3.08it/s, acc_step=1/1, ce_loss_token=2.5351, perplexity_token=12.6171]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  95%|███████████████████████████████████████████▌  | 989/1044 [06:03<00:18,  2.95it/s, acc_step=1/1, ce_loss_token=2.5350, perplexity_token=12.6160]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  95%|███████████████████████████████████████████▌  | 990/1044 [06:04<00:17,  3.09it/s, acc_step=1/1, ce_loss_token=2.5349, perplexity_token=12.6152]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  95%|███████████████████████████████████████████▋  | 991/1044 [06:04<00:17,  2.96it/s, acc_step=1/1, ce_loss_token=2.5348, perplexity_token=12.6139]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  95%|███████████████████████████████████████████▋  | 992/1044 [06:04<00:17,  2.95it/s, acc_step=1/1, ce_loss_token=2.5347, perplexity_token=12.6131]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  95%|███████████████████████████████████████████▊  | 993/1044 [06:05<00:17,  2.87it/s, acc_step=1/1, ce_loss_token=2.5346, perplexity_token=12.6119]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  95%|███████████████████████████████████████████▊  | 994/1044 [06:05<00:16,  3.05it/s, acc_step=1/1, ce_loss_token=2.5346, perplexity_token=12.6110]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  95%|███████████████████████████████████████████▊  | 995/1044 [06:05<00:16,  2.98it/s, acc_step=1/1, ce_loss_token=2.5345, perplexity_token=12.6098]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  95%|███████████████████████████████████████████▉  | 996/1044 [06:06<00:16,  2.83it/s, acc_step=1/1, ce_loss_token=2.5344, perplexity_token=12.6086]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  96%|███████████████████████████████████████████▉  | 998/1044 [06:06<00:13,  3.38it/s, acc_step=1/1, ce_loss_token=2.5343, perplexity_token=12.6079]

torch.Size([256, 295, 35]) torch.Size([256, 295])
torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  96%|████████████████████████████████████████████  | 999/1044 [06:07<00:14,  3.13it/s, acc_step=1/1, ce_loss_token=2.5342, perplexity_token=12.6067]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  96%|███████████████████████████████████████████  | 1000/1044 [06:07<00:14,  2.98it/s, acc_step=1/1, ce_loss_token=2.5341, perplexity_token=12.6055]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  96%|███████████████████████████████████████████▏ | 1001/1044 [06:07<00:14,  2.98it/s, acc_step=1/1, ce_loss_token=2.5340, perplexity_token=12.6044]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  96%|███████████████████████████████████████████▏ | 1002/1044 [06:08<00:13,  3.12it/s, acc_step=1/1, ce_loss_token=2.5340, perplexity_token=12.6034]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  96%|███████████████████████████████████████████▏ | 1003/1044 [06:08<00:12,  3.28it/s, acc_step=1/1, ce_loss_token=2.5339, perplexity_token=12.6024]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  96%|███████████████████████████████████████████▎ | 1004/1044 [06:08<00:12,  3.08it/s, acc_step=1/1, ce_loss_token=2.5338, perplexity_token=12.6011]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  96%|███████████████████████████████████████████▎ | 1005/1044 [06:09<00:14,  2.76it/s, acc_step=1/1, ce_loss_token=2.5337, perplexity_token=12.6000]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  96%|███████████████████████████████████████████▎ | 1006/1044 [06:09<00:13,  2.71it/s, acc_step=1/1, ce_loss_token=2.5336, perplexity_token=12.5989]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  96%|███████████████████████████████████████████▍ | 1007/1044 [06:09<00:13,  2.70it/s, acc_step=1/1, ce_loss_token=2.5335, perplexity_token=12.5977]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  97%|███████████████████████████████████████████▍ | 1008/1044 [06:10<00:12,  2.89it/s, acc_step=1/1, ce_loss_token=2.5334, perplexity_token=12.5968]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  97%|███████████████████████████████████████████▍ | 1009/1044 [06:10<00:12,  2.83it/s, acc_step=1/1, ce_loss_token=2.5334, perplexity_token=12.5956]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  97%|███████████████████████████████████████████▌ | 1010/1044 [06:10<00:11,  2.87it/s, acc_step=1/1, ce_loss_token=2.5333, perplexity_token=12.5945]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  97%|███████████████████████████████████████████▌ | 1011/1044 [06:11<00:11,  2.87it/s, acc_step=1/1, ce_loss_token=2.5332, perplexity_token=12.5934]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  97%|███████████████████████████████████████████▌ | 1012/1044 [06:11<00:11,  2.78it/s, acc_step=1/1, ce_loss_token=2.5331, perplexity_token=12.5924]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  97%|███████████████████████████████████████████▋ | 1013/1044 [06:12<00:11,  2.79it/s, acc_step=1/1, ce_loss_token=2.5330, perplexity_token=12.5912]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  97%|███████████████████████████████████████████▋ | 1014/1044 [06:12<00:10,  2.82it/s, acc_step=1/1, ce_loss_token=2.5329, perplexity_token=12.5902]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  97%|███████████████████████████████████████████▊ | 1015/1044 [06:12<00:10,  2.70it/s, acc_step=1/1, ce_loss_token=2.5328, perplexity_token=12.5891]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  97%|███████████████████████████████████████████▊ | 1016/1044 [06:13<00:10,  2.77it/s, acc_step=1/1, ce_loss_token=2.5327, perplexity_token=12.5880]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  97%|███████████████████████████████████████████▊ | 1017/1044 [06:13<00:10,  2.63it/s, acc_step=1/1, ce_loss_token=2.5326, perplexity_token=12.5868]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  98%|███████████████████████████████████████████▉ | 1018/1044 [06:13<00:09,  2.82it/s, acc_step=1/1, ce_loss_token=2.5326, perplexity_token=12.5858]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  98%|███████████████████████████████████████████▉ | 1019/1044 [06:14<00:08,  2.80it/s, acc_step=1/1, ce_loss_token=2.5325, perplexity_token=12.5844]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  98%|███████████████████████████████████████████▉ | 1020/1044 [06:14<00:08,  2.76it/s, acc_step=1/1, ce_loss_token=2.5324, perplexity_token=12.5833]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  98%|████████████████████████████████████████████ | 1021/1044 [06:14<00:08,  2.70it/s, acc_step=1/1, ce_loss_token=2.5323, perplexity_token=12.5822]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  98%|████████████████████████████████████████████ | 1022/1044 [06:15<00:08,  2.74it/s, acc_step=1/1, ce_loss_token=2.5322, perplexity_token=12.5810]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  98%|████████████████████████████████████████████ | 1023/1044 [06:15<00:07,  2.72it/s, acc_step=1/1, ce_loss_token=2.5321, perplexity_token=12.5799]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  98%|████████████████████████████████████████████▏| 1024/1044 [06:16<00:07,  2.82it/s, acc_step=1/1, ce_loss_token=2.5320, perplexity_token=12.5788]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  98%|████████████████████████████████████████████▏| 1025/1044 [06:16<00:06,  2.78it/s, acc_step=1/1, ce_loss_token=2.5319, perplexity_token=12.5778]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  98%|████████████████████████████████████████████▏| 1026/1044 [06:16<00:06,  2.77it/s, acc_step=1/1, ce_loss_token=2.5318, perplexity_token=12.5766]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  98%|████████████████████████████████████████████▎| 1027/1044 [06:17<00:06,  2.72it/s, acc_step=1/1, ce_loss_token=2.5317, perplexity_token=12.5755]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  98%|████████████████████████████████████████████▎| 1028/1044 [06:17<00:06,  2.66it/s, acc_step=1/1, ce_loss_token=2.5317, perplexity_token=12.5743]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  99%|████████████████████████████████████████████▎| 1029/1044 [06:17<00:05,  2.63it/s, acc_step=1/1, ce_loss_token=2.5316, perplexity_token=12.5733]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  99%|████████████████████████████████████████████▍| 1030/1044 [06:18<00:05,  2.70it/s, acc_step=1/1, ce_loss_token=2.5315, perplexity_token=12.5722]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  99%|████████████████████████████████████████████▍| 1031/1044 [06:18<00:04,  2.74it/s, acc_step=1/1, ce_loss_token=2.5314, perplexity_token=12.5710]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  99%|████████████████████████████████████████████▍| 1032/1044 [06:18<00:04,  2.83it/s, acc_step=1/1, ce_loss_token=2.5313, perplexity_token=12.5701]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  99%|████████████████████████████████████████████▌| 1033/1044 [06:19<00:03,  2.78it/s, acc_step=1/1, ce_loss_token=2.5312, perplexity_token=12.5690]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  99%|████████████████████████████████████████████▌| 1034/1044 [06:19<00:03,  2.71it/s, acc_step=1/1, ce_loss_token=2.5311, perplexity_token=12.5678]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  99%|████████████████████████████████████████████▌| 1035/1044 [06:20<00:03,  2.66it/s, acc_step=1/1, ce_loss_token=2.5310, perplexity_token=12.5666]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  99%|████████████████████████████████████████████▋| 1036/1044 [06:20<00:02,  2.90it/s, acc_step=1/1, ce_loss_token=2.5310, perplexity_token=12.5655]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  99%|████████████████████████████████████████████▋| 1037/1044 [06:20<00:02,  2.87it/s, acc_step=1/1, ce_loss_token=2.5309, perplexity_token=12.5644]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  99%|████████████████████████████████████████████▋| 1038/1044 [06:21<00:02,  2.79it/s, acc_step=1/1, ce_loss_token=2.5308, perplexity_token=12.5632]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]: 100%|████████████████████████████████████████████▊| 1039/1044 [06:21<00:01,  2.79it/s, acc_step=1/1, ce_loss_token=2.5307, perplexity_token=12.5620]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]: 100%|████████████████████████████████████████████▊| 1040/1044 [06:21<00:01,  2.72it/s, acc_step=1/1, ce_loss_token=2.5306, perplexity_token=12.5609]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]: 100%|████████████████████████████████████████████▉| 1042/1044 [06:22<00:00,  3.20it/s, acc_step=1/1, ce_loss_token=2.5305, perplexity_token=12.5595]

torch.Size([256, 300, 35]) torch.Size([256, 300])
torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]: 100%|████████████████████████████████████████████▉| 1043/1044 [06:22<00:00,  3.09it/s, acc_step=1/1, ce_loss_token=2.5304, perplexity_token=12.5583]

torch.Size([170, 296, 35]) torch.Size([170, 296])


                                                                                                                                                                   

Generating with greedy search...

📊 Metrics (Epoch 0):
├── TRAIN:
│   ├── ce_loss_char: 2.5303
│   ├── ce_loss_token: 2.5303
│   ├── perplexity_char: 12.5576
│   └── perplexity_token: 12.5576
└── VAL:
    ├── ce_loss_char: 2.3950
    ├── ce_loss_token: 2.3950
    ├── perplexity_char: 10.9679
    └── perplexity_token: 10.9679
└── TRAINING:
    └── learning_rate: 0.000028


[Training LM]:   0%|                                                                                                                      | 0/1044 [00:00<?, ?it/s]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   0%|                                                | 1/1044 [00:00<08:01,  2.17it/s, acc_step=1/1, ce_loss_token=2.4285, perplexity_token=11.3415]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   0%|                                                | 2/1044 [00:00<07:05,  2.45it/s, acc_step=1/1, ce_loss_token=2.4340, perplexity_token=11.4041]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   0%|▏                                               | 3/1044 [00:01<06:50,  2.53it/s, acc_step=1/1, ce_loss_token=2.4370, perplexity_token=11.4391]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:   0%|▏                                               | 4/1044 [00:01<06:59,  2.48it/s, acc_step=1/1, ce_loss_token=2.4355, perplexity_token=11.4217]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   0%|▏                                               | 5/1044 [00:02<06:56,  2.49it/s, acc_step=1/1, ce_loss_token=2.4367, perplexity_token=11.4348]

torch.Size([256, 370, 35]) torch.Size([256, 370])


[Training LM]:   1%|▎                                               | 6/1044 [00:02<07:26,  2.32it/s, acc_step=1/1, ce_loss_token=2.4359, perplexity_token=11.4261]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   1%|▎                                               | 7/1044 [00:02<07:03,  2.45it/s, acc_step=1/1, ce_loss_token=2.4357, perplexity_token=11.4235]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   1%|▎                                               | 8/1044 [00:03<06:46,  2.55it/s, acc_step=1/1, ce_loss_token=2.4354, perplexity_token=11.4201]

torch.Size([256, 277, 35]) torch.Size([256, 277])


[Training LM]:   1%|▍                                               | 9/1044 [00:03<06:24,  2.69it/s, acc_step=1/1, ce_loss_token=2.4361, perplexity_token=11.4289]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:   1%|▍                                              | 10/1044 [00:03<06:37,  2.60it/s, acc_step=1/1, ce_loss_token=2.4365, perplexity_token=11.4333]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   1%|▍                                              | 11/1044 [00:04<06:41,  2.57it/s, acc_step=1/1, ce_loss_token=2.4366, perplexity_token=11.4341]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:   1%|▌                                              | 12/1044 [00:04<07:06,  2.42it/s, acc_step=1/1, ce_loss_token=2.4359, perplexity_token=11.4261]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   1%|▌                                              | 13/1044 [00:05<07:11,  2.39it/s, acc_step=1/1, ce_loss_token=2.4356, perplexity_token=11.4227]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   1%|▋                                              | 14/1044 [00:05<07:03,  2.43it/s, acc_step=1/1, ce_loss_token=2.4352, perplexity_token=11.4177]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:   1%|▋                                              | 15/1044 [00:05<06:33,  2.61it/s, acc_step=1/1, ce_loss_token=2.4375, perplexity_token=11.4445]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   2%|▋                                              | 16/1044 [00:06<05:59,  2.86it/s, acc_step=1/1, ce_loss_token=2.4383, perplexity_token=11.4537]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   2%|▊                                              | 17/1044 [00:06<06:04,  2.81it/s, acc_step=1/1, ce_loss_token=2.4379, perplexity_token=11.4491]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   2%|▊                                              | 18/1044 [00:06<06:10,  2.77it/s, acc_step=1/1, ce_loss_token=2.4374, perplexity_token=11.4428]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   2%|▊                                              | 19/1044 [00:07<06:17,  2.72it/s, acc_step=1/1, ce_loss_token=2.4370, perplexity_token=11.4382]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:   2%|▉                                              | 20/1044 [00:07<06:57,  2.45it/s, acc_step=1/1, ce_loss_token=2.4369, perplexity_token=11.4377]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   2%|▉                                              | 21/1044 [00:08<06:42,  2.54it/s, acc_step=1/1, ce_loss_token=2.4365, perplexity_token=11.4334]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   2%|▉                                              | 22/1044 [00:08<06:37,  2.57it/s, acc_step=1/1, ce_loss_token=2.4367, perplexity_token=11.4351]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   2%|█                                              | 23/1044 [00:09<06:34,  2.59it/s, acc_step=1/1, ce_loss_token=2.4365, perplexity_token=11.4325]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   2%|█                                              | 24/1044 [00:09<06:25,  2.64it/s, acc_step=1/1, ce_loss_token=2.4361, perplexity_token=11.4288]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   2%|█▏                                             | 25/1044 [00:09<06:40,  2.54it/s, acc_step=1/1, ce_loss_token=2.4359, perplexity_token=11.4263]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   2%|█▏                                             | 26/1044 [00:10<06:30,  2.61it/s, acc_step=1/1, ce_loss_token=2.4359, perplexity_token=11.4264]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:   3%|█▏                                             | 27/1044 [00:10<05:50,  2.91it/s, acc_step=1/1, ce_loss_token=2.4371, perplexity_token=11.4395]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   3%|█▎                                             | 28/1044 [00:10<06:02,  2.80it/s, acc_step=1/1, ce_loss_token=2.4369, perplexity_token=11.4374]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   3%|█▎                                             | 29/1044 [00:11<06:07,  2.76it/s, acc_step=1/1, ce_loss_token=2.4369, perplexity_token=11.4375]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   3%|█▎                                             | 30/1044 [00:11<06:18,  2.68it/s, acc_step=1/1, ce_loss_token=2.4363, perplexity_token=11.4309]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   3%|█▍                                             | 31/1044 [00:11<05:48,  2.90it/s, acc_step=1/1, ce_loss_token=2.4368, perplexity_token=11.4369]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   3%|█▍                                             | 32/1044 [00:12<05:55,  2.85it/s, acc_step=1/1, ce_loss_token=2.4368, perplexity_token=11.4368]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:   3%|█▍                                             | 33/1044 [00:12<06:23,  2.64it/s, acc_step=1/1, ce_loss_token=2.4365, perplexity_token=11.4330]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:   3%|█▌                                             | 34/1044 [00:13<06:33,  2.57it/s, acc_step=1/1, ce_loss_token=2.4364, perplexity_token=11.4321]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   3%|█▌                                             | 35/1044 [00:13<06:33,  2.56it/s, acc_step=1/1, ce_loss_token=2.4363, perplexity_token=11.4309]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   3%|█▌                                             | 36/1044 [00:13<06:29,  2.59it/s, acc_step=1/1, ce_loss_token=2.4363, perplexity_token=11.4310]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   4%|█▋                                             | 37/1044 [00:14<05:58,  2.81it/s, acc_step=1/1, ce_loss_token=2.4366, perplexity_token=11.4339]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:   4%|█▋                                             | 38/1044 [00:14<05:25,  3.09it/s, acc_step=1/1, ce_loss_token=2.4369, perplexity_token=11.4381]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:   4%|█▊                                             | 39/1044 [00:14<05:26,  3.08it/s, acc_step=1/1, ce_loss_token=2.4369, perplexity_token=11.4375]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   4%|█▊                                             | 40/1044 [00:15<05:40,  2.95it/s, acc_step=1/1, ce_loss_token=2.4367, perplexity_token=11.4357]

torch.Size([256, 391, 35]) torch.Size([256, 391])


[Training LM]:   4%|█▊                                             | 41/1044 [00:15<06:05,  2.75it/s, acc_step=1/1, ce_loss_token=2.4371, perplexity_token=11.4399]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   4%|█▉                                             | 42/1044 [00:15<05:59,  2.78it/s, acc_step=1/1, ce_loss_token=2.4369, perplexity_token=11.4372]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   4%|█▉                                             | 43/1044 [00:16<05:59,  2.79it/s, acc_step=1/1, ce_loss_token=2.4364, perplexity_token=11.4313]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   4%|██                                             | 45/1044 [00:16<05:20,  3.12it/s, acc_step=1/1, ce_loss_token=2.4379, perplexity_token=11.4493]

torch.Size([256, 307, 35]) torch.Size([256, 307])
torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   4%|██                                             | 46/1044 [00:17<05:28,  3.04it/s, acc_step=1/1, ce_loss_token=2.4377, perplexity_token=11.4465]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:   5%|██                                             | 47/1044 [00:17<05:30,  3.02it/s, acc_step=1/1, ce_loss_token=2.4375, perplexity_token=11.4438]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:   5%|██▏                                            | 48/1044 [00:17<05:29,  3.02it/s, acc_step=1/1, ce_loss_token=2.4378, perplexity_token=11.4481]

torch.Size([256, 441, 35]) torch.Size([256, 441])


[Training LM]:   5%|██▏                                            | 49/1044 [00:18<07:02,  2.35it/s, acc_step=1/1, ce_loss_token=2.4377, perplexity_token=11.4469]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:   5%|██▎                                            | 50/1044 [00:18<06:35,  2.51it/s, acc_step=1/1, ce_loss_token=2.4379, perplexity_token=11.4491]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   5%|██▎                                            | 51/1044 [00:19<06:21,  2.60it/s, acc_step=1/1, ce_loss_token=2.4378, perplexity_token=11.4473]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:   5%|██▎                                            | 52/1044 [00:19<06:31,  2.54it/s, acc_step=1/1, ce_loss_token=2.4378, perplexity_token=11.4482]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   5%|██▍                                            | 53/1044 [00:19<06:19,  2.61it/s, acc_step=1/1, ce_loss_token=2.4376, perplexity_token=11.4455]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   5%|██▍                                            | 54/1044 [00:20<06:10,  2.67it/s, acc_step=1/1, ce_loss_token=2.4373, perplexity_token=11.4423]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   5%|██▍                                            | 55/1044 [00:20<06:12,  2.66it/s, acc_step=1/1, ce_loss_token=2.4372, perplexity_token=11.4404]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:   5%|██▌                                            | 56/1044 [00:20<05:38,  2.92it/s, acc_step=1/1, ce_loss_token=2.4376, perplexity_token=11.4453]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   5%|██▌                                            | 57/1044 [00:21<05:51,  2.81it/s, acc_step=1/1, ce_loss_token=2.4373, perplexity_token=11.4416]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   6%|██▌                                            | 58/1044 [00:21<05:49,  2.82it/s, acc_step=1/1, ce_loss_token=2.4371, perplexity_token=11.4403]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   6%|██▋                                            | 59/1044 [00:22<05:52,  2.80it/s, acc_step=1/1, ce_loss_token=2.4369, perplexity_token=11.4377]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:   6%|██▋                                            | 60/1044 [00:22<05:48,  2.82it/s, acc_step=1/1, ce_loss_token=2.4367, perplexity_token=11.4355]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   6%|██▋                                            | 61/1044 [00:22<05:29,  2.98it/s, acc_step=1/1, ce_loss_token=2.4371, perplexity_token=11.4393]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   6%|██▊                                            | 62/1044 [00:23<05:40,  2.88it/s, acc_step=1/1, ce_loss_token=2.4369, perplexity_token=11.4377]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:   6%|██▊                                            | 63/1044 [00:23<06:10,  2.65it/s, acc_step=1/1, ce_loss_token=2.4367, perplexity_token=11.4352]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   6%|██▉                                            | 64/1044 [00:23<05:48,  2.82it/s, acc_step=1/1, ce_loss_token=2.4369, perplexity_token=11.4378]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   6%|██▉                                            | 65/1044 [00:24<05:45,  2.83it/s, acc_step=1/1, ce_loss_token=2.4367, perplexity_token=11.4357]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   6%|██▉                                            | 66/1044 [00:24<05:45,  2.83it/s, acc_step=1/1, ce_loss_token=2.4365, perplexity_token=11.4332]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   6%|███                                            | 67/1044 [00:24<05:23,  3.02it/s, acc_step=1/1, ce_loss_token=2.4366, perplexity_token=11.4342]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   7%|███                                            | 68/1044 [00:25<05:37,  2.89it/s, acc_step=1/1, ce_loss_token=2.4364, perplexity_token=11.4321]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   7%|███                                            | 69/1044 [00:25<05:48,  2.80it/s, acc_step=1/1, ce_loss_token=2.4362, perplexity_token=11.4298]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   7%|███▏                                           | 70/1044 [00:25<05:59,  2.71it/s, acc_step=1/1, ce_loss_token=2.4361, perplexity_token=11.4283]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   7%|███▏                                           | 71/1044 [00:26<05:37,  2.88it/s, acc_step=1/1, ce_loss_token=2.4364, perplexity_token=11.4320]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   7%|███▏                                           | 72/1044 [00:26<05:47,  2.79it/s, acc_step=1/1, ce_loss_token=2.4363, perplexity_token=11.4305]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   7%|███▎                                           | 73/1044 [00:26<05:22,  3.01it/s, acc_step=1/1, ce_loss_token=2.4364, perplexity_token=11.4321]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   7%|███▎                                           | 74/1044 [00:27<05:32,  2.92it/s, acc_step=1/1, ce_loss_token=2.4362, perplexity_token=11.4301]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   7%|███▍                                           | 75/1044 [00:27<05:41,  2.84it/s, acc_step=1/1, ce_loss_token=2.4361, perplexity_token=11.4288]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   7%|███▍                                           | 77/1044 [00:28<05:06,  3.15it/s, acc_step=1/1, ce_loss_token=2.4370, perplexity_token=11.4382]

torch.Size([256, 298, 35]) torch.Size([256, 298])
torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   7%|███▌                                           | 78/1044 [00:28<05:25,  2.96it/s, acc_step=1/1, ce_loss_token=2.4368, perplexity_token=11.4363]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:   8%|███▌                                           | 79/1044 [00:29<05:58,  2.69it/s, acc_step=1/1, ce_loss_token=2.4366, perplexity_token=11.4339]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:   8%|███▌                                           | 80/1044 [00:29<05:50,  2.75it/s, acc_step=1/1, ce_loss_token=2.4364, perplexity_token=11.4317]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   8%|███▋                                           | 81/1044 [00:29<05:28,  2.93it/s, acc_step=1/1, ce_loss_token=2.4367, perplexity_token=11.4352]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:   8%|███▋                                           | 82/1044 [00:30<05:48,  2.76it/s, acc_step=1/1, ce_loss_token=2.4365, perplexity_token=11.4324]

torch.Size([256, 372, 35]) torch.Size([256, 372])


[Training LM]:   8%|███▋                                           | 83/1044 [00:30<06:26,  2.49it/s, acc_step=1/1, ce_loss_token=2.4363, perplexity_token=11.4304]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   8%|███▊                                           | 84/1044 [00:30<05:53,  2.72it/s, acc_step=1/1, ce_loss_token=2.4362, perplexity_token=11.4298]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   8%|███▊                                           | 85/1044 [00:31<05:57,  2.68it/s, acc_step=1/1, ce_loss_token=2.4361, perplexity_token=11.4286]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   8%|███▊                                           | 86/1044 [00:31<05:57,  2.68it/s, acc_step=1/1, ce_loss_token=2.4359, perplexity_token=11.4260]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   8%|███▉                                           | 87/1044 [00:32<05:55,  2.69it/s, acc_step=1/1, ce_loss_token=2.4356, perplexity_token=11.4228]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   8%|███▉                                           | 88/1044 [00:32<05:57,  2.67it/s, acc_step=1/1, ce_loss_token=2.4354, perplexity_token=11.4206]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   9%|████                                           | 89/1044 [00:32<05:36,  2.84it/s, acc_step=1/1, ce_loss_token=2.4354, perplexity_token=11.4203]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   9%|████                                           | 90/1044 [00:33<05:35,  2.85it/s, acc_step=1/1, ce_loss_token=2.4352, perplexity_token=11.4177]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:   9%|████                                           | 91/1044 [00:33<06:04,  2.62it/s, acc_step=1/1, ce_loss_token=2.4349, perplexity_token=11.4149]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   9%|████▏                                          | 92/1044 [00:33<05:56,  2.67it/s, acc_step=1/1, ce_loss_token=2.4347, perplexity_token=11.4123]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:   9%|████▏                                          | 93/1044 [00:34<06:11,  2.56it/s, acc_step=1/1, ce_loss_token=2.4346, perplexity_token=11.4114]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:   9%|████▏                                          | 94/1044 [00:34<05:59,  2.65it/s, acc_step=1/1, ce_loss_token=2.4344, perplexity_token=11.4094]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:   9%|████▎                                          | 95/1044 [00:34<05:29,  2.88it/s, acc_step=1/1, ce_loss_token=2.4346, perplexity_token=11.4109]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   9%|████▎                                          | 96/1044 [00:35<05:33,  2.84it/s, acc_step=1/1, ce_loss_token=2.4345, perplexity_token=11.4098]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:   9%|████▎                                          | 97/1044 [00:35<05:28,  2.88it/s, acc_step=1/1, ce_loss_token=2.4342, perplexity_token=11.4069]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   9%|████▍                                          | 98/1044 [00:35<05:34,  2.83it/s, acc_step=1/1, ce_loss_token=2.4340, perplexity_token=11.4041]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   9%|████▍                                          | 99/1044 [00:36<05:46,  2.73it/s, acc_step=1/1, ce_loss_token=2.4338, perplexity_token=11.4019]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  10%|████▍                                         | 100/1044 [00:36<05:47,  2.71it/s, acc_step=1/1, ce_loss_token=2.4336, perplexity_token=11.4002]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  10%|████▍                                         | 101/1044 [00:37<05:24,  2.91it/s, acc_step=1/1, ce_loss_token=2.4337, perplexity_token=11.4008]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  10%|████▍                                         | 102/1044 [00:37<05:29,  2.86it/s, acc_step=1/1, ce_loss_token=2.4336, perplexity_token=11.3998]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  10%|████▌                                         | 103/1044 [00:37<05:32,  2.83it/s, acc_step=1/1, ce_loss_token=2.4334, perplexity_token=11.3975]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  10%|████▌                                         | 104/1044 [00:38<05:42,  2.75it/s, acc_step=1/1, ce_loss_token=2.4333, perplexity_token=11.3970]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  10%|████▋                                         | 105/1044 [00:38<05:38,  2.78it/s, acc_step=1/1, ce_loss_token=2.4332, perplexity_token=11.3954]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  10%|████▋                                         | 106/1044 [00:38<05:37,  2.78it/s, acc_step=1/1, ce_loss_token=2.4331, perplexity_token=11.3936]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  10%|████▋                                         | 107/1044 [00:39<05:24,  2.89it/s, acc_step=1/1, ce_loss_token=2.4332, perplexity_token=11.3956]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  10%|████▊                                         | 108/1044 [00:39<05:12,  3.00it/s, acc_step=1/1, ce_loss_token=2.4334, perplexity_token=11.3973]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  10%|████▊                                         | 109/1044 [00:39<05:14,  2.97it/s, acc_step=1/1, ce_loss_token=2.4332, perplexity_token=11.3954]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  11%|████▊                                         | 110/1044 [00:40<05:46,  2.70it/s, acc_step=1/1, ce_loss_token=2.4331, perplexity_token=11.3941]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  11%|████▉                                         | 111/1044 [00:40<05:34,  2.79it/s, acc_step=1/1, ce_loss_token=2.4329, perplexity_token=11.3923]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  11%|████▉                                         | 112/1044 [00:40<05:45,  2.70it/s, acc_step=1/1, ce_loss_token=2.4328, perplexity_token=11.3908]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  11%|████▉                                         | 113/1044 [00:41<05:41,  2.73it/s, acc_step=1/1, ce_loss_token=2.4327, perplexity_token=11.3901]

torch.Size([256, 271, 35]) torch.Size([256, 271])


[Training LM]:  11%|█████                                         | 114/1044 [00:41<05:08,  3.01it/s, acc_step=1/1, ce_loss_token=2.4327, perplexity_token=11.3900]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  11%|█████                                         | 115/1044 [00:41<04:57,  3.13it/s, acc_step=1/1, ce_loss_token=2.4328, perplexity_token=11.3904]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  11%|█████                                         | 116/1044 [00:42<05:12,  2.97it/s, acc_step=1/1, ce_loss_token=2.4326, perplexity_token=11.3887]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  11%|█████▏                                        | 117/1044 [00:42<05:19,  2.90it/s, acc_step=1/1, ce_loss_token=2.4325, perplexity_token=11.3872]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  11%|█████▏                                        | 118/1044 [00:42<05:22,  2.87it/s, acc_step=1/1, ce_loss_token=2.4323, perplexity_token=11.3856]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  11%|█████▏                                        | 119/1044 [00:43<05:26,  2.83it/s, acc_step=1/1, ce_loss_token=2.4322, perplexity_token=11.3840]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  11%|█████▎                                        | 120/1044 [00:43<05:08,  3.00it/s, acc_step=1/1, ce_loss_token=2.4322, perplexity_token=11.3842]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  12%|█████▎                                        | 121/1044 [00:43<04:51,  3.17it/s, acc_step=1/1, ce_loss_token=2.4324, perplexity_token=11.3859]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  12%|█████▍                                        | 122/1044 [00:44<05:04,  3.03it/s, acc_step=1/1, ce_loss_token=2.4323, perplexity_token=11.3852]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  12%|█████▍                                        | 123/1044 [00:44<05:15,  2.92it/s, acc_step=1/1, ce_loss_token=2.4324, perplexity_token=11.3864]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  12%|█████▍                                        | 124/1044 [00:45<05:26,  2.81it/s, acc_step=1/1, ce_loss_token=2.4323, perplexity_token=11.3849]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  12%|█████▌                                        | 125/1044 [00:45<05:45,  2.66it/s, acc_step=1/1, ce_loss_token=2.4321, perplexity_token=11.3833]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  12%|█████▌                                        | 126/1044 [00:45<05:54,  2.59it/s, acc_step=1/1, ce_loss_token=2.4320, perplexity_token=11.3813]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  12%|█████▌                                        | 127/1044 [00:46<06:01,  2.53it/s, acc_step=1/1, ce_loss_token=2.4318, perplexity_token=11.3795]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  12%|█████▋                                        | 128/1044 [00:46<05:57,  2.56it/s, acc_step=1/1, ce_loss_token=2.4317, perplexity_token=11.3781]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  12%|█████▋                                        | 129/1044 [00:47<05:56,  2.56it/s, acc_step=1/1, ce_loss_token=2.4316, perplexity_token=11.3766]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  12%|█████▋                                        | 130/1044 [00:47<06:07,  2.49it/s, acc_step=1/1, ce_loss_token=2.4314, perplexity_token=11.3750]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  13%|█████▊                                        | 131/1044 [00:47<05:52,  2.59it/s, acc_step=1/1, ce_loss_token=2.4312, perplexity_token=11.3726]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  13%|█████▊                                        | 132/1044 [00:48<05:41,  2.67it/s, acc_step=1/1, ce_loss_token=2.4311, perplexity_token=11.3712]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  13%|█████▉                                        | 134/1044 [00:48<04:42,  3.22it/s, acc_step=1/1, ce_loss_token=2.4317, perplexity_token=11.3783]

torch.Size([256, 293, 35]) torch.Size([256, 293])
torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  13%|█████▉                                        | 135/1044 [00:49<04:56,  3.07it/s, acc_step=1/1, ce_loss_token=2.4316, perplexity_token=11.3766]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  13%|█████▉                                        | 136/1044 [00:49<05:05,  2.98it/s, acc_step=1/1, ce_loss_token=2.4314, perplexity_token=11.3753]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  13%|██████                                        | 137/1044 [00:49<04:51,  3.12it/s, acc_step=1/1, ce_loss_token=2.4316, perplexity_token=11.3767]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  13%|██████                                        | 138/1044 [00:50<05:05,  2.97it/s, acc_step=1/1, ce_loss_token=2.4314, perplexity_token=11.3749]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  13%|██████                                        | 139/1044 [00:50<04:50,  3.11it/s, acc_step=1/1, ce_loss_token=2.4315, perplexity_token=11.3765]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  13%|██████▏                                       | 140/1044 [00:50<04:53,  3.08it/s, acc_step=1/1, ce_loss_token=2.4314, perplexity_token=11.3751]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  14%|██████▏                                       | 141/1044 [00:51<05:11,  2.90it/s, acc_step=1/1, ce_loss_token=2.4314, perplexity_token=11.3743]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  14%|██████▎                                       | 142/1044 [00:51<05:16,  2.85it/s, acc_step=1/1, ce_loss_token=2.4312, perplexity_token=11.3723]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  14%|██████▎                                       | 143/1044 [00:51<05:15,  2.86it/s, acc_step=1/1, ce_loss_token=2.4310, perplexity_token=11.3703]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  14%|██████▎                                       | 144/1044 [00:52<05:26,  2.76it/s, acc_step=1/1, ce_loss_token=2.4309, perplexity_token=11.3686]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  14%|██████▍                                       | 145/1044 [00:52<05:25,  2.77it/s, acc_step=1/1, ce_loss_token=2.4307, perplexity_token=11.3674]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  14%|██████▍                                       | 146/1044 [00:52<05:24,  2.77it/s, acc_step=1/1, ce_loss_token=2.4306, perplexity_token=11.3660]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  14%|██████▍                                       | 147/1044 [00:53<05:31,  2.71it/s, acc_step=1/1, ce_loss_token=2.4305, perplexity_token=11.3644]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  14%|██████▌                                       | 148/1044 [00:53<05:30,  2.71it/s, acc_step=1/1, ce_loss_token=2.4304, perplexity_token=11.3630]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  14%|██████▌                                       | 149/1044 [00:54<05:44,  2.60it/s, acc_step=1/1, ce_loss_token=2.4301, perplexity_token=11.3605]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  14%|██████▌                                       | 150/1044 [00:54<05:47,  2.57it/s, acc_step=1/1, ce_loss_token=2.4300, perplexity_token=11.3591]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  14%|██████▋                                       | 151/1044 [00:54<05:42,  2.61it/s, acc_step=1/1, ce_loss_token=2.4299, perplexity_token=11.3576]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  15%|██████▋                                       | 152/1044 [00:55<05:58,  2.49it/s, acc_step=1/1, ce_loss_token=2.4298, perplexity_token=11.3561]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  15%|██████▋                                       | 153/1044 [00:55<05:30,  2.70it/s, acc_step=1/1, ce_loss_token=2.4298, perplexity_token=11.3565]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  15%|██████▊                                       | 154/1044 [00:55<05:32,  2.68it/s, acc_step=1/1, ce_loss_token=2.4296, perplexity_token=11.3544]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  15%|██████▊                                       | 155/1044 [00:56<05:33,  2.67it/s, acc_step=1/1, ce_loss_token=2.4295, perplexity_token=11.3529]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  15%|██████▊                                       | 156/1044 [00:56<05:33,  2.67it/s, acc_step=1/1, ce_loss_token=2.4294, perplexity_token=11.3518]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  15%|██████▉                                       | 157/1044 [00:57<05:33,  2.66it/s, acc_step=1/1, ce_loss_token=2.4293, perplexity_token=11.3504]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  15%|██████▉                                       | 158/1044 [00:57<05:34,  2.65it/s, acc_step=1/1, ce_loss_token=2.4291, perplexity_token=11.3485]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  15%|███████                                       | 159/1044 [00:57<05:45,  2.57it/s, acc_step=1/1, ce_loss_token=2.4290, perplexity_token=11.3472]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  15%|███████                                       | 160/1044 [00:58<05:40,  2.60it/s, acc_step=1/1, ce_loss_token=2.4288, perplexity_token=11.3457]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  15%|███████                                       | 161/1044 [00:58<05:39,  2.60it/s, acc_step=1/1, ce_loss_token=2.4287, perplexity_token=11.3447]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  16%|███████▏                                      | 162/1044 [00:59<05:44,  2.56it/s, acc_step=1/1, ce_loss_token=2.4286, perplexity_token=11.3433]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  16%|███████▏                                      | 163/1044 [00:59<05:35,  2.63it/s, acc_step=1/1, ce_loss_token=2.4284, perplexity_token=11.3412]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  16%|███████▏                                      | 164/1044 [00:59<05:33,  2.64it/s, acc_step=1/1, ce_loss_token=2.4283, perplexity_token=11.3390]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  16%|███████▎                                      | 165/1044 [01:00<05:31,  2.65it/s, acc_step=1/1, ce_loss_token=2.4281, perplexity_token=11.3378]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  16%|███████▎                                      | 166/1044 [01:00<05:25,  2.69it/s, acc_step=1/1, ce_loss_token=2.4280, perplexity_token=11.3363]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  16%|███████▎                                      | 167/1044 [01:00<04:58,  2.94it/s, acc_step=1/1, ce_loss_token=2.4282, perplexity_token=11.3379]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  16%|███████▍                                      | 168/1044 [01:01<05:14,  2.79it/s, acc_step=1/1, ce_loss_token=2.4280, perplexity_token=11.3367]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  16%|███████▍                                      | 169/1044 [01:01<05:11,  2.81it/s, acc_step=1/1, ce_loss_token=2.4279, perplexity_token=11.3349]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  16%|███████▍                                      | 170/1044 [01:01<05:15,  2.77it/s, acc_step=1/1, ce_loss_token=2.4278, perplexity_token=11.3338]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  16%|███████▌                                      | 171/1044 [01:02<05:30,  2.64it/s, acc_step=1/1, ce_loss_token=2.4277, perplexity_token=11.3327]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  16%|███████▌                                      | 172/1044 [01:02<05:06,  2.85it/s, acc_step=1/1, ce_loss_token=2.4278, perplexity_token=11.3339]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  17%|███████▌                                      | 173/1044 [01:03<05:12,  2.79it/s, acc_step=1/1, ce_loss_token=2.4277, perplexity_token=11.3324]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  17%|███████▋                                      | 174/1044 [01:03<05:18,  2.73it/s, acc_step=1/1, ce_loss_token=2.4276, perplexity_token=11.3316]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  17%|███████▋                                      | 175/1044 [01:03<04:56,  2.93it/s, acc_step=1/1, ce_loss_token=2.4277, perplexity_token=11.3327]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  17%|███████▊                                      | 176/1044 [01:03<04:46,  3.03it/s, acc_step=1/1, ce_loss_token=2.4278, perplexity_token=11.3338]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  17%|███████▊                                      | 177/1044 [01:04<05:08,  2.81it/s, acc_step=1/1, ce_loss_token=2.4276, perplexity_token=11.3321]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  17%|███████▊                                      | 178/1044 [01:04<05:10,  2.79it/s, acc_step=1/1, ce_loss_token=2.4275, perplexity_token=11.3305]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  17%|███████▉                                      | 179/1044 [01:05<05:24,  2.66it/s, acc_step=1/1, ce_loss_token=2.4273, perplexity_token=11.3282]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  17%|███████▉                                      | 180/1044 [01:05<05:34,  2.58it/s, acc_step=1/1, ce_loss_token=2.4271, perplexity_token=11.3263]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  17%|███████▉                                      | 181/1044 [01:05<05:32,  2.60it/s, acc_step=1/1, ce_loss_token=2.4270, perplexity_token=11.3247]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  17%|████████                                      | 182/1044 [01:06<05:27,  2.64it/s, acc_step=1/1, ce_loss_token=2.4268, perplexity_token=11.3232]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  18%|████████                                      | 183/1044 [01:06<05:20,  2.68it/s, acc_step=1/1, ce_loss_token=2.4267, perplexity_token=11.3215]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  18%|████████                                      | 184/1044 [01:07<05:24,  2.65it/s, acc_step=1/1, ce_loss_token=2.4266, perplexity_token=11.3202]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  18%|████████▏                                     | 185/1044 [01:07<05:20,  2.68it/s, acc_step=1/1, ce_loss_token=2.4265, perplexity_token=11.3190]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  18%|████████▏                                     | 186/1044 [01:07<05:17,  2.70it/s, acc_step=1/1, ce_loss_token=2.4263, perplexity_token=11.3173]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  18%|████████▏                                     | 187/1044 [01:08<05:09,  2.77it/s, acc_step=1/1, ce_loss_token=2.4262, perplexity_token=11.3161]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  18%|████████▎                                     | 188/1044 [01:08<05:25,  2.63it/s, acc_step=1/1, ce_loss_token=2.4261, perplexity_token=11.3151]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  18%|████████▎                                     | 189/1044 [01:08<05:02,  2.83it/s, acc_step=1/1, ce_loss_token=2.4262, perplexity_token=11.3157]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  18%|████████▎                                     | 190/1044 [01:09<05:08,  2.77it/s, acc_step=1/1, ce_loss_token=2.4260, perplexity_token=11.3141]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  18%|████████▍                                     | 191/1044 [01:09<05:05,  2.79it/s, acc_step=1/1, ce_loss_token=2.4260, perplexity_token=11.3130]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  18%|████████▍                                     | 192/1044 [01:09<04:47,  2.96it/s, acc_step=1/1, ce_loss_token=2.4260, perplexity_token=11.3135]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  18%|████████▌                                     | 193/1044 [01:10<05:03,  2.80it/s, acc_step=1/1, ce_loss_token=2.4259, perplexity_token=11.3122]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  19%|████████▌                                     | 194/1044 [01:10<05:02,  2.81it/s, acc_step=1/1, ce_loss_token=2.4258, perplexity_token=11.3109]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  19%|████████▌                                     | 195/1044 [01:11<05:10,  2.73it/s, acc_step=1/1, ce_loss_token=2.4256, perplexity_token=11.3091]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  19%|████████▋                                     | 196/1044 [01:11<05:07,  2.76it/s, acc_step=1/1, ce_loss_token=2.4255, perplexity_token=11.3083]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  19%|████████▋                                     | 197/1044 [01:11<05:06,  2.76it/s, acc_step=1/1, ce_loss_token=2.4254, perplexity_token=11.3066]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  19%|████████▋                                     | 198/1044 [01:12<04:39,  3.02it/s, acc_step=1/1, ce_loss_token=2.4254, perplexity_token=11.3063]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  19%|████████▊                                     | 199/1044 [01:12<04:30,  3.13it/s, acc_step=1/1, ce_loss_token=2.4254, perplexity_token=11.3069]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  19%|████████▊                                     | 200/1044 [01:12<04:43,  2.98it/s, acc_step=1/1, ce_loss_token=2.4253, perplexity_token=11.3053]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  19%|████████▊                                     | 201/1044 [01:13<05:12,  2.70it/s, acc_step=1/1, ce_loss_token=2.4251, perplexity_token=11.3038]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  19%|████████▉                                     | 202/1044 [01:13<05:15,  2.67it/s, acc_step=1/1, ce_loss_token=2.4250, perplexity_token=11.3022]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  19%|████████▉                                     | 203/1044 [01:13<05:13,  2.68it/s, acc_step=1/1, ce_loss_token=2.4249, perplexity_token=11.3008]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  20%|████████▉                                     | 204/1044 [01:14<04:52,  2.87it/s, acc_step=1/1, ce_loss_token=2.4249, perplexity_token=11.3012]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  20%|█████████                                     | 205/1044 [01:14<04:54,  2.84it/s, acc_step=1/1, ce_loss_token=2.4248, perplexity_token=11.2999]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  20%|█████████                                     | 206/1044 [01:14<05:10,  2.70it/s, acc_step=1/1, ce_loss_token=2.4247, perplexity_token=11.2986]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  20%|█████████                                     | 207/1044 [01:15<05:12,  2.68it/s, acc_step=1/1, ce_loss_token=2.4246, perplexity_token=11.2973]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  20%|█████████▏                                    | 208/1044 [01:15<05:11,  2.69it/s, acc_step=1/1, ce_loss_token=2.4244, perplexity_token=11.2960]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  20%|█████████▏                                    | 209/1044 [01:16<05:07,  2.71it/s, acc_step=1/1, ce_loss_token=2.4243, perplexity_token=11.2949]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  20%|█████████▎                                    | 210/1044 [01:16<04:47,  2.90it/s, acc_step=1/1, ce_loss_token=2.4245, perplexity_token=11.2961]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  20%|█████████▎                                    | 211/1044 [01:16<04:54,  2.83it/s, acc_step=1/1, ce_loss_token=2.4244, perplexity_token=11.2956]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  20%|█████████▎                                    | 212/1044 [01:17<04:56,  2.80it/s, acc_step=1/1, ce_loss_token=2.4243, perplexity_token=11.2943]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  20%|█████████▍                                    | 213/1044 [01:17<04:42,  2.95it/s, acc_step=1/1, ce_loss_token=2.4244, perplexity_token=11.2952]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  20%|█████████▍                                    | 214/1044 [01:17<04:49,  2.87it/s, acc_step=1/1, ce_loss_token=2.4242, perplexity_token=11.2936]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  21%|█████████▍                                    | 215/1044 [01:18<04:50,  2.85it/s, acc_step=1/1, ce_loss_token=2.4241, perplexity_token=11.2920]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  21%|█████████▌                                    | 216/1044 [01:18<04:45,  2.90it/s, acc_step=1/1, ce_loss_token=2.4241, perplexity_token=11.2926]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  21%|█████████▌                                    | 217/1044 [01:18<04:35,  3.00it/s, acc_step=1/1, ce_loss_token=2.4242, perplexity_token=11.2930]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  21%|█████████▌                                    | 218/1044 [01:19<04:18,  3.20it/s, acc_step=1/1, ce_loss_token=2.4242, perplexity_token=11.2928]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  21%|█████████▋                                    | 219/1044 [01:19<04:26,  3.09it/s, acc_step=1/1, ce_loss_token=2.4240, perplexity_token=11.2914]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  21%|█████████▋                                    | 220/1044 [01:19<04:39,  2.95it/s, acc_step=1/1, ce_loss_token=2.4239, perplexity_token=11.2901]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  21%|█████████▋                                    | 221/1044 [01:20<04:42,  2.91it/s, acc_step=1/1, ce_loss_token=2.4238, perplexity_token=11.2885]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  21%|█████████▊                                    | 222/1044 [01:20<05:08,  2.66it/s, acc_step=1/1, ce_loss_token=2.4236, perplexity_token=11.2869]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  21%|█████████▊                                    | 223/1044 [01:20<04:45,  2.88it/s, acc_step=1/1, ce_loss_token=2.4236, perplexity_token=11.2864]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  21%|█████████▊                                    | 224/1044 [01:21<04:54,  2.78it/s, acc_step=1/1, ce_loss_token=2.4234, perplexity_token=11.2845]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  22%|█████████▉                                    | 225/1044 [01:21<04:56,  2.76it/s, acc_step=1/1, ce_loss_token=2.4233, perplexity_token=11.2831]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  22%|█████████▉                                    | 226/1044 [01:21<05:09,  2.64it/s, acc_step=1/1, ce_loss_token=2.4232, perplexity_token=11.2817]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  22%|██████████                                    | 227/1044 [01:22<05:00,  2.72it/s, acc_step=1/1, ce_loss_token=2.4230, perplexity_token=11.2802]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  22%|██████████                                    | 228/1044 [01:22<04:58,  2.74it/s, acc_step=1/1, ce_loss_token=2.4229, perplexity_token=11.2790]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  22%|██████████                                    | 229/1044 [01:23<04:52,  2.79it/s, acc_step=1/1, ce_loss_token=2.4228, perplexity_token=11.2774]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  22%|██████████▏                                   | 230/1044 [01:23<04:32,  2.99it/s, acc_step=1/1, ce_loss_token=2.4229, perplexity_token=11.2786]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  22%|██████████▏                                   | 231/1044 [01:23<04:44,  2.86it/s, acc_step=1/1, ce_loss_token=2.4228, perplexity_token=11.2774]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  22%|██████████▏                                   | 232/1044 [01:24<04:51,  2.79it/s, acc_step=1/1, ce_loss_token=2.4227, perplexity_token=11.2763]

torch.Size([256, 366, 35]) torch.Size([256, 366])


[Training LM]:  22%|██████████▎                                   | 233/1044 [01:24<05:20,  2.53it/s, acc_step=1/1, ce_loss_token=2.4226, perplexity_token=11.2750]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  22%|██████████▎                                   | 234/1044 [01:24<04:58,  2.72it/s, acc_step=1/1, ce_loss_token=2.4226, perplexity_token=11.2752]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  23%|██████████▎                                   | 235/1044 [01:25<04:55,  2.73it/s, acc_step=1/1, ce_loss_token=2.4226, perplexity_token=11.2747]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  23%|██████████▍                                   | 236/1044 [01:25<04:54,  2.75it/s, acc_step=1/1, ce_loss_token=2.4224, perplexity_token=11.2730]

torch.Size([256, 344, 35]) torch.Size([256, 344])


[Training LM]:  23%|██████████▍                                   | 237/1044 [01:26<05:11,  2.59it/s, acc_step=1/1, ce_loss_token=2.4223, perplexity_token=11.2716]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  23%|██████████▍                                   | 238/1044 [01:26<05:08,  2.61it/s, acc_step=1/1, ce_loss_token=2.4221, perplexity_token=11.2697]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  23%|██████████▌                                   | 239/1044 [01:26<04:48,  2.79it/s, acc_step=1/1, ce_loss_token=2.4221, perplexity_token=11.2699]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  23%|██████████▌                                   | 240/1044 [01:27<04:51,  2.75it/s, acc_step=1/1, ce_loss_token=2.4220, perplexity_token=11.2685]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  23%|██████████▌                                   | 241/1044 [01:27<04:34,  2.92it/s, acc_step=1/1, ce_loss_token=2.4220, perplexity_token=11.2687]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  23%|██████████▋                                   | 242/1044 [01:27<04:38,  2.88it/s, acc_step=1/1, ce_loss_token=2.4219, perplexity_token=11.2671]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  23%|██████████▋                                   | 243/1044 [01:28<04:21,  3.06it/s, acc_step=1/1, ce_loss_token=2.4219, perplexity_token=11.2671]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  23%|██████████▊                                   | 244/1044 [01:28<04:42,  2.83it/s, acc_step=1/1, ce_loss_token=2.4218, perplexity_token=11.2657]

torch.Size([256, 421, 35]) torch.Size([256, 421])


[Training LM]:  23%|██████████▊                                   | 245/1044 [01:29<05:44,  2.32it/s, acc_step=1/1, ce_loss_token=2.4217, perplexity_token=11.2646]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  24%|██████████▊                                   | 246/1044 [01:29<05:12,  2.55it/s, acc_step=1/1, ce_loss_token=2.4217, perplexity_token=11.2654]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  24%|██████████▉                                   | 247/1044 [01:29<04:59,  2.66it/s, acc_step=1/1, ce_loss_token=2.4216, perplexity_token=11.2641]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  24%|██████████▉                                   | 248/1044 [01:30<04:58,  2.66it/s, acc_step=1/1, ce_loss_token=2.4215, perplexity_token=11.2624]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  24%|██████████▉                                   | 249/1044 [01:30<04:52,  2.72it/s, acc_step=1/1, ce_loss_token=2.4214, perplexity_token=11.2612]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  24%|███████████                                   | 250/1044 [01:30<04:47,  2.76it/s, acc_step=1/1, ce_loss_token=2.4213, perplexity_token=11.2601]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  24%|███████████                                   | 251/1044 [01:31<04:50,  2.73it/s, acc_step=1/1, ce_loss_token=2.4211, perplexity_token=11.2584]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  24%|███████████                                   | 252/1044 [01:31<04:56,  2.67it/s, acc_step=1/1, ce_loss_token=2.4210, perplexity_token=11.2571]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  24%|███████████▏                                  | 253/1044 [01:31<04:50,  2.72it/s, acc_step=1/1, ce_loss_token=2.4209, perplexity_token=11.2560]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  24%|███████████▏                                  | 254/1044 [01:32<04:49,  2.73it/s, acc_step=1/1, ce_loss_token=2.4208, perplexity_token=11.2547]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  24%|███████████▏                                  | 255/1044 [01:32<04:45,  2.76it/s, acc_step=1/1, ce_loss_token=2.4207, perplexity_token=11.2537]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  25%|███████████▎                                  | 256/1044 [01:32<04:42,  2.79it/s, acc_step=1/1, ce_loss_token=2.4206, perplexity_token=11.2521]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  25%|███████████▎                                  | 257/1044 [01:33<04:52,  2.69it/s, acc_step=1/1, ce_loss_token=2.4204, perplexity_token=11.2507]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  25%|███████████▎                                  | 258/1044 [01:33<04:53,  2.68it/s, acc_step=1/1, ce_loss_token=2.4203, perplexity_token=11.2490]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  25%|███████████▍                                  | 259/1044 [01:34<04:52,  2.68it/s, acc_step=1/1, ce_loss_token=2.4202, perplexity_token=11.2479]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  25%|███████████▍                                  | 260/1044 [01:34<05:18,  2.47it/s, acc_step=1/1, ce_loss_token=2.4201, perplexity_token=11.2466]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  25%|███████████▌                                  | 261/1044 [01:34<05:13,  2.50it/s, acc_step=1/1, ce_loss_token=2.4200, perplexity_token=11.2453]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  25%|███████████▌                                  | 262/1044 [01:35<05:13,  2.50it/s, acc_step=1/1, ce_loss_token=2.4198, perplexity_token=11.2437]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  25%|███████████▌                                  | 263/1044 [01:35<05:11,  2.51it/s, acc_step=1/1, ce_loss_token=2.4197, perplexity_token=11.2421]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  25%|███████████▋                                  | 264/1044 [01:36<04:57,  2.63it/s, acc_step=1/1, ce_loss_token=2.4195, perplexity_token=11.2404]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  25%|███████████▋                                  | 265/1044 [01:36<04:55,  2.63it/s, acc_step=1/1, ce_loss_token=2.4194, perplexity_token=11.2392]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  25%|███████████▋                                  | 266/1044 [01:36<04:49,  2.68it/s, acc_step=1/1, ce_loss_token=2.4193, perplexity_token=11.2381]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  26%|███████████▊                                  | 267/1044 [01:37<04:50,  2.67it/s, acc_step=1/1, ce_loss_token=2.4192, perplexity_token=11.2366]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  26%|███████████▊                                  | 268/1044 [01:37<04:41,  2.76it/s, acc_step=1/1, ce_loss_token=2.4191, perplexity_token=11.2357]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  26%|███████████▊                                  | 269/1044 [01:37<04:41,  2.76it/s, acc_step=1/1, ce_loss_token=2.4190, perplexity_token=11.2345]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  26%|███████████▉                                  | 270/1044 [01:38<04:40,  2.76it/s, acc_step=1/1, ce_loss_token=2.4189, perplexity_token=11.2336]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  26%|███████████▉                                  | 271/1044 [01:38<04:50,  2.66it/s, acc_step=1/1, ce_loss_token=2.4188, perplexity_token=11.2322]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  26%|███████████▉                                  | 272/1044 [01:38<04:28,  2.87it/s, acc_step=1/1, ce_loss_token=2.4188, perplexity_token=11.2327]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  26%|████████████                                  | 273/1044 [01:39<04:29,  2.86it/s, acc_step=1/1, ce_loss_token=2.4187, perplexity_token=11.2310]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  26%|████████████                                  | 274/1044 [01:39<04:12,  3.05it/s, acc_step=1/1, ce_loss_token=2.4187, perplexity_token=11.2317]

torch.Size([256, 388, 35]) torch.Size([256, 388])


[Training LM]:  26%|████████████                                  | 275/1044 [01:40<04:56,  2.59it/s, acc_step=1/1, ce_loss_token=2.4186, perplexity_token=11.2303]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  26%|████████████▏                                 | 276/1044 [01:40<04:55,  2.60it/s, acc_step=1/1, ce_loss_token=2.4185, perplexity_token=11.2288]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  27%|████████████▏                                 | 277/1044 [01:40<04:51,  2.63it/s, acc_step=1/1, ce_loss_token=2.4183, perplexity_token=11.2272]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  27%|████████████▏                                 | 278/1044 [01:41<04:54,  2.60it/s, acc_step=1/1, ce_loss_token=2.4182, perplexity_token=11.2254]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  27%|████████████▎                                 | 279/1044 [01:41<04:43,  2.69it/s, acc_step=1/1, ce_loss_token=2.4181, perplexity_token=11.2242]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  27%|████████████▎                                 | 280/1044 [01:41<04:52,  2.61it/s, acc_step=1/1, ce_loss_token=2.4179, perplexity_token=11.2228]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  27%|████████████▍                                 | 281/1044 [01:42<04:42,  2.70it/s, acc_step=1/1, ce_loss_token=2.4178, perplexity_token=11.2215]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  27%|████████████▍                                 | 282/1044 [01:42<04:39,  2.72it/s, acc_step=1/1, ce_loss_token=2.4177, perplexity_token=11.2198]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  27%|████████████▍                                 | 283/1044 [01:43<04:43,  2.68it/s, acc_step=1/1, ce_loss_token=2.4175, perplexity_token=11.2182]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  27%|████████████▌                                 | 284/1044 [01:43<04:51,  2.61it/s, acc_step=1/1, ce_loss_token=2.4174, perplexity_token=11.2169]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  27%|████████████▌                                 | 285/1044 [01:43<04:56,  2.56it/s, acc_step=1/1, ce_loss_token=2.4173, perplexity_token=11.2156]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  27%|████████████▌                                 | 286/1044 [01:44<04:53,  2.58it/s, acc_step=1/1, ce_loss_token=2.4171, perplexity_token=11.2138]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  27%|████████████▋                                 | 287/1044 [01:44<04:58,  2.54it/s, acc_step=1/1, ce_loss_token=2.4170, perplexity_token=11.2126]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  28%|████████████▋                                 | 288/1044 [01:45<04:51,  2.60it/s, acc_step=1/1, ce_loss_token=2.4170, perplexity_token=11.2116]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  28%|████████████▋                                 | 289/1044 [01:45<04:51,  2.59it/s, acc_step=1/1, ce_loss_token=2.4168, perplexity_token=11.2100]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  28%|████████████▊                                 | 290/1044 [01:45<04:53,  2.57it/s, acc_step=1/1, ce_loss_token=2.4167, perplexity_token=11.2088]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  28%|████████████▊                                 | 291/1044 [01:46<04:49,  2.60it/s, acc_step=1/1, ce_loss_token=2.4166, perplexity_token=11.2077]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  28%|████████████▊                                 | 292/1044 [01:46<04:43,  2.65it/s, acc_step=1/1, ce_loss_token=2.4165, perplexity_token=11.2060]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  28%|████████████▉                                 | 293/1044 [01:46<04:38,  2.70it/s, acc_step=1/1, ce_loss_token=2.4165, perplexity_token=11.2060]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  28%|████████████▉                                 | 294/1044 [01:47<04:48,  2.60it/s, acc_step=1/1, ce_loss_token=2.4163, perplexity_token=11.2046]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  28%|████████████▉                                 | 295/1044 [01:47<04:43,  2.64it/s, acc_step=1/1, ce_loss_token=2.4162, perplexity_token=11.2035]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  28%|█████████████                                 | 296/1044 [01:47<04:21,  2.86it/s, acc_step=1/1, ce_loss_token=2.4162, perplexity_token=11.2033]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  28%|█████████████                                 | 297/1044 [01:48<04:34,  2.72it/s, acc_step=1/1, ce_loss_token=2.4161, perplexity_token=11.2019]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  29%|█████████████▏                                | 298/1044 [01:48<04:18,  2.88it/s, acc_step=1/1, ce_loss_token=2.4161, perplexity_token=11.2019]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  29%|█████████████▏                                | 299/1044 [01:49<04:17,  2.89it/s, acc_step=1/1, ce_loss_token=2.4159, perplexity_token=11.2001]

torch.Size([256, 277, 35]) torch.Size([256, 277])


[Training LM]:  29%|█████████████▏                                | 300/1044 [01:49<04:16,  2.90it/s, acc_step=1/1, ce_loss_token=2.4158, perplexity_token=11.1986]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  29%|█████████████▎                                | 301/1044 [01:49<04:00,  3.09it/s, acc_step=1/1, ce_loss_token=2.4158, perplexity_token=11.1990]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  29%|█████████████▎                                | 302/1044 [01:50<04:12,  2.94it/s, acc_step=1/1, ce_loss_token=2.4157, perplexity_token=11.1978]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  29%|█████████████▎                                | 303/1044 [01:50<04:16,  2.89it/s, acc_step=1/1, ce_loss_token=2.4156, perplexity_token=11.1963]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  29%|█████████████▍                                | 304/1044 [01:50<04:14,  2.91it/s, acc_step=1/1, ce_loss_token=2.4155, perplexity_token=11.1950]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  29%|█████████████▍                                | 305/1044 [01:51<04:26,  2.78it/s, acc_step=1/1, ce_loss_token=2.4153, perplexity_token=11.1936]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  29%|█████████████▍                                | 306/1044 [01:51<04:25,  2.78it/s, acc_step=1/1, ce_loss_token=2.4152, perplexity_token=11.1922]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  29%|█████████████▌                                | 307/1044 [01:51<04:28,  2.75it/s, acc_step=1/1, ce_loss_token=2.4151, perplexity_token=11.1907]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  30%|█████████████▌                                | 308/1044 [01:52<04:32,  2.70it/s, acc_step=1/1, ce_loss_token=2.4150, perplexity_token=11.1893]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  30%|█████████████▌                                | 309/1044 [01:52<04:29,  2.73it/s, acc_step=1/1, ce_loss_token=2.4148, perplexity_token=11.1878]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  30%|█████████████▋                                | 310/1044 [01:52<04:27,  2.74it/s, acc_step=1/1, ce_loss_token=2.4147, perplexity_token=11.1864]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  30%|█████████████▋                                | 311/1044 [01:53<04:24,  2.78it/s, acc_step=1/1, ce_loss_token=2.4145, perplexity_token=11.1847]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  30%|█████████████▋                                | 312/1044 [01:53<04:08,  2.95it/s, acc_step=1/1, ce_loss_token=2.4146, perplexity_token=11.1855]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  30%|█████████████▊                                | 313/1044 [01:53<04:12,  2.89it/s, acc_step=1/1, ce_loss_token=2.4145, perplexity_token=11.1843]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  30%|█████████████▊                                | 314/1044 [01:54<04:14,  2.86it/s, acc_step=1/1, ce_loss_token=2.4144, perplexity_token=11.1830]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  30%|█████████████▉                                | 315/1044 [01:54<04:23,  2.76it/s, acc_step=1/1, ce_loss_token=2.4143, perplexity_token=11.1819]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  30%|█████████████▉                                | 316/1044 [01:54<04:04,  2.98it/s, acc_step=1/1, ce_loss_token=2.4143, perplexity_token=11.1819]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  30%|█████████████▉                                | 317/1044 [01:55<04:08,  2.93it/s, acc_step=1/1, ce_loss_token=2.4141, perplexity_token=11.1802]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  30%|██████████████                                | 318/1044 [01:55<04:08,  2.92it/s, acc_step=1/1, ce_loss_token=2.4140, perplexity_token=11.1785]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  31%|██████████████                                | 319/1044 [01:56<04:26,  2.73it/s, acc_step=1/1, ce_loss_token=2.4138, perplexity_token=11.1768]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  31%|██████████████                                | 320/1044 [01:56<04:08,  2.92it/s, acc_step=1/1, ce_loss_token=2.4139, perplexity_token=11.1772]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  31%|██████████████▏                               | 321/1044 [01:56<04:00,  3.00it/s, acc_step=1/1, ce_loss_token=2.4139, perplexity_token=11.1775]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  31%|██████████████▏                               | 322/1044 [01:57<04:08,  2.91it/s, acc_step=1/1, ce_loss_token=2.4138, perplexity_token=11.1762]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  31%|██████████████▏                               | 323/1044 [01:57<04:13,  2.85it/s, acc_step=1/1, ce_loss_token=2.4136, perplexity_token=11.1746]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  31%|██████████████▎                               | 324/1044 [01:57<03:56,  3.04it/s, acc_step=1/1, ce_loss_token=2.4137, perplexity_token=11.1752]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  31%|██████████████▎                               | 325/1044 [01:58<04:00,  2.98it/s, acc_step=1/1, ce_loss_token=2.4136, perplexity_token=11.1740]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  31%|██████████████▎                               | 326/1044 [01:58<04:12,  2.84it/s, acc_step=1/1, ce_loss_token=2.4135, perplexity_token=11.1724]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  31%|██████████████▍                               | 327/1044 [01:58<03:55,  3.04it/s, acc_step=1/1, ce_loss_token=2.4134, perplexity_token=11.1720]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  31%|██████████████▍                               | 328/1044 [01:59<04:07,  2.89it/s, acc_step=1/1, ce_loss_token=2.4133, perplexity_token=11.1706]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  32%|██████████████▍                               | 329/1044 [01:59<04:16,  2.79it/s, acc_step=1/1, ce_loss_token=2.4131, perplexity_token=11.1689]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  32%|██████████████▌                               | 330/1044 [01:59<04:13,  2.82it/s, acc_step=1/1, ce_loss_token=2.4130, perplexity_token=11.1675]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  32%|██████████████▌                               | 331/1044 [02:00<04:18,  2.76it/s, acc_step=1/1, ce_loss_token=2.4129, perplexity_token=11.1661]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  32%|██████████████▋                               | 332/1044 [02:00<04:30,  2.63it/s, acc_step=1/1, ce_loss_token=2.4128, perplexity_token=11.1648]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  32%|██████████████▋                               | 333/1044 [02:01<04:27,  2.66it/s, acc_step=1/1, ce_loss_token=2.4126, perplexity_token=11.1629]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  32%|██████████████▋                               | 334/1044 [02:01<04:34,  2.58it/s, acc_step=1/1, ce_loss_token=2.4125, perplexity_token=11.1614]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  32%|██████████████▊                               | 335/1044 [02:01<04:27,  2.65it/s, acc_step=1/1, ce_loss_token=2.4123, perplexity_token=11.1598]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  32%|██████████████▊                               | 336/1044 [02:02<04:09,  2.84it/s, acc_step=1/1, ce_loss_token=2.4123, perplexity_token=11.1597]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  32%|██████████████▊                               | 337/1044 [02:02<04:15,  2.76it/s, acc_step=1/1, ce_loss_token=2.4122, perplexity_token=11.1582]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  32%|██████████████▉                               | 338/1044 [02:02<04:17,  2.74it/s, acc_step=1/1, ce_loss_token=2.4120, perplexity_token=11.1566]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  32%|██████████████▉                               | 339/1044 [02:03<04:12,  2.80it/s, acc_step=1/1, ce_loss_token=2.4119, perplexity_token=11.1547]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  33%|██████████████▉                               | 340/1044 [02:03<04:14,  2.76it/s, acc_step=1/1, ce_loss_token=2.4117, perplexity_token=11.1534]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  33%|███████████████                               | 341/1044 [02:03<04:28,  2.61it/s, acc_step=1/1, ce_loss_token=2.4116, perplexity_token=11.1516]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  33%|███████████████                               | 342/1044 [02:04<04:29,  2.61it/s, acc_step=1/1, ce_loss_token=2.4114, perplexity_token=11.1501]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  33%|███████████████                               | 343/1044 [02:04<04:27,  2.62it/s, acc_step=1/1, ce_loss_token=2.4113, perplexity_token=11.1487]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  33%|███████████████▏                              | 344/1044 [02:05<04:35,  2.54it/s, acc_step=1/1, ce_loss_token=2.4112, perplexity_token=11.1472]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  33%|███████████████▏                              | 345/1044 [02:05<04:31,  2.57it/s, acc_step=1/1, ce_loss_token=2.4111, perplexity_token=11.1462]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  33%|███████████████▏                              | 346/1044 [02:05<04:35,  2.54it/s, acc_step=1/1, ce_loss_token=2.4110, perplexity_token=11.1449]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  33%|███████████████▎                              | 347/1044 [02:06<04:30,  2.58it/s, acc_step=1/1, ce_loss_token=2.4108, perplexity_token=11.1434]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  33%|███████████████▎                              | 348/1044 [02:06<04:19,  2.68it/s, acc_step=1/1, ce_loss_token=2.4107, perplexity_token=11.1422]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  33%|███████████████▍                              | 349/1044 [02:07<04:13,  2.74it/s, acc_step=1/1, ce_loss_token=2.4106, perplexity_token=11.1407]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  34%|███████████████▍                              | 350/1044 [02:07<04:21,  2.65it/s, acc_step=1/1, ce_loss_token=2.4105, perplexity_token=11.1393]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  34%|███████████████▍                              | 351/1044 [02:07<04:21,  2.65it/s, acc_step=1/1, ce_loss_token=2.4104, perplexity_token=11.1383]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  34%|███████████████▌                              | 352/1044 [02:08<04:19,  2.67it/s, acc_step=1/1, ce_loss_token=2.4102, perplexity_token=11.1366]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  34%|███████████████▌                              | 353/1044 [02:08<04:15,  2.70it/s, acc_step=1/1, ce_loss_token=2.4101, perplexity_token=11.1350]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  34%|███████████████▌                              | 354/1044 [02:08<04:16,  2.69it/s, acc_step=1/1, ce_loss_token=2.4100, perplexity_token=11.1339]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  34%|███████████████▋                              | 355/1044 [02:09<04:17,  2.67it/s, acc_step=1/1, ce_loss_token=2.4099, perplexity_token=11.1324]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  34%|███████████████▋                              | 356/1044 [02:09<04:12,  2.72it/s, acc_step=1/1, ce_loss_token=2.4097, perplexity_token=11.1312]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  34%|███████████████▋                              | 357/1044 [02:09<04:08,  2.77it/s, acc_step=1/1, ce_loss_token=2.4096, perplexity_token=11.1296]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  34%|███████████████▊                              | 358/1044 [02:10<04:06,  2.79it/s, acc_step=1/1, ce_loss_token=2.4095, perplexity_token=11.1280]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  34%|███████████████▊                              | 359/1044 [02:10<04:06,  2.77it/s, acc_step=1/1, ce_loss_token=2.4093, perplexity_token=11.1265]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  34%|███████████████▊                              | 360/1044 [02:11<04:10,  2.73it/s, acc_step=1/1, ce_loss_token=2.4092, perplexity_token=11.1251]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  35%|███████████████▉                              | 361/1044 [02:11<04:10,  2.72it/s, acc_step=1/1, ce_loss_token=2.4091, perplexity_token=11.1238]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  35%|███████████████▉                              | 362/1044 [02:11<04:11,  2.71it/s, acc_step=1/1, ce_loss_token=2.4089, perplexity_token=11.1223]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  35%|███████████████▉                              | 363/1044 [02:12<03:55,  2.89it/s, acc_step=1/1, ce_loss_token=2.4089, perplexity_token=11.1217]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  35%|████████████████                              | 364/1044 [02:12<03:41,  3.06it/s, acc_step=1/1, ce_loss_token=2.4089, perplexity_token=11.1219]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  35%|████████████████                              | 365/1044 [02:12<03:56,  2.87it/s, acc_step=1/1, ce_loss_token=2.4088, perplexity_token=11.1206]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  35%|████████████████▏                             | 366/1044 [02:13<03:53,  2.90it/s, acc_step=1/1, ce_loss_token=2.4088, perplexity_token=11.1205]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  35%|████████████████▏                             | 367/1044 [02:13<04:00,  2.81it/s, acc_step=1/1, ce_loss_token=2.4087, perplexity_token=11.1194]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  35%|████████████████▏                             | 368/1044 [02:13<03:57,  2.85it/s, acc_step=1/1, ce_loss_token=2.4087, perplexity_token=11.1196]

torch.Size([256, 274, 35]) torch.Size([256, 274])


[Training LM]:  35%|████████████████▎                             | 369/1044 [02:14<03:36,  3.12it/s, acc_step=1/1, ce_loss_token=2.4087, perplexity_token=11.1194]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  35%|████████████████▎                             | 370/1044 [02:14<03:39,  3.07it/s, acc_step=1/1, ce_loss_token=2.4086, perplexity_token=11.1183]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  36%|████████████████▎                             | 371/1044 [02:14<03:40,  3.05it/s, acc_step=1/1, ce_loss_token=2.4085, perplexity_token=11.1171]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  36%|████████████████▍                             | 372/1044 [02:15<03:52,  2.89it/s, acc_step=1/1, ce_loss_token=2.4084, perplexity_token=11.1160]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  36%|████████████████▍                             | 373/1044 [02:15<03:53,  2.87it/s, acc_step=1/1, ce_loss_token=2.4083, perplexity_token=11.1148]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  36%|████████████████▍                             | 374/1044 [02:15<04:03,  2.75it/s, acc_step=1/1, ce_loss_token=2.4081, perplexity_token=11.1133]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  36%|████████████████▌                             | 375/1044 [02:16<03:52,  2.88it/s, acc_step=1/1, ce_loss_token=2.4081, perplexity_token=11.1127]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  36%|████████████████▌                             | 376/1044 [02:16<03:33,  3.12it/s, acc_step=1/1, ce_loss_token=2.4081, perplexity_token=11.1124]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  36%|████████████████▌                             | 377/1044 [02:16<03:45,  2.96it/s, acc_step=1/1, ce_loss_token=2.4079, perplexity_token=11.1110]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  36%|████████████████▋                             | 378/1044 [02:17<03:51,  2.88it/s, acc_step=1/1, ce_loss_token=2.4078, perplexity_token=11.1100]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  36%|████████████████▋                             | 379/1044 [02:17<03:55,  2.82it/s, acc_step=1/1, ce_loss_token=2.4077, perplexity_token=11.1087]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  36%|████████████████▋                             | 380/1044 [02:17<03:39,  3.02it/s, acc_step=1/1, ce_loss_token=2.4077, perplexity_token=11.1086]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  36%|████████████████▊                             | 381/1044 [02:18<03:43,  2.97it/s, acc_step=1/1, ce_loss_token=2.4076, perplexity_token=11.1071]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  37%|████████████████▊                             | 382/1044 [02:18<03:54,  2.83it/s, acc_step=1/1, ce_loss_token=2.4075, perplexity_token=11.1059]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  37%|████████████████▉                             | 383/1044 [02:19<04:03,  2.71it/s, acc_step=1/1, ce_loss_token=2.4073, perplexity_token=11.1045]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  37%|████████████████▉                             | 384/1044 [02:19<04:01,  2.73it/s, acc_step=1/1, ce_loss_token=2.4072, perplexity_token=11.1032]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  37%|████████████████▉                             | 385/1044 [02:19<04:16,  2.57it/s, acc_step=1/1, ce_loss_token=2.4071, perplexity_token=11.1017]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  37%|█████████████████                             | 386/1044 [02:20<04:13,  2.59it/s, acc_step=1/1, ce_loss_token=2.4070, perplexity_token=11.1002]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  37%|█████████████████                             | 387/1044 [02:20<04:06,  2.67it/s, acc_step=1/1, ce_loss_token=2.4069, perplexity_token=11.0996]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  37%|█████████████████                             | 388/1044 [02:20<04:04,  2.68it/s, acc_step=1/1, ce_loss_token=2.4068, perplexity_token=11.0981]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  37%|█████████████████▏                            | 389/1044 [02:21<04:16,  2.56it/s, acc_step=1/1, ce_loss_token=2.4067, perplexity_token=11.0968]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  37%|█████████████████▏                            | 390/1044 [02:21<04:31,  2.40it/s, acc_step=1/1, ce_loss_token=2.4065, perplexity_token=11.0955]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  37%|█████████████████▏                            | 391/1044 [02:22<04:23,  2.48it/s, acc_step=1/1, ce_loss_token=2.4064, perplexity_token=11.0942]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  38%|█████████████████▎                            | 392/1044 [02:22<04:03,  2.68it/s, acc_step=1/1, ce_loss_token=2.4064, perplexity_token=11.0935]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  38%|█████████████████▎                            | 393/1044 [02:22<04:00,  2.70it/s, acc_step=1/1, ce_loss_token=2.4062, perplexity_token=11.0920]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  38%|█████████████████▎                            | 394/1044 [02:23<03:59,  2.71it/s, acc_step=1/1, ce_loss_token=2.4061, perplexity_token=11.0904]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  38%|█████████████████▍                            | 395/1044 [02:23<04:00,  2.69it/s, acc_step=1/1, ce_loss_token=2.4059, perplexity_token=11.0888]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  38%|█████████████████▍                            | 396/1044 [02:23<04:01,  2.69it/s, acc_step=1/1, ce_loss_token=2.4058, perplexity_token=11.0874]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  38%|█████████████████▍                            | 397/1044 [02:24<04:00,  2.69it/s, acc_step=1/1, ce_loss_token=2.4057, perplexity_token=11.0859]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  38%|█████████████████▌                            | 398/1044 [02:24<04:03,  2.65it/s, acc_step=1/1, ce_loss_token=2.4055, perplexity_token=11.0845]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  38%|█████████████████▌                            | 399/1044 [02:25<03:46,  2.84it/s, acc_step=1/1, ce_loss_token=2.4056, perplexity_token=11.0846]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  38%|█████████████████▌                            | 400/1044 [02:25<03:46,  2.85it/s, acc_step=1/1, ce_loss_token=2.4055, perplexity_token=11.0835]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  38%|█████████████████▋                            | 401/1044 [02:25<03:50,  2.79it/s, acc_step=1/1, ce_loss_token=2.4053, perplexity_token=11.0822]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  39%|█████████████████▋                            | 402/1044 [02:26<03:45,  2.85it/s, acc_step=1/1, ce_loss_token=2.4052, perplexity_token=11.0810]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  39%|█████████████████▊                            | 404/1044 [02:26<03:18,  3.22it/s, acc_step=1/1, ce_loss_token=2.4053, perplexity_token=11.0813]

torch.Size([256, 312, 35]) torch.Size([256, 312])
torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  39%|█████████████████▊                            | 405/1044 [02:27<03:25,  3.11it/s, acc_step=1/1, ce_loss_token=2.4051, perplexity_token=11.0797]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  39%|█████████████████▉                            | 406/1044 [02:27<03:35,  2.96it/s, acc_step=1/1, ce_loss_token=2.4050, perplexity_token=11.0786]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  39%|█████████████████▉                            | 407/1044 [02:27<03:27,  3.08it/s, acc_step=1/1, ce_loss_token=2.4050, perplexity_token=11.0786]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  39%|█████████████████▉                            | 408/1044 [02:28<03:41,  2.87it/s, acc_step=1/1, ce_loss_token=2.4049, perplexity_token=11.0772]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  39%|██████████████████                            | 409/1044 [02:28<03:33,  2.97it/s, acc_step=1/1, ce_loss_token=2.4048, perplexity_token=11.0766]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  39%|██████████████████                            | 410/1044 [02:28<03:39,  2.89it/s, acc_step=1/1, ce_loss_token=2.4047, perplexity_token=11.0753]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  39%|██████████████████                            | 411/1044 [02:29<03:49,  2.76it/s, acc_step=1/1, ce_loss_token=2.4046, perplexity_token=11.0740]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  39%|██████████████████▏                           | 412/1044 [02:29<03:52,  2.72it/s, acc_step=1/1, ce_loss_token=2.4045, perplexity_token=11.0725]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  40%|██████████████████▏                           | 413/1044 [02:29<03:39,  2.88it/s, acc_step=1/1, ce_loss_token=2.4044, perplexity_token=11.0719]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  40%|██████████████████▏                           | 414/1044 [02:30<03:24,  3.08it/s, acc_step=1/1, ce_loss_token=2.4044, perplexity_token=11.0723]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  40%|██████████████████▎                           | 415/1044 [02:30<03:41,  2.83it/s, acc_step=1/1, ce_loss_token=2.4043, perplexity_token=11.0706]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  40%|██████████████████▎                           | 416/1044 [02:30<03:39,  2.86it/s, acc_step=1/1, ce_loss_token=2.4042, perplexity_token=11.0695]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  40%|██████████████████▎                           | 417/1044 [02:31<03:44,  2.79it/s, acc_step=1/1, ce_loss_token=2.4041, perplexity_token=11.0681]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  40%|██████████████████▍                           | 418/1044 [02:31<03:31,  2.96it/s, acc_step=1/1, ce_loss_token=2.4041, perplexity_token=11.0680]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  40%|██████████████████▍                           | 419/1044 [02:31<03:30,  2.97it/s, acc_step=1/1, ce_loss_token=2.4039, perplexity_token=11.0665]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  40%|██████████████████▌                           | 420/1044 [02:32<03:31,  2.95it/s, acc_step=1/1, ce_loss_token=2.4038, perplexity_token=11.0652]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  40%|██████████████████▌                           | 421/1044 [02:32<03:34,  2.90it/s, acc_step=1/1, ce_loss_token=2.4037, perplexity_token=11.0639]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  40%|██████████████████▌                           | 422/1044 [02:32<03:41,  2.81it/s, acc_step=1/1, ce_loss_token=2.4036, perplexity_token=11.0626]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  41%|██████████████████▋                           | 423/1044 [02:33<03:48,  2.72it/s, acc_step=1/1, ce_loss_token=2.4034, perplexity_token=11.0610]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  41%|██████████████████▋                           | 424/1044 [02:33<03:52,  2.66it/s, acc_step=1/1, ce_loss_token=2.4033, perplexity_token=11.0599]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  41%|██████████████████▋                           | 425/1044 [02:34<03:55,  2.63it/s, acc_step=1/1, ce_loss_token=2.4032, perplexity_token=11.0584]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  41%|██████████████████▊                           | 426/1044 [02:34<04:11,  2.46it/s, acc_step=1/1, ce_loss_token=2.4031, perplexity_token=11.0574]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  41%|██████████████████▊                           | 427/1044 [02:35<04:35,  2.24it/s, acc_step=1/1, ce_loss_token=2.4030, perplexity_token=11.0560]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  41%|██████████████████▊                           | 428/1044 [02:35<04:25,  2.32it/s, acc_step=1/1, ce_loss_token=2.4028, perplexity_token=11.0546]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  41%|██████████████████▉                           | 429/1044 [02:35<04:06,  2.50it/s, acc_step=1/1, ce_loss_token=2.4028, perplexity_token=11.0541]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  41%|██████████████████▉                           | 430/1044 [02:36<04:28,  2.29it/s, acc_step=1/1, ce_loss_token=2.4027, perplexity_token=11.0530]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  41%|██████████████████▉                           | 431/1044 [02:36<04:20,  2.35it/s, acc_step=1/1, ce_loss_token=2.4026, perplexity_token=11.0515]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  41%|███████████████████                           | 432/1044 [02:37<04:19,  2.36it/s, acc_step=1/1, ce_loss_token=2.4024, perplexity_token=11.0501]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  41%|███████████████████                           | 433/1044 [02:37<04:09,  2.44it/s, acc_step=1/1, ce_loss_token=2.4023, perplexity_token=11.0489]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  42%|███████████████████                           | 434/1044 [02:37<04:00,  2.54it/s, acc_step=1/1, ce_loss_token=2.4022, perplexity_token=11.0473]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  42%|███████████████████▏                          | 435/1044 [02:38<03:57,  2.57it/s, acc_step=1/1, ce_loss_token=2.4021, perplexity_token=11.0459]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  42%|███████████████████▏                          | 436/1044 [02:38<03:52,  2.62it/s, acc_step=1/1, ce_loss_token=2.4019, perplexity_token=11.0444]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  42%|███████████████████▎                          | 437/1044 [02:39<03:53,  2.60it/s, acc_step=1/1, ce_loss_token=2.4018, perplexity_token=11.0431]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  42%|███████████████████▎                          | 438/1044 [02:39<03:48,  2.65it/s, acc_step=1/1, ce_loss_token=2.4017, perplexity_token=11.0416]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  42%|███████████████████▎                          | 439/1044 [02:39<03:48,  2.64it/s, acc_step=1/1, ce_loss_token=2.4016, perplexity_token=11.0405]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  42%|███████████████████▍                          | 440/1044 [02:40<03:33,  2.83it/s, acc_step=1/1, ce_loss_token=2.4015, perplexity_token=11.0399]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  42%|███████████████████▍                          | 441/1044 [02:40<03:16,  3.07it/s, acc_step=1/1, ce_loss_token=2.4015, perplexity_token=11.0397]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  42%|███████████████████▍                          | 442/1044 [02:40<03:21,  2.99it/s, acc_step=1/1, ce_loss_token=2.4014, perplexity_token=11.0381]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  42%|███████████████████▌                          | 443/1044 [02:41<03:31,  2.85it/s, acc_step=1/1, ce_loss_token=2.4013, perplexity_token=11.0370]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  43%|███████████████████▌                          | 444/1044 [02:41<03:31,  2.84it/s, acc_step=1/1, ce_loss_token=2.4011, perplexity_token=11.0355]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  43%|███████████████████▌                          | 445/1044 [02:41<03:18,  3.01it/s, acc_step=1/1, ce_loss_token=2.4011, perplexity_token=11.0357]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  43%|███████████████████▋                          | 446/1044 [02:42<03:07,  3.19it/s, acc_step=1/1, ce_loss_token=2.4012, perplexity_token=11.0360]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  43%|███████████████████▋                          | 447/1044 [02:42<03:27,  2.88it/s, acc_step=1/1, ce_loss_token=2.4010, perplexity_token=11.0347]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  43%|███████████████████▋                          | 448/1044 [02:42<03:27,  2.87it/s, acc_step=1/1, ce_loss_token=2.4009, perplexity_token=11.0334]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  43%|███████████████████▊                          | 449/1044 [02:43<03:31,  2.81it/s, acc_step=1/1, ce_loss_token=2.4008, perplexity_token=11.0318]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  43%|███████████████████▊                          | 450/1044 [02:43<03:26,  2.87it/s, acc_step=1/1, ce_loss_token=2.4007, perplexity_token=11.0314]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  43%|███████████████████▊                          | 451/1044 [02:43<03:31,  2.80it/s, acc_step=1/1, ce_loss_token=2.4006, perplexity_token=11.0300]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  43%|███████████████████▉                          | 452/1044 [02:44<03:24,  2.90it/s, acc_step=1/1, ce_loss_token=2.4006, perplexity_token=11.0296]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  43%|███████████████████▉                          | 453/1044 [02:44<03:28,  2.83it/s, acc_step=1/1, ce_loss_token=2.4004, perplexity_token=11.0281]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  43%|████████████████████                          | 454/1044 [02:44<03:20,  2.95it/s, acc_step=1/1, ce_loss_token=2.4004, perplexity_token=11.0277]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  44%|████████████████████                          | 455/1044 [02:45<03:22,  2.91it/s, acc_step=1/1, ce_loss_token=2.4003, perplexity_token=11.0262]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  44%|████████████████████                          | 456/1044 [02:45<03:25,  2.87it/s, acc_step=1/1, ce_loss_token=2.4002, perplexity_token=11.0251]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  44%|████████████████████▏                         | 457/1044 [02:45<03:28,  2.82it/s, acc_step=1/1, ce_loss_token=2.4000, perplexity_token=11.0237]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  44%|████████████████████▏                         | 458/1044 [02:46<03:27,  2.83it/s, acc_step=1/1, ce_loss_token=2.3999, perplexity_token=11.0226]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  44%|████████████████████▏                         | 459/1044 [02:46<03:43,  2.62it/s, acc_step=1/1, ce_loss_token=2.3998, perplexity_token=11.0213]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  44%|████████████████████▎                         | 460/1044 [02:47<03:41,  2.63it/s, acc_step=1/1, ce_loss_token=2.3997, perplexity_token=11.0198]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  44%|████████████████████▎                         | 461/1044 [02:47<03:35,  2.70it/s, acc_step=1/1, ce_loss_token=2.3996, perplexity_token=11.0193]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  44%|████████████████████▎                         | 462/1044 [02:47<03:30,  2.76it/s, acc_step=1/1, ce_loss_token=2.3995, perplexity_token=11.0179]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  44%|████████████████████▍                         | 463/1044 [02:48<03:33,  2.72it/s, acc_step=1/1, ce_loss_token=2.3994, perplexity_token=11.0165]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  44%|████████████████████▍                         | 464/1044 [02:48<03:29,  2.77it/s, acc_step=1/1, ce_loss_token=2.3993, perplexity_token=11.0151]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  45%|████████████████████▍                         | 465/1044 [02:48<03:18,  2.91it/s, acc_step=1/1, ce_loss_token=2.3993, perplexity_token=11.0149]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  45%|████████████████████▌                         | 466/1044 [02:49<03:22,  2.85it/s, acc_step=1/1, ce_loss_token=2.3991, perplexity_token=11.0134]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  45%|████████████████████▌                         | 467/1044 [02:49<03:26,  2.79it/s, acc_step=1/1, ce_loss_token=2.3990, perplexity_token=11.0119]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  45%|████████████████████▌                         | 468/1044 [02:49<03:13,  2.98it/s, acc_step=1/1, ce_loss_token=2.3990, perplexity_token=11.0123]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  45%|████████████████████▋                         | 469/1044 [02:50<03:17,  2.91it/s, acc_step=1/1, ce_loss_token=2.3989, perplexity_token=11.0107]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  45%|████████████████████▋                         | 470/1044 [02:50<03:13,  2.97it/s, acc_step=1/1, ce_loss_token=2.3988, perplexity_token=11.0103]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  45%|████████████████████▊                         | 471/1044 [02:50<03:17,  2.90it/s, acc_step=1/1, ce_loss_token=2.3989, perplexity_token=11.0107]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  45%|████████████████████▊                         | 472/1044 [02:51<03:04,  3.10it/s, acc_step=1/1, ce_loss_token=2.3988, perplexity_token=11.0102]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  45%|████████████████████▊                         | 473/1044 [02:51<03:13,  2.94it/s, acc_step=1/1, ce_loss_token=2.3987, perplexity_token=11.0089]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  45%|████████████████████▉                         | 474/1044 [02:51<03:20,  2.84it/s, acc_step=1/1, ce_loss_token=2.3986, perplexity_token=11.0074]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  45%|████████████████████▉                         | 475/1044 [02:52<03:34,  2.65it/s, acc_step=1/1, ce_loss_token=2.3984, perplexity_token=11.0060]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  46%|████████████████████▉                         | 476/1044 [02:52<03:33,  2.66it/s, acc_step=1/1, ce_loss_token=2.3983, perplexity_token=11.0046]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  46%|█████████████████████                         | 477/1044 [02:53<03:29,  2.71it/s, acc_step=1/1, ce_loss_token=2.3982, perplexity_token=11.0034]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  46%|█████████████████████                         | 478/1044 [02:53<03:29,  2.71it/s, acc_step=1/1, ce_loss_token=2.3981, perplexity_token=11.0022]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  46%|█████████████████████                         | 479/1044 [02:53<03:13,  2.92it/s, acc_step=1/1, ce_loss_token=2.3981, perplexity_token=11.0021]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  46%|█████████████████████▏                        | 480/1044 [02:54<03:07,  3.01it/s, acc_step=1/1, ce_loss_token=2.3981, perplexity_token=11.0018]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  46%|█████████████████████▏                        | 482/1044 [02:54<02:51,  3.27it/s, acc_step=1/1, ce_loss_token=2.3981, perplexity_token=11.0024]

torch.Size([256, 312, 35]) torch.Size([256, 312])
torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  46%|█████████████████████▎                        | 483/1044 [02:55<02:59,  3.13it/s, acc_step=1/1, ce_loss_token=2.3980, perplexity_token=11.0010]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  46%|█████████████████████▎                        | 484/1044 [02:55<03:02,  3.07it/s, acc_step=1/1, ce_loss_token=2.3979, perplexity_token=10.9997]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  46%|█████████████████████▎                        | 485/1044 [02:55<03:08,  2.96it/s, acc_step=1/1, ce_loss_token=2.3977, perplexity_token=10.9983]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  47%|█████████████████████▍                        | 486/1044 [02:56<03:41,  2.52it/s, acc_step=1/1, ce_loss_token=2.3976, perplexity_token=10.9970]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  47%|█████████████████████▍                        | 487/1044 [02:56<03:23,  2.74it/s, acc_step=1/1, ce_loss_token=2.3975, perplexity_token=10.9961]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  47%|█████████████████████▌                        | 488/1044 [02:56<03:22,  2.75it/s, acc_step=1/1, ce_loss_token=2.3974, perplexity_token=10.9948]

torch.Size([256, 393, 35]) torch.Size([256, 393])


[Training LM]:  47%|█████████████████████▌                        | 489/1044 [02:57<03:52,  2.39it/s, acc_step=1/1, ce_loss_token=2.3973, perplexity_token=10.9931]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  47%|█████████████████████▌                        | 490/1044 [02:57<03:37,  2.55it/s, acc_step=1/1, ce_loss_token=2.3973, perplexity_token=10.9931]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  47%|█████████████████████▋                        | 491/1044 [02:58<03:44,  2.46it/s, acc_step=1/1, ce_loss_token=2.3971, perplexity_token=10.9918]

torch.Size([256, 396, 35]) torch.Size([256, 396])


[Training LM]:  47%|█████████████████████▋                        | 492/1044 [02:58<04:07,  2.23it/s, acc_step=1/1, ce_loss_token=2.3970, perplexity_token=10.9903]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  47%|█████████████████████▋                        | 493/1044 [02:59<03:42,  2.47it/s, acc_step=1/1, ce_loss_token=2.3970, perplexity_token=10.9898]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  47%|█████████████████████▊                        | 494/1044 [02:59<03:32,  2.59it/s, acc_step=1/1, ce_loss_token=2.3968, perplexity_token=10.9883]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  47%|█████████████████████▊                        | 495/1044 [02:59<03:31,  2.60it/s, acc_step=1/1, ce_loss_token=2.3967, perplexity_token=10.9869]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  48%|█████████████████████▊                        | 496/1044 [03:00<03:27,  2.63it/s, acc_step=1/1, ce_loss_token=2.3966, perplexity_token=10.9857]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  48%|█████████████████████▉                        | 497/1044 [03:00<03:26,  2.65it/s, acc_step=1/1, ce_loss_token=2.3965, perplexity_token=10.9843]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  48%|█████████████████████▉                        | 498/1044 [03:00<03:22,  2.70it/s, acc_step=1/1, ce_loss_token=2.3963, perplexity_token=10.9830]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  48%|█████████████████████▉                        | 499/1044 [03:01<03:23,  2.68it/s, acc_step=1/1, ce_loss_token=2.3962, perplexity_token=10.9818]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  48%|██████████████████████                        | 500/1044 [03:01<03:20,  2.71it/s, acc_step=1/1, ce_loss_token=2.3961, perplexity_token=10.9807]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  48%|██████████████████████                        | 501/1044 [03:02<03:17,  2.75it/s, acc_step=1/1, ce_loss_token=2.3960, perplexity_token=10.9793]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  48%|██████████████████████                        | 502/1044 [03:02<03:23,  2.67it/s, acc_step=1/1, ce_loss_token=2.3959, perplexity_token=10.9777]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  48%|██████████████████████▏                       | 503/1044 [03:02<03:25,  2.64it/s, acc_step=1/1, ce_loss_token=2.3957, perplexity_token=10.9763]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  48%|██████████████████████▏                       | 504/1044 [03:03<03:21,  2.67it/s, acc_step=1/1, ce_loss_token=2.3956, perplexity_token=10.9749]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  48%|██████████████████████▎                       | 505/1044 [03:03<03:20,  2.69it/s, acc_step=1/1, ce_loss_token=2.3955, perplexity_token=10.9737]

torch.Size([256, 364, 35]) torch.Size([256, 364])


[Training LM]:  48%|██████████████████████▎                       | 506/1044 [03:04<03:38,  2.47it/s, acc_step=1/1, ce_loss_token=2.3954, perplexity_token=10.9723]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  49%|██████████████████████▎                       | 507/1044 [03:04<03:32,  2.52it/s, acc_step=1/1, ce_loss_token=2.3953, perplexity_token=10.9710]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  49%|██████████████████████▍                       | 508/1044 [03:04<03:31,  2.53it/s, acc_step=1/1, ce_loss_token=2.3951, perplexity_token=10.9697]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  49%|██████████████████████▍                       | 509/1044 [03:05<03:34,  2.49it/s, acc_step=1/1, ce_loss_token=2.3950, perplexity_token=10.9683]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  49%|██████████████████████▍                       | 510/1044 [03:05<03:29,  2.55it/s, acc_step=1/1, ce_loss_token=2.3949, perplexity_token=10.9669]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  49%|██████████████████████▌                       | 511/1044 [03:05<03:23,  2.62it/s, acc_step=1/1, ce_loss_token=2.3947, perplexity_token=10.9654]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  49%|██████████████████████▌                       | 512/1044 [03:06<03:27,  2.56it/s, acc_step=1/1, ce_loss_token=2.3946, perplexity_token=10.9641]

torch.Size([256, 438, 35]) torch.Size([256, 438])


[Training LM]:  49%|██████████████████████▌                       | 513/1044 [03:06<04:04,  2.17it/s, acc_step=1/1, ce_loss_token=2.3945, perplexity_token=10.9625]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  49%|██████████████████████▋                       | 514/1044 [03:07<04:01,  2.19it/s, acc_step=1/1, ce_loss_token=2.3944, perplexity_token=10.9611]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  49%|██████████████████████▋                       | 515/1044 [03:07<03:44,  2.35it/s, acc_step=1/1, ce_loss_token=2.3942, perplexity_token=10.9599]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  49%|██████████████████████▋                       | 516/1044 [03:08<03:40,  2.39it/s, acc_step=1/1, ce_loss_token=2.3941, perplexity_token=10.9586]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  50%|██████████████████████▊                       | 517/1044 [03:08<03:40,  2.39it/s, acc_step=1/1, ce_loss_token=2.3940, perplexity_token=10.9573]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  50%|██████████████████████▊                       | 518/1044 [03:08<03:30,  2.50it/s, acc_step=1/1, ce_loss_token=2.3939, perplexity_token=10.9561]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  50%|██████████████████████▊                       | 519/1044 [03:09<03:30,  2.49it/s, acc_step=1/1, ce_loss_token=2.3938, perplexity_token=10.9545]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  50%|██████████████████████▉                       | 520/1044 [03:09<03:28,  2.51it/s, acc_step=1/1, ce_loss_token=2.3936, perplexity_token=10.9533]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  50%|██████████████████████▉                       | 521/1044 [03:10<03:25,  2.54it/s, acc_step=1/1, ce_loss_token=2.3935, perplexity_token=10.9522]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  50%|███████████████████████                       | 522/1044 [03:10<03:17,  2.64it/s, acc_step=1/1, ce_loss_token=2.3934, perplexity_token=10.9511]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  50%|███████████████████████                       | 523/1044 [03:10<03:14,  2.68it/s, acc_step=1/1, ce_loss_token=2.3933, perplexity_token=10.9498]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  50%|███████████████████████                       | 524/1044 [03:11<03:02,  2.84it/s, acc_step=1/1, ce_loss_token=2.3933, perplexity_token=10.9491]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  50%|███████████████████████▏                      | 525/1044 [03:11<03:07,  2.76it/s, acc_step=1/1, ce_loss_token=2.3931, perplexity_token=10.9478]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  50%|███████████████████████▏                      | 526/1044 [03:11<03:21,  2.57it/s, acc_step=1/1, ce_loss_token=2.3930, perplexity_token=10.9462]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  50%|███████████████████████▏                      | 527/1044 [03:12<03:21,  2.56it/s, acc_step=1/1, ce_loss_token=2.3929, perplexity_token=10.9451]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  51%|███████████████████████▎                      | 528/1044 [03:12<03:22,  2.55it/s, acc_step=1/1, ce_loss_token=2.3928, perplexity_token=10.9437]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  51%|███████████████████████▎                      | 529/1044 [03:13<03:21,  2.56it/s, acc_step=1/1, ce_loss_token=2.3926, perplexity_token=10.9424]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  51%|███████████████████████▎                      | 530/1044 [03:13<03:39,  2.34it/s, acc_step=1/1, ce_loss_token=2.3925, perplexity_token=10.9410]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  51%|███████████████████████▍                      | 531/1044 [03:14<03:28,  2.46it/s, acc_step=1/1, ce_loss_token=2.3924, perplexity_token=10.9399]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  51%|███████████████████████▍                      | 532/1044 [03:14<03:34,  2.38it/s, acc_step=1/1, ce_loss_token=2.3923, perplexity_token=10.9385]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  51%|███████████████████████▍                      | 533/1044 [03:14<03:28,  2.45it/s, acc_step=1/1, ce_loss_token=2.3922, perplexity_token=10.9371]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  51%|███████████████████████▌                      | 534/1044 [03:15<03:34,  2.38it/s, acc_step=1/1, ce_loss_token=2.3920, perplexity_token=10.9357]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  51%|███████████████████████▌                      | 535/1044 [03:15<03:34,  2.38it/s, acc_step=1/1, ce_loss_token=2.3919, perplexity_token=10.9342]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  51%|███████████████████████▌                      | 536/1044 [03:15<03:13,  2.63it/s, acc_step=1/1, ce_loss_token=2.3919, perplexity_token=10.9339]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  51%|███████████████████████▋                      | 537/1044 [03:16<03:25,  2.46it/s, acc_step=1/1, ce_loss_token=2.3917, perplexity_token=10.9324]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  52%|███████████████████████▋                      | 538/1044 [03:16<03:18,  2.55it/s, acc_step=1/1, ce_loss_token=2.3916, perplexity_token=10.9309]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  52%|███████████████████████▋                      | 539/1044 [03:17<03:18,  2.55it/s, acc_step=1/1, ce_loss_token=2.3915, perplexity_token=10.9295]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  52%|███████████████████████▊                      | 540/1044 [03:17<03:13,  2.60it/s, acc_step=1/1, ce_loss_token=2.3913, perplexity_token=10.9281]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  52%|███████████████████████▊                      | 541/1044 [03:17<03:16,  2.56it/s, acc_step=1/1, ce_loss_token=2.3912, perplexity_token=10.9267]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  52%|███████████████████████▉                      | 542/1044 [03:18<03:14,  2.58it/s, acc_step=1/1, ce_loss_token=2.3911, perplexity_token=10.9255]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  52%|███████████████████████▉                      | 543/1044 [03:18<03:11,  2.61it/s, acc_step=1/1, ce_loss_token=2.3910, perplexity_token=10.9242]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  52%|███████████████████████▉                      | 544/1044 [03:19<03:21,  2.48it/s, acc_step=1/1, ce_loss_token=2.3908, perplexity_token=10.9228]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  52%|████████████████████████                      | 545/1044 [03:19<03:15,  2.55it/s, acc_step=1/1, ce_loss_token=2.3907, perplexity_token=10.9213]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  52%|████████████████████████                      | 546/1044 [03:19<03:09,  2.63it/s, acc_step=1/1, ce_loss_token=2.3906, perplexity_token=10.9197]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  52%|████████████████████████                      | 547/1044 [03:20<03:09,  2.62it/s, acc_step=1/1, ce_loss_token=2.3905, perplexity_token=10.9185]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  52%|████████████████████████▏                     | 548/1044 [03:20<02:54,  2.84it/s, acc_step=1/1, ce_loss_token=2.3904, perplexity_token=10.9179]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  53%|████████████████████████▏                     | 549/1044 [03:20<02:54,  2.83it/s, acc_step=1/1, ce_loss_token=2.3903, perplexity_token=10.9165]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  53%|████████████████████████▏                     | 550/1044 [03:21<02:58,  2.77it/s, acc_step=1/1, ce_loss_token=2.3901, perplexity_token=10.9150]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  53%|████████████████████████▎                     | 551/1044 [03:21<03:05,  2.66it/s, acc_step=1/1, ce_loss_token=2.3900, perplexity_token=10.9137]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  53%|████████████████████████▎                     | 552/1044 [03:22<03:11,  2.57it/s, acc_step=1/1, ce_loss_token=2.3899, perplexity_token=10.9120]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  53%|████████████████████████▎                     | 553/1044 [03:22<03:10,  2.58it/s, acc_step=1/1, ce_loss_token=2.3897, perplexity_token=10.9106]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  53%|████████████████████████▍                     | 554/1044 [03:22<02:54,  2.80it/s, acc_step=1/1, ce_loss_token=2.3897, perplexity_token=10.9102]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  53%|████████████████████████▍                     | 555/1044 [03:23<03:00,  2.71it/s, acc_step=1/1, ce_loss_token=2.3896, perplexity_token=10.9088]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  53%|████████████████████████▍                     | 556/1044 [03:23<03:04,  2.65it/s, acc_step=1/1, ce_loss_token=2.3894, perplexity_token=10.9074]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  53%|████████████████████████▌                     | 557/1044 [03:24<03:10,  2.56it/s, acc_step=1/1, ce_loss_token=2.3893, perplexity_token=10.9060]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  53%|████████████████████████▌                     | 558/1044 [03:24<03:11,  2.54it/s, acc_step=1/1, ce_loss_token=2.3892, perplexity_token=10.9045]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  54%|████████████████████████▋                     | 559/1044 [03:24<03:09,  2.56it/s, acc_step=1/1, ce_loss_token=2.3890, perplexity_token=10.9030]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  54%|████████████████████████▋                     | 560/1044 [03:25<03:02,  2.65it/s, acc_step=1/1, ce_loss_token=2.3889, perplexity_token=10.9017]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  54%|████████████████████████▋                     | 561/1044 [03:25<02:58,  2.70it/s, acc_step=1/1, ce_loss_token=2.3888, perplexity_token=10.9005]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  54%|████████████████████████▊                     | 562/1044 [03:25<02:58,  2.70it/s, acc_step=1/1, ce_loss_token=2.3887, perplexity_token=10.8990]

torch.Size([256, 342, 35]) torch.Size([256, 342])


[Training LM]:  54%|████████████████████████▊                     | 563/1044 [03:26<03:08,  2.55it/s, acc_step=1/1, ce_loss_token=2.3885, perplexity_token=10.8976]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  54%|████████████████████████▉                     | 565/1044 [03:26<02:31,  3.16it/s, acc_step=1/1, ce_loss_token=2.3886, perplexity_token=10.8984]

torch.Size([256, 300, 35]) torch.Size([256, 300])
torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  54%|████████████████████████▉                     | 566/1044 [03:27<02:36,  3.05it/s, acc_step=1/1, ce_loss_token=2.3885, perplexity_token=10.8970]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  54%|████████████████████████▉                     | 567/1044 [03:27<02:46,  2.87it/s, acc_step=1/1, ce_loss_token=2.3884, perplexity_token=10.8959]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  54%|█████████████████████████                     | 568/1044 [03:27<02:48,  2.82it/s, acc_step=1/1, ce_loss_token=2.3883, perplexity_token=10.8947]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  55%|█████████████████████████                     | 569/1044 [03:28<02:53,  2.74it/s, acc_step=1/1, ce_loss_token=2.3881, perplexity_token=10.8933]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  55%|█████████████████████████                     | 570/1044 [03:28<02:53,  2.74it/s, acc_step=1/1, ce_loss_token=2.3880, perplexity_token=10.8920]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  55%|█████████████████████████▏                    | 571/1044 [03:29<03:00,  2.62it/s, acc_step=1/1, ce_loss_token=2.3879, perplexity_token=10.8905]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  55%|█████████████████████████▏                    | 572/1044 [03:29<03:00,  2.61it/s, acc_step=1/1, ce_loss_token=2.3878, perplexity_token=10.8892]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  55%|█████████████████████████▏                    | 573/1044 [03:29<02:45,  2.84it/s, acc_step=1/1, ce_loss_token=2.3877, perplexity_token=10.8888]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  55%|█████████████████████████▎                    | 574/1044 [03:30<02:46,  2.83it/s, acc_step=1/1, ce_loss_token=2.3876, perplexity_token=10.8877]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  55%|█████████████████████████▎                    | 575/1044 [03:30<02:43,  2.86it/s, acc_step=1/1, ce_loss_token=2.3875, perplexity_token=10.8866]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  55%|█████████████████████████▍                    | 576/1044 [03:30<02:50,  2.74it/s, acc_step=1/1, ce_loss_token=2.3874, perplexity_token=10.8851]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  55%|█████████████████████████▍                    | 577/1044 [03:31<02:55,  2.66it/s, acc_step=1/1, ce_loss_token=2.3873, perplexity_token=10.8837]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  55%|█████████████████████████▍                    | 578/1044 [03:31<02:57,  2.62it/s, acc_step=1/1, ce_loss_token=2.3871, perplexity_token=10.8823]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  55%|█████████████████████████▌                    | 579/1044 [03:32<02:57,  2.63it/s, acc_step=1/1, ce_loss_token=2.3870, perplexity_token=10.8808]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  56%|█████████████████████████▌                    | 580/1044 [03:32<02:43,  2.83it/s, acc_step=1/1, ce_loss_token=2.3870, perplexity_token=10.8808]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  56%|█████████████████████████▌                    | 581/1044 [03:32<02:44,  2.81it/s, acc_step=1/1, ce_loss_token=2.3869, perplexity_token=10.8795]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  56%|█████████████████████████▋                    | 582/1044 [03:33<02:48,  2.75it/s, acc_step=1/1, ce_loss_token=2.3868, perplexity_token=10.8782]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  56%|█████████████████████████▋                    | 583/1044 [03:33<02:35,  2.96it/s, acc_step=1/1, ce_loss_token=2.3867, perplexity_token=10.8776]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  56%|█████████████████████████▋                    | 584/1044 [03:33<02:24,  3.18it/s, acc_step=1/1, ce_loss_token=2.3867, perplexity_token=10.8770]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  56%|█████████████████████████▊                    | 585/1044 [03:34<02:33,  3.00it/s, acc_step=1/1, ce_loss_token=2.3865, perplexity_token=10.8757]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  56%|█████████████████████████▊                    | 586/1044 [03:34<02:35,  2.94it/s, acc_step=1/1, ce_loss_token=2.3864, perplexity_token=10.8744]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  56%|█████████████████████████▊                    | 587/1044 [03:34<02:36,  2.93it/s, acc_step=1/1, ce_loss_token=2.3863, perplexity_token=10.8730]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  56%|█████████████████████████▉                    | 588/1044 [03:35<02:41,  2.82it/s, acc_step=1/1, ce_loss_token=2.3862, perplexity_token=10.8716]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  56%|█████████████████████████▉                    | 589/1044 [03:35<02:46,  2.73it/s, acc_step=1/1, ce_loss_token=2.3860, perplexity_token=10.8703]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  57%|█████████████████████████▉                    | 590/1044 [03:35<02:34,  2.93it/s, acc_step=1/1, ce_loss_token=2.3860, perplexity_token=10.8703]

torch.Size([256, 351, 35]) torch.Size([256, 351])


[Training LM]:  57%|██████████████████████████                    | 591/1044 [03:36<02:49,  2.67it/s, acc_step=1/1, ce_loss_token=2.3859, perplexity_token=10.8691]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  57%|██████████████████████████▏                   | 593/1044 [03:36<02:27,  3.05it/s, acc_step=1/1, ce_loss_token=2.3859, perplexity_token=10.8693]

torch.Size([256, 307, 35]) torch.Size([256, 307])
torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  57%|██████████████████████████▏                   | 594/1044 [03:37<02:33,  2.94it/s, acc_step=1/1, ce_loss_token=2.3858, perplexity_token=10.8680]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  57%|██████████████████████████▏                   | 595/1044 [03:37<02:47,  2.68it/s, acc_step=1/1, ce_loss_token=2.3857, perplexity_token=10.8665]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  57%|██████████████████████████▎                   | 596/1044 [03:38<02:52,  2.60it/s, acc_step=1/1, ce_loss_token=2.3856, perplexity_token=10.8654]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  57%|██████████████████████████▎                   | 597/1044 [03:38<02:37,  2.84it/s, acc_step=1/1, ce_loss_token=2.3855, perplexity_token=10.8648]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  57%|██████████████████████████▎                   | 598/1044 [03:38<02:36,  2.84it/s, acc_step=1/1, ce_loss_token=2.3854, perplexity_token=10.8636]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  57%|██████████████████████████▍                   | 599/1044 [03:39<02:36,  2.84it/s, acc_step=1/1, ce_loss_token=2.3853, perplexity_token=10.8624]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  57%|██████████████████████████▍                   | 600/1044 [03:39<02:37,  2.82it/s, acc_step=1/1, ce_loss_token=2.3852, perplexity_token=10.8611]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  58%|██████████████████████████▍                   | 601/1044 [03:39<02:37,  2.81it/s, acc_step=1/1, ce_loss_token=2.3851, perplexity_token=10.8597]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  58%|██████████████████████████▌                   | 602/1044 [03:40<02:43,  2.70it/s, acc_step=1/1, ce_loss_token=2.3849, perplexity_token=10.8582]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  58%|██████████████████████████▌                   | 603/1044 [03:40<02:43,  2.70it/s, acc_step=1/1, ce_loss_token=2.3848, perplexity_token=10.8568]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  58%|██████████████████████████▌                   | 604/1044 [03:41<03:04,  2.39it/s, acc_step=1/1, ce_loss_token=2.3847, perplexity_token=10.8554]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  58%|██████████████████████████▋                   | 605/1044 [03:41<02:56,  2.49it/s, acc_step=1/1, ce_loss_token=2.3846, perplexity_token=10.8544]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  58%|██████████████████████████▋                   | 606/1044 [03:41<02:57,  2.47it/s, acc_step=1/1, ce_loss_token=2.3844, perplexity_token=10.8528]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  58%|██████████████████████████▋                   | 607/1044 [03:42<02:48,  2.59it/s, acc_step=1/1, ce_loss_token=2.3843, perplexity_token=10.8514]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  58%|██████████████████████████▊                   | 608/1044 [03:42<02:57,  2.45it/s, acc_step=1/1, ce_loss_token=2.3842, perplexity_token=10.8500]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  58%|██████████████████████████▊                   | 609/1044 [03:42<02:52,  2.52it/s, acc_step=1/1, ce_loss_token=2.3840, perplexity_token=10.8487]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  58%|██████████████████████████▉                   | 610/1044 [03:43<02:54,  2.48it/s, acc_step=1/1, ce_loss_token=2.3839, perplexity_token=10.8473]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  59%|██████████████████████████▉                   | 611/1044 [03:43<02:41,  2.68it/s, acc_step=1/1, ce_loss_token=2.3839, perplexity_token=10.8468]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  59%|██████████████████████████▉                   | 612/1044 [03:44<02:31,  2.86it/s, acc_step=1/1, ce_loss_token=2.3839, perplexity_token=10.8467]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  59%|███████████████████████████                   | 613/1044 [03:44<02:40,  2.69it/s, acc_step=1/1, ce_loss_token=2.3837, perplexity_token=10.8454]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  59%|███████████████████████████                   | 614/1044 [03:44<02:29,  2.87it/s, acc_step=1/1, ce_loss_token=2.3837, perplexity_token=10.8450]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  59%|███████████████████████████                   | 615/1044 [03:45<02:34,  2.78it/s, acc_step=1/1, ce_loss_token=2.3836, perplexity_token=10.8436]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  59%|███████████████████████████▏                  | 616/1044 [03:45<02:26,  2.91it/s, acc_step=1/1, ce_loss_token=2.3835, perplexity_token=10.8430]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  59%|███████████████████████████▏                  | 617/1044 [03:45<02:30,  2.84it/s, acc_step=1/1, ce_loss_token=2.3834, perplexity_token=10.8418]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  59%|███████████████████████████▏                  | 618/1044 [03:46<02:32,  2.78it/s, acc_step=1/1, ce_loss_token=2.3833, perplexity_token=10.8403]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  59%|███████████████████████████▎                  | 619/1044 [03:46<02:25,  2.92it/s, acc_step=1/1, ce_loss_token=2.3833, perplexity_token=10.8402]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  59%|███████████████████████████▎                  | 620/1044 [03:46<02:24,  2.93it/s, acc_step=1/1, ce_loss_token=2.3831, perplexity_token=10.8388]

torch.Size([256, 381, 35]) torch.Size([256, 381])


[Training LM]:  59%|███████████████████████████▎                  | 621/1044 [03:47<02:31,  2.80it/s, acc_step=1/1, ce_loss_token=2.3831, perplexity_token=10.8379]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  60%|███████████████████████████▍                  | 622/1044 [03:47<02:32,  2.76it/s, acc_step=1/1, ce_loss_token=2.3829, perplexity_token=10.8366]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  60%|███████████████████████████▍                  | 623/1044 [03:47<02:24,  2.92it/s, acc_step=1/1, ce_loss_token=2.3829, perplexity_token=10.8368]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  60%|███████████████████████████▍                  | 624/1044 [03:48<02:22,  2.95it/s, acc_step=1/1, ce_loss_token=2.3828, perplexity_token=10.8354]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  60%|███████████████████████████▌                  | 625/1044 [03:48<02:27,  2.84it/s, acc_step=1/1, ce_loss_token=2.3827, perplexity_token=10.8339]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  60%|███████████████████████████▌                  | 626/1044 [03:48<02:23,  2.91it/s, acc_step=1/1, ce_loss_token=2.3826, perplexity_token=10.8327]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  60%|███████████████████████████▋                  | 627/1044 [03:49<02:27,  2.83it/s, acc_step=1/1, ce_loss_token=2.3824, perplexity_token=10.8312]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  60%|███████████████████████████▋                  | 628/1044 [03:49<02:29,  2.79it/s, acc_step=1/1, ce_loss_token=2.3823, perplexity_token=10.8298]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  60%|███████████████████████████▋                  | 629/1044 [03:50<02:31,  2.75it/s, acc_step=1/1, ce_loss_token=2.3822, perplexity_token=10.8285]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  60%|███████████████████████████▊                  | 630/1044 [03:50<02:22,  2.90it/s, acc_step=1/1, ce_loss_token=2.3822, perplexity_token=10.8282]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  60%|███████████████████████████▊                  | 631/1044 [03:50<02:15,  3.04it/s, acc_step=1/1, ce_loss_token=2.3821, perplexity_token=10.8281]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  61%|███████████████████████████▊                  | 632/1044 [03:51<02:21,  2.92it/s, acc_step=1/1, ce_loss_token=2.3820, perplexity_token=10.8266]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  61%|███████████████████████████▉                  | 633/1044 [03:51<02:23,  2.86it/s, acc_step=1/1, ce_loss_token=2.3819, perplexity_token=10.8254]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  61%|███████████████████████████▉                  | 634/1044 [03:51<02:23,  2.87it/s, acc_step=1/1, ce_loss_token=2.3818, perplexity_token=10.8241]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  61%|███████████████████████████▉                  | 635/1044 [03:52<02:28,  2.75it/s, acc_step=1/1, ce_loss_token=2.3816, perplexity_token=10.8226]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  61%|████████████████████████████                  | 636/1044 [03:52<02:27,  2.76it/s, acc_step=1/1, ce_loss_token=2.3815, perplexity_token=10.8212]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  61%|████████████████████████████                  | 637/1044 [03:52<02:26,  2.77it/s, acc_step=1/1, ce_loss_token=2.3814, perplexity_token=10.8199]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  61%|████████████████████████████                  | 638/1044 [03:53<02:34,  2.62it/s, acc_step=1/1, ce_loss_token=2.3812, perplexity_token=10.8184]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  61%|████████████████████████████▏                 | 639/1044 [03:53<02:33,  2.64it/s, acc_step=1/1, ce_loss_token=2.3811, perplexity_token=10.8171]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  61%|████████████████████████████▏                 | 640/1044 [03:53<02:21,  2.85it/s, acc_step=1/1, ce_loss_token=2.3811, perplexity_token=10.8168]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  61%|████████████████████████████▏                 | 641/1044 [03:54<02:14,  3.00it/s, acc_step=1/1, ce_loss_token=2.3811, perplexity_token=10.8167]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  61%|████████████████████████████▎                 | 642/1044 [03:54<02:13,  3.01it/s, acc_step=1/1, ce_loss_token=2.3810, perplexity_token=10.8154]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  62%|████████████████████████████▎                 | 643/1044 [03:54<02:19,  2.86it/s, acc_step=1/1, ce_loss_token=2.3808, perplexity_token=10.8139]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  62%|████████████████████████████▍                 | 644/1044 [03:55<02:23,  2.78it/s, acc_step=1/1, ce_loss_token=2.3807, perplexity_token=10.8127]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  62%|████████████████████████████▍                 | 645/1044 [03:55<02:21,  2.82it/s, acc_step=1/1, ce_loss_token=2.3806, perplexity_token=10.8115]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  62%|████████████████████████████▍                 | 646/1044 [03:56<02:23,  2.77it/s, acc_step=1/1, ce_loss_token=2.3805, perplexity_token=10.8101]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  62%|████████████████████████████▌                 | 647/1044 [03:56<02:22,  2.79it/s, acc_step=1/1, ce_loss_token=2.3803, perplexity_token=10.8086]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  62%|████████████████████████████▌                 | 648/1044 [03:56<02:26,  2.70it/s, acc_step=1/1, ce_loss_token=2.3802, perplexity_token=10.8074]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  62%|████████████████████████████▌                 | 649/1044 [03:57<02:27,  2.68it/s, acc_step=1/1, ce_loss_token=2.3801, perplexity_token=10.8060]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  62%|████████████████████████████▋                 | 650/1044 [03:57<02:15,  2.91it/s, acc_step=1/1, ce_loss_token=2.3801, perplexity_token=10.8057]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  62%|████████████████████████████▋                 | 651/1044 [03:57<02:15,  2.90it/s, acc_step=1/1, ce_loss_token=2.3799, perplexity_token=10.8043]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  62%|████████████████████████████▋                 | 652/1044 [03:58<02:10,  3.02it/s, acc_step=1/1, ce_loss_token=2.3799, perplexity_token=10.8039]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  63%|████████████████████████████▊                 | 653/1044 [03:58<02:17,  2.85it/s, acc_step=1/1, ce_loss_token=2.3798, perplexity_token=10.8026]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  63%|████████████████████████████▊                 | 654/1044 [03:58<02:20,  2.78it/s, acc_step=1/1, ce_loss_token=2.3797, perplexity_token=10.8013]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  63%|████████████████████████████▊                 | 655/1044 [03:59<02:19,  2.79it/s, acc_step=1/1, ce_loss_token=2.3795, perplexity_token=10.8000]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  63%|████████████████████████████▉                 | 656/1044 [03:59<02:19,  2.79it/s, acc_step=1/1, ce_loss_token=2.3794, perplexity_token=10.7987]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  63%|████████████████████████████▉                 | 657/1044 [03:59<02:20,  2.75it/s, acc_step=1/1, ce_loss_token=2.3793, perplexity_token=10.7976]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  63%|████████████████████████████▉                 | 658/1044 [04:00<02:20,  2.74it/s, acc_step=1/1, ce_loss_token=2.3792, perplexity_token=10.7963]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  63%|█████████████████████████████                 | 659/1044 [04:00<02:12,  2.91it/s, acc_step=1/1, ce_loss_token=2.3791, perplexity_token=10.7957]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  63%|█████████████████████████████                 | 660/1044 [04:01<02:17,  2.78it/s, acc_step=1/1, ce_loss_token=2.3790, perplexity_token=10.7945]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  63%|█████████████████████████████                 | 661/1044 [04:01<02:21,  2.70it/s, acc_step=1/1, ce_loss_token=2.3789, perplexity_token=10.7931]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  63%|█████████████████████████████▏                | 662/1044 [04:01<02:24,  2.65it/s, acc_step=1/1, ce_loss_token=2.3788, perplexity_token=10.7918]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  64%|█████████████████████████████▏                | 663/1044 [04:02<02:31,  2.52it/s, acc_step=1/1, ce_loss_token=2.3787, perplexity_token=10.7903]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  64%|█████████████████████████████▎                | 664/1044 [04:02<02:27,  2.58it/s, acc_step=1/1, ce_loss_token=2.3785, perplexity_token=10.7890]

torch.Size([256, 365, 35]) torch.Size([256, 365])


[Training LM]:  64%|█████████████████████████████▎                | 665/1044 [04:02<02:25,  2.60it/s, acc_step=1/1, ce_loss_token=2.3785, perplexity_token=10.7883]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  64%|█████████████████████████████▎                | 666/1044 [04:03<02:23,  2.63it/s, acc_step=1/1, ce_loss_token=2.3783, perplexity_token=10.7869]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  64%|█████████████████████████████▍                | 667/1044 [04:03<02:11,  2.86it/s, acc_step=1/1, ce_loss_token=2.3783, perplexity_token=10.7867]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  64%|█████████████████████████████▍                | 668/1044 [04:03<02:04,  3.01it/s, acc_step=1/1, ce_loss_token=2.3783, perplexity_token=10.7860]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  64%|█████████████████████████████▍                | 669/1044 [04:04<02:14,  2.79it/s, acc_step=1/1, ce_loss_token=2.3781, perplexity_token=10.7846]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  64%|█████████████████████████████▌                | 670/1044 [04:04<02:14,  2.78it/s, acc_step=1/1, ce_loss_token=2.3780, perplexity_token=10.7831]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  64%|█████████████████████████████▌                | 671/1044 [04:05<02:24,  2.59it/s, acc_step=1/1, ce_loss_token=2.3779, perplexity_token=10.7820]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  64%|█████████████████████████████▌                | 672/1044 [04:05<02:19,  2.67it/s, acc_step=1/1, ce_loss_token=2.3778, perplexity_token=10.7808]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  64%|█████████████████████████████▋                | 673/1044 [04:05<02:16,  2.73it/s, acc_step=1/1, ce_loss_token=2.3776, perplexity_token=10.7795]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  65%|█████████████████████████████▋                | 674/1044 [04:06<02:15,  2.74it/s, acc_step=1/1, ce_loss_token=2.3775, perplexity_token=10.7781]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  65%|█████████████████████████████▋                | 675/1044 [04:06<02:03,  2.99it/s, acc_step=1/1, ce_loss_token=2.3775, perplexity_token=10.7778]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  65%|█████████████████████████████▊                | 676/1044 [04:06<01:57,  3.13it/s, acc_step=1/1, ce_loss_token=2.3774, perplexity_token=10.7772]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  65%|█████████████████████████████▊                | 677/1044 [04:07<01:58,  3.10it/s, acc_step=1/1, ce_loss_token=2.3773, perplexity_token=10.7758]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  65%|█████████████████████████████▊                | 678/1044 [04:07<01:53,  3.21it/s, acc_step=1/1, ce_loss_token=2.3772, perplexity_token=10.7751]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  65%|█████████████████████████████▉                | 680/1044 [04:07<01:45,  3.44it/s, acc_step=1/1, ce_loss_token=2.3773, perplexity_token=10.7754]

torch.Size([256, 308, 35]) torch.Size([256, 308])
torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  65%|██████████████████████████████                | 681/1044 [04:08<01:55,  3.16it/s, acc_step=1/1, ce_loss_token=2.3771, perplexity_token=10.7741]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  65%|██████████████████████████████                | 682/1044 [04:08<01:52,  3.23it/s, acc_step=1/1, ce_loss_token=2.3772, perplexity_token=10.7744]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  65%|██████████████████████████████                | 683/1044 [04:08<01:59,  3.03it/s, acc_step=1/1, ce_loss_token=2.3771, perplexity_token=10.7732]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  66%|██████████████████████████████▏               | 684/1044 [04:09<02:08,  2.81it/s, acc_step=1/1, ce_loss_token=2.3769, perplexity_token=10.7717]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  66%|██████████████████████████████▏               | 685/1044 [04:09<02:08,  2.80it/s, acc_step=1/1, ce_loss_token=2.3768, perplexity_token=10.7704]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  66%|██████████████████████████████▏               | 686/1044 [04:10<02:15,  2.64it/s, acc_step=1/1, ce_loss_token=2.3767, perplexity_token=10.7692]

torch.Size([256, 525, 35]) torch.Size([256, 525])


[Training LM]:  66%|██████████████████████████████▎               | 687/1044 [04:10<02:44,  2.16it/s, acc_step=1/1, ce_loss_token=2.3766, perplexity_token=10.7685]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  66%|██████████████████████████████▎               | 688/1044 [04:11<02:35,  2.29it/s, acc_step=1/1, ce_loss_token=2.3765, perplexity_token=10.7672]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  66%|██████████████████████████████▎               | 689/1044 [04:11<02:30,  2.36it/s, acc_step=1/1, ce_loss_token=2.3764, perplexity_token=10.7659]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  66%|██████████████████████████████▍               | 690/1044 [04:12<02:24,  2.44it/s, acc_step=1/1, ce_loss_token=2.3763, perplexity_token=10.7646]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  66%|██████████████████████████████▍               | 691/1044 [04:12<02:20,  2.52it/s, acc_step=1/1, ce_loss_token=2.3761, perplexity_token=10.7633]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  66%|██████████████████████████████▍               | 692/1044 [04:12<02:18,  2.54it/s, acc_step=1/1, ce_loss_token=2.3760, perplexity_token=10.7619]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  66%|██████████████████████████████▌               | 693/1044 [04:13<02:16,  2.57it/s, acc_step=1/1, ce_loss_token=2.3759, perplexity_token=10.7606]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  66%|██████████████████████████████▌               | 694/1044 [04:13<02:09,  2.71it/s, acc_step=1/1, ce_loss_token=2.3758, perplexity_token=10.7601]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  67%|██████████████████████████████▌               | 695/1044 [04:13<02:09,  2.69it/s, acc_step=1/1, ce_loss_token=2.3757, perplexity_token=10.7586]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  67%|██████████████████████████████▋               | 696/1044 [04:14<02:00,  2.90it/s, acc_step=1/1, ce_loss_token=2.3757, perplexity_token=10.7580]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  67%|██████████████████████████████▋               | 697/1044 [04:14<01:53,  3.07it/s, acc_step=1/1, ce_loss_token=2.3756, perplexity_token=10.7573]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  67%|██████████████████████████████▊               | 698/1044 [04:14<01:56,  2.96it/s, acc_step=1/1, ce_loss_token=2.3755, perplexity_token=10.7560]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  67%|██████████████████████████████▊               | 699/1044 [04:15<02:03,  2.79it/s, acc_step=1/1, ce_loss_token=2.3753, perplexity_token=10.7547]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  67%|██████████████████████████████▊               | 700/1044 [04:15<02:06,  2.73it/s, acc_step=1/1, ce_loss_token=2.3752, perplexity_token=10.7533]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  67%|██████████████████████████████▉               | 701/1044 [04:15<02:07,  2.70it/s, acc_step=1/1, ce_loss_token=2.3751, perplexity_token=10.7518]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  67%|██████████████████████████████▉               | 702/1044 [04:16<02:11,  2.60it/s, acc_step=1/1, ce_loss_token=2.3749, perplexity_token=10.7503]

torch.Size([256, 410, 35]) torch.Size([256, 410])


[Training LM]:  67%|██████████████████████████████▉               | 703/1044 [04:16<02:30,  2.26it/s, acc_step=1/1, ce_loss_token=2.3748, perplexity_token=10.7489]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  67%|███████████████████████████████               | 704/1044 [04:17<02:26,  2.33it/s, acc_step=1/1, ce_loss_token=2.3747, perplexity_token=10.7475]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  68%|███████████████████████████████               | 705/1044 [04:17<02:20,  2.41it/s, acc_step=1/1, ce_loss_token=2.3746, perplexity_token=10.7463]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  68%|███████████████████████████████               | 706/1044 [04:18<02:15,  2.50it/s, acc_step=1/1, ce_loss_token=2.3744, perplexity_token=10.7449]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  68%|███████████████████████████████▏              | 707/1044 [04:18<02:02,  2.75it/s, acc_step=1/1, ce_loss_token=2.3744, perplexity_token=10.7447]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  68%|███████████████████████████████▏              | 708/1044 [04:18<02:06,  2.66it/s, acc_step=1/1, ce_loss_token=2.3743, perplexity_token=10.7434]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  68%|███████████████████████████████▏              | 709/1044 [04:19<02:03,  2.71it/s, acc_step=1/1, ce_loss_token=2.3742, perplexity_token=10.7421]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  68%|███████████████████████████████▎              | 710/1044 [04:19<02:04,  2.68it/s, acc_step=1/1, ce_loss_token=2.3740, perplexity_token=10.7406]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  68%|███████████████████████████████▎              | 711/1044 [04:19<01:53,  2.95it/s, acc_step=1/1, ce_loss_token=2.3740, perplexity_token=10.7407]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  68%|███████████████████████████████▎              | 712/1044 [04:20<01:54,  2.90it/s, acc_step=1/1, ce_loss_token=2.3739, perplexity_token=10.7392]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  68%|███████████████████████████████▍              | 713/1044 [04:20<02:02,  2.71it/s, acc_step=1/1, ce_loss_token=2.3738, perplexity_token=10.7378]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  68%|███████████████████████████████▍              | 714/1044 [04:20<02:01,  2.72it/s, acc_step=1/1, ce_loss_token=2.3736, perplexity_token=10.7363]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  68%|███████████████████████████████▌              | 715/1044 [04:21<01:58,  2.77it/s, acc_step=1/1, ce_loss_token=2.3735, perplexity_token=10.7351]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  69%|███████████████████████████████▌              | 716/1044 [04:21<02:03,  2.66it/s, acc_step=1/1, ce_loss_token=2.3734, perplexity_token=10.7336]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  69%|███████████████████████████████▌              | 717/1044 [04:22<02:07,  2.56it/s, acc_step=1/1, ce_loss_token=2.3733, perplexity_token=10.7323]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  69%|███████████████████████████████▋              | 718/1044 [04:22<01:59,  2.73it/s, acc_step=1/1, ce_loss_token=2.3732, perplexity_token=10.7321]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  69%|███████████████████████████████▋              | 719/1044 [04:22<01:58,  2.74it/s, acc_step=1/1, ce_loss_token=2.3731, perplexity_token=10.7308]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  69%|███████████████████████████████▋              | 720/1044 [04:23<02:04,  2.60it/s, acc_step=1/1, ce_loss_token=2.3730, perplexity_token=10.7293]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  69%|███████████████████████████████▊              | 721/1044 [04:23<02:06,  2.55it/s, acc_step=1/1, ce_loss_token=2.3729, perplexity_token=10.7279]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  69%|███████████████████████████████▊              | 722/1044 [04:23<02:03,  2.61it/s, acc_step=1/1, ce_loss_token=2.3727, perplexity_token=10.7268]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  69%|███████████████████████████████▊              | 723/1044 [04:24<02:00,  2.67it/s, acc_step=1/1, ce_loss_token=2.3726, perplexity_token=10.7255]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  69%|███████████████████████████████▉              | 724/1044 [04:24<02:01,  2.63it/s, acc_step=1/1, ce_loss_token=2.3725, perplexity_token=10.7242]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  69%|███████████████████████████████▉              | 725/1044 [04:25<01:59,  2.67it/s, acc_step=1/1, ce_loss_token=2.3724, perplexity_token=10.7227]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  70%|███████████████████████████████▉              | 726/1044 [04:25<01:51,  2.86it/s, acc_step=1/1, ce_loss_token=2.3723, perplexity_token=10.7220]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  70%|████████████████████████████████              | 728/1044 [04:25<01:36,  3.28it/s, acc_step=1/1, ce_loss_token=2.3722, perplexity_token=10.7213]

torch.Size([256, 297, 35]) torch.Size([256, 297])
torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  70%|████████████████████████████████              | 729/1044 [04:26<01:43,  3.04it/s, acc_step=1/1, ce_loss_token=2.3721, perplexity_token=10.7200]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  70%|████████████████████████████████▏             | 730/1044 [04:26<01:46,  2.95it/s, acc_step=1/1, ce_loss_token=2.3720, perplexity_token=10.7186]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  70%|████████████████████████████████▏             | 731/1044 [04:27<01:49,  2.85it/s, acc_step=1/1, ce_loss_token=2.3719, perplexity_token=10.7173]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  70%|████████████████████████████████▎             | 732/1044 [04:27<01:44,  2.98it/s, acc_step=1/1, ce_loss_token=2.3718, perplexity_token=10.7166]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  70%|████████████████████████████████▎             | 733/1044 [04:27<01:37,  3.20it/s, acc_step=1/1, ce_loss_token=2.3717, perplexity_token=10.7161]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  70%|████████████████████████████████▎             | 734/1044 [04:27<01:44,  2.97it/s, acc_step=1/1, ce_loss_token=2.3716, perplexity_token=10.7148]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  70%|████████████████████████████████▍             | 736/1044 [04:28<01:31,  3.38it/s, acc_step=1/1, ce_loss_token=2.3716, perplexity_token=10.7149]

torch.Size([256, 294, 35]) torch.Size([256, 294])
torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  71%|████████████████████████████████▍             | 737/1044 [04:28<01:29,  3.43it/s, acc_step=1/1, ce_loss_token=2.3716, perplexity_token=10.7141]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  71%|████████████████████████████████▌             | 738/1044 [04:29<01:27,  3.49it/s, acc_step=1/1, ce_loss_token=2.3715, perplexity_token=10.7138]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  71%|████████████████████████████████▌             | 739/1044 [04:29<01:32,  3.31it/s, acc_step=1/1, ce_loss_token=2.3714, perplexity_token=10.7126]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  71%|████████████████████████████████▌             | 740/1044 [04:29<01:40,  3.02it/s, acc_step=1/1, ce_loss_token=2.3713, perplexity_token=10.7114]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  71%|████████████████████████████████▋             | 741/1044 [04:30<01:52,  2.69it/s, acc_step=1/1, ce_loss_token=2.3712, perplexity_token=10.7102]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  71%|████████████████████████████████▋             | 742/1044 [04:30<01:52,  2.68it/s, acc_step=1/1, ce_loss_token=2.3711, perplexity_token=10.7088]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  71%|████████████████████████████████▋             | 743/1044 [04:30<01:46,  2.83it/s, acc_step=1/1, ce_loss_token=2.3710, perplexity_token=10.7082]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  71%|████████████████████████████████▊             | 744/1044 [04:31<01:46,  2.83it/s, acc_step=1/1, ce_loss_token=2.3709, perplexity_token=10.7069]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  71%|████████████████████████████████▊             | 745/1044 [04:31<01:42,  2.92it/s, acc_step=1/1, ce_loss_token=2.3709, perplexity_token=10.7071]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  72%|████████████████████████████████▉             | 747/1044 [04:32<01:32,  3.19it/s, acc_step=1/1, ce_loss_token=2.3709, perplexity_token=10.7070]

torch.Size([256, 311, 35]) torch.Size([256, 311])
torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  72%|████████████████████████████████▉             | 748/1044 [04:32<01:35,  3.10it/s, acc_step=1/1, ce_loss_token=2.3708, perplexity_token=10.7057]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  72%|█████████████████████████████████             | 749/1044 [04:32<01:38,  2.98it/s, acc_step=1/1, ce_loss_token=2.3707, perplexity_token=10.7044]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  72%|█████████████████████████████████             | 750/1044 [04:33<01:43,  2.85it/s, acc_step=1/1, ce_loss_token=2.3705, perplexity_token=10.7028]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  72%|█████████████████████████████████             | 751/1044 [04:33<01:48,  2.71it/s, acc_step=1/1, ce_loss_token=2.3704, perplexity_token=10.7014]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  72%|█████████████████████████████████▏            | 752/1044 [04:34<01:52,  2.60it/s, acc_step=1/1, ce_loss_token=2.3703, perplexity_token=10.7002]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  72%|█████████████████████████████████▏            | 753/1044 [04:34<01:49,  2.65it/s, acc_step=1/1, ce_loss_token=2.3701, perplexity_token=10.6988]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  72%|█████████████████████████████████▏            | 754/1044 [04:34<01:50,  2.63it/s, acc_step=1/1, ce_loss_token=2.3700, perplexity_token=10.6975]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  72%|█████████████████████████████████▎            | 755/1044 [04:35<01:47,  2.68it/s, acc_step=1/1, ce_loss_token=2.3699, perplexity_token=10.6961]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  72%|█████████████████████████████████▎            | 756/1044 [04:35<01:46,  2.71it/s, acc_step=1/1, ce_loss_token=2.3698, perplexity_token=10.6948]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  73%|█████████████████████████████████▎            | 757/1044 [04:36<01:48,  2.64it/s, acc_step=1/1, ce_loss_token=2.3696, perplexity_token=10.6936]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  73%|█████████████████████████████████▍            | 758/1044 [04:36<01:46,  2.69it/s, acc_step=1/1, ce_loss_token=2.3695, perplexity_token=10.6920]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  73%|█████████████████████████████████▍            | 759/1044 [04:36<01:43,  2.76it/s, acc_step=1/1, ce_loss_token=2.3694, perplexity_token=10.6907]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  73%|█████████████████████████████████▍            | 760/1044 [04:37<01:41,  2.79it/s, acc_step=1/1, ce_loss_token=2.3692, perplexity_token=10.6894]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  73%|█████████████████████████████████▌            | 761/1044 [04:37<01:39,  2.85it/s, acc_step=1/1, ce_loss_token=2.3691, perplexity_token=10.6881]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  73%|█████████████████████████████████▌            | 762/1044 [04:37<01:38,  2.85it/s, acc_step=1/1, ce_loss_token=2.3690, perplexity_token=10.6868]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  73%|█████████████████████████████████▌            | 763/1044 [04:38<01:40,  2.80it/s, acc_step=1/1, ce_loss_token=2.3689, perplexity_token=10.6856]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  73%|█████████████████████████████████▋            | 764/1044 [04:38<01:33,  3.01it/s, acc_step=1/1, ce_loss_token=2.3688, perplexity_token=10.6851]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  73%|█████████████████████████████████▋            | 765/1044 [04:38<01:33,  2.99it/s, acc_step=1/1, ce_loss_token=2.3687, perplexity_token=10.6839]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  73%|█████████████████████████████████▊            | 766/1044 [04:39<01:40,  2.78it/s, acc_step=1/1, ce_loss_token=2.3686, perplexity_token=10.6824]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  73%|█████████████████████████████████▊            | 767/1044 [04:39<01:32,  3.00it/s, acc_step=1/1, ce_loss_token=2.3686, perplexity_token=10.6825]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  74%|█████████████████████████████████▊            | 768/1044 [04:39<01:36,  2.85it/s, acc_step=1/1, ce_loss_token=2.3685, perplexity_token=10.6811]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  74%|█████████████████████████████████▉            | 769/1044 [04:40<01:38,  2.80it/s, acc_step=1/1, ce_loss_token=2.3684, perplexity_token=10.6800]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  74%|█████████████████████████████████▉            | 770/1044 [04:40<01:37,  2.81it/s, acc_step=1/1, ce_loss_token=2.3683, perplexity_token=10.6789]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  74%|█████████████████████████████████▉            | 771/1044 [04:40<01:42,  2.67it/s, acc_step=1/1, ce_loss_token=2.3681, perplexity_token=10.6776]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  74%|██████████████████████████████████            | 772/1044 [04:41<01:43,  2.64it/s, acc_step=1/1, ce_loss_token=2.3680, perplexity_token=10.6763]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  74%|██████████████████████████████████            | 773/1044 [04:41<01:42,  2.64it/s, acc_step=1/1, ce_loss_token=2.3679, perplexity_token=10.6752]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  74%|██████████████████████████████████            | 774/1044 [04:42<01:42,  2.63it/s, acc_step=1/1, ce_loss_token=2.3678, perplexity_token=10.6738]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  74%|██████████████████████████████████▏           | 775/1044 [04:42<01:40,  2.67it/s, acc_step=1/1, ce_loss_token=2.3677, perplexity_token=10.6724]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  74%|██████████████████████████████████▏           | 776/1044 [04:42<01:41,  2.63it/s, acc_step=1/1, ce_loss_token=2.3675, perplexity_token=10.6712]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  74%|██████████████████████████████████▏           | 777/1044 [04:43<01:39,  2.69it/s, acc_step=1/1, ce_loss_token=2.3674, perplexity_token=10.6699]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  75%|██████████████████████████████████▎           | 778/1044 [04:43<01:39,  2.68it/s, acc_step=1/1, ce_loss_token=2.3673, perplexity_token=10.6686]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  75%|██████████████████████████████████▎           | 779/1044 [04:44<01:40,  2.64it/s, acc_step=1/1, ce_loss_token=2.3672, perplexity_token=10.6674]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  75%|██████████████████████████████████▎           | 780/1044 [04:44<01:40,  2.63it/s, acc_step=1/1, ce_loss_token=2.3671, perplexity_token=10.6661]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  75%|██████████████████████████████████▍           | 781/1044 [04:44<01:36,  2.73it/s, acc_step=1/1, ce_loss_token=2.3670, perplexity_token=10.6648]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  75%|██████████████████████████████████▍           | 782/1044 [04:45<01:46,  2.47it/s, acc_step=1/1, ce_loss_token=2.3668, perplexity_token=10.6637]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  75%|██████████████████████████████████▌           | 783/1044 [04:45<01:48,  2.41it/s, acc_step=1/1, ce_loss_token=2.3667, perplexity_token=10.6626]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  75%|██████████████████████████████████▌           | 784/1044 [04:46<01:48,  2.39it/s, acc_step=1/1, ce_loss_token=2.3666, perplexity_token=10.6613]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  75%|██████████████████████████████████▌           | 785/1044 [04:46<01:44,  2.48it/s, acc_step=1/1, ce_loss_token=2.3665, perplexity_token=10.6601]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  75%|██████████████████████████████████▋           | 786/1044 [04:46<01:40,  2.57it/s, acc_step=1/1, ce_loss_token=2.3664, perplexity_token=10.6586]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  75%|██████████████████████████████████▋           | 787/1044 [04:47<01:41,  2.54it/s, acc_step=1/1, ce_loss_token=2.3663, perplexity_token=10.6574]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  75%|██████████████████████████████████▋           | 788/1044 [04:47<01:40,  2.55it/s, acc_step=1/1, ce_loss_token=2.3661, perplexity_token=10.6561]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  76%|██████████████████████████████████▊           | 789/1044 [04:47<01:38,  2.58it/s, acc_step=1/1, ce_loss_token=2.3660, perplexity_token=10.6547]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  76%|██████████████████████████████████▊           | 790/1044 [04:48<01:29,  2.85it/s, acc_step=1/1, ce_loss_token=2.3660, perplexity_token=10.6544]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  76%|██████████████████████████████████▊           | 791/1044 [04:48<01:28,  2.85it/s, acc_step=1/1, ce_loss_token=2.3658, perplexity_token=10.6531]

torch.Size([256, 437, 35]) torch.Size([256, 437])


[Training LM]:  76%|██████████████████████████████████▉           | 792/1044 [04:49<01:50,  2.29it/s, acc_step=1/1, ce_loss_token=2.3657, perplexity_token=10.6518]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  76%|██████████████████████████████████▉           | 793/1044 [04:49<01:45,  2.37it/s, acc_step=1/1, ce_loss_token=2.3656, perplexity_token=10.6504]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  76%|██████████████████████████████████▉           | 794/1044 [04:50<01:43,  2.41it/s, acc_step=1/1, ce_loss_token=2.3655, perplexity_token=10.6491]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  76%|███████████████████████████████████           | 795/1044 [04:50<01:44,  2.39it/s, acc_step=1/1, ce_loss_token=2.3654, perplexity_token=10.6478]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  76%|███████████████████████████████████           | 796/1044 [04:50<01:43,  2.39it/s, acc_step=1/1, ce_loss_token=2.3652, perplexity_token=10.6464]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  76%|███████████████████████████████████           | 797/1044 [04:51<01:34,  2.61it/s, acc_step=1/1, ce_loss_token=2.3653, perplexity_token=10.6469]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  76%|███████████████████████████████████▏          | 798/1044 [04:51<01:30,  2.72it/s, acc_step=1/1, ce_loss_token=2.3652, perplexity_token=10.6456]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  77%|███████████████████████████████████▏          | 799/1044 [04:51<01:34,  2.60it/s, acc_step=1/1, ce_loss_token=2.3650, perplexity_token=10.6444]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  77%|███████████████████████████████████▏          | 800/1044 [04:52<01:32,  2.64it/s, acc_step=1/1, ce_loss_token=2.3649, perplexity_token=10.6430]

torch.Size([256, 274, 35]) torch.Size([256, 274])


[Training LM]:  77%|███████████████████████████████████▎          | 801/1044 [04:52<01:22,  2.94it/s, acc_step=1/1, ce_loss_token=2.3649, perplexity_token=10.6429]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  77%|███████████████████████████████████▎          | 802/1044 [04:52<01:23,  2.91it/s, acc_step=1/1, ce_loss_token=2.3648, perplexity_token=10.6416]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  77%|███████████████████████████████████▍          | 803/1044 [04:53<01:23,  2.89it/s, acc_step=1/1, ce_loss_token=2.3646, perplexity_token=10.6403]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  77%|███████████████████████████████████▍          | 804/1044 [04:53<01:27,  2.73it/s, acc_step=1/1, ce_loss_token=2.3645, perplexity_token=10.6391]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  77%|███████████████████████████████████▍          | 805/1044 [04:54<01:27,  2.73it/s, acc_step=1/1, ce_loss_token=2.3644, perplexity_token=10.6378]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  77%|███████████████████████████████████▌          | 806/1044 [04:54<01:26,  2.75it/s, acc_step=1/1, ce_loss_token=2.3643, perplexity_token=10.6365]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  77%|███████████████████████████████████▌          | 807/1044 [04:54<01:29,  2.65it/s, acc_step=1/1, ce_loss_token=2.3642, perplexity_token=10.6353]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  77%|███████████████████████████████████▌          | 808/1044 [04:55<01:27,  2.70it/s, acc_step=1/1, ce_loss_token=2.3641, perplexity_token=10.6339]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  77%|███████████████████████████████████▋          | 809/1044 [04:55<01:24,  2.77it/s, acc_step=1/1, ce_loss_token=2.3639, perplexity_token=10.6327]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  78%|███████████████████████████████████▋          | 810/1044 [04:55<01:24,  2.78it/s, acc_step=1/1, ce_loss_token=2.3638, perplexity_token=10.6316]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  78%|███████████████████████████████████▋          | 811/1044 [04:56<01:17,  3.03it/s, acc_step=1/1, ce_loss_token=2.3638, perplexity_token=10.6313]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  78%|███████████████████████████████████▊          | 812/1044 [04:56<01:19,  2.93it/s, acc_step=1/1, ce_loss_token=2.3637, perplexity_token=10.6300]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  78%|███████████████████████████████████▊          | 813/1044 [04:56<01:27,  2.63it/s, acc_step=1/1, ce_loss_token=2.3635, perplexity_token=10.6286]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  78%|███████████████████████████████████▊          | 814/1044 [04:57<01:28,  2.60it/s, acc_step=1/1, ce_loss_token=2.3634, perplexity_token=10.6272]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  78%|███████████████████████████████████▉          | 815/1044 [04:57<01:26,  2.65it/s, acc_step=1/1, ce_loss_token=2.3633, perplexity_token=10.6258]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  78%|███████████████████████████████████▉          | 816/1044 [04:58<01:24,  2.69it/s, acc_step=1/1, ce_loss_token=2.3632, perplexity_token=10.6247]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  78%|███████████████████████████████████▉          | 817/1044 [04:58<01:23,  2.72it/s, acc_step=1/1, ce_loss_token=2.3631, perplexity_token=10.6234]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  78%|████████████████████████████████████          | 818/1044 [04:58<01:21,  2.77it/s, acc_step=1/1, ce_loss_token=2.3629, perplexity_token=10.6221]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  78%|████████████████████████████████████          | 819/1044 [04:59<01:21,  2.76it/s, acc_step=1/1, ce_loss_token=2.3628, perplexity_token=10.6208]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  79%|████████████████████████████████████▏         | 820/1044 [04:59<01:16,  2.93it/s, acc_step=1/1, ce_loss_token=2.3628, perplexity_token=10.6202]

torch.Size([256, 276, 35]) torch.Size([256, 276])


[Training LM]:  79%|████████████████████████████████████▏         | 821/1044 [04:59<01:14,  2.99it/s, acc_step=1/1, ce_loss_token=2.3626, perplexity_token=10.6187]

torch.Size([256, 402, 35]) torch.Size([256, 402])


[Training LM]:  79%|████████████████████████████████████▏         | 822/1044 [05:00<01:19,  2.78it/s, acc_step=1/1, ce_loss_token=2.3626, perplexity_token=10.6183]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  79%|████████████████████████████████████▎         | 823/1044 [05:00<01:14,  2.98it/s, acc_step=1/1, ce_loss_token=2.3626, perplexity_token=10.6183]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  79%|████████████████████████████████████▎         | 824/1044 [05:00<01:10,  3.11it/s, acc_step=1/1, ce_loss_token=2.3625, perplexity_token=10.6178]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  79%|████████████████████████████████████▎         | 825/1044 [05:01<01:13,  3.00it/s, acc_step=1/1, ce_loss_token=2.3624, perplexity_token=10.6164]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  79%|████████████████████████████████████▍         | 826/1044 [05:01<01:15,  2.88it/s, acc_step=1/1, ce_loss_token=2.3623, perplexity_token=10.6152]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  79%|████████████████████████████████████▍         | 828/1044 [05:02<01:05,  3.30it/s, acc_step=1/1, ce_loss_token=2.3623, perplexity_token=10.6151]

torch.Size([256, 298, 35]) torch.Size([256, 298])
torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  79%|████████████████████████████████████▌         | 829/1044 [05:02<01:11,  2.99it/s, acc_step=1/1, ce_loss_token=2.3622, perplexity_token=10.6137]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  80%|████████████████████████████████████▌         | 830/1044 [05:02<01:12,  2.94it/s, acc_step=1/1, ce_loss_token=2.3620, perplexity_token=10.6125]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  80%|████████████████████████████████████▌         | 831/1044 [05:03<01:13,  2.91it/s, acc_step=1/1, ce_loss_token=2.3619, perplexity_token=10.6112]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  80%|████████████████████████████████████▋         | 832/1044 [05:03<01:10,  3.02it/s, acc_step=1/1, ce_loss_token=2.3619, perplexity_token=10.6106]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  80%|████████████████████████████████████▋         | 833/1044 [05:03<01:12,  2.93it/s, acc_step=1/1, ce_loss_token=2.3617, perplexity_token=10.6094]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  80%|████████████████████████████████████▋         | 834/1044 [05:04<01:13,  2.86it/s, acc_step=1/1, ce_loss_token=2.3616, perplexity_token=10.6081]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  80%|████████████████████████████████████▊         | 835/1044 [05:04<01:24,  2.48it/s, acc_step=1/1, ce_loss_token=2.3615, perplexity_token=10.6066]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  80%|████████████████████████████████████▊         | 836/1044 [05:05<01:23,  2.48it/s, acc_step=1/1, ce_loss_token=2.3613, perplexity_token=10.6052]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  80%|████████████████████████████████████▉         | 837/1044 [05:05<01:16,  2.70it/s, acc_step=1/1, ce_loss_token=2.3614, perplexity_token=10.6054]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  80%|████████████████████████████████████▉         | 838/1044 [05:05<01:16,  2.70it/s, acc_step=1/1, ce_loss_token=2.3612, perplexity_token=10.6040]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  80%|████████████████████████████████████▉         | 839/1044 [05:06<01:19,  2.59it/s, acc_step=1/1, ce_loss_token=2.3611, perplexity_token=10.6027]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  80%|█████████████████████████████████████         | 840/1044 [05:06<01:16,  2.65it/s, acc_step=1/1, ce_loss_token=2.3610, perplexity_token=10.6014]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  81%|█████████████████████████████████████         | 841/1044 [05:06<01:14,  2.71it/s, acc_step=1/1, ce_loss_token=2.3609, perplexity_token=10.6001]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  81%|█████████████████████████████████████         | 842/1044 [05:07<01:14,  2.70it/s, acc_step=1/1, ce_loss_token=2.3607, perplexity_token=10.5989]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  81%|█████████████████████████████████████▏        | 843/1044 [05:07<01:17,  2.58it/s, acc_step=1/1, ce_loss_token=2.3606, perplexity_token=10.5976]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  81%|█████████████████████████████████████▏        | 844/1044 [05:08<01:15,  2.64it/s, acc_step=1/1, ce_loss_token=2.3605, perplexity_token=10.5963]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  81%|█████████████████████████████████████▏        | 845/1044 [05:08<01:13,  2.70it/s, acc_step=1/1, ce_loss_token=2.3604, perplexity_token=10.5949]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  81%|█████████████████████████████████████▎        | 846/1044 [05:08<01:12,  2.74it/s, acc_step=1/1, ce_loss_token=2.3602, perplexity_token=10.5935]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  81%|█████████████████████████████████████▎        | 847/1044 [05:09<01:05,  2.99it/s, acc_step=1/1, ce_loss_token=2.3602, perplexity_token=10.5932]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  81%|█████████████████████████████████████▎        | 848/1044 [05:09<01:07,  2.92it/s, acc_step=1/1, ce_loss_token=2.3601, perplexity_token=10.5920]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  81%|█████████████████████████████████████▍        | 849/1044 [05:09<01:08,  2.83it/s, acc_step=1/1, ce_loss_token=2.3600, perplexity_token=10.5907]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  81%|█████████████████████████████████████▍        | 850/1044 [05:10<01:13,  2.65it/s, acc_step=1/1, ce_loss_token=2.3599, perplexity_token=10.5895]

torch.Size([256, 354, 35]) torch.Size([256, 354])


[Training LM]:  82%|█████████████████████████████████████▍        | 851/1044 [05:10<01:17,  2.48it/s, acc_step=1/1, ce_loss_token=2.3597, perplexity_token=10.5882]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  82%|█████████████████████████████████████▌        | 852/1044 [05:11<01:19,  2.42it/s, acc_step=1/1, ce_loss_token=2.3596, perplexity_token=10.5870]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  82%|█████████████████████████████████████▌        | 853/1044 [05:11<01:20,  2.37it/s, acc_step=1/1, ce_loss_token=2.3595, perplexity_token=10.5858]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  82%|█████████████████████████████████████▋        | 854/1044 [05:11<01:20,  2.35it/s, acc_step=1/1, ce_loss_token=2.3594, perplexity_token=10.5846]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  82%|█████████████████████████████████████▋        | 855/1044 [05:12<01:16,  2.47it/s, acc_step=1/1, ce_loss_token=2.3593, perplexity_token=10.5832]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  82%|█████████████████████████████████████▋        | 856/1044 [05:12<01:11,  2.63it/s, acc_step=1/1, ce_loss_token=2.3592, perplexity_token=10.5827]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  82%|█████████████████████████████████████▊        | 857/1044 [05:12<01:09,  2.69it/s, acc_step=1/1, ce_loss_token=2.3591, perplexity_token=10.5814]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  82%|█████████████████████████████████████▊        | 858/1044 [05:13<01:08,  2.72it/s, acc_step=1/1, ce_loss_token=2.3590, perplexity_token=10.5803]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  82%|█████████████████████████████████████▊        | 859/1044 [05:13<01:08,  2.70it/s, acc_step=1/1, ce_loss_token=2.3589, perplexity_token=10.5789]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  82%|█████████████████████████████████████▉        | 860/1044 [05:14<01:12,  2.54it/s, acc_step=1/1, ce_loss_token=2.3587, perplexity_token=10.5776]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  82%|█████████████████████████████████████▉        | 861/1044 [05:14<01:08,  2.67it/s, acc_step=1/1, ce_loss_token=2.3587, perplexity_token=10.5774]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  83%|█████████████████████████████████████▉        | 862/1044 [05:14<01:10,  2.60it/s, acc_step=1/1, ce_loss_token=2.3586, perplexity_token=10.5760]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  83%|██████████████████████████████████████        | 863/1044 [05:15<01:07,  2.69it/s, acc_step=1/1, ce_loss_token=2.3585, perplexity_token=10.5747]

torch.Size([256, 404, 35]) torch.Size([256, 404])


[Training LM]:  83%|██████████████████████████████████████        | 864/1044 [05:15<01:16,  2.36it/s, acc_step=1/1, ce_loss_token=2.3584, perplexity_token=10.5735]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  83%|██████████████████████████████████████        | 865/1044 [05:16<01:11,  2.50it/s, acc_step=1/1, ce_loss_token=2.3582, perplexity_token=10.5720]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  83%|██████████████████████████████████████▏       | 866/1044 [05:16<01:08,  2.61it/s, acc_step=1/1, ce_loss_token=2.3581, perplexity_token=10.5708]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  83%|██████████████████████████████████████▏       | 868/1044 [05:17<00:57,  3.06it/s, acc_step=1/1, ce_loss_token=2.3580, perplexity_token=10.5699]

torch.Size([256, 304, 35]) torch.Size([256, 304])
torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  83%|██████████████████████████████████████▎       | 869/1044 [05:17<00:59,  2.92it/s, acc_step=1/1, ce_loss_token=2.3579, perplexity_token=10.5685]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  83%|██████████████████████████████████████▎       | 870/1044 [05:17<01:02,  2.80it/s, acc_step=1/1, ce_loss_token=2.3578, perplexity_token=10.5674]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  83%|██████████████████████████████████████▍       | 871/1044 [05:18<01:01,  2.81it/s, acc_step=1/1, ce_loss_token=2.3576, perplexity_token=10.5661]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  84%|██████████████████████████████████████▍       | 873/1044 [05:18<00:50,  3.39it/s, acc_step=1/1, ce_loss_token=2.3577, perplexity_token=10.5662]

torch.Size([256, 314, 35]) torch.Size([256, 314])
torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  84%|██████████████████████████████████████▌       | 874/1044 [05:19<00:53,  3.18it/s, acc_step=1/1, ce_loss_token=2.3575, perplexity_token=10.5650]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  84%|██████████████████████████████████████▌       | 875/1044 [05:19<00:51,  3.31it/s, acc_step=1/1, ce_loss_token=2.3575, perplexity_token=10.5649]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  84%|██████████████████████████████████████▌       | 876/1044 [05:19<00:53,  3.15it/s, acc_step=1/1, ce_loss_token=2.3574, perplexity_token=10.5636]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  84%|██████████████████████████████████████▋       | 877/1044 [05:20<00:54,  3.04it/s, acc_step=1/1, ce_loss_token=2.3573, perplexity_token=10.5624]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  84%|██████████████████████████████████████▋       | 878/1044 [05:20<00:55,  2.99it/s, acc_step=1/1, ce_loss_token=2.3572, perplexity_token=10.5611]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  84%|██████████████████████████████████████▋       | 879/1044 [05:20<00:59,  2.79it/s, acc_step=1/1, ce_loss_token=2.3571, perplexity_token=10.5599]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  84%|██████████████████████████████████████▊       | 880/1044 [05:21<01:01,  2.66it/s, acc_step=1/1, ce_loss_token=2.3569, perplexity_token=10.5586]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  84%|██████████████████████████████████████▊       | 881/1044 [05:21<01:03,  2.58it/s, acc_step=1/1, ce_loss_token=2.3568, perplexity_token=10.5573]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  84%|██████████████████████████████████████▊       | 882/1044 [05:21<01:01,  2.65it/s, acc_step=1/1, ce_loss_token=2.3567, perplexity_token=10.5559]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  85%|██████████████████████████████████████▉       | 883/1044 [05:22<01:02,  2.58it/s, acc_step=1/1, ce_loss_token=2.3566, perplexity_token=10.5547]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  85%|██████████████████████████████████████▉       | 884/1044 [05:22<01:01,  2.60it/s, acc_step=1/1, ce_loss_token=2.3565, perplexity_token=10.5535]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  85%|██████████████████████████████████████▉       | 885/1044 [05:23<01:01,  2.59it/s, acc_step=1/1, ce_loss_token=2.3563, perplexity_token=10.5523]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  85%|███████████████████████████████████████       | 886/1044 [05:23<01:00,  2.59it/s, acc_step=1/1, ce_loss_token=2.3562, perplexity_token=10.5510]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  85%|███████████████████████████████████████       | 887/1044 [05:23<01:01,  2.55it/s, acc_step=1/1, ce_loss_token=2.3561, perplexity_token=10.5496]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  85%|███████████████████████████████████████▏      | 888/1044 [05:24<01:02,  2.51it/s, acc_step=1/1, ce_loss_token=2.3560, perplexity_token=10.5484]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  85%|███████████████████████████████████████▏      | 889/1044 [05:24<01:01,  2.51it/s, acc_step=1/1, ce_loss_token=2.3559, perplexity_token=10.5472]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  85%|███████████████████████████████████████▏      | 890/1044 [05:25<01:01,  2.52it/s, acc_step=1/1, ce_loss_token=2.3557, perplexity_token=10.5458]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  85%|███████████████████████████████████████▎      | 891/1044 [05:25<00:56,  2.71it/s, acc_step=1/1, ce_loss_token=2.3557, perplexity_token=10.5456]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  85%|███████████████████████████████████████▎      | 892/1044 [05:25<00:55,  2.72it/s, acc_step=1/1, ce_loss_token=2.3556, perplexity_token=10.5443]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  86%|███████████████████████████████████████▎      | 893/1044 [05:26<00:54,  2.78it/s, acc_step=1/1, ce_loss_token=2.3555, perplexity_token=10.5430]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  86%|███████████████████████████████████████▍      | 894/1044 [05:26<00:54,  2.74it/s, acc_step=1/1, ce_loss_token=2.3553, perplexity_token=10.5417]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  86%|███████████████████████████████████████▍      | 895/1044 [05:26<00:54,  2.75it/s, acc_step=1/1, ce_loss_token=2.3552, perplexity_token=10.5405]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  86%|███████████████████████████████████████▍      | 896/1044 [05:27<00:55,  2.66it/s, acc_step=1/1, ce_loss_token=2.3551, perplexity_token=10.5390]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  86%|███████████████████████████████████████▌      | 897/1044 [05:27<00:53,  2.74it/s, acc_step=1/1, ce_loss_token=2.3550, perplexity_token=10.5378]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  86%|███████████████████████████████████████▌      | 898/1044 [05:27<00:53,  2.75it/s, acc_step=1/1, ce_loss_token=2.3549, perplexity_token=10.5366]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  86%|███████████████████████████████████████▌      | 899/1044 [05:28<00:52,  2.79it/s, acc_step=1/1, ce_loss_token=2.3548, perplexity_token=10.5359]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  86%|███████████████████████████████████████▋      | 900/1044 [05:28<00:52,  2.75it/s, acc_step=1/1, ce_loss_token=2.3547, perplexity_token=10.5346]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  86%|███████████████████████████████████████▋      | 901/1044 [05:29<00:52,  2.74it/s, acc_step=1/1, ce_loss_token=2.3545, perplexity_token=10.5333]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  86%|███████████████████████████████████████▋      | 902/1044 [05:29<00:53,  2.64it/s, acc_step=1/1, ce_loss_token=2.3544, perplexity_token=10.5319]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  86%|███████████████████████████████████████▊      | 903/1044 [05:29<00:53,  2.62it/s, acc_step=1/1, ce_loss_token=2.3543, perplexity_token=10.5306]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  87%|███████████████████████████████████████▊      | 904/1044 [05:30<00:52,  2.67it/s, acc_step=1/1, ce_loss_token=2.3542, perplexity_token=10.5294]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  87%|███████████████████████████████████████▉      | 905/1044 [05:30<00:51,  2.70it/s, acc_step=1/1, ce_loss_token=2.3540, perplexity_token=10.5279]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  87%|███████████████████████████████████████▉      | 906/1044 [05:30<00:51,  2.67it/s, acc_step=1/1, ce_loss_token=2.3539, perplexity_token=10.5266]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  87%|███████████████████████████████████████▉      | 907/1044 [05:31<00:47,  2.86it/s, acc_step=1/1, ce_loss_token=2.3538, perplexity_token=10.5259]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  87%|████████████████████████████████████████      | 908/1044 [05:31<00:47,  2.85it/s, acc_step=1/1, ce_loss_token=2.3537, perplexity_token=10.5245]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  87%|████████████████████████████████████████      | 909/1044 [05:31<00:47,  2.86it/s, acc_step=1/1, ce_loss_token=2.3536, perplexity_token=10.5231]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  87%|████████████████████████████████████████      | 910/1044 [05:32<00:49,  2.73it/s, acc_step=1/1, ce_loss_token=2.3534, perplexity_token=10.5218]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  87%|████████████████████████████████████████▏     | 911/1044 [05:32<00:51,  2.59it/s, acc_step=1/1, ce_loss_token=2.3533, perplexity_token=10.5205]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  87%|████████████████████████████████████████▏     | 912/1044 [05:33<00:48,  2.72it/s, acc_step=1/1, ce_loss_token=2.3532, perplexity_token=10.5196]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  87%|████████████████████████████████████████▏     | 913/1044 [05:33<00:48,  2.69it/s, acc_step=1/1, ce_loss_token=2.3531, perplexity_token=10.5182]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  88%|████████████████████████████████████████▎     | 914/1044 [05:33<00:46,  2.78it/s, acc_step=1/1, ce_loss_token=2.3531, perplexity_token=10.5180]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  88%|████████████████████████████████████████▎     | 915/1044 [05:34<00:45,  2.81it/s, acc_step=1/1, ce_loss_token=2.3530, perplexity_token=10.5168]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  88%|████████████████████████████████████████▎     | 916/1044 [05:34<00:42,  3.02it/s, acc_step=1/1, ce_loss_token=2.3529, perplexity_token=10.5165]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  88%|████████████████████████████████████████▍     | 917/1044 [05:34<00:44,  2.85it/s, acc_step=1/1, ce_loss_token=2.3528, perplexity_token=10.5154]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  88%|████████████████████████████████████████▍     | 918/1044 [05:35<00:45,  2.78it/s, acc_step=1/1, ce_loss_token=2.3527, perplexity_token=10.5141]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  88%|████████████████████████████████████████▍     | 919/1044 [05:35<00:45,  2.75it/s, acc_step=1/1, ce_loss_token=2.3526, perplexity_token=10.5129]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  88%|████████████████████████████████████████▌     | 920/1044 [05:35<00:42,  2.94it/s, acc_step=1/1, ce_loss_token=2.3526, perplexity_token=10.5124]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  88%|████████████████████████████████████████▌     | 921/1044 [05:36<00:43,  2.81it/s, acc_step=1/1, ce_loss_token=2.3524, perplexity_token=10.5111]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  88%|████████████████████████████████████████▋     | 923/1044 [05:36<00:39,  3.05it/s, acc_step=1/1, ce_loss_token=2.3524, perplexity_token=10.5108]

torch.Size([256, 323, 35]) torch.Size([256, 323])
torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  89%|████████████████████████████████████████▋     | 924/1044 [05:37<00:37,  3.17it/s, acc_step=1/1, ce_loss_token=2.3524, perplexity_token=10.5107]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  89%|████████████████████████████████████████▊     | 925/1044 [05:37<00:40,  2.95it/s, acc_step=1/1, ce_loss_token=2.3523, perplexity_token=10.5094]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  89%|████████████████████████████████████████▊     | 926/1044 [05:37<00:38,  3.06it/s, acc_step=1/1, ce_loss_token=2.3522, perplexity_token=10.5091]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  89%|████████████████████████████████████████▊     | 927/1044 [05:38<00:41,  2.83it/s, acc_step=1/1, ce_loss_token=2.3521, perplexity_token=10.5080]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  89%|████████████████████████████████████████▉     | 928/1044 [05:38<00:42,  2.75it/s, acc_step=1/1, ce_loss_token=2.3520, perplexity_token=10.5067]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  89%|████████████████████████████████████████▉     | 929/1044 [05:39<00:42,  2.74it/s, acc_step=1/1, ce_loss_token=2.3519, perplexity_token=10.5054]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  89%|████████████████████████████████████████▉     | 930/1044 [05:39<00:40,  2.83it/s, acc_step=1/1, ce_loss_token=2.3518, perplexity_token=10.5048]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  89%|█████████████████████████████████████████     | 931/1044 [05:39<00:40,  2.79it/s, acc_step=1/1, ce_loss_token=2.3517, perplexity_token=10.5035]

torch.Size([256, 454, 35]) torch.Size([256, 454])


[Training LM]:  89%|█████████████████████████████████████████     | 932/1044 [05:40<00:50,  2.23it/s, acc_step=1/1, ce_loss_token=2.3516, perplexity_token=10.5021]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  89%|█████████████████████████████████████████     | 933/1044 [05:40<00:47,  2.35it/s, acc_step=1/1, ce_loss_token=2.3515, perplexity_token=10.5010]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  89%|█████████████████████████████████████████▏    | 934/1044 [05:41<00:48,  2.28it/s, acc_step=1/1, ce_loss_token=2.3513, perplexity_token=10.4996]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  90%|█████████████████████████████████████████▏    | 935/1044 [05:41<00:42,  2.56it/s, acc_step=1/1, ce_loss_token=2.3513, perplexity_token=10.4991]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  90%|█████████████████████████████████████████▏    | 936/1044 [05:41<00:40,  2.66it/s, acc_step=1/1, ce_loss_token=2.3512, perplexity_token=10.4979]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  90%|█████████████████████████████████████████▎    | 937/1044 [05:42<00:39,  2.73it/s, acc_step=1/1, ce_loss_token=2.3511, perplexity_token=10.4967]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  90%|█████████████████████████████████████████▎    | 938/1044 [05:42<00:39,  2.70it/s, acc_step=1/1, ce_loss_token=2.3509, perplexity_token=10.4955]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  90%|█████████████████████████████████████████▎    | 939/1044 [05:42<00:36,  2.89it/s, acc_step=1/1, ce_loss_token=2.3509, perplexity_token=10.4954]

torch.Size([256, 399, 35]) torch.Size([256, 399])


[Training LM]:  90%|█████████████████████████████████████████▍    | 940/1044 [05:43<00:42,  2.45it/s, acc_step=1/1, ce_loss_token=2.3508, perplexity_token=10.4940]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  90%|█████████████████████████████████████████▍    | 941/1044 [05:43<00:45,  2.27it/s, acc_step=1/1, ce_loss_token=2.3507, perplexity_token=10.4928]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  90%|█████████████████████████████████████████▌    | 942/1044 [05:44<00:43,  2.36it/s, acc_step=1/1, ce_loss_token=2.3506, perplexity_token=10.4915]

torch.Size([256, 349, 35]) torch.Size([256, 349])


[Training LM]:  90%|█████████████████████████████████████████▌    | 943/1044 [05:44<00:43,  2.31it/s, acc_step=1/1, ce_loss_token=2.3504, perplexity_token=10.4902]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  90%|█████████████████████████████████████████▌    | 944/1044 [05:45<00:40,  2.49it/s, acc_step=1/1, ce_loss_token=2.3504, perplexity_token=10.4896]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  91%|█████████████████████████████████████████▋    | 945/1044 [05:45<00:38,  2.55it/s, acc_step=1/1, ce_loss_token=2.3503, perplexity_token=10.4884]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  91%|█████████████████████████████████████████▋    | 946/1044 [05:45<00:38,  2.56it/s, acc_step=1/1, ce_loss_token=2.3501, perplexity_token=10.4871]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  91%|█████████████████████████████████████████▋    | 947/1044 [05:46<00:37,  2.56it/s, acc_step=1/1, ce_loss_token=2.3500, perplexity_token=10.4858]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  91%|█████████████████████████████████████████▊    | 948/1044 [05:46<00:37,  2.56it/s, acc_step=1/1, ce_loss_token=2.3499, perplexity_token=10.4846]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  91%|█████████████████████████████████████████▊    | 949/1044 [05:46<00:34,  2.76it/s, acc_step=1/1, ce_loss_token=2.3499, perplexity_token=10.4842]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  91%|█████████████████████████████████████████▊    | 950/1044 [05:47<00:31,  3.01it/s, acc_step=1/1, ce_loss_token=2.3498, perplexity_token=10.4839]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  91%|█████████████████████████████████████████▉    | 951/1044 [05:47<00:32,  2.86it/s, acc_step=1/1, ce_loss_token=2.3497, perplexity_token=10.4824]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  91%|█████████████████████████████████████████▉    | 952/1044 [05:48<00:33,  2.79it/s, acc_step=1/1, ce_loss_token=2.3496, perplexity_token=10.4811]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  91%|█████████████████████████████████████████▉    | 953/1044 [05:48<00:33,  2.71it/s, acc_step=1/1, ce_loss_token=2.3495, perplexity_token=10.4799]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  91%|██████████████████████████████████████████    | 954/1044 [05:48<00:33,  2.67it/s, acc_step=1/1, ce_loss_token=2.3493, perplexity_token=10.4786]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  91%|██████████████████████████████████████████    | 955/1044 [05:49<00:31,  2.87it/s, acc_step=1/1, ce_loss_token=2.3493, perplexity_token=10.4778]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  92%|██████████████████████████████████████████    | 956/1044 [05:49<00:31,  2.79it/s, acc_step=1/1, ce_loss_token=2.3491, perplexity_token=10.4766]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  92%|██████████████████████████████████████████▏   | 957/1044 [05:49<00:29,  2.98it/s, acc_step=1/1, ce_loss_token=2.3491, perplexity_token=10.4764]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  92%|██████████████████████████████████████████▏   | 958/1044 [05:50<00:29,  2.92it/s, acc_step=1/1, ce_loss_token=2.3490, perplexity_token=10.4752]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  92%|██████████████████████████████████████████▎   | 959/1044 [05:50<00:29,  2.86it/s, acc_step=1/1, ce_loss_token=2.3489, perplexity_token=10.4740]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  92%|██████████████████████████████████████████▎   | 960/1044 [05:50<00:29,  2.88it/s, acc_step=1/1, ce_loss_token=2.3488, perplexity_token=10.4727]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  92%|██████████████████████████████████████████▎   | 961/1044 [05:51<00:28,  2.89it/s, acc_step=1/1, ce_loss_token=2.3486, perplexity_token=10.4714]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  92%|██████████████████████████████████████████▍   | 962/1044 [05:51<00:29,  2.79it/s, acc_step=1/1, ce_loss_token=2.3485, perplexity_token=10.4701]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  92%|██████████████████████████████████████████▍   | 963/1044 [05:51<00:29,  2.79it/s, acc_step=1/1, ce_loss_token=2.3484, perplexity_token=10.4688]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  92%|██████████████████████████████████████████▍   | 964/1044 [05:52<00:27,  2.86it/s, acc_step=1/1, ce_loss_token=2.3483, perplexity_token=10.4680]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  92%|██████████████████████████████████████████▌   | 965/1044 [05:52<00:28,  2.82it/s, acc_step=1/1, ce_loss_token=2.3482, perplexity_token=10.4668]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  93%|██████████████████████████████████████████▌   | 966/1044 [05:53<00:28,  2.70it/s, acc_step=1/1, ce_loss_token=2.3481, perplexity_token=10.4656]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  93%|██████████████████████████████████████████▌   | 967/1044 [05:53<00:27,  2.83it/s, acc_step=1/1, ce_loss_token=2.3481, perplexity_token=10.4652]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  93%|██████████████████████████████████████████▋   | 968/1044 [05:53<00:26,  2.85it/s, acc_step=1/1, ce_loss_token=2.3479, perplexity_token=10.4638]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  93%|██████████████████████████████████████████▋   | 969/1044 [05:54<00:26,  2.82it/s, acc_step=1/1, ce_loss_token=2.3478, perplexity_token=10.4626]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  93%|██████████████████████████████████████████▋   | 970/1044 [05:54<00:26,  2.77it/s, acc_step=1/1, ce_loss_token=2.3477, perplexity_token=10.4612]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  93%|██████████████████████████████████████████▊   | 971/1044 [05:54<00:26,  2.77it/s, acc_step=1/1, ce_loss_token=2.3476, perplexity_token=10.4600]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  93%|██████████████████████████████████████████▊   | 972/1044 [05:55<00:26,  2.76it/s, acc_step=1/1, ce_loss_token=2.3474, perplexity_token=10.4588]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  93%|██████████████████████████████████████████▊   | 973/1044 [05:55<00:25,  2.75it/s, acc_step=1/1, ce_loss_token=2.3473, perplexity_token=10.4576]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  93%|██████████████████████████████████████████▉   | 974/1044 [05:55<00:26,  2.63it/s, acc_step=1/1, ce_loss_token=2.3472, perplexity_token=10.4563]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  93%|██████████████████████████████████████████▉   | 975/1044 [05:56<00:26,  2.58it/s, acc_step=1/1, ce_loss_token=2.3471, perplexity_token=10.4549]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  93%|███████████████████████████████████████████   | 976/1044 [05:56<00:24,  2.79it/s, acc_step=1/1, ce_loss_token=2.3470, perplexity_token=10.4543]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  94%|███████████████████████████████████████████   | 977/1044 [05:56<00:24,  2.77it/s, acc_step=1/1, ce_loss_token=2.3469, perplexity_token=10.4530]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  94%|███████████████████████████████████████████   | 978/1044 [05:57<00:24,  2.72it/s, acc_step=1/1, ce_loss_token=2.3468, perplexity_token=10.4516]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  94%|███████████████████████████████████████████▏  | 980/1044 [05:57<00:20,  3.17it/s, acc_step=1/1, ce_loss_token=2.3466, perplexity_token=10.4505]

torch.Size([256, 293, 35]) torch.Size([256, 293])
torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  94%|███████████████████████████████████████████▏  | 981/1044 [05:58<00:21,  3.00it/s, acc_step=1/1, ce_loss_token=2.3465, perplexity_token=10.4490]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  94%|███████████████████████████████████████████▎  | 982/1044 [05:58<00:21,  2.93it/s, acc_step=1/1, ce_loss_token=2.3464, perplexity_token=10.4477]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  94%|███████████████████████████████████████████▎  | 983/1044 [05:58<00:20,  2.99it/s, acc_step=1/1, ce_loss_token=2.3464, perplexity_token=10.4475]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  94%|███████████████████████████████████████████▎  | 984/1044 [05:59<00:21,  2.73it/s, acc_step=1/1, ce_loss_token=2.3462, perplexity_token=10.4462]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  94%|███████████████████████████████████████████▍  | 985/1044 [05:59<00:21,  2.72it/s, acc_step=1/1, ce_loss_token=2.3461, perplexity_token=10.4450]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  94%|███████████████████████████████████████████▍  | 986/1044 [06:00<00:21,  2.69it/s, acc_step=1/1, ce_loss_token=2.3460, perplexity_token=10.4438]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  95%|███████████████████████████████████████████▍  | 987/1044 [06:00<00:21,  2.65it/s, acc_step=1/1, ce_loss_token=2.3459, perplexity_token=10.4426]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  95%|███████████████████████████████████████████▌  | 988/1044 [06:00<00:21,  2.58it/s, acc_step=1/1, ce_loss_token=2.3458, perplexity_token=10.4413]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  95%|███████████████████████████████████████████▌  | 989/1044 [06:01<00:21,  2.60it/s, acc_step=1/1, ce_loss_token=2.3457, perplexity_token=10.4401]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  95%|███████████████████████████████████████████▋  | 991/1044 [06:01<00:17,  2.97it/s, acc_step=1/1, ce_loss_token=2.3456, perplexity_token=10.4393]

torch.Size([256, 324, 35]) torch.Size([256, 324])
torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  95%|███████████████████████████████████████████▋  | 992/1044 [06:02<00:17,  2.90it/s, acc_step=1/1, ce_loss_token=2.3455, perplexity_token=10.4380]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  95%|███████████████████████████████████████████▊  | 993/1044 [06:02<00:17,  2.88it/s, acc_step=1/1, ce_loss_token=2.3453, perplexity_token=10.4368]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  95%|███████████████████████████████████████████▊  | 994/1044 [06:03<00:17,  2.84it/s, acc_step=1/1, ce_loss_token=2.3452, perplexity_token=10.4354]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  95%|███████████████████████████████████████████▊  | 995/1044 [06:03<00:19,  2.57it/s, acc_step=1/1, ce_loss_token=2.3451, perplexity_token=10.4343]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  95%|███████████████████████████████████████████▉  | 996/1044 [06:03<00:18,  2.58it/s, acc_step=1/1, ce_loss_token=2.3450, perplexity_token=10.4330]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  95%|███████████████████████████████████████████▉  | 997/1044 [06:04<00:18,  2.58it/s, acc_step=1/1, ce_loss_token=2.3449, perplexity_token=10.4318]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  96%|███████████████████████████████████████████▉  | 998/1044 [06:04<00:17,  2.58it/s, acc_step=1/1, ce_loss_token=2.3447, perplexity_token=10.4306]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  96%|████████████████████████████████████████████  | 999/1044 [06:05<00:17,  2.61it/s, acc_step=1/1, ce_loss_token=2.3446, perplexity_token=10.4294]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  96%|███████████████████████████████████████████  | 1000/1044 [06:05<00:15,  2.82it/s, acc_step=1/1, ce_loss_token=2.3446, perplexity_token=10.4289]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  96%|███████████████████████████████████████████▏ | 1001/1044 [06:05<00:16,  2.66it/s, acc_step=1/1, ce_loss_token=2.3445, perplexity_token=10.4278]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  96%|███████████████████████████████████████████▏ | 1002/1044 [06:06<00:15,  2.75it/s, acc_step=1/1, ce_loss_token=2.3443, perplexity_token=10.4265]

torch.Size([256, 397, 35]) torch.Size([256, 397])


[Training LM]:  96%|███████████████████████████████████████████▏ | 1003/1044 [06:06<00:17,  2.37it/s, acc_step=1/1, ce_loss_token=2.3442, perplexity_token=10.4253]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  96%|███████████████████████████████████████████▎ | 1004/1044 [06:07<00:16,  2.43it/s, acc_step=1/1, ce_loss_token=2.3441, perplexity_token=10.4242]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  96%|███████████████████████████████████████████▎ | 1005/1044 [06:07<00:16,  2.40it/s, acc_step=1/1, ce_loss_token=2.3440, perplexity_token=10.4229]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  96%|███████████████████████████████████████████▎ | 1006/1044 [06:07<00:14,  2.57it/s, acc_step=1/1, ce_loss_token=2.3439, perplexity_token=10.4217]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  96%|███████████████████████████████████████████▍ | 1007/1044 [06:08<00:14,  2.56it/s, acc_step=1/1, ce_loss_token=2.3438, perplexity_token=10.4205]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  97%|███████████████████████████████████████████▍ | 1008/1044 [06:08<00:13,  2.66it/s, acc_step=1/1, ce_loss_token=2.3437, perplexity_token=10.4193]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  97%|███████████████████████████████████████████▍ | 1009/1044 [06:08<00:13,  2.58it/s, acc_step=1/1, ce_loss_token=2.3435, perplexity_token=10.4180]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  97%|███████████████████████████████████████████▌ | 1010/1044 [06:09<00:13,  2.50it/s, acc_step=1/1, ce_loss_token=2.3434, perplexity_token=10.4167]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  97%|███████████████████████████████████████████▌ | 1011/1044 [06:09<00:12,  2.56it/s, acc_step=1/1, ce_loss_token=2.3433, perplexity_token=10.4154]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  97%|███████████████████████████████████████████▌ | 1012/1044 [06:10<00:12,  2.62it/s, acc_step=1/1, ce_loss_token=2.3432, perplexity_token=10.4141]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  97%|███████████████████████████████████████████▋ | 1013/1044 [06:10<00:11,  2.64it/s, acc_step=1/1, ce_loss_token=2.3430, perplexity_token=10.4129]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  97%|███████████████████████████████████████████▋ | 1014/1044 [06:10<00:11,  2.59it/s, acc_step=1/1, ce_loss_token=2.3429, perplexity_token=10.4115]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  97%|███████████████████████████████████████████▊ | 1015/1044 [06:11<00:11,  2.62it/s, acc_step=1/1, ce_loss_token=2.3428, perplexity_token=10.4102]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  97%|███████████████████████████████████████████▊ | 1016/1044 [06:11<00:10,  2.79it/s, acc_step=1/1, ce_loss_token=2.3428, perplexity_token=10.4100]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  97%|███████████████████████████████████████████▊ | 1017/1044 [06:11<00:10,  2.64it/s, acc_step=1/1, ce_loss_token=2.3426, perplexity_token=10.4087]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  98%|███████████████████████████████████████████▉ | 1018/1044 [06:12<00:09,  2.68it/s, acc_step=1/1, ce_loss_token=2.3425, perplexity_token=10.4074]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  98%|███████████████████████████████████████████▉ | 1019/1044 [06:12<00:09,  2.75it/s, acc_step=1/1, ce_loss_token=2.3424, perplexity_token=10.4060]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  98%|███████████████████████████████████████████▉ | 1020/1044 [06:13<00:08,  2.76it/s, acc_step=1/1, ce_loss_token=2.3423, perplexity_token=10.4048]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  98%|████████████████████████████████████████████ | 1021/1044 [06:13<00:08,  2.72it/s, acc_step=1/1, ce_loss_token=2.3422, perplexity_token=10.4036]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  98%|████████████████████████████████████████████ | 1022/1044 [06:13<00:08,  2.73it/s, acc_step=1/1, ce_loss_token=2.3420, perplexity_token=10.4024]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  98%|████████████████████████████████████████████ | 1023/1044 [06:14<00:07,  2.72it/s, acc_step=1/1, ce_loss_token=2.3419, perplexity_token=10.4012]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  98%|████████████████████████████████████████████▏| 1024/1044 [06:14<00:07,  2.67it/s, acc_step=1/1, ce_loss_token=2.3418, perplexity_token=10.4000]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  98%|████████████████████████████████████████████▏| 1025/1044 [06:14<00:07,  2.66it/s, acc_step=1/1, ce_loss_token=2.3417, perplexity_token=10.3988]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  98%|████████████████████████████████████████████▏| 1026/1044 [06:15<00:06,  2.68it/s, acc_step=1/1, ce_loss_token=2.3416, perplexity_token=10.3974]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  98%|████████████████████████████████████████████▎| 1028/1044 [06:15<00:04,  3.31it/s, acc_step=1/1, ce_loss_token=2.3416, perplexity_token=10.3982]

torch.Size([256, 314, 35]) torch.Size([256, 314])
torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  99%|████████████████████████████████████████████▎| 1029/1044 [06:16<00:05,  2.93it/s, acc_step=1/1, ce_loss_token=2.3415, perplexity_token=10.3969]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  99%|████████████████████████████████████████████▍| 1030/1044 [06:16<00:04,  3.07it/s, acc_step=1/1, ce_loss_token=2.3415, perplexity_token=10.3963]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  99%|████████████████████████████████████████████▍| 1031/1044 [06:16<00:04,  3.20it/s, acc_step=1/1, ce_loss_token=2.3414, perplexity_token=10.3959]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  99%|████████████████████████████████████████████▍| 1032/1044 [06:17<00:03,  3.12it/s, acc_step=1/1, ce_loss_token=2.3413, perplexity_token=10.3945]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  99%|████████████████████████████████████████████▌| 1033/1044 [06:17<00:03,  3.00it/s, acc_step=1/1, ce_loss_token=2.3412, perplexity_token=10.3932]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  99%|████████████████████████████████████████████▌| 1034/1044 [06:17<00:03,  2.87it/s, acc_step=1/1, ce_loss_token=2.3410, perplexity_token=10.3920]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  99%|████████████████████████████████████████████▌| 1035/1044 [06:18<00:02,  3.05it/s, acc_step=1/1, ce_loss_token=2.3410, perplexity_token=10.3919]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  99%|████████████████████████████████████████████▋| 1036/1044 [06:18<00:02,  2.95it/s, acc_step=1/1, ce_loss_token=2.3409, perplexity_token=10.3907]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  99%|████████████████████████████████████████████▋| 1037/1044 [06:18<00:02,  2.87it/s, acc_step=1/1, ce_loss_token=2.3408, perplexity_token=10.3895]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  99%|████████████████████████████████████████████▋| 1038/1044 [06:19<00:02,  2.79it/s, acc_step=1/1, ce_loss_token=2.3407, perplexity_token=10.3884]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]: 100%|████████████████████████████████████████████▊| 1039/1044 [06:19<00:01,  2.78it/s, acc_step=1/1, ce_loss_token=2.3406, perplexity_token=10.3872]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]: 100%|████████████████████████████████████████████▊| 1040/1044 [06:20<00:01,  2.69it/s, acc_step=1/1, ce_loss_token=2.3405, perplexity_token=10.3860]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]: 100%|████████████████████████████████████████████▊| 1041/1044 [06:20<00:01,  2.57it/s, acc_step=1/1, ce_loss_token=2.3403, perplexity_token=10.3848]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]: 100%|████████████████████████████████████████████▉| 1042/1044 [06:20<00:00,  2.75it/s, acc_step=1/1, ce_loss_token=2.3403, perplexity_token=10.3841]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]: 100%|█████████████████████████████████████████████| 1044/1044 [06:21<00:00,  3.30it/s, acc_step=1/1, ce_loss_token=2.3402, perplexity_token=10.3828]

torch.Size([170, 282, 35]) torch.Size([170, 282])


                                                                                                                                                                   

Generating with greedy search...

📊 Metrics (Epoch 1):
├── TRAIN:
│   ├── ce_loss_char: 2.3402
│   ├── ce_loss_token: 2.3402
│   ├── perplexity_char: 10.3828
│   └── perplexity_token: 10.3828
└── VAL:
    ├── ce_loss_char: 2.1110
    ├── ce_loss_token: 2.1110
    ├── perplexity_char: 8.2567
    └── perplexity_token: 8.2567
└── TRAINING:
    └── learning_rate: 0.000046


[Training LM]:   0%|                                                                                                                      | 0/1044 [00:00<?, ?it/s]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   0%|                                                 | 1/1044 [00:00<08:27,  2.06it/s, acc_step=1/1, ce_loss_token=2.2142, perplexity_token=9.1545]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   0%|                                                 | 2/1044 [00:00<07:09,  2.43it/s, acc_step=1/1, ce_loss_token=2.2195, perplexity_token=9.2027]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   0%|▏                                                | 3/1044 [00:01<06:45,  2.57it/s, acc_step=1/1, ce_loss_token=2.2195, perplexity_token=9.2031]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   0%|▏                                                | 4/1044 [00:01<06:39,  2.60it/s, acc_step=1/1, ce_loss_token=2.2167, perplexity_token=9.1766]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:   0%|▏                                                | 5/1044 [00:01<06:50,  2.53it/s, acc_step=1/1, ce_loss_token=2.2156, perplexity_token=9.1666]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   1%|▎                                                | 6/1044 [00:02<06:51,  2.52it/s, acc_step=1/1, ce_loss_token=2.2171, perplexity_token=9.1805]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   1%|▎                                                | 7/1044 [00:02<06:52,  2.51it/s, acc_step=1/1, ce_loss_token=2.2167, perplexity_token=9.1770]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   1%|▍                                                | 8/1044 [00:03<06:43,  2.56it/s, acc_step=1/1, ce_loss_token=2.2162, perplexity_token=9.1720]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   1%|▍                                                | 9/1044 [00:03<06:47,  2.54it/s, acc_step=1/1, ce_loss_token=2.2154, perplexity_token=9.1649]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   1%|▍                                               | 10/1044 [00:03<06:43,  2.57it/s, acc_step=1/1, ce_loss_token=2.2151, perplexity_token=9.1623]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   1%|▌                                               | 11/1044 [00:04<06:47,  2.54it/s, acc_step=1/1, ce_loss_token=2.2153, perplexity_token=9.1641]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   1%|▌                                               | 12/1044 [00:04<06:52,  2.50it/s, acc_step=1/1, ce_loss_token=2.2151, perplexity_token=9.1625]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   1%|▌                                               | 13/1044 [00:05<06:16,  2.74it/s, acc_step=1/1, ce_loss_token=2.2214, perplexity_token=9.2204]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   1%|▋                                               | 14/1044 [00:05<05:51,  2.93it/s, acc_step=1/1, ce_loss_token=2.2262, perplexity_token=9.2645]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   1%|▋                                               | 15/1044 [00:05<05:52,  2.92it/s, acc_step=1/1, ce_loss_token=2.2258, perplexity_token=9.2613]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:   2%|▋                                               | 16/1044 [00:06<05:51,  2.92it/s, acc_step=1/1, ce_loss_token=2.2248, perplexity_token=9.2519]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   2%|▊                                               | 17/1044 [00:06<05:56,  2.88it/s, acc_step=1/1, ce_loss_token=2.2245, perplexity_token=9.2488]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   2%|▊                                               | 18/1044 [00:06<05:34,  3.06it/s, acc_step=1/1, ce_loss_token=2.2272, perplexity_token=9.2738]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:   2%|▊                                               | 19/1044 [00:07<06:02,  2.83it/s, acc_step=1/1, ce_loss_token=2.2270, perplexity_token=9.2722]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   2%|▉                                               | 20/1044 [00:07<05:40,  3.01it/s, acc_step=1/1, ce_loss_token=2.2293, perplexity_token=9.2937]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   2%|▉                                               | 21/1044 [00:07<05:53,  2.89it/s, acc_step=1/1, ce_loss_token=2.2286, perplexity_token=9.2871]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   2%|█                                               | 22/1044 [00:08<05:58,  2.85it/s, acc_step=1/1, ce_loss_token=2.2279, perplexity_token=9.2804]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:   2%|█                                               | 23/1044 [00:08<05:54,  2.88it/s, acc_step=1/1, ce_loss_token=2.2271, perplexity_token=9.2728]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:   2%|█                                               | 24/1044 [00:08<06:12,  2.74it/s, acc_step=1/1, ce_loss_token=2.2260, perplexity_token=9.2624]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:   2%|█▏                                              | 25/1044 [00:09<06:23,  2.66it/s, acc_step=1/1, ce_loss_token=2.2254, perplexity_token=9.2568]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   2%|█▏                                              | 26/1044 [00:09<06:24,  2.65it/s, acc_step=1/1, ce_loss_token=2.2247, perplexity_token=9.2508]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   3%|█▏                                              | 27/1044 [00:10<06:20,  2.67it/s, acc_step=1/1, ce_loss_token=2.2245, perplexity_token=9.2485]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   3%|█▎                                              | 28/1044 [00:10<06:15,  2.70it/s, acc_step=1/1, ce_loss_token=2.2240, perplexity_token=9.2440]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   3%|█▎                                              | 29/1044 [00:10<06:12,  2.72it/s, acc_step=1/1, ce_loss_token=2.2226, perplexity_token=9.2315]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   3%|█▍                                              | 30/1044 [00:11<06:08,  2.75it/s, acc_step=1/1, ce_loss_token=2.2224, perplexity_token=9.2294]

torch.Size([256, 454, 35]) torch.Size([256, 454])


[Training LM]:   3%|█▍                                              | 31/1044 [00:11<07:38,  2.21it/s, acc_step=1/1, ce_loss_token=2.2216, perplexity_token=9.2223]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   3%|█▍                                              | 32/1044 [00:12<07:18,  2.31it/s, acc_step=1/1, ce_loss_token=2.2213, perplexity_token=9.2190]

torch.Size([256, 388, 35]) torch.Size([256, 388])


[Training LM]:   3%|█▌                                              | 33/1044 [00:12<07:47,  2.16it/s, acc_step=1/1, ce_loss_token=2.2207, perplexity_token=9.2141]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   3%|█▌                                              | 34/1044 [00:13<07:21,  2.29it/s, acc_step=1/1, ce_loss_token=2.2204, perplexity_token=9.2109]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   3%|█▌                                              | 35/1044 [00:13<07:11,  2.34it/s, acc_step=1/1, ce_loss_token=2.2203, perplexity_token=9.2105]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   3%|█▋                                              | 36/1044 [00:13<06:53,  2.44it/s, acc_step=1/1, ce_loss_token=2.2199, perplexity_token=9.2066]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   4%|█▋                                              | 37/1044 [00:14<06:40,  2.51it/s, acc_step=1/1, ce_loss_token=2.2197, perplexity_token=9.2045]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:   4%|█▋                                              | 38/1044 [00:14<06:48,  2.47it/s, acc_step=1/1, ce_loss_token=2.2193, perplexity_token=9.2006]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   4%|█▊                                              | 40/1044 [00:15<05:46,  2.90it/s, acc_step=1/1, ce_loss_token=2.2236, perplexity_token=9.2403]

torch.Size([256, 289, 35]) torch.Size([256, 289])
torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:   4%|█▉                                              | 41/1044 [00:15<06:04,  2.75it/s, acc_step=1/1, ce_loss_token=2.2232, perplexity_token=9.2369]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   4%|█▉                                              | 42/1044 [00:16<06:19,  2.64it/s, acc_step=1/1, ce_loss_token=2.2229, perplexity_token=9.2338]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   4%|█▉                                              | 43/1044 [00:16<06:11,  2.69it/s, acc_step=1/1, ce_loss_token=2.2224, perplexity_token=9.2292]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   4%|██                                              | 44/1044 [00:16<06:20,  2.63it/s, acc_step=1/1, ce_loss_token=2.2223, perplexity_token=9.2281]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   4%|██                                              | 45/1044 [00:17<06:22,  2.61it/s, acc_step=1/1, ce_loss_token=2.2219, perplexity_token=9.2247]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   4%|██                                              | 46/1044 [00:17<06:18,  2.64it/s, acc_step=1/1, ce_loss_token=2.2216, perplexity_token=9.2224]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   5%|██▏                                             | 47/1044 [00:17<06:11,  2.69it/s, acc_step=1/1, ce_loss_token=2.2213, perplexity_token=9.2195]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   5%|██▏                                             | 48/1044 [00:18<06:14,  2.66it/s, acc_step=1/1, ce_loss_token=2.2210, perplexity_token=9.2169]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   5%|██▎                                             | 50/1044 [00:18<05:05,  3.25it/s, acc_step=1/1, ce_loss_token=2.2262, perplexity_token=9.2646]

torch.Size([256, 306, 35]) torch.Size([256, 306])
torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   5%|██▎                                             | 51/1044 [00:19<05:15,  3.15it/s, acc_step=1/1, ce_loss_token=2.2258, perplexity_token=9.2610]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   5%|██▍                                             | 52/1044 [00:19<05:28,  3.02it/s, acc_step=1/1, ce_loss_token=2.2255, perplexity_token=9.2578]

torch.Size([256, 397, 35]) torch.Size([256, 397])


[Training LM]:   5%|██▍                                             | 53/1044 [00:20<06:37,  2.49it/s, acc_step=1/1, ce_loss_token=2.2249, perplexity_token=9.2523]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   5%|██▍                                             | 54/1044 [00:20<06:04,  2.72it/s, acc_step=1/1, ce_loss_token=2.2260, perplexity_token=9.2627]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   5%|██▌                                             | 55/1044 [00:20<06:12,  2.66it/s, acc_step=1/1, ce_loss_token=2.2256, perplexity_token=9.2591]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:   5%|██▌                                             | 56/1044 [00:21<06:57,  2.36it/s, acc_step=1/1, ce_loss_token=2.2252, perplexity_token=9.2557]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:   5%|██▌                                             | 57/1044 [00:21<06:33,  2.51it/s, acc_step=1/1, ce_loss_token=2.2247, perplexity_token=9.2509]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   6%|██▋                                             | 58/1044 [00:21<06:24,  2.57it/s, acc_step=1/1, ce_loss_token=2.2246, perplexity_token=9.2498]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   6%|██▋                                             | 59/1044 [00:22<06:23,  2.57it/s, acc_step=1/1, ce_loss_token=2.2242, perplexity_token=9.2460]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:   6%|██▊                                             | 60/1044 [00:22<05:44,  2.85it/s, acc_step=1/1, ce_loss_token=2.2248, perplexity_token=9.2519]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   6%|██▊                                             | 61/1044 [00:22<05:44,  2.85it/s, acc_step=1/1, ce_loss_token=2.2245, perplexity_token=9.2486]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:   6%|██▊                                             | 62/1044 [00:23<05:35,  2.93it/s, acc_step=1/1, ce_loss_token=2.2260, perplexity_token=9.2626]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   6%|██▉                                             | 63/1044 [00:23<05:44,  2.85it/s, acc_step=1/1, ce_loss_token=2.2255, perplexity_token=9.2584]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   6%|██▉                                             | 64/1044 [00:24<05:47,  2.82it/s, acc_step=1/1, ce_loss_token=2.2251, perplexity_token=9.2548]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   6%|██▉                                             | 65/1044 [00:24<05:59,  2.73it/s, acc_step=1/1, ce_loss_token=2.2248, perplexity_token=9.2515]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   6%|███                                             | 66/1044 [00:24<06:13,  2.62it/s, acc_step=1/1, ce_loss_token=2.2244, perplexity_token=9.2483]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:   6%|███                                             | 67/1044 [00:25<06:21,  2.56it/s, acc_step=1/1, ce_loss_token=2.2241, perplexity_token=9.2453]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:   7%|███▏                                            | 68/1044 [00:25<06:43,  2.42it/s, acc_step=1/1, ce_loss_token=2.2238, perplexity_token=9.2420]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   7%|███▏                                            | 69/1044 [00:26<06:32,  2.48it/s, acc_step=1/1, ce_loss_token=2.2235, perplexity_token=9.2396]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:   7%|███▏                                            | 70/1044 [00:26<06:18,  2.58it/s, acc_step=1/1, ce_loss_token=2.2232, perplexity_token=9.2364]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   7%|███▎                                            | 71/1044 [00:26<06:10,  2.63it/s, acc_step=1/1, ce_loss_token=2.2230, perplexity_token=9.2349]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   7%|███▎                                            | 72/1044 [00:27<06:19,  2.56it/s, acc_step=1/1, ce_loss_token=2.2226, perplexity_token=9.2313]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:   7%|███▎                                            | 73/1044 [00:27<06:30,  2.49it/s, acc_step=1/1, ce_loss_token=2.2223, perplexity_token=9.2286]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:   7%|███▍                                            | 74/1044 [00:28<06:33,  2.46it/s, acc_step=1/1, ce_loss_token=2.2220, perplexity_token=9.2261]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:   7%|███▍                                            | 75/1044 [00:28<06:16,  2.57it/s, acc_step=1/1, ce_loss_token=2.2219, perplexity_token=9.2252]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   7%|███▍                                            | 76/1044 [00:28<06:17,  2.57it/s, acc_step=1/1, ce_loss_token=2.2217, perplexity_token=9.2226]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   7%|███▌                                            | 77/1044 [00:29<06:15,  2.57it/s, acc_step=1/1, ce_loss_token=2.2214, perplexity_token=9.2202]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   7%|███▌                                            | 78/1044 [00:29<06:31,  2.47it/s, acc_step=1/1, ce_loss_token=2.2212, perplexity_token=9.2180]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:   8%|███▋                                            | 79/1044 [00:30<06:43,  2.39it/s, acc_step=1/1, ce_loss_token=2.2209, perplexity_token=9.2159]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   8%|███▋                                            | 80/1044 [00:30<06:12,  2.59it/s, acc_step=1/1, ce_loss_token=2.2216, perplexity_token=9.2217]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   8%|███▋                                            | 81/1044 [00:30<05:45,  2.78it/s, acc_step=1/1, ce_loss_token=2.2225, perplexity_token=9.2302]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   8%|███▊                                            | 82/1044 [00:30<05:22,  2.98it/s, acc_step=1/1, ce_loss_token=2.2233, perplexity_token=9.2374]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   8%|███▊                                            | 83/1044 [00:31<05:33,  2.88it/s, acc_step=1/1, ce_loss_token=2.2230, perplexity_token=9.2348]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   8%|███▊                                            | 84/1044 [00:31<05:42,  2.80it/s, acc_step=1/1, ce_loss_token=2.2228, perplexity_token=9.2329]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:   8%|███▉                                            | 85/1044 [00:32<06:00,  2.66it/s, acc_step=1/1, ce_loss_token=2.2225, perplexity_token=9.2308]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:   8%|███▉                                            | 86/1044 [00:32<06:45,  2.36it/s, acc_step=1/1, ce_loss_token=2.2223, perplexity_token=9.2288]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   8%|████                                            | 87/1044 [00:33<06:26,  2.47it/s, acc_step=1/1, ce_loss_token=2.2220, perplexity_token=9.2258]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   8%|████                                            | 88/1044 [00:33<05:52,  2.71it/s, acc_step=1/1, ce_loss_token=2.2227, perplexity_token=9.2320]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   9%|████                                            | 89/1044 [00:33<05:46,  2.76it/s, acc_step=1/1, ce_loss_token=2.2224, perplexity_token=9.2294]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   9%|████▏                                           | 90/1044 [00:33<05:22,  2.96it/s, acc_step=1/1, ce_loss_token=2.2228, perplexity_token=9.2334]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:   9%|████▏                                           | 91/1044 [00:34<05:20,  2.97it/s, acc_step=1/1, ce_loss_token=2.2225, perplexity_token=9.2300]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   9%|████▏                                           | 92/1044 [00:34<05:30,  2.88it/s, acc_step=1/1, ce_loss_token=2.2221, perplexity_token=9.2265]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   9%|████▎                                           | 93/1044 [00:35<05:40,  2.79it/s, acc_step=1/1, ce_loss_token=2.2216, perplexity_token=9.2223]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   9%|████▎                                           | 94/1044 [00:35<05:39,  2.80it/s, acc_step=1/1, ce_loss_token=2.2214, perplexity_token=9.2201]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   9%|████▎                                           | 95/1044 [00:35<05:45,  2.75it/s, acc_step=1/1, ce_loss_token=2.2211, perplexity_token=9.2179]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   9%|████▍                                           | 96/1044 [00:36<05:29,  2.88it/s, acc_step=1/1, ce_loss_token=2.2214, perplexity_token=9.2205]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   9%|████▍                                           | 97/1044 [00:36<05:33,  2.84it/s, acc_step=1/1, ce_loss_token=2.2211, perplexity_token=9.2177]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   9%|████▌                                           | 98/1044 [00:36<05:31,  2.85it/s, acc_step=1/1, ce_loss_token=2.2207, perplexity_token=9.2140]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   9%|████▌                                           | 99/1044 [00:37<05:41,  2.76it/s, acc_step=1/1, ce_loss_token=2.2204, perplexity_token=9.2110]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  10%|████▌                                          | 100/1044 [00:37<06:02,  2.61it/s, acc_step=1/1, ce_loss_token=2.2202, perplexity_token=9.2092]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  10%|████▌                                          | 101/1044 [00:38<05:59,  2.62it/s, acc_step=1/1, ce_loss_token=2.2200, perplexity_token=9.2069]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  10%|████▌                                          | 102/1044 [00:38<05:55,  2.65it/s, acc_step=1/1, ce_loss_token=2.2197, perplexity_token=9.2048]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  10%|████▋                                          | 103/1044 [00:38<05:49,  2.70it/s, acc_step=1/1, ce_loss_token=2.2195, perplexity_token=9.2025]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  10%|████▋                                          | 104/1044 [00:39<06:00,  2.60it/s, acc_step=1/1, ce_loss_token=2.2193, perplexity_token=9.2004]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  10%|████▋                                          | 105/1044 [00:39<06:00,  2.60it/s, acc_step=1/1, ce_loss_token=2.2190, perplexity_token=9.1980]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  10%|████▊                                          | 106/1044 [00:39<05:52,  2.66it/s, acc_step=1/1, ce_loss_token=2.2187, perplexity_token=9.1954]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  10%|████▊                                          | 107/1044 [00:40<06:03,  2.58it/s, acc_step=1/1, ce_loss_token=2.2185, perplexity_token=9.1933]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  10%|████▊                                          | 108/1044 [00:40<06:06,  2.56it/s, acc_step=1/1, ce_loss_token=2.2182, perplexity_token=9.1909]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  10%|████▉                                          | 109/1044 [00:41<06:00,  2.59it/s, acc_step=1/1, ce_loss_token=2.2180, perplexity_token=9.1892]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  11%|████▉                                          | 110/1044 [00:41<06:02,  2.57it/s, acc_step=1/1, ce_loss_token=2.2177, perplexity_token=9.1862]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  11%|████▉                                          | 111/1044 [00:41<05:56,  2.62it/s, acc_step=1/1, ce_loss_token=2.2175, perplexity_token=9.1848]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  11%|█████                                          | 112/1044 [00:42<05:59,  2.60it/s, acc_step=1/1, ce_loss_token=2.2173, perplexity_token=9.1828]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  11%|█████                                          | 113/1044 [00:42<05:49,  2.66it/s, acc_step=1/1, ce_loss_token=2.2170, perplexity_token=9.1798]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  11%|█████▏                                         | 114/1044 [00:42<05:40,  2.73it/s, acc_step=1/1, ce_loss_token=2.2168, perplexity_token=9.1783]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  11%|█████▏                                         | 115/1044 [00:43<05:58,  2.59it/s, acc_step=1/1, ce_loss_token=2.2165, perplexity_token=9.1756]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  11%|█████▏                                         | 116/1044 [00:43<05:52,  2.63it/s, acc_step=1/1, ce_loss_token=2.2163, perplexity_token=9.1731]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  11%|█████▎                                         | 117/1044 [00:44<05:45,  2.68it/s, acc_step=1/1, ce_loss_token=2.2161, perplexity_token=9.1716]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  11%|█████▎                                         | 118/1044 [00:44<05:42,  2.70it/s, acc_step=1/1, ce_loss_token=2.2160, perplexity_token=9.1704]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  11%|█████▎                                         | 119/1044 [00:44<05:21,  2.88it/s, acc_step=1/1, ce_loss_token=2.2165, perplexity_token=9.1756]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  11%|█████▍                                         | 120/1044 [00:45<05:36,  2.75it/s, acc_step=1/1, ce_loss_token=2.2164, perplexity_token=9.1740]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  12%|█████▍                                         | 121/1044 [00:45<05:41,  2.70it/s, acc_step=1/1, ce_loss_token=2.2162, perplexity_token=9.1728]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  12%|█████▍                                         | 122/1044 [00:45<05:16,  2.92it/s, acc_step=1/1, ce_loss_token=2.2166, perplexity_token=9.1762]

torch.Size([256, 277, 35]) torch.Size([256, 277])


[Training LM]:  12%|█████▌                                         | 123/1044 [00:46<05:10,  2.96it/s, acc_step=1/1, ce_loss_token=2.2164, perplexity_token=9.1740]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  12%|█████▌                                         | 124/1044 [00:46<05:47,  2.65it/s, acc_step=1/1, ce_loss_token=2.2161, perplexity_token=9.1719]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  12%|█████▋                                         | 125/1044 [00:46<05:44,  2.67it/s, acc_step=1/1, ce_loss_token=2.2160, perplexity_token=9.1703]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  12%|█████▋                                         | 126/1044 [00:47<06:01,  2.54it/s, acc_step=1/1, ce_loss_token=2.2157, perplexity_token=9.1683]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  12%|█████▋                                         | 127/1044 [00:47<05:52,  2.60it/s, acc_step=1/1, ce_loss_token=2.2155, perplexity_token=9.1660]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  12%|█████▊                                         | 128/1044 [00:48<05:21,  2.85it/s, acc_step=1/1, ce_loss_token=2.2159, perplexity_token=9.1692]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  12%|█████▊                                         | 129/1044 [00:48<05:25,  2.81it/s, acc_step=1/1, ce_loss_token=2.2156, perplexity_token=9.1672]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  12%|█████▊                                         | 130/1044 [00:48<05:32,  2.75it/s, acc_step=1/1, ce_loss_token=2.2154, perplexity_token=9.1649]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  13%|█████▉                                         | 131/1044 [00:49<05:26,  2.80it/s, acc_step=1/1, ce_loss_token=2.2152, perplexity_token=9.1628]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  13%|█████▉                                         | 132/1044 [00:49<05:30,  2.76it/s, acc_step=1/1, ce_loss_token=2.2149, perplexity_token=9.1605]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  13%|█████▉                                         | 133/1044 [00:49<05:34,  2.72it/s, acc_step=1/1, ce_loss_token=2.2148, perplexity_token=9.1594]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  13%|██████                                         | 135/1044 [00:50<04:29,  3.37it/s, acc_step=1/1, ce_loss_token=2.2167, perplexity_token=9.1768]

torch.Size([256, 296, 35]) torch.Size([256, 296])
torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  13%|██████                                         | 136/1044 [00:50<04:48,  3.15it/s, acc_step=1/1, ce_loss_token=2.2165, perplexity_token=9.1754]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  13%|██████▏                                        | 137/1044 [00:51<04:57,  3.04it/s, acc_step=1/1, ce_loss_token=2.2164, perplexity_token=9.1738]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  13%|██████▏                                        | 138/1044 [00:51<04:59,  3.03it/s, acc_step=1/1, ce_loss_token=2.2168, perplexity_token=9.1775]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  13%|██████▎                                        | 139/1044 [00:51<05:03,  2.98it/s, acc_step=1/1, ce_loss_token=2.2166, perplexity_token=9.1757]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  13%|██████▎                                        | 140/1044 [00:52<05:20,  2.82it/s, acc_step=1/1, ce_loss_token=2.2163, perplexity_token=9.1734]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  14%|██████▎                                        | 141/1044 [00:52<05:25,  2.78it/s, acc_step=1/1, ce_loss_token=2.2161, perplexity_token=9.1714]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  14%|██████▍                                        | 142/1044 [00:52<05:42,  2.63it/s, acc_step=1/1, ce_loss_token=2.2159, perplexity_token=9.1699]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  14%|██████▍                                        | 143/1044 [00:53<05:51,  2.56it/s, acc_step=1/1, ce_loss_token=2.2156, perplexity_token=9.1673]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  14%|██████▍                                        | 144/1044 [00:53<05:46,  2.60it/s, acc_step=1/1, ce_loss_token=2.2155, perplexity_token=9.1657]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  14%|██████▌                                        | 145/1044 [00:54<05:44,  2.61it/s, acc_step=1/1, ce_loss_token=2.2153, perplexity_token=9.1645]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  14%|██████▌                                        | 146/1044 [00:54<05:58,  2.50it/s, acc_step=1/1, ce_loss_token=2.2151, perplexity_token=9.1628]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  14%|██████▌                                        | 147/1044 [00:54<06:01,  2.48it/s, acc_step=1/1, ce_loss_token=2.2150, perplexity_token=9.1612]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  14%|██████▋                                        | 148/1044 [00:55<06:01,  2.48it/s, acc_step=1/1, ce_loss_token=2.2148, perplexity_token=9.1600]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  14%|██████▋                                        | 149/1044 [00:55<05:49,  2.56it/s, acc_step=1/1, ce_loss_token=2.2146, perplexity_token=9.1581]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  14%|██████▊                                        | 150/1044 [00:56<05:59,  2.49it/s, acc_step=1/1, ce_loss_token=2.2144, perplexity_token=9.1563]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  14%|██████▊                                        | 151/1044 [00:56<05:28,  2.72it/s, acc_step=1/1, ce_loss_token=2.2148, perplexity_token=9.1596]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  15%|██████▊                                        | 152/1044 [00:56<05:20,  2.79it/s, acc_step=1/1, ce_loss_token=2.2146, perplexity_token=9.1576]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  15%|██████▉                                        | 153/1044 [00:57<05:04,  2.93it/s, acc_step=1/1, ce_loss_token=2.2149, perplexity_token=9.1601]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  15%|██████▉                                        | 154/1044 [00:57<05:15,  2.82it/s, acc_step=1/1, ce_loss_token=2.2147, perplexity_token=9.1589]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  15%|██████▉                                        | 155/1044 [00:57<05:20,  2.77it/s, acc_step=1/1, ce_loss_token=2.2145, perplexity_token=9.1566]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  15%|███████                                        | 156/1044 [00:58<05:17,  2.80it/s, acc_step=1/1, ce_loss_token=2.2143, perplexity_token=9.1551]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  15%|███████                                        | 157/1044 [00:58<05:20,  2.77it/s, acc_step=1/1, ce_loss_token=2.2140, perplexity_token=9.1527]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  15%|███████                                        | 158/1044 [00:59<05:39,  2.61it/s, acc_step=1/1, ce_loss_token=2.2138, perplexity_token=9.1507]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  15%|███████▏                                       | 159/1044 [00:59<05:36,  2.63it/s, acc_step=1/1, ce_loss_token=2.2137, perplexity_token=9.1494]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  15%|███████▏                                       | 160/1044 [00:59<05:31,  2.67it/s, acc_step=1/1, ce_loss_token=2.2134, perplexity_token=9.1470]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  15%|███████▏                                       | 161/1044 [01:00<05:27,  2.70it/s, acc_step=1/1, ce_loss_token=2.2132, perplexity_token=9.1453]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  16%|███████▎                                       | 162/1044 [01:00<05:30,  2.67it/s, acc_step=1/1, ce_loss_token=2.2130, perplexity_token=9.1432]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  16%|███████▎                                       | 163/1044 [01:00<05:22,  2.73it/s, acc_step=1/1, ce_loss_token=2.2129, perplexity_token=9.1421]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  16%|███████▍                                       | 164/1044 [01:01<04:58,  2.95it/s, acc_step=1/1, ce_loss_token=2.2133, perplexity_token=9.1455]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  16%|███████▍                                       | 165/1044 [01:01<05:02,  2.91it/s, acc_step=1/1, ce_loss_token=2.2130, perplexity_token=9.1435]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  16%|███████▍                                       | 166/1044 [01:01<05:12,  2.81it/s, acc_step=1/1, ce_loss_token=2.2128, perplexity_token=9.1414]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  16%|███████▌                                       | 167/1044 [01:02<05:13,  2.79it/s, acc_step=1/1, ce_loss_token=2.2126, perplexity_token=9.1394]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  16%|███████▌                                       | 168/1044 [01:02<05:29,  2.66it/s, acc_step=1/1, ce_loss_token=2.2124, perplexity_token=9.1375]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  16%|███████▌                                       | 169/1044 [01:03<05:26,  2.68it/s, acc_step=1/1, ce_loss_token=2.2122, perplexity_token=9.1356]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  16%|███████▋                                       | 170/1044 [01:03<05:31,  2.64it/s, acc_step=1/1, ce_loss_token=2.2120, perplexity_token=9.1341]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  16%|███████▋                                       | 171/1044 [01:03<05:30,  2.64it/s, acc_step=1/1, ce_loss_token=2.2118, perplexity_token=9.1326]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  16%|███████▋                                       | 172/1044 [01:04<05:34,  2.61it/s, acc_step=1/1, ce_loss_token=2.2117, perplexity_token=9.1311]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  17%|███████▊                                       | 173/1044 [01:04<05:38,  2.57it/s, acc_step=1/1, ce_loss_token=2.2114, perplexity_token=9.1287]

torch.Size([256, 421, 35]) torch.Size([256, 421])


[Training LM]:  17%|███████▊                                       | 174/1044 [01:05<06:00,  2.41it/s, acc_step=1/1, ce_loss_token=2.2120, perplexity_token=9.1343]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  17%|███████▉                                       | 175/1044 [01:05<05:51,  2.47it/s, acc_step=1/1, ce_loss_token=2.2119, perplexity_token=9.1327]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  17%|███████▉                                       | 176/1044 [01:05<05:42,  2.54it/s, acc_step=1/1, ce_loss_token=2.2117, perplexity_token=9.1310]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  17%|███████▉                                       | 177/1044 [01:06<05:41,  2.54it/s, acc_step=1/1, ce_loss_token=2.2115, perplexity_token=9.1292]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  17%|████████                                       | 178/1044 [01:06<05:28,  2.64it/s, acc_step=1/1, ce_loss_token=2.2113, perplexity_token=9.1280]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  17%|████████                                       | 179/1044 [01:06<05:34,  2.59it/s, acc_step=1/1, ce_loss_token=2.2111, perplexity_token=9.1258]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  17%|████████                                       | 180/1044 [01:07<05:34,  2.58it/s, acc_step=1/1, ce_loss_token=2.2109, perplexity_token=9.1241]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  17%|████████▏                                      | 181/1044 [01:07<05:31,  2.60it/s, acc_step=1/1, ce_loss_token=2.2108, perplexity_token=9.1226]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  17%|████████▏                                      | 182/1044 [01:08<05:09,  2.79it/s, acc_step=1/1, ce_loss_token=2.2110, perplexity_token=9.1244]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  18%|████████▏                                      | 183/1044 [01:08<05:12,  2.75it/s, acc_step=1/1, ce_loss_token=2.2107, perplexity_token=9.1223]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  18%|████████▎                                      | 184/1044 [01:08<05:15,  2.72it/s, acc_step=1/1, ce_loss_token=2.2105, perplexity_token=9.1206]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  18%|████████▎                                      | 185/1044 [01:09<05:15,  2.73it/s, acc_step=1/1, ce_loss_token=2.2104, perplexity_token=9.1195]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  18%|████████▎                                      | 186/1044 [01:09<05:14,  2.73it/s, acc_step=1/1, ce_loss_token=2.2103, perplexity_token=9.1186]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  18%|████████▍                                      | 187/1044 [01:09<05:29,  2.60it/s, acc_step=1/1, ce_loss_token=2.2102, perplexity_token=9.1174]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  18%|████████▍                                      | 188/1044 [01:10<05:30,  2.59it/s, acc_step=1/1, ce_loss_token=2.2100, perplexity_token=9.1159]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  18%|████████▌                                      | 189/1044 [01:10<05:48,  2.45it/s, acc_step=1/1, ce_loss_token=2.2098, perplexity_token=9.1140]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  18%|████████▌                                      | 190/1044 [01:11<05:37,  2.53it/s, acc_step=1/1, ce_loss_token=2.2097, perplexity_token=9.1127]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  18%|████████▌                                      | 191/1044 [01:11<05:34,  2.55it/s, acc_step=1/1, ce_loss_token=2.2095, perplexity_token=9.1116]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  18%|████████▋                                      | 192/1044 [01:11<05:38,  2.51it/s, acc_step=1/1, ce_loss_token=2.2094, perplexity_token=9.1106]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  18%|████████▋                                      | 193/1044 [01:12<05:25,  2.61it/s, acc_step=1/1, ce_loss_token=2.2093, perplexity_token=9.1092]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  19%|████████▋                                      | 194/1044 [01:12<05:30,  2.57it/s, acc_step=1/1, ce_loss_token=2.2091, perplexity_token=9.1076]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  19%|████████▊                                      | 195/1044 [01:13<05:24,  2.61it/s, acc_step=1/1, ce_loss_token=2.2089, perplexity_token=9.1059]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  19%|████████▊                                      | 196/1044 [01:13<05:21,  2.64it/s, acc_step=1/1, ce_loss_token=2.2088, perplexity_token=9.1046]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  19%|████████▊                                      | 197/1044 [01:13<05:20,  2.64it/s, acc_step=1/1, ce_loss_token=2.2086, perplexity_token=9.1030]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  19%|████████▉                                      | 198/1044 [01:14<05:18,  2.65it/s, acc_step=1/1, ce_loss_token=2.2085, perplexity_token=9.1019]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  19%|████████▉                                      | 199/1044 [01:14<05:26,  2.59it/s, acc_step=1/1, ce_loss_token=2.2083, perplexity_token=9.1000]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  19%|█████████                                      | 200/1044 [01:14<05:21,  2.62it/s, acc_step=1/1, ce_loss_token=2.2081, perplexity_token=9.0983]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  19%|█████████                                      | 201/1044 [01:15<05:30,  2.55it/s, acc_step=1/1, ce_loss_token=2.2080, perplexity_token=9.0974]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  19%|█████████▏                                     | 203/1044 [01:15<04:36,  3.04it/s, acc_step=1/1, ce_loss_token=2.2086, perplexity_token=9.1027]

torch.Size([256, 302, 35]) torch.Size([256, 302])
torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  20%|█████████▏                                     | 204/1044 [01:16<04:20,  3.23it/s, acc_step=1/1, ce_loss_token=2.2088, perplexity_token=9.1045]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  20%|█████████▏                                     | 205/1044 [01:16<04:32,  3.08it/s, acc_step=1/1, ce_loss_token=2.2086, perplexity_token=9.1031]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  20%|█████████▎                                     | 206/1044 [01:16<04:39,  3.00it/s, acc_step=1/1, ce_loss_token=2.2084, perplexity_token=9.1015]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  20%|█████████▎                                     | 207/1044 [01:17<04:45,  2.93it/s, acc_step=1/1, ce_loss_token=2.2083, perplexity_token=9.1005]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  20%|█████████▎                                     | 208/1044 [01:17<04:52,  2.86it/s, acc_step=1/1, ce_loss_token=2.2081, perplexity_token=9.0989]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  20%|█████████▍                                     | 209/1044 [01:17<04:51,  2.87it/s, acc_step=1/1, ce_loss_token=2.2080, perplexity_token=9.0975]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  20%|█████████▍                                     | 210/1044 [01:18<04:52,  2.85it/s, acc_step=1/1, ce_loss_token=2.2079, perplexity_token=9.0962]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  20%|█████████▍                                     | 211/1044 [01:18<05:09,  2.69it/s, acc_step=1/1, ce_loss_token=2.2077, perplexity_token=9.0945]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  20%|█████████▌                                     | 212/1044 [01:19<05:07,  2.71it/s, acc_step=1/1, ce_loss_token=2.2075, perplexity_token=9.0933]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  20%|█████████▌                                     | 213/1044 [01:19<04:46,  2.90it/s, acc_step=1/1, ce_loss_token=2.2080, perplexity_token=9.0974]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  20%|█████████▋                                     | 214/1044 [01:19<04:56,  2.80it/s, acc_step=1/1, ce_loss_token=2.2079, perplexity_token=9.0966]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  21%|█████████▋                                     | 215/1044 [01:20<05:06,  2.71it/s, acc_step=1/1, ce_loss_token=2.2078, perplexity_token=9.0954]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  21%|█████████▋                                     | 216/1044 [01:20<05:19,  2.59it/s, acc_step=1/1, ce_loss_token=2.2076, perplexity_token=9.0941]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  21%|█████████▊                                     | 217/1044 [01:20<05:14,  2.63it/s, acc_step=1/1, ce_loss_token=2.2075, perplexity_token=9.0928]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  21%|█████████▊                                     | 218/1044 [01:21<05:03,  2.72it/s, acc_step=1/1, ce_loss_token=2.2074, perplexity_token=9.0916]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  21%|█████████▊                                     | 219/1044 [01:21<05:08,  2.68it/s, acc_step=1/1, ce_loss_token=2.2072, perplexity_token=9.0904]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  21%|█████████▉                                     | 220/1044 [01:22<05:06,  2.69it/s, acc_step=1/1, ce_loss_token=2.2071, perplexity_token=9.0892]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  21%|█████████▉                                     | 221/1044 [01:22<05:08,  2.66it/s, acc_step=1/1, ce_loss_token=2.2070, perplexity_token=9.0880]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  21%|█████████▉                                     | 222/1044 [01:22<05:05,  2.69it/s, acc_step=1/1, ce_loss_token=2.2068, perplexity_token=9.0864]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  21%|██████████                                     | 223/1044 [01:23<05:09,  2.65it/s, acc_step=1/1, ce_loss_token=2.2066, perplexity_token=9.0846]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  21%|██████████                                     | 224/1044 [01:23<05:01,  2.72it/s, acc_step=1/1, ce_loss_token=2.2064, perplexity_token=9.0833]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  22%|██████████▏                                    | 225/1044 [01:23<05:06,  2.67it/s, acc_step=1/1, ce_loss_token=2.2063, perplexity_token=9.0820]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  22%|██████████▏                                    | 226/1044 [01:24<05:07,  2.66it/s, acc_step=1/1, ce_loss_token=2.2061, perplexity_token=9.0806]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  22%|██████████▏                                    | 227/1044 [01:24<04:47,  2.84it/s, acc_step=1/1, ce_loss_token=2.2064, perplexity_token=9.0828]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  22%|██████████▎                                    | 228/1044 [01:25<05:00,  2.72it/s, acc_step=1/1, ce_loss_token=2.2062, perplexity_token=9.0811]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  22%|██████████▎                                    | 229/1044 [01:25<05:08,  2.65it/s, acc_step=1/1, ce_loss_token=2.2060, perplexity_token=9.0795]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  22%|██████████▎                                    | 230/1044 [01:25<05:06,  2.65it/s, acc_step=1/1, ce_loss_token=2.2059, perplexity_token=9.0782]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  22%|██████████▍                                    | 231/1044 [01:26<04:44,  2.86it/s, acc_step=1/1, ce_loss_token=2.2060, perplexity_token=9.0796]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  22%|██████████▍                                    | 232/1044 [01:26<04:51,  2.79it/s, acc_step=1/1, ce_loss_token=2.2059, perplexity_token=9.0783]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  22%|██████████▍                                    | 233/1044 [01:26<04:53,  2.76it/s, acc_step=1/1, ce_loss_token=2.2057, perplexity_token=9.0769]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  22%|██████████▌                                    | 234/1044 [01:27<04:37,  2.92it/s, acc_step=1/1, ce_loss_token=2.2059, perplexity_token=9.0788]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  23%|██████████▌                                    | 235/1044 [01:27<04:43,  2.85it/s, acc_step=1/1, ce_loss_token=2.2058, perplexity_token=9.0775]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  23%|██████████▌                                    | 236/1044 [01:27<04:41,  2.87it/s, acc_step=1/1, ce_loss_token=2.2056, perplexity_token=9.0758]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  23%|██████████▋                                    | 238/1044 [01:28<04:09,  3.23it/s, acc_step=1/1, ce_loss_token=2.2062, perplexity_token=9.0810]

torch.Size([256, 291, 35]) torch.Size([256, 291])
torch.Size([256, 342, 35]) torch.Size([256, 342])


[Training LM]:  23%|██████████▊                                    | 239/1044 [01:28<04:13,  3.18it/s, acc_step=1/1, ce_loss_token=2.2064, perplexity_token=9.0825]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  23%|██████████▊                                    | 240/1044 [01:29<04:28,  3.00it/s, acc_step=1/1, ce_loss_token=2.2062, perplexity_token=9.0812]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  23%|██████████▊                                    | 241/1044 [01:29<04:35,  2.91it/s, acc_step=1/1, ce_loss_token=2.2060, perplexity_token=9.0795]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  23%|██████████▉                                    | 242/1044 [01:29<04:34,  2.93it/s, acc_step=1/1, ce_loss_token=2.2059, perplexity_token=9.0784]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  23%|██████████▉                                    | 243/1044 [01:30<04:33,  2.93it/s, acc_step=1/1, ce_loss_token=2.2058, perplexity_token=9.0772]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  23%|██████████▉                                    | 244/1044 [01:30<04:45,  2.80it/s, acc_step=1/1, ce_loss_token=2.2056, perplexity_token=9.0760]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  23%|███████████                                    | 245/1044 [01:31<05:01,  2.65it/s, acc_step=1/1, ce_loss_token=2.2055, perplexity_token=9.0746]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  24%|███████████                                    | 246/1044 [01:31<05:03,  2.63it/s, acc_step=1/1, ce_loss_token=2.2053, perplexity_token=9.0734]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  24%|███████████▏                                   | 248/1044 [01:31<04:22,  3.03it/s, acc_step=1/1, ce_loss_token=2.2058, perplexity_token=9.0772]

torch.Size([256, 310, 35]) torch.Size([256, 310])
torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  24%|███████████▏                                   | 249/1044 [01:32<04:35,  2.89it/s, acc_step=1/1, ce_loss_token=2.2056, perplexity_token=9.0757]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  24%|███████████▎                                   | 250/1044 [01:32<04:29,  2.94it/s, acc_step=1/1, ce_loss_token=2.2059, perplexity_token=9.0789]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  24%|███████████▎                                   | 251/1044 [01:33<04:44,  2.79it/s, acc_step=1/1, ce_loss_token=2.2058, perplexity_token=9.0775]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  24%|███████████▎                                   | 252/1044 [01:33<05:19,  2.48it/s, acc_step=1/1, ce_loss_token=2.2057, perplexity_token=9.0763]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  24%|███████████▍                                   | 253/1044 [01:33<05:16,  2.50it/s, acc_step=1/1, ce_loss_token=2.2055, perplexity_token=9.0747]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  24%|███████████▍                                   | 255/1044 [01:34<04:33,  2.89it/s, acc_step=1/1, ce_loss_token=2.2063, perplexity_token=9.0818]

torch.Size([256, 312, 35]) torch.Size([256, 312])
torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  25%|███████████▌                                   | 256/1044 [01:34<04:33,  2.88it/s, acc_step=1/1, ce_loss_token=2.2061, perplexity_token=9.0802]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  25%|███████████▌                                   | 257/1044 [01:35<04:34,  2.87it/s, acc_step=1/1, ce_loss_token=2.2059, perplexity_token=9.0787]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  25%|███████████▌                                   | 258/1044 [01:35<04:24,  2.97it/s, acc_step=1/1, ce_loss_token=2.2061, perplexity_token=9.0804]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  25%|███████████▋                                   | 259/1044 [01:35<04:29,  2.91it/s, acc_step=1/1, ce_loss_token=2.2060, perplexity_token=9.0791]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  25%|███████████▋                                   | 260/1044 [01:36<04:30,  2.90it/s, acc_step=1/1, ce_loss_token=2.2061, perplexity_token=9.0801]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  25%|███████████▊                                   | 261/1044 [01:36<04:33,  2.87it/s, acc_step=1/1, ce_loss_token=2.2059, perplexity_token=9.0784]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  25%|███████████▊                                   | 262/1044 [01:37<04:50,  2.69it/s, acc_step=1/1, ce_loss_token=2.2058, perplexity_token=9.0771]

torch.Size([256, 391, 35]) torch.Size([256, 391])


[Training LM]:  25%|███████████▊                                   | 263/1044 [01:37<05:32,  2.35it/s, acc_step=1/1, ce_loss_token=2.2056, perplexity_token=9.0761]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  25%|███████████▉                                   | 264/1044 [01:38<05:25,  2.40it/s, acc_step=1/1, ce_loss_token=2.2055, perplexity_token=9.0748]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  25%|███████████▉                                   | 265/1044 [01:38<05:16,  2.46it/s, acc_step=1/1, ce_loss_token=2.2053, perplexity_token=9.0733]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  25%|███████████▉                                   | 266/1044 [01:38<05:06,  2.54it/s, acc_step=1/1, ce_loss_token=2.2052, perplexity_token=9.0719]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  26%|████████████                                   | 267/1044 [01:39<05:00,  2.59it/s, acc_step=1/1, ce_loss_token=2.2050, perplexity_token=9.0704]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  26%|████████████                                   | 268/1044 [01:39<04:52,  2.65it/s, acc_step=1/1, ce_loss_token=2.2049, perplexity_token=9.0690]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  26%|████████████                                   | 269/1044 [01:39<04:48,  2.69it/s, acc_step=1/1, ce_loss_token=2.2047, perplexity_token=9.0672]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  26%|████████████▏                                  | 270/1044 [01:40<04:56,  2.61it/s, acc_step=1/1, ce_loss_token=2.2045, perplexity_token=9.0659]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  26%|████████████▏                                  | 271/1044 [01:40<04:49,  2.67it/s, acc_step=1/1, ce_loss_token=2.2044, perplexity_token=9.0645]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  26%|████████████▏                                  | 272/1044 [01:40<04:29,  2.86it/s, acc_step=1/1, ce_loss_token=2.2045, perplexity_token=9.0654]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  26%|████████████▎                                  | 273/1044 [01:41<04:34,  2.81it/s, acc_step=1/1, ce_loss_token=2.2043, perplexity_token=9.0640]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  26%|████████████▎                                  | 274/1044 [01:41<04:54,  2.62it/s, acc_step=1/1, ce_loss_token=2.2042, perplexity_token=9.0629]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  26%|████████████▍                                  | 275/1044 [01:42<04:52,  2.63it/s, acc_step=1/1, ce_loss_token=2.2040, perplexity_token=9.0616]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  26%|████████████▍                                  | 276/1044 [01:42<05:00,  2.56it/s, acc_step=1/1, ce_loss_token=2.2039, perplexity_token=9.0602]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  27%|████████████▍                                  | 277/1044 [01:42<04:58,  2.57it/s, acc_step=1/1, ce_loss_token=2.2037, perplexity_token=9.0589]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  27%|████████████▌                                  | 278/1044 [01:43<04:49,  2.64it/s, acc_step=1/1, ce_loss_token=2.2036, perplexity_token=9.0576]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  27%|████████████▌                                  | 279/1044 [01:43<04:53,  2.61it/s, acc_step=1/1, ce_loss_token=2.2034, perplexity_token=9.0559]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  27%|████████████▌                                  | 280/1044 [01:44<05:04,  2.51it/s, acc_step=1/1, ce_loss_token=2.2032, perplexity_token=9.0543]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  27%|████████████▋                                  | 281/1044 [01:44<05:12,  2.44it/s, acc_step=1/1, ce_loss_token=2.2031, perplexity_token=9.0529]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  27%|████████████▋                                  | 282/1044 [01:44<05:01,  2.52it/s, acc_step=1/1, ce_loss_token=2.2030, perplexity_token=9.0517]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  27%|████████████▋                                  | 283/1044 [01:45<04:39,  2.72it/s, acc_step=1/1, ce_loss_token=2.2033, perplexity_token=9.0546]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  27%|████████████▊                                  | 284/1044 [01:45<04:39,  2.72it/s, acc_step=1/1, ce_loss_token=2.2031, perplexity_token=9.0532]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  27%|████████████▊                                  | 285/1044 [01:45<04:43,  2.68it/s, acc_step=1/1, ce_loss_token=2.2030, perplexity_token=9.0522]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  27%|████████████▉                                  | 286/1044 [01:46<04:32,  2.78it/s, acc_step=1/1, ce_loss_token=2.2031, perplexity_token=9.0530]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  27%|████████████▉                                  | 287/1044 [01:46<04:33,  2.77it/s, acc_step=1/1, ce_loss_token=2.2030, perplexity_token=9.0520]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  28%|████████████▉                                  | 288/1044 [01:47<04:29,  2.81it/s, acc_step=1/1, ce_loss_token=2.2028, perplexity_token=9.0504]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  28%|█████████████                                  | 289/1044 [01:47<04:36,  2.73it/s, acc_step=1/1, ce_loss_token=2.2026, perplexity_token=9.0489]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  28%|█████████████                                  | 290/1044 [01:47<04:45,  2.64it/s, acc_step=1/1, ce_loss_token=2.2025, perplexity_token=9.0476]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  28%|█████████████                                  | 291/1044 [01:48<04:36,  2.72it/s, acc_step=1/1, ce_loss_token=2.2024, perplexity_token=9.0463]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  28%|█████████████▏                                 | 292/1044 [01:48<04:39,  2.69it/s, acc_step=1/1, ce_loss_token=2.2022, perplexity_token=9.0450]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  28%|█████████████▏                                 | 293/1044 [01:48<04:35,  2.73it/s, acc_step=1/1, ce_loss_token=2.2021, perplexity_token=9.0438]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  28%|█████████████▏                                 | 294/1044 [01:49<04:31,  2.76it/s, acc_step=1/1, ce_loss_token=2.2020, perplexity_token=9.0428]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  28%|█████████████▎                                 | 295/1044 [01:49<04:31,  2.76it/s, acc_step=1/1, ce_loss_token=2.2018, perplexity_token=9.0415]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  28%|█████████████▎                                 | 296/1044 [01:49<04:35,  2.72it/s, acc_step=1/1, ce_loss_token=2.2017, perplexity_token=9.0405]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  28%|█████████████▎                                 | 297/1044 [01:50<04:50,  2.58it/s, acc_step=1/1, ce_loss_token=2.2016, perplexity_token=9.0392]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  29%|█████████████▍                                 | 298/1044 [01:50<04:44,  2.62it/s, acc_step=1/1, ce_loss_token=2.2014, perplexity_token=9.0379]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  29%|█████████████▍                                 | 299/1044 [01:51<04:54,  2.53it/s, acc_step=1/1, ce_loss_token=2.2012, perplexity_token=9.0363]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  29%|█████████████▌                                 | 300/1044 [01:51<04:47,  2.58it/s, acc_step=1/1, ce_loss_token=2.2011, perplexity_token=9.0353]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  29%|█████████████▌                                 | 301/1044 [01:51<04:53,  2.53it/s, acc_step=1/1, ce_loss_token=2.2010, perplexity_token=9.0343]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  29%|█████████████▌                                 | 302/1044 [01:52<04:48,  2.57it/s, acc_step=1/1, ce_loss_token=2.2009, perplexity_token=9.0333]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  29%|█████████████▋                                 | 303/1044 [01:52<04:42,  2.62it/s, acc_step=1/1, ce_loss_token=2.2008, perplexity_token=9.0320]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  29%|█████████████▋                                 | 304/1044 [01:53<04:44,  2.60it/s, acc_step=1/1, ce_loss_token=2.2006, perplexity_token=9.0308]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  29%|█████████████▋                                 | 305/1044 [01:53<04:47,  2.57it/s, acc_step=1/1, ce_loss_token=2.2005, perplexity_token=9.0294]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  29%|█████████████▊                                 | 306/1044 [01:53<04:37,  2.66it/s, acc_step=1/1, ce_loss_token=2.2003, perplexity_token=9.0278]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  29%|█████████████▊                                 | 307/1044 [01:54<04:35,  2.67it/s, acc_step=1/1, ce_loss_token=2.2001, perplexity_token=9.0262]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  30%|█████████████▊                                 | 308/1044 [01:54<04:47,  2.56it/s, acc_step=1/1, ce_loss_token=2.2000, perplexity_token=9.0251]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  30%|█████████████▉                                 | 309/1044 [01:55<04:53,  2.51it/s, acc_step=1/1, ce_loss_token=2.1998, perplexity_token=9.0235]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  30%|█████████████▉                                 | 310/1044 [01:55<04:44,  2.58it/s, acc_step=1/1, ce_loss_token=2.1997, perplexity_token=9.0224]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  30%|██████████████                                 | 311/1044 [01:55<04:33,  2.68it/s, acc_step=1/1, ce_loss_token=2.1995, perplexity_token=9.0205]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  30%|██████████████                                 | 312/1044 [01:56<04:36,  2.65it/s, acc_step=1/1, ce_loss_token=2.1994, perplexity_token=9.0195]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  30%|██████████████                                 | 313/1044 [01:56<04:26,  2.74it/s, acc_step=1/1, ce_loss_token=2.1995, perplexity_token=9.0207]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  30%|██████████████▏                                | 314/1044 [01:56<04:21,  2.79it/s, acc_step=1/1, ce_loss_token=2.1994, perplexity_token=9.0194]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  30%|██████████████▏                                | 315/1044 [01:57<04:25,  2.74it/s, acc_step=1/1, ce_loss_token=2.1992, perplexity_token=9.0182]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  30%|██████████████▏                                | 316/1044 [01:57<04:31,  2.68it/s, acc_step=1/1, ce_loss_token=2.1991, perplexity_token=9.0167]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  30%|██████████████▎                                | 317/1044 [01:57<04:26,  2.73it/s, acc_step=1/1, ce_loss_token=2.1989, perplexity_token=9.0155]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  30%|██████████████▎                                | 318/1044 [01:58<04:28,  2.71it/s, acc_step=1/1, ce_loss_token=2.1988, perplexity_token=9.0143]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  31%|██████████████▎                                | 319/1044 [01:58<04:26,  2.73it/s, acc_step=1/1, ce_loss_token=2.1987, perplexity_token=9.0131]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  31%|██████████████▍                                | 320/1044 [01:59<04:28,  2.69it/s, acc_step=1/1, ce_loss_token=2.1985, perplexity_token=9.0119]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  31%|██████████████▍                                | 321/1044 [01:59<04:41,  2.56it/s, acc_step=1/1, ce_loss_token=2.1984, perplexity_token=9.0104]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  31%|██████████████▍                                | 322/1044 [01:59<04:42,  2.56it/s, acc_step=1/1, ce_loss_token=2.1982, perplexity_token=9.0092]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  31%|██████████████▌                                | 323/1044 [02:00<04:24,  2.72it/s, acc_step=1/1, ce_loss_token=2.1984, perplexity_token=9.0104]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  31%|██████████████▌                                | 324/1044 [02:00<04:31,  2.65it/s, acc_step=1/1, ce_loss_token=2.1982, perplexity_token=9.0089]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  31%|██████████████▋                                | 325/1044 [02:01<04:32,  2.63it/s, acc_step=1/1, ce_loss_token=2.1981, perplexity_token=9.0078]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  31%|██████████████▋                                | 326/1044 [02:01<04:14,  2.82it/s, acc_step=1/1, ce_loss_token=2.1982, perplexity_token=9.0088]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  31%|██████████████▋                                | 327/1044 [02:01<04:18,  2.77it/s, acc_step=1/1, ce_loss_token=2.1981, perplexity_token=9.0075]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  31%|██████████████▊                                | 328/1044 [02:02<04:24,  2.70it/s, acc_step=1/1, ce_loss_token=2.1979, perplexity_token=9.0062]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  32%|██████████████▊                                | 329/1044 [02:02<04:29,  2.65it/s, acc_step=1/1, ce_loss_token=2.1978, perplexity_token=9.0049]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  32%|██████████████▊                                | 330/1044 [02:02<04:47,  2.49it/s, acc_step=1/1, ce_loss_token=2.1976, perplexity_token=9.0036]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  32%|██████████████▉                                | 331/1044 [02:03<04:22,  2.72it/s, acc_step=1/1, ce_loss_token=2.1977, perplexity_token=9.0046]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  32%|██████████████▉                                | 332/1044 [02:03<04:26,  2.67it/s, acc_step=1/1, ce_loss_token=2.1976, perplexity_token=9.0034]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  32%|██████████████▉                                | 333/1044 [02:04<04:36,  2.57it/s, acc_step=1/1, ce_loss_token=2.1975, perplexity_token=9.0024]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  32%|███████████████                                | 334/1044 [02:04<04:27,  2.66it/s, acc_step=1/1, ce_loss_token=2.1973, perplexity_token=9.0011]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  32%|███████████████                                | 335/1044 [02:04<04:18,  2.74it/s, acc_step=1/1, ce_loss_token=2.1972, perplexity_token=8.9998]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  32%|███████████████▏                               | 336/1044 [02:05<04:33,  2.59it/s, acc_step=1/1, ce_loss_token=2.1971, perplexity_token=8.9987]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  32%|███████████████▏                               | 337/1044 [02:05<04:28,  2.63it/s, acc_step=1/1, ce_loss_token=2.1969, perplexity_token=8.9975]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  32%|███████████████▏                               | 338/1044 [02:05<04:28,  2.63it/s, acc_step=1/1, ce_loss_token=2.1968, perplexity_token=8.9962]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  32%|███████████████▎                               | 339/1044 [02:06<04:09,  2.82it/s, acc_step=1/1, ce_loss_token=2.1969, perplexity_token=8.9973]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  33%|███████████████▎                               | 340/1044 [02:06<04:16,  2.74it/s, acc_step=1/1, ce_loss_token=2.1968, perplexity_token=8.9960]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  33%|███████████████▎                               | 341/1044 [02:06<04:00,  2.92it/s, acc_step=1/1, ce_loss_token=2.1971, perplexity_token=8.9989]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  33%|███████████████▍                               | 342/1044 [02:07<04:09,  2.81it/s, acc_step=1/1, ce_loss_token=2.1970, perplexity_token=8.9980]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  33%|███████████████▍                               | 343/1044 [02:07<04:15,  2.74it/s, acc_step=1/1, ce_loss_token=2.1969, perplexity_token=8.9973]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  33%|███████████████▍                               | 344/1044 [02:08<04:33,  2.56it/s, acc_step=1/1, ce_loss_token=2.1968, perplexity_token=8.9961]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  33%|███████████████▌                               | 345/1044 [02:08<04:15,  2.74it/s, acc_step=1/1, ce_loss_token=2.1969, perplexity_token=8.9969]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  33%|███████████████▌                               | 346/1044 [02:08<04:19,  2.69it/s, acc_step=1/1, ce_loss_token=2.1968, perplexity_token=8.9958]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  33%|███████████████▌                               | 347/1044 [02:09<04:20,  2.68it/s, acc_step=1/1, ce_loss_token=2.1966, perplexity_token=8.9948]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  33%|███████████████▋                               | 348/1044 [02:09<04:17,  2.70it/s, acc_step=1/1, ce_loss_token=2.1965, perplexity_token=8.9938]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  33%|███████████████▋                               | 349/1044 [02:09<04:15,  2.72it/s, acc_step=1/1, ce_loss_token=2.1964, perplexity_token=8.9930]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  34%|███████████████▊                               | 350/1044 [02:10<04:14,  2.72it/s, acc_step=1/1, ce_loss_token=2.1963, perplexity_token=8.9917]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  34%|███████████████▊                               | 351/1044 [02:10<04:18,  2.68it/s, acc_step=1/1, ce_loss_token=2.1962, perplexity_token=8.9906]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  34%|███████████████▊                               | 352/1044 [02:11<04:22,  2.63it/s, acc_step=1/1, ce_loss_token=2.1960, perplexity_token=8.9893]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  34%|███████████████▉                               | 353/1044 [02:11<04:20,  2.65it/s, acc_step=1/1, ce_loss_token=2.1959, perplexity_token=8.9883]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  34%|███████████████▉                               | 354/1044 [02:11<04:18,  2.67it/s, acc_step=1/1, ce_loss_token=2.1958, perplexity_token=8.9870]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  34%|███████████████▉                               | 355/1044 [02:12<03:59,  2.87it/s, acc_step=1/1, ce_loss_token=2.1959, perplexity_token=8.9879]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  34%|████████████████                               | 356/1044 [02:12<04:08,  2.77it/s, acc_step=1/1, ce_loss_token=2.1958, perplexity_token=8.9869]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  34%|████████████████                               | 357/1044 [02:12<04:13,  2.71it/s, acc_step=1/1, ce_loss_token=2.1956, perplexity_token=8.9855]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  34%|████████████████                               | 358/1044 [02:13<04:05,  2.79it/s, acc_step=1/1, ce_loss_token=2.1957, perplexity_token=8.9864]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  34%|████████████████▏                              | 359/1044 [02:13<04:08,  2.75it/s, acc_step=1/1, ce_loss_token=2.1956, perplexity_token=8.9851]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  34%|████████████████▏                              | 360/1044 [02:13<04:23,  2.60it/s, acc_step=1/1, ce_loss_token=2.1954, perplexity_token=8.9838]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  35%|████████████████▎                              | 361/1044 [02:14<04:18,  2.65it/s, acc_step=1/1, ce_loss_token=2.1953, perplexity_token=8.9826]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  35%|████████████████▎                              | 362/1044 [02:14<04:18,  2.64it/s, acc_step=1/1, ce_loss_token=2.1952, perplexity_token=8.9814]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  35%|████████████████▎                              | 363/1044 [02:15<04:12,  2.69it/s, acc_step=1/1, ce_loss_token=2.1951, perplexity_token=8.9805]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  35%|████████████████▍                              | 364/1044 [02:15<03:53,  2.91it/s, acc_step=1/1, ce_loss_token=2.1951, perplexity_token=8.9812]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  35%|████████████████▍                              | 365/1044 [02:15<03:58,  2.84it/s, acc_step=1/1, ce_loss_token=2.1950, perplexity_token=8.9803]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  35%|████████████████▍                              | 366/1044 [02:16<04:12,  2.69it/s, acc_step=1/1, ce_loss_token=2.1949, perplexity_token=8.9792]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  35%|████████████████▌                              | 367/1044 [02:16<04:12,  2.68it/s, acc_step=1/1, ce_loss_token=2.1948, perplexity_token=8.9779]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  35%|████████████████▌                              | 368/1044 [02:16<04:08,  2.72it/s, acc_step=1/1, ce_loss_token=2.1946, perplexity_token=8.9765]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  35%|████████████████▌                              | 369/1044 [02:17<04:11,  2.68it/s, acc_step=1/1, ce_loss_token=2.1945, perplexity_token=8.9753]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  35%|████████████████▋                              | 370/1044 [02:17<04:13,  2.65it/s, acc_step=1/1, ce_loss_token=2.1943, perplexity_token=8.9741]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  36%|████████████████▋                              | 371/1044 [02:17<03:56,  2.85it/s, acc_step=1/1, ce_loss_token=2.1944, perplexity_token=8.9749]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  36%|████████████████▋                              | 372/1044 [02:18<03:58,  2.82it/s, acc_step=1/1, ce_loss_token=2.1943, perplexity_token=8.9736]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  36%|████████████████▊                              | 373/1044 [02:18<04:22,  2.55it/s, acc_step=1/1, ce_loss_token=2.1942, perplexity_token=8.9725]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  36%|████████████████▊                              | 374/1044 [02:19<04:12,  2.65it/s, acc_step=1/1, ce_loss_token=2.1940, perplexity_token=8.9712]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  36%|████████████████▉                              | 375/1044 [02:19<04:10,  2.67it/s, acc_step=1/1, ce_loss_token=2.1939, perplexity_token=8.9701]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  36%|████████████████▉                              | 376/1044 [02:19<04:22,  2.54it/s, acc_step=1/1, ce_loss_token=2.1938, perplexity_token=8.9690]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  36%|████████████████▉                              | 377/1044 [02:20<04:20,  2.56it/s, acc_step=1/1, ce_loss_token=2.1937, perplexity_token=8.9681]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  36%|█████████████████                              | 378/1044 [02:20<04:13,  2.63it/s, acc_step=1/1, ce_loss_token=2.1935, perplexity_token=8.9669]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  36%|█████████████████                              | 379/1044 [02:20<03:56,  2.82it/s, acc_step=1/1, ce_loss_token=2.1938, perplexity_token=8.9696]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  36%|█████████████████                              | 380/1044 [02:21<03:42,  2.99it/s, acc_step=1/1, ce_loss_token=2.1940, perplexity_token=8.9711]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  36%|█████████████████▏                             | 381/1044 [02:21<03:52,  2.85it/s, acc_step=1/1, ce_loss_token=2.1939, perplexity_token=8.9701]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  37%|█████████████████▏                             | 382/1044 [02:21<03:38,  3.03it/s, acc_step=1/1, ce_loss_token=2.1940, perplexity_token=8.9707]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  37%|█████████████████▏                             | 383/1044 [02:22<03:41,  2.98it/s, acc_step=1/1, ce_loss_token=2.1938, perplexity_token=8.9696]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  37%|█████████████████▎                             | 384/1044 [02:22<03:50,  2.86it/s, acc_step=1/1, ce_loss_token=2.1937, perplexity_token=8.9684]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  37%|█████████████████▎                             | 385/1044 [02:23<03:59,  2.75it/s, acc_step=1/1, ce_loss_token=2.1936, perplexity_token=8.9673]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  37%|█████████████████▍                             | 386/1044 [02:23<04:17,  2.56it/s, acc_step=1/1, ce_loss_token=2.1935, perplexity_token=8.9661]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  37%|█████████████████▍                             | 387/1044 [02:23<04:13,  2.60it/s, acc_step=1/1, ce_loss_token=2.1933, perplexity_token=8.9651]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  37%|█████████████████▍                             | 388/1044 [02:24<04:15,  2.57it/s, acc_step=1/1, ce_loss_token=2.1932, perplexity_token=8.9640]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  37%|█████████████████▌                             | 389/1044 [02:24<04:08,  2.64it/s, acc_step=1/1, ce_loss_token=2.1931, perplexity_token=8.9629]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  37%|█████████████████▌                             | 390/1044 [02:25<04:10,  2.62it/s, acc_step=1/1, ce_loss_token=2.1930, perplexity_token=8.9618]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  37%|█████████████████▌                             | 391/1044 [02:25<04:06,  2.65it/s, acc_step=1/1, ce_loss_token=2.1929, perplexity_token=8.9608]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  38%|█████████████████▋                             | 392/1044 [02:25<04:00,  2.72it/s, acc_step=1/1, ce_loss_token=2.1928, perplexity_token=8.9598]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  38%|█████████████████▋                             | 393/1044 [02:26<04:08,  2.62it/s, acc_step=1/1, ce_loss_token=2.1926, perplexity_token=8.9586]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  38%|█████████████████▋                             | 394/1044 [02:26<04:12,  2.58it/s, acc_step=1/1, ce_loss_token=2.1925, perplexity_token=8.9572]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  38%|█████████████████▊                             | 395/1044 [02:26<04:11,  2.58it/s, acc_step=1/1, ce_loss_token=2.1923, perplexity_token=8.9559]

torch.Size([256, 344, 35]) torch.Size([256, 344])


[Training LM]:  38%|█████████████████▊                             | 396/1044 [02:27<04:21,  2.48it/s, acc_step=1/1, ce_loss_token=2.1922, perplexity_token=8.9549]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  38%|█████████████████▊                             | 397/1044 [02:27<04:32,  2.38it/s, acc_step=1/1, ce_loss_token=2.1920, perplexity_token=8.9534]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  38%|█████████████████▉                             | 398/1044 [02:28<04:44,  2.27it/s, acc_step=1/1, ce_loss_token=2.1919, perplexity_token=8.9523]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  38%|█████████████████▉                             | 399/1044 [02:28<04:40,  2.30it/s, acc_step=1/1, ce_loss_token=2.1918, perplexity_token=8.9512]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  38%|██████████████████                             | 400/1044 [02:29<04:35,  2.34it/s, acc_step=1/1, ce_loss_token=2.1916, perplexity_token=8.9499]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  38%|██████████████████                             | 401/1044 [02:29<04:29,  2.38it/s, acc_step=1/1, ce_loss_token=2.1915, perplexity_token=8.9487]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  39%|██████████████████                             | 402/1044 [02:29<04:06,  2.61it/s, acc_step=1/1, ce_loss_token=2.1916, perplexity_token=8.9496]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  39%|██████████████████▏                            | 403/1044 [02:30<04:02,  2.64it/s, acc_step=1/1, ce_loss_token=2.1915, perplexity_token=8.9485]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  39%|██████████████████▏                            | 404/1044 [02:30<04:01,  2.65it/s, acc_step=1/1, ce_loss_token=2.1913, perplexity_token=8.9473]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  39%|██████████████████▏                            | 405/1044 [02:31<04:17,  2.49it/s, acc_step=1/1, ce_loss_token=2.1912, perplexity_token=8.9458]

torch.Size([256, 437, 35]) torch.Size([256, 437])


[Training LM]:  39%|██████████████████▎                            | 406/1044 [02:31<04:36,  2.31it/s, acc_step=1/1, ce_loss_token=2.1913, perplexity_token=8.9470]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  39%|██████████████████▎                            | 407/1044 [02:31<04:20,  2.45it/s, acc_step=1/1, ce_loss_token=2.1912, perplexity_token=8.9458]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  39%|██████████████████▎                            | 408/1044 [02:32<04:15,  2.49it/s, acc_step=1/1, ce_loss_token=2.1911, perplexity_token=8.9448]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  39%|██████████████████▍                            | 409/1044 [02:32<04:05,  2.59it/s, acc_step=1/1, ce_loss_token=2.1909, perplexity_token=8.9433]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  39%|██████████████████▍                            | 410/1044 [02:33<04:12,  2.51it/s, acc_step=1/1, ce_loss_token=2.1908, perplexity_token=8.9425]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  39%|██████████████████▌                            | 411/1044 [02:33<04:04,  2.59it/s, acc_step=1/1, ce_loss_token=2.1907, perplexity_token=8.9416]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  39%|██████████████████▌                            | 412/1044 [02:33<04:00,  2.63it/s, acc_step=1/1, ce_loss_token=2.1906, perplexity_token=8.9407]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  40%|██████████████████▌                            | 413/1044 [02:34<04:03,  2.59it/s, acc_step=1/1, ce_loss_token=2.1905, perplexity_token=8.9397]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  40%|██████████████████▋                            | 414/1044 [02:34<03:59,  2.63it/s, acc_step=1/1, ce_loss_token=2.1904, perplexity_token=8.9385]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  40%|██████████████████▋                            | 415/1044 [02:34<03:58,  2.64it/s, acc_step=1/1, ce_loss_token=2.1902, perplexity_token=8.9373]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  40%|██████████████████▋                            | 416/1044 [02:35<03:54,  2.68it/s, acc_step=1/1, ce_loss_token=2.1901, perplexity_token=8.9363]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  40%|██████████████████▊                            | 417/1044 [02:35<03:57,  2.64it/s, acc_step=1/1, ce_loss_token=2.1900, perplexity_token=8.9351]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  40%|██████████████████▊                            | 418/1044 [02:36<03:53,  2.68it/s, acc_step=1/1, ce_loss_token=2.1899, perplexity_token=8.9340]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  40%|██████████████████▉                            | 420/1044 [02:36<03:19,  3.13it/s, acc_step=1/1, ce_loss_token=2.1900, perplexity_token=8.9356]

torch.Size([256, 309, 35]) torch.Size([256, 309])
torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  40%|██████████████████▉                            | 421/1044 [02:36<03:27,  3.00it/s, acc_step=1/1, ce_loss_token=2.1899, perplexity_token=8.9345]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  40%|██████████████████▉                            | 422/1044 [02:37<03:42,  2.80it/s, acc_step=1/1, ce_loss_token=2.1898, perplexity_token=8.9334]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  41%|███████████████████                            | 423/1044 [02:37<03:43,  2.78it/s, acc_step=1/1, ce_loss_token=2.1897, perplexity_token=8.9323]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  41%|███████████████████                            | 424/1044 [02:38<03:50,  2.69it/s, acc_step=1/1, ce_loss_token=2.1895, perplexity_token=8.9311]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  41%|███████████████████▏                           | 425/1044 [02:38<03:59,  2.58it/s, acc_step=1/1, ce_loss_token=2.1894, perplexity_token=8.9301]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  41%|███████████████████▏                           | 426/1044 [02:39<04:03,  2.54it/s, acc_step=1/1, ce_loss_token=2.1893, perplexity_token=8.9290]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  41%|███████████████████▏                           | 427/1044 [02:39<03:57,  2.59it/s, acc_step=1/1, ce_loss_token=2.1892, perplexity_token=8.9280]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  41%|███████████████████▎                           | 428/1044 [02:39<04:02,  2.54it/s, acc_step=1/1, ce_loss_token=2.1891, perplexity_token=8.9269]

torch.Size([256, 372, 35]) torch.Size([256, 372])


[Training LM]:  41%|███████████████████▎                           | 429/1044 [02:40<04:20,  2.36it/s, acc_step=1/1, ce_loss_token=2.1889, perplexity_token=8.9257]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  41%|███████████████████▎                           | 430/1044 [02:40<04:09,  2.46it/s, acc_step=1/1, ce_loss_token=2.1888, perplexity_token=8.9248]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  41%|███████████████████▍                           | 431/1044 [02:41<04:02,  2.52it/s, acc_step=1/1, ce_loss_token=2.1887, perplexity_token=8.9239]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  41%|███████████████████▍                           | 432/1044 [02:41<03:48,  2.68it/s, acc_step=1/1, ce_loss_token=2.1889, perplexity_token=8.9256]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  41%|███████████████████▍                           | 433/1044 [02:41<03:42,  2.74it/s, acc_step=1/1, ce_loss_token=2.1888, perplexity_token=8.9245]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  42%|███████████████████▌                           | 434/1044 [02:42<03:41,  2.76it/s, acc_step=1/1, ce_loss_token=2.1887, perplexity_token=8.9233]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  42%|███████████████████▌                           | 435/1044 [02:42<03:56,  2.57it/s, acc_step=1/1, ce_loss_token=2.1885, perplexity_token=8.9221]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  42%|███████████████████▋                           | 436/1044 [02:42<03:59,  2.54it/s, acc_step=1/1, ce_loss_token=2.1884, perplexity_token=8.9211]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  42%|███████████████████▋                           | 437/1044 [02:43<03:53,  2.60it/s, acc_step=1/1, ce_loss_token=2.1883, perplexity_token=8.9201]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  42%|███████████████████▋                           | 438/1044 [02:43<03:55,  2.58it/s, acc_step=1/1, ce_loss_token=2.1882, perplexity_token=8.9190]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  42%|███████████████████▊                           | 439/1044 [02:44<03:53,  2.59it/s, acc_step=1/1, ce_loss_token=2.1881, perplexity_token=8.9179]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  42%|███████████████████▊                           | 440/1044 [02:44<03:37,  2.78it/s, acc_step=1/1, ce_loss_token=2.1882, perplexity_token=8.9188]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  42%|███████████████████▊                           | 441/1044 [02:44<03:37,  2.78it/s, acc_step=1/1, ce_loss_token=2.1881, perplexity_token=8.9179]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  42%|███████████████████▉                           | 442/1044 [02:45<03:49,  2.62it/s, acc_step=1/1, ce_loss_token=2.1879, perplexity_token=8.9166]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  42%|███████████████████▉                           | 443/1044 [02:45<03:52,  2.59it/s, acc_step=1/1, ce_loss_token=2.1878, perplexity_token=8.9157]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  43%|███████████████████▉                           | 444/1044 [02:45<03:56,  2.53it/s, acc_step=1/1, ce_loss_token=2.1877, perplexity_token=8.9145]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  43%|████████████████████                           | 445/1044 [02:46<03:54,  2.56it/s, acc_step=1/1, ce_loss_token=2.1876, perplexity_token=8.9135]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  43%|████████████████████                           | 446/1044 [02:46<03:56,  2.53it/s, acc_step=1/1, ce_loss_token=2.1874, perplexity_token=8.9124]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  43%|████████████████████                           | 447/1044 [02:46<03:33,  2.80it/s, acc_step=1/1, ce_loss_token=2.1875, perplexity_token=8.9132]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  43%|████████████████████▏                          | 448/1044 [02:47<03:45,  2.65it/s, acc_step=1/1, ce_loss_token=2.1874, perplexity_token=8.9122]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  43%|████████████████████▏                          | 449/1044 [02:47<03:48,  2.60it/s, acc_step=1/1, ce_loss_token=2.1873, perplexity_token=8.9112]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  43%|████████████████████▎                          | 450/1044 [02:48<03:44,  2.64it/s, acc_step=1/1, ce_loss_token=2.1872, perplexity_token=8.9103]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  43%|████████████████████▎                          | 451/1044 [02:48<03:40,  2.68it/s, acc_step=1/1, ce_loss_token=2.1871, perplexity_token=8.9090]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  43%|████████████████████▎                          | 452/1044 [02:48<03:45,  2.62it/s, acc_step=1/1, ce_loss_token=2.1870, perplexity_token=8.9083]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  43%|████████████████████▍                          | 453/1044 [02:49<03:43,  2.65it/s, acc_step=1/1, ce_loss_token=2.1868, perplexity_token=8.9071]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  43%|████████████████████▍                          | 454/1044 [02:49<03:40,  2.68it/s, acc_step=1/1, ce_loss_token=2.1867, perplexity_token=8.9059]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  44%|████████████████████▍                          | 455/1044 [02:50<03:41,  2.66it/s, acc_step=1/1, ce_loss_token=2.1866, perplexity_token=8.9047]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  44%|████████████████████▌                          | 456/1044 [02:50<03:37,  2.70it/s, acc_step=1/1, ce_loss_token=2.1865, perplexity_token=8.9038]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  44%|████████████████████▌                          | 457/1044 [02:50<03:41,  2.65it/s, acc_step=1/1, ce_loss_token=2.1864, perplexity_token=8.9027]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  44%|████████████████████▌                          | 458/1044 [02:51<03:49,  2.55it/s, acc_step=1/1, ce_loss_token=2.1862, perplexity_token=8.9015]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  44%|████████████████████▋                          | 459/1044 [02:51<03:43,  2.62it/s, acc_step=1/1, ce_loss_token=2.1861, perplexity_token=8.9005]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  44%|████████████████████▋                          | 460/1044 [02:51<03:31,  2.76it/s, acc_step=1/1, ce_loss_token=2.1862, perplexity_token=8.9012]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  44%|████████████████████▊                          | 461/1044 [02:52<03:35,  2.71it/s, acc_step=1/1, ce_loss_token=2.1861, perplexity_token=8.9001]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  44%|████████████████████▊                          | 462/1044 [02:52<03:37,  2.68it/s, acc_step=1/1, ce_loss_token=2.1860, perplexity_token=8.8992]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  44%|████████████████████▊                          | 463/1044 [02:53<03:40,  2.64it/s, acc_step=1/1, ce_loss_token=2.1858, perplexity_token=8.8981]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  44%|████████████████████▉                          | 464/1044 [02:53<03:43,  2.59it/s, acc_step=1/1, ce_loss_token=2.1857, perplexity_token=8.8971]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  45%|████████████████████▉                          | 465/1044 [02:53<03:41,  2.61it/s, acc_step=1/1, ce_loss_token=2.1856, perplexity_token=8.8961]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  45%|████████████████████▉                          | 466/1044 [02:54<03:34,  2.69it/s, acc_step=1/1, ce_loss_token=2.1855, perplexity_token=8.8951]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  45%|█████████████████████                          | 467/1044 [02:54<03:35,  2.68it/s, acc_step=1/1, ce_loss_token=2.1854, perplexity_token=8.8941]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  45%|█████████████████████                          | 468/1044 [02:54<03:36,  2.66it/s, acc_step=1/1, ce_loss_token=2.1853, perplexity_token=8.8931]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  45%|█████████████████████                          | 469/1044 [02:55<03:19,  2.89it/s, acc_step=1/1, ce_loss_token=2.1853, perplexity_token=8.8936]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  45%|█████████████████████▏                         | 470/1044 [02:55<03:32,  2.70it/s, acc_step=1/1, ce_loss_token=2.1852, perplexity_token=8.8925]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  45%|█████████████████████▏                         | 471/1044 [02:56<03:36,  2.65it/s, acc_step=1/1, ce_loss_token=2.1851, perplexity_token=8.8915]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  45%|█████████████████████▏                         | 472/1044 [02:56<03:38,  2.61it/s, acc_step=1/1, ce_loss_token=2.1850, perplexity_token=8.8904]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  45%|█████████████████████▎                         | 473/1044 [02:56<03:37,  2.62it/s, acc_step=1/1, ce_loss_token=2.1848, perplexity_token=8.8893]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  45%|█████████████████████▎                         | 474/1044 [02:57<03:33,  2.67it/s, acc_step=1/1, ce_loss_token=2.1847, perplexity_token=8.8882]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  45%|█████████████████████▍                         | 475/1044 [02:57<03:30,  2.70it/s, acc_step=1/1, ce_loss_token=2.1846, perplexity_token=8.8872]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  46%|█████████████████████▍                         | 476/1044 [02:57<03:27,  2.74it/s, acc_step=1/1, ce_loss_token=2.1845, perplexity_token=8.8861]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  46%|█████████████████████▍                         | 477/1044 [02:58<03:26,  2.74it/s, acc_step=1/1, ce_loss_token=2.1844, perplexity_token=8.8851]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  46%|█████████████████████▌                         | 478/1044 [02:58<03:27,  2.72it/s, acc_step=1/1, ce_loss_token=2.1842, perplexity_token=8.8838]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  46%|█████████████████████▌                         | 479/1044 [02:59<03:28,  2.70it/s, acc_step=1/1, ce_loss_token=2.1841, perplexity_token=8.8826]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  46%|█████████████████████▌                         | 480/1044 [02:59<03:25,  2.75it/s, acc_step=1/1, ce_loss_token=2.1840, perplexity_token=8.8816]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  46%|█████████████████████▋                         | 481/1044 [02:59<03:27,  2.71it/s, acc_step=1/1, ce_loss_token=2.1838, perplexity_token=8.8803]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  46%|█████████████████████▋                         | 482/1044 [03:00<03:26,  2.72it/s, acc_step=1/1, ce_loss_token=2.1837, perplexity_token=8.8792]

torch.Size([256, 365, 35]) torch.Size([256, 365])


[Training LM]:  46%|█████████████████████▋                         | 483/1044 [03:00<03:46,  2.48it/s, acc_step=1/1, ce_loss_token=2.1836, perplexity_token=8.8781]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  46%|█████████████████████▊                         | 484/1044 [03:01<03:46,  2.47it/s, acc_step=1/1, ce_loss_token=2.1835, perplexity_token=8.8770]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  46%|█████████████████████▊                         | 485/1044 [03:01<03:40,  2.53it/s, acc_step=1/1, ce_loss_token=2.1833, perplexity_token=8.8759]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  47%|█████████████████████▉                         | 486/1044 [03:01<03:34,  2.60it/s, acc_step=1/1, ce_loss_token=2.1832, perplexity_token=8.8747]

torch.Size([256, 404, 35]) torch.Size([256, 404])


[Training LM]:  47%|█████████████████████▉                         | 487/1044 [03:02<04:04,  2.28it/s, acc_step=1/1, ce_loss_token=2.1831, perplexity_token=8.8737]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  47%|█████████████████████▉                         | 488/1044 [03:02<04:00,  2.31it/s, acc_step=1/1, ce_loss_token=2.1830, perplexity_token=8.8725]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  47%|██████████████████████                         | 489/1044 [03:03<03:54,  2.36it/s, acc_step=1/1, ce_loss_token=2.1829, perplexity_token=8.8716]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  47%|██████████████████████                         | 490/1044 [03:03<03:42,  2.48it/s, acc_step=1/1, ce_loss_token=2.1827, perplexity_token=8.8705]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  47%|██████████████████████                         | 491/1044 [03:03<03:44,  2.46it/s, acc_step=1/1, ce_loss_token=2.1826, perplexity_token=8.8695]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  47%|██████████████████████▏                        | 492/1044 [03:04<03:36,  2.55it/s, acc_step=1/1, ce_loss_token=2.1825, perplexity_token=8.8686]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  47%|██████████████████████▏                        | 493/1044 [03:04<03:40,  2.50it/s, acc_step=1/1, ce_loss_token=2.1824, perplexity_token=8.8675]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  47%|██████████████████████▏                        | 494/1044 [03:05<03:36,  2.54it/s, acc_step=1/1, ce_loss_token=2.1823, perplexity_token=8.8663]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  47%|██████████████████████▎                        | 495/1044 [03:05<03:29,  2.62it/s, acc_step=1/1, ce_loss_token=2.1821, perplexity_token=8.8651]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  48%|██████████████████████▎                        | 496/1044 [03:05<03:24,  2.68it/s, acc_step=1/1, ce_loss_token=2.1822, perplexity_token=8.8657]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  48%|██████████████████████▎                        | 497/1044 [03:06<03:32,  2.57it/s, acc_step=1/1, ce_loss_token=2.1821, perplexity_token=8.8647]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  48%|██████████████████████▍                        | 498/1044 [03:06<03:25,  2.65it/s, acc_step=1/1, ce_loss_token=2.1820, perplexity_token=8.8637]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  48%|██████████████████████▍                        | 499/1044 [03:06<03:26,  2.64it/s, acc_step=1/1, ce_loss_token=2.1818, perplexity_token=8.8626]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  48%|██████████████████████▌                        | 500/1044 [03:07<03:20,  2.71it/s, acc_step=1/1, ce_loss_token=2.1817, perplexity_token=8.8615]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  48%|██████████████████████▌                        | 501/1044 [03:07<03:19,  2.73it/s, acc_step=1/1, ce_loss_token=2.1816, perplexity_token=8.8605]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  48%|██████████████████████▌                        | 502/1044 [03:08<03:27,  2.61it/s, acc_step=1/1, ce_loss_token=2.1815, perplexity_token=8.8595]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  48%|██████████████████████▋                        | 503/1044 [03:08<03:09,  2.86it/s, acc_step=1/1, ce_loss_token=2.1815, perplexity_token=8.8600]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  48%|██████████████████████▋                        | 504/1044 [03:08<03:09,  2.85it/s, acc_step=1/1, ce_loss_token=2.1814, perplexity_token=8.8591]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  48%|██████████████████████▋                        | 505/1044 [03:09<03:10,  2.83it/s, acc_step=1/1, ce_loss_token=2.1813, perplexity_token=8.8580]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  48%|██████████████████████▊                        | 506/1044 [03:09<03:13,  2.78it/s, acc_step=1/1, ce_loss_token=2.1812, perplexity_token=8.8569]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  49%|██████████████████████▊                        | 507/1044 [03:09<03:13,  2.78it/s, acc_step=1/1, ce_loss_token=2.1811, perplexity_token=8.8559]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  49%|██████████████████████▊                        | 508/1044 [03:10<03:15,  2.74it/s, acc_step=1/1, ce_loss_token=2.1810, perplexity_token=8.8549]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  49%|██████████████████████▉                        | 509/1044 [03:10<03:02,  2.93it/s, acc_step=1/1, ce_loss_token=2.1810, perplexity_token=8.8553]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  49%|██████████████████████▉                        | 510/1044 [03:10<02:52,  3.09it/s, acc_step=1/1, ce_loss_token=2.1811, perplexity_token=8.8556]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  49%|███████████████████████                        | 511/1044 [03:11<02:56,  3.02it/s, acc_step=1/1, ce_loss_token=2.1809, perplexity_token=8.8546]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  49%|███████████████████████                        | 512/1044 [03:11<03:05,  2.86it/s, acc_step=1/1, ce_loss_token=2.1808, perplexity_token=8.8537]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  49%|███████████████████████                        | 513/1044 [03:11<02:58,  2.97it/s, acc_step=1/1, ce_loss_token=2.1809, perplexity_token=8.8540]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  49%|███████████████████████▏                       | 514/1044 [03:12<03:02,  2.90it/s, acc_step=1/1, ce_loss_token=2.1808, perplexity_token=8.8531]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  49%|███████████████████████▏                       | 515/1044 [03:12<03:06,  2.84it/s, acc_step=1/1, ce_loss_token=2.1807, perplexity_token=8.8521]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  49%|███████████████████████▏                       | 516/1044 [03:12<03:10,  2.77it/s, acc_step=1/1, ce_loss_token=2.1805, perplexity_token=8.8511]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  50%|███████████████████████▎                       | 517/1044 [03:13<03:17,  2.67it/s, acc_step=1/1, ce_loss_token=2.1804, perplexity_token=8.8502]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  50%|███████████████████████▎                       | 518/1044 [03:13<03:06,  2.82it/s, acc_step=1/1, ce_loss_token=2.1805, perplexity_token=8.8511]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  50%|███████████████████████▎                       | 519/1044 [03:13<03:11,  2.74it/s, acc_step=1/1, ce_loss_token=2.1804, perplexity_token=8.8502]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  50%|███████████████████████▍                       | 520/1044 [03:14<03:08,  2.78it/s, acc_step=1/1, ce_loss_token=2.1803, perplexity_token=8.8489]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  50%|███████████████████████▍                       | 521/1044 [03:14<03:10,  2.74it/s, acc_step=1/1, ce_loss_token=2.1802, perplexity_token=8.8479]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  50%|███████████████████████▌                       | 522/1044 [03:15<03:11,  2.73it/s, acc_step=1/1, ce_loss_token=2.1801, perplexity_token=8.8469]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  50%|███████████████████████▌                       | 523/1044 [03:15<03:18,  2.62it/s, acc_step=1/1, ce_loss_token=2.1800, perplexity_token=8.8460]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  50%|███████████████████████▌                       | 524/1044 [03:15<03:16,  2.64it/s, acc_step=1/1, ce_loss_token=2.1799, perplexity_token=8.8450]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  50%|███████████████████████▋                       | 525/1044 [03:16<03:14,  2.66it/s, acc_step=1/1, ce_loss_token=2.1797, perplexity_token=8.8439]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  50%|███████████████████████▋                       | 526/1044 [03:16<03:21,  2.57it/s, acc_step=1/1, ce_loss_token=2.1796, perplexity_token=8.8429]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  50%|███████████████████████▋                       | 527/1044 [03:17<03:16,  2.63it/s, acc_step=1/1, ce_loss_token=2.1795, perplexity_token=8.8419]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  51%|███████████████████████▊                       | 528/1044 [03:17<03:15,  2.64it/s, acc_step=1/1, ce_loss_token=2.1794, perplexity_token=8.8409]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  51%|███████████████████████▊                       | 529/1044 [03:17<03:19,  2.58it/s, acc_step=1/1, ce_loss_token=2.1793, perplexity_token=8.8398]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  51%|███████████████████████▊                       | 530/1044 [03:18<03:13,  2.65it/s, acc_step=1/1, ce_loss_token=2.1791, perplexity_token=8.8386]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  51%|███████████████████████▉                       | 531/1044 [03:18<03:02,  2.81it/s, acc_step=1/1, ce_loss_token=2.1792, perplexity_token=8.8388]

torch.Size([256, 276, 35]) torch.Size([256, 276])


[Training LM]:  51%|███████████████████████▉                       | 532/1044 [03:18<02:56,  2.89it/s, acc_step=1/1, ce_loss_token=2.1791, perplexity_token=8.8379]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  51%|███████████████████████▉                       | 533/1044 [03:19<02:47,  3.05it/s, acc_step=1/1, ce_loss_token=2.1791, perplexity_token=8.8382]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  51%|████████████████████████                       | 534/1044 [03:19<02:49,  3.01it/s, acc_step=1/1, ce_loss_token=2.1790, perplexity_token=8.8374]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  51%|████████████████████████                       | 535/1044 [03:19<03:07,  2.72it/s, acc_step=1/1, ce_loss_token=2.1789, perplexity_token=8.8364]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  51%|████████████████████████▏                      | 536/1044 [03:20<03:00,  2.81it/s, acc_step=1/1, ce_loss_token=2.1789, perplexity_token=8.8368]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  51%|████████████████████████▏                      | 537/1044 [03:20<03:00,  2.81it/s, acc_step=1/1, ce_loss_token=2.1788, perplexity_token=8.8358]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  52%|████████████████████████▏                      | 538/1044 [03:20<03:07,  2.69it/s, acc_step=1/1, ce_loss_token=2.1787, perplexity_token=8.8348]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  52%|████████████████████████▎                      | 539/1044 [03:21<03:05,  2.72it/s, acc_step=1/1, ce_loss_token=2.1786, perplexity_token=8.8338]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  52%|████████████████████████▎                      | 540/1044 [03:21<02:52,  2.92it/s, acc_step=1/1, ce_loss_token=2.1786, perplexity_token=8.8341]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  52%|████████████████████████▍                      | 542/1044 [03:22<02:36,  3.21it/s, acc_step=1/1, ce_loss_token=2.1789, perplexity_token=8.8369]

torch.Size([256, 319, 35]) torch.Size([256, 319])
torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  52%|████████████████████████▍                      | 543/1044 [03:22<02:40,  3.11it/s, acc_step=1/1, ce_loss_token=2.1788, perplexity_token=8.8359]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  52%|████████████████████████▍                      | 544/1044 [03:22<02:50,  2.93it/s, acc_step=1/1, ce_loss_token=2.1787, perplexity_token=8.8348]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  52%|████████████████████████▌                      | 545/1044 [03:23<02:46,  3.00it/s, acc_step=1/1, ce_loss_token=2.1787, perplexity_token=8.8352]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  52%|████████████████████████▌                      | 546/1044 [03:23<02:34,  3.22it/s, acc_step=1/1, ce_loss_token=2.1788, perplexity_token=8.8356]

torch.Size([256, 370, 35]) torch.Size([256, 370])


[Training LM]:  52%|████████████████████████▋                      | 547/1044 [03:23<02:59,  2.77it/s, acc_step=1/1, ce_loss_token=2.1787, perplexity_token=8.8347]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  52%|████████████████████████▋                      | 548/1044 [03:24<03:13,  2.57it/s, acc_step=1/1, ce_loss_token=2.1786, perplexity_token=8.8338]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  53%|████████████████████████▋                      | 549/1044 [03:24<03:11,  2.59it/s, acc_step=1/1, ce_loss_token=2.1785, perplexity_token=8.8327]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  53%|████████████████████████▊                      | 550/1044 [03:25<03:08,  2.62it/s, acc_step=1/1, ce_loss_token=2.1783, perplexity_token=8.8317]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  53%|████████████████████████▊                      | 551/1044 [03:25<03:09,  2.60it/s, acc_step=1/1, ce_loss_token=2.1782, perplexity_token=8.8305]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  53%|████████████████████████▊                      | 552/1044 [03:25<03:09,  2.59it/s, acc_step=1/1, ce_loss_token=2.1781, perplexity_token=8.8293]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  53%|████████████████████████▉                      | 553/1044 [03:26<03:06,  2.64it/s, acc_step=1/1, ce_loss_token=2.1780, perplexity_token=8.8284]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  53%|████████████████████████▉                      | 554/1044 [03:26<02:56,  2.77it/s, acc_step=1/1, ce_loss_token=2.1780, perplexity_token=8.8288]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  53%|█████████████████████████                      | 556/1044 [03:27<02:32,  3.20it/s, acc_step=1/1, ce_loss_token=2.1782, perplexity_token=8.8306]

torch.Size([256, 302, 35]) torch.Size([256, 302])
torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  53%|█████████████████████████                      | 557/1044 [03:27<02:36,  3.11it/s, acc_step=1/1, ce_loss_token=2.1781, perplexity_token=8.8296]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  53%|█████████████████████████                      | 558/1044 [03:27<02:27,  3.30it/s, acc_step=1/1, ce_loss_token=2.1782, perplexity_token=8.8301]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  54%|█████████████████████████▏                     | 559/1044 [03:28<02:33,  3.16it/s, acc_step=1/1, ce_loss_token=2.1781, perplexity_token=8.8291]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  54%|█████████████████████████▏                     | 560/1044 [03:28<02:33,  3.16it/s, acc_step=1/1, ce_loss_token=2.1781, perplexity_token=8.8293]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  54%|█████████████████████████▎                     | 561/1044 [03:28<02:37,  3.06it/s, acc_step=1/1, ce_loss_token=2.1780, perplexity_token=8.8283]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  54%|█████████████████████████▎                     | 562/1044 [03:29<02:42,  2.97it/s, acc_step=1/1, ce_loss_token=2.1779, perplexity_token=8.8273]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  54%|█████████████████████████▎                     | 563/1044 [03:29<02:49,  2.84it/s, acc_step=1/1, ce_loss_token=2.1778, perplexity_token=8.8265]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  54%|█████████████████████████▍                     | 564/1044 [03:29<02:45,  2.90it/s, acc_step=1/1, ce_loss_token=2.1776, perplexity_token=8.8255]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  54%|█████████████████████████▍                     | 565/1044 [03:30<02:54,  2.74it/s, acc_step=1/1, ce_loss_token=2.1775, perplexity_token=8.8246]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  54%|█████████████████████████▍                     | 566/1044 [03:30<02:57,  2.70it/s, acc_step=1/1, ce_loss_token=2.1774, perplexity_token=8.8236]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  54%|█████████████████████████▌                     | 567/1044 [03:31<02:58,  2.67it/s, acc_step=1/1, ce_loss_token=2.1773, perplexity_token=8.8227]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  54%|█████████████████████████▌                     | 568/1044 [03:31<02:56,  2.69it/s, acc_step=1/1, ce_loss_token=2.1772, perplexity_token=8.8217]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  55%|█████████████████████████▌                     | 569/1044 [03:31<02:58,  2.66it/s, acc_step=1/1, ce_loss_token=2.1771, perplexity_token=8.8209]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  55%|█████████████████████████▋                     | 570/1044 [03:32<03:06,  2.54it/s, acc_step=1/1, ce_loss_token=2.1770, perplexity_token=8.8200]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  55%|█████████████████████████▋                     | 571/1044 [03:32<03:04,  2.56it/s, acc_step=1/1, ce_loss_token=2.1769, perplexity_token=8.8190]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  55%|█████████████████████████▊                     | 572/1044 [03:32<02:59,  2.63it/s, acc_step=1/1, ce_loss_token=2.1768, perplexity_token=8.8182]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  55%|█████████████████████████▊                     | 573/1044 [03:33<02:58,  2.64it/s, acc_step=1/1, ce_loss_token=2.1767, perplexity_token=8.8173]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  55%|█████████████████████████▊                     | 574/1044 [03:33<02:59,  2.61it/s, acc_step=1/1, ce_loss_token=2.1766, perplexity_token=8.8164]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  55%|█████████████████████████▉                     | 575/1044 [03:34<02:54,  2.69it/s, acc_step=1/1, ce_loss_token=2.1765, perplexity_token=8.8155]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  55%|█████████████████████████▉                     | 576/1044 [03:34<02:55,  2.67it/s, acc_step=1/1, ce_loss_token=2.1764, perplexity_token=8.8145]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  55%|█████████████████████████▉                     | 577/1044 [03:34<02:55,  2.66it/s, acc_step=1/1, ce_loss_token=2.1763, perplexity_token=8.8136]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  55%|██████████████████████████                     | 578/1044 [03:35<02:59,  2.60it/s, acc_step=1/1, ce_loss_token=2.1762, perplexity_token=8.8125]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  55%|██████████████████████████                     | 579/1044 [03:35<02:47,  2.78it/s, acc_step=1/1, ce_loss_token=2.1762, perplexity_token=8.8131]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  56%|██████████████████████████                     | 580/1044 [03:35<02:54,  2.66it/s, acc_step=1/1, ce_loss_token=2.1761, perplexity_token=8.8121]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  56%|██████████████████████████▏                    | 581/1044 [03:36<02:45,  2.79it/s, acc_step=1/1, ce_loss_token=2.1761, perplexity_token=8.8122]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  56%|██████████████████████████▏                    | 582/1044 [03:36<02:46,  2.78it/s, acc_step=1/1, ce_loss_token=2.1760, perplexity_token=8.8112]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  56%|██████████████████████████▏                    | 583/1044 [03:37<02:44,  2.80it/s, acc_step=1/1, ce_loss_token=2.1759, perplexity_token=8.8101]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  56%|██████████████████████████▎                    | 584/1044 [03:37<02:44,  2.79it/s, acc_step=1/1, ce_loss_token=2.1758, perplexity_token=8.8092]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  56%|██████████████████████████▎                    | 585/1044 [03:37<02:44,  2.79it/s, acc_step=1/1, ce_loss_token=2.1757, perplexity_token=8.8081]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  56%|██████████████████████████▍                    | 586/1044 [03:38<02:46,  2.74it/s, acc_step=1/1, ce_loss_token=2.1756, perplexity_token=8.8072]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  56%|██████████████████████████▍                    | 587/1044 [03:38<02:48,  2.70it/s, acc_step=1/1, ce_loss_token=2.1754, perplexity_token=8.8061]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  56%|██████████████████████████▍                    | 588/1044 [03:38<02:58,  2.55it/s, acc_step=1/1, ce_loss_token=2.1753, perplexity_token=8.8052]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  56%|██████████████████████████▌                    | 589/1044 [03:39<03:04,  2.47it/s, acc_step=1/1, ce_loss_token=2.1752, perplexity_token=8.8044]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  57%|██████████████████████████▌                    | 590/1044 [03:39<02:57,  2.55it/s, acc_step=1/1, ce_loss_token=2.1751, perplexity_token=8.8035]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  57%|██████████████████████████▌                    | 591/1044 [03:40<02:57,  2.55it/s, acc_step=1/1, ce_loss_token=2.1751, perplexity_token=8.8026]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  57%|██████████████████████████▋                    | 592/1044 [03:40<02:55,  2.58it/s, acc_step=1/1, ce_loss_token=2.1749, perplexity_token=8.8015]

torch.Size([256, 441, 35]) torch.Size([256, 441])


[Training LM]:  57%|██████████████████████████▋                    | 593/1044 [03:40<03:10,  2.36it/s, acc_step=1/1, ce_loss_token=2.1750, perplexity_token=8.8019]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  57%|██████████████████████████▋                    | 594/1044 [03:41<02:54,  2.57it/s, acc_step=1/1, ce_loss_token=2.1750, perplexity_token=8.8022]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  57%|██████████████████████████▊                    | 595/1044 [03:41<02:43,  2.74it/s, acc_step=1/1, ce_loss_token=2.1752, perplexity_token=8.8035]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  57%|██████████████████████████▊                    | 596/1044 [03:41<02:33,  2.93it/s, acc_step=1/1, ce_loss_token=2.1752, perplexity_token=8.8037]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  57%|██████████████████████████▉                    | 597/1044 [03:42<02:39,  2.81it/s, acc_step=1/1, ce_loss_token=2.1751, perplexity_token=8.8029]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  57%|██████████████████████████▉                    | 598/1044 [03:42<02:37,  2.84it/s, acc_step=1/1, ce_loss_token=2.1751, perplexity_token=8.8029]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  57%|██████████████████████████▉                    | 599/1044 [03:43<02:41,  2.75it/s, acc_step=1/1, ce_loss_token=2.1750, perplexity_token=8.8022]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  57%|███████████████████████████                    | 600/1044 [03:43<02:44,  2.71it/s, acc_step=1/1, ce_loss_token=2.1749, perplexity_token=8.8012]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  58%|███████████████████████████                    | 601/1044 [03:43<02:44,  2.69it/s, acc_step=1/1, ce_loss_token=2.1748, perplexity_token=8.8003]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  58%|███████████████████████████                    | 602/1044 [03:44<02:43,  2.71it/s, acc_step=1/1, ce_loss_token=2.1747, perplexity_token=8.7995]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  58%|███████████████████████████▏                   | 603/1044 [03:44<02:39,  2.77it/s, acc_step=1/1, ce_loss_token=2.1746, perplexity_token=8.7985]

torch.Size([256, 272, 35]) torch.Size([256, 272])


[Training LM]:  58%|███████████████████████████▏                   | 604/1044 [03:44<02:23,  3.07it/s, acc_step=1/1, ce_loss_token=2.1746, perplexity_token=8.7988]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  58%|███████████████████████████▏                   | 605/1044 [03:45<02:33,  2.86it/s, acc_step=1/1, ce_loss_token=2.1745, perplexity_token=8.7978]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  58%|███████████████████████████▎                   | 606/1044 [03:45<02:35,  2.81it/s, acc_step=1/1, ce_loss_token=2.1744, perplexity_token=8.7969]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  58%|███████████████████████████▎                   | 607/1044 [03:45<02:40,  2.73it/s, acc_step=1/1, ce_loss_token=2.1743, perplexity_token=8.7961]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  58%|███████████████████████████▎                   | 608/1044 [03:46<02:43,  2.67it/s, acc_step=1/1, ce_loss_token=2.1742, perplexity_token=8.7952]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  58%|███████████████████████████▍                   | 609/1044 [03:46<02:31,  2.86it/s, acc_step=1/1, ce_loss_token=2.1742, perplexity_token=8.7953]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  58%|███████████████████████████▍                   | 610/1044 [03:46<02:29,  2.90it/s, acc_step=1/1, ce_loss_token=2.1741, perplexity_token=8.7944]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  59%|███████████████████████████▌                   | 611/1044 [03:47<02:35,  2.78it/s, acc_step=1/1, ce_loss_token=2.1740, perplexity_token=8.7933]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  59%|███████████████████████████▌                   | 612/1044 [03:47<02:28,  2.91it/s, acc_step=1/1, ce_loss_token=2.1740, perplexity_token=8.7933]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  59%|███████████████████████████▌                   | 613/1044 [03:47<02:30,  2.86it/s, acc_step=1/1, ce_loss_token=2.1739, perplexity_token=8.7922]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  59%|███████████████████████████▋                   | 614/1044 [03:48<02:30,  2.85it/s, acc_step=1/1, ce_loss_token=2.1738, perplexity_token=8.7912]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  59%|███████████████████████████▋                   | 615/1044 [03:48<02:32,  2.82it/s, acc_step=1/1, ce_loss_token=2.1736, perplexity_token=8.7902]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  59%|███████████████████████████▋                   | 616/1044 [03:49<02:37,  2.72it/s, acc_step=1/1, ce_loss_token=2.1735, perplexity_token=8.7893]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  59%|███████████████████████████▊                   | 617/1044 [03:49<02:33,  2.78it/s, acc_step=1/1, ce_loss_token=2.1734, perplexity_token=8.7882]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  59%|███████████████████████████▊                   | 618/1044 [03:49<02:23,  2.96it/s, acc_step=1/1, ce_loss_token=2.1737, perplexity_token=8.7908]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  59%|███████████████████████████▊                   | 619/1044 [03:50<02:33,  2.78it/s, acc_step=1/1, ce_loss_token=2.1736, perplexity_token=8.7899]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  59%|███████████████████████████▉                   | 620/1044 [03:50<02:35,  2.73it/s, acc_step=1/1, ce_loss_token=2.1735, perplexity_token=8.7890]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  59%|███████████████████████████▉                   | 621/1044 [03:50<02:31,  2.79it/s, acc_step=1/1, ce_loss_token=2.1734, perplexity_token=8.7880]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  60%|████████████████████████████                   | 622/1044 [03:51<02:34,  2.73it/s, acc_step=1/1, ce_loss_token=2.1733, perplexity_token=8.7868]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  60%|████████████████████████████                   | 623/1044 [03:51<02:41,  2.60it/s, acc_step=1/1, ce_loss_token=2.1732, perplexity_token=8.7859]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  60%|████████████████████████████                   | 624/1044 [03:51<02:31,  2.77it/s, acc_step=1/1, ce_loss_token=2.1733, perplexity_token=8.7871]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  60%|████████████████████████████▏                  | 625/1044 [03:52<02:31,  2.77it/s, acc_step=1/1, ce_loss_token=2.1732, perplexity_token=8.7862]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  60%|████████████████████████████▏                  | 626/1044 [03:52<02:31,  2.76it/s, acc_step=1/1, ce_loss_token=2.1731, perplexity_token=8.7852]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  60%|████████████████████████████▏                  | 627/1044 [03:53<02:34,  2.71it/s, acc_step=1/1, ce_loss_token=2.1730, perplexity_token=8.7844]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  60%|████████████████████████████▎                  | 628/1044 [03:53<02:20,  2.95it/s, acc_step=1/1, ce_loss_token=2.1730, perplexity_token=8.7846]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  60%|████████████████████████████▎                  | 629/1044 [03:53<02:27,  2.82it/s, acc_step=1/1, ce_loss_token=2.1729, perplexity_token=8.7836]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  60%|████████████████████████████▎                  | 630/1044 [03:54<02:30,  2.75it/s, acc_step=1/1, ce_loss_token=2.1728, perplexity_token=8.7826]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  60%|████████████████████████████▍                  | 631/1044 [03:54<02:36,  2.65it/s, acc_step=1/1, ce_loss_token=2.1727, perplexity_token=8.7817]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  61%|████████████████████████████▍                  | 632/1044 [03:55<02:53,  2.37it/s, acc_step=1/1, ce_loss_token=2.1726, perplexity_token=8.7807]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  61%|████████████████████████████▍                  | 633/1044 [03:55<02:48,  2.44it/s, acc_step=1/1, ce_loss_token=2.1724, perplexity_token=8.7796]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  61%|████████████████████████████▌                  | 634/1044 [03:55<02:30,  2.72it/s, acc_step=1/1, ce_loss_token=2.1724, perplexity_token=8.7798]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  61%|████████████████████████████▌                  | 635/1044 [03:56<02:19,  2.92it/s, acc_step=1/1, ce_loss_token=2.1726, perplexity_token=8.7808]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  61%|████████████████████████████▋                  | 636/1044 [03:56<02:31,  2.69it/s, acc_step=1/1, ce_loss_token=2.1725, perplexity_token=8.7798]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  61%|████████████████████████████▋                  | 637/1044 [03:56<02:38,  2.57it/s, acc_step=1/1, ce_loss_token=2.1723, perplexity_token=8.7788]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  61%|████████████████████████████▋                  | 638/1044 [03:57<02:34,  2.63it/s, acc_step=1/1, ce_loss_token=2.1722, perplexity_token=8.7778]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  61%|████████████████████████████▊                  | 639/1044 [03:57<02:30,  2.69it/s, acc_step=1/1, ce_loss_token=2.1721, perplexity_token=8.7768]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  61%|████████████████████████████▊                  | 640/1044 [03:57<02:26,  2.76it/s, acc_step=1/1, ce_loss_token=2.1720, perplexity_token=8.7758]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  61%|████████████████████████████▊                  | 641/1044 [03:58<02:34,  2.61it/s, acc_step=1/1, ce_loss_token=2.1719, perplexity_token=8.7749]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  61%|████████████████████████████▉                  | 642/1044 [03:58<02:33,  2.62it/s, acc_step=1/1, ce_loss_token=2.1718, perplexity_token=8.7740]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  62%|████████████████████████████▉                  | 643/1044 [03:59<02:21,  2.84it/s, acc_step=1/1, ce_loss_token=2.1719, perplexity_token=8.7751]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  62%|████████████████████████████▉                  | 644/1044 [03:59<02:19,  2.86it/s, acc_step=1/1, ce_loss_token=2.1718, perplexity_token=8.7742]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  62%|█████████████████████████████                  | 645/1044 [03:59<02:20,  2.83it/s, acc_step=1/1, ce_loss_token=2.1717, perplexity_token=8.7731]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  62%|█████████████████████████████                  | 646/1044 [04:00<02:22,  2.80it/s, acc_step=1/1, ce_loss_token=2.1716, perplexity_token=8.7723]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  62%|█████████████████████████████▏                 | 647/1044 [04:00<02:22,  2.79it/s, acc_step=1/1, ce_loss_token=2.1715, perplexity_token=8.7715]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  62%|█████████████████████████████▏                 | 648/1044 [04:00<02:19,  2.84it/s, acc_step=1/1, ce_loss_token=2.1714, perplexity_token=8.7704]

torch.Size([256, 525, 35]) torch.Size([256, 525])


[Training LM]:  62%|█████████████████████████████▏                 | 649/1044 [04:01<03:20,  1.97it/s, acc_step=1/1, ce_loss_token=2.1713, perplexity_token=8.7693]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  62%|█████████████████████████████▎                 | 650/1044 [04:01<02:55,  2.24it/s, acc_step=1/1, ce_loss_token=2.1714, perplexity_token=8.7702]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  62%|█████████████████████████████▎                 | 651/1044 [04:02<02:47,  2.35it/s, acc_step=1/1, ce_loss_token=2.1713, perplexity_token=8.7692]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  62%|█████████████████████████████▎                 | 652/1044 [04:02<02:41,  2.43it/s, acc_step=1/1, ce_loss_token=2.1712, perplexity_token=8.7684]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  63%|█████████████████████████████▍                 | 653/1044 [04:03<02:49,  2.31it/s, acc_step=1/1, ce_loss_token=2.1710, perplexity_token=8.7673]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  63%|█████████████████████████████▍                 | 654/1044 [04:03<02:41,  2.41it/s, acc_step=1/1, ce_loss_token=2.1709, perplexity_token=8.7663]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  63%|█████████████████████████████▍                 | 655/1044 [04:03<02:26,  2.65it/s, acc_step=1/1, ce_loss_token=2.1710, perplexity_token=8.7672]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  63%|█████████████████████████████▌                 | 656/1044 [04:04<02:23,  2.71it/s, acc_step=1/1, ce_loss_token=2.1709, perplexity_token=8.7662]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  63%|█████████████████████████████▌                 | 657/1044 [04:04<02:28,  2.60it/s, acc_step=1/1, ce_loss_token=2.1708, perplexity_token=8.7653]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  63%|█████████████████████████████▌                 | 658/1044 [04:04<02:25,  2.65it/s, acc_step=1/1, ce_loss_token=2.1707, perplexity_token=8.7644]

torch.Size([256, 399, 35]) torch.Size([256, 399])


[Training LM]:  63%|█████████████████████████████▋                 | 659/1044 [04:05<02:32,  2.53it/s, acc_step=1/1, ce_loss_token=2.1707, perplexity_token=8.7647]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  63%|█████████████████████████████▋                 | 660/1044 [04:05<02:28,  2.58it/s, acc_step=1/1, ce_loss_token=2.1706, perplexity_token=8.7638]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  63%|█████████████████████████████▊                 | 661/1044 [04:06<02:25,  2.64it/s, acc_step=1/1, ce_loss_token=2.1705, perplexity_token=8.7629]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  63%|█████████████████████████████▊                 | 662/1044 [04:06<02:28,  2.57it/s, acc_step=1/1, ce_loss_token=2.1704, perplexity_token=8.7619]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  64%|█████████████████████████████▊                 | 663/1044 [04:06<02:22,  2.67it/s, acc_step=1/1, ce_loss_token=2.1703, perplexity_token=8.7611]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  64%|█████████████████████████████▉                 | 664/1044 [04:07<02:21,  2.69it/s, acc_step=1/1, ce_loss_token=2.1702, perplexity_token=8.7602]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  64%|█████████████████████████████▉                 | 665/1044 [04:07<02:22,  2.67it/s, acc_step=1/1, ce_loss_token=2.1701, perplexity_token=8.7593]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  64%|█████████████████████████████▉                 | 666/1044 [04:07<02:11,  2.88it/s, acc_step=1/1, ce_loss_token=2.1702, perplexity_token=8.7596]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  64%|██████████████████████████████                 | 667/1044 [04:08<02:11,  2.87it/s, acc_step=1/1, ce_loss_token=2.1701, perplexity_token=8.7587]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  64%|██████████████████████████████                 | 668/1044 [04:08<02:19,  2.70it/s, acc_step=1/1, ce_loss_token=2.1699, perplexity_token=8.7578]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  64%|██████████████████████████████                 | 669/1044 [04:09<02:20,  2.66it/s, acc_step=1/1, ce_loss_token=2.1698, perplexity_token=8.7569]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  64%|██████████████████████████████▏                | 670/1044 [04:09<02:21,  2.63it/s, acc_step=1/1, ce_loss_token=2.1697, perplexity_token=8.7560]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  64%|██████████████████████████████▏                | 671/1044 [04:09<02:23,  2.60it/s, acc_step=1/1, ce_loss_token=2.1696, perplexity_token=8.7550]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  64%|██████████████████████████████▎                | 672/1044 [04:10<02:22,  2.62it/s, acc_step=1/1, ce_loss_token=2.1695, perplexity_token=8.7541]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  64%|██████████████████████████████▎                | 673/1044 [04:10<02:21,  2.63it/s, acc_step=1/1, ce_loss_token=2.1694, perplexity_token=8.7531]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  65%|██████████████████████████████▎                | 674/1044 [04:10<02:11,  2.82it/s, acc_step=1/1, ce_loss_token=2.1694, perplexity_token=8.7533]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  65%|██████████████████████████████▍                | 675/1044 [04:11<02:12,  2.79it/s, acc_step=1/1, ce_loss_token=2.1693, perplexity_token=8.7525]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  65%|██████████████████████████████▍                | 676/1044 [04:11<02:13,  2.76it/s, acc_step=1/1, ce_loss_token=2.1692, perplexity_token=8.7515]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  65%|██████████████████████████████▍                | 677/1044 [04:12<02:15,  2.72it/s, acc_step=1/1, ce_loss_token=2.1691, perplexity_token=8.7505]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  65%|██████████████████████████████▌                | 678/1044 [04:12<02:05,  2.92it/s, acc_step=1/1, ce_loss_token=2.1692, perplexity_token=8.7513]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  65%|██████████████████████████████▌                | 679/1044 [04:12<02:09,  2.83it/s, acc_step=1/1, ce_loss_token=2.1691, perplexity_token=8.7504]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  65%|██████████████████████████████▌                | 680/1044 [04:13<02:02,  2.96it/s, acc_step=1/1, ce_loss_token=2.1692, perplexity_token=8.7510]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  65%|██████████████████████████████▋                | 681/1044 [04:13<01:58,  3.06it/s, acc_step=1/1, ce_loss_token=2.1693, perplexity_token=8.7517]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  65%|██████████████████████████████▋                | 682/1044 [04:13<02:01,  2.97it/s, acc_step=1/1, ce_loss_token=2.1692, perplexity_token=8.7509]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  65%|██████████████████████████████▋                | 683/1044 [04:14<02:07,  2.84it/s, acc_step=1/1, ce_loss_token=2.1690, perplexity_token=8.7499]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  66%|██████████████████████████████▊                | 684/1044 [04:14<02:04,  2.90it/s, acc_step=1/1, ce_loss_token=2.1689, perplexity_token=8.7488]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  66%|██████████████████████████████▊                | 685/1044 [04:14<02:01,  2.96it/s, acc_step=1/1, ce_loss_token=2.1690, perplexity_token=8.7491]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  66%|██████████████████████████████▉                | 686/1044 [04:15<02:10,  2.74it/s, acc_step=1/1, ce_loss_token=2.1688, perplexity_token=8.7481]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  66%|██████████████████████████████▉                | 687/1044 [04:15<02:14,  2.65it/s, acc_step=1/1, ce_loss_token=2.1687, perplexity_token=8.7472]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  66%|██████████████████████████████▉                | 688/1044 [04:15<02:07,  2.80it/s, acc_step=1/1, ce_loss_token=2.1687, perplexity_token=8.7472]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  66%|███████████████████████████████                | 689/1044 [04:16<02:04,  2.84it/s, acc_step=1/1, ce_loss_token=2.1686, perplexity_token=8.7463]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  66%|███████████████████████████████                | 690/1044 [04:16<02:05,  2.82it/s, acc_step=1/1, ce_loss_token=2.1685, perplexity_token=8.7453]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  66%|███████████████████████████████                | 691/1044 [04:16<02:09,  2.72it/s, acc_step=1/1, ce_loss_token=2.1684, perplexity_token=8.7444]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  66%|███████████████████████████████▏               | 692/1044 [04:17<02:09,  2.72it/s, acc_step=1/1, ce_loss_token=2.1683, perplexity_token=8.7435]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  66%|███████████████████████████████▏               | 693/1044 [04:17<02:07,  2.75it/s, acc_step=1/1, ce_loss_token=2.1682, perplexity_token=8.7425]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  66%|███████████████████████████████▏               | 694/1044 [04:18<02:04,  2.82it/s, acc_step=1/1, ce_loss_token=2.1681, perplexity_token=8.7417]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  67%|███████████████████████████████▎               | 695/1044 [04:18<02:06,  2.76it/s, acc_step=1/1, ce_loss_token=2.1680, perplexity_token=8.7407]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  67%|███████████████████████████████▎               | 696/1044 [04:18<02:13,  2.60it/s, acc_step=1/1, ce_loss_token=2.1679, perplexity_token=8.7399]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  67%|███████████████████████████████▍               | 697/1044 [04:19<02:04,  2.78it/s, acc_step=1/1, ce_loss_token=2.1680, perplexity_token=8.7406]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  67%|███████████████████████████████▍               | 698/1044 [04:19<02:09,  2.67it/s, acc_step=1/1, ce_loss_token=2.1679, perplexity_token=8.7399]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  67%|███████████████████████████████▍               | 699/1044 [04:19<02:14,  2.57it/s, acc_step=1/1, ce_loss_token=2.1678, perplexity_token=8.7390]

torch.Size([256, 349, 35]) torch.Size([256, 349])


[Training LM]:  67%|███████████████████████████████▌               | 700/1044 [04:20<02:21,  2.42it/s, acc_step=1/1, ce_loss_token=2.1677, perplexity_token=8.7381]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  67%|███████████████████████████████▌               | 701/1044 [04:20<02:14,  2.55it/s, acc_step=1/1, ce_loss_token=2.1676, perplexity_token=8.7373]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  67%|███████████████████████████████▌               | 702/1044 [04:21<02:16,  2.51it/s, acc_step=1/1, ce_loss_token=2.1675, perplexity_token=8.7364]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  67%|███████████████████████████████▋               | 703/1044 [04:21<02:11,  2.59it/s, acc_step=1/1, ce_loss_token=2.1674, perplexity_token=8.7355]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  67%|███████████████████████████████▋               | 704/1044 [04:21<02:00,  2.81it/s, acc_step=1/1, ce_loss_token=2.1674, perplexity_token=8.7357]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  68%|███████████████████████████████▋               | 705/1044 [04:22<01:55,  2.95it/s, acc_step=1/1, ce_loss_token=2.1674, perplexity_token=8.7357]

torch.Size([256, 364, 35]) torch.Size([256, 364])


[Training LM]:  68%|███████████████████████████████▊               | 706/1044 [04:22<02:08,  2.63it/s, acc_step=1/1, ce_loss_token=2.1673, perplexity_token=8.7348]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  68%|███████████████████████████████▊               | 707/1044 [04:23<02:13,  2.52it/s, acc_step=1/1, ce_loss_token=2.1672, perplexity_token=8.7340]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  68%|███████████████████████████████▊               | 708/1044 [04:23<02:10,  2.58it/s, acc_step=1/1, ce_loss_token=2.1671, perplexity_token=8.7332]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  68%|███████████████████████████████▉               | 709/1044 [04:23<01:59,  2.81it/s, acc_step=1/1, ce_loss_token=2.1671, perplexity_token=8.7333]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  68%|███████████████████████████████▉               | 710/1044 [04:24<02:09,  2.58it/s, acc_step=1/1, ce_loss_token=2.1670, perplexity_token=8.7324]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  68%|████████████████████████████████               | 711/1044 [04:24<02:05,  2.65it/s, acc_step=1/1, ce_loss_token=2.1671, perplexity_token=8.7325]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  68%|████████████████████████████████               | 712/1044 [04:24<02:09,  2.57it/s, acc_step=1/1, ce_loss_token=2.1669, perplexity_token=8.7314]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  68%|████████████████████████████████               | 713/1044 [04:25<02:06,  2.62it/s, acc_step=1/1, ce_loss_token=2.1668, perplexity_token=8.7304]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  68%|████████████████████████████████▏              | 714/1044 [04:25<02:07,  2.59it/s, acc_step=1/1, ce_loss_token=2.1667, perplexity_token=8.7296]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  68%|████████████████████████████████▏              | 715/1044 [04:26<02:06,  2.60it/s, acc_step=1/1, ce_loss_token=2.1666, perplexity_token=8.7287]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  69%|████████████████████████████████▏              | 716/1044 [04:26<02:03,  2.65it/s, acc_step=1/1, ce_loss_token=2.1665, perplexity_token=8.7276]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  69%|████████████████████████████████▎              | 717/1044 [04:26<02:00,  2.71it/s, acc_step=1/1, ce_loss_token=2.1664, perplexity_token=8.7266]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  69%|████████████████████████████████▎              | 718/1044 [04:27<01:58,  2.74it/s, acc_step=1/1, ce_loss_token=2.1663, perplexity_token=8.7256]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  69%|████████████████████████████████▎              | 719/1044 [04:27<01:58,  2.74it/s, acc_step=1/1, ce_loss_token=2.1662, perplexity_token=8.7247]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  69%|████████████████████████████████▍              | 720/1044 [04:27<01:58,  2.74it/s, acc_step=1/1, ce_loss_token=2.1660, perplexity_token=8.7237]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  69%|████████████████████████████████▍              | 721/1044 [04:28<01:59,  2.70it/s, acc_step=1/1, ce_loss_token=2.1659, perplexity_token=8.7227]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  69%|████████████████████████████████▌              | 722/1044 [04:28<02:06,  2.55it/s, acc_step=1/1, ce_loss_token=2.1658, perplexity_token=8.7218]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  69%|████████████████████████████████▌              | 723/1044 [04:29<02:09,  2.48it/s, acc_step=1/1, ce_loss_token=2.1657, perplexity_token=8.7210]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  69%|████████████████████████████████▌              | 724/1044 [04:29<01:56,  2.74it/s, acc_step=1/1, ce_loss_token=2.1657, perplexity_token=8.7209]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  69%|████████████████████████████████▋              | 725/1044 [04:29<02:03,  2.58it/s, acc_step=1/1, ce_loss_token=2.1656, perplexity_token=8.7200]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  70%|████████████████████████████████▋              | 726/1044 [04:30<02:02,  2.60it/s, acc_step=1/1, ce_loss_token=2.1655, perplexity_token=8.7191]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  70%|████████████████████████████████▋              | 727/1044 [04:30<01:54,  2.77it/s, acc_step=1/1, ce_loss_token=2.1655, perplexity_token=8.7193]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  70%|████████████████████████████████▊              | 728/1044 [04:30<01:54,  2.76it/s, acc_step=1/1, ce_loss_token=2.1654, perplexity_token=8.7182]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  70%|████████████████████████████████▊              | 729/1044 [04:31<02:03,  2.54it/s, acc_step=1/1, ce_loss_token=2.1653, perplexity_token=8.7173]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  70%|████████████████████████████████▊              | 730/1044 [04:31<02:02,  2.56it/s, acc_step=1/1, ce_loss_token=2.1652, perplexity_token=8.7163]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  70%|████████████████████████████████▉              | 731/1044 [04:32<02:00,  2.59it/s, acc_step=1/1, ce_loss_token=2.1651, perplexity_token=8.7154]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  70%|████████████████████████████████▉              | 732/1044 [04:32<01:53,  2.75it/s, acc_step=1/1, ce_loss_token=2.1651, perplexity_token=8.7155]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  70%|████████████████████████████████▉              | 733/1044 [04:32<01:53,  2.74it/s, acc_step=1/1, ce_loss_token=2.1650, perplexity_token=8.7146]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  70%|█████████████████████████████████              | 734/1044 [04:33<01:52,  2.76it/s, acc_step=1/1, ce_loss_token=2.1649, perplexity_token=8.7137]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  70%|█████████████████████████████████              | 735/1044 [04:33<01:52,  2.74it/s, acc_step=1/1, ce_loss_token=2.1648, perplexity_token=8.7128]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  70%|█████████████████████████████████▏             | 736/1044 [04:33<01:52,  2.73it/s, acc_step=1/1, ce_loss_token=2.1647, perplexity_token=8.7119]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  71%|█████████████████████████████████▏             | 737/1044 [04:34<01:54,  2.67it/s, acc_step=1/1, ce_loss_token=2.1646, perplexity_token=8.7111]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  71%|█████████████████████████████████▏             | 738/1044 [04:34<01:53,  2.69it/s, acc_step=1/1, ce_loss_token=2.1645, perplexity_token=8.7102]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  71%|█████████████████████████████████▎             | 739/1044 [04:34<01:51,  2.74it/s, acc_step=1/1, ce_loss_token=2.1644, perplexity_token=8.7093]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  71%|█████████████████████████████████▎             | 740/1044 [04:35<01:50,  2.74it/s, acc_step=1/1, ce_loss_token=2.1643, perplexity_token=8.7084]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  71%|█████████████████████████████████▎             | 741/1044 [04:35<01:52,  2.69it/s, acc_step=1/1, ce_loss_token=2.1642, perplexity_token=8.7076]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  71%|█████████████████████████████████▍             | 742/1044 [04:36<01:53,  2.65it/s, acc_step=1/1, ce_loss_token=2.1641, perplexity_token=8.7066]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  71%|█████████████████████████████████▍             | 743/1044 [04:36<01:44,  2.88it/s, acc_step=1/1, ce_loss_token=2.1641, perplexity_token=8.7068]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  71%|█████████████████████████████████▍             | 744/1044 [04:36<01:46,  2.81it/s, acc_step=1/1, ce_loss_token=2.1640, perplexity_token=8.7059]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  71%|█████████████████████████████████▌             | 745/1044 [04:37<01:46,  2.81it/s, acc_step=1/1, ce_loss_token=2.1639, perplexity_token=8.7051]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  71%|█████████████████████████████████▌             | 746/1044 [04:37<01:47,  2.77it/s, acc_step=1/1, ce_loss_token=2.1638, perplexity_token=8.7042]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  72%|█████████████████████████████████▋             | 747/1044 [04:37<01:46,  2.78it/s, acc_step=1/1, ce_loss_token=2.1637, perplexity_token=8.7033]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  72%|█████████████████████████████████▋             | 748/1044 [04:38<01:48,  2.73it/s, acc_step=1/1, ce_loss_token=2.1636, perplexity_token=8.7026]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  72%|█████████████████████████████████▋             | 749/1044 [04:38<01:45,  2.79it/s, acc_step=1/1, ce_loss_token=2.1635, perplexity_token=8.7016]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  72%|█████████████████████████████████▊             | 750/1044 [04:38<01:46,  2.75it/s, acc_step=1/1, ce_loss_token=2.1634, perplexity_token=8.7008]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  72%|█████████████████████████████████▊             | 751/1044 [04:39<01:47,  2.71it/s, acc_step=1/1, ce_loss_token=2.1633, perplexity_token=8.6998]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  72%|█████████████████████████████████▊             | 752/1044 [04:39<01:47,  2.73it/s, acc_step=1/1, ce_loss_token=2.1632, perplexity_token=8.6989]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  72%|█████████████████████████████████▉             | 753/1044 [04:39<01:40,  2.90it/s, acc_step=1/1, ce_loss_token=2.1632, perplexity_token=8.6989]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  72%|█████████████████████████████████▉             | 754/1044 [04:40<01:42,  2.82it/s, acc_step=1/1, ce_loss_token=2.1631, perplexity_token=8.6980]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  72%|█████████████████████████████████▉             | 755/1044 [04:40<01:43,  2.80it/s, acc_step=1/1, ce_loss_token=2.1630, perplexity_token=8.6971]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  72%|██████████████████████████████████             | 756/1044 [04:41<01:36,  2.99it/s, acc_step=1/1, ce_loss_token=2.1631, perplexity_token=8.6978]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  73%|██████████████████████████████████             | 757/1044 [04:41<01:32,  3.12it/s, acc_step=1/1, ce_loss_token=2.1631, perplexity_token=8.6979]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  73%|██████████████████████████████████             | 758/1044 [04:41<01:36,  2.98it/s, acc_step=1/1, ce_loss_token=2.1630, perplexity_token=8.6970]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  73%|██████████████████████████████████▏            | 759/1044 [04:42<01:39,  2.87it/s, acc_step=1/1, ce_loss_token=2.1629, perplexity_token=8.6961]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  73%|██████████████████████████████████▏            | 760/1044 [04:42<01:41,  2.80it/s, acc_step=1/1, ce_loss_token=2.1628, perplexity_token=8.6954]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  73%|██████████████████████████████████▎            | 761/1044 [04:42<01:34,  3.00it/s, acc_step=1/1, ce_loss_token=2.1628, perplexity_token=8.6954]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  73%|██████████████████████████████████▎            | 762/1044 [04:42<01:28,  3.18it/s, acc_step=1/1, ce_loss_token=2.1629, perplexity_token=8.6961]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  73%|██████████████████████████████████▎            | 763/1044 [04:43<01:33,  3.01it/s, acc_step=1/1, ce_loss_token=2.1628, perplexity_token=8.6951]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  73%|██████████████████████████████████▍            | 764/1044 [04:43<01:35,  2.93it/s, acc_step=1/1, ce_loss_token=2.1627, perplexity_token=8.6942]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  73%|██████████████████████████████████▍            | 765/1044 [04:44<01:40,  2.79it/s, acc_step=1/1, ce_loss_token=2.1625, perplexity_token=8.6932]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  73%|██████████████████████████████████▍            | 766/1044 [04:44<01:39,  2.80it/s, acc_step=1/1, ce_loss_token=2.1624, perplexity_token=8.6923]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  73%|██████████████████████████████████▌            | 767/1044 [04:44<01:44,  2.64it/s, acc_step=1/1, ce_loss_token=2.1623, perplexity_token=8.6915]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  74%|██████████████████████████████████▌            | 768/1044 [04:45<01:47,  2.56it/s, acc_step=1/1, ce_loss_token=2.1622, perplexity_token=8.6905]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  74%|██████████████████████████████████▌            | 769/1044 [04:45<01:49,  2.52it/s, acc_step=1/1, ce_loss_token=2.1621, perplexity_token=8.6897]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  74%|██████████████████████████████████▋            | 770/1044 [04:46<01:46,  2.58it/s, acc_step=1/1, ce_loss_token=2.1621, perplexity_token=8.6890]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  74%|██████████████████████████████████▋            | 771/1044 [04:46<01:45,  2.58it/s, acc_step=1/1, ce_loss_token=2.1619, perplexity_token=8.6879]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  74%|██████████████████████████████████▊            | 772/1044 [04:46<01:46,  2.56it/s, acc_step=1/1, ce_loss_token=2.1618, perplexity_token=8.6870]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  74%|██████████████████████████████████▊            | 773/1044 [04:47<01:42,  2.66it/s, acc_step=1/1, ce_loss_token=2.1617, perplexity_token=8.6862]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  74%|██████████████████████████████████▊            | 774/1044 [04:47<01:40,  2.68it/s, acc_step=1/1, ce_loss_token=2.1616, perplexity_token=8.6852]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  74%|██████████████████████████████████▉            | 775/1044 [04:47<01:42,  2.63it/s, acc_step=1/1, ce_loss_token=2.1616, perplexity_token=8.6846]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  74%|██████████████████████████████████▉            | 776/1044 [04:48<01:42,  2.61it/s, acc_step=1/1, ce_loss_token=2.1614, perplexity_token=8.6837]

torch.Size([256, 354, 35]) torch.Size([256, 354])


[Training LM]:  74%|██████████████████████████████████▉            | 777/1044 [04:48<01:48,  2.45it/s, acc_step=1/1, ce_loss_token=2.1614, perplexity_token=8.6829]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  75%|███████████████████████████████████            | 779/1044 [04:49<01:31,  2.90it/s, acc_step=1/1, ce_loss_token=2.1615, perplexity_token=8.6846]

torch.Size([256, 323, 35]) torch.Size([256, 323])
torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  75%|███████████████████████████████████            | 780/1044 [04:49<01:33,  2.83it/s, acc_step=1/1, ce_loss_token=2.1614, perplexity_token=8.6837]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  75%|███████████████████████████████████▏           | 781/1044 [04:50<01:34,  2.78it/s, acc_step=1/1, ce_loss_token=2.1613, perplexity_token=8.6826]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  75%|███████████████████████████████████▏           | 782/1044 [04:50<01:43,  2.53it/s, acc_step=1/1, ce_loss_token=2.1612, perplexity_token=8.6819]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  75%|███████████████████████████████████▎           | 783/1044 [04:51<01:41,  2.58it/s, acc_step=1/1, ce_loss_token=2.1611, perplexity_token=8.6808]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  75%|███████████████████████████████████▎           | 784/1044 [04:51<01:41,  2.56it/s, acc_step=1/1, ce_loss_token=2.1610, perplexity_token=8.6800]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  75%|███████████████████████████████████▎           | 785/1044 [04:51<01:42,  2.52it/s, acc_step=1/1, ce_loss_token=2.1609, perplexity_token=8.6791]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  75%|███████████████████████████████████▍           | 786/1044 [04:52<01:32,  2.78it/s, acc_step=1/1, ce_loss_token=2.1609, perplexity_token=8.6791]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  75%|███████████████████████████████████▍           | 787/1044 [04:52<01:37,  2.62it/s, acc_step=1/1, ce_loss_token=2.1608, perplexity_token=8.6783]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  75%|███████████████████████████████████▍           | 788/1044 [04:52<01:31,  2.80it/s, acc_step=1/1, ce_loss_token=2.1608, perplexity_token=8.6783]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  76%|███████████████████████████████████▌           | 789/1044 [04:53<01:23,  3.04it/s, acc_step=1/1, ce_loss_token=2.1608, perplexity_token=8.6785]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  76%|███████████████████████████████████▌           | 790/1044 [04:53<01:23,  3.03it/s, acc_step=1/1, ce_loss_token=2.1609, perplexity_token=8.6792]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  76%|███████████████████████████████████▌           | 791/1044 [04:53<01:24,  3.01it/s, acc_step=1/1, ce_loss_token=2.1608, perplexity_token=8.6783]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  76%|███████████████████████████████████▋           | 792/1044 [04:54<01:32,  2.71it/s, acc_step=1/1, ce_loss_token=2.1607, perplexity_token=8.6774]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  76%|███████████████████████████████████▋           | 794/1044 [04:54<01:20,  3.12it/s, acc_step=1/1, ce_loss_token=2.1608, perplexity_token=8.6779]

torch.Size([256, 304, 35]) torch.Size([256, 304])
torch.Size([256, 268, 35]) torch.Size([256, 268])


[Training LM]:  76%|███████████████████████████████████▊           | 795/1044 [04:54<01:08,  3.65it/s, acc_step=1/1, ce_loss_token=2.1610, perplexity_token=8.6798]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  76%|███████████████████████████████████▊           | 796/1044 [04:55<01:15,  3.27it/s, acc_step=1/1, ce_loss_token=2.1609, perplexity_token=8.6790]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  76%|███████████████████████████████████▉           | 797/1044 [04:55<01:21,  3.05it/s, acc_step=1/1, ce_loss_token=2.1608, perplexity_token=8.6782]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  76%|███████████████████████████████████▉           | 798/1044 [04:56<01:24,  2.92it/s, acc_step=1/1, ce_loss_token=2.1607, perplexity_token=8.6773]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  77%|███████████████████████████████████▉           | 799/1044 [04:56<01:25,  2.87it/s, acc_step=1/1, ce_loss_token=2.1606, perplexity_token=8.6764]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  77%|████████████████████████████████████           | 800/1044 [04:56<01:26,  2.83it/s, acc_step=1/1, ce_loss_token=2.1605, perplexity_token=8.6757]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  77%|████████████████████████████████████           | 801/1044 [04:57<01:27,  2.77it/s, acc_step=1/1, ce_loss_token=2.1606, perplexity_token=8.6762]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  77%|████████████████████████████████████           | 802/1044 [04:57<01:27,  2.76it/s, acc_step=1/1, ce_loss_token=2.1605, perplexity_token=8.6752]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  77%|████████████████████████████████████▏          | 803/1044 [04:58<01:37,  2.46it/s, acc_step=1/1, ce_loss_token=2.1604, perplexity_token=8.6744]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  77%|████████████████████████████████████▏          | 804/1044 [04:58<01:34,  2.54it/s, acc_step=1/1, ce_loss_token=2.1603, perplexity_token=8.6736]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  77%|████████████████████████████████████▏          | 805/1044 [04:58<01:31,  2.62it/s, acc_step=1/1, ce_loss_token=2.1602, perplexity_token=8.6728]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  77%|████████████████████████████████████▎          | 806/1044 [04:59<01:33,  2.55it/s, acc_step=1/1, ce_loss_token=2.1601, perplexity_token=8.6719]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  77%|████████████████████████████████████▎          | 807/1044 [04:59<01:31,  2.60it/s, acc_step=1/1, ce_loss_token=2.1600, perplexity_token=8.6710]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  77%|████████████████████████████████████▍          | 808/1044 [04:59<01:31,  2.59it/s, acc_step=1/1, ce_loss_token=2.1599, perplexity_token=8.6702]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  77%|████████████████████████████████████▍          | 809/1044 [05:00<01:30,  2.60it/s, acc_step=1/1, ce_loss_token=2.1598, perplexity_token=8.6693]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  78%|████████████████████████████████████▍          | 810/1044 [05:00<01:29,  2.62it/s, acc_step=1/1, ce_loss_token=2.1597, perplexity_token=8.6685]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  78%|████████████████████████████████████▌          | 811/1044 [05:01<01:31,  2.55it/s, acc_step=1/1, ce_loss_token=2.1596, perplexity_token=8.6675]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  78%|████████████████████████████████████▌          | 812/1044 [05:01<01:28,  2.62it/s, acc_step=1/1, ce_loss_token=2.1595, perplexity_token=8.6666]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  78%|████████████████████████████████████▌          | 813/1044 [05:01<01:28,  2.61it/s, acc_step=1/1, ce_loss_token=2.1594, perplexity_token=8.6657]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  78%|████████████████████████████████████▋          | 814/1044 [05:02<01:33,  2.47it/s, acc_step=1/1, ce_loss_token=2.1593, perplexity_token=8.6647]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  78%|████████████████████████████████████▋          | 815/1044 [05:02<01:31,  2.50it/s, acc_step=1/1, ce_loss_token=2.1592, perplexity_token=8.6639]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  78%|████████████████████████████████████▋          | 816/1044 [05:03<01:30,  2.52it/s, acc_step=1/1, ce_loss_token=2.1591, perplexity_token=8.6630]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  78%|████████████████████████████████████▊          | 817/1044 [05:03<01:27,  2.58it/s, acc_step=1/1, ce_loss_token=2.1590, perplexity_token=8.6622]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  78%|████████████████████████████████████▊          | 818/1044 [05:03<01:24,  2.67it/s, acc_step=1/1, ce_loss_token=2.1589, perplexity_token=8.6613]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  78%|████████████████████████████████████▊          | 819/1044 [05:04<01:24,  2.67it/s, acc_step=1/1, ce_loss_token=2.1588, perplexity_token=8.6603]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  79%|████████████████████████████████████▉          | 820/1044 [05:04<01:17,  2.91it/s, acc_step=1/1, ce_loss_token=2.1588, perplexity_token=8.6606]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  79%|████████████████████████████████████▉          | 821/1044 [05:04<01:22,  2.71it/s, acc_step=1/1, ce_loss_token=2.1587, perplexity_token=8.6597]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  79%|█████████████████████████████████████          | 822/1044 [05:05<01:23,  2.66it/s, acc_step=1/1, ce_loss_token=2.1586, perplexity_token=8.6590]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  79%|█████████████████████████████████████          | 823/1044 [05:05<01:26,  2.55it/s, acc_step=1/1, ce_loss_token=2.1585, perplexity_token=8.6581]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  79%|█████████████████████████████████████          | 824/1044 [05:06<01:19,  2.76it/s, acc_step=1/1, ce_loss_token=2.1585, perplexity_token=8.6582]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  79%|█████████████████████████████████████▏         | 825/1044 [05:06<01:21,  2.69it/s, acc_step=1/1, ce_loss_token=2.1584, perplexity_token=8.6574]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  79%|█████████████████████████████████████▏         | 826/1044 [05:06<01:20,  2.71it/s, acc_step=1/1, ce_loss_token=2.1583, perplexity_token=8.6566]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  79%|█████████████████████████████████████▏         | 827/1044 [05:07<01:19,  2.72it/s, acc_step=1/1, ce_loss_token=2.1582, perplexity_token=8.6557]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  79%|█████████████████████████████████████▎         | 828/1044 [05:07<01:23,  2.59it/s, acc_step=1/1, ce_loss_token=2.1581, perplexity_token=8.6549]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  79%|█████████████████████████████████████▎         | 829/1044 [05:07<01:25,  2.53it/s, acc_step=1/1, ce_loss_token=2.1580, perplexity_token=8.6539]

torch.Size([256, 402, 35]) torch.Size([256, 402])


[Training LM]:  80%|█████████████████████████████████████▎         | 830/1044 [05:08<01:27,  2.45it/s, acc_step=1/1, ce_loss_token=2.1580, perplexity_token=8.6541]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  80%|█████████████████████████████████████▍         | 831/1044 [05:08<01:23,  2.55it/s, acc_step=1/1, ce_loss_token=2.1579, perplexity_token=8.6532]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  80%|█████████████████████████████████████▍         | 832/1044 [05:09<01:20,  2.62it/s, acc_step=1/1, ce_loss_token=2.1578, perplexity_token=8.6521]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  80%|█████████████████████████████████████▌         | 833/1044 [05:09<01:13,  2.86it/s, acc_step=1/1, ce_loss_token=2.1578, perplexity_token=8.6523]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  80%|█████████████████████████████████████▌         | 834/1044 [05:09<01:15,  2.77it/s, acc_step=1/1, ce_loss_token=2.1577, perplexity_token=8.6515]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  80%|█████████████████████████████████████▌         | 835/1044 [05:10<01:13,  2.85it/s, acc_step=1/1, ce_loss_token=2.1576, perplexity_token=8.6506]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  80%|█████████████████████████████████████▋         | 836/1044 [05:10<01:14,  2.79it/s, acc_step=1/1, ce_loss_token=2.1575, perplexity_token=8.6497]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  80%|█████████████████████████████████████▋         | 837/1044 [05:10<01:14,  2.78it/s, acc_step=1/1, ce_loss_token=2.1574, perplexity_token=8.6488]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  80%|█████████████████████████████████████▋         | 838/1044 [05:11<01:14,  2.75it/s, acc_step=1/1, ce_loss_token=2.1573, perplexity_token=8.6480]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  80%|█████████████████████████████████████▊         | 839/1044 [05:11<01:17,  2.64it/s, acc_step=1/1, ce_loss_token=2.1572, perplexity_token=8.6472]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  80%|█████████████████████████████████████▊         | 840/1044 [05:12<01:25,  2.38it/s, acc_step=1/1, ce_loss_token=2.1571, perplexity_token=8.6463]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  81%|█████████████████████████████████████▊         | 841/1044 [05:12<01:17,  2.61it/s, acc_step=1/1, ce_loss_token=2.1572, perplexity_token=8.6465]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  81%|█████████████████████████████████████▉         | 842/1044 [05:12<01:11,  2.82it/s, acc_step=1/1, ce_loss_token=2.1572, perplexity_token=8.6466]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  81%|█████████████████████████████████████▉         | 843/1044 [05:13<01:13,  2.72it/s, acc_step=1/1, ce_loss_token=2.1571, perplexity_token=8.6457]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  81%|█████████████████████████████████████▉         | 844/1044 [05:13<01:10,  2.85it/s, acc_step=1/1, ce_loss_token=2.1571, perplexity_token=8.6457]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  81%|██████████████████████████████████████         | 845/1044 [05:13<01:11,  2.80it/s, acc_step=1/1, ce_loss_token=2.1569, perplexity_token=8.6447]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  81%|██████████████████████████████████████         | 846/1044 [05:14<01:12,  2.71it/s, acc_step=1/1, ce_loss_token=2.1568, perplexity_token=8.6439]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  81%|██████████████████████████████████████▏        | 847/1044 [05:14<01:13,  2.68it/s, acc_step=1/1, ce_loss_token=2.1567, perplexity_token=8.6430]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  81%|██████████████████████████████████████▏        | 848/1044 [05:14<01:08,  2.86it/s, acc_step=1/1, ce_loss_token=2.1568, perplexity_token=8.6435]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  81%|██████████████████████████████████████▏        | 849/1044 [05:15<01:11,  2.73it/s, acc_step=1/1, ce_loss_token=2.1567, perplexity_token=8.6426]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  81%|██████████████████████████████████████▎        | 850/1044 [05:15<01:10,  2.74it/s, acc_step=1/1, ce_loss_token=2.1566, perplexity_token=8.6418]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  82%|██████████████████████████████████████▎        | 851/1044 [05:16<01:11,  2.70it/s, acc_step=1/1, ce_loss_token=2.1565, perplexity_token=8.6409]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  82%|██████████████████████████████████████▎        | 852/1044 [05:16<01:11,  2.69it/s, acc_step=1/1, ce_loss_token=2.1564, perplexity_token=8.6401]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  82%|██████████████████████████████████████▍        | 853/1044 [05:16<01:10,  2.70it/s, acc_step=1/1, ce_loss_token=2.1563, perplexity_token=8.6392]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  82%|██████████████████████████████████████▍        | 854/1044 [05:17<01:10,  2.68it/s, acc_step=1/1, ce_loss_token=2.1562, perplexity_token=8.6382]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  82%|██████████████████████████████████████▍        | 855/1044 [05:17<01:09,  2.71it/s, acc_step=1/1, ce_loss_token=2.1561, perplexity_token=8.6375]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  82%|██████████████████████████████████████▌        | 856/1044 [05:17<01:09,  2.72it/s, acc_step=1/1, ce_loss_token=2.1560, perplexity_token=8.6367]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  82%|██████████████████████████████████████▋        | 858/1044 [05:18<00:54,  3.40it/s, acc_step=1/1, ce_loss_token=2.1566, perplexity_token=8.6414]

torch.Size([256, 296, 35]) torch.Size([256, 296])
torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  82%|██████████████████████████████████████▋        | 859/1044 [05:18<00:57,  3.23it/s, acc_step=1/1, ce_loss_token=2.1565, perplexity_token=8.6404]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  82%|██████████████████████████████████████▋        | 860/1044 [05:19<01:03,  2.91it/s, acc_step=1/1, ce_loss_token=2.1564, perplexity_token=8.6396]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  82%|██████████████████████████████████████▊        | 861/1044 [05:19<01:04,  2.86it/s, acc_step=1/1, ce_loss_token=2.1564, perplexity_token=8.6397]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  83%|██████████████████████████████████████▊        | 862/1044 [05:19<01:07,  2.70it/s, acc_step=1/1, ce_loss_token=2.1563, perplexity_token=8.6389]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  83%|██████████████████████████████████████▊        | 863/1044 [05:20<01:08,  2.65it/s, acc_step=1/1, ce_loss_token=2.1562, perplexity_token=8.6381]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  83%|██████████████████████████████████████▉        | 864/1044 [05:20<01:02,  2.87it/s, acc_step=1/1, ce_loss_token=2.1562, perplexity_token=8.6385]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  83%|██████████████████████████████████████▉        | 865/1044 [05:20<01:03,  2.84it/s, acc_step=1/1, ce_loss_token=2.1561, perplexity_token=8.6377]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  83%|██████████████████████████████████████▉        | 866/1044 [05:21<01:05,  2.74it/s, acc_step=1/1, ce_loss_token=2.1560, perplexity_token=8.6369]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  83%|███████████████████████████████████████        | 867/1044 [05:21<01:08,  2.60it/s, acc_step=1/1, ce_loss_token=2.1560, perplexity_token=8.6361]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  83%|███████████████████████████████████████        | 868/1044 [05:22<01:09,  2.55it/s, acc_step=1/1, ce_loss_token=2.1559, perplexity_token=8.6353]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  83%|███████████████████████████████████████        | 869/1044 [05:22<01:04,  2.71it/s, acc_step=1/1, ce_loss_token=2.1559, perplexity_token=8.6354]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  83%|███████████████████████████████████████▏       | 870/1044 [05:22<01:06,  2.64it/s, acc_step=1/1, ce_loss_token=2.1558, perplexity_token=8.6346]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  83%|███████████████████████████████████████▏       | 871/1044 [05:23<01:04,  2.68it/s, acc_step=1/1, ce_loss_token=2.1557, perplexity_token=8.6337]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  84%|███████████████████████████████████████▎       | 872/1044 [05:23<01:04,  2.67it/s, acc_step=1/1, ce_loss_token=2.1556, perplexity_token=8.6329]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  84%|███████████████████████████████████████▎       | 873/1044 [05:24<01:04,  2.65it/s, acc_step=1/1, ce_loss_token=2.1555, perplexity_token=8.6321]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  84%|███████████████████████████████████████▍       | 875/1044 [05:24<00:55,  3.06it/s, acc_step=1/1, ce_loss_token=2.1556, perplexity_token=8.6333]

torch.Size([256, 287, 35]) torch.Size([256, 287])
torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  84%|███████████████████████████████████████▍       | 876/1044 [05:25<00:57,  2.92it/s, acc_step=1/1, ce_loss_token=2.1555, perplexity_token=8.6325]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  84%|███████████████████████████████████████▍       | 877/1044 [05:25<00:59,  2.79it/s, acc_step=1/1, ce_loss_token=2.1554, perplexity_token=8.6316]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  84%|███████████████████████████████████████▌       | 878/1044 [05:25<00:59,  2.78it/s, acc_step=1/1, ce_loss_token=2.1553, perplexity_token=8.6307]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  84%|███████████████████████████████████████▌       | 879/1044 [05:26<01:00,  2.73it/s, acc_step=1/1, ce_loss_token=2.1552, perplexity_token=8.6299]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  84%|███████████████████████████████████████▌       | 880/1044 [05:26<01:00,  2.71it/s, acc_step=1/1, ce_loss_token=2.1551, perplexity_token=8.6291]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  84%|███████████████████████████████████████▋       | 881/1044 [05:26<00:57,  2.85it/s, acc_step=1/1, ce_loss_token=2.1552, perplexity_token=8.6296]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  84%|███████████████████████████████████████▋       | 882/1044 [05:27<00:56,  2.84it/s, acc_step=1/1, ce_loss_token=2.1551, perplexity_token=8.6288]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  85%|███████████████████████████████████████▊       | 883/1044 [05:27<00:58,  2.74it/s, acc_step=1/1, ce_loss_token=2.1550, perplexity_token=8.6279]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  85%|███████████████████████████████████████▊       | 884/1044 [05:27<00:58,  2.73it/s, acc_step=1/1, ce_loss_token=2.1549, perplexity_token=8.6269]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  85%|███████████████████████████████████████▊       | 885/1044 [05:28<00:59,  2.68it/s, acc_step=1/1, ce_loss_token=2.1548, perplexity_token=8.6261]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  85%|███████████████████████████████████████▉       | 886/1044 [05:28<01:02,  2.51it/s, acc_step=1/1, ce_loss_token=2.1547, perplexity_token=8.6254]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  85%|███████████████████████████████████████▉       | 887/1044 [05:29<01:07,  2.32it/s, acc_step=1/1, ce_loss_token=2.1546, perplexity_token=8.6245]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  85%|███████████████████████████████████████▉       | 888/1044 [05:29<01:02,  2.48it/s, acc_step=1/1, ce_loss_token=2.1546, perplexity_token=8.6246]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  85%|████████████████████████████████████████       | 889/1044 [05:30<01:00,  2.58it/s, acc_step=1/1, ce_loss_token=2.1545, perplexity_token=8.6237]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  85%|████████████████████████████████████████       | 890/1044 [05:30<00:57,  2.66it/s, acc_step=1/1, ce_loss_token=2.1544, perplexity_token=8.6229]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  85%|████████████████████████████████████████       | 891/1044 [05:30<00:57,  2.68it/s, acc_step=1/1, ce_loss_token=2.1543, perplexity_token=8.6222]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  85%|████████████████████████████████████████▏      | 892/1044 [05:31<00:58,  2.60it/s, acc_step=1/1, ce_loss_token=2.1542, perplexity_token=8.6213]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  86%|████████████████████████████████████████▏      | 893/1044 [05:31<00:59,  2.55it/s, acc_step=1/1, ce_loss_token=2.1541, perplexity_token=8.6204]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  86%|████████████████████████████████████████▏      | 894/1044 [05:31<00:57,  2.61it/s, acc_step=1/1, ce_loss_token=2.1540, perplexity_token=8.6196]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  86%|████████████████████████████████████████▎      | 895/1044 [05:32<00:55,  2.67it/s, acc_step=1/1, ce_loss_token=2.1540, perplexity_token=8.6196]

torch.Size([256, 366, 35]) torch.Size([256, 366])


[Training LM]:  86%|████████████████████████████████████████▎      | 896/1044 [05:32<00:59,  2.47it/s, acc_step=1/1, ce_loss_token=2.1539, perplexity_token=8.6187]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  86%|████████████████████████████████████████▍      | 897/1044 [05:33<00:58,  2.50it/s, acc_step=1/1, ce_loss_token=2.1538, perplexity_token=8.6180]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  86%|████████████████████████████████████████▍      | 898/1044 [05:33<00:57,  2.54it/s, acc_step=1/1, ce_loss_token=2.1537, perplexity_token=8.6171]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  86%|████████████████████████████████████████▌      | 900/1044 [05:34<00:45,  3.15it/s, acc_step=1/1, ce_loss_token=2.1540, perplexity_token=8.6192]

torch.Size([256, 308, 35]) torch.Size([256, 308])
torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  86%|████████████████████████████████████████▌      | 901/1044 [05:34<00:46,  3.05it/s, acc_step=1/1, ce_loss_token=2.1539, perplexity_token=8.6184]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  86%|████████████████████████████████████████▌      | 902/1044 [05:34<00:48,  2.95it/s, acc_step=1/1, ce_loss_token=2.1538, perplexity_token=8.6177]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  86%|████████████████████████████████████████▋      | 903/1044 [05:35<00:51,  2.71it/s, acc_step=1/1, ce_loss_token=2.1537, perplexity_token=8.6169]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  87%|████████████████████████████████████████▋      | 904/1044 [05:35<00:52,  2.69it/s, acc_step=1/1, ce_loss_token=2.1536, perplexity_token=8.6161]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  87%|████████████████████████████████████████▋      | 905/1044 [05:35<00:48,  2.89it/s, acc_step=1/1, ce_loss_token=2.1536, perplexity_token=8.6160]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  87%|████████████████████████████████████████▊      | 906/1044 [05:36<00:50,  2.72it/s, acc_step=1/1, ce_loss_token=2.1535, perplexity_token=8.6153]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  87%|████████████████████████████████████████▊      | 907/1044 [05:36<00:51,  2.68it/s, acc_step=1/1, ce_loss_token=2.1534, perplexity_token=8.6144]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  87%|████████████████████████████████████████▉      | 908/1044 [05:37<00:51,  2.67it/s, acc_step=1/1, ce_loss_token=2.1533, perplexity_token=8.6136]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  87%|████████████████████████████████████████▉      | 909/1044 [05:37<00:50,  2.65it/s, acc_step=1/1, ce_loss_token=2.1533, perplexity_token=8.6129]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  87%|████████████████████████████████████████▉      | 910/1044 [05:37<00:50,  2.66it/s, acc_step=1/1, ce_loss_token=2.1532, perplexity_token=8.6120]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  87%|█████████████████████████████████████████      | 911/1044 [05:38<00:50,  2.63it/s, acc_step=1/1, ce_loss_token=2.1531, perplexity_token=8.6112]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  87%|█████████████████████████████████████████      | 912/1044 [05:38<00:50,  2.61it/s, acc_step=1/1, ce_loss_token=2.1530, perplexity_token=8.6103]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  87%|█████████████████████████████████████████      | 913/1044 [05:38<00:50,  2.61it/s, acc_step=1/1, ce_loss_token=2.1529, perplexity_token=8.6094]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  88%|█████████████████████████████████████████▏     | 914/1044 [05:39<00:50,  2.59it/s, acc_step=1/1, ce_loss_token=2.1528, perplexity_token=8.6086]

torch.Size([256, 381, 35]) torch.Size([256, 381])


[Training LM]:  88%|█████████████████████████████████████████▏     | 915/1044 [05:39<00:55,  2.34it/s, acc_step=1/1, ce_loss_token=2.1527, perplexity_token=8.6078]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  88%|█████████████████████████████████████████▏     | 916/1044 [05:40<00:52,  2.43it/s, acc_step=1/1, ce_loss_token=2.1526, perplexity_token=8.6070]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  88%|█████████████████████████████████████████▎     | 918/1044 [05:40<00:43,  2.88it/s, acc_step=1/1, ce_loss_token=2.1528, perplexity_token=8.6092]

torch.Size([256, 291, 35]) torch.Size([256, 291])
torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  88%|█████████████████████████████████████████▎     | 919/1044 [05:41<00:41,  3.00it/s, acc_step=1/1, ce_loss_token=2.1528, perplexity_token=8.6093]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  88%|█████████████████████████████████████████▍     | 920/1044 [05:41<00:43,  2.87it/s, acc_step=1/1, ce_loss_token=2.1527, perplexity_token=8.6084]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  88%|█████████████████████████████████████████▍     | 921/1044 [05:41<00:43,  2.86it/s, acc_step=1/1, ce_loss_token=2.1527, perplexity_token=8.6077]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  88%|█████████████████████████████████████████▌     | 923/1044 [05:42<00:36,  3.28it/s, acc_step=1/1, ce_loss_token=2.1527, perplexity_token=8.6080]

torch.Size([256, 300, 35]) torch.Size([256, 300])
torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  89%|█████████████████████████████████████████▌     | 924/1044 [05:42<00:37,  3.17it/s, acc_step=1/1, ce_loss_token=2.1526, perplexity_token=8.6073]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  89%|█████████████████████████████████████████▋     | 925/1044 [05:43<00:35,  3.37it/s, acc_step=1/1, ce_loss_token=2.1528, perplexity_token=8.6091]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  89%|█████████████████████████████████████████▋     | 926/1044 [05:43<00:37,  3.16it/s, acc_step=1/1, ce_loss_token=2.1527, perplexity_token=8.6084]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  89%|█████████████████████████████████████████▋     | 927/1044 [05:43<00:39,  2.99it/s, acc_step=1/1, ce_loss_token=2.1527, perplexity_token=8.6077]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  89%|█████████████████████████████████████████▊     | 928/1044 [05:44<00:41,  2.83it/s, acc_step=1/1, ce_loss_token=2.1526, perplexity_token=8.6068]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  89%|█████████████████████████████████████████▊     | 929/1044 [05:44<00:41,  2.77it/s, acc_step=1/1, ce_loss_token=2.1525, perplexity_token=8.6060]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  89%|█████████████████████████████████████████▊     | 930/1044 [05:44<00:41,  2.77it/s, acc_step=1/1, ce_loss_token=2.1524, perplexity_token=8.6052]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  89%|█████████████████████████████████████████▉     | 931/1044 [05:45<00:40,  2.76it/s, acc_step=1/1, ce_loss_token=2.1523, perplexity_token=8.6044]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  89%|█████████████████████████████████████████▉     | 932/1044 [05:45<00:40,  2.76it/s, acc_step=1/1, ce_loss_token=2.1522, perplexity_token=8.6036]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  89%|██████████████████████████████████████████     | 933/1044 [05:45<00:37,  2.93it/s, acc_step=1/1, ce_loss_token=2.1523, perplexity_token=8.6043]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  89%|██████████████████████████████████████████     | 934/1044 [05:46<00:39,  2.78it/s, acc_step=1/1, ce_loss_token=2.1522, perplexity_token=8.6033]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  90%|██████████████████████████████████████████     | 935/1044 [05:46<00:37,  2.93it/s, acc_step=1/1, ce_loss_token=2.1522, perplexity_token=8.6038]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  90%|██████████████████████████████████████████▏    | 936/1044 [05:46<00:37,  2.86it/s, acc_step=1/1, ce_loss_token=2.1521, perplexity_token=8.6031]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  90%|██████████████████████████████████████████▏    | 937/1044 [05:47<00:39,  2.74it/s, acc_step=1/1, ce_loss_token=2.1520, perplexity_token=8.6024]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  90%|██████████████████████████████████████████▏    | 938/1044 [05:47<00:36,  2.90it/s, acc_step=1/1, ce_loss_token=2.1521, perplexity_token=8.6025]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  90%|██████████████████████████████████████████▎    | 939/1044 [05:48<00:37,  2.80it/s, acc_step=1/1, ce_loss_token=2.1520, perplexity_token=8.6016]

torch.Size([256, 351, 35]) torch.Size([256, 351])


[Training LM]:  90%|██████████████████████████████████████████▎    | 940/1044 [05:48<00:40,  2.57it/s, acc_step=1/1, ce_loss_token=2.1519, perplexity_token=8.6010]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  90%|██████████████████████████████████████████▎    | 941/1044 [05:48<00:39,  2.61it/s, acc_step=1/1, ce_loss_token=2.1518, perplexity_token=8.6003]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  90%|██████████████████████████████████████████▍    | 942/1044 [05:49<00:39,  2.60it/s, acc_step=1/1, ce_loss_token=2.1517, perplexity_token=8.5996]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  90%|██████████████████████████████████████████▍    | 943/1044 [05:49<00:40,  2.47it/s, acc_step=1/1, ce_loss_token=2.1516, perplexity_token=8.5986]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  90%|██████████████████████████████████████████▍    | 944/1044 [05:50<00:36,  2.71it/s, acc_step=1/1, ce_loss_token=2.1517, perplexity_token=8.5992]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  91%|██████████████████████████████████████████▌    | 945/1044 [05:50<00:36,  2.74it/s, acc_step=1/1, ce_loss_token=2.1516, perplexity_token=8.5984]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  91%|██████████████████████████████████████████▌    | 946/1044 [05:50<00:35,  2.77it/s, acc_step=1/1, ce_loss_token=2.1515, perplexity_token=8.5975]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  91%|██████████████████████████████████████████▋    | 947/1044 [05:51<00:34,  2.77it/s, acc_step=1/1, ce_loss_token=2.1514, perplexity_token=8.5967]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  91%|██████████████████████████████████████████▋    | 948/1044 [05:51<00:36,  2.66it/s, acc_step=1/1, ce_loss_token=2.1513, perplexity_token=8.5959]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  91%|██████████████████████████████████████████▋    | 949/1044 [05:51<00:37,  2.55it/s, acc_step=1/1, ce_loss_token=2.1512, perplexity_token=8.5952]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  91%|██████████████████████████████████████████▊    | 950/1044 [05:52<00:37,  2.54it/s, acc_step=1/1, ce_loss_token=2.1511, perplexity_token=8.5944]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  91%|██████████████████████████████████████████▊    | 951/1044 [05:52<00:36,  2.58it/s, acc_step=1/1, ce_loss_token=2.1510, perplexity_token=8.5936]

torch.Size([256, 396, 35]) torch.Size([256, 396])


[Training LM]:  91%|██████████████████████████████████████████▊    | 952/1044 [05:53<00:39,  2.31it/s, acc_step=1/1, ce_loss_token=2.1509, perplexity_token=8.5929]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  91%|██████████████████████████████████████████▉    | 953/1044 [05:53<00:37,  2.40it/s, acc_step=1/1, ce_loss_token=2.1508, perplexity_token=8.5921]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  91%|██████████████████████████████████████████▉    | 954/1044 [05:54<00:40,  2.21it/s, acc_step=1/1, ce_loss_token=2.1508, perplexity_token=8.5914]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  91%|██████████████████████████████████████████▉    | 955/1044 [05:54<00:38,  2.29it/s, acc_step=1/1, ce_loss_token=2.1507, perplexity_token=8.5906]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  92%|███████████████████████████████████████████    | 956/1044 [05:54<00:37,  2.37it/s, acc_step=1/1, ce_loss_token=2.1506, perplexity_token=8.5897]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  92%|███████████████████████████████████████████    | 957/1044 [05:55<00:35,  2.47it/s, acc_step=1/1, ce_loss_token=2.1505, perplexity_token=8.5889]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  92%|███████████████████████████████████████████▏   | 958/1044 [05:55<00:34,  2.50it/s, acc_step=1/1, ce_loss_token=2.1504, perplexity_token=8.5881]

torch.Size([256, 410, 35]) torch.Size([256, 410])


[Training LM]:  92%|███████████████████████████████████████████▏   | 959/1044 [05:56<00:38,  2.21it/s, acc_step=1/1, ce_loss_token=2.1503, perplexity_token=8.5873]

torch.Size([256, 393, 35]) torch.Size([256, 393])


[Training LM]:  92%|███████████████████████████████████████████▏   | 960/1044 [05:56<00:37,  2.24it/s, acc_step=1/1, ce_loss_token=2.1503, perplexity_token=8.5874]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  92%|███████████████████████████████████████████▎   | 961/1044 [05:57<00:35,  2.34it/s, acc_step=1/1, ce_loss_token=2.1502, perplexity_token=8.5866]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  92%|███████████████████████████████████████████▎   | 962/1044 [05:57<00:31,  2.62it/s, acc_step=1/1, ce_loss_token=2.1502, perplexity_token=8.5868]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  92%|███████████████████████████████████████████▎   | 963/1044 [05:57<00:33,  2.45it/s, acc_step=1/1, ce_loss_token=2.1501, perplexity_token=8.5860]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  92%|███████████████████████████████████████████▍   | 964/1044 [05:58<00:33,  2.36it/s, acc_step=1/1, ce_loss_token=2.1500, perplexity_token=8.5852]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  92%|███████████████████████████████████████████▍   | 965/1044 [05:58<00:30,  2.61it/s, acc_step=1/1, ce_loss_token=2.1500, perplexity_token=8.5851]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  93%|███████████████████████████████████████████▍   | 966/1044 [05:59<00:30,  2.57it/s, acc_step=1/1, ce_loss_token=2.1499, perplexity_token=8.5844]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  93%|███████████████████████████████████████████▌   | 967/1044 [05:59<00:30,  2.49it/s, acc_step=1/1, ce_loss_token=2.1499, perplexity_token=8.5836]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  93%|███████████████████████████████████████████▌   | 968/1044 [05:59<00:29,  2.56it/s, acc_step=1/1, ce_loss_token=2.1497, perplexity_token=8.5827]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  93%|███████████████████████████████████████████▌   | 969/1044 [06:00<00:28,  2.59it/s, acc_step=1/1, ce_loss_token=2.1496, perplexity_token=8.5818]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  93%|███████████████████████████████████████████▋   | 970/1044 [06:00<00:28,  2.61it/s, acc_step=1/1, ce_loss_token=2.1495, perplexity_token=8.5809]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  93%|███████████████████████████████████████████▋   | 971/1044 [06:00<00:25,  2.84it/s, acc_step=1/1, ce_loss_token=2.1496, perplexity_token=8.5811]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  93%|███████████████████████████████████████████▊   | 972/1044 [06:01<00:26,  2.76it/s, acc_step=1/1, ce_loss_token=2.1495, perplexity_token=8.5803]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  93%|███████████████████████████████████████████▊   | 973/1044 [06:01<00:26,  2.70it/s, acc_step=1/1, ce_loss_token=2.1494, perplexity_token=8.5795]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  93%|███████████████████████████████████████████▊   | 974/1044 [06:02<00:26,  2.59it/s, acc_step=1/1, ce_loss_token=2.1493, perplexity_token=8.5788]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  93%|███████████████████████████████████████████▉   | 975/1044 [06:02<00:26,  2.58it/s, acc_step=1/1, ce_loss_token=2.1492, perplexity_token=8.5780]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  93%|███████████████████████████████████████████▉   | 976/1044 [06:02<00:24,  2.75it/s, acc_step=1/1, ce_loss_token=2.1492, perplexity_token=8.5783]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  94%|███████████████████████████████████████████▉   | 977/1044 [06:03<00:24,  2.74it/s, acc_step=1/1, ce_loss_token=2.1491, perplexity_token=8.5775]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  94%|████████████████████████████████████████████   | 978/1044 [06:03<00:24,  2.67it/s, acc_step=1/1, ce_loss_token=2.1490, perplexity_token=8.5767]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  94%|████████████████████████████████████████████   | 979/1044 [06:03<00:23,  2.80it/s, acc_step=1/1, ce_loss_token=2.1491, perplexity_token=8.5768]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  94%|████████████████████████████████████████████   | 980/1044 [06:04<00:21,  2.98it/s, acc_step=1/1, ce_loss_token=2.1491, perplexity_token=8.5767]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  94%|████████████████████████████████████████████▏  | 981/1044 [06:04<00:21,  2.96it/s, acc_step=1/1, ce_loss_token=2.1490, perplexity_token=8.5759]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  94%|████████████████████████████████████████████▏  | 982/1044 [06:04<00:21,  2.85it/s, acc_step=1/1, ce_loss_token=2.1489, perplexity_token=8.5751]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  94%|████████████████████████████████████████████▎  | 983/1044 [06:05<00:23,  2.58it/s, acc_step=1/1, ce_loss_token=2.1488, perplexity_token=8.5743]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  94%|████████████████████████████████████████████▎  | 984/1044 [06:05<00:23,  2.58it/s, acc_step=1/1, ce_loss_token=2.1487, perplexity_token=8.5736]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  94%|████████████████████████████████████████████▎  | 985/1044 [06:06<00:22,  2.62it/s, acc_step=1/1, ce_loss_token=2.1486, perplexity_token=8.5728]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  94%|████████████████████████████████████████████▍  | 986/1044 [06:06<00:23,  2.44it/s, acc_step=1/1, ce_loss_token=2.1485, perplexity_token=8.5720]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  95%|████████████████████████████████████████████▍  | 987/1044 [06:06<00:22,  2.51it/s, acc_step=1/1, ce_loss_token=2.1484, perplexity_token=8.5712]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  95%|████████████████████████████████████████████▍  | 988/1044 [06:07<00:22,  2.54it/s, acc_step=1/1, ce_loss_token=2.1483, perplexity_token=8.5704]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  95%|████████████████████████████████████████████▌  | 989/1044 [06:07<00:21,  2.52it/s, acc_step=1/1, ce_loss_token=2.1482, perplexity_token=8.5696]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  95%|████████████████████████████████████████████▌  | 990/1044 [06:07<00:19,  2.70it/s, acc_step=1/1, ce_loss_token=2.1482, perplexity_token=8.5697]

torch.Size([256, 438, 35]) torch.Size([256, 438])


[Training LM]:  95%|████████████████████████████████████████████▌  | 991/1044 [06:08<00:23,  2.24it/s, acc_step=1/1, ce_loss_token=2.1482, perplexity_token=8.5690]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  95%|████████████████████████████████████████████▋  | 992/1044 [06:09<00:22,  2.33it/s, acc_step=1/1, ce_loss_token=2.1481, perplexity_token=8.5682]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  95%|████████████████████████████████████████████▋  | 993/1044 [06:09<00:20,  2.44it/s, acc_step=1/1, ce_loss_token=2.1480, perplexity_token=8.5673]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  95%|████████████████████████████████████████████▋  | 994/1044 [06:09<00:20,  2.48it/s, acc_step=1/1, ce_loss_token=2.1479, perplexity_token=8.5665]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  95%|████████████████████████████████████████████▊  | 995/1044 [06:10<00:19,  2.55it/s, acc_step=1/1, ce_loss_token=2.1478, perplexity_token=8.5657]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  95%|████████████████████████████████████████████▊  | 996/1044 [06:10<00:18,  2.56it/s, acc_step=1/1, ce_loss_token=2.1477, perplexity_token=8.5649]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  95%|████████████████████████████████████████████▉  | 997/1044 [06:10<00:18,  2.55it/s, acc_step=1/1, ce_loss_token=2.1476, perplexity_token=8.5642]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  96%|████████████████████████████████████████████▉  | 998/1044 [06:11<00:17,  2.56it/s, acc_step=1/1, ce_loss_token=2.1475, perplexity_token=8.5633]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  96%|████████████████████████████████████████████▉  | 999/1044 [06:11<00:17,  2.61it/s, acc_step=1/1, ce_loss_token=2.1474, perplexity_token=8.5623]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  96%|████████████████████████████████████████████  | 1000/1044 [06:12<00:16,  2.60it/s, acc_step=1/1, ce_loss_token=2.1473, perplexity_token=8.5614]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  96%|████████████████████████████████████████████  | 1001/1044 [06:12<00:15,  2.83it/s, acc_step=1/1, ce_loss_token=2.1473, perplexity_token=8.5613]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1002/1044 [06:12<00:13,  3.01it/s, acc_step=1/1, ce_loss_token=2.1473, perplexity_token=8.5615]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1003/1044 [06:12<00:14,  2.92it/s, acc_step=1/1, ce_loss_token=2.1472, perplexity_token=8.5608]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1004/1044 [06:13<00:14,  2.85it/s, acc_step=1/1, ce_loss_token=2.1471, perplexity_token=8.5599]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1005/1044 [06:13<00:14,  2.64it/s, acc_step=1/1, ce_loss_token=2.1470, perplexity_token=8.5592]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1007/1044 [06:14<00:12,  3.03it/s, acc_step=1/1, ce_loss_token=2.1471, perplexity_token=8.5598]

torch.Size([256, 296, 35]) torch.Size([256, 296])
torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1008/1044 [06:14<00:12,  2.82it/s, acc_step=1/1, ce_loss_token=2.1470, perplexity_token=8.5590]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1009/1044 [06:15<00:12,  2.74it/s, acc_step=1/1, ce_loss_token=2.1469, perplexity_token=8.5582]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1010/1044 [06:15<00:12,  2.68it/s, acc_step=1/1, ce_loss_token=2.1468, perplexity_token=8.5574]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1011/1044 [06:15<00:11,  2.86it/s, acc_step=1/1, ce_loss_token=2.1468, perplexity_token=8.5575]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1012/1044 [06:16<00:11,  2.86it/s, acc_step=1/1, ce_loss_token=2.1467, perplexity_token=8.5568]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1013/1044 [06:16<00:10,  2.82it/s, acc_step=1/1, ce_loss_token=2.1466, perplexity_token=8.5561]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1014/1044 [06:16<00:10,  2.79it/s, acc_step=1/1, ce_loss_token=2.1466, perplexity_token=8.5553]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1015/1044 [06:17<00:10,  2.66it/s, acc_step=1/1, ce_loss_token=2.1465, perplexity_token=8.5545]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1016/1044 [06:17<00:10,  2.59it/s, acc_step=1/1, ce_loss_token=2.1464, perplexity_token=8.5537]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1017/1044 [06:18<00:10,  2.67it/s, acc_step=1/1, ce_loss_token=2.1463, perplexity_token=8.5529]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  98%|████████████████████████████████████████████▊ | 1018/1044 [06:18<00:10,  2.57it/s, acc_step=1/1, ce_loss_token=2.1462, perplexity_token=8.5520]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1019/1044 [06:18<00:09,  2.51it/s, acc_step=1/1, ce_loss_token=2.1461, perplexity_token=8.5513]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1020/1044 [06:19<00:09,  2.56it/s, acc_step=1/1, ce_loss_token=2.1460, perplexity_token=8.5505]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1021/1044 [06:19<00:08,  2.61it/s, acc_step=1/1, ce_loss_token=2.1459, perplexity_token=8.5497]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  98%|█████████████████████████████████████████████ | 1022/1044 [06:20<00:08,  2.70it/s, acc_step=1/1, ce_loss_token=2.1458, perplexity_token=8.5488]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  98%|█████████████████████████████████████████████ | 1023/1044 [06:20<00:08,  2.58it/s, acc_step=1/1, ce_loss_token=2.1457, perplexity_token=8.5481]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  98%|█████████████████████████████████████████████ | 1024/1044 [06:20<00:07,  2.63it/s, acc_step=1/1, ce_loss_token=2.1456, perplexity_token=8.5471]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1025/1044 [06:21<00:07,  2.66it/s, acc_step=1/1, ce_loss_token=2.1455, perplexity_token=8.5464]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1026/1044 [06:21<00:06,  2.64it/s, acc_step=1/1, ce_loss_token=2.1454, perplexity_token=8.5455]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1027/1044 [06:21<00:06,  2.62it/s, acc_step=1/1, ce_loss_token=2.1453, perplexity_token=8.5447]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  99%|█████████████████████████████████████████████▎| 1029/1044 [06:22<00:04,  3.24it/s, acc_step=1/1, ce_loss_token=2.1455, perplexity_token=8.5461]

torch.Size([256, 312, 35]) torch.Size([256, 312])
torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1030/1044 [06:22<00:04,  3.13it/s, acc_step=1/1, ce_loss_token=2.1454, perplexity_token=8.5453]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1031/1044 [06:23<00:04,  3.03it/s, acc_step=1/1, ce_loss_token=2.1453, perplexity_token=8.5446]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1032/1044 [06:23<00:03,  3.12it/s, acc_step=1/1, ce_loss_token=2.1453, perplexity_token=8.5446]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1033/1044 [06:23<00:03,  3.03it/s, acc_step=1/1, ce_loss_token=2.1452, perplexity_token=8.5438]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1034/1044 [06:24<00:03,  2.90it/s, acc_step=1/1, ce_loss_token=2.1451, perplexity_token=8.5429]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1035/1044 [06:24<00:03,  2.82it/s, acc_step=1/1, ce_loss_token=2.1450, perplexity_token=8.5422]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1036/1044 [06:25<00:02,  2.67it/s, acc_step=1/1, ce_loss_token=2.1449, perplexity_token=8.5414]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1037/1044 [06:25<00:02,  2.66it/s, acc_step=1/1, ce_loss_token=2.1448, perplexity_token=8.5405]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1038/1044 [06:25<00:02,  2.69it/s, acc_step=1/1, ce_loss_token=2.1447, perplexity_token=8.5397]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1039/1044 [06:26<00:01,  2.68it/s, acc_step=1/1, ce_loss_token=2.1446, perplexity_token=8.5389]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1040/1044 [06:26<00:01,  2.81it/s, acc_step=1/1, ce_loss_token=2.1446, perplexity_token=8.5389]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1041/1044 [06:26<00:01,  2.67it/s, acc_step=1/1, ce_loss_token=2.1445, perplexity_token=8.5381]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1042/1044 [06:27<00:00,  2.70it/s, acc_step=1/1, ce_loss_token=2.1444, perplexity_token=8.5373]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1043/1044 [06:27<00:00,  2.69it/s, acc_step=1/1, ce_loss_token=2.1444, perplexity_token=8.5365]

torch.Size([170, 293, 35]) torch.Size([170, 293])


                                                                                                                                                                   

Generating with greedy search...

📊 Metrics (Epoch 2):
├── TRAIN:
│   ├── ce_loss_char: 2.1443
│   ├── ce_loss_token: 2.1443
│   ├── perplexity_char: 8.5360
│   └── perplexity_token: 8.5360
└── VAL:
    ├── ce_loss_char: 1.9567
    ├── ce_loss_token: 1.9567
    ├── perplexity_char: 7.0761
    └── perplexity_token: 7.0761
└── TRAINING:
    └── learning_rate: 0.000064


[Training LM]:   0%|                                                                                                                      | 0/1044 [00:00<?, ?it/s]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:   0%|                                                 | 1/1044 [00:00<09:57,  1.75it/s, acc_step=1/1, ce_loss_token=2.0463, perplexity_token=7.7390]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   0%|                                                 | 2/1044 [00:00<08:18,  2.09it/s, acc_step=1/1, ce_loss_token=2.0489, perplexity_token=7.7593]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   0%|▏                                                | 3/1044 [00:01<07:25,  2.34it/s, acc_step=1/1, ce_loss_token=2.0428, perplexity_token=7.7125]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   0%|▏                                                | 4/1044 [00:01<06:59,  2.48it/s, acc_step=1/1, ce_loss_token=2.0437, perplexity_token=7.7193]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   0%|▏                                                | 5/1044 [00:02<06:48,  2.55it/s, acc_step=1/1, ce_loss_token=2.0437, perplexity_token=7.7193]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   1%|▎                                                | 6/1044 [00:02<06:38,  2.60it/s, acc_step=1/1, ce_loss_token=2.0434, perplexity_token=7.7170]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   1%|▎                                                | 7/1044 [00:02<06:34,  2.63it/s, acc_step=1/1, ce_loss_token=2.0429, perplexity_token=7.7130]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   1%|▍                                                | 8/1044 [00:03<06:31,  2.65it/s, acc_step=1/1, ce_loss_token=2.0428, perplexity_token=7.7122]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   1%|▍                                                | 9/1044 [00:03<06:30,  2.65it/s, acc_step=1/1, ce_loss_token=2.0425, perplexity_token=7.7099]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   1%|▍                                               | 10/1044 [00:03<06:27,  2.67it/s, acc_step=1/1, ce_loss_token=2.0417, perplexity_token=7.7040]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   1%|▌                                               | 11/1044 [00:04<06:22,  2.70it/s, acc_step=1/1, ce_loss_token=2.0433, perplexity_token=7.7163]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   1%|▌                                               | 12/1044 [00:04<06:23,  2.69it/s, acc_step=1/1, ce_loss_token=2.0432, perplexity_token=7.7150]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   1%|▌                                               | 13/1044 [00:05<06:20,  2.71it/s, acc_step=1/1, ce_loss_token=2.0432, perplexity_token=7.7150]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   1%|▋                                               | 14/1044 [00:05<06:34,  2.61it/s, acc_step=1/1, ce_loss_token=2.0433, perplexity_token=7.7163]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:   1%|▋                                               | 15/1044 [00:05<06:20,  2.70it/s, acc_step=1/1, ce_loss_token=2.0436, perplexity_token=7.7183]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   2%|▋                                               | 16/1044 [00:06<05:56,  2.89it/s, acc_step=1/1, ce_loss_token=2.0498, perplexity_token=7.7664]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   2%|▊                                               | 17/1044 [00:06<06:09,  2.78it/s, acc_step=1/1, ce_loss_token=2.0487, perplexity_token=7.7579]

torch.Size([256, 393, 35]) torch.Size([256, 393])


[Training LM]:   2%|▊                                               | 18/1044 [00:06<06:28,  2.64it/s, acc_step=1/1, ce_loss_token=2.0531, perplexity_token=7.7917]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   2%|▊                                               | 19/1044 [00:07<06:24,  2.67it/s, acc_step=1/1, ce_loss_token=2.0535, perplexity_token=7.7949]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   2%|▉                                               | 20/1044 [00:07<06:31,  2.62it/s, acc_step=1/1, ce_loss_token=2.0527, perplexity_token=7.7891]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   2%|▉                                               | 21/1044 [00:08<06:26,  2.64it/s, acc_step=1/1, ce_loss_token=2.0522, perplexity_token=7.7848]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   2%|█                                               | 22/1044 [00:08<06:22,  2.67it/s, acc_step=1/1, ce_loss_token=2.0523, perplexity_token=7.7858]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   2%|█                                               | 23/1044 [00:08<06:31,  2.61it/s, acc_step=1/1, ce_loss_token=2.0519, perplexity_token=7.7825]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   2%|█                                               | 24/1044 [00:09<06:23,  2.66it/s, acc_step=1/1, ce_loss_token=2.0514, perplexity_token=7.7785]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:   2%|█▏                                              | 25/1044 [00:09<06:36,  2.57it/s, acc_step=1/1, ce_loss_token=2.0510, perplexity_token=7.7759]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   2%|█▏                                              | 26/1044 [00:09<06:34,  2.58it/s, acc_step=1/1, ce_loss_token=2.0504, perplexity_token=7.7713]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   3%|█▏                                              | 27/1044 [00:10<06:25,  2.64it/s, acc_step=1/1, ce_loss_token=2.0497, perplexity_token=7.7659]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   3%|█▎                                              | 28/1044 [00:10<06:02,  2.80it/s, acc_step=1/1, ce_loss_token=2.0530, perplexity_token=7.7911]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:   3%|█▎                                              | 29/1044 [00:10<05:57,  2.84it/s, acc_step=1/1, ce_loss_token=2.0525, perplexity_token=7.7872]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   3%|█▍                                              | 30/1044 [00:11<06:02,  2.79it/s, acc_step=1/1, ce_loss_token=2.0521, perplexity_token=7.7841]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   3%|█▍                                              | 31/1044 [00:11<06:03,  2.79it/s, acc_step=1/1, ce_loss_token=2.0520, perplexity_token=7.7831]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:   3%|█▍                                              | 32/1044 [00:12<06:18,  2.67it/s, acc_step=1/1, ce_loss_token=2.0520, perplexity_token=7.7837]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   3%|█▌                                              | 33/1044 [00:12<05:49,  2.89it/s, acc_step=1/1, ce_loss_token=2.0542, perplexity_token=7.8004]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   3%|█▌                                              | 34/1044 [00:12<05:52,  2.87it/s, acc_step=1/1, ce_loss_token=2.0537, perplexity_token=7.7969]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   3%|█▌                                              | 35/1044 [00:13<06:01,  2.79it/s, acc_step=1/1, ce_loss_token=2.0535, perplexity_token=7.7949]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   3%|█▋                                              | 36/1044 [00:13<06:12,  2.71it/s, acc_step=1/1, ce_loss_token=2.0532, perplexity_token=7.7927]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   4%|█▋                                              | 37/1044 [00:13<06:20,  2.65it/s, acc_step=1/1, ce_loss_token=2.0528, perplexity_token=7.7898]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   4%|█▋                                              | 38/1044 [00:14<06:20,  2.64it/s, acc_step=1/1, ce_loss_token=2.0528, perplexity_token=7.7896]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   4%|█▊                                              | 39/1044 [00:14<06:14,  2.69it/s, acc_step=1/1, ce_loss_token=2.0526, perplexity_token=7.7882]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   4%|█▊                                              | 40/1044 [00:15<06:07,  2.74it/s, acc_step=1/1, ce_loss_token=2.0523, perplexity_token=7.7854]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   4%|█▉                                              | 41/1044 [00:15<06:10,  2.71it/s, acc_step=1/1, ce_loss_token=2.0518, perplexity_token=7.7822]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:   4%|█▉                                              | 42/1044 [00:15<06:22,  2.62it/s, acc_step=1/1, ce_loss_token=2.0516, perplexity_token=7.7804]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   4%|█▉                                              | 43/1044 [00:16<06:16,  2.66it/s, acc_step=1/1, ce_loss_token=2.0514, perplexity_token=7.7785]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   4%|██                                              | 44/1044 [00:16<06:20,  2.63it/s, acc_step=1/1, ce_loss_token=2.0510, perplexity_token=7.7758]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:   4%|██                                              | 45/1044 [00:17<06:49,  2.44it/s, acc_step=1/1, ce_loss_token=2.0509, perplexity_token=7.7747]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   4%|██                                              | 46/1044 [00:17<06:44,  2.47it/s, acc_step=1/1, ce_loss_token=2.0506, perplexity_token=7.7729]

torch.Size([256, 525, 35]) torch.Size([256, 525])


[Training LM]:   5%|██▏                                             | 47/1044 [00:18<09:02,  1.84it/s, acc_step=1/1, ce_loss_token=2.0507, perplexity_token=7.7732]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:   5%|██▏                                             | 48/1044 [00:18<08:04,  2.06it/s, acc_step=1/1, ce_loss_token=2.0506, perplexity_token=7.7722]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:   5%|██▎                                             | 49/1044 [00:19<07:23,  2.24it/s, acc_step=1/1, ce_loss_token=2.0504, perplexity_token=7.7709]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   5%|██▎                                             | 50/1044 [00:19<07:00,  2.36it/s, acc_step=1/1, ce_loss_token=2.0501, perplexity_token=7.7686]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   5%|██▎                                             | 51/1044 [00:19<06:42,  2.47it/s, acc_step=1/1, ce_loss_token=2.0501, perplexity_token=7.7688]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   5%|██▍                                             | 52/1044 [00:20<06:42,  2.46it/s, acc_step=1/1, ce_loss_token=2.0499, perplexity_token=7.7672]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   5%|██▍                                             | 53/1044 [00:20<06:30,  2.54it/s, acc_step=1/1, ce_loss_token=2.0497, perplexity_token=7.7658]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   5%|██▍                                             | 54/1044 [00:20<06:18,  2.61it/s, acc_step=1/1, ce_loss_token=2.0497, perplexity_token=7.7660]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   5%|██▌                                             | 55/1044 [00:21<06:22,  2.59it/s, acc_step=1/1, ce_loss_token=2.0497, perplexity_token=7.7653]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   5%|██▌                                             | 56/1044 [00:21<05:53,  2.79it/s, acc_step=1/1, ce_loss_token=2.0514, perplexity_token=7.7792]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   5%|██▌                                             | 57/1044 [00:21<06:00,  2.73it/s, acc_step=1/1, ce_loss_token=2.0512, perplexity_token=7.7774]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   6%|██▋                                             | 58/1044 [00:22<06:01,  2.73it/s, acc_step=1/1, ce_loss_token=2.0511, perplexity_token=7.7764]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   6%|██▋                                             | 59/1044 [00:22<05:43,  2.87it/s, acc_step=1/1, ce_loss_token=2.0534, perplexity_token=7.7941]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   6%|██▊                                             | 60/1044 [00:22<05:43,  2.87it/s, acc_step=1/1, ce_loss_token=2.0531, perplexity_token=7.7924]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   6%|██▊                                             | 61/1044 [00:23<05:52,  2.79it/s, acc_step=1/1, ce_loss_token=2.0529, perplexity_token=7.7903]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   6%|██▊                                             | 62/1044 [00:23<05:57,  2.75it/s, acc_step=1/1, ce_loss_token=2.0526, perplexity_token=7.7883]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   6%|██▉                                             | 63/1044 [00:24<06:06,  2.68it/s, acc_step=1/1, ce_loss_token=2.0525, perplexity_token=7.7871]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:   6%|██▉                                             | 64/1044 [00:24<06:53,  2.37it/s, acc_step=1/1, ce_loss_token=2.0522, perplexity_token=7.7848]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   6%|██▉                                             | 65/1044 [00:25<06:39,  2.45it/s, acc_step=1/1, ce_loss_token=2.0519, perplexity_token=7.7825]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:   6%|███                                             | 66/1044 [00:25<07:07,  2.29it/s, acc_step=1/1, ce_loss_token=2.0516, perplexity_token=7.7801]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   6%|███                                             | 67/1044 [00:25<06:53,  2.36it/s, acc_step=1/1, ce_loss_token=2.0513, perplexity_token=7.7783]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   7%|███▏                                            | 68/1044 [00:26<06:41,  2.43it/s, acc_step=1/1, ce_loss_token=2.0512, perplexity_token=7.7775]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   7%|███▏                                            | 69/1044 [00:26<06:07,  2.65it/s, acc_step=1/1, ce_loss_token=2.0529, perplexity_token=7.7903]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:   7%|███▏                                            | 70/1044 [00:26<05:34,  2.91it/s, acc_step=1/1, ce_loss_token=2.0540, perplexity_token=7.7994]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   7%|███▎                                            | 71/1044 [00:27<05:12,  3.11it/s, acc_step=1/1, ce_loss_token=2.0554, perplexity_token=7.8099]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   7%|███▎                                            | 72/1044 [00:27<05:31,  2.94it/s, acc_step=1/1, ce_loss_token=2.0551, perplexity_token=7.8077]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   7%|███▎                                            | 73/1044 [00:27<05:25,  2.98it/s, acc_step=1/1, ce_loss_token=2.0562, perplexity_token=7.8166]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   7%|███▍                                            | 74/1044 [00:28<05:39,  2.86it/s, acc_step=1/1, ce_loss_token=2.0560, perplexity_token=7.8147]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   7%|███▍                                            | 75/1044 [00:28<05:23,  3.00it/s, acc_step=1/1, ce_loss_token=2.0573, perplexity_token=7.8249]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   7%|███▍                                            | 76/1044 [00:28<05:28,  2.94it/s, acc_step=1/1, ce_loss_token=2.0570, perplexity_token=7.8226]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:   7%|███▌                                            | 77/1044 [00:29<05:27,  2.95it/s, acc_step=1/1, ce_loss_token=2.0569, perplexity_token=7.8220]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   7%|███▌                                            | 78/1044 [00:29<05:09,  3.12it/s, acc_step=1/1, ce_loss_token=2.0580, perplexity_token=7.8300]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   8%|███▋                                            | 79/1044 [00:29<05:27,  2.95it/s, acc_step=1/1, ce_loss_token=2.0577, perplexity_token=7.8279]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   8%|███▋                                            | 80/1044 [00:30<05:29,  2.93it/s, acc_step=1/1, ce_loss_token=2.0573, perplexity_token=7.8251]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:   8%|███▋                                            | 81/1044 [00:30<05:54,  2.72it/s, acc_step=1/1, ce_loss_token=2.0572, perplexity_token=7.8239]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   8%|███▊                                            | 82/1044 [00:30<05:37,  2.85it/s, acc_step=1/1, ce_loss_token=2.0580, perplexity_token=7.8307]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:   8%|███▊                                            | 83/1044 [00:31<05:34,  2.87it/s, acc_step=1/1, ce_loss_token=2.0578, perplexity_token=7.8289]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:   8%|███▊                                            | 84/1044 [00:31<05:34,  2.87it/s, acc_step=1/1, ce_loss_token=2.0576, perplexity_token=7.8272]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   8%|███▉                                            | 85/1044 [00:32<05:41,  2.81it/s, acc_step=1/1, ce_loss_token=2.0574, perplexity_token=7.8252]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   8%|███▉                                            | 86/1044 [00:32<05:58,  2.67it/s, acc_step=1/1, ce_loss_token=2.0571, perplexity_token=7.8234]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   8%|████                                            | 87/1044 [00:32<05:59,  2.66it/s, acc_step=1/1, ce_loss_token=2.0568, perplexity_token=7.8212]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   8%|████                                            | 88/1044 [00:33<06:04,  2.63it/s, acc_step=1/1, ce_loss_token=2.0566, perplexity_token=7.8193]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   9%|████                                            | 89/1044 [00:33<06:04,  2.62it/s, acc_step=1/1, ce_loss_token=2.0565, perplexity_token=7.8187]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   9%|████▏                                           | 90/1044 [00:34<06:16,  2.53it/s, acc_step=1/1, ce_loss_token=2.0562, perplexity_token=7.8160]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:   9%|████▏                                           | 91/1044 [00:34<06:29,  2.45it/s, acc_step=1/1, ce_loss_token=2.0559, perplexity_token=7.8141]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:   9%|████▏                                           | 92/1044 [00:34<05:42,  2.78it/s, acc_step=1/1, ce_loss_token=2.0590, perplexity_token=7.8384]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   9%|████▎                                           | 93/1044 [00:35<05:36,  2.83it/s, acc_step=1/1, ce_loss_token=2.0588, perplexity_token=7.8365]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:   9%|████▎                                           | 94/1044 [00:35<05:54,  2.68it/s, acc_step=1/1, ce_loss_token=2.0586, perplexity_token=7.8353]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   9%|████▎                                           | 95/1044 [00:35<05:53,  2.68it/s, acc_step=1/1, ce_loss_token=2.0585, perplexity_token=7.8342]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   9%|████▍                                           | 96/1044 [00:36<05:53,  2.68it/s, acc_step=1/1, ce_loss_token=2.0583, perplexity_token=7.8330]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   9%|████▍                                           | 97/1044 [00:36<05:27,  2.89it/s, acc_step=1/1, ce_loss_token=2.0591, perplexity_token=7.8387]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   9%|████▌                                           | 98/1044 [00:36<05:06,  3.08it/s, acc_step=1/1, ce_loss_token=2.0598, perplexity_token=7.8443]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   9%|████▌                                           | 99/1044 [00:37<05:12,  3.02it/s, acc_step=1/1, ce_loss_token=2.0594, perplexity_token=7.8413]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  10%|████▌                                          | 100/1044 [00:37<05:25,  2.90it/s, acc_step=1/1, ce_loss_token=2.0592, perplexity_token=7.8394]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  10%|████▌                                          | 101/1044 [00:37<05:24,  2.91it/s, acc_step=1/1, ce_loss_token=2.0588, perplexity_token=7.8366]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  10%|████▌                                          | 102/1044 [00:38<05:26,  2.89it/s, acc_step=1/1, ce_loss_token=2.0586, perplexity_token=7.8347]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  10%|████▋                                          | 103/1044 [00:38<05:34,  2.81it/s, acc_step=1/1, ce_loss_token=2.0583, perplexity_token=7.8326]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  10%|████▋                                          | 104/1044 [00:38<05:52,  2.67it/s, acc_step=1/1, ce_loss_token=2.0581, perplexity_token=7.8314]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  10%|████▋                                          | 105/1044 [00:39<06:01,  2.60it/s, acc_step=1/1, ce_loss_token=2.0579, perplexity_token=7.8294]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  10%|████▊                                          | 106/1044 [00:39<05:49,  2.68it/s, acc_step=1/1, ce_loss_token=2.0576, perplexity_token=7.8272]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  10%|████▊                                          | 107/1044 [00:40<05:31,  2.83it/s, acc_step=1/1, ce_loss_token=2.0583, perplexity_token=7.8324]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  10%|████▊                                          | 108/1044 [00:40<05:35,  2.79it/s, acc_step=1/1, ce_loss_token=2.0580, perplexity_token=7.8300]

torch.Size([256, 351, 35]) torch.Size([256, 351])


[Training LM]:  10%|████▉                                          | 109/1044 [00:40<06:04,  2.56it/s, acc_step=1/1, ce_loss_token=2.0577, perplexity_token=7.8283]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  11%|████▉                                          | 110/1044 [00:41<06:09,  2.53it/s, acc_step=1/1, ce_loss_token=2.0577, perplexity_token=7.8277]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  11%|████▉                                          | 111/1044 [00:41<06:19,  2.46it/s, acc_step=1/1, ce_loss_token=2.0576, perplexity_token=7.8269]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  11%|█████                                          | 112/1044 [00:42<05:50,  2.66it/s, acc_step=1/1, ce_loss_token=2.0581, perplexity_token=7.8310]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  11%|█████                                          | 113/1044 [00:42<05:39,  2.74it/s, acc_step=1/1, ce_loss_token=2.0579, perplexity_token=7.8292]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  11%|█████▏                                         | 114/1044 [00:42<05:33,  2.79it/s, acc_step=1/1, ce_loss_token=2.0576, perplexity_token=7.8269]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  11%|█████▏                                         | 116/1044 [00:43<04:54,  3.15it/s, acc_step=1/1, ce_loss_token=2.0591, perplexity_token=7.8386]

torch.Size([256, 293, 35]) torch.Size([256, 293])
torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  11%|█████▎                                         | 117/1044 [00:43<04:59,  3.09it/s, acc_step=1/1, ce_loss_token=2.0589, perplexity_token=7.8373]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  11%|█████▎                                         | 118/1044 [00:44<05:14,  2.94it/s, acc_step=1/1, ce_loss_token=2.0587, perplexity_token=7.8357]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  11%|█████▎                                         | 119/1044 [00:44<05:25,  2.85it/s, acc_step=1/1, ce_loss_token=2.0584, perplexity_token=7.8338]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  11%|█████▍                                         | 120/1044 [00:44<05:32,  2.78it/s, acc_step=1/1, ce_loss_token=2.0583, perplexity_token=7.8326]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  12%|█████▍                                         | 121/1044 [00:45<05:40,  2.71it/s, acc_step=1/1, ce_loss_token=2.0581, perplexity_token=7.8311]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  12%|█████▍                                         | 122/1044 [00:45<05:57,  2.58it/s, acc_step=1/1, ce_loss_token=2.0580, perplexity_token=7.8300]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  12%|█████▌                                         | 123/1044 [00:45<05:46,  2.66it/s, acc_step=1/1, ce_loss_token=2.0577, perplexity_token=7.8279]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  12%|█████▌                                         | 124/1044 [00:46<05:39,  2.71it/s, acc_step=1/1, ce_loss_token=2.0575, perplexity_token=7.8266]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  12%|█████▋                                         | 125/1044 [00:46<05:39,  2.71it/s, acc_step=1/1, ce_loss_token=2.0574, perplexity_token=7.8258]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  12%|█████▋                                         | 126/1044 [00:47<05:44,  2.67it/s, acc_step=1/1, ce_loss_token=2.0572, perplexity_token=7.8240]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  12%|█████▋                                         | 127/1044 [00:47<05:24,  2.82it/s, acc_step=1/1, ce_loss_token=2.0579, perplexity_token=7.8299]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  12%|█████▊                                         | 128/1044 [00:47<05:34,  2.74it/s, acc_step=1/1, ce_loss_token=2.0577, perplexity_token=7.8277]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  12%|█████▊                                         | 129/1044 [00:48<05:40,  2.69it/s, acc_step=1/1, ce_loss_token=2.0575, perplexity_token=7.8265]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  12%|█████▊                                         | 130/1044 [00:48<05:34,  2.73it/s, acc_step=1/1, ce_loss_token=2.0573, perplexity_token=7.8246]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  13%|█████▉                                         | 131/1044 [00:48<05:39,  2.69it/s, acc_step=1/1, ce_loss_token=2.0572, perplexity_token=7.8237]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  13%|█████▉                                         | 132/1044 [00:49<05:14,  2.90it/s, acc_step=1/1, ce_loss_token=2.0578, perplexity_token=7.8284]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  13%|█████▉                                         | 133/1044 [00:49<05:22,  2.83it/s, acc_step=1/1, ce_loss_token=2.0576, perplexity_token=7.8270]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  13%|██████                                         | 134/1044 [00:49<05:27,  2.78it/s, acc_step=1/1, ce_loss_token=2.0574, perplexity_token=7.8260]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  13%|██████                                         | 135/1044 [00:50<05:46,  2.62it/s, acc_step=1/1, ce_loss_token=2.0572, perplexity_token=7.8238]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  13%|██████                                         | 136/1044 [00:50<05:53,  2.57it/s, acc_step=1/1, ce_loss_token=2.0570, perplexity_token=7.8223]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  13%|██████▏                                        | 137/1044 [00:51<05:23,  2.80it/s, acc_step=1/1, ce_loss_token=2.0575, perplexity_token=7.8264]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  13%|██████▏                                        | 138/1044 [00:51<05:23,  2.80it/s, acc_step=1/1, ce_loss_token=2.0574, perplexity_token=7.8255]

torch.Size([256, 404, 35]) torch.Size([256, 404])


[Training LM]:  13%|██████▎                                        | 139/1044 [00:51<06:16,  2.41it/s, acc_step=1/1, ce_loss_token=2.0572, perplexity_token=7.8244]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  13%|██████▎                                        | 140/1044 [00:52<06:13,  2.42it/s, acc_step=1/1, ce_loss_token=2.0572, perplexity_token=7.8237]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  14%|██████▎                                        | 141/1044 [00:52<05:59,  2.51it/s, acc_step=1/1, ce_loss_token=2.0569, perplexity_token=7.8220]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  14%|██████▍                                        | 142/1044 [00:53<05:48,  2.59it/s, acc_step=1/1, ce_loss_token=2.0567, perplexity_token=7.8203]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  14%|██████▍                                        | 143/1044 [00:53<05:08,  2.92it/s, acc_step=1/1, ce_loss_token=2.0580, perplexity_token=7.8301]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  14%|██████▍                                        | 144/1044 [00:53<05:08,  2.92it/s, acc_step=1/1, ce_loss_token=2.0578, perplexity_token=7.8287]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  14%|██████▌                                        | 145/1044 [00:53<04:53,  3.07it/s, acc_step=1/1, ce_loss_token=2.0584, perplexity_token=7.8338]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  14%|██████▌                                        | 146/1044 [00:54<05:05,  2.94it/s, acc_step=1/1, ce_loss_token=2.0583, perplexity_token=7.8327]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  14%|██████▌                                        | 147/1044 [00:54<05:35,  2.67it/s, acc_step=1/1, ce_loss_token=2.0582, perplexity_token=7.8319]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  14%|██████▋                                        | 148/1044 [00:55<05:29,  2.72it/s, acc_step=1/1, ce_loss_token=2.0579, perplexity_token=7.8296]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  14%|██████▋                                        | 149/1044 [00:55<05:35,  2.67it/s, acc_step=1/1, ce_loss_token=2.0577, perplexity_token=7.8282]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  14%|██████▊                                        | 150/1044 [00:55<05:34,  2.67it/s, acc_step=1/1, ce_loss_token=2.0575, perplexity_token=7.8267]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  14%|██████▊                                        | 151/1044 [00:56<05:54,  2.52it/s, acc_step=1/1, ce_loss_token=2.0574, perplexity_token=7.8253]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  15%|██████▊                                        | 152/1044 [00:56<05:42,  2.60it/s, acc_step=1/1, ce_loss_token=2.0571, perplexity_token=7.8236]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  15%|██████▉                                        | 153/1044 [00:57<05:38,  2.63it/s, acc_step=1/1, ce_loss_token=2.0570, perplexity_token=7.8224]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  15%|██████▉                                        | 154/1044 [00:57<05:41,  2.61it/s, acc_step=1/1, ce_loss_token=2.0569, perplexity_token=7.8214]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  15%|██████▉                                        | 155/1044 [00:57<05:45,  2.57it/s, acc_step=1/1, ce_loss_token=2.0567, perplexity_token=7.8202]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  15%|███████                                        | 156/1044 [00:58<05:43,  2.59it/s, acc_step=1/1, ce_loss_token=2.0565, perplexity_token=7.8186]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  15%|███████                                        | 157/1044 [00:58<05:35,  2.64it/s, acc_step=1/1, ce_loss_token=2.0564, perplexity_token=7.8176]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  15%|███████                                        | 158/1044 [00:59<05:48,  2.54it/s, acc_step=1/1, ce_loss_token=2.0562, perplexity_token=7.8165]

torch.Size([256, 410, 35]) torch.Size([256, 410])


[Training LM]:  15%|███████▏                                       | 159/1044 [00:59<06:36,  2.23it/s, acc_step=1/1, ce_loss_token=2.0561, perplexity_token=7.8152]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  15%|███████▏                                       | 160/1044 [00:59<06:16,  2.35it/s, acc_step=1/1, ce_loss_token=2.0558, perplexity_token=7.8134]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  15%|███████▏                                       | 161/1044 [01:00<05:59,  2.46it/s, acc_step=1/1, ce_loss_token=2.0556, perplexity_token=7.8119]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  16%|███████▎                                       | 162/1044 [01:00<05:51,  2.51it/s, acc_step=1/1, ce_loss_token=2.0555, perplexity_token=7.8108]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  16%|███████▎                                       | 163/1044 [01:01<05:44,  2.55it/s, acc_step=1/1, ce_loss_token=2.0553, perplexity_token=7.8093]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  16%|███████▍                                       | 164/1044 [01:01<05:41,  2.57it/s, acc_step=1/1, ce_loss_token=2.0551, perplexity_token=7.8079]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  16%|███████▍                                       | 165/1044 [01:01<05:12,  2.81it/s, acc_step=1/1, ce_loss_token=2.0554, perplexity_token=7.8102]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  16%|███████▍                                       | 166/1044 [01:02<04:49,  3.04it/s, acc_step=1/1, ce_loss_token=2.0558, perplexity_token=7.8131]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  16%|███████▌                                       | 167/1044 [01:02<04:54,  2.98it/s, acc_step=1/1, ce_loss_token=2.0556, perplexity_token=7.8118]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  16%|███████▌                                       | 168/1044 [01:02<05:16,  2.77it/s, acc_step=1/1, ce_loss_token=2.0555, perplexity_token=7.8106]

torch.Size([256, 366, 35]) torch.Size([256, 366])


[Training LM]:  16%|███████▌                                       | 169/1044 [01:03<05:48,  2.51it/s, acc_step=1/1, ce_loss_token=2.0553, perplexity_token=7.8091]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  16%|███████▋                                       | 170/1044 [01:03<05:36,  2.60it/s, acc_step=1/1, ce_loss_token=2.0551, perplexity_token=7.8079]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  16%|███████▋                                       | 171/1044 [01:04<05:39,  2.57it/s, acc_step=1/1, ce_loss_token=2.0550, perplexity_token=7.8068]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  16%|███████▋                                       | 172/1044 [01:04<05:41,  2.55it/s, acc_step=1/1, ce_loss_token=2.0547, perplexity_token=7.8049]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  17%|███████▊                                       | 173/1044 [01:04<05:35,  2.60it/s, acc_step=1/1, ce_loss_token=2.0546, perplexity_token=7.8038]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  17%|███████▊                                       | 174/1044 [01:05<05:45,  2.52it/s, acc_step=1/1, ce_loss_token=2.0545, perplexity_token=7.8027]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  17%|███████▉                                       | 175/1044 [01:05<05:33,  2.61it/s, acc_step=1/1, ce_loss_token=2.0548, perplexity_token=7.8053]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  17%|███████▉                                       | 176/1044 [01:05<05:07,  2.83it/s, acc_step=1/1, ce_loss_token=2.0552, perplexity_token=7.8085]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  17%|███████▉                                       | 177/1044 [01:06<05:11,  2.78it/s, acc_step=1/1, ce_loss_token=2.0551, perplexity_token=7.8075]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  17%|████████                                       | 178/1044 [01:06<05:07,  2.82it/s, acc_step=1/1, ce_loss_token=2.0549, perplexity_token=7.8062]

torch.Size([256, 344, 35]) torch.Size([256, 344])


[Training LM]:  17%|████████                                       | 179/1044 [01:07<05:28,  2.63it/s, acc_step=1/1, ce_loss_token=2.0547, perplexity_token=7.8043]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  17%|████████                                       | 180/1044 [01:07<05:54,  2.44it/s, acc_step=1/1, ce_loss_token=2.0546, perplexity_token=7.8034]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  17%|████████▏                                      | 181/1044 [01:07<05:42,  2.52it/s, acc_step=1/1, ce_loss_token=2.0544, perplexity_token=7.8024]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  17%|████████▏                                      | 182/1044 [01:08<05:39,  2.54it/s, acc_step=1/1, ce_loss_token=2.0543, perplexity_token=7.8010]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  18%|████████▏                                      | 183/1044 [01:08<05:39,  2.53it/s, acc_step=1/1, ce_loss_token=2.0541, perplexity_token=7.7997]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  18%|████████▎                                      | 184/1044 [01:09<06:00,  2.38it/s, acc_step=1/1, ce_loss_token=2.0539, perplexity_token=7.7986]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  18%|████████▎                                      | 185/1044 [01:09<05:50,  2.45it/s, acc_step=1/1, ce_loss_token=2.0539, perplexity_token=7.7981]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  18%|████████▎                                      | 186/1044 [01:09<05:43,  2.49it/s, acc_step=1/1, ce_loss_token=2.0537, perplexity_token=7.7969]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  18%|████████▍                                      | 187/1044 [01:10<05:43,  2.50it/s, acc_step=1/1, ce_loss_token=2.0535, perplexity_token=7.7955]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  18%|████████▍                                      | 188/1044 [01:10<05:36,  2.55it/s, acc_step=1/1, ce_loss_token=2.0534, perplexity_token=7.7942]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  18%|████████▌                                      | 189/1044 [01:11<05:28,  2.60it/s, acc_step=1/1, ce_loss_token=2.0532, perplexity_token=7.7926]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  18%|████████▌                                      | 190/1044 [01:11<05:25,  2.62it/s, acc_step=1/1, ce_loss_token=2.0530, perplexity_token=7.7914]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  18%|████████▌                                      | 191/1044 [01:11<05:21,  2.65it/s, acc_step=1/1, ce_loss_token=2.0536, perplexity_token=7.7958]

torch.Size([256, 277, 35]) torch.Size([256, 277])


[Training LM]:  18%|████████▋                                      | 192/1044 [01:12<05:08,  2.76it/s, acc_step=1/1, ce_loss_token=2.0534, perplexity_token=7.7942]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  18%|████████▋                                      | 193/1044 [01:12<05:04,  2.79it/s, acc_step=1/1, ce_loss_token=2.0532, perplexity_token=7.7931]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  19%|████████▋                                      | 194/1044 [01:12<05:15,  2.69it/s, acc_step=1/1, ce_loss_token=2.0531, perplexity_token=7.7922]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  19%|████████▊                                      | 195/1044 [01:13<05:23,  2.62it/s, acc_step=1/1, ce_loss_token=2.0530, perplexity_token=7.7909]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  19%|████████▊                                      | 196/1044 [01:13<05:16,  2.68it/s, acc_step=1/1, ce_loss_token=2.0528, perplexity_token=7.7900]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  19%|████████▊                                      | 197/1044 [01:13<05:20,  2.64it/s, acc_step=1/1, ce_loss_token=2.0526, perplexity_token=7.7883]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  19%|████████▉                                      | 198/1044 [01:14<05:19,  2.65it/s, acc_step=1/1, ce_loss_token=2.0524, perplexity_token=7.7867]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  19%|████████▉                                      | 199/1044 [01:14<05:16,  2.67it/s, acc_step=1/1, ce_loss_token=2.0523, perplexity_token=7.7855]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  19%|█████████                                      | 200/1044 [01:15<05:24,  2.60it/s, acc_step=1/1, ce_loss_token=2.0521, perplexity_token=7.7840]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  19%|█████████                                      | 201/1044 [01:15<05:18,  2.65it/s, acc_step=1/1, ce_loss_token=2.0520, perplexity_token=7.7833]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  19%|█████████                                      | 202/1044 [01:15<05:20,  2.62it/s, acc_step=1/1, ce_loss_token=2.0518, perplexity_token=7.7821]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  19%|█████████▏                                     | 203/1044 [01:16<04:59,  2.81it/s, acc_step=1/1, ce_loss_token=2.0523, perplexity_token=7.7860]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  20%|█████████▏                                     | 204/1044 [01:16<05:09,  2.71it/s, acc_step=1/1, ce_loss_token=2.0522, perplexity_token=7.7850]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  20%|█████████▏                                     | 205/1044 [01:16<05:12,  2.68it/s, acc_step=1/1, ce_loss_token=2.0520, perplexity_token=7.7838]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  20%|█████████▎                                     | 206/1044 [01:17<05:09,  2.71it/s, acc_step=1/1, ce_loss_token=2.0519, perplexity_token=7.7829]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  20%|█████████▎                                     | 207/1044 [01:17<04:58,  2.80it/s, acc_step=1/1, ce_loss_token=2.0522, perplexity_token=7.7853]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  20%|█████████▎                                     | 208/1044 [01:18<05:04,  2.75it/s, acc_step=1/1, ce_loss_token=2.0521, perplexity_token=7.7843]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  20%|█████████▍                                     | 209/1044 [01:18<05:13,  2.67it/s, acc_step=1/1, ce_loss_token=2.0520, perplexity_token=7.7831]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  20%|█████████▍                                     | 210/1044 [01:18<04:49,  2.88it/s, acc_step=1/1, ce_loss_token=2.0522, perplexity_token=7.7853]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  20%|█████████▍                                     | 211/1044 [01:19<04:58,  2.79it/s, acc_step=1/1, ce_loss_token=2.0521, perplexity_token=7.7842]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  20%|█████████▌                                     | 212/1044 [01:19<05:00,  2.77it/s, acc_step=1/1, ce_loss_token=2.0520, perplexity_token=7.7832]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  20%|█████████▌                                     | 213/1044 [01:19<05:05,  2.72it/s, acc_step=1/1, ce_loss_token=2.0518, perplexity_token=7.7821]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  20%|█████████▋                                     | 214/1044 [01:20<05:02,  2.74it/s, acc_step=1/1, ce_loss_token=2.0517, perplexity_token=7.7811]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  21%|█████████▋                                     | 215/1044 [01:20<05:44,  2.41it/s, acc_step=1/1, ce_loss_token=2.0516, perplexity_token=7.7802]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  21%|█████████▋                                     | 216/1044 [01:21<05:35,  2.47it/s, acc_step=1/1, ce_loss_token=2.0515, perplexity_token=7.7793]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  21%|█████████▊                                     | 217/1044 [01:21<05:32,  2.49it/s, acc_step=1/1, ce_loss_token=2.0513, perplexity_token=7.7783]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  21%|█████████▊                                     | 218/1044 [01:21<05:26,  2.53it/s, acc_step=1/1, ce_loss_token=2.0512, perplexity_token=7.7769]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  21%|█████████▉                                     | 220/1044 [01:22<04:28,  3.07it/s, acc_step=1/1, ce_loss_token=2.0528, perplexity_token=7.7899]

torch.Size([256, 333, 35]) torch.Size([256, 333])
torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  21%|█████████▉                                     | 221/1044 [01:22<04:30,  3.04it/s, acc_step=1/1, ce_loss_token=2.0527, perplexity_token=7.7890]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  21%|█████████▉                                     | 222/1044 [01:23<04:19,  3.17it/s, acc_step=1/1, ce_loss_token=2.0530, perplexity_token=7.7910]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  21%|██████████                                     | 223/1044 [01:23<04:51,  2.82it/s, acc_step=1/1, ce_loss_token=2.0528, perplexity_token=7.7895]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  21%|██████████                                     | 224/1044 [01:23<04:58,  2.75it/s, acc_step=1/1, ce_loss_token=2.0526, perplexity_token=7.7884]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  22%|██████████▏                                    | 225/1044 [01:24<04:59,  2.74it/s, acc_step=1/1, ce_loss_token=2.0524, perplexity_token=7.7867]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  22%|██████████▏                                    | 226/1044 [01:24<05:08,  2.65it/s, acc_step=1/1, ce_loss_token=2.0523, perplexity_token=7.7858]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  22%|██████████▏                                    | 227/1044 [01:25<05:06,  2.67it/s, acc_step=1/1, ce_loss_token=2.0522, perplexity_token=7.7847]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  22%|██████████▎                                    | 228/1044 [01:25<05:09,  2.63it/s, acc_step=1/1, ce_loss_token=2.0520, perplexity_token=7.7834]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  22%|██████████▎                                    | 229/1044 [01:25<05:14,  2.59it/s, acc_step=1/1, ce_loss_token=2.0519, perplexity_token=7.7824]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  22%|██████████▎                                    | 230/1044 [01:26<05:12,  2.60it/s, acc_step=1/1, ce_loss_token=2.0517, perplexity_token=7.7813]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  22%|██████████▍                                    | 231/1044 [01:26<05:12,  2.60it/s, acc_step=1/1, ce_loss_token=2.0516, perplexity_token=7.7805]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  22%|██████████▍                                    | 232/1044 [01:26<05:05,  2.66it/s, acc_step=1/1, ce_loss_token=2.0515, perplexity_token=7.7796]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  22%|██████████▍                                    | 233/1044 [01:27<05:06,  2.64it/s, acc_step=1/1, ce_loss_token=2.0514, perplexity_token=7.7785]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  22%|██████████▌                                    | 234/1044 [01:27<05:04,  2.66it/s, acc_step=1/1, ce_loss_token=2.0512, perplexity_token=7.7771]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  23%|██████████▌                                    | 235/1044 [01:28<05:02,  2.68it/s, acc_step=1/1, ce_loss_token=2.0511, perplexity_token=7.7761]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  23%|██████████▌                                    | 236/1044 [01:28<04:39,  2.89it/s, acc_step=1/1, ce_loss_token=2.0515, perplexity_token=7.7799]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  23%|██████████▋                                    | 237/1044 [01:28<04:53,  2.75it/s, acc_step=1/1, ce_loss_token=2.0514, perplexity_token=7.7792]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  23%|██████████▋                                    | 238/1044 [01:29<04:52,  2.76it/s, acc_step=1/1, ce_loss_token=2.0513, perplexity_token=7.7784]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  23%|██████████▊                                    | 239/1044 [01:29<04:57,  2.71it/s, acc_step=1/1, ce_loss_token=2.0512, perplexity_token=7.7774]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  23%|██████████▊                                    | 240/1044 [01:29<05:02,  2.66it/s, acc_step=1/1, ce_loss_token=2.0511, perplexity_token=7.7766]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  23%|██████████▊                                    | 241/1044 [01:30<04:36,  2.91it/s, acc_step=1/1, ce_loss_token=2.0513, perplexity_token=7.7783]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  23%|██████████▉                                    | 242/1044 [01:30<04:38,  2.88it/s, acc_step=1/1, ce_loss_token=2.0512, perplexity_token=7.7773]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  23%|██████████▉                                    | 243/1044 [01:30<05:00,  2.67it/s, acc_step=1/1, ce_loss_token=2.0511, perplexity_token=7.7766]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  23%|██████████▉                                    | 244/1044 [01:31<04:55,  2.71it/s, acc_step=1/1, ce_loss_token=2.0510, perplexity_token=7.7754]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  23%|███████████                                    | 245/1044 [01:31<04:29,  2.96it/s, acc_step=1/1, ce_loss_token=2.0513, perplexity_token=7.7778]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  24%|███████████                                    | 246/1044 [01:31<04:35,  2.90it/s, acc_step=1/1, ce_loss_token=2.0516, perplexity_token=7.7801]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  24%|███████████                                    | 247/1044 [01:32<04:50,  2.74it/s, acc_step=1/1, ce_loss_token=2.0515, perplexity_token=7.7794]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  24%|███████████▏                                   | 248/1044 [01:32<04:54,  2.70it/s, acc_step=1/1, ce_loss_token=2.0513, perplexity_token=7.7782]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  24%|███████████▏                                   | 249/1044 [01:33<04:47,  2.77it/s, acc_step=1/1, ce_loss_token=2.0512, perplexity_token=7.7774]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  24%|███████████▎                                   | 250/1044 [01:33<04:54,  2.70it/s, acc_step=1/1, ce_loss_token=2.0511, perplexity_token=7.7762]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  24%|███████████▎                                   | 251/1044 [01:33<04:55,  2.69it/s, acc_step=1/1, ce_loss_token=2.0510, perplexity_token=7.7754]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  24%|███████████▎                                   | 252/1044 [01:34<05:10,  2.55it/s, acc_step=1/1, ce_loss_token=2.0508, perplexity_token=7.7745]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  24%|███████████▍                                   | 253/1044 [01:34<05:08,  2.56it/s, acc_step=1/1, ce_loss_token=2.0507, perplexity_token=7.7734]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  24%|███████████▍                                   | 254/1044 [01:35<05:14,  2.51it/s, acc_step=1/1, ce_loss_token=2.0506, perplexity_token=7.7726]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  24%|███████████▍                                   | 255/1044 [01:35<05:06,  2.58it/s, acc_step=1/1, ce_loss_token=2.0505, perplexity_token=7.7716]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  25%|███████████▌                                   | 256/1044 [01:35<05:04,  2.59it/s, acc_step=1/1, ce_loss_token=2.0503, perplexity_token=7.7706]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  25%|███████████▌                                   | 257/1044 [01:36<04:58,  2.63it/s, acc_step=1/1, ce_loss_token=2.0502, perplexity_token=7.7697]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  25%|███████████▌                                   | 258/1044 [01:36<04:59,  2.63it/s, acc_step=1/1, ce_loss_token=2.0501, perplexity_token=7.7688]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  25%|███████████▋                                   | 259/1044 [01:36<04:57,  2.64it/s, acc_step=1/1, ce_loss_token=2.0500, perplexity_token=7.7680]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  25%|███████████▋                                   | 260/1044 [01:37<05:01,  2.60it/s, acc_step=1/1, ce_loss_token=2.0499, perplexity_token=7.7672]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  25%|███████████▊                                   | 261/1044 [01:37<05:00,  2.61it/s, acc_step=1/1, ce_loss_token=2.0498, perplexity_token=7.7661]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  25%|███████████▊                                   | 262/1044 [01:38<04:43,  2.76it/s, acc_step=1/1, ce_loss_token=2.0500, perplexity_token=7.7681]

torch.Size([256, 437, 35]) torch.Size([256, 437])


[Training LM]:  25%|███████████▊                                   | 263/1044 [01:38<05:45,  2.26it/s, acc_step=1/1, ce_loss_token=2.0500, perplexity_token=7.7676]

torch.Size([256, 421, 35]) torch.Size([256, 421])


[Training LM]:  25%|███████████▉                                   | 264/1044 [01:39<06:27,  2.01it/s, acc_step=1/1, ce_loss_token=2.0499, perplexity_token=7.7668]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  25%|███████████▉                                   | 265/1044 [01:39<06:08,  2.11it/s, acc_step=1/1, ce_loss_token=2.0498, perplexity_token=7.7661]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  25%|███████████▉                                   | 266/1044 [01:40<05:45,  2.25it/s, acc_step=1/1, ce_loss_token=2.0496, perplexity_token=7.7651]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  26%|████████████                                   | 267/1044 [01:40<05:25,  2.39it/s, acc_step=1/1, ce_loss_token=2.0495, perplexity_token=7.7643]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  26%|████████████                                   | 268/1044 [01:40<05:06,  2.53it/s, acc_step=1/1, ce_loss_token=2.0494, perplexity_token=7.7633]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  26%|████████████                                   | 269/1044 [01:41<05:03,  2.55it/s, acc_step=1/1, ce_loss_token=2.0493, perplexity_token=7.7621]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  26%|████████████▏                                  | 270/1044 [01:41<04:52,  2.64it/s, acc_step=1/1, ce_loss_token=2.0491, perplexity_token=7.7611]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  26%|████████████▏                                  | 271/1044 [01:41<04:50,  2.66it/s, acc_step=1/1, ce_loss_token=2.0490, perplexity_token=7.7602]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  26%|████████████▏                                  | 272/1044 [01:42<04:45,  2.70it/s, acc_step=1/1, ce_loss_token=2.0489, perplexity_token=7.7595]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  26%|████████████▎                                  | 273/1044 [01:42<04:38,  2.76it/s, acc_step=1/1, ce_loss_token=2.0488, perplexity_token=7.7586]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  26%|████████████▎                                  | 274/1044 [01:42<04:39,  2.75it/s, acc_step=1/1, ce_loss_token=2.0487, perplexity_token=7.7577]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  26%|████████████▍                                  | 275/1044 [01:43<04:40,  2.74it/s, acc_step=1/1, ce_loss_token=2.0486, perplexity_token=7.7567]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  26%|████████████▍                                  | 276/1044 [01:43<04:37,  2.77it/s, acc_step=1/1, ce_loss_token=2.0484, perplexity_token=7.7556]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  27%|████████████▍                                  | 277/1044 [01:44<04:34,  2.79it/s, acc_step=1/1, ce_loss_token=2.0483, perplexity_token=7.7547]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  27%|████████████▌                                  | 278/1044 [01:44<04:34,  2.79it/s, acc_step=1/1, ce_loss_token=2.0482, perplexity_token=7.7539]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  27%|████████████▌                                  | 280/1044 [01:44<04:08,  3.08it/s, acc_step=1/1, ce_loss_token=2.0490, perplexity_token=7.7603]

torch.Size([256, 297, 35]) torch.Size([256, 297])
torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  27%|████████████▋                                  | 281/1044 [01:45<04:18,  2.95it/s, acc_step=1/1, ce_loss_token=2.0489, perplexity_token=7.7593]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  27%|████████████▋                                  | 282/1044 [01:45<04:19,  2.94it/s, acc_step=1/1, ce_loss_token=2.0487, perplexity_token=7.7581]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  27%|████████████▋                                  | 283/1044 [01:46<04:24,  2.88it/s, acc_step=1/1, ce_loss_token=2.0490, perplexity_token=7.7598]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  27%|████████████▊                                  | 284/1044 [01:46<04:14,  2.99it/s, acc_step=1/1, ce_loss_token=2.0492, perplexity_token=7.7618]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  27%|████████████▊                                  | 285/1044 [01:46<04:26,  2.85it/s, acc_step=1/1, ce_loss_token=2.0491, perplexity_token=7.7606]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  27%|████████████▉                                  | 286/1044 [01:47<04:41,  2.69it/s, acc_step=1/1, ce_loss_token=2.0489, perplexity_token=7.7596]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  27%|████████████▉                                  | 287/1044 [01:47<04:40,  2.70it/s, acc_step=1/1, ce_loss_token=2.0488, perplexity_token=7.7587]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  28%|████████████▉                                  | 288/1044 [01:48<04:59,  2.52it/s, acc_step=1/1, ce_loss_token=2.0486, perplexity_token=7.7574]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  28%|█████████████                                  | 289/1044 [01:48<04:48,  2.62it/s, acc_step=1/1, ce_loss_token=2.0485, perplexity_token=7.7564]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  28%|█████████████                                  | 290/1044 [01:48<04:44,  2.65it/s, acc_step=1/1, ce_loss_token=2.0484, perplexity_token=7.7554]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  28%|█████████████                                  | 291/1044 [01:49<04:38,  2.70it/s, acc_step=1/1, ce_loss_token=2.0483, perplexity_token=7.7545]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  28%|█████████████▏                                 | 292/1044 [01:49<04:36,  2.72it/s, acc_step=1/1, ce_loss_token=2.0482, perplexity_token=7.7535]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  28%|█████████████▏                                 | 293/1044 [01:49<04:43,  2.65it/s, acc_step=1/1, ce_loss_token=2.0480, perplexity_token=7.7526]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  28%|█████████████▏                                 | 294/1044 [01:50<04:43,  2.64it/s, acc_step=1/1, ce_loss_token=2.0479, perplexity_token=7.7518]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  28%|█████████████▎                                 | 295/1044 [01:50<04:44,  2.63it/s, acc_step=1/1, ce_loss_token=2.0478, perplexity_token=7.7511]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  28%|█████████████▎                                 | 296/1044 [01:50<04:45,  2.62it/s, acc_step=1/1, ce_loss_token=2.0477, perplexity_token=7.7502]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  28%|█████████████▎                                 | 297/1044 [01:51<04:39,  2.68it/s, acc_step=1/1, ce_loss_token=2.0476, perplexity_token=7.7496]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  29%|█████████████▍                                 | 298/1044 [01:51<04:41,  2.65it/s, acc_step=1/1, ce_loss_token=2.0475, perplexity_token=7.7488]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  29%|█████████████▍                                 | 299/1044 [01:52<04:59,  2.48it/s, acc_step=1/1, ce_loss_token=2.0474, perplexity_token=7.7479]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  29%|█████████████▌                                 | 300/1044 [01:52<04:59,  2.48it/s, acc_step=1/1, ce_loss_token=2.0473, perplexity_token=7.7468]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  29%|█████████████▌                                 | 301/1044 [01:52<04:56,  2.51it/s, acc_step=1/1, ce_loss_token=2.0471, perplexity_token=7.7458]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  29%|█████████████▌                                 | 302/1044 [01:53<04:49,  2.57it/s, acc_step=1/1, ce_loss_token=2.0470, perplexity_token=7.7450]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  29%|█████████████▋                                 | 303/1044 [01:53<04:43,  2.62it/s, acc_step=1/1, ce_loss_token=2.0469, perplexity_token=7.7439]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  29%|█████████████▋                                 | 304/1044 [01:54<04:36,  2.68it/s, acc_step=1/1, ce_loss_token=2.0468, perplexity_token=7.7432]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  29%|█████████████▋                                 | 305/1044 [01:54<04:19,  2.85it/s, acc_step=1/1, ce_loss_token=2.0472, perplexity_token=7.7459]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  29%|█████████████▊                                 | 306/1044 [01:54<04:17,  2.87it/s, acc_step=1/1, ce_loss_token=2.0471, perplexity_token=7.7453]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  29%|█████████████▊                                 | 307/1044 [01:55<04:01,  3.05it/s, acc_step=1/1, ce_loss_token=2.0473, perplexity_token=7.7467]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  30%|█████████████▊                                 | 308/1044 [01:55<04:07,  2.98it/s, acc_step=1/1, ce_loss_token=2.0472, perplexity_token=7.7461]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  30%|█████████████▉                                 | 309/1044 [01:55<04:20,  2.82it/s, acc_step=1/1, ce_loss_token=2.0471, perplexity_token=7.7452]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  30%|█████████████▉                                 | 310/1044 [01:56<04:26,  2.76it/s, acc_step=1/1, ce_loss_token=2.0470, perplexity_token=7.7443]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  30%|██████████████                                 | 311/1044 [01:56<04:39,  2.62it/s, acc_step=1/1, ce_loss_token=2.0468, perplexity_token=7.7435]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  30%|██████████████                                 | 312/1044 [01:56<04:35,  2.66it/s, acc_step=1/1, ce_loss_token=2.0467, perplexity_token=7.7426]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  30%|██████████████                                 | 313/1044 [01:57<04:12,  2.90it/s, acc_step=1/1, ce_loss_token=2.0469, perplexity_token=7.7440]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  30%|██████████████▏                                | 314/1044 [01:57<04:12,  2.89it/s, acc_step=1/1, ce_loss_token=2.0468, perplexity_token=7.7431]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  30%|██████████████▏                                | 315/1044 [01:57<04:16,  2.84it/s, acc_step=1/1, ce_loss_token=2.0467, perplexity_token=7.7421]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  30%|██████████████▏                                | 316/1044 [01:58<04:25,  2.74it/s, acc_step=1/1, ce_loss_token=2.0465, perplexity_token=7.7410]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  30%|██████████████▎                                | 317/1044 [01:58<04:32,  2.67it/s, acc_step=1/1, ce_loss_token=2.0465, perplexity_token=7.7404]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  30%|██████████████▎                                | 318/1044 [01:59<04:36,  2.62it/s, acc_step=1/1, ce_loss_token=2.0464, perplexity_token=7.7398]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  31%|██████████████▎                                | 319/1044 [01:59<04:36,  2.62it/s, acc_step=1/1, ce_loss_token=2.0463, perplexity_token=7.7392]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  31%|██████████████▍                                | 320/1044 [01:59<04:27,  2.71it/s, acc_step=1/1, ce_loss_token=2.0462, perplexity_token=7.7383]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  31%|██████████████▍                                | 321/1044 [02:00<04:19,  2.79it/s, acc_step=1/1, ce_loss_token=2.0461, perplexity_token=7.7375]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  31%|██████████████▍                                | 322/1044 [02:00<04:17,  2.80it/s, acc_step=1/1, ce_loss_token=2.0460, perplexity_token=7.7366]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  31%|██████████████▌                                | 323/1044 [02:00<04:09,  2.89it/s, acc_step=1/1, ce_loss_token=2.0462, perplexity_token=7.7382]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  31%|██████████████▌                                | 324/1044 [02:01<04:13,  2.84it/s, acc_step=1/1, ce_loss_token=2.0461, perplexity_token=7.7375]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  31%|██████████████▋                                | 325/1044 [02:01<04:21,  2.75it/s, acc_step=1/1, ce_loss_token=2.0460, perplexity_token=7.7371]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  31%|██████████████▋                                | 326/1044 [02:02<04:34,  2.62it/s, acc_step=1/1, ce_loss_token=2.0460, perplexity_token=7.7365]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  31%|██████████████▋                                | 327/1044 [02:02<04:43,  2.53it/s, acc_step=1/1, ce_loss_token=2.0458, perplexity_token=7.7355]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  31%|██████████████▊                                | 328/1044 [02:02<04:33,  2.62it/s, acc_step=1/1, ce_loss_token=2.0457, perplexity_token=7.7347]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  32%|██████████████▊                                | 329/1044 [02:03<04:42,  2.53it/s, acc_step=1/1, ce_loss_token=2.0456, perplexity_token=7.7342]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  32%|██████████████▊                                | 330/1044 [02:03<04:17,  2.78it/s, acc_step=1/1, ce_loss_token=2.0458, perplexity_token=7.7355]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  32%|██████████████▉                                | 331/1044 [02:03<04:13,  2.81it/s, acc_step=1/1, ce_loss_token=2.0457, perplexity_token=7.7347]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  32%|██████████████▉                                | 332/1044 [02:04<04:18,  2.75it/s, acc_step=1/1, ce_loss_token=2.0456, perplexity_token=7.7339]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  32%|███████████████                                | 334/1044 [02:04<03:48,  3.11it/s, acc_step=1/1, ce_loss_token=2.0461, perplexity_token=7.7378]

torch.Size([256, 305, 35]) torch.Size([256, 305])
torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  32%|███████████████                                | 335/1044 [02:05<03:38,  3.24it/s, acc_step=1/1, ce_loss_token=2.0463, perplexity_token=7.7391]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  32%|███████████████▏                               | 336/1044 [02:05<03:49,  3.09it/s, acc_step=1/1, ce_loss_token=2.0462, perplexity_token=7.7382]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  32%|███████████████▏                               | 337/1044 [02:05<04:01,  2.93it/s, acc_step=1/1, ce_loss_token=2.0460, perplexity_token=7.7372]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  32%|███████████████▏                               | 338/1044 [02:06<04:03,  2.90it/s, acc_step=1/1, ce_loss_token=2.0462, perplexity_token=7.7384]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  32%|███████████████▎                               | 339/1044 [02:06<04:09,  2.83it/s, acc_step=1/1, ce_loss_token=2.0461, perplexity_token=7.7377]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  33%|███████████████▎                               | 340/1044 [02:06<04:16,  2.75it/s, acc_step=1/1, ce_loss_token=2.0460, perplexity_token=7.7371]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  33%|███████████████▎                               | 341/1044 [02:07<04:13,  2.77it/s, acc_step=1/1, ce_loss_token=2.0459, perplexity_token=7.7365]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  33%|███████████████▍                               | 342/1044 [02:07<04:15,  2.75it/s, acc_step=1/1, ce_loss_token=2.0458, perplexity_token=7.7355]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  33%|███████████████▍                               | 343/1044 [02:08<04:20,  2.69it/s, acc_step=1/1, ce_loss_token=2.0457, perplexity_token=7.7349]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  33%|███████████████▍                               | 344/1044 [02:08<04:22,  2.67it/s, acc_step=1/1, ce_loss_token=2.0456, perplexity_token=7.7341]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  33%|███████████████▌                               | 345/1044 [02:08<04:26,  2.63it/s, acc_step=1/1, ce_loss_token=2.0455, perplexity_token=7.7333]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  33%|███████████████▌                               | 346/1044 [02:09<04:27,  2.61it/s, acc_step=1/1, ce_loss_token=2.0454, perplexity_token=7.7325]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  33%|███████████████▌                               | 347/1044 [02:09<04:09,  2.80it/s, acc_step=1/1, ce_loss_token=2.0456, perplexity_token=7.7339]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  33%|███████████████▋                               | 348/1044 [02:09<04:21,  2.66it/s, acc_step=1/1, ce_loss_token=2.0455, perplexity_token=7.7332]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  33%|███████████████▋                               | 349/1044 [02:10<04:29,  2.58it/s, acc_step=1/1, ce_loss_token=2.0454, perplexity_token=7.7326]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  34%|███████████████▊                               | 350/1044 [02:10<04:28,  2.59it/s, acc_step=1/1, ce_loss_token=2.0453, perplexity_token=7.7317]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  34%|███████████████▊                               | 351/1044 [02:11<04:32,  2.54it/s, acc_step=1/1, ce_loss_token=2.0452, perplexity_token=7.7309]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  34%|███████████████▊                               | 352/1044 [02:11<04:26,  2.60it/s, acc_step=1/1, ce_loss_token=2.0451, perplexity_token=7.7301]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  34%|███████████████▉                               | 353/1044 [02:11<04:38,  2.48it/s, acc_step=1/1, ce_loss_token=2.0450, perplexity_token=7.7294]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  34%|███████████████▉                               | 354/1044 [02:12<04:33,  2.52it/s, acc_step=1/1, ce_loss_token=2.0449, perplexity_token=7.7284]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  34%|███████████████▉                               | 355/1044 [02:12<04:21,  2.63it/s, acc_step=1/1, ce_loss_token=2.0448, perplexity_token=7.7277]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  34%|████████████████                               | 356/1044 [02:13<04:16,  2.68it/s, acc_step=1/1, ce_loss_token=2.0447, perplexity_token=7.7269]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  34%|████████████████                               | 357/1044 [02:13<04:16,  2.68it/s, acc_step=1/1, ce_loss_token=2.0446, perplexity_token=7.7261]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  34%|████████████████                               | 358/1044 [02:13<04:14,  2.69it/s, acc_step=1/1, ce_loss_token=2.0445, perplexity_token=7.7253]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  34%|████████████████▏                              | 359/1044 [02:14<03:55,  2.90it/s, acc_step=1/1, ce_loss_token=2.0447, perplexity_token=7.7266]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  34%|████████████████▏                              | 360/1044 [02:14<04:16,  2.66it/s, acc_step=1/1, ce_loss_token=2.0445, perplexity_token=7.7257]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  35%|████████████████▎                              | 361/1044 [02:14<04:17,  2.65it/s, acc_step=1/1, ce_loss_token=2.0444, perplexity_token=7.7247]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  35%|████████████████▎                              | 362/1044 [02:15<04:14,  2.68it/s, acc_step=1/1, ce_loss_token=2.0443, perplexity_token=7.7238]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  35%|████████████████▎                              | 363/1044 [02:15<04:22,  2.60it/s, acc_step=1/1, ce_loss_token=2.0442, perplexity_token=7.7229]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  35%|████████████████▍                              | 364/1044 [02:16<04:20,  2.61it/s, acc_step=1/1, ce_loss_token=2.0441, perplexity_token=7.7222]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  35%|████████████████▍                              | 365/1044 [02:16<04:30,  2.51it/s, acc_step=1/1, ce_loss_token=2.0440, perplexity_token=7.7214]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  35%|████████████████▍                              | 366/1044 [02:16<04:36,  2.45it/s, acc_step=1/1, ce_loss_token=2.0439, perplexity_token=7.7206]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  35%|████████████████▌                              | 367/1044 [02:17<04:02,  2.79it/s, acc_step=1/1, ce_loss_token=2.0444, perplexity_token=7.7245]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  35%|████████████████▌                              | 368/1044 [02:17<04:06,  2.75it/s, acc_step=1/1, ce_loss_token=2.0443, perplexity_token=7.7237]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  35%|████████████████▋                              | 370/1044 [02:18<03:28,  3.23it/s, acc_step=1/1, ce_loss_token=2.0450, perplexity_token=7.7288]

torch.Size([256, 340, 35]) torch.Size([256, 340])
torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  36%|████████████████▋                              | 371/1044 [02:18<03:32,  3.17it/s, acc_step=1/1, ce_loss_token=2.0448, perplexity_token=7.7280]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  36%|████████████████▋                              | 372/1044 [02:18<03:41,  3.04it/s, acc_step=1/1, ce_loss_token=2.0447, perplexity_token=7.7270]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  36%|████████████████▊                              | 373/1044 [02:19<03:44,  2.99it/s, acc_step=1/1, ce_loss_token=2.0446, perplexity_token=7.7262]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  36%|████████████████▊                              | 374/1044 [02:19<03:45,  2.97it/s, acc_step=1/1, ce_loss_token=2.0445, perplexity_token=7.7256]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  36%|████████████████▉                              | 375/1044 [02:19<03:49,  2.91it/s, acc_step=1/1, ce_loss_token=2.0444, perplexity_token=7.7249]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  36%|████████████████▉                              | 376/1044 [02:20<03:50,  2.89it/s, acc_step=1/1, ce_loss_token=2.0444, perplexity_token=7.7242]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  36%|████████████████▉                              | 377/1044 [02:20<04:04,  2.73it/s, acc_step=1/1, ce_loss_token=2.0443, perplexity_token=7.7234]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  36%|█████████████████                              | 378/1044 [02:20<03:51,  2.88it/s, acc_step=1/1, ce_loss_token=2.0444, perplexity_token=7.7246]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  36%|█████████████████                              | 379/1044 [02:21<03:43,  2.98it/s, acc_step=1/1, ce_loss_token=2.0446, perplexity_token=7.7257]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  36%|█████████████████                              | 380/1044 [02:21<04:07,  2.69it/s, acc_step=1/1, ce_loss_token=2.0445, perplexity_token=7.7250]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  36%|█████████████████▏                             | 381/1044 [02:22<04:15,  2.60it/s, acc_step=1/1, ce_loss_token=2.0444, perplexity_token=7.7241]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  37%|█████████████████▏                             | 382/1044 [02:22<04:48,  2.30it/s, acc_step=1/1, ce_loss_token=2.0443, perplexity_token=7.7236]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  37%|█████████████████▏                             | 383/1044 [02:23<04:42,  2.34it/s, acc_step=1/1, ce_loss_token=2.0442, perplexity_token=7.7229]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  37%|█████████████████▎                             | 384/1044 [02:23<04:42,  2.33it/s, acc_step=1/1, ce_loss_token=2.0441, perplexity_token=7.7220]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  37%|█████████████████▎                             | 385/1044 [02:23<04:33,  2.41it/s, acc_step=1/1, ce_loss_token=2.0440, perplexity_token=7.7212]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  37%|█████████████████▍                             | 386/1044 [02:24<04:35,  2.39it/s, acc_step=1/1, ce_loss_token=2.0439, perplexity_token=7.7206]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  37%|█████████████████▍                             | 387/1044 [02:24<04:22,  2.50it/s, acc_step=1/1, ce_loss_token=2.0438, perplexity_token=7.7198]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  37%|█████████████████▍                             | 388/1044 [02:24<04:16,  2.56it/s, acc_step=1/1, ce_loss_token=2.0437, perplexity_token=7.7189]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  37%|█████████████████▌                             | 389/1044 [02:25<04:30,  2.43it/s, acc_step=1/1, ce_loss_token=2.0436, perplexity_token=7.7180]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  37%|█████████████████▌                             | 390/1044 [02:25<04:27,  2.44it/s, acc_step=1/1, ce_loss_token=2.0434, perplexity_token=7.7171]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  37%|█████████████████▌                             | 391/1044 [02:26<04:29,  2.43it/s, acc_step=1/1, ce_loss_token=2.0433, perplexity_token=7.7163]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  38%|█████████████████▋                             | 392/1044 [02:26<04:30,  2.41it/s, acc_step=1/1, ce_loss_token=2.0433, perplexity_token=7.7158]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  38%|█████████████████▋                             | 393/1044 [02:27<04:20,  2.50it/s, acc_step=1/1, ce_loss_token=2.0432, perplexity_token=7.7150]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  38%|█████████████████▋                             | 394/1044 [02:27<04:02,  2.68it/s, acc_step=1/1, ce_loss_token=2.0433, perplexity_token=7.7163]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  38%|█████████████████▊                             | 395/1044 [02:27<03:53,  2.78it/s, acc_step=1/1, ce_loss_token=2.0432, perplexity_token=7.7154]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  38%|█████████████████▊                             | 396/1044 [02:27<03:37,  2.98it/s, acc_step=1/1, ce_loss_token=2.0434, perplexity_token=7.7166]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  38%|█████████████████▊                             | 397/1044 [02:28<03:41,  2.92it/s, acc_step=1/1, ce_loss_token=2.0433, perplexity_token=7.7158]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  38%|█████████████████▉                             | 398/1044 [02:28<03:58,  2.71it/s, acc_step=1/1, ce_loss_token=2.0432, perplexity_token=7.7150]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  38%|█████████████████▉                             | 399/1044 [02:29<04:07,  2.61it/s, acc_step=1/1, ce_loss_token=2.0431, perplexity_token=7.7142]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  38%|██████████████████                             | 400/1044 [02:29<03:56,  2.72it/s, acc_step=1/1, ce_loss_token=2.0430, perplexity_token=7.7135]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  38%|██████████████████                             | 401/1044 [02:29<03:59,  2.68it/s, acc_step=1/1, ce_loss_token=2.0429, perplexity_token=7.7127]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  39%|██████████████████                             | 402/1044 [02:30<04:05,  2.62it/s, acc_step=1/1, ce_loss_token=2.0428, perplexity_token=7.7121]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  39%|██████████████████▏                            | 403/1044 [02:30<03:47,  2.82it/s, acc_step=1/1, ce_loss_token=2.0430, perplexity_token=7.7140]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  39%|██████████████████▏                            | 404/1044 [02:30<03:49,  2.79it/s, acc_step=1/1, ce_loss_token=2.0430, perplexity_token=7.7134]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  39%|██████████████████▏                            | 405/1044 [02:31<03:47,  2.80it/s, acc_step=1/1, ce_loss_token=2.0429, perplexity_token=7.7126]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  39%|██████████████████▎                            | 406/1044 [02:31<03:46,  2.81it/s, acc_step=1/1, ce_loss_token=2.0428, perplexity_token=7.7118]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  39%|██████████████████▎                            | 407/1044 [02:31<03:40,  2.89it/s, acc_step=1/1, ce_loss_token=2.0429, perplexity_token=7.7129]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  39%|██████████████████▎                            | 408/1044 [02:32<03:41,  2.87it/s, acc_step=1/1, ce_loss_token=2.0428, perplexity_token=7.7121]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  39%|██████████████████▍                            | 409/1044 [02:32<03:46,  2.80it/s, acc_step=1/1, ce_loss_token=2.0427, perplexity_token=7.7111]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  39%|██████████████████▍                            | 410/1044 [02:33<04:01,  2.63it/s, acc_step=1/1, ce_loss_token=2.0426, perplexity_token=7.7104]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  39%|██████████████████▌                            | 411/1044 [02:33<03:56,  2.68it/s, acc_step=1/1, ce_loss_token=2.0425, perplexity_token=7.7096]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  39%|██████████████████▌                            | 412/1044 [02:33<04:07,  2.56it/s, acc_step=1/1, ce_loss_token=2.0424, perplexity_token=7.7088]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  40%|██████████████████▌                            | 413/1044 [02:34<04:07,  2.55it/s, acc_step=1/1, ce_loss_token=2.0423, perplexity_token=7.7082]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  40%|██████████████████▋                            | 414/1044 [02:34<04:18,  2.44it/s, acc_step=1/1, ce_loss_token=2.0422, perplexity_token=7.7076]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  40%|██████████████████▋                            | 415/1044 [02:35<03:53,  2.70it/s, acc_step=1/1, ce_loss_token=2.0423, perplexity_token=7.7085]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  40%|██████████████████▋                            | 416/1044 [02:35<03:37,  2.89it/s, acc_step=1/1, ce_loss_token=2.0425, perplexity_token=7.7095]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  40%|██████████████████▊                            | 417/1044 [02:35<03:37,  2.88it/s, acc_step=1/1, ce_loss_token=2.0424, perplexity_token=7.7089]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  40%|██████████████████▊                            | 418/1044 [02:36<03:49,  2.72it/s, acc_step=1/1, ce_loss_token=2.0423, perplexity_token=7.7081]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  40%|██████████████████▊                            | 419/1044 [02:36<03:49,  2.72it/s, acc_step=1/1, ce_loss_token=2.0422, perplexity_token=7.7074]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  40%|██████████████████▉                            | 420/1044 [02:36<03:49,  2.72it/s, acc_step=1/1, ce_loss_token=2.0421, perplexity_token=7.7067]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  40%|██████████████████▉                            | 421/1044 [02:37<03:45,  2.76it/s, acc_step=1/1, ce_loss_token=2.0420, perplexity_token=7.7060]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  40%|██████████████████▉                            | 422/1044 [02:37<03:43,  2.78it/s, acc_step=1/1, ce_loss_token=2.0419, perplexity_token=7.7052]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  41%|███████████████████                            | 423/1044 [02:37<03:52,  2.67it/s, acc_step=1/1, ce_loss_token=2.0418, perplexity_token=7.7045]

torch.Size([256, 441, 35]) torch.Size([256, 441])


[Training LM]:  41%|███████████████████                            | 424/1044 [02:38<04:15,  2.43it/s, acc_step=1/1, ce_loss_token=2.0419, perplexity_token=7.7056]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  41%|███████████████████▏                           | 425/1044 [02:38<03:50,  2.68it/s, acc_step=1/1, ce_loss_token=2.0421, perplexity_token=7.7065]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  41%|███████████████████▏                           | 426/1044 [02:39<03:47,  2.72it/s, acc_step=1/1, ce_loss_token=2.0420, perplexity_token=7.7058]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  41%|███████████████████▏                           | 427/1044 [02:39<04:00,  2.57it/s, acc_step=1/1, ce_loss_token=2.0419, perplexity_token=7.7051]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  41%|███████████████████▎                           | 428/1044 [02:39<03:59,  2.57it/s, acc_step=1/1, ce_loss_token=2.0418, perplexity_token=7.7044]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  41%|███████████████████▎                           | 429/1044 [02:40<03:40,  2.79it/s, acc_step=1/1, ce_loss_token=2.0419, perplexity_token=7.7053]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  41%|███████████████████▎                           | 430/1044 [02:40<03:38,  2.80it/s, acc_step=1/1, ce_loss_token=2.0418, perplexity_token=7.7047]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  41%|███████████████████▍                           | 431/1044 [02:40<03:54,  2.62it/s, acc_step=1/1, ce_loss_token=2.0417, perplexity_token=7.7039]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  41%|███████████████████▍                           | 432/1044 [02:41<03:50,  2.66it/s, acc_step=1/1, ce_loss_token=2.0416, perplexity_token=7.7031]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  41%|███████████████████▍                           | 433/1044 [02:41<03:53,  2.61it/s, acc_step=1/1, ce_loss_token=2.0415, perplexity_token=7.7024]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  42%|███████████████████▌                           | 434/1044 [02:42<03:58,  2.56it/s, acc_step=1/1, ce_loss_token=2.0414, perplexity_token=7.7017]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  42%|███████████████████▌                           | 435/1044 [02:42<03:47,  2.68it/s, acc_step=1/1, ce_loss_token=2.0414, perplexity_token=7.7012]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  42%|███████████████████▋                           | 436/1044 [02:42<03:45,  2.70it/s, acc_step=1/1, ce_loss_token=2.0413, perplexity_token=7.7002]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  42%|███████████████████▋                           | 437/1044 [02:43<03:45,  2.69it/s, acc_step=1/1, ce_loss_token=2.0412, perplexity_token=7.6997]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  42%|███████████████████▋                           | 438/1044 [02:43<03:37,  2.78it/s, acc_step=1/1, ce_loss_token=2.0415, perplexity_token=7.7018]

torch.Size([256, 399, 35]) torch.Size([256, 399])


[Training LM]:  42%|███████████████████▊                           | 439/1044 [02:44<04:14,  2.38it/s, acc_step=1/1, ce_loss_token=2.0413, perplexity_token=7.7009]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  42%|███████████████████▊                           | 440/1044 [02:44<04:36,  2.19it/s, acc_step=1/1, ce_loss_token=2.0412, perplexity_token=7.7002]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  42%|███████████████████▊                           | 441/1044 [02:45<04:24,  2.28it/s, acc_step=1/1, ce_loss_token=2.0411, perplexity_token=7.6994]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  42%|███████████████████▉                           | 442/1044 [02:45<04:09,  2.41it/s, acc_step=1/1, ce_loss_token=2.0411, perplexity_token=7.6990]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  42%|███████████████████▉                           | 443/1044 [02:45<03:58,  2.52it/s, acc_step=1/1, ce_loss_token=2.0410, perplexity_token=7.6981]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  43%|███████████████████▉                           | 444/1044 [02:46<03:57,  2.53it/s, acc_step=1/1, ce_loss_token=2.0409, perplexity_token=7.6976]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  43%|████████████████████                           | 445/1044 [02:46<03:37,  2.76it/s, acc_step=1/1, ce_loss_token=2.0410, perplexity_token=7.6986]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  43%|████████████████████                           | 446/1044 [02:46<03:36,  2.76it/s, acc_step=1/1, ce_loss_token=2.0410, perplexity_token=7.6979]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  43%|████████████████████                           | 447/1044 [02:47<03:39,  2.72it/s, acc_step=1/1, ce_loss_token=2.0409, perplexity_token=7.6974]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  43%|████████████████████▏                          | 448/1044 [02:47<03:44,  2.65it/s, acc_step=1/1, ce_loss_token=2.0408, perplexity_token=7.6969]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  43%|████████████████████▏                          | 449/1044 [02:47<03:46,  2.63it/s, acc_step=1/1, ce_loss_token=2.0407, perplexity_token=7.6962]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  43%|████████████████████▎                          | 450/1044 [02:48<03:56,  2.51it/s, acc_step=1/1, ce_loss_token=2.0406, perplexity_token=7.6955]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  43%|████████████████████▎                          | 451/1044 [02:48<03:58,  2.49it/s, acc_step=1/1, ce_loss_token=2.0405, perplexity_token=7.6947]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  43%|████████████████████▎                          | 452/1044 [02:49<03:48,  2.59it/s, acc_step=1/1, ce_loss_token=2.0406, perplexity_token=7.6956]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  43%|████████████████████▍                          | 453/1044 [02:49<03:47,  2.60it/s, acc_step=1/1, ce_loss_token=2.0406, perplexity_token=7.6949]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  43%|████████████████████▍                          | 454/1044 [02:49<03:49,  2.58it/s, acc_step=1/1, ce_loss_token=2.0405, perplexity_token=7.6941]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  44%|████████████████████▍                          | 455/1044 [02:50<03:49,  2.56it/s, acc_step=1/1, ce_loss_token=2.0404, perplexity_token=7.6935]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  44%|████████████████████▌                          | 456/1044 [02:50<03:45,  2.61it/s, acc_step=1/1, ce_loss_token=2.0403, perplexity_token=7.6927]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  44%|████████████████████▌                          | 457/1044 [02:51<03:43,  2.63it/s, acc_step=1/1, ce_loss_token=2.0402, perplexity_token=7.6920]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  44%|████████████████████▌                          | 458/1044 [02:51<03:45,  2.60it/s, acc_step=1/1, ce_loss_token=2.0401, perplexity_token=7.6914]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  44%|████████████████████▋                          | 459/1044 [02:51<03:28,  2.81it/s, acc_step=1/1, ce_loss_token=2.0403, perplexity_token=7.6932]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  44%|████████████████████▋                          | 460/1044 [02:52<03:34,  2.72it/s, acc_step=1/1, ce_loss_token=2.0402, perplexity_token=7.6924]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  44%|████████████████████▊                          | 461/1044 [02:52<03:40,  2.65it/s, acc_step=1/1, ce_loss_token=2.0402, perplexity_token=7.6919]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  44%|████████████████████▊                          | 462/1044 [02:52<03:40,  2.64it/s, acc_step=1/1, ce_loss_token=2.0401, perplexity_token=7.6912]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  44%|████████████████████▊                          | 463/1044 [02:53<03:37,  2.67it/s, acc_step=1/1, ce_loss_token=2.0400, perplexity_token=7.6906]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  44%|████████████████████▉                          | 464/1044 [02:53<03:35,  2.69it/s, acc_step=1/1, ce_loss_token=2.0399, perplexity_token=7.6899]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  45%|████████████████████▉                          | 465/1044 [02:54<03:32,  2.73it/s, acc_step=1/1, ce_loss_token=2.0398, perplexity_token=7.6891]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  45%|████████████████████▉                          | 466/1044 [02:54<03:33,  2.71it/s, acc_step=1/1, ce_loss_token=2.0397, perplexity_token=7.6885]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  45%|█████████████████████                          | 467/1044 [02:54<03:34,  2.69it/s, acc_step=1/1, ce_loss_token=2.0396, perplexity_token=7.6878]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  45%|█████████████████████                          | 468/1044 [02:55<03:39,  2.63it/s, acc_step=1/1, ce_loss_token=2.0395, perplexity_token=7.6869]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  45%|█████████████████████                          | 469/1044 [02:55<03:37,  2.65it/s, acc_step=1/1, ce_loss_token=2.0394, perplexity_token=7.6862]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  45%|█████████████████████▏                         | 470/1044 [02:55<03:31,  2.72it/s, acc_step=1/1, ce_loss_token=2.0393, perplexity_token=7.6855]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  45%|█████████████████████▏                         | 471/1044 [02:56<03:35,  2.66it/s, acc_step=1/1, ce_loss_token=2.0392, perplexity_token=7.6848]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  45%|█████████████████████▏                         | 472/1044 [02:56<03:32,  2.69it/s, acc_step=1/1, ce_loss_token=2.0392, perplexity_token=7.6841]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  45%|█████████████████████▎                         | 473/1044 [02:57<03:30,  2.71it/s, acc_step=1/1, ce_loss_token=2.0391, perplexity_token=7.6834]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  45%|█████████████████████▎                         | 474/1044 [02:57<03:31,  2.70it/s, acc_step=1/1, ce_loss_token=2.0390, perplexity_token=7.6829]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  45%|█████████████████████▍                         | 475/1044 [02:57<03:34,  2.65it/s, acc_step=1/1, ce_loss_token=2.0389, perplexity_token=7.6822]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  46%|█████████████████████▍                         | 476/1044 [02:58<03:37,  2.61it/s, acc_step=1/1, ce_loss_token=2.0388, perplexity_token=7.6816]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  46%|█████████████████████▍                         | 477/1044 [02:58<03:44,  2.53it/s, acc_step=1/1, ce_loss_token=2.0388, perplexity_token=7.6811]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  46%|█████████████████████▌                         | 478/1044 [02:58<03:25,  2.75it/s, acc_step=1/1, ce_loss_token=2.0389, perplexity_token=7.6821]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  46%|█████████████████████▌                         | 479/1044 [02:59<03:22,  2.79it/s, acc_step=1/1, ce_loss_token=2.0388, perplexity_token=7.6815]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  46%|█████████████████████▌                         | 480/1044 [02:59<03:07,  3.01it/s, acc_step=1/1, ce_loss_token=2.0389, perplexity_token=7.6824]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  46%|█████████████████████▋                         | 481/1044 [02:59<03:13,  2.91it/s, acc_step=1/1, ce_loss_token=2.0389, perplexity_token=7.6819]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  46%|█████████████████████▋                         | 482/1044 [03:00<03:27,  2.71it/s, acc_step=1/1, ce_loss_token=2.0388, perplexity_token=7.6810]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  46%|█████████████████████▋                         | 483/1044 [03:00<03:24,  2.74it/s, acc_step=1/1, ce_loss_token=2.0386, perplexity_token=7.6802]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  46%|█████████████████████▊                         | 484/1044 [03:01<03:26,  2.71it/s, acc_step=1/1, ce_loss_token=2.0386, perplexity_token=7.6796]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  46%|█████████████████████▊                         | 485/1044 [03:01<03:34,  2.61it/s, acc_step=1/1, ce_loss_token=2.0385, perplexity_token=7.6789]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  47%|█████████████████████▉                         | 486/1044 [03:01<03:41,  2.51it/s, acc_step=1/1, ce_loss_token=2.0384, perplexity_token=7.6781]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  47%|█████████████████████▉                         | 487/1044 [03:02<03:42,  2.50it/s, acc_step=1/1, ce_loss_token=2.0383, perplexity_token=7.6776]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  47%|█████████████████████▉                         | 488/1044 [03:02<03:34,  2.59it/s, acc_step=1/1, ce_loss_token=2.0382, perplexity_token=7.6767]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  47%|██████████████████████                         | 489/1044 [03:03<03:35,  2.57it/s, acc_step=1/1, ce_loss_token=2.0381, perplexity_token=7.6761]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  47%|██████████████████████                         | 490/1044 [03:03<03:30,  2.63it/s, acc_step=1/1, ce_loss_token=2.0380, perplexity_token=7.6752]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  47%|██████████████████████                         | 491/1044 [03:03<03:33,  2.59it/s, acc_step=1/1, ce_loss_token=2.0379, perplexity_token=7.6744]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  47%|██████████████████████▏                        | 492/1044 [03:04<03:19,  2.77it/s, acc_step=1/1, ce_loss_token=2.0381, perplexity_token=7.6761]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  47%|██████████████████████▏                        | 493/1044 [03:04<03:25,  2.68it/s, acc_step=1/1, ce_loss_token=2.0380, perplexity_token=7.6756]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  47%|██████████████████████▏                        | 494/1044 [03:04<03:23,  2.70it/s, acc_step=1/1, ce_loss_token=2.0379, perplexity_token=7.6747]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  47%|██████████████████████▎                        | 495/1044 [03:05<03:27,  2.65it/s, acc_step=1/1, ce_loss_token=2.0378, perplexity_token=7.6741]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  48%|██████████████████████▎                        | 496/1044 [03:05<03:33,  2.56it/s, acc_step=1/1, ce_loss_token=2.0378, perplexity_token=7.6734]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  48%|██████████████████████▎                        | 497/1044 [03:06<03:17,  2.77it/s, acc_step=1/1, ce_loss_token=2.0379, perplexity_token=7.6744]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  48%|██████████████████████▍                        | 498/1044 [03:06<03:20,  2.73it/s, acc_step=1/1, ce_loss_token=2.0378, perplexity_token=7.6738]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  48%|██████████████████████▍                        | 499/1044 [03:06<03:19,  2.73it/s, acc_step=1/1, ce_loss_token=2.0377, perplexity_token=7.6732]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  48%|██████████████████████▌                        | 500/1044 [03:07<03:21,  2.71it/s, acc_step=1/1, ce_loss_token=2.0377, perplexity_token=7.6727]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  48%|██████████████████████▌                        | 501/1044 [03:07<03:07,  2.89it/s, acc_step=1/1, ce_loss_token=2.0378, perplexity_token=7.6740]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  48%|██████████████████████▌                        | 502/1044 [03:07<03:11,  2.83it/s, acc_step=1/1, ce_loss_token=2.0378, perplexity_token=7.6734]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  48%|██████████████████████▋                        | 503/1044 [03:08<03:16,  2.75it/s, acc_step=1/1, ce_loss_token=2.0377, perplexity_token=7.6727]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  48%|██████████████████████▋                        | 504/1044 [03:08<03:18,  2.72it/s, acc_step=1/1, ce_loss_token=2.0376, perplexity_token=7.6720]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  48%|██████████████████████▋                        | 505/1044 [03:08<03:05,  2.91it/s, acc_step=1/1, ce_loss_token=2.0377, perplexity_token=7.6726]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  48%|██████████████████████▊                        | 506/1044 [03:09<02:55,  3.06it/s, acc_step=1/1, ce_loss_token=2.0377, perplexity_token=7.6733]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  49%|██████████████████████▊                        | 507/1044 [03:09<03:04,  2.91it/s, acc_step=1/1, ce_loss_token=2.0377, perplexity_token=7.6726]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  49%|██████████████████████▊                        | 508/1044 [03:09<02:56,  3.04it/s, acc_step=1/1, ce_loss_token=2.0378, perplexity_token=7.6735]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  49%|██████████████████████▉                        | 509/1044 [03:10<02:57,  3.01it/s, acc_step=1/1, ce_loss_token=2.0376, perplexity_token=7.6725]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  49%|██████████████████████▉                        | 510/1044 [03:10<03:08,  2.83it/s, acc_step=1/1, ce_loss_token=2.0376, perplexity_token=7.6719]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  49%|███████████████████████                        | 511/1044 [03:10<03:15,  2.73it/s, acc_step=1/1, ce_loss_token=2.0375, perplexity_token=7.6711]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  49%|███████████████████████                        | 512/1044 [03:11<03:12,  2.76it/s, acc_step=1/1, ce_loss_token=2.0374, perplexity_token=7.6704]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  49%|███████████████████████                        | 513/1044 [03:11<03:19,  2.67it/s, acc_step=1/1, ce_loss_token=2.0373, perplexity_token=7.6698]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  49%|███████████████████████▏                       | 514/1044 [03:12<03:26,  2.57it/s, acc_step=1/1, ce_loss_token=2.0372, perplexity_token=7.6690]

torch.Size([256, 277, 35]) torch.Size([256, 277])


[Training LM]:  49%|███████████████████████▏                       | 515/1044 [03:12<03:17,  2.68it/s, acc_step=1/1, ce_loss_token=2.0371, perplexity_token=7.6684]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  49%|███████████████████████▏                       | 516/1044 [03:12<03:22,  2.60it/s, acc_step=1/1, ce_loss_token=2.0370, perplexity_token=7.6675]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  50%|███████████████████████▎                       | 517/1044 [03:13<03:21,  2.61it/s, acc_step=1/1, ce_loss_token=2.0369, perplexity_token=7.6668]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  50%|███████████████████████▎                       | 518/1044 [03:13<03:18,  2.66it/s, acc_step=1/1, ce_loss_token=2.0368, perplexity_token=7.6659]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  50%|███████████████████████▎                       | 519/1044 [03:13<03:14,  2.70it/s, acc_step=1/1, ce_loss_token=2.0367, perplexity_token=7.6651]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  50%|███████████████████████▍                       | 520/1044 [03:14<02:57,  2.95it/s, acc_step=1/1, ce_loss_token=2.0368, perplexity_token=7.6658]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  50%|███████████████████████▍                       | 521/1044 [03:14<02:54,  3.00it/s, acc_step=1/1, ce_loss_token=2.0367, perplexity_token=7.6650]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  50%|███████████████████████▌                       | 522/1044 [03:14<03:00,  2.89it/s, acc_step=1/1, ce_loss_token=2.0366, perplexity_token=7.6642]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  50%|███████████████████████▌                       | 523/1044 [03:15<03:11,  2.72it/s, acc_step=1/1, ce_loss_token=2.0365, perplexity_token=7.6637]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  50%|███████████████████████▌                       | 524/1044 [03:15<03:17,  2.63it/s, acc_step=1/1, ce_loss_token=2.0364, perplexity_token=7.6631]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  50%|███████████████████████▋                       | 525/1044 [03:16<03:06,  2.79it/s, acc_step=1/1, ce_loss_token=2.0365, perplexity_token=7.6639]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  50%|███████████████████████▋                       | 526/1044 [03:16<03:07,  2.76it/s, acc_step=1/1, ce_loss_token=2.0364, perplexity_token=7.6634]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  50%|███████████████████████▋                       | 527/1044 [03:16<03:11,  2.70it/s, acc_step=1/1, ce_loss_token=2.0364, perplexity_token=7.6627]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  51%|███████████████████████▊                       | 528/1044 [03:17<03:11,  2.69it/s, acc_step=1/1, ce_loss_token=2.0363, perplexity_token=7.6620]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  51%|███████████████████████▊                       | 529/1044 [03:17<03:18,  2.59it/s, acc_step=1/1, ce_loss_token=2.0362, perplexity_token=7.6613]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  51%|███████████████████████▊                       | 530/1044 [03:17<03:14,  2.64it/s, acc_step=1/1, ce_loss_token=2.0361, perplexity_token=7.6606]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  51%|███████████████████████▉                       | 531/1044 [03:18<03:15,  2.62it/s, acc_step=1/1, ce_loss_token=2.0360, perplexity_token=7.6601]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  51%|███████████████████████▉                       | 532/1044 [03:18<03:11,  2.68it/s, acc_step=1/1, ce_loss_token=2.0359, perplexity_token=7.6593]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  51%|███████████████████████▉                       | 533/1044 [03:19<03:16,  2.59it/s, acc_step=1/1, ce_loss_token=2.0358, perplexity_token=7.6584]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  51%|████████████████████████                       | 534/1044 [03:19<03:14,  2.63it/s, acc_step=1/1, ce_loss_token=2.0357, perplexity_token=7.6579]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  51%|████████████████████████                       | 535/1044 [03:19<03:12,  2.65it/s, acc_step=1/1, ce_loss_token=2.0356, perplexity_token=7.6572]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  51%|████████████████████████▏                      | 536/1044 [03:20<03:09,  2.67it/s, acc_step=1/1, ce_loss_token=2.0356, perplexity_token=7.6566]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  51%|████████████████████████▏                      | 537/1044 [03:20<03:13,  2.62it/s, acc_step=1/1, ce_loss_token=2.0355, perplexity_token=7.6560]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  52%|████████████████████████▏                      | 538/1044 [03:20<02:55,  2.88it/s, acc_step=1/1, ce_loss_token=2.0356, perplexity_token=7.6568]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  52%|████████████████████████▎                      | 539/1044 [03:21<03:17,  2.56it/s, acc_step=1/1, ce_loss_token=2.0355, perplexity_token=7.6563]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  52%|████████████████████████▎                      | 540/1044 [03:21<03:28,  2.41it/s, acc_step=1/1, ce_loss_token=2.0354, perplexity_token=7.6556]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  52%|████████████████████████▎                      | 541/1044 [03:22<03:21,  2.49it/s, acc_step=1/1, ce_loss_token=2.0354, perplexity_token=7.6550]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  52%|████████████████████████▍                      | 542/1044 [03:22<03:03,  2.74it/s, acc_step=1/1, ce_loss_token=2.0354, perplexity_token=7.6555]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  52%|████████████████████████▍                      | 543/1044 [03:22<03:01,  2.76it/s, acc_step=1/1, ce_loss_token=2.0353, perplexity_token=7.6548]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  52%|████████████████████████▍                      | 544/1044 [03:23<03:11,  2.61it/s, acc_step=1/1, ce_loss_token=2.0352, perplexity_token=7.6540]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  52%|████████████████████████▌                      | 545/1044 [03:23<03:07,  2.66it/s, acc_step=1/1, ce_loss_token=2.0352, perplexity_token=7.6534]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  52%|████████████████████████▌                      | 546/1044 [03:24<03:06,  2.67it/s, acc_step=1/1, ce_loss_token=2.0351, perplexity_token=7.6527]

torch.Size([256, 364, 35]) torch.Size([256, 364])


[Training LM]:  52%|████████████████████████▋                      | 547/1044 [03:24<03:21,  2.47it/s, acc_step=1/1, ce_loss_token=2.0350, perplexity_token=7.6520]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  52%|████████████████████████▋                      | 548/1044 [03:24<03:03,  2.71it/s, acc_step=1/1, ce_loss_token=2.0351, perplexity_token=7.6527]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  53%|████████████████████████▋                      | 549/1044 [03:25<03:04,  2.68it/s, acc_step=1/1, ce_loss_token=2.0350, perplexity_token=7.6521]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  53%|████████████████████████▊                      | 550/1044 [03:25<03:03,  2.69it/s, acc_step=1/1, ce_loss_token=2.0349, perplexity_token=7.6514]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  53%|████████████████████████▊                      | 551/1044 [03:25<03:06,  2.65it/s, acc_step=1/1, ce_loss_token=2.0348, perplexity_token=7.6508]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  53%|████████████████████████▊                      | 552/1044 [03:26<03:02,  2.69it/s, acc_step=1/1, ce_loss_token=2.0347, perplexity_token=7.6502]

torch.Size([256, 274, 35]) torch.Size([256, 274])


[Training LM]:  53%|████████████████████████▉                      | 553/1044 [03:26<02:54,  2.81it/s, acc_step=1/1, ce_loss_token=2.0347, perplexity_token=7.6496]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  53%|████████████████████████▉                      | 554/1044 [03:26<02:55,  2.79it/s, acc_step=1/1, ce_loss_token=2.0346, perplexity_token=7.6490]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  53%|████████████████████████▉                      | 555/1044 [03:27<03:03,  2.67it/s, acc_step=1/1, ce_loss_token=2.0345, perplexity_token=7.6483]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  53%|█████████████████████████                      | 556/1044 [03:27<02:49,  2.87it/s, acc_step=1/1, ce_loss_token=2.0346, perplexity_token=7.6490]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  53%|█████████████████████████                      | 557/1044 [03:27<02:40,  3.04it/s, acc_step=1/1, ce_loss_token=2.0347, perplexity_token=7.6501]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  53%|█████████████████████████                      | 558/1044 [03:28<02:50,  2.86it/s, acc_step=1/1, ce_loss_token=2.0346, perplexity_token=7.6495]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  54%|█████████████████████████▏                     | 559/1044 [03:28<02:54,  2.78it/s, acc_step=1/1, ce_loss_token=2.0345, perplexity_token=7.6488]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  54%|█████████████████████████▏                     | 560/1044 [03:29<02:59,  2.69it/s, acc_step=1/1, ce_loss_token=2.0345, perplexity_token=7.6481]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  54%|█████████████████████████▎                     | 561/1044 [03:29<03:00,  2.68it/s, acc_step=1/1, ce_loss_token=2.0344, perplexity_token=7.6473]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  54%|█████████████████████████▎                     | 562/1044 [03:29<02:56,  2.73it/s, acc_step=1/1, ce_loss_token=2.0343, perplexity_token=7.6469]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  54%|█████████████████████████▎                     | 563/1044 [03:30<03:12,  2.50it/s, acc_step=1/1, ce_loss_token=2.0342, perplexity_token=7.6463]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  54%|█████████████████████████▍                     | 564/1044 [03:30<03:15,  2.46it/s, acc_step=1/1, ce_loss_token=2.0341, perplexity_token=7.6456]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  54%|█████████████████████████▍                     | 565/1044 [03:31<02:56,  2.72it/s, acc_step=1/1, ce_loss_token=2.0343, perplexity_token=7.6467]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  54%|█████████████████████████▍                     | 566/1044 [03:31<02:50,  2.80it/s, acc_step=1/1, ce_loss_token=2.0342, perplexity_token=7.6460]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  54%|█████████████████████████▌                     | 567/1044 [03:31<02:51,  2.79it/s, acc_step=1/1, ce_loss_token=2.0341, perplexity_token=7.6453]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  54%|█████████████████████████▌                     | 568/1044 [03:32<02:46,  2.86it/s, acc_step=1/1, ce_loss_token=2.0342, perplexity_token=7.6462]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  55%|█████████████████████████▌                     | 569/1044 [03:32<02:50,  2.78it/s, acc_step=1/1, ce_loss_token=2.0341, perplexity_token=7.6454]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  55%|█████████████████████████▋                     | 570/1044 [03:32<02:49,  2.80it/s, acc_step=1/1, ce_loss_token=2.0340, perplexity_token=7.6445]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  55%|█████████████████████████▋                     | 571/1044 [03:33<02:40,  2.94it/s, acc_step=1/1, ce_loss_token=2.0342, perplexity_token=7.6458]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  55%|█████████████████████████▊                     | 572/1044 [03:33<02:40,  2.93it/s, acc_step=1/1, ce_loss_token=2.0341, perplexity_token=7.6450]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  55%|█████████████████████████▊                     | 573/1044 [03:33<02:32,  3.09it/s, acc_step=1/1, ce_loss_token=2.0341, perplexity_token=7.6456]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  55%|█████████████████████████▊                     | 574/1044 [03:34<02:39,  2.94it/s, acc_step=1/1, ce_loss_token=2.0340, perplexity_token=7.6449]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  55%|█████████████████████████▉                     | 575/1044 [03:34<02:47,  2.80it/s, acc_step=1/1, ce_loss_token=2.0340, perplexity_token=7.6443]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  55%|█████████████████████████▉                     | 576/1044 [03:34<02:53,  2.70it/s, acc_step=1/1, ce_loss_token=2.0339, perplexity_token=7.6436]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  55%|█████████████████████████▉                     | 577/1044 [03:35<02:42,  2.87it/s, acc_step=1/1, ce_loss_token=2.0340, perplexity_token=7.6445]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  55%|██████████████████████████                     | 578/1044 [03:35<02:32,  3.06it/s, acc_step=1/1, ce_loss_token=2.0341, perplexity_token=7.6450]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  55%|██████████████████████████                     | 579/1044 [03:35<02:43,  2.85it/s, acc_step=1/1, ce_loss_token=2.0340, perplexity_token=7.6445]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  56%|██████████████████████████                     | 580/1044 [03:36<02:37,  2.95it/s, acc_step=1/1, ce_loss_token=2.0341, perplexity_token=7.6454]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  56%|██████████████████████████▏                    | 581/1044 [03:36<02:33,  3.02it/s, acc_step=1/1, ce_loss_token=2.0342, perplexity_token=7.6463]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  56%|██████████████████████████▏                    | 582/1044 [03:36<02:40,  2.89it/s, acc_step=1/1, ce_loss_token=2.0341, perplexity_token=7.6457]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  56%|██████████████████████████▏                    | 583/1044 [03:37<02:33,  3.00it/s, acc_step=1/1, ce_loss_token=2.0343, perplexity_token=7.6465]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  56%|██████████████████████████▎                    | 584/1044 [03:37<02:38,  2.90it/s, acc_step=1/1, ce_loss_token=2.0342, perplexity_token=7.6458]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  56%|██████████████████████████▎                    | 585/1044 [03:38<02:56,  2.59it/s, acc_step=1/1, ce_loss_token=2.0341, perplexity_token=7.6452]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  56%|██████████████████████████▍                    | 586/1044 [03:38<02:51,  2.67it/s, acc_step=1/1, ce_loss_token=2.0340, perplexity_token=7.6447]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  56%|██████████████████████████▍                    | 587/1044 [03:38<02:49,  2.69it/s, acc_step=1/1, ce_loss_token=2.0339, perplexity_token=7.6442]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  56%|██████████████████████████▍                    | 588/1044 [03:39<02:56,  2.59it/s, acc_step=1/1, ce_loss_token=2.0339, perplexity_token=7.6437]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  56%|██████████████████████████▌                    | 589/1044 [03:39<02:39,  2.86it/s, acc_step=1/1, ce_loss_token=2.0339, perplexity_token=7.6441]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  57%|██████████████████████████▌                    | 590/1044 [03:39<02:49,  2.68it/s, acc_step=1/1, ce_loss_token=2.0339, perplexity_token=7.6435]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  57%|██████████████████████████▌                    | 591/1044 [03:40<02:50,  2.66it/s, acc_step=1/1, ce_loss_token=2.0338, perplexity_token=7.6429]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  57%|██████████████████████████▋                    | 592/1044 [03:40<02:49,  2.66it/s, acc_step=1/1, ce_loss_token=2.0337, perplexity_token=7.6423]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  57%|██████████████████████████▋                    | 593/1044 [03:40<02:45,  2.73it/s, acc_step=1/1, ce_loss_token=2.0336, perplexity_token=7.6415]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  57%|██████████████████████████▋                    | 594/1044 [03:41<02:41,  2.79it/s, acc_step=1/1, ce_loss_token=2.0335, perplexity_token=7.6408]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  57%|██████████████████████████▊                    | 595/1044 [03:41<02:34,  2.91it/s, acc_step=1/1, ce_loss_token=2.0336, perplexity_token=7.6415]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  57%|██████████████████████████▊                    | 596/1044 [03:41<02:35,  2.88it/s, acc_step=1/1, ce_loss_token=2.0335, perplexity_token=7.6409]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  57%|██████████████████████████▉                    | 597/1044 [03:42<02:38,  2.83it/s, acc_step=1/1, ce_loss_token=2.0334, perplexity_token=7.6402]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  57%|██████████████████████████▉                    | 598/1044 [03:42<02:30,  2.97it/s, acc_step=1/1, ce_loss_token=2.0335, perplexity_token=7.6407]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  57%|██████████████████████████▉                    | 599/1044 [03:43<02:32,  2.93it/s, acc_step=1/1, ce_loss_token=2.0334, perplexity_token=7.6399]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  57%|███████████████████████████                    | 600/1044 [03:43<02:42,  2.72it/s, acc_step=1/1, ce_loss_token=2.0333, perplexity_token=7.6390]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  58%|███████████████████████████                    | 601/1044 [03:43<02:45,  2.68it/s, acc_step=1/1, ce_loss_token=2.0332, perplexity_token=7.6383]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  58%|███████████████████████████                    | 602/1044 [03:44<02:46,  2.65it/s, acc_step=1/1, ce_loss_token=2.0331, perplexity_token=7.6378]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  58%|███████████████████████████▏                   | 603/1044 [03:44<02:37,  2.80it/s, acc_step=1/1, ce_loss_token=2.0332, perplexity_token=7.6381]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  58%|███████████████████████████▏                   | 604/1044 [03:44<02:26,  3.00it/s, acc_step=1/1, ce_loss_token=2.0333, perplexity_token=7.6389]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  58%|███████████████████████████▏                   | 605/1044 [03:45<02:33,  2.87it/s, acc_step=1/1, ce_loss_token=2.0332, perplexity_token=7.6382]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  58%|███████████████████████████▎                   | 606/1044 [03:45<02:40,  2.73it/s, acc_step=1/1, ce_loss_token=2.0331, perplexity_token=7.6376]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  58%|███████████████████████████▎                   | 607/1044 [03:46<02:47,  2.61it/s, acc_step=1/1, ce_loss_token=2.0330, perplexity_token=7.6370]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  58%|███████████████████████████▎                   | 608/1044 [03:46<03:04,  2.36it/s, acc_step=1/1, ce_loss_token=2.0329, perplexity_token=7.6364]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  58%|███████████████████████████▍                   | 609/1044 [03:46<02:57,  2.45it/s, acc_step=1/1, ce_loss_token=2.0328, perplexity_token=7.6358]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  58%|███████████████████████████▍                   | 610/1044 [03:47<02:51,  2.53it/s, acc_step=1/1, ce_loss_token=2.0328, perplexity_token=7.6352]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  59%|███████████████████████████▌                   | 611/1044 [03:47<02:51,  2.53it/s, acc_step=1/1, ce_loss_token=2.0327, perplexity_token=7.6345]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  59%|███████████████████████████▌                   | 612/1044 [03:48<02:48,  2.56it/s, acc_step=1/1, ce_loss_token=2.0326, perplexity_token=7.6338]

torch.Size([256, 438, 35]) torch.Size([256, 438])


[Training LM]:  59%|███████████████████████████▌                   | 613/1044 [03:48<02:59,  2.39it/s, acc_step=1/1, ce_loss_token=2.0327, perplexity_token=7.6346]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  59%|███████████████████████████▋                   | 614/1044 [03:48<02:54,  2.46it/s, acc_step=1/1, ce_loss_token=2.0326, perplexity_token=7.6340]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  59%|███████████████████████████▋                   | 615/1044 [03:49<02:50,  2.51it/s, acc_step=1/1, ce_loss_token=2.0325, perplexity_token=7.6335]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  59%|███████████████████████████▋                   | 616/1044 [03:49<02:37,  2.72it/s, acc_step=1/1, ce_loss_token=2.0326, perplexity_token=7.6342]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  59%|███████████████████████████▊                   | 617/1044 [03:49<02:39,  2.68it/s, acc_step=1/1, ce_loss_token=2.0326, perplexity_token=7.6336]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  59%|███████████████████████████▊                   | 619/1044 [03:50<02:20,  3.03it/s, acc_step=1/1, ce_loss_token=2.0329, perplexity_token=7.6366]

torch.Size([256, 301, 35]) torch.Size([256, 301])
torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  59%|███████████████████████████▉                   | 620/1044 [03:50<02:24,  2.93it/s, acc_step=1/1, ce_loss_token=2.0329, perplexity_token=7.6360]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  59%|███████████████████████████▉                   | 621/1044 [03:51<02:31,  2.80it/s, acc_step=1/1, ce_loss_token=2.0328, perplexity_token=7.6354]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  60%|████████████████████████████                   | 622/1044 [03:51<02:38,  2.66it/s, acc_step=1/1, ce_loss_token=2.0327, perplexity_token=7.6348]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  60%|████████████████████████████                   | 623/1044 [03:52<02:40,  2.63it/s, acc_step=1/1, ce_loss_token=2.0327, perplexity_token=7.6343]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  60%|████████████████████████████                   | 624/1044 [03:52<02:37,  2.66it/s, acc_step=1/1, ce_loss_token=2.0326, perplexity_token=7.6339]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  60%|████████████████████████████▏                  | 625/1044 [03:52<02:37,  2.66it/s, acc_step=1/1, ce_loss_token=2.0325, perplexity_token=7.6334]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  60%|████████████████████████████▏                  | 626/1044 [03:53<02:39,  2.63it/s, acc_step=1/1, ce_loss_token=2.0324, perplexity_token=7.6327]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  60%|████████████████████████████▏                  | 627/1044 [03:53<02:28,  2.80it/s, acc_step=1/1, ce_loss_token=2.0325, perplexity_token=7.6333]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  60%|████████████████████████████▎                  | 628/1044 [03:53<02:26,  2.84it/s, acc_step=1/1, ce_loss_token=2.0324, perplexity_token=7.6327]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  60%|████████████████████████████▎                  | 629/1044 [03:54<02:32,  2.73it/s, acc_step=1/1, ce_loss_token=2.0324, perplexity_token=7.6322]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  60%|████████████████████████████▎                  | 630/1044 [03:54<02:37,  2.63it/s, acc_step=1/1, ce_loss_token=2.0323, perplexity_token=7.6314]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  60%|████████████████████████████▍                  | 631/1044 [03:55<02:26,  2.81it/s, acc_step=1/1, ce_loss_token=2.0324, perplexity_token=7.6321]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  61%|████████████████████████████▍                  | 632/1044 [03:55<02:26,  2.82it/s, acc_step=1/1, ce_loss_token=2.0323, perplexity_token=7.6314]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  61%|████████████████████████████▌                  | 634/1044 [03:55<02:08,  3.19it/s, acc_step=1/1, ce_loss_token=2.0325, perplexity_token=7.6335]

torch.Size([256, 312, 35]) torch.Size([256, 312])
torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  61%|████████████████████████████▌                  | 635/1044 [03:56<02:13,  3.05it/s, acc_step=1/1, ce_loss_token=2.0325, perplexity_token=7.6329]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  61%|████████████████████████████▋                  | 636/1044 [03:56<02:11,  3.11it/s, acc_step=1/1, ce_loss_token=2.0326, perplexity_token=7.6336]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  61%|████████████████████████████▋                  | 637/1044 [03:57<02:17,  2.96it/s, acc_step=1/1, ce_loss_token=2.0325, perplexity_token=7.6330]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  61%|████████████████████████████▋                  | 638/1044 [03:57<02:18,  2.93it/s, acc_step=1/1, ce_loss_token=2.0324, perplexity_token=7.6323]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  61%|████████████████████████████▊                  | 639/1044 [03:57<02:19,  2.91it/s, acc_step=1/1, ce_loss_token=2.0323, perplexity_token=7.6317]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  61%|████████████████████████████▊                  | 640/1044 [03:58<02:25,  2.78it/s, acc_step=1/1, ce_loss_token=2.0323, perplexity_token=7.6312]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  61%|████████████████████████████▊                  | 641/1044 [03:58<02:26,  2.74it/s, acc_step=1/1, ce_loss_token=2.0322, perplexity_token=7.6307]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  61%|████████████████████████████▉                  | 642/1044 [03:58<02:33,  2.62it/s, acc_step=1/1, ce_loss_token=2.0321, perplexity_token=7.6302]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  62%|████████████████████████████▉                  | 643/1044 [03:59<02:30,  2.67it/s, acc_step=1/1, ce_loss_token=2.0320, perplexity_token=7.6294]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  62%|████████████████████████████▉                  | 644/1044 [03:59<02:18,  2.88it/s, acc_step=1/1, ce_loss_token=2.0321, perplexity_token=7.6299]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  62%|█████████████████████████████                  | 645/1044 [03:59<02:18,  2.88it/s, acc_step=1/1, ce_loss_token=2.0320, perplexity_token=7.6292]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  62%|█████████████████████████████                  | 646/1044 [04:00<02:28,  2.67it/s, acc_step=1/1, ce_loss_token=2.0319, perplexity_token=7.6285]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  62%|█████████████████████████████▏                 | 647/1044 [04:00<02:24,  2.74it/s, acc_step=1/1, ce_loss_token=2.0318, perplexity_token=7.6277]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  62%|█████████████████████████████▏                 | 648/1044 [04:01<02:26,  2.69it/s, acc_step=1/1, ce_loss_token=2.0317, perplexity_token=7.6271]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  62%|█████████████████████████████▏                 | 649/1044 [04:01<02:24,  2.73it/s, acc_step=1/1, ce_loss_token=2.0316, perplexity_token=7.6264]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  62%|█████████████████████████████▎                 | 650/1044 [04:01<02:27,  2.67it/s, acc_step=1/1, ce_loss_token=2.0315, perplexity_token=7.6258]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  62%|█████████████████████████████▎                 | 651/1044 [04:02<02:25,  2.70it/s, acc_step=1/1, ce_loss_token=2.0315, perplexity_token=7.6252]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  62%|█████████████████████████████▎                 | 652/1044 [04:02<02:23,  2.74it/s, acc_step=1/1, ce_loss_token=2.0314, perplexity_token=7.6246]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  63%|█████████████████████████████▍                 | 653/1044 [04:02<02:14,  2.91it/s, acc_step=1/1, ce_loss_token=2.0314, perplexity_token=7.6250]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  63%|█████████████████████████████▍                 | 654/1044 [04:03<02:14,  2.89it/s, acc_step=1/1, ce_loss_token=2.0314, perplexity_token=7.6244]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  63%|█████████████████████████████▍                 | 655/1044 [04:03<02:16,  2.84it/s, acc_step=1/1, ce_loss_token=2.0313, perplexity_token=7.6237]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  63%|█████████████████████████████▌                 | 656/1044 [04:03<02:27,  2.64it/s, acc_step=1/1, ce_loss_token=2.0312, perplexity_token=7.6233]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  63%|█████████████████████████████▌                 | 657/1044 [04:04<02:25,  2.66it/s, acc_step=1/1, ce_loss_token=2.0311, perplexity_token=7.6227]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  63%|█████████████████████████████▌                 | 658/1044 [04:04<02:43,  2.36it/s, acc_step=1/1, ce_loss_token=2.0311, perplexity_token=7.6223]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  63%|█████████████████████████████▋                 | 659/1044 [04:05<02:30,  2.56it/s, acc_step=1/1, ce_loss_token=2.0311, perplexity_token=7.6228]

torch.Size([256, 370, 35]) torch.Size([256, 370])


[Training LM]:  63%|█████████████████████████████▋                 | 660/1044 [04:05<02:41,  2.38it/s, acc_step=1/1, ce_loss_token=2.0311, perplexity_token=7.6222]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  63%|█████████████████████████████▊                 | 661/1044 [04:06<02:36,  2.45it/s, acc_step=1/1, ce_loss_token=2.0310, perplexity_token=7.6217]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  63%|█████████████████████████████▊                 | 662/1044 [04:06<02:28,  2.57it/s, acc_step=1/1, ce_loss_token=2.0309, perplexity_token=7.6212]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  64%|█████████████████████████████▊                 | 663/1044 [04:06<02:26,  2.59it/s, acc_step=1/1, ce_loss_token=2.0309, perplexity_token=7.6207]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  64%|█████████████████████████████▉                 | 664/1044 [04:07<02:27,  2.58it/s, acc_step=1/1, ce_loss_token=2.0308, perplexity_token=7.6201]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  64%|█████████████████████████████▉                 | 665/1044 [04:07<02:26,  2.59it/s, acc_step=1/1, ce_loss_token=2.0307, perplexity_token=7.6196]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  64%|█████████████████████████████▉                 | 666/1044 [04:07<02:25,  2.61it/s, acc_step=1/1, ce_loss_token=2.0307, perplexity_token=7.6191]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  64%|██████████████████████████████                 | 667/1044 [04:08<02:22,  2.64it/s, acc_step=1/1, ce_loss_token=2.0306, perplexity_token=7.6185]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  64%|██████████████████████████████                 | 668/1044 [04:08<02:12,  2.83it/s, acc_step=1/1, ce_loss_token=2.0307, perplexity_token=7.6191]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  64%|██████████████████████████████                 | 669/1044 [04:08<02:06,  2.97it/s, acc_step=1/1, ce_loss_token=2.0307, perplexity_token=7.6198]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  64%|██████████████████████████████▏                | 670/1044 [04:09<01:59,  3.12it/s, acc_step=1/1, ce_loss_token=2.0308, perplexity_token=7.6205]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  64%|██████████████████████████████▏                | 671/1044 [04:09<02:06,  2.94it/s, acc_step=1/1, ce_loss_token=2.0308, perplexity_token=7.6200]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  64%|██████████████████████████████▎                | 672/1044 [04:09<02:09,  2.88it/s, acc_step=1/1, ce_loss_token=2.0307, perplexity_token=7.6192]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  64%|██████████████████████████████▎                | 673/1044 [04:10<02:01,  3.06it/s, acc_step=1/1, ce_loss_token=2.0308, perplexity_token=7.6199]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  65%|██████████████████████████████▎                | 674/1044 [04:10<02:06,  2.92it/s, acc_step=1/1, ce_loss_token=2.0307, perplexity_token=7.6194]

torch.Size([256, 274, 35]) torch.Size([256, 274])


[Training LM]:  65%|██████████████████████████████▍                | 675/1044 [04:10<01:56,  3.17it/s, acc_step=1/1, ce_loss_token=2.0308, perplexity_token=7.6199]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  65%|██████████████████████████████▍                | 676/1044 [04:11<01:50,  3.32it/s, acc_step=1/1, ce_loss_token=2.0308, perplexity_token=7.6203]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  65%|██████████████████████████████▍                | 677/1044 [04:11<01:59,  3.08it/s, acc_step=1/1, ce_loss_token=2.0307, perplexity_token=7.6198]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  65%|██████████████████████████████▌                | 678/1044 [04:11<02:00,  3.04it/s, acc_step=1/1, ce_loss_token=2.0308, perplexity_token=7.6205]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  65%|██████████████████████████████▌                | 679/1044 [04:12<02:14,  2.71it/s, acc_step=1/1, ce_loss_token=2.0308, perplexity_token=7.6200]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  65%|██████████████████████████████▌                | 680/1044 [04:12<02:17,  2.64it/s, acc_step=1/1, ce_loss_token=2.0307, perplexity_token=7.6195]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  65%|██████████████████████████████▋                | 681/1044 [04:13<02:16,  2.66it/s, acc_step=1/1, ce_loss_token=2.0306, perplexity_token=7.6189]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  65%|██████████████████████████████▋                | 682/1044 [04:13<02:14,  2.69it/s, acc_step=1/1, ce_loss_token=2.0305, perplexity_token=7.6182]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  65%|██████████████████████████████▋                | 683/1044 [04:13<02:03,  2.91it/s, acc_step=1/1, ce_loss_token=2.0306, perplexity_token=7.6185]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  66%|██████████████████████████████▊                | 684/1044 [04:14<02:03,  2.91it/s, acc_step=1/1, ce_loss_token=2.0305, perplexity_token=7.6180]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  66%|██████████████████████████████▊                | 685/1044 [04:14<02:08,  2.80it/s, acc_step=1/1, ce_loss_token=2.0304, perplexity_token=7.6173]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  66%|██████████████████████████████▉                | 686/1044 [04:14<02:10,  2.74it/s, acc_step=1/1, ce_loss_token=2.0303, perplexity_token=7.6167]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  66%|██████████████████████████████▉                | 687/1044 [04:15<02:05,  2.85it/s, acc_step=1/1, ce_loss_token=2.0304, perplexity_token=7.6171]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  66%|██████████████████████████████▉                | 688/1044 [04:15<02:05,  2.83it/s, acc_step=1/1, ce_loss_token=2.0303, perplexity_token=7.6165]

torch.Size([256, 381, 35]) torch.Size([256, 381])


[Training LM]:  66%|███████████████████████████████                | 689/1044 [04:16<02:23,  2.48it/s, acc_step=1/1, ce_loss_token=2.0302, perplexity_token=7.6159]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  66%|███████████████████████████████                | 690/1044 [04:16<02:22,  2.48it/s, acc_step=1/1, ce_loss_token=2.0302, perplexity_token=7.6155]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  66%|███████████████████████████████                | 691/1044 [04:16<02:27,  2.40it/s, acc_step=1/1, ce_loss_token=2.0301, perplexity_token=7.6150]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  66%|███████████████████████████████▏               | 692/1044 [04:17<02:12,  2.65it/s, acc_step=1/1, ce_loss_token=2.0302, perplexity_token=7.6155]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  66%|███████████████████████████████▏               | 693/1044 [04:17<02:07,  2.75it/s, acc_step=1/1, ce_loss_token=2.0301, perplexity_token=7.6149]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  66%|███████████████████████████████▏               | 694/1044 [04:17<02:08,  2.71it/s, acc_step=1/1, ce_loss_token=2.0300, perplexity_token=7.6144]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  67%|███████████████████████████████▎               | 695/1044 [04:18<02:10,  2.67it/s, acc_step=1/1, ce_loss_token=2.0299, perplexity_token=7.6137]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  67%|███████████████████████████████▎               | 696/1044 [04:18<02:09,  2.69it/s, acc_step=1/1, ce_loss_token=2.0298, perplexity_token=7.6129]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  67%|███████████████████████████████▍               | 697/1044 [04:19<02:10,  2.65it/s, acc_step=1/1, ce_loss_token=2.0297, perplexity_token=7.6122]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  67%|███████████████████████████████▍               | 698/1044 [04:19<02:10,  2.66it/s, acc_step=1/1, ce_loss_token=2.0297, perplexity_token=7.6116]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  67%|███████████████████████████████▍               | 699/1044 [04:19<02:01,  2.84it/s, acc_step=1/1, ce_loss_token=2.0298, perplexity_token=7.6124]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  67%|███████████████████████████████▌               | 700/1044 [04:19<01:58,  2.90it/s, acc_step=1/1, ce_loss_token=2.0297, perplexity_token=7.6119]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  67%|███████████████████████████████▌               | 701/1044 [04:20<02:01,  2.81it/s, acc_step=1/1, ce_loss_token=2.0296, perplexity_token=7.6113]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  67%|███████████████████████████████▌               | 702/1044 [04:20<02:07,  2.68it/s, acc_step=1/1, ce_loss_token=2.0296, perplexity_token=7.6107]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  67%|███████████████████████████████▋               | 703/1044 [04:21<02:09,  2.63it/s, acc_step=1/1, ce_loss_token=2.0295, perplexity_token=7.6101]

torch.Size([256, 349, 35]) torch.Size([256, 349])


[Training LM]:  67%|███████████████████████████████▋               | 704/1044 [04:21<02:16,  2.48it/s, acc_step=1/1, ce_loss_token=2.0294, perplexity_token=7.6096]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  68%|███████████████████████████████▋               | 705/1044 [04:22<02:12,  2.55it/s, acc_step=1/1, ce_loss_token=2.0293, perplexity_token=7.6090]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  68%|███████████████████████████████▊               | 706/1044 [04:22<02:03,  2.73it/s, acc_step=1/1, ce_loss_token=2.0294, perplexity_token=7.6095]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  68%|███████████████████████████████▊               | 707/1044 [04:22<02:02,  2.74it/s, acc_step=1/1, ce_loss_token=2.0293, perplexity_token=7.6090]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  68%|███████████████████████████████▊               | 708/1044 [04:23<02:01,  2.77it/s, acc_step=1/1, ce_loss_token=2.0293, perplexity_token=7.6085]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  68%|███████████████████████████████▉               | 709/1044 [04:23<02:02,  2.74it/s, acc_step=1/1, ce_loss_token=2.0292, perplexity_token=7.6079]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  68%|███████████████████████████████▉               | 710/1044 [04:23<02:02,  2.73it/s, acc_step=1/1, ce_loss_token=2.0291, perplexity_token=7.6074]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  68%|████████████████████████████████               | 711/1044 [04:24<02:01,  2.74it/s, acc_step=1/1, ce_loss_token=2.0290, perplexity_token=7.6068]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  68%|████████████████████████████████               | 712/1044 [04:24<01:54,  2.89it/s, acc_step=1/1, ce_loss_token=2.0291, perplexity_token=7.6073]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  68%|████████████████████████████████               | 713/1044 [04:24<01:56,  2.84it/s, acc_step=1/1, ce_loss_token=2.0290, perplexity_token=7.6067]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  68%|████████████████████████████████▏              | 714/1044 [04:25<01:55,  2.86it/s, acc_step=1/1, ce_loss_token=2.0289, perplexity_token=7.6061]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  68%|████████████████████████████████▏              | 715/1044 [04:25<01:48,  3.02it/s, acc_step=1/1, ce_loss_token=2.0290, perplexity_token=7.6065]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  69%|████████████████████████████████▏              | 716/1044 [04:25<01:52,  2.92it/s, acc_step=1/1, ce_loss_token=2.0289, perplexity_token=7.6060]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  69%|████████████████████████████████▎              | 717/1044 [04:26<01:55,  2.83it/s, acc_step=1/1, ce_loss_token=2.0289, perplexity_token=7.6054]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  69%|████████████████████████████████▎              | 718/1044 [04:26<01:56,  2.80it/s, acc_step=1/1, ce_loss_token=2.0288, perplexity_token=7.6048]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  69%|████████████████████████████████▎              | 719/1044 [04:26<01:58,  2.75it/s, acc_step=1/1, ce_loss_token=2.0287, perplexity_token=7.6042]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  69%|████████████████████████████████▍              | 720/1044 [04:27<01:52,  2.87it/s, acc_step=1/1, ce_loss_token=2.0288, perplexity_token=7.6047]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  69%|████████████████████████████████▍              | 721/1044 [04:27<01:44,  3.08it/s, acc_step=1/1, ce_loss_token=2.0288, perplexity_token=7.6053]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  69%|████████████████████████████████▌              | 722/1044 [04:27<01:46,  3.02it/s, acc_step=1/1, ce_loss_token=2.0288, perplexity_token=7.6046]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  69%|████████████████████████████████▌              | 723/1044 [04:28<01:51,  2.88it/s, acc_step=1/1, ce_loss_token=2.0287, perplexity_token=7.6040]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  69%|████████████████████████████████▌              | 724/1044 [04:28<01:53,  2.82it/s, acc_step=1/1, ce_loss_token=2.0286, perplexity_token=7.6033]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  69%|████████████████████████████████▋              | 725/1044 [04:28<01:47,  2.97it/s, acc_step=1/1, ce_loss_token=2.0286, perplexity_token=7.6036]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  70%|████████████████████████████████▋              | 726/1044 [04:29<01:50,  2.87it/s, acc_step=1/1, ce_loss_token=2.0285, perplexity_token=7.6029]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  70%|████████████████████████████████▋              | 727/1044 [04:29<01:42,  3.08it/s, acc_step=1/1, ce_loss_token=2.0286, perplexity_token=7.6035]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  70%|████████████████████████████████▊              | 728/1044 [04:29<01:45,  3.00it/s, acc_step=1/1, ce_loss_token=2.0285, perplexity_token=7.6029]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  70%|████████████████████████████████▊              | 729/1044 [04:30<01:49,  2.87it/s, acc_step=1/1, ce_loss_token=2.0284, perplexity_token=7.6022]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  70%|████████████████████████████████▊              | 730/1044 [04:30<01:51,  2.81it/s, acc_step=1/1, ce_loss_token=2.0284, perplexity_token=7.6016]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  70%|████████████████████████████████▉              | 731/1044 [04:30<01:45,  2.97it/s, acc_step=1/1, ce_loss_token=2.0284, perplexity_token=7.6021]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  70%|████████████████████████████████▉              | 732/1044 [04:31<01:42,  3.05it/s, acc_step=1/1, ce_loss_token=2.0285, perplexity_token=7.6025]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  70%|████████████████████████████████▉              | 733/1044 [04:31<01:43,  3.00it/s, acc_step=1/1, ce_loss_token=2.0284, perplexity_token=7.6020]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  70%|█████████████████████████████████              | 734/1044 [04:31<01:46,  2.90it/s, acc_step=1/1, ce_loss_token=2.0283, perplexity_token=7.6014]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  70%|█████████████████████████████████              | 735/1044 [04:32<01:53,  2.73it/s, acc_step=1/1, ce_loss_token=2.0283, perplexity_token=7.6008]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  70%|█████████████████████████████████▏             | 736/1044 [04:32<01:52,  2.75it/s, acc_step=1/1, ce_loss_token=2.0282, perplexity_token=7.6002]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  71%|█████████████████████████████████▏             | 737/1044 [04:33<01:55,  2.65it/s, acc_step=1/1, ce_loss_token=2.0281, perplexity_token=7.5996]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  71%|█████████████████████████████████▏             | 738/1044 [04:33<01:45,  2.89it/s, acc_step=1/1, ce_loss_token=2.0282, perplexity_token=7.6003]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  71%|█████████████████████████████████▎             | 739/1044 [04:33<01:48,  2.80it/s, acc_step=1/1, ce_loss_token=2.0281, perplexity_token=7.5997]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  71%|█████████████████████████████████▎             | 740/1044 [04:34<01:41,  3.00it/s, acc_step=1/1, ce_loss_token=2.0282, perplexity_token=7.6001]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  71%|█████████████████████████████████▎             | 741/1044 [04:34<01:43,  2.93it/s, acc_step=1/1, ce_loss_token=2.0281, perplexity_token=7.5994]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  71%|█████████████████████████████████▍             | 742/1044 [04:34<01:46,  2.83it/s, acc_step=1/1, ce_loss_token=2.0280, perplexity_token=7.5986]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  71%|█████████████████████████████████▍             | 743/1044 [04:35<01:50,  2.73it/s, acc_step=1/1, ce_loss_token=2.0279, perplexity_token=7.5981]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  71%|█████████████████████████████████▍             | 744/1044 [04:35<01:59,  2.51it/s, acc_step=1/1, ce_loss_token=2.0278, perplexity_token=7.5977]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  71%|█████████████████████████████████▌             | 745/1044 [04:36<01:57,  2.54it/s, acc_step=1/1, ce_loss_token=2.0278, perplexity_token=7.5972]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  71%|█████████████████████████████████▌             | 746/1044 [04:36<01:54,  2.60it/s, acc_step=1/1, ce_loss_token=2.0277, perplexity_token=7.5967]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  72%|█████████████████████████████████▋             | 747/1044 [04:36<01:48,  2.74it/s, acc_step=1/1, ce_loss_token=2.0278, perplexity_token=7.5971]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  72%|█████████████████████████████████▋             | 748/1044 [04:37<01:40,  2.94it/s, acc_step=1/1, ce_loss_token=2.0278, perplexity_token=7.5976]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  72%|█████████████████████████████████▋             | 749/1044 [04:37<01:50,  2.67it/s, acc_step=1/1, ce_loss_token=2.0277, perplexity_token=7.5970]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  72%|█████████████████████████████████▊             | 750/1044 [04:37<01:49,  2.70it/s, acc_step=1/1, ce_loss_token=2.0277, perplexity_token=7.5964]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  72%|█████████████████████████████████▊             | 751/1044 [04:38<01:48,  2.71it/s, acc_step=1/1, ce_loss_token=2.0276, perplexity_token=7.5958]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  72%|█████████████████████████████████▊             | 752/1044 [04:38<01:41,  2.88it/s, acc_step=1/1, ce_loss_token=2.0276, perplexity_token=7.5960]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  72%|█████████████████████████████████▉             | 753/1044 [04:38<01:36,  3.03it/s, acc_step=1/1, ce_loss_token=2.0277, perplexity_token=7.5965]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  72%|█████████████████████████████████▉             | 754/1044 [04:39<01:39,  2.90it/s, acc_step=1/1, ce_loss_token=2.0276, perplexity_token=7.5959]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  72%|█████████████████████████████████▉             | 755/1044 [04:39<01:41,  2.85it/s, acc_step=1/1, ce_loss_token=2.0275, perplexity_token=7.5952]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  72%|██████████████████████████████████             | 756/1044 [04:39<01:42,  2.81it/s, acc_step=1/1, ce_loss_token=2.0274, perplexity_token=7.5947]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  73%|██████████████████████████████████             | 757/1044 [04:40<01:44,  2.74it/s, acc_step=1/1, ce_loss_token=2.0274, perplexity_token=7.5941]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  73%|██████████████████████████████████             | 758/1044 [04:40<01:47,  2.67it/s, acc_step=1/1, ce_loss_token=2.0273, perplexity_token=7.5933]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  73%|██████████████████████████████████▏            | 759/1044 [04:41<01:51,  2.56it/s, acc_step=1/1, ce_loss_token=2.0272, perplexity_token=7.5929]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  73%|██████████████████████████████████▏            | 760/1044 [04:41<01:51,  2.55it/s, acc_step=1/1, ce_loss_token=2.0271, perplexity_token=7.5923]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  73%|██████████████████████████████████▎            | 761/1044 [04:41<01:41,  2.80it/s, acc_step=1/1, ce_loss_token=2.0272, perplexity_token=7.5927]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  73%|██████████████████████████████████▎            | 762/1044 [04:42<01:38,  2.86it/s, acc_step=1/1, ce_loss_token=2.0271, perplexity_token=7.5921]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  73%|██████████████████████████████████▎            | 763/1044 [04:42<01:32,  3.02it/s, acc_step=1/1, ce_loss_token=2.0271, perplexity_token=7.5923]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  73%|██████████████████████████████████▍            | 764/1044 [04:42<01:37,  2.87it/s, acc_step=1/1, ce_loss_token=2.0271, perplexity_token=7.5918]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  73%|██████████████████████████████████▍            | 765/1044 [04:43<01:37,  2.87it/s, acc_step=1/1, ce_loss_token=2.0270, perplexity_token=7.5912]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  73%|██████████████████████████████████▍            | 766/1044 [04:43<01:41,  2.75it/s, acc_step=1/1, ce_loss_token=2.0269, perplexity_token=7.5906]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  73%|██████████████████████████████████▌            | 767/1044 [04:43<01:43,  2.68it/s, acc_step=1/1, ce_loss_token=2.0268, perplexity_token=7.5900]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  74%|██████████████████████████████████▌            | 768/1044 [04:44<01:45,  2.60it/s, acc_step=1/1, ce_loss_token=2.0267, perplexity_token=7.5893]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  74%|██████████████████████████████████▌            | 769/1044 [04:44<01:45,  2.61it/s, acc_step=1/1, ce_loss_token=2.0266, perplexity_token=7.5886]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  74%|██████████████████████████████████▋            | 770/1044 [04:45<01:43,  2.65it/s, acc_step=1/1, ce_loss_token=2.0266, perplexity_token=7.5881]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  74%|██████████████████████████████████▋            | 771/1044 [04:45<01:39,  2.74it/s, acc_step=1/1, ce_loss_token=2.0266, perplexity_token=7.5883]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  74%|██████████████████████████████████▊            | 772/1044 [04:45<01:32,  2.94it/s, acc_step=1/1, ce_loss_token=2.0267, perplexity_token=7.5887]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  74%|██████████████████████████████████▊            | 773/1044 [04:46<01:37,  2.78it/s, acc_step=1/1, ce_loss_token=2.0266, perplexity_token=7.5882]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  74%|██████████████████████████████████▊            | 774/1044 [04:46<01:37,  2.77it/s, acc_step=1/1, ce_loss_token=2.0265, perplexity_token=7.5877]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  74%|██████████████████████████████████▉            | 775/1044 [04:46<01:41,  2.64it/s, acc_step=1/1, ce_loss_token=2.0264, perplexity_token=7.5870]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  74%|██████████████████████████████████▉            | 776/1044 [04:47<01:40,  2.68it/s, acc_step=1/1, ce_loss_token=2.0264, perplexity_token=7.5865]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  74%|██████████████████████████████████▉            | 777/1044 [04:47<01:33,  2.87it/s, acc_step=1/1, ce_loss_token=2.0264, perplexity_token=7.5869]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  75%|███████████████████████████████████            | 778/1044 [04:47<01:35,  2.78it/s, acc_step=1/1, ce_loss_token=2.0263, perplexity_token=7.5863]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  75%|███████████████████████████████████            | 779/1044 [04:48<01:37,  2.73it/s, acc_step=1/1, ce_loss_token=2.0263, perplexity_token=7.5858]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  75%|███████████████████████████████████            | 780/1044 [04:48<01:30,  2.91it/s, acc_step=1/1, ce_loss_token=2.0263, perplexity_token=7.5862]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  75%|███████████████████████████████████▏           | 781/1044 [04:49<01:32,  2.84it/s, acc_step=1/1, ce_loss_token=2.0263, perplexity_token=7.5857]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  75%|███████████████████████████████████▏           | 782/1044 [04:49<01:37,  2.69it/s, acc_step=1/1, ce_loss_token=2.0262, perplexity_token=7.5851]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  75%|███████████████████████████████████▎           | 783/1044 [04:49<01:35,  2.74it/s, acc_step=1/1, ce_loss_token=2.0261, perplexity_token=7.5844]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  75%|███████████████████████████████████▎           | 784/1044 [04:50<01:37,  2.68it/s, acc_step=1/1, ce_loss_token=2.0260, perplexity_token=7.5839]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  75%|███████████████████████████████████▎           | 785/1044 [04:50<01:39,  2.61it/s, acc_step=1/1, ce_loss_token=2.0259, perplexity_token=7.5833]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  75%|███████████████████████████████████▍           | 786/1044 [04:50<01:38,  2.63it/s, acc_step=1/1, ce_loss_token=2.0259, perplexity_token=7.5828]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  75%|███████████████████████████████████▍           | 787/1044 [04:51<01:41,  2.54it/s, acc_step=1/1, ce_loss_token=2.0258, perplexity_token=7.5824]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  75%|███████████████████████████████████▍           | 788/1044 [04:51<01:41,  2.52it/s, acc_step=1/1, ce_loss_token=2.0258, perplexity_token=7.5819]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  76%|███████████████████████████████████▌           | 789/1044 [04:52<01:39,  2.57it/s, acc_step=1/1, ce_loss_token=2.0257, perplexity_token=7.5812]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  76%|███████████████████████████████████▌           | 790/1044 [04:52<01:38,  2.58it/s, acc_step=1/1, ce_loss_token=2.0256, perplexity_token=7.5806]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  76%|███████████████████████████████████▌           | 791/1044 [04:52<01:40,  2.52it/s, acc_step=1/1, ce_loss_token=2.0255, perplexity_token=7.5801]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  76%|███████████████████████████████████▋           | 792/1044 [04:53<01:38,  2.55it/s, acc_step=1/1, ce_loss_token=2.0255, perplexity_token=7.5795]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  76%|███████████████████████████████████▋           | 793/1044 [04:53<01:37,  2.58it/s, acc_step=1/1, ce_loss_token=2.0254, perplexity_token=7.5790]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  76%|███████████████████████████████████▋           | 794/1044 [04:53<01:28,  2.84it/s, acc_step=1/1, ce_loss_token=2.0254, perplexity_token=7.5793]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  76%|███████████████████████████████████▊           | 795/1044 [04:54<01:32,  2.70it/s, acc_step=1/1, ce_loss_token=2.0253, perplexity_token=7.5787]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  76%|███████████████████████████████████▊           | 796/1044 [04:54<01:32,  2.68it/s, acc_step=1/1, ce_loss_token=2.0253, perplexity_token=7.5782]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  76%|███████████████████████████████████▉           | 797/1044 [04:55<01:34,  2.62it/s, acc_step=1/1, ce_loss_token=2.0252, perplexity_token=7.5775]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  76%|███████████████████████████████████▉           | 798/1044 [04:55<01:37,  2.51it/s, acc_step=1/1, ce_loss_token=2.0251, perplexity_token=7.5770]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  77%|███████████████████████████████████▉           | 799/1044 [04:56<01:41,  2.41it/s, acc_step=1/1, ce_loss_token=2.0250, perplexity_token=7.5764]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  77%|████████████████████████████████████           | 800/1044 [04:56<01:32,  2.63it/s, acc_step=1/1, ce_loss_token=2.0251, perplexity_token=7.5770]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  77%|████████████████████████████████████           | 801/1044 [04:56<01:33,  2.61it/s, acc_step=1/1, ce_loss_token=2.0250, perplexity_token=7.5765]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  77%|████████████████████████████████████           | 802/1044 [04:57<01:29,  2.70it/s, acc_step=1/1, ce_loss_token=2.0251, perplexity_token=7.5767]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  77%|████████████████████████████████████▏          | 803/1044 [04:57<01:23,  2.89it/s, acc_step=1/1, ce_loss_token=2.0251, perplexity_token=7.5770]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  77%|████████████████████████████████████▏          | 804/1044 [04:57<01:24,  2.85it/s, acc_step=1/1, ce_loss_token=2.0251, perplexity_token=7.5765]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  77%|████████████████████████████████████▏          | 805/1044 [04:58<01:27,  2.74it/s, acc_step=1/1, ce_loss_token=2.0250, perplexity_token=7.5760]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  77%|████████████████████████████████████▎          | 806/1044 [04:58<01:28,  2.68it/s, acc_step=1/1, ce_loss_token=2.0249, perplexity_token=7.5754]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  77%|████████████████████████████████████▎          | 807/1044 [04:58<01:28,  2.67it/s, acc_step=1/1, ce_loss_token=2.0248, perplexity_token=7.5748]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  77%|████████████████████████████████████▍          | 808/1044 [04:59<01:28,  2.66it/s, acc_step=1/1, ce_loss_token=2.0248, perplexity_token=7.5743]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  77%|████████████████████████████████████▍          | 809/1044 [04:59<01:27,  2.67it/s, acc_step=1/1, ce_loss_token=2.0247, perplexity_token=7.5736]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  78%|████████████████████████████████████▌          | 811/1044 [05:00<01:15,  3.07it/s, acc_step=1/1, ce_loss_token=2.0248, perplexity_token=7.5748]

torch.Size([256, 298, 35]) torch.Size([256, 298])
torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  78%|████████████████████████████████████▌          | 812/1044 [05:00<01:17,  2.99it/s, acc_step=1/1, ce_loss_token=2.0248, perplexity_token=7.5743]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  78%|████████████████████████████████████▌          | 813/1044 [05:01<01:22,  2.79it/s, acc_step=1/1, ce_loss_token=2.0247, perplexity_token=7.5738]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  78%|████████████████████████████████████▋          | 814/1044 [05:01<01:15,  3.05it/s, acc_step=1/1, ce_loss_token=2.0250, perplexity_token=7.5760]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  78%|████████████████████████████████████▋          | 815/1044 [05:01<01:17,  2.94it/s, acc_step=1/1, ce_loss_token=2.0249, perplexity_token=7.5753]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  78%|████████████████████████████████████▋          | 816/1044 [05:02<01:20,  2.83it/s, acc_step=1/1, ce_loss_token=2.0248, perplexity_token=7.5747]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  78%|████████████████████████████████████▊          | 817/1044 [05:02<01:20,  2.82it/s, acc_step=1/1, ce_loss_token=2.0247, perplexity_token=7.5740]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  78%|████████████████████████████████████▊          | 818/1044 [05:02<01:19,  2.83it/s, acc_step=1/1, ce_loss_token=2.0247, perplexity_token=7.5736]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  78%|████████████████████████████████████▊          | 819/1044 [05:03<01:21,  2.77it/s, acc_step=1/1, ce_loss_token=2.0246, perplexity_token=7.5730]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  79%|████████████████████████████████████▉          | 820/1044 [05:03<01:19,  2.82it/s, acc_step=1/1, ce_loss_token=2.0245, perplexity_token=7.5725]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  79%|████████████████████████████████████▉          | 821/1044 [05:03<01:22,  2.71it/s, acc_step=1/1, ce_loss_token=2.0244, perplexity_token=7.5718]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  79%|█████████████████████████████████████          | 822/1044 [05:04<01:23,  2.65it/s, acc_step=1/1, ce_loss_token=2.0244, perplexity_token=7.5713]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  79%|█████████████████████████████████████          | 823/1044 [05:04<01:22,  2.69it/s, acc_step=1/1, ce_loss_token=2.0243, perplexity_token=7.5708]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  79%|█████████████████████████████████████          | 824/1044 [05:04<01:21,  2.70it/s, acc_step=1/1, ce_loss_token=2.0242, perplexity_token=7.5703]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  79%|█████████████████████████████████████▏         | 825/1044 [05:05<01:22,  2.66it/s, acc_step=1/1, ce_loss_token=2.0242, perplexity_token=7.5697]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  79%|█████████████████████████████████████▏         | 826/1044 [05:05<01:16,  2.84it/s, acc_step=1/1, ce_loss_token=2.0242, perplexity_token=7.5701]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  79%|█████████████████████████████████████▏         | 827/1044 [05:06<01:17,  2.81it/s, acc_step=1/1, ce_loss_token=2.0241, perplexity_token=7.5694]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  79%|█████████████████████████████████████▎         | 828/1044 [05:06<01:17,  2.78it/s, acc_step=1/1, ce_loss_token=2.0240, perplexity_token=7.5689]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  79%|█████████████████████████████████████▎         | 829/1044 [05:06<01:19,  2.70it/s, acc_step=1/1, ce_loss_token=2.0240, perplexity_token=7.5682]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  80%|█████████████████████████████████████▎         | 830/1044 [05:07<01:18,  2.74it/s, acc_step=1/1, ce_loss_token=2.0239, perplexity_token=7.5675]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  80%|█████████████████████████████████████▍         | 831/1044 [05:07<01:18,  2.71it/s, acc_step=1/1, ce_loss_token=2.0238, perplexity_token=7.5669]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  80%|█████████████████████████████████████▍         | 832/1044 [05:07<01:18,  2.70it/s, acc_step=1/1, ce_loss_token=2.0237, perplexity_token=7.5663]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  80%|█████████████████████████████████████▌         | 833/1044 [05:08<01:20,  2.62it/s, acc_step=1/1, ce_loss_token=2.0236, perplexity_token=7.5657]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  80%|█████████████████████████████████████▌         | 834/1044 [05:08<01:18,  2.68it/s, acc_step=1/1, ce_loss_token=2.0236, perplexity_token=7.5651]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  80%|█████████████████████████████████████▌         | 835/1044 [05:09<01:18,  2.66it/s, acc_step=1/1, ce_loss_token=2.0235, perplexity_token=7.5646]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  80%|█████████████████████████████████████▋         | 836/1044 [05:09<01:11,  2.92it/s, acc_step=1/1, ce_loss_token=2.0235, perplexity_token=7.5649]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  80%|█████████████████████████████████████▋         | 837/1044 [05:09<01:13,  2.83it/s, acc_step=1/1, ce_loss_token=2.0234, perplexity_token=7.5643]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  80%|█████████████████████████████████████▋         | 838/1044 [05:10<01:14,  2.77it/s, acc_step=1/1, ce_loss_token=2.0234, perplexity_token=7.5638]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  80%|█████████████████████████████████████▊         | 839/1044 [05:10<01:09,  2.95it/s, acc_step=1/1, ce_loss_token=2.0234, perplexity_token=7.5641]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  80%|█████████████████████████████████████▊         | 840/1044 [05:10<01:11,  2.85it/s, acc_step=1/1, ce_loss_token=2.0233, perplexity_token=7.5635]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  81%|█████████████████████████████████████▊         | 841/1044 [05:11<01:13,  2.77it/s, acc_step=1/1, ce_loss_token=2.0233, perplexity_token=7.5629]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  81%|█████████████████████████████████████▉         | 842/1044 [05:11<01:09,  2.91it/s, acc_step=1/1, ce_loss_token=2.0233, perplexity_token=7.5632]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  81%|█████████████████████████████████████▉         | 843/1044 [05:11<01:05,  3.06it/s, acc_step=1/1, ce_loss_token=2.0233, perplexity_token=7.5634]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  81%|█████████████████████████████████████▉         | 844/1044 [05:12<01:09,  2.88it/s, acc_step=1/1, ce_loss_token=2.0232, perplexity_token=7.5628]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  81%|██████████████████████████████████████         | 845/1044 [05:12<01:11,  2.80it/s, acc_step=1/1, ce_loss_token=2.0232, perplexity_token=7.5623]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  81%|██████████████████████████████████████         | 846/1044 [05:12<01:11,  2.79it/s, acc_step=1/1, ce_loss_token=2.0231, perplexity_token=7.5618]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  81%|██████████████████████████████████████▏        | 847/1044 [05:13<01:06,  2.94it/s, acc_step=1/1, ce_loss_token=2.0231, perplexity_token=7.5620]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  81%|██████████████████████████████████████▏        | 848/1044 [05:13<01:11,  2.75it/s, acc_step=1/1, ce_loss_token=2.0231, perplexity_token=7.5614]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  81%|██████████████████████████████████████▏        | 849/1044 [05:13<01:06,  2.92it/s, acc_step=1/1, ce_loss_token=2.0231, perplexity_token=7.5616]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  81%|██████████████████████████████████████▎        | 850/1044 [05:14<01:10,  2.74it/s, acc_step=1/1, ce_loss_token=2.0230, perplexity_token=7.5611]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  82%|██████████████████████████████████████▎        | 851/1044 [05:14<01:12,  2.65it/s, acc_step=1/1, ce_loss_token=2.0229, perplexity_token=7.5605]

torch.Size([256, 402, 35]) torch.Size([256, 402])


[Training LM]:  82%|██████████████████████████████████████▎        | 852/1044 [05:15<01:22,  2.32it/s, acc_step=1/1, ce_loss_token=2.0229, perplexity_token=7.5600]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  82%|██████████████████████████████████████▍        | 853/1044 [05:15<01:13,  2.60it/s, acc_step=1/1, ce_loss_token=2.0229, perplexity_token=7.5604]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  82%|██████████████████████████████████████▍        | 854/1044 [05:15<01:11,  2.67it/s, acc_step=1/1, ce_loss_token=2.0228, perplexity_token=7.5598]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  82%|██████████████████████████████████████▍        | 855/1044 [05:16<01:11,  2.63it/s, acc_step=1/1, ce_loss_token=2.0228, perplexity_token=7.5592]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  82%|██████████████████████████████████████▌        | 856/1044 [05:16<01:08,  2.76it/s, acc_step=1/1, ce_loss_token=2.0228, perplexity_token=7.5596]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  82%|██████████████████████████████████████▌        | 857/1044 [05:16<01:07,  2.77it/s, acc_step=1/1, ce_loss_token=2.0227, perplexity_token=7.5591]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  82%|██████████████████████████████████████▋        | 858/1044 [05:17<01:09,  2.67it/s, acc_step=1/1, ce_loss_token=2.0227, perplexity_token=7.5585]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  82%|██████████████████████████████████████▋        | 859/1044 [05:17<01:09,  2.67it/s, acc_step=1/1, ce_loss_token=2.0226, perplexity_token=7.5580]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  82%|██████████████████████████████████████▋        | 860/1044 [05:18<01:12,  2.54it/s, acc_step=1/1, ce_loss_token=2.0225, perplexity_token=7.5574]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  82%|██████████████████████████████████████▊        | 861/1044 [05:18<01:09,  2.63it/s, acc_step=1/1, ce_loss_token=2.0225, perplexity_token=7.5569]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  83%|██████████████████████████████████████▊        | 862/1044 [05:18<01:08,  2.67it/s, acc_step=1/1, ce_loss_token=2.0224, perplexity_token=7.5565]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  83%|██████████████████████████████████████▊        | 863/1044 [05:19<01:13,  2.46it/s, acc_step=1/1, ce_loss_token=2.0223, perplexity_token=7.5558]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  83%|██████████████████████████████████████▉        | 864/1044 [05:19<01:10,  2.57it/s, acc_step=1/1, ce_loss_token=2.0222, perplexity_token=7.5553]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  83%|██████████████████████████████████████▉        | 865/1044 [05:20<01:09,  2.57it/s, acc_step=1/1, ce_loss_token=2.0222, perplexity_token=7.5548]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  83%|██████████████████████████████████████▉        | 866/1044 [05:20<01:08,  2.60it/s, acc_step=1/1, ce_loss_token=2.0221, perplexity_token=7.5543]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  83%|███████████████████████████████████████        | 867/1044 [05:20<01:05,  2.70it/s, acc_step=1/1, ce_loss_token=2.0222, perplexity_token=7.5546]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  83%|███████████████████████████████████████        | 868/1044 [05:21<01:00,  2.91it/s, acc_step=1/1, ce_loss_token=2.0222, perplexity_token=7.5551]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  83%|███████████████████████████████████████        | 869/1044 [05:21<01:00,  2.90it/s, acc_step=1/1, ce_loss_token=2.0221, perplexity_token=7.5545]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  83%|███████████████████████████████████████▏       | 870/1044 [05:21<01:04,  2.71it/s, acc_step=1/1, ce_loss_token=2.0221, perplexity_token=7.5539]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  83%|███████████████████████████████████████▏       | 871/1044 [05:22<01:03,  2.72it/s, acc_step=1/1, ce_loss_token=2.0220, perplexity_token=7.5533]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  84%|███████████████████████████████████████▎       | 872/1044 [05:22<01:04,  2.68it/s, acc_step=1/1, ce_loss_token=2.0219, perplexity_token=7.5528]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  84%|███████████████████████████████████████▎       | 873/1044 [05:23<01:05,  2.59it/s, acc_step=1/1, ce_loss_token=2.0218, perplexity_token=7.5521]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  84%|███████████████████████████████████████▎       | 874/1044 [05:23<01:04,  2.64it/s, acc_step=1/1, ce_loss_token=2.0218, perplexity_token=7.5516]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  84%|███████████████████████████████████████▍       | 875/1044 [05:23<01:02,  2.70it/s, acc_step=1/1, ce_loss_token=2.0217, perplexity_token=7.5509]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  84%|███████████████████████████████████████▍       | 876/1044 [05:24<01:02,  2.69it/s, acc_step=1/1, ce_loss_token=2.0216, perplexity_token=7.5504]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  84%|███████████████████████████████████████▍       | 877/1044 [05:24<01:07,  2.49it/s, acc_step=1/1, ce_loss_token=2.0215, perplexity_token=7.5498]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  84%|███████████████████████████████████████▌       | 878/1044 [05:24<01:00,  2.74it/s, acc_step=1/1, ce_loss_token=2.0216, perplexity_token=7.5503]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  84%|███████████████████████████████████████▌       | 879/1044 [05:25<01:00,  2.71it/s, acc_step=1/1, ce_loss_token=2.0215, perplexity_token=7.5496]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  84%|███████████████████████████████████████▌       | 880/1044 [05:25<00:58,  2.79it/s, acc_step=1/1, ce_loss_token=2.0215, perplexity_token=7.5499]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  84%|███████████████████████████████████████▋       | 881/1044 [05:25<01:00,  2.71it/s, acc_step=1/1, ce_loss_token=2.0215, perplexity_token=7.5494]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  84%|███████████████████████████████████████▋       | 882/1044 [05:26<01:00,  2.69it/s, acc_step=1/1, ce_loss_token=2.0214, perplexity_token=7.5488]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  85%|███████████████████████████████████████▊       | 883/1044 [05:26<00:59,  2.69it/s, acc_step=1/1, ce_loss_token=2.0213, perplexity_token=7.5482]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  85%|███████████████████████████████████████▊       | 884/1044 [05:27<01:00,  2.66it/s, acc_step=1/1, ce_loss_token=2.0212, perplexity_token=7.5476]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  85%|███████████████████████████████████████▊       | 885/1044 [05:27<00:59,  2.65it/s, acc_step=1/1, ce_loss_token=2.0212, perplexity_token=7.5471]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  85%|███████████████████████████████████████▉       | 886/1044 [05:27<00:58,  2.68it/s, acc_step=1/1, ce_loss_token=2.0211, perplexity_token=7.5465]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  85%|███████████████████████████████████████▉       | 887/1044 [05:28<00:58,  2.69it/s, acc_step=1/1, ce_loss_token=2.0210, perplexity_token=7.5459]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  85%|███████████████████████████████████████▉       | 888/1044 [05:28<00:58,  2.68it/s, acc_step=1/1, ce_loss_token=2.0209, perplexity_token=7.5454]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  85%|████████████████████████████████████████       | 889/1044 [05:28<00:58,  2.64it/s, acc_step=1/1, ce_loss_token=2.0209, perplexity_token=7.5449]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  85%|████████████████████████████████████████       | 890/1044 [05:29<00:59,  2.58it/s, acc_step=1/1, ce_loss_token=2.0208, perplexity_token=7.5444]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  85%|████████████████████████████████████████       | 891/1044 [05:29<00:59,  2.59it/s, acc_step=1/1, ce_loss_token=2.0207, perplexity_token=7.5439]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  85%|████████████████████████████████████████▏      | 892/1044 [05:30<00:59,  2.57it/s, acc_step=1/1, ce_loss_token=2.0207, perplexity_token=7.5432]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  86%|████████████████████████████████████████▏      | 893/1044 [05:30<00:58,  2.60it/s, acc_step=1/1, ce_loss_token=2.0206, perplexity_token=7.5428]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  86%|████████████████████████████████████████▏      | 894/1044 [05:30<00:56,  2.68it/s, acc_step=1/1, ce_loss_token=2.0205, perplexity_token=7.5421]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  86%|████████████████████████████████████████▎      | 895/1044 [05:31<00:54,  2.74it/s, acc_step=1/1, ce_loss_token=2.0204, perplexity_token=7.5416]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  86%|████████████████████████████████████████▎      | 896/1044 [05:31<00:56,  2.62it/s, acc_step=1/1, ce_loss_token=2.0204, perplexity_token=7.5410]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  86%|████████████████████████████████████████▍      | 897/1044 [05:32<00:56,  2.59it/s, acc_step=1/1, ce_loss_token=2.0203, perplexity_token=7.5404]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  86%|████████████████████████████████████████▍      | 898/1044 [05:32<00:55,  2.61it/s, acc_step=1/1, ce_loss_token=2.0202, perplexity_token=7.5399]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  86%|████████████████████████████████████████▍      | 899/1044 [05:32<00:54,  2.65it/s, acc_step=1/1, ce_loss_token=2.0201, perplexity_token=7.5393]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  86%|████████████████████████████████████████▌      | 900/1044 [05:33<00:57,  2.51it/s, acc_step=1/1, ce_loss_token=2.0201, perplexity_token=7.5388]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  86%|████████████████████████████████████████▌      | 901/1044 [05:33<00:54,  2.65it/s, acc_step=1/1, ce_loss_token=2.0201, perplexity_token=7.5391]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  86%|████████████████████████████████████████▌      | 902/1044 [05:33<00:52,  2.70it/s, acc_step=1/1, ce_loss_token=2.0200, perplexity_token=7.5385]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  86%|████████████████████████████████████████▋      | 903/1044 [05:34<00:48,  2.88it/s, acc_step=1/1, ce_loss_token=2.0201, perplexity_token=7.5389]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  87%|████████████████████████████████████████▋      | 904/1044 [05:34<00:49,  2.81it/s, acc_step=1/1, ce_loss_token=2.0200, perplexity_token=7.5384]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  87%|████████████████████████████████████████▋      | 905/1044 [05:34<00:47,  2.94it/s, acc_step=1/1, ce_loss_token=2.0201, perplexity_token=7.5388]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  87%|████████████████████████████████████████▊      | 906/1044 [05:35<00:47,  2.89it/s, acc_step=1/1, ce_loss_token=2.0200, perplexity_token=7.5383]

torch.Size([256, 354, 35]) torch.Size([256, 354])


[Training LM]:  87%|████████████████████████████████████████▊      | 907/1044 [05:35<00:52,  2.63it/s, acc_step=1/1, ce_loss_token=2.0199, perplexity_token=7.5377]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  87%|████████████████████████████████████████▉      | 908/1044 [05:36<00:51,  2.63it/s, acc_step=1/1, ce_loss_token=2.0198, perplexity_token=7.5371]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  87%|████████████████████████████████████████▉      | 909/1044 [05:36<00:52,  2.59it/s, acc_step=1/1, ce_loss_token=2.0198, perplexity_token=7.5365]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  87%|████████████████████████████████████████▉      | 910/1044 [05:36<00:50,  2.64it/s, acc_step=1/1, ce_loss_token=2.0197, perplexity_token=7.5359]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  87%|█████████████████████████████████████████      | 911/1044 [05:37<00:50,  2.61it/s, acc_step=1/1, ce_loss_token=2.0196, perplexity_token=7.5354]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  87%|█████████████████████████████████████████      | 912/1044 [05:37<00:46,  2.83it/s, acc_step=1/1, ce_loss_token=2.0197, perplexity_token=7.5358]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  87%|█████████████████████████████████████████      | 913/1044 [05:37<00:46,  2.81it/s, acc_step=1/1, ce_loss_token=2.0196, perplexity_token=7.5352]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  88%|█████████████████████████████████████████▏     | 914/1044 [05:38<00:46,  2.82it/s, acc_step=1/1, ce_loss_token=2.0195, perplexity_token=7.5346]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  88%|█████████████████████████████████████████▏     | 915/1044 [05:38<00:42,  3.01it/s, acc_step=1/1, ce_loss_token=2.0195, perplexity_token=7.5348]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  88%|█████████████████████████████████████████▏     | 916/1044 [05:38<00:45,  2.82it/s, acc_step=1/1, ce_loss_token=2.0195, perplexity_token=7.5343]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  88%|█████████████████████████████████████████▎     | 917/1044 [05:39<00:44,  2.82it/s, acc_step=1/1, ce_loss_token=2.0194, perplexity_token=7.5337]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  88%|█████████████████████████████████████████▎     | 918/1044 [05:39<00:45,  2.74it/s, acc_step=1/1, ce_loss_token=2.0193, perplexity_token=7.5331]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  88%|█████████████████████████████████████████▎     | 919/1044 [05:40<00:45,  2.75it/s, acc_step=1/1, ce_loss_token=2.0192, perplexity_token=7.5326]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  88%|█████████████████████████████████████████▍     | 920/1044 [05:40<00:47,  2.63it/s, acc_step=1/1, ce_loss_token=2.0192, perplexity_token=7.5321]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  88%|█████████████████████████████████████████▍     | 921/1044 [05:40<00:46,  2.65it/s, acc_step=1/1, ce_loss_token=2.0191, perplexity_token=7.5316]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  88%|█████████████████████████████████████████▌     | 922/1044 [05:41<00:44,  2.73it/s, acc_step=1/1, ce_loss_token=2.0190, perplexity_token=7.5311]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  88%|█████████████████████████████████████████▌     | 923/1044 [05:41<00:44,  2.75it/s, acc_step=1/1, ce_loss_token=2.0190, perplexity_token=7.5307]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  89%|█████████████████████████████████████████▌     | 924/1044 [05:41<00:43,  2.75it/s, acc_step=1/1, ce_loss_token=2.0189, perplexity_token=7.5302]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  89%|█████████████████████████████████████████▋     | 925/1044 [05:42<00:41,  2.89it/s, acc_step=1/1, ce_loss_token=2.0190, perplexity_token=7.5305]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  89%|█████████████████████████████████████████▋     | 926/1044 [05:42<00:42,  2.77it/s, acc_step=1/1, ce_loss_token=2.0189, perplexity_token=7.5299]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  89%|█████████████████████████████████████████▊     | 928/1044 [05:43<00:36,  3.19it/s, acc_step=1/1, ce_loss_token=2.0191, perplexity_token=7.5315]

torch.Size([256, 296, 35]) torch.Size([256, 296])
torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  89%|█████████████████████████████████████████▊     | 929/1044 [05:43<00:37,  3.04it/s, acc_step=1/1, ce_loss_token=2.0190, perplexity_token=7.5310]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  89%|█████████████████████████████████████████▊     | 930/1044 [05:43<00:40,  2.78it/s, acc_step=1/1, ce_loss_token=2.0189, perplexity_token=7.5304]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  89%|█████████████████████████████████████████▉     | 931/1044 [05:44<00:40,  2.79it/s, acc_step=1/1, ce_loss_token=2.0189, perplexity_token=7.5299]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  89%|█████████████████████████████████████████▉     | 932/1044 [05:44<00:40,  2.77it/s, acc_step=1/1, ce_loss_token=2.0188, perplexity_token=7.5294]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  89%|██████████████████████████████████████████     | 933/1044 [05:45<00:40,  2.74it/s, acc_step=1/1, ce_loss_token=2.0187, perplexity_token=7.5288]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  89%|██████████████████████████████████████████     | 934/1044 [05:45<00:39,  2.75it/s, acc_step=1/1, ce_loss_token=2.0187, perplexity_token=7.5282]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  90%|██████████████████████████████████████████     | 935/1044 [05:45<00:39,  2.79it/s, acc_step=1/1, ce_loss_token=2.0186, perplexity_token=7.5275]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  90%|██████████████████████████████████████████▏    | 936/1044 [05:46<00:38,  2.80it/s, acc_step=1/1, ce_loss_token=2.0185, perplexity_token=7.5269]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  90%|██████████████████████████████████████████▏    | 937/1044 [05:46<00:39,  2.68it/s, acc_step=1/1, ce_loss_token=2.0184, perplexity_token=7.5264]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  90%|██████████████████████████████████████████▏    | 938/1044 [05:46<00:39,  2.69it/s, acc_step=1/1, ce_loss_token=2.0183, perplexity_token=7.5259]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  90%|██████████████████████████████████████████▎    | 939/1044 [05:47<00:40,  2.58it/s, acc_step=1/1, ce_loss_token=2.0183, perplexity_token=7.5254]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  90%|██████████████████████████████████████████▎    | 940/1044 [05:47<00:39,  2.62it/s, acc_step=1/1, ce_loss_token=2.0182, perplexity_token=7.5249]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  90%|██████████████████████████████████████████▎    | 941/1044 [05:48<00:39,  2.60it/s, acc_step=1/1, ce_loss_token=2.0181, perplexity_token=7.5244]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  90%|██████████████████████████████████████████▍    | 942/1044 [05:48<00:38,  2.63it/s, acc_step=1/1, ce_loss_token=2.0181, perplexity_token=7.5238]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  90%|██████████████████████████████████████████▍    | 943/1044 [05:48<00:37,  2.66it/s, acc_step=1/1, ce_loss_token=2.0180, perplexity_token=7.5233]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  90%|██████████████████████████████████████████▍    | 944/1044 [05:49<00:39,  2.54it/s, acc_step=1/1, ce_loss_token=2.0179, perplexity_token=7.5227]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  91%|██████████████████████████████████████████▌    | 945/1044 [05:49<00:38,  2.58it/s, acc_step=1/1, ce_loss_token=2.0179, perplexity_token=7.5222]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  91%|██████████████████████████████████████████▌    | 946/1044 [05:49<00:38,  2.58it/s, acc_step=1/1, ce_loss_token=2.0178, perplexity_token=7.5217]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  91%|██████████████████████████████████████████▋    | 947/1044 [05:50<00:36,  2.68it/s, acc_step=1/1, ce_loss_token=2.0177, perplexity_token=7.5211]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  91%|██████████████████████████████████████████▋    | 948/1044 [05:50<00:35,  2.69it/s, acc_step=1/1, ce_loss_token=2.0176, perplexity_token=7.5205]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  91%|██████████████████████████████████████████▋    | 949/1044 [05:51<00:35,  2.66it/s, acc_step=1/1, ce_loss_token=2.0176, perplexity_token=7.5199]

torch.Size([256, 454, 35]) torch.Size([256, 454])


[Training LM]:  91%|██████████████████████████████████████████▊    | 950/1044 [05:51<00:39,  2.40it/s, acc_step=1/1, ce_loss_token=2.0176, perplexity_token=7.5205]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  91%|██████████████████████████████████████████▊    | 951/1044 [05:51<00:35,  2.64it/s, acc_step=1/1, ce_loss_token=2.0177, perplexity_token=7.5208]

torch.Size([256, 365, 35]) torch.Size([256, 365])


[Training LM]:  91%|██████████████████████████████████████████▊    | 952/1044 [05:52<00:37,  2.44it/s, acc_step=1/1, ce_loss_token=2.0176, perplexity_token=7.5202]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  91%|██████████████████████████████████████████▉    | 953/1044 [05:52<00:36,  2.51it/s, acc_step=1/1, ce_loss_token=2.0175, perplexity_token=7.5196]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  91%|██████████████████████████████████████████▉    | 954/1044 [05:53<00:37,  2.41it/s, acc_step=1/1, ce_loss_token=2.0175, perplexity_token=7.5191]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  91%|██████████████████████████████████████████▉    | 955/1044 [05:53<00:33,  2.67it/s, acc_step=1/1, ce_loss_token=2.0175, perplexity_token=7.5193]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  92%|███████████████████████████████████████████    | 956/1044 [05:53<00:32,  2.67it/s, acc_step=1/1, ce_loss_token=2.0174, perplexity_token=7.5189]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  92%|███████████████████████████████████████████    | 957/1044 [05:54<00:32,  2.65it/s, acc_step=1/1, ce_loss_token=2.0174, perplexity_token=7.5185]

torch.Size([256, 391, 35]) torch.Size([256, 391])


[Training LM]:  92%|███████████████████████████████████████████▏   | 958/1044 [05:54<00:37,  2.32it/s, acc_step=1/1, ce_loss_token=2.0173, perplexity_token=7.5180]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  92%|███████████████████████████████████████████▏   | 959/1044 [05:55<00:39,  2.17it/s, acc_step=1/1, ce_loss_token=2.0172, perplexity_token=7.5174]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  92%|███████████████████████████████████████████▏   | 960/1044 [05:55<00:36,  2.27it/s, acc_step=1/1, ce_loss_token=2.0171, perplexity_token=7.5169]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  92%|███████████████████████████████████████████▎   | 961/1044 [05:56<00:35,  2.37it/s, acc_step=1/1, ce_loss_token=2.0171, perplexity_token=7.5164]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  92%|███████████████████████████████████████████▎   | 962/1044 [05:56<00:33,  2.43it/s, acc_step=1/1, ce_loss_token=2.0170, perplexity_token=7.5159]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  92%|███████████████████████████████████████████▎   | 963/1044 [05:56<00:33,  2.41it/s, acc_step=1/1, ce_loss_token=2.0169, perplexity_token=7.5153]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  92%|███████████████████████████████████████████▍   | 964/1044 [05:57<00:33,  2.39it/s, acc_step=1/1, ce_loss_token=2.0169, perplexity_token=7.5148]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  92%|███████████████████████████████████████████▍   | 965/1044 [05:57<00:32,  2.46it/s, acc_step=1/1, ce_loss_token=2.0168, perplexity_token=7.5143]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  93%|███████████████████████████████████████████▍   | 966/1044 [05:58<00:29,  2.68it/s, acc_step=1/1, ce_loss_token=2.0169, perplexity_token=7.5147]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  93%|███████████████████████████████████████████▌   | 967/1044 [05:58<00:28,  2.71it/s, acc_step=1/1, ce_loss_token=2.0168, perplexity_token=7.5142]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  93%|███████████████████████████████████████████▌   | 968/1044 [05:58<00:27,  2.81it/s, acc_step=1/1, ce_loss_token=2.0168, perplexity_token=7.5144]

torch.Size([256, 397, 35]) torch.Size([256, 397])


[Training LM]:  93%|███████████████████████████████████████████▌   | 969/1044 [05:59<00:31,  2.41it/s, acc_step=1/1, ce_loss_token=2.0168, perplexity_token=7.5140]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  93%|███████████████████████████████████████████▋   | 970/1044 [05:59<00:30,  2.42it/s, acc_step=1/1, ce_loss_token=2.0167, perplexity_token=7.5134]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  93%|███████████████████████████████████████████▋   | 971/1044 [06:00<00:29,  2.46it/s, acc_step=1/1, ce_loss_token=2.0166, perplexity_token=7.5130]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  93%|███████████████████████████████████████████▊   | 972/1044 [06:00<00:28,  2.52it/s, acc_step=1/1, ce_loss_token=2.0166, perplexity_token=7.5125]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  93%|███████████████████████████████████████████▊   | 973/1044 [06:00<00:28,  2.54it/s, acc_step=1/1, ce_loss_token=2.0165, perplexity_token=7.5119]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  93%|███████████████████████████████████████████▊   | 974/1044 [06:01<00:26,  2.60it/s, acc_step=1/1, ce_loss_token=2.0164, perplexity_token=7.5114]

torch.Size([256, 342, 35]) torch.Size([256, 342])


[Training LM]:  93%|███████████████████████████████████████████▉   | 975/1044 [06:01<00:27,  2.49it/s, acc_step=1/1, ce_loss_token=2.0164, perplexity_token=7.5109]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  93%|███████████████████████████████████████████▉   | 976/1044 [06:02<00:27,  2.51it/s, acc_step=1/1, ce_loss_token=2.0163, perplexity_token=7.5104]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  94%|███████████████████████████████████████████▉   | 977/1044 [06:02<00:26,  2.56it/s, acc_step=1/1, ce_loss_token=2.0162, perplexity_token=7.5098]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  94%|████████████████████████████████████████████   | 978/1044 [06:02<00:26,  2.49it/s, acc_step=1/1, ce_loss_token=2.0161, perplexity_token=7.5093]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  94%|████████████████████████████████████████████   | 979/1044 [06:03<00:24,  2.66it/s, acc_step=1/1, ce_loss_token=2.0162, perplexity_token=7.5098]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  94%|████████████████████████████████████████████   | 980/1044 [06:03<00:24,  2.64it/s, acc_step=1/1, ce_loss_token=2.0161, perplexity_token=7.5093]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  94%|████████████████████████████████████████████▏  | 981/1044 [06:03<00:24,  2.58it/s, acc_step=1/1, ce_loss_token=2.0161, perplexity_token=7.5089]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  94%|████████████████████████████████████████████▏  | 982/1044 [06:04<00:24,  2.57it/s, acc_step=1/1, ce_loss_token=2.0160, perplexity_token=7.5083]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  94%|████████████████████████████████████████████▎  | 983/1044 [06:04<00:24,  2.51it/s, acc_step=1/1, ce_loss_token=2.0159, perplexity_token=7.5078]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  94%|████████████████████████████████████████████▎  | 984/1044 [06:05<00:23,  2.59it/s, acc_step=1/1, ce_loss_token=2.0159, perplexity_token=7.5072]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  94%|████████████████████████████████████████████▎  | 985/1044 [06:05<00:24,  2.42it/s, acc_step=1/1, ce_loss_token=2.0158, perplexity_token=7.5067]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  94%|████████████████████████████████████████████▍  | 986/1044 [06:05<00:23,  2.45it/s, acc_step=1/1, ce_loss_token=2.0157, perplexity_token=7.5061]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  95%|████████████████████████████████████████████▍  | 987/1044 [06:06<00:22,  2.53it/s, acc_step=1/1, ce_loss_token=2.0157, perplexity_token=7.5057]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  95%|████████████████████████████████████████████▍  | 988/1044 [06:06<00:21,  2.58it/s, acc_step=1/1, ce_loss_token=2.0156, perplexity_token=7.5052]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  95%|████████████████████████████████████████████▌  | 989/1044 [06:07<00:21,  2.56it/s, acc_step=1/1, ce_loss_token=2.0155, perplexity_token=7.5047]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  95%|████████████████████████████████████████████▌  | 990/1044 [06:07<00:20,  2.64it/s, acc_step=1/1, ce_loss_token=2.0155, perplexity_token=7.5042]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  95%|████████████████████████████████████████████▌  | 991/1044 [06:07<00:20,  2.64it/s, acc_step=1/1, ce_loss_token=2.0154, perplexity_token=7.5036]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  95%|████████████████████████████████████████████▋  | 992/1044 [06:08<00:18,  2.79it/s, acc_step=1/1, ce_loss_token=2.0155, perplexity_token=7.5042]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  95%|████████████████████████████████████████████▋  | 993/1044 [06:08<00:18,  2.81it/s, acc_step=1/1, ce_loss_token=2.0154, perplexity_token=7.5037]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  95%|████████████████████████████████████████████▋  | 994/1044 [06:08<00:18,  2.75it/s, acc_step=1/1, ce_loss_token=2.0153, perplexity_token=7.5034]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  95%|████████████████████████████████████████████▊  | 995/1044 [06:09<00:17,  2.82it/s, acc_step=1/1, ce_loss_token=2.0153, perplexity_token=7.5029]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  95%|████████████████████████████████████████████▊  | 996/1044 [06:09<00:17,  2.76it/s, acc_step=1/1, ce_loss_token=2.0152, perplexity_token=7.5024]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  95%|████████████████████████████████████████████▉  | 997/1044 [06:09<00:16,  2.80it/s, acc_step=1/1, ce_loss_token=2.0152, perplexity_token=7.5020]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  96%|████████████████████████████████████████████▉  | 998/1044 [06:10<00:16,  2.82it/s, acc_step=1/1, ce_loss_token=2.0152, perplexity_token=7.5022]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  96%|████████████████████████████████████████████▉  | 999/1044 [06:10<00:16,  2.76it/s, acc_step=1/1, ce_loss_token=2.0151, perplexity_token=7.5018]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  96%|████████████████████████████████████████████  | 1000/1044 [06:11<00:16,  2.68it/s, acc_step=1/1, ce_loss_token=2.0151, perplexity_token=7.5014]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  96%|████████████████████████████████████████████  | 1001/1044 [06:11<00:16,  2.62it/s, acc_step=1/1, ce_loss_token=2.0150, perplexity_token=7.5009]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1002/1044 [06:11<00:16,  2.54it/s, acc_step=1/1, ce_loss_token=2.0150, perplexity_token=7.5004]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1003/1044 [06:12<00:15,  2.60it/s, acc_step=1/1, ce_loss_token=2.0149, perplexity_token=7.4999]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1004/1044 [06:12<00:15,  2.61it/s, acc_step=1/1, ce_loss_token=2.0148, perplexity_token=7.4995]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1005/1044 [06:12<00:14,  2.62it/s, acc_step=1/1, ce_loss_token=2.0148, perplexity_token=7.4991]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1006/1044 [06:13<00:14,  2.68it/s, acc_step=1/1, ce_loss_token=2.0147, perplexity_token=7.4987]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1007/1044 [06:13<00:13,  2.66it/s, acc_step=1/1, ce_loss_token=2.0147, perplexity_token=7.4982]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1008/1044 [06:14<00:13,  2.57it/s, acc_step=1/1, ce_loss_token=2.0146, perplexity_token=7.4977]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1009/1044 [06:14<00:13,  2.62it/s, acc_step=1/1, ce_loss_token=2.0145, perplexity_token=7.4972]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1010/1044 [06:14<00:13,  2.56it/s, acc_step=1/1, ce_loss_token=2.0145, perplexity_token=7.4967]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1011/1044 [06:15<00:12,  2.61it/s, acc_step=1/1, ce_loss_token=2.0144, perplexity_token=7.4963]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1012/1044 [06:15<00:11,  2.79it/s, acc_step=1/1, ce_loss_token=2.0145, perplexity_token=7.4967]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1013/1044 [06:15<00:10,  3.07it/s, acc_step=1/1, ce_loss_token=2.0145, perplexity_token=7.4970]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1014/1044 [06:16<00:09,  3.22it/s, acc_step=1/1, ce_loss_token=2.0145, perplexity_token=7.4972]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1015/1044 [06:16<00:09,  2.91it/s, acc_step=1/1, ce_loss_token=2.0145, perplexity_token=7.4967]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1016/1044 [06:16<00:10,  2.70it/s, acc_step=1/1, ce_loss_token=2.0144, perplexity_token=7.4963]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1017/1044 [06:17<00:10,  2.67it/s, acc_step=1/1, ce_loss_token=2.0143, perplexity_token=7.4958]

torch.Size([256, 396, 35]) torch.Size([256, 396])


[Training LM]:  98%|████████████████████████████████████████████▊ | 1018/1044 [06:17<00:10,  2.58it/s, acc_step=1/1, ce_loss_token=2.0144, perplexity_token=7.4961]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1019/1044 [06:18<00:09,  2.52it/s, acc_step=1/1, ce_loss_token=2.0143, perplexity_token=7.4955]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1020/1044 [06:18<00:08,  2.76it/s, acc_step=1/1, ce_loss_token=2.0143, perplexity_token=7.4958]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1021/1044 [06:18<00:07,  2.98it/s, acc_step=1/1, ce_loss_token=2.0145, perplexity_token=7.4966]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  98%|█████████████████████████████████████████████ | 1022/1044 [06:19<00:07,  2.98it/s, acc_step=1/1, ce_loss_token=2.0144, perplexity_token=7.4961]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  98%|█████████████████████████████████████████████ | 1023/1044 [06:19<00:07,  2.86it/s, acc_step=1/1, ce_loss_token=2.0143, perplexity_token=7.4956]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  98%|█████████████████████████████████████████████ | 1024/1044 [06:19<00:07,  2.73it/s, acc_step=1/1, ce_loss_token=2.0143, perplexity_token=7.4952]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1025/1044 [06:20<00:07,  2.66it/s, acc_step=1/1, ce_loss_token=2.0142, perplexity_token=7.4948]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1026/1044 [06:20<00:06,  2.69it/s, acc_step=1/1, ce_loss_token=2.0141, perplexity_token=7.4943]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1027/1044 [06:21<00:06,  2.63it/s, acc_step=1/1, ce_loss_token=2.0141, perplexity_token=7.4938]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1028/1044 [06:21<00:06,  2.54it/s, acc_step=1/1, ce_loss_token=2.0140, perplexity_token=7.4933]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  99%|█████████████████████████████████████████████▎| 1029/1044 [06:21<00:05,  2.52it/s, acc_step=1/1, ce_loss_token=2.0139, perplexity_token=7.4928]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1030/1044 [06:22<00:05,  2.52it/s, acc_step=1/1, ce_loss_token=2.0139, perplexity_token=7.4924]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1031/1044 [06:22<00:05,  2.59it/s, acc_step=1/1, ce_loss_token=2.0138, perplexity_token=7.4919]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1032/1044 [06:23<00:04,  2.51it/s, acc_step=1/1, ce_loss_token=2.0138, perplexity_token=7.4914]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1033/1044 [06:23<00:04,  2.54it/s, acc_step=1/1, ce_loss_token=2.0137, perplexity_token=7.4910]

torch.Size([256, 388, 35]) torch.Size([256, 388])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1034/1044 [06:23<00:03,  2.51it/s, acc_step=1/1, ce_loss_token=2.0137, perplexity_token=7.4913]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1035/1044 [06:24<00:03,  2.71it/s, acc_step=1/1, ce_loss_token=2.0138, perplexity_token=7.4916]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1036/1044 [06:24<00:02,  2.82it/s, acc_step=1/1, ce_loss_token=2.0138, perplexity_token=7.4919]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1037/1044 [06:24<00:02,  2.82it/s, acc_step=1/1, ce_loss_token=2.0137, perplexity_token=7.4914]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1038/1044 [06:25<00:02,  2.73it/s, acc_step=1/1, ce_loss_token=2.0137, perplexity_token=7.4910]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1039/1044 [06:25<00:01,  2.93it/s, acc_step=1/1, ce_loss_token=2.0137, perplexity_token=7.4913]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1040/1044 [06:25<00:01,  2.84it/s, acc_step=1/1, ce_loss_token=2.0137, perplexity_token=7.4908]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1041/1044 [06:26<00:01,  2.77it/s, acc_step=1/1, ce_loss_token=2.0136, perplexity_token=7.4902]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1042/1044 [06:26<00:00,  2.63it/s, acc_step=1/1, ce_loss_token=2.0135, perplexity_token=7.4897]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1043/1044 [06:27<00:00,  2.69it/s, acc_step=1/1, ce_loss_token=2.0135, perplexity_token=7.4893]

torch.Size([170, 284, 35]) torch.Size([170, 284])


                                                                                                                                                                   

Generating with greedy search...

📊 Metrics (Epoch 3):
├── TRAIN:
│   ├── ce_loss_char: 2.0134
│   ├── ce_loss_token: 2.0134
│   ├── perplexity_char: 7.4889
│   └── perplexity_token: 7.4889
└── VAL:
    ├── ce_loss_char: 1.8587
    ├── ce_loss_token: 1.8587
    ├── perplexity_char: 6.4154
    └── perplexity_token: 6.4154
└── TRAINING:
    └── learning_rate: 0.000082


[Training LM]:   0%|                                                                                                                      | 0/1044 [00:00<?, ?it/s]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   0%|                                                 | 1/1044 [00:00<08:44,  1.99it/s, acc_step=1/1, ce_loss_token=1.9471, perplexity_token=7.0086]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   0%|                                                 | 2/1044 [00:00<07:38,  2.27it/s, acc_step=1/1, ce_loss_token=1.9496, perplexity_token=7.0260]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   0%|▏                                                | 3/1044 [00:01<07:06,  2.44it/s, acc_step=1/1, ce_loss_token=1.9468, perplexity_token=7.0060]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   0%|▏                                                | 4/1044 [00:01<06:59,  2.48it/s, acc_step=1/1, ce_loss_token=1.9432, perplexity_token=6.9814]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:   0%|▏                                                | 5/1044 [00:02<06:38,  2.61it/s, acc_step=1/1, ce_loss_token=1.9404, perplexity_token=6.9613]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:   1%|▎                                                | 6/1044 [00:02<06:51,  2.52it/s, acc_step=1/1, ce_loss_token=1.9406, perplexity_token=6.9626]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   1%|▎                                                | 7/1044 [00:02<06:51,  2.52it/s, acc_step=1/1, ce_loss_token=1.9421, perplexity_token=6.9731]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   1%|▍                                                | 8/1044 [00:03<06:49,  2.53it/s, acc_step=1/1, ce_loss_token=1.9421, perplexity_token=6.9735]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:   1%|▍                                                | 9/1044 [00:03<06:30,  2.65it/s, acc_step=1/1, ce_loss_token=1.9418, perplexity_token=6.9711]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   1%|▍                                               | 10/1044 [00:03<06:25,  2.69it/s, acc_step=1/1, ce_loss_token=1.9422, perplexity_token=6.9739]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   1%|▌                                               | 11/1044 [00:04<06:43,  2.56it/s, acc_step=1/1, ce_loss_token=1.9411, perplexity_token=6.9665]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   1%|▌                                               | 12/1044 [00:04<06:35,  2.61it/s, acc_step=1/1, ce_loss_token=1.9407, perplexity_token=6.9639]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   1%|▌                                               | 13/1044 [00:05<06:45,  2.54it/s, acc_step=1/1, ce_loss_token=1.9405, perplexity_token=6.9624]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   1%|▋                                               | 14/1044 [00:05<06:35,  2.61it/s, acc_step=1/1, ce_loss_token=1.9409, perplexity_token=6.9649]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   1%|▋                                               | 15/1044 [00:05<06:14,  2.75it/s, acc_step=1/1, ce_loss_token=1.9482, perplexity_token=7.0163]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   2%|▋                                               | 16/1044 [00:06<06:21,  2.69it/s, acc_step=1/1, ce_loss_token=1.9477, perplexity_token=7.0128]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   2%|▊                                               | 17/1044 [00:06<06:17,  2.72it/s, acc_step=1/1, ce_loss_token=1.9468, perplexity_token=7.0061]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   2%|▊                                               | 18/1044 [00:06<05:54,  2.90it/s, acc_step=1/1, ce_loss_token=1.9519, perplexity_token=7.0421]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   2%|▊                                               | 19/1044 [00:07<06:01,  2.84it/s, acc_step=1/1, ce_loss_token=1.9513, perplexity_token=7.0378]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   2%|▉                                               | 20/1044 [00:07<05:40,  3.01it/s, acc_step=1/1, ce_loss_token=1.9585, perplexity_token=7.0885]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   2%|▉                                               | 21/1044 [00:07<05:52,  2.90it/s, acc_step=1/1, ce_loss_token=1.9573, perplexity_token=7.0803]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   2%|█                                               | 22/1044 [00:08<05:37,  3.03it/s, acc_step=1/1, ce_loss_token=1.9610, perplexity_token=7.1065]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   2%|█                                               | 23/1044 [00:08<05:24,  3.15it/s, acc_step=1/1, ce_loss_token=1.9646, perplexity_token=7.1319]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:   2%|█                                               | 24/1044 [00:08<06:27,  2.63it/s, acc_step=1/1, ce_loss_token=1.9641, perplexity_token=7.1284]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:   2%|█▏                                              | 25/1044 [00:09<06:15,  2.71it/s, acc_step=1/1, ce_loss_token=1.9634, perplexity_token=7.1238]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:   2%|█▏                                              | 26/1044 [00:09<06:33,  2.59it/s, acc_step=1/1, ce_loss_token=1.9623, perplexity_token=7.1156]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   3%|█▏                                              | 27/1044 [00:10<06:34,  2.58it/s, acc_step=1/1, ce_loss_token=1.9614, perplexity_token=7.1095]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   3%|█▎                                              | 28/1044 [00:10<06:28,  2.61it/s, acc_step=1/1, ce_loss_token=1.9610, perplexity_token=7.1061]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   3%|█▎                                              | 29/1044 [00:10<06:21,  2.66it/s, acc_step=1/1, ce_loss_token=1.9604, perplexity_token=7.1022]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:   3%|█▍                                              | 30/1044 [00:11<07:08,  2.37it/s, acc_step=1/1, ce_loss_token=1.9599, perplexity_token=7.0987]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   3%|█▍                                              | 31/1044 [00:11<06:54,  2.44it/s, acc_step=1/1, ce_loss_token=1.9590, perplexity_token=7.0923]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   3%|█▍                                              | 32/1044 [00:12<06:38,  2.54it/s, acc_step=1/1, ce_loss_token=1.9588, perplexity_token=7.0906]

torch.Size([256, 388, 35]) torch.Size([256, 388])


[Training LM]:   3%|█▌                                              | 33/1044 [00:12<07:20,  2.29it/s, acc_step=1/1, ce_loss_token=1.9582, perplexity_token=7.0866]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   3%|█▌                                              | 34/1044 [00:13<07:08,  2.36it/s, acc_step=1/1, ce_loss_token=1.9579, perplexity_token=7.0842]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:   3%|█▌                                              | 35/1044 [00:13<07:12,  2.33it/s, acc_step=1/1, ce_loss_token=1.9576, perplexity_token=7.0824]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   3%|█▋                                              | 36/1044 [00:13<07:09,  2.35it/s, acc_step=1/1, ce_loss_token=1.9576, perplexity_token=7.0823]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   4%|█▋                                              | 37/1044 [00:14<06:49,  2.46it/s, acc_step=1/1, ce_loss_token=1.9571, perplexity_token=7.0789]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:   4%|█▋                                              | 38/1044 [00:14<06:30,  2.57it/s, acc_step=1/1, ce_loss_token=1.9569, perplexity_token=7.0772]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   4%|█▊                                              | 39/1044 [00:15<06:24,  2.61it/s, acc_step=1/1, ce_loss_token=1.9565, perplexity_token=7.0747]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:   4%|█▊                                              | 40/1044 [00:15<06:12,  2.70it/s, acc_step=1/1, ce_loss_token=1.9556, perplexity_token=7.0685]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   4%|█▉                                              | 41/1044 [00:15<06:06,  2.74it/s, acc_step=1/1, ce_loss_token=1.9553, perplexity_token=7.0662]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   4%|█▉                                              | 42/1044 [00:16<06:06,  2.74it/s, acc_step=1/1, ce_loss_token=1.9548, perplexity_token=7.0627]

torch.Size([256, 274, 35]) torch.Size([256, 274])


[Training LM]:   4%|█▉                                              | 43/1044 [00:16<05:32,  3.01it/s, acc_step=1/1, ce_loss_token=1.9567, perplexity_token=7.0759]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   4%|██                                              | 44/1044 [00:16<05:40,  2.94it/s, acc_step=1/1, ce_loss_token=1.9564, perplexity_token=7.0735]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   4%|██                                              | 45/1044 [00:17<05:47,  2.88it/s, acc_step=1/1, ce_loss_token=1.9562, perplexity_token=7.0724]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   4%|██                                              | 46/1044 [00:17<05:51,  2.84it/s, acc_step=1/1, ce_loss_token=1.9557, perplexity_token=7.0692]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   5%|██▏                                             | 47/1044 [00:17<06:03,  2.74it/s, acc_step=1/1, ce_loss_token=1.9553, perplexity_token=7.0660]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:   5%|██▏                                             | 48/1044 [00:18<06:24,  2.59it/s, acc_step=1/1, ce_loss_token=1.9550, perplexity_token=7.0637]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   5%|██▎                                             | 49/1044 [00:18<06:03,  2.74it/s, acc_step=1/1, ce_loss_token=1.9567, perplexity_token=7.0757]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   5%|██▎                                             | 50/1044 [00:18<05:55,  2.79it/s, acc_step=1/1, ce_loss_token=1.9563, perplexity_token=7.0728]

torch.Size([256, 402, 35]) torch.Size([256, 402])


[Training LM]:   5%|██▎                                             | 51/1044 [00:19<06:16,  2.63it/s, acc_step=1/1, ce_loss_token=1.9579, perplexity_token=7.0843]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   5%|██▍                                             | 52/1044 [00:19<06:20,  2.60it/s, acc_step=1/1, ce_loss_token=1.9575, perplexity_token=7.0814]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:   5%|██▍                                             | 53/1044 [00:20<06:46,  2.44it/s, acc_step=1/1, ce_loss_token=1.9572, perplexity_token=7.0798]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   5%|██▍                                             | 54/1044 [00:20<06:12,  2.65it/s, acc_step=1/1, ce_loss_token=1.9588, perplexity_token=7.0912]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   5%|██▌                                             | 55/1044 [00:20<06:19,  2.61it/s, acc_step=1/1, ce_loss_token=1.9583, perplexity_token=7.0875]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   5%|██▌                                             | 56/1044 [00:21<06:13,  2.65it/s, acc_step=1/1, ce_loss_token=1.9579, perplexity_token=7.0847]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   5%|██▌                                             | 57/1044 [00:21<05:52,  2.80it/s, acc_step=1/1, ce_loss_token=1.9594, perplexity_token=7.0949]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   6%|██▋                                             | 58/1044 [00:21<05:53,  2.79it/s, acc_step=1/1, ce_loss_token=1.9590, perplexity_token=7.0920]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   6%|██▋                                             | 59/1044 [00:22<05:59,  2.74it/s, acc_step=1/1, ce_loss_token=1.9587, perplexity_token=7.0899]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:   6%|██▊                                             | 60/1044 [00:22<05:52,  2.79it/s, acc_step=1/1, ce_loss_token=1.9582, perplexity_token=7.0868]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   6%|██▊                                             | 61/1044 [00:22<05:41,  2.88it/s, acc_step=1/1, ce_loss_token=1.9595, perplexity_token=7.0960]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   6%|██▊                                             | 62/1044 [00:23<05:48,  2.82it/s, acc_step=1/1, ce_loss_token=1.9590, perplexity_token=7.0923]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   6%|██▉                                             | 63/1044 [00:23<05:54,  2.77it/s, acc_step=1/1, ce_loss_token=1.9585, perplexity_token=7.0884]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:   6%|██▉                                             | 64/1044 [00:24<05:50,  2.80it/s, acc_step=1/1, ce_loss_token=1.9581, perplexity_token=7.0856]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   6%|██▉                                             | 65/1044 [00:24<05:34,  2.93it/s, acc_step=1/1, ce_loss_token=1.9593, perplexity_token=7.0943]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   6%|███                                             | 66/1044 [00:24<05:13,  3.12it/s, acc_step=1/1, ce_loss_token=1.9604, perplexity_token=7.1024]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   6%|███                                             | 67/1044 [00:25<05:28,  2.97it/s, acc_step=1/1, ce_loss_token=1.9599, perplexity_token=7.0985]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   7%|███▏                                            | 68/1044 [00:25<05:17,  3.08it/s, acc_step=1/1, ce_loss_token=1.9618, perplexity_token=7.1119]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   7%|███▏                                            | 69/1044 [00:25<05:28,  2.97it/s, acc_step=1/1, ce_loss_token=1.9615, perplexity_token=7.1102]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   7%|███▏                                            | 70/1044 [00:26<05:57,  2.73it/s, acc_step=1/1, ce_loss_token=1.9612, perplexity_token=7.1076]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   7%|███▎                                            | 71/1044 [00:26<05:59,  2.70it/s, acc_step=1/1, ce_loss_token=1.9608, perplexity_token=7.1049]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:   7%|███▎                                            | 72/1044 [00:26<05:49,  2.78it/s, acc_step=1/1, ce_loss_token=1.9605, perplexity_token=7.1026]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   7%|███▎                                            | 73/1044 [00:27<05:25,  2.98it/s, acc_step=1/1, ce_loss_token=1.9615, perplexity_token=7.1101]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   7%|███▍                                            | 74/1044 [00:27<05:36,  2.89it/s, acc_step=1/1, ce_loss_token=1.9612, perplexity_token=7.1078]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   7%|███▍                                            | 75/1044 [00:27<05:50,  2.76it/s, acc_step=1/1, ce_loss_token=1.9608, perplexity_token=7.1049]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:   7%|███▍                                            | 76/1044 [00:28<06:15,  2.58it/s, acc_step=1/1, ce_loss_token=1.9605, perplexity_token=7.1032]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   7%|███▌                                            | 77/1044 [00:28<06:12,  2.60it/s, acc_step=1/1, ce_loss_token=1.9602, perplexity_token=7.1005]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:   7%|███▌                                            | 78/1044 [00:29<06:23,  2.52it/s, acc_step=1/1, ce_loss_token=1.9599, perplexity_token=7.0987]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:   8%|███▋                                            | 79/1044 [00:29<06:48,  2.36it/s, acc_step=1/1, ce_loss_token=1.9596, perplexity_token=7.0966]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   8%|███▋                                            | 80/1044 [00:30<06:39,  2.41it/s, acc_step=1/1, ce_loss_token=1.9593, perplexity_token=7.0944]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   8%|███▋                                            | 81/1044 [00:30<06:40,  2.40it/s, acc_step=1/1, ce_loss_token=1.9591, perplexity_token=7.0927]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:   8%|███▊                                            | 82/1044 [00:30<06:37,  2.42it/s, acc_step=1/1, ce_loss_token=1.9589, perplexity_token=7.0919]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:   8%|███▊                                            | 83/1044 [00:31<06:48,  2.35it/s, acc_step=1/1, ce_loss_token=1.9587, perplexity_token=7.0900]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   8%|███▊                                            | 84/1044 [00:31<06:37,  2.41it/s, acc_step=1/1, ce_loss_token=1.9586, perplexity_token=7.0892]

torch.Size([256, 525, 35]) torch.Size([256, 525])


[Training LM]:   8%|███▉                                            | 85/1044 [00:32<08:46,  1.82it/s, acc_step=1/1, ce_loss_token=1.9585, perplexity_token=7.0886]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   8%|███▉                                            | 86/1044 [00:32<07:54,  2.02it/s, acc_step=1/1, ce_loss_token=1.9583, perplexity_token=7.0870]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   8%|████                                            | 87/1044 [00:33<07:17,  2.19it/s, acc_step=1/1, ce_loss_token=1.9580, perplexity_token=7.0854]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   8%|████                                            | 88/1044 [00:33<06:47,  2.35it/s, acc_step=1/1, ce_loss_token=1.9577, perplexity_token=7.0832]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   9%|████                                            | 89/1044 [00:34<06:26,  2.47it/s, acc_step=1/1, ce_loss_token=1.9574, perplexity_token=7.0810]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   9%|████▏                                           | 90/1044 [00:34<06:13,  2.56it/s, acc_step=1/1, ce_loss_token=1.9572, perplexity_token=7.0795]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   9%|████▏                                           | 91/1044 [00:34<06:13,  2.55it/s, acc_step=1/1, ce_loss_token=1.9569, perplexity_token=7.0772]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   9%|████▏                                           | 92/1044 [00:35<06:16,  2.53it/s, acc_step=1/1, ce_loss_token=1.9568, perplexity_token=7.0769]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:   9%|████▎                                           | 93/1044 [00:35<06:00,  2.64it/s, acc_step=1/1, ce_loss_token=1.9565, perplexity_token=7.0745]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   9%|████▎                                           | 94/1044 [00:35<05:57,  2.66it/s, acc_step=1/1, ce_loss_token=1.9563, perplexity_token=7.0728]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:   9%|████▎                                           | 95/1044 [00:36<05:47,  2.73it/s, acc_step=1/1, ce_loss_token=1.9560, perplexity_token=7.0711]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   9%|████▍                                           | 96/1044 [00:36<05:50,  2.70it/s, acc_step=1/1, ce_loss_token=1.9557, perplexity_token=7.0692]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:   9%|████▍                                           | 97/1044 [00:36<05:39,  2.79it/s, acc_step=1/1, ce_loss_token=1.9555, perplexity_token=7.0672]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   9%|████▌                                           | 98/1044 [00:37<05:42,  2.76it/s, acc_step=1/1, ce_loss_token=1.9552, perplexity_token=7.0656]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   9%|████▌                                           | 99/1044 [00:37<05:45,  2.73it/s, acc_step=1/1, ce_loss_token=1.9552, perplexity_token=7.0651]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  10%|████▌                                          | 100/1044 [00:37<05:19,  2.95it/s, acc_step=1/1, ce_loss_token=1.9560, perplexity_token=7.0709]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  10%|████▌                                          | 101/1044 [00:38<05:21,  2.93it/s, acc_step=1/1, ce_loss_token=1.9557, perplexity_token=7.0691]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  10%|████▌                                          | 102/1044 [00:38<05:17,  2.97it/s, acc_step=1/1, ce_loss_token=1.9566, perplexity_token=7.0751]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  10%|████▋                                          | 103/1044 [00:38<05:16,  2.98it/s, acc_step=1/1, ce_loss_token=1.9573, perplexity_token=7.0803]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  10%|████▋                                          | 104/1044 [00:39<05:36,  2.79it/s, acc_step=1/1, ce_loss_token=1.9571, perplexity_token=7.0790]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  10%|████▋                                          | 105/1044 [00:39<05:47,  2.70it/s, acc_step=1/1, ce_loss_token=1.9568, perplexity_token=7.0769]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  10%|████▊                                          | 106/1044 [00:40<05:29,  2.85it/s, acc_step=1/1, ce_loss_token=1.9575, perplexity_token=7.0817]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  10%|████▊                                          | 107/1044 [00:40<05:10,  3.02it/s, acc_step=1/1, ce_loss_token=1.9582, perplexity_token=7.0863]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  10%|████▊                                          | 108/1044 [00:40<05:31,  2.82it/s, acc_step=1/1, ce_loss_token=1.9579, perplexity_token=7.0843]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  10%|████▉                                          | 109/1044 [00:41<05:40,  2.74it/s, acc_step=1/1, ce_loss_token=1.9576, perplexity_token=7.0825]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  11%|████▉                                          | 110/1044 [00:41<05:44,  2.71it/s, acc_step=1/1, ce_loss_token=1.9575, perplexity_token=7.0815]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  11%|████▉                                          | 111/1044 [00:41<05:49,  2.67it/s, acc_step=1/1, ce_loss_token=1.9574, perplexity_token=7.0806]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  11%|█████                                          | 112/1044 [00:42<05:57,  2.61it/s, acc_step=1/1, ce_loss_token=1.9572, perplexity_token=7.0796]

torch.Size([256, 364, 35]) torch.Size([256, 364])


[Training LM]:  11%|█████                                          | 113/1044 [00:42<06:24,  2.42it/s, acc_step=1/1, ce_loss_token=1.9570, perplexity_token=7.0781]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  11%|█████▏                                         | 115/1044 [00:43<05:26,  2.85it/s, acc_step=1/1, ce_loss_token=1.9589, perplexity_token=7.0916]

torch.Size([256, 311, 35]) torch.Size([256, 311])
torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  11%|█████▏                                         | 116/1044 [00:43<05:34,  2.77it/s, acc_step=1/1, ce_loss_token=1.9586, perplexity_token=7.0897]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  11%|█████▎                                         | 117/1044 [00:44<05:45,  2.68it/s, acc_step=1/1, ce_loss_token=1.9584, perplexity_token=7.0880]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  11%|█████▎                                         | 118/1044 [00:44<05:17,  2.92it/s, acc_step=1/1, ce_loss_token=1.9590, perplexity_token=7.0925]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  11%|█████▎                                         | 119/1044 [00:44<05:35,  2.76it/s, acc_step=1/1, ce_loss_token=1.9588, perplexity_token=7.0908]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  11%|█████▍                                         | 120/1044 [00:45<05:49,  2.65it/s, acc_step=1/1, ce_loss_token=1.9586, perplexity_token=7.0893]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  12%|█████▍                                         | 121/1044 [00:45<06:03,  2.54it/s, acc_step=1/1, ce_loss_token=1.9585, perplexity_token=7.0884]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  12%|█████▍                                         | 122/1044 [00:46<05:54,  2.60it/s, acc_step=1/1, ce_loss_token=1.9583, perplexity_token=7.0870]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  12%|█████▌                                         | 123/1044 [00:46<05:48,  2.64it/s, acc_step=1/1, ce_loss_token=1.9580, perplexity_token=7.0853]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  12%|█████▌                                         | 124/1044 [00:46<05:45,  2.66it/s, acc_step=1/1, ce_loss_token=1.9578, perplexity_token=7.0835]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  12%|█████▋                                         | 125/1044 [00:47<05:53,  2.60it/s, acc_step=1/1, ce_loss_token=1.9577, perplexity_token=7.0827]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  12%|█████▋                                         | 126/1044 [00:47<05:47,  2.65it/s, acc_step=1/1, ce_loss_token=1.9574, perplexity_token=7.0806]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  12%|█████▋                                         | 127/1044 [00:47<05:47,  2.64it/s, acc_step=1/1, ce_loss_token=1.9571, perplexity_token=7.0791]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  12%|█████▊                                         | 128/1044 [00:48<06:09,  2.48it/s, acc_step=1/1, ce_loss_token=1.9570, perplexity_token=7.0782]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  12%|█████▊                                         | 129/1044 [00:48<06:05,  2.50it/s, acc_step=1/1, ce_loss_token=1.9568, perplexity_token=7.0768]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  12%|█████▊                                         | 130/1044 [00:49<06:03,  2.51it/s, acc_step=1/1, ce_loss_token=1.9566, perplexity_token=7.0754]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  13%|█████▉                                         | 131/1044 [00:49<05:51,  2.60it/s, acc_step=1/1, ce_loss_token=1.9564, perplexity_token=7.0741]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  13%|█████▉                                         | 132/1044 [00:49<05:53,  2.58it/s, acc_step=1/1, ce_loss_token=1.9562, perplexity_token=7.0722]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  13%|█████▉                                         | 133/1044 [00:50<05:52,  2.58it/s, acc_step=1/1, ce_loss_token=1.9560, perplexity_token=7.0712]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  13%|██████                                         | 134/1044 [00:50<05:41,  2.67it/s, acc_step=1/1, ce_loss_token=1.9559, perplexity_token=7.0699]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  13%|██████                                         | 135/1044 [00:51<05:38,  2.69it/s, acc_step=1/1, ce_loss_token=1.9556, perplexity_token=7.0679]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  13%|██████                                         | 136/1044 [00:51<05:44,  2.63it/s, acc_step=1/1, ce_loss_token=1.9554, perplexity_token=7.0668]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  13%|██████▏                                        | 137/1044 [00:51<05:35,  2.71it/s, acc_step=1/1, ce_loss_token=1.9552, perplexity_token=7.0656]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  13%|██████▏                                        | 138/1044 [00:52<05:40,  2.66it/s, acc_step=1/1, ce_loss_token=1.9551, perplexity_token=7.0648]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  13%|██████▎                                        | 139/1044 [00:52<05:30,  2.74it/s, acc_step=1/1, ce_loss_token=1.9549, perplexity_token=7.0636]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  13%|██████▎                                        | 140/1044 [00:52<05:25,  2.77it/s, acc_step=1/1, ce_loss_token=1.9547, perplexity_token=7.0619]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  14%|██████▎                                        | 141/1044 [00:53<05:25,  2.77it/s, acc_step=1/1, ce_loss_token=1.9546, perplexity_token=7.0609]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  14%|██████▍                                        | 142/1044 [00:53<05:04,  2.97it/s, acc_step=1/1, ce_loss_token=1.9550, perplexity_token=7.0642]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  14%|██████▍                                        | 143/1044 [00:53<05:14,  2.87it/s, acc_step=1/1, ce_loss_token=1.9549, perplexity_token=7.0632]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  14%|██████▍                                        | 144/1044 [00:54<05:30,  2.73it/s, acc_step=1/1, ce_loss_token=1.9547, perplexity_token=7.0618]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  14%|██████▌                                        | 145/1044 [00:54<05:35,  2.68it/s, acc_step=1/1, ce_loss_token=1.9545, perplexity_token=7.0607]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  14%|██████▌                                        | 146/1044 [00:55<05:50,  2.56it/s, acc_step=1/1, ce_loss_token=1.9544, perplexity_token=7.0595]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  14%|██████▌                                        | 147/1044 [00:55<05:42,  2.62it/s, acc_step=1/1, ce_loss_token=1.9542, perplexity_token=7.0582]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  14%|██████▋                                        | 148/1044 [00:55<05:44,  2.60it/s, acc_step=1/1, ce_loss_token=1.9540, perplexity_token=7.0571]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  14%|██████▋                                        | 149/1044 [00:56<05:16,  2.82it/s, acc_step=1/1, ce_loss_token=1.9547, perplexity_token=7.0618]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  14%|██████▊                                        | 150/1044 [00:56<05:24,  2.75it/s, acc_step=1/1, ce_loss_token=1.9546, perplexity_token=7.0610]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  14%|██████▊                                        | 151/1044 [00:56<05:30,  2.70it/s, acc_step=1/1, ce_loss_token=1.9545, perplexity_token=7.0601]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  15%|██████▊                                        | 152/1044 [00:57<05:50,  2.54it/s, acc_step=1/1, ce_loss_token=1.9544, perplexity_token=7.0598]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  15%|██████▉                                        | 153/1044 [00:57<05:45,  2.58it/s, acc_step=1/1, ce_loss_token=1.9543, perplexity_token=7.0590]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  15%|██████▉                                        | 154/1044 [00:58<05:26,  2.73it/s, acc_step=1/1, ce_loss_token=1.9548, perplexity_token=7.0623]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  15%|██████▉                                        | 155/1044 [00:58<05:26,  2.72it/s, acc_step=1/1, ce_loss_token=1.9546, perplexity_token=7.0611]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  15%|███████                                        | 156/1044 [00:58<05:08,  2.88it/s, acc_step=1/1, ce_loss_token=1.9551, perplexity_token=7.0648]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  15%|███████                                        | 158/1044 [00:59<04:34,  3.23it/s, acc_step=1/1, ce_loss_token=1.9571, perplexity_token=7.0788]

torch.Size([256, 281, 35]) torch.Size([256, 281])
torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  15%|███████▏                                       | 159/1044 [00:59<04:37,  3.19it/s, acc_step=1/1, ce_loss_token=1.9569, perplexity_token=7.0773]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  15%|███████▏                                       | 160/1044 [01:00<04:58,  2.97it/s, acc_step=1/1, ce_loss_token=1.9567, perplexity_token=7.0759]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  15%|███████▏                                       | 161/1044 [01:00<05:13,  2.82it/s, acc_step=1/1, ce_loss_token=1.9565, perplexity_token=7.0745]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  16%|███████▎                                       | 163/1044 [01:01<04:40,  3.14it/s, acc_step=1/1, ce_loss_token=1.9575, perplexity_token=7.0816]

torch.Size([256, 295, 35]) torch.Size([256, 295])
torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  16%|███████▍                                       | 164/1044 [01:01<04:55,  2.98it/s, acc_step=1/1, ce_loss_token=1.9574, perplexity_token=7.0807]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  16%|███████▍                                       | 165/1044 [01:01<05:14,  2.79it/s, acc_step=1/1, ce_loss_token=1.9572, perplexity_token=7.0798]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  16%|███████▍                                       | 166/1044 [01:02<05:13,  2.80it/s, acc_step=1/1, ce_loss_token=1.9571, perplexity_token=7.0785]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  16%|███████▌                                       | 167/1044 [01:02<05:14,  2.79it/s, acc_step=1/1, ce_loss_token=1.9568, perplexity_token=7.0769]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  16%|███████▌                                       | 168/1044 [01:02<04:58,  2.93it/s, acc_step=1/1, ce_loss_token=1.9575, perplexity_token=7.0814]

torch.Size([256, 438, 35]) torch.Size([256, 438])


[Training LM]:  16%|███████▌                                       | 169/1044 [01:03<06:12,  2.35it/s, acc_step=1/1, ce_loss_token=1.9573, perplexity_token=7.0803]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  16%|███████▋                                       | 170/1044 [01:03<05:55,  2.46it/s, acc_step=1/1, ce_loss_token=1.9571, perplexity_token=7.0791]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  16%|███████▋                                       | 171/1044 [01:04<05:33,  2.62it/s, acc_step=1/1, ce_loss_token=1.9577, perplexity_token=7.0827]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  16%|███████▋                                       | 172/1044 [01:04<05:48,  2.50it/s, acc_step=1/1, ce_loss_token=1.9575, perplexity_token=7.0815]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  17%|███████▊                                       | 173/1044 [01:05<05:56,  2.44it/s, acc_step=1/1, ce_loss_token=1.9573, perplexity_token=7.0800]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  17%|███████▊                                       | 174/1044 [01:05<05:40,  2.55it/s, acc_step=1/1, ce_loss_token=1.9571, perplexity_token=7.0790]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  17%|███████▉                                       | 175/1044 [01:05<05:35,  2.59it/s, acc_step=1/1, ce_loss_token=1.9570, perplexity_token=7.0780]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  17%|███████▉                                       | 176/1044 [01:06<05:24,  2.67it/s, acc_step=1/1, ce_loss_token=1.9568, perplexity_token=7.0765]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  17%|███████▉                                       | 177/1044 [01:06<05:29,  2.63it/s, acc_step=1/1, ce_loss_token=1.9567, perplexity_token=7.0761]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  17%|████████                                       | 178/1044 [01:06<05:09,  2.80it/s, acc_step=1/1, ce_loss_token=1.9572, perplexity_token=7.0792]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  17%|████████                                       | 179/1044 [01:07<05:17,  2.72it/s, acc_step=1/1, ce_loss_token=1.9570, perplexity_token=7.0782]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  17%|████████                                       | 180/1044 [01:07<05:13,  2.75it/s, acc_step=1/1, ce_loss_token=1.9568, perplexity_token=7.0767]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  17%|████████▏                                      | 181/1044 [01:07<05:24,  2.66it/s, acc_step=1/1, ce_loss_token=1.9566, perplexity_token=7.0752]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  17%|████████▏                                      | 182/1044 [01:08<05:02,  2.85it/s, acc_step=1/1, ce_loss_token=1.9571, perplexity_token=7.0788]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  18%|████████▏                                      | 183/1044 [01:08<05:08,  2.79it/s, acc_step=1/1, ce_loss_token=1.9569, perplexity_token=7.0776]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  18%|████████▎                                      | 184/1044 [01:08<05:04,  2.83it/s, acc_step=1/1, ce_loss_token=1.9567, perplexity_token=7.0762]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  18%|████████▎                                      | 185/1044 [01:09<05:03,  2.83it/s, acc_step=1/1, ce_loss_token=1.9566, perplexity_token=7.0751]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  18%|████████▎                                      | 186/1044 [01:09<05:04,  2.82it/s, acc_step=1/1, ce_loss_token=1.9565, perplexity_token=7.0743]

torch.Size([256, 342, 35]) torch.Size([256, 342])


[Training LM]:  18%|████████▍                                      | 187/1044 [01:10<05:26,  2.62it/s, acc_step=1/1, ce_loss_token=1.9563, perplexity_token=7.0731]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  18%|████████▍                                      | 188/1044 [01:10<05:25,  2.63it/s, acc_step=1/1, ce_loss_token=1.9561, perplexity_token=7.0720]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  18%|████████▌                                      | 189/1044 [01:10<05:23,  2.64it/s, acc_step=1/1, ce_loss_token=1.9560, perplexity_token=7.0713]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  18%|████████▌                                      | 190/1044 [01:11<05:22,  2.65it/s, acc_step=1/1, ce_loss_token=1.9559, perplexity_token=7.0705]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  18%|████████▌                                      | 191/1044 [01:11<05:47,  2.45it/s, acc_step=1/1, ce_loss_token=1.9558, perplexity_token=7.0694]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  18%|████████▋                                      | 192/1044 [01:12<05:32,  2.57it/s, acc_step=1/1, ce_loss_token=1.9557, perplexity_token=7.0686]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  18%|████████▋                                      | 193/1044 [01:12<05:26,  2.61it/s, acc_step=1/1, ce_loss_token=1.9555, perplexity_token=7.0674]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  19%|████████▋                                      | 194/1044 [01:12<05:39,  2.51it/s, acc_step=1/1, ce_loss_token=1.9554, perplexity_token=7.0668]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  19%|████████▊                                      | 195/1044 [01:13<05:27,  2.59it/s, acc_step=1/1, ce_loss_token=1.9552, perplexity_token=7.0656]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  19%|████████▊                                      | 196/1044 [01:13<05:36,  2.52it/s, acc_step=1/1, ce_loss_token=1.9551, perplexity_token=7.0648]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  19%|████████▊                                      | 197/1044 [01:14<05:29,  2.57it/s, acc_step=1/1, ce_loss_token=1.9549, perplexity_token=7.0635]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  19%|████████▉                                      | 198/1044 [01:14<05:28,  2.58it/s, acc_step=1/1, ce_loss_token=1.9548, perplexity_token=7.0623]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  19%|████████▉                                      | 199/1044 [01:14<05:42,  2.47it/s, acc_step=1/1, ce_loss_token=1.9546, perplexity_token=7.0611]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  19%|█████████                                      | 200/1044 [01:15<05:35,  2.52it/s, acc_step=1/1, ce_loss_token=1.9545, perplexity_token=7.0601]

torch.Size([256, 441, 35]) torch.Size([256, 441])


[Training LM]:  19%|█████████                                      | 201/1044 [01:15<06:42,  2.09it/s, acc_step=1/1, ce_loss_token=1.9543, perplexity_token=7.0593]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  19%|█████████                                      | 202/1044 [01:16<06:24,  2.19it/s, acc_step=1/1, ce_loss_token=1.9541, perplexity_token=7.0577]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  19%|█████████▏                                     | 203/1044 [01:16<06:12,  2.26it/s, acc_step=1/1, ce_loss_token=1.9540, perplexity_token=7.0571]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  20%|█████████▏                                     | 204/1044 [01:17<06:00,  2.33it/s, acc_step=1/1, ce_loss_token=1.9539, perplexity_token=7.0562]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  20%|█████████▏                                     | 205/1044 [01:17<05:56,  2.35it/s, acc_step=1/1, ce_loss_token=1.9537, perplexity_token=7.0549]

torch.Size([256, 276, 35]) torch.Size([256, 276])


[Training LM]:  20%|█████████▎                                     | 206/1044 [01:17<05:31,  2.53it/s, acc_step=1/1, ce_loss_token=1.9536, perplexity_token=7.0541]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  20%|█████████▎                                     | 207/1044 [01:18<05:38,  2.47it/s, acc_step=1/1, ce_loss_token=1.9535, perplexity_token=7.0531]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  20%|█████████▎                                     | 208/1044 [01:18<05:31,  2.52it/s, acc_step=1/1, ce_loss_token=1.9534, perplexity_token=7.0523]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  20%|█████████▍                                     | 209/1044 [01:18<05:12,  2.68it/s, acc_step=1/1, ce_loss_token=1.9538, perplexity_token=7.0552]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  20%|█████████▍                                     | 210/1044 [01:19<05:15,  2.64it/s, acc_step=1/1, ce_loss_token=1.9536, perplexity_token=7.0538]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  20%|█████████▍                                     | 211/1044 [01:19<05:14,  2.65it/s, acc_step=1/1, ce_loss_token=1.9535, perplexity_token=7.0531]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  20%|█████████▌                                     | 212/1044 [01:20<05:20,  2.60it/s, acc_step=1/1, ce_loss_token=1.9533, perplexity_token=7.0521]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  20%|█████████▌                                     | 213/1044 [01:20<05:12,  2.66it/s, acc_step=1/1, ce_loss_token=1.9532, perplexity_token=7.0511]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  20%|█████████▋                                     | 214/1044 [01:20<05:13,  2.65it/s, acc_step=1/1, ce_loss_token=1.9531, perplexity_token=7.0502]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  21%|█████████▋                                     | 215/1044 [01:21<05:24,  2.56it/s, acc_step=1/1, ce_loss_token=1.9529, perplexity_token=7.0491]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  21%|█████████▋                                     | 216/1044 [01:21<04:56,  2.79it/s, acc_step=1/1, ce_loss_token=1.9533, perplexity_token=7.0517]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  21%|█████████▊                                     | 217/1044 [01:21<04:53,  2.82it/s, acc_step=1/1, ce_loss_token=1.9532, perplexity_token=7.0509]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  21%|█████████▊                                     | 218/1044 [01:22<05:09,  2.67it/s, acc_step=1/1, ce_loss_token=1.9530, perplexity_token=7.0499]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  21%|█████████▊                                     | 219/1044 [01:22<05:09,  2.67it/s, acc_step=1/1, ce_loss_token=1.9529, perplexity_token=7.0489]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  21%|█████████▉                                     | 220/1044 [01:23<05:07,  2.68it/s, acc_step=1/1, ce_loss_token=1.9528, perplexity_token=7.0482]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  21%|█████████▉                                     | 221/1044 [01:23<05:11,  2.64it/s, acc_step=1/1, ce_loss_token=1.9526, perplexity_token=7.0472]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  21%|█████████▉                                     | 222/1044 [01:23<05:19,  2.57it/s, acc_step=1/1, ce_loss_token=1.9524, perplexity_token=7.0458]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  21%|██████████                                     | 223/1044 [01:24<04:53,  2.79it/s, acc_step=1/1, ce_loss_token=1.9528, perplexity_token=7.0483]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  21%|██████████                                     | 224/1044 [01:24<04:52,  2.80it/s, acc_step=1/1, ce_loss_token=1.9526, perplexity_token=7.0471]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  22%|██████████▏                                    | 225/1044 [01:24<04:38,  2.94it/s, acc_step=1/1, ce_loss_token=1.9529, perplexity_token=7.0492]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  22%|██████████▏                                    | 226/1044 [01:25<04:37,  2.95it/s, acc_step=1/1, ce_loss_token=1.9528, perplexity_token=7.0486]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  22%|██████████▏                                    | 227/1044 [01:25<04:41,  2.90it/s, acc_step=1/1, ce_loss_token=1.9527, perplexity_token=7.0476]

torch.Size([256, 354, 35]) torch.Size([256, 354])


[Training LM]:  22%|██████████▎                                    | 228/1044 [01:25<05:09,  2.64it/s, acc_step=1/1, ce_loss_token=1.9526, perplexity_token=7.0468]

torch.Size([256, 372, 35]) torch.Size([256, 372])


[Training LM]:  22%|██████████▎                                    | 229/1044 [01:26<05:40,  2.39it/s, acc_step=1/1, ce_loss_token=1.9524, perplexity_token=7.0457]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  22%|██████████▎                                    | 230/1044 [01:26<05:09,  2.63it/s, acc_step=1/1, ce_loss_token=1.9529, perplexity_token=7.0488]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  22%|██████████▍                                    | 231/1044 [01:27<05:09,  2.63it/s, acc_step=1/1, ce_loss_token=1.9528, perplexity_token=7.0481]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  22%|██████████▍                                    | 232/1044 [01:27<04:50,  2.80it/s, acc_step=1/1, ce_loss_token=1.9530, perplexity_token=7.0500]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  22%|██████████▍                                    | 233/1044 [01:27<04:49,  2.80it/s, acc_step=1/1, ce_loss_token=1.9529, perplexity_token=7.0491]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  22%|██████████▌                                    | 234/1044 [01:28<04:54,  2.75it/s, acc_step=1/1, ce_loss_token=1.9528, perplexity_token=7.0482]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  23%|██████████▌                                    | 235/1044 [01:28<04:58,  2.71it/s, acc_step=1/1, ce_loss_token=1.9527, perplexity_token=7.0479]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  23%|██████████▌                                    | 236/1044 [01:28<04:34,  2.95it/s, acc_step=1/1, ce_loss_token=1.9530, perplexity_token=7.0499]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  23%|██████████▋                                    | 237/1044 [01:29<04:56,  2.72it/s, acc_step=1/1, ce_loss_token=1.9529, perplexity_token=7.0493]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  23%|██████████▋                                    | 238/1044 [01:29<05:03,  2.66it/s, acc_step=1/1, ce_loss_token=1.9528, perplexity_token=7.0484]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  23%|██████████▊                                    | 239/1044 [01:30<05:24,  2.48it/s, acc_step=1/1, ce_loss_token=1.9526, perplexity_token=7.0470]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  23%|██████████▊                                    | 240/1044 [01:30<05:19,  2.52it/s, acc_step=1/1, ce_loss_token=1.9525, perplexity_token=7.0460]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  23%|██████████▊                                    | 241/1044 [01:30<05:08,  2.60it/s, acc_step=1/1, ce_loss_token=1.9524, perplexity_token=7.0455]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  23%|██████████▉                                    | 242/1044 [01:31<05:17,  2.53it/s, acc_step=1/1, ce_loss_token=1.9523, perplexity_token=7.0447]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  23%|██████████▉                                    | 243/1044 [01:31<05:17,  2.52it/s, acc_step=1/1, ce_loss_token=1.9521, perplexity_token=7.0435]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  23%|██████████▉                                    | 244/1044 [01:32<05:19,  2.50it/s, acc_step=1/1, ce_loss_token=1.9520, perplexity_token=7.0425]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  23%|███████████                                    | 245/1044 [01:32<04:57,  2.69it/s, acc_step=1/1, ce_loss_token=1.9523, perplexity_token=7.0448]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  24%|███████████                                    | 246/1044 [01:32<05:01,  2.65it/s, acc_step=1/1, ce_loss_token=1.9521, perplexity_token=7.0436]

torch.Size([256, 351, 35]) torch.Size([256, 351])


[Training LM]:  24%|███████████                                    | 247/1044 [01:33<05:20,  2.49it/s, acc_step=1/1, ce_loss_token=1.9520, perplexity_token=7.0427]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  24%|███████████▏                                   | 248/1044 [01:33<05:17,  2.50it/s, acc_step=1/1, ce_loss_token=1.9519, perplexity_token=7.0420]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  24%|███████████▏                                   | 249/1044 [01:33<04:49,  2.75it/s, acc_step=1/1, ce_loss_token=1.9521, perplexity_token=7.0437]

torch.Size([256, 344, 35]) torch.Size([256, 344])


[Training LM]:  24%|███████████▎                                   | 250/1044 [01:34<05:07,  2.59it/s, acc_step=1/1, ce_loss_token=1.9521, perplexity_token=7.0431]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  24%|███████████▎                                   | 251/1044 [01:34<05:06,  2.58it/s, acc_step=1/1, ce_loss_token=1.9519, perplexity_token=7.0424]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  24%|███████████▎                                   | 252/1044 [01:35<05:17,  2.49it/s, acc_step=1/1, ce_loss_token=1.9518, perplexity_token=7.0417]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  24%|███████████▍                                   | 253/1044 [01:35<05:15,  2.51it/s, acc_step=1/1, ce_loss_token=1.9517, perplexity_token=7.0407]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  24%|███████████▍                                   | 254/1044 [01:36<05:20,  2.46it/s, acc_step=1/1, ce_loss_token=1.9516, perplexity_token=7.0399]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  24%|███████████▍                                   | 255/1044 [01:36<05:10,  2.54it/s, acc_step=1/1, ce_loss_token=1.9514, perplexity_token=7.0389]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  25%|███████████▌                                   | 256/1044 [01:36<05:07,  2.56it/s, acc_step=1/1, ce_loss_token=1.9514, perplexity_token=7.0382]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  25%|███████████▌                                   | 257/1044 [01:37<04:48,  2.73it/s, acc_step=1/1, ce_loss_token=1.9517, perplexity_token=7.0406]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  25%|███████████▌                                   | 258/1044 [01:37<04:41,  2.79it/s, acc_step=1/1, ce_loss_token=1.9516, perplexity_token=7.0398]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  25%|███████████▋                                   | 259/1044 [01:37<04:58,  2.63it/s, acc_step=1/1, ce_loss_token=1.9514, perplexity_token=7.0387]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  25%|███████████▋                                   | 260/1044 [01:38<04:57,  2.64it/s, acc_step=1/1, ce_loss_token=1.9513, perplexity_token=7.0381]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  25%|███████████▊                                   | 261/1044 [01:38<05:07,  2.55it/s, acc_step=1/1, ce_loss_token=1.9512, perplexity_token=7.0374]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  25%|███████████▊                                   | 262/1044 [01:39<05:15,  2.47it/s, acc_step=1/1, ce_loss_token=1.9511, perplexity_token=7.0366]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  25%|███████████▊                                   | 263/1044 [01:39<05:10,  2.52it/s, acc_step=1/1, ce_loss_token=1.9510, perplexity_token=7.0360]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  25%|███████████▉                                   | 264/1044 [01:39<05:17,  2.46it/s, acc_step=1/1, ce_loss_token=1.9509, perplexity_token=7.0352]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  25%|███████████▉                                   | 265/1044 [01:40<05:16,  2.46it/s, acc_step=1/1, ce_loss_token=1.9508, perplexity_token=7.0344]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  25%|███████████▉                                   | 266/1044 [01:40<05:13,  2.48it/s, acc_step=1/1, ce_loss_token=1.9507, perplexity_token=7.0335]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  26%|████████████                                   | 267/1044 [01:41<05:03,  2.56it/s, acc_step=1/1, ce_loss_token=1.9506, perplexity_token=7.0329]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  26%|████████████                                   | 268/1044 [01:41<04:54,  2.64it/s, acc_step=1/1, ce_loss_token=1.9505, perplexity_token=7.0320]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  26%|████████████                                   | 269/1044 [01:41<04:48,  2.68it/s, acc_step=1/1, ce_loss_token=1.9503, perplexity_token=7.0311]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  26%|████████████▏                                  | 270/1044 [01:42<04:47,  2.69it/s, acc_step=1/1, ce_loss_token=1.9502, perplexity_token=7.0301]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  26%|████████████▏                                  | 271/1044 [01:42<04:23,  2.93it/s, acc_step=1/1, ce_loss_token=1.9505, perplexity_token=7.0321]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  26%|████████████▏                                  | 272/1044 [01:42<04:35,  2.80it/s, acc_step=1/1, ce_loss_token=1.9504, perplexity_token=7.0314]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  26%|████████████▎                                  | 273/1044 [01:43<04:35,  2.79it/s, acc_step=1/1, ce_loss_token=1.9503, perplexity_token=7.0305]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  26%|████████████▎                                  | 274/1044 [01:43<04:50,  2.65it/s, acc_step=1/1, ce_loss_token=1.9501, perplexity_token=7.0295]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  26%|████████████▍                                  | 275/1044 [01:43<04:43,  2.72it/s, acc_step=1/1, ce_loss_token=1.9500, perplexity_token=7.0287]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  26%|████████████▍                                  | 276/1044 [01:44<04:39,  2.74it/s, acc_step=1/1, ce_loss_token=1.9499, perplexity_token=7.0280]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  27%|████████████▍                                  | 277/1044 [01:44<04:34,  2.79it/s, acc_step=1/1, ce_loss_token=1.9498, perplexity_token=7.0271]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  27%|████████████▌                                  | 278/1044 [01:45<04:37,  2.76it/s, acc_step=1/1, ce_loss_token=1.9497, perplexity_token=7.0263]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  27%|████████████▌                                  | 279/1044 [01:45<04:38,  2.75it/s, acc_step=1/1, ce_loss_token=1.9495, perplexity_token=7.0253]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  27%|████████████▌                                  | 280/1044 [01:45<04:49,  2.64it/s, acc_step=1/1, ce_loss_token=1.9494, perplexity_token=7.0245]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  27%|████████████▋                                  | 281/1044 [01:46<04:52,  2.60it/s, acc_step=1/1, ce_loss_token=1.9493, perplexity_token=7.0236]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  27%|████████████▋                                  | 282/1044 [01:46<04:55,  2.58it/s, acc_step=1/1, ce_loss_token=1.9491, perplexity_token=7.0227]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  27%|████████████▋                                  | 283/1044 [01:46<04:31,  2.80it/s, acc_step=1/1, ce_loss_token=1.9494, perplexity_token=7.0247]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  27%|████████████▊                                  | 284/1044 [01:47<04:33,  2.78it/s, acc_step=1/1, ce_loss_token=1.9493, perplexity_token=7.0239]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  27%|████████████▊                                  | 285/1044 [01:47<04:52,  2.60it/s, acc_step=1/1, ce_loss_token=1.9492, perplexity_token=7.0229]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  27%|████████████▉                                  | 286/1044 [01:48<04:46,  2.65it/s, acc_step=1/1, ce_loss_token=1.9491, perplexity_token=7.0222]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  27%|████████████▉                                  | 287/1044 [01:48<04:24,  2.86it/s, acc_step=1/1, ce_loss_token=1.9493, perplexity_token=7.0241]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  28%|████████████▉                                  | 288/1044 [01:48<04:33,  2.76it/s, acc_step=1/1, ce_loss_token=1.9492, perplexity_token=7.0232]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  28%|█████████████                                  | 289/1044 [01:49<04:33,  2.76it/s, acc_step=1/1, ce_loss_token=1.9491, perplexity_token=7.0224]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  28%|█████████████                                  | 290/1044 [01:49<04:39,  2.70it/s, acc_step=1/1, ce_loss_token=1.9490, perplexity_token=7.0216]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  28%|█████████████                                  | 291/1044 [01:49<04:21,  2.88it/s, acc_step=1/1, ce_loss_token=1.9494, perplexity_token=7.0243]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  28%|█████████████▏                                 | 292/1044 [01:50<04:26,  2.82it/s, acc_step=1/1, ce_loss_token=1.9493, perplexity_token=7.0237]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  28%|█████████████▏                                 | 293/1044 [01:50<04:31,  2.77it/s, acc_step=1/1, ce_loss_token=1.9492, perplexity_token=7.0229]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  28%|█████████████▏                                 | 294/1044 [01:50<04:19,  2.89it/s, acc_step=1/1, ce_loss_token=1.9495, perplexity_token=7.0252]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  28%|█████████████▎                                 | 295/1044 [01:51<04:22,  2.85it/s, acc_step=1/1, ce_loss_token=1.9494, perplexity_token=7.0241]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  28%|█████████████▎                                 | 296/1044 [01:51<04:28,  2.78it/s, acc_step=1/1, ce_loss_token=1.9493, perplexity_token=7.0234]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  28%|█████████████▎                                 | 297/1044 [01:51<04:28,  2.78it/s, acc_step=1/1, ce_loss_token=1.9491, perplexity_token=7.0225]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  29%|█████████████▍                                 | 298/1044 [01:52<04:31,  2.75it/s, acc_step=1/1, ce_loss_token=1.9490, perplexity_token=7.0215]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  29%|█████████████▍                                 | 299/1044 [01:52<04:46,  2.60it/s, acc_step=1/1, ce_loss_token=1.9489, perplexity_token=7.0209]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  29%|█████████████▌                                 | 300/1044 [01:53<04:44,  2.62it/s, acc_step=1/1, ce_loss_token=1.9488, perplexity_token=7.0201]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  29%|█████████████▌                                 | 301/1044 [01:53<04:38,  2.67it/s, acc_step=1/1, ce_loss_token=1.9487, perplexity_token=7.0194]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  29%|█████████████▌                                 | 302/1044 [01:53<04:34,  2.70it/s, acc_step=1/1, ce_loss_token=1.9486, perplexity_token=7.0188]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  29%|█████████████▋                                 | 303/1044 [01:54<04:31,  2.73it/s, acc_step=1/1, ce_loss_token=1.9485, perplexity_token=7.0179]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  29%|█████████████▋                                 | 304/1044 [01:54<04:28,  2.76it/s, acc_step=1/1, ce_loss_token=1.9484, perplexity_token=7.0172]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  29%|█████████████▋                                 | 305/1044 [01:54<04:28,  2.75it/s, acc_step=1/1, ce_loss_token=1.9483, perplexity_token=7.0165]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  29%|█████████████▊                                 | 306/1044 [01:55<04:23,  2.80it/s, acc_step=1/1, ce_loss_token=1.9482, perplexity_token=7.0158]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  29%|█████████████▊                                 | 307/1044 [01:55<04:28,  2.75it/s, acc_step=1/1, ce_loss_token=1.9480, perplexity_token=7.0150]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  30%|█████████████▊                                 | 308/1044 [01:56<04:33,  2.69it/s, acc_step=1/1, ce_loss_token=1.9480, perplexity_token=7.0144]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  30%|█████████████▉                                 | 309/1044 [01:56<04:31,  2.71it/s, acc_step=1/1, ce_loss_token=1.9479, perplexity_token=7.0138]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  30%|█████████████▉                                 | 310/1044 [01:56<04:36,  2.65it/s, acc_step=1/1, ce_loss_token=1.9478, perplexity_token=7.0132]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  30%|██████████████                                 | 311/1044 [01:57<04:37,  2.65it/s, acc_step=1/1, ce_loss_token=1.9477, perplexity_token=7.0125]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  30%|██████████████                                 | 312/1044 [01:57<05:00,  2.44it/s, acc_step=1/1, ce_loss_token=1.9476, perplexity_token=7.0118]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  30%|██████████████                                 | 313/1044 [01:58<05:01,  2.43it/s, acc_step=1/1, ce_loss_token=1.9475, perplexity_token=7.0109]

torch.Size([256, 454, 35]) torch.Size([256, 454])


[Training LM]:  30%|██████████████▏                                | 314/1044 [01:58<05:55,  2.05it/s, acc_step=1/1, ce_loss_token=1.9474, perplexity_token=7.0104]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  30%|██████████████▏                                | 315/1044 [01:59<05:23,  2.25it/s, acc_step=1/1, ce_loss_token=1.9473, perplexity_token=7.0096]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  30%|██████████████▏                                | 316/1044 [01:59<05:09,  2.35it/s, acc_step=1/1, ce_loss_token=1.9472, perplexity_token=7.0090]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  30%|██████████████▎                                | 317/1044 [01:59<04:57,  2.44it/s, acc_step=1/1, ce_loss_token=1.9471, perplexity_token=7.0080]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  30%|██████████████▎                                | 318/1044 [02:00<05:07,  2.36it/s, acc_step=1/1, ce_loss_token=1.9469, perplexity_token=7.0072]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  31%|██████████████▎                                | 319/1044 [02:00<05:08,  2.35it/s, acc_step=1/1, ce_loss_token=1.9468, perplexity_token=7.0065]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  31%|██████████████▍                                | 320/1044 [02:01<04:44,  2.54it/s, acc_step=1/1, ce_loss_token=1.9471, perplexity_token=7.0083]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  31%|██████████████▍                                | 321/1044 [02:01<04:28,  2.69it/s, acc_step=1/1, ce_loss_token=1.9470, perplexity_token=7.0077]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  31%|██████████████▍                                | 322/1044 [02:01<04:22,  2.75it/s, acc_step=1/1, ce_loss_token=1.9469, perplexity_token=7.0070]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  31%|██████████████▌                                | 323/1044 [02:02<04:32,  2.65it/s, acc_step=1/1, ce_loss_token=1.9468, perplexity_token=7.0063]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  31%|██████████████▌                                | 324/1044 [02:02<04:32,  2.64it/s, acc_step=1/1, ce_loss_token=1.9467, perplexity_token=7.0055]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  31%|██████████████▋                                | 325/1044 [02:02<04:29,  2.67it/s, acc_step=1/1, ce_loss_token=1.9466, perplexity_token=7.0051]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  31%|██████████████▋                                | 326/1044 [02:03<04:14,  2.82it/s, acc_step=1/1, ce_loss_token=1.9469, perplexity_token=7.0067]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  31%|██████████████▋                                | 327/1044 [02:03<04:15,  2.80it/s, acc_step=1/1, ce_loss_token=1.9468, perplexity_token=7.0063]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  31%|██████████████▊                                | 328/1044 [02:03<04:02,  2.95it/s, acc_step=1/1, ce_loss_token=1.9470, perplexity_token=7.0077]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  32%|██████████████▊                                | 329/1044 [02:04<03:49,  3.12it/s, acc_step=1/1, ce_loss_token=1.9472, perplexity_token=7.0093]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  32%|██████████████▊                                | 330/1044 [02:04<04:01,  2.96it/s, acc_step=1/1, ce_loss_token=1.9471, perplexity_token=7.0084]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  32%|██████████████▉                                | 331/1044 [02:04<04:02,  2.93it/s, acc_step=1/1, ce_loss_token=1.9470, perplexity_token=7.0076]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  32%|██████████████▉                                | 332/1044 [02:05<04:04,  2.91it/s, acc_step=1/1, ce_loss_token=1.9469, perplexity_token=7.0069]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  32%|██████████████▉                                | 333/1044 [02:05<04:15,  2.79it/s, acc_step=1/1, ce_loss_token=1.9468, perplexity_token=7.0059]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  32%|███████████████                                | 334/1044 [02:05<04:24,  2.68it/s, acc_step=1/1, ce_loss_token=1.9467, perplexity_token=7.0052]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  32%|███████████████                                | 335/1044 [02:06<04:08,  2.86it/s, acc_step=1/1, ce_loss_token=1.9469, perplexity_token=7.0068]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  32%|███████████████▏                               | 336/1044 [02:06<04:19,  2.72it/s, acc_step=1/1, ce_loss_token=1.9468, perplexity_token=7.0062]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  32%|███████████████▏                               | 337/1044 [02:07<04:20,  2.71it/s, acc_step=1/1, ce_loss_token=1.9467, perplexity_token=7.0056]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  32%|███████████████▏                               | 338/1044 [02:07<04:20,  2.71it/s, acc_step=1/1, ce_loss_token=1.9466, perplexity_token=7.0046]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  32%|███████████████▎                               | 339/1044 [02:07<04:01,  2.91it/s, acc_step=1/1, ce_loss_token=1.9468, perplexity_token=7.0064]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  33%|███████████████▎                               | 340/1044 [02:08<04:04,  2.88it/s, acc_step=1/1, ce_loss_token=1.9468, perplexity_token=7.0060]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  33%|███████████████▍                               | 342/1044 [02:08<03:40,  3.18it/s, acc_step=1/1, ce_loss_token=1.9476, perplexity_token=7.0119]

torch.Size([256, 295, 35]) torch.Size([256, 295])
torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  33%|███████████████▍                               | 343/1044 [02:08<03:29,  3.34it/s, acc_step=1/1, ce_loss_token=1.9478, perplexity_token=7.0130]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  33%|███████████████▍                               | 344/1044 [02:09<03:52,  3.01it/s, acc_step=1/1, ce_loss_token=1.9477, perplexity_token=7.0126]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  33%|███████████████▌                               | 345/1044 [02:09<04:10,  2.79it/s, acc_step=1/1, ce_loss_token=1.9476, perplexity_token=7.0120]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  33%|███████████████▌                               | 346/1044 [02:10<04:19,  2.69it/s, acc_step=1/1, ce_loss_token=1.9475, perplexity_token=7.0114]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  33%|███████████████▌                               | 347/1044 [02:10<04:18,  2.70it/s, acc_step=1/1, ce_loss_token=1.9475, perplexity_token=7.0109]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  33%|███████████████▋                               | 348/1044 [02:10<04:25,  2.62it/s, acc_step=1/1, ce_loss_token=1.9473, perplexity_token=7.0100]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  33%|███████████████▋                               | 349/1044 [02:11<04:24,  2.63it/s, acc_step=1/1, ce_loss_token=1.9473, perplexity_token=7.0094]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  34%|███████████████▊                               | 350/1044 [02:11<04:26,  2.60it/s, acc_step=1/1, ce_loss_token=1.9472, perplexity_token=7.0090]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  34%|███████████████▊                               | 351/1044 [02:12<04:37,  2.50it/s, acc_step=1/1, ce_loss_token=1.9471, perplexity_token=7.0081]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  34%|███████████████▊                               | 352/1044 [02:12<04:37,  2.49it/s, acc_step=1/1, ce_loss_token=1.9470, perplexity_token=7.0075]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  34%|███████████████▉                               | 353/1044 [02:12<04:48,  2.40it/s, acc_step=1/1, ce_loss_token=1.9469, perplexity_token=7.0070]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  34%|███████████████▉                               | 354/1044 [02:13<04:51,  2.36it/s, acc_step=1/1, ce_loss_token=1.9468, perplexity_token=7.0064]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  34%|███████████████▉                               | 355/1044 [02:13<04:35,  2.50it/s, acc_step=1/1, ce_loss_token=1.9467, perplexity_token=7.0056]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  34%|████████████████                               | 356/1044 [02:14<04:26,  2.58it/s, acc_step=1/1, ce_loss_token=1.9466, perplexity_token=7.0052]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  34%|████████████████                               | 357/1044 [02:14<04:33,  2.51it/s, acc_step=1/1, ce_loss_token=1.9466, perplexity_token=7.0048]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  34%|████████████████                               | 358/1044 [02:14<04:27,  2.57it/s, acc_step=1/1, ce_loss_token=1.9468, perplexity_token=7.0062]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  34%|████████████████▏                              | 359/1044 [02:15<04:15,  2.68it/s, acc_step=1/1, ce_loss_token=1.9467, perplexity_token=7.0053]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  34%|████████████████▏                              | 360/1044 [02:15<04:12,  2.71it/s, acc_step=1/1, ce_loss_token=1.9465, perplexity_token=7.0042]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  35%|████████████████▎                              | 361/1044 [02:16<04:18,  2.64it/s, acc_step=1/1, ce_loss_token=1.9464, perplexity_token=7.0036]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  35%|████████████████▎                              | 362/1044 [02:16<04:17,  2.65it/s, acc_step=1/1, ce_loss_token=1.9463, perplexity_token=7.0028]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  35%|████████████████▎                              | 363/1044 [02:16<03:59,  2.84it/s, acc_step=1/1, ce_loss_token=1.9465, perplexity_token=7.0044]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  35%|████████████████▍                              | 364/1044 [02:17<04:08,  2.74it/s, acc_step=1/1, ce_loss_token=1.9465, perplexity_token=7.0039]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  35%|████████████████▍                              | 365/1044 [02:17<03:50,  2.95it/s, acc_step=1/1, ce_loss_token=1.9467, perplexity_token=7.0052]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  35%|████████████████▍                              | 366/1044 [02:17<03:37,  3.12it/s, acc_step=1/1, ce_loss_token=1.9469, perplexity_token=7.0068]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  35%|████████████████▌                              | 367/1044 [02:18<03:52,  2.91it/s, acc_step=1/1, ce_loss_token=1.9468, perplexity_token=7.0060]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  35%|████████████████▌                              | 368/1044 [02:18<03:54,  2.88it/s, acc_step=1/1, ce_loss_token=1.9467, perplexity_token=7.0054]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  35%|████████████████▌                              | 369/1044 [02:18<03:52,  2.90it/s, acc_step=1/1, ce_loss_token=1.9466, perplexity_token=7.0048]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  35%|████████████████▋                              | 370/1044 [02:19<03:58,  2.83it/s, acc_step=1/1, ce_loss_token=1.9465, perplexity_token=7.0041]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  36%|████████████████▋                              | 371/1044 [02:19<04:09,  2.70it/s, acc_step=1/1, ce_loss_token=1.9464, perplexity_token=7.0034]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  36%|████████████████▋                              | 372/1044 [02:19<04:08,  2.70it/s, acc_step=1/1, ce_loss_token=1.9463, perplexity_token=7.0024]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  36%|████████████████▊                              | 373/1044 [02:20<04:07,  2.71it/s, acc_step=1/1, ce_loss_token=1.9462, perplexity_token=7.0019]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  36%|████████████████▊                              | 374/1044 [02:20<03:53,  2.87it/s, acc_step=1/1, ce_loss_token=1.9464, perplexity_token=7.0035]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  36%|████████████████▉                              | 375/1044 [02:20<03:58,  2.81it/s, acc_step=1/1, ce_loss_token=1.9463, perplexity_token=7.0028]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  36%|████████████████▉                              | 376/1044 [02:21<04:00,  2.78it/s, acc_step=1/1, ce_loss_token=1.9462, perplexity_token=7.0022]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  36%|████████████████▉                              | 377/1044 [02:21<04:01,  2.77it/s, acc_step=1/1, ce_loss_token=1.9461, perplexity_token=7.0013]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  36%|█████████████████                              | 378/1044 [02:22<04:07,  2.69it/s, acc_step=1/1, ce_loss_token=1.9460, perplexity_token=7.0006]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  36%|█████████████████                              | 379/1044 [02:22<04:20,  2.55it/s, acc_step=1/1, ce_loss_token=1.9459, perplexity_token=6.9999]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  36%|█████████████████                              | 380/1044 [02:22<04:15,  2.59it/s, acc_step=1/1, ce_loss_token=1.9458, perplexity_token=6.9991]

torch.Size([256, 365, 35]) torch.Size([256, 365])


[Training LM]:  36%|█████████████████▏                             | 381/1044 [02:23<04:14,  2.60it/s, acc_step=1/1, ce_loss_token=1.9459, perplexity_token=7.0002]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  37%|█████████████████▏                             | 382/1044 [02:23<04:13,  2.61it/s, acc_step=1/1, ce_loss_token=1.9458, perplexity_token=6.9995]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  37%|█████████████████▏                             | 383/1044 [02:23<04:08,  2.66it/s, acc_step=1/1, ce_loss_token=1.9457, perplexity_token=6.9989]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  37%|█████████████████▎                             | 384/1044 [02:24<04:16,  2.58it/s, acc_step=1/1, ce_loss_token=1.9457, perplexity_token=6.9982]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  37%|█████████████████▎                             | 385/1044 [02:24<04:19,  2.54it/s, acc_step=1/1, ce_loss_token=1.9456, perplexity_token=6.9976]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  37%|█████████████████▍                             | 386/1044 [02:25<04:15,  2.57it/s, acc_step=1/1, ce_loss_token=1.9455, perplexity_token=6.9970]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  37%|█████████████████▍                             | 387/1044 [02:25<04:14,  2.58it/s, acc_step=1/1, ce_loss_token=1.9454, perplexity_token=6.9965]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  37%|█████████████████▍                             | 388/1044 [02:25<04:17,  2.55it/s, acc_step=1/1, ce_loss_token=1.9453, perplexity_token=6.9958]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  37%|█████████████████▌                             | 389/1044 [02:26<03:58,  2.74it/s, acc_step=1/1, ce_loss_token=1.9455, perplexity_token=6.9973]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  37%|█████████████████▌                             | 390/1044 [02:26<03:56,  2.77it/s, acc_step=1/1, ce_loss_token=1.9454, perplexity_token=6.9967]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  37%|█████████████████▌                             | 391/1044 [02:27<04:06,  2.65it/s, acc_step=1/1, ce_loss_token=1.9453, perplexity_token=6.9960]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  38%|█████████████████▋                             | 392/1044 [02:27<04:08,  2.62it/s, acc_step=1/1, ce_loss_token=1.9453, perplexity_token=6.9954]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  38%|█████████████████▋                             | 393/1044 [02:27<04:19,  2.51it/s, acc_step=1/1, ce_loss_token=1.9452, perplexity_token=6.9950]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  38%|█████████████████▋                             | 394/1044 [02:28<04:13,  2.57it/s, acc_step=1/1, ce_loss_token=1.9451, perplexity_token=6.9942]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  38%|█████████████████▊                             | 395/1044 [02:28<03:55,  2.76it/s, acc_step=1/1, ce_loss_token=1.9452, perplexity_token=6.9953]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  38%|█████████████████▊                             | 396/1044 [02:28<03:52,  2.79it/s, acc_step=1/1, ce_loss_token=1.9452, perplexity_token=6.9947]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  38%|█████████████████▊                             | 397/1044 [02:29<03:37,  2.97it/s, acc_step=1/1, ce_loss_token=1.9453, perplexity_token=6.9958]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  38%|█████████████████▉                             | 398/1044 [02:29<03:42,  2.91it/s, acc_step=1/1, ce_loss_token=1.9452, perplexity_token=6.9953]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  38%|█████████████████▉                             | 399/1044 [02:29<03:51,  2.79it/s, acc_step=1/1, ce_loss_token=1.9451, perplexity_token=6.9946]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  38%|██████████████████                             | 400/1044 [02:30<03:51,  2.79it/s, acc_step=1/1, ce_loss_token=1.9450, perplexity_token=6.9940]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  38%|██████████████████                             | 401/1044 [02:30<03:36,  2.97it/s, acc_step=1/1, ce_loss_token=1.9452, perplexity_token=6.9951]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  39%|██████████████████                             | 402/1044 [02:30<03:44,  2.86it/s, acc_step=1/1, ce_loss_token=1.9451, perplexity_token=6.9945]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  39%|██████████████████▏                            | 403/1044 [02:31<03:49,  2.79it/s, acc_step=1/1, ce_loss_token=1.9450, perplexity_token=6.9940]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  39%|██████████████████▏                            | 404/1044 [02:31<03:51,  2.76it/s, acc_step=1/1, ce_loss_token=1.9449, perplexity_token=6.9932]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  39%|██████████████████▏                            | 405/1044 [02:31<03:38,  2.93it/s, acc_step=1/1, ce_loss_token=1.9451, perplexity_token=6.9944]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  39%|██████████████████▎                            | 406/1044 [02:32<03:39,  2.91it/s, acc_step=1/1, ce_loss_token=1.9450, perplexity_token=6.9939]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  39%|██████████████████▎                            | 407/1044 [02:32<03:27,  3.08it/s, acc_step=1/1, ce_loss_token=1.9452, perplexity_token=6.9949]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  39%|██████████████████▎                            | 408/1044 [02:32<03:36,  2.94it/s, acc_step=1/1, ce_loss_token=1.9451, perplexity_token=6.9944]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  39%|██████████████████▍                            | 409/1044 [02:33<03:40,  2.88it/s, acc_step=1/1, ce_loss_token=1.9450, perplexity_token=6.9938]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  39%|██████████████████▍                            | 410/1044 [02:33<03:48,  2.77it/s, acc_step=1/1, ce_loss_token=1.9450, perplexity_token=6.9934]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  39%|██████████████████▌                            | 411/1044 [02:34<03:45,  2.81it/s, acc_step=1/1, ce_loss_token=1.9449, perplexity_token=6.9928]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  39%|██████████████████▌                            | 412/1044 [02:34<03:43,  2.83it/s, acc_step=1/1, ce_loss_token=1.9448, perplexity_token=6.9922]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  40%|██████████████████▌                            | 413/1044 [02:34<03:53,  2.70it/s, acc_step=1/1, ce_loss_token=1.9447, perplexity_token=6.9914]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  40%|██████████████████▋                            | 414/1044 [02:35<03:58,  2.65it/s, acc_step=1/1, ce_loss_token=1.9446, perplexity_token=6.9910]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  40%|██████████████████▋                            | 415/1044 [02:35<04:05,  2.57it/s, acc_step=1/1, ce_loss_token=1.9445, perplexity_token=6.9903]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  40%|██████████████████▋                            | 416/1044 [02:36<04:00,  2.61it/s, acc_step=1/1, ce_loss_token=1.9444, perplexity_token=6.9898]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  40%|██████████████████▊                            | 417/1044 [02:36<04:06,  2.54it/s, acc_step=1/1, ce_loss_token=1.9444, perplexity_token=6.9892]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  40%|██████████████████▊                            | 418/1044 [02:36<03:57,  2.64it/s, acc_step=1/1, ce_loss_token=1.9443, perplexity_token=6.9886]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  40%|██████████████████▊                            | 419/1044 [02:37<03:40,  2.83it/s, acc_step=1/1, ce_loss_token=1.9444, perplexity_token=6.9897]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  40%|██████████████████▉                            | 420/1044 [02:37<03:40,  2.82it/s, acc_step=1/1, ce_loss_token=1.9443, perplexity_token=6.9890]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  40%|██████████████████▉                            | 421/1044 [02:37<03:47,  2.73it/s, acc_step=1/1, ce_loss_token=1.9443, perplexity_token=6.9885]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  40%|██████████████████▉                            | 422/1044 [02:38<03:50,  2.70it/s, acc_step=1/1, ce_loss_token=1.9442, perplexity_token=6.9878]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  41%|███████████████████                            | 423/1044 [02:38<03:55,  2.64it/s, acc_step=1/1, ce_loss_token=1.9440, perplexity_token=6.9869]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  41%|███████████████████                            | 424/1044 [02:38<03:39,  2.83it/s, acc_step=1/1, ce_loss_token=1.9442, perplexity_token=6.9880]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  41%|███████████████████▏                           | 425/1044 [02:39<03:42,  2.78it/s, acc_step=1/1, ce_loss_token=1.9441, perplexity_token=6.9873]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  41%|███████████████████▏                           | 426/1044 [02:39<03:30,  2.93it/s, acc_step=1/1, ce_loss_token=1.9443, perplexity_token=6.9884]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  41%|███████████████████▏                           | 427/1044 [02:39<03:37,  2.83it/s, acc_step=1/1, ce_loss_token=1.9442, perplexity_token=6.9879]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  41%|███████████████████▎                           | 428/1044 [02:40<03:50,  2.67it/s, acc_step=1/1, ce_loss_token=1.9441, perplexity_token=6.9875]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  41%|███████████████████▎                           | 429/1044 [02:40<03:52,  2.65it/s, acc_step=1/1, ce_loss_token=1.9440, perplexity_token=6.9870]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  41%|███████████████████▎                           | 430/1044 [02:41<03:31,  2.91it/s, acc_step=1/1, ce_loss_token=1.9443, perplexity_token=6.9884]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  41%|███████████████████▍                           | 431/1044 [02:41<03:29,  2.93it/s, acc_step=1/1, ce_loss_token=1.9442, perplexity_token=6.9878]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  41%|███████████████████▍                           | 432/1044 [02:41<03:38,  2.80it/s, acc_step=1/1, ce_loss_token=1.9441, perplexity_token=6.9872]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  41%|███████████████████▍                           | 433/1044 [02:42<03:47,  2.68it/s, acc_step=1/1, ce_loss_token=1.9439, perplexity_token=6.9863]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  42%|███████████████████▌                           | 434/1044 [02:42<03:50,  2.64it/s, acc_step=1/1, ce_loss_token=1.9439, perplexity_token=6.9857]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  42%|███████████████████▌                           | 435/1044 [02:42<03:50,  2.64it/s, acc_step=1/1, ce_loss_token=1.9438, perplexity_token=6.9851]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  42%|███████████████████▋                           | 436/1044 [02:43<03:37,  2.80it/s, acc_step=1/1, ce_loss_token=1.9440, perplexity_token=6.9865]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  42%|███████████████████▋                           | 437/1044 [02:43<03:41,  2.74it/s, acc_step=1/1, ce_loss_token=1.9439, perplexity_token=6.9858]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  42%|███████████████████▋                           | 438/1044 [02:44<03:44,  2.69it/s, acc_step=1/1, ce_loss_token=1.9438, perplexity_token=6.9852]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  42%|███████████████████▊                           | 439/1044 [02:44<03:50,  2.63it/s, acc_step=1/1, ce_loss_token=1.9437, perplexity_token=6.9847]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  42%|███████████████████▊                           | 440/1044 [02:44<03:50,  2.62it/s, acc_step=1/1, ce_loss_token=1.9436, perplexity_token=6.9840]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  42%|███████████████████▊                           | 441/1044 [02:45<04:14,  2.37it/s, acc_step=1/1, ce_loss_token=1.9435, perplexity_token=6.9834]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  42%|███████████████████▉                           | 442/1044 [02:45<04:01,  2.49it/s, acc_step=1/1, ce_loss_token=1.9435, perplexity_token=6.9829]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  42%|███████████████████▉                           | 443/1044 [02:46<03:57,  2.53it/s, acc_step=1/1, ce_loss_token=1.9434, perplexity_token=6.9822]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  43%|███████████████████▉                           | 444/1044 [02:46<04:01,  2.48it/s, acc_step=1/1, ce_loss_token=1.9433, perplexity_token=6.9816]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  43%|████████████████████                           | 445/1044 [02:46<04:06,  2.43it/s, acc_step=1/1, ce_loss_token=1.9432, perplexity_token=6.9810]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  43%|████████████████████                           | 446/1044 [02:47<03:47,  2.62it/s, acc_step=1/1, ce_loss_token=1.9433, perplexity_token=6.9821]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  43%|████████████████████                           | 447/1044 [02:47<03:44,  2.65it/s, acc_step=1/1, ce_loss_token=1.9432, perplexity_token=6.9814]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  43%|████████████████████▏                          | 448/1044 [02:47<03:44,  2.65it/s, acc_step=1/1, ce_loss_token=1.9432, perplexity_token=6.9808]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  43%|████████████████████▏                          | 449/1044 [02:48<03:40,  2.70it/s, acc_step=1/1, ce_loss_token=1.9431, perplexity_token=6.9804]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  43%|████████████████████▎                          | 450/1044 [02:48<03:25,  2.89it/s, acc_step=1/1, ce_loss_token=1.9433, perplexity_token=6.9815]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  43%|████████████████████▎                          | 451/1044 [02:49<03:36,  2.74it/s, acc_step=1/1, ce_loss_token=1.9432, perplexity_token=6.9808]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  43%|████████████████████▎                          | 452/1044 [02:49<03:40,  2.68it/s, acc_step=1/1, ce_loss_token=1.9431, perplexity_token=6.9802]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  43%|████████████████████▍                          | 453/1044 [02:49<03:45,  2.62it/s, acc_step=1/1, ce_loss_token=1.9430, perplexity_token=6.9797]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  43%|████████████████████▍                          | 454/1044 [02:50<03:38,  2.70it/s, acc_step=1/1, ce_loss_token=1.9429, perplexity_token=6.9791]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  44%|████████████████████▍                          | 455/1044 [02:50<03:39,  2.69it/s, acc_step=1/1, ce_loss_token=1.9429, perplexity_token=6.9788]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  44%|████████████████████▌                          | 456/1044 [02:50<03:43,  2.63it/s, acc_step=1/1, ce_loss_token=1.9428, perplexity_token=6.9780]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  44%|████████████████████▌                          | 457/1044 [02:51<03:39,  2.68it/s, acc_step=1/1, ce_loss_token=1.9427, perplexity_token=6.9772]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  44%|████████████████████▌                          | 458/1044 [02:51<03:37,  2.70it/s, acc_step=1/1, ce_loss_token=1.9425, perplexity_token=6.9765]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  44%|████████████████████▋                          | 459/1044 [02:52<03:37,  2.68it/s, acc_step=1/1, ce_loss_token=1.9424, perplexity_token=6.9758]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  44%|████████████████████▋                          | 460/1044 [02:52<03:36,  2.69it/s, acc_step=1/1, ce_loss_token=1.9424, perplexity_token=6.9754]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  44%|████████████████████▊                          | 461/1044 [02:52<03:34,  2.72it/s, acc_step=1/1, ce_loss_token=1.9423, perplexity_token=6.9749]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  44%|████████████████████▊                          | 462/1044 [02:53<03:37,  2.68it/s, acc_step=1/1, ce_loss_token=1.9422, perplexity_token=6.9742]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  44%|████████████████████▊                          | 463/1044 [02:53<03:23,  2.85it/s, acc_step=1/1, ce_loss_token=1.9424, perplexity_token=6.9755]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  44%|████████████████████▉                          | 464/1044 [02:53<03:35,  2.70it/s, acc_step=1/1, ce_loss_token=1.9423, perplexity_token=6.9748]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  45%|████████████████████▉                          | 465/1044 [02:54<03:41,  2.62it/s, acc_step=1/1, ce_loss_token=1.9422, perplexity_token=6.9742]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  45%|████████████████████▉                          | 466/1044 [02:54<03:47,  2.54it/s, acc_step=1/1, ce_loss_token=1.9421, perplexity_token=6.9736]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  45%|█████████████████████                          | 467/1044 [02:55<03:44,  2.57it/s, acc_step=1/1, ce_loss_token=1.9420, perplexity_token=6.9729]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  45%|█████████████████████                          | 468/1044 [02:55<03:31,  2.72it/s, acc_step=1/1, ce_loss_token=1.9422, perplexity_token=6.9740]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  45%|█████████████████████                          | 469/1044 [02:55<03:34,  2.68it/s, acc_step=1/1, ce_loss_token=1.9421, perplexity_token=6.9735]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  45%|█████████████████████▏                         | 470/1044 [02:56<03:31,  2.71it/s, acc_step=1/1, ce_loss_token=1.9423, perplexity_token=6.9745]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  45%|█████████████████████▏                         | 471/1044 [02:56<03:34,  2.67it/s, acc_step=1/1, ce_loss_token=1.9422, perplexity_token=6.9740]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  45%|█████████████████████▏                         | 472/1044 [02:56<03:41,  2.58it/s, acc_step=1/1, ce_loss_token=1.9421, perplexity_token=6.9734]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  45%|█████████████████████▎                         | 473/1044 [02:57<03:42,  2.56it/s, acc_step=1/1, ce_loss_token=1.9420, perplexity_token=6.9725]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  45%|█████████████████████▎                         | 474/1044 [02:57<03:25,  2.77it/s, acc_step=1/1, ce_loss_token=1.9422, perplexity_token=6.9738]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  45%|█████████████████████▍                         | 475/1044 [02:57<03:13,  2.94it/s, acc_step=1/1, ce_loss_token=1.9423, perplexity_token=6.9747]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  46%|█████████████████████▍                         | 476/1044 [02:58<03:21,  2.81it/s, acc_step=1/1, ce_loss_token=1.9422, perplexity_token=6.9739]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  46%|█████████████████████▍                         | 477/1044 [02:58<03:26,  2.75it/s, acc_step=1/1, ce_loss_token=1.9421, perplexity_token=6.9733]

torch.Size([256, 437, 35]) torch.Size([256, 437])


[Training LM]:  46%|█████████████████████▌                         | 478/1044 [02:59<04:13,  2.24it/s, acc_step=1/1, ce_loss_token=1.9420, perplexity_token=6.9727]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  46%|█████████████████████▌                         | 479/1044 [02:59<04:02,  2.33it/s, acc_step=1/1, ce_loss_token=1.9419, perplexity_token=6.9723]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  46%|█████████████████████▌                         | 480/1044 [03:00<03:40,  2.56it/s, acc_step=1/1, ce_loss_token=1.9420, perplexity_token=6.9730]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  46%|█████████████████████▋                         | 481/1044 [03:00<03:30,  2.67it/s, acc_step=1/1, ce_loss_token=1.9419, perplexity_token=6.9723]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  46%|█████████████████████▋                         | 482/1044 [03:00<03:29,  2.68it/s, acc_step=1/1, ce_loss_token=1.9418, perplexity_token=6.9714]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  46%|█████████████████████▋                         | 483/1044 [03:01<03:26,  2.72it/s, acc_step=1/1, ce_loss_token=1.9417, perplexity_token=6.9708]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  46%|█████████████████████▊                         | 484/1044 [03:01<03:14,  2.88it/s, acc_step=1/1, ce_loss_token=1.9419, perplexity_token=6.9716]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  46%|█████████████████████▊                         | 485/1044 [03:01<03:17,  2.82it/s, acc_step=1/1, ce_loss_token=1.9418, perplexity_token=6.9711]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  47%|█████████████████████▉                         | 486/1044 [03:02<03:23,  2.75it/s, acc_step=1/1, ce_loss_token=1.9417, perplexity_token=6.9707]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  47%|█████████████████████▉                         | 487/1044 [03:02<03:13,  2.88it/s, acc_step=1/1, ce_loss_token=1.9418, perplexity_token=6.9716]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  47%|█████████████████████▉                         | 488/1044 [03:02<03:02,  3.05it/s, acc_step=1/1, ce_loss_token=1.9420, perplexity_token=6.9724]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  47%|██████████████████████                         | 489/1044 [03:03<03:17,  2.81it/s, acc_step=1/1, ce_loss_token=1.9419, perplexity_token=6.9719]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  47%|██████████████████████                         | 490/1044 [03:03<03:20,  2.76it/s, acc_step=1/1, ce_loss_token=1.9418, perplexity_token=6.9711]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  47%|██████████████████████                         | 491/1044 [03:03<03:25,  2.68it/s, acc_step=1/1, ce_loss_token=1.9417, perplexity_token=6.9704]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  47%|██████████████████████▏                        | 492/1044 [03:04<03:09,  2.91it/s, acc_step=1/1, ce_loss_token=1.9418, perplexity_token=6.9714]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  47%|██████████████████████▏                        | 493/1044 [03:04<03:15,  2.82it/s, acc_step=1/1, ce_loss_token=1.9417, perplexity_token=6.9708]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  47%|██████████████████████▏                        | 494/1044 [03:04<03:17,  2.79it/s, acc_step=1/1, ce_loss_token=1.9417, perplexity_token=6.9703]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  47%|██████████████████████▎                        | 495/1044 [03:05<03:17,  2.78it/s, acc_step=1/1, ce_loss_token=1.9416, perplexity_token=6.9698]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  48%|██████████████████████▎                        | 496/1044 [03:05<03:19,  2.74it/s, acc_step=1/1, ce_loss_token=1.9415, perplexity_token=6.9693]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  48%|██████████████████████▎                        | 497/1044 [03:06<03:22,  2.71it/s, acc_step=1/1, ce_loss_token=1.9414, perplexity_token=6.9687]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  48%|██████████████████████▍                        | 498/1044 [03:06<03:16,  2.78it/s, acc_step=1/1, ce_loss_token=1.9413, perplexity_token=6.9680]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  48%|██████████████████████▍                        | 499/1044 [03:06<03:18,  2.74it/s, acc_step=1/1, ce_loss_token=1.9413, perplexity_token=6.9676]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  48%|██████████████████████▌                        | 501/1044 [03:07<02:47,  3.24it/s, acc_step=1/1, ce_loss_token=1.9418, perplexity_token=6.9709]

torch.Size([256, 296, 35]) torch.Size([256, 296])
torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  48%|██████████████████████▌                        | 502/1044 [03:07<02:51,  3.17it/s, acc_step=1/1, ce_loss_token=1.9417, perplexity_token=6.9704]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  48%|██████████████████████▋                        | 503/1044 [03:08<02:58,  3.03it/s, acc_step=1/1, ce_loss_token=1.9416, perplexity_token=6.9699]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  48%|██████████████████████▋                        | 504/1044 [03:08<03:05,  2.90it/s, acc_step=1/1, ce_loss_token=1.9415, perplexity_token=6.9693]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  48%|██████████████████████▋                        | 505/1044 [03:08<03:10,  2.83it/s, acc_step=1/1, ce_loss_token=1.9414, perplexity_token=6.9685]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  48%|██████████████████████▊                        | 506/1044 [03:09<03:06,  2.88it/s, acc_step=1/1, ce_loss_token=1.9415, perplexity_token=6.9694]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  49%|██████████████████████▊                        | 507/1044 [03:09<03:07,  2.87it/s, acc_step=1/1, ce_loss_token=1.9415, perplexity_token=6.9690]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  49%|██████████████████████▊                        | 508/1044 [03:09<03:17,  2.71it/s, acc_step=1/1, ce_loss_token=1.9414, perplexity_token=6.9686]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  49%|██████████████████████▉                        | 509/1044 [03:10<03:16,  2.72it/s, acc_step=1/1, ce_loss_token=1.9413, perplexity_token=6.9680]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  49%|██████████████████████▉                        | 510/1044 [03:10<03:12,  2.78it/s, acc_step=1/1, ce_loss_token=1.9412, perplexity_token=6.9674]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  49%|███████████████████████                        | 511/1044 [03:11<03:39,  2.43it/s, acc_step=1/1, ce_loss_token=1.9412, perplexity_token=6.9669]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  49%|███████████████████████                        | 512/1044 [03:11<03:33,  2.49it/s, acc_step=1/1, ce_loss_token=1.9411, perplexity_token=6.9664]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  49%|███████████████████████                        | 513/1044 [03:11<03:27,  2.55it/s, acc_step=1/1, ce_loss_token=1.9410, perplexity_token=6.9660]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  49%|███████████████████████▏                       | 514/1044 [03:12<03:28,  2.54it/s, acc_step=1/1, ce_loss_token=1.9410, perplexity_token=6.9654]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  49%|███████████████████████▏                       | 515/1044 [03:12<03:20,  2.64it/s, acc_step=1/1, ce_loss_token=1.9409, perplexity_token=6.9649]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  49%|███████████████████████▏                       | 516/1044 [03:12<03:06,  2.83it/s, acc_step=1/1, ce_loss_token=1.9410, perplexity_token=6.9659]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  50%|███████████████████████▎                       | 517/1044 [03:13<03:05,  2.84it/s, acc_step=1/1, ce_loss_token=1.9409, perplexity_token=6.9654]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  50%|███████████████████████▎                       | 518/1044 [03:13<03:05,  2.83it/s, acc_step=1/1, ce_loss_token=1.9409, perplexity_token=6.9647]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  50%|███████████████████████▎                       | 519/1044 [03:13<03:08,  2.78it/s, acc_step=1/1, ce_loss_token=1.9408, perplexity_token=6.9644]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  50%|███████████████████████▍                       | 521/1044 [03:14<02:53,  3.02it/s, acc_step=1/1, ce_loss_token=1.9411, perplexity_token=6.9665]

torch.Size([256, 326, 35]) torch.Size([256, 326])
torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  50%|███████████████████████▌                       | 522/1044 [03:14<02:59,  2.91it/s, acc_step=1/1, ce_loss_token=1.9410, perplexity_token=6.9659]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  50%|███████████████████████▌                       | 523/1044 [03:15<03:04,  2.83it/s, acc_step=1/1, ce_loss_token=1.9410, perplexity_token=6.9654]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  50%|███████████████████████▌                       | 524/1044 [03:15<02:50,  3.06it/s, acc_step=1/1, ce_loss_token=1.9411, perplexity_token=6.9663]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  50%|███████████████████████▋                       | 525/1044 [03:15<02:56,  2.94it/s, acc_step=1/1, ce_loss_token=1.9410, perplexity_token=6.9657]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  50%|███████████████████████▋                       | 526/1044 [03:16<03:04,  2.80it/s, acc_step=1/1, ce_loss_token=1.9409, perplexity_token=6.9652]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  50%|███████████████████████▋                       | 527/1044 [03:16<03:08,  2.75it/s, acc_step=1/1, ce_loss_token=1.9409, perplexity_token=6.9647]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  51%|███████████████████████▊                       | 528/1044 [03:17<03:21,  2.56it/s, acc_step=1/1, ce_loss_token=1.9408, perplexity_token=6.9641]

torch.Size([256, 391, 35]) torch.Size([256, 391])


[Training LM]:  51%|███████████████████████▊                       | 529/1044 [03:17<03:45,  2.28it/s, acc_step=1/1, ce_loss_token=1.9407, perplexity_token=6.9635]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  51%|███████████████████████▊                       | 530/1044 [03:18<03:33,  2.41it/s, acc_step=1/1, ce_loss_token=1.9406, perplexity_token=6.9630]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  51%|███████████████████████▉                       | 531/1044 [03:18<03:24,  2.51it/s, acc_step=1/1, ce_loss_token=1.9405, perplexity_token=6.9623]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  51%|███████████████████████▉                       | 532/1044 [03:18<03:27,  2.47it/s, acc_step=1/1, ce_loss_token=1.9404, perplexity_token=6.9618]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  51%|███████████████████████▉                       | 533/1044 [03:19<03:28,  2.45it/s, acc_step=1/1, ce_loss_token=1.9404, perplexity_token=6.9613]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  51%|████████████████████████                       | 534/1044 [03:19<03:10,  2.67it/s, acc_step=1/1, ce_loss_token=1.9405, perplexity_token=6.9622]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  51%|████████████████████████                       | 535/1044 [03:20<03:15,  2.60it/s, acc_step=1/1, ce_loss_token=1.9404, perplexity_token=6.9618]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  51%|████████████████████████▏                      | 536/1044 [03:20<03:10,  2.67it/s, acc_step=1/1, ce_loss_token=1.9404, perplexity_token=6.9613]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  51%|████████████████████████▏                      | 537/1044 [03:20<03:11,  2.64it/s, acc_step=1/1, ce_loss_token=1.9403, perplexity_token=6.9607]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  52%|████████████████████████▏                      | 538/1044 [03:21<03:12,  2.63it/s, acc_step=1/1, ce_loss_token=1.9402, perplexity_token=6.9600]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  52%|████████████████████████▎                      | 539/1044 [03:21<02:59,  2.81it/s, acc_step=1/1, ce_loss_token=1.9403, perplexity_token=6.9607]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  52%|████████████████████████▎                      | 540/1044 [03:21<03:00,  2.79it/s, acc_step=1/1, ce_loss_token=1.9402, perplexity_token=6.9602]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  52%|████████████████████████▎                      | 541/1044 [03:22<03:03,  2.75it/s, acc_step=1/1, ce_loss_token=1.9402, perplexity_token=6.9598]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  52%|████████████████████████▍                      | 542/1044 [03:22<03:00,  2.77it/s, acc_step=1/1, ce_loss_token=1.9401, perplexity_token=6.9591]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  52%|████████████████████████▍                      | 543/1044 [03:22<03:05,  2.69it/s, acc_step=1/1, ce_loss_token=1.9400, perplexity_token=6.9587]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  52%|████████████████████████▍                      | 544/1044 [03:23<02:53,  2.89it/s, acc_step=1/1, ce_loss_token=1.9401, perplexity_token=6.9597]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  52%|████████████████████████▌                      | 545/1044 [03:23<02:52,  2.89it/s, acc_step=1/1, ce_loss_token=1.9400, perplexity_token=6.9590]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  52%|████████████████████████▌                      | 546/1044 [03:23<03:02,  2.73it/s, acc_step=1/1, ce_loss_token=1.9400, perplexity_token=6.9585]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  52%|████████████████████████▋                      | 547/1044 [03:24<03:11,  2.60it/s, acc_step=1/1, ce_loss_token=1.9399, perplexity_token=6.9579]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  52%|████████████████████████▋                      | 548/1044 [03:24<03:08,  2.63it/s, acc_step=1/1, ce_loss_token=1.9398, perplexity_token=6.9574]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  53%|████████████████████████▋                      | 549/1044 [03:25<03:09,  2.61it/s, acc_step=1/1, ce_loss_token=1.9398, perplexity_token=6.9571]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  53%|████████████████████████▊                      | 550/1044 [03:25<02:54,  2.82it/s, acc_step=1/1, ce_loss_token=1.9399, perplexity_token=6.9579]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  53%|████████████████████████▊                      | 551/1044 [03:25<02:40,  3.07it/s, acc_step=1/1, ce_loss_token=1.9400, perplexity_token=6.9588]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  53%|████████████████████████▊                      | 552/1044 [03:26<02:43,  3.01it/s, acc_step=1/1, ce_loss_token=1.9399, perplexity_token=6.9581]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  53%|████████████████████████▉                      | 553/1044 [03:26<02:44,  2.98it/s, acc_step=1/1, ce_loss_token=1.9400, perplexity_token=6.9590]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  53%|████████████████████████▉                      | 554/1044 [03:26<02:51,  2.87it/s, acc_step=1/1, ce_loss_token=1.9400, perplexity_token=6.9585]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  53%|████████████████████████▉                      | 555/1044 [03:27<02:56,  2.78it/s, acc_step=1/1, ce_loss_token=1.9401, perplexity_token=6.9594]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  53%|█████████████████████████                      | 556/1044 [03:27<02:56,  2.77it/s, acc_step=1/1, ce_loss_token=1.9400, perplexity_token=6.9587]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  53%|█████████████████████████                      | 557/1044 [03:27<02:55,  2.77it/s, acc_step=1/1, ce_loss_token=1.9399, perplexity_token=6.9581]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  53%|█████████████████████████                      | 558/1044 [03:28<03:05,  2.62it/s, acc_step=1/1, ce_loss_token=1.9398, perplexity_token=6.9576]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  54%|█████████████████████████▏                     | 559/1044 [03:28<02:58,  2.72it/s, acc_step=1/1, ce_loss_token=1.9399, perplexity_token=6.9583]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  54%|█████████████████████████▏                     | 560/1044 [03:28<02:47,  2.89it/s, acc_step=1/1, ce_loss_token=1.9400, perplexity_token=6.9587]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  54%|█████████████████████████▎                     | 561/1044 [03:29<02:53,  2.79it/s, acc_step=1/1, ce_loss_token=1.9399, perplexity_token=6.9582]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  54%|█████████████████████████▎                     | 562/1044 [03:29<03:01,  2.66it/s, acc_step=1/1, ce_loss_token=1.9399, perplexity_token=6.9577]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  54%|█████████████████████████▎                     | 563/1044 [03:30<02:57,  2.71it/s, acc_step=1/1, ce_loss_token=1.9398, perplexity_token=6.9572]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  54%|█████████████████████████▍                     | 564/1044 [03:30<02:59,  2.67it/s, acc_step=1/1, ce_loss_token=1.9397, perplexity_token=6.9566]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  54%|█████████████████████████▍                     | 565/1044 [03:30<02:59,  2.67it/s, acc_step=1/1, ce_loss_token=1.9396, perplexity_token=6.9561]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  54%|█████████████████████████▍                     | 566/1044 [03:31<02:53,  2.75it/s, acc_step=1/1, ce_loss_token=1.9396, perplexity_token=6.9557]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  54%|█████████████████████████▌                     | 567/1044 [03:31<02:57,  2.68it/s, acc_step=1/1, ce_loss_token=1.9395, perplexity_token=6.9551]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  54%|█████████████████████████▌                     | 568/1044 [03:32<03:03,  2.60it/s, acc_step=1/1, ce_loss_token=1.9394, perplexity_token=6.9546]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  55%|█████████████████████████▌                     | 569/1044 [03:32<03:01,  2.62it/s, acc_step=1/1, ce_loss_token=1.9393, perplexity_token=6.9541]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  55%|█████████████████████████▋                     | 570/1044 [03:32<03:02,  2.59it/s, acc_step=1/1, ce_loss_token=1.9393, perplexity_token=6.9536]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  55%|█████████████████████████▋                     | 571/1044 [03:33<02:56,  2.68it/s, acc_step=1/1, ce_loss_token=1.9392, perplexity_token=6.9532]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  55%|█████████████████████████▊                     | 572/1044 [03:33<02:42,  2.90it/s, acc_step=1/1, ce_loss_token=1.9393, perplexity_token=6.9539]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  55%|█████████████████████████▊                     | 573/1044 [03:33<02:51,  2.75it/s, acc_step=1/1, ce_loss_token=1.9392, perplexity_token=6.9534]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  55%|█████████████████████████▊                     | 574/1044 [03:34<02:53,  2.71it/s, acc_step=1/1, ce_loss_token=1.9391, perplexity_token=6.9528]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  55%|█████████████████████████▉                     | 575/1044 [03:34<02:47,  2.79it/s, acc_step=1/1, ce_loss_token=1.9392, perplexity_token=6.9534]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  55%|█████████████████████████▉                     | 576/1044 [03:34<02:44,  2.85it/s, acc_step=1/1, ce_loss_token=1.9391, perplexity_token=6.9528]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  55%|█████████████████████████▉                     | 577/1044 [03:35<02:45,  2.82it/s, acc_step=1/1, ce_loss_token=1.9391, perplexity_token=6.9524]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  55%|██████████████████████████                     | 578/1044 [03:35<02:33,  3.04it/s, acc_step=1/1, ce_loss_token=1.9392, perplexity_token=6.9530]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  55%|██████████████████████████                     | 579/1044 [03:35<02:24,  3.21it/s, acc_step=1/1, ce_loss_token=1.9393, perplexity_token=6.9537]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  56%|██████████████████████████                     | 580/1044 [03:36<02:23,  3.24it/s, acc_step=1/1, ce_loss_token=1.9394, perplexity_token=6.9548]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  56%|██████████████████████████▏                    | 581/1044 [03:36<02:20,  3.29it/s, acc_step=1/1, ce_loss_token=1.9396, perplexity_token=6.9556]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  56%|██████████████████████████▏                    | 583/1044 [03:36<02:13,  3.44it/s, acc_step=1/1, ce_loss_token=1.9398, perplexity_token=6.9573]

torch.Size([256, 305, 35]) torch.Size([256, 305])
torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  56%|██████████████████████████▎                    | 584/1044 [03:37<02:14,  3.42it/s, acc_step=1/1, ce_loss_token=1.9399, perplexity_token=6.9584]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  56%|██████████████████████████▎                    | 585/1044 [03:37<02:13,  3.43it/s, acc_step=1/1, ce_loss_token=1.9400, perplexity_token=6.9589]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  56%|██████████████████████████▍                    | 586/1044 [03:37<02:21,  3.25it/s, acc_step=1/1, ce_loss_token=1.9400, perplexity_token=6.9584]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  56%|██████████████████████████▍                    | 588/1044 [03:38<02:14,  3.38it/s, acc_step=1/1, ce_loss_token=1.9402, perplexity_token=6.9599]

torch.Size([256, 320, 35]) torch.Size([256, 320])
torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  56%|██████████████████████████▌                    | 589/1044 [03:38<02:12,  3.43it/s, acc_step=1/1, ce_loss_token=1.9403, perplexity_token=6.9608]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  57%|██████████████████████████▌                    | 590/1044 [03:39<02:20,  3.23it/s, acc_step=1/1, ce_loss_token=1.9402, perplexity_token=6.9603]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  57%|██████████████████████████▌                    | 591/1044 [03:39<02:27,  3.06it/s, acc_step=1/1, ce_loss_token=1.9401, perplexity_token=6.9597]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  57%|██████████████████████████▋                    | 592/1044 [03:39<02:20,  3.22it/s, acc_step=1/1, ce_loss_token=1.9402, perplexity_token=6.9602]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  57%|██████████████████████████▋                    | 593/1044 [03:40<02:16,  3.29it/s, acc_step=1/1, ce_loss_token=1.9403, perplexity_token=6.9609]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  57%|██████████████████████████▋                    | 594/1044 [03:40<02:20,  3.21it/s, acc_step=1/1, ce_loss_token=1.9402, perplexity_token=6.9604]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  57%|██████████████████████████▊                    | 595/1044 [03:40<02:28,  3.02it/s, acc_step=1/1, ce_loss_token=1.9402, perplexity_token=6.9599]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  57%|██████████████████████████▊                    | 596/1044 [03:41<02:32,  2.95it/s, acc_step=1/1, ce_loss_token=1.9401, perplexity_token=6.9595]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  57%|██████████████████████████▉                    | 597/1044 [03:41<02:34,  2.89it/s, acc_step=1/1, ce_loss_token=1.9400, perplexity_token=6.9590]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  57%|██████████████████████████▉                    | 598/1044 [03:41<02:35,  2.87it/s, acc_step=1/1, ce_loss_token=1.9399, perplexity_token=6.9583]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  57%|██████████████████████████▉                    | 599/1044 [03:42<02:38,  2.81it/s, acc_step=1/1, ce_loss_token=1.9399, perplexity_token=6.9578]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  57%|███████████████████████████                    | 600/1044 [03:42<02:40,  2.77it/s, acc_step=1/1, ce_loss_token=1.9398, perplexity_token=6.9573]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  58%|███████████████████████████                    | 601/1044 [03:42<02:41,  2.74it/s, acc_step=1/1, ce_loss_token=1.9397, perplexity_token=6.9568]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  58%|███████████████████████████                    | 602/1044 [03:43<02:46,  2.66it/s, acc_step=1/1, ce_loss_token=1.9396, perplexity_token=6.9561]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  58%|███████████████████████████▏                   | 603/1044 [03:43<02:32,  2.90it/s, acc_step=1/1, ce_loss_token=1.9397, perplexity_token=6.9566]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  58%|███████████████████████████▏                   | 604/1044 [03:44<02:37,  2.79it/s, acc_step=1/1, ce_loss_token=1.9397, perplexity_token=6.9564]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  58%|███████████████████████████▏                   | 605/1044 [03:44<02:39,  2.74it/s, acc_step=1/1, ce_loss_token=1.9396, perplexity_token=6.9559]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  58%|███████████████████████████▎                   | 606/1044 [03:44<02:27,  2.96it/s, acc_step=1/1, ce_loss_token=1.9397, perplexity_token=6.9565]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  58%|███████████████████████████▎                   | 607/1044 [03:45<02:35,  2.80it/s, acc_step=1/1, ce_loss_token=1.9396, perplexity_token=6.9561]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  58%|███████████████████████████▎                   | 608/1044 [03:45<02:36,  2.78it/s, acc_step=1/1, ce_loss_token=1.9395, perplexity_token=6.9556]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  58%|███████████████████████████▍                   | 609/1044 [03:45<02:39,  2.73it/s, acc_step=1/1, ce_loss_token=1.9395, perplexity_token=6.9552]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  58%|███████████████████████████▍                   | 610/1044 [03:46<02:40,  2.71it/s, acc_step=1/1, ce_loss_token=1.9394, perplexity_token=6.9547]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  59%|███████████████████████████▌                   | 611/1044 [03:46<02:39,  2.71it/s, acc_step=1/1, ce_loss_token=1.9393, perplexity_token=6.9541]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  59%|███████████████████████████▌                   | 612/1044 [03:46<02:41,  2.68it/s, acc_step=1/1, ce_loss_token=1.9393, perplexity_token=6.9537]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  59%|███████████████████████████▌                   | 613/1044 [03:47<02:35,  2.77it/s, acc_step=1/1, ce_loss_token=1.9394, perplexity_token=6.9543]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  59%|███████████████████████████▋                   | 614/1044 [03:47<02:33,  2.81it/s, acc_step=1/1, ce_loss_token=1.9393, perplexity_token=6.9538]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  59%|███████████████████████████▋                   | 615/1044 [03:47<02:26,  2.92it/s, acc_step=1/1, ce_loss_token=1.9395, perplexity_token=6.9551]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  59%|███████████████████████████▋                   | 616/1044 [03:48<02:18,  3.10it/s, acc_step=1/1, ce_loss_token=1.9396, perplexity_token=6.9558]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  59%|███████████████████████████▊                   | 617/1044 [03:48<02:18,  3.09it/s, acc_step=1/1, ce_loss_token=1.9397, perplexity_token=6.9566]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  59%|███████████████████████████▊                   | 618/1044 [03:48<02:21,  3.00it/s, acc_step=1/1, ce_loss_token=1.9396, perplexity_token=6.9561]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  59%|███████████████████████████▊                   | 619/1044 [03:49<02:28,  2.86it/s, acc_step=1/1, ce_loss_token=1.9395, perplexity_token=6.9556]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  59%|███████████████████████████▉                   | 621/1044 [03:49<02:13,  3.18it/s, acc_step=1/1, ce_loss_token=1.9400, perplexity_token=6.9587]

torch.Size([256, 288, 35]) torch.Size([256, 288])
torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  60%|████████████████████████████                   | 622/1044 [03:50<02:16,  3.09it/s, acc_step=1/1, ce_loss_token=1.9399, perplexity_token=6.9582]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  60%|████████████████████████████                   | 623/1044 [03:50<02:22,  2.96it/s, acc_step=1/1, ce_loss_token=1.9399, perplexity_token=6.9578]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  60%|████████████████████████████                   | 624/1044 [03:50<02:28,  2.82it/s, acc_step=1/1, ce_loss_token=1.9398, perplexity_token=6.9575]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  60%|████████████████████████████▏                  | 625/1044 [03:51<02:30,  2.79it/s, acc_step=1/1, ce_loss_token=1.9398, perplexity_token=6.9573]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  60%|████████████████████████████▏                  | 626/1044 [03:51<02:29,  2.80it/s, acc_step=1/1, ce_loss_token=1.9397, perplexity_token=6.9569]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  60%|████████████████████████████▏                  | 627/1044 [03:52<02:28,  2.81it/s, acc_step=1/1, ce_loss_token=1.9397, perplexity_token=6.9564]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  60%|████████████████████████████▎                  | 628/1044 [03:52<02:27,  2.82it/s, acc_step=1/1, ce_loss_token=1.9396, perplexity_token=6.9559]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  60%|████████████████████████████▎                  | 629/1044 [03:52<02:29,  2.77it/s, acc_step=1/1, ce_loss_token=1.9395, perplexity_token=6.9555]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  60%|████████████████████████████▎                  | 630/1044 [03:53<02:26,  2.82it/s, acc_step=1/1, ce_loss_token=1.9395, perplexity_token=6.9551]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  60%|████████████████████████████▍                  | 631/1044 [03:53<02:24,  2.86it/s, acc_step=1/1, ce_loss_token=1.9394, perplexity_token=6.9546]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  61%|████████████████████████████▍                  | 632/1044 [03:53<02:30,  2.74it/s, acc_step=1/1, ce_loss_token=1.9393, perplexity_token=6.9540]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  61%|████████████████████████████▍                  | 633/1044 [03:54<02:29,  2.75it/s, acc_step=1/1, ce_loss_token=1.9392, perplexity_token=6.9535]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  61%|████████████████████████████▌                  | 634/1044 [03:54<02:27,  2.78it/s, acc_step=1/1, ce_loss_token=1.9392, perplexity_token=6.9530]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  61%|████████████████████████████▌                  | 635/1044 [03:54<02:20,  2.91it/s, acc_step=1/1, ce_loss_token=1.9393, perplexity_token=6.9539]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  61%|████████████████████████████▋                  | 636/1044 [03:55<02:13,  3.05it/s, acc_step=1/1, ce_loss_token=1.9394, perplexity_token=6.9546]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  61%|████████████████████████████▋                  | 637/1044 [03:55<02:13,  3.05it/s, acc_step=1/1, ce_loss_token=1.9396, perplexity_token=6.9558]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  61%|████████████████████████████▋                  | 638/1044 [03:55<02:06,  3.22it/s, acc_step=1/1, ce_loss_token=1.9397, perplexity_token=6.9566]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  61%|████████████████████████████▊                  | 639/1044 [03:56<02:13,  3.04it/s, acc_step=1/1, ce_loss_token=1.9396, perplexity_token=6.9562]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  61%|████████████████████████████▊                  | 640/1044 [03:56<02:15,  2.97it/s, acc_step=1/1, ce_loss_token=1.9396, perplexity_token=6.9557]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  61%|████████████████████████████▊                  | 641/1044 [03:56<02:19,  2.88it/s, acc_step=1/1, ce_loss_token=1.9395, perplexity_token=6.9554]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  61%|████████████████████████████▉                  | 642/1044 [03:57<02:22,  2.82it/s, acc_step=1/1, ce_loss_token=1.9395, perplexity_token=6.9550]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  62%|████████████████████████████▉                  | 643/1044 [03:57<02:21,  2.83it/s, acc_step=1/1, ce_loss_token=1.9394, perplexity_token=6.9546]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  62%|████████████████████████████▉                  | 644/1044 [03:58<02:35,  2.58it/s, acc_step=1/1, ce_loss_token=1.9393, perplexity_token=6.9542]

torch.Size([256, 399, 35]) torch.Size([256, 399])


[Training LM]:  62%|█████████████████████████████                  | 645/1044 [03:58<02:55,  2.27it/s, acc_step=1/1, ce_loss_token=1.9393, perplexity_token=6.9537]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  62%|█████████████████████████████                  | 646/1044 [03:59<02:46,  2.39it/s, acc_step=1/1, ce_loss_token=1.9392, perplexity_token=6.9533]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  62%|█████████████████████████████▏                 | 647/1044 [03:59<02:37,  2.52it/s, acc_step=1/1, ce_loss_token=1.9392, perplexity_token=6.9529]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  62%|█████████████████████████████▏                 | 648/1044 [03:59<02:31,  2.61it/s, acc_step=1/1, ce_loss_token=1.9391, perplexity_token=6.9525]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  62%|█████████████████████████████▏                 | 649/1044 [04:00<02:25,  2.71it/s, acc_step=1/1, ce_loss_token=1.9392, perplexity_token=6.9531]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  62%|█████████████████████████████▎                 | 650/1044 [04:00<02:30,  2.62it/s, acc_step=1/1, ce_loss_token=1.9391, perplexity_token=6.9527]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  62%|█████████████████████████████▎                 | 651/1044 [04:00<02:29,  2.62it/s, acc_step=1/1, ce_loss_token=1.9391, perplexity_token=6.9522]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  62%|█████████████████████████████▎                 | 652/1044 [04:01<02:34,  2.53it/s, acc_step=1/1, ce_loss_token=1.9390, perplexity_token=6.9518]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  63%|█████████████████████████████▍                 | 653/1044 [04:01<02:29,  2.62it/s, acc_step=1/1, ce_loss_token=1.9389, perplexity_token=6.9513]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  63%|█████████████████████████████▍                 | 654/1044 [04:01<02:26,  2.67it/s, acc_step=1/1, ce_loss_token=1.9389, perplexity_token=6.9510]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  63%|█████████████████████████████▍                 | 655/1044 [04:02<02:23,  2.71it/s, acc_step=1/1, ce_loss_token=1.9388, perplexity_token=6.9505]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  63%|█████████████████████████████▌                 | 656/1044 [04:02<02:31,  2.56it/s, acc_step=1/1, ce_loss_token=1.9388, perplexity_token=6.9502]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  63%|█████████████████████████████▌                 | 657/1044 [04:03<02:28,  2.61it/s, acc_step=1/1, ce_loss_token=1.9388, perplexity_token=6.9508]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  63%|█████████████████████████████▌                 | 658/1044 [04:03<02:26,  2.64it/s, acc_step=1/1, ce_loss_token=1.9388, perplexity_token=6.9503]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  63%|█████████████████████████████▋                 | 659/1044 [04:03<02:23,  2.67it/s, acc_step=1/1, ce_loss_token=1.9387, perplexity_token=6.9500]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  63%|█████████████████████████████▋                 | 660/1044 [04:04<02:21,  2.72it/s, acc_step=1/1, ce_loss_token=1.9387, perplexity_token=6.9494]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  63%|█████████████████████████████▊                 | 661/1044 [04:04<02:23,  2.67it/s, acc_step=1/1, ce_loss_token=1.9386, perplexity_token=6.9489]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  63%|█████████████████████████████▊                 | 662/1044 [04:05<02:30,  2.54it/s, acc_step=1/1, ce_loss_token=1.9385, perplexity_token=6.9485]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  64%|█████████████████████████████▊                 | 663/1044 [04:05<02:28,  2.57it/s, acc_step=1/1, ce_loss_token=1.9385, perplexity_token=6.9480]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  64%|█████████████████████████████▉                 | 664/1044 [04:05<02:23,  2.64it/s, acc_step=1/1, ce_loss_token=1.9384, perplexity_token=6.9475]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  64%|█████████████████████████████▉                 | 665/1044 [04:06<02:26,  2.58it/s, acc_step=1/1, ce_loss_token=1.9383, perplexity_token=6.9470]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  64%|█████████████████████████████▉                 | 666/1044 [04:06<02:30,  2.50it/s, acc_step=1/1, ce_loss_token=1.9382, perplexity_token=6.9465]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  64%|██████████████████████████████                 | 667/1044 [04:06<02:22,  2.65it/s, acc_step=1/1, ce_loss_token=1.9382, perplexity_token=6.9459]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  64%|██████████████████████████████                 | 668/1044 [04:07<02:19,  2.69it/s, acc_step=1/1, ce_loss_token=1.9381, perplexity_token=6.9454]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  64%|██████████████████████████████                 | 669/1044 [04:07<02:13,  2.80it/s, acc_step=1/1, ce_loss_token=1.9382, perplexity_token=6.9461]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  64%|██████████████████████████████▏                | 670/1044 [04:07<02:14,  2.79it/s, acc_step=1/1, ce_loss_token=1.9381, perplexity_token=6.9456]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  64%|██████████████████████████████▏                | 671/1044 [04:08<02:08,  2.91it/s, acc_step=1/1, ce_loss_token=1.9382, perplexity_token=6.9464]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  64%|██████████████████████████████▎                | 672/1044 [04:08<02:08,  2.90it/s, acc_step=1/1, ce_loss_token=1.9381, perplexity_token=6.9458]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  64%|██████████████████████████████▎                | 673/1044 [04:09<02:12,  2.80it/s, acc_step=1/1, ce_loss_token=1.9381, perplexity_token=6.9454]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  65%|██████████████████████████████▎                | 674/1044 [04:09<02:14,  2.75it/s, acc_step=1/1, ce_loss_token=1.9380, perplexity_token=6.9449]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  65%|██████████████████████████████▍                | 675/1044 [04:09<02:15,  2.72it/s, acc_step=1/1, ce_loss_token=1.9380, perplexity_token=6.9446]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  65%|██████████████████████████████▍                | 676/1044 [04:10<02:07,  2.88it/s, acc_step=1/1, ce_loss_token=1.9381, perplexity_token=6.9457]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  65%|██████████████████████████████▍                | 677/1044 [04:10<02:19,  2.63it/s, acc_step=1/1, ce_loss_token=1.9381, perplexity_token=6.9453]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  65%|██████████████████████████████▌                | 678/1044 [04:10<02:07,  2.88it/s, acc_step=1/1, ce_loss_token=1.9381, perplexity_token=6.9459]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  65%|██████████████████████████████▌                | 679/1044 [04:11<02:07,  2.87it/s, acc_step=1/1, ce_loss_token=1.9381, perplexity_token=6.9455]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  65%|██████████████████████████████▌                | 680/1044 [04:11<02:13,  2.73it/s, acc_step=1/1, ce_loss_token=1.9380, perplexity_token=6.9451]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  65%|██████████████████████████████▋                | 681/1044 [04:11<02:13,  2.71it/s, acc_step=1/1, ce_loss_token=1.9380, perplexity_token=6.9449]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  65%|██████████████████████████████▋                | 682/1044 [04:12<02:14,  2.70it/s, acc_step=1/1, ce_loss_token=1.9380, perplexity_token=6.9446]

torch.Size([256, 410, 35]) torch.Size([256, 410])


[Training LM]:  65%|██████████████████████████████▋                | 683/1044 [04:12<02:20,  2.56it/s, acc_step=1/1, ce_loss_token=1.9381, perplexity_token=6.9454]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  66%|██████████████████████████████▊                | 684/1044 [04:13<02:17,  2.62it/s, acc_step=1/1, ce_loss_token=1.9380, perplexity_token=6.9451]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  66%|██████████████████████████████▊                | 685/1044 [04:13<02:16,  2.63it/s, acc_step=1/1, ce_loss_token=1.9380, perplexity_token=6.9446]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  66%|██████████████████████████████▉                | 686/1044 [04:13<02:15,  2.64it/s, acc_step=1/1, ce_loss_token=1.9379, perplexity_token=6.9442]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  66%|██████████████████████████████▉                | 687/1044 [04:14<02:16,  2.61it/s, acc_step=1/1, ce_loss_token=1.9378, perplexity_token=6.9437]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  66%|██████████████████████████████▉                | 688/1044 [04:14<02:14,  2.66it/s, acc_step=1/1, ce_loss_token=1.9378, perplexity_token=6.9432]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  66%|███████████████████████████████                | 689/1044 [04:14<02:13,  2.67it/s, acc_step=1/1, ce_loss_token=1.9377, perplexity_token=6.9427]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  66%|███████████████████████████████                | 690/1044 [04:15<02:04,  2.85it/s, acc_step=1/1, ce_loss_token=1.9378, perplexity_token=6.9434]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  66%|███████████████████████████████                | 691/1044 [04:15<02:05,  2.82it/s, acc_step=1/1, ce_loss_token=1.9379, perplexity_token=6.9440]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  66%|███████████████████████████████▏               | 692/1044 [04:16<02:16,  2.58it/s, acc_step=1/1, ce_loss_token=1.9378, perplexity_token=6.9434]

torch.Size([256, 349, 35]) torch.Size([256, 349])


[Training LM]:  66%|███████████████████████████████▏               | 693/1044 [04:16<02:23,  2.44it/s, acc_step=1/1, ce_loss_token=1.9377, perplexity_token=6.9429]

torch.Size([256, 396, 35]) torch.Size([256, 396])


[Training LM]:  66%|███████████████████████████████▏               | 694/1044 [04:17<02:37,  2.23it/s, acc_step=1/1, ce_loss_token=1.9377, perplexity_token=6.9425]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  67%|███████████████████████████████▎               | 695/1044 [04:17<02:31,  2.30it/s, acc_step=1/1, ce_loss_token=1.9376, perplexity_token=6.9421]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  67%|███████████████████████████████▎               | 696/1044 [04:17<02:35,  2.24it/s, acc_step=1/1, ce_loss_token=1.9375, perplexity_token=6.9416]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  67%|███████████████████████████████▍               | 697/1044 [04:18<02:32,  2.28it/s, acc_step=1/1, ce_loss_token=1.9375, perplexity_token=6.9411]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  67%|███████████████████████████████▍               | 698/1044 [04:18<02:22,  2.42it/s, acc_step=1/1, ce_loss_token=1.9374, perplexity_token=6.9407]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  67%|███████████████████████████████▍               | 699/1044 [04:19<02:22,  2.41it/s, acc_step=1/1, ce_loss_token=1.9373, perplexity_token=6.9403]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  67%|███████████████████████████████▌               | 700/1044 [04:19<02:18,  2.49it/s, acc_step=1/1, ce_loss_token=1.9373, perplexity_token=6.9397]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  67%|███████████████████████████████▌               | 701/1044 [04:19<02:13,  2.56it/s, acc_step=1/1, ce_loss_token=1.9372, perplexity_token=6.9392]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  67%|███████████████████████████████▌               | 702/1044 [04:20<02:10,  2.62it/s, acc_step=1/1, ce_loss_token=1.9371, perplexity_token=6.9389]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  67%|███████████████████████████████▋               | 703/1044 [04:20<02:12,  2.58it/s, acc_step=1/1, ce_loss_token=1.9371, perplexity_token=6.9384]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  67%|███████████████████████████████▋               | 704/1044 [04:21<02:14,  2.53it/s, acc_step=1/1, ce_loss_token=1.9370, perplexity_token=6.9380]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  68%|███████████████████████████████▋               | 705/1044 [04:21<02:17,  2.46it/s, acc_step=1/1, ce_loss_token=1.9370, perplexity_token=6.9377]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  68%|███████████████████████████████▊               | 706/1044 [04:21<02:04,  2.72it/s, acc_step=1/1, ce_loss_token=1.9371, perplexity_token=6.9383]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  68%|███████████████████████████████▊               | 707/1044 [04:22<02:06,  2.67it/s, acc_step=1/1, ce_loss_token=1.9370, perplexity_token=6.9378]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  68%|███████████████████████████████▊               | 708/1044 [04:22<02:09,  2.59it/s, acc_step=1/1, ce_loss_token=1.9369, perplexity_token=6.9373]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  68%|███████████████████████████████▉               | 709/1044 [04:22<02:06,  2.64it/s, acc_step=1/1, ce_loss_token=1.9368, perplexity_token=6.9367]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  68%|███████████████████████████████▉               | 710/1044 [04:23<02:04,  2.69it/s, acc_step=1/1, ce_loss_token=1.9368, perplexity_token=6.9362]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  68%|████████████████████████████████               | 711/1044 [04:23<02:06,  2.64it/s, acc_step=1/1, ce_loss_token=1.9367, perplexity_token=6.9357]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  68%|████████████████████████████████               | 712/1044 [04:24<02:07,  2.61it/s, acc_step=1/1, ce_loss_token=1.9366, perplexity_token=6.9353]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  68%|████████████████████████████████               | 713/1044 [04:24<02:08,  2.57it/s, acc_step=1/1, ce_loss_token=1.9366, perplexity_token=6.9348]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  68%|████████████████████████████████▏              | 714/1044 [04:24<02:11,  2.52it/s, acc_step=1/1, ce_loss_token=1.9365, perplexity_token=6.9343]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  68%|████████████████████████████████▏              | 715/1044 [04:25<02:07,  2.57it/s, acc_step=1/1, ce_loss_token=1.9364, perplexity_token=6.9337]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  69%|████████████████████████████████▏              | 716/1044 [04:25<01:58,  2.77it/s, acc_step=1/1, ce_loss_token=1.9365, perplexity_token=6.9342]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  69%|████████████████████████████████▎              | 717/1044 [04:25<01:52,  2.90it/s, acc_step=1/1, ce_loss_token=1.9365, perplexity_token=6.9347]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  69%|████████████████████████████████▎              | 718/1044 [04:26<01:44,  3.11it/s, acc_step=1/1, ce_loss_token=1.9366, perplexity_token=6.9352]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  69%|████████████████████████████████▎              | 719/1044 [04:26<01:50,  2.95it/s, acc_step=1/1, ce_loss_token=1.9366, perplexity_token=6.9348]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  69%|████████████████████████████████▍              | 720/1044 [04:26<01:53,  2.86it/s, acc_step=1/1, ce_loss_token=1.9365, perplexity_token=6.9342]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  69%|████████████████████████████████▍              | 721/1044 [04:27<01:51,  2.89it/s, acc_step=1/1, ce_loss_token=1.9364, perplexity_token=6.9338]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  69%|████████████████████████████████▌              | 722/1044 [04:27<01:53,  2.85it/s, acc_step=1/1, ce_loss_token=1.9363, perplexity_token=6.9334]

torch.Size([256, 274, 35]) torch.Size([256, 274])


[Training LM]:  69%|████████████████████████████████▌              | 723/1044 [04:27<01:42,  3.12it/s, acc_step=1/1, ce_loss_token=1.9364, perplexity_token=6.9338]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  69%|████████████████████████████████▌              | 724/1044 [04:28<01:47,  2.99it/s, acc_step=1/1, ce_loss_token=1.9363, perplexity_token=6.9334]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  69%|████████████████████████████████▋              | 725/1044 [04:28<01:52,  2.83it/s, acc_step=1/1, ce_loss_token=1.9363, perplexity_token=6.9329]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  70%|████████████████████████████████▋              | 726/1044 [04:28<01:46,  2.98it/s, acc_step=1/1, ce_loss_token=1.9364, perplexity_token=6.9335]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  70%|████████████████████████████████▋              | 727/1044 [04:29<01:51,  2.84it/s, acc_step=1/1, ce_loss_token=1.9363, perplexity_token=6.9329]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  70%|████████████████████████████████▊              | 728/1044 [04:29<01:48,  2.91it/s, acc_step=1/1, ce_loss_token=1.9363, perplexity_token=6.9333]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  70%|████████████████████████████████▊              | 729/1044 [04:30<01:51,  2.83it/s, acc_step=1/1, ce_loss_token=1.9363, perplexity_token=6.9328]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  70%|████████████████████████████████▊              | 730/1044 [04:30<01:49,  2.86it/s, acc_step=1/1, ce_loss_token=1.9362, perplexity_token=6.9322]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  70%|████████████████████████████████▉              | 731/1044 [04:30<01:49,  2.86it/s, acc_step=1/1, ce_loss_token=1.9361, perplexity_token=6.9317]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  70%|████████████████████████████████▉              | 733/1044 [04:31<01:36,  3.22it/s, acc_step=1/1, ce_loss_token=1.9365, perplexity_token=6.9343]

torch.Size([256, 295, 35]) torch.Size([256, 295])
torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  70%|█████████████████████████████████              | 734/1044 [04:31<01:39,  3.11it/s, acc_step=1/1, ce_loss_token=1.9364, perplexity_token=6.9338]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  70%|█████████████████████████████████              | 735/1044 [04:32<01:45,  2.92it/s, acc_step=1/1, ce_loss_token=1.9364, perplexity_token=6.9335]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  70%|█████████████████████████████████▏             | 736/1044 [04:32<01:42,  3.01it/s, acc_step=1/1, ce_loss_token=1.9364, perplexity_token=6.9339]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  71%|█████████████████████████████████▏             | 737/1044 [04:32<01:44,  2.95it/s, acc_step=1/1, ce_loss_token=1.9364, perplexity_token=6.9335]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  71%|█████████████████████████████████▏             | 738/1044 [04:33<01:48,  2.83it/s, acc_step=1/1, ce_loss_token=1.9363, perplexity_token=6.9331]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  71%|█████████████████████████████████▎             | 739/1044 [04:33<01:49,  2.78it/s, acc_step=1/1, ce_loss_token=1.9362, perplexity_token=6.9326]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  71%|█████████████████████████████████▎             | 740/1044 [04:33<01:43,  2.94it/s, acc_step=1/1, ce_loss_token=1.9363, perplexity_token=6.9330]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  71%|█████████████████████████████████▎             | 741/1044 [04:34<01:44,  2.90it/s, acc_step=1/1, ce_loss_token=1.9362, perplexity_token=6.9325]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  71%|█████████████████████████████████▍             | 742/1044 [04:34<01:50,  2.74it/s, acc_step=1/1, ce_loss_token=1.9361, perplexity_token=6.9320]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  71%|█████████████████████████████████▍             | 743/1044 [04:34<01:49,  2.75it/s, acc_step=1/1, ce_loss_token=1.9361, perplexity_token=6.9316]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  71%|█████████████████████████████████▍             | 744/1044 [04:35<01:50,  2.72it/s, acc_step=1/1, ce_loss_token=1.9360, perplexity_token=6.9311]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  71%|█████████████████████████████████▌             | 745/1044 [04:35<01:48,  2.75it/s, acc_step=1/1, ce_loss_token=1.9361, perplexity_token=6.9316]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  71%|█████████████████████████████████▌             | 746/1044 [04:35<01:40,  2.98it/s, acc_step=1/1, ce_loss_token=1.9362, perplexity_token=6.9323]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  72%|█████████████████████████████████▋             | 747/1044 [04:36<01:35,  3.12it/s, acc_step=1/1, ce_loss_token=1.9362, perplexity_token=6.9327]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  72%|█████████████████████████████████▋             | 748/1044 [04:36<01:33,  3.16it/s, acc_step=1/1, ce_loss_token=1.9363, perplexity_token=6.9332]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  72%|█████████████████████████████████▋             | 749/1044 [04:36<01:39,  2.97it/s, acc_step=1/1, ce_loss_token=1.9363, perplexity_token=6.9328]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  72%|█████████████████████████████████▊             | 750/1044 [04:37<01:42,  2.88it/s, acc_step=1/1, ce_loss_token=1.9362, perplexity_token=6.9323]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  72%|█████████████████████████████████▊             | 751/1044 [04:37<01:47,  2.72it/s, acc_step=1/1, ce_loss_token=1.9361, perplexity_token=6.9319]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  72%|█████████████████████████████████▊             | 752/1044 [04:38<01:49,  2.68it/s, acc_step=1/1, ce_loss_token=1.9361, perplexity_token=6.9316]

torch.Size([256, 370, 35]) torch.Size([256, 370])


[Training LM]:  72%|█████████████████████████████████▉             | 753/1044 [04:38<01:58,  2.46it/s, acc_step=1/1, ce_loss_token=1.9360, perplexity_token=6.9312]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  72%|█████████████████████████████████▉             | 754/1044 [04:38<01:56,  2.49it/s, acc_step=1/1, ce_loss_token=1.9360, perplexity_token=6.9308]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  72%|█████████████████████████████████▉             | 755/1044 [04:39<01:55,  2.51it/s, acc_step=1/1, ce_loss_token=1.9359, perplexity_token=6.9302]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  72%|██████████████████████████████████             | 756/1044 [04:39<01:50,  2.60it/s, acc_step=1/1, ce_loss_token=1.9358, perplexity_token=6.9297]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  73%|██████████████████████████████████             | 757/1044 [04:40<01:51,  2.56it/s, acc_step=1/1, ce_loss_token=1.9357, perplexity_token=6.9292]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  73%|██████████████████████████████████             | 758/1044 [04:40<01:53,  2.51it/s, acc_step=1/1, ce_loss_token=1.9357, perplexity_token=6.9287]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  73%|██████████████████████████████████▏            | 759/1044 [04:40<01:49,  2.59it/s, acc_step=1/1, ce_loss_token=1.9356, perplexity_token=6.9282]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  73%|██████████████████████████████████▏            | 760/1044 [04:41<01:47,  2.64it/s, acc_step=1/1, ce_loss_token=1.9355, perplexity_token=6.9275]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  73%|██████████████████████████████████▎            | 761/1044 [04:41<01:47,  2.64it/s, acc_step=1/1, ce_loss_token=1.9354, perplexity_token=6.9271]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  73%|██████████████████████████████████▎            | 762/1044 [04:41<01:46,  2.64it/s, acc_step=1/1, ce_loss_token=1.9354, perplexity_token=6.9267]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  73%|██████████████████████████████████▎            | 763/1044 [04:42<01:52,  2.49it/s, acc_step=1/1, ce_loss_token=1.9353, perplexity_token=6.9262]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  73%|██████████████████████████████████▍            | 764/1044 [04:42<01:49,  2.55it/s, acc_step=1/1, ce_loss_token=1.9352, perplexity_token=6.9257]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  73%|██████████████████████████████████▍            | 765/1044 [04:43<01:48,  2.57it/s, acc_step=1/1, ce_loss_token=1.9352, perplexity_token=6.9252]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  73%|██████████████████████████████████▍            | 766/1044 [04:43<01:45,  2.63it/s, acc_step=1/1, ce_loss_token=1.9351, perplexity_token=6.9248]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  73%|██████████████████████████████████▌            | 767/1044 [04:43<01:51,  2.50it/s, acc_step=1/1, ce_loss_token=1.9350, perplexity_token=6.9243]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  74%|██████████████████████████████████▌            | 768/1044 [04:44<01:45,  2.62it/s, acc_step=1/1, ce_loss_token=1.9350, perplexity_token=6.9239]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  74%|██████████████████████████████████▌            | 769/1044 [04:44<01:42,  2.69it/s, acc_step=1/1, ce_loss_token=1.9349, perplexity_token=6.9233]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  74%|██████████████████████████████████▋            | 770/1044 [04:44<01:36,  2.83it/s, acc_step=1/1, ce_loss_token=1.9350, perplexity_token=6.9239]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  74%|██████████████████████████████████▋            | 771/1044 [04:45<01:40,  2.73it/s, acc_step=1/1, ce_loss_token=1.9349, perplexity_token=6.9235]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  74%|██████████████████████████████████▊            | 772/1044 [04:45<01:34,  2.89it/s, acc_step=1/1, ce_loss_token=1.9350, perplexity_token=6.9240]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  74%|██████████████████████████████████▊            | 773/1044 [04:46<01:35,  2.84it/s, acc_step=1/1, ce_loss_token=1.9349, perplexity_token=6.9236]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  74%|██████████████████████████████████▊            | 774/1044 [04:46<01:49,  2.46it/s, acc_step=1/1, ce_loss_token=1.9349, perplexity_token=6.9232]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  74%|██████████████████████████████████▉            | 775/1044 [04:46<01:46,  2.52it/s, acc_step=1/1, ce_loss_token=1.9348, perplexity_token=6.9227]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  74%|██████████████████████████████████▉            | 776/1044 [04:47<01:45,  2.55it/s, acc_step=1/1, ce_loss_token=1.9347, perplexity_token=6.9222]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  74%|██████████████████████████████████▉            | 777/1044 [04:47<01:41,  2.64it/s, acc_step=1/1, ce_loss_token=1.9347, perplexity_token=6.9218]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  75%|███████████████████████████████████            | 778/1044 [04:48<01:41,  2.62it/s, acc_step=1/1, ce_loss_token=1.9346, perplexity_token=6.9213]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  75%|███████████████████████████████████            | 779/1044 [04:48<01:34,  2.81it/s, acc_step=1/1, ce_loss_token=1.9347, perplexity_token=6.9217]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  75%|███████████████████████████████████            | 780/1044 [04:48<01:34,  2.80it/s, acc_step=1/1, ce_loss_token=1.9346, perplexity_token=6.9211]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  75%|███████████████████████████████████▏           | 781/1044 [04:48<01:26,  3.03it/s, acc_step=1/1, ce_loss_token=1.9346, perplexity_token=6.9216]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  75%|███████████████████████████████████▏           | 782/1044 [04:49<01:28,  2.94it/s, acc_step=1/1, ce_loss_token=1.9346, perplexity_token=6.9212]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  75%|███████████████████████████████████▎           | 783/1044 [04:49<01:36,  2.70it/s, acc_step=1/1, ce_loss_token=1.9345, perplexity_token=6.9206]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  75%|███████████████████████████████████▎           | 784/1044 [04:50<01:32,  2.81it/s, acc_step=1/1, ce_loss_token=1.9346, perplexity_token=6.9212]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  75%|███████████████████████████████████▎           | 785/1044 [04:50<01:26,  3.01it/s, acc_step=1/1, ce_loss_token=1.9346, perplexity_token=6.9215]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  75%|███████████████████████████████████▍           | 786/1044 [04:50<01:26,  2.97it/s, acc_step=1/1, ce_loss_token=1.9346, perplexity_token=6.9210]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  75%|███████████████████████████████████▍           | 787/1044 [04:51<01:28,  2.90it/s, acc_step=1/1, ce_loss_token=1.9345, perplexity_token=6.9206]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  75%|███████████████████████████████████▍           | 788/1044 [04:51<01:36,  2.66it/s, acc_step=1/1, ce_loss_token=1.9344, perplexity_token=6.9201]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  76%|███████████████████████████████████▌           | 789/1044 [04:51<01:30,  2.83it/s, acc_step=1/1, ce_loss_token=1.9345, perplexity_token=6.9207]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  76%|███████████████████████████████████▌           | 790/1044 [04:52<01:32,  2.75it/s, acc_step=1/1, ce_loss_token=1.9344, perplexity_token=6.9201]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  76%|███████████████████████████████████▌           | 791/1044 [04:52<01:30,  2.80it/s, acc_step=1/1, ce_loss_token=1.9344, perplexity_token=6.9198]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  76%|███████████████████████████████████▋           | 793/1044 [04:53<01:15,  3.35it/s, acc_step=1/1, ce_loss_token=1.9347, perplexity_token=6.9219]

torch.Size([256, 312, 35]) torch.Size([256, 312])
torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  76%|███████████████████████████████████▋           | 794/1044 [04:53<01:18,  3.20it/s, acc_step=1/1, ce_loss_token=1.9346, perplexity_token=6.9215]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  76%|███████████████████████████████████▊           | 795/1044 [04:53<01:18,  3.19it/s, acc_step=1/1, ce_loss_token=1.9347, perplexity_token=6.9219]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  76%|███████████████████████████████████▊           | 796/1044 [04:54<01:15,  3.26it/s, acc_step=1/1, ce_loss_token=1.9347, perplexity_token=6.9222]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  76%|███████████████████████████████████▉           | 797/1044 [04:54<01:21,  3.03it/s, acc_step=1/1, ce_loss_token=1.9347, perplexity_token=6.9218]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  76%|███████████████████████████████████▉           | 798/1044 [04:54<01:28,  2.76it/s, acc_step=1/1, ce_loss_token=1.9346, perplexity_token=6.9213]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  77%|███████████████████████████████████▉           | 799/1044 [04:55<01:35,  2.57it/s, acc_step=1/1, ce_loss_token=1.9345, perplexity_token=6.9207]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  77%|████████████████████████████████████           | 800/1044 [04:55<01:31,  2.67it/s, acc_step=1/1, ce_loss_token=1.9344, perplexity_token=6.9202]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  77%|████████████████████████████████████           | 801/1044 [04:55<01:29,  2.71it/s, acc_step=1/1, ce_loss_token=1.9344, perplexity_token=6.9198]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  77%|████████████████████████████████████           | 802/1044 [04:56<01:28,  2.75it/s, acc_step=1/1, ce_loss_token=1.9343, perplexity_token=6.9194]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  77%|████████████████████████████████████▏          | 803/1044 [04:56<01:27,  2.74it/s, acc_step=1/1, ce_loss_token=1.9343, perplexity_token=6.9190]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  77%|████████████████████████████████████▏          | 804/1044 [04:57<01:22,  2.90it/s, acc_step=1/1, ce_loss_token=1.9343, perplexity_token=6.9194]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  77%|████████████████████████████████████▏          | 805/1044 [04:57<01:17,  3.08it/s, acc_step=1/1, ce_loss_token=1.9344, perplexity_token=6.9199]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  77%|████████████████████████████████████▎          | 806/1044 [04:57<01:18,  3.02it/s, acc_step=1/1, ce_loss_token=1.9343, perplexity_token=6.9195]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  77%|████████████████████████████████████▎          | 807/1044 [04:58<01:22,  2.88it/s, acc_step=1/1, ce_loss_token=1.9343, perplexity_token=6.9190]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  77%|████████████████████████████████████▍          | 808/1044 [04:58<01:21,  2.90it/s, acc_step=1/1, ce_loss_token=1.9342, perplexity_token=6.9184]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  77%|████████████████████████████████████▍          | 809/1044 [04:58<01:21,  2.90it/s, acc_step=1/1, ce_loss_token=1.9341, perplexity_token=6.9180]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  78%|████████████████████████████████████▍          | 810/1044 [04:59<01:22,  2.84it/s, acc_step=1/1, ce_loss_token=1.9341, perplexity_token=6.9175]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  78%|████████████████████████████████████▌          | 811/1044 [04:59<01:24,  2.75it/s, acc_step=1/1, ce_loss_token=1.9340, perplexity_token=6.9169]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  78%|████████████████████████████████████▌          | 812/1044 [04:59<01:24,  2.74it/s, acc_step=1/1, ce_loss_token=1.9339, perplexity_token=6.9164]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  78%|████████████████████████████████████▌          | 813/1044 [05:00<01:24,  2.74it/s, acc_step=1/1, ce_loss_token=1.9338, perplexity_token=6.9160]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  78%|████████████████████████████████████▋          | 814/1044 [05:00<01:23,  2.76it/s, acc_step=1/1, ce_loss_token=1.9337, perplexity_token=6.9154]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  78%|████████████████████████████████████▋          | 815/1044 [05:00<01:17,  2.94it/s, acc_step=1/1, ce_loss_token=1.9338, perplexity_token=6.9157]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  78%|████████████████████████████████████▋          | 816/1044 [05:01<01:17,  2.93it/s, acc_step=1/1, ce_loss_token=1.9337, perplexity_token=6.9153]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  78%|████████████████████████████████████▊          | 817/1044 [05:01<01:14,  3.03it/s, acc_step=1/1, ce_loss_token=1.9338, perplexity_token=6.9157]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  78%|████████████████████████████████████▊          | 818/1044 [05:01<01:16,  2.97it/s, acc_step=1/1, ce_loss_token=1.9337, perplexity_token=6.9152]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  78%|████████████████████████████████████▊          | 819/1044 [05:02<01:16,  2.96it/s, acc_step=1/1, ce_loss_token=1.9337, perplexity_token=6.9149]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  79%|████████████████████████████████████▉          | 820/1044 [05:02<01:17,  2.90it/s, acc_step=1/1, ce_loss_token=1.9336, perplexity_token=6.9145]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  79%|████████████████████████████████████▉          | 821/1044 [05:02<01:19,  2.82it/s, acc_step=1/1, ce_loss_token=1.9335, perplexity_token=6.9140]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  79%|█████████████████████████████████████          | 822/1044 [05:03<01:18,  2.84it/s, acc_step=1/1, ce_loss_token=1.9335, perplexity_token=6.9136]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  79%|█████████████████████████████████████          | 823/1044 [05:03<01:17,  2.84it/s, acc_step=1/1, ce_loss_token=1.9334, perplexity_token=6.9130]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  79%|█████████████████████████████████████          | 824/1044 [05:03<01:18,  2.82it/s, acc_step=1/1, ce_loss_token=1.9333, perplexity_token=6.9125]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  79%|█████████████████████████████████████▏         | 825/1044 [05:04<01:14,  2.95it/s, acc_step=1/1, ce_loss_token=1.9334, perplexity_token=6.9130]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  79%|█████████████████████████████████████▏         | 826/1044 [05:04<01:19,  2.76it/s, acc_step=1/1, ce_loss_token=1.9333, perplexity_token=6.9126]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  79%|█████████████████████████████████████▏         | 827/1044 [05:05<01:20,  2.70it/s, acc_step=1/1, ce_loss_token=1.9333, perplexity_token=6.9121]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  79%|█████████████████████████████████████▎         | 828/1044 [05:05<01:21,  2.64it/s, acc_step=1/1, ce_loss_token=1.9332, perplexity_token=6.9116]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  79%|█████████████████████████████████████▎         | 829/1044 [05:05<01:21,  2.64it/s, acc_step=1/1, ce_loss_token=1.9331, perplexity_token=6.9111]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  80%|█████████████████████████████████████▎         | 830/1044 [05:06<01:15,  2.85it/s, acc_step=1/1, ce_loss_token=1.9332, perplexity_token=6.9115]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  80%|█████████████████████████████████████▍         | 831/1044 [05:06<01:18,  2.71it/s, acc_step=1/1, ce_loss_token=1.9331, perplexity_token=6.9110]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  80%|█████████████████████████████████████▍         | 832/1044 [05:06<01:17,  2.75it/s, acc_step=1/1, ce_loss_token=1.9330, perplexity_token=6.9105]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  80%|█████████████████████████████████████▌         | 833/1044 [05:07<01:18,  2.70it/s, acc_step=1/1, ce_loss_token=1.9330, perplexity_token=6.9102]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  80%|█████████████████████████████████████▌         | 834/1044 [05:07<01:20,  2.62it/s, acc_step=1/1, ce_loss_token=1.9329, perplexity_token=6.9098]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  80%|█████████████████████████████████████▌         | 835/1044 [05:07<01:14,  2.80it/s, acc_step=1/1, ce_loss_token=1.9330, perplexity_token=6.9100]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  80%|█████████████████████████████████████▋         | 836/1044 [05:08<01:13,  2.82it/s, acc_step=1/1, ce_loss_token=1.9329, perplexity_token=6.9095]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  80%|█████████████████████████████████████▋         | 837/1044 [05:08<01:12,  2.84it/s, acc_step=1/1, ce_loss_token=1.9328, perplexity_token=6.9089]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  80%|█████████████████████████████████████▋         | 838/1044 [05:08<01:07,  3.05it/s, acc_step=1/1, ce_loss_token=1.9329, perplexity_token=6.9092]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  80%|█████████████████████████████████████▊         | 839/1044 [05:09<01:15,  2.72it/s, acc_step=1/1, ce_loss_token=1.9328, perplexity_token=6.9088]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  80%|█████████████████████████████████████▊         | 840/1044 [05:09<01:13,  2.78it/s, acc_step=1/1, ce_loss_token=1.9327, perplexity_token=6.9083]

torch.Size([256, 273, 35]) torch.Size([256, 273])


[Training LM]:  81%|█████████████████████████████████████▊         | 841/1044 [05:10<01:10,  2.86it/s, acc_step=1/1, ce_loss_token=1.9327, perplexity_token=6.9078]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  81%|█████████████████████████████████████▉         | 842/1044 [05:10<01:16,  2.65it/s, acc_step=1/1, ce_loss_token=1.9326, perplexity_token=6.9074]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  81%|█████████████████████████████████████▉         | 843/1044 [05:10<01:15,  2.65it/s, acc_step=1/1, ce_loss_token=1.9325, perplexity_token=6.9069]

torch.Size([256, 421, 35]) torch.Size([256, 421])


[Training LM]:  81%|█████████████████████████████████████▉         | 844/1044 [05:11<01:29,  2.24it/s, acc_step=1/1, ce_loss_token=1.9325, perplexity_token=6.9065]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  81%|██████████████████████████████████████         | 845/1044 [05:11<01:24,  2.35it/s, acc_step=1/1, ce_loss_token=1.9324, perplexity_token=6.9061]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  81%|██████████████████████████████████████         | 846/1044 [05:12<01:19,  2.48it/s, acc_step=1/1, ce_loss_token=1.9323, perplexity_token=6.9056]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  81%|██████████████████████████████████████▏        | 847/1044 [05:12<01:18,  2.51it/s, acc_step=1/1, ce_loss_token=1.9323, perplexity_token=6.9052]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  81%|██████████████████████████████████████▏        | 848/1044 [05:13<01:17,  2.54it/s, acc_step=1/1, ce_loss_token=1.9322, perplexity_token=6.9048]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  81%|██████████████████████████████████████▏        | 849/1044 [05:13<01:15,  2.58it/s, acc_step=1/1, ce_loss_token=1.9321, perplexity_token=6.9043]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  81%|██████████████████████████████████████▎        | 850/1044 [05:13<01:12,  2.66it/s, acc_step=1/1, ce_loss_token=1.9321, perplexity_token=6.9038]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  82%|██████████████████████████████████████▎        | 851/1044 [05:14<01:11,  2.71it/s, acc_step=1/1, ce_loss_token=1.9320, perplexity_token=6.9033]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  82%|██████████████████████████████████████▎        | 852/1044 [05:14<01:11,  2.69it/s, acc_step=1/1, ce_loss_token=1.9319, perplexity_token=6.9029]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  82%|██████████████████████████████████████▍        | 853/1044 [05:14<01:17,  2.47it/s, acc_step=1/1, ce_loss_token=1.9319, perplexity_token=6.9025]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  82%|██████████████████████████████████████▍        | 854/1044 [05:15<01:15,  2.52it/s, acc_step=1/1, ce_loss_token=1.9318, perplexity_token=6.9021]

torch.Size([256, 397, 35]) torch.Size([256, 397])


[Training LM]:  82%|██████████████████████████████████████▍        | 855/1044 [05:15<01:23,  2.25it/s, acc_step=1/1, ce_loss_token=1.9317, perplexity_token=6.9014]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  82%|██████████████████████████████████████▌        | 856/1044 [05:16<01:13,  2.55it/s, acc_step=1/1, ce_loss_token=1.9318, perplexity_token=6.9017]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  82%|██████████████████████████████████████▌        | 857/1044 [05:16<01:18,  2.38it/s, acc_step=1/1, ce_loss_token=1.9317, perplexity_token=6.9012]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  82%|██████████████████████████████████████▋        | 858/1044 [05:17<01:18,  2.36it/s, acc_step=1/1, ce_loss_token=1.9316, perplexity_token=6.9008]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  82%|██████████████████████████████████████▋        | 859/1044 [05:17<01:16,  2.42it/s, acc_step=1/1, ce_loss_token=1.9316, perplexity_token=6.9004]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  82%|██████████████████████████████████████▋        | 860/1044 [05:17<01:14,  2.49it/s, acc_step=1/1, ce_loss_token=1.9315, perplexity_token=6.9000]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  82%|██████████████████████████████████████▊        | 861/1044 [05:18<01:11,  2.56it/s, acc_step=1/1, ce_loss_token=1.9314, perplexity_token=6.8995]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  83%|██████████████████████████████████████▊        | 862/1044 [05:18<01:08,  2.64it/s, acc_step=1/1, ce_loss_token=1.9314, perplexity_token=6.8989]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  83%|██████████████████████████████████████▊        | 863/1044 [05:18<01:08,  2.62it/s, acc_step=1/1, ce_loss_token=1.9313, perplexity_token=6.8985]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  83%|██████████████████████████████████████▉        | 864/1044 [05:19<01:10,  2.54it/s, acc_step=1/1, ce_loss_token=1.9312, perplexity_token=6.8981]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  83%|██████████████████████████████████████▉        | 865/1044 [05:19<01:08,  2.60it/s, acc_step=1/1, ce_loss_token=1.9312, perplexity_token=6.8977]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  83%|██████████████████████████████████████▉        | 866/1044 [05:20<01:06,  2.68it/s, acc_step=1/1, ce_loss_token=1.9311, perplexity_token=6.8972]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  83%|███████████████████████████████████████        | 867/1044 [05:20<01:06,  2.66it/s, acc_step=1/1, ce_loss_token=1.9310, perplexity_token=6.8967]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  83%|███████████████████████████████████████        | 868/1044 [05:20<01:01,  2.85it/s, acc_step=1/1, ce_loss_token=1.9311, perplexity_token=6.8971]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  83%|███████████████████████████████████████        | 869/1044 [05:21<00:58,  2.98it/s, acc_step=1/1, ce_loss_token=1.9311, perplexity_token=6.8974]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  83%|███████████████████████████████████████▏       | 870/1044 [05:21<01:01,  2.81it/s, acc_step=1/1, ce_loss_token=1.9311, perplexity_token=6.8969]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  83%|███████████████████████████████████████▏       | 871/1044 [05:21<01:01,  2.82it/s, acc_step=1/1, ce_loss_token=1.9310, perplexity_token=6.8965]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  84%|███████████████████████████████████████▎       | 872/1044 [05:22<01:02,  2.74it/s, acc_step=1/1, ce_loss_token=1.9309, perplexity_token=6.8961]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  84%|███████████████████████████████████████▎       | 873/1044 [05:22<01:03,  2.71it/s, acc_step=1/1, ce_loss_token=1.9309, perplexity_token=6.8956]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  84%|███████████████████████████████████████▎       | 874/1044 [05:22<01:03,  2.69it/s, acc_step=1/1, ce_loss_token=1.9308, perplexity_token=6.8952]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  84%|███████████████████████████████████████▍       | 875/1044 [05:23<01:02,  2.72it/s, acc_step=1/1, ce_loss_token=1.9308, perplexity_token=6.8947]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  84%|███████████████████████████████████████▍       | 876/1044 [05:23<01:04,  2.62it/s, acc_step=1/1, ce_loss_token=1.9307, perplexity_token=6.8943]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  84%|███████████████████████████████████████▍       | 877/1044 [05:24<01:02,  2.69it/s, acc_step=1/1, ce_loss_token=1.9306, perplexity_token=6.8938]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  84%|███████████████████████████████████████▌       | 878/1044 [05:24<01:02,  2.65it/s, acc_step=1/1, ce_loss_token=1.9306, perplexity_token=6.8934]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  84%|███████████████████████████████████████▌       | 879/1044 [05:24<01:01,  2.70it/s, acc_step=1/1, ce_loss_token=1.9305, perplexity_token=6.8929]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  84%|███████████████████████████████████████▌       | 880/1044 [05:25<00:59,  2.76it/s, acc_step=1/1, ce_loss_token=1.9304, perplexity_token=6.8924]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  84%|███████████████████████████████████████▋       | 881/1044 [05:25<00:55,  2.92it/s, acc_step=1/1, ce_loss_token=1.9305, perplexity_token=6.8928]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  84%|███████████████████████████████████████▋       | 882/1044 [05:25<00:53,  3.04it/s, acc_step=1/1, ce_loss_token=1.9306, perplexity_token=6.8933]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  85%|███████████████████████████████████████▊       | 883/1044 [05:26<00:50,  3.17it/s, acc_step=1/1, ce_loss_token=1.9306, perplexity_token=6.8937]

torch.Size([256, 273, 35]) torch.Size([256, 273])


[Training LM]:  85%|███████████████████████████████████████▊       | 884/1044 [05:26<00:47,  3.39it/s, acc_step=1/1, ce_loss_token=1.9307, perplexity_token=6.8941]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  85%|███████████████████████████████████████▊       | 885/1044 [05:26<00:53,  2.96it/s, acc_step=1/1, ce_loss_token=1.9306, perplexity_token=6.8937]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  85%|███████████████████████████████████████▉       | 887/1044 [05:27<00:48,  3.24it/s, acc_step=1/1, ce_loss_token=1.9309, perplexity_token=6.8955]

torch.Size([256, 306, 35]) torch.Size([256, 306])
torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  85%|███████████████████████████████████████▉       | 888/1044 [05:27<00:54,  2.84it/s, acc_step=1/1, ce_loss_token=1.9308, perplexity_token=6.8951]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  85%|████████████████████████████████████████       | 889/1044 [05:28<00:55,  2.78it/s, acc_step=1/1, ce_loss_token=1.9308, perplexity_token=6.8947]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  85%|████████████████████████████████████████       | 890/1044 [05:28<00:54,  2.82it/s, acc_step=1/1, ce_loss_token=1.9307, perplexity_token=6.8944]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  85%|████████████████████████████████████████       | 891/1044 [05:28<00:55,  2.75it/s, acc_step=1/1, ce_loss_token=1.9306, perplexity_token=6.8939]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  85%|████████████████████████████████████████▏      | 892/1044 [05:29<00:55,  2.74it/s, acc_step=1/1, ce_loss_token=1.9306, perplexity_token=6.8934]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  86%|████████████████████████████████████████▏      | 893/1044 [05:29<00:53,  2.80it/s, acc_step=1/1, ce_loss_token=1.9306, perplexity_token=6.8938]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  86%|████████████████████████████████████████▏      | 894/1044 [05:29<00:54,  2.76it/s, acc_step=1/1, ce_loss_token=1.9305, perplexity_token=6.8933]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  86%|████████████████████████████████████████▎      | 895/1044 [05:30<00:54,  2.72it/s, acc_step=1/1, ce_loss_token=1.9305, perplexity_token=6.8929]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  86%|████████████████████████████████████████▎      | 896/1044 [05:30<00:56,  2.61it/s, acc_step=1/1, ce_loss_token=1.9304, perplexity_token=6.8925]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  86%|████████████████████████████████████████▍      | 897/1044 [05:31<00:55,  2.67it/s, acc_step=1/1, ce_loss_token=1.9304, perplexity_token=6.8921]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  86%|████████████████████████████████████████▍      | 898/1044 [05:31<00:54,  2.68it/s, acc_step=1/1, ce_loss_token=1.9303, perplexity_token=6.8917]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  86%|████████████████████████████████████████▍      | 899/1044 [05:31<00:53,  2.71it/s, acc_step=1/1, ce_loss_token=1.9302, perplexity_token=6.8912]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  86%|████████████████████████████████████████▌      | 900/1044 [05:32<00:53,  2.69it/s, acc_step=1/1, ce_loss_token=1.9302, perplexity_token=6.8908]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  86%|████████████████████████████████████████▌      | 901/1044 [05:32<00:53,  2.68it/s, acc_step=1/1, ce_loss_token=1.9301, perplexity_token=6.8903]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  86%|████████████████████████████████████████▌      | 902/1044 [05:32<00:52,  2.71it/s, acc_step=1/1, ce_loss_token=1.9301, perplexity_token=6.8899]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  86%|████████████████████████████████████████▋      | 903/1044 [05:33<00:54,  2.60it/s, acc_step=1/1, ce_loss_token=1.9300, perplexity_token=6.8896]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  87%|████████████████████████████████████████▋      | 904/1044 [05:33<00:53,  2.61it/s, acc_step=1/1, ce_loss_token=1.9299, perplexity_token=6.8891]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  87%|████████████████████████████████████████▋      | 905/1044 [05:34<00:49,  2.81it/s, acc_step=1/1, ce_loss_token=1.9300, perplexity_token=6.8896]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  87%|████████████████████████████████████████▊      | 906/1044 [05:34<00:49,  2.81it/s, acc_step=1/1, ce_loss_token=1.9300, perplexity_token=6.8892]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  87%|████████████████████████████████████████▊      | 907/1044 [05:34<00:48,  2.80it/s, acc_step=1/1, ce_loss_token=1.9299, perplexity_token=6.8887]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  87%|████████████████████████████████████████▉      | 908/1044 [05:35<00:48,  2.78it/s, acc_step=1/1, ce_loss_token=1.9298, perplexity_token=6.8883]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  87%|████████████████████████████████████████▉      | 909/1044 [05:35<00:45,  2.99it/s, acc_step=1/1, ce_loss_token=1.9299, perplexity_token=6.8886]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  87%|████████████████████████████████████████▉      | 910/1044 [05:35<00:48,  2.79it/s, acc_step=1/1, ce_loss_token=1.9298, perplexity_token=6.8881]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  87%|█████████████████████████████████████████      | 911/1044 [05:36<00:49,  2.70it/s, acc_step=1/1, ce_loss_token=1.9297, perplexity_token=6.8876]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  87%|█████████████████████████████████████████      | 912/1044 [05:36<00:54,  2.43it/s, acc_step=1/1, ce_loss_token=1.9297, perplexity_token=6.8872]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  87%|█████████████████████████████████████████      | 913/1044 [05:37<00:52,  2.48it/s, acc_step=1/1, ce_loss_token=1.9296, perplexity_token=6.8867]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  88%|█████████████████████████████████████████▏     | 914/1044 [05:37<00:52,  2.46it/s, acc_step=1/1, ce_loss_token=1.9295, perplexity_token=6.8863]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  88%|█████████████████████████████████████████▏     | 915/1044 [05:37<00:50,  2.54it/s, acc_step=1/1, ce_loss_token=1.9295, perplexity_token=6.8859]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  88%|█████████████████████████████████████████▏     | 916/1044 [05:38<00:48,  2.63it/s, acc_step=1/1, ce_loss_token=1.9294, perplexity_token=6.8855]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  88%|█████████████████████████████████████████▎     | 917/1044 [05:38<00:45,  2.80it/s, acc_step=1/1, ce_loss_token=1.9295, perplexity_token=6.8859]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  88%|█████████████████████████████████████████▎     | 918/1044 [05:38<00:45,  2.76it/s, acc_step=1/1, ce_loss_token=1.9294, perplexity_token=6.8854]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  88%|█████████████████████████████████████████▎     | 919/1044 [05:39<00:44,  2.80it/s, acc_step=1/1, ce_loss_token=1.9293, perplexity_token=6.8850]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  88%|█████████████████████████████████████████▍     | 920/1044 [05:39<00:40,  3.05it/s, acc_step=1/1, ce_loss_token=1.9294, perplexity_token=6.8854]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  88%|█████████████████████████████████████████▍     | 921/1044 [05:39<00:42,  2.88it/s, acc_step=1/1, ce_loss_token=1.9293, perplexity_token=6.8850]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  88%|█████████████████████████████████████████▌     | 922/1044 [05:40<00:41,  2.92it/s, acc_step=1/1, ce_loss_token=1.9294, perplexity_token=6.8853]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  88%|█████████████████████████████████████████▌     | 923/1044 [05:40<00:40,  2.98it/s, acc_step=1/1, ce_loss_token=1.9293, perplexity_token=6.8849]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  89%|█████████████████████████████████████████▌     | 924/1044 [05:40<00:41,  2.86it/s, acc_step=1/1, ce_loss_token=1.9293, perplexity_token=6.8845]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  89%|█████████████████████████████████████████▋     | 925/1044 [05:41<00:39,  3.00it/s, acc_step=1/1, ce_loss_token=1.9293, perplexity_token=6.8849]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  89%|█████████████████████████████████████████▋     | 926/1044 [05:41<00:37,  3.12it/s, acc_step=1/1, ce_loss_token=1.9294, perplexity_token=6.8853]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  89%|█████████████████████████████████████████▋     | 927/1044 [05:41<00:39,  2.98it/s, acc_step=1/1, ce_loss_token=1.9293, perplexity_token=6.8848]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  89%|█████████████████████████████████████████▊     | 928/1044 [05:42<00:39,  2.92it/s, acc_step=1/1, ce_loss_token=1.9293, perplexity_token=6.8844]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  89%|█████████████████████████████████████████▊     | 929/1044 [05:42<00:41,  2.76it/s, acc_step=1/1, ce_loss_token=1.9292, perplexity_token=6.8839]

torch.Size([256, 381, 35]) torch.Size([256, 381])


[Training LM]:  89%|█████████████████████████████████████████▉     | 931/1044 [05:43<00:36,  3.08it/s, acc_step=1/1, ce_loss_token=1.9295, perplexity_token=6.8858]

torch.Size([256, 308, 35]) torch.Size([256, 308])
torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  89%|█████████████████████████████████████████▉     | 932/1044 [05:43<00:37,  3.00it/s, acc_step=1/1, ce_loss_token=1.9294, perplexity_token=6.8854]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  89%|██████████████████████████████████████████     | 933/1044 [05:43<00:37,  2.93it/s, acc_step=1/1, ce_loss_token=1.9293, perplexity_token=6.8850]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  89%|██████████████████████████████████████████     | 934/1044 [05:44<00:40,  2.75it/s, acc_step=1/1, ce_loss_token=1.9293, perplexity_token=6.8846]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  90%|██████████████████████████████████████████     | 935/1044 [05:44<00:39,  2.75it/s, acc_step=1/1, ce_loss_token=1.9292, perplexity_token=6.8842]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  90%|██████████████████████████████████████████▏    | 936/1044 [05:45<00:40,  2.68it/s, acc_step=1/1, ce_loss_token=1.9292, perplexity_token=6.8838]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  90%|██████████████████████████████████████████▏    | 937/1044 [05:45<00:40,  2.64it/s, acc_step=1/1, ce_loss_token=1.9291, perplexity_token=6.8835]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  90%|██████████████████████████████████████████▏    | 938/1044 [05:45<00:39,  2.72it/s, acc_step=1/1, ce_loss_token=1.9291, perplexity_token=6.8830]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  90%|██████████████████████████████████████████▎    | 939/1044 [05:46<00:38,  2.76it/s, acc_step=1/1, ce_loss_token=1.9290, perplexity_token=6.8825]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  90%|██████████████████████████████████████████▎    | 940/1044 [05:46<00:38,  2.73it/s, acc_step=1/1, ce_loss_token=1.9289, perplexity_token=6.8821]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  90%|██████████████████████████████████████████▎    | 941/1044 [05:47<00:38,  2.67it/s, acc_step=1/1, ce_loss_token=1.9289, perplexity_token=6.8817]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  90%|██████████████████████████████████████████▍    | 942/1044 [05:47<00:38,  2.67it/s, acc_step=1/1, ce_loss_token=1.9288, perplexity_token=6.8812]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  90%|██████████████████████████████████████████▍    | 943/1044 [05:47<00:37,  2.67it/s, acc_step=1/1, ce_loss_token=1.9287, perplexity_token=6.8807]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  90%|██████████████████████████████████████████▍    | 944/1044 [05:48<00:36,  2.71it/s, acc_step=1/1, ce_loss_token=1.9287, perplexity_token=6.8802]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  91%|██████████████████████████████████████████▌    | 945/1044 [05:48<00:38,  2.54it/s, acc_step=1/1, ce_loss_token=1.9286, perplexity_token=6.8799]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  91%|██████████████████████████████████████████▌    | 946/1044 [05:48<00:38,  2.53it/s, acc_step=1/1, ce_loss_token=1.9285, perplexity_token=6.8795]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  91%|██████████████████████████████████████████▋    | 947/1044 [05:49<00:36,  2.64it/s, acc_step=1/1, ce_loss_token=1.9285, perplexity_token=6.8791]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  91%|██████████████████████████████████████████▋    | 948/1044 [05:49<00:35,  2.68it/s, acc_step=1/1, ce_loss_token=1.9284, perplexity_token=6.8786]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  91%|██████████████████████████████████████████▋    | 949/1044 [05:50<00:34,  2.73it/s, acc_step=1/1, ce_loss_token=1.9283, perplexity_token=6.8781]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  91%|██████████████████████████████████████████▊    | 950/1044 [05:50<00:34,  2.75it/s, acc_step=1/1, ce_loss_token=1.9283, perplexity_token=6.8777]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  91%|██████████████████████████████████████████▊    | 951/1044 [05:50<00:34,  2.72it/s, acc_step=1/1, ce_loss_token=1.9282, perplexity_token=6.8772]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  91%|██████████████████████████████████████████▊    | 952/1044 [05:51<00:33,  2.74it/s, acc_step=1/1, ce_loss_token=1.9281, perplexity_token=6.8767]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  91%|██████████████████████████████████████████▉    | 953/1044 [05:51<00:32,  2.80it/s, acc_step=1/1, ce_loss_token=1.9281, perplexity_token=6.8764]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  91%|██████████████████████████████████████████▉    | 954/1044 [05:51<00:32,  2.80it/s, acc_step=1/1, ce_loss_token=1.9280, perplexity_token=6.8760]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  91%|██████████████████████████████████████████▉    | 955/1044 [05:52<00:30,  2.93it/s, acc_step=1/1, ce_loss_token=1.9281, perplexity_token=6.8762]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  92%|███████████████████████████████████████████    | 956/1044 [05:52<00:31,  2.79it/s, acc_step=1/1, ce_loss_token=1.9280, perplexity_token=6.8758]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  92%|███████████████████████████████████████████    | 957/1044 [05:52<00:31,  2.77it/s, acc_step=1/1, ce_loss_token=1.9280, perplexity_token=6.8754]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  92%|███████████████████████████████████████████▏   | 958/1044 [05:53<00:32,  2.66it/s, acc_step=1/1, ce_loss_token=1.9279, perplexity_token=6.8751]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  92%|███████████████████████████████████████████▏   | 959/1044 [05:53<00:31,  2.68it/s, acc_step=1/1, ce_loss_token=1.9278, perplexity_token=6.8746]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  92%|███████████████████████████████████████████▏   | 960/1044 [05:54<00:32,  2.60it/s, acc_step=1/1, ce_loss_token=1.9278, perplexity_token=6.8743]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  92%|███████████████████████████████████████████▎   | 961/1044 [05:54<00:29,  2.85it/s, acc_step=1/1, ce_loss_token=1.9278, perplexity_token=6.8747]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  92%|███████████████████████████████████████████▎   | 962/1044 [05:54<00:26,  3.04it/s, acc_step=1/1, ce_loss_token=1.9279, perplexity_token=6.8750]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  92%|███████████████████████████████████████████▎   | 963/1044 [05:55<00:28,  2.83it/s, acc_step=1/1, ce_loss_token=1.9278, perplexity_token=6.8746]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  92%|███████████████████████████████████████████▍   | 964/1044 [05:55<00:27,  2.87it/s, acc_step=1/1, ce_loss_token=1.9278, perplexity_token=6.8742]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  92%|███████████████████████████████████████████▍   | 965/1044 [05:55<00:27,  2.85it/s, acc_step=1/1, ce_loss_token=1.9277, perplexity_token=6.8738]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  93%|███████████████████████████████████████████▍   | 966/1044 [05:56<00:27,  2.83it/s, acc_step=1/1, ce_loss_token=1.9277, perplexity_token=6.8734]

torch.Size([256, 393, 35]) torch.Size([256, 393])


[Training LM]:  93%|███████████████████████████████████████████▌   | 967/1044 [05:56<00:31,  2.43it/s, acc_step=1/1, ce_loss_token=1.9276, perplexity_token=6.8730]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  93%|███████████████████████████████████████████▌   | 968/1044 [05:56<00:30,  2.53it/s, acc_step=1/1, ce_loss_token=1.9275, perplexity_token=6.8726]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  93%|███████████████████████████████████████████▌   | 969/1044 [05:57<00:26,  2.78it/s, acc_step=1/1, ce_loss_token=1.9276, perplexity_token=6.8730]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  93%|███████████████████████████████████████████▋   | 970/1044 [05:57<00:26,  2.75it/s, acc_step=1/1, ce_loss_token=1.9275, perplexity_token=6.8726]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  93%|███████████████████████████████████████████▋   | 971/1044 [05:57<00:25,  2.89it/s, acc_step=1/1, ce_loss_token=1.9276, perplexity_token=6.8729]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  93%|███████████████████████████████████████████▊   | 972/1044 [05:58<00:25,  2.79it/s, acc_step=1/1, ce_loss_token=1.9275, perplexity_token=6.8726]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  93%|███████████████████████████████████████████▊   | 973/1044 [05:58<00:24,  2.93it/s, acc_step=1/1, ce_loss_token=1.9276, perplexity_token=6.8730]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  93%|███████████████████████████████████████████▊   | 974/1044 [05:59<00:25,  2.80it/s, acc_step=1/1, ce_loss_token=1.9275, perplexity_token=6.8726]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  93%|███████████████████████████████████████████▉   | 975/1044 [05:59<00:25,  2.75it/s, acc_step=1/1, ce_loss_token=1.9275, perplexity_token=6.8721]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  93%|███████████████████████████████████████████▉   | 976/1044 [05:59<00:24,  2.76it/s, acc_step=1/1, ce_loss_token=1.9274, perplexity_token=6.8717]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  94%|███████████████████████████████████████████▉   | 977/1044 [06:00<00:24,  2.74it/s, acc_step=1/1, ce_loss_token=1.9273, perplexity_token=6.8713]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  94%|████████████████████████████████████████████   | 978/1044 [06:00<00:23,  2.80it/s, acc_step=1/1, ce_loss_token=1.9273, perplexity_token=6.8708]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  94%|████████████████████████████████████████████   | 979/1044 [06:00<00:24,  2.68it/s, acc_step=1/1, ce_loss_token=1.9272, perplexity_token=6.8704]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  94%|████████████████████████████████████████████   | 980/1044 [06:01<00:24,  2.58it/s, acc_step=1/1, ce_loss_token=1.9272, perplexity_token=6.8700]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  94%|████████████████████████████████████████████▏  | 981/1044 [06:01<00:24,  2.58it/s, acc_step=1/1, ce_loss_token=1.9271, perplexity_token=6.8696]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  94%|████████████████████████████████████████████▏  | 982/1044 [06:01<00:22,  2.77it/s, acc_step=1/1, ce_loss_token=1.9271, perplexity_token=6.8699]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  94%|████████████████████████████████████████████▎  | 983/1044 [06:02<00:22,  2.69it/s, acc_step=1/1, ce_loss_token=1.9271, perplexity_token=6.8695]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  94%|████████████████████████████████████████████▎  | 984/1044 [06:02<00:21,  2.75it/s, acc_step=1/1, ce_loss_token=1.9270, perplexity_token=6.8691]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  94%|████████████████████████████████████████████▎  | 985/1044 [06:03<00:22,  2.61it/s, acc_step=1/1, ce_loss_token=1.9270, perplexity_token=6.8688]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  94%|████████████████████████████████████████████▍  | 986/1044 [06:03<00:20,  2.84it/s, acc_step=1/1, ce_loss_token=1.9270, perplexity_token=6.8691]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  95%|████████████████████████████████████████████▍  | 987/1044 [06:03<00:20,  2.77it/s, acc_step=1/1, ce_loss_token=1.9270, perplexity_token=6.8687]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  95%|████████████████████████████████████████████▍  | 988/1044 [06:04<00:18,  3.01it/s, acc_step=1/1, ce_loss_token=1.9270, perplexity_token=6.8690]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  95%|████████████████████████████████████████████▌  | 989/1044 [06:04<00:19,  2.87it/s, acc_step=1/1, ce_loss_token=1.9271, perplexity_token=6.8694]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  95%|████████████████████████████████████████████▌  | 990/1044 [06:04<00:18,  2.89it/s, acc_step=1/1, ce_loss_token=1.9270, perplexity_token=6.8689]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  95%|████████████████████████████████████████████▌  | 991/1044 [06:05<00:19,  2.77it/s, acc_step=1/1, ce_loss_token=1.9269, perplexity_token=6.8685]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  95%|████████████████████████████████████████████▋  | 992/1044 [06:05<00:19,  2.72it/s, acc_step=1/1, ce_loss_token=1.9269, perplexity_token=6.8680]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  95%|████████████████████████████████████████████▋  | 993/1044 [06:06<00:19,  2.61it/s, acc_step=1/1, ce_loss_token=1.9268, perplexity_token=6.8676]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  95%|████████████████████████████████████████████▋  | 994/1044 [06:06<00:17,  2.81it/s, acc_step=1/1, ce_loss_token=1.9269, perplexity_token=6.8680]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  95%|████████████████████████████████████████████▊  | 995/1044 [06:06<00:17,  2.74it/s, acc_step=1/1, ce_loss_token=1.9268, perplexity_token=6.8676]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  95%|████████████████████████████████████████████▊  | 996/1044 [06:07<00:17,  2.74it/s, acc_step=1/1, ce_loss_token=1.9269, perplexity_token=6.8679]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  95%|████████████████████████████████████████████▉  | 997/1044 [06:07<00:16,  2.80it/s, acc_step=1/1, ce_loss_token=1.9268, perplexity_token=6.8675]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  96%|████████████████████████████████████████████▉  | 998/1044 [06:07<00:16,  2.82it/s, acc_step=1/1, ce_loss_token=1.9267, perplexity_token=6.8671]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  96%|████████████████████████████████████████████▉  | 999/1044 [06:08<00:16,  2.81it/s, acc_step=1/1, ce_loss_token=1.9267, perplexity_token=6.8667]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  96%|████████████████████████████████████████████  | 1000/1044 [06:08<00:16,  2.61it/s, acc_step=1/1, ce_loss_token=1.9266, perplexity_token=6.8663]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  96%|████████████████████████████████████████████  | 1001/1044 [06:08<00:16,  2.65it/s, acc_step=1/1, ce_loss_token=1.9266, perplexity_token=6.8659]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1002/1044 [06:09<00:15,  2.71it/s, acc_step=1/1, ce_loss_token=1.9265, perplexity_token=6.8656]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1003/1044 [06:09<00:14,  2.75it/s, acc_step=1/1, ce_loss_token=1.9265, perplexity_token=6.8652]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1004/1044 [06:10<00:15,  2.65it/s, acc_step=1/1, ce_loss_token=1.9264, perplexity_token=6.8648]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1005/1044 [06:10<00:13,  2.86it/s, acc_step=1/1, ce_loss_token=1.9265, perplexity_token=6.8652]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1006/1044 [06:10<00:14,  2.69it/s, acc_step=1/1, ce_loss_token=1.9264, perplexity_token=6.8648]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1007/1044 [06:11<00:13,  2.83it/s, acc_step=1/1, ce_loss_token=1.9264, perplexity_token=6.8650]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1008/1044 [06:11<00:12,  2.82it/s, acc_step=1/1, ce_loss_token=1.9264, perplexity_token=6.8645]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1009/1044 [06:11<00:12,  2.85it/s, acc_step=1/1, ce_loss_token=1.9263, perplexity_token=6.8640]

torch.Size([256, 366, 35]) torch.Size([256, 366])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1010/1044 [06:12<00:13,  2.57it/s, acc_step=1/1, ce_loss_token=1.9262, perplexity_token=6.8636]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1011/1044 [06:12<00:12,  2.63it/s, acc_step=1/1, ce_loss_token=1.9262, perplexity_token=6.8632]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1012/1044 [06:12<00:12,  2.61it/s, acc_step=1/1, ce_loss_token=1.9261, perplexity_token=6.8628]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1013/1044 [06:13<00:11,  2.63it/s, acc_step=1/1, ce_loss_token=1.9261, perplexity_token=6.8624]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1014/1044 [06:13<00:11,  2.68it/s, acc_step=1/1, ce_loss_token=1.9260, perplexity_token=6.8620]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1015/1044 [06:14<00:10,  2.71it/s, acc_step=1/1, ce_loss_token=1.9259, perplexity_token=6.8616]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1016/1044 [06:14<00:10,  2.61it/s, acc_step=1/1, ce_loss_token=1.9259, perplexity_token=6.8612]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1017/1044 [06:14<00:10,  2.66it/s, acc_step=1/1, ce_loss_token=1.9258, perplexity_token=6.8608]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  98%|████████████████████████████████████████████▊ | 1018/1044 [06:15<00:09,  2.67it/s, acc_step=1/1, ce_loss_token=1.9258, perplexity_token=6.8604]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1019/1044 [06:15<00:09,  2.63it/s, acc_step=1/1, ce_loss_token=1.9257, perplexity_token=6.8600]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1020/1044 [06:15<00:09,  2.58it/s, acc_step=1/1, ce_loss_token=1.9256, perplexity_token=6.8595]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1021/1044 [06:16<00:08,  2.60it/s, acc_step=1/1, ce_loss_token=1.9256, perplexity_token=6.8591]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  98%|█████████████████████████████████████████████ | 1022/1044 [06:16<00:08,  2.66it/s, acc_step=1/1, ce_loss_token=1.9255, perplexity_token=6.8586]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  98%|█████████████████████████████████████████████ | 1023/1044 [06:17<00:07,  2.82it/s, acc_step=1/1, ce_loss_token=1.9255, perplexity_token=6.8589]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  98%|█████████████████████████████████████████████ | 1024/1044 [06:17<00:06,  2.87it/s, acc_step=1/1, ce_loss_token=1.9255, perplexity_token=6.8585]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1025/1044 [06:17<00:06,  2.85it/s, acc_step=1/1, ce_loss_token=1.9254, perplexity_token=6.8580]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1026/1044 [06:18<00:06,  2.83it/s, acc_step=1/1, ce_loss_token=1.9254, perplexity_token=6.8576]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1027/1044 [06:18<00:06,  2.80it/s, acc_step=1/1, ce_loss_token=1.9253, perplexity_token=6.8571]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1028/1044 [06:18<00:05,  2.73it/s, acc_step=1/1, ce_loss_token=1.9252, perplexity_token=6.8567]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  99%|█████████████████████████████████████████████▎| 1029/1044 [06:19<00:05,  2.74it/s, acc_step=1/1, ce_loss_token=1.9252, perplexity_token=6.8563]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1030/1044 [06:19<00:05,  2.64it/s, acc_step=1/1, ce_loss_token=1.9251, perplexity_token=6.8558]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1031/1044 [06:19<00:04,  2.66it/s, acc_step=1/1, ce_loss_token=1.9250, perplexity_token=6.8554]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1032/1044 [06:20<00:04,  2.68it/s, acc_step=1/1, ce_loss_token=1.9250, perplexity_token=6.8550]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1033/1044 [06:20<00:04,  2.65it/s, acc_step=1/1, ce_loss_token=1.9249, perplexity_token=6.8546]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1034/1044 [06:21<00:03,  2.60it/s, acc_step=1/1, ce_loss_token=1.9249, perplexity_token=6.8542]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1035/1044 [06:21<00:03,  2.58it/s, acc_step=1/1, ce_loss_token=1.9248, perplexity_token=6.8537]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1036/1044 [06:21<00:03,  2.63it/s, acc_step=1/1, ce_loss_token=1.9247, perplexity_token=6.8533]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1037/1044 [06:22<00:02,  2.83it/s, acc_step=1/1, ce_loss_token=1.9248, perplexity_token=6.8535]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1038/1044 [06:22<00:02,  2.98it/s, acc_step=1/1, ce_loss_token=1.9248, perplexity_token=6.8537]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1039/1044 [06:22<00:01,  2.89it/s, acc_step=1/1, ce_loss_token=1.9247, perplexity_token=6.8533]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1040/1044 [06:23<00:01,  2.82it/s, acc_step=1/1, ce_loss_token=1.9247, perplexity_token=6.8529]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1041/1044 [06:23<00:00,  3.00it/s, acc_step=1/1, ce_loss_token=1.9247, perplexity_token=6.8531]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1042/1044 [06:23<00:00,  2.93it/s, acc_step=1/1, ce_loss_token=1.9246, perplexity_token=6.8527]

torch.Size([256, 297, 35]) torch.Size([256, 297])




torch.Size([170, 289, 35]) torch.Size([170, 289])


                                                                                                                                                                   

Generating with greedy search...

📊 Metrics (Epoch 4):
├── TRAIN:
│   ├── ce_loss_char: 1.9246
│   ├── ce_loss_token: 1.9246
│   ├── perplexity_char: 6.8521
│   └── perplexity_token: 6.8521
└── VAL:
    ├── ce_loss_char: 1.7768
    ├── ce_loss_token: 1.7768
    ├── perplexity_char: 5.9109
    └── perplexity_token: 5.9109
└── TRAINING:
    └── learning_rate: 0.000100


[Training LM]:   0%|                                                                                                                      | 0/1044 [00:00<?, ?it/s]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   0%|                                                 | 1/1044 [00:00<08:17,  2.09it/s, acc_step=1/1, ce_loss_token=1.8692, perplexity_token=6.4832]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   0%|                                                 | 2/1044 [00:00<07:29,  2.32it/s, acc_step=1/1, ce_loss_token=1.8652, perplexity_token=6.4570]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   0%|▏                                                | 3/1044 [00:01<06:59,  2.48it/s, acc_step=1/1, ce_loss_token=1.8675, perplexity_token=6.4719]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   0%|▏                                                | 4/1044 [00:01<06:52,  2.52it/s, acc_step=1/1, ce_loss_token=1.8653, perplexity_token=6.4581]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:   0%|▏                                                | 5/1044 [00:02<07:18,  2.37it/s, acc_step=1/1, ce_loss_token=1.8655, perplexity_token=6.4589]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   1%|▎                                                | 6/1044 [00:02<07:01,  2.46it/s, acc_step=1/1, ce_loss_token=1.8628, perplexity_token=6.4415]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   1%|▎                                                | 7/1044 [00:02<06:52,  2.51it/s, acc_step=1/1, ce_loss_token=1.8637, perplexity_token=6.4477]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:   1%|▍                                                | 8/1044 [00:03<07:07,  2.42it/s, acc_step=1/1, ce_loss_token=1.8639, perplexity_token=6.4487]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   1%|▍                                                | 9/1044 [00:03<06:55,  2.49it/s, acc_step=1/1, ce_loss_token=1.8641, perplexity_token=6.4503]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   1%|▍                                               | 10/1044 [00:04<06:44,  2.56it/s, acc_step=1/1, ce_loss_token=1.8639, perplexity_token=6.4486]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   1%|▌                                               | 11/1044 [00:04<06:48,  2.53it/s, acc_step=1/1, ce_loss_token=1.8630, perplexity_token=6.4432]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   1%|▌                                               | 12/1044 [00:04<06:41,  2.57it/s, acc_step=1/1, ce_loss_token=1.8614, perplexity_token=6.4327]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   1%|▌                                               | 13/1044 [00:05<06:39,  2.58it/s, acc_step=1/1, ce_loss_token=1.8614, perplexity_token=6.4325]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   1%|▋                                               | 14/1044 [00:05<06:32,  2.63it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4377]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   1%|▋                                               | 15/1044 [00:05<06:28,  2.65it/s, acc_step=1/1, ce_loss_token=1.8625, perplexity_token=6.4399]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   2%|▋                                               | 16/1044 [00:06<06:22,  2.69it/s, acc_step=1/1, ce_loss_token=1.8626, perplexity_token=6.4403]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:   2%|▊                                               | 17/1044 [00:06<06:45,  2.54it/s, acc_step=1/1, ce_loss_token=1.8625, perplexity_token=6.4396]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   2%|▊                                               | 18/1044 [00:07<06:44,  2.53it/s, acc_step=1/1, ce_loss_token=1.8632, perplexity_token=6.4444]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   2%|▊                                               | 19/1044 [00:07<06:40,  2.56it/s, acc_step=1/1, ce_loss_token=1.8629, perplexity_token=6.4425]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:   2%|▉                                               | 20/1044 [00:07<06:54,  2.47it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4387]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   2%|▉                                               | 21/1044 [00:08<06:46,  2.52it/s, acc_step=1/1, ce_loss_token=1.8618, perplexity_token=6.4355]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:   2%|█                                               | 22/1044 [00:08<07:03,  2.42it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4384]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   2%|█                                               | 23/1044 [00:09<06:22,  2.67it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4703]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:   2%|█                                               | 24/1044 [00:09<06:08,  2.77it/s, acc_step=1/1, ce_loss_token=1.8663, perplexity_token=6.4642]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   2%|█▏                                              | 25/1044 [00:09<06:17,  2.70it/s, acc_step=1/1, ce_loss_token=1.8663, perplexity_token=6.4643]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   2%|█▏                                              | 26/1044 [00:10<06:14,  2.72it/s, acc_step=1/1, ce_loss_token=1.8657, perplexity_token=6.4604]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   3%|█▏                                              | 27/1044 [00:10<06:22,  2.66it/s, acc_step=1/1, ce_loss_token=1.8658, perplexity_token=6.4611]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   3%|█▎                                              | 28/1044 [00:10<06:24,  2.64it/s, acc_step=1/1, ce_loss_token=1.8653, perplexity_token=6.4578]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   3%|█▎                                              | 29/1044 [00:11<06:21,  2.66it/s, acc_step=1/1, ce_loss_token=1.8652, perplexity_token=6.4571]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   3%|█▍                                              | 30/1044 [00:11<06:26,  2.62it/s, acc_step=1/1, ce_loss_token=1.8652, perplexity_token=6.4574]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   3%|█▍                                              | 31/1044 [00:12<06:39,  2.53it/s, acc_step=1/1, ce_loss_token=1.8653, perplexity_token=6.4582]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   3%|█▍                                              | 32/1044 [00:12<06:40,  2.53it/s, acc_step=1/1, ce_loss_token=1.8649, perplexity_token=6.4555]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   3%|█▌                                              | 33/1044 [00:12<06:28,  2.60it/s, acc_step=1/1, ce_loss_token=1.8645, perplexity_token=6.4525]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   3%|█▌                                              | 34/1044 [00:13<06:18,  2.67it/s, acc_step=1/1, ce_loss_token=1.8643, perplexity_token=6.4511]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   3%|█▌                                              | 35/1044 [00:13<06:17,  2.67it/s, acc_step=1/1, ce_loss_token=1.8641, perplexity_token=6.4500]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   3%|█▋                                              | 36/1044 [00:13<06:14,  2.69it/s, acc_step=1/1, ce_loss_token=1.8638, perplexity_token=6.4480]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:   4%|█▋                                              | 37/1044 [00:14<06:05,  2.75it/s, acc_step=1/1, ce_loss_token=1.8633, perplexity_token=6.4451]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:   4%|█▋                                              | 38/1044 [00:14<06:21,  2.63it/s, acc_step=1/1, ce_loss_token=1.8631, perplexity_token=6.4436]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   4%|█▊                                              | 39/1044 [00:15<06:19,  2.64it/s, acc_step=1/1, ce_loss_token=1.8629, perplexity_token=6.4422]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   4%|█▊                                              | 40/1044 [00:15<06:18,  2.65it/s, acc_step=1/1, ce_loss_token=1.8629, perplexity_token=6.4426]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   4%|█▉                                              | 41/1044 [00:15<06:11,  2.70it/s, acc_step=1/1, ce_loss_token=1.8626, perplexity_token=6.4404]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   4%|█▉                                              | 42/1044 [00:16<06:12,  2.69it/s, acc_step=1/1, ce_loss_token=1.8627, perplexity_token=6.4412]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:   4%|█▉                                              | 43/1044 [00:16<06:05,  2.74it/s, acc_step=1/1, ce_loss_token=1.8627, perplexity_token=6.4408]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:   4%|██                                              | 44/1044 [00:16<05:56,  2.81it/s, acc_step=1/1, ce_loss_token=1.8627, perplexity_token=6.4413]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   4%|██                                              | 45/1044 [00:17<05:56,  2.80it/s, acc_step=1/1, ce_loss_token=1.8626, perplexity_token=6.4407]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:   4%|██                                              | 46/1044 [00:17<06:18,  2.64it/s, acc_step=1/1, ce_loss_token=1.8629, perplexity_token=6.4423]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   5%|██▏                                             | 47/1044 [00:18<06:14,  2.66it/s, acc_step=1/1, ce_loss_token=1.8627, perplexity_token=6.4412]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:   5%|██▏                                             | 48/1044 [00:18<06:25,  2.59it/s, acc_step=1/1, ce_loss_token=1.8625, perplexity_token=6.4400]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:   5%|██▎                                             | 49/1044 [00:18<06:33,  2.53it/s, acc_step=1/1, ce_loss_token=1.8625, perplexity_token=6.4397]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   5%|██▎                                             | 50/1044 [00:19<06:29,  2.55it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4382]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   5%|██▎                                             | 51/1044 [00:19<06:29,  2.55it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4372]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   5%|██▍                                             | 52/1044 [00:20<06:27,  2.56it/s, acc_step=1/1, ce_loss_token=1.8624, perplexity_token=6.4391]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   5%|██▍                                             | 53/1044 [00:20<05:59,  2.76it/s, acc_step=1/1, ce_loss_token=1.8642, perplexity_token=6.4509]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   5%|██▍                                             | 54/1044 [00:20<06:05,  2.71it/s, acc_step=1/1, ce_loss_token=1.8643, perplexity_token=6.4514]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:   5%|██▌                                             | 55/1044 [00:21<05:35,  2.95it/s, acc_step=1/1, ce_loss_token=1.8661, perplexity_token=6.4628]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   5%|██▌                                             | 56/1044 [00:21<05:16,  3.12it/s, acc_step=1/1, ce_loss_token=1.8681, perplexity_token=6.4759]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   5%|██▌                                             | 57/1044 [00:21<05:33,  2.96it/s, acc_step=1/1, ce_loss_token=1.8680, perplexity_token=6.4754]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:   6%|██▋                                             | 58/1044 [00:21<05:29,  2.99it/s, acc_step=1/1, ce_loss_token=1.8678, perplexity_token=6.4741]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   6%|██▋                                             | 59/1044 [00:22<05:39,  2.90it/s, acc_step=1/1, ce_loss_token=1.8675, perplexity_token=6.4720]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   6%|██▊                                             | 60/1044 [00:22<05:50,  2.81it/s, acc_step=1/1, ce_loss_token=1.8673, perplexity_token=6.4706]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:   6%|██▊                                             | 61/1044 [00:23<06:11,  2.64it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4703]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   6%|██▊                                             | 62/1044 [00:23<06:09,  2.66it/s, acc_step=1/1, ce_loss_token=1.8673, perplexity_token=6.4705]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   6%|██▉                                             | 63/1044 [00:23<06:05,  2.69it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4704]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   6%|██▉                                             | 64/1044 [00:24<05:41,  2.87it/s, acc_step=1/1, ce_loss_token=1.8688, perplexity_token=6.4808]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   6%|██▉                                             | 65/1044 [00:24<05:47,  2.82it/s, acc_step=1/1, ce_loss_token=1.8687, perplexity_token=6.4796]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   6%|███                                             | 66/1044 [00:24<05:48,  2.81it/s, acc_step=1/1, ce_loss_token=1.8684, perplexity_token=6.4779]

torch.Size([256, 399, 35]) torch.Size([256, 399])


[Training LM]:   6%|███                                             | 67/1044 [00:25<06:12,  2.62it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4878]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   7%|███▏                                            | 68/1044 [00:25<06:10,  2.63it/s, acc_step=1/1, ce_loss_token=1.8698, perplexity_token=6.4872]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   7%|███▏                                            | 69/1044 [00:26<06:06,  2.66it/s, acc_step=1/1, ce_loss_token=1.8697, perplexity_token=6.4861]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   7%|███▏                                            | 70/1044 [00:26<05:38,  2.88it/s, acc_step=1/1, ce_loss_token=1.8709, perplexity_token=6.4941]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   7%|███▎                                            | 71/1044 [00:26<05:41,  2.85it/s, acc_step=1/1, ce_loss_token=1.8707, perplexity_token=6.4928]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   7%|███▎                                            | 72/1044 [00:27<05:46,  2.81it/s, acc_step=1/1, ce_loss_token=1.8707, perplexity_token=6.4929]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   7%|███▎                                            | 73/1044 [00:27<05:50,  2.77it/s, acc_step=1/1, ce_loss_token=1.8706, perplexity_token=6.4922]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   7%|███▍                                            | 74/1044 [00:27<05:56,  2.72it/s, acc_step=1/1, ce_loss_token=1.8704, perplexity_token=6.4906]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:   7%|███▍                                            | 75/1044 [00:28<05:48,  2.78it/s, acc_step=1/1, ce_loss_token=1.8715, perplexity_token=6.4977]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   7%|███▍                                            | 76/1044 [00:28<05:47,  2.79it/s, acc_step=1/1, ce_loss_token=1.8711, perplexity_token=6.4952]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   7%|███▌                                            | 77/1044 [00:28<05:50,  2.76it/s, acc_step=1/1, ce_loss_token=1.8709, perplexity_token=6.4942]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   7%|███▌                                            | 78/1044 [00:29<06:05,  2.64it/s, acc_step=1/1, ce_loss_token=1.8706, perplexity_token=6.4923]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   8%|███▋                                            | 79/1044 [00:29<05:42,  2.82it/s, acc_step=1/1, ce_loss_token=1.8721, perplexity_token=6.5020]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   8%|███▋                                            | 80/1044 [00:30<05:50,  2.75it/s, acc_step=1/1, ce_loss_token=1.8719, perplexity_token=6.5004]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   8%|███▋                                            | 81/1044 [00:30<05:55,  2.71it/s, acc_step=1/1, ce_loss_token=1.8718, perplexity_token=6.4999]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   8%|███▊                                            | 82/1044 [00:30<05:59,  2.67it/s, acc_step=1/1, ce_loss_token=1.8718, perplexity_token=6.4998]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:   8%|███▊                                            | 83/1044 [00:31<05:49,  2.75it/s, acc_step=1/1, ce_loss_token=1.8717, perplexity_token=6.4991]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   8%|███▊                                            | 84/1044 [00:31<05:55,  2.70it/s, acc_step=1/1, ce_loss_token=1.8714, perplexity_token=6.4977]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:   8%|███▉                                            | 85/1044 [00:31<05:25,  2.94it/s, acc_step=1/1, ce_loss_token=1.8724, perplexity_token=6.5039]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   8%|███▉                                            | 86/1044 [00:32<05:12,  3.06it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5131]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   8%|████                                            | 87/1044 [00:32<05:25,  2.94it/s, acc_step=1/1, ce_loss_token=1.8736, perplexity_token=6.5115]

torch.Size([256, 344, 35]) torch.Size([256, 344])


[Training LM]:   8%|████                                            | 88/1044 [00:32<05:52,  2.71it/s, acc_step=1/1, ce_loss_token=1.8734, perplexity_token=6.5101]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   9%|████                                            | 89/1044 [00:33<05:51,  2.72it/s, acc_step=1/1, ce_loss_token=1.8730, perplexity_token=6.5078]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   9%|████▏                                           | 90/1044 [00:33<05:52,  2.71it/s, acc_step=1/1, ce_loss_token=1.8728, perplexity_token=6.5063]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   9%|████▏                                           | 91/1044 [00:33<05:28,  2.90it/s, acc_step=1/1, ce_loss_token=1.8736, perplexity_token=6.5118]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   9%|████▏                                           | 92/1044 [00:34<05:08,  3.08it/s, acc_step=1/1, ce_loss_token=1.8747, perplexity_token=6.5191]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   9%|████▎                                           | 93/1044 [00:34<05:14,  3.02it/s, acc_step=1/1, ce_loss_token=1.8744, perplexity_token=6.5171]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   9%|████▎                                           | 94/1044 [00:34<05:25,  2.92it/s, acc_step=1/1, ce_loss_token=1.8744, perplexity_token=6.5168]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:   9%|████▎                                           | 95/1044 [00:35<06:00,  2.63it/s, acc_step=1/1, ce_loss_token=1.8743, perplexity_token=6.5164]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   9%|████▍                                           | 97/1044 [00:36<05:15,  3.00it/s, acc_step=1/1, ce_loss_token=1.8770, perplexity_token=6.5337]

torch.Size([256, 292, 35]) torch.Size([256, 292])
torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   9%|████▌                                           | 98/1044 [00:36<05:24,  2.92it/s, acc_step=1/1, ce_loss_token=1.8767, perplexity_token=6.5319]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   9%|████▌                                           | 99/1044 [00:36<05:37,  2.80it/s, acc_step=1/1, ce_loss_token=1.8765, perplexity_token=6.5307]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  10%|████▌                                          | 100/1044 [00:37<05:32,  2.84it/s, acc_step=1/1, ce_loss_token=1.8763, perplexity_token=6.5293]

torch.Size([256, 404, 35]) torch.Size([256, 404])


[Training LM]:  10%|████▌                                          | 101/1044 [00:37<06:27,  2.44it/s, acc_step=1/1, ce_loss_token=1.8760, perplexity_token=6.5276]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  10%|████▌                                          | 102/1044 [00:38<06:14,  2.52it/s, acc_step=1/1, ce_loss_token=1.8758, perplexity_token=6.5263]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  10%|████▋                                          | 103/1044 [00:38<06:17,  2.49it/s, acc_step=1/1, ce_loss_token=1.8756, perplexity_token=6.5246]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  10%|████▋                                          | 104/1044 [00:38<06:24,  2.45it/s, acc_step=1/1, ce_loss_token=1.8755, perplexity_token=6.5238]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  10%|████▋                                          | 105/1044 [00:39<06:10,  2.53it/s, acc_step=1/1, ce_loss_token=1.8752, perplexity_token=6.5222]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  10%|████▊                                          | 106/1044 [00:39<05:59,  2.61it/s, acc_step=1/1, ce_loss_token=1.8749, perplexity_token=6.5199]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  10%|████▊                                          | 107/1044 [00:39<05:59,  2.61it/s, acc_step=1/1, ce_loss_token=1.8746, perplexity_token=6.5183]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  10%|████▊                                          | 108/1044 [00:40<05:34,  2.80it/s, acc_step=1/1, ce_loss_token=1.8756, perplexity_token=6.5247]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  10%|████▉                                          | 109/1044 [00:40<05:34,  2.79it/s, acc_step=1/1, ce_loss_token=1.8754, perplexity_token=6.5231]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  11%|████▉                                          | 110/1044 [00:40<05:36,  2.77it/s, acc_step=1/1, ce_loss_token=1.8752, perplexity_token=6.5221]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  11%|████▉                                          | 111/1044 [00:41<05:44,  2.71it/s, acc_step=1/1, ce_loss_token=1.8750, perplexity_token=6.5209]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  11%|█████                                          | 112/1044 [00:41<05:16,  2.94it/s, acc_step=1/1, ce_loss_token=1.8757, perplexity_token=6.5252]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  11%|█████                                          | 113/1044 [00:41<05:01,  3.09it/s, acc_step=1/1, ce_loss_token=1.8767, perplexity_token=6.5322]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  11%|█████▏                                         | 114/1044 [00:42<05:17,  2.93it/s, acc_step=1/1, ce_loss_token=1.8766, perplexity_token=6.5310]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  11%|█████▏                                         | 115/1044 [00:42<05:25,  2.85it/s, acc_step=1/1, ce_loss_token=1.8764, perplexity_token=6.5301]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  11%|█████▏                                         | 116/1044 [00:43<06:11,  2.50it/s, acc_step=1/1, ce_loss_token=1.8761, perplexity_token=6.5283]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  11%|█████▎                                         | 117/1044 [00:43<05:57,  2.59it/s, acc_step=1/1, ce_loss_token=1.8758, perplexity_token=6.5262]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  11%|█████▎                                         | 118/1044 [00:44<06:22,  2.42it/s, acc_step=1/1, ce_loss_token=1.8756, perplexity_token=6.5250]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  11%|█████▎                                         | 119/1044 [00:44<06:10,  2.50it/s, acc_step=1/1, ce_loss_token=1.8754, perplexity_token=6.5233]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  11%|█████▍                                         | 120/1044 [00:44<06:03,  2.55it/s, acc_step=1/1, ce_loss_token=1.8752, perplexity_token=6.5220]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  12%|█████▍                                         | 121/1044 [00:45<05:58,  2.57it/s, acc_step=1/1, ce_loss_token=1.8750, perplexity_token=6.5207]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  12%|█████▍                                         | 122/1044 [00:45<05:27,  2.82it/s, acc_step=1/1, ce_loss_token=1.8759, perplexity_token=6.5264]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  12%|█████▌                                         | 123/1044 [00:45<05:41,  2.70it/s, acc_step=1/1, ce_loss_token=1.8757, perplexity_token=6.5256]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  12%|█████▌                                         | 124/1044 [00:46<05:32,  2.76it/s, acc_step=1/1, ce_loss_token=1.8756, perplexity_token=6.5248]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  12%|█████▋                                         | 125/1044 [00:46<05:37,  2.72it/s, acc_step=1/1, ce_loss_token=1.8754, perplexity_token=6.5233]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  12%|█████▋                                         | 126/1044 [00:46<05:25,  2.82it/s, acc_step=1/1, ce_loss_token=1.8761, perplexity_token=6.5277]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  12%|█████▋                                         | 127/1044 [00:47<05:27,  2.80it/s, acc_step=1/1, ce_loss_token=1.8757, perplexity_token=6.5257]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  12%|█████▊                                         | 128/1044 [00:47<05:26,  2.80it/s, acc_step=1/1, ce_loss_token=1.8756, perplexity_token=6.5248]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  12%|█████▊                                         | 129/1044 [00:48<05:51,  2.60it/s, acc_step=1/1, ce_loss_token=1.8754, perplexity_token=6.5233]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  12%|█████▊                                         | 130/1044 [00:48<06:11,  2.46it/s, acc_step=1/1, ce_loss_token=1.8752, perplexity_token=6.5223]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  13%|█████▉                                         | 131/1044 [00:48<05:54,  2.57it/s, acc_step=1/1, ce_loss_token=1.8750, perplexity_token=6.5211]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  13%|█████▉                                         | 133/1044 [00:49<05:00,  3.04it/s, acc_step=1/1, ce_loss_token=1.8765, perplexity_token=6.5308]

torch.Size([256, 309, 35]) torch.Size([256, 309])
torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  13%|██████                                         | 134/1044 [00:49<05:04,  2.98it/s, acc_step=1/1, ce_loss_token=1.8763, perplexity_token=6.5293]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  13%|██████                                         | 135/1044 [00:50<05:36,  2.70it/s, acc_step=1/1, ce_loss_token=1.8762, perplexity_token=6.5285]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  13%|██████                                         | 136/1044 [00:50<05:24,  2.80it/s, acc_step=1/1, ce_loss_token=1.8760, perplexity_token=6.5271]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  13%|██████▏                                        | 137/1044 [00:50<05:23,  2.80it/s, acc_step=1/1, ce_loss_token=1.8757, perplexity_token=6.5255]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  13%|██████▏                                        | 138/1044 [00:51<05:03,  2.99it/s, acc_step=1/1, ce_loss_token=1.8764, perplexity_token=6.5302]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  13%|██████▎                                        | 139/1044 [00:51<05:28,  2.75it/s, acc_step=1/1, ce_loss_token=1.8763, perplexity_token=6.5292]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  13%|██████▎                                        | 140/1044 [00:51<05:33,  2.71it/s, acc_step=1/1, ce_loss_token=1.8762, perplexity_token=6.5285]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  14%|██████▎                                        | 141/1044 [00:52<05:37,  2.68it/s, acc_step=1/1, ce_loss_token=1.8760, perplexity_token=6.5271]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  14%|██████▍                                        | 142/1044 [00:52<05:42,  2.63it/s, acc_step=1/1, ce_loss_token=1.8759, perplexity_token=6.5265]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  14%|██████▍                                        | 143/1044 [00:53<05:57,  2.52it/s, acc_step=1/1, ce_loss_token=1.8756, perplexity_token=6.5249]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  14%|██████▍                                        | 144/1044 [00:53<05:47,  2.59it/s, acc_step=1/1, ce_loss_token=1.8755, perplexity_token=6.5239]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  14%|██████▌                                        | 145/1044 [00:53<05:50,  2.57it/s, acc_step=1/1, ce_loss_token=1.8753, perplexity_token=6.5229]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  14%|██████▌                                        | 146/1044 [00:54<05:47,  2.59it/s, acc_step=1/1, ce_loss_token=1.8752, perplexity_token=6.5224]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  14%|██████▌                                        | 147/1044 [00:54<05:25,  2.75it/s, acc_step=1/1, ce_loss_token=1.8759, perplexity_token=6.5268]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  14%|██████▋                                        | 148/1044 [00:55<05:42,  2.62it/s, acc_step=1/1, ce_loss_token=1.8758, perplexity_token=6.5262]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  14%|██████▋                                        | 149/1044 [00:55<05:50,  2.55it/s, acc_step=1/1, ce_loss_token=1.8756, perplexity_token=6.5250]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  14%|██████▊                                        | 150/1044 [00:55<05:48,  2.56it/s, acc_step=1/1, ce_loss_token=1.8755, perplexity_token=6.5242]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  14%|██████▊                                        | 151/1044 [00:56<05:44,  2.59it/s, acc_step=1/1, ce_loss_token=1.8754, perplexity_token=6.5231]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  15%|██████▊                                        | 152/1044 [00:56<05:48,  2.56it/s, acc_step=1/1, ce_loss_token=1.8751, perplexity_token=6.5218]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  15%|██████▉                                        | 153/1044 [00:57<05:55,  2.51it/s, acc_step=1/1, ce_loss_token=1.8750, perplexity_token=6.5205]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  15%|██████▉                                        | 154/1044 [00:57<06:26,  2.30it/s, acc_step=1/1, ce_loss_token=1.8748, perplexity_token=6.5193]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  15%|██████▉                                        | 155/1044 [00:57<06:04,  2.44it/s, acc_step=1/1, ce_loss_token=1.8746, perplexity_token=6.5181]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  15%|███████                                        | 156/1044 [00:58<05:57,  2.48it/s, acc_step=1/1, ce_loss_token=1.8743, perplexity_token=6.5163]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  15%|███████                                        | 157/1044 [00:58<05:50,  2.53it/s, acc_step=1/1, ce_loss_token=1.8742, perplexity_token=6.5153]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  15%|███████                                        | 158/1044 [00:59<05:48,  2.54it/s, acc_step=1/1, ce_loss_token=1.8740, perplexity_token=6.5144]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  15%|███████▏                                       | 159/1044 [00:59<05:34,  2.64it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5133]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  15%|███████▏                                       | 160/1044 [00:59<05:36,  2.63it/s, acc_step=1/1, ce_loss_token=1.8737, perplexity_token=6.5124]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  15%|███████▏                                       | 161/1044 [01:00<05:37,  2.61it/s, acc_step=1/1, ce_loss_token=1.8736, perplexity_token=6.5115]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  16%|███████▎                                       | 162/1044 [01:00<05:48,  2.53it/s, acc_step=1/1, ce_loss_token=1.8735, perplexity_token=6.5109]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  16%|███████▎                                       | 163/1044 [01:00<05:32,  2.65it/s, acc_step=1/1, ce_loss_token=1.8733, perplexity_token=6.5096]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  16%|███████▍                                       | 164/1044 [01:01<05:21,  2.73it/s, acc_step=1/1, ce_loss_token=1.8733, perplexity_token=6.5095]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  16%|███████▍                                       | 165/1044 [01:01<05:26,  2.69it/s, acc_step=1/1, ce_loss_token=1.8731, perplexity_token=6.5083]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  16%|███████▍                                       | 166/1044 [01:02<05:23,  2.71it/s, acc_step=1/1, ce_loss_token=1.8730, perplexity_token=6.5078]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  16%|███████▌                                       | 167/1044 [01:02<05:27,  2.68it/s, acc_step=1/1, ce_loss_token=1.8729, perplexity_token=6.5069]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  16%|███████▌                                       | 168/1044 [01:02<05:24,  2.70it/s, acc_step=1/1, ce_loss_token=1.8728, perplexity_token=6.5064]

torch.Size([256, 277, 35]) torch.Size([256, 277])


[Training LM]:  16%|███████▌                                       | 169/1044 [01:03<05:13,  2.79it/s, acc_step=1/1, ce_loss_token=1.8727, perplexity_token=6.5058]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  16%|███████▋                                       | 170/1044 [01:03<04:57,  2.94it/s, acc_step=1/1, ce_loss_token=1.8733, perplexity_token=6.5099]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  16%|███████▋                                       | 171/1044 [01:03<04:43,  3.08it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5132]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  16%|███████▋                                       | 172/1044 [01:04<04:52,  2.99it/s, acc_step=1/1, ce_loss_token=1.8736, perplexity_token=6.5119]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  17%|███████▊                                       | 173/1044 [01:04<04:53,  2.97it/s, acc_step=1/1, ce_loss_token=1.8735, perplexity_token=6.5113]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  17%|███████▊                                       | 174/1044 [01:04<05:06,  2.84it/s, acc_step=1/1, ce_loss_token=1.8734, perplexity_token=6.5102]

torch.Size([256, 277, 35]) torch.Size([256, 277])


[Training LM]:  17%|███████▉                                       | 175/1044 [01:05<04:59,  2.90it/s, acc_step=1/1, ce_loss_token=1.8733, perplexity_token=6.5094]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  17%|███████▉                                       | 176/1044 [01:05<04:59,  2.90it/s, acc_step=1/1, ce_loss_token=1.8730, perplexity_token=6.5080]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  17%|███████▉                                       | 177/1044 [01:05<04:50,  2.98it/s, acc_step=1/1, ce_loss_token=1.8735, perplexity_token=6.5112]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  17%|████████                                       | 178/1044 [01:06<04:35,  3.15it/s, acc_step=1/1, ce_loss_token=1.8740, perplexity_token=6.5143]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  17%|████████                                       | 179/1044 [01:06<04:43,  3.05it/s, acc_step=1/1, ce_loss_token=1.8739, perplexity_token=6.5139]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  17%|████████                                       | 180/1044 [01:06<04:55,  2.92it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5132]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  17%|████████▏                                      | 181/1044 [01:07<04:59,  2.88it/s, acc_step=1/1, ce_loss_token=1.8737, perplexity_token=6.5120]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  17%|████████▏                                      | 182/1044 [01:07<05:07,  2.80it/s, acc_step=1/1, ce_loss_token=1.8735, perplexity_token=6.5111]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  18%|████████▏                                      | 183/1044 [01:07<05:04,  2.83it/s, acc_step=1/1, ce_loss_token=1.8734, perplexity_token=6.5102]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  18%|████████▎                                      | 184/1044 [01:08<05:21,  2.68it/s, acc_step=1/1, ce_loss_token=1.8733, perplexity_token=6.5096]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  18%|████████▎                                      | 185/1044 [01:08<04:58,  2.88it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5133]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  18%|████████▎                                      | 186/1044 [01:08<05:02,  2.84it/s, acc_step=1/1, ce_loss_token=1.8737, perplexity_token=6.5125]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  18%|████████▍                                      | 187/1044 [01:09<05:13,  2.73it/s, acc_step=1/1, ce_loss_token=1.8737, perplexity_token=6.5122]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  18%|████████▍                                      | 188/1044 [01:09<05:32,  2.58it/s, acc_step=1/1, ce_loss_token=1.8735, perplexity_token=6.5111]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  18%|████████▌                                      | 189/1044 [01:10<05:00,  2.84it/s, acc_step=1/1, ce_loss_token=1.8739, perplexity_token=6.5138]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  18%|████████▌                                      | 190/1044 [01:10<05:04,  2.80it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5131]

torch.Size([256, 365, 35]) torch.Size([256, 365])


[Training LM]:  18%|████████▌                                      | 191/1044 [01:10<05:37,  2.53it/s, acc_step=1/1, ce_loss_token=1.8737, perplexity_token=6.5125]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  18%|████████▋                                      | 192/1044 [01:11<05:42,  2.49it/s, acc_step=1/1, ce_loss_token=1.8737, perplexity_token=6.5124]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  18%|████████▋                                      | 193/1044 [01:11<05:49,  2.44it/s, acc_step=1/1, ce_loss_token=1.8736, perplexity_token=6.5120]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  19%|████████▋                                      | 194/1044 [01:12<05:55,  2.39it/s, acc_step=1/1, ce_loss_token=1.8736, perplexity_token=6.5114]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  19%|████████▊                                      | 195/1044 [01:12<05:45,  2.46it/s, acc_step=1/1, ce_loss_token=1.8734, perplexity_token=6.5106]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  19%|████████▊                                      | 196/1044 [01:13<05:47,  2.44it/s, acc_step=1/1, ce_loss_token=1.8733, perplexity_token=6.5101]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  19%|████████▊                                      | 197/1044 [01:13<05:37,  2.51it/s, acc_step=1/1, ce_loss_token=1.8732, perplexity_token=6.5091]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  19%|████████▉                                      | 198/1044 [01:13<05:20,  2.64it/s, acc_step=1/1, ce_loss_token=1.8731, perplexity_token=6.5083]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  19%|████████▉                                      | 199/1044 [01:14<05:37,  2.50it/s, acc_step=1/1, ce_loss_token=1.8730, perplexity_token=6.5075]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  19%|█████████                                      | 200/1044 [01:14<05:06,  2.75it/s, acc_step=1/1, ce_loss_token=1.8734, perplexity_token=6.5103]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  19%|█████████                                      | 201/1044 [01:14<05:04,  2.77it/s, acc_step=1/1, ce_loss_token=1.8733, perplexity_token=6.5095]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  19%|█████████                                      | 202/1044 [01:15<05:18,  2.64it/s, acc_step=1/1, ce_loss_token=1.8732, perplexity_token=6.5089]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  19%|█████████▏                                     | 203/1044 [01:15<05:15,  2.67it/s, acc_step=1/1, ce_loss_token=1.8730, perplexity_token=6.5080]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  20%|█████████▏                                     | 204/1044 [01:15<04:57,  2.83it/s, acc_step=1/1, ce_loss_token=1.8736, perplexity_token=6.5115]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  20%|█████████▏                                     | 205/1044 [01:16<04:53,  2.85it/s, acc_step=1/1, ce_loss_token=1.8735, perplexity_token=6.5109]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  20%|█████████▎                                     | 206/1044 [01:16<04:54,  2.84it/s, acc_step=1/1, ce_loss_token=1.8733, perplexity_token=6.5098]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  20%|█████████▎                                     | 207/1044 [01:16<04:57,  2.81it/s, acc_step=1/1, ce_loss_token=1.8731, perplexity_token=6.5087]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  20%|█████████▎                                     | 208/1044 [01:17<05:00,  2.78it/s, acc_step=1/1, ce_loss_token=1.8731, perplexity_token=6.5084]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  20%|█████████▍                                     | 209/1044 [01:17<05:02,  2.76it/s, acc_step=1/1, ce_loss_token=1.8730, perplexity_token=6.5078]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  20%|█████████▍                                     | 210/1044 [01:18<05:07,  2.71it/s, acc_step=1/1, ce_loss_token=1.8728, perplexity_token=6.5068]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  20%|█████████▍                                     | 211/1044 [01:18<05:18,  2.62it/s, acc_step=1/1, ce_loss_token=1.8728, perplexity_token=6.5063]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  20%|█████████▌                                     | 212/1044 [01:18<05:24,  2.57it/s, acc_step=1/1, ce_loss_token=1.8727, perplexity_token=6.5056]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  20%|█████████▌                                     | 213/1044 [01:19<05:09,  2.69it/s, acc_step=1/1, ce_loss_token=1.8731, perplexity_token=6.5084]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  20%|█████████▋                                     | 214/1044 [01:19<05:06,  2.71it/s, acc_step=1/1, ce_loss_token=1.8730, perplexity_token=6.5076]

torch.Size([256, 402, 35]) torch.Size([256, 402])


[Training LM]:  21%|█████████▋                                     | 215/1044 [01:19<04:48,  2.87it/s, acc_step=1/1, ce_loss_token=1.8745, perplexity_token=6.5177]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  21%|█████████▋                                     | 216/1044 [01:20<05:01,  2.74it/s, acc_step=1/1, ce_loss_token=1.8744, perplexity_token=6.5169]

torch.Size([256, 421, 35]) torch.Size([256, 421])


[Training LM]:  21%|█████████▊                                     | 217/1044 [01:20<06:01,  2.29it/s, acc_step=1/1, ce_loss_token=1.8743, perplexity_token=6.5162]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  21%|█████████▊                                     | 218/1044 [01:21<05:25,  2.54it/s, acc_step=1/1, ce_loss_token=1.8746, perplexity_token=6.5185]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  21%|█████████▊                                     | 219/1044 [01:21<05:30,  2.50it/s, acc_step=1/1, ce_loss_token=1.8745, perplexity_token=6.5176]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  21%|█████████▉                                     | 220/1044 [01:21<05:31,  2.48it/s, acc_step=1/1, ce_loss_token=1.8744, perplexity_token=6.5171]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  21%|█████████▉                                     | 221/1044 [01:22<05:25,  2.53it/s, acc_step=1/1, ce_loss_token=1.8743, perplexity_token=6.5164]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  21%|█████████▉                                     | 222/1044 [01:22<04:52,  2.81it/s, acc_step=1/1, ce_loss_token=1.8747, perplexity_token=6.5186]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  21%|██████████                                     | 223/1044 [01:23<05:05,  2.69it/s, acc_step=1/1, ce_loss_token=1.8745, perplexity_token=6.5178]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  21%|██████████                                     | 224/1044 [01:23<05:16,  2.59it/s, acc_step=1/1, ce_loss_token=1.8744, perplexity_token=6.5171]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  22%|██████████▏                                    | 225/1044 [01:23<05:09,  2.64it/s, acc_step=1/1, ce_loss_token=1.8743, perplexity_token=6.5164]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  22%|██████████▏                                    | 226/1044 [01:24<05:06,  2.67it/s, acc_step=1/1, ce_loss_token=1.8742, perplexity_token=6.5157]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  22%|██████████▏                                    | 227/1044 [01:24<05:16,  2.58it/s, acc_step=1/1, ce_loss_token=1.8741, perplexity_token=6.5147]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  22%|██████████▎                                    | 228/1044 [01:24<05:04,  2.68it/s, acc_step=1/1, ce_loss_token=1.8739, perplexity_token=6.5140]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  22%|██████████▎                                    | 229/1044 [01:25<04:59,  2.72it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5132]

torch.Size([256, 354, 35]) torch.Size([256, 354])


[Training LM]:  22%|██████████▎                                    | 230/1044 [01:25<04:55,  2.75it/s, acc_step=1/1, ce_loss_token=1.8742, perplexity_token=6.5158]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  22%|██████████▍                                    | 231/1044 [01:26<04:59,  2.72it/s, acc_step=1/1, ce_loss_token=1.8741, perplexity_token=6.5149]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  22%|██████████▍                                    | 232/1044 [01:26<04:57,  2.73it/s, acc_step=1/1, ce_loss_token=1.8740, perplexity_token=6.5142]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  22%|██████████▍                                    | 233/1044 [01:26<04:53,  2.76it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5129]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  22%|██████████▌                                    | 234/1044 [01:27<05:12,  2.59it/s, acc_step=1/1, ce_loss_token=1.8737, perplexity_token=6.5122]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  23%|██████████▌                                    | 235/1044 [01:27<05:07,  2.63it/s, acc_step=1/1, ce_loss_token=1.8740, perplexity_token=6.5145]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  23%|██████████▌                                    | 236/1044 [01:27<05:00,  2.68it/s, acc_step=1/1, ce_loss_token=1.8739, perplexity_token=6.5135]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  23%|██████████▋                                    | 237/1044 [01:28<05:04,  2.65it/s, acc_step=1/1, ce_loss_token=1.8737, perplexity_token=6.5125]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  23%|██████████▋                                    | 238/1044 [01:28<04:44,  2.83it/s, acc_step=1/1, ce_loss_token=1.8741, perplexity_token=6.5147]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  23%|██████████▊                                    | 239/1044 [01:28<04:36,  2.91it/s, acc_step=1/1, ce_loss_token=1.8739, perplexity_token=6.5139]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  23%|██████████▊                                    | 240/1044 [01:29<04:37,  2.90it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5133]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  23%|██████████▊                                    | 241/1044 [01:29<04:47,  2.79it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5128]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  23%|██████████▉                                    | 242/1044 [01:29<04:28,  2.98it/s, acc_step=1/1, ce_loss_token=1.8741, perplexity_token=6.5151]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  23%|██████████▉                                    | 243/1044 [01:30<04:14,  3.15it/s, acc_step=1/1, ce_loss_token=1.8744, perplexity_token=6.5170]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  23%|██████████▉                                    | 244/1044 [01:30<04:26,  3.01it/s, acc_step=1/1, ce_loss_token=1.8743, perplexity_token=6.5161]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  23%|███████████                                    | 245/1044 [01:30<04:35,  2.90it/s, acc_step=1/1, ce_loss_token=1.8741, perplexity_token=6.5152]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  24%|███████████                                    | 246/1044 [01:31<04:35,  2.89it/s, acc_step=1/1, ce_loss_token=1.8740, perplexity_token=6.5141]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  24%|███████████                                    | 247/1044 [01:31<04:42,  2.82it/s, acc_step=1/1, ce_loss_token=1.8739, perplexity_token=6.5134]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  24%|███████████▏                                   | 248/1044 [01:32<04:48,  2.76it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5127]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  24%|███████████▏                                   | 249/1044 [01:32<04:48,  2.75it/s, acc_step=1/1, ce_loss_token=1.8741, perplexity_token=6.5147]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  24%|███████████▎                                   | 250/1044 [01:32<04:34,  2.90it/s, acc_step=1/1, ce_loss_token=1.8743, perplexity_token=6.5164]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  24%|███████████▎                                   | 251/1044 [01:33<04:16,  3.09it/s, acc_step=1/1, ce_loss_token=1.8747, perplexity_token=6.5189]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  24%|███████████▎                                   | 252/1044 [01:33<04:22,  3.01it/s, acc_step=1/1, ce_loss_token=1.8746, perplexity_token=6.5182]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  24%|███████████▍                                   | 253/1044 [01:33<04:32,  2.90it/s, acc_step=1/1, ce_loss_token=1.8744, perplexity_token=6.5172]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  24%|███████████▍                                   | 254/1044 [01:34<04:37,  2.85it/s, acc_step=1/1, ce_loss_token=1.8744, perplexity_token=6.5167]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  24%|███████████▍                                   | 255/1044 [01:34<04:43,  2.79it/s, acc_step=1/1, ce_loss_token=1.8743, perplexity_token=6.5160]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  25%|███████████▌                                   | 256/1044 [01:34<05:05,  2.58it/s, acc_step=1/1, ce_loss_token=1.8742, perplexity_token=6.5155]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  25%|███████████▌                                   | 257/1044 [01:35<04:42,  2.78it/s, acc_step=1/1, ce_loss_token=1.8745, perplexity_token=6.5174]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  25%|███████████▌                                   | 258/1044 [01:35<04:54,  2.67it/s, acc_step=1/1, ce_loss_token=1.8743, perplexity_token=6.5164]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  25%|███████████▋                                   | 259/1044 [01:36<04:57,  2.64it/s, acc_step=1/1, ce_loss_token=1.8742, perplexity_token=6.5158]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  25%|███████████▊                                   | 261/1044 [01:36<03:58,  3.29it/s, acc_step=1/1, ce_loss_token=1.8757, perplexity_token=6.5255]

torch.Size([256, 297, 35]) torch.Size([256, 297])
torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  25%|███████████▊                                   | 262/1044 [01:36<04:16,  3.04it/s, acc_step=1/1, ce_loss_token=1.8756, perplexity_token=6.5244]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  25%|███████████▊                                   | 263/1044 [01:37<04:22,  2.97it/s, acc_step=1/1, ce_loss_token=1.8754, perplexity_token=6.5234]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  25%|███████████▉                                   | 264/1044 [01:37<04:32,  2.86it/s, acc_step=1/1, ce_loss_token=1.8753, perplexity_token=6.5227]

torch.Size([256, 349, 35]) torch.Size([256, 349])


[Training LM]:  25%|███████████▉                                   | 265/1044 [01:38<04:56,  2.62it/s, acc_step=1/1, ce_loss_token=1.8752, perplexity_token=6.5223]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  25%|███████████▉                                   | 266/1044 [01:38<04:54,  2.64it/s, acc_step=1/1, ce_loss_token=1.8751, perplexity_token=6.5215]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  26%|████████████                                   | 267/1044 [01:38<04:47,  2.70it/s, acc_step=1/1, ce_loss_token=1.8750, perplexity_token=6.5208]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  26%|████████████                                   | 268/1044 [01:39<04:58,  2.60it/s, acc_step=1/1, ce_loss_token=1.8749, perplexity_token=6.5203]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  26%|████████████                                   | 269/1044 [01:39<04:54,  2.63it/s, acc_step=1/1, ce_loss_token=1.8749, perplexity_token=6.5200]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  26%|████████████▏                                  | 270/1044 [01:39<04:53,  2.64it/s, acc_step=1/1, ce_loss_token=1.8747, perplexity_token=6.5191]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  26%|████████████▏                                  | 271/1044 [01:40<04:52,  2.64it/s, acc_step=1/1, ce_loss_token=1.8746, perplexity_token=6.5182]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  26%|████████████▏                                  | 272/1044 [01:40<04:40,  2.75it/s, acc_step=1/1, ce_loss_token=1.8749, perplexity_token=6.5201]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  26%|████████████▎                                  | 273/1044 [01:41<04:58,  2.59it/s, acc_step=1/1, ce_loss_token=1.8748, perplexity_token=6.5194]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  26%|████████████▎                                  | 274/1044 [01:41<04:51,  2.64it/s, acc_step=1/1, ce_loss_token=1.8747, perplexity_token=6.5187]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  26%|████████████▍                                  | 275/1044 [01:41<05:01,  2.55it/s, acc_step=1/1, ce_loss_token=1.8746, perplexity_token=6.5180]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  26%|████████████▍                                  | 276/1044 [01:42<04:53,  2.62it/s, acc_step=1/1, ce_loss_token=1.8745, perplexity_token=6.5175]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  27%|████████████▍                                  | 277/1044 [01:42<04:58,  2.57it/s, acc_step=1/1, ce_loss_token=1.8744, perplexity_token=6.5169]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  27%|████████████▌                                  | 278/1044 [01:43<04:58,  2.56it/s, acc_step=1/1, ce_loss_token=1.8743, perplexity_token=6.5159]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  27%|████████████▌                                  | 279/1044 [01:43<04:51,  2.62it/s, acc_step=1/1, ce_loss_token=1.8741, perplexity_token=6.5152]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  27%|████████████▌                                  | 280/1044 [01:43<04:52,  2.61it/s, acc_step=1/1, ce_loss_token=1.8740, perplexity_token=6.5145]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  27%|████████████▋                                  | 281/1044 [01:44<04:46,  2.67it/s, acc_step=1/1, ce_loss_token=1.8739, perplexity_token=6.5139]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  27%|████████████▋                                  | 282/1044 [01:44<04:40,  2.72it/s, acc_step=1/1, ce_loss_token=1.8739, perplexity_token=6.5135]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  27%|████████████▋                                  | 283/1044 [01:44<04:50,  2.62it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5127]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  27%|████████████▊                                  | 284/1044 [01:45<04:47,  2.64it/s, acc_step=1/1, ce_loss_token=1.8737, perplexity_token=6.5122]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  27%|████████████▊                                  | 285/1044 [01:45<04:27,  2.84it/s, acc_step=1/1, ce_loss_token=1.8739, perplexity_token=6.5134]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  27%|████████████▉                                  | 286/1044 [01:45<04:34,  2.76it/s, acc_step=1/1, ce_loss_token=1.8737, perplexity_token=6.5126]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  27%|████████████▉                                  | 287/1044 [01:46<04:33,  2.77it/s, acc_step=1/1, ce_loss_token=1.8736, perplexity_token=6.5120]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  28%|████████████▉                                  | 288/1044 [01:46<04:14,  2.97it/s, acc_step=1/1, ce_loss_token=1.8739, perplexity_token=6.5135]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  28%|█████████████                                  | 289/1044 [01:46<04:19,  2.90it/s, acc_step=1/1, ce_loss_token=1.8738, perplexity_token=6.5127]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  28%|█████████████                                  | 290/1044 [01:47<04:50,  2.60it/s, acc_step=1/1, ce_loss_token=1.8736, perplexity_token=6.5117]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  28%|█████████████                                  | 291/1044 [01:47<05:00,  2.50it/s, acc_step=1/1, ce_loss_token=1.8735, perplexity_token=6.5112]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  28%|█████████████▏                                 | 292/1044 [01:48<04:56,  2.53it/s, acc_step=1/1, ce_loss_token=1.8734, perplexity_token=6.5105]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  28%|█████████████▏                                 | 293/1044 [01:48<04:35,  2.73it/s, acc_step=1/1, ce_loss_token=1.8737, perplexity_token=6.5123]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  28%|█████████████▏                                 | 294/1044 [01:48<04:29,  2.78it/s, acc_step=1/1, ce_loss_token=1.8736, perplexity_token=6.5115]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  28%|█████████████▎                                 | 295/1044 [01:49<04:35,  2.72it/s, acc_step=1/1, ce_loss_token=1.8735, perplexity_token=6.5108]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  28%|█████████████▎                                 | 296/1044 [01:49<04:11,  2.98it/s, acc_step=1/1, ce_loss_token=1.8737, perplexity_token=6.5123]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  28%|█████████████▎                                 | 297/1044 [01:49<04:19,  2.88it/s, acc_step=1/1, ce_loss_token=1.8736, perplexity_token=6.5117]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  29%|█████████████▍                                 | 298/1044 [01:50<04:24,  2.82it/s, acc_step=1/1, ce_loss_token=1.8735, perplexity_token=6.5110]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  29%|█████████████▍                                 | 299/1044 [01:50<04:23,  2.83it/s, acc_step=1/1, ce_loss_token=1.8733, perplexity_token=6.5101]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  29%|█████████████▌                                 | 300/1044 [01:51<04:23,  2.82it/s, acc_step=1/1, ce_loss_token=1.8732, perplexity_token=6.5093]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  29%|█████████████▌                                 | 301/1044 [01:51<04:28,  2.76it/s, acc_step=1/1, ce_loss_token=1.8731, perplexity_token=6.5086]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  29%|█████████████▌                                 | 302/1044 [01:51<04:10,  2.97it/s, acc_step=1/1, ce_loss_token=1.8735, perplexity_token=6.5107]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  29%|█████████████▋                                 | 303/1044 [01:52<04:17,  2.88it/s, acc_step=1/1, ce_loss_token=1.8733, perplexity_token=6.5097]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  29%|█████████████▋                                 | 304/1044 [01:52<04:15,  2.89it/s, acc_step=1/1, ce_loss_token=1.8732, perplexity_token=6.5091]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  29%|█████████████▋                                 | 305/1044 [01:52<04:26,  2.77it/s, acc_step=1/1, ce_loss_token=1.8731, perplexity_token=6.5082]

torch.Size([256, 277, 35]) torch.Size([256, 277])


[Training LM]:  29%|█████████████▊                                 | 306/1044 [01:53<04:20,  2.83it/s, acc_step=1/1, ce_loss_token=1.8730, perplexity_token=6.5075]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  29%|█████████████▊                                 | 307/1044 [01:53<04:41,  2.62it/s, acc_step=1/1, ce_loss_token=1.8729, perplexity_token=6.5069]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  30%|█████████████▊                                 | 308/1044 [01:53<04:37,  2.66it/s, acc_step=1/1, ce_loss_token=1.8727, perplexity_token=6.5061]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  30%|█████████████▉                                 | 309/1044 [01:54<04:29,  2.72it/s, acc_step=1/1, ce_loss_token=1.8726, perplexity_token=6.5053]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  30%|█████████████▉                                 | 310/1044 [01:54<04:39,  2.63it/s, acc_step=1/1, ce_loss_token=1.8725, perplexity_token=6.5046]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  30%|██████████████                                 | 311/1044 [01:55<04:50,  2.53it/s, acc_step=1/1, ce_loss_token=1.8725, perplexity_token=6.5042]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  30%|██████████████                                 | 312/1044 [01:55<04:44,  2.58it/s, acc_step=1/1, ce_loss_token=1.8724, perplexity_token=6.5036]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  30%|██████████████                                 | 313/1044 [01:55<04:37,  2.63it/s, acc_step=1/1, ce_loss_token=1.8723, perplexity_token=6.5030]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  30%|██████████████▏                                | 314/1044 [01:56<04:35,  2.65it/s, acc_step=1/1, ce_loss_token=1.8722, perplexity_token=6.5025]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  30%|██████████████▏                                | 315/1044 [01:56<04:29,  2.71it/s, acc_step=1/1, ce_loss_token=1.8721, perplexity_token=6.5019]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  30%|██████████████▏                                | 316/1044 [01:56<04:30,  2.69it/s, acc_step=1/1, ce_loss_token=1.8720, perplexity_token=6.5013]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  30%|██████████████▎                                | 317/1044 [01:57<04:35,  2.64it/s, acc_step=1/1, ce_loss_token=1.8719, perplexity_token=6.5009]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  30%|██████████████▎                                | 318/1044 [01:57<04:27,  2.71it/s, acc_step=1/1, ce_loss_token=1.8719, perplexity_token=6.5004]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  31%|██████████████▎                                | 319/1044 [01:58<04:39,  2.59it/s, acc_step=1/1, ce_loss_token=1.8718, perplexity_token=6.4997]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  31%|██████████████▍                                | 320/1044 [01:58<04:34,  2.64it/s, acc_step=1/1, ce_loss_token=1.8716, perplexity_token=6.4988]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  31%|██████████████▍                                | 321/1044 [01:58<04:31,  2.66it/s, acc_step=1/1, ce_loss_token=1.8715, perplexity_token=6.4983]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  31%|██████████████▍                                | 322/1044 [01:59<04:43,  2.55it/s, acc_step=1/1, ce_loss_token=1.8714, perplexity_token=6.4976]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  31%|██████████████▌                                | 323/1044 [01:59<04:45,  2.53it/s, acc_step=1/1, ce_loss_token=1.8713, perplexity_token=6.4968]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  31%|██████████████▌                                | 324/1044 [02:00<04:41,  2.56it/s, acc_step=1/1, ce_loss_token=1.8712, perplexity_token=6.4963]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  31%|██████████████▋                                | 325/1044 [02:00<04:44,  2.52it/s, acc_step=1/1, ce_loss_token=1.8711, perplexity_token=6.4957]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  31%|██████████████▋                                | 326/1044 [02:00<04:49,  2.48it/s, acc_step=1/1, ce_loss_token=1.8710, perplexity_token=6.4950]

torch.Size([256, 391, 35]) torch.Size([256, 391])


[Training LM]:  31%|██████████████▋                                | 327/1044 [02:01<05:20,  2.24it/s, acc_step=1/1, ce_loss_token=1.8710, perplexity_token=6.4947]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  31%|██████████████▊                                | 328/1044 [02:01<05:02,  2.36it/s, acc_step=1/1, ce_loss_token=1.8709, perplexity_token=6.4941]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  32%|██████████████▊                                | 329/1044 [02:02<04:52,  2.45it/s, acc_step=1/1, ce_loss_token=1.8708, perplexity_token=6.4933]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  32%|██████████████▊                                | 330/1044 [02:02<04:41,  2.54it/s, acc_step=1/1, ce_loss_token=1.8707, perplexity_token=6.4927]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  32%|██████████████▉                                | 331/1044 [02:02<04:29,  2.65it/s, acc_step=1/1, ce_loss_token=1.8706, perplexity_token=6.4921]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  32%|██████████████▉                                | 332/1044 [02:03<04:23,  2.71it/s, acc_step=1/1, ce_loss_token=1.8705, perplexity_token=6.4917]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  32%|██████████████▉                                | 333/1044 [02:03<04:33,  2.60it/s, acc_step=1/1, ce_loss_token=1.8705, perplexity_token=6.4914]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  32%|███████████████                                | 334/1044 [02:04<04:38,  2.55it/s, acc_step=1/1, ce_loss_token=1.8704, perplexity_token=6.4909]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  32%|███████████████                                | 335/1044 [02:04<04:18,  2.74it/s, acc_step=1/1, ce_loss_token=1.8707, perplexity_token=6.4927]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  32%|███████████████▏                               | 336/1044 [02:04<04:09,  2.84it/s, acc_step=1/1, ce_loss_token=1.8706, perplexity_token=6.4922]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  32%|███████████████▏                               | 337/1044 [02:04<03:55,  3.01it/s, acc_step=1/1, ce_loss_token=1.8708, perplexity_token=6.4937]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  32%|███████████████▏                               | 338/1044 [02:05<04:02,  2.91it/s, acc_step=1/1, ce_loss_token=1.8707, perplexity_token=6.4931]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  32%|███████████████▎                               | 339/1044 [02:05<04:02,  2.91it/s, acc_step=1/1, ce_loss_token=1.8707, perplexity_token=6.4926]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  33%|███████████████▎                               | 340/1044 [02:06<04:14,  2.77it/s, acc_step=1/1, ce_loss_token=1.8706, perplexity_token=6.4920]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  33%|███████████████▎                               | 341/1044 [02:06<04:23,  2.67it/s, acc_step=1/1, ce_loss_token=1.8705, perplexity_token=6.4915]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  33%|███████████████▍                               | 342/1044 [02:06<04:25,  2.64it/s, acc_step=1/1, ce_loss_token=1.8704, perplexity_token=6.4907]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  33%|███████████████▍                               | 343/1044 [02:07<04:25,  2.64it/s, acc_step=1/1, ce_loss_token=1.8703, perplexity_token=6.4901]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  33%|███████████████▍                               | 344/1044 [02:07<04:24,  2.65it/s, acc_step=1/1, ce_loss_token=1.8702, perplexity_token=6.4898]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  33%|███████████████▌                               | 345/1044 [02:07<04:18,  2.71it/s, acc_step=1/1, ce_loss_token=1.8701, perplexity_token=6.4892]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  33%|███████████████▌                               | 346/1044 [02:08<04:18,  2.70it/s, acc_step=1/1, ce_loss_token=1.8701, perplexity_token=6.4887]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  33%|███████████████▌                               | 347/1044 [02:08<04:17,  2.71it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4878]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  33%|███████████████▋                               | 348/1044 [02:09<04:20,  2.67it/s, acc_step=1/1, ce_loss_token=1.8698, perplexity_token=6.4870]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  33%|███████████████▋                               | 349/1044 [02:09<04:21,  2.65it/s, acc_step=1/1, ce_loss_token=1.8697, perplexity_token=6.4861]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  34%|███████████████▊                               | 350/1044 [02:09<04:03,  2.85it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4877]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  34%|███████████████▊                               | 351/1044 [02:10<04:16,  2.70it/s, acc_step=1/1, ce_loss_token=1.8698, perplexity_token=6.4870]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  34%|███████████████▊                               | 352/1044 [02:10<04:26,  2.60it/s, acc_step=1/1, ce_loss_token=1.8697, perplexity_token=6.4864]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  34%|███████████████▉                               | 353/1044 [02:11<04:27,  2.58it/s, acc_step=1/1, ce_loss_token=1.8696, perplexity_token=6.4858]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  34%|███████████████▉                               | 354/1044 [02:11<04:11,  2.75it/s, acc_step=1/1, ce_loss_token=1.8700, perplexity_token=6.4880]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  34%|███████████████▉                               | 355/1044 [02:11<04:14,  2.71it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4874]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  34%|████████████████                               | 356/1044 [02:12<04:16,  2.68it/s, acc_step=1/1, ce_loss_token=1.8698, perplexity_token=6.4870]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  34%|████████████████                               | 357/1044 [02:12<03:59,  2.87it/s, acc_step=1/1, ce_loss_token=1.8701, perplexity_token=6.4889]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  34%|████████████████                               | 358/1044 [02:12<03:56,  2.90it/s, acc_step=1/1, ce_loss_token=1.8700, perplexity_token=6.4883]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  34%|████████████████▏                              | 359/1044 [02:13<03:44,  3.05it/s, acc_step=1/1, ce_loss_token=1.8702, perplexity_token=6.4896]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  34%|████████████████▏                              | 360/1044 [02:13<03:46,  3.02it/s, acc_step=1/1, ce_loss_token=1.8701, perplexity_token=6.4889]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  35%|████████████████▎                              | 361/1044 [02:13<03:56,  2.89it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4880]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  35%|████████████████▎                              | 362/1044 [02:14<03:51,  2.94it/s, acc_step=1/1, ce_loss_token=1.8702, perplexity_token=6.4894]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  35%|████████████████▎                              | 363/1044 [02:14<03:54,  2.90it/s, acc_step=1/1, ce_loss_token=1.8701, perplexity_token=6.4888]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  35%|████████████████▍                              | 364/1044 [02:14<03:59,  2.83it/s, acc_step=1/1, ce_loss_token=1.8700, perplexity_token=6.4881]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  35%|████████████████▍                              | 365/1044 [02:15<03:57,  2.86it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4875]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  35%|████████████████▍                              | 366/1044 [02:15<03:56,  2.87it/s, acc_step=1/1, ce_loss_token=1.8698, perplexity_token=6.4871]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  35%|████████████████▌                              | 367/1044 [02:15<04:02,  2.79it/s, acc_step=1/1, ce_loss_token=1.8697, perplexity_token=6.4864]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  35%|████████████████▌                              | 368/1044 [02:16<03:50,  2.93it/s, acc_step=1/1, ce_loss_token=1.8700, perplexity_token=6.4881]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  35%|████████████████▌                              | 369/1044 [02:16<03:50,  2.92it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4878]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  35%|████████████████▋                              | 370/1044 [02:16<03:50,  2.92it/s, acc_step=1/1, ce_loss_token=1.8698, perplexity_token=6.4873]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  36%|████████████████▋                              | 371/1044 [02:17<03:53,  2.89it/s, acc_step=1/1, ce_loss_token=1.8698, perplexity_token=6.4867]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  36%|████████████████▋                              | 372/1044 [02:17<03:59,  2.80it/s, acc_step=1/1, ce_loss_token=1.8697, perplexity_token=6.4864]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  36%|████████████████▊                              | 373/1044 [02:17<04:06,  2.72it/s, acc_step=1/1, ce_loss_token=1.8696, perplexity_token=6.4858]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  36%|████████████████▉                              | 375/1044 [02:18<03:33,  3.14it/s, acc_step=1/1, ce_loss_token=1.8701, perplexity_token=6.4887]

torch.Size([256, 315, 35]) torch.Size([256, 315])
torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  36%|████████████████▉                              | 376/1044 [02:18<03:25,  3.25it/s, acc_step=1/1, ce_loss_token=1.8702, perplexity_token=6.4898]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  36%|████████████████▉                              | 377/1044 [02:19<03:30,  3.17it/s, acc_step=1/1, ce_loss_token=1.8701, perplexity_token=6.4890]

torch.Size([256, 351, 35]) torch.Size([256, 351])


[Training LM]:  36%|█████████████████                              | 378/1044 [02:19<03:59,  2.78it/s, acc_step=1/1, ce_loss_token=1.8701, perplexity_token=6.4888]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  36%|█████████████████                              | 379/1044 [02:20<04:07,  2.68it/s, acc_step=1/1, ce_loss_token=1.8700, perplexity_token=6.4882]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  36%|█████████████████                              | 380/1044 [02:20<03:59,  2.78it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4876]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  36%|█████████████████▏                             | 381/1044 [02:20<03:51,  2.87it/s, acc_step=1/1, ce_loss_token=1.8698, perplexity_token=6.4869]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  37%|█████████████████▏                             | 382/1044 [02:21<03:51,  2.86it/s, acc_step=1/1, ce_loss_token=1.8697, perplexity_token=6.4863]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  37%|█████████████████▏                             | 383/1044 [02:21<03:39,  3.01it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4879]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  37%|█████████████████▎                             | 384/1044 [02:21<03:50,  2.86it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4874]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  37%|█████████████████▎                             | 385/1044 [02:22<03:53,  2.82it/s, acc_step=1/1, ce_loss_token=1.8698, perplexity_token=6.4868]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  37%|█████████████████▍                             | 386/1044 [02:22<04:07,  2.66it/s, acc_step=1/1, ce_loss_token=1.8697, perplexity_token=6.4863]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  37%|█████████████████▍                             | 387/1044 [02:22<03:47,  2.89it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4875]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  37%|█████████████████▍                             | 388/1044 [02:23<03:56,  2.77it/s, acc_step=1/1, ce_loss_token=1.8698, perplexity_token=6.4872]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  37%|█████████████████▌                             | 389/1044 [02:23<03:45,  2.91it/s, acc_step=1/1, ce_loss_token=1.8700, perplexity_token=6.4881]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  37%|█████████████████▌                             | 390/1044 [02:23<03:53,  2.80it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4876]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  37%|█████████████████▌                             | 391/1044 [02:24<03:44,  2.90it/s, acc_step=1/1, ce_loss_token=1.8701, perplexity_token=6.4886]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  38%|█████████████████▋                             | 392/1044 [02:24<03:39,  2.97it/s, acc_step=1/1, ce_loss_token=1.8702, perplexity_token=6.4899]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  38%|█████████████████▋                             | 393/1044 [02:24<04:01,  2.70it/s, acc_step=1/1, ce_loss_token=1.8702, perplexity_token=6.4893]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  38%|█████████████████▋                             | 394/1044 [02:25<03:59,  2.72it/s, acc_step=1/1, ce_loss_token=1.8701, perplexity_token=6.4887]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  38%|█████████████████▊                             | 395/1044 [02:25<04:03,  2.67it/s, acc_step=1/1, ce_loss_token=1.8700, perplexity_token=6.4881]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  38%|█████████████████▊                             | 396/1044 [02:26<04:13,  2.55it/s, acc_step=1/1, ce_loss_token=1.8699, perplexity_token=6.4876]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  38%|█████████████████▊                             | 397/1044 [02:26<04:10,  2.58it/s, acc_step=1/1, ce_loss_token=1.8698, perplexity_token=6.4873]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  38%|█████████████████▉                             | 398/1044 [02:26<04:08,  2.60it/s, acc_step=1/1, ce_loss_token=1.8697, perplexity_token=6.4866]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  38%|█████████████████▉                             | 399/1044 [02:27<04:05,  2.63it/s, acc_step=1/1, ce_loss_token=1.8696, perplexity_token=6.4860]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  38%|██████████████████                             | 400/1044 [02:27<04:14,  2.53it/s, acc_step=1/1, ce_loss_token=1.8696, perplexity_token=6.4854]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  38%|██████████████████                             | 401/1044 [02:28<04:12,  2.54it/s, acc_step=1/1, ce_loss_token=1.8695, perplexity_token=6.4849]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  39%|██████████████████                             | 402/1044 [02:28<04:09,  2.57it/s, acc_step=1/1, ce_loss_token=1.8694, perplexity_token=6.4844]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  39%|██████████████████▏                            | 403/1044 [02:28<04:05,  2.61it/s, acc_step=1/1, ce_loss_token=1.8693, perplexity_token=6.4840]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  39%|██████████████████▏                            | 404/1044 [02:29<04:08,  2.57it/s, acc_step=1/1, ce_loss_token=1.8692, perplexity_token=6.4834]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  39%|██████████████████▏                            | 405/1044 [02:29<04:05,  2.60it/s, acc_step=1/1, ce_loss_token=1.8691, perplexity_token=6.4827]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  39%|██████████████████▎                            | 406/1044 [02:30<04:17,  2.48it/s, acc_step=1/1, ce_loss_token=1.8690, perplexity_token=6.4821]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  39%|██████████████████▎                            | 407/1044 [02:30<04:08,  2.56it/s, acc_step=1/1, ce_loss_token=1.8689, perplexity_token=6.4814]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  39%|██████████████████▎                            | 408/1044 [02:30<04:07,  2.57it/s, acc_step=1/1, ce_loss_token=1.8689, perplexity_token=6.4811]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  39%|██████████████████▍                            | 409/1044 [02:31<04:06,  2.57it/s, acc_step=1/1, ce_loss_token=1.8688, perplexity_token=6.4804]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  39%|██████████████████▍                            | 410/1044 [02:31<03:51,  2.73it/s, acc_step=1/1, ce_loss_token=1.8690, perplexity_token=6.4816]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  39%|██████████████████▌                            | 411/1044 [02:31<03:52,  2.72it/s, acc_step=1/1, ce_loss_token=1.8689, perplexity_token=6.4811]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  39%|██████████████████▌                            | 412/1044 [02:32<03:54,  2.69it/s, acc_step=1/1, ce_loss_token=1.8688, perplexity_token=6.4805]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  40%|██████████████████▌                            | 413/1044 [02:32<03:38,  2.89it/s, acc_step=1/1, ce_loss_token=1.8691, perplexity_token=6.4821]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  40%|██████████████████▋                            | 414/1044 [02:32<03:38,  2.88it/s, acc_step=1/1, ce_loss_token=1.8690, perplexity_token=6.4817]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  40%|██████████████████▋                            | 415/1044 [02:33<03:41,  2.84it/s, acc_step=1/1, ce_loss_token=1.8689, perplexity_token=6.4811]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  40%|██████████████████▋                            | 416/1044 [02:33<03:43,  2.81it/s, acc_step=1/1, ce_loss_token=1.8688, perplexity_token=6.4806]

torch.Size([256, 525, 35]) torch.Size([256, 525])


[Training LM]:  40%|██████████████████▊                            | 417/1044 [02:34<05:16,  1.98it/s, acc_step=1/1, ce_loss_token=1.8688, perplexity_token=6.4802]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  40%|██████████████████▊                            | 418/1044 [02:34<04:54,  2.13it/s, acc_step=1/1, ce_loss_token=1.8686, perplexity_token=6.4795]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  40%|██████████████████▊                            | 419/1044 [02:35<04:37,  2.25it/s, acc_step=1/1, ce_loss_token=1.8686, perplexity_token=6.4790]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  40%|██████████████████▉                            | 420/1044 [02:35<04:24,  2.36it/s, acc_step=1/1, ce_loss_token=1.8685, perplexity_token=6.4786]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  40%|██████████████████▉                            | 421/1044 [02:36<04:27,  2.33it/s, acc_step=1/1, ce_loss_token=1.8684, perplexity_token=6.4781]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  40%|██████████████████▉                            | 422/1044 [02:36<04:26,  2.33it/s, acc_step=1/1, ce_loss_token=1.8683, perplexity_token=6.4775]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  41%|███████████████████                            | 423/1044 [02:36<04:13,  2.45it/s, acc_step=1/1, ce_loss_token=1.8682, perplexity_token=6.4769]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  41%|███████████████████                            | 424/1044 [02:37<04:00,  2.58it/s, acc_step=1/1, ce_loss_token=1.8682, perplexity_token=6.4764]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  41%|███████████████████▏                           | 425/1044 [02:37<03:59,  2.59it/s, acc_step=1/1, ce_loss_token=1.8681, perplexity_token=6.4757]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  41%|███████████████████▏                           | 426/1044 [02:37<03:56,  2.62it/s, acc_step=1/1, ce_loss_token=1.8680, perplexity_token=6.4753]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  41%|███████████████████▏                           | 427/1044 [02:38<03:58,  2.58it/s, acc_step=1/1, ce_loss_token=1.8679, perplexity_token=6.4746]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  41%|███████████████████▎                           | 428/1044 [02:38<03:57,  2.60it/s, acc_step=1/1, ce_loss_token=1.8678, perplexity_token=6.4739]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  41%|███████████████████▎                           | 429/1044 [02:39<03:59,  2.57it/s, acc_step=1/1, ce_loss_token=1.8677, perplexity_token=6.4737]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  41%|███████████████████▎                           | 430/1044 [02:39<04:05,  2.50it/s, acc_step=1/1, ce_loss_token=1.8677, perplexity_token=6.4731]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  41%|███████████████████▍                           | 431/1044 [02:39<04:05,  2.50it/s, acc_step=1/1, ce_loss_token=1.8675, perplexity_token=6.4723]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  41%|███████████████████▍                           | 432/1044 [02:40<04:06,  2.48it/s, acc_step=1/1, ce_loss_token=1.8675, perplexity_token=6.4719]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  41%|███████████████████▍                           | 433/1044 [02:40<04:04,  2.50it/s, acc_step=1/1, ce_loss_token=1.8674, perplexity_token=6.4714]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  42%|███████████████████▌                           | 434/1044 [02:41<03:55,  2.59it/s, acc_step=1/1, ce_loss_token=1.8673, perplexity_token=6.4709]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  42%|███████████████████▌                           | 435/1044 [02:41<03:52,  2.62it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4704]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  42%|███████████████████▋                           | 436/1044 [02:41<03:52,  2.61it/s, acc_step=1/1, ce_loss_token=1.8671, perplexity_token=6.4697]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  42%|███████████████████▋                           | 437/1044 [02:42<03:35,  2.82it/s, acc_step=1/1, ce_loss_token=1.8673, perplexity_token=6.4710]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  42%|███████████████████▋                           | 438/1044 [02:42<03:37,  2.79it/s, acc_step=1/1, ce_loss_token=1.8673, perplexity_token=6.4706]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  42%|███████████████████▊                           | 439/1044 [02:42<03:42,  2.72it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4700]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  42%|███████████████████▊                           | 440/1044 [02:43<03:41,  2.72it/s, acc_step=1/1, ce_loss_token=1.8671, perplexity_token=6.4696]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  42%|███████████████████▊                           | 441/1044 [02:43<03:29,  2.88it/s, acc_step=1/1, ce_loss_token=1.8673, perplexity_token=6.4708]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  42%|███████████████████▉                           | 442/1044 [02:43<03:39,  2.75it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4704]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  42%|███████████████████▉                           | 443/1044 [02:44<03:35,  2.79it/s, acc_step=1/1, ce_loss_token=1.8674, perplexity_token=6.4715]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  43%|███████████████████▉                           | 444/1044 [02:44<03:19,  3.01it/s, acc_step=1/1, ce_loss_token=1.8675, perplexity_token=6.4724]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  43%|████████████████████                           | 445/1044 [02:44<03:23,  2.94it/s, acc_step=1/1, ce_loss_token=1.8675, perplexity_token=6.4719]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  43%|████████████████████                           | 446/1044 [02:45<03:32,  2.81it/s, acc_step=1/1, ce_loss_token=1.8674, perplexity_token=6.4716]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  43%|████████████████████                           | 447/1044 [02:45<03:30,  2.84it/s, acc_step=1/1, ce_loss_token=1.8676, perplexity_token=6.4729]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  43%|████████████████████▏                          | 448/1044 [02:45<03:21,  2.96it/s, acc_step=1/1, ce_loss_token=1.8678, perplexity_token=6.4740]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  43%|████████████████████▏                          | 449/1044 [02:46<03:28,  2.86it/s, acc_step=1/1, ce_loss_token=1.8677, perplexity_token=6.4734]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  43%|████████████████████▎                          | 450/1044 [02:46<03:30,  2.82it/s, acc_step=1/1, ce_loss_token=1.8676, perplexity_token=6.4730]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  43%|████████████████████▎                          | 451/1044 [02:47<03:29,  2.83it/s, acc_step=1/1, ce_loss_token=1.8676, perplexity_token=6.4726]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  43%|████████████████████▎                          | 452/1044 [02:47<03:37,  2.72it/s, acc_step=1/1, ce_loss_token=1.8675, perplexity_token=6.4724]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  43%|████████████████████▍                          | 453/1044 [02:47<03:37,  2.72it/s, acc_step=1/1, ce_loss_token=1.8675, perplexity_token=6.4719]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  43%|████████████████████▍                          | 454/1044 [02:48<03:42,  2.65it/s, acc_step=1/1, ce_loss_token=1.8674, perplexity_token=6.4712]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  44%|████████████████████▍                          | 455/1044 [02:48<03:45,  2.61it/s, acc_step=1/1, ce_loss_token=1.8673, perplexity_token=6.4708]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  44%|████████████████████▌                          | 456/1044 [02:49<03:49,  2.56it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4703]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  44%|████████████████████▌                          | 457/1044 [02:49<03:32,  2.76it/s, acc_step=1/1, ce_loss_token=1.8674, perplexity_token=6.4714]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  44%|████████████████████▌                          | 458/1044 [02:49<03:17,  2.97it/s, acc_step=1/1, ce_loss_token=1.8676, perplexity_token=6.4725]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  44%|████████████████████▋                          | 459/1044 [02:49<03:18,  2.95it/s, acc_step=1/1, ce_loss_token=1.8675, perplexity_token=6.4721]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  44%|████████████████████▋                          | 460/1044 [02:50<03:17,  2.95it/s, acc_step=1/1, ce_loss_token=1.8674, perplexity_token=6.4716]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  44%|████████████████████▊                          | 461/1044 [02:50<03:23,  2.86it/s, acc_step=1/1, ce_loss_token=1.8674, perplexity_token=6.4712]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  44%|████████████████████▊                          | 462/1044 [02:51<03:30,  2.76it/s, acc_step=1/1, ce_loss_token=1.8673, perplexity_token=6.4706]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  44%|████████████████████▊                          | 463/1044 [02:51<03:33,  2.72it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4703]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  44%|████████████████████▉                          | 464/1044 [02:51<03:41,  2.62it/s, acc_step=1/1, ce_loss_token=1.8671, perplexity_token=6.4698]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  45%|████████████████████▉                          | 465/1044 [02:52<03:39,  2.64it/s, acc_step=1/1, ce_loss_token=1.8671, perplexity_token=6.4694]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  45%|████████████████████▉                          | 466/1044 [02:52<03:23,  2.84it/s, acc_step=1/1, ce_loss_token=1.8673, perplexity_token=6.4709]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  45%|█████████████████████                          | 467/1044 [02:52<03:25,  2.81it/s, acc_step=1/1, ce_loss_token=1.8673, perplexity_token=6.4707]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  45%|█████████████████████                          | 468/1044 [02:53<03:32,  2.71it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4703]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  45%|█████████████████████                          | 469/1044 [02:53<03:50,  2.50it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4699]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  45%|█████████████████████▏                         | 470/1044 [02:54<03:56,  2.43it/s, acc_step=1/1, ce_loss_token=1.8671, perplexity_token=6.4693]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  45%|█████████████████████▏                         | 471/1044 [02:54<03:39,  2.60it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4703]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  45%|█████████████████████▏                         | 472/1044 [02:54<03:36,  2.64it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4699]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  45%|█████████████████████▎                         | 473/1044 [02:55<03:30,  2.72it/s, acc_step=1/1, ce_loss_token=1.8671, perplexity_token=6.4695]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  45%|█████████████████████▎                         | 474/1044 [02:55<03:15,  2.92it/s, acc_step=1/1, ce_loss_token=1.8673, perplexity_token=6.4707]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  45%|█████████████████████▍                         | 475/1044 [02:55<03:24,  2.78it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4703]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  46%|█████████████████████▍                         | 476/1044 [02:56<03:30,  2.70it/s, acc_step=1/1, ce_loss_token=1.8672, perplexity_token=6.4699]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  46%|█████████████████████▍                         | 477/1044 [02:56<03:31,  2.68it/s, acc_step=1/1, ce_loss_token=1.8671, perplexity_token=6.4695]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  46%|█████████████████████▌                         | 478/1044 [02:57<03:37,  2.60it/s, acc_step=1/1, ce_loss_token=1.8670, perplexity_token=6.4687]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  46%|█████████████████████▌                         | 479/1044 [02:57<03:50,  2.45it/s, acc_step=1/1, ce_loss_token=1.8669, perplexity_token=6.4683]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  46%|█████████████████████▌                         | 480/1044 [02:57<03:32,  2.65it/s, acc_step=1/1, ce_loss_token=1.8671, perplexity_token=6.4696]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  46%|█████████████████████▋                         | 481/1044 [02:58<03:40,  2.55it/s, acc_step=1/1, ce_loss_token=1.8670, perplexity_token=6.4690]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  46%|█████████████████████▋                         | 482/1044 [02:58<03:36,  2.60it/s, acc_step=1/1, ce_loss_token=1.8670, perplexity_token=6.4686]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  46%|█████████████████████▋                         | 483/1044 [02:59<03:30,  2.67it/s, acc_step=1/1, ce_loss_token=1.8669, perplexity_token=6.4680]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  46%|█████████████████████▊                         | 484/1044 [02:59<03:19,  2.80it/s, acc_step=1/1, ce_loss_token=1.8670, perplexity_token=6.4687]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  46%|█████████████████████▊                         | 485/1044 [02:59<03:16,  2.84it/s, acc_step=1/1, ce_loss_token=1.8669, perplexity_token=6.4684]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  47%|█████████████████████▉                         | 486/1044 [03:00<03:17,  2.82it/s, acc_step=1/1, ce_loss_token=1.8669, perplexity_token=6.4679]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  47%|█████████████████████▉                         | 487/1044 [03:00<03:30,  2.64it/s, acc_step=1/1, ce_loss_token=1.8667, perplexity_token=6.4672]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  47%|█████████████████████▉                         | 488/1044 [03:00<03:36,  2.56it/s, acc_step=1/1, ce_loss_token=1.8667, perplexity_token=6.4667]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  47%|██████████████████████                         | 489/1044 [03:01<03:32,  2.62it/s, acc_step=1/1, ce_loss_token=1.8666, perplexity_token=6.4661]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  47%|██████████████████████                         | 490/1044 [03:01<03:27,  2.67it/s, acc_step=1/1, ce_loss_token=1.8665, perplexity_token=6.4655]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  47%|██████████████████████                         | 491/1044 [03:01<03:23,  2.72it/s, acc_step=1/1, ce_loss_token=1.8664, perplexity_token=6.4651]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  47%|██████████████████████▏                        | 492/1044 [03:02<03:21,  2.74it/s, acc_step=1/1, ce_loss_token=1.8663, perplexity_token=6.4644]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  47%|██████████████████████▏                        | 493/1044 [03:02<03:19,  2.76it/s, acc_step=1/1, ce_loss_token=1.8662, perplexity_token=6.4638]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  47%|██████████████████████▏                        | 494/1044 [03:03<03:20,  2.74it/s, acc_step=1/1, ce_loss_token=1.8661, perplexity_token=6.4632]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  47%|██████████████████████▎                        | 495/1044 [03:03<03:22,  2.71it/s, acc_step=1/1, ce_loss_token=1.8661, perplexity_token=6.4629]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  48%|██████████████████████▎                        | 496/1044 [03:03<03:13,  2.83it/s, acc_step=1/1, ce_loss_token=1.8662, perplexity_token=6.4638]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  48%|██████████████████████▎                        | 497/1044 [03:04<03:02,  2.99it/s, acc_step=1/1, ce_loss_token=1.8664, perplexity_token=6.4649]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  48%|██████████████████████▍                        | 498/1044 [03:04<03:33,  2.56it/s, acc_step=1/1, ce_loss_token=1.8663, perplexity_token=6.4644]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  48%|██████████████████████▍                        | 499/1044 [03:04<03:26,  2.64it/s, acc_step=1/1, ce_loss_token=1.8662, perplexity_token=6.4639]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  48%|██████████████████████▌                        | 500/1044 [03:05<03:27,  2.63it/s, acc_step=1/1, ce_loss_token=1.8661, perplexity_token=6.4632]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  48%|██████████████████████▌                        | 501/1044 [03:05<03:27,  2.61it/s, acc_step=1/1, ce_loss_token=1.8661, perplexity_token=6.4628]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  48%|██████████████████████▌                        | 502/1044 [03:06<03:25,  2.64it/s, acc_step=1/1, ce_loss_token=1.8660, perplexity_token=6.4623]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  48%|██████████████████████▋                        | 503/1044 [03:06<03:22,  2.68it/s, acc_step=1/1, ce_loss_token=1.8659, perplexity_token=6.4619]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  48%|██████████████████████▋                        | 504/1044 [03:06<03:20,  2.70it/s, acc_step=1/1, ce_loss_token=1.8658, perplexity_token=6.4613]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  48%|██████████████████████▋                        | 505/1044 [03:07<03:06,  2.89it/s, acc_step=1/1, ce_loss_token=1.8660, perplexity_token=6.4625]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  48%|██████████████████████▊                        | 506/1044 [03:07<03:07,  2.86it/s, acc_step=1/1, ce_loss_token=1.8660, perplexity_token=6.4623]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  49%|██████████████████████▊                        | 507/1044 [03:07<03:14,  2.76it/s, acc_step=1/1, ce_loss_token=1.8659, perplexity_token=6.4617]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  49%|██████████████████████▊                        | 508/1044 [03:08<03:16,  2.72it/s, acc_step=1/1, ce_loss_token=1.8658, perplexity_token=6.4613]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  49%|██████████████████████▉                        | 509/1044 [03:08<03:16,  2.73it/s, acc_step=1/1, ce_loss_token=1.8658, perplexity_token=6.4608]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  49%|██████████████████████▉                        | 510/1044 [03:08<03:25,  2.60it/s, acc_step=1/1, ce_loss_token=1.8657, perplexity_token=6.4603]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  49%|███████████████████████                        | 511/1044 [03:09<03:21,  2.65it/s, acc_step=1/1, ce_loss_token=1.8656, perplexity_token=6.4599]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  49%|███████████████████████                        | 512/1044 [03:09<03:18,  2.68it/s, acc_step=1/1, ce_loss_token=1.8656, perplexity_token=6.4596]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  49%|███████████████████████                        | 513/1044 [03:10<03:16,  2.71it/s, acc_step=1/1, ce_loss_token=1.8655, perplexity_token=6.4593]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  49%|███████████████████████▏                       | 514/1044 [03:10<03:07,  2.83it/s, acc_step=1/1, ce_loss_token=1.8657, perplexity_token=6.4603]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  49%|███████████████████████▏                       | 515/1044 [03:10<03:11,  2.77it/s, acc_step=1/1, ce_loss_token=1.8656, perplexity_token=6.4600]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  49%|███████████████████████▏                       | 516/1044 [03:11<03:20,  2.63it/s, acc_step=1/1, ce_loss_token=1.8656, perplexity_token=6.4596]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  50%|███████████████████████▎                       | 517/1044 [03:11<03:22,  2.61it/s, acc_step=1/1, ce_loss_token=1.8655, perplexity_token=6.4591]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  50%|███████████████████████▎                       | 518/1044 [03:11<03:20,  2.62it/s, acc_step=1/1, ce_loss_token=1.8654, perplexity_token=6.4586]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  50%|███████████████████████▎                       | 519/1044 [03:12<03:16,  2.67it/s, acc_step=1/1, ce_loss_token=1.8654, perplexity_token=6.4582]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  50%|███████████████████████▍                       | 520/1044 [03:12<03:02,  2.88it/s, acc_step=1/1, ce_loss_token=1.8655, perplexity_token=6.4592]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  50%|███████████████████████▍                       | 521/1044 [03:12<02:58,  2.94it/s, acc_step=1/1, ce_loss_token=1.8656, perplexity_token=6.4601]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  50%|███████████████████████▌                       | 522/1044 [03:13<03:06,  2.80it/s, acc_step=1/1, ce_loss_token=1.8656, perplexity_token=6.4597]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  50%|███████████████████████▌                       | 523/1044 [03:13<03:10,  2.74it/s, acc_step=1/1, ce_loss_token=1.8655, perplexity_token=6.4593]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  50%|███████████████████████▌                       | 524/1044 [03:14<03:09,  2.75it/s, acc_step=1/1, ce_loss_token=1.8655, perplexity_token=6.4589]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  50%|███████████████████████▋                       | 525/1044 [03:14<03:11,  2.71it/s, acc_step=1/1, ce_loss_token=1.8654, perplexity_token=6.4585]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  50%|███████████████████████▋                       | 526/1044 [03:14<03:09,  2.74it/s, acc_step=1/1, ce_loss_token=1.8653, perplexity_token=6.4581]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  50%|███████████████████████▋                       | 527/1044 [03:15<03:25,  2.51it/s, acc_step=1/1, ce_loss_token=1.8653, perplexity_token=6.4579]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  51%|███████████████████████▊                       | 528/1044 [03:15<03:22,  2.55it/s, acc_step=1/1, ce_loss_token=1.8652, perplexity_token=6.4575]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  51%|███████████████████████▊                       | 529/1044 [03:15<03:15,  2.63it/s, acc_step=1/1, ce_loss_token=1.8652, perplexity_token=6.4570]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  51%|███████████████████████▊                       | 530/1044 [03:16<03:15,  2.63it/s, acc_step=1/1, ce_loss_token=1.8651, perplexity_token=6.4566]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  51%|███████████████████████▉                       | 531/1044 [03:16<03:03,  2.80it/s, acc_step=1/1, ce_loss_token=1.8653, perplexity_token=6.4579]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  51%|███████████████████████▉                       | 532/1044 [03:17<03:08,  2.72it/s, acc_step=1/1, ce_loss_token=1.8653, perplexity_token=6.4576]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  51%|███████████████████████▉                       | 533/1044 [03:17<03:08,  2.71it/s, acc_step=1/1, ce_loss_token=1.8652, perplexity_token=6.4571]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  51%|████████████████████████                       | 534/1044 [03:17<03:11,  2.66it/s, acc_step=1/1, ce_loss_token=1.8651, perplexity_token=6.4567]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  51%|████████████████████████                       | 535/1044 [03:18<03:09,  2.69it/s, acc_step=1/1, ce_loss_token=1.8651, perplexity_token=6.4563]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  51%|████████████████████████▏                      | 536/1044 [03:18<03:07,  2.71it/s, acc_step=1/1, ce_loss_token=1.8650, perplexity_token=6.4560]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  51%|████████████████████████▏                      | 537/1044 [03:18<03:11,  2.65it/s, acc_step=1/1, ce_loss_token=1.8650, perplexity_token=6.4556]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  52%|████████████████████████▏                      | 538/1044 [03:19<03:18,  2.55it/s, acc_step=1/1, ce_loss_token=1.8649, perplexity_token=6.4551]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  52%|████████████████████████▎                      | 539/1044 [03:19<03:08,  2.67it/s, acc_step=1/1, ce_loss_token=1.8648, perplexity_token=6.4545]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  52%|████████████████████████▎                      | 540/1044 [03:20<03:22,  2.49it/s, acc_step=1/1, ce_loss_token=1.8647, perplexity_token=6.4539]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  52%|████████████████████████▎                      | 541/1044 [03:20<03:31,  2.38it/s, acc_step=1/1, ce_loss_token=1.8646, perplexity_token=6.4533]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  52%|████████████████████████▍                      | 542/1044 [03:21<03:32,  2.37it/s, acc_step=1/1, ce_loss_token=1.8645, perplexity_token=6.4528]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  52%|████████████████████████▍                      | 543/1044 [03:21<03:24,  2.45it/s, acc_step=1/1, ce_loss_token=1.8644, perplexity_token=6.4523]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  52%|████████████████████████▍                      | 544/1044 [03:21<03:16,  2.54it/s, acc_step=1/1, ce_loss_token=1.8644, perplexity_token=6.4519]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  52%|████████████████████████▌                      | 545/1044 [03:22<03:14,  2.57it/s, acc_step=1/1, ce_loss_token=1.8643, perplexity_token=6.4515]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  52%|████████████████████████▌                      | 546/1044 [03:22<03:10,  2.61it/s, acc_step=1/1, ce_loss_token=1.8642, perplexity_token=6.4510]

torch.Size([256, 396, 35]) torch.Size([256, 396])


[Training LM]:  52%|████████████████████████▋                      | 547/1044 [03:23<03:33,  2.33it/s, acc_step=1/1, ce_loss_token=1.8642, perplexity_token=6.4506]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  52%|████████████████████████▋                      | 548/1044 [03:23<03:22,  2.44it/s, acc_step=1/1, ce_loss_token=1.8641, perplexity_token=6.4502]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  53%|████████████████████████▋                      | 549/1044 [03:23<03:18,  2.49it/s, acc_step=1/1, ce_loss_token=1.8641, perplexity_token=6.4498]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  53%|████████████████████████▊                      | 550/1044 [03:24<03:18,  2.49it/s, acc_step=1/1, ce_loss_token=1.8640, perplexity_token=6.4495]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  53%|████████████████████████▊                      | 551/1044 [03:24<02:56,  2.80it/s, acc_step=1/1, ce_loss_token=1.8642, perplexity_token=6.4505]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  53%|████████████████████████▊                      | 552/1044 [03:24<02:58,  2.75it/s, acc_step=1/1, ce_loss_token=1.8641, perplexity_token=6.4501]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  53%|████████████████████████▉                      | 553/1044 [03:25<02:58,  2.75it/s, acc_step=1/1, ce_loss_token=1.8640, perplexity_token=6.4497]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  53%|████████████████████████▉                      | 554/1044 [03:25<02:51,  2.85it/s, acc_step=1/1, ce_loss_token=1.8641, perplexity_token=6.4504]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  53%|████████████████████████▉                      | 555/1044 [03:25<02:54,  2.79it/s, acc_step=1/1, ce_loss_token=1.8641, perplexity_token=6.4501]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  53%|█████████████████████████                      | 556/1044 [03:26<03:04,  2.64it/s, acc_step=1/1, ce_loss_token=1.8640, perplexity_token=6.4496]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  53%|█████████████████████████                      | 557/1044 [03:26<03:10,  2.56it/s, acc_step=1/1, ce_loss_token=1.8640, perplexity_token=6.4492]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  53%|█████████████████████████                      | 558/1044 [03:27<03:10,  2.55it/s, acc_step=1/1, ce_loss_token=1.8639, perplexity_token=6.4487]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  54%|█████████████████████████▏                     | 559/1044 [03:27<03:04,  2.63it/s, acc_step=1/1, ce_loss_token=1.8638, perplexity_token=6.4483]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  54%|█████████████████████████▏                     | 560/1044 [03:27<03:03,  2.64it/s, acc_step=1/1, ce_loss_token=1.8638, perplexity_token=6.4479]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  54%|█████████████████████████▎                     | 561/1044 [03:28<02:59,  2.68it/s, acc_step=1/1, ce_loss_token=1.8637, perplexity_token=6.4475]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  54%|█████████████████████████▎                     | 562/1044 [03:28<02:56,  2.73it/s, acc_step=1/1, ce_loss_token=1.8636, perplexity_token=6.4471]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  54%|█████████████████████████▎                     | 563/1044 [03:29<03:01,  2.65it/s, acc_step=1/1, ce_loss_token=1.8636, perplexity_token=6.4466]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  54%|█████████████████████████▍                     | 564/1044 [03:29<02:58,  2.69it/s, acc_step=1/1, ce_loss_token=1.8635, perplexity_token=6.4464]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  54%|█████████████████████████▍                     | 565/1044 [03:29<02:54,  2.75it/s, acc_step=1/1, ce_loss_token=1.8634, perplexity_token=6.4458]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  54%|█████████████████████████▍                     | 566/1044 [03:30<02:55,  2.73it/s, acc_step=1/1, ce_loss_token=1.8634, perplexity_token=6.4454]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  54%|█████████████████████████▌                     | 567/1044 [03:30<02:53,  2.75it/s, acc_step=1/1, ce_loss_token=1.8633, perplexity_token=6.4451]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  54%|█████████████████████████▌                     | 568/1044 [03:30<03:01,  2.63it/s, acc_step=1/1, ce_loss_token=1.8632, perplexity_token=6.4445]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  55%|█████████████████████████▌                     | 569/1044 [03:31<02:58,  2.67it/s, acc_step=1/1, ce_loss_token=1.8631, perplexity_token=6.4440]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  55%|█████████████████████████▋                     | 570/1044 [03:31<02:58,  2.65it/s, acc_step=1/1, ce_loss_token=1.8630, perplexity_token=6.4434]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  55%|█████████████████████████▋                     | 571/1044 [03:31<02:55,  2.69it/s, acc_step=1/1, ce_loss_token=1.8630, perplexity_token=6.4429]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  55%|█████████████████████████▊                     | 572/1044 [03:32<02:52,  2.74it/s, acc_step=1/1, ce_loss_token=1.8629, perplexity_token=6.4424]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  55%|█████████████████████████▊                     | 573/1044 [03:32<02:49,  2.77it/s, acc_step=1/1, ce_loss_token=1.8629, perplexity_token=6.4421]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  55%|█████████████████████████▊                     | 574/1044 [03:33<02:53,  2.71it/s, acc_step=1/1, ce_loss_token=1.8628, perplexity_token=6.4417]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  55%|█████████████████████████▉                     | 575/1044 [03:33<02:51,  2.73it/s, acc_step=1/1, ce_loss_token=1.8627, perplexity_token=6.4412]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  55%|█████████████████████████▉                     | 576/1044 [03:33<02:53,  2.70it/s, acc_step=1/1, ce_loss_token=1.8627, perplexity_token=6.4408]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  55%|█████████████████████████▉                     | 577/1044 [03:34<02:51,  2.72it/s, acc_step=1/1, ce_loss_token=1.8626, perplexity_token=6.4403]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  55%|██████████████████████████                     | 578/1044 [03:34<02:53,  2.68it/s, acc_step=1/1, ce_loss_token=1.8625, perplexity_token=6.4399]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  55%|██████████████████████████                     | 579/1044 [03:34<02:52,  2.70it/s, acc_step=1/1, ce_loss_token=1.8624, perplexity_token=6.4394]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  56%|██████████████████████████                     | 580/1044 [03:35<02:47,  2.76it/s, acc_step=1/1, ce_loss_token=1.8624, perplexity_token=6.4390]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  56%|██████████████████████████▏                    | 581/1044 [03:35<02:47,  2.77it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4385]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  56%|██████████████████████████▏                    | 582/1044 [03:35<02:39,  2.90it/s, acc_step=1/1, ce_loss_token=1.8624, perplexity_token=6.4393]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  56%|██████████████████████████▏                    | 583/1044 [03:36<02:38,  2.90it/s, acc_step=1/1, ce_loss_token=1.8624, perplexity_token=6.4390]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  56%|██████████████████████████▎                    | 584/1044 [03:36<02:43,  2.82it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4385]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  56%|██████████████████████████▎                    | 585/1044 [03:37<02:45,  2.78it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4382]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  56%|██████████████████████████▍                    | 586/1044 [03:37<02:45,  2.77it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4380]

torch.Size([256, 276, 35]) torch.Size([256, 276])


[Training LM]:  56%|██████████████████████████▍                    | 587/1044 [03:37<02:40,  2.84it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4376]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  56%|██████████████████████████▍                    | 588/1044 [03:38<02:42,  2.81it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4372]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  56%|██████████████████████████▌                    | 589/1044 [03:38<02:30,  3.02it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4380]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  57%|██████████████████████████▌                    | 590/1044 [03:38<02:21,  3.20it/s, acc_step=1/1, ce_loss_token=1.8624, perplexity_token=6.4389]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  57%|██████████████████████████▌                    | 591/1044 [03:39<02:33,  2.95it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4385]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  57%|██████████████████████████▋                    | 592/1044 [03:39<02:26,  3.08it/s, acc_step=1/1, ce_loss_token=1.8624, perplexity_token=6.4394]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  57%|██████████████████████████▋                    | 593/1044 [03:39<02:30,  2.99it/s, acc_step=1/1, ce_loss_token=1.8624, perplexity_token=6.4389]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  57%|██████████████████████████▋                    | 594/1044 [03:40<02:39,  2.82it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4385]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  57%|██████████████████████████▊                    | 595/1044 [03:40<02:44,  2.72it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4380]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  57%|██████████████████████████▊                    | 596/1044 [03:40<02:32,  2.93it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4388]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  57%|██████████████████████████▉                    | 597/1044 [03:41<02:27,  3.03it/s, acc_step=1/1, ce_loss_token=1.8625, perplexity_token=6.4399]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  57%|██████████████████████████▉                    | 598/1044 [03:41<02:41,  2.76it/s, acc_step=1/1, ce_loss_token=1.8625, perplexity_token=6.4395]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  57%|██████████████████████████▉                    | 599/1044 [03:41<02:58,  2.50it/s, acc_step=1/1, ce_loss_token=1.8624, perplexity_token=6.4390]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  57%|███████████████████████████                    | 600/1044 [03:42<02:50,  2.60it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4386]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  58%|███████████████████████████                    | 601/1044 [03:42<02:45,  2.68it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4382]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  58%|███████████████████████████                    | 602/1044 [03:43<02:42,  2.72it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4377]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  58%|███████████████████████████▏                   | 603/1044 [03:43<02:50,  2.59it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4373]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  58%|███████████████████████████▏                   | 604/1044 [03:43<02:41,  2.73it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4380]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  58%|███████████████████████████▏                   | 605/1044 [03:44<02:41,  2.72it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4377]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  58%|███████████████████████████▎                   | 606/1044 [03:44<02:40,  2.73it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4373]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  58%|███████████████████████████▎                   | 607/1044 [03:44<02:33,  2.85it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4384]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  58%|███████████████████████████▎                   | 608/1044 [03:45<02:39,  2.73it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4381]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  58%|███████████████████████████▍                   | 609/1044 [03:45<02:38,  2.74it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4377]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  58%|███████████████████████████▍                   | 610/1044 [03:45<02:41,  2.69it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4372]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  59%|███████████████████████████▌                   | 611/1044 [03:46<02:40,  2.70it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4366]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  59%|███████████████████████████▌                   | 612/1044 [03:46<02:36,  2.76it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4363]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  59%|███████████████████████████▌                   | 613/1044 [03:46<02:26,  2.95it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4371]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  59%|███████████████████████████▋                   | 614/1044 [03:47<02:26,  2.93it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4367]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  59%|███████████████████████████▋                   | 615/1044 [03:47<02:31,  2.83it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4362]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  59%|███████████████████████████▋                   | 616/1044 [03:48<02:31,  2.83it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4357]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  59%|███████████████████████████▊                   | 617/1044 [03:48<02:32,  2.80it/s, acc_step=1/1, ce_loss_token=1.8618, perplexity_token=6.4353]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  59%|███████████████████████████▊                   | 618/1044 [03:48<02:24,  2.96it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4362]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  59%|███████████████████████████▊                   | 619/1044 [03:49<02:33,  2.76it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4358]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  59%|███████████████████████████▉                   | 620/1044 [03:49<02:40,  2.64it/s, acc_step=1/1, ce_loss_token=1.8618, perplexity_token=6.4354]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  59%|███████████████████████████▉                   | 621/1044 [03:49<02:34,  2.74it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4364]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  60%|████████████████████████████                   | 622/1044 [03:50<02:38,  2.67it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4360]

torch.Size([256, 393, 35]) torch.Size([256, 393])


[Training LM]:  60%|████████████████████████████                   | 623/1044 [03:50<02:59,  2.34it/s, acc_step=1/1, ce_loss_token=1.8618, perplexity_token=6.4356]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  60%|████████████████████████████                   | 624/1044 [03:51<02:45,  2.54it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4365]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  60%|████████████████████████████▏                  | 625/1044 [03:51<02:43,  2.56it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4361]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  60%|████████████████████████████▏                  | 626/1044 [03:51<02:44,  2.54it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4357]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  60%|████████████████████████████▏                  | 627/1044 [03:52<02:40,  2.59it/s, acc_step=1/1, ce_loss_token=1.8618, perplexity_token=6.4353]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  60%|████████████████████████████▎                  | 628/1044 [03:52<02:37,  2.64it/s, acc_step=1/1, ce_loss_token=1.8617, perplexity_token=6.4350]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  60%|████████████████████████████▎                  | 629/1044 [03:53<02:34,  2.68it/s, acc_step=1/1, ce_loss_token=1.8617, perplexity_token=6.4346]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  60%|████████████████████████████▎                  | 630/1044 [03:53<02:40,  2.57it/s, acc_step=1/1, ce_loss_token=1.8616, perplexity_token=6.4343]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  60%|████████████████████████████▍                  | 631/1044 [03:53<02:38,  2.60it/s, acc_step=1/1, ce_loss_token=1.8616, perplexity_token=6.4339]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  61%|████████████████████████████▍                  | 633/1044 [03:54<02:07,  3.23it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4381]

torch.Size([256, 294, 35]) torch.Size([256, 294])
torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  61%|████████████████████████████▌                  | 634/1044 [03:54<02:13,  3.07it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4378]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  61%|████████████████████████████▌                  | 635/1044 [03:55<02:17,  2.98it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4375]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  61%|████████████████████████████▋                  | 636/1044 [03:55<02:30,  2.72it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4372]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  61%|████████████████████████████▋                  | 637/1044 [03:55<02:25,  2.80it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4381]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  61%|████████████████████████████▋                  | 638/1044 [03:56<02:28,  2.74it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4377]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  61%|████████████████████████████▊                  | 639/1044 [03:56<02:30,  2.69it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4373]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  61%|████████████████████████████▊                  | 640/1044 [03:56<02:36,  2.58it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4369]

torch.Size([256, 381, 35]) torch.Size([256, 381])


[Training LM]:  61%|████████████████████████████▊                  | 641/1044 [03:57<02:38,  2.55it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4379]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  61%|████████████████████████████▉                  | 642/1044 [03:57<02:33,  2.61it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4375]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  62%|████████████████████████████▉                  | 643/1044 [03:58<02:28,  2.71it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4373]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  62%|████████████████████████████▉                  | 644/1044 [03:58<02:27,  2.71it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4368]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  62%|█████████████████████████████                  | 646/1044 [03:59<02:06,  3.13it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4386]

torch.Size([256, 287, 35]) torch.Size([256, 287])
torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  62%|█████████████████████████████▏                 | 647/1044 [03:59<02:18,  2.86it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4382]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  62%|█████████████████████████████▏                 | 648/1044 [03:59<02:22,  2.78it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4377]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  62%|█████████████████████████████▏                 | 649/1044 [04:00<02:24,  2.73it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4373]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  62%|█████████████████████████████▎                 | 650/1044 [04:00<02:23,  2.75it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4368]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  62%|█████████████████████████████▎                 | 651/1044 [04:00<02:25,  2.71it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4364]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  62%|█████████████████████████████▎                 | 652/1044 [04:01<02:16,  2.87it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4373]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  63%|█████████████████████████████▍                 | 653/1044 [04:01<02:36,  2.50it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4369]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  63%|█████████████████████████████▍                 | 654/1044 [04:02<02:39,  2.45it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4366]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  63%|█████████████████████████████▍                 | 655/1044 [04:02<02:34,  2.52it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4363]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  63%|█████████████████████████████▌                 | 656/1044 [04:02<02:29,  2.60it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4360]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  63%|█████████████████████████████▌                 | 657/1044 [04:03<02:26,  2.65it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4357]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  63%|█████████████████████████████▌                 | 658/1044 [04:03<02:26,  2.64it/s, acc_step=1/1, ce_loss_token=1.8618, perplexity_token=6.4354]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  63%|█████████████████████████████▋                 | 659/1044 [04:04<02:25,  2.64it/s, acc_step=1/1, ce_loss_token=1.8618, perplexity_token=6.4351]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  63%|█████████████████████████████▋                 | 660/1044 [04:04<02:25,  2.64it/s, acc_step=1/1, ce_loss_token=1.8617, perplexity_token=6.4346]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  63%|█████████████████████████████▊                 | 661/1044 [04:04<02:14,  2.84it/s, acc_step=1/1, ce_loss_token=1.8618, perplexity_token=6.4353]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  63%|█████████████████████████████▊                 | 662/1044 [04:05<02:16,  2.80it/s, acc_step=1/1, ce_loss_token=1.8617, perplexity_token=6.4349]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  64%|█████████████████████████████▊                 | 663/1044 [04:05<02:18,  2.75it/s, acc_step=1/1, ce_loss_token=1.8617, perplexity_token=6.4346]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  64%|█████████████████████████████▉                 | 664/1044 [04:05<02:22,  2.66it/s, acc_step=1/1, ce_loss_token=1.8616, perplexity_token=6.4340]

torch.Size([256, 366, 35]) torch.Size([256, 366])


[Training LM]:  64%|█████████████████████████████▉                 | 666/1044 [04:06<02:02,  3.09it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4381]

torch.Size([256, 294, 35]) torch.Size([256, 294])
torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  64%|██████████████████████████████                 | 667/1044 [04:06<02:08,  2.95it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4377]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  64%|██████████████████████████████                 | 668/1044 [04:07<02:11,  2.87it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4374]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  64%|██████████████████████████████                 | 669/1044 [04:07<02:03,  3.05it/s, acc_step=1/1, ce_loss_token=1.8623, perplexity_token=6.4383]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  64%|██████████████████████████████▏                | 670/1044 [04:07<02:06,  2.97it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4380]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  64%|██████████████████████████████▏                | 671/1044 [04:08<02:11,  2.83it/s, acc_step=1/1, ce_loss_token=1.8622, perplexity_token=6.4377]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  64%|██████████████████████████████▎                | 672/1044 [04:08<02:18,  2.68it/s, acc_step=1/1, ce_loss_token=1.8621, perplexity_token=6.4372]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  64%|██████████████████████████████▎                | 673/1044 [04:09<02:18,  2.68it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4368]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  65%|██████████████████████████████▎                | 674/1044 [04:09<02:18,  2.67it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4365]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  65%|██████████████████████████████▍                | 675/1044 [04:09<02:28,  2.48it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4361]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  65%|██████████████████████████████▍                | 676/1044 [04:10<02:30,  2.45it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4359]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  65%|██████████████████████████████▍                | 677/1044 [04:10<02:18,  2.66it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4367]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  65%|██████████████████████████████▌                | 678/1044 [04:10<02:13,  2.74it/s, acc_step=1/1, ce_loss_token=1.8620, perplexity_token=6.4364]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  65%|██████████████████████████████▌                | 679/1044 [04:11<02:09,  2.81it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4361]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  65%|██████████████████████████████▌                | 680/1044 [04:11<02:09,  2.82it/s, acc_step=1/1, ce_loss_token=1.8619, perplexity_token=6.4358]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  65%|██████████████████████████████▋                | 681/1044 [04:12<02:13,  2.72it/s, acc_step=1/1, ce_loss_token=1.8618, perplexity_token=6.4354]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  65%|██████████████████████████████▋                | 682/1044 [04:12<02:17,  2.64it/s, acc_step=1/1, ce_loss_token=1.8617, perplexity_token=6.4349]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  65%|██████████████████████████████▋                | 683/1044 [04:12<02:14,  2.68it/s, acc_step=1/1, ce_loss_token=1.8617, perplexity_token=6.4345]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  66%|██████████████████████████████▊                | 684/1044 [04:13<02:10,  2.75it/s, acc_step=1/1, ce_loss_token=1.8616, perplexity_token=6.4339]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  66%|██████████████████████████████▊                | 685/1044 [04:13<02:15,  2.65it/s, acc_step=1/1, ce_loss_token=1.8615, perplexity_token=6.4335]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  66%|██████████████████████████████▉                | 686/1044 [04:13<02:15,  2.64it/s, acc_step=1/1, ce_loss_token=1.8615, perplexity_token=6.4332]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  66%|██████████████████████████████▉                | 687/1044 [04:14<02:14,  2.65it/s, acc_step=1/1, ce_loss_token=1.8614, perplexity_token=6.4328]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  66%|██████████████████████████████▉                | 688/1044 [04:14<02:14,  2.65it/s, acc_step=1/1, ce_loss_token=1.8614, perplexity_token=6.4326]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  66%|███████████████████████████████                | 689/1044 [04:15<02:19,  2.55it/s, acc_step=1/1, ce_loss_token=1.8613, perplexity_token=6.4322]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  66%|███████████████████████████████                | 690/1044 [04:15<02:19,  2.54it/s, acc_step=1/1, ce_loss_token=1.8613, perplexity_token=6.4319]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  66%|███████████████████████████████                | 691/1044 [04:15<02:16,  2.58it/s, acc_step=1/1, ce_loss_token=1.8612, perplexity_token=6.4314]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  66%|███████████████████████████████▏               | 692/1044 [04:16<02:20,  2.50it/s, acc_step=1/1, ce_loss_token=1.8611, perplexity_token=6.4310]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  66%|███████████████████████████████▏               | 693/1044 [04:16<02:19,  2.51it/s, acc_step=1/1, ce_loss_token=1.8611, perplexity_token=6.4307]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  66%|███████████████████████████████▏               | 694/1044 [04:17<02:16,  2.57it/s, acc_step=1/1, ce_loss_token=1.8610, perplexity_token=6.4302]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  67%|███████████████████████████████▎               | 695/1044 [04:17<02:14,  2.60it/s, acc_step=1/1, ce_loss_token=1.8609, perplexity_token=6.4298]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  67%|███████████████████████████████▎               | 696/1044 [04:17<02:12,  2.63it/s, acc_step=1/1, ce_loss_token=1.8609, perplexity_token=6.4294]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  67%|███████████████████████████████▍               | 697/1044 [04:18<02:16,  2.55it/s, acc_step=1/1, ce_loss_token=1.8608, perplexity_token=6.4290]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  67%|███████████████████████████████▍               | 698/1044 [04:18<02:14,  2.58it/s, acc_step=1/1, ce_loss_token=1.8608, perplexity_token=6.4286]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  67%|███████████████████████████████▍               | 699/1044 [04:18<02:10,  2.64it/s, acc_step=1/1, ce_loss_token=1.8607, perplexity_token=6.4281]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  67%|███████████████████████████████▌               | 700/1044 [04:19<02:08,  2.68it/s, acc_step=1/1, ce_loss_token=1.8606, perplexity_token=6.4278]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  67%|███████████████████████████████▌               | 701/1044 [04:19<02:05,  2.74it/s, acc_step=1/1, ce_loss_token=1.8606, perplexity_token=6.4275]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  67%|███████████████████████████████▌               | 702/1044 [04:20<02:07,  2.69it/s, acc_step=1/1, ce_loss_token=1.8605, perplexity_token=6.4271]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  67%|███████████████████████████████▋               | 703/1044 [04:20<02:00,  2.82it/s, acc_step=1/1, ce_loss_token=1.8606, perplexity_token=6.4278]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  67%|███████████████████████████████▋               | 704/1044 [04:20<02:00,  2.83it/s, acc_step=1/1, ce_loss_token=1.8606, perplexity_token=6.4273]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  68%|███████████████████████████████▋               | 705/1044 [04:21<02:02,  2.77it/s, acc_step=1/1, ce_loss_token=1.8605, perplexity_token=6.4270]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  68%|███████████████████████████████▊               | 706/1044 [04:21<02:09,  2.61it/s, acc_step=1/1, ce_loss_token=1.8605, perplexity_token=6.4266]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  68%|███████████████████████████████▊               | 707/1044 [04:21<02:14,  2.51it/s, acc_step=1/1, ce_loss_token=1.8604, perplexity_token=6.4263]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  68%|███████████████████████████████▊               | 708/1044 [04:22<02:03,  2.72it/s, acc_step=1/1, ce_loss_token=1.8605, perplexity_token=6.4267]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  68%|███████████████████████████████▉               | 709/1044 [04:22<02:03,  2.71it/s, acc_step=1/1, ce_loss_token=1.8604, perplexity_token=6.4263]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  68%|███████████████████████████████▉               | 710/1044 [04:23<02:01,  2.74it/s, acc_step=1/1, ce_loss_token=1.8603, perplexity_token=6.4260]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  68%|████████████████████████████████               | 711/1044 [04:23<02:04,  2.68it/s, acc_step=1/1, ce_loss_token=1.8603, perplexity_token=6.4256]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  68%|████████████████████████████████               | 712/1044 [04:23<02:06,  2.62it/s, acc_step=1/1, ce_loss_token=1.8602, perplexity_token=6.4250]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  68%|████████████████████████████████               | 713/1044 [04:24<02:00,  2.75it/s, acc_step=1/1, ce_loss_token=1.8603, perplexity_token=6.4257]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  68%|████████████████████████████████▏              | 714/1044 [04:24<01:58,  2.78it/s, acc_step=1/1, ce_loss_token=1.8602, perplexity_token=6.4253]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  68%|████████████████████████████████▏              | 715/1044 [04:24<01:50,  2.97it/s, acc_step=1/1, ce_loss_token=1.8603, perplexity_token=6.4259]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  69%|████████████████████████████████▏              | 716/1044 [04:25<01:53,  2.89it/s, acc_step=1/1, ce_loss_token=1.8603, perplexity_token=6.4255]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  69%|████████████████████████████████▎              | 717/1044 [04:25<01:54,  2.85it/s, acc_step=1/1, ce_loss_token=1.8602, perplexity_token=6.4253]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  69%|████████████████████████████████▎              | 718/1044 [04:25<01:54,  2.86it/s, acc_step=1/1, ce_loss_token=1.8602, perplexity_token=6.4249]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  69%|████████████████████████████████▎              | 719/1044 [04:26<02:00,  2.69it/s, acc_step=1/1, ce_loss_token=1.8601, perplexity_token=6.4245]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  69%|████████████████████████████████▍              | 720/1044 [04:26<01:59,  2.71it/s, acc_step=1/1, ce_loss_token=1.8601, perplexity_token=6.4241]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  69%|████████████████████████████████▍              | 721/1044 [04:26<01:53,  2.85it/s, acc_step=1/1, ce_loss_token=1.8602, perplexity_token=6.4248]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  69%|████████████████████████████████▌              | 722/1044 [04:27<01:57,  2.75it/s, acc_step=1/1, ce_loss_token=1.8601, perplexity_token=6.4244]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  69%|████████████████████████████████▌              | 723/1044 [04:27<01:58,  2.70it/s, acc_step=1/1, ce_loss_token=1.8600, perplexity_token=6.4240]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  69%|████████████████████████████████▌              | 724/1044 [04:28<01:59,  2.68it/s, acc_step=1/1, ce_loss_token=1.8600, perplexity_token=6.4235]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  69%|████████████████████████████████▋              | 725/1044 [04:28<01:58,  2.70it/s, acc_step=1/1, ce_loss_token=1.8599, perplexity_token=6.4232]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  70%|████████████████████████████████▋              | 726/1044 [04:28<01:48,  2.94it/s, acc_step=1/1, ce_loss_token=1.8600, perplexity_token=6.4239]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  70%|████████████████████████████████▋              | 727/1044 [04:29<01:49,  2.91it/s, acc_step=1/1, ce_loss_token=1.8600, perplexity_token=6.4236]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  70%|████████████████████████████████▊              | 728/1044 [04:29<01:51,  2.83it/s, acc_step=1/1, ce_loss_token=1.8599, perplexity_token=6.4233]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  70%|████████████████████████████████▊              | 729/1044 [04:29<01:54,  2.75it/s, acc_step=1/1, ce_loss_token=1.8599, perplexity_token=6.4230]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  70%|████████████████████████████████▊              | 730/1044 [04:30<01:47,  2.91it/s, acc_step=1/1, ce_loss_token=1.8600, perplexity_token=6.4238]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  70%|████████████████████████████████▉              | 732/1044 [04:30<01:39,  3.13it/s, acc_step=1/1, ce_loss_token=1.8604, perplexity_token=6.4266]

torch.Size([256, 320, 35]) torch.Size([256, 320])
torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  70%|████████████████████████████████▉              | 733/1044 [04:31<01:44,  2.96it/s, acc_step=1/1, ce_loss_token=1.8604, perplexity_token=6.4262]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  70%|█████████████████████████████████              | 734/1044 [04:31<02:03,  2.52it/s, acc_step=1/1, ce_loss_token=1.8603, perplexity_token=6.4257]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  70%|█████████████████████████████████              | 735/1044 [04:32<02:02,  2.53it/s, acc_step=1/1, ce_loss_token=1.8602, perplexity_token=6.4253]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  70%|█████████████████████████████████▏             | 736/1044 [04:32<02:06,  2.44it/s, acc_step=1/1, ce_loss_token=1.8602, perplexity_token=6.4250]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  71%|█████████████████████████████████▏             | 737/1044 [04:32<02:03,  2.48it/s, acc_step=1/1, ce_loss_token=1.8601, perplexity_token=6.4246]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  71%|█████████████████████████████████▏             | 738/1044 [04:33<01:51,  2.75it/s, acc_step=1/1, ce_loss_token=1.8603, perplexity_token=6.4254]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  71%|█████████████████████████████████▎             | 739/1044 [04:33<01:50,  2.75it/s, acc_step=1/1, ce_loss_token=1.8602, perplexity_token=6.4250]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  71%|█████████████████████████████████▎             | 740/1044 [04:33<01:54,  2.65it/s, acc_step=1/1, ce_loss_token=1.8601, perplexity_token=6.4246]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  71%|█████████████████████████████████▎             | 741/1044 [04:34<01:49,  2.77it/s, acc_step=1/1, ce_loss_token=1.8602, perplexity_token=6.4251]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  71%|█████████████████████████████████▍             | 742/1044 [04:34<02:00,  2.50it/s, acc_step=1/1, ce_loss_token=1.8601, perplexity_token=6.4247]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  71%|█████████████████████████████████▍             | 743/1044 [04:35<01:58,  2.55it/s, acc_step=1/1, ce_loss_token=1.8601, perplexity_token=6.4244]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  71%|█████████████████████████████████▍             | 744/1044 [04:35<01:47,  2.79it/s, acc_step=1/1, ce_loss_token=1.8602, perplexity_token=6.4250]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  71%|█████████████████████████████████▌             | 745/1044 [04:35<01:47,  2.77it/s, acc_step=1/1, ce_loss_token=1.8601, perplexity_token=6.4246]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  71%|█████████████████████████████████▌             | 746/1044 [04:36<01:52,  2.66it/s, acc_step=1/1, ce_loss_token=1.8601, perplexity_token=6.4243]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  72%|█████████████████████████████████▋             | 747/1044 [04:36<01:52,  2.63it/s, acc_step=1/1, ce_loss_token=1.8600, perplexity_token=6.4240]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  72%|█████████████████████████████████▋             | 748/1044 [04:36<01:49,  2.71it/s, acc_step=1/1, ce_loss_token=1.8600, perplexity_token=6.4237]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  72%|█████████████████████████████████▋             | 749/1044 [04:37<01:41,  2.90it/s, acc_step=1/1, ce_loss_token=1.8601, perplexity_token=6.4244]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  72%|█████████████████████████████████▊             | 750/1044 [04:37<01:43,  2.83it/s, acc_step=1/1, ce_loss_token=1.8601, perplexity_token=6.4241]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  72%|█████████████████████████████████▊             | 751/1044 [04:37<01:44,  2.80it/s, acc_step=1/1, ce_loss_token=1.8600, perplexity_token=6.4236]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  72%|█████████████████████████████████▊             | 752/1044 [04:38<01:50,  2.65it/s, acc_step=1/1, ce_loss_token=1.8599, perplexity_token=6.4233]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  72%|█████████████████████████████████▉             | 753/1044 [04:38<01:47,  2.72it/s, acc_step=1/1, ce_loss_token=1.8599, perplexity_token=6.4228]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  72%|█████████████████████████████████▉             | 754/1044 [04:39<01:46,  2.72it/s, acc_step=1/1, ce_loss_token=1.8598, perplexity_token=6.4224]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  72%|█████████████████████████████████▉             | 755/1044 [04:39<01:49,  2.65it/s, acc_step=1/1, ce_loss_token=1.8597, perplexity_token=6.4220]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  72%|██████████████████████████████████             | 756/1044 [04:39<01:52,  2.56it/s, acc_step=1/1, ce_loss_token=1.8597, perplexity_token=6.4217]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  73%|██████████████████████████████████             | 757/1044 [04:40<01:49,  2.63it/s, acc_step=1/1, ce_loss_token=1.8596, perplexity_token=6.4214]

torch.Size([256, 437, 35]) torch.Size([256, 437])


[Training LM]:  73%|██████████████████████████████████             | 758/1044 [04:40<02:10,  2.19it/s, acc_step=1/1, ce_loss_token=1.8596, perplexity_token=6.4211]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  73%|██████████████████████████████████▏            | 759/1044 [04:41<01:58,  2.40it/s, acc_step=1/1, ce_loss_token=1.8597, perplexity_token=6.4217]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  73%|██████████████████████████████████▏            | 760/1044 [04:41<01:55,  2.45it/s, acc_step=1/1, ce_loss_token=1.8596, perplexity_token=6.4214]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  73%|██████████████████████████████████▎            | 761/1044 [04:41<01:52,  2.51it/s, acc_step=1/1, ce_loss_token=1.8596, perplexity_token=6.4211]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  73%|██████████████████████████████████▎            | 762/1044 [04:42<01:44,  2.71it/s, acc_step=1/1, ce_loss_token=1.8597, perplexity_token=6.4216]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  73%|██████████████████████████████████▎            | 763/1044 [04:42<01:44,  2.70it/s, acc_step=1/1, ce_loss_token=1.8596, perplexity_token=6.4214]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  73%|██████████████████████████████████▍            | 764/1044 [04:43<01:50,  2.55it/s, acc_step=1/1, ce_loss_token=1.8596, perplexity_token=6.4210]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  73%|██████████████████████████████████▍            | 765/1044 [04:43<01:51,  2.51it/s, acc_step=1/1, ce_loss_token=1.8595, perplexity_token=6.4207]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  73%|██████████████████████████████████▍            | 766/1044 [04:43<01:48,  2.55it/s, acc_step=1/1, ce_loss_token=1.8595, perplexity_token=6.4204]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  73%|██████████████████████████████████▌            | 767/1044 [04:44<01:47,  2.58it/s, acc_step=1/1, ce_loss_token=1.8594, perplexity_token=6.4199]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  74%|██████████████████████████████████▌            | 768/1044 [04:44<01:38,  2.81it/s, acc_step=1/1, ce_loss_token=1.8595, perplexity_token=6.4205]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  74%|██████████████████████████████████▌            | 769/1044 [04:44<01:37,  2.82it/s, acc_step=1/1, ce_loss_token=1.8595, perplexity_token=6.4202]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  74%|██████████████████████████████████▋            | 770/1044 [04:45<01:37,  2.82it/s, acc_step=1/1, ce_loss_token=1.8594, perplexity_token=6.4199]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  74%|██████████████████████████████████▋            | 771/1044 [04:45<01:37,  2.79it/s, acc_step=1/1, ce_loss_token=1.8593, perplexity_token=6.4195]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  74%|██████████████████████████████████▊            | 772/1044 [04:45<01:36,  2.81it/s, acc_step=1/1, ce_loss_token=1.8593, perplexity_token=6.4190]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  74%|██████████████████████████████████▊            | 773/1044 [04:46<01:36,  2.80it/s, acc_step=1/1, ce_loss_token=1.8592, perplexity_token=6.4186]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  74%|██████████████████████████████████▊            | 774/1044 [04:46<01:36,  2.79it/s, acc_step=1/1, ce_loss_token=1.8591, perplexity_token=6.4182]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  74%|██████████████████████████████████▉            | 775/1044 [04:47<01:31,  2.93it/s, acc_step=1/1, ce_loss_token=1.8592, perplexity_token=6.4189]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  74%|██████████████████████████████████▉            | 776/1044 [04:47<01:38,  2.73it/s, acc_step=1/1, ce_loss_token=1.8592, perplexity_token=6.4186]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  74%|██████████████████████████████████▉            | 777/1044 [04:47<01:40,  2.67it/s, acc_step=1/1, ce_loss_token=1.8591, perplexity_token=6.4183]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  75%|███████████████████████████████████            | 778/1044 [04:48<01:44,  2.55it/s, acc_step=1/1, ce_loss_token=1.8591, perplexity_token=6.4180]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  75%|███████████████████████████████████            | 780/1044 [04:48<01:28,  2.98it/s, acc_step=1/1, ce_loss_token=1.8593, perplexity_token=6.4192]

torch.Size([256, 299, 35]) torch.Size([256, 299])
torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  75%|███████████████████████████████████▏           | 781/1044 [04:49<01:33,  2.81it/s, acc_step=1/1, ce_loss_token=1.8592, perplexity_token=6.4188]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  75%|███████████████████████████████████▏           | 782/1044 [04:49<01:35,  2.73it/s, acc_step=1/1, ce_loss_token=1.8592, perplexity_token=6.4184]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  75%|███████████████████████████████████▎           | 783/1044 [04:49<01:29,  2.92it/s, acc_step=1/1, ce_loss_token=1.8593, perplexity_token=6.4190]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  75%|███████████████████████████████████▎           | 784/1044 [04:50<01:23,  3.12it/s, acc_step=1/1, ce_loss_token=1.8594, perplexity_token=6.4197]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  75%|███████████████████████████████████▎           | 785/1044 [04:50<01:25,  3.04it/s, acc_step=1/1, ce_loss_token=1.8593, perplexity_token=6.4194]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  75%|███████████████████████████████████▍           | 786/1044 [04:50<01:27,  2.94it/s, acc_step=1/1, ce_loss_token=1.8593, perplexity_token=6.4191]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  75%|███████████████████████████████████▍           | 787/1044 [04:51<01:24,  3.05it/s, acc_step=1/1, ce_loss_token=1.8594, perplexity_token=6.4197]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  75%|███████████████████████████████████▍           | 788/1044 [04:51<01:20,  3.18it/s, acc_step=1/1, ce_loss_token=1.8595, perplexity_token=6.4203]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  76%|███████████████████████████████████▌           | 789/1044 [04:51<01:22,  3.08it/s, acc_step=1/1, ce_loss_token=1.8594, perplexity_token=6.4199]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  76%|███████████████████████████████████▌           | 790/1044 [04:52<01:25,  2.98it/s, acc_step=1/1, ce_loss_token=1.8594, perplexity_token=6.4196]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  76%|███████████████████████████████████▌           | 791/1044 [04:52<01:22,  3.07it/s, acc_step=1/1, ce_loss_token=1.8594, perplexity_token=6.4201]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  76%|███████████████████████████████████▋           | 792/1044 [04:52<01:24,  3.00it/s, acc_step=1/1, ce_loss_token=1.8594, perplexity_token=6.4197]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  76%|███████████████████████████████████▋           | 793/1044 [04:53<01:27,  2.88it/s, acc_step=1/1, ce_loss_token=1.8593, perplexity_token=6.4194]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  76%|███████████████████████████████████▋           | 794/1044 [04:53<01:30,  2.76it/s, acc_step=1/1, ce_loss_token=1.8593, perplexity_token=6.4191]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  76%|███████████████████████████████████▊           | 795/1044 [04:54<01:30,  2.76it/s, acc_step=1/1, ce_loss_token=1.8592, perplexity_token=6.4187]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  76%|███████████████████████████████████▊           | 796/1044 [04:54<01:32,  2.70it/s, acc_step=1/1, ce_loss_token=1.8592, perplexity_token=6.4184]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  76%|███████████████████████████████████▉           | 797/1044 [04:54<01:31,  2.69it/s, acc_step=1/1, ce_loss_token=1.8591, perplexity_token=6.4180]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  76%|███████████████████████████████████▉           | 798/1044 [04:55<01:24,  2.90it/s, acc_step=1/1, ce_loss_token=1.8592, perplexity_token=6.4186]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  77%|███████████████████████████████████▉           | 799/1044 [04:55<01:32,  2.65it/s, acc_step=1/1, ce_loss_token=1.8592, perplexity_token=6.4183]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  77%|████████████████████████████████████           | 800/1044 [04:55<01:36,  2.54it/s, acc_step=1/1, ce_loss_token=1.8591, perplexity_token=6.4178]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  77%|████████████████████████████████████           | 801/1044 [04:56<01:34,  2.57it/s, acc_step=1/1, ce_loss_token=1.8590, perplexity_token=6.4175]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  77%|████████████████████████████████████           | 802/1044 [04:56<01:34,  2.56it/s, acc_step=1/1, ce_loss_token=1.8590, perplexity_token=6.4172]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  77%|████████████████████████████████████▏          | 803/1044 [04:57<01:27,  2.76it/s, acc_step=1/1, ce_loss_token=1.8591, perplexity_token=6.4177]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  77%|████████████████████████████████████▏          | 804/1044 [04:57<01:27,  2.75it/s, acc_step=1/1, ce_loss_token=1.8590, perplexity_token=6.4173]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  77%|████████████████████████████████████▏          | 805/1044 [04:57<01:26,  2.78it/s, acc_step=1/1, ce_loss_token=1.8590, perplexity_token=6.4170]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  77%|████████████████████████████████████▎          | 806/1044 [04:58<01:26,  2.76it/s, acc_step=1/1, ce_loss_token=1.8589, perplexity_token=6.4166]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  77%|████████████████████████████████████▎          | 807/1044 [04:58<01:24,  2.80it/s, acc_step=1/1, ce_loss_token=1.8588, perplexity_token=6.4163]

torch.Size([256, 370, 35]) torch.Size([256, 370])


[Training LM]:  77%|████████████████████████████████████▍          | 808/1044 [04:58<01:33,  2.52it/s, acc_step=1/1, ce_loss_token=1.8588, perplexity_token=6.4161]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  77%|████████████████████████████████████▍          | 809/1044 [04:59<01:31,  2.57it/s, acc_step=1/1, ce_loss_token=1.8589, perplexity_token=6.4166]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  78%|████████████████████████████████████▍          | 810/1044 [04:59<01:27,  2.68it/s, acc_step=1/1, ce_loss_token=1.8588, perplexity_token=6.4163]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  78%|████████████████████████████████████▌          | 811/1044 [05:00<01:26,  2.68it/s, acc_step=1/1, ce_loss_token=1.8588, perplexity_token=6.4159]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  78%|████████████████████████████████████▌          | 812/1044 [05:00<01:27,  2.66it/s, acc_step=1/1, ce_loss_token=1.8587, perplexity_token=6.4156]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  78%|████████████████████████████████████▌          | 813/1044 [05:00<01:26,  2.66it/s, acc_step=1/1, ce_loss_token=1.8587, perplexity_token=6.4153]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  78%|████████████████████████████████████▋          | 814/1044 [05:01<01:28,  2.59it/s, acc_step=1/1, ce_loss_token=1.8586, perplexity_token=6.4149]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  78%|████████████████████████████████████▋          | 815/1044 [05:01<01:28,  2.58it/s, acc_step=1/1, ce_loss_token=1.8586, perplexity_token=6.4146]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  78%|████████████████████████████████████▋          | 816/1044 [05:01<01:26,  2.63it/s, acc_step=1/1, ce_loss_token=1.8585, perplexity_token=6.4143]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  78%|████████████████████████████████████▊          | 817/1044 [05:02<01:24,  2.69it/s, acc_step=1/1, ce_loss_token=1.8584, perplexity_token=6.4138]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  78%|████████████████████████████████████▊          | 818/1044 [05:02<01:24,  2.69it/s, acc_step=1/1, ce_loss_token=1.8584, perplexity_token=6.4134]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  78%|████████████████████████████████████▊          | 819/1044 [05:02<01:21,  2.75it/s, acc_step=1/1, ce_loss_token=1.8583, perplexity_token=6.4129]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  79%|████████████████████████████████████▉          | 820/1044 [05:03<01:22,  2.72it/s, acc_step=1/1, ce_loss_token=1.8583, perplexity_token=6.4127]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  79%|████████████████████████████████████▉          | 821/1044 [05:03<01:24,  2.64it/s, acc_step=1/1, ce_loss_token=1.8582, perplexity_token=6.4124]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  79%|█████████████████████████████████████          | 822/1044 [05:04<01:26,  2.56it/s, acc_step=1/1, ce_loss_token=1.8582, perplexity_token=6.4121]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  79%|█████████████████████████████████████          | 823/1044 [05:04<01:26,  2.55it/s, acc_step=1/1, ce_loss_token=1.8581, perplexity_token=6.4117]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  79%|█████████████████████████████████████          | 824/1044 [05:04<01:18,  2.80it/s, acc_step=1/1, ce_loss_token=1.8582, perplexity_token=6.4121]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  79%|█████████████████████████████████████▏         | 825/1044 [05:05<01:12,  3.00it/s, acc_step=1/1, ce_loss_token=1.8583, perplexity_token=6.4126]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  79%|█████████████████████████████████████▏         | 826/1044 [05:05<01:13,  2.98it/s, acc_step=1/1, ce_loss_token=1.8582, perplexity_token=6.4123]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  79%|█████████████████████████████████████▏         | 827/1044 [05:05<01:15,  2.89it/s, acc_step=1/1, ce_loss_token=1.8582, perplexity_token=6.4120]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  79%|█████████████████████████████████████▎         | 828/1044 [05:06<01:17,  2.80it/s, acc_step=1/1, ce_loss_token=1.8581, perplexity_token=6.4117]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  79%|█████████████████████████████████████▎         | 829/1044 [05:06<01:20,  2.69it/s, acc_step=1/1, ce_loss_token=1.8581, perplexity_token=6.4112]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  80%|█████████████████████████████████████▎         | 830/1044 [05:07<01:19,  2.70it/s, acc_step=1/1, ce_loss_token=1.8580, perplexity_token=6.4109]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  80%|█████████████████████████████████████▍         | 831/1044 [05:07<01:21,  2.62it/s, acc_step=1/1, ce_loss_token=1.8579, perplexity_token=6.4105]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  80%|█████████████████████████████████████▍         | 832/1044 [05:07<01:20,  2.63it/s, acc_step=1/1, ce_loss_token=1.8579, perplexity_token=6.4101]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  80%|█████████████████████████████████████▌         | 833/1044 [05:08<01:20,  2.61it/s, acc_step=1/1, ce_loss_token=1.8578, perplexity_token=6.4097]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  80%|█████████████████████████████████████▌         | 834/1044 [05:08<01:20,  2.59it/s, acc_step=1/1, ce_loss_token=1.8578, perplexity_token=6.4094]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  80%|█████████████████████████████████████▌         | 835/1044 [05:08<01:19,  2.63it/s, acc_step=1/1, ce_loss_token=1.8577, perplexity_token=6.4091]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  80%|█████████████████████████████████████▋         | 836/1044 [05:09<01:17,  2.67it/s, acc_step=1/1, ce_loss_token=1.8577, perplexity_token=6.4088]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  80%|█████████████████████████████████████▋         | 837/1044 [05:09<01:18,  2.63it/s, acc_step=1/1, ce_loss_token=1.8576, perplexity_token=6.4084]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  80%|█████████████████████████████████████▋         | 838/1044 [05:10<01:17,  2.64it/s, acc_step=1/1, ce_loss_token=1.8576, perplexity_token=6.4080]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  80%|█████████████████████████████████████▊         | 839/1044 [05:10<01:16,  2.69it/s, acc_step=1/1, ce_loss_token=1.8575, perplexity_token=6.4075]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  80%|█████████████████████████████████████▊         | 840/1044 [05:10<01:16,  2.68it/s, acc_step=1/1, ce_loss_token=1.8574, perplexity_token=6.4071]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  81%|█████████████████████████████████████▊         | 841/1044 [05:11<01:09,  2.91it/s, acc_step=1/1, ce_loss_token=1.8575, perplexity_token=6.4078]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  81%|█████████████████████████████████████▉         | 842/1044 [05:11<01:05,  3.06it/s, acc_step=1/1, ce_loss_token=1.8576, perplexity_token=6.4083]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  81%|█████████████████████████████████████▉         | 843/1044 [05:11<01:07,  2.98it/s, acc_step=1/1, ce_loss_token=1.8575, perplexity_token=6.4078]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  81%|█████████████████████████████████████▉         | 844/1044 [05:12<01:11,  2.79it/s, acc_step=1/1, ce_loss_token=1.8575, perplexity_token=6.4074]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  81%|██████████████████████████████████████         | 845/1044 [05:12<01:13,  2.71it/s, acc_step=1/1, ce_loss_token=1.8574, perplexity_token=6.4070]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  81%|██████████████████████████████████████         | 846/1044 [05:12<01:12,  2.73it/s, acc_step=1/1, ce_loss_token=1.8573, perplexity_token=6.4067]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  81%|██████████████████████████████████████▏        | 847/1044 [05:13<01:14,  2.66it/s, acc_step=1/1, ce_loss_token=1.8573, perplexity_token=6.4063]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  81%|██████████████████████████████████████▏        | 848/1044 [05:13<01:14,  2.64it/s, acc_step=1/1, ce_loss_token=1.8572, perplexity_token=6.4059]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  81%|██████████████████████████████████████▏        | 849/1044 [05:14<01:11,  2.72it/s, acc_step=1/1, ce_loss_token=1.8572, perplexity_token=6.4056]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  81%|██████████████████████████████████████▎        | 850/1044 [05:14<01:10,  2.76it/s, acc_step=1/1, ce_loss_token=1.8571, perplexity_token=6.4054]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  82%|██████████████████████████████████████▎        | 851/1044 [05:14<01:11,  2.69it/s, acc_step=1/1, ce_loss_token=1.8571, perplexity_token=6.4050]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  82%|██████████████████████████████████████▎        | 852/1044 [05:15<01:05,  2.93it/s, acc_step=1/1, ce_loss_token=1.8572, perplexity_token=6.4056]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  82%|██████████████████████████████████████▍        | 853/1044 [05:15<01:07,  2.83it/s, acc_step=1/1, ce_loss_token=1.8571, perplexity_token=6.4053]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  82%|██████████████████████████████████████▍        | 854/1044 [05:15<01:08,  2.78it/s, acc_step=1/1, ce_loss_token=1.8571, perplexity_token=6.4049]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  82%|██████████████████████████████████████▍        | 855/1044 [05:16<01:03,  2.96it/s, acc_step=1/1, ce_loss_token=1.8571, perplexity_token=6.4053]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  82%|██████████████████████████████████████▌        | 856/1044 [05:16<01:00,  3.13it/s, acc_step=1/1, ce_loss_token=1.8572, perplexity_token=6.4061]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  82%|██████████████████████████████████████▌        | 857/1044 [05:16<01:01,  3.03it/s, acc_step=1/1, ce_loss_token=1.8572, perplexity_token=6.4057]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  82%|██████████████████████████████████████▋        | 858/1044 [05:17<01:05,  2.85it/s, acc_step=1/1, ce_loss_token=1.8571, perplexity_token=6.4054]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  82%|██████████████████████████████████████▋        | 859/1044 [05:17<01:06,  2.80it/s, acc_step=1/1, ce_loss_token=1.8571, perplexity_token=6.4050]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  82%|██████████████████████████████████████▋        | 860/1044 [05:17<01:07,  2.73it/s, acc_step=1/1, ce_loss_token=1.8570, perplexity_token=6.4047]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  82%|██████████████████████████████████████▊        | 861/1044 [05:18<01:06,  2.74it/s, acc_step=1/1, ce_loss_token=1.8570, perplexity_token=6.4043]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  83%|██████████████████████████████████████▊        | 862/1044 [05:18<01:07,  2.71it/s, acc_step=1/1, ce_loss_token=1.8569, perplexity_token=6.4040]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  83%|██████████████████████████████████████▊        | 863/1044 [05:19<01:08,  2.64it/s, acc_step=1/1, ce_loss_token=1.8569, perplexity_token=6.4036]

torch.Size([256, 410, 35]) torch.Size([256, 410])


[Training LM]:  83%|██████████████████████████████████████▉        | 864/1044 [05:19<01:18,  2.29it/s, acc_step=1/1, ce_loss_token=1.8568, perplexity_token=6.4033]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  83%|██████████████████████████████████████▉        | 865/1044 [05:19<01:15,  2.38it/s, acc_step=1/1, ce_loss_token=1.8568, perplexity_token=6.4030]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  83%|██████████████████████████████████████▉        | 866/1044 [05:20<01:12,  2.45it/s, acc_step=1/1, ce_loss_token=1.8567, perplexity_token=6.4027]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  83%|███████████████████████████████████████        | 867/1044 [05:20<01:09,  2.53it/s, acc_step=1/1, ce_loss_token=1.8567, perplexity_token=6.4023]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  83%|███████████████████████████████████████        | 868/1044 [05:21<01:08,  2.56it/s, acc_step=1/1, ce_loss_token=1.8566, perplexity_token=6.4020]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  83%|███████████████████████████████████████        | 869/1044 [05:21<01:07,  2.59it/s, acc_step=1/1, ce_loss_token=1.8565, perplexity_token=6.4015]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  83%|███████████████████████████████████████▏       | 870/1044 [05:21<01:02,  2.78it/s, acc_step=1/1, ce_loss_token=1.8566, perplexity_token=6.4020]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  83%|███████████████████████████████████████▏       | 871/1044 [05:22<01:03,  2.71it/s, acc_step=1/1, ce_loss_token=1.8566, perplexity_token=6.4017]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  84%|███████████████████████████████████████▎       | 872/1044 [05:22<01:06,  2.60it/s, acc_step=1/1, ce_loss_token=1.8565, perplexity_token=6.4014]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  84%|███████████████████████████████████████▎       | 873/1044 [05:22<01:01,  2.78it/s, acc_step=1/1, ce_loss_token=1.8566, perplexity_token=6.4017]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  84%|███████████████████████████████████████▎       | 874/1044 [05:23<01:02,  2.72it/s, acc_step=1/1, ce_loss_token=1.8565, perplexity_token=6.4015]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  84%|███████████████████████████████████████▍       | 875/1044 [05:23<01:02,  2.70it/s, acc_step=1/1, ce_loss_token=1.8565, perplexity_token=6.4012]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  84%|███████████████████████████████████████▍       | 876/1044 [05:23<01:01,  2.74it/s, acc_step=1/1, ce_loss_token=1.8564, perplexity_token=6.4009]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  84%|███████████████████████████████████████▍       | 877/1044 [05:24<01:01,  2.71it/s, acc_step=1/1, ce_loss_token=1.8564, perplexity_token=6.4005]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  84%|███████████████████████████████████████▌       | 878/1044 [05:24<01:01,  2.69it/s, acc_step=1/1, ce_loss_token=1.8563, perplexity_token=6.4002]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  84%|███████████████████████████████████████▌       | 879/1044 [05:25<01:02,  2.63it/s, acc_step=1/1, ce_loss_token=1.8563, perplexity_token=6.3998]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  84%|███████████████████████████████████████▌       | 880/1044 [05:25<01:02,  2.62it/s, acc_step=1/1, ce_loss_token=1.8562, perplexity_token=6.3995]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  84%|███████████████████████████████████████▋       | 881/1044 [05:25<01:01,  2.65it/s, acc_step=1/1, ce_loss_token=1.8562, perplexity_token=6.3991]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  84%|███████████████████████████████████████▋       | 882/1044 [05:26<01:01,  2.62it/s, acc_step=1/1, ce_loss_token=1.8561, perplexity_token=6.3987]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  85%|███████████████████████████████████████▊       | 883/1044 [05:26<00:58,  2.75it/s, acc_step=1/1, ce_loss_token=1.8562, perplexity_token=6.3992]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  85%|███████████████████████████████████████▊       | 884/1044 [05:27<01:02,  2.54it/s, acc_step=1/1, ce_loss_token=1.8561, perplexity_token=6.3989]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  85%|███████████████████████████████████████▊       | 885/1044 [05:27<01:01,  2.59it/s, acc_step=1/1, ce_loss_token=1.8561, perplexity_token=6.3986]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  85%|███████████████████████████████████████▉       | 886/1044 [05:27<00:59,  2.64it/s, acc_step=1/1, ce_loss_token=1.8560, perplexity_token=6.3982]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  85%|███████████████████████████████████████▉       | 887/1044 [05:28<00:56,  2.78it/s, acc_step=1/1, ce_loss_token=1.8561, perplexity_token=6.3987]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  85%|███████████████████████████████████████▉       | 888/1044 [05:28<00:53,  2.93it/s, acc_step=1/1, ce_loss_token=1.8562, perplexity_token=6.3991]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  85%|████████████████████████████████████████       | 889/1044 [05:28<00:55,  2.82it/s, acc_step=1/1, ce_loss_token=1.8561, perplexity_token=6.3989]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  85%|████████████████████████████████████████       | 890/1044 [05:29<00:55,  2.80it/s, acc_step=1/1, ce_loss_token=1.8561, perplexity_token=6.3986]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  85%|████████████████████████████████████████▏      | 892/1044 [05:29<00:48,  3.14it/s, acc_step=1/1, ce_loss_token=1.8563, perplexity_token=6.3997]

torch.Size([256, 314, 35]) torch.Size([256, 314])
torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  86%|████████████████████████████████████████▏      | 893/1044 [05:30<00:46,  3.24it/s, acc_step=1/1, ce_loss_token=1.8563, perplexity_token=6.4002]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  86%|████████████████████████████████████████▏      | 894/1044 [05:30<00:48,  3.09it/s, acc_step=1/1, ce_loss_token=1.8563, perplexity_token=6.3998]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  86%|████████████████████████████████████████▎      | 895/1044 [05:30<00:49,  2.98it/s, acc_step=1/1, ce_loss_token=1.8562, perplexity_token=6.3995]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  86%|████████████████████████████████████████▎      | 896/1044 [05:31<00:47,  3.12it/s, acc_step=1/1, ce_loss_token=1.8563, perplexity_token=6.3999]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  86%|████████████████████████████████████████▍      | 897/1044 [05:31<00:46,  3.15it/s, acc_step=1/1, ce_loss_token=1.8564, perplexity_token=6.4004]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  86%|████████████████████████████████████████▍      | 898/1044 [05:31<00:47,  3.06it/s, acc_step=1/1, ce_loss_token=1.8563, perplexity_token=6.4000]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  86%|████████████████████████████████████████▍      | 899/1044 [05:32<00:49,  2.91it/s, acc_step=1/1, ce_loss_token=1.8563, perplexity_token=6.3997]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  86%|████████████████████████████████████████▌      | 900/1044 [05:32<00:47,  3.05it/s, acc_step=1/1, ce_loss_token=1.8563, perplexity_token=6.4000]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  86%|████████████████████████████████████████▌      | 901/1044 [05:32<00:48,  2.93it/s, acc_step=1/1, ce_loss_token=1.8563, perplexity_token=6.3998]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  86%|████████████████████████████████████████▌      | 902/1044 [05:33<00:50,  2.81it/s, acc_step=1/1, ce_loss_token=1.8562, perplexity_token=6.3994]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  87%|████████████████████████████████████████▋      | 904/1044 [05:33<00:42,  3.26it/s, acc_step=1/1, ce_loss_token=1.8564, perplexity_token=6.4008]

torch.Size([256, 286, 35]) torch.Size([256, 286])
torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  87%|████████████████████████████████████████▋      | 905/1044 [05:34<00:44,  3.14it/s, acc_step=1/1, ce_loss_token=1.8564, perplexity_token=6.4004]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  87%|████████████████████████████████████████▊      | 906/1044 [05:34<00:46,  2.99it/s, acc_step=1/1, ce_loss_token=1.8563, perplexity_token=6.4001]

torch.Size([256, 342, 35]) torch.Size([256, 342])


[Training LM]:  87%|████████████████████████████████████████▊      | 907/1044 [05:34<00:50,  2.74it/s, acc_step=1/1, ce_loss_token=1.8562, perplexity_token=6.3996]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  87%|████████████████████████████████████████▉      | 908/1044 [05:35<00:50,  2.71it/s, acc_step=1/1, ce_loss_token=1.8562, perplexity_token=6.3992]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  87%|████████████████████████████████████████▉      | 909/1044 [05:35<00:49,  2.74it/s, acc_step=1/1, ce_loss_token=1.8561, perplexity_token=6.3989]

torch.Size([256, 438, 35]) torch.Size([256, 438])


[Training LM]:  87%|████████████████████████████████████████▉      | 910/1044 [05:36<00:59,  2.26it/s, acc_step=1/1, ce_loss_token=1.8561, perplexity_token=6.3986]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  87%|█████████████████████████████████████████      | 911/1044 [05:36<00:55,  2.40it/s, acc_step=1/1, ce_loss_token=1.8560, perplexity_token=6.3983]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  87%|█████████████████████████████████████████      | 912/1044 [05:36<00:53,  2.47it/s, acc_step=1/1, ce_loss_token=1.8560, perplexity_token=6.3979]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  87%|█████████████████████████████████████████      | 913/1044 [05:37<00:51,  2.54it/s, acc_step=1/1, ce_loss_token=1.8559, perplexity_token=6.3976]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  88%|█████████████████████████████████████████▏     | 914/1044 [05:37<00:52,  2.46it/s, acc_step=1/1, ce_loss_token=1.8559, perplexity_token=6.3972]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  88%|█████████████████████████████████████████▏     | 915/1044 [05:38<00:50,  2.53it/s, acc_step=1/1, ce_loss_token=1.8558, perplexity_token=6.3969]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  88%|█████████████████████████████████████████▏     | 916/1044 [05:38<00:50,  2.55it/s, acc_step=1/1, ce_loss_token=1.8557, perplexity_token=6.3965]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  88%|█████████████████████████████████████████▎     | 917/1044 [05:38<00:49,  2.58it/s, acc_step=1/1, ce_loss_token=1.8557, perplexity_token=6.3961]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  88%|█████████████████████████████████████████▎     | 918/1044 [05:39<00:47,  2.64it/s, acc_step=1/1, ce_loss_token=1.8556, perplexity_token=6.3956]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  88%|█████████████████████████████████████████▎     | 919/1044 [05:39<00:45,  2.72it/s, acc_step=1/1, ce_loss_token=1.8556, perplexity_token=6.3953]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  88%|█████████████████████████████████████████▍     | 920/1044 [05:39<00:44,  2.76it/s, acc_step=1/1, ce_loss_token=1.8555, perplexity_token=6.3949]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  88%|█████████████████████████████████████████▍     | 921/1044 [05:40<00:44,  2.74it/s, acc_step=1/1, ce_loss_token=1.8555, perplexity_token=6.3946]

torch.Size([256, 454, 35]) torch.Size([256, 454])


[Training LM]:  88%|█████████████████████████████████████████▌     | 922/1044 [05:40<00:55,  2.19it/s, acc_step=1/1, ce_loss_token=1.8554, perplexity_token=6.3943]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  88%|█████████████████████████████████████████▌     | 923/1044 [05:41<00:54,  2.23it/s, acc_step=1/1, ce_loss_token=1.8554, perplexity_token=6.3940]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  89%|█████████████████████████████████████████▌     | 924/1044 [05:41<00:50,  2.36it/s, acc_step=1/1, ce_loss_token=1.8553, perplexity_token=6.3937]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  89%|█████████████████████████████████████████▋     | 925/1044 [05:42<00:49,  2.42it/s, acc_step=1/1, ce_loss_token=1.8553, perplexity_token=6.3933]

torch.Size([256, 372, 35]) torch.Size([256, 372])


[Training LM]:  89%|█████████████████████████████████████████▋     | 926/1044 [05:42<00:47,  2.47it/s, acc_step=1/1, ce_loss_token=1.8554, perplexity_token=6.3940]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  89%|█████████████████████████████████████████▋     | 927/1044 [05:42<00:45,  2.56it/s, acc_step=1/1, ce_loss_token=1.8553, perplexity_token=6.3936]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  89%|█████████████████████████████████████████▊     | 928/1044 [05:43<00:43,  2.64it/s, acc_step=1/1, ce_loss_token=1.8553, perplexity_token=6.3934]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  89%|█████████████████████████████████████████▊     | 929/1044 [05:43<00:43,  2.63it/s, acc_step=1/1, ce_loss_token=1.8552, perplexity_token=6.3931]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  89%|█████████████████████████████████████████▊     | 930/1044 [05:43<00:41,  2.77it/s, acc_step=1/1, ce_loss_token=1.8553, perplexity_token=6.3937]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  89%|█████████████████████████████████████████▉     | 931/1044 [05:44<00:41,  2.70it/s, acc_step=1/1, ce_loss_token=1.8553, perplexity_token=6.3935]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  89%|█████████████████████████████████████████▉     | 932/1044 [05:44<00:40,  2.74it/s, acc_step=1/1, ce_loss_token=1.8552, perplexity_token=6.3931]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  89%|██████████████████████████████████████████     | 933/1044 [05:45<00:41,  2.68it/s, acc_step=1/1, ce_loss_token=1.8552, perplexity_token=6.3929]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  89%|██████████████████████████████████████████     | 934/1044 [05:45<00:42,  2.59it/s, acc_step=1/1, ce_loss_token=1.8551, perplexity_token=6.3925]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  90%|██████████████████████████████████████████     | 935/1044 [05:45<00:42,  2.59it/s, acc_step=1/1, ce_loss_token=1.8551, perplexity_token=6.3921]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  90%|██████████████████████████████████████████▏    | 936/1044 [05:46<00:43,  2.48it/s, acc_step=1/1, ce_loss_token=1.8550, perplexity_token=6.3918]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  90%|██████████████████████████████████████████▏    | 937/1044 [05:46<00:42,  2.53it/s, acc_step=1/1, ce_loss_token=1.8550, perplexity_token=6.3914]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  90%|██████████████████████████████████████████▏    | 938/1044 [05:47<00:41,  2.53it/s, acc_step=1/1, ce_loss_token=1.8549, perplexity_token=6.3911]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  90%|██████████████████████████████████████████▎    | 939/1044 [05:47<00:41,  2.56it/s, acc_step=1/1, ce_loss_token=1.8549, perplexity_token=6.3908]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  90%|██████████████████████████████████████████▎    | 940/1044 [05:47<00:39,  2.61it/s, acc_step=1/1, ce_loss_token=1.8548, perplexity_token=6.3904]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  90%|██████████████████████████████████████████▎    | 941/1044 [05:48<00:38,  2.68it/s, acc_step=1/1, ce_loss_token=1.8548, perplexity_token=6.3902]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  90%|██████████████████████████████████████████▍    | 942/1044 [05:48<00:37,  2.71it/s, acc_step=1/1, ce_loss_token=1.8547, perplexity_token=6.3897]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  90%|██████████████████████████████████████████▍    | 943/1044 [05:48<00:38,  2.62it/s, acc_step=1/1, ce_loss_token=1.8546, perplexity_token=6.3893]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  90%|██████████████████████████████████████████▍    | 944/1044 [05:49<00:39,  2.55it/s, acc_step=1/1, ce_loss_token=1.8546, perplexity_token=6.3890]

torch.Size([256, 364, 35]) torch.Size([256, 364])


[Training LM]:  91%|██████████████████████████████████████████▌    | 945/1044 [05:49<00:41,  2.40it/s, acc_step=1/1, ce_loss_token=1.8545, perplexity_token=6.3887]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  91%|██████████████████████████████████████████▌    | 946/1044 [05:50<00:40,  2.40it/s, acc_step=1/1, ce_loss_token=1.8545, perplexity_token=6.3885]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  91%|██████████████████████████████████████████▋    | 947/1044 [05:50<00:38,  2.50it/s, acc_step=1/1, ce_loss_token=1.8546, perplexity_token=6.3889]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  91%|██████████████████████████████████████████▋    | 948/1044 [05:50<00:34,  2.77it/s, acc_step=1/1, ce_loss_token=1.8546, perplexity_token=6.3894]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  91%|██████████████████████████████████████████▋    | 949/1044 [05:51<00:34,  2.73it/s, acc_step=1/1, ce_loss_token=1.8546, perplexity_token=6.3889]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  91%|██████████████████████████████████████████▊    | 950/1044 [05:51<00:31,  2.94it/s, acc_step=1/1, ce_loss_token=1.8546, perplexity_token=6.3893]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  91%|██████████████████████████████████████████▊    | 951/1044 [05:51<00:30,  3.01it/s, acc_step=1/1, ce_loss_token=1.8547, perplexity_token=6.3898]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  91%|██████████████████████████████████████████▊    | 952/1044 [05:52<00:31,  2.91it/s, acc_step=1/1, ce_loss_token=1.8546, perplexity_token=6.3894]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  91%|██████████████████████████████████████████▉    | 953/1044 [05:52<00:32,  2.81it/s, acc_step=1/1, ce_loss_token=1.8546, perplexity_token=6.3891]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  91%|██████████████████████████████████████████▉    | 954/1044 [05:53<00:32,  2.73it/s, acc_step=1/1, ce_loss_token=1.8546, perplexity_token=6.3888]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  91%|██████████████████████████████████████████▉    | 955/1044 [05:53<00:32,  2.78it/s, acc_step=1/1, ce_loss_token=1.8545, perplexity_token=6.3885]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  92%|███████████████████████████████████████████    | 956/1044 [05:53<00:31,  2.77it/s, acc_step=1/1, ce_loss_token=1.8545, perplexity_token=6.3883]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  92%|███████████████████████████████████████████    | 957/1044 [05:54<00:31,  2.79it/s, acc_step=1/1, ce_loss_token=1.8544, perplexity_token=6.3880]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  92%|███████████████████████████████████████████▏   | 958/1044 [05:54<00:31,  2.73it/s, acc_step=1/1, ce_loss_token=1.8544, perplexity_token=6.3876]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  92%|███████████████████████████████████████████▏   | 959/1044 [05:54<00:31,  2.73it/s, acc_step=1/1, ce_loss_token=1.8543, perplexity_token=6.3873]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  92%|███████████████████████████████████████████▏   | 960/1044 [05:55<00:31,  2.70it/s, acc_step=1/1, ce_loss_token=1.8543, perplexity_token=6.3870]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  92%|███████████████████████████████████████████▎   | 961/1044 [05:55<00:30,  2.73it/s, acc_step=1/1, ce_loss_token=1.8542, perplexity_token=6.3867]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  92%|███████████████████████████████████████████▎   | 962/1044 [05:55<00:29,  2.75it/s, acc_step=1/1, ce_loss_token=1.8542, perplexity_token=6.3863]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  92%|███████████████████████████████████████████▎   | 963/1044 [05:56<00:29,  2.73it/s, acc_step=1/1, ce_loss_token=1.8541, perplexity_token=6.3860]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  92%|███████████████████████████████████████████▍   | 964/1044 [05:56<00:29,  2.68it/s, acc_step=1/1, ce_loss_token=1.8541, perplexity_token=6.3857]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  92%|███████████████████████████████████████████▍   | 965/1044 [05:57<00:30,  2.61it/s, acc_step=1/1, ce_loss_token=1.8540, perplexity_token=6.3854]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  93%|███████████████████████████████████████████▍   | 966/1044 [05:57<00:29,  2.66it/s, acc_step=1/1, ce_loss_token=1.8540, perplexity_token=6.3851]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  93%|███████████████████████████████████████████▌   | 967/1044 [05:57<00:29,  2.65it/s, acc_step=1/1, ce_loss_token=1.8539, perplexity_token=6.3848]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  93%|███████████████████████████████████████████▌   | 968/1044 [05:58<00:28,  2.70it/s, acc_step=1/1, ce_loss_token=1.8539, perplexity_token=6.3844]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  93%|███████████████████████████████████████████▌   | 969/1044 [05:58<00:28,  2.67it/s, acc_step=1/1, ce_loss_token=1.8538, perplexity_token=6.3841]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  93%|███████████████████████████████████████████▋   | 970/1044 [05:58<00:28,  2.59it/s, acc_step=1/1, ce_loss_token=1.8538, perplexity_token=6.3837]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  93%|███████████████████████████████████████████▋   | 971/1044 [05:59<00:28,  2.60it/s, acc_step=1/1, ce_loss_token=1.8537, perplexity_token=6.3835]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  93%|███████████████████████████████████████████▊   | 972/1044 [05:59<00:28,  2.56it/s, acc_step=1/1, ce_loss_token=1.8537, perplexity_token=6.3832]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  93%|███████████████████████████████████████████▊   | 973/1044 [06:00<00:25,  2.75it/s, acc_step=1/1, ce_loss_token=1.8537, perplexity_token=6.3836]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  93%|███████████████████████████████████████████▊   | 974/1044 [06:00<00:25,  2.78it/s, acc_step=1/1, ce_loss_token=1.8537, perplexity_token=6.3832]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  93%|███████████████████████████████████████████▉   | 975/1044 [06:00<00:23,  2.95it/s, acc_step=1/1, ce_loss_token=1.8537, perplexity_token=6.3837]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  93%|███████████████████████████████████████████▉   | 976/1044 [06:01<00:23,  2.85it/s, acc_step=1/1, ce_loss_token=1.8537, perplexity_token=6.3833]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  94%|███████████████████████████████████████████▉   | 977/1044 [06:01<00:23,  2.83it/s, acc_step=1/1, ce_loss_token=1.8536, perplexity_token=6.3830]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  94%|████████████████████████████████████████████   | 978/1044 [06:01<00:21,  3.03it/s, acc_step=1/1, ce_loss_token=1.8537, perplexity_token=6.3834]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  94%|████████████████████████████████████████████   | 979/1044 [06:02<00:22,  2.84it/s, acc_step=1/1, ce_loss_token=1.8536, perplexity_token=6.3830]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  94%|████████████████████████████████████████████   | 980/1044 [06:02<00:22,  2.88it/s, acc_step=1/1, ce_loss_token=1.8536, perplexity_token=6.3827]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  94%|████████████████████████████████████████████▏  | 981/1044 [06:02<00:22,  2.77it/s, acc_step=1/1, ce_loss_token=1.8535, perplexity_token=6.3824]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  94%|████████████████████████████████████████████▏  | 982/1044 [06:03<00:21,  2.82it/s, acc_step=1/1, ce_loss_token=1.8535, perplexity_token=6.3821]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  94%|████████████████████████████████████████████▎  | 983/1044 [06:03<00:22,  2.77it/s, acc_step=1/1, ce_loss_token=1.8534, perplexity_token=6.3818]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  94%|████████████████████████████████████████████▎  | 984/1044 [06:03<00:20,  2.95it/s, acc_step=1/1, ce_loss_token=1.8535, perplexity_token=6.3821]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  94%|████████████████████████████████████████████▎  | 985/1044 [06:04<00:19,  2.95it/s, acc_step=1/1, ce_loss_token=1.8534, perplexity_token=6.3818]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  94%|████████████████████████████████████████████▍  | 986/1044 [06:04<00:19,  2.92it/s, acc_step=1/1, ce_loss_token=1.8534, perplexity_token=6.3815]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  95%|████████████████████████████████████████████▍  | 987/1044 [06:04<00:19,  2.85it/s, acc_step=1/1, ce_loss_token=1.8533, perplexity_token=6.3811]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  95%|████████████████████████████████████████████▍  | 988/1044 [06:05<00:19,  2.88it/s, acc_step=1/1, ce_loss_token=1.8533, perplexity_token=6.3808]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  95%|████████████████████████████████████████████▌  | 989/1044 [06:05<00:18,  2.91it/s, acc_step=1/1, ce_loss_token=1.8532, perplexity_token=6.3804]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  95%|████████████████████████████████████████████▌  | 990/1044 [06:05<00:17,  3.06it/s, acc_step=1/1, ce_loss_token=1.8533, perplexity_token=6.3809]

torch.Size([256, 441, 35]) torch.Size([256, 441])


[Training LM]:  95%|████████████████████████████████████████████▌  | 991/1044 [06:06<00:22,  2.36it/s, acc_step=1/1, ce_loss_token=1.8533, perplexity_token=6.3806]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  95%|████████████████████████████████████████████▋  | 992/1044 [06:06<00:21,  2.42it/s, acc_step=1/1, ce_loss_token=1.8532, perplexity_token=6.3803]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  95%|████████████████████████████████████████████▋  | 993/1044 [06:07<00:20,  2.51it/s, acc_step=1/1, ce_loss_token=1.8532, perplexity_token=6.3801]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  95%|████████████████████████████████████████████▋  | 994/1044 [06:07<00:19,  2.62it/s, acc_step=1/1, ce_loss_token=1.8531, perplexity_token=6.3797]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  95%|████████████████████████████████████████████▊  | 995/1044 [06:08<00:18,  2.62it/s, acc_step=1/1, ce_loss_token=1.8531, perplexity_token=6.3793]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  95%|████████████████████████████████████████████▊  | 996/1044 [06:08<00:17,  2.69it/s, acc_step=1/1, ce_loss_token=1.8530, perplexity_token=6.3790]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  95%|████████████████████████████████████████████▉  | 997/1044 [06:08<00:17,  2.70it/s, acc_step=1/1, ce_loss_token=1.8530, perplexity_token=6.3787]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  96%|████████████████████████████████████████████▉  | 998/1044 [06:09<00:17,  2.63it/s, acc_step=1/1, ce_loss_token=1.8529, perplexity_token=6.3784]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  96%|████████████████████████████████████████████▉  | 999/1044 [06:09<00:16,  2.69it/s, acc_step=1/1, ce_loss_token=1.8529, perplexity_token=6.3781]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  96%|████████████████████████████████████████████  | 1000/1044 [06:09<00:16,  2.65it/s, acc_step=1/1, ce_loss_token=1.8528, perplexity_token=6.3778]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  96%|████████████████████████████████████████████  | 1001/1044 [06:10<00:16,  2.63it/s, acc_step=1/1, ce_loss_token=1.8528, perplexity_token=6.3774]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1002/1044 [06:10<00:16,  2.56it/s, acc_step=1/1, ce_loss_token=1.8527, perplexity_token=6.3771]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1003/1044 [06:11<00:16,  2.55it/s, acc_step=1/1, ce_loss_token=1.8527, perplexity_token=6.3768]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1004/1044 [06:11<00:14,  2.76it/s, acc_step=1/1, ce_loss_token=1.8527, perplexity_token=6.3771]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1005/1044 [06:11<00:14,  2.78it/s, acc_step=1/1, ce_loss_token=1.8527, perplexity_token=6.3768]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1006/1044 [06:12<00:13,  2.84it/s, acc_step=1/1, ce_loss_token=1.8527, perplexity_token=6.3772]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1007/1044 [06:12<00:13,  2.71it/s, acc_step=1/1, ce_loss_token=1.8527, perplexity_token=6.3770]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1008/1044 [06:12<00:13,  2.70it/s, acc_step=1/1, ce_loss_token=1.8527, perplexity_token=6.3767]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1009/1044 [06:13<00:13,  2.68it/s, acc_step=1/1, ce_loss_token=1.8526, perplexity_token=6.3764]

torch.Size([256, 388, 35]) torch.Size([256, 388])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1010/1044 [06:13<00:14,  2.39it/s, acc_step=1/1, ce_loss_token=1.8526, perplexity_token=6.3761]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1011/1044 [06:14<00:13,  2.43it/s, acc_step=1/1, ce_loss_token=1.8525, perplexity_token=6.3759]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1012/1044 [06:14<00:13,  2.39it/s, acc_step=1/1, ce_loss_token=1.8525, perplexity_token=6.3757]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1013/1044 [06:15<00:14,  2.21it/s, acc_step=1/1, ce_loss_token=1.8524, perplexity_token=6.3754]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1014/1044 [06:15<00:14,  2.09it/s, acc_step=1/1, ce_loss_token=1.8524, perplexity_token=6.3750]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1015/1044 [06:15<00:12,  2.39it/s, acc_step=1/1, ce_loss_token=1.8524, perplexity_token=6.3753]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1016/1044 [06:16<00:11,  2.45it/s, acc_step=1/1, ce_loss_token=1.8524, perplexity_token=6.3750]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1017/1044 [06:16<00:10,  2.53it/s, acc_step=1/1, ce_loss_token=1.8523, perplexity_token=6.3747]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  98%|████████████████████████████████████████████▊ | 1018/1044 [06:17<00:10,  2.55it/s, acc_step=1/1, ce_loss_token=1.8523, perplexity_token=6.3744]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1019/1044 [06:17<00:09,  2.61it/s, acc_step=1/1, ce_loss_token=1.8522, perplexity_token=6.3741]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1020/1044 [06:17<00:09,  2.56it/s, acc_step=1/1, ce_loss_token=1.8522, perplexity_token=6.3739]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1021/1044 [06:18<00:09,  2.50it/s, acc_step=1/1, ce_loss_token=1.8521, perplexity_token=6.3735]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  98%|█████████████████████████████████████████████ | 1022/1044 [06:18<00:08,  2.47it/s, acc_step=1/1, ce_loss_token=1.8521, perplexity_token=6.3732]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  98%|█████████████████████████████████████████████ | 1023/1044 [06:19<00:08,  2.49it/s, acc_step=1/1, ce_loss_token=1.8520, perplexity_token=6.3729]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  98%|█████████████████████████████████████████████ | 1024/1044 [06:19<00:07,  2.75it/s, acc_step=1/1, ce_loss_token=1.8521, perplexity_token=6.3733]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1025/1044 [06:19<00:07,  2.67it/s, acc_step=1/1, ce_loss_token=1.8521, perplexity_token=6.3731]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1026/1044 [06:20<00:06,  2.89it/s, acc_step=1/1, ce_loss_token=1.8522, perplexity_token=6.3737]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1027/1044 [06:20<00:05,  2.93it/s, acc_step=1/1, ce_loss_token=1.8521, perplexity_token=6.3734]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1028/1044 [06:20<00:05,  2.89it/s, acc_step=1/1, ce_loss_token=1.8521, perplexity_token=6.3731]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  99%|█████████████████████████████████████████████▎| 1029/1044 [06:21<00:05,  2.77it/s, acc_step=1/1, ce_loss_token=1.8520, perplexity_token=6.3727]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1030/1044 [06:21<00:05,  2.73it/s, acc_step=1/1, ce_loss_token=1.8520, perplexity_token=6.3724]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1031/1044 [06:21<00:04,  2.68it/s, acc_step=1/1, ce_loss_token=1.8519, perplexity_token=6.3721]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1032/1044 [06:22<00:04,  2.59it/s, acc_step=1/1, ce_loss_token=1.8519, perplexity_token=6.3717]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1033/1044 [06:22<00:04,  2.62it/s, acc_step=1/1, ce_loss_token=1.8518, perplexity_token=6.3714]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1034/1044 [06:22<00:03,  2.84it/s, acc_step=1/1, ce_loss_token=1.8519, perplexity_token=6.3719]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1035/1044 [06:23<00:02,  3.01it/s, acc_step=1/1, ce_loss_token=1.8520, perplexity_token=6.3723]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1036/1044 [06:23<00:02,  2.86it/s, acc_step=1/1, ce_loss_token=1.8519, perplexity_token=6.3720]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1037/1044 [06:23<00:02,  2.81it/s, acc_step=1/1, ce_loss_token=1.8519, perplexity_token=6.3717]

torch.Size([256, 267, 35]) torch.Size([256, 267])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1038/1044 [06:24<00:02,  2.91it/s, acc_step=1/1, ce_loss_token=1.8518, perplexity_token=6.3713]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1039/1044 [06:24<00:01,  3.02it/s, acc_step=1/1, ce_loss_token=1.8519, perplexity_token=6.3716]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1040/1044 [06:25<00:01,  2.59it/s, acc_step=1/1, ce_loss_token=1.8518, perplexity_token=6.3713]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1041/1044 [06:25<00:01,  2.52it/s, acc_step=1/1, ce_loss_token=1.8518, perplexity_token=6.3710]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1042/1044 [06:25<00:00,  2.75it/s, acc_step=1/1, ce_loss_token=1.8518, perplexity_token=6.3713]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]: 100%|██████████████████████████████████████████████| 1044/1044 [06:26<00:00,  2.99it/s, acc_step=1/1, ce_loss_token=1.8518, perplexity_token=6.3713]

torch.Size([170, 300, 35]) torch.Size([170, 300])


                                                                                                                                                                   

Generating with greedy search...

📊 Metrics (Epoch 5):
├── TRAIN:
│   ├── ce_loss_char: 1.8518
│   ├── ce_loss_token: 1.8518
│   ├── perplexity_char: 6.3713
│   └── perplexity_token: 6.3713
└── VAL:
    ├── ce_loss_char: 1.7117
    ├── ce_loss_token: 1.7117
    ├── perplexity_char: 5.5381
    └── perplexity_token: 5.5381
└── TRAINING:
    └── learning_rate: 0.000100


[Training LM]:   0%|                                                                                                                      | 0/1044 [00:00<?, ?it/s]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   0%|                                                 | 1/1044 [00:00<08:17,  2.09it/s, acc_step=1/1, ce_loss_token=1.7872, perplexity_token=5.9729]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   0%|                                                 | 2/1044 [00:00<07:02,  2.46it/s, acc_step=1/1, ce_loss_token=1.7930, perplexity_token=6.0075]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   0%|▏                                                | 3/1044 [00:01<07:07,  2.44it/s, acc_step=1/1, ce_loss_token=1.7934, perplexity_token=6.0099]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   0%|▏                                                | 4/1044 [00:01<06:17,  2.75it/s, acc_step=1/1, ce_loss_token=1.8338, perplexity_token=6.2579]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   0%|▏                                                | 5/1044 [00:01<06:28,  2.67it/s, acc_step=1/1, ce_loss_token=1.8310, perplexity_token=6.2404]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:   1%|▎                                                | 6/1044 [00:02<06:46,  2.56it/s, acc_step=1/1, ce_loss_token=1.8263, perplexity_token=6.2107]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   1%|▎                                                | 7/1044 [00:02<06:38,  2.60it/s, acc_step=1/1, ce_loss_token=1.8200, perplexity_token=6.1721]

torch.Size([256, 525, 35]) torch.Size([256, 525])


[Training LM]:   1%|▍                                                | 8/1044 [00:03<09:14,  1.87it/s, acc_step=1/1, ce_loss_token=1.8181, perplexity_token=6.1599]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   1%|▍                                                | 9/1044 [00:03<07:50,  2.20it/s, acc_step=1/1, ce_loss_token=1.8320, perplexity_token=6.2462]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   1%|▍                                               | 10/1044 [00:04<06:53,  2.50it/s, acc_step=1/1, ce_loss_token=1.8435, perplexity_token=6.3188]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   1%|▌                                               | 11/1044 [00:04<06:56,  2.48it/s, acc_step=1/1, ce_loss_token=1.8392, perplexity_token=6.2917]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   1%|▌                                               | 12/1044 [00:04<06:52,  2.50it/s, acc_step=1/1, ce_loss_token=1.8368, perplexity_token=6.2765]

torch.Size([256, 265, 35]) torch.Size([256, 265])


[Training LM]:   1%|▌                                               | 13/1044 [00:05<06:03,  2.83it/s, acc_step=1/1, ce_loss_token=1.8424, perplexity_token=6.3115]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:   1%|▋                                               | 14/1044 [00:05<05:56,  2.89it/s, acc_step=1/1, ce_loss_token=1.8398, perplexity_token=6.2954]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:   1%|▋                                               | 15/1044 [00:06<06:45,  2.54it/s, acc_step=1/1, ce_loss_token=1.8381, perplexity_token=6.2845]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   2%|▋                                               | 16/1044 [00:06<06:51,  2.50it/s, acc_step=1/1, ce_loss_token=1.8373, perplexity_token=6.2798]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:   2%|▊                                               | 17/1044 [00:06<07:03,  2.43it/s, acc_step=1/1, ce_loss_token=1.8354, perplexity_token=6.2674]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:   2%|▊                                               | 18/1044 [00:07<07:01,  2.43it/s, acc_step=1/1, ce_loss_token=1.8329, perplexity_token=6.2520]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   2%|▊                                               | 19/1044 [00:07<06:44,  2.54it/s, acc_step=1/1, ce_loss_token=1.8315, perplexity_token=6.2430]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   2%|▉                                               | 20/1044 [00:07<06:08,  2.78it/s, acc_step=1/1, ce_loss_token=1.8348, perplexity_token=6.2641]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   2%|▉                                               | 21/1044 [00:08<05:43,  2.98it/s, acc_step=1/1, ce_loss_token=1.8390, perplexity_token=6.2902]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   2%|█                                               | 22/1044 [00:08<05:44,  2.96it/s, acc_step=1/1, ce_loss_token=1.8374, perplexity_token=6.2805]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   2%|█                                               | 23/1044 [00:08<06:05,  2.80it/s, acc_step=1/1, ce_loss_token=1.8354, perplexity_token=6.2678]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:   2%|█                                               | 24/1044 [00:09<06:16,  2.71it/s, acc_step=1/1, ce_loss_token=1.8341, perplexity_token=6.2592]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:   2%|█▏                                              | 25/1044 [00:09<06:30,  2.61it/s, acc_step=1/1, ce_loss_token=1.8325, perplexity_token=6.2497]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   2%|█▏                                              | 26/1044 [00:10<06:05,  2.78it/s, acc_step=1/1, ce_loss_token=1.8369, perplexity_token=6.2773]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   3%|█▏                                              | 27/1044 [00:10<06:20,  2.68it/s, acc_step=1/1, ce_loss_token=1.8353, perplexity_token=6.2671]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   3%|█▎                                              | 28/1044 [00:10<06:31,  2.60it/s, acc_step=1/1, ce_loss_token=1.8343, perplexity_token=6.2609]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   3%|█▎                                              | 29/1044 [00:11<06:30,  2.60it/s, acc_step=1/1, ce_loss_token=1.8332, perplexity_token=6.2537]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   3%|█▍                                              | 30/1044 [00:11<05:59,  2.82it/s, acc_step=1/1, ce_loss_token=1.8359, perplexity_token=6.2706]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:   3%|█▍                                              | 31/1044 [00:11<05:55,  2.85it/s, acc_step=1/1, ce_loss_token=1.8344, perplexity_token=6.2613]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   3%|█▍                                              | 32/1044 [00:12<05:52,  2.87it/s, acc_step=1/1, ce_loss_token=1.8329, perplexity_token=6.2522]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   3%|█▌                                              | 33/1044 [00:12<05:58,  2.82it/s, acc_step=1/1, ce_loss_token=1.8320, perplexity_token=6.2464]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:   3%|█▌                                              | 34/1044 [00:13<06:24,  2.63it/s, acc_step=1/1, ce_loss_token=1.8307, perplexity_token=6.2381]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:   3%|█▌                                              | 35/1044 [00:13<06:35,  2.55it/s, acc_step=1/1, ce_loss_token=1.8294, perplexity_token=6.2305]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   3%|█▋                                              | 36/1044 [00:13<06:03,  2.77it/s, acc_step=1/1, ce_loss_token=1.8324, perplexity_token=6.2487]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:   4%|█▋                                              | 37/1044 [00:14<06:15,  2.68it/s, acc_step=1/1, ce_loss_token=1.8314, perplexity_token=6.2427]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   4%|█▋                                              | 38/1044 [00:14<06:22,  2.63it/s, acc_step=1/1, ce_loss_token=1.8308, perplexity_token=6.2390]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   4%|█▊                                              | 39/1044 [00:14<06:18,  2.65it/s, acc_step=1/1, ce_loss_token=1.8301, perplexity_token=6.2345]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   4%|█▊                                              | 40/1044 [00:15<06:11,  2.70it/s, acc_step=1/1, ce_loss_token=1.8295, perplexity_token=6.2305]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   4%|█▉                                              | 41/1044 [00:15<06:27,  2.59it/s, acc_step=1/1, ce_loss_token=1.8287, perplexity_token=6.2255]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   4%|█▉                                              | 42/1044 [00:16<06:24,  2.61it/s, acc_step=1/1, ce_loss_token=1.8283, perplexity_token=6.2230]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:   4%|█▉                                              | 43/1044 [00:16<05:51,  2.85it/s, acc_step=1/1, ce_loss_token=1.8301, perplexity_token=6.2342]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   4%|██                                              | 44/1044 [00:16<05:59,  2.78it/s, acc_step=1/1, ce_loss_token=1.8293, perplexity_token=6.2298]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   4%|██                                              | 45/1044 [00:17<06:05,  2.73it/s, acc_step=1/1, ce_loss_token=1.8286, perplexity_token=6.2250]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:   4%|██                                              | 46/1044 [00:17<06:25,  2.59it/s, acc_step=1/1, ce_loss_token=1.8280, perplexity_token=6.2214]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   5%|██▏                                             | 47/1044 [00:17<06:29,  2.56it/s, acc_step=1/1, ce_loss_token=1.8278, perplexity_token=6.2200]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   5%|██▏                                             | 48/1044 [00:18<06:24,  2.59it/s, acc_step=1/1, ce_loss_token=1.8271, perplexity_token=6.2161]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   5%|██▎                                             | 49/1044 [00:18<06:28,  2.56it/s, acc_step=1/1, ce_loss_token=1.8268, perplexity_token=6.2142]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   5%|██▎                                             | 50/1044 [00:19<06:21,  2.61it/s, acc_step=1/1, ce_loss_token=1.8260, perplexity_token=6.2089]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   5%|██▎                                             | 51/1044 [00:19<06:14,  2.65it/s, acc_step=1/1, ce_loss_token=1.8254, perplexity_token=6.2050]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   5%|██▍                                             | 52/1044 [00:19<06:17,  2.63it/s, acc_step=1/1, ce_loss_token=1.8251, perplexity_token=6.2032]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   5%|██▍                                             | 53/1044 [00:20<05:48,  2.84it/s, acc_step=1/1, ce_loss_token=1.8267, perplexity_token=6.2133]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   5%|██▍                                             | 54/1044 [00:20<05:59,  2.75it/s, acc_step=1/1, ce_loss_token=1.8258, perplexity_token=6.2079]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:   5%|██▌                                             | 55/1044 [00:20<05:51,  2.81it/s, acc_step=1/1, ce_loss_token=1.8252, perplexity_token=6.2039]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   5%|██▌                                             | 56/1044 [00:21<06:00,  2.74it/s, acc_step=1/1, ce_loss_token=1.8246, perplexity_token=6.2004]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   5%|██▌                                             | 57/1044 [00:21<05:54,  2.79it/s, acc_step=1/1, ce_loss_token=1.8243, perplexity_token=6.1984]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   6%|██▋                                             | 58/1044 [00:21<05:58,  2.75it/s, acc_step=1/1, ce_loss_token=1.8237, perplexity_token=6.1950]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   6%|██▋                                             | 59/1044 [00:22<05:32,  2.96it/s, acc_step=1/1, ce_loss_token=1.8258, perplexity_token=6.2075]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:   6%|██▊                                             | 60/1044 [00:22<05:33,  2.95it/s, acc_step=1/1, ce_loss_token=1.8252, perplexity_token=6.2038]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:   6%|██▊                                             | 61/1044 [00:22<05:57,  2.75it/s, acc_step=1/1, ce_loss_token=1.8250, perplexity_token=6.2025]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   6%|██▊                                             | 62/1044 [00:23<05:58,  2.74it/s, acc_step=1/1, ce_loss_token=1.8244, perplexity_token=6.1992]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   6%|██▉                                             | 63/1044 [00:23<05:41,  2.87it/s, acc_step=1/1, ce_loss_token=1.8258, perplexity_token=6.2075]

torch.Size([256, 421, 35]) torch.Size([256, 421])


[Training LM]:   6%|██▉                                             | 64/1044 [00:24<06:54,  2.36it/s, acc_step=1/1, ce_loss_token=1.8256, perplexity_token=6.2063]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   6%|██▉                                             | 65/1044 [00:24<06:47,  2.40it/s, acc_step=1/1, ce_loss_token=1.8251, perplexity_token=6.2034]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   6%|███                                             | 66/1044 [00:25<06:37,  2.46it/s, acc_step=1/1, ce_loss_token=1.8246, perplexity_token=6.2004]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   6%|███                                             | 67/1044 [00:25<06:24,  2.54it/s, acc_step=1/1, ce_loss_token=1.8241, perplexity_token=6.1974]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:   7%|███▏                                            | 68/1044 [00:25<06:34,  2.47it/s, acc_step=1/1, ce_loss_token=1.8237, perplexity_token=6.1946]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   7%|███▏                                            | 69/1044 [00:26<06:32,  2.49it/s, acc_step=1/1, ce_loss_token=1.8233, perplexity_token=6.1924]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   7%|███▏                                            | 70/1044 [00:26<06:00,  2.70it/s, acc_step=1/1, ce_loss_token=1.8248, perplexity_token=6.2014]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   7%|███▎                                            | 71/1044 [00:26<05:59,  2.70it/s, acc_step=1/1, ce_loss_token=1.8245, perplexity_token=6.1997]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   7%|███▎                                            | 72/1044 [00:27<06:03,  2.68it/s, acc_step=1/1, ce_loss_token=1.8241, perplexity_token=6.1975]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   7%|███▎                                            | 73/1044 [00:27<05:57,  2.71it/s, acc_step=1/1, ce_loss_token=1.8239, perplexity_token=6.1958]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   7%|███▍                                            | 74/1044 [00:28<05:58,  2.71it/s, acc_step=1/1, ce_loss_token=1.8237, perplexity_token=6.1946]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   7%|███▍                                            | 75/1044 [00:28<05:54,  2.74it/s, acc_step=1/1, ce_loss_token=1.8235, perplexity_token=6.1933]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   7%|███▍                                            | 76/1044 [00:28<05:51,  2.75it/s, acc_step=1/1, ce_loss_token=1.8231, perplexity_token=6.1913]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   7%|███▌                                            | 77/1044 [00:29<06:00,  2.68it/s, acc_step=1/1, ce_loss_token=1.8227, perplexity_token=6.1883]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   7%|███▌                                            | 78/1044 [00:29<06:06,  2.64it/s, acc_step=1/1, ce_loss_token=1.8225, perplexity_token=6.1871]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   8%|███▋                                            | 79/1044 [00:29<06:08,  2.62it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1846]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   8%|███▋                                            | 80/1044 [00:30<06:05,  2.64it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1827]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   8%|███▋                                            | 81/1044 [00:30<05:36,  2.86it/s, acc_step=1/1, ce_loss_token=1.8232, perplexity_token=6.1916]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:   8%|███▊                                            | 82/1044 [00:30<05:32,  2.89it/s, acc_step=1/1, ce_loss_token=1.8229, perplexity_token=6.1898]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   8%|███▊                                            | 83/1044 [00:31<05:42,  2.81it/s, acc_step=1/1, ce_loss_token=1.8226, perplexity_token=6.1878]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   8%|███▊                                            | 84/1044 [00:31<05:47,  2.76it/s, acc_step=1/1, ce_loss_token=1.8223, perplexity_token=6.1862]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   8%|███▉                                            | 85/1044 [00:32<05:56,  2.69it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1837]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   8%|███▉                                            | 86/1044 [00:32<05:53,  2.71it/s, acc_step=1/1, ce_loss_token=1.8217, perplexity_token=6.1822]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   8%|████                                            | 87/1044 [00:32<06:02,  2.64it/s, acc_step=1/1, ce_loss_token=1.8215, perplexity_token=6.1811]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   8%|████                                            | 88/1044 [00:33<06:08,  2.59it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1800]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   9%|████                                            | 89/1044 [00:33<05:59,  2.65it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1786]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   9%|████▏                                           | 90/1044 [00:33<06:06,  2.60it/s, acc_step=1/1, ce_loss_token=1.8209, perplexity_token=6.1776]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   9%|████▏                                           | 91/1044 [00:34<06:09,  2.58it/s, acc_step=1/1, ce_loss_token=1.8206, perplexity_token=6.1757]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   9%|████▏                                           | 92/1044 [00:34<06:03,  2.62it/s, acc_step=1/1, ce_loss_token=1.8200, perplexity_token=6.1718]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   9%|████▎                                           | 93/1044 [00:35<05:53,  2.69it/s, acc_step=1/1, ce_loss_token=1.8197, perplexity_token=6.1701]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   9%|████▎                                           | 94/1044 [00:35<05:31,  2.87it/s, acc_step=1/1, ce_loss_token=1.8206, perplexity_token=6.1756]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   9%|████▎                                           | 95/1044 [00:35<05:52,  2.69it/s, acc_step=1/1, ce_loss_token=1.8204, perplexity_token=6.1744]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   9%|████▍                                           | 96/1044 [00:36<06:01,  2.63it/s, acc_step=1/1, ce_loss_token=1.8203, perplexity_token=6.1735]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   9%|████▍                                           | 97/1044 [00:36<05:42,  2.77it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1789]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   9%|████▌                                           | 98/1044 [00:36<05:19,  2.96it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1845]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   9%|████▌                                           | 99/1044 [00:37<05:29,  2.86it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1834]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  10%|████▌                                          | 100/1044 [00:37<05:32,  2.84it/s, acc_step=1/1, ce_loss_token=1.8216, perplexity_token=6.1815]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  10%|████▌                                          | 101/1044 [00:37<05:37,  2.79it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1805]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  10%|████▌                                          | 102/1044 [00:38<05:43,  2.74it/s, acc_step=1/1, ce_loss_token=1.8210, perplexity_token=6.1782]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  10%|████▋                                          | 103/1044 [00:38<05:41,  2.76it/s, acc_step=1/1, ce_loss_token=1.8207, perplexity_token=6.1764]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  10%|████▋                                          | 104/1044 [00:39<05:52,  2.67it/s, acc_step=1/1, ce_loss_token=1.8205, perplexity_token=6.1749]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  10%|████▋                                          | 105/1044 [00:39<06:02,  2.59it/s, acc_step=1/1, ce_loss_token=1.8203, perplexity_token=6.1737]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  10%|████▊                                          | 106/1044 [00:39<06:14,  2.50it/s, acc_step=1/1, ce_loss_token=1.8201, perplexity_token=6.1722]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  10%|████▊                                          | 107/1044 [00:40<05:43,  2.73it/s, acc_step=1/1, ce_loss_token=1.8207, perplexity_token=6.1761]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  10%|████▊                                          | 108/1044 [00:40<05:59,  2.60it/s, acc_step=1/1, ce_loss_token=1.8205, perplexity_token=6.1748]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  10%|████▉                                          | 109/1044 [00:40<05:51,  2.66it/s, acc_step=1/1, ce_loss_token=1.8204, perplexity_token=6.1745]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  11%|████▉                                          | 110/1044 [00:41<05:47,  2.69it/s, acc_step=1/1, ce_loss_token=1.8201, perplexity_token=6.1728]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  11%|████▉                                          | 111/1044 [00:41<05:24,  2.87it/s, acc_step=1/1, ce_loss_token=1.8209, perplexity_token=6.1777]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  11%|█████                                          | 112/1044 [00:41<05:25,  2.86it/s, acc_step=1/1, ce_loss_token=1.8208, perplexity_token=6.1769]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  11%|█████                                          | 113/1044 [00:42<05:39,  2.75it/s, acc_step=1/1, ce_loss_token=1.8206, perplexity_token=6.1759]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  11%|█████▏                                         | 114/1044 [00:42<05:14,  2.95it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1794]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  11%|█████▏                                         | 115/1044 [00:42<04:58,  3.11it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1844]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  11%|█████▏                                         | 116/1044 [00:43<05:30,  2.81it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1828]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  11%|█████▎                                         | 117/1044 [00:43<05:33,  2.78it/s, acc_step=1/1, ce_loss_token=1.8215, perplexity_token=6.1814]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  11%|█████▎                                         | 118/1044 [00:44<05:39,  2.73it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1796]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  11%|█████▎                                         | 119/1044 [00:44<05:23,  2.86it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1846]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  12%|█████▍                                         | 121/1044 [00:44<04:30,  3.42it/s, acc_step=1/1, ce_loss_token=1.8250, perplexity_token=6.2027]

torch.Size([256, 322, 35]) torch.Size([256, 322])
torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  12%|█████▍                                         | 122/1044 [00:45<05:09,  2.98it/s, acc_step=1/1, ce_loss_token=1.8246, perplexity_token=6.2000]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  12%|█████▌                                         | 123/1044 [00:45<05:16,  2.91it/s, acc_step=1/1, ce_loss_token=1.8243, perplexity_token=6.1987]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  12%|█████▌                                         | 124/1044 [00:46<05:10,  2.96it/s, acc_step=1/1, ce_loss_token=1.8249, perplexity_token=6.2024]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  12%|█████▋                                         | 125/1044 [00:46<05:21,  2.86it/s, acc_step=1/1, ce_loss_token=1.8247, perplexity_token=6.2008]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  12%|█████▋                                         | 126/1044 [00:46<05:33,  2.75it/s, acc_step=1/1, ce_loss_token=1.8244, perplexity_token=6.1994]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  12%|█████▋                                         | 127/1044 [00:47<05:12,  2.93it/s, acc_step=1/1, ce_loss_token=1.8249, perplexity_token=6.2023]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  12%|█████▊                                         | 128/1044 [00:47<05:23,  2.83it/s, acc_step=1/1, ce_loss_token=1.8248, perplexity_token=6.2014]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  12%|█████▊                                         | 129/1044 [00:47<05:05,  2.99it/s, acc_step=1/1, ce_loss_token=1.8254, perplexity_token=6.2055]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  12%|█████▊                                         | 130/1044 [00:48<05:09,  2.95it/s, acc_step=1/1, ce_loss_token=1.8252, perplexity_token=6.2041]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  13%|█████▉                                         | 131/1044 [00:48<05:28,  2.78it/s, acc_step=1/1, ce_loss_token=1.8250, perplexity_token=6.2031]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  13%|█████▉                                         | 132/1044 [00:48<05:23,  2.82it/s, acc_step=1/1, ce_loss_token=1.8249, perplexity_token=6.2023]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  13%|█████▉                                         | 133/1044 [00:49<05:22,  2.82it/s, acc_step=1/1, ce_loss_token=1.8247, perplexity_token=6.2010]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  13%|██████                                         | 134/1044 [00:49<05:31,  2.74it/s, acc_step=1/1, ce_loss_token=1.8245, perplexity_token=6.1995]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  13%|██████                                         | 135/1044 [00:49<05:23,  2.81it/s, acc_step=1/1, ce_loss_token=1.8243, perplexity_token=6.1982]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  13%|██████                                         | 136/1044 [00:50<05:26,  2.78it/s, acc_step=1/1, ce_loss_token=1.8240, perplexity_token=6.1967]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  13%|██████▏                                        | 137/1044 [00:50<05:32,  2.73it/s, acc_step=1/1, ce_loss_token=1.8238, perplexity_token=6.1954]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  13%|██████▏                                        | 138/1044 [00:50<05:04,  2.98it/s, acc_step=1/1, ce_loss_token=1.8247, perplexity_token=6.2008]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  13%|██████▎                                        | 139/1044 [00:51<05:09,  2.92it/s, acc_step=1/1, ce_loss_token=1.8244, perplexity_token=6.1992]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  13%|██████▎                                        | 140/1044 [00:51<05:20,  2.82it/s, acc_step=1/1, ce_loss_token=1.8242, perplexity_token=6.1980]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  14%|██████▎                                        | 141/1044 [00:52<05:41,  2.64it/s, acc_step=1/1, ce_loss_token=1.8240, perplexity_token=6.1969]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  14%|██████▍                                        | 142/1044 [00:52<05:34,  2.69it/s, acc_step=1/1, ce_loss_token=1.8237, perplexity_token=6.1950]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  14%|██████▍                                        | 143/1044 [00:52<05:29,  2.74it/s, acc_step=1/1, ce_loss_token=1.8244, perplexity_token=6.1990]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  14%|██████▍                                        | 144/1044 [00:53<05:22,  2.79it/s, acc_step=1/1, ce_loss_token=1.8241, perplexity_token=6.1974]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  14%|██████▌                                        | 145/1044 [00:53<05:25,  2.76it/s, acc_step=1/1, ce_loss_token=1.8240, perplexity_token=6.1964]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  14%|██████▌                                        | 146/1044 [00:53<05:24,  2.76it/s, acc_step=1/1, ce_loss_token=1.8237, perplexity_token=6.1949]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  14%|██████▌                                        | 147/1044 [00:54<05:24,  2.76it/s, acc_step=1/1, ce_loss_token=1.8235, perplexity_token=6.1935]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  14%|██████▋                                        | 148/1044 [00:54<05:04,  2.94it/s, acc_step=1/1, ce_loss_token=1.8241, perplexity_token=6.1970]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  14%|██████▋                                        | 149/1044 [00:54<05:03,  2.95it/s, acc_step=1/1, ce_loss_token=1.8239, perplexity_token=6.1962]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  14%|██████▊                                        | 150/1044 [00:55<05:08,  2.89it/s, acc_step=1/1, ce_loss_token=1.8237, perplexity_token=6.1949]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  14%|██████▊                                        | 151/1044 [00:55<05:16,  2.82it/s, acc_step=1/1, ce_loss_token=1.8234, perplexity_token=6.1928]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  15%|██████▊                                        | 152/1044 [00:56<05:16,  2.82it/s, acc_step=1/1, ce_loss_token=1.8233, perplexity_token=6.1924]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  15%|██████▉                                        | 153/1044 [00:56<04:58,  2.99it/s, acc_step=1/1, ce_loss_token=1.8237, perplexity_token=6.1949]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  15%|██████▉                                        | 154/1044 [00:56<05:00,  2.96it/s, acc_step=1/1, ce_loss_token=1.8236, perplexity_token=6.1938]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  15%|██████▉                                        | 155/1044 [00:56<04:56,  3.00it/s, acc_step=1/1, ce_loss_token=1.8233, perplexity_token=6.1923]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  15%|███████                                        | 156/1044 [00:57<05:11,  2.85it/s, acc_step=1/1, ce_loss_token=1.8232, perplexity_token=6.1915]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  15%|███████                                        | 157/1044 [00:57<05:17,  2.80it/s, acc_step=1/1, ce_loss_token=1.8229, perplexity_token=6.1900]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  15%|███████                                        | 158/1044 [00:58<05:21,  2.76it/s, acc_step=1/1, ce_loss_token=1.8227, perplexity_token=6.1887]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  15%|███████▏                                       | 159/1044 [00:58<05:24,  2.73it/s, acc_step=1/1, ce_loss_token=1.8225, perplexity_token=6.1874]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  15%|███████▏                                       | 160/1044 [00:58<05:29,  2.68it/s, acc_step=1/1, ce_loss_token=1.8223, perplexity_token=6.1862]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  15%|███████▏                                       | 161/1044 [00:59<05:33,  2.65it/s, acc_step=1/1, ce_loss_token=1.8222, perplexity_token=6.1852]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  16%|███████▎                                       | 162/1044 [00:59<05:24,  2.71it/s, acc_step=1/1, ce_loss_token=1.8228, perplexity_token=6.1895]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  16%|███████▍                                       | 164/1044 [01:00<04:32,  3.23it/s, acc_step=1/1, ce_loss_token=1.8257, perplexity_token=6.2071]

torch.Size([256, 322, 35]) torch.Size([256, 322])
torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  16%|███████▍                                       | 165/1044 [01:00<04:36,  3.17it/s, acc_step=1/1, ce_loss_token=1.8256, perplexity_token=6.2063]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  16%|███████▍                                       | 166/1044 [01:00<04:45,  3.07it/s, acc_step=1/1, ce_loss_token=1.8254, perplexity_token=6.2052]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  16%|███████▌                                       | 167/1044 [01:01<04:35,  3.18it/s, acc_step=1/1, ce_loss_token=1.8258, perplexity_token=6.2079]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  16%|███████▌                                       | 168/1044 [01:01<04:49,  3.02it/s, acc_step=1/1, ce_loss_token=1.8256, perplexity_token=6.2066]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  16%|███████▌                                       | 169/1044 [01:01<04:56,  2.95it/s, acc_step=1/1, ce_loss_token=1.8254, perplexity_token=6.2056]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  16%|███████▋                                       | 170/1044 [01:02<05:18,  2.74it/s, acc_step=1/1, ce_loss_token=1.8253, perplexity_token=6.2046]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  16%|███████▋                                       | 171/1044 [01:02<05:34,  2.61it/s, acc_step=1/1, ce_loss_token=1.8251, perplexity_token=6.2036]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  16%|███████▋                                       | 172/1044 [01:03<05:31,  2.63it/s, acc_step=1/1, ce_loss_token=1.8249, perplexity_token=6.2022]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  17%|███████▊                                       | 173/1044 [01:03<05:27,  2.66it/s, acc_step=1/1, ce_loss_token=1.8248, perplexity_token=6.2013]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  17%|███████▊                                       | 174/1044 [01:03<05:19,  2.72it/s, acc_step=1/1, ce_loss_token=1.8246, perplexity_token=6.2001]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  17%|███████▉                                       | 175/1044 [01:04<05:09,  2.81it/s, acc_step=1/1, ce_loss_token=1.8251, perplexity_token=6.2032]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  17%|███████▉                                       | 176/1044 [01:04<05:12,  2.78it/s, acc_step=1/1, ce_loss_token=1.8249, perplexity_token=6.2021]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  17%|███████▉                                       | 177/1044 [01:04<05:14,  2.76it/s, acc_step=1/1, ce_loss_token=1.8247, perplexity_token=6.2012]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  17%|████████                                       | 178/1044 [01:05<05:29,  2.63it/s, acc_step=1/1, ce_loss_token=1.8246, perplexity_token=6.2001]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  17%|████████                                       | 179/1044 [01:05<05:23,  2.68it/s, acc_step=1/1, ce_loss_token=1.8244, perplexity_token=6.1989]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  17%|████████                                       | 180/1044 [01:05<05:27,  2.63it/s, acc_step=1/1, ce_loss_token=1.8243, perplexity_token=6.1982]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  17%|████████▏                                      | 181/1044 [01:06<05:36,  2.57it/s, acc_step=1/1, ce_loss_token=1.8240, perplexity_token=6.1968]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  17%|████████▏                                      | 182/1044 [01:06<05:24,  2.66it/s, acc_step=1/1, ce_loss_token=1.8239, perplexity_token=6.1957]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  18%|████████▏                                      | 183/1044 [01:07<05:26,  2.63it/s, acc_step=1/1, ce_loss_token=1.8237, perplexity_token=6.1945]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  18%|████████▎                                      | 184/1044 [01:07<05:14,  2.73it/s, acc_step=1/1, ce_loss_token=1.8236, perplexity_token=6.1938]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  18%|████████▎                                      | 185/1044 [01:07<04:59,  2.87it/s, acc_step=1/1, ce_loss_token=1.8239, perplexity_token=6.1962]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  18%|████████▎                                      | 186/1044 [01:08<04:59,  2.86it/s, acc_step=1/1, ce_loss_token=1.8238, perplexity_token=6.1952]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  18%|████████▍                                      | 187/1044 [01:08<05:05,  2.81it/s, acc_step=1/1, ce_loss_token=1.8236, perplexity_token=6.1942]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  18%|████████▍                                      | 188/1044 [01:08<05:02,  2.83it/s, acc_step=1/1, ce_loss_token=1.8235, perplexity_token=6.1933]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  18%|████████▌                                      | 189/1044 [01:09<05:05,  2.80it/s, acc_step=1/1, ce_loss_token=1.8232, perplexity_token=6.1919]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  18%|████████▌                                      | 190/1044 [01:09<05:11,  2.74it/s, acc_step=1/1, ce_loss_token=1.8231, perplexity_token=6.1912]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  18%|████████▌                                      | 191/1044 [01:09<05:13,  2.72it/s, acc_step=1/1, ce_loss_token=1.8229, perplexity_token=6.1897]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  18%|████████▋                                      | 192/1044 [01:10<05:09,  2.76it/s, acc_step=1/1, ce_loss_token=1.8227, perplexity_token=6.1885]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  18%|████████▋                                      | 193/1044 [01:10<05:09,  2.75it/s, acc_step=1/1, ce_loss_token=1.8226, perplexity_token=6.1877]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  19%|████████▋                                      | 194/1044 [01:10<04:48,  2.94it/s, acc_step=1/1, ce_loss_token=1.8232, perplexity_token=6.1915]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  19%|████████▊                                      | 195/1044 [01:11<04:53,  2.90it/s, acc_step=1/1, ce_loss_token=1.8231, perplexity_token=6.1909]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  19%|████████▊                                      | 196/1044 [01:11<04:48,  2.94it/s, acc_step=1/1, ce_loss_token=1.8230, perplexity_token=6.1903]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  19%|████████▊                                      | 197/1044 [01:12<05:21,  2.64it/s, acc_step=1/1, ce_loss_token=1.8228, perplexity_token=6.1893]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  19%|████████▉                                      | 198/1044 [01:12<05:21,  2.63it/s, acc_step=1/1, ce_loss_token=1.8226, perplexity_token=6.1881]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  19%|████████▉                                      | 199/1044 [01:12<05:29,  2.56it/s, acc_step=1/1, ce_loss_token=1.8225, perplexity_token=6.1872]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  19%|█████████                                      | 200/1044 [01:13<05:20,  2.63it/s, acc_step=1/1, ce_loss_token=1.8223, perplexity_token=6.1863]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  19%|█████████                                      | 201/1044 [01:13<05:27,  2.57it/s, acc_step=1/1, ce_loss_token=1.8223, perplexity_token=6.1860]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  19%|█████████                                      | 202/1044 [01:14<05:23,  2.60it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1851]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  19%|█████████▏                                     | 203/1044 [01:14<05:23,  2.60it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1844]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  20%|█████████▏                                     | 204/1044 [01:14<05:22,  2.60it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1841]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  20%|█████████▏                                     | 205/1044 [01:15<05:22,  2.60it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1833]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  20%|█████████▎                                     | 206/1044 [01:15<05:15,  2.66it/s, acc_step=1/1, ce_loss_token=1.8217, perplexity_token=6.1825]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  20%|█████████▎                                     | 207/1044 [01:15<04:50,  2.88it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1849]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  20%|█████████▎                                     | 208/1044 [01:16<04:58,  2.80it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1839]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  20%|█████████▍                                     | 209/1044 [01:16<04:55,  2.82it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1836]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  20%|█████████▍                                     | 210/1044 [01:16<05:08,  2.70it/s, acc_step=1/1, ce_loss_token=1.8217, perplexity_token=6.1827]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  20%|█████████▍                                     | 211/1044 [01:17<05:02,  2.75it/s, acc_step=1/1, ce_loss_token=1.8216, perplexity_token=6.1819]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  20%|█████████▌                                     | 212/1044 [01:17<04:40,  2.96it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1841]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  20%|█████████▌                                     | 213/1044 [01:17<04:44,  2.92it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1832]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  20%|█████████▋                                     | 214/1044 [01:18<04:34,  3.03it/s, acc_step=1/1, ce_loss_token=1.8224, perplexity_token=6.1867]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  21%|█████████▋                                     | 215/1044 [01:18<04:19,  3.20it/s, acc_step=1/1, ce_loss_token=1.8229, perplexity_token=6.1900]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  21%|█████████▋                                     | 216/1044 [01:18<04:30,  3.06it/s, acc_step=1/1, ce_loss_token=1.8228, perplexity_token=6.1893]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  21%|█████████▊                                     | 217/1044 [01:19<04:39,  2.95it/s, acc_step=1/1, ce_loss_token=1.8226, perplexity_token=6.1880]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  21%|█████████▊                                     | 218/1044 [01:19<04:42,  2.93it/s, acc_step=1/1, ce_loss_token=1.8225, perplexity_token=6.1871]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  21%|█████████▊                                     | 219/1044 [01:19<04:46,  2.88it/s, acc_step=1/1, ce_loss_token=1.8224, perplexity_token=6.1864]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  21%|█████████▉                                     | 220/1044 [01:20<04:30,  3.05it/s, acc_step=1/1, ce_loss_token=1.8229, perplexity_token=6.1900]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  21%|█████████▉                                     | 221/1044 [01:20<04:40,  2.93it/s, acc_step=1/1, ce_loss_token=1.8228, perplexity_token=6.1890]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  21%|█████████▉                                     | 222/1044 [01:21<04:50,  2.83it/s, acc_step=1/1, ce_loss_token=1.8227, perplexity_token=6.1882]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  21%|██████████                                     | 223/1044 [01:21<04:50,  2.83it/s, acc_step=1/1, ce_loss_token=1.8225, perplexity_token=6.1871]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  21%|██████████                                     | 224/1044 [01:21<04:56,  2.77it/s, acc_step=1/1, ce_loss_token=1.8224, perplexity_token=6.1865]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  22%|██████████▏                                    | 225/1044 [01:22<04:52,  2.80it/s, acc_step=1/1, ce_loss_token=1.8222, perplexity_token=6.1858]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  22%|██████████▏                                    | 226/1044 [01:22<04:48,  2.83it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1849]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  22%|██████████▏                                    | 227/1044 [01:22<04:59,  2.73it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1840]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  22%|██████████▎                                    | 228/1044 [01:23<04:40,  2.91it/s, acc_step=1/1, ce_loss_token=1.8223, perplexity_token=6.1861]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  22%|██████████▎                                    | 229/1044 [01:23<04:28,  3.04it/s, acc_step=1/1, ce_loss_token=1.8226, perplexity_token=6.1881]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  22%|██████████▎                                    | 230/1044 [01:23<04:17,  3.17it/s, acc_step=1/1, ce_loss_token=1.8230, perplexity_token=6.1905]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  22%|██████████▍                                    | 231/1044 [01:24<05:03,  2.68it/s, acc_step=1/1, ce_loss_token=1.8229, perplexity_token=6.1899]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  22%|██████████▍                                    | 232/1044 [01:24<05:10,  2.61it/s, acc_step=1/1, ce_loss_token=1.8228, perplexity_token=6.1892]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  22%|██████████▍                                    | 233/1044 [01:24<05:05,  2.65it/s, acc_step=1/1, ce_loss_token=1.8226, perplexity_token=6.1881]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  22%|██████████▌                                    | 234/1044 [01:25<04:56,  2.73it/s, acc_step=1/1, ce_loss_token=1.8224, perplexity_token=6.1870]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  23%|██████████▌                                    | 235/1044 [01:25<05:01,  2.68it/s, acc_step=1/1, ce_loss_token=1.8224, perplexity_token=6.1866]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  23%|██████████▌                                    | 236/1044 [01:26<05:02,  2.67it/s, acc_step=1/1, ce_loss_token=1.8222, perplexity_token=6.1857]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  23%|██████████▋                                    | 237/1044 [01:26<05:02,  2.67it/s, acc_step=1/1, ce_loss_token=1.8222, perplexity_token=6.1854]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  23%|██████████▋                                    | 238/1044 [01:26<04:58,  2.70it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1844]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  23%|██████████▊                                    | 239/1044 [01:27<04:55,  2.72it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1837]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  23%|██████████▊                                    | 240/1044 [01:27<05:08,  2.61it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1829]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  23%|██████████▊                                    | 241/1044 [01:28<05:10,  2.59it/s, acc_step=1/1, ce_loss_token=1.8216, perplexity_token=6.1818]

torch.Size([256, 364, 35]) torch.Size([256, 364])


[Training LM]:  23%|██████████▉                                    | 242/1044 [01:28<05:31,  2.42it/s, acc_step=1/1, ce_loss_token=1.8215, perplexity_token=6.1809]

torch.Size([256, 349, 35]) torch.Size([256, 349])


[Training LM]:  23%|██████████▉                                    | 243/1044 [01:28<05:42,  2.34it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1799]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  23%|██████████▉                                    | 244/1044 [01:29<05:39,  2.36it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1788]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  23%|███████████                                    | 245/1044 [01:29<05:23,  2.47it/s, acc_step=1/1, ce_loss_token=1.8210, perplexity_token=6.1777]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  24%|███████████                                    | 246/1044 [01:30<05:00,  2.66it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1804]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  24%|███████████                                    | 247/1044 [01:30<04:50,  2.75it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1798]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  24%|███████████▏                                   | 248/1044 [01:30<04:47,  2.77it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1791]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  24%|███████████▏                                   | 249/1044 [01:30<04:27,  2.98it/s, acc_step=1/1, ce_loss_token=1.8215, perplexity_token=6.1808]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  24%|███████████▎                                   | 250/1044 [01:31<04:35,  2.88it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1804]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  24%|███████████▎                                   | 251/1044 [01:31<04:42,  2.81it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1796]

torch.Size([256, 396, 35]) torch.Size([256, 396])


[Training LM]:  24%|███████████▎                                   | 252/1044 [01:32<05:26,  2.43it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1791]

torch.Size([256, 404, 35]) torch.Size([256, 404])


[Training LM]:  24%|███████████▍                                   | 253/1044 [01:32<06:03,  2.18it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1787]

torch.Size([256, 351, 35]) torch.Size([256, 351])


[Training LM]:  24%|███████████▍                                   | 254/1044 [01:33<05:14,  2.51it/s, acc_step=1/1, ce_loss_token=1.8225, perplexity_token=6.1875]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  24%|███████████▍                                   | 255/1044 [01:33<05:07,  2.57it/s, acc_step=1/1, ce_loss_token=1.8224, perplexity_token=6.1870]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  25%|███████████▌                                   | 256/1044 [01:33<05:15,  2.50it/s, acc_step=1/1, ce_loss_token=1.8223, perplexity_token=6.1864]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  25%|███████████▌                                   | 257/1044 [01:34<04:45,  2.76it/s, acc_step=1/1, ce_loss_token=1.8226, perplexity_token=6.1880]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  25%|███████████▌                                   | 258/1044 [01:34<04:23,  2.98it/s, acc_step=1/1, ce_loss_token=1.8228, perplexity_token=6.1893]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  25%|███████████▋                                   | 259/1044 [01:34<04:36,  2.83it/s, acc_step=1/1, ce_loss_token=1.8227, perplexity_token=6.1888]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  25%|███████████▋                                   | 260/1044 [01:35<04:38,  2.82it/s, acc_step=1/1, ce_loss_token=1.8227, perplexity_token=6.1884]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  25%|███████████▊                                   | 261/1044 [01:35<04:31,  2.88it/s, acc_step=1/1, ce_loss_token=1.8226, perplexity_token=6.1878]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  25%|███████████▊                                   | 262/1044 [01:35<04:43,  2.76it/s, acc_step=1/1, ce_loss_token=1.8224, perplexity_token=6.1870]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  25%|███████████▊                                   | 263/1044 [01:36<04:42,  2.76it/s, acc_step=1/1, ce_loss_token=1.8224, perplexity_token=6.1864]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  25%|███████████▉                                   | 264/1044 [01:36<04:40,  2.78it/s, acc_step=1/1, ce_loss_token=1.8222, perplexity_token=6.1853]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  25%|███████████▉                                   | 265/1044 [01:37<04:43,  2.75it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1849]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  25%|███████████▉                                   | 266/1044 [01:37<04:54,  2.64it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1841]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  26%|████████████                                   | 267/1044 [01:37<04:48,  2.69it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1838]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  26%|████████████                                   | 268/1044 [01:38<04:49,  2.68it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1827]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  26%|████████████                                   | 269/1044 [01:38<04:51,  2.66it/s, acc_step=1/1, ce_loss_token=1.8216, perplexity_token=6.1818]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  26%|████████████▏                                  | 270/1044 [01:38<04:43,  2.73it/s, acc_step=1/1, ce_loss_token=1.8215, perplexity_token=6.1809]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  26%|████████████▏                                  | 271/1044 [01:39<04:44,  2.71it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1802]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  26%|████████████▏                                  | 272/1044 [01:39<04:26,  2.90it/s, acc_step=1/1, ce_loss_token=1.8217, perplexity_token=6.1822]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  26%|████████████▎                                  | 273/1044 [01:39<04:34,  2.81it/s, acc_step=1/1, ce_loss_token=1.8215, perplexity_token=6.1814]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  26%|████████████▎                                  | 274/1044 [01:40<04:42,  2.72it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1803]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  26%|████████████▍                                  | 275/1044 [01:40<04:29,  2.85it/s, acc_step=1/1, ce_loss_token=1.8216, perplexity_token=6.1820]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  26%|████████████▍                                  | 276/1044 [01:40<04:11,  3.05it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1842]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  27%|████████████▍                                  | 277/1044 [01:41<04:13,  3.02it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1836]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  27%|████████████▌                                  | 278/1044 [01:41<04:27,  2.86it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1829]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  27%|████████████▌                                  | 279/1044 [01:42<04:30,  2.83it/s, acc_step=1/1, ce_loss_token=1.8217, perplexity_token=6.1824]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  27%|████████████▌                                  | 280/1044 [01:42<04:33,  2.80it/s, acc_step=1/1, ce_loss_token=1.8216, perplexity_token=6.1816]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  27%|████████████▋                                  | 281/1044 [01:42<04:32,  2.80it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1807]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  27%|████████████▋                                  | 282/1044 [01:43<04:41,  2.71it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1803]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  27%|████████████▋                                  | 283/1044 [01:43<04:34,  2.77it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1794]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  27%|████████████▊                                  | 284/1044 [01:43<04:56,  2.57it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1785]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  27%|████████████▊                                  | 285/1044 [01:44<04:48,  2.63it/s, acc_step=1/1, ce_loss_token=1.8210, perplexity_token=6.1779]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  27%|████████████▉                                  | 286/1044 [01:44<04:28,  2.82it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1795]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  27%|████████████▉                                  | 287/1044 [01:44<04:30,  2.79it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1784]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  28%|████████████▉                                  | 288/1044 [01:45<04:10,  3.01it/s, acc_step=1/1, ce_loss_token=1.8215, perplexity_token=6.1809]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  28%|█████████████                                  | 289/1044 [01:45<03:59,  3.16it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1832]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  28%|█████████████                                  | 290/1044 [01:45<03:56,  3.18it/s, acc_step=1/1, ce_loss_token=1.8222, perplexity_token=6.1857]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  28%|█████████████                                  | 291/1044 [01:46<04:09,  3.02it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1850]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  28%|█████████████▏                                 | 292/1044 [01:46<04:31,  2.77it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1844]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  28%|█████████████▏                                 | 293/1044 [01:46<04:34,  2.74it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1834]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  28%|█████████████▏                                 | 294/1044 [01:47<04:32,  2.75it/s, acc_step=1/1, ce_loss_token=1.8217, perplexity_token=6.1826]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  28%|█████████████▎                                 | 295/1044 [01:47<04:29,  2.78it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1839]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  28%|█████████████▎                                 | 296/1044 [01:48<04:34,  2.73it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1830]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  28%|█████████████▎                                 | 297/1044 [01:48<04:33,  2.74it/s, acc_step=1/1, ce_loss_token=1.8217, perplexity_token=6.1824]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  29%|█████████████▍                                 | 298/1044 [01:48<04:32,  2.74it/s, acc_step=1/1, ce_loss_token=1.8216, perplexity_token=6.1816]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  29%|█████████████▍                                 | 299/1044 [01:49<04:12,  2.95it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1841]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  29%|█████████████▌                                 | 300/1044 [01:49<04:27,  2.78it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1834]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  29%|█████████████▌                                 | 302/1044 [01:50<03:58,  3.11it/s, acc_step=1/1, ce_loss_token=1.8227, perplexity_token=6.1883]

torch.Size([256, 307, 35]) torch.Size([256, 307])
torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  29%|█████████████▋                                 | 303/1044 [01:50<04:10,  2.96it/s, acc_step=1/1, ce_loss_token=1.8226, perplexity_token=6.1877]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  29%|█████████████▋                                 | 304/1044 [01:50<04:33,  2.71it/s, acc_step=1/1, ce_loss_token=1.8225, perplexity_token=6.1870]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  29%|█████████████▋                                 | 305/1044 [01:51<04:36,  2.67it/s, acc_step=1/1, ce_loss_token=1.8223, perplexity_token=6.1863]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  29%|█████████████▊                                 | 306/1044 [01:51<04:30,  2.73it/s, acc_step=1/1, ce_loss_token=1.8222, perplexity_token=6.1856]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  29%|█████████████▊                                 | 307/1044 [01:52<04:39,  2.64it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1849]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  30%|█████████████▊                                 | 308/1044 [01:52<04:30,  2.72it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1841]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  30%|█████████████▉                                 | 309/1044 [01:52<04:34,  2.68it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1836]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  30%|█████████████▉                                 | 310/1044 [01:53<04:13,  2.89it/s, acc_step=1/1, ce_loss_token=1.8222, perplexity_token=6.1855]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  30%|██████████████                                 | 311/1044 [01:53<04:12,  2.91it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1849]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  30%|██████████████                                 | 312/1044 [01:53<04:17,  2.84it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1843]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  30%|██████████████                                 | 313/1044 [01:54<04:18,  2.83it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1836]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  30%|██████████████▏                                | 314/1044 [01:54<04:20,  2.80it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1829]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  30%|██████████████▏                                | 315/1044 [01:54<04:09,  2.93it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1844]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  30%|██████████████▏                                | 316/1044 [01:55<04:17,  2.83it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1839]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  30%|██████████████▎                                | 317/1044 [01:55<04:30,  2.69it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1833]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  30%|██████████████▎                                | 318/1044 [01:56<04:41,  2.58it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1828]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  31%|██████████████▎                                | 319/1044 [01:56<04:45,  2.54it/s, acc_step=1/1, ce_loss_token=1.8217, perplexity_token=6.1822]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  31%|██████████████▍                                | 320/1044 [01:56<04:41,  2.57it/s, acc_step=1/1, ce_loss_token=1.8216, perplexity_token=6.1816]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  31%|██████████████▍                                | 321/1044 [01:57<04:30,  2.67it/s, acc_step=1/1, ce_loss_token=1.8215, perplexity_token=6.1809]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  31%|██████████████▍                                | 322/1044 [01:57<04:45,  2.53it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1804]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  31%|██████████████▌                                | 323/1044 [01:57<04:39,  2.58it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1798]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  31%|██████████████▌                                | 324/1044 [01:58<04:40,  2.56it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1791]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  31%|██████████████▋                                | 325/1044 [01:58<04:16,  2.81it/s, acc_step=1/1, ce_loss_token=1.8215, perplexity_token=6.1814]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  31%|██████████████▋                                | 326/1044 [01:59<04:20,  2.76it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1806]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  31%|██████████████▋                                | 327/1044 [01:59<04:18,  2.77it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1803]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  31%|██████████████▊                                | 328/1044 [01:59<04:20,  2.74it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1797]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  32%|██████████████▊                                | 329/1044 [02:00<04:12,  2.83it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1792]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  32%|██████████████▊                                | 330/1044 [02:00<04:26,  2.68it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1784]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  32%|██████████████▉                                | 331/1044 [02:00<04:20,  2.74it/s, acc_step=1/1, ce_loss_token=1.8210, perplexity_token=6.1778]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  32%|██████████████▉                                | 332/1044 [02:01<04:18,  2.75it/s, acc_step=1/1, ce_loss_token=1.8208, perplexity_token=6.1770]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  32%|██████████████▉                                | 333/1044 [02:01<04:17,  2.77it/s, acc_step=1/1, ce_loss_token=1.8207, perplexity_token=6.1763]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  32%|███████████████                                | 334/1044 [02:01<04:11,  2.82it/s, acc_step=1/1, ce_loss_token=1.8206, perplexity_token=6.1755]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  32%|███████████████                                | 335/1044 [02:02<04:14,  2.79it/s, acc_step=1/1, ce_loss_token=1.8205, perplexity_token=6.1751]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  32%|███████████████▏                               | 336/1044 [02:02<04:15,  2.78it/s, acc_step=1/1, ce_loss_token=1.8204, perplexity_token=6.1744]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  32%|███████████████▏                               | 337/1044 [02:03<04:19,  2.72it/s, acc_step=1/1, ce_loss_token=1.8203, perplexity_token=6.1738]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  32%|███████████████▏                               | 338/1044 [02:03<04:14,  2.78it/s, acc_step=1/1, ce_loss_token=1.8202, perplexity_token=6.1732]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  32%|███████████████▎                               | 339/1044 [02:03<04:17,  2.74it/s, acc_step=1/1, ce_loss_token=1.8201, perplexity_token=6.1726]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  33%|███████████████▎                               | 340/1044 [02:04<04:17,  2.73it/s, acc_step=1/1, ce_loss_token=1.8200, perplexity_token=6.1721]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  33%|███████████████▎                               | 341/1044 [02:04<04:35,  2.55it/s, acc_step=1/1, ce_loss_token=1.8199, perplexity_token=6.1715]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  33%|███████████████▍                               | 342/1044 [02:04<04:30,  2.60it/s, acc_step=1/1, ce_loss_token=1.8198, perplexity_token=6.1707]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  33%|███████████████▍                               | 343/1044 [02:05<04:10,  2.80it/s, acc_step=1/1, ce_loss_token=1.8201, perplexity_token=6.1727]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  33%|███████████████▍                               | 344/1044 [02:05<03:57,  2.95it/s, acc_step=1/1, ce_loss_token=1.8204, perplexity_token=6.1742]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  33%|███████████████▌                               | 345/1044 [02:05<04:11,  2.78it/s, acc_step=1/1, ce_loss_token=1.8203, perplexity_token=6.1737]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  33%|███████████████▌                               | 346/1044 [02:06<04:14,  2.74it/s, acc_step=1/1, ce_loss_token=1.8202, perplexity_token=6.1729]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  33%|███████████████▌                               | 347/1044 [02:06<03:57,  2.93it/s, acc_step=1/1, ce_loss_token=1.8204, perplexity_token=6.1742]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  33%|███████████████▋                               | 348/1044 [02:06<04:09,  2.79it/s, acc_step=1/1, ce_loss_token=1.8203, perplexity_token=6.1739]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  33%|███████████████▋                               | 349/1044 [02:07<04:22,  2.64it/s, acc_step=1/1, ce_loss_token=1.8202, perplexity_token=6.1732]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  34%|███████████████▊                               | 350/1044 [02:07<04:22,  2.65it/s, acc_step=1/1, ce_loss_token=1.8201, perplexity_token=6.1727]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  34%|███████████████▊                               | 351/1044 [02:08<04:27,  2.59it/s, acc_step=1/1, ce_loss_token=1.8200, perplexity_token=6.1722]

torch.Size([256, 397, 35]) torch.Size([256, 397])


[Training LM]:  34%|███████████████▊                               | 352/1044 [02:08<04:09,  2.77it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1788]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  34%|███████████████▉                               | 353/1044 [02:08<04:03,  2.84it/s, acc_step=1/1, ce_loss_token=1.8210, perplexity_token=6.1781]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  34%|███████████████▉                               | 354/1044 [02:09<04:06,  2.79it/s, acc_step=1/1, ce_loss_token=1.8209, perplexity_token=6.1775]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  34%|███████████████▉                               | 355/1044 [02:09<04:11,  2.74it/s, acc_step=1/1, ce_loss_token=1.8208, perplexity_token=6.1770]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  34%|████████████████                               | 356/1044 [02:09<04:15,  2.69it/s, acc_step=1/1, ce_loss_token=1.8207, perplexity_token=6.1765]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  34%|████████████████                               | 357/1044 [02:10<04:18,  2.65it/s, acc_step=1/1, ce_loss_token=1.8206, perplexity_token=6.1757]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  34%|████████████████                               | 358/1044 [02:10<03:58,  2.88it/s, acc_step=1/1, ce_loss_token=1.8209, perplexity_token=6.1772]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  34%|████████████████▏                              | 359/1044 [02:10<03:44,  3.05it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1785]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  34%|████████████████▏                              | 360/1044 [02:11<03:33,  3.20it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1802]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  35%|████████████████▎                              | 361/1044 [02:11<03:38,  3.12it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1795]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  35%|████████████████▎                              | 362/1044 [02:11<03:46,  3.01it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1789]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  35%|████████████████▎                              | 363/1044 [02:12<03:35,  3.16it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1804]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  35%|████████████████▍                              | 364/1044 [02:12<03:28,  3.26it/s, acc_step=1/1, ce_loss_token=1.8216, perplexity_token=6.1816]

torch.Size([256, 344, 35]) torch.Size([256, 344])


[Training LM]:  35%|████████████████▍                              | 365/1044 [02:12<03:53,  2.91it/s, acc_step=1/1, ce_loss_token=1.8215, perplexity_token=6.1810]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  35%|████████████████▍                              | 366/1044 [02:13<04:05,  2.77it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1806]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  35%|████████████████▌                              | 367/1044 [02:13<04:18,  2.62it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1803]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  35%|████████████████▌                              | 368/1044 [02:14<04:15,  2.65it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1797]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  35%|████████████████▋                              | 370/1044 [02:14<03:58,  2.83it/s, acc_step=1/1, ce_loss_token=1.8222, perplexity_token=6.1852]

torch.Size([256, 325, 35]) torch.Size([256, 325])
torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  36%|████████████████▋                              | 371/1044 [02:15<04:00,  2.80it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1847]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  36%|████████████████▋                              | 372/1044 [02:15<03:48,  2.94it/s, acc_step=1/1, ce_loss_token=1.8223, perplexity_token=6.1858]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  36%|████████████████▊                              | 373/1044 [02:15<03:38,  3.07it/s, acc_step=1/1, ce_loss_token=1.8224, perplexity_token=6.1869]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  36%|████████████████▊                              | 374/1044 [02:16<03:30,  3.19it/s, acc_step=1/1, ce_loss_token=1.8226, perplexity_token=6.1882]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  36%|████████████████▉                              | 375/1044 [02:16<03:35,  3.11it/s, acc_step=1/1, ce_loss_token=1.8225, perplexity_token=6.1875]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  36%|████████████████▉                              | 376/1044 [02:16<03:53,  2.86it/s, acc_step=1/1, ce_loss_token=1.8224, perplexity_token=6.1869]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  36%|████████████████▉                              | 377/1044 [02:17<03:58,  2.80it/s, acc_step=1/1, ce_loss_token=1.8223, perplexity_token=6.1862]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  36%|█████████████████                              | 378/1044 [02:17<03:51,  2.88it/s, acc_step=1/1, ce_loss_token=1.8225, perplexity_token=6.1873]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  36%|█████████████████                              | 379/1044 [02:17<03:59,  2.77it/s, acc_step=1/1, ce_loss_token=1.8224, perplexity_token=6.1868]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  36%|█████████████████                              | 380/1044 [02:18<03:58,  2.79it/s, acc_step=1/1, ce_loss_token=1.8223, perplexity_token=6.1863]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  36%|█████████████████▏                             | 381/1044 [02:18<03:53,  2.84it/s, acc_step=1/1, ce_loss_token=1.8223, perplexity_token=6.1858]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  37%|█████████████████▏                             | 382/1044 [02:18<04:01,  2.74it/s, acc_step=1/1, ce_loss_token=1.8222, perplexity_token=6.1852]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  37%|█████████████████▏                             | 383/1044 [02:19<03:58,  2.77it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1847]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  37%|█████████████████▎                             | 384/1044 [02:19<03:58,  2.77it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1839]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  37%|█████████████████▎                             | 385/1044 [02:20<03:59,  2.76it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1832]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  37%|█████████████████▍                             | 386/1044 [02:20<04:00,  2.73it/s, acc_step=1/1, ce_loss_token=1.8218, perplexity_token=6.1827]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  37%|█████████████████▍                             | 387/1044 [02:20<03:46,  2.90it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1842]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  37%|█████████████████▍                             | 388/1044 [02:21<03:46,  2.90it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1836]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  37%|█████████████████▌                             | 389/1044 [02:21<03:38,  3.00it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1847]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  37%|█████████████████▌                             | 390/1044 [02:21<03:29,  3.13it/s, acc_step=1/1, ce_loss_token=1.8223, perplexity_token=6.1863]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  37%|█████████████████▌                             | 391/1044 [02:22<03:40,  2.96it/s, acc_step=1/1, ce_loss_token=1.8222, perplexity_token=6.1857]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  38%|█████████████████▋                             | 392/1044 [02:22<03:50,  2.83it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1850]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  38%|█████████████████▋                             | 393/1044 [02:22<04:03,  2.67it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1844]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  38%|█████████████████▋                             | 394/1044 [02:23<04:02,  2.68it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1838]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  38%|█████████████████▊                             | 395/1044 [02:23<03:42,  2.91it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1850]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  38%|█████████████████▊                             | 396/1044 [02:23<03:46,  2.86it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1846]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  38%|█████████████████▊                             | 397/1044 [02:24<03:44,  2.88it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1839]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  38%|█████████████████▉                             | 398/1044 [02:24<03:33,  3.03it/s, acc_step=1/1, ce_loss_token=1.8221, perplexity_token=6.1849]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  38%|█████████████████▉                             | 399/1044 [02:24<03:41,  2.91it/s, acc_step=1/1, ce_loss_token=1.8220, perplexity_token=6.1843]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  38%|██████████████████                             | 400/1044 [02:25<03:48,  2.82it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1839]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  38%|██████████████████                             | 401/1044 [02:25<03:52,  2.77it/s, acc_step=1/1, ce_loss_token=1.8219, perplexity_token=6.1834]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  39%|██████████████████                             | 402/1044 [02:25<03:56,  2.72it/s, acc_step=1/1, ce_loss_token=1.8217, perplexity_token=6.1827]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  39%|██████████████████▏                            | 403/1044 [02:26<03:53,  2.74it/s, acc_step=1/1, ce_loss_token=1.8217, perplexity_token=6.1822]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  39%|██████████████████▏                            | 404/1044 [02:26<03:57,  2.70it/s, acc_step=1/1, ce_loss_token=1.8216, perplexity_token=6.1817]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  39%|██████████████████▏                            | 405/1044 [02:27<03:55,  2.71it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1807]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  39%|██████████████████▎                            | 406/1044 [02:27<03:38,  2.91it/s, acc_step=1/1, ce_loss_token=1.8217, perplexity_token=6.1823]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  39%|██████████████████▎                            | 407/1044 [02:27<03:40,  2.89it/s, acc_step=1/1, ce_loss_token=1.8216, perplexity_token=6.1819]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  39%|██████████████████▎                            | 408/1044 [02:28<03:44,  2.83it/s, acc_step=1/1, ce_loss_token=1.8216, perplexity_token=6.1815]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  39%|██████████████████▍                            | 409/1044 [02:28<03:55,  2.69it/s, acc_step=1/1, ce_loss_token=1.8215, perplexity_token=6.1809]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  39%|██████████████████▍                            | 410/1044 [02:28<03:54,  2.70it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1804]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  39%|██████████████████▌                            | 411/1044 [02:29<03:50,  2.74it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1799]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  39%|██████████████████▌                            | 412/1044 [02:29<03:50,  2.74it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1796]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  40%|██████████████████▌                            | 413/1044 [02:29<03:37,  2.90it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1805]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  40%|██████████████████▋                            | 414/1044 [02:30<03:43,  2.82it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1800]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  40%|██████████████████▋                            | 415/1044 [02:30<03:32,  2.96it/s, acc_step=1/1, ce_loss_token=1.8215, perplexity_token=6.1810]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  40%|██████████████████▋                            | 416/1044 [02:30<03:42,  2.83it/s, acc_step=1/1, ce_loss_token=1.8214, perplexity_token=6.1806]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  40%|██████████████████▊                            | 417/1044 [02:31<03:50,  2.72it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1799]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  40%|██████████████████▊                            | 418/1044 [02:31<03:45,  2.78it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1795]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  40%|██████████████████▊                            | 419/1044 [02:32<03:53,  2.67it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1790]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  40%|██████████████████▉                            | 420/1044 [02:32<03:54,  2.66it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1785]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  40%|██████████████████▉                            | 421/1044 [02:32<03:39,  2.84it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1799]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  40%|██████████████████▉                            | 422/1044 [02:33<03:44,  2.77it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1794]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  41%|███████████████████                            | 423/1044 [02:33<03:40,  2.82it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1788]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  41%|███████████████████                            | 424/1044 [02:33<03:44,  2.76it/s, acc_step=1/1, ce_loss_token=1.8210, perplexity_token=6.1783]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  41%|███████████████████▏                           | 425/1044 [02:34<03:48,  2.70it/s, acc_step=1/1, ce_loss_token=1.8210, perplexity_token=6.1778]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  41%|███████████████████▏                           | 426/1044 [02:34<03:52,  2.66it/s, acc_step=1/1, ce_loss_token=1.8209, perplexity_token=6.1774]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  41%|███████████████████▏                           | 427/1044 [02:35<03:55,  2.62it/s, acc_step=1/1, ce_loss_token=1.8208, perplexity_token=6.1768]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  41%|███████████████████▎                           | 428/1044 [02:35<03:59,  2.57it/s, acc_step=1/1, ce_loss_token=1.8207, perplexity_token=6.1764]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  41%|███████████████████▎                           | 429/1044 [02:35<03:40,  2.79it/s, acc_step=1/1, ce_loss_token=1.8209, perplexity_token=6.1775]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  41%|███████████████████▎                           | 430/1044 [02:36<03:39,  2.80it/s, acc_step=1/1, ce_loss_token=1.8208, perplexity_token=6.1769]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  41%|███████████████████▍                           | 431/1044 [02:36<03:29,  2.92it/s, acc_step=1/1, ce_loss_token=1.8210, perplexity_token=6.1779]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  41%|███████████████████▍                           | 432/1044 [02:36<03:35,  2.84it/s, acc_step=1/1, ce_loss_token=1.8209, perplexity_token=6.1773]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  41%|███████████████████▍                           | 433/1044 [02:37<03:22,  3.02it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1789]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  42%|███████████████████▌                           | 434/1044 [02:37<03:12,  3.16it/s, acc_step=1/1, ce_loss_token=1.8213, perplexity_token=6.1800]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  42%|███████████████████▌                           | 435/1044 [02:37<03:29,  2.90it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1795]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  42%|███████████████████▋                           | 436/1044 [02:38<03:52,  2.62it/s, acc_step=1/1, ce_loss_token=1.8212, perplexity_token=6.1790]

torch.Size([256, 402, 35]) torch.Size([256, 402])


[Training LM]:  42%|███████████████████▋                           | 437/1044 [02:38<04:24,  2.29it/s, acc_step=1/1, ce_loss_token=1.8211, perplexity_token=6.1786]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  42%|███████████████████▋                           | 438/1044 [02:39<04:16,  2.36it/s, acc_step=1/1, ce_loss_token=1.8210, perplexity_token=6.1780]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  42%|███████████████████▊                           | 439/1044 [02:39<04:06,  2.45it/s, acc_step=1/1, ce_loss_token=1.8209, perplexity_token=6.1773]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  42%|███████████████████▊                           | 440/1044 [02:39<03:57,  2.54it/s, acc_step=1/1, ce_loss_token=1.8208, perplexity_token=6.1769]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  42%|███████████████████▊                           | 441/1044 [02:40<03:51,  2.61it/s, acc_step=1/1, ce_loss_token=1.8207, perplexity_token=6.1764]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  42%|███████████████████▉                           | 442/1044 [02:40<03:50,  2.61it/s, acc_step=1/1, ce_loss_token=1.8206, perplexity_token=6.1758]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  42%|███████████████████▉                           | 443/1044 [02:41<03:57,  2.53it/s, acc_step=1/1, ce_loss_token=1.8206, perplexity_token=6.1754]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  43%|███████████████████▉                           | 444/1044 [02:41<03:54,  2.56it/s, acc_step=1/1, ce_loss_token=1.8205, perplexity_token=6.1751]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  43%|████████████████████                           | 445/1044 [02:41<03:52,  2.58it/s, acc_step=1/1, ce_loss_token=1.8205, perplexity_token=6.1746]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  43%|████████████████████                           | 446/1044 [02:42<03:49,  2.61it/s, acc_step=1/1, ce_loss_token=1.8204, perplexity_token=6.1741]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  43%|████████████████████                           | 447/1044 [02:42<03:46,  2.63it/s, acc_step=1/1, ce_loss_token=1.8203, perplexity_token=6.1736]

torch.Size([256, 276, 35]) torch.Size([256, 276])


[Training LM]:  43%|████████████████████▏                          | 448/1044 [02:42<03:37,  2.74it/s, acc_step=1/1, ce_loss_token=1.8202, perplexity_token=6.1733]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  43%|████████████████████▏                          | 449/1044 [02:43<03:35,  2.76it/s, acc_step=1/1, ce_loss_token=1.8201, perplexity_token=6.1728]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  43%|████████████████████▎                          | 450/1044 [02:43<03:40,  2.70it/s, acc_step=1/1, ce_loss_token=1.8201, perplexity_token=6.1723]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  43%|████████████████████▎                          | 451/1044 [02:44<03:46,  2.62it/s, acc_step=1/1, ce_loss_token=1.8200, perplexity_token=6.1717]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  43%|████████████████████▎                          | 452/1044 [02:44<03:31,  2.80it/s, acc_step=1/1, ce_loss_token=1.8202, perplexity_token=6.1731]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  43%|████████████████████▍                          | 453/1044 [02:44<03:14,  3.04it/s, acc_step=1/1, ce_loss_token=1.8204, perplexity_token=6.1745]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  43%|████████████████████▍                          | 454/1044 [02:44<03:07,  3.14it/s, acc_step=1/1, ce_loss_token=1.8207, perplexity_token=6.1761]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  44%|████████████████████▍                          | 455/1044 [02:45<03:20,  2.93it/s, acc_step=1/1, ce_loss_token=1.8206, perplexity_token=6.1754]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  44%|████████████████████▌                          | 456/1044 [02:45<03:14,  3.02it/s, acc_step=1/1, ce_loss_token=1.8208, perplexity_token=6.1768]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  44%|████████████████████▌                          | 457/1044 [02:45<03:18,  2.96it/s, acc_step=1/1, ce_loss_token=1.8207, perplexity_token=6.1762]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  44%|████████████████████▌                          | 458/1044 [02:46<03:10,  3.08it/s, acc_step=1/1, ce_loss_token=1.8209, perplexity_token=6.1775]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  44%|████████████████████▋                          | 459/1044 [02:46<03:11,  3.06it/s, acc_step=1/1, ce_loss_token=1.8208, perplexity_token=6.1771]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  44%|████████████████████▋                          | 460/1044 [02:46<03:14,  3.00it/s, acc_step=1/1, ce_loss_token=1.8208, perplexity_token=6.1766]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  44%|████████████████████▊                          | 461/1044 [02:47<03:18,  2.93it/s, acc_step=1/1, ce_loss_token=1.8207, perplexity_token=6.1760]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  44%|████████████████████▊                          | 462/1044 [02:47<03:26,  2.81it/s, acc_step=1/1, ce_loss_token=1.8206, perplexity_token=6.1755]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  44%|████████████████████▊                          | 463/1044 [02:48<03:34,  2.71it/s, acc_step=1/1, ce_loss_token=1.8205, perplexity_token=6.1750]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  44%|████████████████████▉                          | 464/1044 [02:48<03:35,  2.69it/s, acc_step=1/1, ce_loss_token=1.8205, perplexity_token=6.1748]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  45%|████████████████████▉                          | 465/1044 [02:48<03:39,  2.64it/s, acc_step=1/1, ce_loss_token=1.8204, perplexity_token=6.1742]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  45%|████████████████████▉                          | 466/1044 [02:49<03:41,  2.61it/s, acc_step=1/1, ce_loss_token=1.8203, perplexity_token=6.1738]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  45%|█████████████████████                          | 467/1044 [02:49<03:44,  2.57it/s, acc_step=1/1, ce_loss_token=1.8203, perplexity_token=6.1735]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  45%|█████████████████████                          | 468/1044 [02:50<03:44,  2.57it/s, acc_step=1/1, ce_loss_token=1.8202, perplexity_token=6.1729]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  45%|█████████████████████                          | 469/1044 [02:50<03:46,  2.53it/s, acc_step=1/1, ce_loss_token=1.8201, perplexity_token=6.1722]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  45%|█████████████████████▏                         | 470/1044 [02:50<03:59,  2.39it/s, acc_step=1/1, ce_loss_token=1.8200, perplexity_token=6.1718]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  45%|█████████████████████▏                         | 471/1044 [02:51<03:51,  2.47it/s, acc_step=1/1, ce_loss_token=1.8199, perplexity_token=6.1713]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  45%|█████████████████████▏                         | 472/1044 [02:51<03:48,  2.50it/s, acc_step=1/1, ce_loss_token=1.8198, perplexity_token=6.1708]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  45%|█████████████████████▎                         | 473/1044 [02:52<03:35,  2.65it/s, acc_step=1/1, ce_loss_token=1.8201, perplexity_token=6.1722]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  45%|█████████████████████▎                         | 474/1044 [02:52<03:27,  2.75it/s, acc_step=1/1, ce_loss_token=1.8200, perplexity_token=6.1718]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  45%|█████████████████████▍                         | 475/1044 [02:52<03:24,  2.78it/s, acc_step=1/1, ce_loss_token=1.8199, perplexity_token=6.1712]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  46%|█████████████████████▍                         | 476/1044 [02:53<03:24,  2.78it/s, acc_step=1/1, ce_loss_token=1.8198, perplexity_token=6.1708]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  46%|█████████████████████▍                         | 477/1044 [02:53<03:07,  3.02it/s, acc_step=1/1, ce_loss_token=1.8200, perplexity_token=6.1718]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  46%|█████████████████████▌                         | 478/1044 [02:53<03:10,  2.97it/s, acc_step=1/1, ce_loss_token=1.8199, perplexity_token=6.1713]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  46%|█████████████████████▌                         | 479/1044 [02:54<03:23,  2.77it/s, acc_step=1/1, ce_loss_token=1.8198, perplexity_token=6.1709]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  46%|█████████████████████▌                         | 480/1044 [02:54<03:26,  2.73it/s, acc_step=1/1, ce_loss_token=1.8197, perplexity_token=6.1703]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  46%|█████████████████████▋                         | 481/1044 [02:54<03:29,  2.68it/s, acc_step=1/1, ce_loss_token=1.8197, perplexity_token=6.1699]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  46%|█████████████████████▋                         | 482/1044 [02:55<03:16,  2.86it/s, acc_step=1/1, ce_loss_token=1.8198, perplexity_token=6.1708]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  46%|█████████████████████▋                         | 483/1044 [02:55<03:36,  2.59it/s, acc_step=1/1, ce_loss_token=1.8197, perplexity_token=6.1703]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  46%|█████████████████████▊                         | 484/1044 [02:55<03:27,  2.70it/s, acc_step=1/1, ce_loss_token=1.8199, perplexity_token=6.1715]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  46%|█████████████████████▊                         | 485/1044 [02:56<03:13,  2.88it/s, acc_step=1/1, ce_loss_token=1.8201, perplexity_token=6.1723]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  47%|█████████████████████▉                         | 486/1044 [02:56<03:16,  2.83it/s, acc_step=1/1, ce_loss_token=1.8200, perplexity_token=6.1718]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  47%|█████████████████████▉                         | 487/1044 [02:57<03:19,  2.79it/s, acc_step=1/1, ce_loss_token=1.8199, perplexity_token=6.1714]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  47%|█████████████████████▉                         | 488/1044 [02:57<03:26,  2.69it/s, acc_step=1/1, ce_loss_token=1.8199, perplexity_token=6.1710]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  47%|██████████████████████                         | 489/1044 [02:57<03:26,  2.69it/s, acc_step=1/1, ce_loss_token=1.8198, perplexity_token=6.1704]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  47%|██████████████████████                         | 490/1044 [02:58<03:41,  2.50it/s, acc_step=1/1, ce_loss_token=1.8197, perplexity_token=6.1699]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  47%|██████████████████████                         | 491/1044 [02:58<03:32,  2.60it/s, acc_step=1/1, ce_loss_token=1.8196, perplexity_token=6.1695]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  47%|██████████████████████▏                        | 492/1044 [02:58<03:28,  2.65it/s, acc_step=1/1, ce_loss_token=1.8196, perplexity_token=6.1691]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  47%|██████████████████████▏                        | 494/1044 [02:59<03:00,  3.04it/s, acc_step=1/1, ce_loss_token=1.8201, perplexity_token=6.1727]

torch.Size([256, 325, 35]) torch.Size([256, 325])
torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  47%|██████████████████████▎                        | 495/1044 [02:59<03:03,  2.98it/s, acc_step=1/1, ce_loss_token=1.8201, perplexity_token=6.1722]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  48%|██████████████████████▎                        | 496/1044 [03:00<03:14,  2.82it/s, acc_step=1/1, ce_loss_token=1.8200, perplexity_token=6.1717]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  48%|██████████████████████▎                        | 497/1044 [03:00<03:13,  2.82it/s, acc_step=1/1, ce_loss_token=1.8199, perplexity_token=6.1712]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  48%|██████████████████████▍                        | 498/1044 [03:01<03:16,  2.78it/s, acc_step=1/1, ce_loss_token=1.8198, perplexity_token=6.1706]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  48%|██████████████████████▍                        | 499/1044 [03:01<03:17,  2.76it/s, acc_step=1/1, ce_loss_token=1.8197, perplexity_token=6.1701]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  48%|██████████████████████▌                        | 500/1044 [03:01<03:20,  2.71it/s, acc_step=1/1, ce_loss_token=1.8196, perplexity_token=6.1694]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  48%|██████████████████████▌                        | 501/1044 [03:02<03:20,  2.70it/s, acc_step=1/1, ce_loss_token=1.8195, perplexity_token=6.1688]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  48%|██████████████████████▌                        | 502/1044 [03:02<03:18,  2.73it/s, acc_step=1/1, ce_loss_token=1.8194, perplexity_token=6.1684]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  48%|██████████████████████▋                        | 503/1044 [03:02<03:19,  2.71it/s, acc_step=1/1, ce_loss_token=1.8194, perplexity_token=6.1680]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  48%|██████████████████████▋                        | 504/1044 [03:03<03:21,  2.68it/s, acc_step=1/1, ce_loss_token=1.8193, perplexity_token=6.1675]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  48%|██████████████████████▋                        | 505/1044 [03:03<03:31,  2.55it/s, acc_step=1/1, ce_loss_token=1.8192, perplexity_token=6.1672]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  48%|██████████████████████▊                        | 506/1044 [03:04<03:34,  2.50it/s, acc_step=1/1, ce_loss_token=1.8192, perplexity_token=6.1667]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  49%|██████████████████████▊                        | 507/1044 [03:04<03:30,  2.56it/s, acc_step=1/1, ce_loss_token=1.8191, perplexity_token=6.1661]

torch.Size([256, 372, 35]) torch.Size([256, 372])


[Training LM]:  49%|██████████████████████▊                        | 508/1044 [03:04<03:45,  2.37it/s, acc_step=1/1, ce_loss_token=1.8190, perplexity_token=6.1656]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  49%|██████████████████████▉                        | 509/1044 [03:05<03:38,  2.45it/s, acc_step=1/1, ce_loss_token=1.8189, perplexity_token=6.1650]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  49%|██████████████████████▉                        | 510/1044 [03:05<03:36,  2.46it/s, acc_step=1/1, ce_loss_token=1.8188, perplexity_token=6.1645]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  49%|███████████████████████                        | 511/1044 [03:06<03:21,  2.64it/s, acc_step=1/1, ce_loss_token=1.8190, perplexity_token=6.1656]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  49%|███████████████████████                        | 512/1044 [03:06<03:18,  2.68it/s, acc_step=1/1, ce_loss_token=1.8189, perplexity_token=6.1649]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  49%|███████████████████████                        | 513/1044 [03:06<03:20,  2.65it/s, acc_step=1/1, ce_loss_token=1.8188, perplexity_token=6.1647]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  49%|███████████████████████▏                       | 514/1044 [03:07<03:16,  2.70it/s, acc_step=1/1, ce_loss_token=1.8187, perplexity_token=6.1641]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  49%|███████████████████████▏                       | 515/1044 [03:07<03:23,  2.59it/s, acc_step=1/1, ce_loss_token=1.8187, perplexity_token=6.1636]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  49%|███████████████████████▏                       | 516/1044 [03:07<03:22,  2.61it/s, acc_step=1/1, ce_loss_token=1.8186, perplexity_token=6.1630]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  50%|███████████████████████▎                       | 517/1044 [03:08<03:20,  2.62it/s, acc_step=1/1, ce_loss_token=1.8185, perplexity_token=6.1625]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  50%|███████████████████████▎                       | 518/1044 [03:08<03:21,  2.60it/s, acc_step=1/1, ce_loss_token=1.8184, perplexity_token=6.1622]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  50%|███████████████████████▎                       | 519/1044 [03:09<03:18,  2.65it/s, acc_step=1/1, ce_loss_token=1.8184, perplexity_token=6.1617]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  50%|███████████████████████▍                       | 520/1044 [03:09<03:16,  2.67it/s, acc_step=1/1, ce_loss_token=1.8183, perplexity_token=6.1612]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  50%|███████████████████████▍                       | 521/1044 [03:09<03:02,  2.86it/s, acc_step=1/1, ce_loss_token=1.8185, perplexity_token=6.1627]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  50%|███████████████████████▌                       | 522/1044 [03:10<02:54,  3.00it/s, acc_step=1/1, ce_loss_token=1.8187, perplexity_token=6.1636]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  50%|███████████████████████▌                       | 523/1044 [03:10<03:06,  2.79it/s, acc_step=1/1, ce_loss_token=1.8186, perplexity_token=6.1631]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  50%|███████████████████████▌                       | 524/1044 [03:10<03:10,  2.73it/s, acc_step=1/1, ce_loss_token=1.8185, perplexity_token=6.1625]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  50%|███████████████████████▋                       | 525/1044 [03:11<03:06,  2.79it/s, acc_step=1/1, ce_loss_token=1.8184, perplexity_token=6.1620]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  50%|███████████████████████▋                       | 526/1044 [03:11<03:00,  2.87it/s, acc_step=1/1, ce_loss_token=1.8186, perplexity_token=6.1630]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  50%|███████████████████████▋                       | 527/1044 [03:11<02:51,  3.01it/s, acc_step=1/1, ce_loss_token=1.8187, perplexity_token=6.1638]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  51%|███████████████████████▊                       | 528/1044 [03:12<02:54,  2.95it/s, acc_step=1/1, ce_loss_token=1.8186, perplexity_token=6.1631]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  51%|███████████████████████▊                       | 529/1044 [03:12<02:56,  2.91it/s, acc_step=1/1, ce_loss_token=1.8185, perplexity_token=6.1627]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  51%|███████████████████████▊                       | 530/1044 [03:12<03:04,  2.79it/s, acc_step=1/1, ce_loss_token=1.8185, perplexity_token=6.1623]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  51%|███████████████████████▉                       | 531/1044 [03:13<03:07,  2.74it/s, acc_step=1/1, ce_loss_token=1.8184, perplexity_token=6.1619]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  51%|███████████████████████▉                       | 532/1044 [03:13<03:04,  2.78it/s, acc_step=1/1, ce_loss_token=1.8183, perplexity_token=6.1615]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  51%|███████████████████████▉                       | 533/1044 [03:14<03:09,  2.70it/s, acc_step=1/1, ce_loss_token=1.8182, perplexity_token=6.1610]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  51%|████████████████████████                       | 534/1044 [03:14<03:11,  2.66it/s, acc_step=1/1, ce_loss_token=1.8182, perplexity_token=6.1605]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  51%|████████████████████████                       | 535/1044 [03:14<03:10,  2.67it/s, acc_step=1/1, ce_loss_token=1.8181, perplexity_token=6.1600]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  51%|████████████████████████▏                      | 536/1044 [03:15<03:07,  2.71it/s, acc_step=1/1, ce_loss_token=1.8180, perplexity_token=6.1597]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  51%|████████████████████████▏                      | 537/1044 [03:15<03:00,  2.81it/s, acc_step=1/1, ce_loss_token=1.8182, perplexity_token=6.1606]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  52%|████████████████████████▏                      | 538/1044 [03:15<03:03,  2.76it/s, acc_step=1/1, ce_loss_token=1.8181, perplexity_token=6.1601]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  52%|████████████████████████▎                      | 539/1044 [03:16<03:01,  2.78it/s, acc_step=1/1, ce_loss_token=1.8180, perplexity_token=6.1597]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  52%|████████████████████████▎                      | 540/1044 [03:16<03:05,  2.72it/s, acc_step=1/1, ce_loss_token=1.8180, perplexity_token=6.1593]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  52%|████████████████████████▎                      | 541/1044 [03:16<03:04,  2.72it/s, acc_step=1/1, ce_loss_token=1.8179, perplexity_token=6.1589]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  52%|████████████████████████▍                      | 542/1044 [03:17<03:05,  2.71it/s, acc_step=1/1, ce_loss_token=1.8178, perplexity_token=6.1584]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  52%|████████████████████████▍                      | 543/1044 [03:17<03:16,  2.55it/s, acc_step=1/1, ce_loss_token=1.8177, perplexity_token=6.1579]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  52%|████████████████████████▍                      | 544/1044 [03:18<03:13,  2.58it/s, acc_step=1/1, ce_loss_token=1.8177, perplexity_token=6.1575]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  52%|████████████████████████▌                      | 546/1044 [03:18<02:46,  3.00it/s, acc_step=1/1, ce_loss_token=1.8182, perplexity_token=6.1605]

torch.Size([256, 284, 35]) torch.Size([256, 284])
torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  52%|████████████████████████▋                      | 547/1044 [03:19<02:50,  2.92it/s, acc_step=1/1, ce_loss_token=1.8181, perplexity_token=6.1601]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  52%|████████████████████████▋                      | 548/1044 [03:19<02:51,  2.88it/s, acc_step=1/1, ce_loss_token=1.8181, perplexity_token=6.1598]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  53%|████████████████████████▋                      | 549/1044 [03:19<02:54,  2.84it/s, acc_step=1/1, ce_loss_token=1.8180, perplexity_token=6.1594]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  53%|████████████████████████▊                      | 550/1044 [03:20<02:59,  2.76it/s, acc_step=1/1, ce_loss_token=1.8179, perplexity_token=6.1590]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  53%|████████████████████████▊                      | 551/1044 [03:20<02:59,  2.74it/s, acc_step=1/1, ce_loss_token=1.8178, perplexity_token=6.1585]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  53%|████████████████████████▊                      | 552/1044 [03:21<03:03,  2.68it/s, acc_step=1/1, ce_loss_token=1.8177, perplexity_token=6.1578]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  53%|████████████████████████▉                      | 553/1044 [03:21<03:01,  2.71it/s, acc_step=1/1, ce_loss_token=1.8177, perplexity_token=6.1574]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  53%|████████████████████████▉                      | 554/1044 [03:21<02:40,  3.05it/s, acc_step=1/1, ce_loss_token=1.8181, perplexity_token=6.1600]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  53%|████████████████████████▉                      | 555/1044 [03:21<02:40,  3.04it/s, acc_step=1/1, ce_loss_token=1.8180, perplexity_token=6.1595]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  53%|█████████████████████████                      | 556/1044 [03:22<02:36,  3.11it/s, acc_step=1/1, ce_loss_token=1.8181, perplexity_token=6.1604]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  53%|█████████████████████████                      | 557/1044 [03:22<02:41,  3.01it/s, acc_step=1/1, ce_loss_token=1.8180, perplexity_token=6.1598]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  53%|█████████████████████████                      | 558/1044 [03:22<02:48,  2.88it/s, acc_step=1/1, ce_loss_token=1.8180, perplexity_token=6.1594]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  54%|█████████████████████████▏                     | 559/1044 [03:23<03:01,  2.68it/s, acc_step=1/1, ce_loss_token=1.8179, perplexity_token=6.1590]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  54%|█████████████████████████▏                     | 560/1044 [03:23<03:01,  2.66it/s, acc_step=1/1, ce_loss_token=1.8178, perplexity_token=6.1584]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  54%|█████████████████████████▎                     | 561/1044 [03:24<02:58,  2.70it/s, acc_step=1/1, ce_loss_token=1.8177, perplexity_token=6.1579]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  54%|█████████████████████████▎                     | 562/1044 [03:24<02:57,  2.71it/s, acc_step=1/1, ce_loss_token=1.8177, perplexity_token=6.1574]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  54%|█████████████████████████▎                     | 563/1044 [03:24<03:00,  2.67it/s, acc_step=1/1, ce_loss_token=1.8176, perplexity_token=6.1569]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  54%|█████████████████████████▍                     | 564/1044 [03:25<03:00,  2.66it/s, acc_step=1/1, ce_loss_token=1.8175, perplexity_token=6.1566]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  54%|█████████████████████████▍                     | 565/1044 [03:25<02:56,  2.71it/s, acc_step=1/1, ce_loss_token=1.8175, perplexity_token=6.1562]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  54%|█████████████████████████▍                     | 566/1044 [03:26<03:00,  2.66it/s, acc_step=1/1, ce_loss_token=1.8174, perplexity_token=6.1558]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  54%|█████████████████████████▌                     | 567/1044 [03:26<02:57,  2.69it/s, acc_step=1/1, ce_loss_token=1.8173, perplexity_token=6.1554]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  54%|█████████████████████████▌                     | 568/1044 [03:26<03:00,  2.63it/s, acc_step=1/1, ce_loss_token=1.8173, perplexity_token=6.1551]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  55%|█████████████████████████▌                     | 569/1044 [03:27<03:01,  2.61it/s, acc_step=1/1, ce_loss_token=1.8172, perplexity_token=6.1546]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  55%|█████████████████████████▋                     | 570/1044 [03:27<03:03,  2.58it/s, acc_step=1/1, ce_loss_token=1.8171, perplexity_token=6.1540]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  55%|█████████████████████████▋                     | 571/1044 [03:27<03:04,  2.56it/s, acc_step=1/1, ce_loss_token=1.8170, perplexity_token=6.1535]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  55%|█████████████████████████▊                     | 572/1044 [03:28<02:59,  2.62it/s, acc_step=1/1, ce_loss_token=1.8169, perplexity_token=6.1531]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  55%|█████████████████████████▊                     | 573/1044 [03:28<02:55,  2.69it/s, acc_step=1/1, ce_loss_token=1.8169, perplexity_token=6.1527]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  55%|█████████████████████████▊                     | 574/1044 [03:29<02:55,  2.67it/s, acc_step=1/1, ce_loss_token=1.8168, perplexity_token=6.1522]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  55%|█████████████████████████▉                     | 575/1044 [03:29<02:42,  2.89it/s, acc_step=1/1, ce_loss_token=1.8169, perplexity_token=6.1529]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  55%|█████████████████████████▉                     | 576/1044 [03:29<02:42,  2.89it/s, acc_step=1/1, ce_loss_token=1.8169, perplexity_token=6.1525]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  55%|█████████████████████████▉                     | 577/1044 [03:30<02:51,  2.73it/s, acc_step=1/1, ce_loss_token=1.8168, perplexity_token=6.1520]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  55%|██████████████████████████                     | 578/1044 [03:30<02:51,  2.71it/s, acc_step=1/1, ce_loss_token=1.8167, perplexity_token=6.1517]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  55%|██████████████████████████                     | 579/1044 [03:30<02:59,  2.60it/s, acc_step=1/1, ce_loss_token=1.8167, perplexity_token=6.1514]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  56%|██████████████████████████                     | 580/1044 [03:31<03:10,  2.44it/s, acc_step=1/1, ce_loss_token=1.8166, perplexity_token=6.1509]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  56%|██████████████████████████▏                    | 581/1044 [03:31<03:10,  2.43it/s, acc_step=1/1, ce_loss_token=1.8168, perplexity_token=6.1521]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  56%|██████████████████████████▏                    | 582/1044 [03:32<03:05,  2.49it/s, acc_step=1/1, ce_loss_token=1.8167, perplexity_token=6.1515]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  56%|██████████████████████████▏                    | 583/1044 [03:32<03:04,  2.50it/s, acc_step=1/1, ce_loss_token=1.8166, perplexity_token=6.1511]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  56%|██████████████████████████▎                    | 584/1044 [03:32<03:05,  2.49it/s, acc_step=1/1, ce_loss_token=1.8166, perplexity_token=6.1507]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  56%|██████████████████████████▎                    | 585/1044 [03:33<02:56,  2.61it/s, acc_step=1/1, ce_loss_token=1.8165, perplexity_token=6.1503]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  56%|██████████████████████████▍                    | 586/1044 [03:33<02:58,  2.57it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1498]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  56%|██████████████████████████▍                    | 587/1044 [03:34<02:54,  2.62it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1494]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  56%|██████████████████████████▍                    | 588/1044 [03:34<03:00,  2.52it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1489]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  56%|██████████████████████████▌                    | 589/1044 [03:34<03:05,  2.46it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1487]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  57%|██████████████████████████▌                    | 590/1044 [03:35<03:01,  2.50it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1483]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  57%|██████████████████████████▌                    | 591/1044 [03:35<02:44,  2.75it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1494]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  57%|██████████████████████████▋                    | 592/1044 [03:35<02:42,  2.78it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1488]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  57%|██████████████████████████▋                    | 593/1044 [03:36<02:40,  2.80it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1499]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  57%|██████████████████████████▋                    | 594/1044 [03:36<02:47,  2.69it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1494]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  57%|██████████████████████████▊                    | 595/1044 [03:37<02:44,  2.73it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1489]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  57%|██████████████████████████▊                    | 596/1044 [03:37<02:44,  2.73it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1485]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  57%|██████████████████████████▉                    | 597/1044 [03:37<02:40,  2.79it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1481]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  57%|██████████████████████████▉                    | 598/1044 [03:38<02:46,  2.68it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1477]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  57%|██████████████████████████▉                    | 599/1044 [03:38<02:36,  2.83it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1485]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  57%|███████████████████████████                    | 600/1044 [03:38<02:43,  2.71it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1481]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  58%|███████████████████████████                    | 601/1044 [03:39<02:45,  2.68it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1477]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  58%|███████████████████████████                    | 602/1044 [03:39<02:33,  2.87it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1484]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  58%|███████████████████████████▏                   | 603/1044 [03:39<02:43,  2.70it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1481]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  58%|███████████████████████████▏                   | 604/1044 [03:40<02:50,  2.57it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1476]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  58%|███████████████████████████▏                   | 605/1044 [03:40<02:44,  2.66it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1472]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  58%|███████████████████████████▎                   | 606/1044 [03:41<02:44,  2.66it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1469]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  58%|███████████████████████████▎                   | 607/1044 [03:41<02:39,  2.74it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1465]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  58%|███████████████████████████▎                   | 608/1044 [03:41<02:35,  2.80it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1472]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  58%|███████████████████████████▍                   | 609/1044 [03:42<02:35,  2.79it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1468]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  58%|███████████████████████████▍                   | 610/1044 [03:42<02:35,  2.80it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1463]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  59%|███████████████████████████▌                   | 611/1044 [03:42<02:32,  2.84it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1460]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  59%|███████████████████████████▌                   | 612/1044 [03:43<02:34,  2.79it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1457]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  59%|███████████████████████████▌                   | 613/1044 [03:43<02:39,  2.71it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1453]

torch.Size([256, 454, 35]) torch.Size([256, 454])


[Training LM]:  59%|███████████████████████████▋                   | 614/1044 [03:44<03:17,  2.18it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1450]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  59%|███████████████████████████▋                   | 616/1044 [03:44<02:34,  2.77it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1494]

torch.Size([256, 327, 35]) torch.Size([256, 327])
torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  59%|███████████████████████████▊                   | 617/1044 [03:45<02:36,  2.73it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1490]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  59%|███████████████████████████▊                   | 618/1044 [03:45<02:43,  2.60it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1487]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  59%|███████████████████████████▊                   | 619/1044 [03:46<02:43,  2.60it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1483]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  59%|███████████████████████████▉                   | 620/1044 [03:46<02:37,  2.70it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1479]

torch.Size([256, 410, 35]) torch.Size([256, 410])


[Training LM]:  59%|███████████████████████████▉                   | 621/1044 [03:46<03:01,  2.32it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1476]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  60%|████████████████████████████                   | 622/1044 [03:47<02:53,  2.43it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1472]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  60%|████████████████████████████                   | 623/1044 [03:47<02:51,  2.46it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1468]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  60%|████████████████████████████                   | 624/1044 [03:47<02:35,  2.70it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1476]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  60%|████████████████████████████▏                  | 625/1044 [03:48<02:37,  2.67it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1471]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  60%|████████████████████████████▏                  | 626/1044 [03:48<02:35,  2.69it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1467]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  60%|████████████████████████████▏                  | 627/1044 [03:49<02:33,  2.72it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1463]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  60%|████████████████████████████▎                  | 628/1044 [03:49<02:33,  2.72it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1459]

torch.Size([256, 370, 35]) torch.Size([256, 370])


[Training LM]:  60%|████████████████████████████▎                  | 629/1044 [03:49<02:48,  2.46it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1456]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  60%|████████████████████████████▎                  | 630/1044 [03:50<02:37,  2.63it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1466]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  60%|████████████████████████████▍                  | 631/1044 [03:50<02:40,  2.58it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1462]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  61%|████████████████████████████▍                  | 632/1044 [03:51<02:34,  2.67it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1458]

torch.Size([256, 437, 35]) torch.Size([256, 437])


[Training LM]:  61%|████████████████████████████▍                  | 633/1044 [03:51<03:06,  2.20it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1455]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  61%|████████████████████████████▌                  | 634/1044 [03:52<02:57,  2.31it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1451]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  61%|████████████████████████████▌                  | 635/1044 [03:52<02:50,  2.39it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1447]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  61%|████████████████████████████▋                  | 636/1044 [03:52<02:33,  2.65it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1458]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  61%|████████████████████████████▋                  | 637/1044 [03:53<02:31,  2.68it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1454]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  61%|████████████████████████████▋                  | 638/1044 [03:53<02:40,  2.52it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1450]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  61%|████████████████████████████▊                  | 639/1044 [03:53<02:36,  2.59it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1446]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  61%|████████████████████████████▊                  | 640/1044 [03:54<02:27,  2.74it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1457]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  61%|████████████████████████████▊                  | 641/1044 [03:54<02:27,  2.73it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1455]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  61%|████████████████████████████▉                  | 642/1044 [03:54<02:27,  2.72it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1450]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  62%|████████████████████████████▉                  | 643/1044 [03:55<02:18,  2.90it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1458]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  62%|████████████████████████████▉                  | 644/1044 [03:55<02:08,  3.11it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1466]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  62%|█████████████████████████████                  | 645/1044 [03:55<02:12,  3.02it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1462]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  62%|█████████████████████████████                  | 646/1044 [03:56<02:28,  2.69it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1460]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  62%|█████████████████████████████▏                 | 647/1044 [03:56<02:15,  2.92it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1470]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  62%|█████████████████████████████▏                 | 648/1044 [03:56<02:13,  2.98it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1477]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  62%|█████████████████████████████▏                 | 649/1044 [03:57<02:17,  2.88it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1473]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  62%|█████████████████████████████▎                 | 650/1044 [03:57<02:09,  3.05it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1483]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  62%|█████████████████████████████▎                 | 651/1044 [03:57<02:05,  3.13it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1491]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  62%|█████████████████████████████▎                 | 652/1044 [03:58<02:16,  2.87it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1488]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  63%|█████████████████████████████▍                 | 653/1044 [03:58<02:25,  2.68it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1484]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  63%|█████████████████████████████▍                 | 654/1044 [03:59<02:22,  2.74it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1479]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  63%|█████████████████████████████▍                 | 655/1044 [03:59<02:23,  2.72it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1476]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  63%|█████████████████████████████▌                 | 656/1044 [03:59<02:30,  2.58it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1472]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  63%|█████████████████████████████▌                 | 657/1044 [04:00<02:20,  2.76it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1479]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  63%|█████████████████████████████▌                 | 658/1044 [04:00<02:23,  2.69it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1475]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  63%|█████████████████████████████▋                 | 659/1044 [04:00<02:19,  2.75it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1469]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  63%|█████████████████████████████▋                 | 660/1044 [04:01<02:21,  2.72it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1467]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  63%|█████████████████████████████▊                 | 661/1044 [04:01<02:21,  2.70it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1464]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  63%|█████████████████████████████▊                 | 662/1044 [04:02<02:21,  2.70it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1459]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  64%|█████████████████████████████▊                 | 663/1044 [04:02<02:12,  2.88it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1468]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  64%|█████████████████████████████▉                 | 664/1044 [04:02<02:01,  3.13it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1474]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  64%|█████████████████████████████▉                 | 665/1044 [04:02<02:04,  3.03it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1470]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  64%|█████████████████████████████▉                 | 666/1044 [04:03<02:11,  2.87it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1466]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  64%|██████████████████████████████                 | 667/1044 [04:03<02:03,  3.05it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1472]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  64%|██████████████████████████████                 | 668/1044 [04:03<02:07,  2.94it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1469]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  64%|██████████████████████████████                 | 669/1044 [04:04<02:00,  3.10it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1475]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  64%|██████████████████████████████▏                | 670/1044 [04:04<01:55,  3.23it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1481]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  64%|██████████████████████████████▏                | 671/1044 [04:04<02:02,  3.05it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1476]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  64%|██████████████████████████████▎                | 672/1044 [04:05<01:57,  3.18it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1485]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  65%|██████████████████████████████▎                | 674/1044 [04:05<01:49,  3.39it/s, acc_step=1/1, ce_loss_token=1.8166, perplexity_token=6.1511]

torch.Size([256, 321, 35]) torch.Size([256, 321])
torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  65%|██████████████████████████████▍                | 675/1044 [04:06<01:58,  3.12it/s, acc_step=1/1, ce_loss_token=1.8166, perplexity_token=6.1507]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  65%|██████████████████████████████▍                | 676/1044 [04:06<02:04,  2.96it/s, acc_step=1/1, ce_loss_token=1.8165, perplexity_token=6.1503]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  65%|██████████████████████████████▍                | 677/1044 [04:06<02:06,  2.91it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1499]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  65%|██████████████████████████████▌                | 678/1044 [04:07<02:08,  2.85it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1495]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  65%|██████████████████████████████▌                | 679/1044 [04:07<02:14,  2.72it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1492]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  65%|██████████████████████████████▌                | 680/1044 [04:08<02:14,  2.70it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1488]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  65%|██████████████████████████████▋                | 681/1044 [04:08<02:15,  2.68it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1485]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  65%|██████████████████████████████▋                | 682/1044 [04:08<02:08,  2.82it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1493]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  65%|██████████████████████████████▋                | 683/1044 [04:09<02:06,  2.84it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1491]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  66%|██████████████████████████████▊                | 684/1044 [04:09<01:58,  3.04it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1497]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  66%|██████████████████████████████▊                | 685/1044 [04:09<02:02,  2.93it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1493]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  66%|██████████████████████████████▉                | 686/1044 [04:10<01:54,  3.13it/s, acc_step=1/1, ce_loss_token=1.8165, perplexity_token=6.1501]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  66%|██████████████████████████████▉                | 687/1044 [04:10<01:55,  3.09it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1498]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  66%|██████████████████████████████▉                | 688/1044 [04:10<01:59,  2.97it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1495]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  66%|███████████████████████████████                | 689/1044 [04:11<01:58,  3.00it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1491]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  66%|███████████████████████████████                | 690/1044 [04:11<01:54,  3.09it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1494]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  66%|███████████████████████████████                | 691/1044 [04:11<01:57,  3.00it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1490]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  66%|███████████████████████████████▏               | 692/1044 [04:12<01:59,  2.94it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1487]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  66%|███████████████████████████████▏               | 693/1044 [04:12<01:50,  3.18it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1493]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  66%|███████████████████████████████▏               | 694/1044 [04:12<01:47,  3.24it/s, acc_step=1/1, ce_loss_token=1.8165, perplexity_token=6.1501]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  67%|███████████████████████████████▎               | 695/1044 [04:12<01:45,  3.31it/s, acc_step=1/1, ce_loss_token=1.8166, perplexity_token=6.1508]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  67%|███████████████████████████████▎               | 696/1044 [04:13<01:54,  3.03it/s, acc_step=1/1, ce_loss_token=1.8165, perplexity_token=6.1504]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  67%|███████████████████████████████▍               | 697/1044 [04:13<01:58,  2.93it/s, acc_step=1/1, ce_loss_token=1.8165, perplexity_token=6.1500]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  67%|███████████████████████████████▍               | 698/1044 [04:13<01:53,  3.05it/s, acc_step=1/1, ce_loss_token=1.8166, perplexity_token=6.1507]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  67%|███████████████████████████████▍               | 699/1044 [04:14<01:57,  2.94it/s, acc_step=1/1, ce_loss_token=1.8165, perplexity_token=6.1503]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  67%|███████████████████████████████▌               | 700/1044 [04:14<02:02,  2.80it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1499]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  67%|███████████████████████████████▌               | 701/1044 [04:15<02:01,  2.81it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1496]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  67%|███████████████████████████████▌               | 702/1044 [04:15<01:55,  2.95it/s, acc_step=1/1, ce_loss_token=1.8165, perplexity_token=6.1502]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  67%|███████████████████████████████▋               | 703/1044 [04:15<01:55,  2.96it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1497]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  67%|███████████████████████████████▋               | 704/1044 [04:15<01:50,  3.08it/s, acc_step=1/1, ce_loss_token=1.8165, perplexity_token=6.1503]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  68%|███████████████████████████████▋               | 705/1044 [04:16<01:55,  2.94it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1499]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  68%|███████████████████████████████▊               | 706/1044 [04:16<01:58,  2.85it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1496]

torch.Size([256, 365, 35]) torch.Size([256, 365])


[Training LM]:  68%|███████████████████████████████▊               | 707/1044 [04:17<02:12,  2.55it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1491]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  68%|███████████████████████████████▊               | 708/1044 [04:17<02:05,  2.68it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1500]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  68%|███████████████████████████████▉               | 709/1044 [04:17<02:06,  2.64it/s, acc_step=1/1, ce_loss_token=1.8164, perplexity_token=6.1496]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  68%|███████████████████████████████▉               | 710/1044 [04:18<02:08,  2.60it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1492]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  68%|████████████████████████████████               | 711/1044 [04:18<02:17,  2.42it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1488]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  68%|████████████████████████████████               | 712/1044 [04:19<02:14,  2.48it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1484]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  68%|████████████████████████████████               | 713/1044 [04:19<02:01,  2.74it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1492]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  68%|████████████████████████████████▏              | 714/1044 [04:19<02:02,  2.70it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1490]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  68%|████████████████████████████████▏              | 715/1044 [04:20<02:09,  2.54it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1486]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  69%|████████████████████████████████▏              | 716/1044 [04:20<02:00,  2.72it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1493]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  69%|████████████████████████████████▎              | 717/1044 [04:20<01:58,  2.76it/s, acc_step=1/1, ce_loss_token=1.8163, perplexity_token=6.1488]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  69%|████████████████████████████████▎              | 718/1044 [04:21<01:56,  2.79it/s, acc_step=1/1, ce_loss_token=1.8162, perplexity_token=6.1485]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  69%|████████████████████████████████▎              | 719/1044 [04:21<01:55,  2.80it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1480]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  69%|████████████████████████████████▍              | 720/1044 [04:22<02:03,  2.62it/s, acc_step=1/1, ce_loss_token=1.8161, perplexity_token=6.1475]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  69%|████████████████████████████████▍              | 721/1044 [04:22<02:04,  2.60it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1472]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  69%|████████████████████████████████▌              | 722/1044 [04:22<02:02,  2.63it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1468]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  69%|████████████████████████████████▌              | 723/1044 [04:23<02:02,  2.63it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1463]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  69%|████████████████████████████████▌              | 724/1044 [04:23<02:01,  2.63it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1459]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  69%|████████████████████████████████▋              | 725/1044 [04:24<02:00,  2.66it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1457]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  70%|████████████████████████████████▋              | 726/1044 [04:24<01:58,  2.68it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1453]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  70%|████████████████████████████████▋              | 727/1044 [04:24<02:00,  2.64it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1449]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  70%|████████████████████████████████▊              | 728/1044 [04:25<02:00,  2.62it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1445]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  70%|████████████████████████████████▊              | 729/1044 [04:25<02:00,  2.62it/s, acc_step=1/1, ce_loss_token=1.8155, perplexity_token=6.1441]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  70%|████████████████████████████████▊              | 730/1044 [04:25<02:01,  2.59it/s, acc_step=1/1, ce_loss_token=1.8154, perplexity_token=6.1438]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  70%|████████████████████████████████▉              | 731/1044 [04:26<01:51,  2.81it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1447]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  70%|████████████████████████████████▉              | 732/1044 [04:26<01:52,  2.77it/s, acc_step=1/1, ce_loss_token=1.8155, perplexity_token=6.1444]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  70%|████████████████████████████████▉              | 733/1044 [04:26<01:46,  2.92it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1450]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  70%|█████████████████████████████████              | 734/1044 [04:27<01:55,  2.67it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1446]

torch.Size([256, 388, 35]) torch.Size([256, 388])


[Training LM]:  70%|█████████████████████████████████              | 735/1044 [04:27<02:10,  2.37it/s, acc_step=1/1, ce_loss_token=1.8155, perplexity_token=6.1441]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  70%|█████████████████████████████████▏             | 736/1044 [04:28<02:06,  2.43it/s, acc_step=1/1, ce_loss_token=1.8154, perplexity_token=6.1437]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  71%|█████████████████████████████████▏             | 737/1044 [04:28<01:56,  2.64it/s, acc_step=1/1, ce_loss_token=1.8155, perplexity_token=6.1442]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  71%|█████████████████████████████████▏             | 738/1044 [04:29<02:08,  2.37it/s, acc_step=1/1, ce_loss_token=1.8154, perplexity_token=6.1437]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  71%|█████████████████████████████████▎             | 739/1044 [04:29<02:05,  2.43it/s, acc_step=1/1, ce_loss_token=1.8154, perplexity_token=6.1435]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  71%|█████████████████████████████████▎             | 740/1044 [04:29<02:00,  2.52it/s, acc_step=1/1, ce_loss_token=1.8153, perplexity_token=6.1430]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  71%|█████████████████████████████████▎             | 741/1044 [04:30<01:56,  2.60it/s, acc_step=1/1, ce_loss_token=1.8152, perplexity_token=6.1425]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  71%|█████████████████████████████████▍             | 742/1044 [04:30<01:49,  2.76it/s, acc_step=1/1, ce_loss_token=1.8153, perplexity_token=6.1431]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  71%|█████████████████████████████████▍             | 743/1044 [04:30<01:50,  2.73it/s, acc_step=1/1, ce_loss_token=1.8153, perplexity_token=6.1428]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  71%|█████████████████████████████████▍             | 744/1044 [04:31<01:48,  2.77it/s, acc_step=1/1, ce_loss_token=1.8153, perplexity_token=6.1426]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  71%|█████████████████████████████████▌             | 745/1044 [04:31<01:47,  2.79it/s, acc_step=1/1, ce_loss_token=1.8152, perplexity_token=6.1423]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  71%|█████████████████████████████████▌             | 746/1044 [04:31<01:49,  2.73it/s, acc_step=1/1, ce_loss_token=1.8151, perplexity_token=6.1418]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  72%|█████████████████████████████████▋             | 747/1044 [04:32<01:54,  2.60it/s, acc_step=1/1, ce_loss_token=1.8151, perplexity_token=6.1416]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  72%|█████████████████████████████████▋             | 748/1044 [04:32<01:54,  2.59it/s, acc_step=1/1, ce_loss_token=1.8150, perplexity_token=6.1413]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  72%|█████████████████████████████████▋             | 749/1044 [04:33<01:52,  2.63it/s, acc_step=1/1, ce_loss_token=1.8150, perplexity_token=6.1409]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  72%|█████████████████████████████████▊             | 750/1044 [04:33<01:41,  2.90it/s, acc_step=1/1, ce_loss_token=1.8150, perplexity_token=6.1413]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  72%|█████████████████████████████████▊             | 751/1044 [04:33<01:35,  3.08it/s, acc_step=1/1, ce_loss_token=1.8151, perplexity_token=6.1418]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  72%|█████████████████████████████████▉             | 753/1044 [04:34<01:26,  3.37it/s, acc_step=1/1, ce_loss_token=1.8155, perplexity_token=6.1444]

torch.Size([256, 307, 35]) torch.Size([256, 307])
torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  72%|█████████████████████████████████▉             | 754/1044 [04:34<01:25,  3.39it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1452]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  72%|█████████████████████████████████▉             | 755/1044 [04:35<01:40,  2.87it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1449]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  72%|██████████████████████████████████             | 756/1044 [04:35<01:43,  2.78it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1445]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  73%|██████████████████████████████████             | 757/1044 [04:35<01:45,  2.73it/s, acc_step=1/1, ce_loss_token=1.8155, perplexity_token=6.1442]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  73%|██████████████████████████████████             | 758/1044 [04:36<01:48,  2.63it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1449]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  73%|██████████████████████████████████▏            | 759/1044 [04:36<01:58,  2.41it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1446]

torch.Size([256, 441, 35]) torch.Size([256, 441])


[Training LM]:  73%|██████████████████████████████████▏            | 760/1044 [04:37<02:18,  2.05it/s, acc_step=1/1, ce_loss_token=1.8155, perplexity_token=6.1443]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  73%|██████████████████████████████████▎            | 761/1044 [04:37<02:06,  2.24it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1452]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  73%|██████████████████████████████████▎            | 762/1044 [04:37<01:53,  2.48it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1458]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  73%|██████████████████████████████████▎            | 763/1044 [04:38<01:48,  2.59it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1453]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  73%|██████████████████████████████████▍            | 764/1044 [04:38<01:45,  2.66it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1449]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  73%|██████████████████████████████████▍            | 765/1044 [04:39<01:45,  2.65it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1446]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  73%|██████████████████████████████████▍            | 766/1044 [04:39<01:39,  2.81it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1454]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  73%|██████████████████████████████████▌            | 767/1044 [04:39<01:42,  2.69it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1450]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  74%|██████████████████████████████████▌            | 768/1044 [04:40<01:35,  2.88it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1457]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  74%|██████████████████████████████████▌            | 769/1044 [04:40<01:37,  2.81it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1453]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  74%|██████████████████████████████████▋            | 770/1044 [04:40<01:37,  2.80it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1448]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  74%|██████████████████████████████████▋            | 771/1044 [04:41<01:35,  2.85it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1453]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  74%|██████████████████████████████████▊            | 773/1044 [04:41<01:26,  3.15it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1474]

torch.Size([256, 302, 35]) torch.Size([256, 302])
torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  74%|██████████████████████████████████▊            | 774/1044 [04:42<01:30,  2.98it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1469]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  74%|██████████████████████████████████▉            | 775/1044 [04:42<01:32,  2.92it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1467]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  74%|██████████████████████████████████▉            | 776/1044 [04:42<01:35,  2.80it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1464]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  74%|██████████████████████████████████▉            | 777/1044 [04:43<01:40,  2.65it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1460]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  75%|███████████████████████████████████            | 778/1044 [04:43<01:34,  2.83it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1468]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  75%|███████████████████████████████████            | 779/1044 [04:43<01:34,  2.81it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1465]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  75%|███████████████████████████████████            | 780/1044 [04:44<01:29,  2.96it/s, acc_step=1/1, ce_loss_token=1.8160, perplexity_token=6.1469]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  75%|███████████████████████████████████▏           | 781/1044 [04:44<01:35,  2.77it/s, acc_step=1/1, ce_loss_token=1.8159, perplexity_token=6.1466]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  75%|███████████████████████████████████▏           | 782/1044 [04:45<01:38,  2.65it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1462]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  75%|███████████████████████████████████▎           | 783/1044 [04:45<01:40,  2.61it/s, acc_step=1/1, ce_loss_token=1.8158, perplexity_token=6.1458]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  75%|███████████████████████████████████▎           | 784/1044 [04:45<01:37,  2.66it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1455]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  75%|███████████████████████████████████▎           | 785/1044 [04:46<01:39,  2.61it/s, acc_step=1/1, ce_loss_token=1.8157, perplexity_token=6.1452]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  75%|███████████████████████████████████▍           | 786/1044 [04:46<01:36,  2.68it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1449]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  75%|███████████████████████████████████▍           | 787/1044 [04:46<01:36,  2.66it/s, acc_step=1/1, ce_loss_token=1.8156, perplexity_token=6.1447]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  75%|███████████████████████████████████▍           | 788/1044 [04:47<01:37,  2.63it/s, acc_step=1/1, ce_loss_token=1.8155, perplexity_token=6.1444]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  76%|███████████████████████████████████▌           | 789/1044 [04:47<01:37,  2.60it/s, acc_step=1/1, ce_loss_token=1.8155, perplexity_token=6.1439]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  76%|███████████████████████████████████▌           | 790/1044 [04:48<01:39,  2.56it/s, acc_step=1/1, ce_loss_token=1.8154, perplexity_token=6.1435]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  76%|███████████████████████████████████▌           | 791/1044 [04:48<01:35,  2.64it/s, acc_step=1/1, ce_loss_token=1.8153, perplexity_token=6.1431]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  76%|███████████████████████████████████▋           | 792/1044 [04:48<01:34,  2.67it/s, acc_step=1/1, ce_loss_token=1.8153, perplexity_token=6.1428]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  76%|███████████████████████████████████▋           | 793/1044 [04:49<01:34,  2.65it/s, acc_step=1/1, ce_loss_token=1.8152, perplexity_token=6.1424]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  76%|███████████████████████████████████▋           | 794/1044 [04:49<01:27,  2.85it/s, acc_step=1/1, ce_loss_token=1.8153, perplexity_token=6.1430]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  76%|███████████████████████████████████▊           | 795/1044 [04:50<01:35,  2.59it/s, acc_step=1/1, ce_loss_token=1.8153, perplexity_token=6.1426]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  76%|███████████████████████████████████▊           | 796/1044 [04:50<01:35,  2.60it/s, acc_step=1/1, ce_loss_token=1.8152, perplexity_token=6.1423]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  76%|███████████████████████████████████▉           | 797/1044 [04:50<01:35,  2.60it/s, acc_step=1/1, ce_loss_token=1.8151, perplexity_token=6.1419]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  76%|███████████████████████████████████▉           | 798/1044 [04:51<01:35,  2.57it/s, acc_step=1/1, ce_loss_token=1.8151, perplexity_token=6.1417]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  77%|███████████████████████████████████▉           | 799/1044 [04:51<01:35,  2.57it/s, acc_step=1/1, ce_loss_token=1.8150, perplexity_token=6.1413]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  77%|████████████████████████████████████           | 800/1044 [04:51<01:33,  2.62it/s, acc_step=1/1, ce_loss_token=1.8151, perplexity_token=6.1417]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  77%|████████████████████████████████████           | 801/1044 [04:52<01:33,  2.60it/s, acc_step=1/1, ce_loss_token=1.8150, perplexity_token=6.1413]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  77%|████████████████████████████████████           | 802/1044 [04:52<01:33,  2.60it/s, acc_step=1/1, ce_loss_token=1.8150, perplexity_token=6.1411]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  77%|████████████████████████████████████▏          | 803/1044 [04:53<01:35,  2.53it/s, acc_step=1/1, ce_loss_token=1.8149, perplexity_token=6.1407]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  77%|████████████████████████████████████▏          | 804/1044 [04:53<01:34,  2.53it/s, acc_step=1/1, ce_loss_token=1.8149, perplexity_token=6.1404]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  77%|████████████████████████████████████▏          | 805/1044 [04:53<01:34,  2.53it/s, acc_step=1/1, ce_loss_token=1.8150, perplexity_token=6.1410]

torch.Size([256, 272, 35]) torch.Size([256, 272])


[Training LM]:  77%|████████████████████████████████████▎          | 806/1044 [04:54<01:28,  2.70it/s, acc_step=1/1, ce_loss_token=1.8149, perplexity_token=6.1407]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  77%|████████████████████████████████████▎          | 807/1044 [04:54<01:28,  2.67it/s, acc_step=1/1, ce_loss_token=1.8149, perplexity_token=6.1405]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  77%|████████████████████████████████████▍          | 808/1044 [04:54<01:21,  2.89it/s, acc_step=1/1, ce_loss_token=1.8150, perplexity_token=6.1413]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  77%|████████████████████████████████████▍          | 809/1044 [04:55<01:22,  2.86it/s, acc_step=1/1, ce_loss_token=1.8150, perplexity_token=6.1410]

torch.Size([256, 354, 35]) torch.Size([256, 354])


[Training LM]:  78%|████████████████████████████████████▍          | 810/1044 [04:55<01:29,  2.62it/s, acc_step=1/1, ce_loss_token=1.8149, perplexity_token=6.1407]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  78%|████████████████████████████████████▌          | 811/1044 [04:56<01:26,  2.68it/s, acc_step=1/1, ce_loss_token=1.8149, perplexity_token=6.1404]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  78%|████████████████████████████████████▌          | 812/1044 [04:56<01:26,  2.69it/s, acc_step=1/1, ce_loss_token=1.8148, perplexity_token=6.1401]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  78%|████████████████████████████████████▌          | 813/1044 [04:56<01:26,  2.66it/s, acc_step=1/1, ce_loss_token=1.8148, perplexity_token=6.1398]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  78%|████████████████████████████████████▋          | 814/1044 [04:57<01:30,  2.53it/s, acc_step=1/1, ce_loss_token=1.8147, perplexity_token=6.1395]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  78%|████████████████████████████████████▋          | 815/1044 [04:57<01:30,  2.54it/s, acc_step=1/1, ce_loss_token=1.8147, perplexity_token=6.1392]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  78%|████████████████████████████████████▋          | 816/1044 [04:58<01:28,  2.56it/s, acc_step=1/1, ce_loss_token=1.8146, perplexity_token=6.1388]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  78%|████████████████████████████████████▊          | 817/1044 [04:58<01:28,  2.57it/s, acc_step=1/1, ce_loss_token=1.8146, perplexity_token=6.1385]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  78%|████████████████████████████████████▊          | 818/1044 [04:58<01:26,  2.62it/s, acc_step=1/1, ce_loss_token=1.8145, perplexity_token=6.1382]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  78%|████████████████████████████████████▊          | 819/1044 [04:59<01:20,  2.79it/s, acc_step=1/1, ce_loss_token=1.8146, perplexity_token=6.1386]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  79%|████████████████████████████████████▉          | 820/1044 [04:59<01:14,  3.02it/s, acc_step=1/1, ce_loss_token=1.8147, perplexity_token=6.1393]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  79%|████████████████████████████████████▉          | 821/1044 [04:59<01:09,  3.19it/s, acc_step=1/1, ce_loss_token=1.8148, perplexity_token=6.1399]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  79%|█████████████████████████████████████          | 822/1044 [05:00<01:16,  2.90it/s, acc_step=1/1, ce_loss_token=1.8147, perplexity_token=6.1395]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  79%|█████████████████████████████████████          | 823/1044 [05:00<01:14,  2.98it/s, acc_step=1/1, ce_loss_token=1.8148, perplexity_token=6.1400]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  79%|█████████████████████████████████████          | 824/1044 [05:00<01:14,  2.95it/s, acc_step=1/1, ce_loss_token=1.8147, perplexity_token=6.1395]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  79%|█████████████████████████████████████▏         | 825/1044 [05:01<01:18,  2.77it/s, acc_step=1/1, ce_loss_token=1.8147, perplexity_token=6.1392]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  79%|█████████████████████████████████████▏         | 826/1044 [05:01<01:18,  2.78it/s, acc_step=1/1, ce_loss_token=1.8146, perplexity_token=6.1388]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  79%|█████████████████████████████████████▏         | 827/1044 [05:01<01:17,  2.80it/s, acc_step=1/1, ce_loss_token=1.8146, perplexity_token=6.1384]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  79%|█████████████████████████████████████▎         | 828/1044 [05:02<01:24,  2.55it/s, acc_step=1/1, ce_loss_token=1.8145, perplexity_token=6.1382]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  79%|█████████████████████████████████████▎         | 829/1044 [05:02<01:26,  2.49it/s, acc_step=1/1, ce_loss_token=1.8145, perplexity_token=6.1379]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  80%|█████████████████████████████████████▎         | 830/1044 [05:03<01:23,  2.56it/s, acc_step=1/1, ce_loss_token=1.8144, perplexity_token=6.1376]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  80%|█████████████████████████████████████▍         | 831/1044 [05:03<01:25,  2.50it/s, acc_step=1/1, ce_loss_token=1.8144, perplexity_token=6.1373]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  80%|█████████████████████████████████████▍         | 832/1044 [05:03<01:23,  2.52it/s, acc_step=1/1, ce_loss_token=1.8143, perplexity_token=6.1369]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  80%|█████████████████████████████████████▌         | 833/1044 [05:04<01:24,  2.49it/s, acc_step=1/1, ce_loss_token=1.8143, perplexity_token=6.1365]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  80%|█████████████████████████████████████▌         | 834/1044 [05:04<01:23,  2.53it/s, acc_step=1/1, ce_loss_token=1.8142, perplexity_token=6.1363]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  80%|█████████████████████████████████████▌         | 835/1044 [05:04<01:16,  2.74it/s, acc_step=1/1, ce_loss_token=1.8143, perplexity_token=6.1368]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  80%|█████████████████████████████████████▋         | 836/1044 [05:05<01:18,  2.64it/s, acc_step=1/1, ce_loss_token=1.8143, perplexity_token=6.1365]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  80%|█████████████████████████████████████▋         | 837/1044 [05:05<01:19,  2.60it/s, acc_step=1/1, ce_loss_token=1.8142, perplexity_token=6.1361]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  80%|█████████████████████████████████████▋         | 838/1044 [05:06<01:14,  2.75it/s, acc_step=1/1, ce_loss_token=1.8143, perplexity_token=6.1366]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  80%|█████████████████████████████████████▊         | 839/1044 [05:06<01:14,  2.74it/s, acc_step=1/1, ce_loss_token=1.8142, perplexity_token=6.1362]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  80%|█████████████████████████████████████▊         | 840/1044 [05:06<01:10,  2.89it/s, acc_step=1/1, ce_loss_token=1.8143, perplexity_token=6.1369]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  81%|█████████████████████████████████████▊         | 841/1044 [05:07<01:10,  2.88it/s, acc_step=1/1, ce_loss_token=1.8143, perplexity_token=6.1365]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  81%|█████████████████████████████████████▉         | 842/1044 [05:07<01:10,  2.86it/s, acc_step=1/1, ce_loss_token=1.8142, perplexity_token=6.1361]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  81%|█████████████████████████████████████▉         | 843/1044 [05:07<01:12,  2.78it/s, acc_step=1/1, ce_loss_token=1.8141, perplexity_token=6.1358]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  81%|█████████████████████████████████████▉         | 844/1044 [05:08<01:12,  2.76it/s, acc_step=1/1, ce_loss_token=1.8141, perplexity_token=6.1354]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  81%|██████████████████████████████████████         | 845/1044 [05:08<01:13,  2.70it/s, acc_step=1/1, ce_loss_token=1.8140, perplexity_token=6.1350]

torch.Size([256, 399, 35]) torch.Size([256, 399])


[Training LM]:  81%|██████████████████████████████████████         | 846/1044 [05:09<01:24,  2.33it/s, acc_step=1/1, ce_loss_token=1.8139, perplexity_token=6.1346]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  81%|██████████████████████████████████████▏        | 847/1044 [05:09<01:26,  2.28it/s, acc_step=1/1, ce_loss_token=1.8139, perplexity_token=6.1343]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  81%|██████████████████████████████████████▏        | 848/1044 [05:10<01:22,  2.37it/s, acc_step=1/1, ce_loss_token=1.8138, perplexity_token=6.1339]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  81%|██████████████████████████████████████▏        | 849/1044 [05:10<01:22,  2.36it/s, acc_step=1/1, ce_loss_token=1.8138, perplexity_token=6.1335]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  81%|██████████████████████████████████████▎        | 850/1044 [05:10<01:18,  2.48it/s, acc_step=1/1, ce_loss_token=1.8137, perplexity_token=6.1332]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  82%|██████████████████████████████████████▎        | 851/1044 [05:11<01:13,  2.62it/s, acc_step=1/1, ce_loss_token=1.8137, perplexity_token=6.1329]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  82%|██████████████████████████████████████▎        | 852/1044 [05:11<01:15,  2.54it/s, acc_step=1/1, ce_loss_token=1.8136, perplexity_token=6.1325]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  82%|██████████████████████████████████████▍        | 853/1044 [05:11<01:12,  2.63it/s, acc_step=1/1, ce_loss_token=1.8136, perplexity_token=6.1322]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  82%|██████████████████████████████████████▍        | 854/1044 [05:12<01:11,  2.66it/s, acc_step=1/1, ce_loss_token=1.8135, perplexity_token=6.1319]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  82%|██████████████████████████████████████▍        | 855/1044 [05:12<01:16,  2.46it/s, acc_step=1/1, ce_loss_token=1.8135, perplexity_token=6.1316]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  82%|██████████████████████████████████████▌        | 856/1044 [05:13<01:15,  2.48it/s, acc_step=1/1, ce_loss_token=1.8134, perplexity_token=6.1314]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  82%|██████████████████████████████████████▌        | 857/1044 [05:13<01:16,  2.44it/s, acc_step=1/1, ce_loss_token=1.8134, perplexity_token=6.1311]

torch.Size([256, 438, 35]) torch.Size([256, 438])


[Training LM]:  82%|██████████████████████████████████████▋        | 858/1044 [05:14<01:28,  2.09it/s, acc_step=1/1, ce_loss_token=1.8133, perplexity_token=6.1307]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  82%|██████████████████████████████████████▋        | 859/1044 [05:14<01:28,  2.10it/s, acc_step=1/1, ce_loss_token=1.8133, perplexity_token=6.1303]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  82%|██████████████████████████████████████▋        | 860/1044 [05:15<01:23,  2.21it/s, acc_step=1/1, ce_loss_token=1.8132, perplexity_token=6.1300]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  82%|██████████████████████████████████████▊        | 861/1044 [05:15<01:18,  2.33it/s, acc_step=1/1, ce_loss_token=1.8131, perplexity_token=6.1297]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  83%|██████████████████████████████████████▊        | 862/1044 [05:15<01:15,  2.41it/s, acc_step=1/1, ce_loss_token=1.8131, perplexity_token=6.1294]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  83%|██████████████████████████████████████▊        | 863/1044 [05:16<01:15,  2.39it/s, acc_step=1/1, ce_loss_token=1.8131, perplexity_token=6.1291]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  83%|██████████████████████████████████████▉        | 864/1044 [05:16<01:12,  2.48it/s, acc_step=1/1, ce_loss_token=1.8130, perplexity_token=6.1288]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  83%|██████████████████████████████████████▉        | 865/1044 [05:16<01:08,  2.61it/s, acc_step=1/1, ce_loss_token=1.8129, perplexity_token=6.1284]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  83%|██████████████████████████████████████▉        | 866/1044 [05:17<01:09,  2.56it/s, acc_step=1/1, ce_loss_token=1.8129, perplexity_token=6.1281]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  83%|███████████████████████████████████████        | 867/1044 [05:17<01:07,  2.64it/s, acc_step=1/1, ce_loss_token=1.8129, perplexity_token=6.1279]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  83%|███████████████████████████████████████        | 868/1044 [05:18<01:04,  2.71it/s, acc_step=1/1, ce_loss_token=1.8128, perplexity_token=6.1275]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  83%|███████████████████████████████████████        | 869/1044 [05:18<01:05,  2.65it/s, acc_step=1/1, ce_loss_token=1.8127, perplexity_token=6.1272]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  83%|███████████████████████████████████████▏       | 870/1044 [05:18<01:04,  2.70it/s, acc_step=1/1, ce_loss_token=1.8127, perplexity_token=6.1269]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  83%|███████████████████████████████████████▏       | 871/1044 [05:19<01:04,  2.67it/s, acc_step=1/1, ce_loss_token=1.8126, perplexity_token=6.1265]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  84%|███████████████████████████████████████▎       | 872/1044 [05:19<01:02,  2.75it/s, acc_step=1/1, ce_loss_token=1.8126, perplexity_token=6.1262]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  84%|███████████████████████████████████████▎       | 873/1044 [05:19<00:58,  2.92it/s, acc_step=1/1, ce_loss_token=1.8127, perplexity_token=6.1267]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  84%|███████████████████████████████████████▎       | 874/1044 [05:20<01:01,  2.78it/s, acc_step=1/1, ce_loss_token=1.8126, perplexity_token=6.1263]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  84%|███████████████████████████████████████▍       | 875/1044 [05:20<01:01,  2.77it/s, acc_step=1/1, ce_loss_token=1.8125, perplexity_token=6.1259]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  84%|███████████████████████████████████████▍       | 877/1044 [05:21<00:53,  3.13it/s, acc_step=1/1, ce_loss_token=1.8129, perplexity_token=6.1280]

torch.Size([256, 291, 35]) torch.Size([256, 291])
torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  84%|███████████████████████████████████████▌       | 878/1044 [05:21<00:56,  2.93it/s, acc_step=1/1, ce_loss_token=1.8129, perplexity_token=6.1285]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  84%|███████████████████████████████████████▌       | 879/1044 [05:21<00:57,  2.88it/s, acc_step=1/1, ce_loss_token=1.8129, perplexity_token=6.1282]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  84%|███████████████████████████████████████▌       | 880/1044 [05:22<00:54,  3.01it/s, acc_step=1/1, ce_loss_token=1.8130, perplexity_token=6.1287]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  84%|███████████████████████████████████████▋       | 881/1044 [05:22<00:55,  2.92it/s, acc_step=1/1, ce_loss_token=1.8129, perplexity_token=6.1284]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  84%|███████████████████████████████████████▋       | 882/1044 [05:22<00:55,  2.90it/s, acc_step=1/1, ce_loss_token=1.8131, perplexity_token=6.1291]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  85%|███████████████████████████████████████▊       | 883/1044 [05:23<00:57,  2.82it/s, acc_step=1/1, ce_loss_token=1.8130, perplexity_token=6.1288]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  85%|███████████████████████████████████████▊       | 884/1044 [05:23<00:56,  2.85it/s, acc_step=1/1, ce_loss_token=1.8130, perplexity_token=6.1286]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  85%|███████████████████████████████████████▊       | 885/1044 [05:24<00:56,  2.82it/s, acc_step=1/1, ce_loss_token=1.8129, perplexity_token=6.1282]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  85%|███████████████████████████████████████▉       | 886/1044 [05:24<00:58,  2.72it/s, acc_step=1/1, ce_loss_token=1.8129, perplexity_token=6.1280]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  85%|███████████████████████████████████████▉       | 887/1044 [05:24<00:58,  2.68it/s, acc_step=1/1, ce_loss_token=1.8128, perplexity_token=6.1277]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  85%|███████████████████████████████████████▉       | 888/1044 [05:25<00:57,  2.72it/s, acc_step=1/1, ce_loss_token=1.8128, perplexity_token=6.1273]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  85%|████████████████████████████████████████       | 889/1044 [05:25<01:00,  2.58it/s, acc_step=1/1, ce_loss_token=1.8127, perplexity_token=6.1269]

torch.Size([256, 366, 35]) torch.Size([256, 366])


[Training LM]:  85%|████████████████████████████████████████       | 890/1044 [05:26<01:03,  2.41it/s, acc_step=1/1, ce_loss_token=1.8126, perplexity_token=6.1265]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  85%|████████████████████████████████████████       | 891/1044 [05:26<01:02,  2.44it/s, acc_step=1/1, ce_loss_token=1.8126, perplexity_token=6.1263]

torch.Size([256, 393, 35]) torch.Size([256, 393])


[Training LM]:  85%|████████████████████████████████████████▏      | 892/1044 [05:27<01:09,  2.20it/s, acc_step=1/1, ce_loss_token=1.8125, perplexity_token=6.1260]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  86%|████████████████████████████████████████▏      | 893/1044 [05:27<01:06,  2.28it/s, acc_step=1/1, ce_loss_token=1.8125, perplexity_token=6.1258]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  86%|████████████████████████████████████████▏      | 894/1044 [05:27<01:03,  2.35it/s, acc_step=1/1, ce_loss_token=1.8124, perplexity_token=6.1254]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  86%|████████████████████████████████████████▎      | 895/1044 [05:28<00:56,  2.64it/s, acc_step=1/1, ce_loss_token=1.8125, perplexity_token=6.1259]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  86%|████████████████████████████████████████▎      | 896/1044 [05:28<00:56,  2.61it/s, acc_step=1/1, ce_loss_token=1.8125, perplexity_token=6.1256]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  86%|████████████████████████████████████████▍      | 897/1044 [05:28<00:55,  2.66it/s, acc_step=1/1, ce_loss_token=1.8124, perplexity_token=6.1253]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  86%|████████████████████████████████████████▍      | 898/1044 [05:29<00:54,  2.66it/s, acc_step=1/1, ce_loss_token=1.8124, perplexity_token=6.1251]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  86%|████████████████████████████████████████▍      | 899/1044 [05:29<00:53,  2.72it/s, acc_step=1/1, ce_loss_token=1.8123, perplexity_token=6.1247]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  86%|████████████████████████████████████████▌      | 900/1044 [05:30<00:54,  2.63it/s, acc_step=1/1, ce_loss_token=1.8123, perplexity_token=6.1244]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  86%|████████████████████████████████████████▌      | 901/1044 [05:30<00:54,  2.61it/s, acc_step=1/1, ce_loss_token=1.8122, perplexity_token=6.1240]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  86%|████████████████████████████████████████▌      | 902/1044 [05:30<00:52,  2.71it/s, acc_step=1/1, ce_loss_token=1.8122, perplexity_token=6.1237]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  86%|████████████████████████████████████████▋      | 903/1044 [05:31<00:52,  2.70it/s, acc_step=1/1, ce_loss_token=1.8121, perplexity_token=6.1235]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  87%|████████████████████████████████████████▋      | 904/1044 [05:31<00:53,  2.63it/s, acc_step=1/1, ce_loss_token=1.8121, perplexity_token=6.1232]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  87%|████████████████████████████████████████▋      | 905/1044 [05:31<00:53,  2.60it/s, acc_step=1/1, ce_loss_token=1.8120, perplexity_token=6.1227]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  87%|████████████████████████████████████████▊      | 906/1044 [05:32<00:51,  2.68it/s, acc_step=1/1, ce_loss_token=1.8120, perplexity_token=6.1225]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  87%|████████████████████████████████████████▊      | 907/1044 [05:32<00:50,  2.69it/s, acc_step=1/1, ce_loss_token=1.8119, perplexity_token=6.1221]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  87%|████████████████████████████████████████▉      | 908/1044 [05:32<00:49,  2.72it/s, acc_step=1/1, ce_loss_token=1.8119, perplexity_token=6.1218]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  87%|████████████████████████████████████████▉      | 909/1044 [05:33<00:51,  2.61it/s, acc_step=1/1, ce_loss_token=1.8118, perplexity_token=6.1214]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  87%|████████████████████████████████████████▉      | 910/1044 [05:33<00:51,  2.60it/s, acc_step=1/1, ce_loss_token=1.8117, perplexity_token=6.1211]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  87%|█████████████████████████████████████████      | 911/1044 [05:34<00:47,  2.79it/s, acc_step=1/1, ce_loss_token=1.8119, perplexity_token=6.1218]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  87%|█████████████████████████████████████████      | 912/1044 [05:34<00:46,  2.84it/s, acc_step=1/1, ce_loss_token=1.8118, perplexity_token=6.1216]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  87%|█████████████████████████████████████████      | 913/1044 [05:34<00:49,  2.67it/s, acc_step=1/1, ce_loss_token=1.8118, perplexity_token=6.1213]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  88%|█████████████████████████████████████████▏     | 914/1044 [05:35<00:48,  2.68it/s, acc_step=1/1, ce_loss_token=1.8117, perplexity_token=6.1211]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  88%|█████████████████████████████████████████▏     | 915/1044 [05:35<00:46,  2.75it/s, acc_step=1/1, ce_loss_token=1.8117, perplexity_token=6.1207]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  88%|█████████████████████████████████████████▏     | 916/1044 [05:35<00:48,  2.64it/s, acc_step=1/1, ce_loss_token=1.8116, perplexity_token=6.1204]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  88%|█████████████████████████████████████████▎     | 917/1044 [05:36<00:48,  2.64it/s, acc_step=1/1, ce_loss_token=1.8116, perplexity_token=6.1201]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  88%|█████████████████████████████████████████▎     | 918/1044 [05:36<00:47,  2.63it/s, acc_step=1/1, ce_loss_token=1.8115, perplexity_token=6.1198]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  88%|█████████████████████████████████████████▎     | 919/1044 [05:37<00:47,  2.64it/s, acc_step=1/1, ce_loss_token=1.8115, perplexity_token=6.1194]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  88%|█████████████████████████████████████████▍     | 920/1044 [05:37<00:46,  2.66it/s, acc_step=1/1, ce_loss_token=1.8114, perplexity_token=6.1191]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  88%|█████████████████████████████████████████▍     | 921/1044 [05:37<00:46,  2.64it/s, acc_step=1/1, ce_loss_token=1.8114, perplexity_token=6.1188]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  88%|█████████████████████████████████████████▌     | 922/1044 [05:38<00:45,  2.66it/s, acc_step=1/1, ce_loss_token=1.8113, perplexity_token=6.1185]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  88%|█████████████████████████████████████████▌     | 923/1044 [05:38<00:46,  2.60it/s, acc_step=1/1, ce_loss_token=1.8113, perplexity_token=6.1182]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  89%|█████████████████████████████████████████▌     | 924/1044 [05:39<00:45,  2.65it/s, acc_step=1/1, ce_loss_token=1.8112, perplexity_token=6.1179]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  89%|█████████████████████████████████████████▋     | 925/1044 [05:39<00:44,  2.69it/s, acc_step=1/1, ce_loss_token=1.8112, perplexity_token=6.1176]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  89%|█████████████████████████████████████████▋     | 926/1044 [05:39<00:42,  2.79it/s, acc_step=1/1, ce_loss_token=1.8112, perplexity_token=6.1180]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  89%|█████████████████████████████████████████▋     | 927/1044 [05:40<00:43,  2.67it/s, acc_step=1/1, ce_loss_token=1.8112, perplexity_token=6.1177]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  89%|█████████████████████████████████████████▊     | 928/1044 [05:40<00:43,  2.64it/s, acc_step=1/1, ce_loss_token=1.8111, perplexity_token=6.1173]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  89%|█████████████████████████████████████████▊     | 929/1044 [05:41<00:48,  2.36it/s, acc_step=1/1, ce_loss_token=1.8111, perplexity_token=6.1170]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  89%|█████████████████████████████████████████▊     | 930/1044 [05:41<00:43,  2.63it/s, acc_step=1/1, ce_loss_token=1.8111, perplexity_token=6.1174]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  89%|█████████████████████████████████████████▉     | 931/1044 [05:41<00:42,  2.65it/s, acc_step=1/1, ce_loss_token=1.8111, perplexity_token=6.1171]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  89%|█████████████████████████████████████████▉     | 932/1044 [05:42<00:42,  2.62it/s, acc_step=1/1, ce_loss_token=1.8110, perplexity_token=6.1167]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  89%|██████████████████████████████████████████     | 933/1044 [05:42<00:42,  2.60it/s, acc_step=1/1, ce_loss_token=1.8110, perplexity_token=6.1165]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  89%|██████████████████████████████████████████     | 934/1044 [05:42<00:41,  2.63it/s, acc_step=1/1, ce_loss_token=1.8109, perplexity_token=6.1162]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  90%|██████████████████████████████████████████     | 935/1044 [05:43<00:40,  2.69it/s, acc_step=1/1, ce_loss_token=1.8109, perplexity_token=6.1159]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  90%|██████████████████████████████████████████▏    | 936/1044 [05:43<00:37,  2.91it/s, acc_step=1/1, ce_loss_token=1.8109, perplexity_token=6.1163]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  90%|██████████████████████████████████████████▏    | 937/1044 [05:43<00:34,  3.11it/s, acc_step=1/1, ce_loss_token=1.8111, perplexity_token=6.1169]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  90%|██████████████████████████████████████████▏    | 938/1044 [05:44<00:35,  3.00it/s, acc_step=1/1, ce_loss_token=1.8110, perplexity_token=6.1166]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  90%|██████████████████████████████████████████▎    | 939/1044 [05:44<00:35,  2.93it/s, acc_step=1/1, ce_loss_token=1.8110, perplexity_token=6.1163]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  90%|██████████████████████████████████████████▎    | 940/1044 [05:44<00:37,  2.75it/s, acc_step=1/1, ce_loss_token=1.8109, perplexity_token=6.1160]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  90%|██████████████████████████████████████████▎    | 941/1044 [05:45<00:39,  2.61it/s, acc_step=1/1, ce_loss_token=1.8108, perplexity_token=6.1156]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  90%|██████████████████████████████████████████▍    | 942/1044 [05:45<00:38,  2.64it/s, acc_step=1/1, ce_loss_token=1.8108, perplexity_token=6.1153]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  90%|██████████████████████████████████████████▍    | 943/1044 [05:45<00:35,  2.83it/s, acc_step=1/1, ce_loss_token=1.8109, perplexity_token=6.1158]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  90%|██████████████████████████████████████████▍    | 944/1044 [05:46<00:35,  2.82it/s, acc_step=1/1, ce_loss_token=1.8108, perplexity_token=6.1155]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  91%|██████████████████████████████████████████▌    | 945/1044 [05:46<00:34,  2.83it/s, acc_step=1/1, ce_loss_token=1.8108, perplexity_token=6.1152]

torch.Size([256, 342, 35]) torch.Size([256, 342])


[Training LM]:  91%|██████████████████████████████████████████▌    | 946/1044 [05:47<00:37,  2.65it/s, acc_step=1/1, ce_loss_token=1.8107, perplexity_token=6.1149]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  91%|██████████████████████████████████████████▋    | 947/1044 [05:47<00:37,  2.60it/s, acc_step=1/1, ce_loss_token=1.8107, perplexity_token=6.1147]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  91%|██████████████████████████████████████████▋    | 948/1044 [05:47<00:36,  2.61it/s, acc_step=1/1, ce_loss_token=1.8106, perplexity_token=6.1143]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  91%|██████████████████████████████████████████▋    | 949/1044 [05:48<00:37,  2.53it/s, acc_step=1/1, ce_loss_token=1.8106, perplexity_token=6.1139]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  91%|██████████████████████████████████████████▊    | 950/1044 [05:48<00:37,  2.50it/s, acc_step=1/1, ce_loss_token=1.8105, perplexity_token=6.1136]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  91%|██████████████████████████████████████████▊    | 951/1044 [05:49<00:37,  2.47it/s, acc_step=1/1, ce_loss_token=1.8105, perplexity_token=6.1133]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  91%|██████████████████████████████████████████▊    | 952/1044 [05:49<00:36,  2.55it/s, acc_step=1/1, ce_loss_token=1.8104, perplexity_token=6.1130]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  91%|██████████████████████████████████████████▉    | 953/1044 [05:49<00:35,  2.54it/s, acc_step=1/1, ce_loss_token=1.8104, perplexity_token=6.1127]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  91%|██████████████████████████████████████████▉    | 954/1044 [05:50<00:37,  2.42it/s, acc_step=1/1, ce_loss_token=1.8103, perplexity_token=6.1123]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  91%|██████████████████████████████████████████▉    | 955/1044 [05:50<00:36,  2.43it/s, acc_step=1/1, ce_loss_token=1.8102, perplexity_token=6.1120]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  92%|███████████████████████████████████████████    | 956/1044 [05:51<00:34,  2.54it/s, acc_step=1/1, ce_loss_token=1.8102, perplexity_token=6.1116]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  92%|███████████████████████████████████████████    | 957/1044 [05:51<00:33,  2.59it/s, acc_step=1/1, ce_loss_token=1.8102, perplexity_token=6.1114]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  92%|███████████████████████████████████████████▏   | 958/1044 [05:51<00:34,  2.50it/s, acc_step=1/1, ce_loss_token=1.8101, perplexity_token=6.1110]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  92%|███████████████████████████████████████████▏   | 959/1044 [05:52<00:34,  2.48it/s, acc_step=1/1, ce_loss_token=1.8101, perplexity_token=6.1108]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  92%|███████████████████████████████████████████▏   | 960/1044 [05:52<00:31,  2.68it/s, acc_step=1/1, ce_loss_token=1.8101, perplexity_token=6.1112]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  92%|███████████████████████████████████████████▎   | 961/1044 [05:53<00:30,  2.69it/s, acc_step=1/1, ce_loss_token=1.8101, perplexity_token=6.1110]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  92%|███████████████████████████████████████████▎   | 962/1044 [05:53<00:32,  2.53it/s, acc_step=1/1, ce_loss_token=1.8100, perplexity_token=6.1107]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  92%|███████████████████████████████████████████▎   | 963/1044 [05:53<00:33,  2.41it/s, acc_step=1/1, ce_loss_token=1.8100, perplexity_token=6.1104]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  92%|███████████████████████████████████████████▍   | 964/1044 [05:54<00:32,  2.49it/s, acc_step=1/1, ce_loss_token=1.8099, perplexity_token=6.1101]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  92%|███████████████████████████████████████████▍   | 965/1044 [05:54<00:31,  2.53it/s, acc_step=1/1, ce_loss_token=1.8099, perplexity_token=6.1099]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  93%|███████████████████████████████████████████▍   | 966/1044 [05:55<00:32,  2.38it/s, acc_step=1/1, ce_loss_token=1.8099, perplexity_token=6.1096]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  93%|███████████████████████████████████████████▌   | 967/1044 [05:55<00:31,  2.43it/s, acc_step=1/1, ce_loss_token=1.8098, perplexity_token=6.1094]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  93%|███████████████████████████████████████████▌   | 968/1044 [05:55<00:29,  2.54it/s, acc_step=1/1, ce_loss_token=1.8098, perplexity_token=6.1090]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  93%|███████████████████████████████████████████▌   | 969/1044 [05:56<00:28,  2.64it/s, acc_step=1/1, ce_loss_token=1.8097, perplexity_token=6.1087]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  93%|███████████████████████████████████████████▋   | 970/1044 [05:56<00:28,  2.64it/s, acc_step=1/1, ce_loss_token=1.8097, perplexity_token=6.1085]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  93%|███████████████████████████████████████████▋   | 971/1044 [05:56<00:27,  2.66it/s, acc_step=1/1, ce_loss_token=1.8096, perplexity_token=6.1082]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  93%|███████████████████████████████████████████▊   | 972/1044 [05:57<00:27,  2.65it/s, acc_step=1/1, ce_loss_token=1.8096, perplexity_token=6.1078]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  93%|███████████████████████████████████████████▊   | 973/1044 [05:57<00:26,  2.63it/s, acc_step=1/1, ce_loss_token=1.8095, perplexity_token=6.1074]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  93%|███████████████████████████████████████████▊   | 974/1044 [05:58<00:26,  2.67it/s, acc_step=1/1, ce_loss_token=1.8095, perplexity_token=6.1072]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  93%|███████████████████████████████████████████▉   | 975/1044 [05:58<00:25,  2.66it/s, acc_step=1/1, ce_loss_token=1.8094, perplexity_token=6.1069]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  93%|███████████████████████████████████████████▉   | 976/1044 [05:58<00:25,  2.66it/s, acc_step=1/1, ce_loss_token=1.8094, perplexity_token=6.1066]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  94%|███████████████████████████████████████████▉   | 977/1044 [05:59<00:24,  2.70it/s, acc_step=1/1, ce_loss_token=1.8093, perplexity_token=6.1063]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  94%|████████████████████████████████████████████   | 978/1044 [05:59<00:23,  2.79it/s, acc_step=1/1, ce_loss_token=1.8093, perplexity_token=6.1059]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  94%|████████████████████████████████████████████   | 979/1044 [05:59<00:21,  3.00it/s, acc_step=1/1, ce_loss_token=1.8093, perplexity_token=6.1063]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  94%|████████████████████████████████████████████   | 980/1044 [06:00<00:22,  2.87it/s, acc_step=1/1, ce_loss_token=1.8093, perplexity_token=6.1059]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  94%|████████████████████████████████████████████▏  | 981/1044 [06:00<00:20,  3.04it/s, acc_step=1/1, ce_loss_token=1.8093, perplexity_token=6.1063]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  94%|████████████████████████████████████████████▏  | 982/1044 [06:00<00:21,  2.94it/s, acc_step=1/1, ce_loss_token=1.8093, perplexity_token=6.1061]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  94%|████████████████████████████████████████████▎  | 983/1044 [06:01<00:20,  2.91it/s, acc_step=1/1, ce_loss_token=1.8092, perplexity_token=6.1058]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  94%|████████████████████████████████████████████▎  | 984/1044 [06:01<00:21,  2.86it/s, acc_step=1/1, ce_loss_token=1.8092, perplexity_token=6.1054]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  94%|████████████████████████████████████████████▎  | 985/1044 [06:01<00:21,  2.80it/s, acc_step=1/1, ce_loss_token=1.8091, perplexity_token=6.1051]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  94%|████████████████████████████████████████████▍  | 986/1044 [06:02<00:21,  2.69it/s, acc_step=1/1, ce_loss_token=1.8091, perplexity_token=6.1049]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  95%|████████████████████████████████████████████▍  | 987/1044 [06:02<00:20,  2.72it/s, acc_step=1/1, ce_loss_token=1.8090, perplexity_token=6.1046]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  95%|████████████████████████████████████████████▍  | 988/1044 [06:03<00:21,  2.66it/s, acc_step=1/1, ce_loss_token=1.8090, perplexity_token=6.1043]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  95%|████████████████████████████████████████████▌  | 989/1044 [06:03<00:20,  2.70it/s, acc_step=1/1, ce_loss_token=1.8089, perplexity_token=6.1040]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  95%|████████████████████████████████████████████▌  | 990/1044 [06:03<00:20,  2.70it/s, acc_step=1/1, ce_loss_token=1.8089, perplexity_token=6.1037]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  95%|████████████████████████████████████████████▌  | 991/1044 [06:04<00:19,  2.68it/s, acc_step=1/1, ce_loss_token=1.8088, perplexity_token=6.1033]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  95%|████████████████████████████████████████████▋  | 992/1044 [06:04<00:18,  2.85it/s, acc_step=1/1, ce_loss_token=1.8089, perplexity_token=6.1037]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  95%|████████████████████████████████████████████▋  | 993/1044 [06:04<00:18,  2.76it/s, acc_step=1/1, ce_loss_token=1.8089, perplexity_token=6.1034]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  95%|████████████████████████████████████████████▋  | 994/1044 [06:05<00:18,  2.76it/s, acc_step=1/1, ce_loss_token=1.8088, perplexity_token=6.1031]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  95%|████████████████████████████████████████████▊  | 995/1044 [06:05<00:18,  2.62it/s, acc_step=1/1, ce_loss_token=1.8087, perplexity_token=6.1028]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  95%|████████████████████████████████████████████▊  | 996/1044 [06:05<00:17,  2.79it/s, acc_step=1/1, ce_loss_token=1.8089, perplexity_token=6.1034]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  95%|████████████████████████████████████████████▉  | 997/1044 [06:06<00:16,  2.77it/s, acc_step=1/1, ce_loss_token=1.8088, perplexity_token=6.1031]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  96%|████████████████████████████████████████████▉  | 998/1044 [06:06<00:17,  2.68it/s, acc_step=1/1, ce_loss_token=1.8088, perplexity_token=6.1029]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  96%|████████████████████████████████████████████▉  | 999/1044 [06:07<00:16,  2.72it/s, acc_step=1/1, ce_loss_token=1.8087, perplexity_token=6.1027]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  96%|████████████████████████████████████████████  | 1000/1044 [06:07<00:16,  2.61it/s, acc_step=1/1, ce_loss_token=1.8087, perplexity_token=6.1025]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  96%|████████████████████████████████████████████  | 1001/1044 [06:07<00:16,  2.63it/s, acc_step=1/1, ce_loss_token=1.8086, perplexity_token=6.1022]

torch.Size([256, 391, 35]) torch.Size([256, 391])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1002/1044 [06:08<00:18,  2.30it/s, acc_step=1/1, ce_loss_token=1.8086, perplexity_token=6.1019]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1003/1044 [06:08<00:16,  2.45it/s, acc_step=1/1, ce_loss_token=1.8085, perplexity_token=6.1016]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1004/1044 [06:09<00:15,  2.51it/s, acc_step=1/1, ce_loss_token=1.8085, perplexity_token=6.1012]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1005/1044 [06:09<00:15,  2.60it/s, acc_step=1/1, ce_loss_token=1.8084, perplexity_token=6.1009]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1006/1044 [06:09<00:14,  2.66it/s, acc_step=1/1, ce_loss_token=1.8084, perplexity_token=6.1006]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1007/1044 [06:10<00:14,  2.50it/s, acc_step=1/1, ce_loss_token=1.8083, perplexity_token=6.1004]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1008/1044 [06:10<00:13,  2.62it/s, acc_step=1/1, ce_loss_token=1.8083, perplexity_token=6.1001]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1009/1044 [06:11<00:12,  2.70it/s, acc_step=1/1, ce_loss_token=1.8083, perplexity_token=6.0999]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1010/1044 [06:11<00:13,  2.61it/s, acc_step=1/1, ce_loss_token=1.8082, perplexity_token=6.0996]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1011/1044 [06:11<00:12,  2.60it/s, acc_step=1/1, ce_loss_token=1.8082, perplexity_token=6.0993]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1012/1044 [06:12<00:12,  2.63it/s, acc_step=1/1, ce_loss_token=1.8081, perplexity_token=6.0990]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1013/1044 [06:12<00:11,  2.82it/s, acc_step=1/1, ce_loss_token=1.8082, perplexity_token=6.0994]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1014/1044 [06:12<00:10,  2.76it/s, acc_step=1/1, ce_loss_token=1.8081, perplexity_token=6.0990]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1015/1044 [06:13<00:10,  2.80it/s, acc_step=1/1, ce_loss_token=1.8081, perplexity_token=6.0987]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1016/1044 [06:13<00:10,  2.78it/s, acc_step=1/1, ce_loss_token=1.8080, perplexity_token=6.0985]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1017/1044 [06:13<00:09,  2.78it/s, acc_step=1/1, ce_loss_token=1.8080, perplexity_token=6.0982]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  98%|████████████████████████████████████████████▊ | 1018/1044 [06:14<00:09,  2.67it/s, acc_step=1/1, ce_loss_token=1.8080, perplexity_token=6.0980]

torch.Size([256, 381, 35]) torch.Size([256, 381])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1019/1044 [06:14<00:09,  2.62it/s, acc_step=1/1, ce_loss_token=1.8080, perplexity_token=6.0984]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1020/1044 [06:15<00:09,  2.62it/s, acc_step=1/1, ce_loss_token=1.8080, perplexity_token=6.0982]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1021/1044 [06:15<00:08,  2.66it/s, acc_step=1/1, ce_loss_token=1.8080, perplexity_token=6.0980]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  98%|█████████████████████████████████████████████ | 1022/1044 [06:15<00:08,  2.52it/s, acc_step=1/1, ce_loss_token=1.8079, perplexity_token=6.0977]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  98%|█████████████████████████████████████████████ | 1023/1044 [06:16<00:07,  2.71it/s, acc_step=1/1, ce_loss_token=1.8080, perplexity_token=6.0983]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  98%|█████████████████████████████████████████████ | 1024/1044 [06:16<00:06,  2.90it/s, acc_step=1/1, ce_loss_token=1.8081, perplexity_token=6.0988]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1025/1044 [06:16<00:06,  2.86it/s, acc_step=1/1, ce_loss_token=1.8080, perplexity_token=6.0985]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1026/1044 [06:17<00:06,  2.75it/s, acc_step=1/1, ce_loss_token=1.8080, perplexity_token=6.0982]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1027/1044 [06:17<00:05,  2.90it/s, acc_step=1/1, ce_loss_token=1.8081, perplexity_token=6.0989]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1028/1044 [06:17<00:05,  3.04it/s, acc_step=1/1, ce_loss_token=1.8082, perplexity_token=6.0993]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  99%|█████████████████████████████████████████████▎| 1029/1044 [06:18<00:05,  2.96it/s, acc_step=1/1, ce_loss_token=1.8081, perplexity_token=6.0990]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1030/1044 [06:18<00:04,  3.15it/s, acc_step=1/1, ce_loss_token=1.8082, perplexity_token=6.0996]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1031/1044 [06:19<00:04,  2.68it/s, acc_step=1/1, ce_loss_token=1.8082, perplexity_token=6.0993]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1032/1044 [06:19<00:04,  2.55it/s, acc_step=1/1, ce_loss_token=1.8081, perplexity_token=6.0991]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1033/1044 [06:19<00:04,  2.62it/s, acc_step=1/1, ce_loss_token=1.8081, perplexity_token=6.0988]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1034/1044 [06:20<00:03,  2.69it/s, acc_step=1/1, ce_loss_token=1.8081, perplexity_token=6.0986]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1035/1044 [06:20<00:03,  2.74it/s, acc_step=1/1, ce_loss_token=1.8080, perplexity_token=6.0983]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1036/1044 [06:20<00:02,  2.71it/s, acc_step=1/1, ce_loss_token=1.8080, perplexity_token=6.0980]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1037/1044 [06:21<00:02,  2.64it/s, acc_step=1/1, ce_loss_token=1.8079, perplexity_token=6.0977]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1038/1044 [06:21<00:02,  2.66it/s, acc_step=1/1, ce_loss_token=1.8079, perplexity_token=6.0975]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1039/1044 [06:22<00:01,  2.68it/s, acc_step=1/1, ce_loss_token=1.8078, perplexity_token=6.0973]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1040/1044 [06:22<00:01,  2.61it/s, acc_step=1/1, ce_loss_token=1.8078, perplexity_token=6.0970]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1041/1044 [06:22<00:01,  2.63it/s, acc_step=1/1, ce_loss_token=1.8078, perplexity_token=6.0968]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1042/1044 [06:23<00:00,  2.66it/s, acc_step=1/1, ce_loss_token=1.8077, perplexity_token=6.0965]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1043/1044 [06:23<00:00,  2.63it/s, acc_step=1/1, ce_loss_token=1.8077, perplexity_token=6.0962]

torch.Size([170, 309, 35]) torch.Size([170, 309])


                                                                                                                                                                   

Generating with greedy search...

📊 Metrics (Epoch 6):
├── TRAIN:
│   ├── ce_loss_char: 1.8076
│   ├── ce_loss_token: 1.8076
│   ├── perplexity_char: 6.0960
│   └── perplexity_token: 6.0960
└── VAL:
    ├── ce_loss_char: 1.6729
    ├── ce_loss_token: 1.6729
    ├── perplexity_char: 5.3275
    └── perplexity_token: 5.3275
└── TRAINING:
    └── learning_rate: 0.000100


[Training LM]:   0%|                                                                                                                      | 0/1044 [00:00<?, ?it/s]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   0%|                                                 | 1/1044 [00:00<08:48,  1.97it/s, acc_step=1/1, ce_loss_token=1.7695, perplexity_token=5.8679]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:   0%|                                                 | 2/1044 [00:00<06:58,  2.49it/s, acc_step=1/1, ce_loss_token=1.7642, perplexity_token=5.8368]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   0%|▏                                                | 3/1044 [00:01<06:55,  2.51it/s, acc_step=1/1, ce_loss_token=1.7620, perplexity_token=5.8243]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   0%|▏                                                | 4/1044 [00:01<06:45,  2.57it/s, acc_step=1/1, ce_loss_token=1.7634, perplexity_token=5.8324]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   0%|▏                                                | 5/1044 [00:02<06:55,  2.50it/s, acc_step=1/1, ce_loss_token=1.7618, perplexity_token=5.8228]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   1%|▎                                                | 6/1044 [00:02<06:39,  2.60it/s, acc_step=1/1, ce_loss_token=1.7599, perplexity_token=5.8121]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   1%|▎                                                | 7/1044 [00:02<06:32,  2.64it/s, acc_step=1/1, ce_loss_token=1.7600, perplexity_token=5.8125]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   1%|▍                                                | 8/1044 [00:03<06:46,  2.55it/s, acc_step=1/1, ce_loss_token=1.7597, perplexity_token=5.8106]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   1%|▍                                                | 9/1044 [00:03<06:42,  2.57it/s, acc_step=1/1, ce_loss_token=1.7584, perplexity_token=5.8031]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   1%|▍                                               | 10/1044 [00:03<06:49,  2.53it/s, acc_step=1/1, ce_loss_token=1.7585, perplexity_token=5.8035]

torch.Size([256, 388, 35]) torch.Size([256, 388])


[Training LM]:   1%|▌                                               | 11/1044 [00:04<07:31,  2.29it/s, acc_step=1/1, ce_loss_token=1.7587, perplexity_token=5.8047]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   1%|▌                                               | 12/1044 [00:04<07:19,  2.35it/s, acc_step=1/1, ce_loss_token=1.7581, perplexity_token=5.8016]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   1%|▌                                               | 13/1044 [00:05<06:56,  2.47it/s, acc_step=1/1, ce_loss_token=1.7581, perplexity_token=5.8014]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   1%|▋                                               | 14/1044 [00:05<06:18,  2.72it/s, acc_step=1/1, ce_loss_token=1.7691, perplexity_token=5.8655]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:   1%|▋                                               | 15/1044 [00:05<06:30,  2.64it/s, acc_step=1/1, ce_loss_token=1.7683, perplexity_token=5.8610]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   2%|▋                                               | 16/1044 [00:06<05:56,  2.88it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9050]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:   2%|▊                                               | 17/1044 [00:06<06:12,  2.76it/s, acc_step=1/1, ce_loss_token=1.7831, perplexity_token=5.9483]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   2%|▊                                               | 18/1044 [00:06<06:19,  2.70it/s, acc_step=1/1, ce_loss_token=1.7819, perplexity_token=5.9410]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   2%|▉                                               | 20/1044 [00:07<05:07,  3.33it/s, acc_step=1/1, ce_loss_token=1.8080, perplexity_token=6.0984]

torch.Size([256, 296, 35]) torch.Size([256, 296])
torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   2%|▉                                               | 21/1044 [00:07<05:23,  3.16it/s, acc_step=1/1, ce_loss_token=1.8053, perplexity_token=6.0820]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   2%|█                                               | 22/1044 [00:08<05:41,  2.99it/s, acc_step=1/1, ce_loss_token=1.8034, perplexity_token=6.0700]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   2%|█                                               | 23/1044 [00:08<05:55,  2.87it/s, acc_step=1/1, ce_loss_token=1.8017, perplexity_token=6.0602]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   2%|█                                               | 24/1044 [00:08<06:08,  2.77it/s, acc_step=1/1, ce_loss_token=1.7993, perplexity_token=6.0454]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   2%|█▏                                              | 25/1044 [00:09<06:15,  2.71it/s, acc_step=1/1, ce_loss_token=1.7975, perplexity_token=6.0347]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   2%|█▏                                              | 26/1044 [00:09<06:26,  2.64it/s, acc_step=1/1, ce_loss_token=1.7963, perplexity_token=6.0272]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   3%|█▏                                              | 27/1044 [00:10<06:19,  2.68it/s, acc_step=1/1, ce_loss_token=1.7949, perplexity_token=6.0190]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:   3%|█▎                                              | 28/1044 [00:10<06:34,  2.58it/s, acc_step=1/1, ce_loss_token=1.7937, perplexity_token=6.0118]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   3%|█▎                                              | 29/1044 [00:10<06:33,  2.58it/s, acc_step=1/1, ce_loss_token=1.7924, perplexity_token=6.0038]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   3%|█▍                                              | 30/1044 [00:11<06:27,  2.62it/s, acc_step=1/1, ce_loss_token=1.7915, perplexity_token=5.9987]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   3%|█▍                                              | 31/1044 [00:11<06:19,  2.67it/s, acc_step=1/1, ce_loss_token=1.7910, perplexity_token=5.9954]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:   3%|█▍                                              | 32/1044 [00:12<06:31,  2.59it/s, acc_step=1/1, ce_loss_token=1.7901, perplexity_token=5.9903]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   3%|█▌                                              | 33/1044 [00:12<06:30,  2.59it/s, acc_step=1/1, ce_loss_token=1.7900, perplexity_token=5.9893]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   3%|█▌                                              | 34/1044 [00:12<06:29,  2.59it/s, acc_step=1/1, ce_loss_token=1.7890, perplexity_token=5.9837]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   3%|█▌                                              | 35/1044 [00:13<05:59,  2.80it/s, acc_step=1/1, ce_loss_token=1.7918, perplexity_token=6.0000]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   3%|█▋                                              | 36/1044 [00:13<05:36,  2.99it/s, acc_step=1/1, ce_loss_token=1.7945, perplexity_token=6.0166]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   4%|█▋                                              | 37/1044 [00:13<05:43,  2.93it/s, acc_step=1/1, ce_loss_token=1.7936, perplexity_token=6.0108]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   4%|█▋                                              | 38/1044 [00:14<06:01,  2.78it/s, acc_step=1/1, ce_loss_token=1.7923, perplexity_token=6.0031]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   4%|█▊                                              | 39/1044 [00:14<06:02,  2.77it/s, acc_step=1/1, ce_loss_token=1.7919, perplexity_token=6.0006]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   4%|█▊                                              | 40/1044 [00:14<06:08,  2.72it/s, acc_step=1/1, ce_loss_token=1.7916, perplexity_token=5.9989]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:   4%|█▉                                              | 41/1044 [00:15<05:38,  2.96it/s, acc_step=1/1, ce_loss_token=1.7938, perplexity_token=6.0123]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   4%|█▉                                              | 42/1044 [00:15<05:17,  3.15it/s, acc_step=1/1, ce_loss_token=1.7966, perplexity_token=6.0292]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:   4%|█▉                                              | 43/1044 [00:15<05:22,  3.10it/s, acc_step=1/1, ce_loss_token=1.7957, perplexity_token=6.0237]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   4%|██                                              | 44/1044 [00:16<05:32,  3.01it/s, acc_step=1/1, ce_loss_token=1.7951, perplexity_token=6.0200]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   4%|██                                              | 45/1044 [00:16<05:40,  2.93it/s, acc_step=1/1, ce_loss_token=1.7945, perplexity_token=6.0164]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   4%|██                                              | 46/1044 [00:16<05:58,  2.78it/s, acc_step=1/1, ce_loss_token=1.7939, perplexity_token=6.0129]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   5%|██▏                                             | 47/1044 [00:17<05:35,  2.97it/s, acc_step=1/1, ce_loss_token=1.7965, perplexity_token=6.0283]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:   5%|██▏                                             | 48/1044 [00:17<06:17,  2.64it/s, acc_step=1/1, ce_loss_token=1.7957, perplexity_token=6.0235]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:   5%|██▎                                             | 49/1044 [00:18<06:07,  2.70it/s, acc_step=1/1, ce_loss_token=1.7950, perplexity_token=6.0197]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:   5%|██▎                                             | 50/1044 [00:18<06:25,  2.58it/s, acc_step=1/1, ce_loss_token=1.7945, perplexity_token=6.0166]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   5%|██▎                                             | 51/1044 [00:18<06:18,  2.62it/s, acc_step=1/1, ce_loss_token=1.7939, perplexity_token=6.0129]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   5%|██▍                                             | 52/1044 [00:19<06:33,  2.52it/s, acc_step=1/1, ce_loss_token=1.7933, perplexity_token=6.0091]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   5%|██▍                                             | 53/1044 [00:19<06:30,  2.54it/s, acc_step=1/1, ce_loss_token=1.7928, perplexity_token=6.0064]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   5%|██▍                                             | 54/1044 [00:19<05:56,  2.77it/s, acc_step=1/1, ce_loss_token=1.7950, perplexity_token=6.0194]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   5%|██▌                                             | 55/1044 [00:20<05:33,  2.97it/s, acc_step=1/1, ce_loss_token=1.7963, perplexity_token=6.0271]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   5%|██▌                                             | 56/1044 [00:20<05:38,  2.92it/s, acc_step=1/1, ce_loss_token=1.7954, perplexity_token=6.0219]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   5%|██▌                                             | 57/1044 [00:20<05:49,  2.83it/s, acc_step=1/1, ce_loss_token=1.7946, perplexity_token=6.0172]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   6%|██▋                                             | 58/1044 [00:21<05:48,  2.83it/s, acc_step=1/1, ce_loss_token=1.7942, perplexity_token=6.0146]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   6%|██▋                                             | 59/1044 [00:21<05:56,  2.76it/s, acc_step=1/1, ce_loss_token=1.7936, perplexity_token=6.0111]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:   6%|██▊                                             | 60/1044 [00:22<06:10,  2.66it/s, acc_step=1/1, ce_loss_token=1.7929, perplexity_token=6.0071]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:   6%|██▊                                             | 61/1044 [00:22<06:28,  2.53it/s, acc_step=1/1, ce_loss_token=1.7925, perplexity_token=6.0042]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   6%|██▊                                             | 62/1044 [00:22<06:14,  2.62it/s, acc_step=1/1, ce_loss_token=1.7918, perplexity_token=6.0002]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   6%|██▉                                             | 63/1044 [00:23<06:04,  2.69it/s, acc_step=1/1, ce_loss_token=1.7912, perplexity_token=5.9964]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   6%|██▉                                             | 64/1044 [00:23<05:39,  2.89it/s, acc_step=1/1, ce_loss_token=1.7924, perplexity_token=6.0037]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   6%|██▉                                             | 65/1044 [00:23<05:43,  2.85it/s, acc_step=1/1, ce_loss_token=1.7920, perplexity_token=6.0012]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   6%|███                                             | 66/1044 [00:24<05:28,  2.98it/s, acc_step=1/1, ce_loss_token=1.7936, perplexity_token=6.0113]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:   6%|███                                             | 67/1044 [00:24<05:21,  3.04it/s, acc_step=1/1, ce_loss_token=1.7949, perplexity_token=6.0191]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   7%|███▏                                            | 68/1044 [00:24<05:35,  2.91it/s, acc_step=1/1, ce_loss_token=1.7944, perplexity_token=6.0157]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   7%|███▏                                            | 69/1044 [00:25<05:23,  3.02it/s, acc_step=1/1, ce_loss_token=1.7962, perplexity_token=6.0264]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   7%|███▏                                            | 70/1044 [00:25<05:29,  2.95it/s, acc_step=1/1, ce_loss_token=1.7955, perplexity_token=6.0226]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   7%|███▎                                            | 71/1044 [00:25<05:45,  2.81it/s, acc_step=1/1, ce_loss_token=1.7950, perplexity_token=6.0194]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   7%|███▎                                            | 72/1044 [00:26<05:52,  2.76it/s, acc_step=1/1, ce_loss_token=1.7945, perplexity_token=6.0163]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   7%|███▎                                            | 73/1044 [00:26<05:29,  2.95it/s, acc_step=1/1, ce_loss_token=1.7954, perplexity_token=6.0221]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   7%|███▍                                            | 74/1044 [00:26<05:33,  2.90it/s, acc_step=1/1, ce_loss_token=1.7950, perplexity_token=6.0193]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   7%|███▍                                            | 75/1044 [00:27<05:52,  2.75it/s, acc_step=1/1, ce_loss_token=1.7946, perplexity_token=6.0173]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   7%|███▍                                            | 76/1044 [00:27<05:26,  2.96it/s, acc_step=1/1, ce_loss_token=1.7956, perplexity_token=6.0229]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:   7%|███▌                                            | 77/1044 [00:28<05:44,  2.81it/s, acc_step=1/1, ce_loss_token=1.7951, perplexity_token=6.0202]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   7%|███▌                                            | 78/1044 [00:28<05:27,  2.95it/s, acc_step=1/1, ce_loss_token=1.7962, perplexity_token=6.0267]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   8%|███▋                                            | 79/1044 [00:28<05:33,  2.89it/s, acc_step=1/1, ce_loss_token=1.7957, perplexity_token=6.0238]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   8%|███▋                                            | 80/1044 [00:29<05:41,  2.82it/s, acc_step=1/1, ce_loss_token=1.7952, perplexity_token=6.0208]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   8%|███▋                                            | 81/1044 [00:29<05:51,  2.74it/s, acc_step=1/1, ce_loss_token=1.7947, perplexity_token=6.0179]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   8%|███▊                                            | 82/1044 [00:29<05:49,  2.75it/s, acc_step=1/1, ce_loss_token=1.7941, perplexity_token=6.0142]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   8%|███▊                                            | 83/1044 [00:30<05:57,  2.69it/s, acc_step=1/1, ce_loss_token=1.7938, perplexity_token=6.0122]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   8%|███▊                                            | 84/1044 [00:30<05:57,  2.69it/s, acc_step=1/1, ce_loss_token=1.7932, perplexity_token=6.0089]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:   8%|███▉                                            | 85/1044 [00:30<06:11,  2.58it/s, acc_step=1/1, ce_loss_token=1.7928, perplexity_token=6.0060]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   8%|███▉                                            | 86/1044 [00:31<06:12,  2.57it/s, acc_step=1/1, ce_loss_token=1.7926, perplexity_token=6.0053]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   8%|████                                            | 87/1044 [00:31<06:24,  2.49it/s, acc_step=1/1, ce_loss_token=1.7922, perplexity_token=6.0029]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   8%|████                                            | 88/1044 [00:32<06:16,  2.54it/s, acc_step=1/1, ce_loss_token=1.7917, perplexity_token=5.9999]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   9%|████                                            | 89/1044 [00:32<05:43,  2.78it/s, acc_step=1/1, ce_loss_token=1.7927, perplexity_token=6.0059]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   9%|████▏                                           | 90/1044 [00:32<06:00,  2.65it/s, acc_step=1/1, ce_loss_token=1.7922, perplexity_token=6.0024]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   9%|████▏                                           | 91/1044 [00:33<05:42,  2.78it/s, acc_step=1/1, ce_loss_token=1.7934, perplexity_token=6.0100]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   9%|████▏                                           | 92/1044 [00:33<05:50,  2.72it/s, acc_step=1/1, ce_loss_token=1.7929, perplexity_token=6.0071]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   9%|████▎                                           | 93/1044 [00:33<05:58,  2.66it/s, acc_step=1/1, ce_loss_token=1.7926, perplexity_token=6.0051]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   9%|████▎                                           | 94/1044 [00:34<06:01,  2.63it/s, acc_step=1/1, ce_loss_token=1.7923, perplexity_token=6.0032]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   9%|████▎                                           | 95/1044 [00:34<05:51,  2.70it/s, acc_step=1/1, ce_loss_token=1.7920, perplexity_token=6.0017]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   9%|████▍                                           | 96/1044 [00:35<05:54,  2.68it/s, acc_step=1/1, ce_loss_token=1.7919, perplexity_token=6.0006]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   9%|████▍                                           | 97/1044 [00:35<06:00,  2.63it/s, acc_step=1/1, ce_loss_token=1.7915, perplexity_token=5.9983]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   9%|████▌                                           | 98/1044 [00:35<06:02,  2.61it/s, acc_step=1/1, ce_loss_token=1.7910, perplexity_token=5.9956]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   9%|████▌                                           | 99/1044 [00:36<05:55,  2.66it/s, acc_step=1/1, ce_loss_token=1.7906, perplexity_token=5.9932]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  10%|████▌                                          | 100/1044 [00:36<05:35,  2.81it/s, acc_step=1/1, ce_loss_token=1.7913, perplexity_token=5.9972]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  10%|████▌                                          | 101/1044 [00:36<05:27,  2.88it/s, acc_step=1/1, ce_loss_token=1.7909, perplexity_token=5.9947]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  10%|████▌                                          | 102/1044 [00:37<05:13,  3.00it/s, acc_step=1/1, ce_loss_token=1.7916, perplexity_token=5.9992]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  10%|████▋                                          | 103/1044 [00:37<04:58,  3.16it/s, acc_step=1/1, ce_loss_token=1.7924, perplexity_token=6.0036]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  10%|████▋                                          | 104/1044 [00:37<05:08,  3.04it/s, acc_step=1/1, ce_loss_token=1.7921, perplexity_token=6.0018]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  10%|████▋                                          | 105/1044 [00:38<05:02,  3.11it/s, acc_step=1/1, ce_loss_token=1.7927, perplexity_token=6.0057]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  10%|████▊                                          | 106/1044 [00:38<05:02,  3.10it/s, acc_step=1/1, ce_loss_token=1.7923, perplexity_token=6.0035]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  10%|████▊                                          | 107/1044 [00:38<05:17,  2.95it/s, acc_step=1/1, ce_loss_token=1.7920, perplexity_token=6.0015]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  10%|████▊                                          | 108/1044 [00:39<05:22,  2.90it/s, acc_step=1/1, ce_loss_token=1.7916, perplexity_token=5.9991]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  10%|████▉                                          | 109/1044 [00:39<05:27,  2.86it/s, acc_step=1/1, ce_loss_token=1.7913, perplexity_token=5.9975]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  11%|████▉                                          | 110/1044 [00:39<05:44,  2.71it/s, acc_step=1/1, ce_loss_token=1.7911, perplexity_token=5.9961]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  11%|████▉                                          | 111/1044 [00:40<05:49,  2.67it/s, acc_step=1/1, ce_loss_token=1.7907, perplexity_token=5.9937]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  11%|█████                                          | 112/1044 [00:40<05:43,  2.71it/s, acc_step=1/1, ce_loss_token=1.7904, perplexity_token=5.9919]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  11%|█████                                          | 113/1044 [00:41<05:47,  2.68it/s, acc_step=1/1, ce_loss_token=1.7901, perplexity_token=5.9901]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  11%|█████▏                                         | 114/1044 [00:41<05:31,  2.81it/s, acc_step=1/1, ce_loss_token=1.7911, perplexity_token=5.9961]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  11%|█████▏                                         | 115/1044 [00:41<05:47,  2.67it/s, acc_step=1/1, ce_loss_token=1.7908, perplexity_token=5.9944]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  11%|█████▏                                         | 116/1044 [00:42<05:27,  2.84it/s, acc_step=1/1, ce_loss_token=1.7915, perplexity_token=5.9984]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  11%|█████▎                                         | 117/1044 [00:42<05:40,  2.72it/s, acc_step=1/1, ce_loss_token=1.7912, perplexity_token=5.9968]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  11%|█████▎                                         | 118/1044 [00:42<05:15,  2.93it/s, acc_step=1/1, ce_loss_token=1.7919, perplexity_token=6.0006]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  11%|█████▎                                         | 119/1044 [00:43<05:38,  2.73it/s, acc_step=1/1, ce_loss_token=1.7915, perplexity_token=5.9986]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  11%|█████▍                                         | 120/1044 [00:43<05:57,  2.58it/s, acc_step=1/1, ce_loss_token=1.7913, perplexity_token=5.9974]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  12%|█████▍                                         | 121/1044 [00:44<06:06,  2.52it/s, acc_step=1/1, ce_loss_token=1.7910, perplexity_token=5.9957]

torch.Size([256, 277, 35]) torch.Size([256, 277])


[Training LM]:  12%|█████▌                                         | 123/1044 [00:44<04:57,  3.10it/s, acc_step=1/1, ce_loss_token=1.7927, perplexity_token=6.0057]

torch.Size([256, 292, 35]) torch.Size([256, 292])
torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  12%|█████▌                                         | 124/1044 [00:44<04:58,  3.09it/s, acc_step=1/1, ce_loss_token=1.7924, perplexity_token=6.0039]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  12%|█████▋                                         | 125/1044 [00:45<05:02,  3.04it/s, acc_step=1/1, ce_loss_token=1.7922, perplexity_token=6.0027]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  12%|█████▋                                         | 126/1044 [00:45<04:47,  3.19it/s, acc_step=1/1, ce_loss_token=1.7932, perplexity_token=6.0088]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  12%|█████▋                                         | 127/1044 [00:45<05:07,  2.99it/s, acc_step=1/1, ce_loss_token=1.7929, perplexity_token=6.0067]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  12%|█████▊                                         | 128/1044 [00:46<05:12,  2.93it/s, acc_step=1/1, ce_loss_token=1.7925, perplexity_token=6.0047]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  12%|█████▊                                         | 129/1044 [00:46<05:18,  2.88it/s, acc_step=1/1, ce_loss_token=1.7921, perplexity_token=6.0021]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  12%|█████▊                                         | 130/1044 [00:47<05:30,  2.77it/s, acc_step=1/1, ce_loss_token=1.7918, perplexity_token=6.0005]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  13%|█████▉                                         | 131/1044 [00:47<05:32,  2.75it/s, acc_step=1/1, ce_loss_token=1.7915, perplexity_token=5.9986]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  13%|█████▉                                         | 132/1044 [00:47<05:49,  2.61it/s, acc_step=1/1, ce_loss_token=1.7913, perplexity_token=5.9972]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  13%|█████▉                                         | 133/1044 [00:48<05:54,  2.57it/s, acc_step=1/1, ce_loss_token=1.7910, perplexity_token=5.9954]

torch.Size([256, 441, 35]) torch.Size([256, 441])


[Training LM]:  13%|██████                                         | 134/1044 [00:48<07:04,  2.14it/s, acc_step=1/1, ce_loss_token=1.7908, perplexity_token=5.9940]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  13%|██████                                         | 135/1044 [00:49<06:54,  2.19it/s, acc_step=1/1, ce_loss_token=1.7905, perplexity_token=5.9922]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  13%|██████                                         | 136/1044 [00:49<06:33,  2.31it/s, acc_step=1/1, ce_loss_token=1.7901, perplexity_token=5.9902]

torch.Size([256, 410, 35]) torch.Size([256, 410])


[Training LM]:  13%|██████▏                                        | 137/1044 [00:50<07:11,  2.10it/s, acc_step=1/1, ce_loss_token=1.7898, perplexity_token=5.9885]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  13%|██████▏                                        | 138/1044 [00:50<06:39,  2.27it/s, acc_step=1/1, ce_loss_token=1.7903, perplexity_token=5.9912]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  13%|██████▎                                        | 139/1044 [00:51<06:17,  2.40it/s, acc_step=1/1, ce_loss_token=1.7900, perplexity_token=5.9893]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  13%|██████▎                                        | 140/1044 [00:51<06:04,  2.48it/s, acc_step=1/1, ce_loss_token=1.7898, perplexity_token=5.9884]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  14%|██████▎                                        | 141/1044 [00:51<06:01,  2.50it/s, acc_step=1/1, ce_loss_token=1.7896, perplexity_token=5.9871]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  14%|██████▍                                        | 142/1044 [00:52<06:07,  2.45it/s, acc_step=1/1, ce_loss_token=1.7894, perplexity_token=5.9859]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  14%|██████▍                                        | 143/1044 [00:52<06:07,  2.45it/s, acc_step=1/1, ce_loss_token=1.7892, perplexity_token=5.9846]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  14%|██████▍                                        | 144/1044 [00:52<05:27,  2.75it/s, acc_step=1/1, ce_loss_token=1.7898, perplexity_token=5.9880]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  14%|██████▌                                        | 145/1044 [00:53<05:22,  2.79it/s, acc_step=1/1, ce_loss_token=1.7896, perplexity_token=5.9868]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  14%|██████▌                                        | 146/1044 [00:53<04:58,  3.01it/s, acc_step=1/1, ce_loss_token=1.7901, perplexity_token=5.9903]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  14%|██████▌                                        | 147/1044 [00:53<05:09,  2.90it/s, acc_step=1/1, ce_loss_token=1.7899, perplexity_token=5.9890]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  14%|██████▋                                        | 148/1044 [00:54<05:17,  2.82it/s, acc_step=1/1, ce_loss_token=1.7897, perplexity_token=5.9874]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  14%|██████▋                                        | 149/1044 [00:54<05:22,  2.77it/s, acc_step=1/1, ce_loss_token=1.7895, perplexity_token=5.9862]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  14%|██████▊                                        | 150/1044 [00:54<05:20,  2.79it/s, acc_step=1/1, ce_loss_token=1.7892, perplexity_token=5.9845]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  14%|██████▊                                        | 151/1044 [00:55<05:22,  2.77it/s, acc_step=1/1, ce_loss_token=1.7889, perplexity_token=5.9830]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  15%|██████▊                                        | 152/1044 [00:55<05:08,  2.90it/s, acc_step=1/1, ce_loss_token=1.7895, perplexity_token=5.9866]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  15%|██████▉                                        | 153/1044 [00:56<05:12,  2.85it/s, acc_step=1/1, ce_loss_token=1.7893, perplexity_token=5.9850]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  15%|██████▉                                        | 154/1044 [00:56<05:24,  2.75it/s, acc_step=1/1, ce_loss_token=1.7890, perplexity_token=5.9835]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  15%|██████▉                                        | 155/1044 [00:56<05:35,  2.65it/s, acc_step=1/1, ce_loss_token=1.7889, perplexity_token=5.9827]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  15%|███████                                        | 156/1044 [00:57<05:20,  2.77it/s, acc_step=1/1, ce_loss_token=1.7893, perplexity_token=5.9854]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  15%|███████                                        | 157/1044 [00:57<05:16,  2.80it/s, acc_step=1/1, ce_loss_token=1.7891, perplexity_token=5.9841]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  15%|███████                                        | 158/1044 [00:57<05:08,  2.87it/s, acc_step=1/1, ce_loss_token=1.7889, perplexity_token=5.9830]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  15%|███████▏                                       | 159/1044 [00:58<05:08,  2.87it/s, acc_step=1/1, ce_loss_token=1.7887, perplexity_token=5.9817]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  15%|███████▏                                       | 160/1044 [00:58<05:06,  2.88it/s, acc_step=1/1, ce_loss_token=1.7885, perplexity_token=5.9803]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  15%|███████▏                                       | 161/1044 [00:58<05:13,  2.81it/s, acc_step=1/1, ce_loss_token=1.7882, perplexity_token=5.9789]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  16%|███████▎                                       | 162/1044 [00:59<05:19,  2.76it/s, acc_step=1/1, ce_loss_token=1.7880, perplexity_token=5.9775]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  16%|███████▎                                       | 163/1044 [00:59<05:24,  2.72it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9759]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  16%|███████▍                                       | 164/1044 [01:00<05:26,  2.69it/s, acc_step=1/1, ce_loss_token=1.7876, perplexity_token=5.9751]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  16%|███████▍                                       | 165/1044 [01:00<05:26,  2.69it/s, acc_step=1/1, ce_loss_token=1.7874, perplexity_token=5.9739]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  16%|███████▍                                       | 166/1044 [01:00<05:27,  2.68it/s, acc_step=1/1, ce_loss_token=1.7873, perplexity_token=5.9733]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  16%|███████▌                                       | 167/1044 [01:01<05:29,  2.67it/s, acc_step=1/1, ce_loss_token=1.7870, perplexity_token=5.9717]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  16%|███████▌                                       | 168/1044 [01:01<05:21,  2.73it/s, acc_step=1/1, ce_loss_token=1.7868, perplexity_token=5.9706]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  16%|███████▌                                       | 169/1044 [01:01<05:22,  2.71it/s, acc_step=1/1, ce_loss_token=1.7867, perplexity_token=5.9694]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  16%|███████▋                                       | 170/1044 [01:02<05:23,  2.70it/s, acc_step=1/1, ce_loss_token=1.7865, perplexity_token=5.9686]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  16%|███████▋                                       | 171/1044 [01:02<05:06,  2.85it/s, acc_step=1/1, ce_loss_token=1.7870, perplexity_token=5.9714]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  16%|███████▋                                       | 172/1044 [01:02<05:08,  2.83it/s, acc_step=1/1, ce_loss_token=1.7868, perplexity_token=5.9703]

torch.Size([256, 391, 35]) torch.Size([256, 391])


[Training LM]:  17%|███████▊                                       | 173/1044 [01:03<05:25,  2.67it/s, acc_step=1/1, ce_loss_token=1.7873, perplexity_token=5.9731]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  17%|███████▊                                       | 174/1044 [01:03<05:08,  2.82it/s, acc_step=1/1, ce_loss_token=1.7878, perplexity_token=5.9761]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  17%|███████▉                                       | 175/1044 [01:03<04:48,  3.02it/s, acc_step=1/1, ce_loss_token=1.7882, perplexity_token=5.9787]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  17%|███████▉                                       | 176/1044 [01:04<04:40,  3.10it/s, acc_step=1/1, ce_loss_token=1.7889, perplexity_token=5.9830]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  17%|███████▉                                       | 177/1044 [01:04<05:01,  2.87it/s, acc_step=1/1, ce_loss_token=1.7887, perplexity_token=5.9819]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  17%|████████                                       | 178/1044 [01:04<04:46,  3.02it/s, acc_step=1/1, ce_loss_token=1.7892, perplexity_token=5.9846]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  17%|████████                                       | 179/1044 [01:05<04:55,  2.92it/s, acc_step=1/1, ce_loss_token=1.7890, perplexity_token=5.9836]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  17%|████████                                       | 180/1044 [01:05<05:05,  2.83it/s, acc_step=1/1, ce_loss_token=1.7889, perplexity_token=5.9826]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  17%|████████▏                                      | 181/1044 [01:06<05:00,  2.88it/s, acc_step=1/1, ce_loss_token=1.7888, perplexity_token=5.9822]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  17%|████████▏                                      | 182/1044 [01:06<05:13,  2.75it/s, acc_step=1/1, ce_loss_token=1.7886, perplexity_token=5.9810]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  18%|████████▏                                      | 183/1044 [01:06<05:09,  2.78it/s, acc_step=1/1, ce_loss_token=1.7885, perplexity_token=5.9803]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  18%|████████▎                                      | 184/1044 [01:07<05:16,  2.72it/s, acc_step=1/1, ce_loss_token=1.7883, perplexity_token=5.9794]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  18%|████████▎                                      | 185/1044 [01:07<05:16,  2.71it/s, acc_step=1/1, ce_loss_token=1.7882, perplexity_token=5.9784]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  18%|████████▎                                      | 186/1044 [01:07<04:56,  2.90it/s, acc_step=1/1, ce_loss_token=1.7886, perplexity_token=5.9811]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  18%|████████▍                                      | 187/1044 [01:08<04:43,  3.02it/s, acc_step=1/1, ce_loss_token=1.7892, perplexity_token=5.9848]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  18%|████████▌                                      | 189/1044 [01:08<03:58,  3.58it/s, acc_step=1/1, ce_loss_token=1.7912, perplexity_token=5.9964]

torch.Size([256, 289, 35]) torch.Size([256, 289])
torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  18%|████████▌                                      | 190/1044 [01:08<04:11,  3.40it/s, acc_step=1/1, ce_loss_token=1.7916, perplexity_token=5.9990]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  18%|████████▌                                      | 191/1044 [01:09<04:20,  3.27it/s, acc_step=1/1, ce_loss_token=1.7914, perplexity_token=5.9980]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  18%|████████▋                                      | 192/1044 [01:09<04:31,  3.13it/s, acc_step=1/1, ce_loss_token=1.7912, perplexity_token=5.9969]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  18%|████████▋                                      | 193/1044 [01:09<04:39,  3.04it/s, acc_step=1/1, ce_loss_token=1.7910, perplexity_token=5.9955]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  19%|████████▋                                      | 194/1044 [01:10<04:42,  3.01it/s, acc_step=1/1, ce_loss_token=1.7908, perplexity_token=5.9940]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  19%|████████▊                                      | 195/1044 [01:10<05:03,  2.79it/s, acc_step=1/1, ce_loss_token=1.7905, perplexity_token=5.9927]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  19%|████████▊                                      | 196/1044 [01:10<04:48,  2.94it/s, acc_step=1/1, ce_loss_token=1.7909, perplexity_token=5.9948]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  19%|████████▊                                      | 197/1044 [01:11<04:31,  3.12it/s, acc_step=1/1, ce_loss_token=1.7914, perplexity_token=5.9980]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  19%|████████▉                                      | 198/1044 [01:11<04:45,  2.97it/s, acc_step=1/1, ce_loss_token=1.7912, perplexity_token=5.9965]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  19%|████████▉                                      | 199/1044 [01:12<04:55,  2.86it/s, acc_step=1/1, ce_loss_token=1.7910, perplexity_token=5.9953]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  19%|█████████                                      | 200/1044 [01:12<05:04,  2.77it/s, acc_step=1/1, ce_loss_token=1.7908, perplexity_token=5.9945]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  19%|█████████                                      | 201/1044 [01:12<05:18,  2.65it/s, acc_step=1/1, ce_loss_token=1.7907, perplexity_token=5.9936]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  19%|█████████                                      | 202/1044 [01:13<05:10,  2.71it/s, acc_step=1/1, ce_loss_token=1.7906, perplexity_token=5.9930]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  19%|█████████▏                                     | 203/1044 [01:13<05:13,  2.68it/s, acc_step=1/1, ce_loss_token=1.7904, perplexity_token=5.9921]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  20%|█████████▏                                     | 204/1044 [01:13<05:14,  2.67it/s, acc_step=1/1, ce_loss_token=1.7903, perplexity_token=5.9911]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  20%|█████████▏                                     | 205/1044 [01:14<05:15,  2.66it/s, acc_step=1/1, ce_loss_token=1.7901, perplexity_token=5.9901]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  20%|█████████▎                                     | 206/1044 [01:14<05:09,  2.71it/s, acc_step=1/1, ce_loss_token=1.7899, perplexity_token=5.9889]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  20%|█████████▎                                     | 207/1044 [01:15<05:04,  2.75it/s, acc_step=1/1, ce_loss_token=1.7897, perplexity_token=5.9877]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  20%|█████████▎                                     | 208/1044 [01:15<05:04,  2.74it/s, acc_step=1/1, ce_loss_token=1.7896, perplexity_token=5.9868]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  20%|█████████▍                                     | 209/1044 [01:15<05:05,  2.74it/s, acc_step=1/1, ce_loss_token=1.7894, perplexity_token=5.9862]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  20%|█████████▍                                     | 210/1044 [01:16<04:46,  2.91it/s, acc_step=1/1, ce_loss_token=1.7898, perplexity_token=5.9881]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  20%|█████████▍                                     | 211/1044 [01:16<04:32,  3.06it/s, acc_step=1/1, ce_loss_token=1.7904, perplexity_token=5.9917]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  20%|█████████▌                                     | 212/1044 [01:16<04:37,  3.00it/s, acc_step=1/1, ce_loss_token=1.7902, perplexity_token=5.9909]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  20%|█████████▌                                     | 213/1044 [01:16<04:22,  3.16it/s, acc_step=1/1, ce_loss_token=1.7907, perplexity_token=5.9939]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  20%|█████████▋                                     | 214/1044 [01:17<04:31,  3.05it/s, acc_step=1/1, ce_loss_token=1.7905, perplexity_token=5.9927]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  21%|█████████▋                                     | 215/1044 [01:17<04:48,  2.88it/s, acc_step=1/1, ce_loss_token=1.7904, perplexity_token=5.9920]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  21%|█████████▋                                     | 216/1044 [01:18<04:45,  2.90it/s, acc_step=1/1, ce_loss_token=1.7902, perplexity_token=5.9907]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  21%|█████████▊                                     | 217/1044 [01:18<04:28,  3.08it/s, acc_step=1/1, ce_loss_token=1.7906, perplexity_token=5.9928]

torch.Size([256, 366, 35]) torch.Size([256, 366])


[Training LM]:  21%|█████████▊                                     | 218/1044 [01:18<05:04,  2.71it/s, acc_step=1/1, ce_loss_token=1.7904, perplexity_token=5.9918]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  21%|█████████▊                                     | 219/1044 [01:19<05:02,  2.73it/s, acc_step=1/1, ce_loss_token=1.7903, perplexity_token=5.9910]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  21%|█████████▉                                     | 220/1044 [01:19<05:03,  2.71it/s, acc_step=1/1, ce_loss_token=1.7900, perplexity_token=5.9897]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  21%|█████████▉                                     | 221/1044 [01:19<05:09,  2.66it/s, acc_step=1/1, ce_loss_token=1.7899, perplexity_token=5.9887]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  21%|█████████▉                                     | 222/1044 [01:20<05:06,  2.68it/s, acc_step=1/1, ce_loss_token=1.7897, perplexity_token=5.9878]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  21%|██████████                                     | 223/1044 [01:20<05:07,  2.67it/s, acc_step=1/1, ce_loss_token=1.7895, perplexity_token=5.9865]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  21%|██████████                                     | 224/1044 [01:21<04:58,  2.74it/s, acc_step=1/1, ce_loss_token=1.7893, perplexity_token=5.9855]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  22%|██████████▏                                    | 225/1044 [01:21<04:43,  2.89it/s, acc_step=1/1, ce_loss_token=1.7897, perplexity_token=5.9878]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  22%|██████████▏                                    | 226/1044 [01:21<04:50,  2.81it/s, acc_step=1/1, ce_loss_token=1.7895, perplexity_token=5.9863]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  22%|██████████▏                                    | 227/1044 [01:22<04:57,  2.75it/s, acc_step=1/1, ce_loss_token=1.7893, perplexity_token=5.9854]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  22%|██████████▎                                    | 228/1044 [01:22<04:57,  2.75it/s, acc_step=1/1, ce_loss_token=1.7892, perplexity_token=5.9844]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  22%|██████████▎                                    | 229/1044 [01:22<05:05,  2.67it/s, acc_step=1/1, ce_loss_token=1.7890, perplexity_token=5.9836]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  22%|██████████▎                                    | 230/1044 [01:23<05:17,  2.57it/s, acc_step=1/1, ce_loss_token=1.7889, perplexity_token=5.9826]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  22%|██████████▍                                    | 231/1044 [01:23<05:09,  2.63it/s, acc_step=1/1, ce_loss_token=1.7887, perplexity_token=5.9818]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  22%|██████████▍                                    | 232/1044 [01:23<04:44,  2.86it/s, acc_step=1/1, ce_loss_token=1.7891, perplexity_token=5.9838]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  22%|██████████▍                                    | 233/1044 [01:24<04:28,  3.02it/s, acc_step=1/1, ce_loss_token=1.7896, perplexity_token=5.9869]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  22%|██████████▌                                    | 234/1044 [01:24<04:36,  2.93it/s, acc_step=1/1, ce_loss_token=1.7894, perplexity_token=5.9858]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  23%|██████████▌                                    | 235/1044 [01:24<04:36,  2.93it/s, acc_step=1/1, ce_loss_token=1.7892, perplexity_token=5.9848]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  23%|██████████▌                                    | 236/1044 [01:25<04:40,  2.88it/s, acc_step=1/1, ce_loss_token=1.7891, perplexity_token=5.9838]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  23%|██████████▋                                    | 237/1044 [01:25<04:49,  2.79it/s, acc_step=1/1, ce_loss_token=1.7889, perplexity_token=5.9830]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  23%|██████████▋                                    | 238/1044 [01:26<04:50,  2.78it/s, acc_step=1/1, ce_loss_token=1.7887, perplexity_token=5.9818]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  23%|██████████▊                                    | 239/1044 [01:26<04:45,  2.82it/s, acc_step=1/1, ce_loss_token=1.7892, perplexity_token=5.9847]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  23%|██████████▊                                    | 240/1044 [01:26<04:55,  2.72it/s, acc_step=1/1, ce_loss_token=1.7891, perplexity_token=5.9840]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  23%|██████████▊                                    | 241/1044 [01:27<05:07,  2.61it/s, acc_step=1/1, ce_loss_token=1.7890, perplexity_token=5.9833]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  23%|██████████▉                                    | 242/1044 [01:27<05:02,  2.66it/s, acc_step=1/1, ce_loss_token=1.7888, perplexity_token=5.9823]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  23%|██████████▉                                    | 243/1044 [01:27<05:02,  2.65it/s, acc_step=1/1, ce_loss_token=1.7886, perplexity_token=5.9813]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  23%|██████████▉                                    | 244/1044 [01:28<05:15,  2.54it/s, acc_step=1/1, ce_loss_token=1.7885, perplexity_token=5.9804]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  23%|███████████                                    | 245/1044 [01:28<05:18,  2.51it/s, acc_step=1/1, ce_loss_token=1.7883, perplexity_token=5.9793]

torch.Size([256, 437, 35]) torch.Size([256, 437])


[Training LM]:  24%|███████████                                    | 246/1044 [01:29<06:13,  2.13it/s, acc_step=1/1, ce_loss_token=1.7881, perplexity_token=5.9780]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  24%|███████████                                    | 247/1044 [01:29<05:41,  2.33it/s, acc_step=1/1, ce_loss_token=1.7879, perplexity_token=5.9771]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  24%|███████████▏                                   | 248/1044 [01:30<05:23,  2.46it/s, acc_step=1/1, ce_loss_token=1.7878, perplexity_token=5.9764]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  24%|███████████▏                                   | 249/1044 [01:30<05:13,  2.53it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9754]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  24%|███████████▎                                   | 250/1044 [01:30<04:49,  2.74it/s, acc_step=1/1, ce_loss_token=1.7880, perplexity_token=5.9772]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  24%|███████████▎                                   | 251/1044 [01:31<04:52,  2.71it/s, acc_step=1/1, ce_loss_token=1.7878, perplexity_token=5.9760]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  24%|███████████▎                                   | 252/1044 [01:31<05:04,  2.60it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9755]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  24%|███████████▍                                   | 253/1044 [01:31<04:57,  2.66it/s, acc_step=1/1, ce_loss_token=1.7875, perplexity_token=5.9744]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  24%|███████████▍                                   | 254/1044 [01:32<04:58,  2.64it/s, acc_step=1/1, ce_loss_token=1.7873, perplexity_token=5.9735]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  24%|███████████▍                                   | 255/1044 [01:32<04:53,  2.69it/s, acc_step=1/1, ce_loss_token=1.7872, perplexity_token=5.9725]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  25%|███████████▌                                   | 256/1044 [01:32<04:51,  2.70it/s, acc_step=1/1, ce_loss_token=1.7875, perplexity_token=5.9747]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  25%|███████████▌                                   | 257/1044 [01:33<05:01,  2.61it/s, acc_step=1/1, ce_loss_token=1.7874, perplexity_token=5.9740]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  25%|███████████▌                                   | 258/1044 [01:33<05:13,  2.51it/s, acc_step=1/1, ce_loss_token=1.7873, perplexity_token=5.9732]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  25%|███████████▋                                   | 259/1044 [01:34<04:48,  2.72it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9758]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  25%|███████████▋                                   | 260/1044 [01:34<04:52,  2.68it/s, acc_step=1/1, ce_loss_token=1.7876, perplexity_token=5.9748]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  25%|███████████▊                                   | 261/1044 [01:34<04:30,  2.89it/s, acc_step=1/1, ce_loss_token=1.7880, perplexity_token=5.9773]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  25%|███████████▊                                   | 262/1044 [01:35<04:30,  2.89it/s, acc_step=1/1, ce_loss_token=1.7878, perplexity_token=5.9765]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  25%|███████████▊                                   | 263/1044 [01:35<04:45,  2.73it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9758]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  25%|███████████▉                                   | 264/1044 [01:35<04:43,  2.75it/s, acc_step=1/1, ce_loss_token=1.7876, perplexity_token=5.9748]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  25%|███████████▉                                   | 265/1044 [01:36<04:46,  2.72it/s, acc_step=1/1, ce_loss_token=1.7875, perplexity_token=5.9743]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  25%|███████████▉                                   | 266/1044 [01:36<04:47,  2.71it/s, acc_step=1/1, ce_loss_token=1.7873, perplexity_token=5.9736]

torch.Size([256, 421, 35]) torch.Size([256, 421])


[Training LM]:  26%|████████████                                   | 267/1044 [01:37<05:42,  2.27it/s, acc_step=1/1, ce_loss_token=1.7872, perplexity_token=5.9725]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  26%|████████████                                   | 268/1044 [01:37<05:41,  2.27it/s, acc_step=1/1, ce_loss_token=1.7871, perplexity_token=5.9719]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  26%|████████████                                   | 269/1044 [01:38<05:42,  2.27it/s, acc_step=1/1, ce_loss_token=1.7869, perplexity_token=5.9710]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  26%|████████████▏                                  | 270/1044 [01:38<05:27,  2.37it/s, acc_step=1/1, ce_loss_token=1.7868, perplexity_token=5.9701]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  26%|████████████▏                                  | 271/1044 [01:38<04:58,  2.59it/s, acc_step=1/1, ce_loss_token=1.7872, perplexity_token=5.9725]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  26%|████████████▏                                  | 272/1044 [01:39<04:51,  2.65it/s, acc_step=1/1, ce_loss_token=1.7871, perplexity_token=5.9721]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  26%|████████████▎                                  | 273/1044 [01:39<04:29,  2.86it/s, acc_step=1/1, ce_loss_token=1.7874, perplexity_token=5.9741]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  26%|████████████▎                                  | 274/1044 [01:39<04:13,  3.04it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9757]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  26%|████████████▍                                  | 275/1044 [01:40<04:24,  2.90it/s, acc_step=1/1, ce_loss_token=1.7876, perplexity_token=5.9749]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  26%|████████████▍                                  | 276/1044 [01:40<04:35,  2.79it/s, acc_step=1/1, ce_loss_token=1.7874, perplexity_token=5.9741]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  27%|████████████▍                                  | 277/1044 [01:40<04:34,  2.79it/s, acc_step=1/1, ce_loss_token=1.7873, perplexity_token=5.9734]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  27%|████████████▌                                  | 278/1044 [01:41<04:41,  2.72it/s, acc_step=1/1, ce_loss_token=1.7872, perplexity_token=5.9729]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  27%|████████████▌                                  | 279/1044 [01:41<04:40,  2.73it/s, acc_step=1/1, ce_loss_token=1.7871, perplexity_token=5.9721]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  27%|████████████▌                                  | 280/1044 [01:42<05:14,  2.43it/s, acc_step=1/1, ce_loss_token=1.7870, perplexity_token=5.9713]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  27%|████████████▋                                  | 281/1044 [01:42<05:01,  2.53it/s, acc_step=1/1, ce_loss_token=1.7869, perplexity_token=5.9707]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  27%|████████████▋                                  | 282/1044 [01:42<05:01,  2.53it/s, acc_step=1/1, ce_loss_token=1.7867, perplexity_token=5.9699]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  27%|████████████▋                                  | 283/1044 [01:43<04:46,  2.66it/s, acc_step=1/1, ce_loss_token=1.7867, perplexity_token=5.9694]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  27%|████████████▊                                  | 284/1044 [01:43<04:32,  2.79it/s, acc_step=1/1, ce_loss_token=1.7869, perplexity_token=5.9708]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  27%|████████████▊                                  | 285/1044 [01:43<04:21,  2.90it/s, acc_step=1/1, ce_loss_token=1.7872, perplexity_token=5.9725]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  27%|████████████▉                                  | 286/1044 [01:44<04:27,  2.83it/s, acc_step=1/1, ce_loss_token=1.7871, perplexity_token=5.9718]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  27%|████████████▉                                  | 287/1044 [01:44<04:35,  2.75it/s, acc_step=1/1, ce_loss_token=1.7869, perplexity_token=5.9712]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  28%|████████████▉                                  | 288/1044 [01:45<04:39,  2.70it/s, acc_step=1/1, ce_loss_token=1.7868, perplexity_token=5.9704]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  28%|█████████████                                  | 289/1044 [01:45<04:40,  2.70it/s, acc_step=1/1, ce_loss_token=1.7867, perplexity_token=5.9697]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  28%|█████████████                                  | 290/1044 [01:45<04:48,  2.61it/s, acc_step=1/1, ce_loss_token=1.7866, perplexity_token=5.9691]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  28%|█████████████                                  | 291/1044 [01:46<04:26,  2.82it/s, acc_step=1/1, ce_loss_token=1.7870, perplexity_token=5.9717]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  28%|█████████████▏                                 | 292/1044 [01:46<04:31,  2.77it/s, acc_step=1/1, ce_loss_token=1.7869, perplexity_token=5.9709]

torch.Size([256, 404, 35]) torch.Size([256, 404])


[Training LM]:  28%|█████████████▏                                 | 293/1044 [01:47<05:14,  2.39it/s, acc_step=1/1, ce_loss_token=1.7868, perplexity_token=5.9704]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  28%|█████████████▏                                 | 294/1044 [01:47<05:03,  2.47it/s, acc_step=1/1, ce_loss_token=1.7867, perplexity_token=5.9698]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  28%|█████████████▎                                 | 295/1044 [01:47<04:59,  2.50it/s, acc_step=1/1, ce_loss_token=1.7866, perplexity_token=5.9694]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  28%|█████████████▎                                 | 296/1044 [01:48<05:07,  2.43it/s, acc_step=1/1, ce_loss_token=1.7865, perplexity_token=5.9684]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  28%|█████████████▎                                 | 297/1044 [01:48<04:56,  2.52it/s, acc_step=1/1, ce_loss_token=1.7864, perplexity_token=5.9679]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  29%|█████████████▍                                 | 298/1044 [01:48<04:35,  2.70it/s, acc_step=1/1, ce_loss_token=1.7866, perplexity_token=5.9694]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  29%|█████████████▍                                 | 299/1044 [01:49<04:36,  2.69it/s, acc_step=1/1, ce_loss_token=1.7865, perplexity_token=5.9686]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  29%|█████████████▌                                 | 300/1044 [01:49<04:43,  2.63it/s, acc_step=1/1, ce_loss_token=1.7864, perplexity_token=5.9677]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  29%|█████████████▌                                 | 301/1044 [01:50<04:44,  2.61it/s, acc_step=1/1, ce_loss_token=1.7863, perplexity_token=5.9671]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  29%|█████████████▌                                 | 302/1044 [01:50<04:42,  2.63it/s, acc_step=1/1, ce_loss_token=1.7862, perplexity_token=5.9665]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  29%|█████████████▋                                 | 303/1044 [01:50<04:23,  2.81it/s, acc_step=1/1, ce_loss_token=1.7864, perplexity_token=5.9678]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  29%|█████████████▋                                 | 304/1044 [01:51<04:20,  2.84it/s, acc_step=1/1, ce_loss_token=1.7862, perplexity_token=5.9669]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  29%|█████████████▋                                 | 305/1044 [01:51<04:29,  2.74it/s, acc_step=1/1, ce_loss_token=1.7861, perplexity_token=5.9663]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  29%|█████████████▊                                 | 306/1044 [01:51<04:40,  2.63it/s, acc_step=1/1, ce_loss_token=1.7860, perplexity_token=5.9656]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  29%|█████████████▊                                 | 307/1044 [01:52<04:35,  2.67it/s, acc_step=1/1, ce_loss_token=1.7859, perplexity_token=5.9647]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  30%|█████████████▊                                 | 308/1044 [01:52<04:38,  2.65it/s, acc_step=1/1, ce_loss_token=1.7857, perplexity_token=5.9639]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  30%|█████████████▉                                 | 309/1044 [01:53<04:43,  2.60it/s, acc_step=1/1, ce_loss_token=1.7856, perplexity_token=5.9633]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  30%|█████████████▉                                 | 310/1044 [01:53<04:37,  2.65it/s, acc_step=1/1, ce_loss_token=1.7855, perplexity_token=5.9626]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  30%|██████████████                                 | 311/1044 [01:53<04:34,  2.67it/s, acc_step=1/1, ce_loss_token=1.7859, perplexity_token=5.9649]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  30%|██████████████                                 | 312/1044 [01:54<04:36,  2.64it/s, acc_step=1/1, ce_loss_token=1.7858, perplexity_token=5.9644]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  30%|██████████████▏                                | 314/1044 [01:54<03:53,  3.13it/s, acc_step=1/1, ce_loss_token=1.7869, perplexity_token=5.9707]

torch.Size([256, 293, 35]) torch.Size([256, 293])
torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  30%|██████████████▏                                | 315/1044 [01:55<04:14,  2.86it/s, acc_step=1/1, ce_loss_token=1.7868, perplexity_token=5.9700]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  30%|██████████████▏                                | 316/1044 [01:55<04:34,  2.65it/s, acc_step=1/1, ce_loss_token=1.7867, perplexity_token=5.9695]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  30%|██████████████▎                                | 317/1044 [01:55<04:29,  2.69it/s, acc_step=1/1, ce_loss_token=1.7865, perplexity_token=5.9688]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  30%|██████████████▎                                | 318/1044 [01:56<04:35,  2.63it/s, acc_step=1/1, ce_loss_token=1.7864, perplexity_token=5.9679]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  31%|██████████████▎                                | 319/1044 [01:56<04:19,  2.80it/s, acc_step=1/1, ce_loss_token=1.7868, perplexity_token=5.9701]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  31%|██████████████▍                                | 320/1044 [01:57<04:22,  2.76it/s, acc_step=1/1, ce_loss_token=1.7867, perplexity_token=5.9696]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  31%|██████████████▍                                | 322/1044 [01:57<03:34,  3.37it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9759]

torch.Size([256, 312, 35]) torch.Size([256, 312])
torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  31%|██████████████▌                                | 323/1044 [01:57<03:49,  3.14it/s, acc_step=1/1, ce_loss_token=1.7876, perplexity_token=5.9750]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  31%|██████████████▌                                | 324/1044 [01:58<04:09,  2.89it/s, acc_step=1/1, ce_loss_token=1.7875, perplexity_token=5.9742]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  31%|██████████████▋                                | 325/1044 [01:58<04:25,  2.71it/s, acc_step=1/1, ce_loss_token=1.7873, perplexity_token=5.9735]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  31%|██████████████▋                                | 326/1044 [01:59<04:46,  2.50it/s, acc_step=1/1, ce_loss_token=1.7872, perplexity_token=5.9729]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  31%|██████████████▋                                | 327/1044 [01:59<04:40,  2.55it/s, acc_step=1/1, ce_loss_token=1.7871, perplexity_token=5.9723]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  31%|██████████████▊                                | 328/1044 [01:59<04:36,  2.59it/s, acc_step=1/1, ce_loss_token=1.7870, perplexity_token=5.9717]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  32%|██████████████▊                                | 329/1044 [02:00<04:35,  2.60it/s, acc_step=1/1, ce_loss_token=1.7869, perplexity_token=5.9710]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  32%|██████████████▊                                | 330/1044 [02:00<04:34,  2.60it/s, acc_step=1/1, ce_loss_token=1.7868, perplexity_token=5.9704]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  32%|██████████████▉                                | 331/1044 [02:01<04:47,  2.48it/s, acc_step=1/1, ce_loss_token=1.7867, perplexity_token=5.9698]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  32%|██████████████▉                                | 332/1044 [02:01<04:38,  2.56it/s, acc_step=1/1, ce_loss_token=1.7866, perplexity_token=5.9692]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  32%|██████████████▉                                | 333/1044 [02:01<04:13,  2.81it/s, acc_step=1/1, ce_loss_token=1.7869, perplexity_token=5.9708]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  32%|███████████████                                | 334/1044 [02:02<04:17,  2.75it/s, acc_step=1/1, ce_loss_token=1.7868, perplexity_token=5.9702]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  32%|███████████████                                | 335/1044 [02:02<04:13,  2.79it/s, acc_step=1/1, ce_loss_token=1.7867, perplexity_token=5.9698]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  32%|███████████████▏                               | 336/1044 [02:02<04:17,  2.75it/s, acc_step=1/1, ce_loss_token=1.7866, perplexity_token=5.9692]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  32%|███████████████▏                               | 337/1044 [02:03<04:21,  2.70it/s, acc_step=1/1, ce_loss_token=1.7865, perplexity_token=5.9684]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  32%|███████████████▏                               | 338/1044 [02:03<04:18,  2.73it/s, acc_step=1/1, ce_loss_token=1.7864, perplexity_token=5.9678]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  32%|███████████████▎                               | 339/1044 [02:03<03:58,  2.95it/s, acc_step=1/1, ce_loss_token=1.7866, perplexity_token=5.9692]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  33%|███████████████▎                               | 340/1044 [02:04<04:06,  2.86it/s, acc_step=1/1, ce_loss_token=1.7865, perplexity_token=5.9686]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  33%|███████████████▎                               | 341/1044 [02:04<04:20,  2.70it/s, acc_step=1/1, ce_loss_token=1.7865, perplexity_token=5.9682]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  33%|███████████████▍                               | 342/1044 [02:05<04:22,  2.68it/s, acc_step=1/1, ce_loss_token=1.7864, perplexity_token=5.9676]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  33%|███████████████▍                               | 343/1044 [02:05<04:29,  2.60it/s, acc_step=1/1, ce_loss_token=1.7863, perplexity_token=5.9671]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  33%|███████████████▍                               | 344/1044 [02:05<04:27,  2.62it/s, acc_step=1/1, ce_loss_token=1.7861, perplexity_token=5.9663]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  33%|███████████████▌                               | 345/1044 [02:06<04:25,  2.63it/s, acc_step=1/1, ce_loss_token=1.7860, perplexity_token=5.9657]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  33%|███████████████▌                               | 346/1044 [02:06<04:24,  2.64it/s, acc_step=1/1, ce_loss_token=1.7859, perplexity_token=5.9651]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  33%|███████████████▌                               | 347/1044 [02:06<04:06,  2.83it/s, acc_step=1/1, ce_loss_token=1.7862, perplexity_token=5.9665]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  33%|███████████████▋                               | 348/1044 [02:07<04:06,  2.83it/s, acc_step=1/1, ce_loss_token=1.7860, perplexity_token=5.9657]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  33%|███████████████▋                               | 349/1044 [02:07<04:13,  2.74it/s, acc_step=1/1, ce_loss_token=1.7859, perplexity_token=5.9652]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  34%|███████████████▊                               | 350/1044 [02:07<04:11,  2.76it/s, acc_step=1/1, ce_loss_token=1.7859, perplexity_token=5.9647]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  34%|███████████████▊                               | 351/1044 [02:08<04:16,  2.70it/s, acc_step=1/1, ce_loss_token=1.7857, perplexity_token=5.9639]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  34%|███████████████▊                               | 352/1044 [02:08<03:55,  2.94it/s, acc_step=1/1, ce_loss_token=1.7860, perplexity_token=5.9656]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  34%|███████████████▉                               | 353/1044 [02:08<03:55,  2.94it/s, acc_step=1/1, ce_loss_token=1.7859, perplexity_token=5.9648]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  34%|███████████████▉                               | 354/1044 [02:09<03:58,  2.89it/s, acc_step=1/1, ce_loss_token=1.7857, perplexity_token=5.9640]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  34%|███████████████▉                               | 355/1044 [02:09<04:00,  2.86it/s, acc_step=1/1, ce_loss_token=1.7856, perplexity_token=5.9633]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  34%|████████████████                               | 356/1044 [02:10<04:07,  2.78it/s, acc_step=1/1, ce_loss_token=1.7855, perplexity_token=5.9628]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  34%|████████████████                               | 357/1044 [02:10<03:59,  2.87it/s, acc_step=1/1, ce_loss_token=1.7857, perplexity_token=5.9641]

torch.Size([256, 397, 35]) torch.Size([256, 397])


[Training LM]:  34%|████████████████                               | 358/1044 [02:10<04:40,  2.45it/s, acc_step=1/1, ce_loss_token=1.7857, perplexity_token=5.9635]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  34%|████████████████▏                              | 359/1044 [02:11<04:26,  2.57it/s, acc_step=1/1, ce_loss_token=1.7856, perplexity_token=5.9629]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  34%|████████████████▏                              | 360/1044 [02:11<04:22,  2.61it/s, acc_step=1/1, ce_loss_token=1.7855, perplexity_token=5.9624]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  35%|████████████████▎                              | 361/1044 [02:12<04:13,  2.70it/s, acc_step=1/1, ce_loss_token=1.7854, perplexity_token=5.9617]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  35%|████████████████▎                              | 363/1044 [02:12<03:46,  3.01it/s, acc_step=1/1, ce_loss_token=1.7862, perplexity_token=5.9667]

torch.Size([256, 306, 35]) torch.Size([256, 306])
torch.Size([256, 393, 35]) torch.Size([256, 393])


[Training LM]:  35%|████████████████▍                              | 364/1044 [02:13<04:27,  2.55it/s, acc_step=1/1, ce_loss_token=1.7861, perplexity_token=5.9661]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  35%|████████████████▍                              | 365/1044 [02:13<04:21,  2.59it/s, acc_step=1/1, ce_loss_token=1.7860, perplexity_token=5.9657]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  35%|████████████████▍                              | 366/1044 [02:13<04:16,  2.64it/s, acc_step=1/1, ce_loss_token=1.7859, perplexity_token=5.9652]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  35%|████████████████▌                              | 367/1044 [02:14<04:02,  2.79it/s, acc_step=1/1, ce_loss_token=1.7862, perplexity_token=5.9668]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  35%|████████████████▌                              | 368/1044 [02:14<04:09,  2.71it/s, acc_step=1/1, ce_loss_token=1.7861, perplexity_token=5.9662]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  35%|████████████████▋                              | 370/1044 [02:15<03:38,  3.09it/s, acc_step=1/1, ce_loss_token=1.7871, perplexity_token=5.9720]

torch.Size([256, 311, 35]) torch.Size([256, 311])
torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  36%|████████████████▋                              | 371/1044 [02:15<03:48,  2.95it/s, acc_step=1/1, ce_loss_token=1.7870, perplexity_token=5.9712]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  36%|████████████████▋                              | 372/1044 [02:15<03:31,  3.17it/s, acc_step=1/1, ce_loss_token=1.7872, perplexity_token=5.9726]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  36%|████████████████▊                              | 373/1044 [02:16<03:27,  3.24it/s, acc_step=1/1, ce_loss_token=1.7875, perplexity_token=5.9744]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  36%|████████████████▊                              | 374/1044 [02:16<03:36,  3.09it/s, acc_step=1/1, ce_loss_token=1.7874, perplexity_token=5.9738]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  36%|████████████████▉                              | 375/1044 [02:16<03:31,  3.16it/s, acc_step=1/1, ce_loss_token=1.7876, perplexity_token=5.9751]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  36%|████████████████▉                              | 376/1044 [02:17<03:21,  3.31it/s, acc_step=1/1, ce_loss_token=1.7878, perplexity_token=5.9763]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  36%|████████████████▉                              | 377/1044 [02:17<03:39,  3.04it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9757]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  36%|█████████████████                              | 378/1044 [02:17<03:51,  2.88it/s, acc_step=1/1, ce_loss_token=1.7876, perplexity_token=5.9753]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  36%|█████████████████                              | 379/1044 [02:18<03:37,  3.06it/s, acc_step=1/1, ce_loss_token=1.7879, perplexity_token=5.9766]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  36%|█████████████████                              | 380/1044 [02:18<03:27,  3.20it/s, acc_step=1/1, ce_loss_token=1.7881, perplexity_token=5.9783]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  36%|█████████████████▏                             | 381/1044 [02:18<03:33,  3.11it/s, acc_step=1/1, ce_loss_token=1.7880, perplexity_token=5.9777]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  37%|█████████████████▏                             | 382/1044 [02:19<03:43,  2.97it/s, acc_step=1/1, ce_loss_token=1.7879, perplexity_token=5.9771]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  37%|█████████████████▏                             | 383/1044 [02:19<03:54,  2.81it/s, acc_step=1/1, ce_loss_token=1.7878, perplexity_token=5.9765]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  37%|█████████████████▎                             | 384/1044 [02:19<04:06,  2.68it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9759]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  37%|█████████████████▎                             | 385/1044 [02:20<04:06,  2.67it/s, acc_step=1/1, ce_loss_token=1.7876, perplexity_token=5.9753]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  37%|█████████████████▍                             | 386/1044 [02:20<03:48,  2.89it/s, acc_step=1/1, ce_loss_token=1.7879, perplexity_token=5.9771]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  37%|█████████████████▍                             | 387/1044 [02:20<03:56,  2.78it/s, acc_step=1/1, ce_loss_token=1.7878, perplexity_token=5.9766]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  37%|█████████████████▍                             | 388/1044 [02:21<03:56,  2.77it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9759]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  37%|█████████████████▌                             | 389/1044 [02:21<04:04,  2.68it/s, acc_step=1/1, ce_loss_token=1.7876, perplexity_token=5.9751]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  37%|█████████████████▌                             | 390/1044 [02:22<04:04,  2.68it/s, acc_step=1/1, ce_loss_token=1.7875, perplexity_token=5.9746]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  37%|█████████████████▌                             | 391/1044 [02:22<03:45,  2.90it/s, acc_step=1/1, ce_loss_token=1.7878, perplexity_token=5.9761]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  38%|█████████████████▋                             | 392/1044 [02:22<03:49,  2.84it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9754]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  38%|█████████████████▋                             | 393/1044 [02:23<03:35,  3.02it/s, acc_step=1/1, ce_loss_token=1.7879, perplexity_token=5.9771]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  38%|█████████████████▋                             | 394/1044 [02:23<03:45,  2.88it/s, acc_step=1/1, ce_loss_token=1.7879, perplexity_token=5.9767]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  38%|█████████████████▊                             | 395/1044 [02:23<03:28,  3.11it/s, acc_step=1/1, ce_loss_token=1.7880, perplexity_token=5.9777]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  38%|█████████████████▊                             | 396/1044 [02:24<03:37,  2.98it/s, acc_step=1/1, ce_loss_token=1.7879, perplexity_token=5.9770]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  38%|█████████████████▊                             | 397/1044 [02:24<03:45,  2.87it/s, acc_step=1/1, ce_loss_token=1.7878, perplexity_token=5.9763]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  38%|█████████████████▉                             | 398/1044 [02:24<03:42,  2.90it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9758]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  38%|█████████████████▉                             | 399/1044 [02:25<03:42,  2.90it/s, acc_step=1/1, ce_loss_token=1.7876, perplexity_token=5.9749]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  38%|██████████████████                             | 400/1044 [02:25<03:39,  2.94it/s, acc_step=1/1, ce_loss_token=1.7875, perplexity_token=5.9745]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  38%|██████████████████                             | 401/1044 [02:25<03:42,  2.89it/s, acc_step=1/1, ce_loss_token=1.7874, perplexity_token=5.9739]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  39%|██████████████████                             | 402/1044 [02:26<03:59,  2.68it/s, acc_step=1/1, ce_loss_token=1.7873, perplexity_token=5.9733]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  39%|██████████████████▏                            | 403/1044 [02:26<03:43,  2.87it/s, acc_step=1/1, ce_loss_token=1.7875, perplexity_token=5.9745]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  39%|██████████████████▏                            | 404/1044 [02:26<03:29,  3.05it/s, acc_step=1/1, ce_loss_token=1.7877, perplexity_token=5.9756]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  39%|██████████████████▏                            | 405/1044 [02:27<03:50,  2.77it/s, acc_step=1/1, ce_loss_token=1.7876, perplexity_token=5.9749]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  39%|██████████████████▎                            | 406/1044 [02:27<03:46,  2.81it/s, acc_step=1/1, ce_loss_token=1.7875, perplexity_token=5.9742]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  39%|██████████████████▎                            | 407/1044 [02:27<03:40,  2.89it/s, acc_step=1/1, ce_loss_token=1.7874, perplexity_token=5.9737]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  39%|██████████████████▎                            | 408/1044 [02:28<03:44,  2.83it/s, acc_step=1/1, ce_loss_token=1.7873, perplexity_token=5.9731]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  39%|██████████████████▍                            | 409/1044 [02:28<03:30,  3.02it/s, acc_step=1/1, ce_loss_token=1.7874, perplexity_token=5.9740]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  39%|██████████████████▍                            | 410/1044 [02:28<03:39,  2.89it/s, acc_step=1/1, ce_loss_token=1.7873, perplexity_token=5.9733]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  39%|██████████████████▌                            | 411/1044 [02:29<03:48,  2.77it/s, acc_step=1/1, ce_loss_token=1.7872, perplexity_token=5.9726]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  39%|██████████████████▌                            | 412/1044 [02:29<03:51,  2.73it/s, acc_step=1/1, ce_loss_token=1.7871, perplexity_token=5.9722]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  40%|██████████████████▌                            | 413/1044 [02:30<04:01,  2.61it/s, acc_step=1/1, ce_loss_token=1.7870, perplexity_token=5.9717]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  40%|██████████████████▋                            | 414/1044 [02:30<04:01,  2.61it/s, acc_step=1/1, ce_loss_token=1.7869, perplexity_token=5.9712]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  40%|██████████████████▋                            | 415/1044 [02:30<03:57,  2.64it/s, acc_step=1/1, ce_loss_token=1.7869, perplexity_token=5.9707]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  40%|██████████████████▋                            | 416/1044 [02:31<03:59,  2.62it/s, acc_step=1/1, ce_loss_token=1.7868, perplexity_token=5.9703]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  40%|██████████████████▊                            | 417/1044 [02:31<03:56,  2.65it/s, acc_step=1/1, ce_loss_token=1.7867, perplexity_token=5.9697]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  40%|██████████████████▊                            | 418/1044 [02:32<03:51,  2.71it/s, acc_step=1/1, ce_loss_token=1.7866, perplexity_token=5.9690]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  40%|██████████████████▊                            | 419/1044 [02:32<03:48,  2.73it/s, acc_step=1/1, ce_loss_token=1.7865, perplexity_token=5.9685]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  40%|██████████████████▉                            | 420/1044 [02:32<04:18,  2.41it/s, acc_step=1/1, ce_loss_token=1.7865, perplexity_token=5.9683]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  40%|██████████████████▉                            | 421/1044 [02:33<04:04,  2.54it/s, acc_step=1/1, ce_loss_token=1.7864, perplexity_token=5.9678]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  40%|██████████████████▉                            | 422/1044 [02:33<03:53,  2.66it/s, acc_step=1/1, ce_loss_token=1.7863, perplexity_token=5.9673]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  41%|███████████████████                            | 423/1044 [02:33<03:57,  2.62it/s, acc_step=1/1, ce_loss_token=1.7862, perplexity_token=5.9668]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  41%|███████████████████                            | 424/1044 [02:34<03:35,  2.88it/s, acc_step=1/1, ce_loss_token=1.7865, perplexity_token=5.9683]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  41%|███████████████████▏                           | 425/1044 [02:34<03:36,  2.87it/s, acc_step=1/1, ce_loss_token=1.7863, perplexity_token=5.9676]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  41%|███████████████████▏                           | 426/1044 [02:34<03:31,  2.93it/s, acc_step=1/1, ce_loss_token=1.7863, perplexity_token=5.9670]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  41%|███████████████████▏                           | 427/1044 [02:35<03:36,  2.86it/s, acc_step=1/1, ce_loss_token=1.7862, perplexity_token=5.9665]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  41%|███████████████████▎                           | 428/1044 [02:35<03:20,  3.07it/s, acc_step=1/1, ce_loss_token=1.7864, perplexity_token=5.9680]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  41%|███████████████████▎                           | 429/1044 [02:35<03:11,  3.22it/s, acc_step=1/1, ce_loss_token=1.7866, perplexity_token=5.9688]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  41%|███████████████████▎                           | 430/1044 [02:36<03:17,  3.10it/s, acc_step=1/1, ce_loss_token=1.7865, perplexity_token=5.9684]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  41%|███████████████████▍                           | 431/1044 [02:36<03:22,  3.03it/s, acc_step=1/1, ce_loss_token=1.7864, perplexity_token=5.9678]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  41%|███████████████████▍                           | 432/1044 [02:36<03:27,  2.96it/s, acc_step=1/1, ce_loss_token=1.7863, perplexity_token=5.9672]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  41%|███████████████████▍                           | 433/1044 [02:37<03:26,  2.96it/s, acc_step=1/1, ce_loss_token=1.7862, perplexity_token=5.9668]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  42%|███████████████████▌                           | 434/1044 [02:37<03:33,  2.86it/s, acc_step=1/1, ce_loss_token=1.7861, perplexity_token=5.9662]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  42%|███████████████████▌                           | 435/1044 [02:37<03:32,  2.86it/s, acc_step=1/1, ce_loss_token=1.7860, perplexity_token=5.9658]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  42%|███████████████████▋                           | 436/1044 [02:38<03:32,  2.86it/s, acc_step=1/1, ce_loss_token=1.7860, perplexity_token=5.9654]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  42%|███████████████████▋                           | 437/1044 [02:38<03:21,  3.01it/s, acc_step=1/1, ce_loss_token=1.7862, perplexity_token=5.9666]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  42%|███████████████████▋                           | 438/1044 [02:38<03:23,  2.98it/s, acc_step=1/1, ce_loss_token=1.7861, perplexity_token=5.9660]

torch.Size([256, 396, 35]) torch.Size([256, 396])


[Training LM]:  42%|███████████████████▊                           | 439/1044 [02:39<03:59,  2.53it/s, acc_step=1/1, ce_loss_token=1.7860, perplexity_token=5.9655]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  42%|███████████████████▊                           | 440/1044 [02:39<03:49,  2.63it/s, acc_step=1/1, ce_loss_token=1.7859, perplexity_token=5.9651]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  42%|███████████████████▊                           | 441/1044 [02:40<03:48,  2.64it/s, acc_step=1/1, ce_loss_token=1.7858, perplexity_token=5.9645]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  42%|███████████████████▉                           | 442/1044 [02:40<03:32,  2.84it/s, acc_step=1/1, ce_loss_token=1.7860, perplexity_token=5.9655]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  42%|███████████████████▉                           | 443/1044 [02:40<03:33,  2.82it/s, acc_step=1/1, ce_loss_token=1.7859, perplexity_token=5.9650]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  43%|███████████████████▉                           | 444/1044 [02:41<03:36,  2.77it/s, acc_step=1/1, ce_loss_token=1.7858, perplexity_token=5.9644]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  43%|████████████████████                           | 445/1044 [02:41<03:22,  2.96it/s, acc_step=1/1, ce_loss_token=1.7861, perplexity_token=5.9660]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  43%|████████████████████                           | 446/1044 [02:41<03:26,  2.90it/s, acc_step=1/1, ce_loss_token=1.7860, perplexity_token=5.9654]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  43%|████████████████████                           | 447/1044 [02:42<03:27,  2.87it/s, acc_step=1/1, ce_loss_token=1.7859, perplexity_token=5.9650]

torch.Size([256, 364, 35]) torch.Size([256, 364])


[Training LM]:  43%|████████████████████▏                          | 448/1044 [02:42<03:49,  2.60it/s, acc_step=1/1, ce_loss_token=1.7858, perplexity_token=5.9645]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  43%|████████████████████▏                          | 449/1044 [02:43<03:49,  2.59it/s, acc_step=1/1, ce_loss_token=1.7858, perplexity_token=5.9641]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  43%|████████████████████▎                          | 450/1044 [02:43<03:46,  2.63it/s, acc_step=1/1, ce_loss_token=1.7857, perplexity_token=5.9636]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  43%|████████████████████▎                          | 451/1044 [02:43<03:48,  2.60it/s, acc_step=1/1, ce_loss_token=1.7856, perplexity_token=5.9633]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  43%|████████████████████▎                          | 452/1044 [02:44<03:42,  2.66it/s, acc_step=1/1, ce_loss_token=1.7855, perplexity_token=5.9628]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  43%|████████████████████▍                          | 453/1044 [02:44<03:38,  2.71it/s, acc_step=1/1, ce_loss_token=1.7855, perplexity_token=5.9626]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  43%|████████████████████▍                          | 454/1044 [02:44<03:38,  2.71it/s, acc_step=1/1, ce_loss_token=1.7854, perplexity_token=5.9621]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  44%|████████████████████▍                          | 455/1044 [02:45<03:38,  2.69it/s, acc_step=1/1, ce_loss_token=1.7853, perplexity_token=5.9615]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  44%|████████████████████▌                          | 456/1044 [02:45<03:46,  2.60it/s, acc_step=1/1, ce_loss_token=1.7853, perplexity_token=5.9613]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  44%|████████████████████▌                          | 457/1044 [02:46<03:37,  2.70it/s, acc_step=1/1, ce_loss_token=1.7852, perplexity_token=5.9607]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  44%|████████████████████▌                          | 458/1044 [02:46<03:19,  2.93it/s, acc_step=1/1, ce_loss_token=1.7854, perplexity_token=5.9617]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  44%|████████████████████▋                          | 459/1044 [02:46<03:36,  2.70it/s, acc_step=1/1, ce_loss_token=1.7853, perplexity_token=5.9613]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  44%|████████████████████▋                          | 460/1044 [02:47<03:31,  2.76it/s, acc_step=1/1, ce_loss_token=1.7852, perplexity_token=5.9610]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  44%|████████████████████▊                          | 461/1044 [02:47<03:30,  2.77it/s, acc_step=1/1, ce_loss_token=1.7851, perplexity_token=5.9604]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  44%|████████████████████▊                          | 462/1044 [02:47<03:35,  2.70it/s, acc_step=1/1, ce_loss_token=1.7851, perplexity_token=5.9599]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  44%|████████████████████▊                          | 463/1044 [02:48<03:20,  2.89it/s, acc_step=1/1, ce_loss_token=1.7852, perplexity_token=5.9610]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  44%|████████████████████▉                          | 464/1044 [02:48<03:22,  2.87it/s, acc_step=1/1, ce_loss_token=1.7852, perplexity_token=5.9606]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  45%|████████████████████▉                          | 465/1044 [02:48<03:29,  2.76it/s, acc_step=1/1, ce_loss_token=1.7851, perplexity_token=5.9600]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  45%|████████████████████▉                          | 466/1044 [02:49<03:28,  2.77it/s, acc_step=1/1, ce_loss_token=1.7850, perplexity_token=5.9594]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  45%|█████████████████████                          | 467/1044 [02:49<03:16,  2.94it/s, acc_step=1/1, ce_loss_token=1.7851, perplexity_token=5.9604]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  45%|█████████████████████                          | 469/1044 [02:50<02:51,  3.35it/s, acc_step=1/1, ce_loss_token=1.7859, perplexity_token=5.9647]

torch.Size([256, 291, 35]) torch.Size([256, 291])
torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  45%|█████████████████████▏                         | 470/1044 [02:50<03:04,  3.12it/s, acc_step=1/1, ce_loss_token=1.7858, perplexity_token=5.9641]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  45%|█████████████████████▏                         | 471/1044 [02:50<03:10,  3.01it/s, acc_step=1/1, ce_loss_token=1.7857, perplexity_token=5.9635]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  45%|█████████████████████▏                         | 472/1044 [02:51<03:17,  2.90it/s, acc_step=1/1, ce_loss_token=1.7856, perplexity_token=5.9631]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  45%|█████████████████████▎                         | 473/1044 [02:51<03:29,  2.73it/s, acc_step=1/1, ce_loss_token=1.7855, perplexity_token=5.9626]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  45%|█████████████████████▎                         | 474/1044 [02:51<03:30,  2.71it/s, acc_step=1/1, ce_loss_token=1.7854, perplexity_token=5.9619]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  45%|█████████████████████▍                         | 475/1044 [02:52<03:16,  2.90it/s, acc_step=1/1, ce_loss_token=1.7855, perplexity_token=5.9627]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  46%|█████████████████████▍                         | 476/1044 [02:52<03:22,  2.81it/s, acc_step=1/1, ce_loss_token=1.7854, perplexity_token=5.9622]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  46%|█████████████████████▍                         | 477/1044 [02:53<03:33,  2.66it/s, acc_step=1/1, ce_loss_token=1.7854, perplexity_token=5.9618]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  46%|█████████████████████▌                         | 478/1044 [02:53<03:41,  2.56it/s, acc_step=1/1, ce_loss_token=1.7853, perplexity_token=5.9613]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  46%|█████████████████████▌                         | 479/1044 [02:53<03:34,  2.63it/s, acc_step=1/1, ce_loss_token=1.7852, perplexity_token=5.9608]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  46%|█████████████████████▌                         | 480/1044 [02:54<03:34,  2.63it/s, acc_step=1/1, ce_loss_token=1.7851, perplexity_token=5.9603]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  46%|█████████████████████▋                         | 481/1044 [02:54<03:34,  2.63it/s, acc_step=1/1, ce_loss_token=1.7851, perplexity_token=5.9599]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  46%|█████████████████████▋                         | 482/1044 [02:54<03:33,  2.63it/s, acc_step=1/1, ce_loss_token=1.7850, perplexity_token=5.9594]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  46%|█████████████████████▋                         | 483/1044 [02:55<03:32,  2.64it/s, acc_step=1/1, ce_loss_token=1.7849, perplexity_token=5.9588]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  46%|█████████████████████▊                         | 484/1044 [02:55<03:35,  2.60it/s, acc_step=1/1, ce_loss_token=1.7848, perplexity_token=5.9583]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  46%|█████████████████████▊                         | 485/1044 [02:56<03:37,  2.57it/s, acc_step=1/1, ce_loss_token=1.7847, perplexity_token=5.9579]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  47%|█████████████████████▉                         | 486/1044 [02:56<03:48,  2.45it/s, acc_step=1/1, ce_loss_token=1.7846, perplexity_token=5.9574]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  47%|█████████████████████▉                         | 487/1044 [02:57<03:45,  2.47it/s, acc_step=1/1, ce_loss_token=1.7845, perplexity_token=5.9569]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  47%|█████████████████████▉                         | 488/1044 [02:57<03:41,  2.51it/s, acc_step=1/1, ce_loss_token=1.7845, perplexity_token=5.9565]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  47%|██████████████████████                         | 489/1044 [02:57<03:38,  2.55it/s, acc_step=1/1, ce_loss_token=1.7844, perplexity_token=5.9561]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  47%|██████████████████████                         | 490/1044 [02:58<03:35,  2.57it/s, acc_step=1/1, ce_loss_token=1.7843, perplexity_token=5.9556]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  47%|██████████████████████                         | 491/1044 [02:58<03:34,  2.57it/s, acc_step=1/1, ce_loss_token=1.7843, perplexity_token=5.9552]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  47%|██████████████████████▏                        | 492/1044 [02:58<03:32,  2.60it/s, acc_step=1/1, ce_loss_token=1.7842, perplexity_token=5.9547]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  47%|██████████████████████▏                        | 493/1044 [02:59<03:25,  2.68it/s, acc_step=1/1, ce_loss_token=1.7841, perplexity_token=5.9543]

torch.Size([256, 272, 35]) torch.Size([256, 272])


[Training LM]:  47%|██████████████████████▏                        | 494/1044 [02:59<03:15,  2.82it/s, acc_step=1/1, ce_loss_token=1.7841, perplexity_token=5.9539]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  47%|██████████████████████▎                        | 495/1044 [03:00<03:32,  2.59it/s, acc_step=1/1, ce_loss_token=1.7839, perplexity_token=5.9533]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  48%|██████████████████████▎                        | 496/1044 [03:00<03:32,  2.57it/s, acc_step=1/1, ce_loss_token=1.7838, perplexity_token=5.9527]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  48%|██████████████████████▎                        | 497/1044 [03:00<03:29,  2.62it/s, acc_step=1/1, ce_loss_token=1.7838, perplexity_token=5.9522]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  48%|██████████████████████▍                        | 498/1044 [03:01<03:15,  2.79it/s, acc_step=1/1, ce_loss_token=1.7839, perplexity_token=5.9530]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  48%|██████████████████████▍                        | 499/1044 [03:01<03:05,  2.95it/s, acc_step=1/1, ce_loss_token=1.7841, perplexity_token=5.9544]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  48%|██████████████████████▌                        | 500/1044 [03:01<03:00,  3.02it/s, acc_step=1/1, ce_loss_token=1.7843, perplexity_token=5.9555]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  48%|██████████████████████▌                        | 501/1044 [03:02<03:17,  2.75it/s, acc_step=1/1, ce_loss_token=1.7842, perplexity_token=5.9550]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  48%|██████████████████████▌                        | 502/1044 [03:02<03:31,  2.56it/s, acc_step=1/1, ce_loss_token=1.7842, perplexity_token=5.9546]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  48%|██████████████████████▋                        | 503/1044 [03:02<03:22,  2.67it/s, acc_step=1/1, ce_loss_token=1.7840, perplexity_token=5.9539]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  48%|██████████████████████▋                        | 504/1044 [03:03<03:22,  2.67it/s, acc_step=1/1, ce_loss_token=1.7840, perplexity_token=5.9536]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  48%|██████████████████████▋                        | 505/1044 [03:03<03:07,  2.88it/s, acc_step=1/1, ce_loss_token=1.7841, perplexity_token=5.9544]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  48%|██████████████████████▊                        | 506/1044 [03:04<03:22,  2.66it/s, acc_step=1/1, ce_loss_token=1.7841, perplexity_token=5.9540]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  49%|██████████████████████▊                        | 507/1044 [03:04<03:30,  2.55it/s, acc_step=1/1, ce_loss_token=1.7840, perplexity_token=5.9537]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  49%|██████████████████████▊                        | 508/1044 [03:04<03:08,  2.84it/s, acc_step=1/1, ce_loss_token=1.7842, perplexity_token=5.9546]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  49%|██████████████████████▉                        | 509/1044 [03:05<03:06,  2.86it/s, acc_step=1/1, ce_loss_token=1.7841, perplexity_token=5.9541]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  49%|██████████████████████▉                        | 510/1044 [03:05<03:15,  2.73it/s, acc_step=1/1, ce_loss_token=1.7840, perplexity_token=5.9538]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  49%|███████████████████████                        | 511/1044 [03:05<03:12,  2.77it/s, acc_step=1/1, ce_loss_token=1.7840, perplexity_token=5.9535]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  49%|███████████████████████                        | 512/1044 [03:06<03:10,  2.79it/s, acc_step=1/1, ce_loss_token=1.7839, perplexity_token=5.9529]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  49%|███████████████████████                        | 513/1044 [03:06<03:11,  2.77it/s, acc_step=1/1, ce_loss_token=1.7838, perplexity_token=5.9523]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  49%|███████████████████████▏                       | 514/1044 [03:06<03:19,  2.65it/s, acc_step=1/1, ce_loss_token=1.7837, perplexity_token=5.9519]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  49%|███████████████████████▏                       | 515/1044 [03:07<03:02,  2.90it/s, acc_step=1/1, ce_loss_token=1.7838, perplexity_token=5.9527]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  49%|███████████████████████▏                       | 516/1044 [03:07<03:18,  2.66it/s, acc_step=1/1, ce_loss_token=1.7838, perplexity_token=5.9522]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  50%|███████████████████████▎                       | 517/1044 [03:07<03:07,  2.81it/s, acc_step=1/1, ce_loss_token=1.7839, perplexity_token=5.9531]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  50%|███████████████████████▎                       | 518/1044 [03:08<03:06,  2.82it/s, acc_step=1/1, ce_loss_token=1.7838, perplexity_token=5.9526]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  50%|███████████████████████▎                       | 519/1044 [03:08<03:08,  2.78it/s, acc_step=1/1, ce_loss_token=1.7838, perplexity_token=5.9522]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  50%|███████████████████████▍                       | 520/1044 [03:09<03:06,  2.81it/s, acc_step=1/1, ce_loss_token=1.7837, perplexity_token=5.9517]

torch.Size([256, 370, 35]) torch.Size([256, 370])


[Training LM]:  50%|███████████████████████▍                       | 521/1044 [03:09<03:09,  2.76it/s, acc_step=1/1, ce_loss_token=1.7838, perplexity_token=5.9525]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  50%|███████████████████████▌                       | 522/1044 [03:09<03:19,  2.62it/s, acc_step=1/1, ce_loss_token=1.7837, perplexity_token=5.9521]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  50%|███████████████████████▌                       | 523/1044 [03:10<03:22,  2.58it/s, acc_step=1/1, ce_loss_token=1.7837, perplexity_token=5.9517]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  50%|███████████████████████▌                       | 524/1044 [03:10<03:22,  2.57it/s, acc_step=1/1, ce_loss_token=1.7836, perplexity_token=5.9513]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  50%|███████████████████████▋                       | 525/1044 [03:10<03:14,  2.66it/s, acc_step=1/1, ce_loss_token=1.7836, perplexity_token=5.9510]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  50%|███████████████████████▋                       | 526/1044 [03:11<03:23,  2.55it/s, acc_step=1/1, ce_loss_token=1.7835, perplexity_token=5.9505]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  50%|███████████████████████▋                       | 527/1044 [03:11<03:21,  2.57it/s, acc_step=1/1, ce_loss_token=1.7834, perplexity_token=5.9501]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  51%|███████████████████████▊                       | 528/1044 [03:12<03:13,  2.66it/s, acc_step=1/1, ce_loss_token=1.7833, perplexity_token=5.9496]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  51%|███████████████████████▊                       | 529/1044 [03:12<03:14,  2.65it/s, acc_step=1/1, ce_loss_token=1.7832, perplexity_token=5.9489]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  51%|███████████████████████▊                       | 530/1044 [03:12<03:03,  2.80it/s, acc_step=1/1, ce_loss_token=1.7834, perplexity_token=5.9501]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  51%|███████████████████████▉                       | 531/1044 [03:13<02:49,  3.03it/s, acc_step=1/1, ce_loss_token=1.7835, perplexity_token=5.9509]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  51%|███████████████████████▉                       | 532/1044 [03:13<03:00,  2.84it/s, acc_step=1/1, ce_loss_token=1.7835, perplexity_token=5.9506]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  51%|███████████████████████▉                       | 533/1044 [03:13<02:55,  2.91it/s, acc_step=1/1, ce_loss_token=1.7836, perplexity_token=5.9514]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  51%|████████████████████████                       | 534/1044 [03:14<03:04,  2.77it/s, acc_step=1/1, ce_loss_token=1.7836, perplexity_token=5.9511]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  51%|████████████████████████                       | 535/1044 [03:14<03:03,  2.78it/s, acc_step=1/1, ce_loss_token=1.7835, perplexity_token=5.9507]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  51%|████████████████████████▏                      | 536/1044 [03:14<03:03,  2.77it/s, acc_step=1/1, ce_loss_token=1.7834, perplexity_token=5.9503]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  51%|████████████████████████▏                      | 537/1044 [03:15<03:05,  2.73it/s, acc_step=1/1, ce_loss_token=1.7834, perplexity_token=5.9498]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  52%|████████████████████████▏                      | 538/1044 [03:15<03:07,  2.70it/s, acc_step=1/1, ce_loss_token=1.7833, perplexity_token=5.9493]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  52%|████████████████████████▎                      | 539/1044 [03:16<03:15,  2.58it/s, acc_step=1/1, ce_loss_token=1.7832, perplexity_token=5.9489]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  52%|████████████████████████▎                      | 540/1044 [03:16<02:57,  2.84it/s, acc_step=1/1, ce_loss_token=1.7834, perplexity_token=5.9501]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  52%|████████████████████████▍                      | 542/1044 [03:16<02:30,  3.34it/s, acc_step=1/1, ce_loss_token=1.7840, perplexity_token=5.9536]

torch.Size([256, 291, 35]) torch.Size([256, 291])
torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  52%|████████████████████████▍                      | 543/1044 [03:17<02:33,  3.26it/s, acc_step=1/1, ce_loss_token=1.7839, perplexity_token=5.9532]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  52%|████████████████████████▍                      | 544/1044 [03:17<02:46,  3.00it/s, acc_step=1/1, ce_loss_token=1.7839, perplexity_token=5.9529]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  52%|████████████████████████▌                      | 545/1044 [03:18<02:50,  2.93it/s, acc_step=1/1, ce_loss_token=1.7838, perplexity_token=5.9525]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  52%|████████████████████████▌                      | 546/1044 [03:18<02:56,  2.82it/s, acc_step=1/1, ce_loss_token=1.7837, perplexity_token=5.9520]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  52%|████████████████████████▋                      | 547/1044 [03:18<03:01,  2.74it/s, acc_step=1/1, ce_loss_token=1.7837, perplexity_token=5.9516]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  52%|████████████████████████▋                      | 548/1044 [03:19<03:00,  2.75it/s, acc_step=1/1, ce_loss_token=1.7836, perplexity_token=5.9512]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  53%|████████████████████████▋                      | 549/1044 [03:19<03:02,  2.72it/s, acc_step=1/1, ce_loss_token=1.7835, perplexity_token=5.9507]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  53%|████████████████████████▊                      | 550/1044 [03:19<03:02,  2.70it/s, acc_step=1/1, ce_loss_token=1.7834, perplexity_token=5.9502]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  53%|████████████████████████▊                      | 551/1044 [03:20<03:01,  2.72it/s, acc_step=1/1, ce_loss_token=1.7833, perplexity_token=5.9497]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  53%|████████████████████████▊                      | 552/1044 [03:20<03:07,  2.62it/s, acc_step=1/1, ce_loss_token=1.7833, perplexity_token=5.9492]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  53%|████████████████████████▉                      | 553/1044 [03:20<02:52,  2.85it/s, acc_step=1/1, ce_loss_token=1.7834, perplexity_token=5.9498]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  53%|████████████████████████▉                      | 554/1044 [03:21<03:17,  2.48it/s, acc_step=1/1, ce_loss_token=1.7833, perplexity_token=5.9494]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  53%|████████████████████████▉                      | 555/1044 [03:21<03:16,  2.49it/s, acc_step=1/1, ce_loss_token=1.7832, perplexity_token=5.9490]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  53%|█████████████████████████                      | 556/1044 [03:22<03:04,  2.64it/s, acc_step=1/1, ce_loss_token=1.7831, perplexity_token=5.9485]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  53%|█████████████████████████                      | 557/1044 [03:22<03:06,  2.62it/s, acc_step=1/1, ce_loss_token=1.7831, perplexity_token=5.9481]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  53%|█████████████████████████                      | 558/1044 [03:23<03:13,  2.51it/s, acc_step=1/1, ce_loss_token=1.7830, perplexity_token=5.9476]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  54%|█████████████████████████▏                     | 559/1044 [03:23<03:12,  2.52it/s, acc_step=1/1, ce_loss_token=1.7829, perplexity_token=5.9473]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  54%|█████████████████████████▏                     | 560/1044 [03:23<03:08,  2.57it/s, acc_step=1/1, ce_loss_token=1.7829, perplexity_token=5.9471]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  54%|█████████████████████████▎                     | 561/1044 [03:24<03:10,  2.53it/s, acc_step=1/1, ce_loss_token=1.7828, perplexity_token=5.9466]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  54%|█████████████████████████▎                     | 562/1044 [03:24<03:06,  2.58it/s, acc_step=1/1, ce_loss_token=1.7827, perplexity_token=5.9461]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  54%|█████████████████████████▎                     | 563/1044 [03:24<03:05,  2.59it/s, acc_step=1/1, ce_loss_token=1.7826, perplexity_token=5.9456]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  54%|█████████████████████████▍                     | 564/1044 [03:25<03:02,  2.63it/s, acc_step=1/1, ce_loss_token=1.7826, perplexity_token=5.9451]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  54%|█████████████████████████▍                     | 565/1044 [03:25<03:00,  2.66it/s, acc_step=1/1, ce_loss_token=1.7825, perplexity_token=5.9448]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  54%|█████████████████████████▍                     | 566/1044 [03:26<02:55,  2.73it/s, acc_step=1/1, ce_loss_token=1.7825, perplexity_token=5.9444]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  54%|█████████████████████████▌                     | 567/1044 [03:26<02:53,  2.75it/s, acc_step=1/1, ce_loss_token=1.7824, perplexity_token=5.9440]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  54%|█████████████████████████▌                     | 568/1044 [03:26<03:01,  2.62it/s, acc_step=1/1, ce_loss_token=1.7823, perplexity_token=5.9436]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  55%|█████████████████████████▌                     | 569/1044 [03:27<02:58,  2.67it/s, acc_step=1/1, ce_loss_token=1.7823, perplexity_token=5.9433]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  55%|█████████████████████████▋                     | 570/1044 [03:27<03:01,  2.61it/s, acc_step=1/1, ce_loss_token=1.7822, perplexity_token=5.9429]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  55%|█████████████████████████▋                     | 571/1044 [03:27<02:57,  2.66it/s, acc_step=1/1, ce_loss_token=1.7821, perplexity_token=5.9426]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  55%|█████████████████████████▊                     | 572/1044 [03:28<02:56,  2.67it/s, acc_step=1/1, ce_loss_token=1.7821, perplexity_token=5.9423]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  55%|█████████████████████████▊                     | 573/1044 [03:28<02:47,  2.82it/s, acc_step=1/1, ce_loss_token=1.7823, perplexity_token=5.9434]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  55%|█████████████████████████▊                     | 574/1044 [03:28<02:45,  2.84it/s, acc_step=1/1, ce_loss_token=1.7822, perplexity_token=5.9430]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  55%|█████████████████████████▉                     | 575/1044 [03:29<02:45,  2.83it/s, acc_step=1/1, ce_loss_token=1.7821, perplexity_token=5.9425]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  55%|█████████████████████████▉                     | 576/1044 [03:29<02:44,  2.85it/s, acc_step=1/1, ce_loss_token=1.7821, perplexity_token=5.9422]

torch.Size([256, 342, 35]) torch.Size([256, 342])


[Training LM]:  55%|█████████████████████████▉                     | 577/1044 [03:30<02:55,  2.67it/s, acc_step=1/1, ce_loss_token=1.7820, perplexity_token=5.9418]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  55%|██████████████████████████                     | 578/1044 [03:30<02:51,  2.72it/s, acc_step=1/1, ce_loss_token=1.7819, perplexity_token=5.9414]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  55%|██████████████████████████                     | 579/1044 [03:30<02:37,  2.95it/s, acc_step=1/1, ce_loss_token=1.7822, perplexity_token=5.9426]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  56%|██████████████████████████                     | 580/1044 [03:31<02:38,  2.92it/s, acc_step=1/1, ce_loss_token=1.7821, perplexity_token=5.9422]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  56%|██████████████████████████▏                    | 581/1044 [03:31<02:39,  2.91it/s, acc_step=1/1, ce_loss_token=1.7820, perplexity_token=5.9419]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  56%|██████████████████████████▏                    | 582/1044 [03:31<02:47,  2.75it/s, acc_step=1/1, ce_loss_token=1.7819, perplexity_token=5.9414]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  56%|██████████████████████████▏                    | 583/1044 [03:32<02:50,  2.70it/s, acc_step=1/1, ce_loss_token=1.7819, perplexity_token=5.9409]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  56%|██████████████████████████▎                    | 584/1044 [03:32<02:36,  2.94it/s, acc_step=1/1, ce_loss_token=1.7820, perplexity_token=5.9418]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  56%|██████████████████████████▎                    | 585/1044 [03:32<02:34,  2.96it/s, acc_step=1/1, ce_loss_token=1.7820, perplexity_token=5.9415]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  56%|██████████████████████████▍                    | 586/1044 [03:33<02:25,  3.15it/s, acc_step=1/1, ce_loss_token=1.7822, perplexity_token=5.9426]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  56%|██████████████████████████▍                    | 587/1044 [03:33<02:33,  2.99it/s, acc_step=1/1, ce_loss_token=1.7821, perplexity_token=5.9423]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  56%|██████████████████████████▍                    | 588/1044 [03:33<02:34,  2.95it/s, acc_step=1/1, ce_loss_token=1.7820, perplexity_token=5.9417]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  56%|██████████████████████████▌                    | 589/1044 [03:34<02:39,  2.86it/s, acc_step=1/1, ce_loss_token=1.7820, perplexity_token=5.9415]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  57%|██████████████████████████▌                    | 590/1044 [03:34<02:29,  3.05it/s, acc_step=1/1, ce_loss_token=1.7821, perplexity_token=5.9422]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  57%|██████████████████████████▌                    | 591/1044 [03:34<02:30,  3.00it/s, acc_step=1/1, ce_loss_token=1.7820, perplexity_token=5.9419]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  57%|██████████████████████████▋                    | 592/1044 [03:35<02:45,  2.72it/s, acc_step=1/1, ce_loss_token=1.7820, perplexity_token=5.9417]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  57%|██████████████████████████▋                    | 593/1044 [03:35<02:43,  2.75it/s, acc_step=1/1, ce_loss_token=1.7819, perplexity_token=5.9413]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  57%|██████████████████████████▋                    | 594/1044 [03:35<02:45,  2.73it/s, acc_step=1/1, ce_loss_token=1.7819, perplexity_token=5.9410]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  57%|██████████████████████████▊                    | 595/1044 [03:36<02:41,  2.79it/s, acc_step=1/1, ce_loss_token=1.7818, perplexity_token=5.9406]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  57%|██████████████████████████▊                    | 596/1044 [03:36<02:42,  2.75it/s, acc_step=1/1, ce_loss_token=1.7817, perplexity_token=5.9402]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  57%|██████████████████████████▉                    | 597/1044 [03:37<02:44,  2.73it/s, acc_step=1/1, ce_loss_token=1.7817, perplexity_token=5.9398]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  57%|██████████████████████████▉                    | 598/1044 [03:37<02:38,  2.82it/s, acc_step=1/1, ce_loss_token=1.7816, perplexity_token=5.9394]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  57%|██████████████████████████▉                    | 599/1044 [03:37<02:45,  2.69it/s, acc_step=1/1, ce_loss_token=1.7815, perplexity_token=5.9390]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  57%|███████████████████████████                    | 600/1044 [03:38<02:31,  2.93it/s, acc_step=1/1, ce_loss_token=1.7817, perplexity_token=5.9397]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  58%|███████████████████████████                    | 601/1044 [03:38<02:47,  2.64it/s, acc_step=1/1, ce_loss_token=1.7816, perplexity_token=5.9393]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  58%|███████████████████████████                    | 602/1044 [03:38<02:37,  2.81it/s, acc_step=1/1, ce_loss_token=1.7817, perplexity_token=5.9401]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  58%|███████████████████████████▏                   | 603/1044 [03:39<02:36,  2.81it/s, acc_step=1/1, ce_loss_token=1.7817, perplexity_token=5.9397]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  58%|███████████████████████████▏                   | 604/1044 [03:39<02:37,  2.80it/s, acc_step=1/1, ce_loss_token=1.7816, perplexity_token=5.9393]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  58%|███████████████████████████▏                   | 605/1044 [03:39<02:40,  2.74it/s, acc_step=1/1, ce_loss_token=1.7815, perplexity_token=5.9391]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  58%|███████████████████████████▎                   | 606/1044 [03:40<02:45,  2.64it/s, acc_step=1/1, ce_loss_token=1.7815, perplexity_token=5.9387]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  58%|███████████████████████████▎                   | 607/1044 [03:40<02:43,  2.68it/s, acc_step=1/1, ce_loss_token=1.7814, perplexity_token=5.9382]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  58%|███████████████████████████▎                   | 608/1044 [03:41<02:40,  2.71it/s, acc_step=1/1, ce_loss_token=1.7814, perplexity_token=5.9379]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  58%|███████████████████████████▍                   | 609/1044 [03:41<02:40,  2.71it/s, acc_step=1/1, ce_loss_token=1.7813, perplexity_token=5.9377]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  58%|███████████████████████████▍                   | 610/1044 [03:41<02:38,  2.74it/s, acc_step=1/1, ce_loss_token=1.7813, perplexity_token=5.9373]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  59%|███████████████████████████▌                   | 611/1044 [03:42<02:29,  2.90it/s, acc_step=1/1, ce_loss_token=1.7814, perplexity_token=5.9381]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  59%|███████████████████████████▌                   | 612/1044 [03:42<02:30,  2.87it/s, acc_step=1/1, ce_loss_token=1.7813, perplexity_token=5.9378]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  59%|███████████████████████████▌                   | 613/1044 [03:42<02:27,  2.93it/s, acc_step=1/1, ce_loss_token=1.7814, perplexity_token=5.9384]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  59%|███████████████████████████▋                   | 614/1044 [03:43<02:40,  2.68it/s, acc_step=1/1, ce_loss_token=1.7814, perplexity_token=5.9380]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  59%|███████████████████████████▋                   | 615/1044 [03:43<02:40,  2.67it/s, acc_step=1/1, ce_loss_token=1.7813, perplexity_token=5.9376]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  59%|███████████████████████████▋                   | 616/1044 [03:43<02:29,  2.86it/s, acc_step=1/1, ce_loss_token=1.7815, perplexity_token=5.9387]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  59%|███████████████████████████▊                   | 617/1044 [03:44<02:35,  2.75it/s, acc_step=1/1, ce_loss_token=1.7814, perplexity_token=5.9383]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  59%|███████████████████████████▊                   | 618/1044 [03:44<02:35,  2.75it/s, acc_step=1/1, ce_loss_token=1.7814, perplexity_token=5.9380]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  59%|███████████████████████████▊                   | 619/1044 [03:45<02:33,  2.76it/s, acc_step=1/1, ce_loss_token=1.7813, perplexity_token=5.9377]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  59%|███████████████████████████▉                   | 620/1044 [03:45<02:23,  2.96it/s, acc_step=1/1, ce_loss_token=1.7815, perplexity_token=5.9388]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  59%|███████████████████████████▉                   | 621/1044 [03:45<02:24,  2.93it/s, acc_step=1/1, ce_loss_token=1.7814, perplexity_token=5.9384]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  60%|████████████████████████████                   | 622/1044 [03:46<02:29,  2.82it/s, acc_step=1/1, ce_loss_token=1.7814, perplexity_token=5.9380]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  60%|████████████████████████████                   | 623/1044 [03:46<02:36,  2.68it/s, acc_step=1/1, ce_loss_token=1.7813, perplexity_token=5.9378]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  60%|████████████████████████████                   | 624/1044 [03:46<02:26,  2.86it/s, acc_step=1/1, ce_loss_token=1.7815, perplexity_token=5.9385]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  60%|████████████████████████████▏                  | 625/1044 [03:47<02:24,  2.89it/s, acc_step=1/1, ce_loss_token=1.7814, perplexity_token=5.9381]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  60%|████████████████████████████▏                  | 626/1044 [03:47<02:24,  2.90it/s, acc_step=1/1, ce_loss_token=1.7813, perplexity_token=5.9378]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  60%|████████████████████████████▏                  | 627/1044 [03:47<02:26,  2.84it/s, acc_step=1/1, ce_loss_token=1.7813, perplexity_token=5.9374]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  60%|████████████████████████████▎                  | 628/1044 [03:48<02:29,  2.79it/s, acc_step=1/1, ce_loss_token=1.7812, perplexity_token=5.9370]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  60%|████████████████████████████▎                  | 629/1044 [03:48<02:29,  2.77it/s, acc_step=1/1, ce_loss_token=1.7811, perplexity_token=5.9365]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  60%|████████████████████████████▎                  | 630/1044 [03:48<02:30,  2.76it/s, acc_step=1/1, ce_loss_token=1.7810, perplexity_token=5.9360]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  60%|████████████████████████████▍                  | 631/1044 [03:49<02:36,  2.64it/s, acc_step=1/1, ce_loss_token=1.7810, perplexity_token=5.9356]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  61%|████████████████████████████▍                  | 632/1044 [03:49<02:23,  2.86it/s, acc_step=1/1, ce_loss_token=1.7812, perplexity_token=5.9368]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  61%|████████████████████████████▍                  | 633/1044 [03:49<02:23,  2.86it/s, acc_step=1/1, ce_loss_token=1.7811, perplexity_token=5.9364]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  61%|████████████████████████████▌                  | 634/1044 [03:50<02:27,  2.79it/s, acc_step=1/1, ce_loss_token=1.7810, perplexity_token=5.9361]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  61%|████████████████████████████▌                  | 635/1044 [03:50<02:35,  2.63it/s, acc_step=1/1, ce_loss_token=1.7810, perplexity_token=5.9358]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  61%|████████████████████████████▋                  | 636/1044 [03:51<02:32,  2.67it/s, acc_step=1/1, ce_loss_token=1.7809, perplexity_token=5.9354]

torch.Size([256, 438, 35]) torch.Size([256, 438])


[Training LM]:  61%|████████████████████████████▋                  | 637/1044 [03:51<03:02,  2.24it/s, acc_step=1/1, ce_loss_token=1.7809, perplexity_token=5.9350]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  61%|████████████████████████████▋                  | 638/1044 [03:52<02:53,  2.34it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9345]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  61%|████████████████████████████▊                  | 639/1044 [03:52<02:38,  2.55it/s, acc_step=1/1, ce_loss_token=1.7809, perplexity_token=5.9353]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  61%|████████████████████████████▊                  | 640/1044 [03:52<02:33,  2.64it/s, acc_step=1/1, ce_loss_token=1.7809, perplexity_token=5.9349]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  61%|████████████████████████████▊                  | 641/1044 [03:53<02:30,  2.68it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9345]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  61%|████████████████████████████▉                  | 642/1044 [03:53<02:34,  2.60it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9341]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  62%|████████████████████████████▉                  | 643/1044 [03:53<02:28,  2.70it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9336]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  62%|████████████████████████████▉                  | 644/1044 [03:54<02:21,  2.82it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9342]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  62%|█████████████████████████████                  | 645/1044 [03:54<02:20,  2.84it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9337]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  62%|█████████████████████████████                  | 646/1044 [03:54<02:21,  2.81it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9334]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  62%|█████████████████████████████▏                 | 647/1044 [03:55<02:21,  2.81it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9331]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  62%|█████████████████████████████▏                 | 648/1044 [03:55<02:29,  2.65it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9328]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  62%|█████████████████████████████▏                 | 649/1044 [03:56<02:36,  2.53it/s, acc_step=1/1, ce_loss_token=1.7804, perplexity_token=5.9324]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  62%|█████████████████████████████▎                 | 650/1044 [03:56<02:22,  2.76it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9331]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  62%|█████████████████████████████▎                 | 651/1044 [03:56<02:28,  2.65it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9327]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  62%|█████████████████████████████▎                 | 652/1044 [03:57<02:25,  2.69it/s, acc_step=1/1, ce_loss_token=1.7804, perplexity_token=5.9324]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  63%|█████████████████████████████▍                 | 653/1044 [03:57<02:27,  2.66it/s, acc_step=1/1, ce_loss_token=1.7804, perplexity_token=5.9320]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  63%|█████████████████████████████▍                 | 654/1044 [03:57<02:17,  2.85it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9326]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  63%|█████████████████████████████▌                 | 656/1044 [03:58<01:59,  3.26it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9343]

torch.Size([256, 305, 35]) torch.Size([256, 305])
torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  63%|█████████████████████████████▌                 | 657/1044 [03:58<02:03,  3.13it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9340]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  63%|█████████████████████████████▌                 | 658/1044 [03:59<02:08,  3.01it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9337]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  63%|█████████████████████████████▋                 | 659/1044 [03:59<02:12,  2.90it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9334]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  63%|█████████████████████████████▋                 | 660/1044 [03:59<02:05,  3.06it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9344]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  63%|█████████████████████████████▊                 | 661/1044 [04:00<02:01,  3.15it/s, acc_step=1/1, ce_loss_token=1.7809, perplexity_token=5.9350]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  63%|█████████████████████████████▊                 | 662/1044 [04:00<02:11,  2.91it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9347]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  64%|█████████████████████████████▊                 | 663/1044 [04:00<02:24,  2.64it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9343]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  64%|█████████████████████████████▉                 | 664/1044 [04:01<02:12,  2.88it/s, acc_step=1/1, ce_loss_token=1.7809, perplexity_token=5.9350]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  64%|█████████████████████████████▉                 | 665/1044 [04:01<02:05,  3.01it/s, acc_step=1/1, ce_loss_token=1.7810, perplexity_token=5.9357]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  64%|█████████████████████████████▉                 | 666/1044 [04:01<02:07,  2.96it/s, acc_step=1/1, ce_loss_token=1.7809, perplexity_token=5.9351]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  64%|██████████████████████████████                 | 667/1044 [04:02<02:10,  2.89it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9348]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  64%|██████████████████████████████                 | 668/1044 [04:02<02:12,  2.84it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9344]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  64%|██████████████████████████████                 | 669/1044 [04:03<02:19,  2.69it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9342]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  64%|██████████████████████████████▏                | 670/1044 [04:03<02:08,  2.91it/s, acc_step=1/1, ce_loss_token=1.7809, perplexity_token=5.9349]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  64%|██████████████████████████████▏                | 671/1044 [04:03<02:10,  2.85it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9346]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  64%|██████████████████████████████▎                | 672/1044 [04:04<02:14,  2.76it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9342]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  64%|██████████████████████████████▎                | 673/1044 [04:04<02:16,  2.72it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9339]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  65%|██████████████████████████████▎                | 674/1044 [04:04<02:18,  2.67it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9336]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  65%|██████████████████████████████▍                | 675/1044 [04:05<02:18,  2.66it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9332]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  65%|██████████████████████████████▍                | 676/1044 [04:05<02:09,  2.84it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9341]

torch.Size([256, 351, 35]) torch.Size([256, 351])


[Training LM]:  65%|██████████████████████████████▍                | 677/1044 [04:05<02:20,  2.62it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9337]

torch.Size([256, 454, 35]) torch.Size([256, 454])


[Training LM]:  65%|██████████████████████████████▌                | 678/1044 [04:06<02:33,  2.38it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9344]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  65%|██████████████████████████████▌                | 679/1044 [04:06<02:27,  2.47it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9341]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  65%|██████████████████████████████▌                | 680/1044 [04:07<02:28,  2.46it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9338]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  65%|██████████████████████████████▋                | 681/1044 [04:07<02:21,  2.57it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9334]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  65%|██████████████████████████████▋                | 682/1044 [04:07<02:18,  2.60it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9329]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  65%|██████████████████████████████▋                | 683/1044 [04:08<02:06,  2.85it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9335]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  66%|██████████████████████████████▊                | 684/1044 [04:08<02:04,  2.90it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9331]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  66%|██████████████████████████████▊                | 685/1044 [04:08<02:03,  2.90it/s, acc_step=1/1, ce_loss_token=1.7804, perplexity_token=5.9325]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  66%|██████████████████████████████▉                | 686/1044 [04:09<02:03,  2.91it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9332]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  66%|██████████████████████████████▉                | 687/1044 [04:09<01:55,  3.10it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9341]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  66%|██████████████████████████████▉                | 688/1044 [04:09<01:54,  3.10it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9338]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  66%|███████████████████████████████                | 689/1044 [04:10<01:52,  3.14it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9348]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  66%|███████████████████████████████                | 690/1044 [04:10<01:51,  3.18it/s, acc_step=1/1, ce_loss_token=1.7809, perplexity_token=5.9353]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  66%|███████████████████████████████                | 691/1044 [04:10<01:47,  3.27it/s, acc_step=1/1, ce_loss_token=1.7810, perplexity_token=5.9359]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  66%|███████████████████████████████▏               | 692/1044 [04:11<01:51,  3.15it/s, acc_step=1/1, ce_loss_token=1.7810, perplexity_token=5.9356]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  66%|███████████████████████████████▏               | 693/1044 [04:11<01:47,  3.27it/s, acc_step=1/1, ce_loss_token=1.7811, perplexity_token=5.9362]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  66%|███████████████████████████████▏               | 694/1044 [04:11<01:49,  3.19it/s, acc_step=1/1, ce_loss_token=1.7810, perplexity_token=5.9358]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  67%|███████████████████████████████▎               | 695/1044 [04:12<01:47,  3.23it/s, acc_step=1/1, ce_loss_token=1.7812, perplexity_token=5.9368]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  67%|███████████████████████████████▎               | 696/1044 [04:12<01:52,  3.09it/s, acc_step=1/1, ce_loss_token=1.7811, perplexity_token=5.9363]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  67%|███████████████████████████████▍               | 697/1044 [04:12<01:59,  2.90it/s, acc_step=1/1, ce_loss_token=1.7810, perplexity_token=5.9359]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  67%|███████████████████████████████▍               | 698/1044 [04:13<02:06,  2.74it/s, acc_step=1/1, ce_loss_token=1.7810, perplexity_token=5.9355]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  67%|███████████████████████████████▍               | 699/1044 [04:13<02:06,  2.74it/s, acc_step=1/1, ce_loss_token=1.7809, perplexity_token=5.9353]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  67%|███████████████████████████████▌               | 700/1044 [04:13<02:09,  2.66it/s, acc_step=1/1, ce_loss_token=1.7809, perplexity_token=5.9349]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  67%|███████████████████████████████▌               | 701/1044 [04:14<02:13,  2.57it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9346]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  67%|███████████████████████████████▌               | 702/1044 [04:14<02:11,  2.60it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9343]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  67%|███████████████████████████████▋               | 703/1044 [04:15<02:08,  2.66it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9340]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  67%|███████████████████████████████▋               | 704/1044 [04:15<01:57,  2.89it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9349]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  68%|███████████████████████████████▋               | 705/1044 [04:15<02:03,  2.74it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9345]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  68%|███████████████████████████████▊               | 706/1044 [04:16<01:59,  2.83it/s, acc_step=1/1, ce_loss_token=1.7809, perplexity_token=5.9351]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  68%|███████████████████████████████▊               | 707/1044 [04:16<01:59,  2.82it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9347]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  68%|███████████████████████████████▊               | 708/1044 [04:16<02:03,  2.73it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9344]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  68%|███████████████████████████████▉               | 709/1044 [04:17<02:03,  2.70it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9340]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  68%|███████████████████████████████▉               | 710/1044 [04:17<02:07,  2.61it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9335]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  68%|████████████████████████████████               | 711/1044 [04:18<02:08,  2.60it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9331]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  68%|████████████████████████████████               | 712/1044 [04:18<02:06,  2.62it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9328]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  68%|████████████████████████████████               | 713/1044 [04:18<02:04,  2.65it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9325]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  68%|████████████████████████████████▏              | 714/1044 [04:19<01:55,  2.86it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9330]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  68%|████████████████████████████████▏              | 715/1044 [04:19<01:46,  3.08it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9339]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  69%|████████████████████████████████▏              | 716/1044 [04:19<01:48,  3.03it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9337]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  69%|████████████████████████████████▎              | 717/1044 [04:20<01:59,  2.73it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9333]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  69%|████████████████████████████████▎              | 718/1044 [04:20<01:50,  2.95it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9342]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  69%|████████████████████████████████▎              | 719/1044 [04:20<01:54,  2.84it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9339]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  69%|████████████████████████████████▍              | 720/1044 [04:21<01:45,  3.07it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9348]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  69%|████████████████████████████████▍              | 721/1044 [04:21<01:47,  3.01it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9345]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  69%|████████████████████████████████▌              | 722/1044 [04:21<01:54,  2.80it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9341]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  69%|████████████████████████████████▌              | 723/1044 [04:22<02:00,  2.66it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9339]

torch.Size([256, 365, 35]) torch.Size([256, 365])


[Training LM]:  69%|████████████████████████████████▌              | 724/1044 [04:22<02:10,  2.45it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9336]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  69%|████████████████████████████████▋              | 725/1044 [04:23<02:00,  2.64it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9340]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  70%|████████████████████████████████▋              | 726/1044 [04:23<02:02,  2.60it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9336]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  70%|████████████████████████████████▋              | 727/1044 [04:23<02:06,  2.51it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9333]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  70%|████████████████████████████████▊              | 728/1044 [04:24<02:07,  2.47it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9329]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  70%|████████████████████████████████▊              | 729/1044 [04:24<01:56,  2.69it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9337]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  70%|████████████████████████████████▊              | 730/1044 [04:24<02:01,  2.58it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9335]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  70%|████████████████████████████████▉              | 731/1044 [04:25<02:00,  2.59it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9332]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  70%|████████████████████████████████▉              | 732/1044 [04:25<01:51,  2.79it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9339]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  70%|████████████████████████████████▉              | 733/1044 [04:25<01:43,  3.02it/s, acc_step=1/1, ce_loss_token=1.7808, perplexity_token=5.9344]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  70%|█████████████████████████████████              | 734/1044 [04:26<01:47,  2.89it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9341]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  70%|█████████████████████████████████              | 735/1044 [04:26<01:56,  2.65it/s, acc_step=1/1, ce_loss_token=1.7807, perplexity_token=5.9338]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  70%|█████████████████████████████████▏             | 736/1044 [04:27<01:51,  2.76it/s, acc_step=1/1, ce_loss_token=1.7806, perplexity_token=5.9334]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  71%|█████████████████████████████████▏             | 737/1044 [04:27<01:52,  2.74it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9331]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  71%|█████████████████████████████████▏             | 738/1044 [04:27<01:51,  2.75it/s, acc_step=1/1, ce_loss_token=1.7805, perplexity_token=5.9327]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  71%|█████████████████████████████████▎             | 739/1044 [04:28<01:49,  2.78it/s, acc_step=1/1, ce_loss_token=1.7804, perplexity_token=5.9324]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  71%|█████████████████████████████████▎             | 740/1044 [04:28<01:52,  2.71it/s, acc_step=1/1, ce_loss_token=1.7804, perplexity_token=5.9320]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  71%|█████████████████████████████████▎             | 741/1044 [04:28<01:53,  2.68it/s, acc_step=1/1, ce_loss_token=1.7803, perplexity_token=5.9316]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  71%|█████████████████████████████████▍             | 742/1044 [04:29<01:50,  2.73it/s, acc_step=1/1, ce_loss_token=1.7802, perplexity_token=5.9311]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  71%|█████████████████████████████████▍             | 743/1044 [04:29<01:51,  2.69it/s, acc_step=1/1, ce_loss_token=1.7801, perplexity_token=5.9307]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  71%|█████████████████████████████████▍             | 744/1044 [04:30<01:53,  2.64it/s, acc_step=1/1, ce_loss_token=1.7801, perplexity_token=5.9303]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  71%|█████████████████████████████████▌             | 745/1044 [04:30<01:51,  2.68it/s, acc_step=1/1, ce_loss_token=1.7800, perplexity_token=5.9301]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  71%|█████████████████████████████████▌             | 746/1044 [04:30<01:53,  2.62it/s, acc_step=1/1, ce_loss_token=1.7800, perplexity_token=5.9297]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  72%|█████████████████████████████████▋             | 747/1044 [04:31<01:55,  2.58it/s, acc_step=1/1, ce_loss_token=1.7799, perplexity_token=5.9295]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  72%|█████████████████████████████████▋             | 748/1044 [04:31<01:45,  2.80it/s, acc_step=1/1, ce_loss_token=1.7800, perplexity_token=5.9301]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  72%|█████████████████████████████████▋             | 749/1044 [04:31<01:48,  2.72it/s, acc_step=1/1, ce_loss_token=1.7800, perplexity_token=5.9297]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  72%|█████████████████████████████████▊             | 750/1044 [04:32<01:47,  2.74it/s, acc_step=1/1, ce_loss_token=1.7799, perplexity_token=5.9294]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  72%|█████████████████████████████████▊             | 751/1044 [04:32<01:47,  2.71it/s, acc_step=1/1, ce_loss_token=1.7799, perplexity_token=5.9290]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  72%|█████████████████████████████████▊             | 752/1044 [04:32<01:45,  2.77it/s, acc_step=1/1, ce_loss_token=1.7798, perplexity_token=5.9287]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  72%|█████████████████████████████████▉             | 753/1044 [04:33<01:44,  2.78it/s, acc_step=1/1, ce_loss_token=1.7797, perplexity_token=5.9284]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  72%|█████████████████████████████████▉             | 754/1044 [04:33<01:48,  2.68it/s, acc_step=1/1, ce_loss_token=1.7797, perplexity_token=5.9280]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  72%|█████████████████████████████████▉             | 755/1044 [04:34<01:46,  2.72it/s, acc_step=1/1, ce_loss_token=1.7796, perplexity_token=5.9277]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  72%|██████████████████████████████████             | 756/1044 [04:34<01:39,  2.91it/s, acc_step=1/1, ce_loss_token=1.7797, perplexity_token=5.9283]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  73%|██████████████████████████████████             | 757/1044 [04:34<01:39,  2.88it/s, acc_step=1/1, ce_loss_token=1.7797, perplexity_token=5.9280]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  73%|██████████████████████████████████             | 758/1044 [04:35<01:41,  2.82it/s, acc_step=1/1, ce_loss_token=1.7796, perplexity_token=5.9278]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  73%|██████████████████████████████████▏            | 759/1044 [04:35<01:38,  2.90it/s, acc_step=1/1, ce_loss_token=1.7798, perplexity_token=5.9287]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  73%|██████████████████████████████████▏            | 760/1044 [04:35<01:38,  2.88it/s, acc_step=1/1, ce_loss_token=1.7798, perplexity_token=5.9284]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  73%|██████████████████████████████████▎            | 761/1044 [04:36<01:30,  3.11it/s, acc_step=1/1, ce_loss_token=1.7798, perplexity_token=5.9289]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  73%|██████████████████████████████████▎            | 762/1044 [04:36<01:32,  3.04it/s, acc_step=1/1, ce_loss_token=1.7798, perplexity_token=5.9285]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  73%|██████████████████████████████████▎            | 763/1044 [04:36<01:36,  2.91it/s, acc_step=1/1, ce_loss_token=1.7797, perplexity_token=5.9280]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  73%|██████████████████████████████████▍            | 764/1044 [04:37<01:32,  3.03it/s, acc_step=1/1, ce_loss_token=1.7798, perplexity_token=5.9286]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  73%|██████████████████████████████████▍            | 765/1044 [04:37<01:33,  2.98it/s, acc_step=1/1, ce_loss_token=1.7797, perplexity_token=5.9283]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  73%|██████████████████████████████████▍            | 766/1044 [04:37<01:37,  2.86it/s, acc_step=1/1, ce_loss_token=1.7797, perplexity_token=5.9280]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  73%|██████████████████████████████████▌            | 767/1044 [04:38<01:40,  2.77it/s, acc_step=1/1, ce_loss_token=1.7796, perplexity_token=5.9277]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  74%|██████████████████████████████████▌            | 768/1044 [04:38<01:40,  2.73it/s, acc_step=1/1, ce_loss_token=1.7796, perplexity_token=5.9274]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  74%|██████████████████████████████████▌            | 769/1044 [04:38<01:39,  2.76it/s, acc_step=1/1, ce_loss_token=1.7795, perplexity_token=5.9272]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  74%|██████████████████████████████████▋            | 770/1044 [04:39<01:41,  2.69it/s, acc_step=1/1, ce_loss_token=1.7795, perplexity_token=5.9269]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  74%|██████████████████████████████████▋            | 771/1044 [04:39<01:45,  2.59it/s, acc_step=1/1, ce_loss_token=1.7794, perplexity_token=5.9266]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  74%|██████████████████████████████████▊            | 772/1044 [04:40<01:37,  2.78it/s, acc_step=1/1, ce_loss_token=1.7795, perplexity_token=5.9271]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  74%|██████████████████████████████████▊            | 773/1044 [04:40<01:37,  2.79it/s, acc_step=1/1, ce_loss_token=1.7795, perplexity_token=5.9268]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  74%|██████████████████████████████████▊            | 774/1044 [04:40<01:38,  2.75it/s, acc_step=1/1, ce_loss_token=1.7794, perplexity_token=5.9265]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  74%|██████████████████████████████████▉            | 775/1044 [04:41<01:40,  2.69it/s, acc_step=1/1, ce_loss_token=1.7794, perplexity_token=5.9261]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  74%|██████████████████████████████████▉            | 776/1044 [04:41<01:35,  2.81it/s, acc_step=1/1, ce_loss_token=1.7795, perplexity_token=5.9267]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  74%|██████████████████████████████████▉            | 777/1044 [04:41<01:28,  3.02it/s, acc_step=1/1, ce_loss_token=1.7795, perplexity_token=5.9271]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  75%|███████████████████████████████████            | 778/1044 [04:42<01:28,  2.99it/s, acc_step=1/1, ce_loss_token=1.7795, perplexity_token=5.9268]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  75%|███████████████████████████████████            | 779/1044 [04:42<01:28,  2.98it/s, acc_step=1/1, ce_loss_token=1.7794, perplexity_token=5.9265]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  75%|███████████████████████████████████            | 780/1044 [04:42<01:31,  2.89it/s, acc_step=1/1, ce_loss_token=1.7794, perplexity_token=5.9262]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  75%|███████████████████████████████████▏           | 781/1044 [04:43<01:40,  2.61it/s, acc_step=1/1, ce_loss_token=1.7793, perplexity_token=5.9259]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  75%|███████████████████████████████████▏           | 782/1044 [04:43<01:39,  2.62it/s, acc_step=1/1, ce_loss_token=1.7793, perplexity_token=5.9255]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  75%|███████████████████████████████████▎           | 783/1044 [04:44<01:40,  2.60it/s, acc_step=1/1, ce_loss_token=1.7792, perplexity_token=5.9253]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  75%|███████████████████████████████████▎           | 784/1044 [04:44<01:31,  2.85it/s, acc_step=1/1, ce_loss_token=1.7793, perplexity_token=5.9258]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  75%|███████████████████████████████████▎           | 785/1044 [04:44<01:31,  2.83it/s, acc_step=1/1, ce_loss_token=1.7793, perplexity_token=5.9255]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  75%|███████████████████████████████████▍           | 786/1044 [04:45<01:36,  2.67it/s, acc_step=1/1, ce_loss_token=1.7792, perplexity_token=5.9252]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  75%|███████████████████████████████████▍           | 787/1044 [04:45<01:32,  2.77it/s, acc_step=1/1, ce_loss_token=1.7793, perplexity_token=5.9258]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  75%|███████████████████████████████████▍           | 788/1044 [04:45<01:31,  2.80it/s, acc_step=1/1, ce_loss_token=1.7793, perplexity_token=5.9255]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  76%|███████████████████████████████████▌           | 789/1044 [04:46<01:31,  2.80it/s, acc_step=1/1, ce_loss_token=1.7792, perplexity_token=5.9252]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  76%|███████████████████████████████████▌           | 790/1044 [04:46<01:29,  2.85it/s, acc_step=1/1, ce_loss_token=1.7792, perplexity_token=5.9249]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  76%|███████████████████████████████████▌           | 791/1044 [04:46<01:28,  2.87it/s, acc_step=1/1, ce_loss_token=1.7791, perplexity_token=5.9246]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  76%|███████████████████████████████████▋           | 792/1044 [04:47<01:28,  2.86it/s, acc_step=1/1, ce_loss_token=1.7791, perplexity_token=5.9244]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  76%|███████████████████████████████████▋           | 793/1044 [04:47<01:41,  2.48it/s, acc_step=1/1, ce_loss_token=1.7790, perplexity_token=5.9240]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  76%|███████████████████████████████████▋           | 794/1044 [04:48<01:36,  2.60it/s, acc_step=1/1, ce_loss_token=1.7790, perplexity_token=5.9238]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  76%|███████████████████████████████████▊           | 795/1044 [04:48<01:35,  2.62it/s, acc_step=1/1, ce_loss_token=1.7789, perplexity_token=5.9234]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  76%|███████████████████████████████████▊           | 796/1044 [04:48<01:34,  2.64it/s, acc_step=1/1, ce_loss_token=1.7789, perplexity_token=5.9232]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  76%|███████████████████████████████████▉           | 797/1044 [04:49<01:40,  2.46it/s, acc_step=1/1, ce_loss_token=1.7788, perplexity_token=5.9229]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  76%|███████████████████████████████████▉           | 798/1044 [04:49<01:37,  2.54it/s, acc_step=1/1, ce_loss_token=1.7788, perplexity_token=5.9226]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  77%|███████████████████████████████████▉           | 799/1044 [04:49<01:28,  2.78it/s, acc_step=1/1, ce_loss_token=1.7789, perplexity_token=5.9232]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  77%|████████████████████████████████████           | 800/1044 [04:50<01:31,  2.67it/s, acc_step=1/1, ce_loss_token=1.7788, perplexity_token=5.9230]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  77%|████████████████████████████████████           | 801/1044 [04:50<01:29,  2.70it/s, acc_step=1/1, ce_loss_token=1.7788, perplexity_token=5.9228]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  77%|████████████████████████████████████           | 802/1044 [04:51<01:30,  2.66it/s, acc_step=1/1, ce_loss_token=1.7788, perplexity_token=5.9225]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  77%|████████████████████████████████████▏          | 803/1044 [04:51<01:29,  2.69it/s, acc_step=1/1, ce_loss_token=1.7787, perplexity_token=5.9222]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  77%|████████████████████████████████████▏          | 804/1044 [04:51<01:28,  2.70it/s, acc_step=1/1, ce_loss_token=1.7786, perplexity_token=5.9218]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  77%|████████████████████████████████████▏          | 805/1044 [04:52<01:27,  2.74it/s, acc_step=1/1, ce_loss_token=1.7786, perplexity_token=5.9215]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  77%|████████████████████████████████████▎          | 806/1044 [04:52<01:27,  2.73it/s, acc_step=1/1, ce_loss_token=1.7785, perplexity_token=5.9212]

torch.Size([256, 381, 35]) torch.Size([256, 381])


[Training LM]:  77%|████████████████████████████████████▎          | 807/1044 [04:53<01:37,  2.43it/s, acc_step=1/1, ce_loss_token=1.7785, perplexity_token=5.9209]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  77%|████████████████████████████████████▍          | 808/1044 [04:53<01:37,  2.42it/s, acc_step=1/1, ce_loss_token=1.7785, perplexity_token=5.9207]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  77%|████████████████████████████████████▍          | 809/1044 [04:53<01:35,  2.46it/s, acc_step=1/1, ce_loss_token=1.7784, perplexity_token=5.9204]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  78%|████████████████████████████████████▍          | 810/1044 [04:54<01:34,  2.48it/s, acc_step=1/1, ce_loss_token=1.7784, perplexity_token=5.9202]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  78%|████████████████████████████████████▌          | 811/1044 [04:54<01:32,  2.51it/s, acc_step=1/1, ce_loss_token=1.7783, perplexity_token=5.9198]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  78%|████████████████████████████████████▌          | 812/1044 [04:54<01:26,  2.69it/s, acc_step=1/1, ce_loss_token=1.7784, perplexity_token=5.9204]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  78%|████████████████████████████████████▌          | 813/1044 [04:55<01:29,  2.59it/s, acc_step=1/1, ce_loss_token=1.7783, perplexity_token=5.9200]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  78%|████████████████████████████████████▋          | 814/1044 [04:55<01:27,  2.62it/s, acc_step=1/1, ce_loss_token=1.7783, perplexity_token=5.9197]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  78%|████████████████████████████████████▋          | 815/1044 [04:56<01:26,  2.64it/s, acc_step=1/1, ce_loss_token=1.7782, perplexity_token=5.9193]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  78%|████████████████████████████████████▋          | 816/1044 [04:56<01:24,  2.69it/s, acc_step=1/1, ce_loss_token=1.7782, perplexity_token=5.9189]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  78%|████████████████████████████████████▊          | 817/1044 [04:56<01:21,  2.80it/s, acc_step=1/1, ce_loss_token=1.7782, perplexity_token=5.9194]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  78%|████████████████████████████████████▊          | 818/1044 [04:57<01:18,  2.87it/s, acc_step=1/1, ce_loss_token=1.7784, perplexity_token=5.9202]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  78%|████████████████████████████████████▊          | 819/1044 [04:57<01:28,  2.54it/s, acc_step=1/1, ce_loss_token=1.7783, perplexity_token=5.9199]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  79%|████████████████████████████████████▉          | 820/1044 [04:57<01:27,  2.57it/s, acc_step=1/1, ce_loss_token=1.7783, perplexity_token=5.9196]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  79%|████████████████████████████████████▉          | 821/1044 [04:58<01:19,  2.80it/s, acc_step=1/1, ce_loss_token=1.7783, perplexity_token=5.9200]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  79%|█████████████████████████████████████          | 822/1044 [04:58<01:20,  2.74it/s, acc_step=1/1, ce_loss_token=1.7783, perplexity_token=5.9197]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  79%|█████████████████████████████████████          | 823/1044 [04:59<01:21,  2.72it/s, acc_step=1/1, ce_loss_token=1.7782, perplexity_token=5.9193]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  79%|█████████████████████████████████████          | 824/1044 [04:59<01:20,  2.73it/s, acc_step=1/1, ce_loss_token=1.7781, perplexity_token=5.9189]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  79%|█████████████████████████████████████▏         | 825/1044 [04:59<01:18,  2.80it/s, acc_step=1/1, ce_loss_token=1.7781, perplexity_token=5.9186]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  79%|█████████████████████████████████████▏         | 826/1044 [05:00<01:13,  2.96it/s, acc_step=1/1, ce_loss_token=1.7782, perplexity_token=5.9194]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  79%|█████████████████████████████████████▏         | 827/1044 [05:00<01:12,  2.97it/s, acc_step=1/1, ce_loss_token=1.7782, perplexity_token=5.9191]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  79%|█████████████████████████████████████▎         | 828/1044 [05:00<01:18,  2.76it/s, acc_step=1/1, ce_loss_token=1.7781, perplexity_token=5.9188]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  79%|█████████████████████████████████████▎         | 829/1044 [05:01<01:18,  2.72it/s, acc_step=1/1, ce_loss_token=1.7781, perplexity_token=5.9185]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  80%|█████████████████████████████████████▎         | 830/1044 [05:01<01:18,  2.72it/s, acc_step=1/1, ce_loss_token=1.7780, perplexity_token=5.9182]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  80%|█████████████████████████████████████▍         | 831/1044 [05:01<01:22,  2.60it/s, acc_step=1/1, ce_loss_token=1.7780, perplexity_token=5.9178]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  80%|█████████████████████████████████████▍         | 832/1044 [05:02<01:18,  2.70it/s, acc_step=1/1, ce_loss_token=1.7781, perplexity_token=5.9186]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  80%|█████████████████████████████████████▌         | 833/1044 [05:02<01:18,  2.68it/s, acc_step=1/1, ce_loss_token=1.7781, perplexity_token=5.9184]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  80%|█████████████████████████████████████▌         | 834/1044 [05:03<01:24,  2.50it/s, acc_step=1/1, ce_loss_token=1.7780, perplexity_token=5.9179]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  80%|█████████████████████████████████████▌         | 835/1044 [05:03<01:19,  2.61it/s, acc_step=1/1, ce_loss_token=1.7779, perplexity_token=5.9177]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  80%|█████████████████████████████████████▋         | 836/1044 [05:03<01:13,  2.81it/s, acc_step=1/1, ce_loss_token=1.7780, perplexity_token=5.9183]

torch.Size([256, 354, 35]) torch.Size([256, 354])


[Training LM]:  80%|█████████████████████████████████████▋         | 837/1044 [05:04<01:19,  2.60it/s, acc_step=1/1, ce_loss_token=1.7780, perplexity_token=5.9180]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  80%|█████████████████████████████████████▋         | 838/1044 [05:04<01:19,  2.58it/s, acc_step=1/1, ce_loss_token=1.7779, perplexity_token=5.9177]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  80%|█████████████████████████████████████▊         | 839/1044 [05:05<01:20,  2.55it/s, acc_step=1/1, ce_loss_token=1.7779, perplexity_token=5.9173]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  80%|█████████████████████████████████████▊         | 840/1044 [05:05<01:19,  2.55it/s, acc_step=1/1, ce_loss_token=1.7778, perplexity_token=5.9170]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  81%|█████████████████████████████████████▊         | 841/1044 [05:05<01:16,  2.65it/s, acc_step=1/1, ce_loss_token=1.7778, perplexity_token=5.9167]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  81%|█████████████████████████████████████▉         | 842/1044 [05:06<01:15,  2.66it/s, acc_step=1/1, ce_loss_token=1.7777, perplexity_token=5.9164]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  81%|█████████████████████████████████████▉         | 843/1044 [05:06<01:13,  2.73it/s, acc_step=1/1, ce_loss_token=1.7777, perplexity_token=5.9160]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  81%|█████████████████████████████████████▉         | 844/1044 [05:06<01:07,  2.98it/s, acc_step=1/1, ce_loss_token=1.7778, perplexity_token=5.9168]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  81%|██████████████████████████████████████         | 845/1044 [05:07<01:09,  2.86it/s, acc_step=1/1, ce_loss_token=1.7777, perplexity_token=5.9165]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  81%|██████████████████████████████████████         | 846/1044 [05:07<01:09,  2.83it/s, acc_step=1/1, ce_loss_token=1.7777, perplexity_token=5.9163]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  81%|██████████████████████████████████████▏        | 847/1044 [05:07<01:10,  2.80it/s, acc_step=1/1, ce_loss_token=1.7777, perplexity_token=5.9160]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  81%|██████████████████████████████████████▏        | 848/1044 [05:08<01:05,  3.00it/s, acc_step=1/1, ce_loss_token=1.7778, perplexity_token=5.9165]

torch.Size([256, 372, 35]) torch.Size([256, 372])


[Training LM]:  81%|██████████████████████████████████████▏        | 849/1044 [05:08<01:13,  2.65it/s, acc_step=1/1, ce_loss_token=1.7777, perplexity_token=5.9162]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  81%|██████████████████████████████████████▎        | 850/1044 [05:08<01:10,  2.75it/s, acc_step=1/1, ce_loss_token=1.7776, perplexity_token=5.9158]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  82%|██████████████████████████████████████▎        | 851/1044 [05:09<01:10,  2.72it/s, acc_step=1/1, ce_loss_token=1.7776, perplexity_token=5.9156]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  82%|██████████████████████████████████████▎        | 852/1044 [05:09<01:04,  2.99it/s, acc_step=1/1, ce_loss_token=1.7777, perplexity_token=5.9163]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  82%|██████████████████████████████████████▍        | 853/1044 [05:09<01:05,  2.93it/s, acc_step=1/1, ce_loss_token=1.7777, perplexity_token=5.9160]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  82%|██████████████████████████████████████▍        | 854/1044 [05:10<01:05,  2.89it/s, acc_step=1/1, ce_loss_token=1.7776, perplexity_token=5.9157]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  82%|██████████████████████████████████████▍        | 855/1044 [05:10<01:11,  2.65it/s, acc_step=1/1, ce_loss_token=1.7776, perplexity_token=5.9154]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  82%|██████████████████████████████████████▌        | 856/1044 [05:11<01:11,  2.63it/s, acc_step=1/1, ce_loss_token=1.7775, perplexity_token=5.9152]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  82%|██████████████████████████████████████▌        | 857/1044 [05:11<01:09,  2.71it/s, acc_step=1/1, ce_loss_token=1.7775, perplexity_token=5.9148]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  82%|██████████████████████████████████████▋        | 858/1044 [05:11<01:07,  2.75it/s, acc_step=1/1, ce_loss_token=1.7774, perplexity_token=5.9145]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  82%|██████████████████████████████████████▋        | 859/1044 [05:12<01:06,  2.77it/s, acc_step=1/1, ce_loss_token=1.7773, perplexity_token=5.9141]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  82%|██████████████████████████████████████▋        | 860/1044 [05:12<01:06,  2.78it/s, acc_step=1/1, ce_loss_token=1.7773, perplexity_token=5.9139]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  82%|██████████████████████████████████████▊        | 861/1044 [05:12<01:01,  2.98it/s, acc_step=1/1, ce_loss_token=1.7774, perplexity_token=5.9144]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  83%|██████████████████████████████████████▊        | 862/1044 [05:13<01:03,  2.87it/s, acc_step=1/1, ce_loss_token=1.7773, perplexity_token=5.9141]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  83%|██████████████████████████████████████▊        | 863/1044 [05:13<01:04,  2.80it/s, acc_step=1/1, ce_loss_token=1.7773, perplexity_token=5.9136]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  83%|██████████████████████████████████████▉        | 864/1044 [05:13<01:05,  2.76it/s, acc_step=1/1, ce_loss_token=1.7772, perplexity_token=5.9134]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  83%|██████████████████████████████████████▉        | 865/1044 [05:14<01:02,  2.87it/s, acc_step=1/1, ce_loss_token=1.7773, perplexity_token=5.9138]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  83%|██████████████████████████████████████▉        | 866/1044 [05:14<01:03,  2.81it/s, acc_step=1/1, ce_loss_token=1.7772, perplexity_token=5.9136]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  83%|███████████████████████████████████████        | 867/1044 [05:14<01:02,  2.82it/s, acc_step=1/1, ce_loss_token=1.7772, perplexity_token=5.9132]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  83%|███████████████████████████████████████        | 868/1044 [05:15<01:03,  2.77it/s, acc_step=1/1, ce_loss_token=1.7772, perplexity_token=5.9130]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  83%|███████████████████████████████████████        | 869/1044 [05:15<01:04,  2.71it/s, acc_step=1/1, ce_loss_token=1.7771, perplexity_token=5.9128]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  83%|███████████████████████████████████████▏       | 870/1044 [05:16<01:05,  2.65it/s, acc_step=1/1, ce_loss_token=1.7771, perplexity_token=5.9124]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  83%|███████████████████████████████████████▏       | 871/1044 [05:16<01:06,  2.62it/s, acc_step=1/1, ce_loss_token=1.7770, perplexity_token=5.9121]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  84%|███████████████████████████████████████▎       | 872/1044 [05:16<01:04,  2.66it/s, acc_step=1/1, ce_loss_token=1.7769, perplexity_token=5.9117]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  84%|███████████████████████████████████████▎       | 873/1044 [05:17<01:04,  2.63it/s, acc_step=1/1, ce_loss_token=1.7769, perplexity_token=5.9115]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  84%|███████████████████████████████████████▎       | 874/1044 [05:17<01:04,  2.65it/s, acc_step=1/1, ce_loss_token=1.7768, perplexity_token=5.9111]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  84%|███████████████████████████████████████▍       | 875/1044 [05:17<01:00,  2.81it/s, acc_step=1/1, ce_loss_token=1.7769, perplexity_token=5.9116]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  84%|███████████████████████████████████████▍       | 876/1044 [05:18<01:00,  2.78it/s, acc_step=1/1, ce_loss_token=1.7769, perplexity_token=5.9114]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  84%|███████████████████████████████████████▍       | 877/1044 [05:18<01:02,  2.68it/s, acc_step=1/1, ce_loss_token=1.7768, perplexity_token=5.9111]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  84%|███████████████████████████████████████▌       | 878/1044 [05:19<01:02,  2.64it/s, acc_step=1/1, ce_loss_token=1.7768, perplexity_token=5.9107]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  84%|███████████████████████████████████████▌       | 879/1044 [05:19<01:02,  2.64it/s, acc_step=1/1, ce_loss_token=1.7767, perplexity_token=5.9105]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  84%|███████████████████████████████████████▌       | 880/1044 [05:19<01:02,  2.64it/s, acc_step=1/1, ce_loss_token=1.7767, perplexity_token=5.9102]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  84%|███████████████████████████████████████▋       | 881/1044 [05:20<01:00,  2.70it/s, acc_step=1/1, ce_loss_token=1.7766, perplexity_token=5.9100]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  84%|███████████████████████████████████████▋       | 882/1044 [05:20<00:57,  2.84it/s, acc_step=1/1, ce_loss_token=1.7767, perplexity_token=5.9105]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  85%|███████████████████████████████████████▊       | 883/1044 [05:20<00:55,  2.89it/s, acc_step=1/1, ce_loss_token=1.7768, perplexity_token=5.9109]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  85%|███████████████████████████████████████▊       | 884/1044 [05:21<00:56,  2.81it/s, acc_step=1/1, ce_loss_token=1.7768, perplexity_token=5.9107]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  85%|███████████████████████████████████████▊       | 885/1044 [05:21<01:00,  2.63it/s, acc_step=1/1, ce_loss_token=1.7767, perplexity_token=5.9105]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  85%|███████████████████████████████████████▉       | 886/1044 [05:21<00:55,  2.85it/s, acc_step=1/1, ce_loss_token=1.7768, perplexity_token=5.9109]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  85%|███████████████████████████████████████▉       | 887/1044 [05:22<00:51,  3.03it/s, acc_step=1/1, ce_loss_token=1.7769, perplexity_token=5.9116]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  85%|███████████████████████████████████████▉       | 888/1044 [05:22<00:52,  2.98it/s, acc_step=1/1, ce_loss_token=1.7769, perplexity_token=5.9114]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  85%|████████████████████████████████████████       | 889/1044 [05:22<00:53,  2.91it/s, acc_step=1/1, ce_loss_token=1.7768, perplexity_token=5.9112]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  85%|████████████████████████████████████████       | 890/1044 [05:23<00:55,  2.79it/s, acc_step=1/1, ce_loss_token=1.7768, perplexity_token=5.9110]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  85%|████████████████████████████████████████       | 891/1044 [05:23<00:57,  2.65it/s, acc_step=1/1, ce_loss_token=1.7768, perplexity_token=5.9107]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  85%|████████████████████████████████████████▏      | 892/1044 [05:24<00:57,  2.66it/s, acc_step=1/1, ce_loss_token=1.7767, perplexity_token=5.9104]

torch.Size([256, 399, 35]) torch.Size([256, 399])


[Training LM]:  86%|████████████████████████████████████████▏      | 893/1044 [05:24<01:05,  2.32it/s, acc_step=1/1, ce_loss_token=1.7767, perplexity_token=5.9102]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  86%|████████████████████████████████████████▏      | 894/1044 [05:24<00:57,  2.60it/s, acc_step=1/1, ce_loss_token=1.7768, perplexity_token=5.9109]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  86%|████████████████████████████████████████▎      | 895/1044 [05:25<00:56,  2.65it/s, acc_step=1/1, ce_loss_token=1.7767, perplexity_token=5.9105]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  86%|████████████████████████████████████████▎      | 896/1044 [05:25<00:55,  2.65it/s, acc_step=1/1, ce_loss_token=1.7767, perplexity_token=5.9102]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  86%|████████████████████████████████████████▍      | 897/1044 [05:26<00:54,  2.68it/s, acc_step=1/1, ce_loss_token=1.7766, perplexity_token=5.9099]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  86%|████████████████████████████████████████▍      | 898/1044 [05:26<00:53,  2.71it/s, acc_step=1/1, ce_loss_token=1.7766, perplexity_token=5.9096]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  86%|████████████████████████████████████████▍      | 899/1044 [05:26<00:54,  2.64it/s, acc_step=1/1, ce_loss_token=1.7765, perplexity_token=5.9093]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  86%|████████████████████████████████████████▌      | 900/1044 [05:27<00:53,  2.70it/s, acc_step=1/1, ce_loss_token=1.7765, perplexity_token=5.9090]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  86%|████████████████████████████████████████▌      | 901/1044 [05:27<00:52,  2.70it/s, acc_step=1/1, ce_loss_token=1.7765, perplexity_token=5.9088]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  86%|████████████████████████████████████████▌      | 902/1044 [05:27<00:53,  2.68it/s, acc_step=1/1, ce_loss_token=1.7764, perplexity_token=5.9086]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  86%|████████████████████████████████████████▋      | 903/1044 [05:28<00:48,  2.90it/s, acc_step=1/1, ce_loss_token=1.7765, perplexity_token=5.9090]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  87%|████████████████████████████████████████▋      | 904/1044 [05:28<00:48,  2.86it/s, acc_step=1/1, ce_loss_token=1.7764, perplexity_token=5.9088]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  87%|████████████████████████████████████████▋      | 905/1044 [05:28<00:49,  2.79it/s, acc_step=1/1, ce_loss_token=1.7764, perplexity_token=5.9084]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  87%|████████████████████████████████████████▊      | 906/1044 [05:29<00:54,  2.54it/s, acc_step=1/1, ce_loss_token=1.7763, perplexity_token=5.9082]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  87%|████████████████████████████████████████▊      | 907/1044 [05:29<00:56,  2.43it/s, acc_step=1/1, ce_loss_token=1.7763, perplexity_token=5.9079]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  87%|████████████████████████████████████████▉      | 908/1044 [05:30<00:55,  2.47it/s, acc_step=1/1, ce_loss_token=1.7762, perplexity_token=5.9075]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  87%|████████████████████████████████████████▉      | 909/1044 [05:30<00:53,  2.54it/s, acc_step=1/1, ce_loss_token=1.7762, perplexity_token=5.9073]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  87%|████████████████████████████████████████▉      | 910/1044 [05:30<00:51,  2.60it/s, acc_step=1/1, ce_loss_token=1.7761, perplexity_token=5.9069]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  87%|█████████████████████████████████████████      | 911/1044 [05:31<00:46,  2.84it/s, acc_step=1/1, ce_loss_token=1.7762, perplexity_token=5.9075]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  87%|█████████████████████████████████████████      | 912/1044 [05:31<00:47,  2.78it/s, acc_step=1/1, ce_loss_token=1.7762, perplexity_token=5.9071]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  87%|█████████████████████████████████████████      | 913/1044 [05:32<00:49,  2.66it/s, acc_step=1/1, ce_loss_token=1.7761, perplexity_token=5.9069]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  88%|█████████████████████████████████████████▏     | 914/1044 [05:32<00:49,  2.64it/s, acc_step=1/1, ce_loss_token=1.7761, perplexity_token=5.9065]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  88%|█████████████████████████████████████████▏     | 915/1044 [05:32<00:48,  2.66it/s, acc_step=1/1, ce_loss_token=1.7760, perplexity_token=5.9063]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  88%|█████████████████████████████████████████▏     | 916/1044 [05:33<00:48,  2.66it/s, acc_step=1/1, ce_loss_token=1.7760, perplexity_token=5.9060]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  88%|█████████████████████████████████████████▎     | 917/1044 [05:33<00:47,  2.65it/s, acc_step=1/1, ce_loss_token=1.7759, perplexity_token=5.9056]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  88%|█████████████████████████████████████████▎     | 918/1044 [05:33<00:48,  2.60it/s, acc_step=1/1, ce_loss_token=1.7759, perplexity_token=5.9054]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  88%|█████████████████████████████████████████▎     | 919/1044 [05:34<00:48,  2.57it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9050]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  88%|█████████████████████████████████████████▍     | 920/1044 [05:34<00:47,  2.59it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9048]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  88%|█████████████████████████████████████████▍     | 921/1044 [05:35<00:48,  2.55it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9044]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  88%|█████████████████████████████████████████▌     | 922/1044 [05:35<00:47,  2.55it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9042]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  88%|█████████████████████████████████████████▌     | 923/1044 [05:35<00:43,  2.76it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9046]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  89%|█████████████████████████████████████████▌     | 924/1044 [05:36<00:42,  2.80it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9044]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  89%|█████████████████████████████████████████▋     | 925/1044 [05:36<00:40,  2.93it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9051]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  89%|█████████████████████████████████████████▋     | 926/1044 [05:36<00:37,  3.11it/s, acc_step=1/1, ce_loss_token=1.7759, perplexity_token=5.9055]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  89%|█████████████████████████████████████████▋     | 927/1044 [05:37<00:40,  2.87it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9052]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  89%|█████████████████████████████████████████▊     | 928/1044 [05:37<00:40,  2.85it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9049]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  89%|█████████████████████████████████████████▊     | 929/1044 [05:37<00:42,  2.73it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9046]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  89%|█████████████████████████████████████████▊     | 930/1044 [05:38<00:42,  2.66it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9043]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  89%|█████████████████████████████████████████▉     | 931/1044 [05:38<00:43,  2.63it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9040]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  89%|█████████████████████████████████████████▉     | 932/1044 [05:39<00:42,  2.62it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9037]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  89%|██████████████████████████████████████████     | 933/1044 [05:39<00:39,  2.82it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9042]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  89%|██████████████████████████████████████████     | 934/1044 [05:39<00:39,  2.78it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9039]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  90%|██████████████████████████████████████████     | 935/1044 [05:40<00:37,  2.92it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9044]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  90%|██████████████████████████████████████████▏    | 936/1044 [05:40<00:37,  2.89it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9041]

torch.Size([256, 344, 35]) torch.Size([256, 344])


[Training LM]:  90%|██████████████████████████████████████████▏    | 937/1044 [05:40<00:39,  2.68it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9038]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  90%|██████████████████████████████████████████▏    | 938/1044 [05:41<00:44,  2.40it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9037]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  90%|██████████████████████████████████████████▎    | 939/1044 [05:41<00:39,  2.66it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9043]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  90%|██████████████████████████████████████████▎    | 940/1044 [05:41<00:35,  2.92it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9047]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  90%|██████████████████████████████████████████▎    | 941/1044 [05:42<00:33,  3.09it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9053]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  90%|██████████████████████████████████████████▍    | 942/1044 [05:42<00:30,  3.29it/s, acc_step=1/1, ce_loss_token=1.7759, perplexity_token=5.9058]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  90%|██████████████████████████████████████████▍    | 943/1044 [05:42<00:31,  3.17it/s, acc_step=1/1, ce_loss_token=1.7759, perplexity_token=5.9054]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  90%|██████████████████████████████████████████▍    | 944/1044 [05:43<00:33,  2.95it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9052]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  91%|██████████████████████████████████████████▌    | 945/1044 [05:43<00:32,  3.07it/s, acc_step=1/1, ce_loss_token=1.7759, perplexity_token=5.9057]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  91%|██████████████████████████████████████████▌    | 946/1044 [05:43<00:34,  2.85it/s, acc_step=1/1, ce_loss_token=1.7759, perplexity_token=5.9054]

torch.Size([256, 525, 35]) torch.Size([256, 525])


[Training LM]:  91%|██████████████████████████████████████████▋    | 947/1044 [05:44<00:48,  1.98it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9052]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  91%|██████████████████████████████████████████▋    | 948/1044 [05:45<00:45,  2.11it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9050]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  91%|██████████████████████████████████████████▋    | 949/1044 [05:45<00:42,  2.23it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9047]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  91%|██████████████████████████████████████████▊    | 950/1044 [05:45<00:37,  2.50it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9051]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  91%|██████████████████████████████████████████▊    | 951/1044 [05:46<00:33,  2.76it/s, acc_step=1/1, ce_loss_token=1.7760, perplexity_token=5.9059]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  91%|██████████████████████████████████████████▊    | 952/1044 [05:46<00:34,  2.67it/s, acc_step=1/1, ce_loss_token=1.7759, perplexity_token=5.9056]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  91%|██████████████████████████████████████████▉    | 953/1044 [05:46<00:35,  2.56it/s, acc_step=1/1, ce_loss_token=1.7759, perplexity_token=5.9054]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  91%|██████████████████████████████████████████▉    | 954/1044 [05:47<00:34,  2.60it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9051]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  91%|██████████████████████████████████████████▉    | 955/1044 [05:47<00:33,  2.65it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9049]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  92%|███████████████████████████████████████████    | 956/1044 [05:48<00:32,  2.67it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9046]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  92%|███████████████████████████████████████████    | 957/1044 [05:48<00:30,  2.84it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9050]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  92%|███████████████████████████████████████████▏   | 958/1044 [05:48<00:30,  2.79it/s, acc_step=1/1, ce_loss_token=1.7758, perplexity_token=5.9047]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  92%|███████████████████████████████████████████▏   | 959/1044 [05:49<00:30,  2.74it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9045]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  92%|███████████████████████████████████████████▏   | 960/1044 [05:49<00:30,  2.77it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9042]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  92%|███████████████████████████████████████████▎   | 961/1044 [05:49<00:30,  2.71it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9039]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  92%|███████████████████████████████████████████▎   | 962/1044 [05:50<00:30,  2.65it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9037]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  92%|███████████████████████████████████████████▎   | 963/1044 [05:50<00:30,  2.67it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9041]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  92%|███████████████████████████████████████████▍   | 964/1044 [05:51<00:30,  2.61it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9038]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  92%|███████████████████████████████████████████▍   | 965/1044 [05:51<00:30,  2.57it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9035]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  93%|███████████████████████████████████████████▍   | 966/1044 [05:51<00:30,  2.56it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9032]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  93%|███████████████████████████████████████████▌   | 967/1044 [05:52<00:27,  2.78it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9038]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  93%|███████████████████████████████████████████▌   | 968/1044 [05:52<00:27,  2.78it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9035]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  93%|███████████████████████████████████████████▌   | 969/1044 [05:52<00:25,  2.97it/s, acc_step=1/1, ce_loss_token=1.7757, perplexity_token=5.9042]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  93%|███████████████████████████████████████████▋   | 970/1044 [05:53<00:27,  2.66it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9039]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  93%|███████████████████████████████████████████▋   | 971/1044 [05:53<00:27,  2.61it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9037]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  93%|███████████████████████████████████████████▊   | 972/1044 [05:54<00:27,  2.58it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9034]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  93%|███████████████████████████████████████████▊   | 973/1044 [05:54<00:25,  2.75it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9038]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  93%|███████████████████████████████████████████▊   | 974/1044 [05:54<00:25,  2.73it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9036]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  93%|███████████████████████████████████████████▉   | 975/1044 [05:54<00:23,  2.90it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9040]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  93%|███████████████████████████████████████████▉   | 976/1044 [05:55<00:24,  2.73it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9037]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  94%|███████████████████████████████████████████▉   | 977/1044 [05:55<00:24,  2.72it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9035]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  94%|████████████████████████████████████████████   | 978/1044 [05:56<00:24,  2.65it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9033]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  94%|████████████████████████████████████████████   | 979/1044 [05:56<00:24,  2.66it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9031]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  94%|████████████████████████████████████████████   | 980/1044 [05:56<00:24,  2.65it/s, acc_step=1/1, ce_loss_token=1.7754, perplexity_token=5.9029]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  94%|████████████████████████████████████████████▏  | 981/1044 [05:57<00:22,  2.78it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9035]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  94%|████████████████████████████████████████████▏  | 982/1044 [05:57<00:22,  2.79it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9032]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  94%|████████████████████████████████████████████▎  | 983/1044 [05:58<00:23,  2.60it/s, acc_step=1/1, ce_loss_token=1.7754, perplexity_token=5.9029]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  94%|████████████████████████████████████████████▎  | 984/1044 [05:58<00:22,  2.62it/s, acc_step=1/1, ce_loss_token=1.7754, perplexity_token=5.9027]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  94%|████████████████████████████████████████████▎  | 985/1044 [05:58<00:22,  2.58it/s, acc_step=1/1, ce_loss_token=1.7754, perplexity_token=5.9024]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  94%|████████████████████████████████████████████▍  | 986/1044 [05:59<00:22,  2.55it/s, acc_step=1/1, ce_loss_token=1.7753, perplexity_token=5.9021]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  95%|████████████████████████████████████████████▍  | 987/1044 [05:59<00:20,  2.79it/s, acc_step=1/1, ce_loss_token=1.7754, perplexity_token=5.9026]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  95%|████████████████████████████████████████████▍  | 988/1044 [05:59<00:20,  2.75it/s, acc_step=1/1, ce_loss_token=1.7753, perplexity_token=5.9022]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  95%|████████████████████████████████████████████▌  | 989/1044 [06:00<00:19,  2.78it/s, acc_step=1/1, ce_loss_token=1.7753, perplexity_token=5.9020]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  95%|████████████████████████████████████████████▌  | 990/1044 [06:00<00:19,  2.79it/s, acc_step=1/1, ce_loss_token=1.7753, perplexity_token=5.9018]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  95%|████████████████████████████████████████████▌  | 991/1044 [06:00<00:18,  2.82it/s, acc_step=1/1, ce_loss_token=1.7752, perplexity_token=5.9015]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  95%|████████████████████████████████████████████▋  | 992/1044 [06:01<00:19,  2.74it/s, acc_step=1/1, ce_loss_token=1.7752, perplexity_token=5.9012]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  95%|████████████████████████████████████████████▋  | 993/1044 [06:01<00:18,  2.74it/s, acc_step=1/1, ce_loss_token=1.7752, perplexity_token=5.9016]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  95%|████████████████████████████████████████████▋  | 994/1044 [06:02<00:18,  2.74it/s, acc_step=1/1, ce_loss_token=1.7752, perplexity_token=5.9013]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  95%|████████████████████████████████████████████▊  | 995/1044 [06:02<00:17,  2.76it/s, acc_step=1/1, ce_loss_token=1.7751, perplexity_token=5.9010]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  95%|████████████████████████████████████████████▊  | 996/1044 [06:02<00:17,  2.80it/s, acc_step=1/1, ce_loss_token=1.7752, perplexity_token=5.9017]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  95%|████████████████████████████████████████████▉  | 997/1044 [06:03<00:17,  2.73it/s, acc_step=1/1, ce_loss_token=1.7752, perplexity_token=5.9015]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  96%|████████████████████████████████████████████▉  | 998/1044 [06:03<00:17,  2.70it/s, acc_step=1/1, ce_loss_token=1.7752, perplexity_token=5.9012]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  96%|████████████████████████████████████████████▉  | 999/1044 [06:03<00:16,  2.70it/s, acc_step=1/1, ce_loss_token=1.7751, perplexity_token=5.9009]

torch.Size([256, 402, 35]) torch.Size([256, 402])


[Training LM]:  96%|████████████████████████████████████████████  | 1000/1044 [06:04<00:18,  2.34it/s, acc_step=1/1, ce_loss_token=1.7751, perplexity_token=5.9007]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  96%|████████████████████████████████████████████  | 1001/1044 [06:04<00:17,  2.43it/s, acc_step=1/1, ce_loss_token=1.7750, perplexity_token=5.9005]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1002/1044 [06:05<00:17,  2.46it/s, acc_step=1/1, ce_loss_token=1.7750, perplexity_token=5.9002]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1003/1044 [06:05<00:15,  2.64it/s, acc_step=1/1, ce_loss_token=1.7751, perplexity_token=5.9006]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1004/1044 [06:05<00:15,  2.62it/s, acc_step=1/1, ce_loss_token=1.7750, perplexity_token=5.9004]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1005/1044 [06:06<00:14,  2.66it/s, acc_step=1/1, ce_loss_token=1.7750, perplexity_token=5.9001]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1006/1044 [06:06<00:12,  3.00it/s, acc_step=1/1, ce_loss_token=1.7753, perplexity_token=5.9022]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1007/1044 [06:06<00:12,  3.08it/s, acc_step=1/1, ce_loss_token=1.7754, perplexity_token=5.9026]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1008/1044 [06:07<00:11,  3.16it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9032]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1009/1044 [06:07<00:12,  2.88it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9031]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1010/1044 [06:07<00:12,  2.82it/s, acc_step=1/1, ce_loss_token=1.7754, perplexity_token=5.9028]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1011/1044 [06:08<00:11,  2.85it/s, acc_step=1/1, ce_loss_token=1.7756, perplexity_token=5.9035]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1012/1044 [06:08<00:11,  2.76it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9033]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1013/1044 [06:09<00:11,  2.76it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9030]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1014/1044 [06:09<00:10,  2.99it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9034]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1015/1044 [06:09<00:09,  2.91it/s, acc_step=1/1, ce_loss_token=1.7755, perplexity_token=5.9030]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1016/1044 [06:09<00:09,  2.93it/s, acc_step=1/1, ce_loss_token=1.7754, perplexity_token=5.9028]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1017/1044 [06:10<00:09,  2.88it/s, acc_step=1/1, ce_loss_token=1.7754, perplexity_token=5.9026]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  98%|████████████████████████████████████████████▊ | 1018/1044 [06:10<00:09,  2.86it/s, acc_step=1/1, ce_loss_token=1.7754, perplexity_token=5.9023]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1019/1044 [06:11<00:08,  2.92it/s, acc_step=1/1, ce_loss_token=1.7754, perplexity_token=5.9027]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1020/1044 [06:11<00:08,  2.83it/s, acc_step=1/1, ce_loss_token=1.7754, perplexity_token=5.9024]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1021/1044 [06:11<00:08,  2.81it/s, acc_step=1/1, ce_loss_token=1.7753, perplexity_token=5.9022]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  98%|█████████████████████████████████████████████ | 1022/1044 [06:12<00:07,  2.78it/s, acc_step=1/1, ce_loss_token=1.7753, perplexity_token=5.9019]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  98%|█████████████████████████████████████████████ | 1023/1044 [06:12<00:07,  2.77it/s, acc_step=1/1, ce_loss_token=1.7752, perplexity_token=5.9017]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  98%|█████████████████████████████████████████████ | 1024/1044 [06:12<00:07,  2.78it/s, acc_step=1/1, ce_loss_token=1.7752, perplexity_token=5.9014]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1025/1044 [06:13<00:06,  2.76it/s, acc_step=1/1, ce_loss_token=1.7752, perplexity_token=5.9012]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1026/1044 [06:13<00:06,  2.77it/s, acc_step=1/1, ce_loss_token=1.7751, perplexity_token=5.9010]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1027/1044 [06:13<00:06,  2.73it/s, acc_step=1/1, ce_loss_token=1.7751, perplexity_token=5.9007]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1028/1044 [06:14<00:05,  2.73it/s, acc_step=1/1, ce_loss_token=1.7750, perplexity_token=5.9005]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  99%|█████████████████████████████████████████████▎| 1029/1044 [06:14<00:05,  2.74it/s, acc_step=1/1, ce_loss_token=1.7750, perplexity_token=5.9003]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1030/1044 [06:15<00:05,  2.71it/s, acc_step=1/1, ce_loss_token=1.7750, perplexity_token=5.9000]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1031/1044 [06:15<00:04,  2.77it/s, acc_step=1/1, ce_loss_token=1.7749, perplexity_token=5.8998]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1032/1044 [06:15<00:04,  2.78it/s, acc_step=1/1, ce_loss_token=1.7749, perplexity_token=5.8995]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1033/1044 [06:16<00:03,  2.80it/s, acc_step=1/1, ce_loss_token=1.7748, perplexity_token=5.8992]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1034/1044 [06:16<00:03,  2.60it/s, acc_step=1/1, ce_loss_token=1.7748, perplexity_token=5.8989]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1035/1044 [06:16<00:03,  2.68it/s, acc_step=1/1, ce_loss_token=1.7747, perplexity_token=5.8987]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1036/1044 [06:17<00:02,  2.90it/s, acc_step=1/1, ce_loss_token=1.7748, perplexity_token=5.8991]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1037/1044 [06:17<00:02,  2.50it/s, acc_step=1/1, ce_loss_token=1.7748, perplexity_token=5.8989]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1038/1044 [06:17<00:02,  2.74it/s, acc_step=1/1, ce_loss_token=1.7748, perplexity_token=5.8993]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1039/1044 [06:18<00:01,  2.75it/s, acc_step=1/1, ce_loss_token=1.7748, perplexity_token=5.8991]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1040/1044 [06:18<00:01,  2.73it/s, acc_step=1/1, ce_loss_token=1.7747, perplexity_token=5.8988]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1041/1044 [06:19<00:01,  2.75it/s, acc_step=1/1, ce_loss_token=1.7747, perplexity_token=5.8985]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1042/1044 [06:19<00:00,  2.81it/s, acc_step=1/1, ce_loss_token=1.7747, perplexity_token=5.8983]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1043/1044 [06:19<00:00,  2.77it/s, acc_step=1/1, ce_loss_token=1.7746, perplexity_token=5.8980]

torch.Size([170, 307, 35]) torch.Size([170, 307])


                                                                                                                                                                   

Generating with greedy search...

📊 Metrics (Epoch 7):
├── TRAIN:
│   ├── ce_loss_char: 1.7746
│   ├── ce_loss_token: 1.7746
│   ├── perplexity_char: 5.8979
│   └── perplexity_token: 5.8979
└── VAL:
    ├── ce_loss_char: 1.6455
    ├── ce_loss_token: 1.6455
    ├── perplexity_char: 5.1839
    └── perplexity_token: 5.1839
└── TRAINING:
    └── learning_rate: 0.000100


[Training LM]:   0%|                                                                                                                      | 0/1044 [00:00<?, ?it/s]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   0%|                                                 | 1/1044 [00:00<08:36,  2.02it/s, acc_step=1/1, ce_loss_token=1.7196, perplexity_token=5.5820]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:   0%|                                                 | 2/1044 [00:00<07:09,  2.43it/s, acc_step=1/1, ce_loss_token=1.7970, perplexity_token=6.0312]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   0%|▏                                                | 3/1044 [00:01<06:09,  2.82it/s, acc_step=1/1, ce_loss_token=1.8139, perplexity_token=6.1344]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   0%|▏                                                | 4/1044 [00:01<06:19,  2.74it/s, acc_step=1/1, ce_loss_token=1.7936, perplexity_token=6.0109]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:   0%|▏                                                | 5/1044 [00:01<06:14,  2.78it/s, acc_step=1/1, ce_loss_token=1.7815, perplexity_token=5.9390]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:   1%|▎                                                | 6/1044 [00:02<06:02,  2.87it/s, acc_step=1/1, ce_loss_token=1.7933, perplexity_token=6.0093]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   1%|▎                                                | 7/1044 [00:02<05:58,  2.89it/s, acc_step=1/1, ce_loss_token=1.7843, perplexity_token=5.9555]

torch.Size([256, 276, 35]) torch.Size([256, 276])


[Training LM]:   1%|▍                                                | 8/1044 [00:02<05:51,  2.95it/s, acc_step=1/1, ce_loss_token=1.7773, perplexity_token=5.9141]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:   1%|▍                                                | 9/1044 [00:03<05:51,  2.95it/s, acc_step=1/1, ce_loss_token=1.7839, perplexity_token=5.9531]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   1%|▍                                               | 10/1044 [00:03<06:09,  2.80it/s, acc_step=1/1, ce_loss_token=1.7788, perplexity_token=5.9226]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   1%|▌                                               | 11/1044 [00:03<06:07,  2.81it/s, acc_step=1/1, ce_loss_token=1.7744, perplexity_token=5.8967]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   1%|▌                                               | 13/1044 [00:04<05:22,  3.19it/s, acc_step=1/1, ce_loss_token=1.7908, perplexity_token=5.9945]

torch.Size([256, 294, 35]) torch.Size([256, 294])
torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   1%|▋                                               | 14/1044 [00:04<05:42,  3.01it/s, acc_step=1/1, ce_loss_token=1.7853, perplexity_token=5.9614]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   1%|▋                                               | 15/1044 [00:05<05:56,  2.89it/s, acc_step=1/1, ce_loss_token=1.7824, perplexity_token=5.9439]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   2%|▋                                               | 16/1044 [00:05<05:41,  3.01it/s, acc_step=1/1, ce_loss_token=1.7871, perplexity_token=5.9722]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   2%|▊                                               | 17/1044 [00:05<05:53,  2.90it/s, acc_step=1/1, ce_loss_token=1.7835, perplexity_token=5.9507]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   2%|▊                                               | 18/1044 [00:06<06:18,  2.71it/s, acc_step=1/1, ce_loss_token=1.7800, perplexity_token=5.9300]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   2%|▊                                               | 19/1044 [00:06<06:18,  2.71it/s, acc_step=1/1, ce_loss_token=1.7771, perplexity_token=5.9127]

torch.Size([256, 273, 35]) torch.Size([256, 273])


[Training LM]:   2%|▉                                               | 20/1044 [00:07<06:04,  2.81it/s, acc_step=1/1, ce_loss_token=1.7748, perplexity_token=5.8990]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   2%|▉                                               | 21/1044 [00:07<06:07,  2.78it/s, acc_step=1/1, ce_loss_token=1.7725, perplexity_token=5.8853]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   2%|█                                               | 22/1044 [00:07<06:12,  2.74it/s, acc_step=1/1, ce_loss_token=1.7701, perplexity_token=5.8717]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   2%|█                                               | 23/1044 [00:08<06:16,  2.71it/s, acc_step=1/1, ce_loss_token=1.7681, perplexity_token=5.8597]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   2%|█                                               | 24/1044 [00:08<06:14,  2.72it/s, acc_step=1/1, ce_loss_token=1.7665, perplexity_token=5.8505]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   2%|█▏                                              | 25/1044 [00:08<06:28,  2.62it/s, acc_step=1/1, ce_loss_token=1.7644, perplexity_token=5.8379]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   2%|█▏                                              | 26/1044 [00:09<06:21,  2.67it/s, acc_step=1/1, ce_loss_token=1.7632, perplexity_token=5.8312]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   3%|█▎                                              | 28/1044 [00:09<05:29,  3.08it/s, acc_step=1/1, ce_loss_token=1.7705, perplexity_token=5.8738]

torch.Size([256, 326, 35]) torch.Size([256, 326])
torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:   3%|█▎                                              | 29/1044 [00:10<05:09,  3.28it/s, acc_step=1/1, ce_loss_token=1.7723, perplexity_token=5.8842]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   3%|█▍                                              | 30/1044 [00:10<05:07,  3.30it/s, acc_step=1/1, ce_loss_token=1.7761, perplexity_token=5.9068]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   3%|█▍                                              | 31/1044 [00:10<05:31,  3.05it/s, acc_step=1/1, ce_loss_token=1.7749, perplexity_token=5.8995]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:   3%|█▍                                              | 32/1044 [00:11<05:35,  3.02it/s, acc_step=1/1, ce_loss_token=1.7736, perplexity_token=5.8920]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   3%|█▌                                              | 33/1044 [00:11<05:40,  2.97it/s, acc_step=1/1, ce_loss_token=1.7722, perplexity_token=5.8835]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   3%|█▌                                              | 34/1044 [00:11<05:47,  2.91it/s, acc_step=1/1, ce_loss_token=1.7709, perplexity_token=5.8759]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   3%|█▌                                              | 35/1044 [00:12<05:31,  3.05it/s, acc_step=1/1, ce_loss_token=1.7741, perplexity_token=5.8947]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:   3%|█▋                                              | 36/1044 [00:12<05:59,  2.81it/s, acc_step=1/1, ce_loss_token=1.7730, perplexity_token=5.8882]

torch.Size([256, 365, 35]) torch.Size([256, 365])


[Training LM]:   4%|█▋                                              | 37/1044 [00:13<06:39,  2.52it/s, acc_step=1/1, ce_loss_token=1.7723, perplexity_token=5.8842]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   4%|█▋                                              | 38/1044 [00:13<06:35,  2.55it/s, acc_step=1/1, ce_loss_token=1.7712, perplexity_token=5.8776]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   4%|█▊                                              | 39/1044 [00:13<06:27,  2.59it/s, acc_step=1/1, ce_loss_token=1.7700, perplexity_token=5.8709]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   4%|█▊                                              | 40/1044 [00:14<06:29,  2.58it/s, acc_step=1/1, ce_loss_token=1.7691, perplexity_token=5.8653]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:   4%|█▉                                              | 41/1044 [00:14<06:48,  2.46it/s, acc_step=1/1, ce_loss_token=1.7682, perplexity_token=5.8604]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   4%|█▉                                              | 42/1044 [00:15<06:17,  2.66it/s, acc_step=1/1, ce_loss_token=1.7709, perplexity_token=5.8764]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   4%|█▉                                              | 43/1044 [00:15<06:18,  2.65it/s, acc_step=1/1, ce_loss_token=1.7704, perplexity_token=5.8729]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:   4%|██                                              | 44/1044 [00:15<06:06,  2.73it/s, acc_step=1/1, ce_loss_token=1.7696, perplexity_token=5.8684]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:   4%|██                                              | 45/1044 [00:16<05:56,  2.80it/s, acc_step=1/1, ce_loss_token=1.7686, perplexity_token=5.8628]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:   4%|██                                              | 46/1044 [00:16<06:12,  2.68it/s, acc_step=1/1, ce_loss_token=1.7677, perplexity_token=5.8573]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   5%|██▏                                             | 47/1044 [00:16<06:18,  2.63it/s, acc_step=1/1, ce_loss_token=1.7670, perplexity_token=5.8531]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:   5%|██▏                                             | 48/1044 [00:17<06:25,  2.59it/s, acc_step=1/1, ce_loss_token=1.7663, perplexity_token=5.8494]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   5%|██▎                                             | 49/1044 [00:17<06:26,  2.58it/s, acc_step=1/1, ce_loss_token=1.7657, perplexity_token=5.8455]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   5%|██▎                                             | 50/1044 [00:18<06:26,  2.57it/s, acc_step=1/1, ce_loss_token=1.7649, perplexity_token=5.8410]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   5%|██▎                                             | 51/1044 [00:18<06:18,  2.62it/s, acc_step=1/1, ce_loss_token=1.7643, perplexity_token=5.8372]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:   5%|██▍                                             | 52/1044 [00:18<06:07,  2.70it/s, acc_step=1/1, ce_loss_token=1.7637, perplexity_token=5.8340]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   5%|██▍                                             | 53/1044 [00:19<06:02,  2.74it/s, acc_step=1/1, ce_loss_token=1.7629, perplexity_token=5.8291]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:   5%|██▍                                             | 54/1044 [00:19<06:12,  2.66it/s, acc_step=1/1, ce_loss_token=1.7625, perplexity_token=5.8268]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   5%|██▌                                             | 55/1044 [00:19<06:08,  2.68it/s, acc_step=1/1, ce_loss_token=1.7620, perplexity_token=5.8239]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   5%|██▌                                             | 56/1044 [00:20<06:12,  2.65it/s, acc_step=1/1, ce_loss_token=1.7612, perplexity_token=5.8196]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   5%|██▌                                             | 57/1044 [00:20<05:43,  2.87it/s, acc_step=1/1, ce_loss_token=1.7626, perplexity_token=5.8275]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   6%|██▋                                             | 58/1044 [00:20<05:53,  2.79it/s, acc_step=1/1, ce_loss_token=1.7621, perplexity_token=5.8246]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   6%|██▋                                             | 59/1044 [00:21<05:56,  2.77it/s, acc_step=1/1, ce_loss_token=1.7616, perplexity_token=5.8218]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   6%|██▊                                             | 60/1044 [00:21<05:56,  2.76it/s, acc_step=1/1, ce_loss_token=1.7611, perplexity_token=5.8187]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   6%|██▊                                             | 61/1044 [00:21<05:33,  2.95it/s, acc_step=1/1, ce_loss_token=1.7626, perplexity_token=5.8274]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   6%|██▊                                             | 62/1044 [00:22<05:45,  2.84it/s, acc_step=1/1, ce_loss_token=1.7624, perplexity_token=5.8265]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   6%|██▉                                             | 63/1044 [00:22<05:23,  3.03it/s, acc_step=1/1, ce_loss_token=1.7635, perplexity_token=5.8328]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:   6%|██▉                                             | 64/1044 [00:23<06:00,  2.72it/s, acc_step=1/1, ce_loss_token=1.7629, perplexity_token=5.8296]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   6%|██▉                                             | 65/1044 [00:23<06:04,  2.69it/s, acc_step=1/1, ce_loss_token=1.7625, perplexity_token=5.8271]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   6%|███                                             | 66/1044 [00:23<06:00,  2.71it/s, acc_step=1/1, ce_loss_token=1.7621, perplexity_token=5.8249]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   6%|███                                             | 67/1044 [00:24<05:55,  2.75it/s, acc_step=1/1, ce_loss_token=1.7618, perplexity_token=5.8231]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   7%|███▏                                            | 68/1044 [00:24<06:03,  2.69it/s, acc_step=1/1, ce_loss_token=1.7616, perplexity_token=5.8218]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:   7%|███▏                                            | 69/1044 [00:24<06:17,  2.58it/s, acc_step=1/1, ce_loss_token=1.7610, perplexity_token=5.8182]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   7%|███▏                                            | 70/1044 [00:25<06:15,  2.59it/s, acc_step=1/1, ce_loss_token=1.7606, perplexity_token=5.8157]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   7%|███▎                                            | 71/1044 [00:25<06:20,  2.56it/s, acc_step=1/1, ce_loss_token=1.7598, perplexity_token=5.8115]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   7%|███▎                                            | 72/1044 [00:26<05:49,  2.78it/s, acc_step=1/1, ce_loss_token=1.7607, perplexity_token=5.8164]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   7%|███▎                                            | 73/1044 [00:26<06:06,  2.65it/s, acc_step=1/1, ce_loss_token=1.7603, perplexity_token=5.8141]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:   7%|███▍                                            | 74/1044 [00:26<06:25,  2.52it/s, acc_step=1/1, ce_loss_token=1.7598, perplexity_token=5.8114]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:   7%|███▍                                            | 75/1044 [00:27<06:27,  2.50it/s, acc_step=1/1, ce_loss_token=1.7594, perplexity_token=5.8091]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   7%|███▍                                            | 76/1044 [00:27<06:11,  2.61it/s, acc_step=1/1, ce_loss_token=1.7588, perplexity_token=5.8056]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   7%|███▌                                            | 77/1044 [00:28<06:18,  2.56it/s, acc_step=1/1, ce_loss_token=1.7585, perplexity_token=5.8039]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   7%|███▌                                            | 78/1044 [00:28<06:08,  2.62it/s, acc_step=1/1, ce_loss_token=1.7581, perplexity_token=5.8014]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:   8%|███▋                                            | 79/1044 [00:28<05:57,  2.70it/s, acc_step=1/1, ce_loss_token=1.7577, perplexity_token=5.7993]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   8%|███▋                                            | 80/1044 [00:29<05:55,  2.71it/s, acc_step=1/1, ce_loss_token=1.7574, perplexity_token=5.7971]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   8%|███▋                                            | 81/1044 [00:29<05:57,  2.69it/s, acc_step=1/1, ce_loss_token=1.7570, perplexity_token=5.7949]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:   8%|███▊                                            | 82/1044 [00:29<05:48,  2.76it/s, acc_step=1/1, ce_loss_token=1.7567, perplexity_token=5.7932]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   8%|███▊                                            | 83/1044 [00:30<05:52,  2.73it/s, acc_step=1/1, ce_loss_token=1.7564, perplexity_token=5.7915]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   8%|███▊                                            | 84/1044 [00:30<05:27,  2.93it/s, acc_step=1/1, ce_loss_token=1.7575, perplexity_token=5.7978]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   8%|███▉                                            | 85/1044 [00:30<05:38,  2.84it/s, acc_step=1/1, ce_loss_token=1.7570, perplexity_token=5.7953]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   8%|███▉                                            | 86/1044 [00:31<05:40,  2.82it/s, acc_step=1/1, ce_loss_token=1.7565, perplexity_token=5.7920]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   8%|████                                            | 87/1044 [00:31<05:16,  3.02it/s, acc_step=1/1, ce_loss_token=1.7580, perplexity_token=5.8011]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   8%|████                                            | 88/1044 [00:31<05:29,  2.90it/s, acc_step=1/1, ce_loss_token=1.7576, perplexity_token=5.7985]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   9%|████                                            | 89/1044 [00:32<05:36,  2.84it/s, acc_step=1/1, ce_loss_token=1.7572, perplexity_token=5.7964]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   9%|████▏                                           | 90/1044 [00:32<05:48,  2.74it/s, acc_step=1/1, ce_loss_token=1.7569, perplexity_token=5.7944]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   9%|████▏                                           | 91/1044 [00:32<05:30,  2.88it/s, acc_step=1/1, ce_loss_token=1.7583, perplexity_token=5.8028]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   9%|████▏                                           | 92/1044 [00:33<05:42,  2.78it/s, acc_step=1/1, ce_loss_token=1.7580, perplexity_token=5.8009]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   9%|████▎                                           | 93/1044 [00:33<05:47,  2.74it/s, acc_step=1/1, ce_loss_token=1.7577, perplexity_token=5.7992]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:   9%|████▎                                           | 94/1044 [00:34<06:28,  2.44it/s, acc_step=1/1, ce_loss_token=1.7576, perplexity_token=5.7986]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   9%|████▎                                           | 95/1044 [00:34<06:00,  2.63it/s, acc_step=1/1, ce_loss_token=1.7590, perplexity_token=5.8063]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:   9%|████▍                                           | 96/1044 [00:35<06:18,  2.51it/s, acc_step=1/1, ce_loss_token=1.7586, perplexity_token=5.8044]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   9%|████▍                                           | 97/1044 [00:35<06:13,  2.54it/s, acc_step=1/1, ce_loss_token=1.7583, perplexity_token=5.8027]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   9%|████▌                                           | 98/1044 [00:35<06:13,  2.53it/s, acc_step=1/1, ce_loss_token=1.7581, perplexity_token=5.8013]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   9%|████▌                                           | 99/1044 [00:36<06:03,  2.60it/s, acc_step=1/1, ce_loss_token=1.7579, perplexity_token=5.8004]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  10%|████▌                                          | 100/1044 [00:36<05:41,  2.76it/s, acc_step=1/1, ce_loss_token=1.7591, perplexity_token=5.8075]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  10%|████▌                                          | 101/1044 [00:36<05:42,  2.75it/s, acc_step=1/1, ce_loss_token=1.7590, perplexity_token=5.8068]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  10%|████▌                                          | 102/1044 [00:37<05:41,  2.76it/s, acc_step=1/1, ce_loss_token=1.7587, perplexity_token=5.8051]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  10%|████▋                                          | 103/1044 [00:37<05:17,  2.96it/s, acc_step=1/1, ce_loss_token=1.7595, perplexity_token=5.8095]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  10%|████▋                                          | 104/1044 [00:37<05:21,  2.92it/s, acc_step=1/1, ce_loss_token=1.7591, perplexity_token=5.8073]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  10%|████▋                                          | 105/1044 [00:38<05:19,  2.94it/s, acc_step=1/1, ce_loss_token=1.7590, perplexity_token=5.8065]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  10%|████▊                                          | 106/1044 [00:38<05:32,  2.82it/s, acc_step=1/1, ce_loss_token=1.7586, perplexity_token=5.8045]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  10%|████▊                                          | 107/1044 [00:38<05:30,  2.84it/s, acc_step=1/1, ce_loss_token=1.7582, perplexity_token=5.8021]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  10%|████▊                                          | 108/1044 [00:39<05:30,  2.83it/s, acc_step=1/1, ce_loss_token=1.7580, perplexity_token=5.8009]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  10%|████▉                                          | 109/1044 [00:39<05:30,  2.83it/s, acc_step=1/1, ce_loss_token=1.7578, perplexity_token=5.7997]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  11%|████▉                                          | 110/1044 [00:39<05:30,  2.82it/s, acc_step=1/1, ce_loss_token=1.7576, perplexity_token=5.7987]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  11%|████▉                                          | 111/1044 [00:40<05:35,  2.78it/s, acc_step=1/1, ce_loss_token=1.7574, perplexity_token=5.7973]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  11%|█████                                          | 112/1044 [00:40<05:30,  2.82it/s, acc_step=1/1, ce_loss_token=1.7571, perplexity_token=5.7957]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  11%|█████                                          | 113/1044 [00:41<06:03,  2.56it/s, acc_step=1/1, ce_loss_token=1.7568, perplexity_token=5.7940]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  11%|█████▏                                         | 114/1044 [00:41<06:02,  2.57it/s, acc_step=1/1, ce_loss_token=1.7566, perplexity_token=5.7928]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  11%|█████▏                                         | 115/1044 [00:41<05:58,  2.59it/s, acc_step=1/1, ce_loss_token=1.7563, perplexity_token=5.7911]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  11%|█████▏                                         | 116/1044 [00:42<06:40,  2.32it/s, acc_step=1/1, ce_loss_token=1.7561, perplexity_token=5.7901]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  11%|█████▎                                         | 117/1044 [00:42<06:20,  2.44it/s, acc_step=1/1, ce_loss_token=1.7558, perplexity_token=5.7881]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  11%|█████▎                                         | 118/1044 [00:43<05:44,  2.69it/s, acc_step=1/1, ce_loss_token=1.7565, perplexity_token=5.7924]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  11%|█████▎                                         | 119/1044 [00:43<05:38,  2.73it/s, acc_step=1/1, ce_loss_token=1.7563, perplexity_token=5.7910]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  11%|█████▍                                         | 120/1044 [00:43<05:55,  2.60it/s, acc_step=1/1, ce_loss_token=1.7561, perplexity_token=5.7901]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  12%|█████▍                                         | 121/1044 [00:44<05:50,  2.63it/s, acc_step=1/1, ce_loss_token=1.7558, perplexity_token=5.7882]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  12%|█████▍                                         | 122/1044 [00:44<05:53,  2.61it/s, acc_step=1/1, ce_loss_token=1.7555, perplexity_token=5.7863]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  12%|█████▌                                         | 123/1044 [00:45<05:48,  2.64it/s, acc_step=1/1, ce_loss_token=1.7552, perplexity_token=5.7847]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  12%|█████▌                                         | 124/1044 [00:45<05:44,  2.67it/s, acc_step=1/1, ce_loss_token=1.7551, perplexity_token=5.7839]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  12%|█████▋                                         | 125/1044 [00:45<06:06,  2.51it/s, acc_step=1/1, ce_loss_token=1.7548, perplexity_token=5.7824]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  12%|█████▋                                         | 126/1044 [00:46<05:58,  2.56it/s, acc_step=1/1, ce_loss_token=1.7545, perplexity_token=5.7808]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  12%|█████▋                                         | 127/1044 [00:46<05:53,  2.59it/s, acc_step=1/1, ce_loss_token=1.7544, perplexity_token=5.7798]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  12%|█████▊                                         | 128/1044 [00:46<05:51,  2.61it/s, acc_step=1/1, ce_loss_token=1.7541, perplexity_token=5.7783]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  12%|█████▊                                         | 129/1044 [00:47<05:50,  2.61it/s, acc_step=1/1, ce_loss_token=1.7539, perplexity_token=5.7773]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  12%|█████▊                                         | 130/1044 [00:47<05:56,  2.56it/s, acc_step=1/1, ce_loss_token=1.7536, perplexity_token=5.7755]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  13%|█████▉                                         | 131/1044 [00:48<05:49,  2.61it/s, acc_step=1/1, ce_loss_token=1.7533, perplexity_token=5.7737]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  13%|█████▉                                         | 132/1044 [00:48<05:46,  2.63it/s, acc_step=1/1, ce_loss_token=1.7532, perplexity_token=5.7728]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  13%|█████▉                                         | 133/1044 [00:48<05:22,  2.83it/s, acc_step=1/1, ce_loss_token=1.7538, perplexity_token=5.7764]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  13%|██████                                         | 134/1044 [00:49<05:33,  2.73it/s, acc_step=1/1, ce_loss_token=1.7536, perplexity_token=5.7755]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  13%|██████                                         | 135/1044 [00:49<05:36,  2.70it/s, acc_step=1/1, ce_loss_token=1.7535, perplexity_token=5.7746]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  13%|██████                                         | 136/1044 [00:49<05:40,  2.67it/s, acc_step=1/1, ce_loss_token=1.7532, perplexity_token=5.7732]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  13%|██████▏                                        | 137/1044 [00:50<05:38,  2.68it/s, acc_step=1/1, ce_loss_token=1.7530, perplexity_token=5.7722]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  13%|██████▏                                        | 138/1044 [00:50<05:29,  2.75it/s, acc_step=1/1, ce_loss_token=1.7537, perplexity_token=5.7761]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  13%|██████▎                                        | 139/1044 [00:50<05:25,  2.78it/s, acc_step=1/1, ce_loss_token=1.7535, perplexity_token=5.7749]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  13%|██████▎                                        | 140/1044 [00:51<05:22,  2.80it/s, acc_step=1/1, ce_loss_token=1.7533, perplexity_token=5.7737]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  14%|██████▎                                        | 141/1044 [00:51<05:42,  2.63it/s, acc_step=1/1, ce_loss_token=1.7530, perplexity_token=5.7721]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  14%|██████▍                                        | 142/1044 [00:52<05:43,  2.63it/s, acc_step=1/1, ce_loss_token=1.7529, perplexity_token=5.7710]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  14%|██████▍                                        | 143/1044 [00:52<05:41,  2.64it/s, acc_step=1/1, ce_loss_token=1.7527, perplexity_token=5.7704]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  14%|██████▍                                        | 144/1044 [00:52<05:50,  2.57it/s, acc_step=1/1, ce_loss_token=1.7525, perplexity_token=5.7692]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  14%|██████▌                                        | 145/1044 [00:53<05:48,  2.58it/s, acc_step=1/1, ce_loss_token=1.7524, perplexity_token=5.7684]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  14%|██████▌                                        | 146/1044 [00:53<05:26,  2.75it/s, acc_step=1/1, ce_loss_token=1.7529, perplexity_token=5.7712]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  14%|██████▌                                        | 147/1044 [00:53<05:06,  2.93it/s, acc_step=1/1, ce_loss_token=1.7536, perplexity_token=5.7753]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  14%|██████▋                                        | 148/1044 [00:54<05:19,  2.80it/s, acc_step=1/1, ce_loss_token=1.7535, perplexity_token=5.7750]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  14%|██████▋                                        | 149/1044 [00:54<05:26,  2.74it/s, acc_step=1/1, ce_loss_token=1.7533, perplexity_token=5.7738]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  14%|██████▊                                        | 150/1044 [00:55<05:34,  2.68it/s, acc_step=1/1, ce_loss_token=1.7532, perplexity_token=5.7731]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  14%|██████▊                                        | 151/1044 [00:55<05:35,  2.66it/s, acc_step=1/1, ce_loss_token=1.7530, perplexity_token=5.7719]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  15%|██████▊                                        | 152/1044 [00:55<06:02,  2.46it/s, acc_step=1/1, ce_loss_token=1.7528, perplexity_token=5.7707]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  15%|██████▉                                        | 153/1044 [00:56<05:53,  2.52it/s, acc_step=1/1, ce_loss_token=1.7526, perplexity_token=5.7696]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  15%|██████▉                                        | 154/1044 [00:56<05:44,  2.58it/s, acc_step=1/1, ce_loss_token=1.7525, perplexity_token=5.7688]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  15%|██████▉                                        | 155/1044 [00:57<05:35,  2.65it/s, acc_step=1/1, ce_loss_token=1.7524, perplexity_token=5.7682]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  15%|███████                                        | 156/1044 [00:57<05:26,  2.72it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7669]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  15%|███████                                        | 157/1044 [00:57<05:26,  2.72it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7660]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  15%|███████                                        | 158/1044 [00:58<05:38,  2.62it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7650]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  15%|███████▏                                       | 159/1044 [00:58<05:45,  2.56it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7640]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  15%|███████▏                                       | 160/1044 [00:58<05:17,  2.78it/s, acc_step=1/1, ce_loss_token=1.7523, perplexity_token=5.7676]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  15%|███████▏                                       | 161/1044 [00:59<05:29,  2.68it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7666]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  16%|███████▎                                       | 162/1044 [00:59<05:26,  2.70it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7654]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  16%|███████▎                                       | 163/1044 [01:00<05:27,  2.69it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7653]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  16%|███████▍                                       | 164/1044 [01:00<05:06,  2.87it/s, acc_step=1/1, ce_loss_token=1.7524, perplexity_token=5.7685]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  16%|███████▍                                       | 165/1044 [01:00<05:07,  2.86it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7668]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  16%|███████▍                                       | 166/1044 [01:01<05:13,  2.80it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7658]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  16%|███████▌                                       | 167/1044 [01:01<05:24,  2.71it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7646]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  16%|███████▌                                       | 168/1044 [01:01<05:21,  2.73it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7637]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  16%|███████▌                                       | 169/1044 [01:02<05:15,  2.77it/s, acc_step=1/1, ce_loss_token=1.7514, perplexity_token=5.7629]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  16%|███████▋                                       | 170/1044 [01:02<05:13,  2.79it/s, acc_step=1/1, ce_loss_token=1.7513, perplexity_token=5.7622]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  16%|███████▋                                       | 171/1044 [01:02<05:18,  2.74it/s, acc_step=1/1, ce_loss_token=1.7512, perplexity_token=5.7613]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  16%|███████▋                                       | 172/1044 [01:03<04:56,  2.94it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7636]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  17%|███████▊                                       | 173/1044 [01:03<05:03,  2.87it/s, acc_step=1/1, ce_loss_token=1.7514, perplexity_token=5.7629]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  17%|███████▊                                       | 174/1044 [01:03<04:49,  3.00it/s, acc_step=1/1, ce_loss_token=1.7523, perplexity_token=5.7676]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  17%|███████▉                                       | 175/1044 [01:04<04:55,  2.94it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7666]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  17%|███████▉                                       | 176/1044 [01:04<05:07,  2.83it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7656]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  17%|███████▉                                       | 177/1044 [01:04<05:17,  2.73it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7644]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  17%|████████                                       | 178/1044 [01:05<05:14,  2.76it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7636]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  17%|████████                                       | 179/1044 [01:05<05:21,  2.69it/s, acc_step=1/1, ce_loss_token=1.7514, perplexity_token=5.7625]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  17%|████████                                       | 180/1044 [01:06<05:23,  2.67it/s, acc_step=1/1, ce_loss_token=1.7512, perplexity_token=5.7618]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  17%|████████▏                                      | 181/1044 [01:06<05:21,  2.69it/s, acc_step=1/1, ce_loss_token=1.7511, perplexity_token=5.7608]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  17%|████████▏                                      | 182/1044 [01:06<05:02,  2.85it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7648]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  18%|████████▏                                      | 183/1044 [01:07<05:32,  2.59it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7638]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  18%|████████▎                                      | 184/1044 [01:07<05:06,  2.80it/s, acc_step=1/1, ce_loss_token=1.7523, perplexity_token=5.7678]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  18%|████████▎                                      | 185/1044 [01:07<05:05,  2.81it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7670]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  18%|████████▎                                      | 186/1044 [01:08<05:00,  2.86it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7655]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  18%|████████▍                                      | 187/1044 [01:08<05:05,  2.80it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7650]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  18%|████████▍                                      | 188/1044 [01:08<05:05,  2.81it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7646]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  18%|████████▌                                      | 189/1044 [01:09<05:05,  2.79it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7639]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  18%|████████▌                                      | 190/1044 [01:09<05:18,  2.68it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7632]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  18%|████████▌                                      | 191/1044 [01:10<05:06,  2.78it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7659]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  18%|████████▋                                      | 192/1044 [01:10<05:04,  2.80it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7656]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  18%|████████▋                                      | 193/1044 [01:10<05:05,  2.78it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7647]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  19%|████████▋                                      | 194/1044 [01:11<05:15,  2.69it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7640]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  19%|████████▊                                      | 195/1044 [01:11<05:08,  2.75it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7634]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  19%|████████▊                                      | 196/1044 [01:11<04:59,  2.83it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7659]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  19%|████████▊                                      | 197/1044 [01:12<04:39,  3.04it/s, acc_step=1/1, ce_loss_token=1.7524, perplexity_token=5.7685]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  19%|████████▉                                      | 198/1044 [01:12<04:38,  3.04it/s, acc_step=1/1, ce_loss_token=1.7523, perplexity_token=5.7676]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  19%|████████▉                                      | 199/1044 [01:12<04:56,  2.85it/s, acc_step=1/1, ce_loss_token=1.7522, perplexity_token=5.7671]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  19%|█████████                                      | 200/1044 [01:13<04:55,  2.86it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7666]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  19%|█████████                                      | 201/1044 [01:13<04:48,  2.92it/s, acc_step=1/1, ce_loss_token=1.7525, perplexity_token=5.7688]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  19%|█████████                                      | 202/1044 [01:13<04:54,  2.86it/s, acc_step=1/1, ce_loss_token=1.7523, perplexity_token=5.7678]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  19%|█████████▏                                     | 203/1044 [01:14<04:37,  3.03it/s, acc_step=1/1, ce_loss_token=1.7527, perplexity_token=5.7704]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  20%|█████████▏                                     | 204/1044 [01:14<04:45,  2.94it/s, acc_step=1/1, ce_loss_token=1.7526, perplexity_token=5.7694]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  20%|█████████▏                                     | 205/1044 [01:14<04:45,  2.94it/s, acc_step=1/1, ce_loss_token=1.7525, perplexity_token=5.7689]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  20%|█████████▎                                     | 206/1044 [01:15<04:59,  2.80it/s, acc_step=1/1, ce_loss_token=1.7523, perplexity_token=5.7680]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  20%|█████████▎                                     | 207/1044 [01:15<05:03,  2.76it/s, acc_step=1/1, ce_loss_token=1.7522, perplexity_token=5.7674]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  20%|█████████▎                                     | 208/1044 [01:16<05:06,  2.73it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7667]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  20%|█████████▍                                     | 209/1044 [01:16<05:08,  2.71it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7656]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  20%|█████████▍                                     | 210/1044 [01:16<05:13,  2.66it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7650]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  20%|█████████▍                                     | 211/1044 [01:17<05:03,  2.74it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7642]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  20%|█████████▌                                     | 212/1044 [01:17<05:10,  2.68it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7631]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  20%|█████████▌                                     | 213/1044 [01:17<04:44,  2.92it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7665]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  20%|█████████▋                                     | 214/1044 [01:18<04:25,  3.12it/s, acc_step=1/1, ce_loss_token=1.7525, perplexity_token=5.7689]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  21%|█████████▋                                     | 215/1044 [01:18<04:31,  3.05it/s, acc_step=1/1, ce_loss_token=1.7524, perplexity_token=5.7682]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  21%|█████████▋                                     | 216/1044 [01:18<04:52,  2.83it/s, acc_step=1/1, ce_loss_token=1.7530, perplexity_token=5.7720]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  21%|█████████▊                                     | 217/1044 [01:19<04:39,  2.96it/s, acc_step=1/1, ce_loss_token=1.7534, perplexity_token=5.7742]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  21%|█████████▊                                     | 218/1044 [01:19<04:34,  3.01it/s, acc_step=1/1, ce_loss_token=1.7533, perplexity_token=5.7735]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  21%|█████████▊                                     | 219/1044 [01:19<04:38,  2.96it/s, acc_step=1/1, ce_loss_token=1.7532, perplexity_token=5.7728]

torch.Size([256, 402, 35]) torch.Size([256, 402])


[Training LM]:  21%|█████████▉                                     | 220/1044 [01:20<04:59,  2.75it/s, acc_step=1/1, ce_loss_token=1.7535, perplexity_token=5.7748]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  21%|█████████▉                                     | 221/1044 [01:20<04:57,  2.77it/s, acc_step=1/1, ce_loss_token=1.7533, perplexity_token=5.7739]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  21%|█████████▉                                     | 222/1044 [01:20<04:48,  2.85it/s, acc_step=1/1, ce_loss_token=1.7532, perplexity_token=5.7733]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  21%|██████████                                     | 223/1044 [01:21<04:53,  2.79it/s, acc_step=1/1, ce_loss_token=1.7531, perplexity_token=5.7727]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  21%|██████████                                     | 224/1044 [01:21<05:02,  2.71it/s, acc_step=1/1, ce_loss_token=1.7530, perplexity_token=5.7719]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  22%|██████████▏                                    | 225/1044 [01:22<05:06,  2.67it/s, acc_step=1/1, ce_loss_token=1.7529, perplexity_token=5.7712]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  22%|██████████▏                                    | 226/1044 [01:22<05:03,  2.69it/s, acc_step=1/1, ce_loss_token=1.7528, perplexity_token=5.7706]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  22%|██████████▏                                    | 227/1044 [01:22<05:19,  2.56it/s, acc_step=1/1, ce_loss_token=1.7526, perplexity_token=5.7698]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  22%|██████████▎                                    | 228/1044 [01:23<05:15,  2.58it/s, acc_step=1/1, ce_loss_token=1.7525, perplexity_token=5.7689]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  22%|██████████▎                                    | 229/1044 [01:23<05:07,  2.65it/s, acc_step=1/1, ce_loss_token=1.7523, perplexity_token=5.7681]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  22%|██████████▎                                    | 230/1044 [01:23<05:10,  2.62it/s, acc_step=1/1, ce_loss_token=1.7522, perplexity_token=5.7675]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  22%|██████████▍                                    | 231/1044 [01:24<05:19,  2.55it/s, acc_step=1/1, ce_loss_token=1.7522, perplexity_token=5.7670]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  22%|██████████▍                                    | 232/1044 [01:24<05:25,  2.49it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7662]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  22%|██████████▍                                    | 233/1044 [01:25<05:19,  2.54it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7654]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  22%|██████████▌                                    | 234/1044 [01:25<05:10,  2.61it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7649]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  23%|██████████▌                                    | 235/1044 [01:25<05:22,  2.51it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7640]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  23%|██████████▌                                    | 236/1044 [01:26<05:10,  2.61it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7632]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  23%|██████████▋                                    | 237/1044 [01:26<05:03,  2.66it/s, acc_step=1/1, ce_loss_token=1.7514, perplexity_token=5.7628]

torch.Size([256, 364, 35]) torch.Size([256, 364])


[Training LM]:  23%|██████████▋                                    | 238/1044 [01:26<04:35,  2.92it/s, acc_step=1/1, ce_loss_token=1.7527, perplexity_token=5.7704]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  23%|██████████▊                                    | 239/1044 [01:27<04:39,  2.88it/s, acc_step=1/1, ce_loss_token=1.7526, perplexity_token=5.7696]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  23%|██████████▊                                    | 240/1044 [01:27<04:39,  2.88it/s, acc_step=1/1, ce_loss_token=1.7525, perplexity_token=5.7689]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  23%|██████████▊                                    | 241/1044 [01:27<04:40,  2.87it/s, acc_step=1/1, ce_loss_token=1.7524, perplexity_token=5.7684]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  23%|██████████▉                                    | 242/1044 [01:28<04:37,  2.90it/s, acc_step=1/1, ce_loss_token=1.7523, perplexity_token=5.7677]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  23%|██████████▉                                    | 243/1044 [01:28<04:53,  2.73it/s, acc_step=1/1, ce_loss_token=1.7522, perplexity_token=5.7673]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  23%|██████████▉                                    | 244/1044 [01:29<05:07,  2.60it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7668]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  23%|███████████                                    | 245/1044 [01:29<04:57,  2.69it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7663]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  24%|███████████                                    | 246/1044 [01:29<04:59,  2.66it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7656]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  24%|███████████                                    | 247/1044 [01:30<04:56,  2.68it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7650]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  24%|███████████▏                                   | 248/1044 [01:30<04:55,  2.70it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7644]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  24%|███████████▏                                   | 249/1044 [01:31<04:57,  2.67it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7638]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  24%|███████████▎                                   | 250/1044 [01:31<04:42,  2.81it/s, acc_step=1/1, ce_loss_token=1.7522, perplexity_token=5.7670]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  24%|███████████▎                                   | 251/1044 [01:31<04:40,  2.83it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7664]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  24%|███████████▎                                   | 252/1044 [01:32<04:42,  2.81it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7658]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  24%|███████████▍                                   | 253/1044 [01:32<04:44,  2.78it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7652]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  24%|███████████▍                                   | 254/1044 [01:32<04:52,  2.70it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7647]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  24%|███████████▍                                   | 255/1044 [01:33<04:54,  2.68it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7640]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  25%|███████████▌                                   | 256/1044 [01:33<04:33,  2.88it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7661]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  25%|███████████▌                                   | 257/1044 [01:33<04:35,  2.85it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7653]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  25%|███████████▌                                   | 258/1044 [01:34<04:39,  2.81it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7646]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  25%|███████████▋                                   | 259/1044 [01:34<04:44,  2.76it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7638]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  25%|███████████▋                                   | 260/1044 [01:34<04:40,  2.79it/s, acc_step=1/1, ce_loss_token=1.7514, perplexity_token=5.7629]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  25%|███████████▊                                   | 261/1044 [01:35<04:50,  2.70it/s, acc_step=1/1, ce_loss_token=1.7513, perplexity_token=5.7621]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  25%|███████████▊                                   | 262/1044 [01:35<04:57,  2.63it/s, acc_step=1/1, ce_loss_token=1.7512, perplexity_token=5.7615]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  25%|███████████▊                                   | 263/1044 [01:36<05:19,  2.45it/s, acc_step=1/1, ce_loss_token=1.7511, perplexity_token=5.7607]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  25%|███████████▉                                   | 264/1044 [01:36<05:10,  2.51it/s, acc_step=1/1, ce_loss_token=1.7509, perplexity_token=5.7601]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  25%|███████████▉                                   | 266/1044 [01:37<04:24,  2.95it/s, acc_step=1/1, ce_loss_token=1.7523, perplexity_token=5.7677]

torch.Size([256, 287, 35]) torch.Size([256, 287])
torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  26%|████████████                                   | 267/1044 [01:37<04:34,  2.83it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7669]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  26%|████████████                                   | 268/1044 [01:37<04:45,  2.72it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7663]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  26%|████████████                                   | 269/1044 [01:38<04:51,  2.65it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7659]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  26%|████████████▏                                  | 270/1044 [01:38<04:47,  2.69it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7652]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  26%|████████████▏                                  | 271/1044 [01:39<04:56,  2.61it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7643]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  26%|████████████▏                                  | 272/1044 [01:39<04:54,  2.62it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7637]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  26%|████████████▎                                  | 273/1044 [01:39<04:56,  2.60it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7631]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  26%|████████████▎                                  | 274/1044 [01:40<04:51,  2.64it/s, acc_step=1/1, ce_loss_token=1.7514, perplexity_token=5.7625]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  26%|████████████▍                                  | 275/1044 [01:40<04:54,  2.61it/s, acc_step=1/1, ce_loss_token=1.7513, perplexity_token=5.7619]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  26%|████████████▍                                  | 276/1044 [01:41<04:53,  2.61it/s, acc_step=1/1, ce_loss_token=1.7511, perplexity_token=5.7612]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  27%|████████████▍                                  | 277/1044 [01:41<04:34,  2.80it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7631]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  27%|████████████▌                                  | 278/1044 [01:41<04:15,  3.00it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7646]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  27%|████████████▌                                  | 279/1044 [01:41<04:15,  3.00it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7639]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  27%|████████████▌                                  | 280/1044 [01:42<03:58,  3.20it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7661]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  27%|████████████▋                                  | 281/1044 [01:42<03:59,  3.18it/s, acc_step=1/1, ce_loss_token=1.7523, perplexity_token=5.7681]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  27%|████████████▋                                  | 282/1044 [01:42<04:20,  2.93it/s, acc_step=1/1, ce_loss_token=1.7522, perplexity_token=5.7673]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  27%|████████████▋                                  | 283/1044 [01:43<04:39,  2.73it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7668]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  27%|████████████▊                                  | 284/1044 [01:43<04:45,  2.66it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7660]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  27%|████████████▊                                  | 285/1044 [01:44<04:40,  2.70it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7656]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  27%|████████████▉                                  | 286/1044 [01:44<04:42,  2.69it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7650]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  27%|████████████▉                                  | 287/1044 [01:44<04:37,  2.73it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7646]

torch.Size([256, 396, 35]) torch.Size([256, 396])


[Training LM]:  28%|████████████▉                                  | 288/1044 [01:45<05:15,  2.39it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7641]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  28%|█████████████                                  | 289/1044 [01:45<05:05,  2.48it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7635]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  28%|█████████████                                  | 290/1044 [01:46<04:58,  2.52it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7630]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  28%|█████████████                                  | 291/1044 [01:46<04:54,  2.56it/s, acc_step=1/1, ce_loss_token=1.7513, perplexity_token=5.7622]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  28%|█████████████▏                                 | 292/1044 [01:46<04:56,  2.54it/s, acc_step=1/1, ce_loss_token=1.7513, perplexity_token=5.7618]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  28%|█████████████▏                                 | 293/1044 [01:47<04:47,  2.61it/s, acc_step=1/1, ce_loss_token=1.7512, perplexity_token=5.7613]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  28%|█████████████▏                                 | 294/1044 [01:47<04:49,  2.59it/s, acc_step=1/1, ce_loss_token=1.7511, perplexity_token=5.7607]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  28%|█████████████▎                                 | 295/1044 [01:48<04:45,  2.63it/s, acc_step=1/1, ce_loss_token=1.7510, perplexity_token=5.7603]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  28%|█████████████▎                                 | 296/1044 [01:48<04:31,  2.75it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7632]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  28%|█████████████▎                                 | 297/1044 [01:48<04:34,  2.72it/s, acc_step=1/1, ce_loss_token=1.7514, perplexity_token=5.7626]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  29%|█████████████▍                                 | 298/1044 [01:49<04:36,  2.70it/s, acc_step=1/1, ce_loss_token=1.7513, perplexity_token=5.7622]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  29%|█████████████▍                                 | 299/1044 [01:49<04:43,  2.63it/s, acc_step=1/1, ce_loss_token=1.7512, perplexity_token=5.7617]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  29%|█████████████▌                                 | 300/1044 [01:49<04:45,  2.61it/s, acc_step=1/1, ce_loss_token=1.7511, perplexity_token=5.7610]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  29%|█████████████▌                                 | 301/1044 [01:50<04:43,  2.62it/s, acc_step=1/1, ce_loss_token=1.7510, perplexity_token=5.7602]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  29%|█████████████▌                                 | 302/1044 [01:50<04:36,  2.68it/s, acc_step=1/1, ce_loss_token=1.7509, perplexity_token=5.7598]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  29%|█████████████▋                                 | 303/1044 [01:50<04:19,  2.86it/s, acc_step=1/1, ce_loss_token=1.7513, perplexity_token=5.7622]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  29%|█████████████▋                                 | 304/1044 [01:51<04:34,  2.70it/s, acc_step=1/1, ce_loss_token=1.7512, perplexity_token=5.7616]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  29%|█████████████▋                                 | 305/1044 [01:51<04:31,  2.72it/s, acc_step=1/1, ce_loss_token=1.7511, perplexity_token=5.7610]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  29%|█████████████▊                                 | 306/1044 [01:52<04:30,  2.73it/s, acc_step=1/1, ce_loss_token=1.7510, perplexity_token=5.7605]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  29%|█████████████▊                                 | 307/1044 [01:52<04:27,  2.75it/s, acc_step=1/1, ce_loss_token=1.7509, perplexity_token=5.7599]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  30%|█████████████▊                                 | 308/1044 [01:52<04:32,  2.71it/s, acc_step=1/1, ce_loss_token=1.7508, perplexity_token=5.7593]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  30%|█████████████▉                                 | 309/1044 [01:53<04:35,  2.66it/s, acc_step=1/1, ce_loss_token=1.7507, perplexity_token=5.7589]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  30%|█████████████▉                                 | 310/1044 [01:53<04:14,  2.89it/s, acc_step=1/1, ce_loss_token=1.7510, perplexity_token=5.7606]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  30%|██████████████                                 | 311/1044 [01:53<04:13,  2.89it/s, acc_step=1/1, ce_loss_token=1.7509, perplexity_token=5.7601]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  30%|██████████████                                 | 312/1044 [01:54<04:17,  2.84it/s, acc_step=1/1, ce_loss_token=1.7509, perplexity_token=5.7596]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  30%|██████████████                                 | 313/1044 [01:54<04:17,  2.84it/s, acc_step=1/1, ce_loss_token=1.7508, perplexity_token=5.7591]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  30%|██████████████▏                                | 314/1044 [01:54<04:22,  2.78it/s, acc_step=1/1, ce_loss_token=1.7507, perplexity_token=5.7588]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  30%|██████████████▏                                | 315/1044 [01:55<04:22,  2.77it/s, acc_step=1/1, ce_loss_token=1.7507, perplexity_token=5.7584]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  30%|██████████████▏                                | 316/1044 [01:55<04:33,  2.66it/s, acc_step=1/1, ce_loss_token=1.7506, perplexity_token=5.7578]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  30%|██████████████▎                                | 317/1044 [01:56<04:31,  2.68it/s, acc_step=1/1, ce_loss_token=1.7505, perplexity_token=5.7574]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  30%|██████████████▎                                | 318/1044 [01:56<04:27,  2.71it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7568]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  31%|██████████████▎                                | 319/1044 [01:56<04:23,  2.75it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7562]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  31%|██████████████▍                                | 320/1044 [01:57<04:22,  2.76it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7555]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  31%|██████████████▍                                | 321/1044 [01:57<04:25,  2.72it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7547]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  31%|██████████████▍                                | 322/1044 [01:57<04:36,  2.61it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7543]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  31%|██████████████▌                                | 323/1044 [01:58<04:43,  2.54it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7535]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  31%|██████████████▌                                | 324/1044 [01:58<04:50,  2.47it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7529]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  31%|██████████████▋                                | 325/1044 [01:59<04:50,  2.47it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7524]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  31%|██████████████▋                                | 326/1044 [01:59<04:37,  2.58it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7521]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  31%|██████████████▋                                | 327/1044 [01:59<04:34,  2.61it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7517]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  31%|██████████████▊                                | 328/1044 [02:00<04:17,  2.78it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7533]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  32%|██████████████▊                                | 329/1044 [02:00<04:21,  2.73it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7526]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  32%|██████████████▊                                | 330/1044 [02:00<04:16,  2.78it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7523]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  32%|██████████████▉                                | 331/1044 [02:01<04:30,  2.64it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7520]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  32%|██████████████▉                                | 332/1044 [02:01<04:24,  2.69it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7515]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  32%|██████████████▉                                | 333/1044 [02:02<04:20,  2.73it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7512]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  32%|███████████████                                | 334/1044 [02:02<04:24,  2.68it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7505]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  32%|███████████████                                | 335/1044 [02:02<04:35,  2.58it/s, acc_step=1/1, ce_loss_token=1.7492, perplexity_token=5.7500]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  32%|███████████████▏                               | 336/1044 [02:03<04:33,  2.59it/s, acc_step=1/1, ce_loss_token=1.7491, perplexity_token=5.7495]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  32%|███████████████▏                               | 337/1044 [02:03<04:31,  2.61it/s, acc_step=1/1, ce_loss_token=1.7490, perplexity_token=5.7490]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  32%|███████████████▏                               | 338/1044 [02:03<04:25,  2.66it/s, acc_step=1/1, ce_loss_token=1.7489, perplexity_token=5.7482]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  32%|███████████████▎                               | 339/1044 [02:04<04:13,  2.79it/s, acc_step=1/1, ce_loss_token=1.7492, perplexity_token=5.7500]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  33%|███████████████▎                               | 340/1044 [02:04<04:12,  2.79it/s, acc_step=1/1, ce_loss_token=1.7491, perplexity_token=5.7496]

torch.Size([256, 438, 35]) torch.Size([256, 438])


[Training LM]:  33%|███████████████▎                               | 341/1044 [02:05<05:07,  2.28it/s, acc_step=1/1, ce_loss_token=1.7490, perplexity_token=5.7489]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  33%|███████████████▍                               | 342/1044 [02:05<04:51,  2.41it/s, acc_step=1/1, ce_loss_token=1.7489, perplexity_token=5.7485]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  33%|███████████████▍                               | 343/1044 [02:06<04:53,  2.38it/s, acc_step=1/1, ce_loss_token=1.7488, perplexity_token=5.7479]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  33%|███████████████▍                               | 344/1044 [02:06<04:44,  2.46it/s, acc_step=1/1, ce_loss_token=1.7488, perplexity_token=5.7475]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  33%|███████████████▌                               | 345/1044 [02:06<04:36,  2.52it/s, acc_step=1/1, ce_loss_token=1.7487, perplexity_token=5.7469]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  33%|███████████████▌                               | 346/1044 [02:07<04:40,  2.49it/s, acc_step=1/1, ce_loss_token=1.7486, perplexity_token=5.7465]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  33%|███████████████▌                               | 347/1044 [02:07<04:32,  2.56it/s, acc_step=1/1, ce_loss_token=1.7485, perplexity_token=5.7458]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  33%|███████████████▋                               | 348/1044 [02:07<04:23,  2.64it/s, acc_step=1/1, ce_loss_token=1.7484, perplexity_token=5.7456]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  33%|███████████████▋                               | 349/1044 [02:08<04:29,  2.58it/s, acc_step=1/1, ce_loss_token=1.7484, perplexity_token=5.7453]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  34%|███████████████▊                               | 350/1044 [02:08<04:38,  2.49it/s, acc_step=1/1, ce_loss_token=1.7483, perplexity_token=5.7445]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  34%|███████████████▊                               | 351/1044 [02:09<04:20,  2.66it/s, acc_step=1/1, ce_loss_token=1.7485, perplexity_token=5.7461]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  34%|███████████████▊                               | 352/1044 [02:09<04:15,  2.70it/s, acc_step=1/1, ce_loss_token=1.7485, perplexity_token=5.7457]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  34%|███████████████▉                               | 353/1044 [02:09<03:56,  2.92it/s, acc_step=1/1, ce_loss_token=1.7488, perplexity_token=5.7479]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  34%|███████████████▉                               | 354/1044 [02:10<04:01,  2.85it/s, acc_step=1/1, ce_loss_token=1.7488, perplexity_token=5.7475]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  34%|███████████████▉                               | 355/1044 [02:10<04:05,  2.80it/s, acc_step=1/1, ce_loss_token=1.7487, perplexity_token=5.7472]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  34%|████████████████                               | 356/1044 [02:10<04:04,  2.81it/s, acc_step=1/1, ce_loss_token=1.7486, perplexity_token=5.7466]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  34%|████████████████                               | 357/1044 [02:11<04:04,  2.81it/s, acc_step=1/1, ce_loss_token=1.7485, perplexity_token=5.7460]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  34%|████████████████                               | 358/1044 [02:11<04:11,  2.73it/s, acc_step=1/1, ce_loss_token=1.7485, perplexity_token=5.7458]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  34%|████████████████▏                              | 359/1044 [02:11<04:09,  2.75it/s, acc_step=1/1, ce_loss_token=1.7484, perplexity_token=5.7453]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  34%|████████████████▏                              | 360/1044 [02:12<04:04,  2.80it/s, acc_step=1/1, ce_loss_token=1.7483, perplexity_token=5.7451]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  35%|████████████████▎                              | 362/1044 [02:12<03:31,  3.22it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7514]

torch.Size([256, 328, 35]) torch.Size([256, 328])
torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  35%|████████████████▎                              | 363/1044 [02:13<03:42,  3.06it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7510]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  35%|████████████████▍                              | 364/1044 [02:13<03:49,  2.96it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7505]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  35%|████████████████▍                              | 365/1044 [02:13<03:53,  2.91it/s, acc_step=1/1, ce_loss_token=1.7492, perplexity_token=5.7499]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  35%|████████████████▍                              | 366/1044 [02:14<03:53,  2.90it/s, acc_step=1/1, ce_loss_token=1.7491, perplexity_token=5.7493]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  35%|████████████████▌                              | 367/1044 [02:14<03:43,  3.03it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7507]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  35%|████████████████▌                              | 368/1044 [02:14<03:37,  3.11it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7527]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  35%|████████████████▌                              | 369/1044 [02:15<03:41,  3.04it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7521]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  35%|████████████████▋                              | 370/1044 [02:15<03:45,  2.99it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7518]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  36%|████████████████▋                              | 371/1044 [02:15<04:02,  2.77it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7514]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  36%|████████████████▋                              | 372/1044 [02:16<04:00,  2.79it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7509]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  36%|████████████████▊                              | 373/1044 [02:16<04:14,  2.63it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7505]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  36%|████████████████▊                              | 374/1044 [02:17<04:16,  2.61it/s, acc_step=1/1, ce_loss_token=1.7492, perplexity_token=5.7501]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  36%|████████████████▉                              | 375/1044 [02:17<04:11,  2.66it/s, acc_step=1/1, ce_loss_token=1.7491, perplexity_token=5.7496]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  36%|████████████████▉                              | 376/1044 [02:18<04:37,  2.41it/s, acc_step=1/1, ce_loss_token=1.7491, perplexity_token=5.7493]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  36%|████████████████▉                              | 377/1044 [02:18<04:23,  2.53it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7509]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  36%|█████████████████                              | 378/1044 [02:18<04:12,  2.64it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7504]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  36%|█████████████████                              | 379/1044 [02:19<03:54,  2.84it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7515]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  36%|█████████████████                              | 380/1044 [02:19<03:38,  3.04it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7535]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  36%|█████████████████▏                             | 381/1044 [02:19<03:46,  2.93it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7531]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  37%|█████████████████▏                             | 382/1044 [02:20<03:51,  2.86it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7527]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  37%|█████████████████▏                             | 383/1044 [02:20<03:48,  2.89it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7522]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  37%|█████████████████▎                             | 384/1044 [02:20<03:37,  3.03it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7534]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  37%|█████████████████▎                             | 385/1044 [02:21<03:52,  2.83it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7528]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  37%|█████████████████▍                             | 386/1044 [02:21<03:57,  2.77it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7522]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  37%|█████████████████▍                             | 387/1044 [02:21<04:05,  2.67it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7518]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  37%|█████████████████▍                             | 388/1044 [02:22<03:45,  2.91it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7532]

torch.Size([256, 344, 35]) torch.Size([256, 344])


[Training LM]:  37%|█████████████████▌                             | 390/1044 [02:22<03:27,  3.14it/s, acc_step=1/1, ce_loss_token=1.7507, perplexity_token=5.7587]

torch.Size([256, 293, 35]) torch.Size([256, 293])
torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  37%|█████████████████▌                             | 391/1044 [02:23<03:37,  3.01it/s, acc_step=1/1, ce_loss_token=1.7506, perplexity_token=5.7580]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  38%|█████████████████▋                             | 392/1044 [02:23<03:43,  2.91it/s, acc_step=1/1, ce_loss_token=1.7505, perplexity_token=5.7577]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  38%|█████████████████▋                             | 393/1044 [02:23<03:39,  2.97it/s, acc_step=1/1, ce_loss_token=1.7508, perplexity_token=5.7591]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  38%|█████████████████▋                             | 394/1044 [02:24<03:41,  2.93it/s, acc_step=1/1, ce_loss_token=1.7507, perplexity_token=5.7586]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  38%|█████████████████▊                             | 395/1044 [02:24<03:43,  2.90it/s, acc_step=1/1, ce_loss_token=1.7506, perplexity_token=5.7580]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  38%|█████████████████▊                             | 396/1044 [02:24<03:41,  2.93it/s, acc_step=1/1, ce_loss_token=1.7505, perplexity_token=5.7575]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  38%|█████████████████▊                             | 397/1044 [02:25<03:30,  3.08it/s, acc_step=1/1, ce_loss_token=1.7507, perplexity_token=5.7589]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  38%|█████████████████▉                             | 398/1044 [02:25<03:24,  3.16it/s, acc_step=1/1, ce_loss_token=1.7509, perplexity_token=5.7599]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  38%|█████████████████▉                             | 399/1044 [02:25<03:31,  3.04it/s, acc_step=1/1, ce_loss_token=1.7508, perplexity_token=5.7594]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  38%|██████████████████                             | 400/1044 [02:26<03:39,  2.93it/s, acc_step=1/1, ce_loss_token=1.7508, perplexity_token=5.7591]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  38%|██████████████████                             | 401/1044 [02:26<03:41,  2.90it/s, acc_step=1/1, ce_loss_token=1.7507, perplexity_token=5.7586]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  39%|██████████████████                             | 402/1044 [02:26<03:56,  2.72it/s, acc_step=1/1, ce_loss_token=1.7506, perplexity_token=5.7580]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  39%|██████████████████▏                            | 403/1044 [02:27<03:36,  2.96it/s, acc_step=1/1, ce_loss_token=1.7508, perplexity_token=5.7591]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  39%|██████████████████▏                            | 404/1044 [02:27<03:41,  2.89it/s, acc_step=1/1, ce_loss_token=1.7507, perplexity_token=5.7588]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  39%|██████████████████▏                            | 405/1044 [02:27<03:43,  2.86it/s, acc_step=1/1, ce_loss_token=1.7507, perplexity_token=5.7585]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  39%|██████████████████▎                            | 406/1044 [02:28<03:43,  2.85it/s, acc_step=1/1, ce_loss_token=1.7506, perplexity_token=5.7581]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  39%|██████████████████▎                            | 407/1044 [02:28<03:29,  3.05it/s, acc_step=1/1, ce_loss_token=1.7508, perplexity_token=5.7594]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  39%|██████████████████▎                            | 408/1044 [02:28<03:40,  2.89it/s, acc_step=1/1, ce_loss_token=1.7508, perplexity_token=5.7591]

torch.Size([256, 351, 35]) torch.Size([256, 351])


[Training LM]:  39%|██████████████████▍                            | 409/1044 [02:29<04:01,  2.63it/s, acc_step=1/1, ce_loss_token=1.7507, perplexity_token=5.7585]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  39%|██████████████████▍                            | 410/1044 [02:29<04:02,  2.61it/s, acc_step=1/1, ce_loss_token=1.7506, perplexity_token=5.7581]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  39%|██████████████████▌                            | 411/1044 [02:30<03:58,  2.65it/s, acc_step=1/1, ce_loss_token=1.7505, perplexity_token=5.7576]

torch.Size([256, 388, 35]) torch.Size([256, 388])


[Training LM]:  39%|██████████████████▌                            | 412/1044 [02:30<04:27,  2.36it/s, acc_step=1/1, ce_loss_token=1.7505, perplexity_token=5.7573]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  40%|██████████████████▌                            | 413/1044 [02:31<04:12,  2.50it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7567]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  40%|██████████████████▋                            | 414/1044 [02:31<04:04,  2.58it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7562]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  40%|██████████████████▋                            | 415/1044 [02:31<03:43,  2.82it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7571]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  40%|██████████████████▋                            | 416/1044 [02:32<03:44,  2.80it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7568]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  40%|██████████████████▊                            | 417/1044 [02:32<03:42,  2.81it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7563]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  40%|██████████████████▊                            | 418/1044 [02:32<03:47,  2.75it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7560]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  40%|██████████████████▊                            | 419/1044 [02:33<03:50,  2.71it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7554]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  40%|██████████████████▉                            | 420/1044 [02:33<03:47,  2.75it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7549]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  40%|██████████████████▉                            | 421/1044 [02:33<03:45,  2.77it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7545]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  40%|██████████████████▉                            | 422/1044 [02:34<04:05,  2.54it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7543]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  41%|███████████████████                            | 423/1044 [02:34<03:40,  2.82it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7550]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  41%|███████████████████                            | 424/1044 [02:34<03:32,  2.92it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7564]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  41%|███████████████████▏                           | 425/1044 [02:35<03:43,  2.77it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7560]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  41%|███████████████████▏                           | 426/1044 [02:35<03:48,  2.71it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7556]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  41%|███████████████████▏                           | 427/1044 [02:36<03:50,  2.68it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7552]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  41%|███████████████████▎                           | 428/1044 [02:36<03:47,  2.71it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7549]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  41%|███████████████████▎                           | 429/1044 [02:36<03:54,  2.63it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7546]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  41%|███████████████████▎                           | 430/1044 [02:37<03:37,  2.83it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7563]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  41%|███████████████████▍                           | 431/1044 [02:37<03:40,  2.78it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7558]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  41%|███████████████████▍                           | 432/1044 [02:37<03:41,  2.76it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7552]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  41%|███████████████████▍                           | 433/1044 [02:38<03:43,  2.73it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7549]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  42%|███████████████████▌                           | 434/1044 [02:38<03:45,  2.70it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7545]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  42%|███████████████████▌                           | 435/1044 [02:38<03:44,  2.71it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7541]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  42%|███████████████████▋                           | 436/1044 [02:39<03:42,  2.73it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7538]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  42%|███████████████████▋                           | 437/1044 [02:39<03:31,  2.87it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7549]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  42%|███████████████████▋                           | 438/1044 [02:40<03:36,  2.80it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7546]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  42%|███████████████████▊                           | 439/1044 [02:40<03:37,  2.78it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7541]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  42%|███████████████████▊                           | 440/1044 [02:40<03:38,  2.77it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7536]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  42%|███████████████████▊                           | 441/1044 [02:41<03:37,  2.77it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7533]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  42%|███████████████████▉                           | 442/1044 [02:41<03:41,  2.72it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7529]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  42%|███████████████████▉                           | 443/1044 [02:41<03:37,  2.76it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7527]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  43%|███████████████████▉                           | 444/1044 [02:42<03:36,  2.77it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7524]

torch.Size([256, 399, 35]) torch.Size([256, 399])


[Training LM]:  43%|████████████████████                           | 445/1044 [02:42<03:48,  2.62it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7535]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  43%|████████████████████                           | 446/1044 [02:43<03:53,  2.57it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7530]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  43%|████████████████████                           | 447/1044 [02:43<03:46,  2.64it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7525]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  43%|████████████████████▏                          | 448/1044 [02:43<03:28,  2.86it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7536]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  43%|████████████████████▏                          | 449/1044 [02:44<03:32,  2.80it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7531]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  43%|████████████████████▎                          | 450/1044 [02:44<03:37,  2.74it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7528]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  43%|████████████████████▎                          | 451/1044 [02:44<03:40,  2.69it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7525]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  43%|████████████████████▎                          | 452/1044 [02:45<03:21,  2.94it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7537]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  43%|████████████████████▍                          | 453/1044 [02:45<03:30,  2.81it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7532]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  43%|████████████████████▍                          | 454/1044 [02:45<03:32,  2.77it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7529]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  44%|████████████████████▍                          | 455/1044 [02:46<03:39,  2.68it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7528]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  44%|████████████████████▌                          | 456/1044 [02:46<03:47,  2.58it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7524]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  44%|████████████████████▌                          | 457/1044 [02:46<03:26,  2.84it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7535]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  44%|████████████████████▌                          | 458/1044 [02:47<03:26,  2.84it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7531]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  44%|████████████████████▋                          | 459/1044 [02:47<03:12,  3.03it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7543]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  44%|████████████████████▋                          | 460/1044 [02:48<03:28,  2.81it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7539]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  44%|████████████████████▊                          | 461/1044 [02:48<03:33,  2.73it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7535]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  44%|████████████████████▊                          | 462/1044 [02:48<04:04,  2.38it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7532]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  44%|████████████████████▊                          | 463/1044 [02:49<03:56,  2.45it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7550]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  44%|████████████████████▉                          | 464/1044 [02:49<03:54,  2.48it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7545]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  45%|████████████████████▉                          | 465/1044 [02:50<03:49,  2.53it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7541]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  45%|████████████████████▉                          | 466/1044 [02:50<03:40,  2.62it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7537]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  45%|█████████████████████                          | 467/1044 [02:50<03:42,  2.59it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7535]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  45%|█████████████████████                          | 468/1044 [02:51<03:41,  2.60it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7530]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  45%|█████████████████████                          | 469/1044 [02:51<03:46,  2.53it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7526]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  45%|█████████████████████▏                         | 470/1044 [02:51<03:39,  2.62it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7522]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  45%|█████████████████████▏                         | 471/1044 [02:52<03:37,  2.64it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7516]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  45%|█████████████████████▏                         | 472/1044 [02:52<03:36,  2.64it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7513]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  45%|█████████████████████▎                         | 473/1044 [02:53<03:41,  2.58it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7510]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  45%|█████████████████████▎                         | 474/1044 [02:53<03:39,  2.60it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7505]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  45%|█████████████████████▍                         | 475/1044 [02:53<03:37,  2.61it/s, acc_step=1/1, ce_loss_token=1.7492, perplexity_token=5.7501]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  46%|█████████████████████▍                         | 477/1044 [02:54<03:06,  3.03it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7526]

torch.Size([256, 305, 35]) torch.Size([256, 305])
torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  46%|█████████████████████▌                         | 478/1044 [02:54<03:09,  2.99it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7522]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  46%|█████████████████████▌                         | 479/1044 [02:55<03:22,  2.80it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7520]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  46%|█████████████████████▌                         | 480/1044 [02:55<03:19,  2.82it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7515]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  46%|█████████████████████▋                         | 481/1044 [02:55<03:24,  2.76it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7510]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  46%|█████████████████████▋                         | 482/1044 [02:56<03:24,  2.74it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7508]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  46%|█████████████████████▋                         | 483/1044 [02:56<03:24,  2.74it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7504]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  46%|█████████████████████▊                         | 484/1044 [02:57<03:23,  2.76it/s, acc_step=1/1, ce_loss_token=1.7492, perplexity_token=5.7501]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  46%|█████████████████████▊                         | 485/1044 [02:57<03:25,  2.73it/s, acc_step=1/1, ce_loss_token=1.7492, perplexity_token=5.7498]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  47%|█████████████████████▉                         | 486/1044 [02:57<03:31,  2.63it/s, acc_step=1/1, ce_loss_token=1.7491, perplexity_token=5.7494]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  47%|█████████████████████▉                         | 488/1044 [02:58<03:11,  2.91it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7536]

torch.Size([256, 318, 35]) torch.Size([256, 318])
torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  47%|██████████████████████                         | 489/1044 [02:58<03:11,  2.90it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7532]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  47%|██████████████████████                         | 490/1044 [02:59<03:24,  2.71it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7527]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  47%|██████████████████████                         | 491/1044 [02:59<03:26,  2.68it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7523]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  47%|██████████████████████▏                        | 492/1044 [03:00<03:30,  2.63it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7519]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  47%|██████████████████████▏                        | 493/1044 [03:00<03:19,  2.76it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7530]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  47%|██████████████████████▏                        | 494/1044 [03:00<03:19,  2.76it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7527]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  47%|██████████████████████▎                        | 495/1044 [03:01<03:34,  2.56it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7525]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  48%|██████████████████████▎                        | 496/1044 [03:01<03:26,  2.65it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7521]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  48%|██████████████████████▎                        | 497/1044 [03:01<03:20,  2.72it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7519]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  48%|██████████████████████▍                        | 498/1044 [03:02<03:27,  2.63it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7514]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  48%|██████████████████████▍                        | 499/1044 [03:02<03:12,  2.84it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7524]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  48%|██████████████████████▌                        | 500/1044 [03:02<03:20,  2.71it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7521]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  48%|██████████████████████▌                        | 501/1044 [03:03<03:05,  2.93it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7531]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  48%|██████████████████████▌                        | 502/1044 [03:03<03:05,  2.93it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7527]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  48%|██████████████████████▋                        | 503/1044 [03:03<03:06,  2.90it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7523]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  48%|██████████████████████▋                        | 504/1044 [03:04<03:14,  2.78it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7519]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  48%|██████████████████████▋                        | 505/1044 [03:04<03:18,  2.71it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7516]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  48%|██████████████████████▊                        | 506/1044 [03:05<03:04,  2.91it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7525]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  49%|██████████████████████▊                        | 507/1044 [03:05<03:06,  2.87it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7519]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  49%|██████████████████████▊                        | 508/1044 [03:05<03:09,  2.83it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7515]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  49%|██████████████████████▉                        | 509/1044 [03:06<03:10,  2.80it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7512]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  49%|██████████████████████▉                        | 510/1044 [03:06<03:14,  2.74it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7508]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  49%|███████████████████████                        | 511/1044 [03:06<03:16,  2.71it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7505]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  49%|███████████████████████                        | 512/1044 [03:07<03:25,  2.59it/s, acc_step=1/1, ce_loss_token=1.7492, perplexity_token=5.7500]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  49%|███████████████████████▏                       | 514/1044 [03:07<02:55,  3.02it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7526]

torch.Size([256, 300, 35]) torch.Size([256, 300])
torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  49%|███████████████████████▏                       | 515/1044 [03:08<03:08,  2.81it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7522]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  49%|███████████████████████▏                       | 516/1044 [03:08<03:14,  2.71it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7519]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  50%|███████████████████████▎                       | 517/1044 [03:09<03:13,  2.72it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7514]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  50%|███████████████████████▎                       | 518/1044 [03:09<03:22,  2.60it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7510]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  50%|███████████████████████▎                       | 519/1044 [03:09<03:18,  2.65it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7505]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  50%|███████████████████████▍                       | 520/1044 [03:10<03:04,  2.85it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7513]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  50%|███████████████████████▌                       | 522/1044 [03:10<02:40,  3.25it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7537]

torch.Size([256, 291, 35]) torch.Size([256, 291])
torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  50%|███████████████████████▌                       | 523/1044 [03:11<02:37,  3.31it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7547]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  50%|███████████████████████▌                       | 524/1044 [03:11<02:40,  3.24it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7556]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  50%|███████████████████████▋                       | 525/1044 [03:11<02:58,  2.90it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7552]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  50%|███████████████████████▋                       | 526/1044 [03:12<03:03,  2.83it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7550]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  50%|███████████████████████▋                       | 527/1044 [03:12<03:13,  2.68it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7547]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  51%|███████████████████████▊                       | 528/1044 [03:12<03:13,  2.67it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7542]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  51%|███████████████████████▊                       | 529/1044 [03:13<03:04,  2.80it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7552]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  51%|███████████████████████▊                       | 530/1044 [03:13<03:08,  2.73it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7549]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  51%|███████████████████████▉                       | 531/1044 [03:13<03:06,  2.75it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7545]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  51%|███████████████████████▉                       | 532/1044 [03:14<02:51,  2.98it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7559]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  51%|███████████████████████▉                       | 533/1044 [03:14<02:56,  2.90it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7556]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  51%|████████████████████████                       | 534/1044 [03:14<02:50,  2.99it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7565]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  51%|████████████████████████                       | 535/1044 [03:15<02:57,  2.86it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7561]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  51%|████████████████████████▏                      | 536/1044 [03:15<03:02,  2.79it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7557]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  51%|████████████████████████▏                      | 537/1044 [03:16<03:07,  2.71it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7555]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  52%|████████████████████████▏                      | 538/1044 [03:16<03:14,  2.60it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7551]

torch.Size([256, 454, 35]) torch.Size([256, 454])


[Training LM]:  52%|████████████████████████▎                      | 539/1044 [03:17<03:33,  2.37it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7560]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  52%|████████████████████████▎                      | 540/1044 [03:17<03:19,  2.52it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7568]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  52%|████████████████████████▎                      | 541/1044 [03:17<03:15,  2.57it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7564]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  52%|████████████████████████▍                      | 543/1044 [03:18<02:46,  3.01it/s, acc_step=1/1, ce_loss_token=1.7509, perplexity_token=5.7599]

torch.Size([256, 311, 35]) torch.Size([256, 311])
torch.Size([256, 349, 35]) torch.Size([256, 349])


[Training LM]:  52%|████████████████████████▍                      | 544/1044 [03:18<03:02,  2.74it/s, acc_step=1/1, ce_loss_token=1.7508, perplexity_token=5.7595]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  52%|████████████████████████▌                      | 545/1044 [03:19<02:53,  2.87it/s, acc_step=1/1, ce_loss_token=1.7511, perplexity_token=5.7609]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  52%|████████████████████████▌                      | 546/1044 [03:19<02:57,  2.81it/s, acc_step=1/1, ce_loss_token=1.7510, perplexity_token=5.7604]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  52%|████████████████████████▋                      | 547/1044 [03:19<03:00,  2.75it/s, acc_step=1/1, ce_loss_token=1.7509, perplexity_token=5.7600]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  52%|████████████████████████▋                      | 548/1044 [03:20<02:48,  2.95it/s, acc_step=1/1, ce_loss_token=1.7512, perplexity_token=5.7613]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  53%|████████████████████████▋                      | 549/1044 [03:20<02:53,  2.85it/s, acc_step=1/1, ce_loss_token=1.7511, perplexity_token=5.7609]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  53%|████████████████████████▊                      | 550/1044 [03:20<02:44,  3.01it/s, acc_step=1/1, ce_loss_token=1.7512, perplexity_token=5.7617]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  53%|████████████████████████▊                      | 551/1044 [03:21<02:46,  2.97it/s, acc_step=1/1, ce_loss_token=1.7512, perplexity_token=5.7613]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  53%|████████████████████████▉                      | 553/1044 [03:21<02:39,  3.08it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7648]

torch.Size([256, 269, 35]) torch.Size([256, 269])
torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  53%|████████████████████████▉                      | 554/1044 [03:22<02:45,  2.96it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7645]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  53%|████████████████████████▉                      | 555/1044 [03:22<02:37,  3.10it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7653]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  53%|█████████████████████████                      | 556/1044 [03:22<02:42,  3.00it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7650]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  53%|█████████████████████████                      | 557/1044 [03:23<02:49,  2.88it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7647]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  53%|█████████████████████████                      | 558/1044 [03:23<02:55,  2.77it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7644]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  54%|█████████████████████████▏                     | 559/1044 [03:23<02:45,  2.94it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7650]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  54%|█████████████████████████▏                     | 560/1044 [03:24<02:55,  2.75it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7647]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  54%|█████████████████████████▎                     | 561/1044 [03:24<02:41,  2.99it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7654]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  54%|█████████████████████████▎                     | 562/1044 [03:24<02:46,  2.90it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7649]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  54%|█████████████████████████▎                     | 563/1044 [03:25<02:49,  2.84it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7646]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  54%|█████████████████████████▍                     | 564/1044 [03:25<02:56,  2.73it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7644]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  54%|█████████████████████████▍                     | 565/1044 [03:25<02:43,  2.93it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7653]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  54%|█████████████████████████▍                     | 566/1044 [03:26<02:47,  2.85it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7651]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  54%|█████████████████████████▌                     | 567/1044 [03:26<02:40,  2.96it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7661]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  54%|█████████████████████████▌                     | 568/1044 [03:26<02:32,  3.12it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7668]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  55%|█████████████████████████▌                     | 569/1044 [03:27<02:46,  2.84it/s, acc_step=1/1, ce_loss_token=1.7521, perplexity_token=5.7664]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  55%|█████████████████████████▋                     | 570/1044 [03:27<02:50,  2.79it/s, acc_step=1/1, ce_loss_token=1.7520, perplexity_token=5.7661]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  55%|█████████████████████████▋                     | 571/1044 [03:28<02:56,  2.69it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7658]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  55%|█████████████████████████▊                     | 572/1044 [03:28<02:57,  2.66it/s, acc_step=1/1, ce_loss_token=1.7519, perplexity_token=5.7653]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  55%|█████████████████████████▊                     | 573/1044 [03:28<02:59,  2.63it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7650]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  55%|█████████████████████████▊                     | 574/1044 [03:29<02:57,  2.64it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7646]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  55%|█████████████████████████▉                     | 575/1044 [03:29<02:55,  2.67it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7644]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  55%|█████████████████████████▉                     | 576/1044 [03:30<03:03,  2.56it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7639]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  55%|█████████████████████████▉                     | 577/1044 [03:30<03:03,  2.55it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7635]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  55%|██████████████████████████                     | 578/1044 [03:30<02:53,  2.68it/s, acc_step=1/1, ce_loss_token=1.7518, perplexity_token=5.7648]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  55%|██████████████████████████                     | 579/1044 [03:31<02:54,  2.66it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7646]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  56%|██████████████████████████                     | 580/1044 [03:31<02:50,  2.72it/s, acc_step=1/1, ce_loss_token=1.7517, perplexity_token=5.7642]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  56%|██████████████████████████▏                    | 581/1044 [03:31<02:48,  2.75it/s, acc_step=1/1, ce_loss_token=1.7516, perplexity_token=5.7639]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  56%|██████████████████████████▏                    | 582/1044 [03:32<02:51,  2.70it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7635]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  56%|██████████████████████████▏                    | 583/1044 [03:32<02:56,  2.61it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7633]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  56%|██████████████████████████▎                    | 584/1044 [03:33<03:02,  2.52it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7630]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  56%|██████████████████████████▎                    | 585/1044 [03:33<03:00,  2.55it/s, acc_step=1/1, ce_loss_token=1.7514, perplexity_token=5.7627]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  56%|██████████████████████████▍                    | 586/1044 [03:33<02:49,  2.70it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7633]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  56%|██████████████████████████▍                    | 587/1044 [03:34<02:52,  2.65it/s, acc_step=1/1, ce_loss_token=1.7515, perplexity_token=5.7630]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  56%|██████████████████████████▍                    | 588/1044 [03:34<02:50,  2.68it/s, acc_step=1/1, ce_loss_token=1.7514, perplexity_token=5.7628]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  56%|██████████████████████████▌                    | 589/1044 [03:34<02:47,  2.72it/s, acc_step=1/1, ce_loss_token=1.7514, perplexity_token=5.7624]

torch.Size([256, 404, 35]) torch.Size([256, 404])


[Training LM]:  57%|██████████████████████████▌                    | 590/1044 [03:35<03:12,  2.36it/s, acc_step=1/1, ce_loss_token=1.7513, perplexity_token=5.7621]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  57%|██████████████████████████▌                    | 591/1044 [03:35<03:00,  2.51it/s, acc_step=1/1, ce_loss_token=1.7512, perplexity_token=5.7617]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  57%|██████████████████████████▋                    | 592/1044 [03:36<03:02,  2.48it/s, acc_step=1/1, ce_loss_token=1.7512, perplexity_token=5.7614]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  57%|██████████████████████████▋                    | 593/1044 [03:36<02:43,  2.77it/s, acc_step=1/1, ce_loss_token=1.7513, perplexity_token=5.7622]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  57%|██████████████████████████▋                    | 594/1044 [03:36<02:46,  2.69it/s, acc_step=1/1, ce_loss_token=1.7513, perplexity_token=5.7619]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  57%|██████████████████████████▊                    | 595/1044 [03:37<02:50,  2.63it/s, acc_step=1/1, ce_loss_token=1.7512, perplexity_token=5.7615]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  57%|██████████████████████████▊                    | 596/1044 [03:37<02:46,  2.68it/s, acc_step=1/1, ce_loss_token=1.7511, perplexity_token=5.7611]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  57%|██████████████████████████▉                    | 597/1044 [03:38<02:43,  2.73it/s, acc_step=1/1, ce_loss_token=1.7511, perplexity_token=5.7607]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  57%|██████████████████████████▉                    | 598/1044 [03:38<02:43,  2.72it/s, acc_step=1/1, ce_loss_token=1.7510, perplexity_token=5.7603]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  57%|██████████████████████████▉                    | 599/1044 [03:38<02:45,  2.68it/s, acc_step=1/1, ce_loss_token=1.7509, perplexity_token=5.7600]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  57%|███████████████████████████                    | 600/1044 [03:39<02:46,  2.67it/s, acc_step=1/1, ce_loss_token=1.7509, perplexity_token=5.7597]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  58%|███████████████████████████                    | 601/1044 [03:39<02:43,  2.70it/s, acc_step=1/1, ce_loss_token=1.7508, perplexity_token=5.7593]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  58%|███████████████████████████                    | 602/1044 [03:39<02:39,  2.77it/s, acc_step=1/1, ce_loss_token=1.7508, perplexity_token=5.7590]

torch.Size([256, 342, 35]) torch.Size([256, 342])


[Training LM]:  58%|███████████████████████████▏                   | 603/1044 [03:40<02:48,  2.62it/s, acc_step=1/1, ce_loss_token=1.7507, perplexity_token=5.7586]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  58%|███████████████████████████▏                   | 604/1044 [03:40<02:56,  2.49it/s, acc_step=1/1, ce_loss_token=1.7506, perplexity_token=5.7583]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  58%|███████████████████████████▏                   | 605/1044 [03:41<02:51,  2.56it/s, acc_step=1/1, ce_loss_token=1.7506, perplexity_token=5.7578]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  58%|███████████████████████████▎                   | 606/1044 [03:41<02:42,  2.69it/s, acc_step=1/1, ce_loss_token=1.7507, perplexity_token=5.7586]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  58%|███████████████████████████▎                   | 607/1044 [03:41<02:44,  2.66it/s, acc_step=1/1, ce_loss_token=1.7506, perplexity_token=5.7583]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  58%|███████████████████████████▎                   | 608/1044 [03:42<02:42,  2.68it/s, acc_step=1/1, ce_loss_token=1.7506, perplexity_token=5.7580]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  58%|███████████████████████████▍                   | 609/1044 [03:42<02:41,  2.70it/s, acc_step=1/1, ce_loss_token=1.7505, perplexity_token=5.7576]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  58%|███████████████████████████▍                   | 610/1044 [03:42<02:39,  2.71it/s, acc_step=1/1, ce_loss_token=1.7505, perplexity_token=5.7573]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  59%|███████████████████████████▌                   | 611/1044 [03:43<02:35,  2.79it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7570]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  59%|███████████████████████████▌                   | 612/1044 [03:43<02:33,  2.81it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7567]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  59%|███████████████████████████▌                   | 613/1044 [03:43<02:21,  3.04it/s, acc_step=1/1, ce_loss_token=1.7505, perplexity_token=5.7574]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  59%|███████████████████████████▋                   | 614/1044 [03:44<02:25,  2.95it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7571]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  59%|███████████████████████████▋                   | 615/1044 [03:44<02:29,  2.87it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7568]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  59%|███████████████████████████▋                   | 616/1044 [03:44<02:33,  2.79it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7565]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  59%|███████████████████████████▊                   | 617/1044 [03:45<02:35,  2.74it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7562]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  59%|███████████████████████████▊                   | 618/1044 [03:45<02:42,  2.62it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7559]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  59%|███████████████████████████▊                   | 619/1044 [03:46<02:41,  2.63it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7557]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  59%|███████████████████████████▉                   | 620/1044 [03:46<02:28,  2.86it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7569]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  59%|███████████████████████████▉                   | 621/1044 [03:46<02:30,  2.81it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7567]

torch.Size([256, 381, 35]) torch.Size([256, 381])


[Training LM]:  60%|████████████████████████████                   | 622/1044 [03:47<02:50,  2.47it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7563]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  60%|████████████████████████████                   | 623/1044 [03:47<02:50,  2.47it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7560]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  60%|████████████████████████████                   | 624/1044 [03:48<02:36,  2.69it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7571]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  60%|████████████████████████████▏                  | 625/1044 [03:48<02:31,  2.76it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7567]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  60%|████████████████████████████▏                  | 626/1044 [03:48<02:31,  2.77it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7565]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  60%|████████████████████████████▏                  | 627/1044 [03:49<02:32,  2.73it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7561]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  60%|████████████████████████████▎                  | 628/1044 [03:49<02:35,  2.67it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7558]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  60%|████████████████████████████▎                  | 629/1044 [03:49<02:30,  2.76it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7555]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  60%|████████████████████████████▎                  | 630/1044 [03:50<02:29,  2.77it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7552]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  60%|████████████████████████████▍                  | 631/1044 [03:50<02:30,  2.74it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7549]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  61%|████████████████████████████▍                  | 632/1044 [03:50<02:32,  2.70it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7546]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  61%|████████████████████████████▍                  | 633/1044 [03:51<02:35,  2.64it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7543]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  61%|████████████████████████████▌                  | 634/1044 [03:51<02:39,  2.57it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7539]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  61%|████████████████████████████▌                  | 635/1044 [03:52<02:31,  2.70it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7546]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  61%|████████████████████████████▋                  | 636/1044 [03:52<02:35,  2.62it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7542]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  61%|████████████████████████████▋                  | 637/1044 [03:52<02:24,  2.82it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7554]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  61%|████████████████████████████▋                  | 638/1044 [03:53<02:17,  2.94it/s, acc_step=1/1, ce_loss_token=1.7503, perplexity_token=5.7561]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  61%|████████████████████████████▊                  | 639/1044 [03:53<02:18,  2.92it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7558]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  61%|████████████████████████████▊                  | 640/1044 [03:53<02:21,  2.85it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7554]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  61%|████████████████████████████▊                  | 641/1044 [03:54<02:29,  2.70it/s, acc_step=1/1, ce_loss_token=1.7501, perplexity_token=5.7552]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  61%|████████████████████████████▉                  | 642/1044 [03:54<02:30,  2.67it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7548]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  62%|████████████████████████████▉                  | 643/1044 [03:54<02:29,  2.69it/s, acc_step=1/1, ce_loss_token=1.7500, perplexity_token=5.7546]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  62%|████████████████████████████▉                  | 644/1044 [03:55<02:28,  2.70it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7543]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  62%|█████████████████████████████                  | 645/1044 [03:55<02:31,  2.64it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7541]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  62%|█████████████████████████████                  | 646/1044 [03:56<02:37,  2.53it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7537]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  62%|█████████████████████████████▏                 | 647/1044 [03:56<02:26,  2.72it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7543]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  62%|█████████████████████████████▏                 | 648/1044 [03:56<02:28,  2.66it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7542]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  62%|█████████████████████████████▏                 | 649/1044 [03:57<02:29,  2.64it/s, acc_step=1/1, ce_loss_token=1.7499, perplexity_token=5.7538]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  62%|█████████████████████████████▎                 | 650/1044 [03:57<02:26,  2.68it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7536]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  62%|█████████████████████████████▎                 | 651/1044 [03:57<02:25,  2.71it/s, acc_step=1/1, ce_loss_token=1.7498, perplexity_token=5.7533]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  62%|█████████████████████████████▎                 | 652/1044 [03:58<02:25,  2.70it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7530]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  63%|█████████████████████████████▍                 | 653/1044 [03:58<02:22,  2.74it/s, acc_step=1/1, ce_loss_token=1.7497, perplexity_token=5.7526]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  63%|█████████████████████████████▍                 | 654/1044 [03:59<02:26,  2.66it/s, acc_step=1/1, ce_loss_token=1.7496, perplexity_token=5.7523]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  63%|█████████████████████████████▍                 | 655/1044 [03:59<02:23,  2.70it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7519]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  63%|█████████████████████████████▌                 | 656/1044 [03:59<02:26,  2.65it/s, acc_step=1/1, ce_loss_token=1.7495, perplexity_token=5.7516]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  63%|█████████████████████████████▌                 | 657/1044 [04:00<02:22,  2.71it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7512]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  63%|█████████████████████████████▌                 | 658/1044 [04:00<02:31,  2.54it/s, acc_step=1/1, ce_loss_token=1.7494, perplexity_token=5.7510]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  63%|█████████████████████████████▋                 | 659/1044 [04:01<02:32,  2.52it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7508]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  63%|█████████████████████████████▋                 | 660/1044 [04:01<02:30,  2.55it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7503]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  63%|█████████████████████████████▊                 | 661/1044 [04:01<02:26,  2.61it/s, acc_step=1/1, ce_loss_token=1.7492, perplexity_token=5.7500]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  63%|█████████████████████████████▊                 | 662/1044 [04:02<02:16,  2.81it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7507]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  64%|█████████████████████████████▊                 | 663/1044 [04:02<02:18,  2.76it/s, acc_step=1/1, ce_loss_token=1.7493, perplexity_token=5.7504]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  64%|█████████████████████████████▉                 | 664/1044 [04:02<02:19,  2.72it/s, acc_step=1/1, ce_loss_token=1.7492, perplexity_token=5.7501]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  64%|█████████████████████████████▉                 | 665/1044 [04:03<02:22,  2.67it/s, acc_step=1/1, ce_loss_token=1.7491, perplexity_token=5.7497]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  64%|█████████████████████████████▉                 | 666/1044 [04:03<02:22,  2.66it/s, acc_step=1/1, ce_loss_token=1.7491, perplexity_token=5.7494]

torch.Size([256, 370, 35]) torch.Size([256, 370])


[Training LM]:  64%|██████████████████████████████                 | 667/1044 [04:04<02:34,  2.43it/s, acc_step=1/1, ce_loss_token=1.7490, perplexity_token=5.7491]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  64%|██████████████████████████████                 | 668/1044 [04:04<02:22,  2.65it/s, acc_step=1/1, ce_loss_token=1.7492, perplexity_token=5.7498]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  64%|██████████████████████████████                 | 669/1044 [04:04<02:17,  2.72it/s, acc_step=1/1, ce_loss_token=1.7491, perplexity_token=5.7494]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  64%|██████████████████████████████▏                | 670/1044 [04:05<02:27,  2.54it/s, acc_step=1/1, ce_loss_token=1.7490, perplexity_token=5.7491]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  64%|██████████████████████████████▏                | 671/1044 [04:05<02:26,  2.55it/s, acc_step=1/1, ce_loss_token=1.7490, perplexity_token=5.7487]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  64%|██████████████████████████████▎                | 672/1044 [04:06<02:26,  2.54it/s, acc_step=1/1, ce_loss_token=1.7489, perplexity_token=5.7484]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  64%|██████████████████████████████▎                | 673/1044 [04:06<02:27,  2.51it/s, acc_step=1/1, ce_loss_token=1.7488, perplexity_token=5.7479]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  65%|██████████████████████████████▎                | 674/1044 [04:06<02:25,  2.54it/s, acc_step=1/1, ce_loss_token=1.7488, perplexity_token=5.7475]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  65%|██████████████████████████████▍                | 675/1044 [04:07<02:21,  2.61it/s, acc_step=1/1, ce_loss_token=1.7487, perplexity_token=5.7473]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  65%|██████████████████████████████▍                | 676/1044 [04:07<02:21,  2.60it/s, acc_step=1/1, ce_loss_token=1.7487, perplexity_token=5.7470]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  65%|██████████████████████████████▍                | 677/1044 [04:07<02:12,  2.77it/s, acc_step=1/1, ce_loss_token=1.7488, perplexity_token=5.7475]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  65%|██████████████████████████████▌                | 678/1044 [04:08<02:14,  2.72it/s, acc_step=1/1, ce_loss_token=1.7487, perplexity_token=5.7473]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  65%|██████████████████████████████▌                | 679/1044 [04:08<01:58,  3.07it/s, acc_step=1/1, ce_loss_token=1.7492, perplexity_token=5.7499]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  65%|██████████████████████████████▌                | 680/1044 [04:08<02:03,  2.96it/s, acc_step=1/1, ce_loss_token=1.7491, perplexity_token=5.7496]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  65%|██████████████████████████████▋                | 681/1044 [04:09<02:05,  2.89it/s, acc_step=1/1, ce_loss_token=1.7491, perplexity_token=5.7492]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  65%|██████████████████████████████▋                | 682/1044 [04:09<02:13,  2.72it/s, acc_step=1/1, ce_loss_token=1.7490, perplexity_token=5.7491]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  65%|██████████████████████████████▋                | 683/1044 [04:10<02:20,  2.56it/s, acc_step=1/1, ce_loss_token=1.7490, perplexity_token=5.7488]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  66%|██████████████████████████████▊                | 684/1044 [04:10<02:16,  2.63it/s, acc_step=1/1, ce_loss_token=1.7489, perplexity_token=5.7485]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  66%|██████████████████████████████▊                | 685/1044 [04:10<02:13,  2.69it/s, acc_step=1/1, ce_loss_token=1.7489, perplexity_token=5.7482]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  66%|██████████████████████████████▉                | 686/1044 [04:11<02:13,  2.68it/s, acc_step=1/1, ce_loss_token=1.7488, perplexity_token=5.7479]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  66%|██████████████████████████████▉                | 687/1044 [04:11<02:11,  2.72it/s, acc_step=1/1, ce_loss_token=1.7488, perplexity_token=5.7475]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  66%|██████████████████████████████▉                | 688/1044 [04:11<02:12,  2.69it/s, acc_step=1/1, ce_loss_token=1.7487, perplexity_token=5.7471]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  66%|███████████████████████████████                | 689/1044 [04:12<02:11,  2.70it/s, acc_step=1/1, ce_loss_token=1.7486, perplexity_token=5.7468]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  66%|███████████████████████████████                | 690/1044 [04:12<02:02,  2.89it/s, acc_step=1/1, ce_loss_token=1.7488, perplexity_token=5.7479]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  66%|███████████████████████████████                | 691/1044 [04:12<02:06,  2.80it/s, acc_step=1/1, ce_loss_token=1.7488, perplexity_token=5.7477]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  66%|███████████████████████████████▏               | 692/1044 [04:13<02:11,  2.67it/s, acc_step=1/1, ce_loss_token=1.7488, perplexity_token=5.7476]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  66%|███████████████████████████████▏               | 693/1044 [04:13<02:09,  2.71it/s, acc_step=1/1, ce_loss_token=1.7487, perplexity_token=5.7474]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  66%|███████████████████████████████▏               | 694/1044 [04:14<02:11,  2.66it/s, acc_step=1/1, ce_loss_token=1.7487, perplexity_token=5.7471]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  67%|███████████████████████████████▎               | 695/1044 [04:14<02:15,  2.57it/s, acc_step=1/1, ce_loss_token=1.7486, perplexity_token=5.7468]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  67%|███████████████████████████████▎               | 696/1044 [04:14<02:11,  2.65it/s, acc_step=1/1, ce_loss_token=1.7486, perplexity_token=5.7464]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  67%|███████████████████████████████▍               | 697/1044 [04:15<02:00,  2.87it/s, acc_step=1/1, ce_loss_token=1.7487, perplexity_token=5.7469]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  67%|███████████████████████████████▍               | 698/1044 [04:15<02:00,  2.86it/s, acc_step=1/1, ce_loss_token=1.7486, perplexity_token=5.7467]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  67%|███████████████████████████████▍               | 699/1044 [04:15<02:07,  2.71it/s, acc_step=1/1, ce_loss_token=1.7486, perplexity_token=5.7464]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  67%|███████████████████████████████▌               | 700/1044 [04:16<02:10,  2.63it/s, acc_step=1/1, ce_loss_token=1.7486, perplexity_token=5.7463]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  67%|███████████████████████████████▌               | 701/1044 [04:16<02:08,  2.68it/s, acc_step=1/1, ce_loss_token=1.7485, perplexity_token=5.7461]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  67%|███████████████████████████████▌               | 702/1044 [04:17<02:09,  2.63it/s, acc_step=1/1, ce_loss_token=1.7485, perplexity_token=5.7458]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  67%|███████████████████████████████▋               | 703/1044 [04:17<02:06,  2.70it/s, acc_step=1/1, ce_loss_token=1.7484, perplexity_token=5.7456]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  67%|███████████████████████████████▋               | 704/1044 [04:17<02:06,  2.70it/s, acc_step=1/1, ce_loss_token=1.7484, perplexity_token=5.7453]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  68%|███████████████████████████████▋               | 705/1044 [04:18<02:06,  2.68it/s, acc_step=1/1, ce_loss_token=1.7483, perplexity_token=5.7450]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  68%|███████████████████████████████▊               | 706/1044 [04:18<02:12,  2.55it/s, acc_step=1/1, ce_loss_token=1.7483, perplexity_token=5.7447]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  68%|███████████████████████████████▊               | 707/1044 [04:18<02:12,  2.53it/s, acc_step=1/1, ce_loss_token=1.7482, perplexity_token=5.7444]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  68%|███████████████████████████████▊               | 708/1044 [04:19<02:13,  2.52it/s, acc_step=1/1, ce_loss_token=1.7482, perplexity_token=5.7441]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  68%|███████████████████████████████▉               | 709/1044 [04:19<02:13,  2.50it/s, acc_step=1/1, ce_loss_token=1.7481, perplexity_token=5.7438]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  68%|███████████████████████████████▉               | 710/1044 [04:20<02:09,  2.59it/s, acc_step=1/1, ce_loss_token=1.7481, perplexity_token=5.7434]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  68%|████████████████████████████████               | 711/1044 [04:20<02:09,  2.57it/s, acc_step=1/1, ce_loss_token=1.7480, perplexity_token=5.7430]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  68%|████████████████████████████████               | 712/1044 [04:20<02:07,  2.61it/s, acc_step=1/1, ce_loss_token=1.7479, perplexity_token=5.7426]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  68%|████████████████████████████████               | 713/1044 [04:21<02:07,  2.59it/s, acc_step=1/1, ce_loss_token=1.7479, perplexity_token=5.7425]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  68%|████████████████████████████████▏              | 714/1044 [04:21<02:05,  2.63it/s, acc_step=1/1, ce_loss_token=1.7478, perplexity_token=5.7422]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  68%|████████████████████████████████▏              | 715/1044 [04:22<02:08,  2.56it/s, acc_step=1/1, ce_loss_token=1.7478, perplexity_token=5.7419]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  69%|████████████████████████████████▏              | 716/1044 [04:22<02:05,  2.61it/s, acc_step=1/1, ce_loss_token=1.7478, perplexity_token=5.7417]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  69%|████████████████████████████████▎              | 717/1044 [04:22<02:13,  2.45it/s, acc_step=1/1, ce_loss_token=1.7477, perplexity_token=5.7413]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  69%|████████████████████████████████▎              | 718/1044 [04:23<02:10,  2.49it/s, acc_step=1/1, ce_loss_token=1.7476, perplexity_token=5.7410]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  69%|████████████████████████████████▎              | 719/1044 [04:23<02:08,  2.54it/s, acc_step=1/1, ce_loss_token=1.7476, perplexity_token=5.7407]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  69%|████████████████████████████████▍              | 720/1044 [04:24<02:02,  2.64it/s, acc_step=1/1, ce_loss_token=1.7475, perplexity_token=5.7404]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  69%|████████████████████████████████▍              | 721/1044 [04:24<01:59,  2.69it/s, acc_step=1/1, ce_loss_token=1.7475, perplexity_token=5.7402]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  69%|████████████████████████████████▌              | 722/1044 [04:24<02:00,  2.67it/s, acc_step=1/1, ce_loss_token=1.7474, perplexity_token=5.7399]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  69%|████████████████████████████████▌              | 723/1044 [04:25<01:58,  2.70it/s, acc_step=1/1, ce_loss_token=1.7474, perplexity_token=5.7395]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  69%|████████████████████████████████▌              | 724/1044 [04:25<02:01,  2.64it/s, acc_step=1/1, ce_loss_token=1.7473, perplexity_token=5.7392]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  69%|████████████████████████████████▋              | 725/1044 [04:25<01:51,  2.85it/s, acc_step=1/1, ce_loss_token=1.7474, perplexity_token=5.7398]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  70%|████████████████████████████████▋              | 726/1044 [04:26<01:52,  2.82it/s, acc_step=1/1, ce_loss_token=1.7474, perplexity_token=5.7394]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  70%|████████████████████████████████▋              | 727/1044 [04:26<01:53,  2.80it/s, acc_step=1/1, ce_loss_token=1.7473, perplexity_token=5.7391]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  70%|████████████████████████████████▊              | 728/1044 [04:26<01:54,  2.77it/s, acc_step=1/1, ce_loss_token=1.7473, perplexity_token=5.7388]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  70%|████████████████████████████████▊              | 729/1044 [04:27<01:55,  2.72it/s, acc_step=1/1, ce_loss_token=1.7472, perplexity_token=5.7385]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  70%|████████████████████████████████▊              | 730/1044 [04:27<01:54,  2.74it/s, acc_step=1/1, ce_loss_token=1.7472, perplexity_token=5.7382]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  70%|████████████████████████████████▉              | 731/1044 [04:28<01:53,  2.76it/s, acc_step=1/1, ce_loss_token=1.7471, perplexity_token=5.7379]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  70%|████████████████████████████████▉              | 732/1044 [04:28<01:54,  2.73it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7376]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  70%|████████████████████████████████▉              | 733/1044 [04:28<01:58,  2.62it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7374]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  70%|█████████████████████████████████              | 734/1044 [04:29<01:56,  2.67it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7371]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  70%|█████████████████████████████████              | 735/1044 [04:29<01:55,  2.68it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7368]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  70%|█████████████████████████████████▏             | 736/1044 [04:29<01:54,  2.70it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7366]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  71%|█████████████████████████████████▏             | 737/1044 [04:30<01:59,  2.58it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7363]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  71%|█████████████████████████████████▏             | 738/1044 [04:30<01:57,  2.61it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7361]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  71%|█████████████████████████████████▎             | 739/1044 [04:31<01:55,  2.65it/s, acc_step=1/1, ce_loss_token=1.7467, perplexity_token=5.7359]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  71%|█████████████████████████████████▎             | 740/1044 [04:31<01:52,  2.71it/s, acc_step=1/1, ce_loss_token=1.7467, perplexity_token=5.7356]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  71%|█████████████████████████████████▎             | 741/1044 [04:31<01:54,  2.64it/s, acc_step=1/1, ce_loss_token=1.7467, perplexity_token=5.7354]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  71%|█████████████████████████████████▍             | 742/1044 [04:32<01:54,  2.64it/s, acc_step=1/1, ce_loss_token=1.7466, perplexity_token=5.7352]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  71%|█████████████████████████████████▍             | 743/1044 [04:32<01:50,  2.74it/s, acc_step=1/1, ce_loss_token=1.7466, perplexity_token=5.7349]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  71%|█████████████████████████████████▍             | 744/1044 [04:32<01:55,  2.60it/s, acc_step=1/1, ce_loss_token=1.7465, perplexity_token=5.7347]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  71%|█████████████████████████████████▌             | 745/1044 [04:33<01:53,  2.62it/s, acc_step=1/1, ce_loss_token=1.7465, perplexity_token=5.7344]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  71%|█████████████████████████████████▌             | 746/1044 [04:33<01:52,  2.64it/s, acc_step=1/1, ce_loss_token=1.7464, perplexity_token=5.7341]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  72%|█████████████████████████████████▋             | 747/1044 [04:34<01:59,  2.49it/s, acc_step=1/1, ce_loss_token=1.7464, perplexity_token=5.7339]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  72%|█████████████████████████████████▋             | 748/1044 [04:34<01:58,  2.50it/s, acc_step=1/1, ce_loss_token=1.7463, perplexity_token=5.7335]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  72%|█████████████████████████████████▋             | 749/1044 [04:34<01:57,  2.50it/s, acc_step=1/1, ce_loss_token=1.7463, perplexity_token=5.7332]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  72%|█████████████████████████████████▊             | 750/1044 [04:35<01:54,  2.56it/s, acc_step=1/1, ce_loss_token=1.7462, perplexity_token=5.7330]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  72%|█████████████████████████████████▊             | 751/1044 [04:35<01:47,  2.72it/s, acc_step=1/1, ce_loss_token=1.7463, perplexity_token=5.7336]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  72%|█████████████████████████████████▊             | 752/1044 [04:35<01:46,  2.75it/s, acc_step=1/1, ce_loss_token=1.7463, perplexity_token=5.7334]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  72%|█████████████████████████████████▉             | 753/1044 [04:36<01:47,  2.72it/s, acc_step=1/1, ce_loss_token=1.7463, perplexity_token=5.7332]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  72%|█████████████████████████████████▉             | 754/1044 [04:36<01:42,  2.83it/s, acc_step=1/1, ce_loss_token=1.7464, perplexity_token=5.7341]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  72%|█████████████████████████████████▉             | 755/1044 [04:37<01:41,  2.86it/s, acc_step=1/1, ce_loss_token=1.7464, perplexity_token=5.7340]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  72%|██████████████████████████████████             | 756/1044 [04:37<01:42,  2.81it/s, acc_step=1/1, ce_loss_token=1.7463, perplexity_token=5.7336]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  73%|██████████████████████████████████             | 757/1044 [04:37<01:38,  2.91it/s, acc_step=1/1, ce_loss_token=1.7465, perplexity_token=5.7346]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  73%|██████████████████████████████████             | 758/1044 [04:37<01:32,  3.08it/s, acc_step=1/1, ce_loss_token=1.7466, perplexity_token=5.7353]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  73%|██████████████████████████████████▏            | 759/1044 [04:38<01:34,  3.00it/s, acc_step=1/1, ce_loss_token=1.7466, perplexity_token=5.7350]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  73%|██████████████████████████████████▏            | 760/1044 [04:38<01:40,  2.84it/s, acc_step=1/1, ce_loss_token=1.7465, perplexity_token=5.7347]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  73%|██████████████████████████████████▎            | 761/1044 [04:39<01:40,  2.82it/s, acc_step=1/1, ce_loss_token=1.7465, perplexity_token=5.7344]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  73%|██████████████████████████████████▎            | 762/1044 [04:39<01:41,  2.78it/s, acc_step=1/1, ce_loss_token=1.7464, perplexity_token=5.7342]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  73%|██████████████████████████████████▎            | 763/1044 [04:39<01:40,  2.79it/s, acc_step=1/1, ce_loss_token=1.7464, perplexity_token=5.7339]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  73%|██████████████████████████████████▍            | 764/1044 [04:40<01:41,  2.76it/s, acc_step=1/1, ce_loss_token=1.7464, perplexity_token=5.7337]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  73%|██████████████████████████████████▍            | 765/1044 [04:40<01:35,  2.92it/s, acc_step=1/1, ce_loss_token=1.7465, perplexity_token=5.7344]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  73%|██████████████████████████████████▍            | 766/1044 [04:40<01:40,  2.76it/s, acc_step=1/1, ce_loss_token=1.7464, perplexity_token=5.7341]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  73%|██████████████████████████████████▌            | 767/1044 [04:41<01:32,  2.99it/s, acc_step=1/1, ce_loss_token=1.7466, perplexity_token=5.7348]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  74%|██████████████████████████████████▌            | 768/1044 [04:41<01:36,  2.86it/s, acc_step=1/1, ce_loss_token=1.7465, perplexity_token=5.7346]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  74%|██████████████████████████████████▌            | 769/1044 [04:41<01:31,  3.01it/s, acc_step=1/1, ce_loss_token=1.7467, perplexity_token=5.7355]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  74%|██████████████████████████████████▋            | 770/1044 [04:42<01:34,  2.91it/s, acc_step=1/1, ce_loss_token=1.7466, perplexity_token=5.7354]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  74%|██████████████████████████████████▋            | 771/1044 [04:42<01:33,  2.93it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7360]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  74%|██████████████████████████████████▊            | 772/1044 [04:42<01:32,  2.94it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7366]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  74%|██████████████████████████████████▊            | 773/1044 [04:43<01:40,  2.71it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7364]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  74%|██████████████████████████████████▊            | 774/1044 [04:43<01:42,  2.64it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7361]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  74%|██████████████████████████████████▉            | 775/1044 [04:44<01:42,  2.63it/s, acc_step=1/1, ce_loss_token=1.7467, perplexity_token=5.7358]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  74%|██████████████████████████████████▉            | 776/1044 [04:44<01:33,  2.86it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7362]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  74%|██████████████████████████████████▉            | 777/1044 [04:44<01:28,  3.02it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7372]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  75%|███████████████████████████████████            | 778/1044 [04:45<01:30,  2.94it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7370]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  75%|███████████████████████████████████            | 779/1044 [04:45<01:31,  2.89it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7368]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  75%|███████████████████████████████████            | 780/1044 [04:45<01:32,  2.87it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7364]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  75%|███████████████████████████████████▏           | 781/1044 [04:46<01:32,  2.83it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7361]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  75%|███████████████████████████████████▏           | 782/1044 [04:46<01:30,  2.90it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7370]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  75%|███████████████████████████████████▎           | 783/1044 [04:46<01:29,  2.91it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7369]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  75%|███████████████████████████████████▎           | 784/1044 [04:47<01:24,  3.07it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7374]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  75%|███████████████████████████████████▎           | 785/1044 [04:47<01:26,  2.99it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7372]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  75%|███████████████████████████████████▍           | 786/1044 [04:47<01:30,  2.87it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7370]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  75%|███████████████████████████████████▍           | 787/1044 [04:48<01:30,  2.84it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7368]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  75%|███████████████████████████████████▍           | 788/1044 [04:48<01:32,  2.75it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7365]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  76%|███████████████████████████████████▌           | 789/1044 [04:48<01:34,  2.70it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7363]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  76%|███████████████████████████████████▌           | 790/1044 [04:49<01:28,  2.86it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7369]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  76%|███████████████████████████████████▌           | 791/1044 [04:49<01:22,  3.05it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7374]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  76%|███████████████████████████████████▋           | 792/1044 [04:49<01:24,  2.98it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7371]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  76%|███████████████████████████████████▋           | 793/1044 [04:50<01:19,  3.17it/s, acc_step=1/1, ce_loss_token=1.7471, perplexity_token=5.7380]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  76%|███████████████████████████████████▋           | 794/1044 [04:50<01:22,  3.02it/s, acc_step=1/1, ce_loss_token=1.7471, perplexity_token=5.7377]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  76%|███████████████████████████████████▊           | 795/1044 [04:50<01:27,  2.86it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7375]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  76%|███████████████████████████████████▊           | 796/1044 [04:51<01:34,  2.63it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7372]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  76%|███████████████████████████████████▉           | 797/1044 [04:51<01:31,  2.69it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7370]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  76%|███████████████████████████████████▉           | 798/1044 [04:52<01:30,  2.72it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7367]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  77%|███████████████████████████████████▉           | 799/1044 [04:52<01:32,  2.64it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7364]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  77%|████████████████████████████████████           | 800/1044 [04:52<01:31,  2.67it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7361]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  77%|████████████████████████████████████           | 801/1044 [04:53<01:25,  2.83it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7365]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  77%|████████████████████████████████████           | 802/1044 [04:53<01:21,  2.96it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7371]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  77%|████████████████████████████████████▏          | 803/1044 [04:53<01:22,  2.92it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7369]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  77%|████████████████████████████████████▏          | 804/1044 [04:54<01:17,  3.11it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7374]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  77%|████████████████████████████████████▏          | 805/1044 [04:54<01:22,  2.90it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7371]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  77%|████████████████████████████████████▎          | 806/1044 [04:54<01:27,  2.72it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7369]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  77%|████████████████████████████████████▎          | 807/1044 [04:55<01:21,  2.90it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7375]

torch.Size([256, 410, 35]) torch.Size([256, 410])


[Training LM]:  77%|████████████████████████████████████▍          | 809/1044 [04:55<01:22,  2.85it/s, acc_step=1/1, ce_loss_token=1.7473, perplexity_token=5.7393]

torch.Size([256, 311, 35]) torch.Size([256, 311])
torch.Size([256, 391, 35]) torch.Size([256, 391])


[Training LM]:  78%|████████████████████████████████████▍          | 810/1044 [04:56<01:26,  2.70it/s, acc_step=1/1, ce_loss_token=1.7474, perplexity_token=5.7399]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  78%|████████████████████████████████████▌          | 811/1044 [04:56<01:33,  2.49it/s, acc_step=1/1, ce_loss_token=1.7474, perplexity_token=5.7396]

torch.Size([256, 354, 35]) torch.Size([256, 354])


[Training LM]:  78%|████████████████████████████████████▌          | 812/1044 [04:57<01:37,  2.38it/s, acc_step=1/1, ce_loss_token=1.7473, perplexity_token=5.7393]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  78%|████████████████████████████████████▌          | 813/1044 [04:57<01:35,  2.42it/s, acc_step=1/1, ce_loss_token=1.7473, perplexity_token=5.7390]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  78%|████████████████████████████████████▋          | 814/1044 [04:58<01:31,  2.50it/s, acc_step=1/1, ce_loss_token=1.7472, perplexity_token=5.7387]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  78%|████████████████████████████████████▋          | 815/1044 [04:58<01:28,  2.59it/s, acc_step=1/1, ce_loss_token=1.7472, perplexity_token=5.7384]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  78%|████████████████████████████████████▋          | 816/1044 [04:58<01:26,  2.65it/s, acc_step=1/1, ce_loss_token=1.7471, perplexity_token=5.7382]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  78%|████████████████████████████████████▊          | 817/1044 [04:59<01:26,  2.62it/s, acc_step=1/1, ce_loss_token=1.7471, perplexity_token=5.7379]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  78%|████████████████████████████████████▊          | 818/1044 [04:59<01:27,  2.58it/s, acc_step=1/1, ce_loss_token=1.7471, perplexity_token=5.7377]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  78%|████████████████████████████████████▊          | 819/1044 [04:59<01:25,  2.62it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7374]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  79%|████████████████████████████████████▉          | 820/1044 [05:00<01:27,  2.57it/s, acc_step=1/1, ce_loss_token=1.7470, perplexity_token=5.7372]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  79%|████████████████████████████████████▉          | 821/1044 [05:00<01:29,  2.48it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7370]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  79%|█████████████████████████████████████          | 822/1044 [05:01<01:28,  2.51it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7367]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  79%|█████████████████████████████████████          | 823/1044 [05:01<01:26,  2.57it/s, acc_step=1/1, ce_loss_token=1.7469, perplexity_token=5.7365]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  79%|█████████████████████████████████████          | 824/1044 [05:01<01:23,  2.64it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7363]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  79%|█████████████████████████████████████▏         | 825/1044 [05:02<01:23,  2.63it/s, acc_step=1/1, ce_loss_token=1.7468, perplexity_token=5.7360]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  79%|█████████████████████████████████████▏         | 826/1044 [05:02<01:22,  2.65it/s, acc_step=1/1, ce_loss_token=1.7467, perplexity_token=5.7357]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  79%|█████████████████████████████████████▏         | 827/1044 [05:03<01:21,  2.66it/s, acc_step=1/1, ce_loss_token=1.7467, perplexity_token=5.7354]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  79%|█████████████████████████████████████▎         | 828/1044 [05:03<01:14,  2.89it/s, acc_step=1/1, ce_loss_token=1.7467, perplexity_token=5.7359]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  79%|█████████████████████████████████████▎         | 829/1044 [05:03<01:14,  2.88it/s, acc_step=1/1, ce_loss_token=1.7467, perplexity_token=5.7356]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  80%|█████████████████████████████████████▎         | 830/1044 [05:04<01:14,  2.87it/s, acc_step=1/1, ce_loss_token=1.7467, perplexity_token=5.7354]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  80%|█████████████████████████████████████▍         | 831/1044 [05:04<01:14,  2.87it/s, acc_step=1/1, ce_loss_token=1.7466, perplexity_token=5.7351]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  80%|█████████████████████████████████████▍         | 832/1044 [05:04<01:13,  2.88it/s, acc_step=1/1, ce_loss_token=1.7466, perplexity_token=5.7348]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  80%|█████████████████████████████████████▌         | 833/1044 [05:05<01:15,  2.79it/s, acc_step=1/1, ce_loss_token=1.7465, perplexity_token=5.7345]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  80%|█████████████████████████████████████▌         | 834/1044 [05:05<01:18,  2.69it/s, acc_step=1/1, ce_loss_token=1.7465, perplexity_token=5.7343]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  80%|█████████████████████████████████████▌         | 835/1044 [05:05<01:15,  2.77it/s, acc_step=1/1, ce_loss_token=1.7464, perplexity_token=5.7341]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  80%|█████████████████████████████████████▋         | 836/1044 [05:06<01:11,  2.93it/s, acc_step=1/1, ce_loss_token=1.7465, perplexity_token=5.7347]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  80%|█████████████████████████████████████▋         | 837/1044 [05:06<01:11,  2.88it/s, acc_step=1/1, ce_loss_token=1.7465, perplexity_token=5.7345]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  80%|█████████████████████████████████████▋         | 838/1044 [05:06<01:13,  2.81it/s, acc_step=1/1, ce_loss_token=1.7465, perplexity_token=5.7342]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  80%|█████████████████████████████████████▊         | 839/1044 [05:07<01:13,  2.80it/s, acc_step=1/1, ce_loss_token=1.7464, perplexity_token=5.7340]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  80%|█████████████████████████████████████▊         | 840/1044 [05:07<01:17,  2.65it/s, acc_step=1/1, ce_loss_token=1.7464, perplexity_token=5.7339]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  81%|█████████████████████████████████████▊         | 841/1044 [05:07<01:13,  2.76it/s, acc_step=1/1, ce_loss_token=1.7463, perplexity_token=5.7336]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  81%|█████████████████████████████████████▉         | 842/1044 [05:08<01:13,  2.73it/s, acc_step=1/1, ce_loss_token=1.7463, perplexity_token=5.7334]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  81%|█████████████████████████████████████▉         | 843/1044 [05:08<01:16,  2.62it/s, acc_step=1/1, ce_loss_token=1.7463, perplexity_token=5.7332]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  81%|█████████████████████████████████████▉         | 844/1044 [05:09<01:15,  2.66it/s, acc_step=1/1, ce_loss_token=1.7463, perplexity_token=5.7331]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  81%|██████████████████████████████████████         | 845/1044 [05:09<01:16,  2.60it/s, acc_step=1/1, ce_loss_token=1.7462, perplexity_token=5.7329]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  81%|██████████████████████████████████████         | 846/1044 [05:09<01:17,  2.54it/s, acc_step=1/1, ce_loss_token=1.7462, perplexity_token=5.7328]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  81%|██████████████████████████████████████▏        | 847/1044 [05:10<01:17,  2.56it/s, acc_step=1/1, ce_loss_token=1.7461, perplexity_token=5.7325]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  81%|██████████████████████████████████████▏        | 848/1044 [05:10<01:10,  2.78it/s, acc_step=1/1, ce_loss_token=1.7462, perplexity_token=5.7329]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  81%|██████████████████████████████████████▏        | 849/1044 [05:10<01:11,  2.73it/s, acc_step=1/1, ce_loss_token=1.7462, perplexity_token=5.7327]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  81%|██████████████████████████████████████▎        | 850/1044 [05:11<01:09,  2.80it/s, acc_step=1/1, ce_loss_token=1.7461, perplexity_token=5.7324]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  82%|██████████████████████████████████████▎        | 851/1044 [05:11<01:09,  2.78it/s, acc_step=1/1, ce_loss_token=1.7463, perplexity_token=5.7333]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  82%|██████████████████████████████████████▎        | 852/1044 [05:12<01:10,  2.71it/s, acc_step=1/1, ce_loss_token=1.7462, perplexity_token=5.7330]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  82%|██████████████████████████████████████▍        | 853/1044 [05:12<01:10,  2.72it/s, acc_step=1/1, ce_loss_token=1.7462, perplexity_token=5.7327]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  82%|██████████████████████████████████████▍        | 854/1044 [05:12<01:09,  2.75it/s, acc_step=1/1, ce_loss_token=1.7462, perplexity_token=5.7325]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  82%|██████████████████████████████████████▍        | 855/1044 [05:13<01:09,  2.71it/s, acc_step=1/1, ce_loss_token=1.7461, perplexity_token=5.7322]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  82%|██████████████████████████████████████▌        | 856/1044 [05:13<01:09,  2.72it/s, acc_step=1/1, ce_loss_token=1.7461, perplexity_token=5.7319]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  82%|██████████████████████████████████████▌        | 857/1044 [05:13<01:09,  2.71it/s, acc_step=1/1, ce_loss_token=1.7460, perplexity_token=5.7317]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  82%|██████████████████████████████████████▋        | 858/1044 [05:14<01:13,  2.54it/s, acc_step=1/1, ce_loss_token=1.7460, perplexity_token=5.7314]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  82%|██████████████████████████████████████▋        | 859/1044 [05:14<01:16,  2.43it/s, acc_step=1/1, ce_loss_token=1.7459, perplexity_token=5.7311]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  82%|██████████████████████████████████████▋        | 860/1044 [05:15<01:12,  2.53it/s, acc_step=1/1, ce_loss_token=1.7459, perplexity_token=5.7308]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  82%|██████████████████████████████████████▊        | 861/1044 [05:15<01:09,  2.65it/s, acc_step=1/1, ce_loss_token=1.7458, perplexity_token=5.7306]

torch.Size([256, 372, 35]) torch.Size([256, 372])


[Training LM]:  83%|██████████████████████████████████████▊        | 862/1044 [05:16<01:14,  2.43it/s, acc_step=1/1, ce_loss_token=1.7458, perplexity_token=5.7304]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  83%|██████████████████████████████████████▊        | 863/1044 [05:16<01:12,  2.51it/s, acc_step=1/1, ce_loss_token=1.7457, perplexity_token=5.7301]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  83%|██████████████████████████████████████▉        | 864/1044 [05:16<01:06,  2.72it/s, acc_step=1/1, ce_loss_token=1.7458, perplexity_token=5.7305]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  83%|██████████████████████████████████████▉        | 865/1044 [05:16<01:01,  2.90it/s, acc_step=1/1, ce_loss_token=1.7459, perplexity_token=5.7311]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  83%|██████████████████████████████████████▉        | 866/1044 [05:17<01:01,  2.89it/s, acc_step=1/1, ce_loss_token=1.7459, perplexity_token=5.7309]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  83%|███████████████████████████████████████        | 867/1044 [05:17<00:59,  3.00it/s, acc_step=1/1, ce_loss_token=1.7460, perplexity_token=5.7315]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  83%|███████████████████████████████████████        | 868/1044 [05:17<01:00,  2.90it/s, acc_step=1/1, ce_loss_token=1.7459, perplexity_token=5.7312]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  83%|███████████████████████████████████████        | 869/1044 [05:18<00:58,  3.00it/s, acc_step=1/1, ce_loss_token=1.7461, perplexity_token=5.7320]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  83%|███████████████████████████████████████▏       | 870/1044 [05:18<01:00,  2.89it/s, acc_step=1/1, ce_loss_token=1.7460, perplexity_token=5.7317]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  83%|███████████████████████████████████████▏       | 871/1044 [05:19<01:00,  2.86it/s, acc_step=1/1, ce_loss_token=1.7460, perplexity_token=5.7316]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  84%|███████████████████████████████████████▎       | 872/1044 [05:19<01:09,  2.47it/s, acc_step=1/1, ce_loss_token=1.7459, perplexity_token=5.7313]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  84%|███████████████████████████████████████▎       | 873/1044 [05:19<01:08,  2.51it/s, acc_step=1/1, ce_loss_token=1.7459, perplexity_token=5.7310]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  84%|███████████████████████████████████████▎       | 874/1044 [05:20<01:04,  2.65it/s, acc_step=1/1, ce_loss_token=1.7459, perplexity_token=5.7308]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  84%|███████████████████████████████████████▍       | 875/1044 [05:20<01:05,  2.57it/s, acc_step=1/1, ce_loss_token=1.7458, perplexity_token=5.7306]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  84%|███████████████████████████████████████▍       | 876/1044 [05:21<01:03,  2.64it/s, acc_step=1/1, ce_loss_token=1.7458, perplexity_token=5.7303]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  84%|███████████████████████████████████████▍       | 877/1044 [05:21<00:57,  2.89it/s, acc_step=1/1, ce_loss_token=1.7459, perplexity_token=5.7311]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  84%|███████████████████████████████████████▌       | 878/1044 [05:21<00:57,  2.87it/s, acc_step=1/1, ce_loss_token=1.7459, perplexity_token=5.7308]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  84%|███████████████████████████████████████▌       | 879/1044 [05:22<00:58,  2.80it/s, acc_step=1/1, ce_loss_token=1.7458, perplexity_token=5.7306]

torch.Size([256, 437, 35]) torch.Size([256, 437])


[Training LM]:  84%|███████████████████████████████████████▌       | 880/1044 [05:22<01:12,  2.27it/s, acc_step=1/1, ce_loss_token=1.7458, perplexity_token=5.7303]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  84%|███████████████████████████████████████▋       | 881/1044 [05:23<01:09,  2.36it/s, acc_step=1/1, ce_loss_token=1.7457, perplexity_token=5.7301]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  84%|███████████████████████████████████████▋       | 882/1044 [05:23<01:05,  2.46it/s, acc_step=1/1, ce_loss_token=1.7457, perplexity_token=5.7299]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  85%|███████████████████████████████████████▊       | 883/1044 [05:23<01:05,  2.46it/s, acc_step=1/1, ce_loss_token=1.7457, perplexity_token=5.7298]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  85%|███████████████████████████████████████▊       | 884/1044 [05:24<01:02,  2.55it/s, acc_step=1/1, ce_loss_token=1.7456, perplexity_token=5.7296]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  85%|███████████████████████████████████████▊       | 885/1044 [05:24<01:02,  2.56it/s, acc_step=1/1, ce_loss_token=1.7456, perplexity_token=5.7293]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  85%|███████████████████████████████████████▉       | 886/1044 [05:24<00:59,  2.64it/s, acc_step=1/1, ce_loss_token=1.7456, perplexity_token=5.7291]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  85%|███████████████████████████████████████▉       | 887/1044 [05:25<01:02,  2.49it/s, acc_step=1/1, ce_loss_token=1.7455, perplexity_token=5.7289]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  85%|███████████████████████████████████████▉       | 888/1044 [05:25<01:01,  2.54it/s, acc_step=1/1, ce_loss_token=1.7455, perplexity_token=5.7286]

torch.Size([256, 421, 35]) torch.Size([256, 421])


[Training LM]:  85%|████████████████████████████████████████       | 889/1044 [05:26<01:10,  2.19it/s, acc_step=1/1, ce_loss_token=1.7454, perplexity_token=5.7284]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  85%|████████████████████████████████████████       | 890/1044 [05:26<01:05,  2.37it/s, acc_step=1/1, ce_loss_token=1.7454, perplexity_token=5.7282]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  85%|████████████████████████████████████████       | 891/1044 [05:27<00:59,  2.59it/s, acc_step=1/1, ce_loss_token=1.7455, perplexity_token=5.7289]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  85%|████████████████████████████████████████▏      | 892/1044 [05:27<01:00,  2.52it/s, acc_step=1/1, ce_loss_token=1.7455, perplexity_token=5.7286]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  86%|████████████████████████████████████████▏      | 893/1044 [05:27<00:58,  2.57it/s, acc_step=1/1, ce_loss_token=1.7455, perplexity_token=5.7285]

torch.Size([256, 525, 35]) torch.Size([256, 525])


[Training LM]:  86%|████████████████████████████████████████▏      | 894/1044 [05:28<01:19,  1.89it/s, acc_step=1/1, ce_loss_token=1.7454, perplexity_token=5.7284]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  86%|████████████████████████████████████████▎      | 895/1044 [05:29<01:13,  2.02it/s, acc_step=1/1, ce_loss_token=1.7454, perplexity_token=5.7283]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  86%|████████████████████████████████████████▎      | 896/1044 [05:29<01:11,  2.07it/s, acc_step=1/1, ce_loss_token=1.7454, perplexity_token=5.7281]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  86%|████████████████████████████████████████▍      | 897/1044 [05:29<01:06,  2.21it/s, acc_step=1/1, ce_loss_token=1.7453, perplexity_token=5.7279]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  86%|████████████████████████████████████████▍      | 898/1044 [05:30<01:03,  2.31it/s, acc_step=1/1, ce_loss_token=1.7453, perplexity_token=5.7277]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  86%|████████████████████████████████████████▍      | 899/1044 [05:30<01:00,  2.39it/s, acc_step=1/1, ce_loss_token=1.7453, perplexity_token=5.7275]

torch.Size([256, 270, 35]) torch.Size([256, 270])


[Training LM]:  86%|████████████████████████████████████████▌      | 900/1044 [05:31<00:55,  2.57it/s, acc_step=1/1, ce_loss_token=1.7452, perplexity_token=5.7273]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  86%|████████████████████████████████████████▌      | 901/1044 [05:31<00:53,  2.67it/s, acc_step=1/1, ce_loss_token=1.7452, perplexity_token=5.7270]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  86%|████████████████████████████████████████▌      | 902/1044 [05:31<00:55,  2.56it/s, acc_step=1/1, ce_loss_token=1.7451, perplexity_token=5.7267]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  86%|████████████████████████████████████████▋      | 903/1044 [05:32<00:56,  2.52it/s, acc_step=1/1, ce_loss_token=1.7451, perplexity_token=5.7265]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  87%|████████████████████████████████████████▋      | 904/1044 [05:32<00:54,  2.56it/s, acc_step=1/1, ce_loss_token=1.7451, perplexity_token=5.7262]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  87%|████████████████████████████████████████▋      | 905/1044 [05:32<00:53,  2.61it/s, acc_step=1/1, ce_loss_token=1.7450, perplexity_token=5.7260]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  87%|████████████████████████████████████████▊      | 906/1044 [05:33<00:51,  2.66it/s, acc_step=1/1, ce_loss_token=1.7450, perplexity_token=5.7258]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  87%|████████████████████████████████████████▊      | 907/1044 [05:33<00:51,  2.65it/s, acc_step=1/1, ce_loss_token=1.7449, perplexity_token=5.7256]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  87%|████████████████████████████████████████▉      | 908/1044 [05:34<00:52,  2.58it/s, acc_step=1/1, ce_loss_token=1.7449, perplexity_token=5.7254]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  87%|████████████████████████████████████████▉      | 909/1044 [05:34<00:50,  2.69it/s, acc_step=1/1, ce_loss_token=1.7449, perplexity_token=5.7252]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  87%|████████████████████████████████████████▉      | 910/1044 [05:34<00:46,  2.88it/s, acc_step=1/1, ce_loss_token=1.7450, perplexity_token=5.7257]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  87%|█████████████████████████████████████████      | 911/1044 [05:35<00:46,  2.89it/s, acc_step=1/1, ce_loss_token=1.7449, perplexity_token=5.7255]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  87%|█████████████████████████████████████████      | 912/1044 [05:35<00:45,  2.90it/s, acc_step=1/1, ce_loss_token=1.7449, perplexity_token=5.7253]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  87%|█████████████████████████████████████████      | 913/1044 [05:35<00:45,  2.90it/s, acc_step=1/1, ce_loss_token=1.7449, perplexity_token=5.7252]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  88%|█████████████████████████████████████████▏     | 914/1044 [05:36<00:46,  2.78it/s, acc_step=1/1, ce_loss_token=1.7448, perplexity_token=5.7249]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  88%|█████████████████████████████████████████▏     | 915/1044 [05:36<00:52,  2.47it/s, acc_step=1/1, ce_loss_token=1.7448, perplexity_token=5.7246]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  88%|█████████████████████████████████████████▏     | 916/1044 [05:37<00:51,  2.51it/s, acc_step=1/1, ce_loss_token=1.7447, perplexity_token=5.7244]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  88%|█████████████████████████████████████████▎     | 917/1044 [05:37<00:51,  2.44it/s, acc_step=1/1, ce_loss_token=1.7447, perplexity_token=5.7242]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  88%|█████████████████████████████████████████▎     | 918/1044 [05:37<00:51,  2.46it/s, acc_step=1/1, ce_loss_token=1.7447, perplexity_token=5.7239]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  88%|█████████████████████████████████████████▎     | 919/1044 [05:38<00:47,  2.65it/s, acc_step=1/1, ce_loss_token=1.7447, perplexity_token=5.7244]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  88%|█████████████████████████████████████████▍     | 920/1044 [05:38<00:48,  2.58it/s, acc_step=1/1, ce_loss_token=1.7447, perplexity_token=5.7242]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  88%|█████████████████████████████████████████▍     | 921/1044 [05:38<00:46,  2.63it/s, acc_step=1/1, ce_loss_token=1.7447, perplexity_token=5.7239]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  88%|█████████████████████████████████████████▌     | 922/1044 [05:39<00:46,  2.61it/s, acc_step=1/1, ce_loss_token=1.7446, perplexity_token=5.7236]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  88%|█████████████████████████████████████████▌     | 923/1044 [05:39<00:44,  2.71it/s, acc_step=1/1, ce_loss_token=1.7446, perplexity_token=5.7234]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  89%|█████████████████████████████████████████▌     | 924/1044 [05:40<00:46,  2.58it/s, acc_step=1/1, ce_loss_token=1.7445, perplexity_token=5.7232]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  89%|█████████████████████████████████████████▋     | 925/1044 [05:40<00:45,  2.61it/s, acc_step=1/1, ce_loss_token=1.7445, perplexity_token=5.7230]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  89%|█████████████████████████████████████████▋     | 926/1044 [05:40<00:43,  2.70it/s, acc_step=1/1, ce_loss_token=1.7445, perplexity_token=5.7228]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  89%|█████████████████████████████████████████▋     | 927/1044 [05:41<00:44,  2.62it/s, acc_step=1/1, ce_loss_token=1.7444, perplexity_token=5.7226]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  89%|█████████████████████████████████████████▊     | 928/1044 [05:41<00:43,  2.64it/s, acc_step=1/1, ce_loss_token=1.7444, perplexity_token=5.7223]

torch.Size([256, 397, 35]) torch.Size([256, 397])


[Training LM]:  89%|█████████████████████████████████████████▊     | 929/1044 [05:42<00:49,  2.31it/s, acc_step=1/1, ce_loss_token=1.7443, perplexity_token=5.7221]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  89%|█████████████████████████████████████████▊     | 930/1044 [05:42<00:47,  2.41it/s, acc_step=1/1, ce_loss_token=1.7443, perplexity_token=5.7220]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  89%|█████████████████████████████████████████▉     | 931/1044 [05:42<00:42,  2.64it/s, acc_step=1/1, ce_loss_token=1.7445, perplexity_token=5.7229]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  89%|█████████████████████████████████████████▉     | 932/1044 [05:43<00:41,  2.69it/s, acc_step=1/1, ce_loss_token=1.7444, perplexity_token=5.7226]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  89%|██████████████████████████████████████████     | 933/1044 [05:43<00:40,  2.75it/s, acc_step=1/1, ce_loss_token=1.7444, perplexity_token=5.7224]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  89%|██████████████████████████████████████████     | 934/1044 [05:43<00:40,  2.70it/s, acc_step=1/1, ce_loss_token=1.7444, perplexity_token=5.7224]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  90%|██████████████████████████████████████████     | 935/1044 [05:44<00:40,  2.66it/s, acc_step=1/1, ce_loss_token=1.7443, perplexity_token=5.7221]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  90%|██████████████████████████████████████████▏    | 936/1044 [05:44<00:37,  2.88it/s, acc_step=1/1, ce_loss_token=1.7444, perplexity_token=5.7225]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  90%|██████████████████████████████████████████▏    | 937/1044 [05:44<00:37,  2.84it/s, acc_step=1/1, ce_loss_token=1.7444, perplexity_token=5.7223]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  90%|██████████████████████████████████████████▏    | 938/1044 [05:45<00:38,  2.76it/s, acc_step=1/1, ce_loss_token=1.7443, perplexity_token=5.7221]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  90%|██████████████████████████████████████████▎    | 939/1044 [05:45<00:38,  2.75it/s, acc_step=1/1, ce_loss_token=1.7443, perplexity_token=5.7218]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  90%|██████████████████████████████████████████▎    | 940/1044 [05:46<00:37,  2.79it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7215]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  90%|██████████████████████████████████████████▎    | 941/1044 [05:46<00:37,  2.76it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7212]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  90%|██████████████████████████████████████████▍    | 942/1044 [05:46<00:36,  2.77it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7210]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  90%|██████████████████████████████████████████▍    | 943/1044 [05:47<00:34,  2.93it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7214]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  90%|██████████████████████████████████████████▍    | 944/1044 [05:47<00:34,  2.91it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7211]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  91%|██████████████████████████████████████████▌    | 945/1044 [05:47<00:36,  2.75it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7209]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  91%|██████████████████████████████████████████▌    | 946/1044 [05:48<00:33,  2.94it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7215]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  91%|██████████████████████████████████████████▋    | 947/1044 [05:48<00:34,  2.82it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7212]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  91%|██████████████████████████████████████████▋    | 948/1044 [05:48<00:35,  2.73it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7211]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  91%|██████████████████████████████████████████▋    | 949/1044 [05:49<00:38,  2.49it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7209]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  91%|██████████████████████████████████████████▊    | 950/1044 [05:49<00:36,  2.56it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7207]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  91%|██████████████████████████████████████████▊    | 951/1044 [05:50<00:33,  2.77it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7212]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  91%|██████████████████████████████████████████▊    | 952/1044 [05:50<00:33,  2.72it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7210]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  91%|██████████████████████████████████████████▉    | 953/1044 [05:50<00:33,  2.71it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7208]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  91%|██████████████████████████████████████████▉    | 954/1044 [05:51<00:33,  2.71it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7206]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  91%|██████████████████████████████████████████▉    | 955/1044 [05:51<00:30,  2.93it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7212]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  92%|███████████████████████████████████████████    | 956/1044 [05:51<00:30,  2.88it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7210]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  92%|███████████████████████████████████████████    | 957/1044 [05:52<00:30,  2.88it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7208]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  92%|███████████████████████████████████████████▏   | 958/1044 [05:52<00:30,  2.83it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7207]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  92%|███████████████████████████████████████████▏   | 959/1044 [05:52<00:30,  2.81it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7204]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  92%|███████████████████████████████████████████▏   | 960/1044 [05:53<00:29,  2.86it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7202]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  92%|███████████████████████████████████████████▎   | 961/1044 [05:53<00:31,  2.67it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7199]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  92%|███████████████████████████████████████████▎   | 962/1044 [05:53<00:27,  2.95it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7208]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  92%|███████████████████████████████████████████▎   | 963/1044 [05:54<00:29,  2.71it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7205]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  92%|███████████████████████████████████████████▍   | 964/1044 [05:54<00:29,  2.69it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7203]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  92%|███████████████████████████████████████████▍   | 965/1044 [05:55<00:28,  2.81it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7212]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  93%|███████████████████████████████████████████▍   | 966/1044 [05:55<00:26,  2.90it/s, acc_step=1/1, ce_loss_token=1.7443, perplexity_token=5.7217]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  93%|███████████████████████████████████████████▌   | 967/1044 [05:55<00:26,  2.86it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7214]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  93%|███████████████████████████████████████████▌   | 968/1044 [05:56<00:26,  2.82it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7213]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  93%|███████████████████████████████████████████▌   | 969/1044 [05:56<00:27,  2.76it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7210]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  93%|███████████████████████████████████████████▋   | 970/1044 [05:56<00:27,  2.66it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7207]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  93%|███████████████████████████████████████████▋   | 971/1044 [05:57<00:27,  2.68it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7213]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  93%|███████████████████████████████████████████▊   | 972/1044 [05:57<00:27,  2.64it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7211]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  93%|███████████████████████████████████████████▊   | 973/1044 [05:57<00:26,  2.68it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7209]

torch.Size([256, 274, 35]) torch.Size([256, 274])


[Training LM]:  93%|███████████████████████████████████████████▊   | 974/1044 [05:58<00:24,  2.81it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7206]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  93%|███████████████████████████████████████████▉   | 975/1044 [05:58<00:25,  2.66it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7204]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  93%|███████████████████████████████████████████▉   | 976/1044 [05:59<00:25,  2.70it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7202]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  94%|███████████████████████████████████████████▉   | 977/1044 [05:59<00:24,  2.69it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7199]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  94%|████████████████████████████████████████████   | 978/1044 [05:59<00:27,  2.39it/s, acc_step=1/1, ce_loss_token=1.7439, perplexity_token=5.7198]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  94%|████████████████████████████████████████████   | 979/1044 [06:00<00:26,  2.45it/s, acc_step=1/1, ce_loss_token=1.7439, perplexity_token=5.7195]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  94%|████████████████████████████████████████████   | 980/1044 [06:00<00:25,  2.50it/s, acc_step=1/1, ce_loss_token=1.7438, perplexity_token=5.7193]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  94%|████████████████████████████████████████████▏  | 981/1044 [06:01<00:24,  2.61it/s, acc_step=1/1, ce_loss_token=1.7438, perplexity_token=5.7190]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  94%|████████████████████████████████████████████▏  | 982/1044 [06:01<00:23,  2.64it/s, acc_step=1/1, ce_loss_token=1.7438, perplexity_token=5.7188]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  94%|████████████████████████████████████████████▎  | 983/1044 [06:01<00:23,  2.65it/s, acc_step=1/1, ce_loss_token=1.7437, perplexity_token=5.7186]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  94%|████████████████████████████████████████████▎  | 984/1044 [06:02<00:22,  2.66it/s, acc_step=1/1, ce_loss_token=1.7437, perplexity_token=5.7184]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  94%|████████████████████████████████████████████▎  | 985/1044 [06:02<00:21,  2.70it/s, acc_step=1/1, ce_loss_token=1.7437, perplexity_token=5.7182]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  94%|████████████████████████████████████████████▍  | 986/1044 [06:02<00:20,  2.88it/s, acc_step=1/1, ce_loss_token=1.7437, perplexity_token=5.7186]

torch.Size([256, 366, 35]) torch.Size([256, 366])


[Training LM]:  95%|████████████████████████████████████████████▍  | 987/1044 [06:03<00:21,  2.61it/s, acc_step=1/1, ce_loss_token=1.7437, perplexity_token=5.7185]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  95%|████████████████████████████████████████████▍  | 988/1044 [06:03<00:21,  2.65it/s, acc_step=1/1, ce_loss_token=1.7438, perplexity_token=5.7193]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  95%|████████████████████████████████████████████▌  | 989/1044 [06:03<00:19,  2.83it/s, acc_step=1/1, ce_loss_token=1.7439, perplexity_token=5.7197]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  95%|████████████████████████████████████████████▌  | 990/1044 [06:04<00:18,  2.90it/s, acc_step=1/1, ce_loss_token=1.7439, perplexity_token=5.7195]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  95%|████████████████████████████████████████████▌  | 991/1044 [06:04<00:18,  2.90it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7203]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  95%|████████████████████████████████████████████▋  | 992/1044 [06:04<00:17,  2.92it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7201]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  95%|████████████████████████████████████████████▋  | 993/1044 [06:05<00:18,  2.80it/s, acc_step=1/1, ce_loss_token=1.7439, perplexity_token=5.7199]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  95%|████████████████████████████████████████████▋  | 994/1044 [06:05<00:17,  2.79it/s, acc_step=1/1, ce_loss_token=1.7439, perplexity_token=5.7197]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  95%|████████████████████████████████████████████▊  | 995/1044 [06:06<00:17,  2.80it/s, acc_step=1/1, ce_loss_token=1.7439, perplexity_token=5.7195]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  95%|████████████████████████████████████████████▊  | 996/1044 [06:06<00:16,  2.99it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7202]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  95%|████████████████████████████████████████████▉  | 997/1044 [06:06<00:15,  3.00it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7200]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  96%|████████████████████████████████████████████▉  | 998/1044 [06:07<00:15,  2.99it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7204]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  96%|████████████████████████████████████████████▉  | 999/1044 [06:07<00:14,  3.10it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7208]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  96%|████████████████████████████████████████████  | 1000/1044 [06:07<00:15,  2.93it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7206]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  96%|████████████████████████████████████████████  | 1001/1044 [06:08<00:15,  2.74it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7204]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1002/1044 [06:08<00:14,  2.89it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7208]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1003/1044 [06:08<00:14,  2.87it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7207]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1004/1044 [06:09<00:13,  2.86it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7205]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1005/1044 [06:09<00:13,  2.90it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7203]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1006/1044 [06:09<00:12,  3.10it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7207]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1007/1044 [06:10<00:12,  3.07it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7205]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1008/1044 [06:10<00:12,  2.99it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7204]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1009/1044 [06:10<00:12,  2.87it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7201]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1010/1044 [06:11<00:11,  3.04it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7206]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1011/1044 [06:11<00:11,  2.99it/s, acc_step=1/1, ce_loss_token=1.7440, perplexity_token=5.7204]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1012/1044 [06:11<00:10,  3.11it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7212]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1013/1044 [06:12<00:10,  2.99it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7210]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1014/1044 [06:12<00:10,  2.90it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7207]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1015/1044 [06:12<00:09,  3.04it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7213]

torch.Size([256, 393, 35]) torch.Size([256, 393])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1016/1044 [06:13<00:11,  2.53it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7211]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1017/1044 [06:13<00:10,  2.52it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7209]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  98%|████████████████████████████████████████████▊ | 1018/1044 [06:14<00:09,  2.71it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7213]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1019/1044 [06:14<00:09,  2.70it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7211]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1020/1044 [06:14<00:09,  2.66it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7209]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1021/1044 [06:15<00:08,  2.70it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7208]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  98%|█████████████████████████████████████████████ | 1022/1044 [06:15<00:08,  2.74it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7215]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  98%|█████████████████████████████████████████████ | 1023/1044 [06:15<00:07,  2.73it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7213]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  98%|█████████████████████████████████████████████ | 1024/1044 [06:16<00:07,  2.55it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7212]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1025/1044 [06:16<00:06,  2.73it/s, acc_step=1/1, ce_loss_token=1.7443, perplexity_token=5.7219]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1026/1044 [06:16<00:06,  2.77it/s, acc_step=1/1, ce_loss_token=1.7443, perplexity_token=5.7217]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1027/1044 [06:17<00:06,  2.73it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7215]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1028/1044 [06:17<00:05,  2.86it/s, acc_step=1/1, ce_loss_token=1.7443, perplexity_token=5.7220]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  99%|█████████████████████████████████████████████▎| 1029/1044 [06:18<00:05,  2.84it/s, acc_step=1/1, ce_loss_token=1.7443, perplexity_token=5.7217]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1030/1044 [06:18<00:05,  2.76it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7216]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1031/1044 [06:18<00:04,  2.81it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7214]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1032/1044 [06:19<00:04,  2.77it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7212]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1033/1044 [06:19<00:03,  2.78it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7209]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1034/1044 [06:19<00:03,  2.97it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7212]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1035/1044 [06:20<00:03,  2.91it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7210]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1036/1044 [06:20<00:02,  2.81it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7208]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1037/1044 [06:20<00:02,  2.72it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7206]

torch.Size([256, 441, 35]) torch.Size([256, 441])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1038/1044 [06:21<00:02,  2.46it/s, acc_step=1/1, ce_loss_token=1.7442, perplexity_token=5.7210]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1039/1044 [06:21<00:02,  2.43it/s, acc_step=1/1, ce_loss_token=1.7441, perplexity_token=5.7209]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1041/1044 [06:22<00:00,  3.06it/s, acc_step=1/1, ce_loss_token=1.7446, perplexity_token=5.7234]

torch.Size([256, 304, 35]) torch.Size([256, 304])
torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1042/1044 [06:22<00:00,  3.02it/s, acc_step=1/1, ce_loss_token=1.7445, perplexity_token=5.7232]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1043/1044 [06:22<00:00,  3.12it/s, acc_step=1/1, ce_loss_token=1.7446, perplexity_token=5.7235]

torch.Size([170, 320, 35]) torch.Size([170, 320])


                                                                                                                                                                   

Generating with greedy search...

📊 Metrics (Epoch 8):
├── TRAIN:
│   ├── ce_loss_char: 1.7446
│   ├── ce_loss_token: 1.7446
│   ├── perplexity_char: 5.7235
│   └── perplexity_token: 5.7235
└── VAL:
    ├── ce_loss_char: 1.6239
    ├── ce_loss_token: 1.6239
    ├── perplexity_char: 5.0729
    └── perplexity_token: 5.0729
└── TRAINING:
    └── learning_rate: 0.000100


[Training LM]:   0%|                                                                                                                      | 0/1044 [00:00<?, ?it/s]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   0%|                                                 | 1/1044 [00:00<09:24,  1.85it/s, acc_step=1/1, ce_loss_token=1.7162, perplexity_token=5.5634]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   0%|                                                 | 2/1044 [00:00<07:41,  2.26it/s, acc_step=1/1, ce_loss_token=1.7208, perplexity_token=5.5890]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   0%|▏                                                | 3/1044 [00:01<07:05,  2.45it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5409]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   0%|▏                                                | 4/1044 [00:01<06:21,  2.72it/s, acc_step=1/1, ce_loss_token=1.7395, perplexity_token=5.6947]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:   0%|▏                                                | 5/1044 [00:02<06:39,  2.60it/s, acc_step=1/1, ce_loss_token=1.7297, perplexity_token=5.6390]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   1%|▎                                                | 6/1044 [00:02<06:32,  2.65it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6179]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:   1%|▎                                                | 7/1044 [00:02<06:21,  2.72it/s, acc_step=1/1, ce_loss_token=1.7242, perplexity_token=5.6078]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   1%|▍                                                | 8/1044 [00:03<05:55,  2.91it/s, acc_step=1/1, ce_loss_token=1.7369, perplexity_token=5.6796]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   1%|▍                                                | 9/1044 [00:03<05:33,  3.11it/s, acc_step=1/1, ce_loss_token=1.7456, perplexity_token=5.7294]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   1%|▍                                               | 10/1044 [00:03<05:52,  2.93it/s, acc_step=1/1, ce_loss_token=1.7423, perplexity_token=5.7106]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   1%|▌                                               | 11/1044 [00:03<05:34,  3.09it/s, acc_step=1/1, ce_loss_token=1.7502, perplexity_token=5.7557]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   1%|▌                                               | 12/1044 [00:04<05:51,  2.94it/s, acc_step=1/1, ce_loss_token=1.7473, perplexity_token=5.7393]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:   1%|▌                                               | 13/1044 [00:04<06:26,  2.67it/s, acc_step=1/1, ce_loss_token=1.7446, perplexity_token=5.7238]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:   1%|▋                                               | 14/1044 [00:05<06:12,  2.77it/s, acc_step=1/1, ce_loss_token=1.7537, perplexity_token=5.7758]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   1%|▋                                               | 15/1044 [00:05<06:08,  2.79it/s, acc_step=1/1, ce_loss_token=1.7504, perplexity_token=5.7567]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   2%|▋                                               | 16/1044 [00:05<06:11,  2.77it/s, acc_step=1/1, ce_loss_token=1.7477, perplexity_token=5.7416]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   2%|▊                                               | 17/1044 [00:06<06:20,  2.70it/s, acc_step=1/1, ce_loss_token=1.7453, perplexity_token=5.7277]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   2%|▊                                               | 18/1044 [00:06<06:15,  2.73it/s, acc_step=1/1, ce_loss_token=1.7434, perplexity_token=5.7165]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   2%|▊                                               | 19/1044 [00:06<06:17,  2.72it/s, acc_step=1/1, ce_loss_token=1.7413, perplexity_token=5.7049]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   2%|▉                                               | 20/1044 [00:07<06:18,  2.70it/s, acc_step=1/1, ce_loss_token=1.7393, perplexity_token=5.6932]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:   2%|▉                                               | 21/1044 [00:07<06:13,  2.74it/s, acc_step=1/1, ce_loss_token=1.7369, perplexity_token=5.6796]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   2%|█                                               | 22/1044 [00:08<06:18,  2.70it/s, acc_step=1/1, ce_loss_token=1.7356, perplexity_token=5.6724]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   2%|█                                               | 23/1044 [00:08<05:52,  2.90it/s, acc_step=1/1, ce_loss_token=1.7393, perplexity_token=5.6934]

torch.Size([256, 399, 35]) torch.Size([256, 399])


[Training LM]:   2%|█                                               | 24/1044 [00:08<06:58,  2.43it/s, acc_step=1/1, ce_loss_token=1.7388, perplexity_token=5.6904]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   2%|█▏                                              | 25/1044 [00:09<06:48,  2.49it/s, acc_step=1/1, ce_loss_token=1.7377, perplexity_token=5.6842]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:   2%|█▏                                              | 26/1044 [00:09<06:29,  2.62it/s, acc_step=1/1, ce_loss_token=1.7369, perplexity_token=5.6795]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:   3%|█▏                                              | 27/1044 [00:10<06:39,  2.54it/s, acc_step=1/1, ce_loss_token=1.7357, perplexity_token=5.6728]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   3%|█▎                                              | 28/1044 [00:10<06:40,  2.54it/s, acc_step=1/1, ce_loss_token=1.7346, perplexity_token=5.6666]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   3%|█▎                                              | 29/1044 [00:10<06:15,  2.70it/s, acc_step=1/1, ce_loss_token=1.7368, perplexity_token=5.6791]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   3%|█▍                                              | 30/1044 [00:11<06:14,  2.71it/s, acc_step=1/1, ce_loss_token=1.7355, perplexity_token=5.6720]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   3%|█▍                                              | 31/1044 [00:11<06:10,  2.73it/s, acc_step=1/1, ce_loss_token=1.7352, perplexity_token=5.6699]

torch.Size([256, 344, 35]) torch.Size([256, 344])


[Training LM]:   3%|█▍                                              | 32/1044 [00:11<06:01,  2.80it/s, acc_step=1/1, ce_loss_token=1.7376, perplexity_token=5.6836]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   3%|█▌                                              | 33/1044 [00:12<06:02,  2.79it/s, acc_step=1/1, ce_loss_token=1.7366, perplexity_token=5.6782]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:   3%|█▌                                              | 34/1044 [00:12<06:24,  2.63it/s, acc_step=1/1, ce_loss_token=1.7352, perplexity_token=5.6703]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   3%|█▌                                              | 35/1044 [00:12<06:23,  2.63it/s, acc_step=1/1, ce_loss_token=1.7345, perplexity_token=5.6662]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   3%|█▋                                              | 36/1044 [00:13<06:12,  2.71it/s, acc_step=1/1, ce_loss_token=1.7335, perplexity_token=5.6606]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   4%|█▋                                              | 37/1044 [00:13<06:19,  2.66it/s, acc_step=1/1, ce_loss_token=1.7330, perplexity_token=5.6577]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   4%|█▋                                              | 38/1044 [00:14<06:15,  2.68it/s, acc_step=1/1, ce_loss_token=1.7323, perplexity_token=5.6538]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   4%|█▊                                              | 39/1044 [00:14<05:56,  2.82it/s, acc_step=1/1, ce_loss_token=1.7361, perplexity_token=5.6754]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   4%|█▊                                              | 40/1044 [00:14<05:59,  2.79it/s, acc_step=1/1, ce_loss_token=1.7354, perplexity_token=5.6714]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:   4%|█▉                                              | 41/1044 [00:15<06:18,  2.65it/s, acc_step=1/1, ce_loss_token=1.7350, perplexity_token=5.6687]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   4%|█▉                                              | 42/1044 [00:15<06:16,  2.66it/s, acc_step=1/1, ce_loss_token=1.7341, perplexity_token=5.6639]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   4%|█▉                                              | 43/1044 [00:15<06:11,  2.69it/s, acc_step=1/1, ce_loss_token=1.7335, perplexity_token=5.6602]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   4%|██                                              | 44/1044 [00:16<06:10,  2.70it/s, acc_step=1/1, ce_loss_token=1.7331, perplexity_token=5.6581]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   4%|██                                              | 45/1044 [00:16<06:07,  2.72it/s, acc_step=1/1, ce_loss_token=1.7323, perplexity_token=5.6537]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   4%|██                                              | 46/1044 [00:17<06:08,  2.71it/s, acc_step=1/1, ce_loss_token=1.7317, perplexity_token=5.6502]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:   5%|██▏                                             | 47/1044 [00:17<05:59,  2.77it/s, acc_step=1/1, ce_loss_token=1.7311, perplexity_token=5.6467]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:   5%|██▏                                             | 48/1044 [00:17<05:50,  2.84it/s, acc_step=1/1, ce_loss_token=1.7305, perplexity_token=5.6433]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   5%|██▎                                             | 49/1044 [00:17<05:31,  3.01it/s, acc_step=1/1, ce_loss_token=1.7333, perplexity_token=5.6590]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   5%|██▎                                             | 50/1044 [00:18<05:40,  2.92it/s, acc_step=1/1, ce_loss_token=1.7327, perplexity_token=5.6561]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:   5%|██▎                                             | 51/1044 [00:18<05:40,  2.91it/s, acc_step=1/1, ce_loss_token=1.7322, perplexity_token=5.6530]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   5%|██▍                                             | 52/1044 [00:19<05:44,  2.88it/s, acc_step=1/1, ce_loss_token=1.7312, perplexity_token=5.6476]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   5%|██▍                                             | 53/1044 [00:19<05:30,  3.00it/s, acc_step=1/1, ce_loss_token=1.7337, perplexity_token=5.6616]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   5%|██▍                                             | 54/1044 [00:19<05:11,  3.17it/s, acc_step=1/1, ce_loss_token=1.7354, perplexity_token=5.6711]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   5%|██▌                                             | 55/1044 [00:20<05:39,  2.92it/s, acc_step=1/1, ce_loss_token=1.7348, perplexity_token=5.6680]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:   5%|██▌                                             | 56/1044 [00:20<05:59,  2.75it/s, acc_step=1/1, ce_loss_token=1.7343, perplexity_token=5.6651]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   5%|██▌                                             | 57/1044 [00:20<05:43,  2.87it/s, acc_step=1/1, ce_loss_token=1.7360, perplexity_token=5.6744]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   6%|██▋                                             | 58/1044 [00:21<05:56,  2.76it/s, acc_step=1/1, ce_loss_token=1.7354, perplexity_token=5.6712]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:   6%|██▋                                             | 59/1044 [00:21<05:51,  2.80it/s, acc_step=1/1, ce_loss_token=1.7347, perplexity_token=5.6674]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   6%|██▊                                             | 60/1044 [00:21<05:52,  2.79it/s, acc_step=1/1, ce_loss_token=1.7339, perplexity_token=5.6627]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   6%|██▊                                             | 61/1044 [00:22<05:56,  2.76it/s, acc_step=1/1, ce_loss_token=1.7337, perplexity_token=5.6616]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:   6%|██▊                                             | 62/1044 [00:22<05:56,  2.76it/s, acc_step=1/1, ce_loss_token=1.7332, perplexity_token=5.6588]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:   6%|██▉                                             | 63/1044 [00:23<06:13,  2.63it/s, acc_step=1/1, ce_loss_token=1.7328, perplexity_token=5.6565]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   6%|██▉                                             | 64/1044 [00:23<05:48,  2.81it/s, acc_step=1/1, ce_loss_token=1.7344, perplexity_token=5.6658]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   6%|██▉                                             | 65/1044 [00:23<05:53,  2.77it/s, acc_step=1/1, ce_loss_token=1.7341, perplexity_token=5.6636]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   6%|███                                             | 66/1044 [00:24<06:05,  2.67it/s, acc_step=1/1, ce_loss_token=1.7336, perplexity_token=5.6611]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   6%|███                                             | 67/1044 [00:24<06:03,  2.69it/s, acc_step=1/1, ce_loss_token=1.7330, perplexity_token=5.6579]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   7%|███▏                                            | 68/1044 [00:24<06:03,  2.68it/s, acc_step=1/1, ce_loss_token=1.7326, perplexity_token=5.6551]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:   7%|███▏                                            | 69/1044 [00:25<06:21,  2.56it/s, acc_step=1/1, ce_loss_token=1.7322, perplexity_token=5.6531]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   7%|███▏                                            | 70/1044 [00:25<06:17,  2.58it/s, acc_step=1/1, ce_loss_token=1.7318, perplexity_token=5.6509]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   7%|███▎                                            | 71/1044 [00:26<06:22,  2.55it/s, acc_step=1/1, ce_loss_token=1.7313, perplexity_token=5.6481]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   7%|███▎                                            | 72/1044 [00:26<06:18,  2.57it/s, acc_step=1/1, ce_loss_token=1.7308, perplexity_token=5.6451]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   7%|███▎                                            | 73/1044 [00:26<05:55,  2.73it/s, acc_step=1/1, ce_loss_token=1.7318, perplexity_token=5.6511]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   7%|███▍                                            | 74/1044 [00:27<05:51,  2.76it/s, acc_step=1/1, ce_loss_token=1.7315, perplexity_token=5.6493]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   7%|███▍                                            | 75/1044 [00:27<05:54,  2.74it/s, acc_step=1/1, ce_loss_token=1.7314, perplexity_token=5.6485]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   7%|███▍                                            | 76/1044 [00:27<05:53,  2.74it/s, acc_step=1/1, ce_loss_token=1.7310, perplexity_token=5.6462]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   7%|███▌                                            | 77/1044 [00:28<06:10,  2.61it/s, acc_step=1/1, ce_loss_token=1.7306, perplexity_token=5.6439]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   7%|███▌                                            | 78/1044 [00:28<06:19,  2.55it/s, acc_step=1/1, ce_loss_token=1.7303, perplexity_token=5.6424]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   8%|███▋                                            | 79/1044 [00:29<06:12,  2.59it/s, acc_step=1/1, ce_loss_token=1.7299, perplexity_token=5.6400]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:   8%|███▋                                            | 80/1044 [00:29<06:24,  2.51it/s, acc_step=1/1, ce_loss_token=1.7296, perplexity_token=5.6384]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   8%|███▋                                            | 81/1044 [00:29<06:10,  2.60it/s, acc_step=1/1, ce_loss_token=1.7294, perplexity_token=5.6370]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   8%|███▊                                            | 82/1044 [00:30<05:41,  2.82it/s, acc_step=1/1, ce_loss_token=1.7306, perplexity_token=5.6439]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:   8%|███▊                                            | 83/1044 [00:30<06:00,  2.67it/s, acc_step=1/1, ce_loss_token=1.7302, perplexity_token=5.6415]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:   8%|███▊                                            | 84/1044 [00:30<05:49,  2.75it/s, acc_step=1/1, ce_loss_token=1.7298, perplexity_token=5.6395]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   8%|███▉                                            | 85/1044 [00:31<05:54,  2.70it/s, acc_step=1/1, ce_loss_token=1.7295, perplexity_token=5.6381]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:   8%|███▉                                            | 86/1044 [00:31<05:44,  2.78it/s, acc_step=1/1, ce_loss_token=1.7293, perplexity_token=5.6364]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:   8%|████                                            | 87/1044 [00:31<05:35,  2.85it/s, acc_step=1/1, ce_loss_token=1.7290, perplexity_token=5.6352]

torch.Size([256, 381, 35]) torch.Size([256, 381])


[Training LM]:   8%|████                                            | 88/1044 [00:32<06:23,  2.49it/s, acc_step=1/1, ce_loss_token=1.7286, perplexity_token=5.6325]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   9%|████                                            | 89/1044 [00:32<06:19,  2.52it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6314]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   9%|████▏                                           | 90/1044 [00:33<06:18,  2.52it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6300]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   9%|████▏                                           | 91/1044 [00:33<06:15,  2.54it/s, acc_step=1/1, ce_loss_token=1.7279, perplexity_token=5.6288]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   9%|████▏                                           | 92/1044 [00:33<06:11,  2.56it/s, acc_step=1/1, ce_loss_token=1.7277, perplexity_token=5.6277]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   9%|████▎                                           | 93/1044 [00:34<06:07,  2.59it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6256]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   9%|████▎                                           | 94/1044 [00:34<06:13,  2.54it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6234]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   9%|████▎                                           | 95/1044 [00:35<06:04,  2.61it/s, acc_step=1/1, ce_loss_token=1.7267, perplexity_token=5.6221]

torch.Size([256, 437, 35]) torch.Size([256, 437])


[Training LM]:   9%|████▍                                           | 96/1044 [00:35<07:16,  2.17it/s, acc_step=1/1, ce_loss_token=1.7266, perplexity_token=5.6214]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:   9%|████▍                                           | 97/1044 [00:36<06:43,  2.35it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6196]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   9%|████▌                                           | 98/1044 [00:36<06:29,  2.43it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6173]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   9%|████▌                                           | 99/1044 [00:36<06:16,  2.51it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6165]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  10%|████▌                                          | 100/1044 [00:37<06:03,  2.60it/s, acc_step=1/1, ce_loss_token=1.7255, perplexity_token=5.6155]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  10%|████▌                                          | 101/1044 [00:37<06:04,  2.59it/s, acc_step=1/1, ce_loss_token=1.7252, perplexity_token=5.6139]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  10%|████▌                                          | 102/1044 [00:37<06:01,  2.61it/s, acc_step=1/1, ce_loss_token=1.7249, perplexity_token=5.6120]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  10%|████▋                                          | 103/1044 [00:38<05:49,  2.69it/s, acc_step=1/1, ce_loss_token=1.7247, perplexity_token=5.6110]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  10%|████▋                                          | 104/1044 [00:38<05:49,  2.69it/s, acc_step=1/1, ce_loss_token=1.7245, perplexity_token=5.6097]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  10%|████▋                                          | 105/1044 [00:38<05:23,  2.91it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6173]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  10%|████▊                                          | 106/1044 [00:39<05:30,  2.84it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6162]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  10%|████▊                                          | 107/1044 [00:39<05:37,  2.78it/s, acc_step=1/1, ce_loss_token=1.7255, perplexity_token=5.6152]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  10%|████▊                                          | 108/1044 [00:40<05:16,  2.95it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6232]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  10%|████▉                                          | 109/1044 [00:40<05:20,  2.91it/s, acc_step=1/1, ce_loss_token=1.7267, perplexity_token=5.6218]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  11%|████▉                                          | 110/1044 [00:40<05:21,  2.90it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6207]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  11%|████▉                                          | 111/1044 [00:41<05:18,  2.93it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6190]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  11%|█████                                          | 112/1044 [00:41<05:31,  2.81it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6175]

torch.Size([256, 421, 35]) torch.Size([256, 421])


[Training LM]:  11%|█████                                          | 113/1044 [00:42<06:41,  2.32it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6162]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  11%|█████▏                                         | 114/1044 [00:42<06:27,  2.40it/s, acc_step=1/1, ce_loss_token=1.7254, perplexity_token=5.6147]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  11%|█████▏                                         | 115/1044 [00:42<06:15,  2.48it/s, acc_step=1/1, ce_loss_token=1.7252, perplexity_token=5.6139]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  11%|█████▏                                         | 116/1044 [00:43<05:46,  2.68it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6187]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  11%|█████▎                                         | 117/1044 [00:43<05:47,  2.67it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6174]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  11%|█████▎                                         | 118/1044 [00:43<05:41,  2.71it/s, acc_step=1/1, ce_loss_token=1.7256, perplexity_token=5.6160]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  11%|█████▎                                         | 119/1044 [00:44<05:42,  2.70it/s, acc_step=1/1, ce_loss_token=1.7254, perplexity_token=5.6149]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  11%|█████▍                                         | 120/1044 [00:44<05:43,  2.69it/s, acc_step=1/1, ce_loss_token=1.7252, perplexity_token=5.6135]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  12%|█████▍                                         | 121/1044 [00:44<05:40,  2.71it/s, acc_step=1/1, ce_loss_token=1.7250, perplexity_token=5.6126]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  12%|█████▍                                         | 122/1044 [00:45<05:21,  2.87it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6175]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  12%|█████▌                                         | 123/1044 [00:45<05:25,  2.83it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6165]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  12%|█████▌                                         | 124/1044 [00:45<05:20,  2.87it/s, acc_step=1/1, ce_loss_token=1.7255, perplexity_token=5.6153]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  12%|█████▋                                         | 125/1044 [00:46<05:32,  2.76it/s, acc_step=1/1, ce_loss_token=1.7254, perplexity_token=5.6148]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  12%|█████▋                                         | 126/1044 [00:46<05:41,  2.69it/s, acc_step=1/1, ce_loss_token=1.7251, perplexity_token=5.6133]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  12%|█████▋                                         | 127/1044 [00:47<05:47,  2.64it/s, acc_step=1/1, ce_loss_token=1.7249, perplexity_token=5.6119]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  12%|█████▊                                         | 128/1044 [00:47<05:56,  2.57it/s, acc_step=1/1, ce_loss_token=1.7247, perplexity_token=5.6108]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  12%|█████▊                                         | 129/1044 [00:47<05:53,  2.59it/s, acc_step=1/1, ce_loss_token=1.7246, perplexity_token=5.6105]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  12%|█████▊                                         | 130/1044 [00:48<05:56,  2.57it/s, acc_step=1/1, ce_loss_token=1.7246, perplexity_token=5.6102]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  13%|█████▉                                         | 131/1044 [00:48<05:50,  2.60it/s, acc_step=1/1, ce_loss_token=1.7244, perplexity_token=5.6091]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  13%|█████▉                                         | 132/1044 [00:49<05:42,  2.66it/s, acc_step=1/1, ce_loss_token=1.7242, perplexity_token=5.6083]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  13%|█████▉                                         | 133/1044 [00:49<05:39,  2.68it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6075]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  13%|██████                                         | 134/1044 [00:49<05:51,  2.59it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6061]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  13%|██████                                         | 135/1044 [00:50<06:01,  2.52it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6061]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  13%|██████                                         | 136/1044 [00:50<06:01,  2.51it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6050]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  13%|██████▏                                        | 137/1044 [00:51<06:10,  2.45it/s, acc_step=1/1, ce_loss_token=1.7236, perplexity_token=5.6048]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  13%|██████▏                                        | 138/1044 [00:51<06:10,  2.45it/s, acc_step=1/1, ce_loss_token=1.7235, perplexity_token=5.6041]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  13%|██████▎                                        | 139/1044 [00:51<05:38,  2.67it/s, acc_step=1/1, ce_loss_token=1.7246, perplexity_token=5.6100]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  13%|██████▎                                        | 140/1044 [00:52<05:30,  2.74it/s, acc_step=1/1, ce_loss_token=1.7244, perplexity_token=5.6093]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  14%|██████▎                                        | 141/1044 [00:52<05:36,  2.68it/s, acc_step=1/1, ce_loss_token=1.7243, perplexity_token=5.6084]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  14%|██████▍                                        | 142/1044 [00:52<05:38,  2.66it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6072]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  14%|██████▍                                        | 143/1044 [00:53<05:41,  2.64it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6060]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  14%|██████▍                                        | 144/1044 [00:53<05:44,  2.62it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6054]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  14%|██████▌                                        | 145/1044 [00:54<05:36,  2.67it/s, acc_step=1/1, ce_loss_token=1.7236, perplexity_token=5.6047]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  14%|██████▌                                        | 146/1044 [00:54<05:35,  2.68it/s, acc_step=1/1, ce_loss_token=1.7236, perplexity_token=5.6044]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  14%|██████▌                                        | 147/1044 [00:54<05:31,  2.71it/s, acc_step=1/1, ce_loss_token=1.7234, perplexity_token=5.6035]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  14%|██████▋                                        | 148/1044 [00:55<05:31,  2.70it/s, acc_step=1/1, ce_loss_token=1.7232, perplexity_token=5.6026]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  14%|██████▋                                        | 149/1044 [00:55<05:26,  2.74it/s, acc_step=1/1, ce_loss_token=1.7231, perplexity_token=5.6018]

torch.Size([256, 396, 35]) torch.Size([256, 396])


[Training LM]:  14%|██████▊                                        | 150/1044 [00:55<05:41,  2.62it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6056]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  14%|██████▊                                        | 151/1044 [00:56<05:32,  2.69it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6051]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  15%|██████▊                                        | 152/1044 [00:56<05:29,  2.71it/s, acc_step=1/1, ce_loss_token=1.7236, perplexity_token=5.6045]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  15%|██████▉                                        | 153/1044 [00:57<05:36,  2.65it/s, acc_step=1/1, ce_loss_token=1.7234, perplexity_token=5.6037]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  15%|██████▉                                        | 154/1044 [00:57<05:15,  2.82it/s, acc_step=1/1, ce_loss_token=1.7244, perplexity_token=5.6090]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  15%|██████▉                                        | 155/1044 [00:57<05:19,  2.78it/s, acc_step=1/1, ce_loss_token=1.7242, perplexity_token=5.6083]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  15%|███████                                        | 156/1044 [00:58<05:08,  2.88it/s, acc_step=1/1, ce_loss_token=1.7250, perplexity_token=5.6122]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  15%|███████                                        | 157/1044 [00:58<05:18,  2.78it/s, acc_step=1/1, ce_loss_token=1.7249, perplexity_token=5.6119]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  15%|███████▏                                       | 159/1044 [00:59<04:48,  3.07it/s, acc_step=1/1, ce_loss_token=1.7275, perplexity_token=5.6268]

torch.Size([256, 297, 35]) torch.Size([256, 297])
torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  15%|███████▏                                       | 160/1044 [00:59<04:59,  2.96it/s, acc_step=1/1, ce_loss_token=1.7274, perplexity_token=5.6262]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  15%|███████▏                                       | 161/1044 [00:59<05:11,  2.83it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6254]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  16%|███████▎                                       | 162/1044 [01:00<05:27,  2.69it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6245]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  16%|███████▎                                       | 163/1044 [01:00<05:38,  2.60it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6237]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  16%|███████▍                                       | 164/1044 [01:01<05:38,  2.60it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6228]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  16%|███████▍                                       | 165/1044 [01:01<05:30,  2.66it/s, acc_step=1/1, ce_loss_token=1.7267, perplexity_token=5.6219]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  16%|███████▍                                       | 166/1044 [01:01<05:35,  2.61it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6209]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  16%|███████▌                                       | 167/1044 [01:02<05:31,  2.65it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6199]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  16%|███████▌                                       | 168/1044 [01:02<05:27,  2.68it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6189]

torch.Size([256, 354, 35]) torch.Size([256, 354])


[Training LM]:  16%|███████▌                                       | 169/1044 [01:02<05:21,  2.73it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6226]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  16%|███████▋                                       | 170/1044 [01:03<05:29,  2.65it/s, acc_step=1/1, ce_loss_token=1.7266, perplexity_token=5.6215]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  16%|███████▋                                       | 171/1044 [01:03<05:44,  2.54it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6211]

torch.Size([256, 438, 35]) torch.Size([256, 438])


[Training LM]:  16%|███████▋                                       | 172/1044 [01:04<06:44,  2.16it/s, acc_step=1/1, ce_loss_token=1.7264, perplexity_token=5.6204]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  17%|███████▊                                       | 173/1044 [01:04<06:19,  2.29it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6252]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  17%|███████▊                                       | 174/1044 [01:05<05:54,  2.46it/s, acc_step=1/1, ce_loss_token=1.7272, perplexity_token=5.6247]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  17%|███████▉                                       | 175/1044 [01:05<05:45,  2.51it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6236]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  17%|███████▉                                       | 176/1044 [01:05<05:37,  2.57it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6229]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  17%|███████▉                                       | 177/1044 [01:06<05:29,  2.63it/s, acc_step=1/1, ce_loss_token=1.7266, perplexity_token=5.6218]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  17%|████████                                       | 178/1044 [01:06<05:21,  2.69it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6210]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  17%|████████                                       | 179/1044 [01:06<05:32,  2.60it/s, acc_step=1/1, ce_loss_token=1.7264, perplexity_token=5.6202]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  17%|████████                                       | 180/1044 [01:07<05:26,  2.65it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6193]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  17%|████████▏                                      | 181/1044 [01:07<05:35,  2.58it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6189]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  17%|████████▏                                      | 182/1044 [01:07<05:07,  2.80it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6229]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  18%|████████▏                                      | 183/1044 [01:08<04:59,  2.87it/s, acc_step=1/1, ce_loss_token=1.7274, perplexity_token=5.6257]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  18%|████████▎                                      | 184/1044 [01:08<05:09,  2.78it/s, acc_step=1/1, ce_loss_token=1.7272, perplexity_token=5.6250]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  18%|████████▎                                      | 185/1044 [01:09<05:08,  2.78it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6244]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  18%|████████▎                                      | 186/1044 [01:09<05:20,  2.68it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6237]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  18%|████████▍                                      | 187/1044 [01:09<05:02,  2.83it/s, acc_step=1/1, ce_loss_token=1.7275, perplexity_token=5.6264]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  18%|████████▍                                      | 188/1044 [01:10<04:56,  2.89it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6257]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  18%|████████▌                                      | 189/1044 [01:10<05:02,  2.82it/s, acc_step=1/1, ce_loss_token=1.7272, perplexity_token=5.6247]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  18%|████████▌                                      | 190/1044 [01:10<05:13,  2.73it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6243]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  18%|████████▌                                      | 191/1044 [01:11<05:16,  2.70it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6234]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  18%|████████▋                                      | 192/1044 [01:11<05:13,  2.72it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6227]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  18%|████████▋                                      | 193/1044 [01:11<05:13,  2.71it/s, acc_step=1/1, ce_loss_token=1.7266, perplexity_token=5.6217]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  19%|████████▋                                      | 194/1044 [01:12<05:09,  2.74it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6209]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  19%|████████▊                                      | 195/1044 [01:12<05:20,  2.65it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6200]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  19%|████████▊                                      | 196/1044 [01:13<05:52,  2.41it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6191]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  19%|████████▊                                      | 197/1044 [01:13<05:35,  2.52it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6182]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  19%|████████▉                                      | 198/1044 [01:13<05:30,  2.56it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6177]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  19%|████████▉                                      | 199/1044 [01:14<05:40,  2.48it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6172]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  19%|█████████                                      | 200/1044 [01:14<05:34,  2.52it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6168]

torch.Size([256, 372, 35]) torch.Size([256, 372])


[Training LM]:  19%|█████████                                      | 201/1044 [01:15<06:00,  2.34it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6163]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  19%|█████████                                      | 202/1044 [01:15<05:38,  2.49it/s, acc_step=1/1, ce_loss_token=1.7256, perplexity_token=5.6156]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  19%|█████████▏                                     | 203/1044 [01:15<05:25,  2.58it/s, acc_step=1/1, ce_loss_token=1.7254, perplexity_token=5.6148]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  20%|█████████▏                                     | 204/1044 [01:16<05:37,  2.49it/s, acc_step=1/1, ce_loss_token=1.7254, perplexity_token=5.6146]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  20%|█████████▏                                     | 205/1044 [01:16<05:35,  2.50it/s, acc_step=1/1, ce_loss_token=1.7252, perplexity_token=5.6137]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  20%|█████████▎                                     | 206/1044 [01:17<05:40,  2.46it/s, acc_step=1/1, ce_loss_token=1.7251, perplexity_token=5.6128]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  20%|█████████▎                                     | 207/1044 [01:17<05:33,  2.51it/s, acc_step=1/1, ce_loss_token=1.7250, perplexity_token=5.6124]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  20%|█████████▍                                     | 209/1044 [01:18<04:40,  2.98it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6225]

torch.Size([256, 301, 35]) torch.Size([256, 301])
torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  20%|█████████▍                                     | 210/1044 [01:18<04:44,  2.93it/s, acc_step=1/1, ce_loss_token=1.7267, perplexity_token=5.6219]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  20%|█████████▍                                     | 211/1044 [01:18<04:55,  2.82it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6211]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  20%|█████████▌                                     | 212/1044 [01:19<05:17,  2.62it/s, acc_step=1/1, ce_loss_token=1.7264, perplexity_token=5.6204]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  20%|█████████▌                                     | 213/1044 [01:19<05:27,  2.53it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6200]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  20%|█████████▋                                     | 214/1044 [01:20<05:29,  2.52it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6188]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  21%|█████████▋                                     | 215/1044 [01:20<05:09,  2.68it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6225]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  21%|█████████▋                                     | 216/1044 [01:20<05:22,  2.57it/s, acc_step=1/1, ce_loss_token=1.7266, perplexity_token=5.6217]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  21%|█████████▊                                     | 217/1044 [01:21<05:10,  2.66it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6209]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  21%|█████████▊                                     | 218/1044 [01:21<04:56,  2.78it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6235]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  21%|█████████▊                                     | 219/1044 [01:21<04:49,  2.85it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6233]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  21%|█████████▉                                     | 220/1044 [01:22<04:47,  2.86it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6228]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  21%|█████████▉                                     | 221/1044 [01:22<04:55,  2.78it/s, acc_step=1/1, ce_loss_token=1.7267, perplexity_token=5.6220]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  21%|█████████▉                                     | 222/1044 [01:23<04:57,  2.76it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6212]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  21%|██████████                                     | 223/1044 [01:23<05:17,  2.58it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6208]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  21%|██████████                                     | 224/1044 [01:23<05:10,  2.64it/s, acc_step=1/1, ce_loss_token=1.7264, perplexity_token=5.6202]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  22%|██████████▏                                    | 225/1044 [01:24<05:13,  2.61it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6197]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  22%|██████████▏                                    | 226/1044 [01:24<04:49,  2.83it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6224]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  22%|██████████▏                                    | 227/1044 [01:24<04:47,  2.84it/s, acc_step=1/1, ce_loss_token=1.7267, perplexity_token=5.6218]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  22%|██████████▎                                    | 228/1044 [01:25<05:01,  2.70it/s, acc_step=1/1, ce_loss_token=1.7266, perplexity_token=5.6213]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  22%|██████████▎                                    | 229/1044 [01:25<04:45,  2.86it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6238]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  22%|██████████▎                                    | 230/1044 [01:25<04:48,  2.82it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6233]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  22%|██████████▍                                    | 231/1044 [01:26<04:53,  2.77it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6225]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  22%|██████████▍                                    | 232/1044 [01:26<04:55,  2.75it/s, acc_step=1/1, ce_loss_token=1.7267, perplexity_token=5.6221]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  22%|██████████▍                                    | 233/1044 [01:27<05:01,  2.69it/s, acc_step=1/1, ce_loss_token=1.7266, perplexity_token=5.6216]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  22%|██████████▌                                    | 234/1044 [01:27<04:54,  2.75it/s, acc_step=1/1, ce_loss_token=1.7266, perplexity_token=5.6213]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  23%|██████████▌                                    | 235/1044 [01:27<04:59,  2.70it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6207]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  23%|██████████▌                                    | 236/1044 [01:28<05:00,  2.69it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6198]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  23%|██████████▋                                    | 237/1044 [01:28<04:58,  2.70it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6191]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  23%|██████████▋                                    | 238/1044 [01:28<05:02,  2.67it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6186]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  23%|██████████▊                                    | 239/1044 [01:29<05:02,  2.66it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6181]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  23%|██████████▊                                    | 240/1044 [01:29<05:04,  2.64it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6176]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  23%|██████████▊                                    | 241/1044 [01:30<05:05,  2.63it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6170]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  23%|██████████▉                                    | 242/1044 [01:30<05:02,  2.65it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6166]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  23%|██████████▉                                    | 243/1044 [01:30<05:01,  2.66it/s, acc_step=1/1, ce_loss_token=1.7256, perplexity_token=5.6161]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  23%|██████████▉                                    | 244/1044 [01:31<05:02,  2.65it/s, acc_step=1/1, ce_loss_token=1.7255, perplexity_token=5.6156]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  23%|███████████                                    | 245/1044 [01:31<05:03,  2.63it/s, acc_step=1/1, ce_loss_token=1.7254, perplexity_token=5.6147]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  24%|███████████                                    | 246/1044 [01:31<05:08,  2.58it/s, acc_step=1/1, ce_loss_token=1.7253, perplexity_token=5.6143]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  24%|███████████                                    | 247/1044 [01:32<05:06,  2.60it/s, acc_step=1/1, ce_loss_token=1.7252, perplexity_token=5.6139]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  24%|███████████▏                                   | 248/1044 [01:32<05:05,  2.60it/s, acc_step=1/1, ce_loss_token=1.7251, perplexity_token=5.6132]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  24%|███████████▏                                   | 249/1044 [01:33<05:03,  2.62it/s, acc_step=1/1, ce_loss_token=1.7250, perplexity_token=5.6126]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  24%|███████████▎                                   | 250/1044 [01:33<04:47,  2.76it/s, acc_step=1/1, ce_loss_token=1.7254, perplexity_token=5.6149]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  24%|███████████▎                                   | 251/1044 [01:33<04:59,  2.65it/s, acc_step=1/1, ce_loss_token=1.7254, perplexity_token=5.6145]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  24%|███████████▎                                   | 252/1044 [01:34<04:42,  2.81it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6163]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  24%|███████████▍                                   | 253/1044 [01:34<04:50,  2.72it/s, acc_step=1/1, ce_loss_token=1.7255, perplexity_token=5.6156]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  24%|███████████▍                                   | 254/1044 [01:34<04:30,  2.92it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6177]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  24%|███████████▍                                   | 255/1044 [01:35<04:40,  2.82it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6169]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  25%|███████████▌                                   | 256/1044 [01:35<04:49,  2.72it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6165]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  25%|███████████▌                                   | 257/1044 [01:35<04:34,  2.87it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6195]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  25%|███████████▌                                   | 258/1044 [01:36<04:42,  2.78it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6189]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  25%|███████████▋                                   | 259/1044 [01:36<04:49,  2.71it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6181]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  25%|███████████▋                                   | 260/1044 [01:37<04:55,  2.66it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6176]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  25%|███████████▊                                   | 261/1044 [01:37<04:53,  2.67it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6168]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  25%|███████████▊                                   | 262/1044 [01:37<04:48,  2.71it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6167]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  25%|███████████▊                                   | 263/1044 [01:38<04:32,  2.87it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6197]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  25%|███████████▉                                   | 264/1044 [01:38<04:11,  3.10it/s, acc_step=1/1, ce_loss_token=1.7266, perplexity_token=5.6217]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  25%|███████████▉                                   | 265/1044 [01:38<04:25,  2.93it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6212]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  25%|███████████▉                                   | 266/1044 [01:39<04:36,  2.81it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6207]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  26%|████████████                                   | 267/1044 [01:39<04:38,  2.79it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6200]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  26%|████████████                                   | 268/1044 [01:39<04:48,  2.69it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6194]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  26%|████████████                                   | 269/1044 [01:40<04:32,  2.85it/s, acc_step=1/1, ce_loss_token=1.7266, perplexity_token=5.6213]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  26%|████████████▏                                  | 270/1044 [01:40<04:39,  2.76it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6210]

torch.Size([256, 342, 35]) torch.Size([256, 342])


[Training LM]:  26%|████████████▏                                  | 271/1044 [01:40<04:11,  3.08it/s, acc_step=1/1, ce_loss_token=1.7279, perplexity_token=5.6288]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  26%|████████████▏                                  | 272/1044 [01:41<04:17,  3.00it/s, acc_step=1/1, ce_loss_token=1.7278, perplexity_token=5.6281]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  26%|████████████▎                                  | 273/1044 [01:41<04:31,  2.84it/s, acc_step=1/1, ce_loss_token=1.7277, perplexity_token=5.6274]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  26%|████████████▎                                  | 274/1044 [01:41<04:13,  3.04it/s, acc_step=1/1, ce_loss_token=1.7280, perplexity_token=5.6295]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  26%|████████████▍                                  | 275/1044 [01:42<04:19,  2.96it/s, acc_step=1/1, ce_loss_token=1.7279, perplexity_token=5.6290]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  26%|████████████▍                                  | 276/1044 [01:42<04:05,  3.13it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6317]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  27%|████████████▍                                  | 277/1044 [01:42<04:29,  2.85it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6314]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  27%|████████████▌                                  | 278/1044 [01:43<04:35,  2.78it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6309]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  27%|████████████▌                                  | 279/1044 [01:43<04:35,  2.78it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6307]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  27%|████████████▌                                  | 280/1044 [01:44<04:40,  2.72it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6301]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  27%|████████████▋                                  | 281/1044 [01:44<05:00,  2.54it/s, acc_step=1/1, ce_loss_token=1.7280, perplexity_token=5.6296]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  27%|████████████▋                                  | 282/1044 [01:44<04:51,  2.62it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6315]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  27%|████████████▋                                  | 283/1044 [01:45<05:02,  2.51it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6308]

torch.Size([256, 273, 35]) torch.Size([256, 273])


[Training LM]:  27%|████████████▊                                  | 284/1044 [01:45<04:47,  2.65it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6304]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  27%|████████████▊                                  | 285/1044 [01:45<04:41,  2.69it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6300]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  27%|████████████▉                                  | 286/1044 [01:46<04:57,  2.54it/s, acc_step=1/1, ce_loss_token=1.7280, perplexity_token=5.6296]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  27%|████████████▉                                  | 287/1044 [01:46<05:03,  2.50it/s, acc_step=1/1, ce_loss_token=1.7279, perplexity_token=5.6289]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  28%|████████████▉                                  | 288/1044 [01:47<04:49,  2.61it/s, acc_step=1/1, ce_loss_token=1.7279, perplexity_token=5.6286]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  28%|█████████████                                  | 289/1044 [01:47<04:43,  2.66it/s, acc_step=1/1, ce_loss_token=1.7278, perplexity_token=5.6280]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  28%|█████████████                                  | 290/1044 [01:48<05:05,  2.47it/s, acc_step=1/1, ce_loss_token=1.7277, perplexity_token=5.6276]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  28%|█████████████                                  | 291/1044 [01:48<04:39,  2.69it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6297]

torch.Size([256, 525, 35]) torch.Size([256, 525])


[Training LM]:  28%|█████████████▏                                 | 292/1044 [01:48<05:42,  2.20it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6317]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  28%|█████████████▏                                 | 293/1044 [01:49<05:23,  2.32it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6311]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  28%|█████████████▏                                 | 294/1044 [01:49<04:50,  2.59it/s, acc_step=1/1, ce_loss_token=1.7286, perplexity_token=5.6329]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  28%|█████████████▎                                 | 295/1044 [01:50<04:45,  2.62it/s, acc_step=1/1, ce_loss_token=1.7285, perplexity_token=5.6323]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  28%|█████████████▎                                 | 296/1044 [01:50<04:41,  2.66it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6318]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  28%|█████████████▎                                 | 297/1044 [01:50<04:42,  2.64it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6316]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  29%|█████████████▍                                 | 298/1044 [01:51<04:19,  2.87it/s, acc_step=1/1, ce_loss_token=1.7288, perplexity_token=5.6339]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  29%|█████████████▍                                 | 299/1044 [01:51<04:23,  2.83it/s, acc_step=1/1, ce_loss_token=1.7287, perplexity_token=5.6335]

torch.Size([256, 388, 35]) torch.Size([256, 388])


[Training LM]:  29%|█████████████▌                                 | 300/1044 [01:51<05:02,  2.46it/s, acc_step=1/1, ce_loss_token=1.7287, perplexity_token=5.6332]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  29%|█████████████▌                                 | 301/1044 [01:52<04:58,  2.49it/s, acc_step=1/1, ce_loss_token=1.7286, perplexity_token=5.6326]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  29%|█████████████▌                                 | 302/1044 [01:52<04:49,  2.56it/s, acc_step=1/1, ce_loss_token=1.7285, perplexity_token=5.6320]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  29%|█████████████▋                                 | 303/1044 [01:53<04:42,  2.62it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6315]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  29%|█████████████▋                                 | 304/1044 [01:53<04:42,  2.62it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6312]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  29%|█████████████▋                                 | 305/1044 [01:53<04:54,  2.51it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6309]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  29%|█████████████▊                                 | 306/1044 [01:54<04:44,  2.59it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6303]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  29%|█████████████▊                                 | 307/1044 [01:54<04:47,  2.57it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6301]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  30%|█████████████▊                                 | 308/1044 [01:54<04:44,  2.59it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6297]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  30%|█████████████▉                                 | 309/1044 [01:55<04:43,  2.60it/s, acc_step=1/1, ce_loss_token=1.7280, perplexity_token=5.6294]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  30%|█████████████▉                                 | 310/1044 [01:55<04:36,  2.66it/s, acc_step=1/1, ce_loss_token=1.7279, perplexity_token=5.6291]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  30%|██████████████                                 | 311/1044 [01:56<04:43,  2.58it/s, acc_step=1/1, ce_loss_token=1.7279, perplexity_token=5.6287]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  30%|██████████████                                 | 312/1044 [01:56<04:37,  2.64it/s, acc_step=1/1, ce_loss_token=1.7278, perplexity_token=5.6283]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  30%|██████████████                                 | 313/1044 [01:56<04:39,  2.61it/s, acc_step=1/1, ce_loss_token=1.7277, perplexity_token=5.6278]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  30%|██████████████▏                                | 314/1044 [01:57<04:19,  2.82it/s, acc_step=1/1, ce_loss_token=1.7280, perplexity_token=5.6294]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  30%|██████████████▏                                | 315/1044 [01:57<04:13,  2.88it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6319]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  30%|██████████████▏                                | 316/1044 [01:57<04:18,  2.81it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6311]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  30%|██████████████▎                                | 317/1044 [01:58<04:33,  2.66it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6307]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  30%|██████████████▎                                | 318/1044 [01:58<04:30,  2.68it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6304]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  31%|██████████████▎                                | 319/1044 [01:59<04:36,  2.62it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6299]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  31%|██████████████▍                                | 320/1044 [01:59<04:29,  2.68it/s, acc_step=1/1, ce_loss_token=1.7280, perplexity_token=5.6293]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  31%|██████████████▍                                | 321/1044 [01:59<04:36,  2.61it/s, acc_step=1/1, ce_loss_token=1.7279, perplexity_token=5.6288]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  31%|██████████████▍                                | 322/1044 [02:00<04:39,  2.59it/s, acc_step=1/1, ce_loss_token=1.7278, perplexity_token=5.6284]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  31%|██████████████▌                                | 323/1044 [02:00<04:42,  2.55it/s, acc_step=1/1, ce_loss_token=1.7277, perplexity_token=5.6278]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  31%|██████████████▌                                | 324/1044 [02:01<04:35,  2.61it/s, acc_step=1/1, ce_loss_token=1.7276, perplexity_token=5.6273]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  31%|██████████████▋                                | 325/1044 [02:01<04:34,  2.62it/s, acc_step=1/1, ce_loss_token=1.7275, perplexity_token=5.6267]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  31%|██████████████▋                                | 326/1044 [02:01<04:29,  2.66it/s, acc_step=1/1, ce_loss_token=1.7274, perplexity_token=5.6262]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  31%|██████████████▋                                | 327/1044 [02:02<04:27,  2.68it/s, acc_step=1/1, ce_loss_token=1.7274, perplexity_token=5.6257]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  31%|██████████████▊                                | 328/1044 [02:02<04:29,  2.66it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6252]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  32%|██████████████▊                                | 329/1044 [02:02<04:28,  2.67it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6245]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  32%|██████████████▊                                | 330/1044 [02:03<04:23,  2.71it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6238]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  32%|██████████████▉                                | 331/1044 [02:03<04:25,  2.69it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6232]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  32%|██████████████▉                                | 332/1044 [02:03<04:02,  2.94it/s, acc_step=1/1, ce_loss_token=1.7272, perplexity_token=5.6249]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  32%|██████████████▉                                | 333/1044 [02:04<04:08,  2.87it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6243]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  32%|███████████████                                | 334/1044 [02:04<04:09,  2.85it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6238]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  32%|███████████████                                | 335/1044 [02:04<04:03,  2.92it/s, acc_step=1/1, ce_loss_token=1.7274, perplexity_token=5.6262]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  32%|███████████████▏                               | 336/1044 [02:05<04:11,  2.82it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6256]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  32%|███████████████▏                               | 337/1044 [02:05<04:21,  2.71it/s, acc_step=1/1, ce_loss_token=1.7272, perplexity_token=5.6251]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  32%|███████████████▏                               | 338/1044 [02:06<04:27,  2.64it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6246]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  32%|███████████████▎                               | 339/1044 [02:06<04:24,  2.66it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6241]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  33%|███████████████▎                               | 340/1044 [02:06<04:06,  2.85it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6257]

torch.Size([256, 278, 35]) torch.Size([256, 278])


[Training LM]:  33%|███████████████▎                               | 341/1044 [02:07<03:59,  2.93it/s, acc_step=1/1, ce_loss_token=1.7272, perplexity_token=5.6251]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  33%|███████████████▍                               | 342/1044 [02:07<04:06,  2.84it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6246]

torch.Size([256, 275, 35]) torch.Size([256, 275])


[Training LM]:  33%|███████████████▍                               | 343/1044 [02:07<04:02,  2.89it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6237]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  33%|███████████████▍                               | 344/1044 [02:08<04:19,  2.69it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6233]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  33%|███████████████▌                               | 345/1044 [02:08<04:22,  2.67it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6230]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  33%|███████████████▌                               | 346/1044 [02:08<04:24,  2.64it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6226]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  33%|███████████████▌                               | 347/1044 [02:09<04:09,  2.79it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6239]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  33%|███████████████▋                               | 348/1044 [02:09<04:19,  2.69it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6236]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  33%|███████████████▋                               | 349/1044 [02:10<04:35,  2.52it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6232]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  34%|███████████████▊                               | 350/1044 [02:10<04:35,  2.52it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6229]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  34%|███████████████▊                               | 351/1044 [02:10<04:15,  2.72it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6242]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  34%|███████████████▊                               | 352/1044 [02:11<03:54,  2.95it/s, acc_step=1/1, ce_loss_token=1.7274, perplexity_token=5.6259]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  34%|███████████████▉                               | 353/1044 [02:11<03:57,  2.91it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6254]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  34%|███████████████▉                               | 354/1044 [02:11<03:49,  3.01it/s, acc_step=1/1, ce_loss_token=1.7276, perplexity_token=5.6269]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  34%|███████████████▉                               | 355/1044 [02:12<03:52,  2.97it/s, acc_step=1/1, ce_loss_token=1.7275, perplexity_token=5.6264]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  34%|████████████████                               | 356/1044 [02:12<03:56,  2.91it/s, acc_step=1/1, ce_loss_token=1.7274, perplexity_token=5.6260]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  34%|████████████████                               | 357/1044 [02:12<03:59,  2.87it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6256]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  34%|████████████████                               | 358/1044 [02:13<03:41,  3.09it/s, acc_step=1/1, ce_loss_token=1.7276, perplexity_token=5.6269]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  34%|████████████████▏                              | 360/1044 [02:13<03:35,  3.17it/s, acc_step=1/1, ce_loss_token=1.7286, perplexity_token=5.6330]

torch.Size([256, 302, 35]) torch.Size([256, 302])
torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  35%|████████████████▎                              | 361/1044 [02:14<03:47,  3.00it/s, acc_step=1/1, ce_loss_token=1.7286, perplexity_token=5.6325]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  35%|████████████████▎                              | 362/1044 [02:14<03:58,  2.86it/s, acc_step=1/1, ce_loss_token=1.7285, perplexity_token=5.6320]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  35%|████████████████▎                              | 363/1044 [02:14<04:07,  2.75it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6316]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  35%|████████████████▍                              | 364/1044 [02:15<04:09,  2.73it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6312]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  35%|████████████████▍                              | 365/1044 [02:15<04:07,  2.75it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6308]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  35%|████████████████▍                              | 366/1044 [02:16<04:07,  2.74it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6304]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  35%|████████████████▌                              | 367/1044 [02:16<04:06,  2.75it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6302]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  35%|████████████████▌                              | 368/1044 [02:16<04:09,  2.71it/s, acc_step=1/1, ce_loss_token=1.7280, perplexity_token=5.6296]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  35%|████████████████▌                              | 369/1044 [02:17<04:10,  2.70it/s, acc_step=1/1, ce_loss_token=1.7280, perplexity_token=5.6294]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  36%|████████████████▋                              | 371/1044 [02:17<03:36,  3.11it/s, acc_step=1/1, ce_loss_token=1.7290, perplexity_token=5.6352]

torch.Size([256, 301, 35]) torch.Size([256, 301])
torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  36%|████████████████▋                              | 372/1044 [02:18<03:47,  2.95it/s, acc_step=1/1, ce_loss_token=1.7289, perplexity_token=5.6345]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  36%|████████████████▊                              | 373/1044 [02:18<04:01,  2.78it/s, acc_step=1/1, ce_loss_token=1.7288, perplexity_token=5.6341]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  36%|████████████████▊                              | 374/1044 [02:18<03:58,  2.81it/s, acc_step=1/1, ce_loss_token=1.7288, perplexity_token=5.6337]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  36%|████████████████▉                              | 375/1044 [02:19<04:02,  2.76it/s, acc_step=1/1, ce_loss_token=1.7287, perplexity_token=5.6334]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  36%|████████████████▉                              | 376/1044 [02:19<04:35,  2.42it/s, acc_step=1/1, ce_loss_token=1.7286, perplexity_token=5.6329]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  36%|████████████████▉                              | 377/1044 [02:20<04:27,  2.50it/s, acc_step=1/1, ce_loss_token=1.7286, perplexity_token=5.6326]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  36%|█████████████████                              | 378/1044 [02:20<04:41,  2.36it/s, acc_step=1/1, ce_loss_token=1.7285, perplexity_token=5.6321]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  36%|█████████████████                              | 379/1044 [02:21<04:39,  2.38it/s, acc_step=1/1, ce_loss_token=1.7285, perplexity_token=5.6320]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  36%|█████████████████                              | 380/1044 [02:21<04:35,  2.41it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6314]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  36%|█████████████████▏                             | 381/1044 [02:21<04:25,  2.50it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6307]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  37%|█████████████████▏                             | 382/1044 [02:22<03:58,  2.78it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6318]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  37%|█████████████████▏                             | 383/1044 [02:22<03:38,  3.02it/s, acc_step=1/1, ce_loss_token=1.7288, perplexity_token=5.6339]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  37%|█████████████████▎                             | 384/1044 [02:22<03:44,  2.94it/s, acc_step=1/1, ce_loss_token=1.7287, perplexity_token=5.6335]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  37%|█████████████████▎                             | 385/1044 [02:23<03:49,  2.87it/s, acc_step=1/1, ce_loss_token=1.7287, perplexity_token=5.6331]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  37%|█████████████████▍                             | 386/1044 [02:23<03:53,  2.82it/s, acc_step=1/1, ce_loss_token=1.7286, perplexity_token=5.6327]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  37%|█████████████████▍                             | 387/1044 [02:23<03:38,  3.01it/s, acc_step=1/1, ce_loss_token=1.7289, perplexity_token=5.6347]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  37%|█████████████████▍                             | 388/1044 [02:24<03:41,  2.96it/s, acc_step=1/1, ce_loss_token=1.7289, perplexity_token=5.6342]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  37%|█████████████████▌                             | 389/1044 [02:24<03:47,  2.88it/s, acc_step=1/1, ce_loss_token=1.7288, perplexity_token=5.6336]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  37%|█████████████████▌                             | 390/1044 [02:24<03:29,  3.13it/s, acc_step=1/1, ce_loss_token=1.7291, perplexity_token=5.6356]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  37%|█████████████████▌                             | 391/1044 [02:24<03:18,  3.30it/s, acc_step=1/1, ce_loss_token=1.7293, perplexity_token=5.6369]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  38%|█████████████████▋                             | 392/1044 [02:25<03:32,  3.07it/s, acc_step=1/1, ce_loss_token=1.7292, perplexity_token=5.6363]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  38%|█████████████████▋                             | 393/1044 [02:25<03:41,  2.93it/s, acc_step=1/1, ce_loss_token=1.7292, perplexity_token=5.6361]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  38%|█████████████████▋                             | 394/1044 [02:26<03:45,  2.88it/s, acc_step=1/1, ce_loss_token=1.7291, perplexity_token=5.6356]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  38%|█████████████████▊                             | 395/1044 [02:26<03:45,  2.88it/s, acc_step=1/1, ce_loss_token=1.7290, perplexity_token=5.6352]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  38%|█████████████████▊                             | 396/1044 [02:26<03:56,  2.74it/s, acc_step=1/1, ce_loss_token=1.7289, perplexity_token=5.6347]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  38%|█████████████████▊                             | 397/1044 [02:27<03:55,  2.75it/s, acc_step=1/1, ce_loss_token=1.7289, perplexity_token=5.6344]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  38%|█████████████████▉                             | 398/1044 [02:27<04:03,  2.65it/s, acc_step=1/1, ce_loss_token=1.7288, perplexity_token=5.6339]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  38%|█████████████████▉                             | 399/1044 [02:27<04:02,  2.65it/s, acc_step=1/1, ce_loss_token=1.7287, perplexity_token=5.6335]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  38%|██████████████████                             | 400/1044 [02:28<03:55,  2.74it/s, acc_step=1/1, ce_loss_token=1.7287, perplexity_token=5.6331]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  38%|██████████████████                             | 401/1044 [02:28<03:43,  2.88it/s, acc_step=1/1, ce_loss_token=1.7290, perplexity_token=5.6351]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  39%|██████████████████                             | 402/1044 [02:28<03:43,  2.88it/s, acc_step=1/1, ce_loss_token=1.7290, perplexity_token=5.6347]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  39%|██████████████████▏                            | 403/1044 [02:29<03:34,  2.99it/s, acc_step=1/1, ce_loss_token=1.7291, perplexity_token=5.6358]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  39%|██████████████████▏                            | 404/1044 [02:29<03:41,  2.89it/s, acc_step=1/1, ce_loss_token=1.7290, perplexity_token=5.6351]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  39%|██████████████████▏                            | 405/1044 [02:30<03:47,  2.81it/s, acc_step=1/1, ce_loss_token=1.7290, perplexity_token=5.6348]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  39%|██████████████████▎                            | 406/1044 [02:30<03:51,  2.75it/s, acc_step=1/1, ce_loss_token=1.7289, perplexity_token=5.6343]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  39%|██████████████████▎                            | 407/1044 [02:30<03:55,  2.71it/s, acc_step=1/1, ce_loss_token=1.7288, perplexity_token=5.6338]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  39%|██████████████████▎                            | 408/1044 [02:31<03:52,  2.73it/s, acc_step=1/1, ce_loss_token=1.7287, perplexity_token=5.6334]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  39%|██████████████████▍                            | 409/1044 [02:31<03:40,  2.89it/s, acc_step=1/1, ce_loss_token=1.7290, perplexity_token=5.6351]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  39%|██████████████████▍                            | 410/1044 [02:31<03:35,  2.95it/s, acc_step=1/1, ce_loss_token=1.7289, perplexity_token=5.6347]

torch.Size([256, 274, 35]) torch.Size([256, 274])


[Training LM]:  39%|██████████████████▌                            | 411/1044 [02:32<03:17,  3.21it/s, acc_step=1/1, ce_loss_token=1.7292, perplexity_token=5.6363]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  39%|██████████████████▌                            | 412/1044 [02:32<03:32,  2.98it/s, acc_step=1/1, ce_loss_token=1.7292, perplexity_token=5.6359]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  40%|██████████████████▌                            | 413/1044 [02:32<03:35,  2.92it/s, acc_step=1/1, ce_loss_token=1.7291, perplexity_token=5.6354]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  40%|██████████████████▋                            | 414/1044 [02:33<03:37,  2.89it/s, acc_step=1/1, ce_loss_token=1.7290, perplexity_token=5.6351]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  40%|██████████████████▋                            | 415/1044 [02:33<03:40,  2.85it/s, acc_step=1/1, ce_loss_token=1.7289, perplexity_token=5.6347]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  40%|██████████████████▋                            | 416/1044 [02:34<04:15,  2.46it/s, acc_step=1/1, ce_loss_token=1.7289, perplexity_token=5.6343]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  40%|██████████████████▊                            | 417/1044 [02:34<04:07,  2.54it/s, acc_step=1/1, ce_loss_token=1.7288, perplexity_token=5.6340]

torch.Size([256, 277, 35]) torch.Size([256, 277])


[Training LM]:  40%|██████████████████▊                            | 418/1044 [02:34<03:54,  2.67it/s, acc_step=1/1, ce_loss_token=1.7288, perplexity_token=5.6338]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  40%|██████████████████▊                            | 419/1044 [02:35<03:49,  2.72it/s, acc_step=1/1, ce_loss_token=1.7287, perplexity_token=5.6334]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  40%|██████████████████▉                            | 420/1044 [02:35<03:58,  2.61it/s, acc_step=1/1, ce_loss_token=1.7286, perplexity_token=5.6330]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  40%|██████████████████▉                            | 421/1044 [02:35<03:43,  2.79it/s, acc_step=1/1, ce_loss_token=1.7289, perplexity_token=5.6343]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  40%|██████████████████▉                            | 422/1044 [02:36<03:29,  2.97it/s, acc_step=1/1, ce_loss_token=1.7291, perplexity_token=5.6355]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  41%|███████████████████                            | 423/1044 [02:36<03:37,  2.86it/s, acc_step=1/1, ce_loss_token=1.7290, perplexity_token=5.6350]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  41%|███████████████████                            | 424/1044 [02:36<03:23,  3.05it/s, acc_step=1/1, ce_loss_token=1.7292, perplexity_token=5.6363]

torch.Size([256, 366, 35]) torch.Size([256, 366])


[Training LM]:  41%|███████████████████▏                           | 425/1044 [02:37<03:29,  2.95it/s, acc_step=1/1, ce_loss_token=1.7294, perplexity_token=5.6373]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  41%|███████████████████▏                           | 426/1044 [02:37<03:40,  2.80it/s, acc_step=1/1, ce_loss_token=1.7293, perplexity_token=5.6369]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  41%|███████████████████▏                           | 427/1044 [02:37<03:44,  2.75it/s, acc_step=1/1, ce_loss_token=1.7293, perplexity_token=5.6366]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  41%|███████████████████▎                           | 428/1044 [02:38<03:46,  2.72it/s, acc_step=1/1, ce_loss_token=1.7292, perplexity_token=5.6361]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  41%|███████████████████▎                           | 429/1044 [02:38<03:41,  2.78it/s, acc_step=1/1, ce_loss_token=1.7291, perplexity_token=5.6357]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  41%|███████████████████▎                           | 430/1044 [02:39<03:56,  2.60it/s, acc_step=1/1, ce_loss_token=1.7291, perplexity_token=5.6354]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  41%|███████████████████▍                           | 431/1044 [02:39<03:53,  2.62it/s, acc_step=1/1, ce_loss_token=1.7290, perplexity_token=5.6349]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  41%|███████████████████▍                           | 432/1044 [02:39<03:53,  2.63it/s, acc_step=1/1, ce_loss_token=1.7289, perplexity_token=5.6346]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  41%|███████████████████▍                           | 433/1044 [02:40<03:55,  2.60it/s, acc_step=1/1, ce_loss_token=1.7288, perplexity_token=5.6341]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  42%|███████████████████▌                           | 434/1044 [02:40<04:03,  2.51it/s, acc_step=1/1, ce_loss_token=1.7288, perplexity_token=5.6338]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  42%|███████████████████▌                           | 435/1044 [02:40<03:57,  2.56it/s, acc_step=1/1, ce_loss_token=1.7288, perplexity_token=5.6336]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  42%|███████████████████▋                           | 436/1044 [02:41<03:51,  2.63it/s, acc_step=1/1, ce_loss_token=1.7287, perplexity_token=5.6333]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  42%|███████████████████▋                           | 437/1044 [02:41<03:58,  2.54it/s, acc_step=1/1, ce_loss_token=1.7286, perplexity_token=5.6330]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  42%|███████████████████▋                           | 438/1044 [02:42<03:52,  2.61it/s, acc_step=1/1, ce_loss_token=1.7286, perplexity_token=5.6326]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  42%|███████████████████▊                           | 439/1044 [02:42<03:48,  2.65it/s, acc_step=1/1, ce_loss_token=1.7285, perplexity_token=5.6322]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  42%|███████████████████▊                           | 440/1044 [02:42<03:46,  2.67it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6317]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  42%|███████████████████▊                           | 441/1044 [02:43<03:55,  2.56it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6313]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  42%|███████████████████▉                           | 442/1044 [02:43<03:52,  2.59it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6308]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  42%|███████████████████▉                           | 443/1044 [02:44<03:52,  2.59it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6305]

torch.Size([256, 441, 35]) torch.Size([256, 441])


[Training LM]:  43%|███████████████████▉                           | 444/1044 [02:44<04:40,  2.14it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6301]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  43%|████████████████████                           | 445/1044 [02:45<04:33,  2.19it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6298]

torch.Size([256, 276, 35]) torch.Size([256, 276])


[Training LM]:  43%|████████████████████                           | 446/1044 [02:45<03:57,  2.52it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6308]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  43%|████████████████████                           | 447/1044 [02:45<03:43,  2.68it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6306]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  43%|████████████████████▏                          | 448/1044 [02:46<03:41,  2.69it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6301]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  43%|████████████████████▏                          | 449/1044 [02:46<03:49,  2.59it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6298]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  43%|████████████████████▎                          | 450/1044 [02:46<03:38,  2.72it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6308]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  43%|████████████████████▎                          | 451/1044 [02:47<03:34,  2.76it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6305]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  43%|████████████████████▎                          | 452/1044 [02:47<03:34,  2.76it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6303]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  43%|████████████████████▍                          | 453/1044 [02:47<03:20,  2.95it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6317]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  43%|████████████████████▍                          | 454/1044 [02:48<03:19,  2.96it/s, acc_step=1/1, ce_loss_token=1.7284, perplexity_token=5.6314]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  44%|████████████████████▍                          | 455/1044 [02:48<03:25,  2.87it/s, acc_step=1/1, ce_loss_token=1.7283, perplexity_token=5.6311]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  44%|████████████████████▌                          | 456/1044 [02:48<03:35,  2.73it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6308]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  44%|████████████████████▌                          | 457/1044 [02:49<03:41,  2.65it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6304]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  44%|████████████████████▌                          | 458/1044 [02:49<03:44,  2.61it/s, acc_step=1/1, ce_loss_token=1.7282, perplexity_token=5.6303]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  44%|████████████████████▋                          | 459/1044 [02:50<03:42,  2.63it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6300]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  44%|████████████████████▋                          | 460/1044 [02:50<03:38,  2.68it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6297]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  44%|████████████████████▊                          | 461/1044 [02:50<03:37,  2.68it/s, acc_step=1/1, ce_loss_token=1.7280, perplexity_token=5.6292]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  44%|████████████████████▊                          | 462/1044 [02:51<03:40,  2.64it/s, acc_step=1/1, ce_loss_token=1.7279, perplexity_token=5.6288]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  44%|████████████████████▊                          | 463/1044 [02:51<03:35,  2.70it/s, acc_step=1/1, ce_loss_token=1.7278, perplexity_token=5.6284]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  44%|████████████████████▉                          | 464/1044 [02:51<03:32,  2.73it/s, acc_step=1/1, ce_loss_token=1.7278, perplexity_token=5.6280]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  45%|████████████████████▉                          | 465/1044 [02:52<03:36,  2.67it/s, acc_step=1/1, ce_loss_token=1.7277, perplexity_token=5.6275]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  45%|████████████████████▉                          | 466/1044 [02:52<03:40,  2.62it/s, acc_step=1/1, ce_loss_token=1.7276, perplexity_token=5.6270]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  45%|█████████████████████                          | 467/1044 [02:53<03:37,  2.65it/s, acc_step=1/1, ce_loss_token=1.7275, perplexity_token=5.6266]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  45%|█████████████████████                          | 468/1044 [02:53<03:35,  2.68it/s, acc_step=1/1, ce_loss_token=1.7275, perplexity_token=5.6265]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  45%|█████████████████████                          | 469/1044 [02:53<03:38,  2.63it/s, acc_step=1/1, ce_loss_token=1.7274, perplexity_token=5.6263]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  45%|█████████████████████▏                         | 470/1044 [02:54<03:38,  2.63it/s, acc_step=1/1, ce_loss_token=1.7274, perplexity_token=5.6259]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  45%|█████████████████████▏                         | 471/1044 [02:54<03:35,  2.66it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6256]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  45%|█████████████████████▏                         | 472/1044 [02:55<03:43,  2.56it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6253]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  45%|█████████████████████▎                         | 473/1044 [02:55<03:39,  2.61it/s, acc_step=1/1, ce_loss_token=1.7272, perplexity_token=5.6251]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  45%|█████████████████████▎                         | 474/1044 [02:55<03:45,  2.53it/s, acc_step=1/1, ce_loss_token=1.7272, perplexity_token=5.6247]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  45%|█████████████████████▍                         | 475/1044 [02:56<03:38,  2.60it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6244]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  46%|█████████████████████▍                         | 476/1044 [02:56<03:20,  2.83it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6255]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  46%|█████████████████████▍                         | 477/1044 [02:56<03:07,  3.03it/s, acc_step=1/1, ce_loss_token=1.7275, perplexity_token=5.6266]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  46%|█████████████████████▌                         | 478/1044 [02:57<03:07,  3.02it/s, acc_step=1/1, ce_loss_token=1.7275, perplexity_token=5.6263]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  46%|█████████████████████▌                         | 479/1044 [02:57<03:10,  2.97it/s, acc_step=1/1, ce_loss_token=1.7274, perplexity_token=5.6259]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  46%|█████████████████████▌                         | 480/1044 [02:57<03:17,  2.85it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6255]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  46%|█████████████████████▋                         | 481/1044 [02:58<03:14,  2.89it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6252]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  46%|█████████████████████▋                         | 482/1044 [02:58<03:20,  2.81it/s, acc_step=1/1, ce_loss_token=1.7272, perplexity_token=5.6249]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  46%|█████████████████████▋                         | 483/1044 [02:58<03:27,  2.71it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6245]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  46%|█████████████████████▊                         | 484/1044 [02:59<03:28,  2.68it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6241]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  46%|█████████████████████▊                         | 485/1044 [02:59<03:26,  2.70it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6237]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  47%|█████████████████████▉                         | 486/1044 [03:00<03:31,  2.63it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6232]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  47%|█████████████████████▉                         | 487/1044 [03:00<03:31,  2.63it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6227]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  47%|█████████████████████▉                         | 488/1044 [03:00<03:14,  2.86it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6238]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  47%|██████████████████████                         | 489/1044 [03:01<03:12,  2.88it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6234]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  47%|██████████████████████                         | 490/1044 [03:01<03:18,  2.79it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6229]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  47%|██████████████████████                         | 491/1044 [03:01<03:21,  2.74it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6227]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  47%|██████████████████████▏                        | 492/1044 [03:02<03:10,  2.90it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6237]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  47%|██████████████████████▏                        | 493/1044 [03:02<03:11,  2.88it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6235]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  47%|██████████████████████▏                        | 494/1044 [03:02<03:00,  3.05it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6246]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  47%|██████████████████████▎                        | 495/1044 [03:03<02:55,  3.13it/s, acc_step=1/1, ce_loss_token=1.7274, perplexity_token=5.6258]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  48%|██████████████████████▎                        | 496/1044 [03:03<02:59,  3.06it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6254]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  48%|██████████████████████▎                        | 497/1044 [03:03<03:08,  2.91it/s, acc_step=1/1, ce_loss_token=1.7273, perplexity_token=5.6252]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  48%|██████████████████████▍                        | 498/1044 [03:04<03:07,  2.91it/s, acc_step=1/1, ce_loss_token=1.7272, perplexity_token=5.6249]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  48%|██████████████████████▍                        | 499/1044 [03:04<03:08,  2.89it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6246]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  48%|██████████████████████▌                        | 500/1044 [03:04<03:12,  2.82it/s, acc_step=1/1, ce_loss_token=1.7271, perplexity_token=5.6241]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  48%|██████████████████████▌                        | 501/1044 [03:05<03:23,  2.67it/s, acc_step=1/1, ce_loss_token=1.7270, perplexity_token=5.6238]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  48%|██████████████████████▌                        | 502/1044 [03:05<03:23,  2.66it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6235]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  48%|██████████████████████▋                        | 503/1044 [03:06<03:29,  2.58it/s, acc_step=1/1, ce_loss_token=1.7269, perplexity_token=5.6231]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  48%|██████████████████████▋                        | 504/1044 [03:06<03:24,  2.64it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6229]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  48%|██████████████████████▋                        | 505/1044 [03:06<03:23,  2.65it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6225]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  48%|██████████████████████▊                        | 506/1044 [03:07<03:19,  2.69it/s, acc_step=1/1, ce_loss_token=1.7267, perplexity_token=5.6222]

torch.Size([256, 397, 35]) torch.Size([256, 397])


[Training LM]:  49%|██████████████████████▊                        | 507/1044 [03:07<03:49,  2.34it/s, acc_step=1/1, ce_loss_token=1.7267, perplexity_token=5.6218]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  49%|██████████████████████▊                        | 508/1044 [03:08<03:43,  2.40it/s, acc_step=1/1, ce_loss_token=1.7266, perplexity_token=5.6214]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  49%|██████████████████████▉                        | 509/1044 [03:08<03:52,  2.30it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6210]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  49%|██████████████████████▉                        | 510/1044 [03:09<03:51,  2.30it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6208]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  49%|███████████████████████                        | 511/1044 [03:09<03:44,  2.37it/s, acc_step=1/1, ce_loss_token=1.7264, perplexity_token=5.6204]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  49%|███████████████████████                        | 512/1044 [03:09<03:46,  2.35it/s, acc_step=1/1, ce_loss_token=1.7264, perplexity_token=5.6201]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  49%|███████████████████████                        | 513/1044 [03:10<03:36,  2.45it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6198]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  49%|███████████████████████▏                       | 514/1044 [03:10<03:34,  2.47it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6195]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  49%|███████████████████████▏                       | 515/1044 [03:10<03:28,  2.54it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6191]

torch.Size([256, 410, 35]) torch.Size([256, 410])


[Training LM]:  49%|███████████████████████▏                       | 516/1044 [03:11<03:56,  2.23it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6187]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  50%|███████████████████████▎                       | 517/1044 [03:11<03:47,  2.32it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6183]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  50%|███████████████████████▎                       | 518/1044 [03:12<03:23,  2.59it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6193]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  50%|███████████████████████▎                       | 519/1044 [03:12<03:07,  2.81it/s, acc_step=1/1, ce_loss_token=1.7264, perplexity_token=5.6201]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  50%|███████████████████████▍                       | 520/1044 [03:12<03:21,  2.60it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6197]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  50%|███████████████████████▍                       | 521/1044 [03:13<03:21,  2.59it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6194]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  50%|███████████████████████▌                       | 522/1044 [03:13<03:19,  2.62it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6191]

torch.Size([256, 273, 35]) torch.Size([256, 273])


[Training LM]:  50%|███████████████████████▌                       | 523/1044 [03:14<03:10,  2.73it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6188]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  50%|███████████████████████▌                       | 524/1044 [03:14<03:27,  2.51it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6186]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  50%|███████████████████████▋                       | 525/1044 [03:14<03:23,  2.56it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6183]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  50%|███████████████████████▋                       | 526/1044 [03:15<03:17,  2.62it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6180]

torch.Size([256, 404, 35]) torch.Size([256, 404])


[Training LM]:  50%|███████████████████████▋                       | 527/1044 [03:15<03:44,  2.30it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6176]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  51%|███████████████████████▊                       | 528/1044 [03:16<03:42,  2.32it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6174]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  51%|███████████████████████▊                       | 529/1044 [03:16<03:33,  2.41it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6169]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  51%|███████████████████████▊                       | 530/1044 [03:16<03:28,  2.47it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6165]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  51%|███████████████████████▉                       | 531/1044 [03:17<03:29,  2.45it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6162]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  51%|███████████████████████▉                       | 532/1044 [03:17<03:08,  2.71it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6171]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  51%|███████████████████████▉                       | 533/1044 [03:18<03:02,  2.81it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6181]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  51%|████████████████████████                       | 534/1044 [03:18<03:16,  2.60it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6178]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  51%|████████████████████████                       | 535/1044 [03:18<03:15,  2.60it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6176]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  51%|████████████████████████▏                      | 536/1044 [03:19<03:11,  2.65it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6174]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  51%|████████████████████████▏                      | 537/1044 [03:19<03:12,  2.63it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6172]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  52%|████████████████████████▏                      | 538/1044 [03:19<03:12,  2.63it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6171]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  52%|████████████████████████▎                      | 539/1044 [03:20<03:03,  2.76it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6185]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  52%|████████████████████████▎                      | 540/1044 [03:20<03:03,  2.75it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6182]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  52%|████████████████████████▎                      | 541/1044 [03:21<03:18,  2.53it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6180]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  52%|████████████████████████▍                      | 542/1044 [03:21<03:16,  2.55it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6179]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  52%|████████████████████████▍                      | 543/1044 [03:21<03:14,  2.58it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6176]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  52%|████████████████████████▍                      | 544/1044 [03:22<03:07,  2.66it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6174]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  52%|████████████████████████▌                      | 545/1044 [03:22<02:50,  2.93it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6187]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  52%|████████████████████████▌                      | 546/1044 [03:22<02:41,  3.08it/s, acc_step=1/1, ce_loss_token=1.7264, perplexity_token=5.6201]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  52%|████████████████████████▋                      | 547/1044 [03:23<02:54,  2.85it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6198]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  52%|████████████████████████▋                      | 548/1044 [03:23<02:54,  2.84it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6193]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  53%|████████████████████████▋                      | 549/1044 [03:23<02:58,  2.77it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6191]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  53%|████████████████████████▊                      | 550/1044 [03:24<02:59,  2.75it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6188]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  53%|████████████████████████▊                      | 551/1044 [03:24<02:51,  2.87it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6196]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  53%|████████████████████████▊                      | 552/1044 [03:24<02:54,  2.82it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6192]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  53%|████████████████████████▉                      | 553/1044 [03:25<02:45,  2.96it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6201]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  53%|████████████████████████▉                      | 554/1044 [03:25<02:44,  2.98it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6198]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  53%|████████████████████████▉                      | 555/1044 [03:26<02:51,  2.86it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6193]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  53%|█████████████████████████                      | 556/1044 [03:26<02:59,  2.71it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6191]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  53%|█████████████████████████                      | 557/1044 [03:26<03:09,  2.57it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6188]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  53%|█████████████████████████                      | 558/1044 [03:27<02:55,  2.77it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6198]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  54%|█████████████████████████▏                     | 559/1044 [03:27<02:57,  2.73it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6196]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  54%|█████████████████████████▏                     | 560/1044 [03:27<03:02,  2.65it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6193]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  54%|█████████████████████████▎                     | 561/1044 [03:28<02:59,  2.70it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6189]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  54%|█████████████████████████▎                     | 562/1044 [03:28<03:04,  2.61it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6186]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  54%|█████████████████████████▎                     | 563/1044 [03:29<03:05,  2.59it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6183]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  54%|█████████████████████████▍                     | 564/1044 [03:29<03:06,  2.57it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6179]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  54%|█████████████████████████▍                     | 565/1044 [03:29<02:54,  2.75it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6187]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  54%|█████████████████████████▍                     | 566/1044 [03:30<02:40,  2.97it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6197]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  54%|█████████████████████████▌                     | 567/1044 [03:30<02:33,  3.12it/s, acc_step=1/1, ce_loss_token=1.7264, perplexity_token=5.6206]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  54%|█████████████████████████▌                     | 568/1044 [03:30<02:38,  3.01it/s, acc_step=1/1, ce_loss_token=1.7264, perplexity_token=5.6202]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  55%|█████████████████████████▌                     | 569/1044 [03:31<02:48,  2.82it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6200]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  55%|█████████████████████████▋                     | 570/1044 [03:31<02:49,  2.80it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6197]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  55%|█████████████████████████▋                     | 571/1044 [03:31<02:37,  3.00it/s, acc_step=1/1, ce_loss_token=1.7265, perplexity_token=5.6207]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  55%|█████████████████████████▊                     | 572/1044 [03:32<02:39,  2.97it/s, acc_step=1/1, ce_loss_token=1.7264, perplexity_token=5.6202]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  55%|█████████████████████████▊                     | 573/1044 [03:32<02:40,  2.93it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6199]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  55%|█████████████████████████▊                     | 574/1044 [03:32<02:44,  2.85it/s, acc_step=1/1, ce_loss_token=1.7263, perplexity_token=5.6196]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  55%|█████████████████████████▉                     | 575/1044 [03:33<02:50,  2.75it/s, acc_step=1/1, ce_loss_token=1.7262, perplexity_token=5.6192]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  55%|█████████████████████████▉                     | 576/1044 [03:33<02:58,  2.62it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6190]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  55%|█████████████████████████▉                     | 577/1044 [03:34<02:58,  2.61it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6187]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  55%|██████████████████████████                     | 578/1044 [03:34<02:57,  2.63it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6185]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  55%|██████████████████████████                     | 579/1044 [03:34<02:59,  2.59it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6182]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  56%|██████████████████████████                     | 580/1044 [03:35<02:58,  2.60it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6179]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  56%|██████████████████████████▏                    | 581/1044 [03:35<03:01,  2.56it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6176]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  56%|██████████████████████████▏                    | 582/1044 [03:35<02:56,  2.62it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6172]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  56%|██████████████████████████▏                    | 583/1044 [03:36<02:54,  2.64it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6170]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  56%|██████████████████████████▎                    | 584/1044 [03:36<02:42,  2.82it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6177]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  56%|██████████████████████████▎                    | 585/1044 [03:36<02:28,  3.08it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6185]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  56%|██████████████████████████▍                    | 586/1044 [03:37<02:39,  2.87it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6181]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  56%|██████████████████████████▍                    | 587/1044 [03:37<02:29,  3.05it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6189]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  56%|██████████████████████████▍                    | 588/1044 [03:37<02:32,  2.99it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6186]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  56%|██████████████████████████▌                    | 589/1044 [03:38<02:43,  2.79it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6183]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  57%|██████████████████████████▌                    | 590/1044 [03:38<03:03,  2.48it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6180]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  57%|██████████████████████████▌                    | 591/1044 [03:39<02:50,  2.65it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6187]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  57%|██████████████████████████▋                    | 592/1044 [03:39<02:50,  2.65it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6185]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  57%|██████████████████████████▋                    | 593/1044 [03:39<02:47,  2.69it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6182]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  57%|██████████████████████████▋                    | 594/1044 [03:40<02:45,  2.72it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6180]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  57%|██████████████████████████▊                    | 595/1044 [03:40<02:42,  2.77it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6178]

torch.Size([256, 274, 35]) torch.Size([256, 274])


[Training LM]:  57%|██████████████████████████▊                    | 596/1044 [03:40<02:36,  2.87it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6175]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  57%|██████████████████████████▉                    | 597/1044 [03:41<02:25,  3.08it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6187]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  57%|██████████████████████████▉                    | 598/1044 [03:41<02:31,  2.95it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6183]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  57%|██████████████████████████▉                    | 599/1044 [03:41<02:33,  2.91it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6181]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  57%|███████████████████████████                    | 600/1044 [03:42<02:24,  3.07it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6190]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  58%|███████████████████████████                    | 601/1044 [03:42<02:26,  3.02it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6187]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  58%|███████████████████████████                    | 602/1044 [03:42<02:37,  2.80it/s, acc_step=1/1, ce_loss_token=1.7261, perplexity_token=5.6184]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  58%|███████████████████████████▏                   | 603/1044 [03:43<02:35,  2.84it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6180]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  58%|███████████████████████████▏                   | 604/1044 [03:43<02:46,  2.64it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6177]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  58%|███████████████████████████▏                   | 605/1044 [03:44<03:00,  2.44it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6175]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  58%|███████████████████████████▎                   | 606/1044 [03:44<03:01,  2.42it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6171]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  58%|███████████████████████████▎                   | 607/1044 [03:44<02:53,  2.51it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6169]

torch.Size([256, 402, 35]) torch.Size([256, 402])


[Training LM]:  58%|███████████████████████████▎                   | 608/1044 [03:45<03:13,  2.25it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6167]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  58%|███████████████████████████▍                   | 609/1044 [03:45<02:55,  2.48it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6175]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  58%|███████████████████████████▍                   | 610/1044 [03:46<02:48,  2.57it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6171]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  59%|███████████████████████████▌                   | 611/1044 [03:46<02:46,  2.60it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6169]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  59%|███████████████████████████▌                   | 612/1044 [03:46<02:42,  2.65it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6167]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  59%|███████████████████████████▌                   | 613/1044 [03:47<02:51,  2.52it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6165]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  59%|███████████████████████████▋                   | 614/1044 [03:47<02:37,  2.74it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6172]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  59%|███████████████████████████▋                   | 615/1044 [03:48<02:35,  2.76it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6171]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  59%|███████████████████████████▋                   | 616/1044 [03:48<02:24,  2.97it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6180]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  59%|███████████████████████████▊                   | 617/1044 [03:48<02:29,  2.86it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6177]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  59%|███████████████████████████▊                   | 618/1044 [03:49<02:30,  2.82it/s, acc_step=1/1, ce_loss_token=1.7259, perplexity_token=5.6174]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  59%|███████████████████████████▊                   | 619/1044 [03:49<02:30,  2.83it/s, acc_step=1/1, ce_loss_token=1.7258, perplexity_token=5.6170]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  59%|███████████████████████████▉                   | 620/1044 [03:49<02:32,  2.79it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6166]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  59%|███████████████████████████▉                   | 621/1044 [03:50<02:35,  2.73it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6163]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  60%|████████████████████████████                   | 622/1044 [03:50<02:37,  2.67it/s, acc_step=1/1, ce_loss_token=1.7256, perplexity_token=5.6158]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  60%|████████████████████████████                   | 623/1044 [03:50<02:41,  2.61it/s, acc_step=1/1, ce_loss_token=1.7255, perplexity_token=5.6156]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  60%|████████████████████████████                   | 624/1044 [03:51<02:35,  2.71it/s, acc_step=1/1, ce_loss_token=1.7255, perplexity_token=5.6154]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  60%|████████████████████████████▏                  | 625/1044 [03:51<02:23,  2.93it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6166]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  60%|████████████████████████████▏                  | 626/1044 [03:51<02:25,  2.87it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6163]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  60%|████████████████████████████▏                  | 627/1044 [03:52<02:29,  2.79it/s, acc_step=1/1, ce_loss_token=1.7256, perplexity_token=5.6160]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  60%|████████████████████████████▎                  | 628/1044 [03:52<02:20,  2.97it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6166]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  60%|████████████████████████████▎                  | 629/1044 [03:53<02:33,  2.70it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6162]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  60%|████████████████████████████▎                  | 630/1044 [03:53<02:33,  2.70it/s, acc_step=1/1, ce_loss_token=1.7256, perplexity_token=5.6160]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  60%|████████████████████████████▍                  | 631/1044 [03:53<02:21,  2.92it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6167]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  61%|████████████████████████████▍                  | 632/1044 [03:54<02:21,  2.90it/s, acc_step=1/1, ce_loss_token=1.7257, perplexity_token=5.6163]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  61%|████████████████████████████▍                  | 633/1044 [03:54<02:27,  2.79it/s, acc_step=1/1, ce_loss_token=1.7256, perplexity_token=5.6159]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  61%|████████████████████████████▌                  | 634/1044 [03:54<02:31,  2.71it/s, acc_step=1/1, ce_loss_token=1.7256, perplexity_token=5.6157]

torch.Size([256, 351, 35]) torch.Size([256, 351])


[Training LM]:  61%|████████████████████████████▌                  | 635/1044 [03:55<02:42,  2.52it/s, acc_step=1/1, ce_loss_token=1.7255, perplexity_token=5.6154]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  61%|████████████████████████████▋                  | 636/1044 [03:55<02:37,  2.59it/s, acc_step=1/1, ce_loss_token=1.7255, perplexity_token=5.6151]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  61%|████████████████████████████▋                  | 637/1044 [03:56<02:37,  2.58it/s, acc_step=1/1, ce_loss_token=1.7254, perplexity_token=5.6148]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  61%|████████████████████████████▋                  | 638/1044 [03:56<02:36,  2.60it/s, acc_step=1/1, ce_loss_token=1.7254, perplexity_token=5.6145]

torch.Size([256, 364, 35]) torch.Size([256, 364])


[Training LM]:  61%|████████████████████████████▊                  | 639/1044 [03:56<02:47,  2.42it/s, acc_step=1/1, ce_loss_token=1.7253, perplexity_token=5.6144]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  61%|████████████████████████████▊                  | 640/1044 [03:57<02:41,  2.50it/s, acc_step=1/1, ce_loss_token=1.7253, perplexity_token=5.6142]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  61%|████████████████████████████▊                  | 641/1044 [03:57<02:35,  2.59it/s, acc_step=1/1, ce_loss_token=1.7253, perplexity_token=5.6140]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  61%|████████████████████████████▉                  | 642/1044 [03:57<02:29,  2.69it/s, acc_step=1/1, ce_loss_token=1.7252, perplexity_token=5.6137]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  62%|████████████████████████████▉                  | 643/1044 [03:58<02:25,  2.76it/s, acc_step=1/1, ce_loss_token=1.7252, perplexity_token=5.6134]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  62%|████████████████████████████▉                  | 644/1044 [03:58<02:26,  2.72it/s, acc_step=1/1, ce_loss_token=1.7251, perplexity_token=5.6132]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  62%|█████████████████████████████                  | 645/1044 [03:59<02:33,  2.60it/s, acc_step=1/1, ce_loss_token=1.7251, perplexity_token=5.6130]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  62%|█████████████████████████████                  | 646/1044 [03:59<02:34,  2.57it/s, acc_step=1/1, ce_loss_token=1.7250, perplexity_token=5.6127]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  62%|█████████████████████████████▏                 | 647/1044 [03:59<02:37,  2.52it/s, acc_step=1/1, ce_loss_token=1.7250, perplexity_token=5.6124]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  62%|█████████████████████████████▏                 | 648/1044 [04:00<02:35,  2.55it/s, acc_step=1/1, ce_loss_token=1.7250, perplexity_token=5.6123]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  62%|█████████████████████████████▏                 | 649/1044 [04:00<02:19,  2.82it/s, acc_step=1/1, ce_loss_token=1.7251, perplexity_token=5.6129]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  62%|█████████████████████████████▎                 | 650/1044 [04:00<02:19,  2.82it/s, acc_step=1/1, ce_loss_token=1.7250, perplexity_token=5.6126]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  62%|█████████████████████████████▎                 | 651/1044 [04:01<02:38,  2.48it/s, acc_step=1/1, ce_loss_token=1.7250, perplexity_token=5.6123]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  62%|█████████████████████████████▎                 | 652/1044 [04:01<02:47,  2.34it/s, acc_step=1/1, ce_loss_token=1.7249, perplexity_token=5.6121]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  63%|█████████████████████████████▍                 | 653/1044 [04:02<02:33,  2.54it/s, acc_step=1/1, ce_loss_token=1.7251, perplexity_token=5.6130]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  63%|█████████████████████████████▍                 | 654/1044 [04:02<02:29,  2.61it/s, acc_step=1/1, ce_loss_token=1.7250, perplexity_token=5.6126]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  63%|█████████████████████████████▍                 | 655/1044 [04:02<02:24,  2.70it/s, acc_step=1/1, ce_loss_token=1.7250, perplexity_token=5.6123]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  63%|█████████████████████████████▌                 | 656/1044 [04:03<02:22,  2.73it/s, acc_step=1/1, ce_loss_token=1.7249, perplexity_token=5.6120]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  63%|█████████████████████████████▌                 | 657/1044 [04:03<02:21,  2.74it/s, acc_step=1/1, ce_loss_token=1.7249, perplexity_token=5.6119]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  63%|█████████████████████████████▌                 | 658/1044 [04:04<02:19,  2.76it/s, acc_step=1/1, ce_loss_token=1.7248, perplexity_token=5.6115]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  63%|█████████████████████████████▋                 | 659/1044 [04:04<02:26,  2.63it/s, acc_step=1/1, ce_loss_token=1.7248, perplexity_token=5.6112]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  63%|█████████████████████████████▋                 | 660/1044 [04:04<02:34,  2.48it/s, acc_step=1/1, ce_loss_token=1.7247, perplexity_token=5.6110]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  63%|█████████████████████████████▊                 | 661/1044 [04:05<02:33,  2.50it/s, acc_step=1/1, ce_loss_token=1.7247, perplexity_token=5.6107]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  63%|█████████████████████████████▊                 | 662/1044 [04:05<02:34,  2.48it/s, acc_step=1/1, ce_loss_token=1.7247, perplexity_token=5.6106]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  64%|█████████████████████████████▊                 | 663/1044 [04:06<02:35,  2.46it/s, acc_step=1/1, ce_loss_token=1.7246, perplexity_token=5.6102]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  64%|█████████████████████████████▉                 | 664/1044 [04:06<02:32,  2.49it/s, acc_step=1/1, ce_loss_token=1.7245, perplexity_token=5.6099]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  64%|█████████████████████████████▉                 | 665/1044 [04:06<02:29,  2.53it/s, acc_step=1/1, ce_loss_token=1.7245, perplexity_token=5.6097]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  64%|█████████████████████████████▉                 | 666/1044 [04:07<02:30,  2.50it/s, acc_step=1/1, ce_loss_token=1.7244, perplexity_token=5.6094]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  64%|██████████████████████████████                 | 667/1044 [04:07<02:34,  2.44it/s, acc_step=1/1, ce_loss_token=1.7244, perplexity_token=5.6091]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  64%|██████████████████████████████                 | 668/1044 [04:08<02:27,  2.56it/s, acc_step=1/1, ce_loss_token=1.7244, perplexity_token=5.6089]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  64%|██████████████████████████████                 | 669/1044 [04:08<02:24,  2.60it/s, acc_step=1/1, ce_loss_token=1.7243, perplexity_token=5.6088]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  64%|██████████████████████████████▏                | 670/1044 [04:08<02:21,  2.65it/s, acc_step=1/1, ce_loss_token=1.7243, perplexity_token=5.6086]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  64%|██████████████████████████████▏                | 671/1044 [04:09<02:22,  2.61it/s, acc_step=1/1, ce_loss_token=1.7243, perplexity_token=5.6083]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  64%|██████████████████████████████▎                | 672/1044 [04:09<02:18,  2.68it/s, acc_step=1/1, ce_loss_token=1.7242, perplexity_token=5.6081]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  64%|██████████████████████████████▎                | 673/1044 [04:09<02:21,  2.63it/s, acc_step=1/1, ce_loss_token=1.7242, perplexity_token=5.6079]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  65%|██████████████████████████████▎                | 674/1044 [04:10<02:19,  2.66it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6076]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  65%|██████████████████████████████▍                | 675/1044 [04:10<02:16,  2.70it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6073]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  65%|██████████████████████████████▍                | 676/1044 [04:11<02:19,  2.64it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6070]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  65%|██████████████████████████████▍                | 677/1044 [04:11<02:16,  2.68it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6066]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  65%|██████████████████████████████▌                | 678/1044 [04:11<02:22,  2.58it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6063]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  65%|██████████████████████████████▌                | 679/1044 [04:12<02:19,  2.62it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6060]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  65%|██████████████████████████████▌                | 680/1044 [04:12<02:21,  2.57it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6056]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  65%|██████████████████████████████▋                | 681/1044 [04:13<02:20,  2.58it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6054]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  65%|██████████████████████████████▋                | 683/1044 [04:13<02:12,  2.73it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6077]

torch.Size([256, 312, 35]) torch.Size([256, 312])
torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  66%|██████████████████████████████▊                | 684/1044 [04:14<02:09,  2.77it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6074]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  66%|██████████████████████████████▊                | 685/1044 [04:14<02:09,  2.78it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6072]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  66%|██████████████████████████████▉                | 686/1044 [04:14<02:08,  2.79it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6069]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  66%|██████████████████████████████▉                | 687/1044 [04:15<02:07,  2.81it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6066]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  66%|██████████████████████████████▉                | 688/1044 [04:15<02:07,  2.78it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6064]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  66%|███████████████████████████████                | 689/1044 [04:15<02:07,  2.79it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6061]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  66%|███████████████████████████████                | 690/1044 [04:16<02:06,  2.80it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6059]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  66%|███████████████████████████████                | 691/1044 [04:16<02:00,  2.93it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6066]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  66%|███████████████████████████████▏               | 692/1044 [04:16<02:01,  2.89it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6063]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  66%|███████████████████████████████▏               | 693/1044 [04:17<02:04,  2.81it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6061]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  66%|███████████████████████████████▏               | 694/1044 [04:17<02:12,  2.64it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6059]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  67%|███████████████████████████████▎               | 695/1044 [04:18<02:04,  2.80it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6067]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  67%|███████████████████████████████▎               | 696/1044 [04:18<02:06,  2.75it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6065]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  67%|███████████████████████████████▍               | 697/1044 [04:18<02:08,  2.69it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6063]

torch.Size([256, 393, 35]) torch.Size([256, 393])


[Training LM]:  67%|███████████████████████████████▍               | 698/1044 [04:19<02:27,  2.35it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6062]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  67%|███████████████████████████████▍               | 699/1044 [04:19<02:22,  2.42it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6059]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  67%|███████████████████████████████▌               | 700/1044 [04:20<02:25,  2.36it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6058]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  67%|███████████████████████████████▌               | 701/1044 [04:20<02:12,  2.59it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6068]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  67%|███████████████████████████████▌               | 702/1044 [04:20<02:10,  2.62it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6066]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  67%|███████████████████████████████▋               | 703/1044 [04:21<02:02,  2.78it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6072]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  67%|███████████████████████████████▋               | 704/1044 [04:21<01:53,  2.99it/s, acc_step=1/1, ce_loss_token=1.7242, perplexity_token=5.6078]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  68%|███████████████████████████████▋               | 705/1044 [04:21<01:54,  2.95it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6076]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  68%|███████████████████████████████▊               | 706/1044 [04:22<02:02,  2.76it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6073]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  68%|███████████████████████████████▊               | 707/1044 [04:22<02:07,  2.64it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6071]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  68%|███████████████████████████████▊               | 708/1044 [04:22<02:06,  2.65it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6068]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  68%|███████████████████████████████▉               | 709/1044 [04:23<02:05,  2.67it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6065]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  68%|███████████████████████████████▉               | 710/1044 [04:23<02:02,  2.72it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6061]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  68%|████████████████████████████████               | 711/1044 [04:24<02:03,  2.69it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6058]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  68%|████████████████████████████████               | 712/1044 [04:24<02:04,  2.68it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6056]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  68%|████████████████████████████████               | 713/1044 [04:24<02:03,  2.68it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6052]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  68%|████████████████████████████████▏              | 714/1044 [04:25<01:54,  2.87it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6064]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  68%|████████████████████████████████▏              | 715/1044 [04:25<01:48,  3.04it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6075]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  69%|████████████████████████████████▏              | 716/1044 [04:25<01:51,  2.95it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6073]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  69%|████████████████████████████████▎              | 717/1044 [04:26<01:54,  2.85it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6071]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  69%|████████████████████████████████▎              | 718/1044 [04:26<01:54,  2.84it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6068]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  69%|████████████████████████████████▎              | 719/1044 [04:26<01:54,  2.83it/s, acc_step=1/1, ce_loss_token=1.7242, perplexity_token=5.6078]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  69%|████████████████████████████████▍              | 720/1044 [04:27<01:58,  2.73it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6076]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  69%|████████████████████████████████▍              | 721/1044 [04:27<01:55,  2.79it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6073]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  69%|████████████████████████████████▌              | 722/1044 [04:27<01:57,  2.74it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6071]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  69%|████████████████████████████████▌              | 723/1044 [04:28<01:59,  2.69it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6069]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  69%|████████████████████████████████▌              | 724/1044 [04:28<01:57,  2.73it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6066]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  69%|████████████████████████████████▋              | 725/1044 [04:29<02:01,  2.62it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6064]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  70%|████████████████████████████████▋              | 726/1044 [04:29<01:53,  2.80it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6072]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  70%|████████████████████████████████▋              | 727/1044 [04:29<01:58,  2.69it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6069]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  70%|████████████████████████████████▊              | 728/1044 [04:30<02:01,  2.59it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6067]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  70%|████████████████████████████████▊              | 729/1044 [04:30<02:01,  2.60it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6064]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  70%|████████████████████████████████▊              | 730/1044 [04:30<01:52,  2.79it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6070]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  70%|████████████████████████████████▉              | 731/1044 [04:31<01:44,  2.98it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6076]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  70%|████████████████████████████████▉              | 732/1044 [04:31<01:49,  2.86it/s, acc_step=1/1, ce_loss_token=1.7241, perplexity_token=5.6074]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  70%|████████████████████████████████▉              | 733/1044 [04:32<01:54,  2.72it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6072]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  70%|█████████████████████████████████              | 734/1044 [04:32<01:55,  2.69it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6069]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  70%|█████████████████████████████████              | 735/1044 [04:32<01:55,  2.67it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6067]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  70%|█████████████████████████████████▏             | 736/1044 [04:33<01:55,  2.67it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6064]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  71%|█████████████████████████████████▏             | 737/1044 [04:33<01:56,  2.63it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6061]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  71%|█████████████████████████████████▏             | 738/1044 [04:33<01:55,  2.64it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6058]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  71%|█████████████████████████████████▎             | 739/1044 [04:34<01:53,  2.68it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6056]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  71%|█████████████████████████████████▎             | 740/1044 [04:34<01:56,  2.60it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6054]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  71%|█████████████████████████████████▎             | 741/1044 [04:35<01:59,  2.54it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6051]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  71%|█████████████████████████████████▍             | 742/1044 [04:35<02:02,  2.47it/s, acc_step=1/1, ce_loss_token=1.7236, perplexity_token=5.6049]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  71%|█████████████████████████████████▍             | 743/1044 [04:35<01:59,  2.52it/s, acc_step=1/1, ce_loss_token=1.7236, perplexity_token=5.6047]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  71%|█████████████████████████████████▍             | 744/1044 [04:36<01:49,  2.74it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6054]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  71%|█████████████████████████████████▌             | 745/1044 [04:36<01:47,  2.79it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6052]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  71%|█████████████████████████████████▌             | 746/1044 [04:36<01:48,  2.76it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6051]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  72%|█████████████████████████████████▋             | 747/1044 [04:37<01:50,  2.69it/s, acc_step=1/1, ce_loss_token=1.7236, perplexity_token=5.6049]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  72%|█████████████████████████████████▋             | 748/1044 [04:37<01:45,  2.80it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6054]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  72%|█████████████████████████████████▋             | 749/1044 [04:38<01:46,  2.78it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6051]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  72%|█████████████████████████████████▊             | 750/1044 [04:38<01:49,  2.69it/s, acc_step=1/1, ce_loss_token=1.7236, perplexity_token=5.6049]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  72%|█████████████████████████████████▊             | 751/1044 [04:38<01:49,  2.68it/s, acc_step=1/1, ce_loss_token=1.7236, perplexity_token=5.6046]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  72%|█████████████████████████████████▊             | 752/1044 [04:39<01:39,  2.93it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6055]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  72%|█████████████████████████████████▉             | 753/1044 [04:39<01:42,  2.85it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6052]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  72%|█████████████████████████████████▉             | 754/1044 [04:39<01:42,  2.84it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6050]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  72%|█████████████████████████████████▉             | 755/1044 [04:40<01:34,  3.04it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6059]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  72%|██████████████████████████████████             | 756/1044 [04:40<01:31,  3.15it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6070]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  73%|██████████████████████████████████             | 757/1044 [04:40<01:38,  2.90it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6067]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  73%|██████████████████████████████████             | 758/1044 [04:41<01:40,  2.84it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6065]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  73%|██████████████████████████████████▏            | 759/1044 [04:41<01:42,  2.78it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6063]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  73%|██████████████████████████████████▏            | 760/1044 [04:41<01:47,  2.65it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6061]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  73%|██████████████████████████████████▎            | 761/1044 [04:42<01:43,  2.72it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6071]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  73%|██████████████████████████████████▎            | 762/1044 [04:42<01:46,  2.64it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6069]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  73%|██████████████████████████████████▎            | 763/1044 [04:43<01:44,  2.69it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6067]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  73%|██████████████████████████████████▍            | 764/1044 [04:43<01:48,  2.59it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6064]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  73%|██████████████████████████████████▍            | 765/1044 [04:43<01:49,  2.56it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6062]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  73%|██████████████████████████████████▍            | 766/1044 [04:44<01:46,  2.61it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6059]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  73%|██████████████████████████████████▌            | 767/1044 [04:44<01:47,  2.57it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6057]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  74%|██████████████████████████████████▌            | 768/1044 [04:44<01:46,  2.58it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6055]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  74%|██████████████████████████████████▌            | 769/1044 [04:45<01:45,  2.62it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6053]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  74%|██████████████████████████████████▋            | 770/1044 [04:45<01:38,  2.78it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6059]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  74%|██████████████████████████████████▋            | 771/1044 [04:46<01:38,  2.78it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6057]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  74%|██████████████████████████████████▊            | 772/1044 [04:46<01:33,  2.92it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6063]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  74%|██████████████████████████████████▊            | 773/1044 [04:46<01:32,  2.94it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6061]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  74%|██████████████████████████████████▊            | 774/1044 [04:47<01:32,  2.92it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6069]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  74%|██████████████████████████████████▉            | 775/1044 [04:47<01:33,  2.89it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6066]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  74%|██████████████████████████████████▉            | 776/1044 [04:47<01:36,  2.78it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6065]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  74%|██████████████████████████████████▉            | 777/1044 [04:48<01:37,  2.73it/s, acc_step=1/1, ce_loss_token=1.7239, perplexity_token=5.6062]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  75%|███████████████████████████████████            | 778/1044 [04:48<01:45,  2.53it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6060]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  75%|███████████████████████████████████            | 779/1044 [04:48<01:42,  2.59it/s, acc_step=1/1, ce_loss_token=1.7238, perplexity_token=5.6058]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  75%|███████████████████████████████████            | 780/1044 [04:49<01:41,  2.60it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6055]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  75%|███████████████████████████████████▏           | 781/1044 [04:49<01:40,  2.61it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6053]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  75%|███████████████████████████████████▏           | 782/1044 [04:50<01:40,  2.60it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6051]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  75%|███████████████████████████████████▎           | 783/1044 [04:50<01:38,  2.65it/s, acc_step=1/1, ce_loss_token=1.7236, perplexity_token=5.6048]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  75%|███████████████████████████████████▎           | 784/1044 [04:50<01:36,  2.71it/s, acc_step=1/1, ce_loss_token=1.7236, perplexity_token=5.6047]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  75%|███████████████████████████████████▎           | 785/1044 [04:51<01:35,  2.71it/s, acc_step=1/1, ce_loss_token=1.7235, perplexity_token=5.6044]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  75%|███████████████████████████████████▍           | 786/1044 [04:51<01:47,  2.39it/s, acc_step=1/1, ce_loss_token=1.7235, perplexity_token=5.6042]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  75%|███████████████████████████████████▍           | 787/1044 [04:52<01:43,  2.49it/s, acc_step=1/1, ce_loss_token=1.7235, perplexity_token=5.6040]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  75%|███████████████████████████████████▍           | 788/1044 [04:52<01:40,  2.56it/s, acc_step=1/1, ce_loss_token=1.7236, perplexity_token=5.6046]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  76%|███████████████████████████████████▌           | 789/1044 [04:52<01:39,  2.57it/s, acc_step=1/1, ce_loss_token=1.7235, perplexity_token=5.6043]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  76%|███████████████████████████████████▌           | 790/1044 [04:53<01:37,  2.61it/s, acc_step=1/1, ce_loss_token=1.7235, perplexity_token=5.6041]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  76%|███████████████████████████████████▌           | 791/1044 [04:53<01:38,  2.58it/s, acc_step=1/1, ce_loss_token=1.7234, perplexity_token=5.6038]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  76%|███████████████████████████████████▋           | 792/1044 [04:54<01:38,  2.55it/s, acc_step=1/1, ce_loss_token=1.7234, perplexity_token=5.6037]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  76%|███████████████████████████████████▋           | 793/1044 [04:54<01:38,  2.55it/s, acc_step=1/1, ce_loss_token=1.7234, perplexity_token=5.6034]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  76%|███████████████████████████████████▋           | 794/1044 [04:54<01:35,  2.63it/s, acc_step=1/1, ce_loss_token=1.7233, perplexity_token=5.6032]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  76%|███████████████████████████████████▊           | 795/1044 [04:55<01:36,  2.57it/s, acc_step=1/1, ce_loss_token=1.7233, perplexity_token=5.6030]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  76%|███████████████████████████████████▊           | 796/1044 [04:55<01:36,  2.56it/s, acc_step=1/1, ce_loss_token=1.7232, perplexity_token=5.6027]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  76%|███████████████████████████████████▉           | 797/1044 [04:55<01:37,  2.54it/s, acc_step=1/1, ce_loss_token=1.7232, perplexity_token=5.6024]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  76%|███████████████████████████████████▉           | 798/1044 [04:56<01:34,  2.59it/s, acc_step=1/1, ce_loss_token=1.7232, perplexity_token=5.6022]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  77%|███████████████████████████████████▉           | 799/1044 [04:56<01:32,  2.64it/s, acc_step=1/1, ce_loss_token=1.7231, perplexity_token=5.6019]

torch.Size([256, 370, 35]) torch.Size([256, 370])


[Training LM]:  77%|████████████████████████████████████           | 800/1044 [04:57<01:40,  2.43it/s, acc_step=1/1, ce_loss_token=1.7231, perplexity_token=5.6017]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  77%|████████████████████████████████████           | 801/1044 [04:57<01:38,  2.48it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6015]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  77%|████████████████████████████████████           | 802/1044 [04:57<01:36,  2.50it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6013]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  77%|████████████████████████████████████▏          | 803/1044 [04:58<01:34,  2.55it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6011]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  77%|████████████████████████████████████▏          | 804/1044 [04:58<01:31,  2.62it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6008]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  77%|████████████████████████████████████▏          | 805/1044 [04:59<01:40,  2.37it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6006]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  77%|████████████████████████████████████▎          | 806/1044 [04:59<01:40,  2.38it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6004]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  77%|████████████████████████████████████▎          | 807/1044 [04:59<01:34,  2.52it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6001]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  77%|████████████████████████████████████▍          | 808/1044 [05:00<01:32,  2.56it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6000]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  77%|████████████████████████████████████▍          | 809/1044 [05:00<01:30,  2.61it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5999]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  78%|████████████████████████████████████▍          | 810/1044 [05:01<01:26,  2.69it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5996]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  78%|████████████████████████████████████▌          | 811/1044 [05:01<01:27,  2.67it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5994]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  78%|████████████████████████████████████▌          | 812/1044 [05:01<01:30,  2.57it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5992]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  78%|████████████████████████████████████▌          | 813/1044 [05:02<01:30,  2.55it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5990]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  78%|████████████████████████████████████▋          | 814/1044 [05:02<01:30,  2.55it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5988]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  78%|████████████████████████████████████▋          | 815/1044 [05:02<01:26,  2.66it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5986]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  78%|████████████████████████████████████▋          | 816/1044 [05:03<01:29,  2.56it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5984]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  78%|████████████████████████████████████▊          | 817/1044 [05:03<01:23,  2.72it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5990]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  78%|████████████████████████████████████▊          | 818/1044 [05:04<01:23,  2.71it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5988]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  78%|████████████████████████████████████▊          | 819/1044 [05:04<01:23,  2.70it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5985]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  79%|████████████████████████████████████▉          | 820/1044 [05:04<01:22,  2.70it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5984]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  79%|████████████████████████████████████▉          | 821/1044 [05:05<01:25,  2.60it/s, acc_step=1/1, ce_loss_token=1.7224, perplexity_token=5.5981]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  79%|█████████████████████████████████████          | 822/1044 [05:05<01:22,  2.68it/s, acc_step=1/1, ce_loss_token=1.7224, perplexity_token=5.5979]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  79%|█████████████████████████████████████          | 823/1044 [05:05<01:22,  2.67it/s, acc_step=1/1, ce_loss_token=1.7223, perplexity_token=5.5977]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  79%|█████████████████████████████████████          | 824/1044 [05:06<01:25,  2.56it/s, acc_step=1/1, ce_loss_token=1.7223, perplexity_token=5.5975]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  79%|█████████████████████████████████████▏         | 825/1044 [05:06<01:24,  2.58it/s, acc_step=1/1, ce_loss_token=1.7223, perplexity_token=5.5973]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  79%|█████████████████████████████████████▏         | 826/1044 [05:07<01:22,  2.65it/s, acc_step=1/1, ce_loss_token=1.7222, perplexity_token=5.5971]

torch.Size([256, 391, 35]) torch.Size([256, 391])


[Training LM]:  79%|█████████████████████████████████████▏         | 827/1044 [05:07<01:24,  2.56it/s, acc_step=1/1, ce_loss_token=1.7224, perplexity_token=5.5977]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  79%|█████████████████████████████████████▎         | 828/1044 [05:07<01:23,  2.59it/s, acc_step=1/1, ce_loss_token=1.7223, perplexity_token=5.5975]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  79%|█████████████████████████████████████▎         | 829/1044 [05:08<01:24,  2.54it/s, acc_step=1/1, ce_loss_token=1.7223, perplexity_token=5.5973]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  80%|█████████████████████████████████████▎         | 830/1044 [05:08<01:24,  2.55it/s, acc_step=1/1, ce_loss_token=1.7223, perplexity_token=5.5971]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  80%|█████████████████████████████████████▍         | 831/1044 [05:09<01:24,  2.51it/s, acc_step=1/1, ce_loss_token=1.7222, perplexity_token=5.5969]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  80%|█████████████████████████████████████▍         | 832/1044 [05:09<01:25,  2.47it/s, acc_step=1/1, ce_loss_token=1.7222, perplexity_token=5.5966]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  80%|█████████████████████████████████████▌         | 833/1044 [05:09<01:22,  2.56it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5964]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  80%|█████████████████████████████████████▌         | 834/1044 [05:10<01:21,  2.56it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5962]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  80%|█████████████████████████████████████▌         | 835/1044 [05:10<01:20,  2.59it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5961]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  80%|█████████████████████████████████████▋         | 836/1044 [05:11<01:18,  2.64it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5957]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  80%|█████████████████████████████████████▋         | 837/1044 [05:11<01:11,  2.89it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5963]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  80%|█████████████████████████████████████▋         | 838/1044 [05:11<01:14,  2.77it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5960]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  80%|█████████████████████████████████████▊         | 839/1044 [05:12<01:15,  2.72it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5958]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  80%|█████████████████████████████████████▊         | 840/1044 [05:12<01:14,  2.75it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5964]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  81%|█████████████████████████████████████▊         | 841/1044 [05:12<01:08,  2.95it/s, acc_step=1/1, ce_loss_token=1.7222, perplexity_token=5.5969]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  81%|█████████████████████████████████████▉         | 842/1044 [05:13<01:12,  2.79it/s, acc_step=1/1, ce_loss_token=1.7222, perplexity_token=5.5968]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  81%|█████████████████████████████████████▉         | 843/1044 [05:13<01:15,  2.65it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5965]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  81%|█████████████████████████████████████▉         | 844/1044 [05:13<01:15,  2.66it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5963]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  81%|██████████████████████████████████████         | 845/1044 [05:14<01:16,  2.61it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5961]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  81%|██████████████████████████████████████         | 846/1044 [05:14<01:16,  2.58it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5958]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  81%|██████████████████████████████████████▏        | 847/1044 [05:15<01:21,  2.41it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5957]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  81%|██████████████████████████████████████▏        | 848/1044 [05:15<01:23,  2.34it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5955]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  81%|██████████████████████████████████████▏        | 849/1044 [05:16<01:20,  2.43it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5953]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  81%|██████████████████████████████████████▎        | 850/1044 [05:16<01:17,  2.50it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5951]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  82%|██████████████████████████████████████▎        | 851/1044 [05:16<01:14,  2.60it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5950]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  82%|██████████████████████████████████████▎        | 852/1044 [05:17<01:13,  2.62it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5948]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  82%|██████████████████████████████████████▍        | 853/1044 [05:17<01:12,  2.65it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5946]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  82%|██████████████████████████████████████▍        | 854/1044 [05:17<01:06,  2.85it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5953]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  82%|██████████████████████████████████████▍        | 855/1044 [05:18<01:05,  2.89it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5950]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  82%|██████████████████████████████████████▌        | 856/1044 [05:18<01:05,  2.89it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5949]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  82%|██████████████████████████████████████▌        | 857/1044 [05:18<01:06,  2.81it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5946]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  82%|██████████████████████████████████████▋        | 858/1044 [05:19<01:07,  2.74it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5944]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  82%|██████████████████████████████████████▋        | 859/1044 [05:19<01:09,  2.68it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5941]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  82%|██████████████████████████████████████▋        | 860/1044 [05:19<01:04,  2.85it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5947]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  82%|██████████████████████████████████████▊        | 861/1044 [05:20<01:05,  2.81it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5945]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  83%|██████████████████████████████████████▊        | 862/1044 [05:20<01:08,  2.67it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5943]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  83%|██████████████████████████████████████▊        | 863/1044 [05:21<01:06,  2.72it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5942]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  83%|██████████████████████████████████████▉        | 864/1044 [05:21<01:01,  2.94it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5948]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  83%|██████████████████████████████████████▉        | 865/1044 [05:21<01:02,  2.86it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5946]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  83%|██████████████████████████████████████▉        | 866/1044 [05:22<01:02,  2.87it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5945]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  83%|███████████████████████████████████████        | 867/1044 [05:22<00:59,  2.98it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5952]

torch.Size([256, 454, 35]) torch.Size([256, 454])


[Training LM]:  83%|███████████████████████████████████████        | 868/1044 [05:22<01:08,  2.59it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5958]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  83%|███████████████████████████████████████        | 869/1044 [05:23<01:09,  2.53it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5957]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  83%|███████████████████████████████████████▏       | 870/1044 [05:23<01:07,  2.58it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5955]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  83%|███████████████████████████████████████▏       | 871/1044 [05:24<01:15,  2.30it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5952]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  84%|███████████████████████████████████████▎       | 872/1044 [05:24<01:14,  2.32it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5950]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  84%|███████████████████████████████████████▎       | 873/1044 [05:25<01:11,  2.38it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5948]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  84%|███████████████████████████████████████▎       | 874/1044 [05:25<01:09,  2.44it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5946]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  84%|███████████████████████████████████████▍       | 875/1044 [05:25<01:08,  2.47it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5945]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  84%|███████████████████████████████████████▍       | 876/1044 [05:26<01:05,  2.58it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5943]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  84%|███████████████████████████████████████▍       | 877/1044 [05:26<01:03,  2.61it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5941]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  84%|███████████████████████████████████████▌       | 878/1044 [05:26<01:02,  2.64it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5939]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  84%|███████████████████████████████████████▌       | 880/1044 [05:27<00:51,  3.17it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5957]

torch.Size([256, 274, 35]) torch.Size([256, 274])
torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  84%|███████████████████████████████████████▋       | 881/1044 [05:27<00:48,  3.36it/s, acc_step=1/1, ce_loss_token=1.7222, perplexity_token=5.5966]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  84%|███████████████████████████████████████▋       | 882/1044 [05:28<00:52,  3.11it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5963]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  85%|███████████████████████████████████████▊       | 883/1044 [05:28<00:53,  3.02it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5961]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  85%|███████████████████████████████████████▊       | 884/1044 [05:28<00:54,  2.92it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5958]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  85%|███████████████████████████████████████▊       | 885/1044 [05:29<00:55,  2.84it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5955]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  85%|███████████████████████████████████████▉       | 886/1044 [05:29<00:54,  2.93it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5964]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  85%|███████████████████████████████████████▉       | 887/1044 [05:29<00:54,  2.86it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5962]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  85%|███████████████████████████████████████▉       | 888/1044 [05:30<00:51,  3.01it/s, acc_step=1/1, ce_loss_token=1.7222, perplexity_token=5.5968]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  85%|████████████████████████████████████████       | 889/1044 [05:30<00:53,  2.91it/s, acc_step=1/1, ce_loss_token=1.7222, perplexity_token=5.5966]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  85%|████████████████████████████████████████       | 890/1044 [05:30<00:53,  2.89it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5964]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  85%|████████████████████████████████████████       | 891/1044 [05:31<00:53,  2.88it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5962]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  85%|████████████████████████████████████████▏      | 892/1044 [05:31<00:55,  2.73it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5959]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  86%|████████████████████████████████████████▏      | 893/1044 [05:32<00:55,  2.70it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5957]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  86%|████████████████████████████████████████▏      | 894/1044 [05:32<00:54,  2.73it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5956]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  86%|████████████████████████████████████████▎      | 895/1044 [05:32<00:54,  2.71it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5954]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  86%|████████████████████████████████████████▎      | 896/1044 [05:33<00:55,  2.69it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5954]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  86%|████████████████████████████████████████▍      | 897/1044 [05:33<00:54,  2.71it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5952]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  86%|████████████████████████████████████████▍      | 898/1044 [05:33<00:55,  2.61it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5949]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  86%|████████████████████████████████████████▍      | 899/1044 [05:34<00:55,  2.62it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5948]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  86%|████████████████████████████████████████▌      | 900/1044 [05:34<00:56,  2.55it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5946]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  86%|████████████████████████████████████████▌      | 901/1044 [05:35<00:55,  2.59it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5944]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  86%|████████████████████████████████████████▌      | 902/1044 [05:35<00:58,  2.43it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5943]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  86%|████████████████████████████████████████▋      | 903/1044 [05:35<00:58,  2.41it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5940]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  87%|████████████████████████████████████████▋      | 904/1044 [05:36<00:57,  2.44it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5939]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  87%|████████████████████████████████████████▋      | 905/1044 [05:36<00:54,  2.54it/s, acc_step=1/1, ce_loss_token=1.7216, perplexity_token=5.5937]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  87%|████████████████████████████████████████▊      | 906/1044 [05:37<00:55,  2.49it/s, acc_step=1/1, ce_loss_token=1.7216, perplexity_token=5.5935]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  87%|████████████████████████████████████████▊      | 907/1044 [05:37<00:56,  2.42it/s, acc_step=1/1, ce_loss_token=1.7215, perplexity_token=5.5932]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  87%|████████████████████████████████████████▉      | 908/1044 [05:37<00:55,  2.44it/s, acc_step=1/1, ce_loss_token=1.7215, perplexity_token=5.5931]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  87%|████████████████████████████████████████▉      | 909/1044 [05:38<00:50,  2.66it/s, acc_step=1/1, ce_loss_token=1.7216, perplexity_token=5.5936]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  87%|████████████████████████████████████████▉      | 910/1044 [05:38<00:49,  2.73it/s, acc_step=1/1, ce_loss_token=1.7216, perplexity_token=5.5934]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  87%|█████████████████████████████████████████      | 911/1044 [05:38<00:46,  2.88it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5939]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  87%|█████████████████████████████████████████      | 912/1044 [05:39<00:46,  2.84it/s, acc_step=1/1, ce_loss_token=1.7216, perplexity_token=5.5936]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  87%|█████████████████████████████████████████      | 913/1044 [05:39<00:48,  2.73it/s, acc_step=1/1, ce_loss_token=1.7216, perplexity_token=5.5935]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  88%|█████████████████████████████████████████▏     | 914/1044 [05:40<00:49,  2.61it/s, acc_step=1/1, ce_loss_token=1.7216, perplexity_token=5.5933]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  88%|█████████████████████████████████████████▏     | 915/1044 [05:40<00:46,  2.80it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5939]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  88%|█████████████████████████████████████████▏     | 916/1044 [05:40<00:46,  2.77it/s, acc_step=1/1, ce_loss_token=1.7216, perplexity_token=5.5936]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  88%|█████████████████████████████████████████▎     | 917/1044 [05:41<00:42,  2.99it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5942]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  88%|█████████████████████████████████████████▎     | 918/1044 [05:41<00:39,  3.21it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5947]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  88%|█████████████████████████████████████████▎     | 919/1044 [05:41<00:40,  3.09it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5944]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  88%|█████████████████████████████████████████▍     | 920/1044 [05:42<00:41,  3.00it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5943]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  88%|█████████████████████████████████████████▍     | 921/1044 [05:42<00:42,  2.87it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5940]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  88%|█████████████████████████████████████████▌     | 922/1044 [05:42<00:43,  2.79it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5939]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  88%|█████████████████████████████████████████▌     | 923/1044 [05:43<00:43,  2.77it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5938]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  89%|█████████████████████████████████████████▌     | 924/1044 [05:43<00:43,  2.78it/s, acc_step=1/1, ce_loss_token=1.7216, perplexity_token=5.5936]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  89%|█████████████████████████████████████████▋     | 925/1044 [05:43<00:44,  2.69it/s, acc_step=1/1, ce_loss_token=1.7216, perplexity_token=5.5934]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  89%|█████████████████████████████████████████▋     | 926/1044 [05:44<00:41,  2.86it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5943]

torch.Size([256, 365, 35]) torch.Size([256, 365])


[Training LM]:  89%|█████████████████████████████████████████▋     | 927/1044 [05:44<00:45,  2.57it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5941]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  89%|█████████████████████████████████████████▊     | 928/1044 [05:44<00:42,  2.74it/s, acc_step=1/1, ce_loss_token=1.7218, perplexity_token=5.5947]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  89%|█████████████████████████████████████████▊     | 929/1044 [05:45<00:40,  2.84it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5952]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  89%|█████████████████████████████████████████▊     | 930/1044 [05:45<00:40,  2.80it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5950]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  89%|█████████████████████████████████████████▉     | 931/1044 [05:45<00:38,  2.95it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5955]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  89%|█████████████████████████████████████████▉     | 932/1044 [05:46<00:36,  3.07it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5959]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  89%|██████████████████████████████████████████     | 933/1044 [05:46<00:37,  2.98it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5957]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  89%|██████████████████████████████████████████     | 934/1044 [05:47<00:38,  2.87it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5954]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  90%|██████████████████████████████████████████     | 935/1044 [05:47<00:39,  2.78it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5952]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  90%|██████████████████████████████████████████▏    | 936/1044 [05:47<00:36,  2.97it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5957]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  90%|██████████████████████████████████████████▏    | 937/1044 [05:48<00:37,  2.85it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5955]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  90%|██████████████████████████████████████████▏    | 938/1044 [05:48<00:39,  2.69it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5953]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  90%|██████████████████████████████████████████▎    | 939/1044 [05:48<00:37,  2.78it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5958]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  90%|██████████████████████████████████████████▎    | 940/1044 [05:49<00:37,  2.76it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5957]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  90%|██████████████████████████████████████████▎    | 941/1044 [05:49<00:37,  2.72it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5955]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  90%|██████████████████████████████████████████▍    | 942/1044 [05:49<00:39,  2.61it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5952]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  90%|██████████████████████████████████████████▍    | 943/1044 [05:50<00:35,  2.84it/s, acc_step=1/1, ce_loss_token=1.7221, perplexity_token=5.5961]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  90%|██████████████████████████████████████████▍    | 944/1044 [05:50<00:35,  2.82it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5959]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  91%|██████████████████████████████████████████▌    | 945/1044 [05:50<00:35,  2.81it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5958]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  91%|██████████████████████████████████████████▌    | 946/1044 [05:51<00:35,  2.74it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5956]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  91%|██████████████████████████████████████████▋    | 947/1044 [05:51<00:35,  2.70it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5955]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  91%|██████████████████████████████████████████▋    | 948/1044 [05:52<00:33,  2.90it/s, acc_step=1/1, ce_loss_token=1.7220, perplexity_token=5.5960]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  91%|██████████████████████████████████████████▋    | 949/1044 [05:52<00:31,  3.04it/s, acc_step=1/1, ce_loss_token=1.7222, perplexity_token=5.5968]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  91%|██████████████████████████████████████████▊    | 951/1044 [05:52<00:26,  3.54it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5989]

torch.Size([256, 302, 35]) torch.Size([256, 302])
torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  91%|██████████████████████████████████████████▊    | 952/1044 [05:53<00:27,  3.35it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5988]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  91%|██████████████████████████████████████████▉    | 953/1044 [05:53<00:26,  3.45it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5992]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  91%|██████████████████████████████████████████▉    | 954/1044 [05:53<00:28,  3.14it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5991]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  92%|███████████████████████████████████████████    | 956/1044 [05:54<00:26,  3.28it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6011]

torch.Size([256, 302, 35]) torch.Size([256, 302])
torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  92%|███████████████████████████████████████████    | 957/1044 [05:54<00:29,  2.96it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6009]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  92%|███████████████████████████████████████████▏   | 958/1044 [05:55<00:29,  2.89it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6007]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  92%|███████████████████████████████████████████▏   | 959/1044 [05:55<00:31,  2.69it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6006]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  92%|███████████████████████████████████████████▏   | 960/1044 [05:55<00:30,  2.76it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6009]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  92%|███████████████████████████████████████████▎   | 961/1044 [05:56<00:30,  2.75it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6007]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  92%|███████████████████████████████████████████▎   | 962/1044 [05:56<00:27,  2.94it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6012]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  92%|███████████████████████████████████████████▎   | 963/1044 [05:56<00:27,  2.91it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6010]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  92%|███████████████████████████████████████████▍   | 964/1044 [05:57<00:28,  2.81it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6010]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  92%|███████████████████████████████████████████▍   | 965/1044 [05:57<00:29,  2.72it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6007]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  93%|███████████████████████████████████████████▍   | 966/1044 [05:58<00:28,  2.72it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6006]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  93%|███████████████████████████████████████████▌   | 967/1044 [05:58<00:29,  2.57it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6004]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  93%|███████████████████████████████████████████▌   | 968/1044 [05:58<00:29,  2.62it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6003]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  93%|███████████████████████████████████████████▌   | 969/1044 [05:59<00:28,  2.63it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6002]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  93%|███████████████████████████████████████████▋   | 970/1044 [05:59<00:28,  2.61it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.5999]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  93%|███████████████████████████████████████████▋   | 971/1044 [06:00<00:28,  2.52it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5997]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  93%|███████████████████████████████████████████▊   | 972/1044 [06:00<00:28,  2.57it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5995]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  93%|███████████████████████████████████████████▊   | 973/1044 [06:00<00:27,  2.62it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5993]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  93%|███████████████████████████████████████████▊   | 974/1044 [06:01<00:27,  2.54it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5990]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  93%|███████████████████████████████████████████▉   | 975/1044 [06:01<00:27,  2.52it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5988]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  93%|███████████████████████████████████████████▉   | 976/1044 [06:02<00:26,  2.57it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5988]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  94%|███████████████████████████████████████████▉   | 977/1044 [06:02<00:27,  2.41it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5986]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  94%|████████████████████████████████████████████   | 978/1044 [06:02<00:24,  2.67it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5994]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  94%|████████████████████████████████████████████   | 979/1044 [06:03<00:22,  2.90it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5999]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  94%|████████████████████████████████████████████   | 980/1044 [06:03<00:22,  2.87it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5997]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  94%|████████████████████████████████████████████▏  | 981/1044 [06:03<00:22,  2.86it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5995]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  94%|████████████████████████████████████████████▏  | 982/1044 [06:04<00:21,  2.92it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6000]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  94%|████████████████████████████████████████████▎  | 983/1044 [06:04<00:20,  2.94it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6008]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  94%|████████████████████████████████████████████▎  | 984/1044 [06:04<00:21,  2.85it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6006]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  94%|████████████████████████████████████████████▎  | 985/1044 [06:05<00:20,  2.81it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6004]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  94%|████████████████████████████████████████████▍  | 986/1044 [06:05<00:20,  2.77it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6002]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  95%|████████████████████████████████████████████▍  | 987/1044 [06:05<00:20,  2.78it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6000]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  95%|████████████████████████████████████████████▍  | 988/1044 [06:06<00:19,  2.81it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5999]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  95%|████████████████████████████████████████████▌  | 989/1044 [06:06<00:18,  2.97it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6003]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  95%|████████████████████████████████████████████▌  | 990/1044 [06:06<00:17,  3.11it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6010]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  95%|████████████████████████████████████████████▌  | 991/1044 [06:07<00:16,  3.15it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6014]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  95%|████████████████████████████████████████████▋  | 992/1044 [06:07<00:17,  2.90it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6013]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  95%|████████████████████████████████████████████▋  | 993/1044 [06:07<00:17,  2.90it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6011]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  95%|████████████████████████████████████████████▋  | 994/1044 [06:08<00:17,  2.89it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6009]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  95%|████████████████████████████████████████████▊  | 995/1044 [06:08<00:17,  2.78it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6007]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  95%|████████████████████████████████████████████▊  | 996/1044 [06:08<00:16,  2.94it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6012]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  95%|████████████████████████████████████████████▉  | 997/1044 [06:09<00:16,  2.81it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6010]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  96%|████████████████████████████████████████████▉  | 998/1044 [06:09<00:16,  2.78it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6007]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  96%|████████████████████████████████████████████▉  | 999/1044 [06:10<00:16,  2.78it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6006]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  96%|████████████████████████████████████████████  | 1000/1044 [06:10<00:14,  3.02it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6010]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  96%|████████████████████████████████████████████  | 1001/1044 [06:10<00:14,  2.99it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6009]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1002/1044 [06:11<00:13,  3.06it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6014]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1003/1044 [06:11<00:14,  2.88it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6011]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  96%|████████████████████████████████████████████▏ | 1004/1044 [06:11<00:14,  2.77it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6010]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1005/1044 [06:12<00:14,  2.68it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6008]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1006/1044 [06:12<00:14,  2.66it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6006]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  96%|████████████████████████████████████████████▎ | 1007/1044 [06:12<00:13,  2.74it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6004]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1008/1044 [06:13<00:13,  2.77it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6002]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  97%|████████████████████████████████████████████▍ | 1009/1044 [06:13<00:12,  2.71it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6000]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1010/1044 [06:14<00:12,  2.74it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5998]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1011/1044 [06:14<00:11,  2.93it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6002]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  97%|████████████████████████████████████████████▌ | 1012/1044 [06:14<00:11,  2.76it/s, acc_step=1/1, ce_loss_token=1.7228, perplexity_token=5.6000]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1013/1044 [06:15<00:11,  2.75it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5998]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1014/1044 [06:15<00:10,  2.75it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5997]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  97%|████████████████████████████████████████████▋ | 1015/1044 [06:15<00:10,  2.67it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5995]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1016/1044 [06:16<00:10,  2.66it/s, acc_step=1/1, ce_loss_token=1.7227, perplexity_token=5.5994]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  97%|████████████████████████████████████████████▊ | 1017/1044 [06:16<00:10,  2.64it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5991]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  98%|████████████████████████████████████████████▊ | 1018/1044 [06:16<00:09,  2.66it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5990]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1019/1044 [06:17<00:09,  2.64it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5988]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1020/1044 [06:17<00:09,  2.51it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5985]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  98%|████████████████████████████████████████████▉ | 1021/1044 [06:18<00:08,  2.59it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5983]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  98%|█████████████████████████████████████████████ | 1022/1044 [06:18<00:08,  2.67it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5988]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  98%|█████████████████████████████████████████████ | 1023/1044 [06:18<00:07,  2.76it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5985]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  98%|█████████████████████████████████████████████ | 1024/1044 [06:19<00:07,  2.84it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5990]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1025/1044 [06:19<00:06,  2.81it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5988]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  98%|█████████████████████████████████████████████▏| 1026/1044 [06:19<00:06,  2.98it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5992]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1027/1044 [06:20<00:06,  2.79it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5991]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  98%|█████████████████████████████████████████████▎| 1028/1044 [06:20<00:05,  2.74it/s, acc_step=1/1, ce_loss_token=1.7226, perplexity_token=5.5989]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  99%|█████████████████████████████████████████████▎| 1029/1044 [06:21<00:05,  2.57it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5986]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1030/1044 [06:21<00:05,  2.57it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5984]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1031/1044 [06:21<00:05,  2.53it/s, acc_step=1/1, ce_loss_token=1.7224, perplexity_token=5.5982]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  99%|█████████████████████████████████████████████▍| 1032/1044 [06:22<00:04,  2.54it/s, acc_step=1/1, ce_loss_token=1.7224, perplexity_token=5.5980]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1033/1044 [06:22<00:04,  2.58it/s, acc_step=1/1, ce_loss_token=1.7224, perplexity_token=5.5979]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1034/1044 [06:22<00:03,  2.68it/s, acc_step=1/1, ce_loss_token=1.7224, perplexity_token=5.5978]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  99%|█████████████████████████████████████████████▌| 1035/1044 [06:23<00:03,  2.88it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5985]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1036/1044 [06:23<00:02,  2.79it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5985]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1037/1044 [06:23<00:02,  2.81it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5982]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  99%|█████████████████████████████████████████████▋| 1038/1044 [06:24<00:02,  2.78it/s, acc_step=1/1, ce_loss_token=1.7224, perplexity_token=5.5981]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1039/1044 [06:24<00:01,  2.68it/s, acc_step=1/1, ce_loss_token=1.7224, perplexity_token=5.5979]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1040/1044 [06:25<00:01,  2.73it/s, acc_step=1/1, ce_loss_token=1.7224, perplexity_token=5.5977]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]: 100%|█████████████████████████████████████████████▊| 1041/1044 [06:25<00:01,  2.89it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5984]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1042/1044 [06:25<00:00,  2.83it/s, acc_step=1/1, ce_loss_token=1.7225, perplexity_token=5.5982]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]: 100%|█████████████████████████████████████████████▉| 1043/1044 [06:26<00:00,  2.77it/s, acc_step=1/1, ce_loss_token=1.7224, perplexity_token=5.5981]

torch.Size([170, 302, 35]) torch.Size([170, 302])


                                                                                                                                                                   

Generating with greedy search...

📊 Metrics (Epoch 9):
├── TRAIN:
│   ├── ce_loss_char: 1.7224
│   ├── ce_loss_token: 1.7224
│   ├── perplexity_char: 5.5980
│   └── perplexity_token: 5.5980
└── VAL:
    ├── ce_loss_char: 1.6054
    ├── ce_loss_token: 1.6054
    ├── perplexity_char: 4.9798
    └── perplexity_token: 4.9798
└── TRAINING:
    └── learning_rate: 0.000099


[Training LM]:   0%|                                                                                                                      | 0/1044 [00:00<?, ?it/s]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:   0%|                                                 | 1/1044 [00:00<08:50,  1.97it/s, acc_step=1/1, ce_loss_token=1.6877, perplexity_token=5.4070]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   0%|                                                 | 2/1044 [00:00<07:26,  2.33it/s, acc_step=1/1, ce_loss_token=1.6846, perplexity_token=5.3905]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   0%|▏                                                | 3/1044 [00:01<06:57,  2.50it/s, acc_step=1/1, ce_loss_token=1.6854, perplexity_token=5.3948]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:   0%|▏                                                | 4/1044 [00:01<07:00,  2.47it/s, acc_step=1/1, ce_loss_token=1.6863, perplexity_token=5.3993]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:   0%|▏                                                | 5/1044 [00:02<07:19,  2.36it/s, acc_step=1/1, ce_loss_token=1.6873, perplexity_token=5.4049]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   1%|▎                                                | 6/1044 [00:02<07:01,  2.46it/s, acc_step=1/1, ce_loss_token=1.6900, perplexity_token=5.4196]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   1%|▎                                                | 7/1044 [00:02<06:52,  2.51it/s, acc_step=1/1, ce_loss_token=1.6915, perplexity_token=5.4278]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   1%|▍                                                | 8/1044 [00:03<06:52,  2.51it/s, acc_step=1/1, ce_loss_token=1.6910, perplexity_token=5.4247]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:   1%|▍                                                | 9/1044 [00:03<07:07,  2.42it/s, acc_step=1/1, ce_loss_token=1.6910, perplexity_token=5.4251]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   1%|▍                                               | 10/1044 [00:03<06:24,  2.69it/s, acc_step=1/1, ce_loss_token=1.7066, perplexity_token=5.5100]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   1%|▌                                               | 11/1044 [00:04<05:55,  2.91it/s, acc_step=1/1, ce_loss_token=1.7208, perplexity_token=5.5892]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   1%|▌                                               | 12/1044 [00:04<06:07,  2.81it/s, acc_step=1/1, ce_loss_token=1.7178, perplexity_token=5.5720]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:   1%|▌                                               | 13/1044 [00:05<06:19,  2.71it/s, acc_step=1/1, ce_loss_token=1.7168, perplexity_token=5.5668]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:   1%|▋                                               | 14/1044 [00:05<05:50,  2.94it/s, acc_step=1/1, ce_loss_token=1.7266, perplexity_token=5.6213]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   1%|▋                                               | 15/1044 [00:05<05:52,  2.92it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6068]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   2%|▋                                               | 16/1044 [00:05<05:41,  3.01it/s, acc_step=1/1, ce_loss_token=1.7286, perplexity_token=5.6325]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   2%|▊                                               | 17/1044 [00:06<05:55,  2.89it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6181]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:   2%|▊                                               | 18/1044 [00:06<05:49,  2.94it/s, acc_step=1/1, ce_loss_token=1.7240, perplexity_token=5.6072]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:   2%|▊                                               | 19/1044 [00:06<05:30,  3.10it/s, acc_step=1/1, ce_loss_token=1.7278, perplexity_token=5.6281]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   2%|▉                                               | 20/1044 [00:07<05:52,  2.90it/s, acc_step=1/1, ce_loss_token=1.7256, perplexity_token=5.6159]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:   2%|▉                                               | 21/1044 [00:07<06:14,  2.73it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6054]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:   2%|█                                               | 22/1044 [00:08<06:02,  2.82it/s, acc_step=1/1, ce_loss_token=1.7295, perplexity_token=5.6380]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:   2%|█                                               | 23/1044 [00:08<05:55,  2.88it/s, acc_step=1/1, ce_loss_token=1.7281, perplexity_token=5.6301]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:   2%|█                                               | 24/1044 [00:08<06:01,  2.82it/s, acc_step=1/1, ce_loss_token=1.7260, perplexity_token=5.6181]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:   2%|█▏                                              | 25/1044 [00:09<05:59,  2.83it/s, acc_step=1/1, ce_loss_token=1.7293, perplexity_token=5.6369]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   2%|█▏                                              | 26/1044 [00:09<06:04,  2.79it/s, acc_step=1/1, ce_loss_token=1.7280, perplexity_token=5.6294]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   3%|█▏                                              | 27/1044 [00:09<06:10,  2.75it/s, acc_step=1/1, ce_loss_token=1.7268, perplexity_token=5.6226]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   3%|█▎                                              | 28/1044 [00:10<06:18,  2.68it/s, acc_step=1/1, ce_loss_token=1.7255, perplexity_token=5.6153]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   3%|█▎                                              | 29/1044 [00:10<06:22,  2.66it/s, acc_step=1/1, ce_loss_token=1.7242, perplexity_token=5.6079]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:   3%|█▍                                              | 30/1044 [00:11<06:29,  2.60it/s, acc_step=1/1, ce_loss_token=1.7229, perplexity_token=5.6009]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:   3%|█▍                                              | 31/1044 [00:11<06:33,  2.58it/s, acc_step=1/1, ce_loss_token=1.7219, perplexity_token=5.5950]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:   3%|█▍                                              | 32/1044 [00:11<06:40,  2.53it/s, acc_step=1/1, ce_loss_token=1.7203, perplexity_token=5.5862]

torch.Size([256, 454, 35]) torch.Size([256, 454])


[Training LM]:   3%|█▌                                              | 33/1044 [00:12<08:00,  2.10it/s, acc_step=1/1, ce_loss_token=1.7191, perplexity_token=5.5796]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   3%|█▌                                              | 34/1044 [00:12<07:29,  2.25it/s, acc_step=1/1, ce_loss_token=1.7184, perplexity_token=5.5757]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:   3%|█▌                                              | 35/1044 [00:13<07:06,  2.37it/s, acc_step=1/1, ce_loss_token=1.7173, perplexity_token=5.5693]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:   3%|█▋                                              | 36/1044 [00:13<07:04,  2.37it/s, acc_step=1/1, ce_loss_token=1.7165, perplexity_token=5.5651]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   4%|█▋                                              | 37/1044 [00:14<06:48,  2.47it/s, acc_step=1/1, ce_loss_token=1.7152, perplexity_token=5.5577]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:   4%|█▋                                              | 38/1044 [00:14<06:23,  2.62it/s, acc_step=1/1, ce_loss_token=1.7178, perplexity_token=5.5720]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:   4%|█▊                                              | 39/1044 [00:14<06:20,  2.64it/s, acc_step=1/1, ce_loss_token=1.7169, perplexity_token=5.5672]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   4%|█▊                                              | 40/1044 [00:15<05:56,  2.82it/s, acc_step=1/1, ce_loss_token=1.7191, perplexity_token=5.5795]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   4%|█▉                                              | 41/1044 [00:15<05:58,  2.80it/s, acc_step=1/1, ce_loss_token=1.7183, perplexity_token=5.5750]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:   4%|█▉                                              | 42/1044 [00:15<05:49,  2.87it/s, acc_step=1/1, ce_loss_token=1.7176, perplexity_token=5.5712]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:   4%|█▉                                              | 43/1044 [00:16<05:43,  2.91it/s, acc_step=1/1, ce_loss_token=1.7207, perplexity_token=5.5883]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:   4%|██                                              | 44/1044 [00:16<06:14,  2.67it/s, acc_step=1/1, ce_loss_token=1.7203, perplexity_token=5.5863]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:   4%|██                                              | 45/1044 [00:16<06:07,  2.72it/s, acc_step=1/1, ce_loss_token=1.7197, perplexity_token=5.5828]

torch.Size([256, 354, 35]) torch.Size([256, 354])


[Training LM]:   4%|██                                              | 46/1044 [00:17<06:33,  2.53it/s, acc_step=1/1, ce_loss_token=1.7190, perplexity_token=5.5791]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   5%|██▏                                             | 47/1044 [00:17<06:06,  2.72it/s, acc_step=1/1, ce_loss_token=1.7208, perplexity_token=5.5892]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   5%|██▏                                             | 48/1044 [00:18<06:09,  2.70it/s, acc_step=1/1, ce_loss_token=1.7202, perplexity_token=5.5857]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   5%|██▎                                             | 49/1044 [00:18<06:08,  2.70it/s, acc_step=1/1, ce_loss_token=1.7195, perplexity_token=5.5815]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:   5%|██▎                                             | 50/1044 [00:18<06:33,  2.53it/s, acc_step=1/1, ce_loss_token=1.7188, perplexity_token=5.5780]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   5%|██▎                                             | 51/1044 [00:19<06:33,  2.53it/s, acc_step=1/1, ce_loss_token=1.7184, perplexity_token=5.5755]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:   5%|██▍                                             | 52/1044 [00:19<06:09,  2.69it/s, acc_step=1/1, ce_loss_token=1.7212, perplexity_token=5.5912]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   5%|██▍                                             | 53/1044 [00:19<06:05,  2.71it/s, acc_step=1/1, ce_loss_token=1.7205, perplexity_token=5.5872]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   5%|██▍                                             | 54/1044 [00:20<05:43,  2.88it/s, acc_step=1/1, ce_loss_token=1.7222, perplexity_token=5.5967]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:   5%|██▌                                             | 55/1044 [00:20<05:40,  2.90it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5938]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   5%|██▌                                             | 56/1044 [00:20<05:27,  3.02it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6011]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:   5%|██▌                                             | 57/1044 [00:21<05:10,  3.18it/s, acc_step=1/1, ce_loss_token=1.7244, perplexity_token=5.6092]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   6%|██▋                                             | 58/1044 [00:21<05:20,  3.08it/s, acc_step=1/1, ce_loss_token=1.7237, perplexity_token=5.6053]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:   6%|██▋                                             | 59/1044 [00:21<05:53,  2.78it/s, acc_step=1/1, ce_loss_token=1.7230, perplexity_token=5.6013]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:   6%|██▊                                             | 60/1044 [00:22<05:53,  2.78it/s, acc_step=1/1, ce_loss_token=1.7224, perplexity_token=5.5979]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:   6%|██▊                                             | 61/1044 [00:22<05:48,  2.82it/s, acc_step=1/1, ce_loss_token=1.7217, perplexity_token=5.5941]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:   6%|██▊                                             | 62/1044 [00:23<05:54,  2.77it/s, acc_step=1/1, ce_loss_token=1.7210, perplexity_token=5.5901]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   6%|██▉                                             | 63/1044 [00:23<06:00,  2.72it/s, acc_step=1/1, ce_loss_token=1.7202, perplexity_token=5.5858]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:   6%|██▉                                             | 64/1044 [00:23<05:53,  2.77it/s, acc_step=1/1, ce_loss_token=1.7199, perplexity_token=5.5837]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   6%|██▉                                             | 65/1044 [00:24<05:52,  2.78it/s, acc_step=1/1, ce_loss_token=1.7195, perplexity_token=5.5819]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   6%|███                                             | 66/1044 [00:24<06:01,  2.71it/s, acc_step=1/1, ce_loss_token=1.7191, perplexity_token=5.5793]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:   6%|███                                             | 67/1044 [00:24<06:13,  2.62it/s, acc_step=1/1, ce_loss_token=1.7184, perplexity_token=5.5758]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:   7%|███▏                                            | 68/1044 [00:25<06:15,  2.60it/s, acc_step=1/1, ce_loss_token=1.7180, perplexity_token=5.5735]

torch.Size([256, 410, 35]) torch.Size([256, 410])


[Training LM]:   7%|███▏                                            | 69/1044 [00:25<07:11,  2.26it/s, acc_step=1/1, ce_loss_token=1.7176, perplexity_token=5.5711]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:   7%|███▏                                            | 70/1044 [00:26<06:23,  2.54it/s, acc_step=1/1, ce_loss_token=1.7187, perplexity_token=5.5775]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:   7%|███▎                                            | 71/1044 [00:26<06:10,  2.63it/s, acc_step=1/1, ce_loss_token=1.7183, perplexity_token=5.5750]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   7%|███▎                                            | 72/1044 [00:26<06:07,  2.65it/s, acc_step=1/1, ce_loss_token=1.7179, perplexity_token=5.5726]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:   7%|███▎                                            | 73/1044 [00:27<06:11,  2.61it/s, acc_step=1/1, ce_loss_token=1.7193, perplexity_token=5.5805]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:   7%|███▍                                            | 74/1044 [00:27<05:57,  2.71it/s, acc_step=1/1, ce_loss_token=1.7189, perplexity_token=5.5784]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:   7%|███▍                                            | 75/1044 [00:28<06:14,  2.59it/s, acc_step=1/1, ce_loss_token=1.7185, perplexity_token=5.5760]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:   7%|███▍                                            | 76/1044 [00:28<06:07,  2.63it/s, acc_step=1/1, ce_loss_token=1.7180, perplexity_token=5.5735]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:   7%|███▌                                            | 77/1044 [00:28<06:06,  2.64it/s, acc_step=1/1, ce_loss_token=1.7174, perplexity_token=5.5700]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:   7%|███▌                                            | 78/1044 [00:29<06:20,  2.54it/s, acc_step=1/1, ce_loss_token=1.7170, perplexity_token=5.5676]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:   8%|███▋                                            | 79/1044 [00:29<06:12,  2.59it/s, acc_step=1/1, ce_loss_token=1.7164, perplexity_token=5.5643]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:   8%|███▋                                            | 80/1044 [00:29<06:05,  2.64it/s, acc_step=1/1, ce_loss_token=1.7161, perplexity_token=5.5626]

torch.Size([256, 437, 35]) torch.Size([256, 437])


[Training LM]:   8%|███▋                                            | 81/1044 [00:30<06:36,  2.43it/s, acc_step=1/1, ce_loss_token=1.7173, perplexity_token=5.5697]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:   8%|███▊                                            | 82/1044 [00:30<06:27,  2.48it/s, acc_step=1/1, ce_loss_token=1.7169, perplexity_token=5.5671]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:   8%|███▊                                            | 83/1044 [00:31<06:23,  2.51it/s, acc_step=1/1, ce_loss_token=1.7164, perplexity_token=5.5645]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:   8%|███▊                                            | 84/1044 [00:31<05:49,  2.75it/s, acc_step=1/1, ce_loss_token=1.7174, perplexity_token=5.5701]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   8%|███▉                                            | 85/1044 [00:31<05:50,  2.74it/s, acc_step=1/1, ce_loss_token=1.7170, perplexity_token=5.5676]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:   8%|███▉                                            | 86/1044 [00:32<05:52,  2.72it/s, acc_step=1/1, ce_loss_token=1.7167, perplexity_token=5.5660]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   8%|████                                            | 87/1044 [00:32<05:59,  2.66it/s, acc_step=1/1, ce_loss_token=1.7163, perplexity_token=5.5639]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:   8%|████                                            | 88/1044 [00:33<06:02,  2.64it/s, acc_step=1/1, ce_loss_token=1.7160, perplexity_token=5.5621]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:   9%|████                                            | 89/1044 [00:33<06:05,  2.61it/s, acc_step=1/1, ce_loss_token=1.7157, perplexity_token=5.5606]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:   9%|████▏                                           | 90/1044 [00:33<06:08,  2.59it/s, acc_step=1/1, ce_loss_token=1.7154, perplexity_token=5.5588]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:   9%|████▏                                           | 91/1044 [00:34<05:36,  2.83it/s, acc_step=1/1, ce_loss_token=1.7164, perplexity_token=5.5644]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   9%|████▏                                           | 92/1044 [00:34<05:41,  2.79it/s, acc_step=1/1, ce_loss_token=1.7159, perplexity_token=5.5619]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:   9%|████▎                                           | 93/1044 [00:34<05:54,  2.69it/s, acc_step=1/1, ce_loss_token=1.7155, perplexity_token=5.5597]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:   9%|████▎                                           | 94/1044 [00:35<05:34,  2.84it/s, acc_step=1/1, ce_loss_token=1.7170, perplexity_token=5.5677]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:   9%|████▎                                           | 95/1044 [00:35<05:26,  2.91it/s, acc_step=1/1, ce_loss_token=1.7184, perplexity_token=5.5757]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:   9%|████▍                                           | 96/1044 [00:35<05:45,  2.74it/s, acc_step=1/1, ce_loss_token=1.7180, perplexity_token=5.5733]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:   9%|████▍                                           | 97/1044 [00:36<05:51,  2.70it/s, acc_step=1/1, ce_loss_token=1.7177, perplexity_token=5.5718]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:   9%|████▌                                           | 98/1044 [00:36<06:00,  2.62it/s, acc_step=1/1, ce_loss_token=1.7175, perplexity_token=5.5704]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:   9%|████▌                                           | 99/1044 [00:37<06:07,  2.57it/s, acc_step=1/1, ce_loss_token=1.7172, perplexity_token=5.5691]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  10%|████▌                                          | 100/1044 [00:37<06:10,  2.55it/s, acc_step=1/1, ce_loss_token=1.7168, perplexity_token=5.5667]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  10%|████▌                                          | 101/1044 [00:37<06:21,  2.47it/s, acc_step=1/1, ce_loss_token=1.7166, perplexity_token=5.5656]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  10%|████▌                                          | 102/1044 [00:38<06:14,  2.52it/s, acc_step=1/1, ce_loss_token=1.7163, perplexity_token=5.5641]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  10%|████▋                                          | 103/1044 [00:38<06:12,  2.52it/s, acc_step=1/1, ce_loss_token=1.7160, perplexity_token=5.5623]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  10%|████▋                                          | 104/1044 [00:39<06:10,  2.53it/s, acc_step=1/1, ce_loss_token=1.7158, perplexity_token=5.5610]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  10%|████▋                                          | 105/1044 [00:39<06:32,  2.39it/s, acc_step=1/1, ce_loss_token=1.7155, perplexity_token=5.5593]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  10%|████▊                                          | 106/1044 [00:39<06:16,  2.49it/s, acc_step=1/1, ce_loss_token=1.7152, perplexity_token=5.5579]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  10%|████▊                                          | 107/1044 [00:40<06:05,  2.57it/s, acc_step=1/1, ce_loss_token=1.7150, perplexity_token=5.5565]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  10%|████▊                                          | 108/1044 [00:40<06:17,  2.48it/s, acc_step=1/1, ce_loss_token=1.7145, perplexity_token=5.5541]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  10%|████▉                                          | 109/1044 [00:41<05:42,  2.73it/s, acc_step=1/1, ce_loss_token=1.7153, perplexity_token=5.5582]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  11%|████▉                                          | 110/1044 [00:41<05:51,  2.66it/s, acc_step=1/1, ce_loss_token=1.7150, perplexity_token=5.5564]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  11%|████▉                                          | 111/1044 [00:41<05:31,  2.81it/s, acc_step=1/1, ce_loss_token=1.7161, perplexity_token=5.5630]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  11%|█████                                          | 112/1044 [00:42<05:43,  2.71it/s, acc_step=1/1, ce_loss_token=1.7159, perplexity_token=5.5616]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  11%|█████                                          | 113/1044 [00:42<05:40,  2.74it/s, acc_step=1/1, ce_loss_token=1.7156, perplexity_token=5.5601]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  11%|█████▏                                         | 114/1044 [00:42<05:46,  2.68it/s, acc_step=1/1, ce_loss_token=1.7154, perplexity_token=5.5589]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  11%|█████▏                                         | 115/1044 [00:43<05:44,  2.69it/s, acc_step=1/1, ce_loss_token=1.7152, perplexity_token=5.5578]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  11%|█████▏                                         | 116/1044 [00:43<05:43,  2.70it/s, acc_step=1/1, ce_loss_token=1.7150, perplexity_token=5.5565]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  11%|█████▎                                         | 117/1044 [00:43<05:13,  2.96it/s, acc_step=1/1, ce_loss_token=1.7162, perplexity_token=5.5631]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  11%|█████▎                                         | 118/1044 [00:44<04:58,  3.10it/s, acc_step=1/1, ce_loss_token=1.7170, perplexity_token=5.5676]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  11%|█████▎                                         | 119/1044 [00:44<05:24,  2.85it/s, acc_step=1/1, ce_loss_token=1.7167, perplexity_token=5.5663]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  11%|█████▍                                         | 120/1044 [00:44<05:34,  2.76it/s, acc_step=1/1, ce_loss_token=1.7165, perplexity_token=5.5649]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  12%|█████▍                                         | 121/1044 [00:45<05:34,  2.76it/s, acc_step=1/1, ce_loss_token=1.7162, perplexity_token=5.5636]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  12%|█████▍                                         | 122/1044 [00:45<05:41,  2.70it/s, acc_step=1/1, ce_loss_token=1.7160, perplexity_token=5.5624]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  12%|█████▌                                         | 123/1044 [00:46<05:43,  2.68it/s, acc_step=1/1, ce_loss_token=1.7157, perplexity_token=5.5607]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  12%|█████▌                                         | 124/1044 [00:46<05:35,  2.74it/s, acc_step=1/1, ce_loss_token=1.7154, perplexity_token=5.5590]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  12%|█████▋                                         | 125/1044 [00:46<05:39,  2.71it/s, acc_step=1/1, ce_loss_token=1.7151, perplexity_token=5.5575]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  12%|█████▋                                         | 126/1044 [00:47<05:42,  2.68it/s, acc_step=1/1, ce_loss_token=1.7150, perplexity_token=5.5568]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  12%|█████▊                                         | 128/1044 [00:47<04:53,  3.12it/s, acc_step=1/1, ce_loss_token=1.7169, perplexity_token=5.5670]

torch.Size([256, 307, 35]) torch.Size([256, 307])
torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  12%|█████▊                                         | 129/1044 [00:48<05:21,  2.85it/s, acc_step=1/1, ce_loss_token=1.7165, perplexity_token=5.5653]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  12%|█████▊                                         | 130/1044 [00:48<05:20,  2.85it/s, acc_step=1/1, ce_loss_token=1.7164, perplexity_token=5.5644]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  13%|█████▉                                         | 131/1044 [00:48<05:01,  3.03it/s, acc_step=1/1, ce_loss_token=1.7171, perplexity_token=5.5681]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  13%|█████▉                                         | 132/1044 [00:49<05:12,  2.92it/s, acc_step=1/1, ce_loss_token=1.7168, perplexity_token=5.5667]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  13%|█████▉                                         | 133/1044 [00:49<05:16,  2.88it/s, acc_step=1/1, ce_loss_token=1.7166, perplexity_token=5.5655]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  13%|██████                                         | 134/1044 [00:49<05:26,  2.79it/s, acc_step=1/1, ce_loss_token=1.7163, perplexity_token=5.5642]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  13%|██████                                         | 135/1044 [00:50<05:24,  2.80it/s, acc_step=1/1, ce_loss_token=1.7161, perplexity_token=5.5629]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  13%|██████                                         | 136/1044 [00:50<05:22,  2.82it/s, acc_step=1/1, ce_loss_token=1.7159, perplexity_token=5.5618]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  13%|██████▏                                        | 137/1044 [00:50<05:23,  2.80it/s, acc_step=1/1, ce_loss_token=1.7156, perplexity_token=5.5602]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  13%|██████▏                                        | 138/1044 [00:51<05:11,  2.91it/s, acc_step=1/1, ce_loss_token=1.7164, perplexity_token=5.5642]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  13%|██████▎                                        | 139/1044 [00:51<05:19,  2.83it/s, acc_step=1/1, ce_loss_token=1.7161, perplexity_token=5.5629]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  13%|██████▎                                        | 140/1044 [00:52<05:29,  2.74it/s, acc_step=1/1, ce_loss_token=1.7158, perplexity_token=5.5614]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  14%|██████▎                                        | 141/1044 [00:52<05:32,  2.72it/s, acc_step=1/1, ce_loss_token=1.7155, perplexity_token=5.5596]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  14%|██████▍                                        | 142/1044 [00:52<05:38,  2.66it/s, acc_step=1/1, ce_loss_token=1.7153, perplexity_token=5.5583]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  14%|██████▍                                        | 143/1044 [00:53<05:38,  2.66it/s, acc_step=1/1, ce_loss_token=1.7151, perplexity_token=5.5570]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  14%|██████▍                                        | 144/1044 [00:53<05:33,  2.70it/s, acc_step=1/1, ce_loss_token=1.7148, perplexity_token=5.5558]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  14%|██████▌                                        | 145/1044 [00:53<05:13,  2.87it/s, acc_step=1/1, ce_loss_token=1.7155, perplexity_token=5.5592]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  14%|██████▌                                        | 146/1044 [00:54<05:24,  2.77it/s, acc_step=1/1, ce_loss_token=1.7152, perplexity_token=5.5577]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  14%|██████▌                                        | 147/1044 [00:54<05:26,  2.75it/s, acc_step=1/1, ce_loss_token=1.7150, perplexity_token=5.5567]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  14%|██████▋                                        | 148/1044 [00:55<05:36,  2.66it/s, acc_step=1/1, ce_loss_token=1.7148, perplexity_token=5.5558]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  14%|██████▋                                        | 149/1044 [00:55<05:50,  2.56it/s, acc_step=1/1, ce_loss_token=1.7147, perplexity_token=5.5548]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  14%|██████▊                                        | 150/1044 [00:55<05:50,  2.55it/s, acc_step=1/1, ce_loss_token=1.7145, perplexity_token=5.5538]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  14%|██████▊                                        | 151/1044 [00:56<05:59,  2.49it/s, acc_step=1/1, ce_loss_token=1.7143, perplexity_token=5.5526]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  15%|██████▊                                        | 152/1044 [00:56<06:08,  2.42it/s, acc_step=1/1, ce_loss_token=1.7142, perplexity_token=5.5522]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  15%|██████▉                                        | 153/1044 [00:57<06:03,  2.45it/s, acc_step=1/1, ce_loss_token=1.7140, perplexity_token=5.5512]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  15%|██████▉                                        | 154/1044 [00:57<05:51,  2.53it/s, acc_step=1/1, ce_loss_token=1.7139, perplexity_token=5.5506]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  15%|██████▉                                        | 155/1044 [00:57<05:48,  2.55it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5496]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  15%|███████                                        | 156/1044 [00:58<05:48,  2.55it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5485]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  15%|███████                                        | 157/1044 [00:58<05:46,  2.56it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5471]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  15%|███████                                        | 158/1044 [00:58<05:33,  2.66it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5464]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  15%|███████▏                                       | 159/1044 [00:59<05:30,  2.67it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5457]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  15%|███████▏                                       | 160/1044 [00:59<05:09,  2.86it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5488]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  15%|███████▏                                       | 161/1044 [00:59<05:06,  2.88it/s, acc_step=1/1, ce_loss_token=1.7134, perplexity_token=5.5477]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  16%|███████▎                                       | 162/1044 [01:00<05:17,  2.77it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5472]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  16%|███████▎                                       | 163/1044 [01:00<05:19,  2.76it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5459]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  16%|███████▍                                       | 164/1044 [01:01<05:22,  2.73it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5454]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  16%|███████▍                                       | 165/1044 [01:01<05:24,  2.71it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5440]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  16%|███████▍                                       | 166/1044 [01:01<05:26,  2.69it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5433]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  16%|███████▌                                       | 167/1044 [01:02<05:21,  2.73it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5426]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  16%|███████▌                                       | 168/1044 [01:02<05:33,  2.63it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5419]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  16%|███████▌                                       | 169/1044 [01:03<05:38,  2.58it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5409]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  16%|███████▋                                       | 170/1044 [01:03<05:37,  2.59it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5401]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  16%|███████▋                                       | 171/1044 [01:03<05:43,  2.54it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5394]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  16%|███████▋                                       | 172/1044 [01:04<05:28,  2.66it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5387]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  17%|███████▊                                       | 173/1044 [01:04<05:13,  2.78it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5439]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  17%|███████▊                                       | 174/1044 [01:04<05:07,  2.83it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5436]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  17%|███████▉                                       | 175/1044 [01:05<05:06,  2.83it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5427]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  17%|███████▉                                       | 176/1044 [01:05<05:01,  2.87it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5421]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  17%|███████▉                                       | 177/1044 [01:05<05:09,  2.80it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5413]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  17%|████████                                       | 178/1044 [01:06<05:24,  2.67it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5405]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  17%|████████                                       | 179/1044 [01:06<05:19,  2.71it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5396]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  17%|████████                                       | 180/1044 [01:07<05:25,  2.65it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5388]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  17%|████████▏                                      | 181/1044 [01:07<05:15,  2.74it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5381]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  17%|████████▏                                      | 182/1044 [01:07<04:54,  2.93it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5422]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  18%|████████▏                                      | 183/1044 [01:08<04:55,  2.92it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5410]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  18%|████████▎                                      | 184/1044 [01:08<05:01,  2.85it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5403]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  18%|████████▎                                      | 185/1044 [01:08<05:10,  2.77it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5394]

torch.Size([256, 365, 35]) torch.Size([256, 365])


[Training LM]:  18%|████████▎                                      | 186/1044 [01:09<05:42,  2.51it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5383]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  18%|████████▍                                      | 187/1044 [01:09<05:26,  2.63it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5373]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  18%|████████▍                                      | 188/1044 [01:09<05:23,  2.64it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5362]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  18%|████████▌                                      | 189/1044 [01:10<05:28,  2.60it/s, acc_step=1/1, ce_loss_token=1.7112, perplexity_token=5.5354]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  18%|████████▌                                      | 190/1044 [01:10<05:16,  2.70it/s, acc_step=1/1, ce_loss_token=1.7110, perplexity_token=5.5344]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  18%|████████▌                                      | 191/1044 [01:11<05:21,  2.66it/s, acc_step=1/1, ce_loss_token=1.7109, perplexity_token=5.5338]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  18%|████████▋                                      | 192/1044 [01:11<05:33,  2.55it/s, acc_step=1/1, ce_loss_token=1.7108, perplexity_token=5.5332]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  18%|████████▋                                      | 193/1044 [01:11<05:27,  2.60it/s, acc_step=1/1, ce_loss_token=1.7106, perplexity_token=5.5323]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  19%|████████▋                                      | 194/1044 [01:12<05:25,  2.61it/s, acc_step=1/1, ce_loss_token=1.7104, perplexity_token=5.5314]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  19%|████████▊                                      | 195/1044 [01:12<05:26,  2.60it/s, acc_step=1/1, ce_loss_token=1.7103, perplexity_token=5.5306]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  19%|████████▊                                      | 196/1044 [01:13<05:23,  2.62it/s, acc_step=1/1, ce_loss_token=1.7101, perplexity_token=5.5295]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  19%|████████▊                                      | 197/1044 [01:13<05:10,  2.72it/s, acc_step=1/1, ce_loss_token=1.7105, perplexity_token=5.5319]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  19%|████████▉                                      | 198/1044 [01:13<05:15,  2.68it/s, acc_step=1/1, ce_loss_token=1.7104, perplexity_token=5.5310]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  19%|████████▉                                      | 199/1044 [01:14<05:18,  2.65it/s, acc_step=1/1, ce_loss_token=1.7102, perplexity_token=5.5302]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  19%|█████████                                      | 200/1044 [01:14<04:57,  2.84it/s, acc_step=1/1, ce_loss_token=1.7108, perplexity_token=5.5333]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  19%|█████████                                      | 201/1044 [01:14<04:52,  2.88it/s, acc_step=1/1, ce_loss_token=1.7106, perplexity_token=5.5325]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  19%|█████████                                      | 202/1044 [01:15<04:54,  2.86it/s, acc_step=1/1, ce_loss_token=1.7106, perplexity_token=5.5321]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  19%|█████████▏                                     | 203/1044 [01:15<04:54,  2.85it/s, acc_step=1/1, ce_loss_token=1.7104, perplexity_token=5.5313]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  20%|█████████▏                                     | 204/1044 [01:15<04:58,  2.82it/s, acc_step=1/1, ce_loss_token=1.7102, perplexity_token=5.5302]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  20%|█████████▏                                     | 205/1044 [01:16<05:08,  2.72it/s, acc_step=1/1, ce_loss_token=1.7101, perplexity_token=5.5296]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  20%|█████████▎                                     | 206/1044 [01:16<05:05,  2.74it/s, acc_step=1/1, ce_loss_token=1.7100, perplexity_token=5.5288]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  20%|█████████▎                                     | 207/1044 [01:16<05:01,  2.78it/s, acc_step=1/1, ce_loss_token=1.7099, perplexity_token=5.5282]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  20%|█████████▎                                     | 208/1044 [01:17<04:58,  2.80it/s, acc_step=1/1, ce_loss_token=1.7098, perplexity_token=5.5276]

torch.Size([256, 372, 35]) torch.Size([256, 372])


[Training LM]:  20%|█████████▍                                     | 209/1044 [01:17<05:03,  2.75it/s, acc_step=1/1, ce_loss_token=1.7102, perplexity_token=5.5303]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  20%|█████████▍                                     | 210/1044 [01:18<04:59,  2.78it/s, acc_step=1/1, ce_loss_token=1.7100, perplexity_token=5.5289]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  20%|█████████▍                                     | 211/1044 [01:18<04:42,  2.95it/s, acc_step=1/1, ce_loss_token=1.7104, perplexity_token=5.5311]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  20%|█████████▌                                     | 212/1044 [01:18<04:43,  2.93it/s, acc_step=1/1, ce_loss_token=1.7102, perplexity_token=5.5303]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  20%|█████████▌                                     | 213/1044 [01:18<04:26,  3.12it/s, acc_step=1/1, ce_loss_token=1.7108, perplexity_token=5.5336]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  20%|█████████▋                                     | 214/1044 [01:19<04:19,  3.20it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5370]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  21%|█████████▋                                     | 215/1044 [01:19<04:51,  2.85it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5366]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  21%|█████████▋                                     | 216/1044 [01:19<04:34,  3.02it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5388]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  21%|█████████▊                                     | 217/1044 [01:20<04:37,  2.98it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5378]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  21%|█████████▊                                     | 218/1044 [01:20<04:19,  3.18it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5414]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  21%|█████████▊                                     | 219/1044 [01:20<04:33,  3.02it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5408]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  21%|█████████▉                                     | 220/1044 [01:21<04:20,  3.17it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5429]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  21%|█████████▉                                     | 221/1044 [01:21<04:09,  3.30it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5465]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  21%|█████████▉                                     | 222/1044 [01:21<04:21,  3.14it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5459]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  21%|██████████                                     | 223/1044 [01:22<04:16,  3.20it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5495]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  21%|██████████                                     | 224/1044 [01:22<04:28,  3.06it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5487]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  22%|██████████▏                                    | 225/1044 [01:22<04:43,  2.89it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5482]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  22%|██████████▏                                    | 226/1044 [01:23<04:50,  2.81it/s, acc_step=1/1, ce_loss_token=1.7134, perplexity_token=5.5475]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  22%|██████████▏                                    | 227/1044 [01:23<04:47,  2.84it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5466]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  22%|██████████▎                                    | 228/1044 [01:23<04:47,  2.84it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5457]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  22%|██████████▎                                    | 229/1044 [01:24<05:08,  2.64it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5450]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  22%|██████████▎                                    | 230/1044 [01:24<04:43,  2.87it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5468]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  22%|██████████▍                                    | 231/1044 [01:24<04:29,  3.02it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5489]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  22%|██████████▍                                    | 232/1044 [01:25<04:32,  2.98it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5485]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  22%|██████████▍                                    | 233/1044 [01:25<04:24,  3.06it/s, acc_step=1/1, ce_loss_token=1.7142, perplexity_token=5.5520]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  22%|██████████▌                                    | 234/1044 [01:25<04:30,  3.00it/s, acc_step=1/1, ce_loss_token=1.7141, perplexity_token=5.5514]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  23%|██████████▌                                    | 235/1044 [01:26<04:35,  2.94it/s, acc_step=1/1, ce_loss_token=1.7140, perplexity_token=5.5510]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  23%|██████████▌                                    | 236/1044 [01:26<04:42,  2.86it/s, acc_step=1/1, ce_loss_token=1.7139, perplexity_token=5.5503]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  23%|██████████▋                                    | 237/1044 [01:27<04:44,  2.83it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5496]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  23%|██████████▋                                    | 238/1044 [01:27<04:46,  2.81it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5490]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  23%|██████████▊                                    | 239/1044 [01:27<04:47,  2.80it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5482]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  23%|██████████▊                                    | 240/1044 [01:28<04:47,  2.80it/s, acc_step=1/1, ce_loss_token=1.7134, perplexity_token=5.5477]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  23%|██████████▊                                    | 241/1044 [01:28<04:52,  2.74it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5469]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  23%|██████████▉                                    | 242/1044 [01:28<04:55,  2.71it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5464]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  23%|██████████▉                                    | 243/1044 [01:29<04:57,  2.69it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5459]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  23%|██████████▉                                    | 244/1044 [01:29<05:09,  2.59it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5453]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  23%|███████████                                    | 245/1044 [01:30<05:02,  2.64it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5448]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  24%|███████████                                    | 246/1044 [01:30<05:03,  2.63it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5444]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  24%|███████████                                    | 247/1044 [01:30<04:34,  2.90it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5463]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  24%|███████████▏                                   | 248/1044 [01:31<04:36,  2.88it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5454]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  24%|███████████▏                                   | 249/1044 [01:31<04:43,  2.81it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5442]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  24%|███████████▎                                   | 250/1044 [01:31<04:50,  2.73it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5437]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  24%|███████████▎                                   | 251/1044 [01:32<04:58,  2.66it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5431]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  24%|███████████▎                                   | 252/1044 [01:32<05:09,  2.56it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5426]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  24%|███████████▍                                   | 253/1044 [01:33<04:56,  2.66it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5418]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  24%|███████████▍                                   | 254/1044 [01:33<05:06,  2.58it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5412]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  24%|███████████▍                                   | 255/1044 [01:33<04:53,  2.68it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5408]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  25%|███████████▌                                   | 256/1044 [01:34<04:34,  2.87it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5425]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  25%|███████████▌                                   | 257/1044 [01:34<04:14,  3.10it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5441]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  25%|███████████▌                                   | 258/1044 [01:34<04:26,  2.95it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5433]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  25%|███████████▋                                   | 259/1044 [01:35<04:34,  2.86it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5427]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  25%|███████████▋                                   | 260/1044 [01:35<04:39,  2.80it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5417]

torch.Size([256, 402, 35]) torch.Size([256, 402])


[Training LM]:  25%|███████████▊                                   | 261/1044 [01:35<04:55,  2.65it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5437]

torch.Size([256, 353, 35]) torch.Size([256, 353])


[Training LM]:  25%|███████████▊                                   | 262/1044 [01:36<05:16,  2.47it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5433]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  25%|███████████▊                                   | 263/1044 [01:36<05:05,  2.56it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5427]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  25%|███████████▉                                   | 264/1044 [01:37<04:57,  2.62it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5419]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  25%|███████████▉                                   | 265/1044 [01:37<04:57,  2.62it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5414]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  25%|███████████▉                                   | 266/1044 [01:37<04:54,  2.64it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5409]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  26%|████████████                                   | 267/1044 [01:38<04:53,  2.65it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5403]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  26%|████████████                                   | 268/1044 [01:38<04:52,  2.65it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5399]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  26%|████████████                                   | 269/1044 [01:38<04:47,  2.69it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5397]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  26%|████████████▏                                  | 270/1044 [01:39<05:10,  2.49it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5392]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  26%|████████████▏                                  | 271/1044 [01:39<04:46,  2.69it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5415]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  26%|████████████▏                                  | 272/1044 [01:40<04:42,  2.73it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5409]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  26%|████████████▎                                  | 273/1044 [01:40<04:55,  2.61it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5403]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  26%|████████████▎                                  | 274/1044 [01:40<04:45,  2.70it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5398]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  26%|████████████▍                                  | 275/1044 [01:41<04:43,  2.71it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5392]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  26%|████████████▍                                  | 276/1044 [01:41<04:47,  2.67it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5385]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  27%|████████████▍                                  | 277/1044 [01:41<04:49,  2.65it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5380]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  27%|████████████▌                                  | 278/1044 [01:42<04:49,  2.65it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5372]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  27%|████████████▌                                  | 279/1044 [01:42<04:43,  2.70it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5367]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  27%|████████████▌                                  | 280/1044 [01:42<04:29,  2.84it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5381]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  27%|████████████▋                                  | 281/1044 [01:43<04:33,  2.79it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5374]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  27%|████████████▋                                  | 282/1044 [01:43<04:45,  2.67it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5371]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  27%|████████████▋                                  | 283/1044 [01:44<04:26,  2.86it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5387]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  27%|████████████▊                                  | 284/1044 [01:44<04:38,  2.73it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5380]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  27%|████████████▊                                  | 285/1044 [01:44<04:34,  2.76it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5377]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  27%|████████████▉                                  | 286/1044 [01:45<04:15,  2.97it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5390]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  27%|████████████▉                                  | 287/1044 [01:45<04:22,  2.89it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5384]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  28%|████████████▉                                  | 288/1044 [01:45<04:32,  2.77it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5377]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  28%|█████████████                                  | 289/1044 [01:46<04:38,  2.71it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5373]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  28%|█████████████                                  | 290/1044 [01:46<05:01,  2.50it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5368]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  28%|█████████████                                  | 291/1044 [01:47<04:54,  2.56it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5360]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  28%|█████████████▏                                 | 292/1044 [01:47<04:31,  2.77it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5387]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  28%|█████████████▏                                 | 293/1044 [01:47<04:36,  2.71it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5382]

torch.Size([256, 421, 35]) torch.Size([256, 421])


[Training LM]:  28%|█████████████▏                                 | 294/1044 [01:48<05:30,  2.27it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5374]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  28%|█████████████▎                                 | 295/1044 [01:48<05:30,  2.27it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5371]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  28%|█████████████▎                                 | 296/1044 [01:49<05:15,  2.37it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5365]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  28%|█████████████▎                                 | 297/1044 [01:49<04:50,  2.57it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5383]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  29%|█████████████▍                                 | 298/1044 [01:49<04:28,  2.78it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5401]

torch.Size([256, 378, 35]) torch.Size([256, 378])


[Training LM]:  29%|█████████████▍                                 | 299/1044 [01:50<04:57,  2.50it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5396]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  29%|█████████████▌                                 | 300/1044 [01:50<04:48,  2.58it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5391]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  29%|█████████████▌                                 | 301/1044 [01:51<04:40,  2.65it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5385]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  29%|█████████████▌                                 | 302/1044 [01:51<04:35,  2.69it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5379]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  29%|█████████████▋                                 | 303/1044 [01:51<04:33,  2.70it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5374]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  29%|█████████████▋                                 | 304/1044 [01:52<04:35,  2.69it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5369]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  29%|█████████████▋                                 | 305/1044 [01:52<04:11,  2.94it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5394]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  29%|█████████████▊                                 | 306/1044 [01:52<04:12,  2.93it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5389]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  29%|█████████████▊                                 | 307/1044 [01:53<04:01,  3.05it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5414]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  30%|█████████████▊                                 | 308/1044 [01:53<04:05,  3.00it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5411]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  30%|█████████████▉                                 | 309/1044 [01:53<04:14,  2.89it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5406]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  30%|█████████████▉                                 | 310/1044 [01:54<04:12,  2.90it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5400]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  30%|██████████████                                 | 311/1044 [01:54<04:18,  2.83it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5395]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  30%|██████████████                                 | 312/1044 [01:54<04:21,  2.79it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5390]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  30%|██████████████                                 | 313/1044 [01:55<04:37,  2.63it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5385]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  30%|██████████████▏                                | 314/1044 [01:55<04:37,  2.63it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5380]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  30%|██████████████▏                                | 315/1044 [01:56<04:40,  2.60it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5375]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  30%|██████████████▏                                | 316/1044 [01:56<04:38,  2.62it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5371]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  30%|██████████████▎                                | 317/1044 [01:56<04:36,  2.63it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5367]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  30%|██████████████▎                                | 318/1044 [01:57<04:26,  2.72it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5363]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  31%|██████████████▎                                | 319/1044 [01:57<04:31,  2.67it/s, acc_step=1/1, ce_loss_token=1.7112, perplexity_token=5.5355]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  31%|██████████████▍                                | 320/1044 [01:57<04:07,  2.92it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5369]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  31%|██████████████▍                                | 321/1044 [01:58<04:08,  2.91it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5364]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  31%|██████████████▍                                | 322/1044 [01:58<04:11,  2.87it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5360]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  31%|██████████████▌                                | 323/1044 [01:58<04:03,  2.96it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5375]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  31%|██████████████▌                                | 324/1044 [01:59<04:00,  2.99it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5372]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  31%|██████████████▋                                | 325/1044 [01:59<03:53,  3.08it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5395]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  31%|██████████████▋                                | 326/1044 [01:59<03:59,  2.99it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5392]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  31%|██████████████▋                                | 327/1044 [02:00<04:11,  2.85it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5388]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  31%|██████████████▊                                | 328/1044 [02:00<04:06,  2.91it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5412]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  32%|██████████████▊                                | 329/1044 [02:00<04:16,  2.79it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5406]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  32%|██████████████▊                                | 330/1044 [02:01<04:24,  2.70it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5400]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  32%|██████████████▉                                | 331/1044 [02:01<04:29,  2.65it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5396]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  32%|██████████████▉                                | 332/1044 [02:02<04:23,  2.71it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5393]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  32%|██████████████▉                                | 333/1044 [02:02<04:03,  2.92it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5410]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  32%|███████████████                                | 334/1044 [02:02<04:12,  2.81it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5403]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  32%|███████████████                                | 335/1044 [02:03<04:16,  2.76it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5399]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  32%|███████████████▏                               | 336/1044 [02:03<04:14,  2.78it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5394]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  32%|███████████████▏                               | 337/1044 [02:03<04:11,  2.81it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5391]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  32%|███████████████▏                               | 338/1044 [02:04<04:25,  2.65it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5386]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  32%|███████████████▎                               | 339/1044 [02:04<04:33,  2.57it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5382]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  33%|███████████████▎                               | 340/1044 [02:04<04:27,  2.63it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5378]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  33%|███████████████▎                               | 341/1044 [02:05<04:18,  2.72it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5375]

torch.Size([256, 364, 35]) torch.Size([256, 364])


[Training LM]:  33%|███████████████▍                               | 342/1044 [02:05<04:41,  2.50it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5372]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  33%|███████████████▍                               | 343/1044 [02:06<04:35,  2.54it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5367]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  33%|███████████████▍                               | 344/1044 [02:06<04:27,  2.61it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5363]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  33%|███████████████▌                               | 345/1044 [02:06<04:27,  2.61it/s, acc_step=1/1, ce_loss_token=1.7112, perplexity_token=5.5358]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  33%|███████████████▌                               | 346/1044 [02:07<04:09,  2.80it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5374]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  33%|███████████████▌                               | 347/1044 [02:07<03:54,  2.97it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5396]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  33%|███████████████▋                               | 348/1044 [02:07<03:49,  3.03it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5416]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  33%|███████████████▋                               | 349/1044 [02:08<03:49,  3.02it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5412]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  34%|███████████████▊                               | 350/1044 [02:08<04:18,  2.68it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5408]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  34%|███████████████▊                               | 351/1044 [02:09<04:24,  2.62it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5403]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  34%|███████████████▊                               | 352/1044 [02:09<04:38,  2.49it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5401]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  34%|███████████████▉                               | 353/1044 [02:09<04:31,  2.54it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5397]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  34%|███████████████▉                               | 354/1044 [02:10<04:27,  2.58it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5393]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  34%|███████████████▉                               | 355/1044 [02:10<04:03,  2.83it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5407]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  34%|████████████████                               | 356/1044 [02:10<04:17,  2.67it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5402]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  34%|████████████████                               | 357/1044 [02:11<04:22,  2.62it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5399]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  34%|████████████████                               | 358/1044 [02:11<04:03,  2.82it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5418]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  34%|████████████████▏                              | 359/1044 [02:11<04:00,  2.85it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5414]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  34%|████████████████▏                              | 360/1044 [02:12<03:59,  2.86it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5410]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  35%|████████████████▎                              | 361/1044 [02:12<04:00,  2.84it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5407]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  35%|████████████████▎                              | 362/1044 [02:13<04:06,  2.76it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5403]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  35%|████████████████▎                              | 363/1044 [02:13<04:19,  2.62it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5396]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  35%|████████████████▍                              | 364/1044 [02:13<04:11,  2.70it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5392]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  35%|████████████████▍                              | 365/1044 [02:14<04:14,  2.66it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5386]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  35%|████████████████▍                              | 366/1044 [02:14<04:15,  2.65it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5380]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  35%|████████████████▌                              | 367/1044 [02:14<04:08,  2.73it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5375]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  35%|████████████████▌                              | 368/1044 [02:15<04:07,  2.73it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5371]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  35%|████████████████▌                              | 369/1044 [02:15<04:02,  2.78it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5368]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  35%|████████████████▋                              | 370/1044 [02:15<03:51,  2.91it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5388]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  36%|████████████████▋                              | 371/1044 [02:16<03:57,  2.83it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5384]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  36%|████████████████▋                              | 372/1044 [02:16<04:00,  2.80it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5379]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  36%|████████████████▊                              | 373/1044 [02:17<04:02,  2.76it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5374]

torch.Size([256, 342, 35]) torch.Size([256, 342])


[Training LM]:  36%|████████████████▊                              | 374/1044 [02:17<04:17,  2.60it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5370]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  36%|████████████████▉                              | 375/1044 [02:17<04:25,  2.52it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5368]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  36%|████████████████▉                              | 376/1044 [02:18<04:20,  2.57it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5365]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  36%|████████████████▉                              | 377/1044 [02:18<04:18,  2.58it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5361]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  36%|█████████████████                              | 378/1044 [02:19<04:29,  2.47it/s, acc_step=1/1, ce_loss_token=1.7112, perplexity_token=5.5357]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  36%|█████████████████                              | 379/1044 [02:19<04:31,  2.45it/s, acc_step=1/1, ce_loss_token=1.7111, perplexity_token=5.5352]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  36%|█████████████████                              | 380/1044 [02:19<04:17,  2.58it/s, acc_step=1/1, ce_loss_token=1.7111, perplexity_token=5.5349]

torch.Size([256, 344, 35]) torch.Size([256, 344])


[Training LM]:  36%|█████████████████▏                             | 381/1044 [02:20<04:27,  2.48it/s, acc_step=1/1, ce_loss_token=1.7110, perplexity_token=5.5343]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  37%|█████████████████▏                             | 382/1044 [02:20<04:26,  2.48it/s, acc_step=1/1, ce_loss_token=1.7109, perplexity_token=5.5338]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  37%|█████████████████▏                             | 383/1044 [02:21<04:13,  2.60it/s, acc_step=1/1, ce_loss_token=1.7108, perplexity_token=5.5335]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  37%|█████████████████▎                             | 384/1044 [02:21<04:12,  2.62it/s, acc_step=1/1, ce_loss_token=1.7107, perplexity_token=5.5331]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  37%|█████████████████▎                             | 385/1044 [02:21<04:16,  2.57it/s, acc_step=1/1, ce_loss_token=1.7107, perplexity_token=5.5327]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  37%|█████████████████▍                             | 386/1044 [02:22<04:10,  2.63it/s, acc_step=1/1, ce_loss_token=1.7106, perplexity_token=5.5323]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  37%|█████████████████▍                             | 387/1044 [02:22<03:45,  2.91it/s, acc_step=1/1, ce_loss_token=1.7108, perplexity_token=5.5336]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  37%|█████████████████▍                             | 388/1044 [02:22<03:55,  2.79it/s, acc_step=1/1, ce_loss_token=1.7107, perplexity_token=5.5331]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  37%|█████████████████▌                             | 389/1044 [02:23<03:39,  2.99it/s, acc_step=1/1, ce_loss_token=1.7110, perplexity_token=5.5342]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  37%|█████████████████▌                             | 390/1044 [02:23<03:23,  3.22it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5362]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  37%|█████████████████▌                             | 391/1044 [02:23<03:40,  2.97it/s, acc_step=1/1, ce_loss_token=1.7112, perplexity_token=5.5358]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  38%|█████████████████▋                             | 392/1044 [02:24<03:45,  2.90it/s, acc_step=1/1, ce_loss_token=1.7112, perplexity_token=5.5355]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  38%|█████████████████▋                             | 393/1044 [02:24<03:30,  3.09it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5374]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  38%|█████████████████▋                             | 394/1044 [02:24<03:31,  3.07it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5370]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  38%|█████████████████▊                             | 395/1044 [02:25<03:52,  2.79it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5367]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  38%|█████████████████▊                             | 396/1044 [02:25<03:36,  3.00it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5379]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  38%|█████████████████▊                             | 397/1044 [02:25<03:34,  3.01it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5374]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  38%|█████████████████▉                             | 398/1044 [02:26<03:25,  3.15it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5393]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  38%|█████████████████▉                             | 399/1044 [02:26<03:32,  3.03it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5387]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  38%|██████████████████                             | 400/1044 [02:26<03:36,  2.98it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5384]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  38%|██████████████████                             | 401/1044 [02:27<03:44,  2.86it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5381]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  39%|██████████████████                             | 402/1044 [02:27<03:47,  2.82it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5379]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  39%|██████████████████▏                            | 403/1044 [02:27<03:48,  2.80it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5375]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  39%|██████████████████▏                            | 404/1044 [02:28<03:50,  2.78it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5371]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  39%|██████████████████▎                            | 406/1044 [02:28<03:21,  3.17it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5418]

torch.Size([256, 304, 35]) torch.Size([256, 304])
torch.Size([256, 399, 35]) torch.Size([256, 399])


[Training LM]:  39%|██████████████████▎                            | 407/1044 [02:29<04:06,  2.58it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5414]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  39%|██████████████████▎                            | 408/1044 [02:29<04:00,  2.65it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5427]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  39%|██████████████████▍                            | 409/1044 [02:30<04:04,  2.60it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5422]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  39%|██████████████████▍                            | 410/1044 [02:30<04:06,  2.57it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5420]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  39%|██████████████████▌                            | 411/1044 [02:30<04:01,  2.62it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5414]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  39%|██████████████████▌                            | 412/1044 [02:31<03:46,  2.78it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5426]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  40%|██████████████████▌                            | 413/1044 [02:31<03:57,  2.66it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5424]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  40%|██████████████████▋                            | 414/1044 [02:32<03:58,  2.65it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5420]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  40%|██████████████████▋                            | 415/1044 [02:32<03:56,  2.66it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5416]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  40%|██████████████████▋                            | 416/1044 [02:32<03:52,  2.70it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5412]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  40%|██████████████████▊                            | 417/1044 [02:33<04:00,  2.61it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5410]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  40%|██████████████████▊                            | 418/1044 [02:33<03:59,  2.61it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5406]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  40%|██████████████████▊                            | 419/1044 [02:33<04:05,  2.55it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5403]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  40%|██████████████████▉                            | 420/1044 [02:34<03:55,  2.65it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5400]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  40%|██████████████████▉                            | 421/1044 [02:34<03:54,  2.65it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5396]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  40%|██████████████████▉                            | 422/1044 [02:35<03:55,  2.65it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5392]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  41%|███████████████████                            | 423/1044 [02:35<04:00,  2.59it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5388]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  41%|███████████████████                            | 424/1044 [02:35<04:04,  2.54it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5384]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  41%|███████████████████▏                           | 425/1044 [02:36<03:59,  2.59it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5381]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  41%|███████████████████▏                           | 426/1044 [02:36<03:43,  2.77it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5398]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  41%|███████████████████▏                           | 427/1044 [02:36<03:43,  2.76it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5393]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  41%|███████████████████▎                           | 428/1044 [02:37<03:41,  2.78it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5388]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  41%|███████████████████▎                           | 429/1044 [02:37<03:57,  2.58it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5384]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  41%|███████████████████▎                           | 430/1044 [02:37<03:35,  2.85it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5401]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  41%|███████████████████▍                           | 431/1044 [02:38<03:40,  2.78it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5399]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  41%|███████████████████▍                           | 432/1044 [02:38<03:38,  2.81it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5394]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  41%|███████████████████▍                           | 433/1044 [02:38<03:25,  2.97it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5403]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  42%|███████████████████▌                           | 434/1044 [02:39<03:25,  2.97it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5398]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  42%|███████████████████▌                           | 435/1044 [02:39<03:27,  2.94it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5393]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  42%|███████████████████▋                           | 436/1044 [02:40<03:27,  2.93it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5389]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  42%|███████████████████▋                           | 437/1044 [02:40<03:21,  3.02it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5400]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  42%|███████████████████▋                           | 438/1044 [02:40<03:30,  2.88it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5394]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  42%|███████████████████▊                           | 439/1044 [02:41<03:31,  2.86it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5392]

torch.Size([256, 397, 35]) torch.Size([256, 397])


[Training LM]:  42%|███████████████████▊                           | 440/1044 [02:41<04:08,  2.43it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5387]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  42%|███████████████████▊                           | 441/1044 [02:42<04:02,  2.48it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5384]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  42%|███████████████████▉                           | 442/1044 [02:42<03:58,  2.53it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5380]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  42%|███████████████████▉                           | 443/1044 [02:42<03:54,  2.57it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5378]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  43%|███████████████████▉                           | 444/1044 [02:43<03:56,  2.54it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5374]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  43%|████████████████████                           | 445/1044 [02:43<03:37,  2.76it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5383]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  43%|████████████████████                           | 446/1044 [02:43<03:36,  2.76it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5381]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  43%|████████████████████                           | 447/1044 [02:44<03:22,  2.95it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5393]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  43%|████████████████████▏                          | 448/1044 [02:44<03:37,  2.74it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5390]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  43%|████████████████████▏                          | 449/1044 [02:44<03:32,  2.81it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5402]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  43%|████████████████████▎                          | 450/1044 [02:45<03:32,  2.79it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5397]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  43%|████████████████████▎                          | 451/1044 [02:45<03:41,  2.68it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5393]

torch.Size([256, 276, 35]) torch.Size([256, 276])


[Training LM]:  43%|████████████████████▎                          | 452/1044 [02:45<03:32,  2.79it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5389]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  43%|████████████████████▍                          | 453/1044 [02:46<03:32,  2.78it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5384]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  43%|████████████████████▍                          | 454/1044 [02:46<03:34,  2.75it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5382]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  44%|████████████████████▍                          | 455/1044 [02:46<03:20,  2.94it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5392]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  44%|████████████████████▌                          | 456/1044 [02:47<03:10,  3.09it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5408]

torch.Size([256, 343, 35]) torch.Size([256, 343])


[Training LM]:  44%|████████████████████▌                          | 457/1044 [02:47<03:30,  2.79it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5404]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  44%|████████████████████▌                          | 458/1044 [02:48<03:28,  2.82it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5400]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  44%|████████████████████▋                          | 459/1044 [02:48<03:35,  2.71it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5395]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  44%|████████████████████▋                          | 460/1044 [02:48<03:45,  2.59it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5390]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  44%|████████████████████▊                          | 461/1044 [02:49<03:44,  2.59it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5386]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  44%|████████████████████▊                          | 462/1044 [02:49<03:42,  2.62it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5383]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  44%|████████████████████▊                          | 463/1044 [02:49<03:27,  2.80it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5394]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  44%|████████████████████▉                          | 464/1044 [02:50<03:33,  2.71it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5389]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  45%|████████████████████▉                          | 465/1044 [02:50<03:34,  2.70it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5388]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  45%|████████████████████▉                          | 466/1044 [02:51<03:21,  2.87it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5400]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  45%|█████████████████████                          | 467/1044 [02:51<03:31,  2.72it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5397]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  45%|█████████████████████                          | 468/1044 [02:51<03:39,  2.62it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5395]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  45%|█████████████████████                          | 469/1044 [02:52<03:44,  2.56it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5392]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  45%|█████████████████████▏                         | 470/1044 [02:52<03:40,  2.61it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5388]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  45%|█████████████████████▏                         | 471/1044 [02:52<03:36,  2.65it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5385]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  45%|█████████████████████▏                         | 472/1044 [02:53<03:23,  2.81it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5393]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  45%|█████████████████████▎                         | 473/1044 [02:53<03:30,  2.71it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5391]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  45%|█████████████████████▎                         | 474/1044 [02:54<03:28,  2.73it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5386]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  45%|█████████████████████▍                         | 475/1044 [02:54<03:16,  2.90it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5396]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  46%|█████████████████████▍                         | 476/1044 [02:54<03:04,  3.08it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5407]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  46%|█████████████████████▌                         | 478/1044 [02:55<02:44,  3.44it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5444]

torch.Size([256, 309, 35]) torch.Size([256, 309])
torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  46%|█████████████████████▌                         | 479/1044 [02:55<02:56,  3.20it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5440]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  46%|█████████████████████▌                         | 480/1044 [02:55<03:16,  2.87it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5435]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  46%|█████████████████████▋                         | 481/1044 [02:56<03:07,  3.01it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5451]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  46%|█████████████████████▋                         | 482/1044 [02:56<03:14,  2.88it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5449]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  46%|█████████████████████▋                         | 483/1044 [02:56<03:20,  2.80it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5444]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  46%|█████████████████████▊                         | 484/1044 [02:57<03:08,  2.97it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5460]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  47%|█████████████████████▉                         | 486/1044 [02:57<02:48,  3.32it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5492]

torch.Size([256, 302, 35]) torch.Size([256, 302])
torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  47%|█████████████████████▉                         | 487/1044 [02:58<02:58,  3.11it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5488]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  47%|█████████████████████▉                         | 488/1044 [02:58<02:49,  3.28it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5498]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  47%|██████████████████████                         | 489/1044 [02:58<02:53,  3.19it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5496]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  47%|██████████████████████                         | 490/1044 [02:59<03:05,  2.99it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5491]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  47%|██████████████████████                         | 491/1044 [02:59<03:16,  2.82it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5488]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  47%|██████████████████████▏                        | 492/1044 [02:59<03:04,  2.99it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5498]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  47%|██████████████████████▏                        | 493/1044 [03:00<03:12,  2.86it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5496]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  47%|██████████████████████▏                        | 494/1044 [03:00<03:17,  2.79it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5492]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  47%|██████████████████████▎                        | 495/1044 [03:01<03:18,  2.77it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5489]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  48%|██████████████████████▎                        | 496/1044 [03:01<03:16,  2.78it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5486]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  48%|██████████████████████▎                        | 497/1044 [03:01<03:21,  2.71it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5482]

torch.Size([256, 350, 35]) torch.Size([256, 350])


[Training LM]:  48%|██████████████████████▍                        | 498/1044 [03:02<03:17,  2.76it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5491]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  48%|██████████████████████▍                        | 499/1044 [03:02<03:24,  2.66it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5485]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  48%|██████████████████████▌                        | 500/1044 [03:02<03:24,  2.67it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5482]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  48%|██████████████████████▌                        | 501/1044 [03:03<03:20,  2.71it/s, acc_step=1/1, ce_loss_token=1.7134, perplexity_token=5.5479]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  48%|██████████████████████▌                        | 502/1044 [03:03<03:14,  2.79it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5486]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  48%|██████████████████████▋                        | 503/1044 [03:03<02:58,  3.02it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5495]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  48%|██████████████████████▋                        | 504/1044 [03:04<03:02,  2.96it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5491]

torch.Size([256, 370, 35]) torch.Size([256, 370])


[Training LM]:  48%|██████████████████████▋                        | 505/1044 [03:04<03:26,  2.61it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5487]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  48%|██████████████████████▊                        | 506/1044 [03:05<03:26,  2.61it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5482]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  49%|██████████████████████▊                        | 507/1044 [03:05<03:24,  2.62it/s, acc_step=1/1, ce_loss_token=1.7134, perplexity_token=5.5479]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  49%|██████████████████████▊                        | 508/1044 [03:05<03:24,  2.62it/s, acc_step=1/1, ce_loss_token=1.7134, perplexity_token=5.5475]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  49%|██████████████████████▉                        | 509/1044 [03:06<03:21,  2.65it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5471]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  49%|██████████████████████▉                        | 510/1044 [03:06<03:27,  2.58it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5468]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  49%|███████████████████████                        | 511/1044 [03:06<03:24,  2.61it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5465]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  49%|███████████████████████                        | 512/1044 [03:07<03:21,  2.63it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5460]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  49%|███████████████████████                        | 513/1044 [03:07<03:21,  2.64it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5455]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  49%|███████████████████████▏                       | 514/1044 [03:08<03:23,  2.61it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5452]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  49%|███████████████████████▏                       | 515/1044 [03:08<03:13,  2.73it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5460]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  49%|███████████████████████▏                       | 516/1044 [03:08<03:26,  2.56it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5455]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  50%|███████████████████████▎                       | 517/1044 [03:09<03:08,  2.79it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5471]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  50%|███████████████████████▎                       | 518/1044 [03:09<03:14,  2.71it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5468]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  50%|███████████████████████▎                       | 519/1044 [03:09<03:16,  2.68it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5465]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  50%|███████████████████████▍                       | 520/1044 [03:10<03:12,  2.73it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5462]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  50%|███████████████████████▍                       | 521/1044 [03:10<03:16,  2.66it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5459]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  50%|███████████████████████▌                       | 522/1044 [03:11<03:17,  2.64it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5455]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  50%|███████████████████████▌                       | 523/1044 [03:11<03:11,  2.72it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5469]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  50%|███████████████████████▌                       | 524/1044 [03:11<03:13,  2.69it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5465]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  50%|███████████████████████▋                       | 525/1044 [03:12<03:14,  2.66it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5462]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  50%|███████████████████████▋                       | 526/1044 [03:12<03:00,  2.86it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5472]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  50%|███████████████████████▋                       | 527/1044 [03:12<03:03,  2.82it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5468]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  51%|███████████████████████▊                       | 528/1044 [03:13<03:10,  2.71it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5465]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  51%|███████████████████████▊                       | 529/1044 [03:13<03:09,  2.72it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5461]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  51%|███████████████████████▊                       | 530/1044 [03:13<02:52,  2.99it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5470]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  51%|███████████████████████▉                       | 531/1044 [03:14<02:54,  2.94it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5467]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  51%|███████████████████████▉                       | 532/1044 [03:14<03:00,  2.84it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5464]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  51%|███████████████████████▉                       | 533/1044 [03:14<03:03,  2.79it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5460]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  51%|████████████████████████                       | 534/1044 [03:15<03:29,  2.43it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5457]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  51%|████████████████████████                       | 535/1044 [03:15<03:23,  2.50it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5453]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  51%|████████████████████████▏                      | 536/1044 [03:16<03:23,  2.50it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5450]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  51%|████████████████████████▏                      | 537/1044 [03:16<03:17,  2.56it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5448]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  52%|████████████████████████▏                      | 538/1044 [03:17<03:15,  2.59it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5443]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  52%|████████████████████████▎                      | 539/1044 [03:17<03:21,  2.50it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5440]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  52%|████████████████████████▎                      | 540/1044 [03:17<03:24,  2.46it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5437]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  52%|████████████████████████▎                      | 541/1044 [03:18<03:21,  2.50it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5432]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  52%|████████████████████████▍                      | 542/1044 [03:18<03:16,  2.56it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5429]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  52%|████████████████████████▍                      | 543/1044 [03:19<03:12,  2.61it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5425]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  52%|████████████████████████▍                      | 544/1044 [03:19<03:04,  2.70it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5439]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  52%|████████████████████████▌                      | 545/1044 [03:19<03:06,  2.67it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5435]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  52%|████████████████████████▌                      | 546/1044 [03:20<03:13,  2.58it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5432]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  52%|████████████████████████▋                      | 547/1044 [03:20<03:14,  2.55it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5429]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  52%|████████████████████████▋                      | 548/1044 [03:20<03:18,  2.50it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5425]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  53%|████████████████████████▋                      | 549/1044 [03:21<03:12,  2.57it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5421]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  53%|████████████████████████▊                      | 550/1044 [03:21<03:05,  2.67it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5418]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  53%|████████████████████████▊                      | 551/1044 [03:22<02:59,  2.75it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5416]

torch.Size([256, 352, 35]) torch.Size([256, 352])


[Training LM]:  53%|████████████████████████▊                      | 552/1044 [03:22<03:12,  2.56it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5413]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  53%|████████████████████████▉                      | 553/1044 [03:22<03:11,  2.57it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5410]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  53%|████████████████████████▉                      | 554/1044 [03:23<03:06,  2.62it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5406]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  53%|████████████████████████▉                      | 555/1044 [03:23<03:01,  2.70it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5404]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  53%|█████████████████████████                      | 556/1044 [03:23<02:59,  2.73it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5399]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  53%|█████████████████████████                      | 557/1044 [03:24<02:58,  2.73it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5397]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  53%|█████████████████████████                      | 558/1044 [03:24<03:02,  2.67it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5393]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  54%|█████████████████████████▏                     | 559/1044 [03:25<02:57,  2.74it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5391]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  54%|█████████████████████████▏                     | 560/1044 [03:25<02:55,  2.76it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5388]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  54%|█████████████████████████▎                     | 561/1044 [03:25<03:00,  2.68it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5384]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  54%|█████████████████████████▎                     | 562/1044 [03:26<02:48,  2.87it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5393]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  54%|█████████████████████████▎                     | 563/1044 [03:26<02:56,  2.73it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5389]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  54%|█████████████████████████▍                     | 564/1044 [03:26<02:59,  2.67it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5386]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  54%|█████████████████████████▍                     | 565/1044 [03:27<02:57,  2.70it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5384]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  54%|█████████████████████████▍                     | 566/1044 [03:27<02:57,  2.69it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5379]

torch.Size([256, 404, 35]) torch.Size([256, 404])


[Training LM]:  54%|█████████████████████████▌                     | 567/1044 [03:28<03:23,  2.34it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5378]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  54%|█████████████████████████▌                     | 568/1044 [03:28<03:13,  2.46it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5376]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  55%|█████████████████████████▌                     | 569/1044 [03:28<03:08,  2.52it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5373]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  55%|█████████████████████████▋                     | 570/1044 [03:29<02:53,  2.73it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5385]

torch.Size([256, 362, 35]) torch.Size([256, 362])


[Training LM]:  55%|█████████████████████████▋                     | 571/1044 [03:29<02:52,  2.74it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5394]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  55%|█████████████████████████▊                     | 572/1044 [03:29<02:38,  2.98it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5402]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  55%|█████████████████████████▊                     | 573/1044 [03:30<02:44,  2.86it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5400]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  55%|█████████████████████████▊                     | 574/1044 [03:30<02:34,  3.03it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5413]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  55%|█████████████████████████▉                     | 575/1044 [03:30<02:39,  2.95it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5411]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  55%|█████████████████████████▉                     | 576/1044 [03:31<02:46,  2.81it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5407]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  55%|█████████████████████████▉                     | 577/1044 [03:31<02:55,  2.66it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5403]

torch.Size([256, 441, 35]) torch.Size([256, 441])


[Training LM]:  55%|██████████████████████████                     | 578/1044 [03:32<03:13,  2.41it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5411]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  55%|██████████████████████████                     | 579/1044 [03:32<03:06,  2.49it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5407]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  56%|██████████████████████████                     | 580/1044 [03:32<03:08,  2.47it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5402]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  56%|██████████████████████████▏                    | 581/1044 [03:33<03:01,  2.56it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5399]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  56%|██████████████████████████▏                    | 582/1044 [03:33<02:58,  2.60it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5396]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  56%|██████████████████████████▏                    | 583/1044 [03:34<02:54,  2.65it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5391]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  56%|██████████████████████████▎                    | 584/1044 [03:34<02:41,  2.85it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5400]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  56%|██████████████████████████▎                    | 585/1044 [03:34<02:46,  2.76it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5397]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  56%|██████████████████████████▍                    | 586/1044 [03:35<02:43,  2.81it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5392]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  56%|██████████████████████████▍                    | 587/1044 [03:35<02:34,  2.96it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5405]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  56%|██████████████████████████▍                    | 588/1044 [03:35<02:36,  2.91it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5402]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  56%|██████████████████████████▌                    | 589/1044 [03:36<02:40,  2.84it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5399]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  57%|██████████████████████████▌                    | 590/1044 [03:36<02:41,  2.82it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5396]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  57%|██████████████████████████▌                    | 591/1044 [03:36<02:42,  2.78it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5393]

torch.Size([256, 348, 35]) torch.Size([256, 348])


[Training LM]:  57%|██████████████████████████▋                    | 592/1044 [03:37<02:54,  2.59it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5391]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  57%|██████████████████████████▋                    | 593/1044 [03:37<02:53,  2.60it/s, acc_step=1/1, ce_loss_token=1.7118, perplexity_token=5.5389]

torch.Size([256, 276, 35]) torch.Size([256, 276])


[Training LM]:  57%|██████████████████████████▋                    | 594/1044 [03:37<02:45,  2.71it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5385]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  57%|██████████████████████████▊                    | 595/1044 [03:38<02:48,  2.67it/s, acc_step=1/1, ce_loss_token=1.7117, perplexity_token=5.5383]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  57%|██████████████████████████▊                    | 596/1044 [03:38<02:44,  2.72it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5380]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  57%|██████████████████████████▉                    | 597/1044 [03:39<02:44,  2.72it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5377]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  57%|██████████████████████████▉                    | 598/1044 [03:39<03:08,  2.37it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5375]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  57%|██████████████████████████▉                    | 599/1044 [03:40<03:08,  2.36it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5372]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  57%|███████████████████████████                    | 600/1044 [03:40<02:52,  2.57it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5380]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  58%|███████████████████████████                    | 601/1044 [03:40<02:48,  2.63it/s, acc_step=1/1, ce_loss_token=1.7116, perplexity_token=5.5377]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  58%|███████████████████████████                    | 602/1044 [03:41<02:43,  2.70it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5375]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  58%|███████████████████████████▏                   | 603/1044 [03:41<02:44,  2.68it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5372]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  58%|███████████████████████████▏                   | 604/1044 [03:41<02:44,  2.67it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5369]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  58%|███████████████████████████▏                   | 605/1044 [03:42<02:46,  2.64it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5366]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  58%|███████████████████████████▎                   | 606/1044 [03:42<02:44,  2.66it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5362]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  58%|███████████████████████████▎                   | 607/1044 [03:42<02:32,  2.86it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5371]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  58%|███████████████████████████▎                   | 608/1044 [03:43<02:38,  2.74it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5367]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  58%|███████████████████████████▍                   | 609/1044 [03:43<02:39,  2.73it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5364]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  58%|███████████████████████████▍                   | 610/1044 [03:44<02:42,  2.67it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5362]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  59%|███████████████████████████▌                   | 611/1044 [03:44<02:40,  2.70it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5359]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  59%|███████████████████████████▌                   | 612/1044 [03:44<02:30,  2.88it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5366]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  59%|███████████████████████████▌                   | 613/1044 [03:44<02:18,  3.10it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5374]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  59%|███████████████████████████▋                   | 614/1044 [03:45<02:25,  2.96it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5371]

torch.Size([256, 274, 35]) torch.Size([256, 274])


[Training LM]:  59%|███████████████████████████▋                   | 615/1044 [03:45<02:22,  3.00it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5369]

torch.Size([256, 339, 35]) torch.Size([256, 339])


[Training LM]:  59%|███████████████████████████▋                   | 616/1044 [03:46<02:36,  2.74it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5367]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  59%|███████████████████████████▊                   | 617/1044 [03:46<02:39,  2.68it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5365]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  59%|███████████████████████████▊                   | 618/1044 [03:46<02:34,  2.76it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5362]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  59%|███████████████████████████▊                   | 619/1044 [03:47<02:31,  2.80it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5359]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  59%|███████████████████████████▉                   | 620/1044 [03:47<02:37,  2.69it/s, acc_step=1/1, ce_loss_token=1.7112, perplexity_token=5.5357]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  59%|███████████████████████████▉                   | 621/1044 [03:47<02:40,  2.63it/s, acc_step=1/1, ce_loss_token=1.7112, perplexity_token=5.5354]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  60%|████████████████████████████                   | 622/1044 [03:48<02:28,  2.83it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5365]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  60%|████████████████████████████                   | 623/1044 [03:48<02:31,  2.78it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5361]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  60%|████████████████████████████                   | 624/1044 [03:49<02:34,  2.72it/s, acc_step=1/1, ce_loss_token=1.7112, perplexity_token=5.5359]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  60%|████████████████████████████▏                  | 625/1044 [03:49<02:23,  2.91it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5366]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  60%|████████████████████████████▏                  | 626/1044 [03:49<02:25,  2.88it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5363]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  60%|████████████████████████████▏                  | 627/1044 [03:50<02:29,  2.79it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5359]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  60%|████████████████████████████▎                  | 628/1044 [03:50<02:22,  2.92it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5367]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  60%|████████████████████████████▎                  | 629/1044 [03:50<02:24,  2.87it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5364]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  60%|████████████████████████████▎                  | 630/1044 [03:51<02:15,  3.05it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5371]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  60%|████████████████████████████▍                  | 631/1044 [03:51<02:19,  2.97it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5369]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  61%|████████████████████████████▍                  | 632/1044 [03:51<02:26,  2.81it/s, acc_step=1/1, ce_loss_token=1.7114, perplexity_token=5.5366]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  61%|████████████████████████████▍                  | 633/1044 [03:52<02:27,  2.78it/s, acc_step=1/1, ce_loss_token=1.7113, perplexity_token=5.5362]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  61%|████████████████████████████▌                  | 634/1044 [03:52<02:26,  2.79it/s, acc_step=1/1, ce_loss_token=1.7112, perplexity_token=5.5358]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  61%|████████████████████████████▌                  | 635/1044 [03:52<02:17,  2.97it/s, acc_step=1/1, ce_loss_token=1.7115, perplexity_token=5.5370]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  61%|████████████████████████████▋                  | 637/1044 [03:53<01:54,  3.57it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5440]

torch.Size([256, 339, 35]) torch.Size([256, 339])
torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  61%|████████████████████████████▋                  | 638/1044 [03:53<02:20,  2.90it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5438]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  61%|████████████████████████████▊                  | 639/1044 [03:54<02:23,  2.82it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5435]

torch.Size([256, 284, 35]) torch.Size([256, 284])


[Training LM]:  61%|████████████████████████████▊                  | 641/1044 [03:54<02:03,  3.25it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5466]

torch.Size([256, 307, 35]) torch.Size([256, 307])
torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  61%|████████████████████████████▉                  | 642/1044 [03:54<01:57,  3.42it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5473]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  62%|████████████████████████████▉                  | 643/1044 [03:55<02:10,  3.06it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5470]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  62%|████████████████████████████▉                  | 644/1044 [03:55<02:18,  2.89it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5467]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  62%|█████████████████████████████                  | 645/1044 [03:56<02:19,  2.87it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5463]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  62%|█████████████████████████████                  | 646/1044 [03:56<02:27,  2.70it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5461]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  62%|█████████████████████████████▏                 | 647/1044 [03:56<02:17,  2.88it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5468]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  62%|█████████████████████████████▏                 | 648/1044 [03:57<02:08,  3.09it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5475]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  62%|█████████████████████████████▏                 | 649/1044 [03:57<02:13,  2.96it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5471]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  62%|█████████████████████████████▎                 | 650/1044 [03:57<02:09,  3.05it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5483]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  62%|█████████████████████████████▎                 | 651/1044 [03:58<02:19,  2.81it/s, acc_step=1/1, ce_loss_token=1.7134, perplexity_token=5.5481]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  62%|█████████████████████████████▎                 | 652/1044 [03:58<02:17,  2.85it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5486]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  63%|█████████████████████████████▍                 | 653/1044 [03:58<02:06,  3.09it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5493]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  63%|█████████████████████████████▍                 | 654/1044 [03:59<02:12,  2.93it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5490]

torch.Size([256, 366, 35]) torch.Size([256, 366])


[Training LM]:  63%|█████████████████████████████▍                 | 655/1044 [03:59<02:03,  3.16it/s, acc_step=1/1, ce_loss_token=1.7141, perplexity_token=5.5516]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  63%|█████████████████████████████▌                 | 656/1044 [03:59<02:08,  3.03it/s, acc_step=1/1, ce_loss_token=1.7140, perplexity_token=5.5513]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  63%|█████████████████████████████▌                 | 657/1044 [04:00<02:18,  2.79it/s, acc_step=1/1, ce_loss_token=1.7140, perplexity_token=5.5511]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  63%|█████████████████████████████▌                 | 658/1044 [04:00<02:20,  2.75it/s, acc_step=1/1, ce_loss_token=1.7139, perplexity_token=5.5508]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  63%|█████████████████████████████▋                 | 659/1044 [04:01<02:39,  2.42it/s, acc_step=1/1, ce_loss_token=1.7139, perplexity_token=5.5506]

torch.Size([256, 396, 35]) torch.Size([256, 396])


[Training LM]:  63%|█████████████████████████████▋                 | 660/1044 [04:01<02:53,  2.21it/s, acc_step=1/1, ce_loss_token=1.7139, perplexity_token=5.5503]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  63%|█████████████████████████████▊                 | 661/1044 [04:02<02:46,  2.29it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5500]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  63%|█████████████████████████████▊                 | 662/1044 [04:02<02:38,  2.42it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5496]

torch.Size([256, 438, 35]) torch.Size([256, 438])


[Training LM]:  64%|█████████████████████████████▊                 | 663/1044 [04:02<02:44,  2.31it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5503]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  64%|█████████████████████████████▉                 | 664/1044 [04:03<02:36,  2.44it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5499]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  64%|█████████████████████████████▉                 | 665/1044 [04:03<02:33,  2.48it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5497]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  64%|█████████████████████████████▉                 | 666/1044 [04:04<02:28,  2.54it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5494]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  64%|██████████████████████████████                 | 667/1044 [04:04<02:16,  2.76it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5503]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  64%|██████████████████████████████                 | 668/1044 [04:04<02:07,  2.94it/s, acc_step=1/1, ce_loss_token=1.7140, perplexity_token=5.5510]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  64%|██████████████████████████████                 | 669/1044 [04:04<02:08,  2.93it/s, acc_step=1/1, ce_loss_token=1.7139, perplexity_token=5.5507]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  64%|██████████████████████████████▏                | 670/1044 [04:05<02:13,  2.81it/s, acc_step=1/1, ce_loss_token=1.7139, perplexity_token=5.5504]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  64%|██████████████████████████████▏                | 671/1044 [04:05<02:17,  2.70it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5502]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  64%|██████████████████████████████▎                | 672/1044 [04:06<02:07,  2.91it/s, acc_step=1/1, ce_loss_token=1.7140, perplexity_token=5.5512]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  64%|██████████████████████████████▎                | 673/1044 [04:06<02:09,  2.86it/s, acc_step=1/1, ce_loss_token=1.7140, perplexity_token=5.5509]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  65%|██████████████████████████████▎                | 674/1044 [04:06<02:12,  2.80it/s, acc_step=1/1, ce_loss_token=1.7139, perplexity_token=5.5506]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  65%|██████████████████████████████▍                | 675/1044 [04:07<02:15,  2.72it/s, acc_step=1/1, ce_loss_token=1.7139, perplexity_token=5.5503]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  65%|██████████████████████████████▍                | 676/1044 [04:07<02:17,  2.67it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5500]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  65%|██████████████████████████████▍                | 677/1044 [04:07<02:15,  2.70it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5498]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  65%|██████████████████████████████▌                | 678/1044 [04:08<02:16,  2.69it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5495]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  65%|██████████████████████████████▌                | 679/1044 [04:08<02:19,  2.61it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5493]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  65%|██████████████████████████████▌                | 680/1044 [04:09<02:18,  2.62it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5490]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  65%|██████████████████████████████▋                | 681/1044 [04:09<02:16,  2.67it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5487]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  65%|██████████████████████████████▋                | 682/1044 [04:09<02:13,  2.72it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5497]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  65%|██████████████████████████████▋                | 683/1044 [04:10<02:11,  2.75it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5493]

torch.Size([256, 276, 35]) torch.Size([256, 276])


[Training LM]:  66%|██████████████████████████████▊                | 684/1044 [04:10<02:06,  2.84it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5490]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  66%|██████████████████████████████▊                | 685/1044 [04:10<01:58,  3.02it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5497]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  66%|██████████████████████████████▉                | 686/1044 [04:11<02:03,  2.91it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5494]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  66%|██████████████████████████████▉                | 687/1044 [04:11<02:10,  2.73it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5491]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  66%|██████████████████████████████▉                | 688/1044 [04:11<02:09,  2.76it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5489]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  66%|███████████████████████████████                | 689/1044 [04:12<02:06,  2.81it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5487]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  66%|███████████████████████████████                | 690/1044 [04:12<02:09,  2.74it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5483]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  66%|███████████████████████████████                | 691/1044 [04:12<02:04,  2.83it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5489]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  66%|███████████████████████████████▏               | 692/1044 [04:13<01:56,  3.02it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5496]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  66%|███████████████████████████████▏               | 693/1044 [04:13<02:00,  2.90it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5493]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  66%|███████████████████████████████▏               | 694/1044 [04:13<02:03,  2.84it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5491]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  67%|███████████████████████████████▎               | 695/1044 [04:14<02:02,  2.84it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5501]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  67%|███████████████████████████████▎               | 696/1044 [04:14<02:04,  2.80it/s, acc_step=1/1, ce_loss_token=1.7139, perplexity_token=5.5508]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  67%|███████████████████████████████▍               | 697/1044 [04:15<02:06,  2.75it/s, acc_step=1/1, ce_loss_token=1.7139, perplexity_token=5.5505]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  67%|███████████████████████████████▍               | 698/1044 [04:15<02:05,  2.76it/s, acc_step=1/1, ce_loss_token=1.7139, perplexity_token=5.5504]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  67%|███████████████████████████████▍               | 699/1044 [04:15<02:03,  2.78it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5502]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  67%|███████████████████████████████▌               | 700/1044 [04:16<02:05,  2.74it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5497]

torch.Size([256, 359, 35]) torch.Size([256, 359])


[Training LM]:  67%|███████████████████████████████▌               | 701/1044 [04:16<02:16,  2.51it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5494]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  67%|███████████████████████████████▌               | 702/1044 [04:16<02:12,  2.58it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5491]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  67%|███████████████████████████████▋               | 703/1044 [04:17<02:03,  2.77it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5502]

torch.Size([256, 328, 35]) torch.Size([256, 328])


[Training LM]:  67%|███████████████████████████████▋               | 704/1044 [04:17<02:07,  2.67it/s, acc_step=1/1, ce_loss_token=1.7138, perplexity_token=5.5498]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  68%|███████████████████████████████▋               | 705/1044 [04:18<02:03,  2.75it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5496]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  68%|███████████████████████████████▊               | 706/1044 [04:18<02:07,  2.64it/s, acc_step=1/1, ce_loss_token=1.7137, perplexity_token=5.5492]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  68%|███████████████████████████████▊               | 707/1044 [04:18<02:07,  2.65it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5489]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  68%|███████████████████████████████▊               | 708/1044 [04:19<02:04,  2.70it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5487]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  68%|███████████████████████████████▉               | 709/1044 [04:19<02:06,  2.65it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5485]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  68%|███████████████████████████████▉               | 710/1044 [04:19<02:06,  2.64it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5481]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  68%|████████████████████████████████               | 711/1044 [04:20<02:04,  2.68it/s, acc_step=1/1, ce_loss_token=1.7134, perplexity_token=5.5479]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  68%|████████████████████████████████               | 712/1044 [04:20<02:01,  2.73it/s, acc_step=1/1, ce_loss_token=1.7134, perplexity_token=5.5477]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  68%|████████████████████████████████               | 713/1044 [04:21<02:02,  2.70it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5473]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  68%|████████████████████████████████▏              | 714/1044 [04:21<02:07,  2.59it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5471]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  68%|████████████████████████████████▏              | 715/1044 [04:21<02:05,  2.63it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5468]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  69%|████████████████████████████████▏              | 716/1044 [04:22<02:05,  2.62it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5465]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  69%|████████████████████████████████▎              | 717/1044 [04:22<02:05,  2.61it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5462]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  69%|████████████████████████████████▎              | 718/1044 [04:22<02:04,  2.61it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5460]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  69%|████████████████████████████████▎              | 719/1044 [04:23<02:02,  2.65it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5457]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  69%|████████████████████████████████▍              | 720/1044 [04:23<02:00,  2.69it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5454]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  69%|████████████████████████████████▍              | 721/1044 [04:24<02:01,  2.66it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5452]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  69%|████████████████████████████████▌              | 722/1044 [04:24<02:00,  2.68it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5449]

torch.Size([256, 345, 35]) torch.Size([256, 345])


[Training LM]:  69%|████████████████████████████████▌              | 723/1044 [04:24<02:07,  2.52it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5446]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  69%|████████████████████████████████▌              | 724/1044 [04:25<01:57,  2.73it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5456]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  69%|████████████████████████████████▋              | 725/1044 [04:25<01:56,  2.74it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5453]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  70%|████████████████████████████████▋              | 726/1044 [04:25<01:56,  2.72it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5450]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  70%|████████████████████████████████▋              | 727/1044 [04:26<01:58,  2.68it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5447]

torch.Size([256, 332, 35]) torch.Size([256, 332])


[Training LM]:  70%|████████████████████████████████▊              | 728/1044 [04:26<02:01,  2.59it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5443]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  70%|████████████████████████████████▊              | 729/1044 [04:27<01:52,  2.79it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5450]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  70%|████████████████████████████████▊              | 730/1044 [04:27<01:53,  2.77it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5446]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  70%|████████████████████████████████▉              | 731/1044 [04:27<01:49,  2.85it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5444]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  70%|████████████████████████████████▉              | 732/1044 [04:28<01:48,  2.88it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5443]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  70%|████████████████████████████████▉              | 733/1044 [04:28<01:48,  2.86it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5440]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  70%|█████████████████████████████████              | 734/1044 [04:28<01:50,  2.80it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5438]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  70%|█████████████████████████████████              | 735/1044 [04:29<01:57,  2.64it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5435]

torch.Size([256, 349, 35]) torch.Size([256, 349])


[Training LM]:  70%|█████████████████████████████████▏             | 736/1044 [04:29<02:04,  2.47it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5432]

torch.Size([256, 390, 35]) torch.Size([256, 390])


[Training LM]:  71%|█████████████████████████████████▏             | 737/1044 [04:30<02:05,  2.45it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5438]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  71%|█████████████████████████████████▏             | 738/1044 [04:30<01:55,  2.64it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5445]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  71%|█████████████████████████████████▎             | 739/1044 [04:30<01:47,  2.85it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5450]

torch.Size([256, 289, 35]) torch.Size([256, 289])


[Training LM]:  71%|█████████████████████████████████▎             | 740/1044 [04:31<01:46,  2.85it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5447]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  71%|█████████████████████████████████▎             | 741/1044 [04:31<01:47,  2.81it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5444]

torch.Size([256, 351, 35]) torch.Size([256, 351])


[Training LM]:  71%|█████████████████████████████████▍             | 742/1044 [04:31<01:57,  2.57it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5442]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  71%|█████████████████████████████████▍             | 743/1044 [04:32<01:56,  2.58it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5439]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  71%|█████████████████████████████████▌             | 745/1044 [04:32<01:40,  2.98it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5462]

torch.Size([256, 316, 35]) torch.Size([256, 316])
torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  71%|█████████████████████████████████▌             | 746/1044 [04:33<01:39,  2.99it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5459]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  72%|█████████████████████████████████▋             | 747/1044 [04:33<01:46,  2.78it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5457]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  72%|█████████████████████████████████▋             | 748/1044 [04:34<01:48,  2.73it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5455]

torch.Size([256, 393, 35]) torch.Size([256, 393])


[Training LM]:  72%|█████████████████████████████████▋             | 749/1044 [04:34<01:53,  2.61it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5461]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  72%|█████████████████████████████████▊             | 750/1044 [04:34<01:50,  2.66it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5458]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  72%|█████████████████████████████████▊             | 751/1044 [04:35<01:53,  2.58it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5456]

torch.Size([256, 355, 35]) torch.Size([256, 355])


[Training LM]:  72%|█████████████████████████████████▊             | 752/1044 [04:35<02:00,  2.43it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5453]

torch.Size([256, 388, 35]) torch.Size([256, 388])


[Training LM]:  72%|█████████████████████████████████▉             | 753/1044 [04:36<02:10,  2.23it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5450]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  72%|█████████████████████████████████▉             | 754/1044 [04:36<02:03,  2.34it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5447]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  72%|█████████████████████████████████▉             | 755/1044 [04:36<01:59,  2.43it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5445]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  72%|██████████████████████████████████             | 756/1044 [04:37<01:58,  2.43it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5443]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  73%|██████████████████████████████████             | 757/1044 [04:37<01:55,  2.48it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5440]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  73%|██████████████████████████████████             | 758/1044 [04:38<01:48,  2.64it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5445]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  73%|██████████████████████████████████▏            | 759/1044 [04:38<01:46,  2.67it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5442]

torch.Size([256, 321, 35]) torch.Size([256, 321])


[Training LM]:  73%|██████████████████████████████████▏            | 760/1044 [04:38<01:41,  2.80it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5452]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  73%|██████████████████████████████████▎            | 761/1044 [04:39<01:40,  2.82it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5449]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  73%|██████████████████████████████████▎            | 762/1044 [04:39<01:44,  2.71it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5446]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  73%|██████████████████████████████████▎            | 763/1044 [04:39<01:44,  2.70it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5444]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  73%|██████████████████████████████████▍            | 764/1044 [04:40<01:39,  2.82it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5452]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  73%|██████████████████████████████████▍            | 765/1044 [04:40<01:38,  2.83it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5450]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  73%|██████████████████████████████████▍            | 766/1044 [04:40<01:43,  2.70it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5447]

torch.Size([256, 331, 35]) torch.Size([256, 331])


[Training LM]:  73%|██████████████████████████████████▌            | 767/1044 [04:41<01:47,  2.58it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5444]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  74%|██████████████████████████████████▌            | 768/1044 [04:41<01:45,  2.62it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5441]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  74%|██████████████████████████████████▌            | 769/1044 [04:42<01:48,  2.54it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5438]

torch.Size([256, 338, 35]) torch.Size([256, 338])


[Training LM]:  74%|██████████████████████████████████▋            | 770/1044 [04:42<01:50,  2.48it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5436]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  74%|██████████████████████████████████▋            | 771/1044 [04:42<01:49,  2.50it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5432]

torch.Size([256, 287, 35]) torch.Size([256, 287])


[Training LM]:  74%|██████████████████████████████████▊            | 772/1044 [04:43<01:44,  2.60it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5430]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  74%|██████████████████████████████████▊            | 773/1044 [04:43<01:42,  2.64it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5428]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  74%|██████████████████████████████████▊            | 774/1044 [04:44<01:42,  2.64it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5426]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  74%|██████████████████████████████████▉            | 775/1044 [04:44<01:44,  2.58it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5424]

torch.Size([256, 347, 35]) torch.Size([256, 347])


[Training LM]:  74%|██████████████████████████████████▉            | 776/1044 [04:44<01:49,  2.45it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5421]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  74%|██████████████████████████████████▉            | 777/1044 [04:45<01:42,  2.60it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5418]

torch.Size([256, 392, 35]) torch.Size([256, 392])


[Training LM]:  75%|███████████████████████████████████            | 778/1044 [04:45<01:53,  2.34it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5415]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  75%|███████████████████████████████████            | 779/1044 [04:46<01:47,  2.48it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5412]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  75%|███████████████████████████████████▏           | 781/1044 [04:46<01:21,  3.22it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5450]

torch.Size([256, 285, 35]) torch.Size([256, 285])
torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  75%|███████████████████████████████████▏           | 782/1044 [04:46<01:21,  3.20it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5447]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  75%|███████████████████████████████████▎           | 783/1044 [04:47<01:23,  3.14it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5445]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  75%|███████████████████████████████████▎           | 784/1044 [04:47<01:28,  2.94it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5441]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  75%|███████████████████████████████████▎           | 785/1044 [04:48<01:33,  2.76it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5438]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  75%|███████████████████████████████████▍           | 786/1044 [04:48<01:35,  2.70it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5435]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  75%|███████████████████████████████████▍           | 787/1044 [04:48<01:35,  2.70it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5433]

torch.Size([256, 340, 35]) torch.Size([256, 340])


[Training LM]:  75%|███████████████████████████████████▍           | 788/1044 [04:49<01:31,  2.78it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5440]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  76%|███████████████████████████████████▌           | 789/1044 [04:49<01:25,  2.98it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5449]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  76%|███████████████████████████████████▌           | 790/1044 [04:49<01:27,  2.90it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5446]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  76%|███████████████████████████████████▌           | 791/1044 [04:50<01:21,  3.12it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5456]

torch.Size([256, 356, 35]) torch.Size([256, 356])


[Training LM]:  76%|███████████████████████████████████▋           | 792/1044 [04:50<01:22,  3.04it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5462]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  76%|███████████████████████████████████▋           | 793/1044 [04:50<01:19,  3.15it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5470]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  76%|███████████████████████████████████▋           | 794/1044 [04:51<01:22,  3.02it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5468]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  76%|███████████████████████████████████▊           | 795/1044 [04:51<01:23,  2.97it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5466]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  76%|███████████████████████████████████▉           | 797/1044 [04:52<01:15,  3.29it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5489]

torch.Size([256, 297, 35]) torch.Size([256, 297])
torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  76%|███████████████████████████████████▉           | 798/1044 [04:52<01:18,  3.14it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5487]

torch.Size([256, 279, 35]) torch.Size([256, 279])


[Training LM]:  77%|███████████████████████████████████▉           | 799/1044 [04:52<01:18,  3.10it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5486]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  77%|████████████████████████████████████           | 800/1044 [04:52<01:17,  3.16it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5490]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  77%|████████████████████████████████████           | 801/1044 [04:53<01:20,  3.04it/s, acc_step=1/1, ce_loss_token=1.7136, perplexity_token=5.5488]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  77%|████████████████████████████████████           | 802/1044 [04:53<01:22,  2.95it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5485]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  77%|████████████████████████████████████▏          | 803/1044 [04:54<01:22,  2.93it/s, acc_step=1/1, ce_loss_token=1.7135, perplexity_token=5.5483]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  77%|████████████████████████████████████▏          | 804/1044 [04:54<01:23,  2.86it/s, acc_step=1/1, ce_loss_token=1.7134, perplexity_token=5.5481]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  77%|████████████████████████████████████▏          | 805/1044 [04:54<01:23,  2.85it/s, acc_step=1/1, ce_loss_token=1.7134, perplexity_token=5.5478]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  77%|████████████████████████████████████▎          | 806/1044 [04:55<01:24,  2.83it/s, acc_step=1/1, ce_loss_token=1.7134, perplexity_token=5.5475]

torch.Size([256, 341, 35]) torch.Size([256, 341])


[Training LM]:  77%|████████████████████████████████████▎          | 807/1044 [04:55<01:30,  2.62it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5473]

torch.Size([256, 327, 35]) torch.Size([256, 327])


[Training LM]:  77%|████████████████████████████████████▍          | 808/1044 [04:56<01:33,  2.53it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5471]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  77%|████████████████████████████████████▍          | 809/1044 [04:56<01:31,  2.56it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5467]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  78%|████████████████████████████████████▍          | 810/1044 [04:56<01:25,  2.75it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5473]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  78%|████████████████████████████████████▌          | 811/1044 [04:57<01:23,  2.77it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5470]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  78%|████████████████████████████████████▌          | 812/1044 [04:57<01:23,  2.80it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5468]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  78%|████████████████████████████████████▌          | 813/1044 [04:57<01:22,  2.79it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5465]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  78%|████████████████████████████████████▋          | 814/1044 [04:58<01:22,  2.80it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5463]

torch.Size([256, 316, 35]) torch.Size([256, 316])


[Training LM]:  78%|████████████████████████████████████▋          | 815/1044 [04:58<01:18,  2.94it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5467]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  78%|████████████████████████████████████▋          | 816/1044 [04:58<01:18,  2.91it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5464]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  78%|████████████████████████████████████▊          | 817/1044 [04:59<01:20,  2.83it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5462]

torch.Size([256, 377, 35]) torch.Size([256, 377])


[Training LM]:  78%|████████████████████████████████████▊          | 818/1044 [04:59<01:31,  2.48it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5460]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  78%|████████████████████████████████████▊          | 819/1044 [05:00<01:28,  2.53it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5458]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  79%|████████████████████████████████████▉          | 820/1044 [05:00<01:26,  2.60it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5455]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  79%|████████████████████████████████████▉          | 821/1044 [05:00<01:24,  2.63it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5453]

torch.Size([256, 288, 35]) torch.Size([256, 288])


[Training LM]:  79%|█████████████████████████████████████          | 822/1044 [05:01<01:21,  2.72it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5450]

torch.Size([256, 312, 35]) torch.Size([256, 312])


[Training LM]:  79%|█████████████████████████████████████          | 823/1044 [05:01<01:21,  2.70it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5447]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  79%|█████████████████████████████████████          | 824/1044 [05:01<01:20,  2.73it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5445]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  79%|█████████████████████████████████████▏         | 825/1044 [05:02<01:14,  2.93it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5450]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  79%|█████████████████████████████████████▏         | 826/1044 [05:02<01:09,  3.12it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5455]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  79%|█████████████████████████████████████▏         | 827/1044 [05:02<01:07,  3.23it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5464]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  79%|█████████████████████████████████████▎         | 828/1044 [05:02<01:04,  3.32it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5473]

torch.Size([256, 324, 35]) torch.Size([256, 324])


[Training LM]:  79%|█████████████████████████████████████▎         | 829/1044 [05:03<01:10,  3.03it/s, acc_step=1/1, ce_loss_token=1.7133, perplexity_token=5.5470]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  80%|█████████████████████████████████████▎         | 830/1044 [05:03<01:13,  2.91it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5467]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  80%|█████████████████████████████████████▍         | 831/1044 [05:04<01:14,  2.87it/s, acc_step=1/1, ce_loss_token=1.7132, perplexity_token=5.5465]

torch.Size([256, 282, 35]) torch.Size([256, 282])


[Training LM]:  80%|█████████████████████████████████████▍         | 832/1044 [05:04<01:12,  2.90it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5462]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  80%|█████████████████████████████████████▌         | 833/1044 [05:04<01:13,  2.88it/s, acc_step=1/1, ce_loss_token=1.7131, perplexity_token=5.5459]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  80%|█████████████████████████████████████▌         | 834/1044 [05:05<01:13,  2.85it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5457]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  80%|█████████████████████████████████████▌         | 835/1044 [05:05<01:14,  2.81it/s, acc_step=1/1, ce_loss_token=1.7130, perplexity_token=5.5454]

torch.Size([256, 317, 35]) torch.Size([256, 317])


[Training LM]:  80%|█████████████████████████████████████▋         | 836/1044 [05:05<01:16,  2.71it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5451]

torch.Size([256, 300, 35]) torch.Size([256, 300])


[Training LM]:  80%|█████████████████████████████████████▋         | 837/1044 [05:06<01:15,  2.72it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5449]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  80%|█████████████████████████████████████▋         | 838/1044 [05:06<01:17,  2.65it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5446]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  80%|█████████████████████████████████████▊         | 839/1044 [05:07<01:15,  2.71it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5444]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  80%|█████████████████████████████████████▊         | 840/1044 [05:07<01:14,  2.74it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5442]

torch.Size([256, 333, 35]) torch.Size([256, 333])


[Training LM]:  81%|█████████████████████████████████████▊         | 841/1044 [05:07<01:17,  2.61it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5441]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  81%|█████████████████████████████████████▉         | 842/1044 [05:08<01:15,  2.66it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5439]

torch.Size([256, 283, 35]) torch.Size([256, 283])


[Training LM]:  81%|█████████████████████████████████████▉         | 843/1044 [05:08<01:13,  2.73it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5437]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  81%|█████████████████████████████████████▉         | 844/1044 [05:08<01:13,  2.72it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5434]

torch.Size([256, 292, 35]) torch.Size([256, 292])


[Training LM]:  81%|██████████████████████████████████████         | 845/1044 [05:09<01:12,  2.76it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5432]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  81%|██████████████████████████████████████         | 846/1044 [05:09<01:07,  2.92it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5436]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  81%|██████████████████████████████████████▏        | 847/1044 [05:09<01:04,  3.03it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5442]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  81%|██████████████████████████████████████▏        | 848/1044 [05:10<01:06,  2.94it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5439]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  81%|██████████████████████████████████████▏        | 849/1044 [05:10<01:03,  3.08it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5445]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  81%|██████████████████████████████████████▎        | 850/1044 [05:10<01:06,  2.92it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5443]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  82%|██████████████████████████████████████▎        | 851/1044 [05:11<01:07,  2.85it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5440]

torch.Size([256, 301, 35]) torch.Size([256, 301])


[Training LM]:  82%|██████████████████████████████████████▎        | 852/1044 [05:11<01:08,  2.80it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5437]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  82%|██████████████████████████████████████▍        | 853/1044 [05:11<01:08,  2.78it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5436]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  82%|██████████████████████████████████████▍        | 854/1044 [05:12<01:04,  2.94it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5441]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  82%|██████████████████████████████████████▍        | 855/1044 [05:12<01:04,  2.92it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5439]

torch.Size([256, 326, 35]) torch.Size([256, 326])


[Training LM]:  82%|██████████████████████████████████████▌        | 856/1044 [05:12<01:02,  2.99it/s, acc_step=1/1, ce_loss_token=1.7129, perplexity_token=5.5448]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  82%|██████████████████████████████████████▌        | 857/1044 [05:13<01:05,  2.84it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5446]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  82%|██████████████████████████████████████▋        | 858/1044 [05:13<01:05,  2.83it/s, acc_step=1/1, ce_loss_token=1.7128, perplexity_token=5.5443]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  82%|██████████████████████████████████████▋        | 859/1044 [05:14<01:04,  2.87it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5440]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  82%|██████████████████████████████████████▋        | 860/1044 [05:14<01:05,  2.79it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5437]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  82%|██████████████████████████████████████▊        | 861/1044 [05:14<01:06,  2.75it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5435]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  83%|██████████████████████████████████████▊        | 862/1044 [05:15<01:05,  2.77it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5433]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  83%|██████████████████████████████████████▊        | 863/1044 [05:15<01:05,  2.77it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5430]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  83%|██████████████████████████████████████▉        | 864/1044 [05:15<01:03,  2.84it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5439]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  83%|██████████████████████████████████████▉        | 865/1044 [05:16<01:03,  2.83it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5437]

torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  83%|██████████████████████████████████████▉        | 866/1044 [05:16<01:00,  2.92it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5442]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  83%|███████████████████████████████████████        | 867/1044 [05:16<01:02,  2.82it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5440]

torch.Size([256, 318, 35]) torch.Size([256, 318])


[Training LM]:  83%|███████████████████████████████████████        | 868/1044 [05:17<01:04,  2.73it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5437]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  83%|███████████████████████████████████████        | 869/1044 [05:17<01:04,  2.71it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5435]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  83%|███████████████████████████████████████▏       | 870/1044 [05:18<01:04,  2.69it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5431]

torch.Size([256, 286, 35]) torch.Size([256, 286])


[Training LM]:  83%|███████████████████████████████████████▏       | 871/1044 [05:18<01:02,  2.76it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5430]

torch.Size([256, 297, 35]) torch.Size([256, 297])


[Training LM]:  84%|███████████████████████████████████████▎       | 872/1044 [05:18<00:57,  2.97it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5435]

torch.Size([256, 329, 35]) torch.Size([256, 329])


[Training LM]:  84%|███████████████████████████████████████▎       | 873/1044 [05:19<01:01,  2.77it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5432]

torch.Size([256, 308, 35]) torch.Size([256, 308])


[Training LM]:  84%|███████████████████████████████████████▎       | 874/1044 [05:19<00:57,  2.94it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5435]

torch.Size([256, 314, 35]) torch.Size([256, 314])


[Training LM]:  84%|███████████████████████████████████████▍       | 875/1044 [05:19<00:59,  2.84it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5434]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  84%|███████████████████████████████████████▍       | 876/1044 [05:20<00:56,  2.98it/s, acc_step=1/1, ce_loss_token=1.7127, perplexity_token=5.5438]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  84%|███████████████████████████████████████▍       | 877/1044 [05:20<00:59,  2.83it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5435]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  84%|███████████████████████████████████████▌       | 878/1044 [05:20<00:59,  2.77it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5433]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  84%|███████████████████████████████████████▌       | 879/1044 [05:21<01:00,  2.71it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5431]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  84%|███████████████████████████████████████▌       | 880/1044 [05:21<01:00,  2.69it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5428]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  84%|███████████████████████████████████████▋       | 881/1044 [05:21<01:00,  2.71it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5426]

torch.Size([256, 313, 35]) torch.Size([256, 313])


[Training LM]:  84%|███████████████████████████████████████▋       | 882/1044 [05:22<00:56,  2.86it/s, acc_step=1/1, ce_loss_token=1.7126, perplexity_token=5.5432]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  85%|███████████████████████████████████████▊       | 883/1044 [05:22<00:58,  2.75it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5429]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  85%|███████████████████████████████████████▊       | 884/1044 [05:23<00:59,  2.70it/s, acc_step=1/1, ce_loss_token=1.7125, perplexity_token=5.5426]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  85%|███████████████████████████████████████▊       | 885/1044 [05:23<00:58,  2.73it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5425]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  85%|███████████████████████████████████████▉       | 886/1044 [05:23<00:56,  2.78it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5422]

torch.Size([256, 525, 35]) torch.Size([256, 525])


[Training LM]:  85%|███████████████████████████████████████▉       | 887/1044 [05:24<01:19,  1.97it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5419]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  85%|███████████████████████████████████████▉       | 888/1044 [05:24<01:12,  2.15it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5417]

torch.Size([256, 294, 35]) torch.Size([256, 294])


[Training LM]:  85%|████████████████████████████████████████       | 889/1044 [05:25<01:07,  2.31it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5415]

torch.Size([256, 280, 35]) torch.Size([256, 280])


[Training LM]:  85%|████████████████████████████████████████       | 890/1044 [05:25<01:01,  2.49it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5413]

torch.Size([256, 281, 35]) torch.Size([256, 281])


[Training LM]:  85%|████████████████████████████████████████       | 891/1044 [05:25<00:58,  2.60it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5412]

torch.Size([256, 334, 35]) torch.Size([256, 334])


[Training LM]:  85%|████████████████████████████████████████▏      | 892/1044 [05:26<01:00,  2.53it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5410]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  86%|████████████████████████████████████████▏      | 893/1044 [05:26<00:59,  2.55it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5408]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  86%|████████████████████████████████████████▏      | 894/1044 [05:27<00:58,  2.58it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5406]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  86%|████████████████████████████████████████▎      | 895/1044 [05:27<00:53,  2.78it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5412]

torch.Size([256, 337, 35]) torch.Size([256, 337])


[Training LM]:  86%|████████████████████████████████████████▎      | 896/1044 [05:27<00:56,  2.63it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5409]

torch.Size([256, 295, 35]) torch.Size([256, 295])


[Training LM]:  86%|████████████████████████████████████████▍      | 897/1044 [05:28<00:54,  2.67it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5406]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  86%|████████████████████████████████████████▍      | 898/1044 [05:28<00:54,  2.69it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5403]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  86%|████████████████████████████████████████▍      | 899/1044 [05:28<00:50,  2.87it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5407]

torch.Size([256, 335, 35]) torch.Size([256, 335])


[Training LM]:  86%|████████████████████████████████████████▌      | 900/1044 [05:29<00:53,  2.68it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5404]

torch.Size([256, 303, 35]) torch.Size([256, 303])


[Training LM]:  86%|████████████████████████████████████████▌      | 901/1044 [05:29<00:53,  2.67it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5403]

torch.Size([256, 336, 35]) torch.Size([256, 336])


[Training LM]:  86%|████████████████████████████████████████▌      | 902/1044 [05:30<00:55,  2.57it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5401]

torch.Size([256, 293, 35]) torch.Size([256, 293])


[Training LM]:  86%|████████████████████████████████████████▋      | 903/1044 [05:30<00:53,  2.63it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5399]

torch.Size([256, 322, 35]) torch.Size([256, 322])


[Training LM]:  87%|████████████████████████████████████████▋      | 905/1044 [05:31<00:45,  3.04it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5420]

torch.Size([256, 297, 35]) torch.Size([256, 297])
torch.Size([256, 330, 35]) torch.Size([256, 330])


[Training LM]:  87%|████████████████████████████████████████▊      | 906/1044 [05:31<00:44,  3.09it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5424]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  87%|████████████████████████████████████████▊      | 907/1044 [05:31<00:45,  2.99it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5423]

torch.Size([256, 290, 35]) torch.Size([256, 290])


[Training LM]:  87%|████████████████████████████████████████▉      | 908/1044 [05:32<00:46,  2.95it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5420]

torch.Size([256, 325, 35]) torch.Size([256, 325])


[Training LM]:  87%|████████████████████████████████████████▉      | 909/1044 [05:32<00:48,  2.76it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5418]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  87%|████████████████████████████████████████▉      | 910/1044 [05:32<00:49,  2.71it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5416]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  87%|█████████████████████████████████████████      | 911/1044 [05:33<00:46,  2.87it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5420]

torch.Size([256, 285, 35]) torch.Size([256, 285])


[Training LM]:  87%|█████████████████████████████████████████      | 912/1044 [05:33<00:45,  2.89it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5418]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  87%|█████████████████████████████████████████      | 913/1044 [05:33<00:46,  2.80it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5416]

torch.Size([256, 298, 35]) torch.Size([256, 298])


[Training LM]:  88%|█████████████████████████████████████████▏     | 914/1044 [05:34<00:43,  2.99it/s, acc_step=1/1, ce_loss_token=1.7124, perplexity_token=5.5420]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  88%|█████████████████████████████████████████▏     | 915/1044 [05:34<00:45,  2.84it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5418]

torch.Size([256, 309, 35]) torch.Size([256, 309])


[Training LM]:  88%|█████████████████████████████████████████▏     | 916/1044 [05:34<00:46,  2.77it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5415]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  88%|█████████████████████████████████████████▎     | 917/1044 [05:35<00:45,  2.78it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5413]

torch.Size([256, 310, 35]) torch.Size([256, 310])


[Training LM]:  88%|█████████████████████████████████████████▎     | 918/1044 [05:35<00:45,  2.74it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5411]

torch.Size([256, 320, 35]) torch.Size([256, 320])


[Training LM]:  88%|█████████████████████████████████████████▎     | 919/1044 [05:36<00:46,  2.67it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5409]

torch.Size([256, 305, 35]) torch.Size([256, 305])


[Training LM]:  88%|█████████████████████████████████████████▍     | 920/1044 [05:36<00:46,  2.66it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5406]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  88%|█████████████████████████████████████████▍     | 921/1044 [05:36<00:43,  2.80it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5410]

torch.Size([256, 291, 35]) torch.Size([256, 291])


[Training LM]:  88%|█████████████████████████████████████████▌     | 922/1044 [05:37<00:43,  2.82it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5409]

torch.Size([256, 296, 35]) torch.Size([256, 296])


[Training LM]:  88%|█████████████████████████████████████████▌     | 923/1044 [05:37<00:40,  3.02it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5413]

torch.Size([256, 315, 35]) torch.Size([256, 315])


[Training LM]:  89%|█████████████████████████████████████████▌     | 924/1044 [05:37<00:41,  2.87it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5411]

torch.Size([256, 302, 35]) torch.Size([256, 302])


[Training LM]:  89%|█████████████████████████████████████████▋     | 925/1044 [05:38<00:39,  3.02it/s, acc_step=1/1, ce_loss_token=1.7123, perplexity_token=5.5415]

torch.Size([256, 304, 35]) torch.Size([256, 304])


[Training LM]:  89%|█████████████████████████████████████████▋     | 926/1044 [05:38<00:40,  2.93it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5412]

torch.Size([256, 357, 35]) torch.Size([256, 357])


[Training LM]:  89%|█████████████████████████████████████████▋     | 927/1044 [05:38<00:44,  2.62it/s, acc_step=1/1, ce_loss_token=1.7122, perplexity_token=5.5410]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  89%|█████████████████████████████████████████▊     | 928/1044 [05:39<00:45,  2.55it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5407]

torch.Size([256, 311, 35]) torch.Size([256, 311])


[Training LM]:  89%|█████████████████████████████████████████▊     | 929/1044 [05:39<00:44,  2.56it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5405]

torch.Size([256, 323, 35]) torch.Size([256, 323])


[Training LM]:  89%|█████████████████████████████████████████▊     | 930/1044 [05:40<00:45,  2.51it/s, acc_step=1/1, ce_loss_token=1.7121, perplexity_token=5.5403]

torch.Size([256, 307, 35]) torch.Size([256, 307])


[Training LM]:  89%|█████████████████████████████████████████▉     | 931/1044 [05:40<00:44,  2.54it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5401]

torch.Size([256, 319, 35]) torch.Size([256, 319])


[Training LM]:  89%|█████████████████████████████████████████▉     | 932/1044 [05:40<00:44,  2.53it/s, acc_step=1/1, ce_loss_token=1.7120, perplexity_token=5.5400]

torch.Size([256, 306, 35]) torch.Size([256, 306])


[Training LM]:  89%|██████████████████████████████████████████     | 933/1044 [05:41<00:43,  2.57it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5397]

torch.Size([256, 299, 35]) torch.Size([256, 299])


[Training LM]:  89%|██████████████████████████████████████████     | 934/1044 [05:41<00:42,  2.60it/s, acc_step=1/1, ce_loss_token=1.7119, perplexity_token=5.5395]

# Evaluate


In [None]:
trainer.load_checkpoint("/jet/home/srivastr/idl/IDL4-Transfomers/IDL-HW4/expts/test-lm/checkpoints/checkpoint-best-metric-model.pth")

test_metrics, test_generation_results = trainer.evaluate(test_loader)
# Cleanup
trainer.cleanup()

# Submission
To submit your assignment, you will need to create a `handin.tar` with the following directory structure:

```
handin/
├── mytorch/                     # Your implemented modules
├── test_metrics.json            # Results from evaluation
├── test_generated_results.json  # Sample text generations
└── model_arch.txt               # Model architecture summary
```

- Simply run the cell below once you are satisfied with your current state and this will create the `handin.tar` file.
- After running the above cell, you should see the handin.tar file in the current directory
- Upload the `handin.tar` file to the `HW4P1` assignment on Autolab.

In [None]:
# Create temporary handin directory
if os.path.exists('handin'):
    shutil.rmtree('handin')
os.makedirs('handin')

# Copy mytorch directory
shutil.copytree('mytorch', 'handin/mytorch')

# Save final results
with open('handin/test_metrics.json', 'w') as f:
    json.dump(test_metrics, f, indent=4)

with open('handin/test_generated_results.json', 'w') as f:
    json.dump(test_generation_results['greedy'], f, indent=4)

# Save model architecture
with open('handin/model_arch.txt', 'w') as f:
    f.write(str(model_stats))

# Create tar file with all exclusions handled by filter
with tarfile.open('handin.tar', 'w') as tar:
    def filter_files(tarinfo):
        # Skip unwanted files
        if any(pattern in tarinfo.name for pattern in [
            '.DS_Store',
            '__pycache__',
            '.pyc'
        ]):
            return None
        return tarinfo

    tar.add('handin', arcname='handin', filter=filter_files)

# Cleanup
shutil.rmtree('handin')

print("Created handin.tar successfully!")

## After running the above cell, you should see the handin.tar file in the current directory
!ls