Skip to content

Can't build GPT-J 6B #595

@coppock

Description

@coppock

System Info

  • CPU architecture: x86_64
  • Host memory: 256GB
  • GPU
    • Name: NVIDIA A30
    • Memory: 24GB
  • Libraries
    • TensorRT-LLM: v0.11.0
    • TensorRT: 10.1.0
    • CUDA: 12.6
    • NVIDIA driver: 560.28.03
    • Linux: Ubuntu 22.04

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Checkout v0.11.0 tag
  2. Install Python requirements
  3. Build GPT-J 6B engine, following the example

Expected behavior

A successful build

actual behavior

ubuntu$ python examples/gptj/convert_checkpoint.py --model_dir=gpt-j-6b --output_dir=gpt-j-6b/trt
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
0.11.0
Weights loaded. Total time: 00:00:12
Traceback (most recent call last):
  File "/h/pcoppock/data/mlos/apps/triton/../../third-party/tensorrtllm_backend/tensorrt_llm/examples/gptj/convert_checkpoint.py", line 382, in <module>
    main()
  File "/h/pcoppock/data/mlos/apps/triton/../../third-party/tensorrtllm_backend/tensorrt_llm/examples/gptj/convert_checkpoint.py", line 358, in main
    covert_and_save(rank)
  File "/h/pcoppock/data/mlos/apps/triton/../../third-party/tensorrtllm_backend/tensorrt_llm/examples/gptj/convert_checkpoint.py", line 353, in covert_and_save
    safetensors.torch.save_file(
  File "/data/pcoppock/mlos/.venv/lib/python3.10/site-packages/safetensors/torch.py", line 286, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
  File "/data/pcoppock/mlos/.venv/lib/python3.10/site-packages/safetensors/torch.py", line 496, in _flatten
    return {
  File "/data/pcoppock/mlos/.venv/lib/python3.10/site-packages/safetensors/torch.py", line 500, in <dictcomp>
    "data": _tobytes(v, k),
  File "/data/pcoppock/mlos/.venv/lib/python3.10/site-packages/safetensors/torch.py", line 414, in _tobytes
    raise ValueError(
ValueError: You are trying to save a non contiguous tensor: `lm_head.weight` which is not allowed. It either means you are trying to save tensors which are reference of each other in which case it's recommended to save only the full tensors, and reslice at load time, or simply call `.contiguous()` on your tensor to pack it before saving.
(.venv) ubuntu$

Checkpoint conversion fails with error "You are trying to save a noncontiguous tensor...."

additional notes

Conversion of Llama weights succeeds without error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions