Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A brief summary of the potential issues during the replication and corresponding solutons #81

Open
puyuanliu opened this issue Mar 17, 2023 · 2 comments

Comments

@puyuanliu
Copy link

puyuanliu commented Mar 17, 2023

1. module transformers has no attribute LLaMATokenizer or 'missing key 'llama'.

First, install the SentencePiece then install transformers from huggingface git repo. i.e., pip install sentencepiece, pip install git+https://github.com/huggingface/transformers.git
The installation order matters.

2. CUDA OOM at the beginning of the training.

Use -fp 16 instead of -bp 16. Lower the batch size and gradient accumulation steps.

3. CUDA OOM during model saving.

Assume you are using torch=1.13.0, change python/lib/python3.9/site packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:2224 from state_dict[fqn] = state_dict[fqn].clone().detach() to state_dict[fqn] = state_dict[fqn].cpu().clone().detach()

This usually happens when using GPUs of small memory (e.g., 40GB or 24GB)

4. How to perform inference?

Refer to #35 (comment)

5. Generated tokens are not human-readable at inference time.

Assume your training goes well (e.g., training loss <0.5), it's most likely your model weights are corrupted during model saving. Make sure there is no error message during the saving.

6. Finetuning is slow.

Refer to #32 (comment)

@ZeyuTeng96
Copy link

Hello my friend, like finding treasures in this issue. I had a QQ chat group. Are u willing to come in and help all Chinese friends. My QQ chat group number is: 397447632

@datquocnguyen
Copy link

Regarding the CUDA OOM during model saving, with python 3.10: we should make the change in python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants