### 0. 실행전 유의사항

코랩 PRO를 통해서 51GB DRAM과 15GB VRAM을 활용할 수 있게 되었다.
1. 가장 먼저 GPU가 잘 설정되어 있는지 확인한다.
2. 환경변수 CUDA_VISIBLE_DEVICES를 0으로 설정한다.
3. 파이썬 버전을 3.10으로 되어있는 지 확인한다.
4. 구글 드라이브에 미리 OpenFedLLM을 깃 클론해둔다.
5. OpenFedLLM의 requirements.txt에서 install==1.3.5를 미리 삭제해둔다.
6. OpenFedLLM에 install-1.3.5-py3-none-any.whl 파일을 넣어둔다.
7. OpenFedLLM에 output 폴더를 만들어 둔다.
8. cd 명령어를 통해서 OpenFedLLM 폴더 내부로 이동한다
9. requirements를 설치한다.
10. install-1.3.5 패키지를 수동으로 설치한다.
11. 허깅페이스허브에 read 토큰으로 로그인한다. (필수적 사항인지는 모름)
12. setup.sh를 실행한다.
13. 사양을 덜 먹도록 training_scripts/run_sft.sh를 적절하게 수정해둔다
14. training_scripts/run_sft.sh를 실행하면 연합 학습이 시작된다.

### 1. OpenFed 구현

런타임 유형변경 => GPU T4(16GB) < L4 (24GB) < A100 (40GB) < TPU (64GB)

In [1]:
import os
from google.colab import drive
drive.mount("/content/drive")
os.chdir("/content/drive/MyDrive/Colab Notebooks")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!nvidia-smi
!python --version

Tue Nov 12 18:08:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   47C    P8              16W /  72W |      1MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # GPU ID를 0으로 설정

In [4]:
#!git clone --recursive --shallow-submodules https://github.com/rui-ye/OpenFedLLM.git   # 깃클론 1번만

In [5]:
os.chdir("/content/drive/MyDrive/Colab Notebooks/OpenFedLLM")
!pip install -r requirements.txt   # install == 1.3.5 삭제하세요
!pip install bitsandbytes-cuda117

#디스코드에 있는 파일을 OpenFedLLM 폴더에 넣은 후, install=1.3.5를 수동 설치
!pip install install-1.3.5-py3-none-any.whl -f ./ --no-index

Collecting bitsandbytes-cuda117
  Using cached bitsandbytes_cuda117-0.26.0.post2-py3-none-any.whl.metadata (6.3 kB)
Using cached bitsandbytes_cuda117-0.26.0.post2-py3-none-any.whl (4.3 MB)
Installing collected packages: bitsandbytes-cuda117
Successfully installed bitsandbytes-cuda117-0.26.0.post2
Looking in links: ./
Processing ./install-1.3.5-py3-none-any.whl
install is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.


In [6]:
#output 폴더 필요
#!mkdir -p ./output

In [7]:
# setup.sh 실행
!bash setup.sh

In [8]:
# 필수 패키지 최신 버전 설치 및 업그레이드
# 24/10/29 : 패키지 업그레이드 필요
#transformer, trl 최신버전 -> mistral 에러 해결

!pip install --upgrade bitsandbytes transformers huggingface_hub peft trl openai deepspeed 'accelerate>=0.26.0'

# 호환되는 torch 및 torchvision 버전 설치
!pip install torch==2.0.1 torchvision==0.15.2


Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting transformers
  Downloading transformers-4.46.2-py3-none-any.whl.metadata (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface_hub
  Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Collecting peft
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Collecting trl
  Downloading trl-0.12.0-py3-none-any.whl.metadata (10 kB)
Collecting openai
  Downloading openai-1.54.4-py3-none-any.whl.metadata (24 kB)
Collecting deepspeed
  Downloading deepspeed-0.15.4.tar.gz (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m67.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l

In [9]:
#허깅페이스 토큰사용
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) N
Token is valid (permission: fineGrained).
The token `ssumday24` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `ssumday2

In [None]:
# run_sft.sh 실행

!bash training_scripts/run_sft.sh

training_scripts/run_sft.sh: line 19: ../prometheus-7b-v2.0: Is a directory
2024-10-30 12:03:16.933858: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-30 12:03:16.953748: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-30 12:03:16.975759: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-30 12:03:16.982310: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been regist

In [None]:
import os
os.getcwd()

'/content/drive/MyDrive/Colab Notebooks/OpenFedLLM'