λ²μ: μ₯ν¨μ
Note
|edit| μ΄ νν 리μΌμ μ¬κΈ°μ λ³΄κ³ νΈμ§νμΈμ GitHub. μ 체 μμ€ μ½λλ μ¬κΈ°μ μμ΅λλ€ GitHub.
μ μ μ§μ:
- PyTorch C++ νλ‘ νΈμλ μ¬μ©νκΈ°
- CUDA semantics
- Pytorch 2.0 μ΄μ
- CUDA 11 μ΄μ
- NVIDIAμ CUDA κ·Έλνλ λ²μ 10 λ¦΄λ¦¬μ¦ μ΄νλ‘ CUDA ν΄ν· λΌμ΄λΈλ¬λ¦¬μ μΌλΆμμ΅λλ€
- version 10.
CPU κ³ΌλΆνλ₯Ό ν¬κ² μ€μ¬ μ ν리μΌμ΄μ μ μ±λ₯μ ν₯μμν΅λλ€.
μ΄ νν 리μΌμμλ, CUDA κ·Έλν μ¬μ©μ μ΄μ μ λ§μΆ κ²μ
λλ€
PyTorch C++ νλ‘ νΈμλ μ¬μ©νκΈ°.
C++ νλ‘ νΈμλλ νμ΄ν μΉ μ¬μ© μ¬λ‘μ μ€μν λΆλΆμΈλ°, μ£Όλ‘ μ ν λ° λ°°ν¬ μ ν리μΌμ΄μ
μμ νμ©λ©λλ€.
첫λ²μ§Έ λ±μ₯
μ΄νλ‘ CUDA κ·Έλνλ λ§€μ° μ±λ₯μ΄ μ’κ³ μ¬μ©νκΈ° μ¬μμ, μ¬μ©μμ κ°λ°μμ λ§μμ μ¬λ‘μ‘μμ΅λλ€.
μ€μ λ‘, CUDA κ·Έλνλ νμ΄ν μΉ 2.0μ torch.compile
μμ κΈ°λ³Έμ μΌλ‘ μ¬μ©λλ©°,
νλ ¨κ³Ό μΆλ‘ μμ μμ°μ±μ λμ¬μ€λλ€.
νμ΄ν μΉμμ CUDA κ·Έλν μ¬μ©λ²μ 보μ¬λλ¦¬κ³ μ ν©λλ€ MNIST μμ . LibTorch(C++ νλ‘ νΈμλ)μμμ CUDA κ·Έλν μ¬μ©λ²μ λ€μκ³Ό λ§€μ° μ μ¬νμ§λ§ Python μ¬μ©μμ μ½κ°μ ꡬ문과 κΈ°λ₯μ μ°¨μ΄κ° μμ΅λλ€.
μ£Όμ νλ ¨ 루νλ μ¬λ¬ λ¨κ³λ‘ ꡬμ±λμ΄ μμΌλ©° λ€μ μ½λ λͺ¨μμ μ€λͺ λμ΄ μμ΅λλ€.
for (auto& batch : data_loader) {
auto data = batch.data.to(device);
auto targets = batch.target.to(device);
optimizer.zero_grad();
auto output = model.forward(data);
auto loss = torch::nll_loss(output, targets);
loss.backward();
optimizer.step();
}
μμ μμμλ μμ ν, μμ ν, κ°μ€μΉ μ λ°μ΄νΈκ° ν¬ν¨λμ΄ μμ΅λλ€.
- μ΄ νν 리μΌμμλ μ 체 λ€νΈμν¬ κ·Έλν μΊ‘μ²λ₯Ό ν΅ν΄ λͺ¨λ κ³μ° λ¨κ³μ CUDA κ·Έλνλ₯Ό μ μ©ν©λλ€.
- νμ§λ§ κ·Έ μ μ μ½κ°μ μμ€ μ½λ μμ μ΄ νμν©λλ€. μ°λ¦¬κ° ν΄μΌ ν μΌμ μ£Ό νλ ¨ 루νμμ
tensorλ₯Ό μ¬μ¬μ©ν μ μλλ‘ tensorλ₯Ό 미리 ν λΉνλ κ²μ λλ€. λ€μμ ꡬν μμμ λλ€.
torch::TensorOptions FloatCUDA =
torch::TensorOptions(device).dtype(torch::kFloat);
torch::TensorOptions LongCUDA =
torch::TensorOptions(device).dtype(torch::kLong);
torch::Tensor data = torch::zeros({kTrainBatchSize, 1, 28, 28}, FloatCUDA);
torch::Tensor targets = torch::zeros({kTrainBatchSize}, LongCUDA);
torch::Tensor output = torch::zeros({1}, FloatCUDA);
torch::Tensor loss = torch::zeros({1}, FloatCUDA);
for (auto& batch : data_loader) {
data.copy_(batch.data);
targets.copy_(batch.target);
training_step(model, optimizer, data, targets, output, loss);
}
μ¬κΈ°μ ``training_step``μ λ¨μν ν΄λΉ μ΅ν°λ§μ΄μ νΈμΆκ³Ό ν¨κ» μμ ν λ° μμ νλ‘ κ΅¬μ±λ©λλ€
void training_step(
Net& model,
torch::optim::Optimizer& optimizer,
torch::Tensor& data,
torch::Tensor& targets,
torch::Tensor& output,
torch::Tensor& loss) {
optimizer.zero_grad();
output = model.forward(data);
loss = torch::nll_loss(output, targets);
loss.backward();
optimizer.step();
}
νμ΄ν μΉμ CUDA κ·Έλν APIλ μ€νΈλ¦Ό μΊ‘μ²μ μμ‘΄νκ³ μμΌλ©°, μ΄ κ²½μ° λ€μμ²λΌ μ¬μ©λ©λλ€
at::cuda::CUDAGraph graph;
at::cuda::CUDAStream captureStream = at::cuda::getStreamFromPool();
at::cuda::setCurrentCUDAStream(captureStream);
graph.capture_begin();
training_step(model, optimizer, data, targets, output, loss);
graph.capture_end();
μ€μ κ·Έλν μΊ‘μ² μ μ, μ¬μ΄λ μ€νΈλ¦Όμμ μ¬λ¬ λ²μ μλ°μ λ°λ³΅μ μ€ννμ¬ CUDA μΊμλΏλ§ μλλΌ νλ ¨ μ€μ μ¬μ©ν CUDA λΌμ΄λΈλ¬λ¦¬(CUBLASμ CUDNNκ°μ)λ₯Ό μ€λΉνλ κ²μ΄ μ€μν©λλ€.
at::cuda::CUDAStream warmupStream = at::cuda::getStreamFromPool();
at::cuda::setCurrentCUDAStream(warmupStream);
for (int iter = 0; iter < num_warmup_iters; iter++) {
training_step(model, optimizer, data, targets, output, loss);
}
κ·Έλν μΊ‘μ²μ μ±κ³΅νλ©΄ training_step(model, optimizer, data, target, output, loss);
νΈμΆμ
``graph.replay()``λ‘ λ체νμ¬ νμ΅ λ¨κ³λ₯Ό μ§νν μ μμ΅λλ€.
μ½λλ₯Ό ν λ² μ΄ν΄λ³΄λ©΄ κ·Έλνκ° μλ μΌλ° νλ ¨μμ λ€μκ³Ό κ°μ κ²°κ³Όλ₯Ό λ³Ό μ μμ΅λλ€
$ time ./mnist
Train Epoch: 1 [59584/60000] Loss: 0.3921
Test set: Average loss: 0.2051 | Accuracy: 0.938
Train Epoch: 2 [59584/60000] Loss: 0.1826
Test set: Average loss: 0.1273 | Accuracy: 0.960
Train Epoch: 3 [59584/60000] Loss: 0.1796
Test set: Average loss: 0.1012 | Accuracy: 0.968
Train Epoch: 4 [59584/60000] Loss: 0.1603
Test set: Average loss: 0.0869 | Accuracy: 0.973
Train Epoch: 5 [59584/60000] Loss: 0.2315
Test set: Average loss: 0.0736 | Accuracy: 0.978
Train Epoch: 6 [59584/60000] Loss: 0.0511
Test set: Average loss: 0.0704 | Accuracy: 0.977
Train Epoch: 7 [59584/60000] Loss: 0.0802
Test set: Average loss: 0.0654 | Accuracy: 0.979
Train Epoch: 8 [59584/60000] Loss: 0.0774
Test set: Average loss: 0.0604 | Accuracy: 0.980
Train Epoch: 9 [59584/60000] Loss: 0.0669
Test set: Average loss: 0.0544 | Accuracy: 0.984
Train Epoch: 10 [59584/60000] Loss: 0.0219
Test set: Average loss: 0.0517 | Accuracy: 0.983
real 0m44.287s
user 0m44.018s
sys 0m1.116s
CUDA κ·Έλνλ₯Ό μ¬μ©ν νλ ¨μ λ€μκ³Ό κ°μ μΆλ ₯μ μμ±ν©λλ€
$ time ./mnist --use-train-graph
Train Epoch: 1 [59584/60000] Loss: 0.4092
Test set: Average loss: 0.2037 | Accuracy: 0.938
Train Epoch: 2 [59584/60000] Loss: 0.2039
Test set: Average loss: 0.1274 | Accuracy: 0.961
Train Epoch: 3 [59584/60000] Loss: 0.1779
Test set: Average loss: 0.1017 | Accuracy: 0.968
Train Epoch: 4 [59584/60000] Loss: 0.1559
Test set: Average loss: 0.0871 | Accuracy: 0.972
Train Epoch: 5 [59584/60000] Loss: 0.2240
Test set: Average loss: 0.0735 | Accuracy: 0.977
Train Epoch: 6 [59584/60000] Loss: 0.0520
Test set: Average loss: 0.0710 | Accuracy: 0.978
Train Epoch: 7 [59584/60000] Loss: 0.0935
Test set: Average loss: 0.0666 | Accuracy: 0.979
Train Epoch: 8 [59584/60000] Loss: 0.0744
Test set: Average loss: 0.0603 | Accuracy: 0.981
Train Epoch: 9 [59584/60000] Loss: 0.0762
Test set: Average loss: 0.0547 | Accuracy: 0.983
Train Epoch: 10 [59584/60000] Loss: 0.0207
Test set: Average loss: 0.0525 | Accuracy: 0.983
real 0m6.952s
user 0m7.048s
sys 0m0.619s
μ μμμμ λ³Ό μ μλ―μ΄, λ°λ‘ MNIST μμ μ CUDA κ·Έλνλ₯Ό μ μ©νλ κ²λ§μΌλ‘λ νλ ¨ μ±λ₯μ 6λ°° μ΄μ ν₯μμν¬ μ μμμ΅λλ€. μ΄λ κ² ν° μ±λ₯ ν₯μμ΄ κ°λ₯νλ κ²μ λͺ¨λΈ ν¬κΈ°κ° μμκΈ° λλ¬Έμ λλ€. GPU μ¬μ©λμ΄ λ§μ λν λͺ¨λΈμ κ²½μ° CPU κ³ΌλΆνμ μν₯μ΄ μ κΈ° λλ¬Έμ κ°μ ν¨κ³Όκ° λ μμ μ μμ΅λλ€. κ·Έλ° κ²½μ°λΌλ, GPUμ μ±λ₯μ μ΄λμ΄λ΄λ €λ©΄ CUDA κ·Έλνλ₯Ό μ¬μ©νλ κ²μ΄ νμ μ 리ν©λλ€.