<a href="https://colab.research.google.com/github/sbb2002/Portfolio/blob/main/.study/PyTorch/PyTorch_ch3_minibatch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data가 방대하면 할 수록 속도도 느려지고 하드웨어도 부담이 심해진다. 이에 대한 대책으로 data를 여러개로 분할(mini-batch)해서 하나씩 모델을 돌린다.

이렇게 하면 업데이트를 빠르게 할 수 있지만, 전체데이터를 쓰지 않기에 잘못된 방향으로 업데이트할 수 있다. local minimum이다. 이에 관한 문제는 핸즈온 머신러닝의 minibatch 파트를 보면 도움이 될 듯하다.

# Data loader

Data를 적재할 때 torch.utils.data.DataLoader에서 제공하는 라이브러리를 이용하면 된다.

```
from torch.utils.data import DataLoader

dataloader = DataLoader(
  dataset,
  batch_size=2,
  shuffle=True,
)
```

* batch_size
 * mini-batch의 크기
 * 통상적으로 $2^{n}$를 사용

* shuffle
 * epoch마다 dataset을 섞어주는 옵션
 * 만약 False해서 섞지않고 그대로 쓰면 모델이 학습하면서 순서를 외워버릴 수 있다. \
 따라서 항상 True를 권장.

# Practice

## Downloading and skimming data

In [None]:
!git clone https://github.com/deeplearningzerotoall/PyTorch

Cloning into 'PyTorch'...
remote: Enumerating objects: 1899, done.[K
remote: Total 1899 (delta 0), reused 0 (delta 0), pack-reused 1899
Receiving objects: 100% (1899/1899), 80.33 MiB | 29.76 MiB/s, done.
Resolving deltas: 100% (242/242), done.


In [None]:
!find -name data-01-test-score.csv

./PyTorch/data-01-test-score.csv


In [None]:
import numpy as np

xy = np.loadtxt('./PyTorch/data-01-test-score.csv', delimiter=',', dtype=np.float32)
x_data = xy[:, 0:-1]
y_data = xy[:, [-1]]

In [None]:
print(x_data.shape)
print(len(x_data))
print(x_data[:5])

(25, 3)
25
[[ 73.  80.  75.]
 [ 93.  88.  93.]
 [ 89.  91.  90.]
 [ 96.  98. 100.]
 [ 73.  66.  70.]]


## Imports

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f2f194d9450>

## Low-lv implementation

In [None]:
# loading data
x_train = torch.FloatTensor(x_data)
y_train = torch.FloatTensor(y_data)

# model initialize
W = torch.zeros((3, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

# set opt
optimizer = optim.SGD([W, b], lr=1e-5)

nb_epochs = 20
for epoch in range(nb_epochs + 1):
  hypothesis = x_train.matmul(W) + b
  cost = torch.mean((hypothesis - y_train) ** 2)

  optimizer.zero_grad()
  cost.backward()
  optimizer.step()

  print('Epoch: {:4d}/{} Cost: {:.6f}'.format(epoch, nb_epochs, cost.item()))


Epoch:    0/20 Cost: 26811.960938
Epoch:    1/20 Cost: 9920.530273
Epoch:    2/20 Cost: 3675.298340
Epoch:    3/20 Cost: 1366.260620
Epoch:    4/20 Cost: 512.542419
Epoch:    5/20 Cost: 196.896667
Epoch:    6/20 Cost: 80.190987
Epoch:    7/20 Cost: 37.038692
Epoch:    8/20 Cost: 21.081343
Epoch:    9/20 Cost: 15.178762
Epoch:   10/20 Cost: 12.993679
Epoch:   11/20 Cost: 12.183031
Epoch:   12/20 Cost: 11.880532
Epoch:   13/20 Cost: 11.765952
Epoch:   14/20 Cost: 11.720860
Epoch:   15/20 Cost: 11.701438
Epoch:   16/20 Cost: 11.691511
Epoch:   17/20 Cost: 11.685116
Epoch:   18/20 Cost: 11.680007
Epoch:   19/20 Cost: 11.675385
Epoch:   20/20 Cost: 11.670945


## Low-lv. implementation 2 (all in one)

In [None]:
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset

# loading data
x_train = torch.FloatTensor(x_data)
y_train = torch.FloatTensor(y_data)
dataset = TensorDataset(x_train, y_train)

dataloader = DataLoader(
    dataset,
    batch_size=2,
    shuffle=True,
)

In [None]:
# set opt
model = nn.Linear(3, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-5)

nb_epochs = 20    # 클수록 loss가 낮아진다. 여기선 20으로도 충분한 것 같다.
for epoch in range(nb_epochs + 1):
  for batch_idx, samples in enumerate(dataloader):
    # print(batch_idx)
    # print(samples)
    x_train, y_train = samples

    # hypo계산
    prediction = model(x_train)

    # cost 계산
    cost = F.mse_loss(prediction, y_train)

    # cost로 H(x) 계산
    optimizer.zero_grad()
    cost.backward()
    optimizer.step()

    print('Epoch: {:4d}/{} Batch: {}/{} Cost: {:.6f}'.format(epoch, nb_epochs, batch_idx+1, len(dataloader), cost.item()))


Epoch:    0/20 Batch: 1/13 Cost: 22717.306641
Epoch:    0/20 Batch: 2/13 Cost: 7065.685547
Epoch:    0/20 Batch: 3/13 Cost: 2125.653564
Epoch:    0/20 Batch: 4/13 Cost: 1564.352905
Epoch:    0/20 Batch: 5/13 Cost: 413.526459
Epoch:    0/20 Batch: 6/13 Cost: 108.073265
Epoch:    0/20 Batch: 7/13 Cost: 55.210655
Epoch:    0/20 Batch: 8/13 Cost: 20.286850
Epoch:    0/20 Batch: 9/13 Cost: 153.164841
Epoch:    0/20 Batch: 10/13 Cost: 44.974277
Epoch:    0/20 Batch: 11/13 Cost: 33.838169
Epoch:    0/20 Batch: 12/13 Cost: 2.966383
Epoch:    0/20 Batch: 13/13 Cost: 9.104765
Epoch:    1/20 Batch: 1/13 Cost: 5.657867
Epoch:    1/20 Batch: 2/13 Cost: 14.449411
Epoch:    1/20 Batch: 3/13 Cost: 13.327970
Epoch:    1/20 Batch: 4/13 Cost: 1.030128
Epoch:    1/20 Batch: 5/13 Cost: 70.678886
Epoch:    1/20 Batch: 6/13 Cost: 5.688546
Epoch:    1/20 Batch: 7/13 Cost: 1.989340
Epoch:    1/20 Batch: 8/13 Cost: 5.499257
Epoch:    1/20 Batch: 9/13 Cost: 2.556114
Epoch:    1/20 Batch: 10/13 Cost: 39.214111
Ep

In [None]:
# Test
new_var = torch.FloatTensor([[73, 80, 75]])   # 임의의 입력값
pred_y = model(new_var)   # 임의의 입력값에 대한 예측값 y

print(pred_y)

tensor([[151.5106]], grad_fn=<AddmmBackward>)


## Hi-lv. implementation w/ nn.Module

In [None]:
class MultivariateLinearRegressionModel(nn.Module):
  def __init__(self):
    super().__init__()
    self.linear = nn.Linear(3, 1)

  def forward(self, x):
    return self.linear(x)

In [None]:
# data loading
x_train = torch.FloatTensor(x_data)
y_train = torch.FloatTensor(y_data)

# initialize model
model = MultivariateLinearRegressionModel()

# set opt
optimizer = optim.SGD(model.parameters(), lr=1e-5)

nb_epochs = 20
for epoch in range(nb_epochs + 1):
  
  # H(x) 계산
  prediction = model(x_train)

  # cost 계산
  cost = F.mse_loss(prediction, y_train)

  # cost로 H(x) 개선
  optimizer.zero_grad()
  cost.backward()
  optimizer.step()

  print('Epoch: {:4d}/{} Cost: {:.6f}'.format(
      epoch, nb_epochs, cost.item()
  ))

Epoch:    0/20 Cost: 18751.404297
Epoch:    1/20 Cost: 6945.343262
Epoch:    2/20 Cost: 2580.307861
Epoch:    3/20 Cost: 966.426697
Epoch:    4/20 Cost: 369.723907
Epoch:    5/20 Cost: 149.099976
Epoch:    6/20 Cost: 67.522499
Epoch:    7/20 Cost: 37.354561
Epoch:    8/20 Cost: 26.194042
Epoch:    9/20 Cost: 22.061132
Epoch:   10/20 Cost: 20.526497
Epoch:   11/20 Cost: 19.952566
Epoch:   12/20 Cost: 19.733818
Epoch:   13/20 Cost: 19.646399
Epoch:   14/20 Cost: 19.607527
Epoch:   15/20 Cost: 19.586645
Epoch:   16/20 Cost: 19.572403
Epoch:   17/20 Cost: 19.560575
Epoch:   18/20 Cost: 19.549700
Epoch:   19/20 Cost: 19.539173
Epoch:   20/20 Cost: 19.528784
