# Lab 4-2: Load Data

Author: Seungjae Lee (이승재)

<div class="alert alert-warning">
    We use elemental PyTorch to implement linear regression here. However, in most actual applications, abstractions such as <code>nn.Module</code> or <code>nn.Linear</code> are used.
</div>

## Slicing 1D Array

In [1]:
nums = [0, 1, 2, 3, 4]

In [2]:
print(nums)

[0, 1, 2, 3, 4]


index 2에서 4 전까지 가져와라. (앞 포함, 뒤 비포함)

In [3]:
print(nums[2:4])

[2, 3]


index 2부터 다 가져와라.

In [4]:
print(nums[2:])

[2, 3, 4]


index 2 전까지 가져와라. (역시 뒤는 비포함)

In [5]:
print(nums[:2])

[0, 1]


전부 가져와라

In [6]:
print(nums[:])

[0, 1, 2, 3, 4]


마지막 index 전까지 가져와라. (뒤는 비포함!)

In [7]:
print(nums[:-1])

[0, 1, 2, 3]


assign 도 가능!

In [8]:
nums[2:4] = [8, 9]

In [9]:
print(nums)

[0, 1, 8, 9, 4]


## Slicing 2D Array

In [10]:
import numpy as np

In [11]:
b = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

In [12]:
print(b)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


In [13]:
b[:, 1]

array([ 2,  6, 10])

In [14]:
b[-1]

array([ 9, 10, 11, 12])

In [15]:
b[-1, :]

array([ 9, 10, 11, 12])

In [16]:
b[-1, ...]

array([ 9, 10, 11, 12])

In [17]:
b[0:2, :]

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

## Loading Data from `.csv` file

In [25]:
import numpy as np
import pandas as pd

In [35]:
!ls

 CNN
 CONTRIBUTING.md
 custom_data
 data-01-test-score.csv
 data-02-stock_daily.csv
 data-03-diabetes.csv
 data-04-zoo.csv
 docker
 docker_user_guide.md
 figs
 figures
 lab-01_tensor_manipulation.ipynb
 lab-02_linear_regression.ipynb
 lab-03_minimizing_cost.ipynb
 lab-04_1_multivariable_linear_regression.ipynb
 lab-04_2_load_data.ipynb
 lab-05_logistic_classification.ipynb
 lab-06_1_softmax_classification.ipynb
 lab-06_2_fancy_softmax_classification.ipynb
 lab-07_1_tips.ipynb
 lab-07_2_mnist_introduction.ipynb
 lab-08_1_xor.ipynb
 lab-08_2_xor_nn.ipynb
 lab-08_3_xor_nn_wide_deep.ipynb
 lab-08_4_mnist_back_prop.ipynb
 lab-09_1_mnist_softmax.ipynb
 lab-09_2_mnist_nn.ipynb
 lab-09_3_mnist_nn_xavier.ipynb
 lab-09_4_mnist_nn_deep.ipynb
 lab-09_5_mnist_nn_dropout.ipynb
 lab-09_6_mnist_batchnorm.ipynb
'lab-09_7_mnist_nn_selu(wip).ipynb'
 lab-10_1_mnist_cnn.ipynb
 lab-10_2_mnist_deep_cnn.ipynb
 lab-10_3_1_visdom-example.ipynb
'lab-10_3_2_MNIST-CNN with Visdom.ipynb'
 lab-10_4_1_ImageFolder_1.i

In [34]:
%cd d_competition

/content/drive/MyDrive/d_competition


In [38]:
xy = np.loadtxt('data-01-test-score.csv', delimiter=',', dtype=np.float32)

In [21]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [39]:
x_data = xy[:, 0:-1]
y_data = xy[:, [-1]]

In [40]:
print(x_data.shape) # x_data shape
print(len(x_data))  # x_data 길이
print(x_data[:5])   # 첫 다섯 개

(25, 3)
25
[[ 73.  80.  75.]
 [ 93.  88.  93.]
 [ 89.  91.  90.]
 [ 96.  98. 100.]
 [ 73.  66.  70.]]


In [41]:
print(y_data.shape) # y_data shape
print(len(y_data))  # y_data 길이
print(y_data[:5])   # 첫 다섯 개

(25, 1)
25
[[152.]
 [185.]
 [180.]
 [196.]
 [142.]]


## Imports

In [42]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [43]:
# For reproducibility
torch.manual_seed(1)

<torch._C.Generator at 0x7f02c60d6a30>

## Low-level Implementation

In [44]:
# 데이터
x_train = torch.FloatTensor(x_data)
y_train = torch.FloatTensor(y_data)
# 모델 초기화
W = torch.zeros((3, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)
# optimizer 설정
optimizer = optim.SGD([W, b], lr=1e-5)

nb_epochs = 20
for epoch in range(nb_epochs + 1):
    
    # H(x) 계산
    hypothesis = x_train.matmul(W) + b # or .mm or @

    # cost 계산
    cost = torch.mean((hypothesis - y_train) ** 2)

    # cost로 H(x) 개선
    optimizer.zero_grad()
    cost.backward()
    optimizer.step()

    # 100번마다 로그 출력
    print('Epoch {:4d}/{} Cost: {:.6f}'.format(
        epoch, nb_epochs, cost.item()
    ))

Epoch    0/20 Cost: 26811.960938
Epoch    1/20 Cost: 9920.530273
Epoch    2/20 Cost: 3675.298340
Epoch    3/20 Cost: 1366.260498
Epoch    4/20 Cost: 512.542480
Epoch    5/20 Cost: 196.896637
Epoch    6/20 Cost: 80.190987
Epoch    7/20 Cost: 37.038696
Epoch    8/20 Cost: 21.081343
Epoch    9/20 Cost: 15.178760
Epoch   10/20 Cost: 12.993679
Epoch   11/20 Cost: 12.183023
Epoch   12/20 Cost: 11.880535
Epoch   13/20 Cost: 11.765958
Epoch   14/20 Cost: 11.720851
Epoch   15/20 Cost: 11.701438
Epoch   16/20 Cost: 11.691514
Epoch   17/20 Cost: 11.685116
Epoch   18/20 Cost: 11.680005
Epoch   19/20 Cost: 11.675380
Epoch   20/20 Cost: 11.670952


## High-level Implementation with `nn.Module`

In [45]:
class MultivariateLinearRegressionModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(3, 1)

    def forward(self, x):
        return self.linear(x)

In [46]:
nn.Linear(3,1)

Linear(in_features=3, out_features=1, bias=True)

In [51]:
# 데이터
x_train = torch.FloatTensor(x_data)
y_train = torch.FloatTensor(y_data)
# 모델 초기화
model = MultivariateLinearRegressionModel()
# optimizer 설정
optimizer = optim.SGD(model.parameters(), lr=1e-5)

nb_epochs = 20
for epoch in range(nb_epochs+1):
    
    # H(x) 계산
    prediction = model(x_train)
    
    # cost 계산
    cost = F.mse_loss(prediction, y_train)
    
    # cost로 H(x) 개선
    optimizer.zero_grad()
    cost.backward()
    optimizer.step()
    
    # 20번마다 로그 출력
    print('Epoch {:4d}/{} Cost: {:.6f}'.format(
        epoch, nb_epochs, cost.item()
    ))

Epoch    0/20 Cost: 22444.050781
Epoch    1/20 Cost: 8304.253906
Epoch    2/20 Cost: 3076.377197
Epoch    3/20 Cost: 1143.485474
Epoch    4/20 Cost: 428.840912
Epoch    5/20 Cost: 164.614960
Epoch    6/20 Cost: 66.922005
Epoch    7/20 Cost: 30.800661
Epoch    8/20 Cost: 17.444128
Epoch    9/20 Cost: 12.504412
Epoch   10/20 Cost: 10.676523
Epoch   11/20 Cost: 9.999249
Epoch   12/20 Cost: 9.747339
Epoch   13/20 Cost: 9.652740
Epoch   14/20 Cost: 9.616273
Epoch   15/20 Cost: 9.601300
Epoch   16/20 Cost: 9.594303
Epoch   17/20 Cost: 9.590237
Epoch   18/20 Cost: 9.587271
Epoch   19/20 Cost: 9.584668
Epoch   20/20 Cost: 9.582247


## Dataset and DataLoader

<div class="alert alert-warning">
    pandas 기초지식이 필요할 것 같다
</div>

In [47]:
from torch.utils.data import Dataset

class CustomDataset(Dataset):
  def __init__(self):
    self.x_data =[[ 73,  80,  75],
                  [ 93,  88,  93],
                  [ 89,  91,  90],
                  [ 96,  98, 100],
                  [ 73,  66,  70]]
    self.y_data = [[152], [185], [180] ,[196], [142]]                  

  def __len__(self):
    return len(self.x_data)
  
  def __getitem__(self, idx):
    x = torch.FloatTensor(self.x_data[idx])
    y = torch.FloatTensor(self.y_data[idx])
    return x,y

dataset = CustomDataset()

In [49]:
from torch.utils.data import DataLoader

dataloader = DataLoader(
    dataset,
    batch_size=2,
    shuffle = True
)

In [53]:
nb_epochs = 20
for epoch in range(nb_epochs+1):
  for batch_idx, samples in enumerate(dataloader):
    
      # H(x) 계산
      prediction = model(x_train)
      
      # cost 계산
      cost = F.mse_loss(prediction, y_train)
      
      # cost로 H(x) 개선
      optimizer.zero_grad()
      cost.backward()
      optimizer.step()
      
      # 20번마다 로그 출력
      print('Epoch {:4d}/{} Batch{}/{} Cost: {:.6f}'.format(
          epoch, nb_epochs, batch_idx+1, len(dataloader), cost.item()
      ))

Epoch    0/20 Batch1/3 Cost: 9.435888
Epoch    0/20 Batch2/3 Cost: 9.433691
Epoch    0/20 Batch3/3 Cost: 9.431454
Epoch    1/20 Batch1/3 Cost: 9.429218
Epoch    1/20 Batch2/3 Cost: 9.426997
Epoch    1/20 Batch3/3 Cost: 9.424771
Epoch    2/20 Batch1/3 Cost: 9.422530
Epoch    2/20 Batch2/3 Cost: 9.420312
Epoch    2/20 Batch3/3 Cost: 9.418097
Epoch    3/20 Batch1/3 Cost: 9.415869
Epoch    3/20 Batch2/3 Cost: 9.413650
Epoch    3/20 Batch3/3 Cost: 9.411441
Epoch    4/20 Batch1/3 Cost: 9.409225
Epoch    4/20 Batch2/3 Cost: 9.407017
Epoch    4/20 Batch3/3 Cost: 9.404799
Epoch    5/20 Batch1/3 Cost: 9.402602
Epoch    5/20 Batch2/3 Cost: 9.400373
Epoch    5/20 Batch3/3 Cost: 9.398164
Epoch    6/20 Batch1/3 Cost: 9.395971
Epoch    6/20 Batch2/3 Cost: 9.393768
Epoch    6/20 Batch3/3 Cost: 9.391559
Epoch    7/20 Batch1/3 Cost: 9.389375
Epoch    7/20 Batch2/3 Cost: 9.387171
Epoch    7/20 Batch3/3 Cost: 9.384965
Epoch    8/20 Batch1/3 Cost: 9.382771
Epoch    8/20 Batch2/3 Cost: 9.380582
Epoch    8/2

너무 데이터가 크면 `x_data`, `y_data` 를 전부 다 가져오지 말고, 필요한 배치만 가져올 수 밖에 없다.

[PyTorch Data Loading and Processing tutorial](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#iterating-through-the-dataset)