## Multilayer perceptron
- XOR 문제는 단층 perceptron으로는 해결 불가능.
- XOR 문제를 해결하기 위해서는 multilayer perceptron이 필요하다.
 (MLP)
- 여러 개의 층을 갖는 구조 

## Backpropagation
- MLP 학습할 수 있음.
- 역전파 : 입력 x 가 들어왔을 때 뉴럴 네트워크로 output을 구함. 
- 역전파 알고리즘 : loss (cost)에 대해서 뉴럴 네트워크에 있는 weight에 대한 미분값을 계산하고 해당 gradient 가지고 뒷 layer부터 해당 loss를 최소화할 수 있도록 weight를 업데이트 함. 

In [1]:
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# for reproducibility
torch.manual_seed(777)
if device == 'cuda':
    torch.cuda.manual_seed_all(777)

In [14]:
X = torch.FloatTensor([[0, 0], [0, 1], [1, 0], [1, 1]]).to(device)
Y = torch.FloatTensor([[0], [1], [1], [0]]).to(device)

# nn Layers : nn.Linear 2개 사용했다고 생각하면 됨. 
w1 = torch.Tensor(2, 2).to(device) # 2개짜리 weight 가짐
b1 = torch.Tensor(2).to(device)
w2 = torch.Tensor(2, 1).to(device) # 2개 -> 1개 weight 가짐
b2 = torch.Tensor(1).to(device)

# [수정 후] 정규분포 랜덤 값으로 초기화 (권장)
w1 = torch.randn(2, 2).to(device) 
b1 = torch.randn(2).to(device)
w2 = torch.randn(2, 1).to(device)
b2 = torch.randn(1).to(device)

def sigmoid(x):
    return 1.0 / (1.0 + torch.exp(-x))

def sigmoid_prime(x): # sigmoid 미분
    return sigmoid(x) * (1 - sigmoid(x))

In [6]:
l1 = torch.add(torch.matmul(X, w1), b1)
a1 = sigmoid(l1)
l2 = torch.add(torch.matmul(a1, w2), b2)
Y_pred = sigmoid(l2)

1. forward propagation
$$\begin{aligned}
L_1 &= X W_1 + b_1 \\
A_1 &= \sigma(L_1) \\
L_2 &= A_1 W_2 + b_2 \\
\hat{Y} &= \sigma(L_2)
\end{aligned}$$

2. BCE (binary cross entropy)
$$J = - \sum \left[ Y \log(\hat{Y}) + (1 - Y) \log(1 - \hat{Y}) \right]$$

3. backpropagation (chain rule)
- Loss Derivative (d_Y_pred): 예측값에 대한 손실함수의 미분$$\frac{\partial J}{\partial \hat{Y}} = \frac{\hat{Y} - Y}{\hat{Y}(1 - \hat{Y})}$$

In [None]:
d_Y_pred = (Y_pred - Y) / (Y_pred * (1.0 - Y_pred) + 1e-7)

- Layer 2 Delta (d_l2) : 출력층(Layer 2) 선형 결합 값($l_2$)에 대한 미분 (오차항)$$\frac{\partial J}{\partial L_2} = \frac{\partial J}{\partial \hat{Y}} \odot \sigma'(L_2)$$

In [None]:
d_l2 = d_Y_pred * sigmoid_prime(l2)

- Layer 2 Gradients (d_b2, d_w2)출력층 가중치($W_2$)와 편향($b_2$)에 대한 기울기$$\begin{aligned}
\frac{\partial J}{\partial b_2} &= \frac{\partial J}{\partial L_2} \\
\frac{\partial J}{\partial W_2} &= A_1^T \cdot \frac{\partial J}{\partial L_2}
\end{aligned}$$

In [None]:
d_b2 = d_l2
# 두 번째 인자와 세 번째 인자의 차원값을 바꾸기 (교환) :: 10, 5 -> 5, 10ㅇ
d_w2 = torch.matmul(torch.transpose(a1, 0, 1), d_b2) 

- Layer 1 Activation Gradient (d_a1)은닉층(Layer 1) 출력값($a_1$)으로 역전파된 오차$$\frac{\partial J}{\partial A_1} = \frac{\partial J}{\partial L_2} \cdot W_2^T$$

In [None]:
d_a1 = torch.matmul(d_b2, torch.transpose(w2, 0, 1))

- Layer 1 Delta (d_l1)은닉층(Layer 1) 선형 결합 값($l_1$)에 대한 미분$$\frac{\partial J}{\partial L_1} = \frac{\partial J}{\partial A_1} \odot \sigma'(L_1)$$

In [None]:
d_l1 = d_a1 * sigmoid_prime(l1)

- Layer 1 Gradients (d_b1, d_w1)은닉층 가중치($W_1$)와 편향($b_1$)에 대한 기울기$$\begin{aligned}
\frac{\partial J}{\partial b_1} &= \frac{\partial J}{\partial L_1} \\
\frac{\partial J}{\partial W_1} &= X^T \cdot \frac{\partial J}{\partial L_1}
\end{aligned}$$

In [None]:
d_b1 = d_l1
d_w1 = torch.matmul(torch.transpose(X, 0, 1), d_b1)

In [15]:
learning_rate = 1
for step in range(10001):
    # forward
    l1 = torch.add(torch.matmul(X, w1), b1)
    a1 = sigmoid(l1)
    l2 = torch.add(torch.matmul(a1, w2), b2)
    Y_pred = sigmoid(l2)

    # BCE
    cost = -torch.mean(Y * torch.log(Y_pred) + (1 - Y) * torch.log(1 - Y_pred))

    # back prop (chain rule)
    # loss 미분
    d_Y_pred = (Y_pred - Y) / (Y_pred * (1.0 - Y_pred) + 1e-7)

    # Layer 2
    d_l2 = d_Y_pred * sigmoid_prime(l2)
    d_b2 = d_l2
    d_w2 = torch.matmul(torch.transpose(a1, 0, 1), d_b2)

    # Layer 1
    d_a1 = torch.matmul(d_b2, torch.transpose(w2, 0, 1)) # layer 2의 오차를 layer 2의 가중치 타고 거꾸로 흘려보내는 과정
    d_l1 = d_a1 * sigmoid_prime(l1)
    d_b1 = d_l1
    d_w1 = torch.matmul(torch.transpose(X, 0, 1), d_b1)

    # weight update (gradient descent)
    w1 = w1 - learning_rate * d_w1
    b1 = b1 - learning_rate * torch.mean(d_b1, 0)
    w2 = w2 - learning_rate * d_w2
    b2 = b2 - learning_rate * torch.mean(d_b2, 0)

    if step % 100 == 0:
        print(step, cost.item())

0 0.9613491296768188
100 0.13873937726020813
200 0.0327175110578537
300 0.017057759687304497
400 0.011352622881531715
500 0.008457552641630173
600 0.006719966884702444
700 0.005565759725868702
800 0.004745173268020153
900 0.0041327145881950855
1000 0.0036585028283298016
1100 0.0032807989045977592
1200 0.002972976304590702
1300 0.0027174088172614574
1400 0.002501869108527899
1500 0.00231768935918808
1600 0.0021585593931376934
1700 0.002019699662923813
1800 0.0018974419217556715
1900 0.0017890152521431446
2000 0.0016922496724873781
2100 0.0016052747378125787
2200 0.0015267736744135618
2300 0.0014555497327819467
2400 0.0013905862579122186
2500 0.0013311801012605429
2600 0.0012765987776219845
2700 0.0012262744130566716
2800 0.001179698621854186
2900 0.001136572565883398
3000 0.0010964032262563705
3100 0.001059040892869234
3200 0.0010240973206236959
3300 0.0009913333924487233
3400 0.0009606146486476064
3500 0.000931702321395278
3600 0.0009044915204867721
3700 0.0008788480190560222
3800 0.00

# XOR - nn
- 레이어 2개 사용

In [16]:
linear1 = torch.nn.Linear(2, 2, bias=True)
linear2 = torch.nn.Linear(2, 1, bias=True)

sigmoid = torch.nn.Sigmoid()
model = torch.nn.Sequential(linear1, sigmoid, linear2, sigmoid).to(device)

criterion = torch.nn.BCELoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=1)
for step in range(10001):
    optimizer.zero_grad()
    hypothesis = model(X)

    cost = criterion(hypothesis, Y)
    cost.backward()
    optimizer.step()

    if step % 100 == 0:
        print(step, cost.item())

0 0.7763229608535767
100 0.6930165886878967
200 0.692973792552948
300 0.6929107308387756
400 0.6928110718727112
500 0.6926401853561401
600 0.6923158764839172
700 0.6916120648384094
800 0.6897789239883423
900 0.6838115453720093
1000 0.6620278358459473
1100 0.6086074709892273
1200 0.48274874687194824
1300 0.21770578622817993
1400 0.11606816202402115
1500 0.07570721954107285
1600 0.05526117980480194
1700 0.04318012297153473
1800 0.03528796508908272
1900 0.029761184006929398
2000 0.025689953938126564
2100 0.02257372811436653
2200 0.02011588029563427
2300 0.018130186945199966
2400 0.016493942588567734
2500 0.015123352408409119
2600 0.013959166593849659
2700 0.012958543375134468
2800 0.012089530006051064
2900 0.011327981948852539
3000 0.010655341669917107
3100 0.010056989267468452
3200 0.009521331638097763
3300 0.009039136581122875
3400 0.008602828718721867
3500 0.008206162601709366
3600 0.007843991741538048
3700 0.007512134499847889
3800 0.00720686512067914
3900 0.00692514143884182
4000 0.0

In [17]:
# Accuracy computation
# True if hypothesis>0.5 else False
with torch.no_grad():
    hypothesis = model(X)
    predicted = (hypothesis > 0.5).float()
    accuracy = (predicted == Y).float().mean()
    print('\nHypothesis: ', hypothesis.detach().cpu().numpy(), '\nCorrect: ', predicted.detach().cpu().numpy(), '\nAccuracy: ', accuracy.item())


Hypothesis:  [[0.00232873]
 [0.99831223]
 [0.99831223]
 [0.00239239]] 
Correct:  [[0.]
 [1.]
 [1.]
 [0.]] 
Accuracy:  1.0


# XOR - nn - wide - deep

In [18]:
# nn layers
linear1 = torch.nn.Linear(2, 10, bias=True)
linear2 = torch.nn.Linear(10, 10, bias=True)
linear3 = torch.nn.Linear(10, 10, bias=True)
linear4 = torch.nn.Linear(10, 1, bias=True)
sigmoid = torch.nn.Sigmoid()

In [19]:
model = torch.nn.Sequential(linear1, sigmoid, linear2, sigmoid, linear3, sigmoid, linear4, sigmoid).to(device)

In [21]:
criterion = torch.nn.BCELoss().to(device)
optimizer=  torch.optim.SGD(model.parameters(), lr=1)

In [22]:
for step in range(10001):
    optimizer.zero_grad()
    hypothesis = model(X)

    # cost / loss function
    cost = criterion(hypothesis, Y)
    cost.backward()
    optimizer.step()

    if step % 100 == 0:
        print(step, cost.item())

0 0.702433705329895
100 0.6930965185165405
200 0.6930878162384033
300 0.6930782794952393
400 0.6930678486824036
500 0.6930562257766724
600 0.693043053150177
700 0.693027913570404
800 0.6930105090141296
900 0.6929900646209717
1000 0.692965567111969
1100 0.69293612241745
1200 0.6928997039794922
1300 0.6928539276123047
1400 0.6927951574325562
1500 0.6927174925804138
1600 0.6926113367080688
1700 0.6924604773521423
1800 0.6922343373298645
1900 0.6918712854385376
2000 0.6912301182746887
2100 0.6899261474609375
2200 0.6865946054458618
2300 0.6736891269683838
2400 0.5819851160049438
2500 0.2226959764957428
2600 0.012114075943827629
2700 0.005340529605746269
2800 0.0032906006090343
2900 0.0023381607607007027
3000 0.0017971005290746689
3100 0.0014514061622321606
3200 0.001212847651913762
3300 0.0010389507515355945
3400 0.0009069546358659863
3500 0.0008035794598981738
3600 0.0007205012952908874
3700 0.000652395945508033
3800 0.0005956039531156421
3900 0.0005475673242472112
4000 0.0005064248689450

In [None]:
# Accuracy

with torch.no_grad():
    hypothesis = model(X)
    predicted = (hypothesis > 0.5).float()
    accuracy = (predicted == Y).float().mean()
    print(    print('\nHypothesis: ', hypothesis.detach().cpu().numpy(), '\nCorrect: ', predicted.detach().cpu().numpy(), '\nAccuracy: ', accuracy.item()))

TypeError: 'Tensor' object is not callable