## 循环神经网络 RNN

&emsp;&emsp;在潜变量自回归模型中，我们采用潜变量$h_{t}$总结过去的信息。

$$
p(h_{t} | h_{t-1}, x_{t-1}) \\
p(x_{t} | h_{t},x_{t-1})
$$

&emsp;&emsp;接下来就是说如何把这个模型做成`RNN`:

1. 针对$p(h_{t} | h_{t-1}, x_{t-1})$这一项, 我们可以建模成：

$$
h_{t} = \tanh(W_{hh}h_{t-1} + W_{hx}x_{t-1} + b_{h})
$$

&emsp;&emsp;上式中，如果去掉$W_{hh}h_{t-1}$这一项的话，就会退化成`MLP`了。

2. 针对$p(x_{t} | h_{t},x_{t-1})$这一项，我们的输出可以描述为:

$$
o_{t} = \phi(W_{ho}h_{t} + b_{o})
$$

## torch.nn.RNN

在`PyTorch`中，`RNN`的`API`地址为：[torch.nn.RNN(*args, **kwargs)](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html?highlight=rnn#torch.nn.RNN)

其需要的参数有:

1. `input_size`: 输入数据$x$的特征大小。
2. `hidden_size`: 隐藏状态$h$的特征大小。
3. `num_layers`: 默认只有一层，可以通过改变`num_layers`改变层数的大小。
4. `nonlinearity`: 非线性激活函数，可以为`tanh`或者`relu`，默认为`tanh`。
5. `bias`: 如果设置为`False`的话，就没有`bias`，默认为`True`。
6. `batch_first`: 若设置为`True`，则提供的数据维度为`(batch, seq, feature)`，否则为`(seq, batch, feature)`，默认为`False`。
7. `dropout`：如果为非`0`的值，则会设置`dropout`，默认为`0`。
8. `bidirectional`: 是否为双向。若为双向的结构，则输出的结果大小为 `2 * hidden_size`。


实例化之后，输入数据有两部分：`input,h_0`。输入数据维度为：$(L, N, H_{in})$, 如果`batch_first=False`, 则维度为：$(N, L, H_{in})$。`h_0`的数据`shape`为`(D * num_layers, N, H_out)`。

其中：$N=batch\ size, L=sequence\ length, D = 2 \ if\ bidrectional=True,\  otherwise = 1$, $H_{in}=input\_size, H_{out}=hidden\_size$。

In [1]:
import torch
import torch.nn as nn

bs, T = 2, 3 # 批大小，输入序列长度
input_size, hidden_size = 2, 3  # 输入特征大小，隐含层特征大小

input_data = torch.randn(bs, T, input_size)  # 随机初始化输入特征数据。


### 单向，单层

#### 双向，单层的API调用

In [2]:
h_prev = torch.zeros(bs, hidden_size)  # 初始化隐含状态
single_rnn = nn.RNN(input_size, hidden_size, batch_first=True)  # 输入特征维度为4， 输出特征维度为3。
output, hn = single_rnn(input_data, h_prev.unsqueeze(0))  # 不指定h_0的话，则默认会给0。
print(output)
print(hn)

tensor([[[-0.4158,  0.2196,  0.3127],
         [-0.1908, -0.0525, -0.1845],
         [-0.4863,  0.3603,  0.2505]],

        [[-0.7959, -0.8847, -0.5323],
         [-0.0699,  0.4007,  0.5177],
         [-0.0020,  0.5124,  0.2386]]], grad_fn=<TransposeBackward1>)
tensor([[[-0.4863,  0.3603,  0.2505],
         [-0.0020,  0.5124,  0.2386]]], grad_fn=<StackBackward0>)


从上述输出结果可以看到，输出结果为`batch size`为`1`，序列长度为`2`， 每个序列的输出的维度都是`3`，最后一个序列的输出就是`hn`。

#### 双向，单层RNN前向传播理解

In [3]:
def rnn_forward(input_data, weight_ih, bias_ih, weight_hh, bias_hh, h_prev):
    bs, T, input_size = input_data.shape
    h_dim = weight_ih.shape[0]
    h_out = torch.zeros(bs, T, h_dim)  # 初始化一个输出(状态)矩阵
    
    for t in range(T):
        x = input_data[:, t, :].unsqueeze(2)  # 获取当前时刻输入特征， bs * input_size * 1
        
        w_ih_batch = weight_ih.unsqueeze(0).tile(bs, 1, 1)  # bs * h_dim * input_size
        w_hh_batch = weight_hh.unsqueeze(0).tile(bs, 1, 1)  # bs * h_dim * h_dim
        
        w_times_x = torch.bmm(w_ih_batch, x).squeeze(-1)  # bs * h_dim
        w_times_h = torch.bmm(w_hh_batch, h_prev.unsqueeze(2)).squeeze(-1)  # bs * h_dim
        
        h_prev = torch.tanh(w_times_x + bias_ih + w_times_h + bias_hh)
        
        h_out[:,t,:] = h_prev
    
    return h_out, h_prev.unsqueeze(0)  # 单向单层的，但是官方给的是三维的。

In [4]:
for k, v in single_rnn.named_parameters():
    print(k, v)

weight_ih_l0 Parameter containing:
tensor([[-0.5119,  0.0075],
        [-0.5362, -0.4070],
        [-0.5311, -0.1011]], requires_grad=True)
weight_hh_l0 Parameter containing:
tensor([[-0.5507, -0.1207,  0.3047],
        [ 0.4018, -0.2243, -0.3673],
        [ 0.5684, -0.4863, -0.4629]], requires_grad=True)
bias_ih_l0 Parameter containing:
tensor([-0.3611, -0.0943,  0.3951], requires_grad=True)
bias_hh_l0 Parameter containing:
tensor([-0.4305, -0.0467, -0.4333], requires_grad=True)


In [5]:
output, hn = rnn_forward(input_data, single_rnn.weight_ih_l0, single_rnn.bias_ih_l0, \
                         single_rnn.weight_hh_l0, single_rnn.bias_hh_l0, h_prev)
print(output)
print(hn)

tensor([[[-0.4158,  0.2196,  0.3127],
         [-0.1908, -0.0525, -0.1845],
         [-0.4863,  0.3603,  0.2505]],

        [[-0.7959, -0.8847, -0.5323],
         [-0.0699,  0.4007,  0.5177],
         [-0.0020,  0.5124,  0.2386]]], grad_fn=<CopySlices>)
tensor([[[-0.4863,  0.3603,  0.2505],
         [-0.0020,  0.5124,  0.2386]]], grad_fn=<UnsqueezeBackward0>)


可以看到输出结果与官方`API`的调用结果一致。

### 双向，单层

#### 双向，单层的API调用

In [6]:
bidirectional_rnn = nn.RNN(input_size, hidden_size, batch_first=True, bidirectional=True)  # 输入特征维度为4， 输出特征维度为3， 1层。
h_prev = torch.zeros(2, bs, hidden_size)  # 初始化隐含状态
bi_output, bi_hn = bidirectional_rnn(input_data, h_prev)  # 不指定h_0的话，则默认会给0。
print(bi_output)
print(bi_hn)

tensor([[[ 0.2831,  0.0730,  0.4316, -0.5670, -0.0961,  0.4662],
         [ 0.2739, -0.1995,  0.4555, -0.4329,  0.1318,  0.3973],
         [ 0.3073, -0.3371,  0.4671, -0.6645,  0.1620,  0.1460]],

        [[ 0.4380,  0.9180, -0.4362, -0.6307, -0.6673,  0.2989],
         [ 0.3911, -0.3844,  0.5364, -0.5271,  0.0734,  0.4458],
         [-0.1203, -0.4641,  0.5987, -0.5103,  0.2283,  0.3184]]],
       grad_fn=<TransposeBackward1>)
tensor([[[ 0.3073, -0.3371,  0.4671],
         [-0.1203, -0.4641,  0.5987]],

        [[-0.5670, -0.0961,  0.4662],
         [-0.6307, -0.6673,  0.2989]]], grad_fn=<StackBackward0>)


#### 双向，单层RNN前向传播理解

双向`RNN`的前向计算过程是什么样的呢？

In [7]:
def bidirectional_rnn_forward(input_data, weight_ih, bias_ih, weight_hh, bias_hh, h_prev,\
                             weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse, h_prev_reverse):
    
    bs, T, input_size = input_data.shape
    h_dim = weight_ih.shape[0]
    h_out = torch.zeros(bs, T, h_dim * 2)  # 初始化一个输出(状态)矩阵, 双向是两倍特征大小。
    
    forward_output = rnn_forward(input_data, weight_ih, bias_ih, weight_hh, bias_hh, h_prev)[0]
    
    backward_output = rnn_forward(torch.flip(input_data, [1]), weight_ih_reverse, weight_hh_reverse, \
                                  bias_ih_reverse, bias_hh_reverse, h_prev_reverse)[0]  # backward layer
    
    h_out[:, :, :h_dim] = forward_output
    h_out[:, :, h_dim:] = backward_output

    return h_out, h_out[:, -1, :].reshape((bs, 2, h_dim)).transpose(0, 1)

In [8]:
for k, v in bidirectional_rnn.named_parameters():
    print(k, v)

weight_ih_l0 Parameter containing:
tensor([[ 0.3819, -0.1383],
        [ 0.1533,  0.5730],
        [-0.2634, -0.2586]], requires_grad=True)
weight_hh_l0 Parameter containing:
tensor([[-0.5423,  0.3853,  0.1053],
        [-0.1075, -0.3597, -0.1994],
        [-0.2672,  0.2382,  0.1060]], requires_grad=True)
bias_ih_l0 Parameter containing:
tensor([ 0.0654, -0.0144, -0.0434], requires_grad=True)
bias_hh_l0 Parameter containing:
tensor([0.4866, 0.1902, 0.3265], requires_grad=True)
weight_ih_l0_reverse Parameter containing:
tensor([[-0.2933,  0.1265],
        [-0.0481, -0.2726],
        [-0.2160,  0.0169]], requires_grad=True)
weight_hh_l0_reverse Parameter containing:
tensor([[-0.3859,  0.3072, -0.5566],
        [-0.1584,  0.3211, -0.4665],
        [-0.2365,  0.0129,  0.4616]], requires_grad=True)
bias_ih_l0_reverse Parameter containing:
tensor([-0.3966, -0.5015,  0.3502], requires_grad=True)
bias_hh_l0_reverse Parameter containing:
tensor([-0.4332,  0.4477, -0.2797], requires_grad=True)


In [9]:
output,hn = bidirectional_rnn_forward(input_data, bidirectional_rnn.weight_ih_l0, bidirectional_rnn.bias_ih_l0,\
                         bidirectional_rnn.weight_hh_l0, bidirectional_rnn.bias_hh_l0, h_prev[0], \
                         bidirectional_rnn.weight_ih_l0_reverse, bidirectional_rnn.bias_ih_l0_reverse, \
                         bidirectional_rnn.weight_hh_l0_reverse, bidirectional_rnn.bias_hh_l0_reverse, h_prev[1])
print(output)
print(hn)

tensor([[[ 0.2831,  0.0730,  0.4316, -0.6645,  0.1620,  0.1460],
         [ 0.2739, -0.1995,  0.4555, -0.4329,  0.1318,  0.3973],
         [ 0.3073, -0.3371,  0.4671, -0.5670, -0.0961,  0.4662]],

        [[ 0.4380,  0.9180, -0.4362, -0.5103,  0.2283,  0.3184],
         [ 0.3911, -0.3844,  0.5364, -0.5271,  0.0734,  0.4458],
         [-0.1203, -0.4641,  0.5987, -0.6307, -0.6673,  0.2989]]],
       grad_fn=<CopySlices>)
tensor([[[ 0.3073, -0.3371,  0.4671],
         [-0.1203, -0.4641,  0.5987]],

        [[-0.5670, -0.0961,  0.4662],
         [-0.6307, -0.6673,  0.2989]]], grad_fn=<TransposeBackward0>)


### RNNCell

`RNN`其实就是多次的`RNNCell`的计算。

函数原型为:[torch.nn.RNNCell](https://pytorch.org/docs/stable/generated/torch.nn.RNNCell.html?highlight=rnncell#torch.nn.RNNCell)

```python
CLASS torch.nn.RNNCell(input_size, hidden_size, bias=True, nonlinearity='tanh', device=None, dtype=None)
```

举例为:

```python
rnn = nn.RNNCell(10, 20)
input = torch.randn(6, 3, 10)
hx = torch.randn(3, 20)
output = []
for i in range(6):
    hx = rnn(input[i], hx)
    output.append(hx)
```


## Vanilla-RNN-代码

### Numpy实现

In [10]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import time
sns.set()

def get_vocab(file, lower = False):
    with open(file, 'r') as fopen:
        data = fopen.read() # 将文件中的所有数据读取进来。
    if lower:
        data = data.lower()
    
    vocab = list(set(data))
    return data, vocab

def embed_to_control(data, vocab):
    onehot = np.zeros((len(data), len(vocab)), dtype = np.float32)
    for i in range(len(data)):
        onehot[i, vocab.index(data[i])] = 1.0
    return onehot

In [11]:
text, text_vocab = get_vocab('../consumer.h', lower = False)
one_hot = embed_to_control(text, text_vocab)
print('len text: ', len(text))
print('len text_vocab: ', len(text_vocab))
print('one_hot shape: ', one_hot.shape)

len text:  15294
len text_vocab:  75
one_hot shape:  (15294, 75)


&emsp;&emsp;`tanh`激活函数为:

$$
\tanh x=\frac{\sinh x}{\cosh x}=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}
$$

&emsp;&emsp;导数为:

$$
(\tanh x)^{\prime}=\operatorname{sech}^{2} x=1-\tanh ^{2} x
$$

In [12]:
def tanh(x, grad=False):
    if grad:
        output = np.tanh(x)
        return (1.0 - np.square(output))
    else:
        return np.tanh(x)

$$
h_{t} = \tanh(W_{hh}h_{t-1} + W_{hx}x_{t-1} + b_{h})
$$

$$
o_{t} = \phi(W_{ho}h_{t} + b_{o})
$$

In [13]:
def forward_rnn_recurrent(x, prev_state, W_hx, W_hh, W_ho):
    mul_hx = np.dot(x, W_hx.T)
    
    # 处理隐藏状态部分。
    mul_hh = np.dot(prev_state, W_hh.T)
    add_previous_now = mul_hx + mul_hh
    current_state = tanh(add_previous_now)
    
    # 处理输出部分。
    mul_o = np.dot(current_state, W_ho.T)
    return (mul_hx, mul_hh, add_previous_now, current_state, mul_o)

In [14]:
def softmax(x):
    """
    x: np.max(x)取的是二维数组x中的最大值。
    """
    exp_scores = np.exp(x - np.max(x))
    return exp_scores / (np.sum(exp_scores, axis=1, keepdims=True) + 1e-8)

&emsp;&emsp;多分类的交叉熵损失如下:

$$
L=\frac{1}{N} \sum_{i} L_{i}=-\frac{1}{N} \sum_{i} \sum_{c=1}^{M} y_{i c} \log \left(p_{i c}\right)
$$

&emsp;&emsp;其中$M$表示类别的数量，$y_{ic}$是符号函数(0或者1), 如果样本$i$的真实类别等于$c$取1，否者取0。$p_{ic}$表示观测样本$i$属于类别$c$的概率。

In [15]:
def cross_entropy(Y_hat, Y, epsilon=1e-12):
    Y_hat = np.clip(Y_hat, epsilon, 1. - epsilon)
    N = Y_hat.shape[0]
    return -np.sum(np.sum(Y * np.log(Y_hat + 1e-9))) / N

&emsp;&emsp;`RNN`的反向传播:

In [16]:
def backward_multiply_gate(w, x, dz):
    """
    w shape = (75, 128)
    x shape = (64, 128)
    dz shape = (64, 75)
    """
    dw = np.dot(dz.T, x) # shape = (75, 128)
    dx = np.dot(w.T, dz.T) # shape = (128, 64)
    return dw, dx

def backward_add_gate(x1, x2, dz):
    dx1 = dz * np.ones_like(x1)
    dx2 = dz * np.ones_like(x2)
    return dx1, dx2

def backward_rnn_recurrent(x, prev_state, W_hx, W_hh, W_ho, d_mu_o, saved_graph):
    mul_hx, mul_hh, add_previous_now, current_state, mul_o = saved_graph
    
    dW_ho, d_CurrentState = backward_multiply_gate(W_ho, current_state, d_mu_o)
    
    dadd_previous_now = tanh(add_previous_now, True) * d_CurrentState.T
    
    dmul_hh, dmul_hx = backward_add_gate(mul_hh, mul_hx, dadd_previous_now)
    dW_hh, dprev_state = backward_multiply_gate(W_hh, prev_state, dmul_hh)
    dW_hx, dx = backward_multiply_gate(W_hx, x, dmul_hx)
    
    return (dprev_state, dW_hx, dW_hh, dW_ho)

In [17]:
epoch = 1000
learning_rate = 0.0001
batch_size = 64
sequence_length = int(12)
dimension = one_hot.shape[1]
print('dimension is :', dimension)
possible_batch_id = range(len(text) - sequence_length - 1)
hidden_dim = 128

W_hx = np.random.randn(hidden_dim, dimension) / np.sqrt(hidden_dim)
W_hh = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)
W_ho = np.random.randn(dimension, hidden_dim) / np.sqrt(hidden_dim)

for i in range(epoch):
    batch_x = np.zeros((batch_size, sequence_length, dimension))
    batch_y = np.zeros((batch_size, sequence_length, dimension))
    
    batch_id = random.sample(possible_batch_id, batch_size)  # 随机采样，选择batch_id。
    prev_s = np.zeros((batch_size, hidden_dim))
    
    for n in range(sequence_length):
        id1 = [k + n for k in batch_id]
        id2 = [k + n + 1 for k in batch_id]
        
        batch_x[:, n, :] = one_hot[id1]
        batch_y[:, n, :] = one_hot[id2]
    
    layers = []
    out_logits = np.zeros((batch_size, sequence_length, dimension))
    for n in range(sequence_length):
        layers.append(forward_rnn_recurrent(batch_x[:, n, :], prev_s, W_hx, W_hh, W_ho))
        prev_s = layers[-1][3]
        out_logits[:, n, :] = layers[-1][-1]
    
    probs = softmax(out_logits.reshape((-1, dimension)))
    y = np.argmax(batch_y.reshape((-1, dimension)), axis=1)
    accuracy = np.mean(np.argmax(probs, axis=1) == y)
    
    loss = cross_entropy(probs, batch_y.reshape((-1, dimension)))
    
    # 之后需要开始计算反向更新部分了。
    
    # 1. 计算梯度:
    delta = probs  # 取网络的输出结果。
    delta[range(y.shape[0]), y] -= 1  # 将网络输出结果中对应标签的那个概率减去1。
    delta = delta.reshape((batch_size, sequence_length, dimension))
    
    dW_hx = np.zeros(W_hx.shape)
    dW_hh = np.zeros(W_hh.shape)
    dW_ho = np.zeros(W_ho.shape)
    prev_state = np.zeros((batch_size, hidden_dim))
    
    for n in range(sequence_length):
        d_mul_o = delta[:, n, :] # shape = (batch_size, dimension)
        
        dprev_s, dW_hx_t, dW_hh_t, dW_ho_t = backward_rnn_recurrent(batch_x[:,n,:], prev_state, 
                                                         W_hx, W_hh, W_ho, d_mul_o, layers[n])
        prev_state = layers[n][3]
        dW_hx += dW_hx_t
        dW_hh += dW_hh_t
        dW_ho += dW_ho_t
    
    # 更新
    W_hx -= learning_rate * dW_hx 
    W_hh -= learning_rate * dW_hh
    W_ho -= learning_rate * dW_ho
    if(i + 1) % 50 == 0:
        print("epoch {}, loss {}, accuracy {}".format(i+1, loss, accuracy))

dimension is : 75
epoch 50, loss 4.241343069364474, accuracy 0.06640625
epoch 100, loss 4.032394839844798, accuracy 0.18489583333333334
epoch 150, loss 3.6716801635951057, accuracy 0.16536458333333334
epoch 200, loss 3.4627890638009937, accuracy 0.23828125
epoch 250, loss 3.4195365176120514, accuracy 0.24088541666666666
epoch 300, loss 3.167283505103797, accuracy 0.2760416666666667
epoch 350, loss 3.0534844597935122, accuracy 0.3203125
epoch 400, loss 3.0401859430164486, accuracy 0.296875
epoch 450, loss 2.9362047046204367, accuracy 0.35546875
epoch 500, loss 2.774710712376892, accuracy 0.35546875
epoch 550, loss 2.884420198759688, accuracy 0.3333333333333333
epoch 600, loss 2.6950117225465324, accuracy 0.3697916666666667
epoch 650, loss 2.7025486548669435, accuracy 0.35546875
epoch 700, loss 2.6027955887830725, accuracy 0.3776041666666667
epoch 750, loss 2.1376824503047134, accuracy 0.50390625
epoch 800, loss 2.56399462748593, accuracy 0.3880208333333333
epoch 850, loss 2.352770017730

### Pytorch实现

In [18]:
import torch
import torch.nn as nn
from torch.autograd import Variable

class RNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim):
        """
        """
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        super(RNN, self).__init__()
        self.rnn = nn.RNN(input_dim, hidden_dim, num_layers, batch_first=True, nonlinearity='tanh')
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        h0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_dim))
        out, hn = self.rnn(x, None)
        
        out = self.fc(out)
        return out
rnn = RNN(input_dim = dimension, hidden_dim = hidden_dim, num_layers=1, output_dim = dimension)
print(rnn)

RNN(
  (rnn): RNN(75, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=75, bias=True)
)


In [19]:
# 分开定义softmax运算和交叉熵损失函数会造成数值不稳定。
# 因此PyTorch提供了一个具有良好数值稳定性且包括softmax运算和交叉熵计算的函数。
criterion = nn.CrossEntropyLoss()

learning_rate = 0.5
optimizer = torch.optim.SGD(rnn.parameters(), lr=learning_rate)

for i in range(epoch):
    batch_x = np.zeros((batch_size, sequence_length, dimension), dtype=np.float32)
    batch_y = np.zeros((batch_size, sequence_length, dimension), dtype=np.float32)
    batch_id = random.sample(possible_batch_id, batch_size)  # 随机采样，选择batch_id。
    prev_s = np.zeros((batch_size, hidden_dim))
    for n in range(sequence_length):
        id1 = [k + n for k in batch_id]
        id2 = [k + n + 1 for k in batch_id]
        
        batch_x[:, n, :] = one_hot[id1]
        batch_y[:, n, :] = one_hot[id2]
    
    # 从Numpy转成torch之后送入神经网络中去。
    output = rnn(torch.from_numpy(batch_x))  # torch.Size([64, 12, 75])
    label = torch.argmax(torch.from_numpy(batch_y).view(-1, dimension), dim=1) # shape = 786
    
    accuracy = np.mean(torch.argmax(output.view(-1, dimension), axis=1).numpy() == label.numpy())
    
    optimizer.zero_grad()
    loss = criterion(output.view(-1, dimension), label)
    loss.backward()
    optimizer.step()
    
    if(i + 1) % 50 == 0:
        print("epoch {}, loss {}, accuracy {}".format(i+1, loss.item(), accuracy))
    

epoch 50, loss 3.3897383213043213, accuracy 0.11067708333333333
epoch 100, loss 3.04331374168396, accuracy 0.2526041666666667
epoch 150, loss 2.7403066158294678, accuracy 0.3346354166666667
epoch 200, loss 2.039881467819214, accuracy 0.5052083333333334
epoch 250, loss 2.1290059089660645, accuracy 0.4700520833333333
epoch 300, loss 1.5940618515014648, accuracy 0.6067708333333334
epoch 350, loss 1.5607830286026, accuracy 0.6197916666666666
epoch 400, loss 1.2755593061447144, accuracy 0.6692708333333334
epoch 450, loss 1.2091299295425415, accuracy 0.69140625
epoch 500, loss 1.3169924020767212, accuracy 0.6731770833333334
epoch 550, loss 1.3107393980026245, accuracy 0.66796875
epoch 600, loss 1.18193519115448, accuracy 0.7083333333333334
epoch 650, loss 1.027571678161621, accuracy 0.7434895833333334
epoch 700, loss 0.9772723317146301, accuracy 0.7278645833333334
epoch 750, loss 1.1997982263565063, accuracy 0.6888020833333334
epoch 800, loss 0.989878237247467, accuracy 0.7395833333333334
ep