In [319]:
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from scipy import spatial
%matplotlib inline

# Project 3: Word2Vec (70 pt)
The goal of this project is to obtain the vector representations for words from text.

The main idea is that words appearing in similar contexts have similar meanings. Because of that, word vectors of similar words should be close together. Models that use word vectors can utilize these properties, e.g., in sentiment analysis a model will learn that "good" and "great" are positive words, but will also generalize to other words that it has not seen (e.g. "amazing") because they should be close together in the vector space.

Vectors can keep other language properties as well, like analogies. The question "a is to b as c is to ...?", where the answer is d, can be answered by looking into word vector space and calculating $\mathbf{u}_b - \mathbf{u}_a + \mathbf{u}_c$, and finding the word vector that is the closest to the result.

## Your task
Complete the missing code in this notebook. Make sure that all the functions follow the provided specification, i.e. the output of the function exactly matches the description in the docstring. 

We are given a text that contains $N$ unique words $\{ x_1, ..., x_N \}$. We will focus on the Skip-Gram model in which the goal is to predict the context window $S = \{ x_{i-l}, ..., x_{i-1}, x_{i+1}, ..., x_{i+l} \}$ from current word $x_i$, where $l$ is the window size. 

We get a word embedding $\mathbf{u}_i$ by multiplying the matrix $\mathbf{U}$ with a one-hot representation $\mathbf{x}_i$ of a word $x_i$. Then, to get output probabilities for context window, we multiply this embedding with another matrix $\mathbf{V}$ and apply softmax. The objective is to minimize the loss: $-\mathop{\mathbb{E}}[P(S|x_i;\mathbf{U}, \mathbf{V})]$.

You are given a dataset with positive and negative reviews. Your task is to:
+ Construct input-output pairs corresponding to the current word and a word in the context window
+ Implement forward and backward propagation with parameter updates for Skip-Gram model
+ Train the model
+ Test it on word analogies and sentiment analysis task

## General remarks

Fill in the missing code at the markers
```python
###########################
# YOUR CODE HERE
###########################
```
Do not add or modify any code at other places in the notebook except where otherwise explicitly stated.
After you fill in all the missing code, restart the kernel and re-run all the cells in the notebook.

The following things are **NOT** allowed:
- Using additional `import` statements
- Copying / reusing code from other sources (e.g. code by other students)

If you plagiarise even for a single project task, you won't be eligible for the bonus this semester.

# 1. Load data (5 pts)

We'll be working with a subset of reviews for restaurants in Las Vegas. The reviews that we'll be working with are either 1-star or 5-star.

In [320]:
data = np.load("task03_data.npy", allow_pickle=True).item()
reviews_1star = [[x.lower() for x in s] for s in data["reviews_1star"]]
reviews_5star = [[x.lower() for x in s] for s in data["reviews_5star"]]

We generate the vocabulary by taking the top 200 words by their frequency from both positive and negative sentences. We could also use the whole vocabulary, but that would be slower. Other words are represented with "unk".

In [321]:
corpus = reviews_1star + reviews_5star
vocabulary = [w for s in corpus for w in s]
vocabulary, counts = zip(*Counter(vocabulary).most_common(200))

In [322]:
corpus = [[w if w in vocabulary else 'unk' for w in s] for s in corpus]
vocabulary += ('unk',) # Add "unk" to vocabulary
counts += (sum([w == 'unk' for s in corpus for w in s]),) # Add count for "unk"

In [323]:
VOCABULARY_SIZE = len(vocabulary)
EMBEDDING_DIM = 32

In [324]:
print('Number of positive reviews:', len(reviews_1star))
print('Number of negative reviews:', len(reviews_5star))
print('Number of unique words:', VOCABULARY_SIZE)

Number of positive reviews: 1000
Number of negative reviews: 2000
Number of unique words: 201


You have to create two dictionaries: `word_to_ind` and `ind_to_word` so we can go from text to numerical representation and vice versa. The input into the model will be the index of the word denoting the position in the vocabulary.

In [325]:
"""
Implement
---------
word_to_ind: dict
    The keys are words (str) and the value is the corresponding position in the vocabulary
ind_to_word: dict
    The keys are indices (int) and the value is the corresponding word from the vocabulary
ind_to_freq: dict
    The keys are indices (int) and the value is the corresponding count in the vocabulary
"""

###########################
# YOUR CODE HERE
###########################
word_to_ind = {}
ind_to_word = {}
ind_to_freq = {}
for i in range(VOCABULARY_SIZE):
    word_to_ind.update({vocabulary[i]: i})
    ind_to_word.update({i: vocabulary[i]})
    ind_to_freq.update({i: counts[i]})


In [326]:
print(f'Word \"the\" is at position {word_to_ind["the"]} appearing {ind_to_freq[word_to_ind["the"]]} times')
# Should output: 
# Word "the" is at position 0 appearing 2017 times

Word "the" is at position 0 appearing 2017 times


# 2. Create word pairs (5 pts)

We need all the word pairs $\{ x_i, x_j \}$, where $x_i$ is the current word and $x_j$ is from its context window. These will correspond to input-output pairs. We want them to be represented numericaly so you should use `word_to_ind` dictionary.

In [327]:
def get_window(sentence, window_size):
    """
    Collect all ordered word pairs from a sentence that are at most window_size apart.
    
    Parameters
    ----------
    sentence: list
        A list of words of str type forming a sentence
    window_size: int
        A positive scalar
        
    Returns
    -------
    pairs: list
        A list of tuple (word index, word index from its context) of int type
    """
    
    pairs = []

    ###########################
    # YOUR CODE HERE
    for i in range(len(sentence)):
        word = sentence[i]
        word_idx = word_to_ind[word]
        for j in range(1, window_size+1):
            if i - j >= 0:
                window_word_before = sentence[i - j]
                window_word_idx = word_to_ind[window_word_before]
                pairs.append([word_idx, window_word_idx])
            if i + j < len(sentence): 
                window_word_after = sentence[i + j]
                window_word_idx = word_to_ind[window_word_after]
                pairs.append([word_idx, window_word_idx])
    ###########################

    return pairs

In [328]:
data = []
for x in corpus:
    data += get_window(x, window_size=3)
data = np.array(data)

print('First 5 pairs:', data[:5].tolist())
print('Total number of pairs:', data.shape[0])
# Should output:
# First 5 pairs: [[10, 200], [10, 6], [10, 64], [200, 10], [200, 6]]
# Total number of pairs: 207462

First 5 pairs: [[10, 200], [10, 6], [10, 64], [200, 10], [200, 6]]
Total number of pairs: 207462


We calculate a weighting score to counter the imbalance between the rare and frequent words. Rare words will be sampled more frequently. See https://arxiv.org/pdf/1310.4546.pdf

In [329]:
probabilities = [1 - np.sqrt(1e-3 / ind_to_freq[x]) for x in data[:,0]]
probabilities /= np.sum(probabilities)
print(probabilities[:3])
# Should output: 
# [4.8206203e-06 4.8206203e-06 4.8206203e-06]

[4.8206203e-06 4.8206203e-06 4.8206203e-06]


# 3. Model definition (55 pts)

In this part you should implement forward and backward propagation together with update of the parameters i.e.:
+ One-hot encoding of the words(5 pts)
+ Softmax (5 pts)
+ Loss implementation & computation (5 pts)
+ Forward pass (15 pts)
+ Backward pass (15 pts)
+ Optimizer (10 pts)

In [330]:
class Embedding():
    """
    Word embedding model.

    Parameters
    ----------
    N: int
        Number of unique words in the vocabulary
    D: int
        Dimension of the word vector embedding
    """
    def __init__(self, N, D):
        self.N = N
        self.D = D

        self.ctx = None # Used to store values for backpropagation

        self.U = None
        self.V = None
        self.reset_parameters()

    def reset_parameters(self):
        """
        We initialize weight matrices U and V of dimension (D, N) and (N, D), respectively
        """
        self.ctx = None
        self.U = np.random.normal(0, np.sqrt(6. / (self.D + self.N)), (self.D, self.N))
        self.V = np.random.normal(0, np.sqrt(6. / (self.D + self.N)), (self.N, self.D))

    def one_hot(self, x, N):
        """
        Given a vector returns a matrix with rows corresponding to one-hot encoding.
        
        Parameters
        ----------
        x: array
            M-dimensional vector containing integers from [0, N - 1]
        N: int
            Number of posible classes
        
        Returns
        -------
        one_hot: array
            (N, M) matrix where each column is N-dimensional one-hot encoding of elements from x 
        """

        ###########################
        # YOUR CODE HERE
        one_hot = np.zeros([N, x.shape[0]])
        for col in range(x.shape[0]):
            for row in range(N):
                if x[col] == row:
                    one_hot[row, col] = 1
        ###########################

        assert one_hot.shape == (N, x.shape[0]), 'Incorrect one-hot embedding shape'
        return one_hot

    def softmax(self, x, axis):
        """
        Parameters
        ----------
        x: array
            A non-empty matrix of any dimension
        axis: int
            Dimension on which softmax is performed
            
        Returns
        -------
        y: array
            Matrix of same dimension as x with softmax applied to 'axis' dimension
        """
        
        # Note! You should implement a numerically stable version of softmax
        
        ###########################
        # YOUR CODE HERE
        epsilon = 1e-8 # to avoid dividing by 0
        
        x_stable = x - np.amax(x)
        
        numerator = np.exp(x_stable) + epsilon
        denominator = np.sum(numerator, axis=0)

        y = numerator/denominator


        ###########################
        
        assert x.shape == y.shape, 'Output should have the same shape is input'
        return y

    def loss(self, y, prob):
        """
        Parameters
        ----------
        y: array
            (N, M) matrix of M samples where columns are one-hot vectors for true values
        prob: array
            (N, M) column of M samples where columns are probability vectors after softmax

        Returns
        -------
        loss: int
            Cross-entropy loss calculated as: -(1 / M) * sum_i(sum_j(y_ij * log(prob_ij)))
        """

        prob = np.clip(prob, 1e-8, None)
        
        ###########################
        # YOUR CODE HERE
        N, M = y.shape
        loss = 0
        for i in range(N):
            for j in range(M):
                loss -= y[i, j] * np.log(prob[i, j])
        loss /= M
        ###########################
        
        assert isinstance(loss, float), 'Loss must be a scalar'
        return loss

    def forward(self, x, y):
        """
        Performs forward and backward propagation
        
        Parameters
        ----------
        x: array
            M-dimensional mini-batched vector containing input word indices of int type
        y: array
            Output words, same dimension and type as 'x'
            
        Returns
        -------
        loss: float
            Cross-entropy loss
        """
        
        # Input transformation
        """
        Input is represented with M-dimensional vectors
        We convert them to (N, M) matrices such that columns are one-hot 
        representations of the input
        """
        x = self.one_hot(x, self.N)
        y = self.one_hot(y, self.N)
        
        # Forward propagation
        """
        Returns
        -------
        embedding: array
            (D, M) matrix where columns are word embedding from U matrix
        logits: array
            (N, M) matrix where columns are output logits
        prob: array
            (N, M) matrix where columns are output probabilities
        """
        
        ###########################
        # YOUR CODE HERE
        # Embedding
        embedding = np.dot(self.U, x)
        
        # Logits
        logits = np.dot(self.V, embedding)
        
        # Prob
        prob = self.softmax(logits, axis=0)
        
        # Loss is already implemented below
        ###########################

        assert embedding.shape == (self.D, x.shape[1])
        assert logits.shape == (self.N, x.shape[1])
        assert prob.shape == (self.N, x.shape[1])
        
        # Save values for backpropagation
        self.ctx = (embedding, logits, prob, x, y)
        
        # Loss calculation
        loss = self.loss(y, prob)
        
        return loss
        
    def backward(self):
        """
        Given parameters from forward propagation, returns gradient of U and V.
        
        Returns
        -------
        d_V: array
            (N, D) matrix of partial derivatives of loss w.r.t. V
        d_U: array
            (D, N) matrix of partial derivatives of loss w.r.t. U
        """

        embedding, logits, prob, x, y = self.ctx

        ###########################
        # YOUR CODE HERE
        N, M = prob.shape
        D = embedding.shape[0]
        prob = np.clip(prob, 1e-8, None) # to avoid dividing by 0
        
        '''d_l_prob = 0
        for i in range(N):
            for j in range(M):
                d_l_prob -= y[i, j]/prob[i, j]
        d_l_prob /= M
        
        # d_prob_logit is false!
        d_prob_logit = prob # NxM
        
        
        
        # d_V
        d_logit_V = embedding.T # MxD
        
        d_V = d_l_prob * np.dot(d_prob_logit, d_logit_V)'''
        
        d_V = (1/M) * (prob - y) @ x.T @ self.U.T
        
        
        # d_U
        '''d_logit_embed = self.V # NxD
        
        d_embed_u = x # NxM
        
        d_U = d_l_prob * np.dot(np.dot(d_prob_logit, d_embed_u.T), d_logit_embed)
        d_U = d_U.T'''
        
        d_U = (1/M) * self.V.T @ (prob - y) @ x.T
        
        ###########################

        assert d_V.shape == (self.N, self.D)
        assert d_U.shape == (self.D, self.N)

        return { 'V': d_V, 'U': d_U }

In [331]:
class Optimizer():
    """
    Stochastic gradient descent with momentum optimizer.

    Parameters
    ----------
    model: object
        Model as defined above
    learning_rate: float
        Learning rate
    momentum: float (optional)
        Momentum factor (default: 0)
    """
    def __init__(self, model, learning_rate, momentum=0):
        self.model = model
        self.learning_rate = learning_rate
        self.momentum = momentum
        
        self.previous = None # Previous gradients
    
    def _init_previous(self, grad):
        # Initialize previous gradients to zero
        self.previous = { k: np.zeros_like(v) for k,v in grad.items() }
    
    def step(self, grad):
        if self.previous is None:
            self._init_previous(grad)
            
        for name, dw in grad.items():
            dw_prev = self.previous[name]
            w = getattr(self.model, name)

            """
            Given weight w, previous gradients dw_prev and current 
            gradients dw, performs an update of weight w.

            Returns
            -------
            dw_new: array
                New gradients calculated as combination of previous and
                current, weighted with momentum factor.
            w_new: array
                New weights calculated with a single step of gradient
                descent using dw_new.
            """
            ###########################
            # YOUR CODE HERE
            # V, then U
            dv_or_du = grad[name]
            dw_new = self.momentum * dw_prev + (1 - self.momentum) * dv_or_du
            w_new = w - self.learning_rate * dw_new
            
            
            ###########################

            self.previous[name] = dw_new
            setattr(self.model, name, w_new)

## 3.1 Gradient check

The following code checks whether the updates for weights are implemented correctly. It should run without an error.

In [332]:
def get_loss(model, old, variable, epsilon, x, y, i, j):
    np.random.seed(123)
    model.reset_parameters() # reset weights
    
    delta = np.zeros_like(old)
    delta[i, j] = epsilon
    
    setattr(model, variable, old + delta) # change one weight by a small amount
    loss = model.forward(x, y)

    return loss

def gradient_check_for_weight(model, variable, i, j, k, l):
    x, y = np.array([i, i]), np.array([j, j]) # set input and output
    
    np.random.seed(123)
    model.reset_parameters() # reset weights

    old = getattr(model, variable)
    
    loss = model.forward(x, y)
    grad = model.backward()
    
    eps = 1e-4
    loss_positive = get_loss(model, old, variable, eps, x, y, k, l) # loss for positive change on one weight
    loss_negative = get_loss(model, old, variable, -eps, x, y, k, l) # loss for negative change on one weight
    
    true_gradient = (loss_positive - loss_negative) / 2 / eps # calculate true derivative wrt one weight

    assert abs(true_gradient - grad[variable][k, l]) < 1e-5, 'Incorrect gradient'

def gradient_check():
    N, D = VOCABULARY_SIZE, EMBEDDING_DIM
    model = Embedding(N, D)

    # check for V
    for _ in range(20):
        i, j, k = [np.random.randint(0, d) for d in [N, N, D]] # get random indices for input and weights
        gradient_check_for_weight(model, 'V', i, j, i, k)

    # check for U
    for _ in range(20):
        i, j, k = [np.random.randint(0, d) for d in [N, N, D]]
        gradient_check_for_weight(model, 'U', i, j, k, i)

    print('Gradients checked - all good!')

gradient_check()

Gradients checked - all good!


# 4. Training

We train our model using stochastic gradient descent. At every step we sample a mini-batch from data and update the weights.

The following function samples words from data and creates mini-batches. It subsamples frequent words based on previously calculated probabilities.

In [333]:
rng = np.random.default_rng(123)
def get_batch(data, size, prob):
    x = rng.choice(data, size, p=prob)
    return x[:,0], x[:,1]

Training the model can take some time so plan accordingly.

In [334]:
model = Embedding(N=VOCABULARY_SIZE, D=EMBEDDING_DIM)
optim = Optimizer(model, learning_rate=1.0, momentum=0.5)

losses = []

MAX_ITERATIONS = 15000
PRINT_EVERY = 1000
BATCH_SIZE = 1000

for i in range(MAX_ITERATIONS):
    x, y = get_batch(data, BATCH_SIZE, probabilities)
    
    loss = model.forward(x, y)
    grad = model.backward()
    optim.step(grad)
    
    assert not np.isnan(loss)
    
    losses.append(loss)

    if (i + 1) % PRINT_EVERY == 0:
        print(f'Iteration: {i + 1}, Avg. training loss: {np.mean(losses[-PRINT_EVERY:]):.4f}')

Iteration: 1, Avg. training loss: 5.3013
Iteration: 2, Avg. training loss: 5.2977
Iteration: 3, Avg. training loss: 5.2675
Iteration: 4, Avg. training loss: 5.2392
Iteration: 5, Avg. training loss: 5.2188
Iteration: 6, Avg. training loss: 5.1706
Iteration: 7, Avg. training loss: 5.1269
Iteration: 8, Avg. training loss: 5.1035
Iteration: 9, Avg. training loss: 5.0166
Iteration: 10, Avg. training loss: 4.9812
Iteration: 11, Avg. training loss: 4.9183
Iteration: 12, Avg. training loss: 4.8464
Iteration: 13, Avg. training loss: 4.8461
Iteration: 14, Avg. training loss: 4.7251
Iteration: 15, Avg. training loss: 4.7361
Iteration: 16, Avg. training loss: 4.7997
Iteration: 17, Avg. training loss: 4.8074
Iteration: 18, Avg. training loss: 4.7504
Iteration: 19, Avg. training loss: 4.7324
Iteration: 20, Avg. training loss: 4.6748
Iteration: 21, Avg. training loss: 4.7030
Iteration: 22, Avg. training loss: 4.6306
Iteration: 23, Avg. training loss: 4.6083
Iteration: 24, Avg. training loss: 4.6245
I

Iteration: 195, Avg. training loss: 3.7750
Iteration: 196, Avg. training loss: 3.6376
Iteration: 197, Avg. training loss: 3.7834
Iteration: 198, Avg. training loss: 3.8158
Iteration: 199, Avg. training loss: 3.7944
Iteration: 200, Avg. training loss: 3.7179
Iteration: 201, Avg. training loss: 3.7831
Iteration: 202, Avg. training loss: 3.8287
Iteration: 203, Avg. training loss: 3.7777
Iteration: 204, Avg. training loss: 3.7219
Iteration: 205, Avg. training loss: 3.9539
Iteration: 206, Avg. training loss: 3.6978
Iteration: 207, Avg. training loss: 3.7552
Iteration: 208, Avg. training loss: 3.7614
Iteration: 209, Avg. training loss: 3.8064
Iteration: 210, Avg. training loss: 3.7043
Iteration: 211, Avg. training loss: 3.7267
Iteration: 212, Avg. training loss: 3.5535
Iteration: 213, Avg. training loss: 3.6998
Iteration: 214, Avg. training loss: 3.6828
Iteration: 215, Avg. training loss: 3.7365
Iteration: 216, Avg. training loss: 3.8211
Iteration: 217, Avg. training loss: 3.6747
Iteration: 

Iteration: 386, Avg. training loss: 3.6679
Iteration: 387, Avg. training loss: 3.6328
Iteration: 388, Avg. training loss: 3.6545
Iteration: 389, Avg. training loss: 3.6474
Iteration: 390, Avg. training loss: 3.6682
Iteration: 391, Avg. training loss: 3.6232
Iteration: 392, Avg. training loss: 3.6362
Iteration: 393, Avg. training loss: 3.6056
Iteration: 394, Avg. training loss: 3.6198
Iteration: 395, Avg. training loss: 3.5714
Iteration: 396, Avg. training loss: 3.6999
Iteration: 397, Avg. training loss: 3.7037
Iteration: 398, Avg. training loss: 3.6688
Iteration: 399, Avg. training loss: 3.6438
Iteration: 400, Avg. training loss: 3.5285
Iteration: 401, Avg. training loss: 3.5339
Iteration: 402, Avg. training loss: 3.6505
Iteration: 403, Avg. training loss: 3.6816
Iteration: 404, Avg. training loss: 3.5818
Iteration: 405, Avg. training loss: 3.6066
Iteration: 406, Avg. training loss: 3.7111
Iteration: 407, Avg. training loss: 3.5908
Iteration: 408, Avg. training loss: 3.6089
Iteration: 

Iteration: 577, Avg. training loss: 3.6511
Iteration: 578, Avg. training loss: 3.6203
Iteration: 579, Avg. training loss: 3.6094
Iteration: 580, Avg. training loss: 3.6516
Iteration: 581, Avg. training loss: 3.5630
Iteration: 582, Avg. training loss: 3.6709
Iteration: 583, Avg. training loss: 3.5678
Iteration: 584, Avg. training loss: 3.5857
Iteration: 585, Avg. training loss: 3.5861
Iteration: 586, Avg. training loss: 3.5811
Iteration: 587, Avg. training loss: 3.6507
Iteration: 588, Avg. training loss: 3.6297
Iteration: 589, Avg. training loss: 3.5356
Iteration: 590, Avg. training loss: 3.6268
Iteration: 591, Avg. training loss: 3.5850
Iteration: 592, Avg. training loss: 3.6621
Iteration: 593, Avg. training loss: 3.6024
Iteration: 594, Avg. training loss: 3.6062
Iteration: 595, Avg. training loss: 3.5863
Iteration: 596, Avg. training loss: 3.6501
Iteration: 597, Avg. training loss: 3.5663
Iteration: 598, Avg. training loss: 3.6215
Iteration: 599, Avg. training loss: 3.6306
Iteration: 

Iteration: 768, Avg. training loss: 3.6288
Iteration: 769, Avg. training loss: 3.5409
Iteration: 770, Avg. training loss: 3.6222
Iteration: 771, Avg. training loss: 3.6185
Iteration: 772, Avg. training loss: 3.7634
Iteration: 773, Avg. training loss: 3.5731
Iteration: 774, Avg. training loss: 3.6010
Iteration: 775, Avg. training loss: 3.7132
Iteration: 776, Avg. training loss: 3.5791
Iteration: 777, Avg. training loss: 3.4303
Iteration: 778, Avg. training loss: 3.6804
Iteration: 779, Avg. training loss: 3.5508
Iteration: 780, Avg. training loss: 3.5784
Iteration: 781, Avg. training loss: 3.6965
Iteration: 782, Avg. training loss: 3.5930
Iteration: 783, Avg. training loss: 3.8144
Iteration: 784, Avg. training loss: 3.5824
Iteration: 785, Avg. training loss: 3.5441
Iteration: 786, Avg. training loss: 3.6029
Iteration: 787, Avg. training loss: 3.6116
Iteration: 788, Avg. training loss: 3.5622
Iteration: 789, Avg. training loss: 3.5553
Iteration: 790, Avg. training loss: 3.6794
Iteration: 

Iteration: 959, Avg. training loss: 3.6536
Iteration: 960, Avg. training loss: 3.5944
Iteration: 961, Avg. training loss: 3.5790
Iteration: 962, Avg. training loss: 3.6324
Iteration: 963, Avg. training loss: 3.5353
Iteration: 964, Avg. training loss: 3.6413
Iteration: 965, Avg. training loss: 3.6023
Iteration: 966, Avg. training loss: 3.5528
Iteration: 967, Avg. training loss: 3.6162
Iteration: 968, Avg. training loss: 3.4248
Iteration: 969, Avg. training loss: 3.6100
Iteration: 970, Avg. training loss: 3.6186
Iteration: 971, Avg. training loss: 3.6224
Iteration: 972, Avg. training loss: 3.5353
Iteration: 973, Avg. training loss: 3.5011
Iteration: 974, Avg. training loss: 3.4122
Iteration: 975, Avg. training loss: 3.6319
Iteration: 976, Avg. training loss: 3.5690
Iteration: 977, Avg. training loss: 3.6257
Iteration: 978, Avg. training loss: 3.6266
Iteration: 979, Avg. training loss: 3.5975
Iteration: 980, Avg. training loss: 3.6033
Iteration: 981, Avg. training loss: 3.6270
Iteration: 

Iteration: 1147, Avg. training loss: 3.6184
Iteration: 1148, Avg. training loss: 3.5087
Iteration: 1149, Avg. training loss: 3.5171
Iteration: 1150, Avg. training loss: 3.6216
Iteration: 1151, Avg. training loss: 3.4601
Iteration: 1152, Avg. training loss: 3.6053
Iteration: 1153, Avg. training loss: 3.5931
Iteration: 1154, Avg. training loss: 3.5356
Iteration: 1155, Avg. training loss: 3.4696
Iteration: 1156, Avg. training loss: 3.6553
Iteration: 1157, Avg. training loss: 3.5876
Iteration: 1158, Avg. training loss: 3.6548
Iteration: 1159, Avg. training loss: 3.4615
Iteration: 1160, Avg. training loss: 3.5348
Iteration: 1161, Avg. training loss: 3.4826
Iteration: 1162, Avg. training loss: 3.5305
Iteration: 1163, Avg. training loss: 3.5411
Iteration: 1164, Avg. training loss: 3.5247
Iteration: 1165, Avg. training loss: 3.5026
Iteration: 1166, Avg. training loss: 3.5371
Iteration: 1167, Avg. training loss: 3.6588
Iteration: 1168, Avg. training loss: 3.5717
Iteration: 1169, Avg. training l

Iteration: 1334, Avg. training loss: 3.6767
Iteration: 1335, Avg. training loss: 3.6348
Iteration: 1336, Avg. training loss: 3.6714
Iteration: 1337, Avg. training loss: 3.6803
Iteration: 1338, Avg. training loss: 3.5778
Iteration: 1339, Avg. training loss: 3.5405
Iteration: 1340, Avg. training loss: 3.5623
Iteration: 1341, Avg. training loss: 3.4820
Iteration: 1342, Avg. training loss: 3.6158
Iteration: 1343, Avg. training loss: 3.5261
Iteration: 1344, Avg. training loss: 3.5011
Iteration: 1345, Avg. training loss: 3.6685
Iteration: 1346, Avg. training loss: 3.5706
Iteration: 1347, Avg. training loss: 3.6352
Iteration: 1348, Avg. training loss: 3.5337
Iteration: 1349, Avg. training loss: 3.6735
Iteration: 1350, Avg. training loss: 3.6037
Iteration: 1351, Avg. training loss: 3.5304
Iteration: 1352, Avg. training loss: 3.5660
Iteration: 1353, Avg. training loss: 3.5207
Iteration: 1354, Avg. training loss: 3.5554
Iteration: 1355, Avg. training loss: 3.5657
Iteration: 1356, Avg. training l

Iteration: 1521, Avg. training loss: 3.7480
Iteration: 1522, Avg. training loss: 3.5763
Iteration: 1523, Avg. training loss: 3.4549
Iteration: 1524, Avg. training loss: 3.5274
Iteration: 1525, Avg. training loss: 3.4661
Iteration: 1526, Avg. training loss: 3.5915
Iteration: 1527, Avg. training loss: 3.4569
Iteration: 1528, Avg. training loss: 3.5198
Iteration: 1529, Avg. training loss: 3.5504
Iteration: 1530, Avg. training loss: 3.5824
Iteration: 1531, Avg. training loss: 3.6469
Iteration: 1532, Avg. training loss: 3.5701
Iteration: 1533, Avg. training loss: 3.5757
Iteration: 1534, Avg. training loss: 3.6575
Iteration: 1535, Avg. training loss: 3.5696
Iteration: 1536, Avg. training loss: 3.5916
Iteration: 1537, Avg. training loss: 3.5865
Iteration: 1538, Avg. training loss: 3.6052
Iteration: 1539, Avg. training loss: 3.5041
Iteration: 1540, Avg. training loss: 3.6032
Iteration: 1541, Avg. training loss: 3.5136
Iteration: 1542, Avg. training loss: 3.4833
Iteration: 1543, Avg. training l

Iteration: 1708, Avg. training loss: 3.5194
Iteration: 1709, Avg. training loss: 3.5207
Iteration: 1710, Avg. training loss: 3.4898
Iteration: 1711, Avg. training loss: 3.5258
Iteration: 1712, Avg. training loss: 3.5919
Iteration: 1713, Avg. training loss: 3.4891
Iteration: 1714, Avg. training loss: 3.5308
Iteration: 1715, Avg. training loss: 3.6076
Iteration: 1716, Avg. training loss: 3.5002
Iteration: 1717, Avg. training loss: 3.6811
Iteration: 1718, Avg. training loss: 3.6559
Iteration: 1719, Avg. training loss: 3.5642
Iteration: 1720, Avg. training loss: 3.5206
Iteration: 1721, Avg. training loss: 3.5130
Iteration: 1722, Avg. training loss: 3.5718
Iteration: 1723, Avg. training loss: 3.7099
Iteration: 1724, Avg. training loss: 3.5660
Iteration: 1725, Avg. training loss: 3.6852
Iteration: 1726, Avg. training loss: 3.6280
Iteration: 1727, Avg. training loss: 3.6371
Iteration: 1728, Avg. training loss: 3.5909
Iteration: 1729, Avg. training loss: 3.6550
Iteration: 1730, Avg. training l

Iteration: 1895, Avg. training loss: 3.4514
Iteration: 1896, Avg. training loss: 3.5429
Iteration: 1897, Avg. training loss: 3.5742
Iteration: 1898, Avg. training loss: 3.4265
Iteration: 1899, Avg. training loss: 3.5155
Iteration: 1900, Avg. training loss: 3.6205
Iteration: 1901, Avg. training loss: 3.5099
Iteration: 1902, Avg. training loss: 3.4880
Iteration: 1903, Avg. training loss: 3.5046
Iteration: 1904, Avg. training loss: 3.5405
Iteration: 1905, Avg. training loss: 3.5548
Iteration: 1906, Avg. training loss: 3.5738
Iteration: 1907, Avg. training loss: 3.4951
Iteration: 1908, Avg. training loss: 3.5387
Iteration: 1909, Avg. training loss: 3.6541
Iteration: 1910, Avg. training loss: 3.5389
Iteration: 1911, Avg. training loss: 3.6029
Iteration: 1912, Avg. training loss: 3.5955
Iteration: 1913, Avg. training loss: 3.5436
Iteration: 1914, Avg. training loss: 3.6313
Iteration: 1915, Avg. training loss: 3.5327
Iteration: 1916, Avg. training loss: 3.4742
Iteration: 1917, Avg. training l

Iteration: 2082, Avg. training loss: 3.5394
Iteration: 2083, Avg. training loss: 3.5680
Iteration: 2084, Avg. training loss: 3.4489
Iteration: 2085, Avg. training loss: 3.4244
Iteration: 2086, Avg. training loss: 3.5080
Iteration: 2087, Avg. training loss: 3.6438
Iteration: 2088, Avg. training loss: 3.6500
Iteration: 2089, Avg. training loss: 3.6968
Iteration: 2090, Avg. training loss: 3.5759
Iteration: 2091, Avg. training loss: 3.4858
Iteration: 2092, Avg. training loss: 3.5840
Iteration: 2093, Avg. training loss: 3.5947
Iteration: 2094, Avg. training loss: 3.4981
Iteration: 2095, Avg. training loss: 3.5749
Iteration: 2096, Avg. training loss: 3.4403
Iteration: 2097, Avg. training loss: 3.4530
Iteration: 2098, Avg. training loss: 3.5021
Iteration: 2099, Avg. training loss: 3.6623
Iteration: 2100, Avg. training loss: 3.5090
Iteration: 2101, Avg. training loss: 3.5037
Iteration: 2102, Avg. training loss: 3.6807
Iteration: 2103, Avg. training loss: 3.5393
Iteration: 2104, Avg. training l

Iteration: 2269, Avg. training loss: 3.5530
Iteration: 2270, Avg. training loss: 3.5667
Iteration: 2271, Avg. training loss: 3.4919
Iteration: 2272, Avg. training loss: 3.5392
Iteration: 2273, Avg. training loss: 3.5693
Iteration: 2274, Avg. training loss: 3.5850
Iteration: 2275, Avg. training loss: 3.5564
Iteration: 2276, Avg. training loss: 3.4806
Iteration: 2277, Avg. training loss: 3.6261
Iteration: 2278, Avg. training loss: 3.5311
Iteration: 2279, Avg. training loss: 3.6136
Iteration: 2280, Avg. training loss: 3.5293
Iteration: 2281, Avg. training loss: 3.4348
Iteration: 2282, Avg. training loss: 3.5171
Iteration: 2283, Avg. training loss: 3.6253
Iteration: 2284, Avg. training loss: 3.6373
Iteration: 2285, Avg. training loss: 3.6116
Iteration: 2286, Avg. training loss: 3.5907
Iteration: 2287, Avg. training loss: 3.4560
Iteration: 2288, Avg. training loss: 3.4716
Iteration: 2289, Avg. training loss: 3.4389
Iteration: 2290, Avg. training loss: 3.6079
Iteration: 2291, Avg. training l

Iteration: 2456, Avg. training loss: 3.5688
Iteration: 2457, Avg. training loss: 3.6304
Iteration: 2458, Avg. training loss: 3.5525
Iteration: 2459, Avg. training loss: 3.5544
Iteration: 2460, Avg. training loss: 3.5466
Iteration: 2461, Avg. training loss: 3.4892
Iteration: 2462, Avg. training loss: 3.5419
Iteration: 2463, Avg. training loss: 3.6681
Iteration: 2464, Avg. training loss: 3.5794
Iteration: 2465, Avg. training loss: 3.4404
Iteration: 2466, Avg. training loss: 3.6106
Iteration: 2467, Avg. training loss: 3.6363
Iteration: 2468, Avg. training loss: 3.5487
Iteration: 2469, Avg. training loss: 3.5929
Iteration: 2470, Avg. training loss: 3.4679
Iteration: 2471, Avg. training loss: 3.5272
Iteration: 2472, Avg. training loss: 3.6163
Iteration: 2473, Avg. training loss: 3.4738
Iteration: 2474, Avg. training loss: 3.6960
Iteration: 2475, Avg. training loss: 3.4541
Iteration: 2476, Avg. training loss: 3.6880
Iteration: 2477, Avg. training loss: 3.4542
Iteration: 2478, Avg. training l

Iteration: 2643, Avg. training loss: 3.5890
Iteration: 2644, Avg. training loss: 3.5718
Iteration: 2645, Avg. training loss: 3.5724
Iteration: 2646, Avg. training loss: 3.4659
Iteration: 2647, Avg. training loss: 3.4624
Iteration: 2648, Avg. training loss: 3.4856
Iteration: 2649, Avg. training loss: 3.5858
Iteration: 2650, Avg. training loss: 3.6311
Iteration: 2651, Avg. training loss: 3.5328
Iteration: 2652, Avg. training loss: 3.5443
Iteration: 2653, Avg. training loss: 3.4263
Iteration: 2654, Avg. training loss: 3.5997
Iteration: 2655, Avg. training loss: 3.5159
Iteration: 2656, Avg. training loss: 3.5100
Iteration: 2657, Avg. training loss: 3.5392
Iteration: 2658, Avg. training loss: 3.5692
Iteration: 2659, Avg. training loss: 3.6142
Iteration: 2660, Avg. training loss: 3.4770
Iteration: 2661, Avg. training loss: 3.5168
Iteration: 2662, Avg. training loss: 3.6600
Iteration: 2663, Avg. training loss: 3.5135
Iteration: 2664, Avg. training loss: 3.5860
Iteration: 2665, Avg. training l

Iteration: 2830, Avg. training loss: 3.5691
Iteration: 2831, Avg. training loss: 3.7881
Iteration: 2832, Avg. training loss: 3.4733
Iteration: 2833, Avg. training loss: 3.4517
Iteration: 2834, Avg. training loss: 3.5223
Iteration: 2835, Avg. training loss: 3.5085
Iteration: 2836, Avg. training loss: 3.5730
Iteration: 2837, Avg. training loss: 3.5348
Iteration: 2838, Avg. training loss: 3.5375
Iteration: 2839, Avg. training loss: 3.4729
Iteration: 2840, Avg. training loss: 3.5110
Iteration: 2841, Avg. training loss: 3.6309
Iteration: 2842, Avg. training loss: 3.4706
Iteration: 2843, Avg. training loss: 3.5716
Iteration: 2844, Avg. training loss: 3.5555
Iteration: 2845, Avg. training loss: 3.5859
Iteration: 2846, Avg. training loss: 3.4547
Iteration: 2847, Avg. training loss: 3.5217
Iteration: 2848, Avg. training loss: 3.6056
Iteration: 2849, Avg. training loss: 3.5571
Iteration: 2850, Avg. training loss: 3.5518
Iteration: 2851, Avg. training loss: 3.5764
Iteration: 2852, Avg. training l

Iteration: 3017, Avg. training loss: 3.5643
Iteration: 3018, Avg. training loss: 3.5244
Iteration: 3019, Avg. training loss: 3.5825
Iteration: 3020, Avg. training loss: 3.7285
Iteration: 3021, Avg. training loss: 3.4554
Iteration: 3022, Avg. training loss: 3.4931
Iteration: 3023, Avg. training loss: 3.6198
Iteration: 3024, Avg. training loss: 3.6409
Iteration: 3025, Avg. training loss: 3.5244
Iteration: 3026, Avg. training loss: 3.5770
Iteration: 3027, Avg. training loss: 3.4370
Iteration: 3028, Avg. training loss: 3.6779
Iteration: 3029, Avg. training loss: 3.5856
Iteration: 3030, Avg. training loss: 3.5106
Iteration: 3031, Avg. training loss: 3.4930
Iteration: 3032, Avg. training loss: 3.4171
Iteration: 3033, Avg. training loss: 3.5682
Iteration: 3034, Avg. training loss: 3.4826
Iteration: 3035, Avg. training loss: 3.5523
Iteration: 3036, Avg. training loss: 3.5259
Iteration: 3037, Avg. training loss: 3.4582
Iteration: 3038, Avg. training loss: 3.6395
Iteration: 3039, Avg. training l

Iteration: 3204, Avg. training loss: 3.4350
Iteration: 3205, Avg. training loss: 3.5444
Iteration: 3206, Avg. training loss: 3.4764
Iteration: 3207, Avg. training loss: 3.4926
Iteration: 3208, Avg. training loss: 3.6225
Iteration: 3209, Avg. training loss: 3.4718
Iteration: 3210, Avg. training loss: 3.4761
Iteration: 3211, Avg. training loss: 3.5420
Iteration: 3212, Avg. training loss: 3.5872
Iteration: 3213, Avg. training loss: 3.5168
Iteration: 3214, Avg. training loss: 3.5360
Iteration: 3215, Avg. training loss: 3.6021
Iteration: 3216, Avg. training loss: 3.5420
Iteration: 3217, Avg. training loss: 3.5815
Iteration: 3218, Avg. training loss: 3.5664
Iteration: 3219, Avg. training loss: 3.6881
Iteration: 3220, Avg. training loss: 3.3984
Iteration: 3221, Avg. training loss: 3.5112
Iteration: 3222, Avg. training loss: 3.4068
Iteration: 3223, Avg. training loss: 3.5319
Iteration: 3224, Avg. training loss: 3.4736
Iteration: 3225, Avg. training loss: 3.5859
Iteration: 3226, Avg. training l

Iteration: 3391, Avg. training loss: 3.4122
Iteration: 3392, Avg. training loss: 3.5405
Iteration: 3393, Avg. training loss: 3.5615
Iteration: 3394, Avg. training loss: 3.5159
Iteration: 3395, Avg. training loss: 3.6162
Iteration: 3396, Avg. training loss: 3.5481
Iteration: 3397, Avg. training loss: 3.5477
Iteration: 3398, Avg. training loss: 3.4323
Iteration: 3399, Avg. training loss: 3.4521
Iteration: 3400, Avg. training loss: 3.4932
Iteration: 3401, Avg. training loss: 3.4922
Iteration: 3402, Avg. training loss: 3.5271
Iteration: 3403, Avg. training loss: 3.4969
Iteration: 3404, Avg. training loss: 3.5030
Iteration: 3405, Avg. training loss: 3.4866
Iteration: 3406, Avg. training loss: 3.5119
Iteration: 3407, Avg. training loss: 3.6489
Iteration: 3408, Avg. training loss: 3.4773
Iteration: 3409, Avg. training loss: 3.5671
Iteration: 3410, Avg. training loss: 3.6046
Iteration: 3411, Avg. training loss: 3.4725
Iteration: 3412, Avg. training loss: 3.5242
Iteration: 3413, Avg. training l

Iteration: 3578, Avg. training loss: 3.5608
Iteration: 3579, Avg. training loss: 3.5713
Iteration: 3580, Avg. training loss: 3.5075
Iteration: 3581, Avg. training loss: 3.5772
Iteration: 3582, Avg. training loss: 3.4534
Iteration: 3583, Avg. training loss: 3.4959
Iteration: 3584, Avg. training loss: 3.5321
Iteration: 3585, Avg. training loss: 3.7645
Iteration: 3586, Avg. training loss: 3.4630
Iteration: 3587, Avg. training loss: 3.5214
Iteration: 3588, Avg. training loss: 3.5031
Iteration: 3589, Avg. training loss: 3.6274
Iteration: 3590, Avg. training loss: 3.5387
Iteration: 3591, Avg. training loss: 3.4944
Iteration: 3592, Avg. training loss: 3.4740
Iteration: 3593, Avg. training loss: 3.5335
Iteration: 3594, Avg. training loss: 3.5078
Iteration: 3595, Avg. training loss: 3.4797
Iteration: 3596, Avg. training loss: 3.5550
Iteration: 3597, Avg. training loss: 3.5548
Iteration: 3598, Avg. training loss: 3.5727
Iteration: 3599, Avg. training loss: 3.5827
Iteration: 3600, Avg. training l

Iteration: 3765, Avg. training loss: 3.5107
Iteration: 3766, Avg. training loss: 3.5456
Iteration: 3767, Avg. training loss: 3.5949
Iteration: 3768, Avg. training loss: 3.5024
Iteration: 3769, Avg. training loss: 3.5440
Iteration: 3770, Avg. training loss: 3.4899
Iteration: 3771, Avg. training loss: 3.5615
Iteration: 3772, Avg. training loss: 3.6049
Iteration: 3773, Avg. training loss: 3.5799
Iteration: 3774, Avg. training loss: 3.5726
Iteration: 3775, Avg. training loss: 3.5475
Iteration: 3776, Avg. training loss: 3.5989
Iteration: 3777, Avg. training loss: 3.5317
Iteration: 3778, Avg. training loss: 3.5836
Iteration: 3779, Avg. training loss: 3.4680
Iteration: 3780, Avg. training loss: 3.5650
Iteration: 3781, Avg. training loss: 3.5688
Iteration: 3782, Avg. training loss: 3.5093
Iteration: 3783, Avg. training loss: 3.4940
Iteration: 3784, Avg. training loss: 3.5370
Iteration: 3785, Avg. training loss: 3.5654
Iteration: 3786, Avg. training loss: 3.4365
Iteration: 3787, Avg. training l

Iteration: 3952, Avg. training loss: 3.5230
Iteration: 3953, Avg. training loss: 3.5380
Iteration: 3954, Avg. training loss: 3.3258
Iteration: 3955, Avg. training loss: 3.5392
Iteration: 3956, Avg. training loss: 3.4948
Iteration: 3957, Avg. training loss: 3.6216
Iteration: 3958, Avg. training loss: 3.5289
Iteration: 3959, Avg. training loss: 3.6572
Iteration: 3960, Avg. training loss: 3.5631
Iteration: 3961, Avg. training loss: 3.6241
Iteration: 3962, Avg. training loss: 3.4509
Iteration: 3963, Avg. training loss: 3.5154
Iteration: 3964, Avg. training loss: 3.5039
Iteration: 3965, Avg. training loss: 3.4666
Iteration: 3966, Avg. training loss: 3.4158
Iteration: 3967, Avg. training loss: 3.5292
Iteration: 3968, Avg. training loss: 3.4933
Iteration: 3969, Avg. training loss: 3.5105
Iteration: 3970, Avg. training loss: 3.6007
Iteration: 3971, Avg. training loss: 3.6125
Iteration: 3972, Avg. training loss: 3.5954
Iteration: 3973, Avg. training loss: 3.6195
Iteration: 3974, Avg. training l

Iteration: 4139, Avg. training loss: 3.4554
Iteration: 4140, Avg. training loss: 3.5943
Iteration: 4141, Avg. training loss: 3.5001
Iteration: 4142, Avg. training loss: 3.5787
Iteration: 4143, Avg. training loss: 3.5402
Iteration: 4144, Avg. training loss: 3.5568
Iteration: 4145, Avg. training loss: 3.5929
Iteration: 4146, Avg. training loss: 3.5076
Iteration: 4147, Avg. training loss: 3.6047
Iteration: 4148, Avg. training loss: 3.6173
Iteration: 4149, Avg. training loss: 3.5883
Iteration: 4150, Avg. training loss: 3.4061
Iteration: 4151, Avg. training loss: 3.4426
Iteration: 4152, Avg. training loss: 3.5855
Iteration: 4153, Avg. training loss: 3.5417
Iteration: 4154, Avg. training loss: 3.5579
Iteration: 4155, Avg. training loss: 3.6719
Iteration: 4156, Avg. training loss: 3.4660
Iteration: 4157, Avg. training loss: 3.4794
Iteration: 4158, Avg. training loss: 3.4653
Iteration: 4159, Avg. training loss: 3.5022
Iteration: 4160, Avg. training loss: 3.6831
Iteration: 4161, Avg. training l

Iteration: 4326, Avg. training loss: 3.5181
Iteration: 4327, Avg. training loss: 3.4896
Iteration: 4328, Avg. training loss: 3.4780
Iteration: 4329, Avg. training loss: 3.4659
Iteration: 4330, Avg. training loss: 3.6252
Iteration: 4331, Avg. training loss: 3.5952
Iteration: 4332, Avg. training loss: 3.5163
Iteration: 4333, Avg. training loss: 3.4921
Iteration: 4334, Avg. training loss: 3.7048
Iteration: 4335, Avg. training loss: 3.3743
Iteration: 4336, Avg. training loss: 3.6129
Iteration: 4337, Avg. training loss: 3.6433
Iteration: 4338, Avg. training loss: 3.5059
Iteration: 4339, Avg. training loss: 3.5767
Iteration: 4340, Avg. training loss: 3.6331
Iteration: 4341, Avg. training loss: 3.6552
Iteration: 4342, Avg. training loss: 3.5870
Iteration: 4343, Avg. training loss: 3.6388
Iteration: 4344, Avg. training loss: 3.5668
Iteration: 4345, Avg. training loss: 3.5238
Iteration: 4346, Avg. training loss: 3.5581
Iteration: 4347, Avg. training loss: 3.4920
Iteration: 4348, Avg. training l

Iteration: 4513, Avg. training loss: 3.5987
Iteration: 4514, Avg. training loss: 3.5384
Iteration: 4515, Avg. training loss: 3.5354
Iteration: 4516, Avg. training loss: 3.5563
Iteration: 4517, Avg. training loss: 3.4919
Iteration: 4518, Avg. training loss: 3.6382
Iteration: 4519, Avg. training loss: 3.5716
Iteration: 4520, Avg. training loss: 3.6285
Iteration: 4521, Avg. training loss: 3.5477
Iteration: 4522, Avg. training loss: 3.4671
Iteration: 4523, Avg. training loss: 3.6027
Iteration: 4524, Avg. training loss: 3.5071
Iteration: 4525, Avg. training loss: 3.5809
Iteration: 4526, Avg. training loss: 3.5921
Iteration: 4527, Avg. training loss: 3.4980
Iteration: 4528, Avg. training loss: 3.6924
Iteration: 4529, Avg. training loss: 3.4480
Iteration: 4530, Avg. training loss: 3.5014
Iteration: 4531, Avg. training loss: 3.5922
Iteration: 4532, Avg. training loss: 3.6513
Iteration: 4533, Avg. training loss: 3.6558
Iteration: 4534, Avg. training loss: 3.5379
Iteration: 4535, Avg. training l

Iteration: 4700, Avg. training loss: 3.6630
Iteration: 4701, Avg. training loss: 3.5641
Iteration: 4702, Avg. training loss: 3.5178
Iteration: 4703, Avg. training loss: 3.5374
Iteration: 4704, Avg. training loss: 3.4556
Iteration: 4705, Avg. training loss: 3.5626
Iteration: 4706, Avg. training loss: 3.4749
Iteration: 4707, Avg. training loss: 3.5413
Iteration: 4708, Avg. training loss: 3.4347
Iteration: 4709, Avg. training loss: 3.5057
Iteration: 4710, Avg. training loss: 3.6035
Iteration: 4711, Avg. training loss: 3.6199
Iteration: 4712, Avg. training loss: 3.5031
Iteration: 4713, Avg. training loss: 3.5057
Iteration: 4714, Avg. training loss: 3.4835
Iteration: 4715, Avg. training loss: 3.5311
Iteration: 4716, Avg. training loss: 3.6703
Iteration: 4717, Avg. training loss: 3.4449
Iteration: 4718, Avg. training loss: 3.5034
Iteration: 4719, Avg. training loss: 3.6327
Iteration: 4720, Avg. training loss: 3.5592
Iteration: 4721, Avg. training loss: 3.4451
Iteration: 4722, Avg. training l

Iteration: 4887, Avg. training loss: 3.4787
Iteration: 4888, Avg. training loss: 3.5033
Iteration: 4889, Avg. training loss: 3.5471
Iteration: 4890, Avg. training loss: 3.4921
Iteration: 4891, Avg. training loss: 3.5946
Iteration: 4892, Avg. training loss: 3.4861
Iteration: 4893, Avg. training loss: 3.5202
Iteration: 4894, Avg. training loss: 3.5920
Iteration: 4895, Avg. training loss: 3.5024
Iteration: 4896, Avg. training loss: 3.4700
Iteration: 4897, Avg. training loss: 3.4278
Iteration: 4898, Avg. training loss: 3.4870
Iteration: 4899, Avg. training loss: 3.5357
Iteration: 4900, Avg. training loss: 3.5324
Iteration: 4901, Avg. training loss: 3.6133
Iteration: 4902, Avg. training loss: 3.5043
Iteration: 4903, Avg. training loss: 3.5634
Iteration: 4904, Avg. training loss: 3.4087
Iteration: 4905, Avg. training loss: 3.3547
Iteration: 4906, Avg. training loss: 3.4234
Iteration: 4907, Avg. training loss: 3.5120
Iteration: 4908, Avg. training loss: 3.5570
Iteration: 4909, Avg. training l

Iteration: 5074, Avg. training loss: 3.4590
Iteration: 5075, Avg. training loss: 3.5645
Iteration: 5076, Avg. training loss: 3.4947
Iteration: 5077, Avg. training loss: 3.4347
Iteration: 5078, Avg. training loss: 3.4929
Iteration: 5079, Avg. training loss: 3.5350
Iteration: 5080, Avg. training loss: 3.5285
Iteration: 5081, Avg. training loss: 3.4943
Iteration: 5082, Avg. training loss: 3.4240
Iteration: 5083, Avg. training loss: 3.4903
Iteration: 5084, Avg. training loss: 3.4600
Iteration: 5085, Avg. training loss: 3.5393
Iteration: 5086, Avg. training loss: 3.4896
Iteration: 5087, Avg. training loss: 3.4944
Iteration: 5088, Avg. training loss: 3.4710
Iteration: 5089, Avg. training loss: 3.5232
Iteration: 5090, Avg. training loss: 3.3536
Iteration: 5091, Avg. training loss: 3.4551
Iteration: 5092, Avg. training loss: 3.5447
Iteration: 5093, Avg. training loss: 3.6307
Iteration: 5094, Avg. training loss: 3.5177
Iteration: 5095, Avg. training loss: 3.4584
Iteration: 5096, Avg. training l

Iteration: 5261, Avg. training loss: 3.5399
Iteration: 5262, Avg. training loss: 3.5262
Iteration: 5263, Avg. training loss: 3.4503
Iteration: 5264, Avg. training loss: 3.5171
Iteration: 5265, Avg. training loss: 3.4824
Iteration: 5266, Avg. training loss: 3.4393
Iteration: 5267, Avg. training loss: 3.4567
Iteration: 5268, Avg. training loss: 3.5695
Iteration: 5269, Avg. training loss: 3.5398
Iteration: 5270, Avg. training loss: 3.5974
Iteration: 5271, Avg. training loss: 3.6355
Iteration: 5272, Avg. training loss: 3.5029
Iteration: 5273, Avg. training loss: 3.6391
Iteration: 5274, Avg. training loss: 3.5363
Iteration: 5275, Avg. training loss: 3.6040
Iteration: 5276, Avg. training loss: 3.5962
Iteration: 5277, Avg. training loss: 3.6832
Iteration: 5278, Avg. training loss: 3.4615
Iteration: 5279, Avg. training loss: 3.5628
Iteration: 5280, Avg. training loss: 3.4951
Iteration: 5281, Avg. training loss: 3.5412
Iteration: 5282, Avg. training loss: 3.5481
Iteration: 5283, Avg. training l

Iteration: 5448, Avg. training loss: 3.5235
Iteration: 5449, Avg. training loss: 3.5327
Iteration: 5450, Avg. training loss: 3.5177
Iteration: 5451, Avg. training loss: 3.5737
Iteration: 5452, Avg. training loss: 3.6773
Iteration: 5453, Avg. training loss: 3.5091
Iteration: 5454, Avg. training loss: 3.6596
Iteration: 5455, Avg. training loss: 3.5453
Iteration: 5456, Avg. training loss: 3.5152
Iteration: 5457, Avg. training loss: 3.5231
Iteration: 5458, Avg. training loss: 3.4890
Iteration: 5459, Avg. training loss: 3.4839
Iteration: 5460, Avg. training loss: 3.5110
Iteration: 5461, Avg. training loss: 3.6055
Iteration: 5462, Avg. training loss: 3.5596
Iteration: 5463, Avg. training loss: 3.4653
Iteration: 5464, Avg. training loss: 3.4970
Iteration: 5465, Avg. training loss: 3.5445
Iteration: 5466, Avg. training loss: 3.4850
Iteration: 5467, Avg. training loss: 3.4528
Iteration: 5468, Avg. training loss: 3.4580
Iteration: 5469, Avg. training loss: 3.4932
Iteration: 5470, Avg. training l

Iteration: 5635, Avg. training loss: 3.4867
Iteration: 5636, Avg. training loss: 3.6179
Iteration: 5637, Avg. training loss: 3.4222
Iteration: 5638, Avg. training loss: 3.4478
Iteration: 5639, Avg. training loss: 3.3458
Iteration: 5640, Avg. training loss: 3.4347
Iteration: 5641, Avg. training loss: 3.4819
Iteration: 5642, Avg. training loss: 3.5343
Iteration: 5643, Avg. training loss: 3.4738
Iteration: 5644, Avg. training loss: 3.4988
Iteration: 5645, Avg. training loss: 3.5824
Iteration: 5646, Avg. training loss: 3.5277
Iteration: 5647, Avg. training loss: 3.5533
Iteration: 5648, Avg. training loss: 3.5344
Iteration: 5649, Avg. training loss: 3.4658
Iteration: 5650, Avg. training loss: 3.5923
Iteration: 5651, Avg. training loss: 3.5787
Iteration: 5652, Avg. training loss: 3.4968
Iteration: 5653, Avg. training loss: 3.4686
Iteration: 5654, Avg. training loss: 3.4702
Iteration: 5655, Avg. training loss: 3.5616
Iteration: 5656, Avg. training loss: 3.4584
Iteration: 5657, Avg. training l

Iteration: 5822, Avg. training loss: 3.5474
Iteration: 5823, Avg. training loss: 3.5527
Iteration: 5824, Avg. training loss: 3.6384
Iteration: 5825, Avg. training loss: 3.5740
Iteration: 5826, Avg. training loss: 3.5016
Iteration: 5827, Avg. training loss: 3.5508
Iteration: 5828, Avg. training loss: 3.5784
Iteration: 5829, Avg. training loss: 3.5148
Iteration: 5830, Avg. training loss: 3.5327
Iteration: 5831, Avg. training loss: 3.4710
Iteration: 5832, Avg. training loss: 3.5046
Iteration: 5833, Avg. training loss: 3.6049
Iteration: 5834, Avg. training loss: 3.4926
Iteration: 5835, Avg. training loss: 3.5720
Iteration: 5836, Avg. training loss: 3.5199
Iteration: 5837, Avg. training loss: 3.5394
Iteration: 5838, Avg. training loss: 3.5326
Iteration: 5839, Avg. training loss: 3.6080
Iteration: 5840, Avg. training loss: 3.6020
Iteration: 5841, Avg. training loss: 3.4231
Iteration: 5842, Avg. training loss: 3.3407
Iteration: 5843, Avg. training loss: 3.5647
Iteration: 5844, Avg. training l

Iteration: 6009, Avg. training loss: 3.4807
Iteration: 6010, Avg. training loss: 3.5826
Iteration: 6011, Avg. training loss: 3.4764
Iteration: 6012, Avg. training loss: 3.6654
Iteration: 6013, Avg. training loss: 3.4510
Iteration: 6014, Avg. training loss: 3.5182
Iteration: 6015, Avg. training loss: 3.4353
Iteration: 6016, Avg. training loss: 3.5321
Iteration: 6017, Avg. training loss: 3.6485
Iteration: 6018, Avg. training loss: 3.3901
Iteration: 6019, Avg. training loss: 3.4681
Iteration: 6020, Avg. training loss: 3.6275
Iteration: 6021, Avg. training loss: 3.7339
Iteration: 6022, Avg. training loss: 3.5469
Iteration: 6023, Avg. training loss: 3.5646
Iteration: 6024, Avg. training loss: 3.4692
Iteration: 6025, Avg. training loss: 3.5561
Iteration: 6026, Avg. training loss: 3.6216
Iteration: 6027, Avg. training loss: 3.5501
Iteration: 6028, Avg. training loss: 3.5250
Iteration: 6029, Avg. training loss: 3.4909
Iteration: 6030, Avg. training loss: 3.5206
Iteration: 6031, Avg. training l

Iteration: 6196, Avg. training loss: 3.4337
Iteration: 6197, Avg. training loss: 3.5015
Iteration: 6198, Avg. training loss: 3.7707
Iteration: 6199, Avg. training loss: 3.5055
Iteration: 6200, Avg. training loss: 3.4215
Iteration: 6201, Avg. training loss: 3.4900
Iteration: 6202, Avg. training loss: 3.4473
Iteration: 6203, Avg. training loss: 3.4951
Iteration: 6204, Avg. training loss: 3.4885
Iteration: 6205, Avg. training loss: 3.6604
Iteration: 6206, Avg. training loss: 3.5830
Iteration: 6207, Avg. training loss: 3.3998
Iteration: 6208, Avg. training loss: 3.5963
Iteration: 6209, Avg. training loss: 3.6037
Iteration: 6210, Avg. training loss: 3.5734
Iteration: 6211, Avg. training loss: 3.4620
Iteration: 6212, Avg. training loss: 3.5919
Iteration: 6213, Avg. training loss: 3.3457
Iteration: 6214, Avg. training loss: 3.5827
Iteration: 6215, Avg. training loss: 3.5510
Iteration: 6216, Avg. training loss: 3.4244
Iteration: 6217, Avg. training loss: 3.6808
Iteration: 6218, Avg. training l

Iteration: 6383, Avg. training loss: 3.4252
Iteration: 6384, Avg. training loss: 3.4837
Iteration: 6385, Avg. training loss: 3.6171
Iteration: 6386, Avg. training loss: 3.5244
Iteration: 6387, Avg. training loss: 3.4335
Iteration: 6388, Avg. training loss: 3.5565
Iteration: 6389, Avg. training loss: 3.5516
Iteration: 6390, Avg. training loss: 3.4832
Iteration: 6391, Avg. training loss: 3.5070
Iteration: 6392, Avg. training loss: 3.4686
Iteration: 6393, Avg. training loss: 3.4589
Iteration: 6394, Avg. training loss: 3.5611
Iteration: 6395, Avg. training loss: 3.4948
Iteration: 6396, Avg. training loss: 3.4685
Iteration: 6397, Avg. training loss: 3.5678
Iteration: 6398, Avg. training loss: 3.5852
Iteration: 6399, Avg. training loss: 3.4800
Iteration: 6400, Avg. training loss: 3.5357
Iteration: 6401, Avg. training loss: 3.4065
Iteration: 6402, Avg. training loss: 3.5054
Iteration: 6403, Avg. training loss: 3.5588
Iteration: 6404, Avg. training loss: 3.5463
Iteration: 6405, Avg. training l

Iteration: 6570, Avg. training loss: 3.5376
Iteration: 6571, Avg. training loss: 3.6015
Iteration: 6572, Avg. training loss: 3.5755
Iteration: 6573, Avg. training loss: 3.4879
Iteration: 6574, Avg. training loss: 3.4644
Iteration: 6575, Avg. training loss: 3.4396
Iteration: 6576, Avg. training loss: 3.5624
Iteration: 6577, Avg. training loss: 3.5483
Iteration: 6578, Avg. training loss: 3.4083
Iteration: 6579, Avg. training loss: 3.5375
Iteration: 6580, Avg. training loss: 3.5987
Iteration: 6581, Avg. training loss: 3.4813
Iteration: 6582, Avg. training loss: 3.4962
Iteration: 6583, Avg. training loss: 3.4826
Iteration: 6584, Avg. training loss: 3.5217
Iteration: 6585, Avg. training loss: 3.6574
Iteration: 6586, Avg. training loss: 3.4458
Iteration: 6587, Avg. training loss: 3.3951
Iteration: 6588, Avg. training loss: 3.4645
Iteration: 6589, Avg. training loss: 3.4534
Iteration: 6590, Avg. training loss: 3.4987
Iteration: 6591, Avg. training loss: 3.5552
Iteration: 6592, Avg. training l

Iteration: 6757, Avg. training loss: 3.4942
Iteration: 6758, Avg. training loss: 3.5670
Iteration: 6759, Avg. training loss: 3.5405
Iteration: 6760, Avg. training loss: 3.5595
Iteration: 6761, Avg. training loss: 3.5107
Iteration: 6762, Avg. training loss: 3.4791
Iteration: 6763, Avg. training loss: 3.5684
Iteration: 6764, Avg. training loss: 3.5739
Iteration: 6765, Avg. training loss: 3.4428
Iteration: 6766, Avg. training loss: 3.5961
Iteration: 6767, Avg. training loss: 3.5131
Iteration: 6768, Avg. training loss: 3.5697
Iteration: 6769, Avg. training loss: 3.4060
Iteration: 6770, Avg. training loss: 3.5348
Iteration: 6771, Avg. training loss: 3.4236
Iteration: 6772, Avg. training loss: 3.5171
Iteration: 6773, Avg. training loss: 3.5616
Iteration: 6774, Avg. training loss: 3.5194
Iteration: 6775, Avg. training loss: 3.4752
Iteration: 6776, Avg. training loss: 3.5384
Iteration: 6777, Avg. training loss: 3.5996
Iteration: 6778, Avg. training loss: 3.4834
Iteration: 6779, Avg. training l

Iteration: 6944, Avg. training loss: 3.4963
Iteration: 6945, Avg. training loss: 3.6272
Iteration: 6946, Avg. training loss: 3.4889
Iteration: 6947, Avg. training loss: 3.5164
Iteration: 6948, Avg. training loss: 3.5521
Iteration: 6949, Avg. training loss: 3.4797
Iteration: 6950, Avg. training loss: 3.5146
Iteration: 6951, Avg. training loss: 3.5474
Iteration: 6952, Avg. training loss: 3.4664
Iteration: 6953, Avg. training loss: 3.5843
Iteration: 6954, Avg. training loss: 3.4312
Iteration: 6955, Avg. training loss: 3.4345
Iteration: 6956, Avg. training loss: 3.4546
Iteration: 6957, Avg. training loss: 3.5791
Iteration: 6958, Avg. training loss: 3.4707
Iteration: 6959, Avg. training loss: 3.4947
Iteration: 6960, Avg. training loss: 3.5400
Iteration: 6961, Avg. training loss: 3.4711
Iteration: 6962, Avg. training loss: 3.5015
Iteration: 6963, Avg. training loss: 3.4858
Iteration: 6964, Avg. training loss: 3.5014
Iteration: 6965, Avg. training loss: 3.6115
Iteration: 6966, Avg. training l

Iteration: 7131, Avg. training loss: 3.6487
Iteration: 7132, Avg. training loss: 3.6309
Iteration: 7133, Avg. training loss: 3.6001
Iteration: 7134, Avg. training loss: 3.5736
Iteration: 7135, Avg. training loss: 3.4303
Iteration: 7136, Avg. training loss: 3.4188
Iteration: 7137, Avg. training loss: 3.5590
Iteration: 7138, Avg. training loss: 3.5863
Iteration: 7139, Avg. training loss: 3.4968
Iteration: 7140, Avg. training loss: 3.5453
Iteration: 7141, Avg. training loss: 3.5949
Iteration: 7142, Avg. training loss: 3.5012
Iteration: 7143, Avg. training loss: 3.5250
Iteration: 7144, Avg. training loss: 3.5126
Iteration: 7145, Avg. training loss: 3.4659
Iteration: 7146, Avg. training loss: 3.4894
Iteration: 7147, Avg. training loss: 3.5172
Iteration: 7148, Avg. training loss: 3.5006
Iteration: 7149, Avg. training loss: 3.4907
Iteration: 7150, Avg. training loss: 3.5197
Iteration: 7151, Avg. training loss: 3.4490
Iteration: 7152, Avg. training loss: 3.4955
Iteration: 7153, Avg. training l

Iteration: 7318, Avg. training loss: 3.4414
Iteration: 7319, Avg. training loss: 3.5220
Iteration: 7320, Avg. training loss: 3.5096
Iteration: 7321, Avg. training loss: 3.4492
Iteration: 7322, Avg. training loss: 3.5371
Iteration: 7323, Avg. training loss: 3.4171
Iteration: 7324, Avg. training loss: 3.4378
Iteration: 7325, Avg. training loss: 3.5266
Iteration: 7326, Avg. training loss: 3.4444
Iteration: 7327, Avg. training loss: 3.5834
Iteration: 7328, Avg. training loss: 3.5044
Iteration: 7329, Avg. training loss: 3.4769
Iteration: 7330, Avg. training loss: 3.4307
Iteration: 7331, Avg. training loss: 3.5430
Iteration: 7332, Avg. training loss: 3.4881
Iteration: 7333, Avg. training loss: 3.4228
Iteration: 7334, Avg. training loss: 3.5361
Iteration: 7335, Avg. training loss: 3.4942
Iteration: 7336, Avg. training loss: 3.6542
Iteration: 7337, Avg. training loss: 3.3375
Iteration: 7338, Avg. training loss: 3.5432
Iteration: 7339, Avg. training loss: 3.5008
Iteration: 7340, Avg. training l

Iteration: 7505, Avg. training loss: 3.5618
Iteration: 7506, Avg. training loss: 3.5609
Iteration: 7507, Avg. training loss: 3.5264
Iteration: 7508, Avg. training loss: 3.4760
Iteration: 7509, Avg. training loss: 3.4098
Iteration: 7510, Avg. training loss: 3.3280
Iteration: 7511, Avg. training loss: 3.3940
Iteration: 7512, Avg. training loss: 3.5120
Iteration: 7513, Avg. training loss: 3.5276
Iteration: 7514, Avg. training loss: 3.6549
Iteration: 7515, Avg. training loss: 3.5685
Iteration: 7516, Avg. training loss: 3.4593
Iteration: 7517, Avg. training loss: 3.6166
Iteration: 7518, Avg. training loss: 3.4542
Iteration: 7519, Avg. training loss: 3.5554
Iteration: 7520, Avg. training loss: 3.5502
Iteration: 7521, Avg. training loss: 3.5306
Iteration: 7522, Avg. training loss: 3.4612
Iteration: 7523, Avg. training loss: 3.4692
Iteration: 7524, Avg. training loss: 3.4992
Iteration: 7525, Avg. training loss: 3.3145
Iteration: 7526, Avg. training loss: 3.5857
Iteration: 7527, Avg. training l

Iteration: 7692, Avg. training loss: 3.4565
Iteration: 7693, Avg. training loss: 3.6073
Iteration: 7694, Avg. training loss: 3.4571
Iteration: 7695, Avg. training loss: 3.4940
Iteration: 7696, Avg. training loss: 3.4681
Iteration: 7697, Avg. training loss: 3.4607
Iteration: 7698, Avg. training loss: 3.5146
Iteration: 7699, Avg. training loss: 3.5627
Iteration: 7700, Avg. training loss: 3.4336
Iteration: 7701, Avg. training loss: 3.5401
Iteration: 7702, Avg. training loss: 3.3647
Iteration: 7703, Avg. training loss: 3.5226
Iteration: 7704, Avg. training loss: 3.6097
Iteration: 7705, Avg. training loss: 3.4896
Iteration: 7706, Avg. training loss: 3.5005
Iteration: 7707, Avg. training loss: 3.4134
Iteration: 7708, Avg. training loss: 3.6275
Iteration: 7709, Avg. training loss: 3.4070
Iteration: 7710, Avg. training loss: 3.3986
Iteration: 7711, Avg. training loss: 3.4402
Iteration: 7712, Avg. training loss: 3.5812
Iteration: 7713, Avg. training loss: 3.4658
Iteration: 7714, Avg. training l

Iteration: 7879, Avg. training loss: 3.4640
Iteration: 7880, Avg. training loss: 3.5643
Iteration: 7881, Avg. training loss: 3.5795
Iteration: 7882, Avg. training loss: 3.5458
Iteration: 7883, Avg. training loss: 3.4857
Iteration: 7884, Avg. training loss: 3.4845
Iteration: 7885, Avg. training loss: 3.6033
Iteration: 7886, Avg. training loss: 3.5353
Iteration: 7887, Avg. training loss: 3.4466
Iteration: 7888, Avg. training loss: 3.4724
Iteration: 7889, Avg. training loss: 3.5195
Iteration: 7890, Avg. training loss: 3.4482
Iteration: 7891, Avg. training loss: 3.5034
Iteration: 7892, Avg. training loss: 3.4119
Iteration: 7893, Avg. training loss: 3.5608
Iteration: 7894, Avg. training loss: 3.4550
Iteration: 7895, Avg. training loss: 3.5150
Iteration: 7896, Avg. training loss: 3.4898
Iteration: 7897, Avg. training loss: 3.4391
Iteration: 7898, Avg. training loss: 3.4572
Iteration: 7899, Avg. training loss: 3.4138
Iteration: 7900, Avg. training loss: 3.5394
Iteration: 7901, Avg. training l

Iteration: 8066, Avg. training loss: 3.3446
Iteration: 8067, Avg. training loss: 3.4210
Iteration: 8068, Avg. training loss: 3.5627
Iteration: 8069, Avg. training loss: 3.6001
Iteration: 8070, Avg. training loss: 3.4935
Iteration: 8071, Avg. training loss: 3.5598
Iteration: 8072, Avg. training loss: 3.5785
Iteration: 8073, Avg. training loss: 3.5006
Iteration: 8074, Avg. training loss: 3.4976
Iteration: 8075, Avg. training loss: 3.5316
Iteration: 8076, Avg. training loss: 3.5214
Iteration: 8077, Avg. training loss: 3.4828
Iteration: 8078, Avg. training loss: 3.4891
Iteration: 8079, Avg. training loss: 3.5666
Iteration: 8080, Avg. training loss: 3.5150
Iteration: 8081, Avg. training loss: 3.5500
Iteration: 8082, Avg. training loss: 3.3948
Iteration: 8083, Avg. training loss: 3.4602
Iteration: 8084, Avg. training loss: 3.5170
Iteration: 8085, Avg. training loss: 3.5304
Iteration: 8086, Avg. training loss: 3.4532
Iteration: 8087, Avg. training loss: 3.5714
Iteration: 8088, Avg. training l

Iteration: 8253, Avg. training loss: 3.3729
Iteration: 8254, Avg. training loss: 3.4443
Iteration: 8255, Avg. training loss: 3.5930
Iteration: 8256, Avg. training loss: 3.4816
Iteration: 8257, Avg. training loss: 3.4217
Iteration: 8258, Avg. training loss: 3.4949
Iteration: 8259, Avg. training loss: 3.5226
Iteration: 8260, Avg. training loss: 3.6135
Iteration: 8261, Avg. training loss: 3.5505
Iteration: 8262, Avg. training loss: 3.5919
Iteration: 8263, Avg. training loss: 3.6149
Iteration: 8264, Avg. training loss: 3.5025
Iteration: 8265, Avg. training loss: 3.3485
Iteration: 8266, Avg. training loss: 3.5598
Iteration: 8267, Avg. training loss: 3.4721
Iteration: 8268, Avg. training loss: 3.5939
Iteration: 8269, Avg. training loss: 3.4917
Iteration: 8270, Avg. training loss: 3.5537
Iteration: 8271, Avg. training loss: 3.4804
Iteration: 8272, Avg. training loss: 3.4542
Iteration: 8273, Avg. training loss: 3.4455
Iteration: 8274, Avg. training loss: 3.4182
Iteration: 8275, Avg. training l

Iteration: 8440, Avg. training loss: 3.4592
Iteration: 8441, Avg. training loss: 3.5486
Iteration: 8442, Avg. training loss: 3.4976
Iteration: 8443, Avg. training loss: 3.3445
Iteration: 8444, Avg. training loss: 3.4990
Iteration: 8445, Avg. training loss: 3.3659
Iteration: 8446, Avg. training loss: 3.5012
Iteration: 8447, Avg. training loss: 3.4665
Iteration: 8448, Avg. training loss: 3.4887
Iteration: 8449, Avg. training loss: 3.5529
Iteration: 8450, Avg. training loss: 3.5270
Iteration: 8451, Avg. training loss: 3.4701
Iteration: 8452, Avg. training loss: 3.5575
Iteration: 8453, Avg. training loss: 3.6172
Iteration: 8454, Avg. training loss: 3.5013
Iteration: 8455, Avg. training loss: 3.4262
Iteration: 8456, Avg. training loss: 3.5143
Iteration: 8457, Avg. training loss: 3.5767
Iteration: 8458, Avg. training loss: 3.5078
Iteration: 8459, Avg. training loss: 3.4602
Iteration: 8460, Avg. training loss: 3.4203
Iteration: 8461, Avg. training loss: 3.4722
Iteration: 8462, Avg. training l

Iteration: 8627, Avg. training loss: 3.4595
Iteration: 8628, Avg. training loss: 3.5998
Iteration: 8629, Avg. training loss: 3.4727
Iteration: 8630, Avg. training loss: 3.6115
Iteration: 8631, Avg. training loss: 3.4244
Iteration: 8632, Avg. training loss: 3.5690
Iteration: 8633, Avg. training loss: 3.3670
Iteration: 8634, Avg. training loss: 3.4269
Iteration: 8635, Avg. training loss: 3.5871
Iteration: 8636, Avg. training loss: 3.4758
Iteration: 8637, Avg. training loss: 3.5662
Iteration: 8638, Avg. training loss: 3.4005
Iteration: 8639, Avg. training loss: 3.5146
Iteration: 8640, Avg. training loss: 3.3843
Iteration: 8641, Avg. training loss: 3.4582
Iteration: 8642, Avg. training loss: 3.4432
Iteration: 8643, Avg. training loss: 3.5622
Iteration: 8644, Avg. training loss: 3.4402
Iteration: 8645, Avg. training loss: 3.5488
Iteration: 8646, Avg. training loss: 3.5216
Iteration: 8647, Avg. training loss: 3.4939
Iteration: 8648, Avg. training loss: 3.5081
Iteration: 8649, Avg. training l

Iteration: 8814, Avg. training loss: 3.6109
Iteration: 8815, Avg. training loss: 3.4914
Iteration: 8816, Avg. training loss: 3.5218
Iteration: 8817, Avg. training loss: 3.4016
Iteration: 8818, Avg. training loss: 3.5555
Iteration: 8819, Avg. training loss: 3.4179
Iteration: 8820, Avg. training loss: 3.4998
Iteration: 8821, Avg. training loss: 3.5014
Iteration: 8822, Avg. training loss: 3.4810
Iteration: 8823, Avg. training loss: 3.4907
Iteration: 8824, Avg. training loss: 3.5118
Iteration: 8825, Avg. training loss: 3.4850
Iteration: 8826, Avg. training loss: 3.4183
Iteration: 8827, Avg. training loss: 3.4481
Iteration: 8828, Avg. training loss: 3.6278
Iteration: 8829, Avg. training loss: 3.4320
Iteration: 8830, Avg. training loss: 3.5496
Iteration: 8831, Avg. training loss: 3.5504
Iteration: 8832, Avg. training loss: 3.5396
Iteration: 8833, Avg. training loss: 3.5026
Iteration: 8834, Avg. training loss: 3.6141
Iteration: 8835, Avg. training loss: 3.5392
Iteration: 8836, Avg. training l

Iteration: 9001, Avg. training loss: 3.4625
Iteration: 9002, Avg. training loss: 3.5249
Iteration: 9003, Avg. training loss: 3.6358
Iteration: 9004, Avg. training loss: 3.4983
Iteration: 9005, Avg. training loss: 3.4602
Iteration: 9006, Avg. training loss: 3.5980
Iteration: 9007, Avg. training loss: 3.5016
Iteration: 9008, Avg. training loss: 3.5676
Iteration: 9009, Avg. training loss: 3.5190
Iteration: 9010, Avg. training loss: 3.5397
Iteration: 9011, Avg. training loss: 3.5879
Iteration: 9012, Avg. training loss: 3.4271
Iteration: 9013, Avg. training loss: 3.5624
Iteration: 9014, Avg. training loss: 3.5130
Iteration: 9015, Avg. training loss: 3.3801
Iteration: 9016, Avg. training loss: 3.5264
Iteration: 9017, Avg. training loss: 3.4822
Iteration: 9018, Avg. training loss: 3.4366
Iteration: 9019, Avg. training loss: 3.5075
Iteration: 9020, Avg. training loss: 3.5358
Iteration: 9021, Avg. training loss: 3.4542
Iteration: 9022, Avg. training loss: 3.5018
Iteration: 9023, Avg. training l

Iteration: 9188, Avg. training loss: 3.5341
Iteration: 9189, Avg. training loss: 3.4824
Iteration: 9190, Avg. training loss: 3.4281
Iteration: 9191, Avg. training loss: 3.5542
Iteration: 9192, Avg. training loss: 3.3215
Iteration: 9193, Avg. training loss: 3.4689
Iteration: 9194, Avg. training loss: 3.5228
Iteration: 9195, Avg. training loss: 3.4720
Iteration: 9196, Avg. training loss: 3.4887
Iteration: 9197, Avg. training loss: 3.4157
Iteration: 9198, Avg. training loss: 3.3738
Iteration: 9199, Avg. training loss: 3.5929
Iteration: 9200, Avg. training loss: 3.3801
Iteration: 9201, Avg. training loss: 3.4216
Iteration: 9202, Avg. training loss: 3.4519
Iteration: 9203, Avg. training loss: 3.5287
Iteration: 9204, Avg. training loss: 3.4819
Iteration: 9205, Avg. training loss: 3.4890
Iteration: 9206, Avg. training loss: 3.4314
Iteration: 9207, Avg. training loss: 3.6008
Iteration: 9208, Avg. training loss: 3.5827
Iteration: 9209, Avg. training loss: 3.3833
Iteration: 9210, Avg. training l

Iteration: 9375, Avg. training loss: 3.4604
Iteration: 9376, Avg. training loss: 3.3821
Iteration: 9377, Avg. training loss: 3.5219
Iteration: 9378, Avg. training loss: 3.5396
Iteration: 9379, Avg. training loss: 3.5360
Iteration: 9380, Avg. training loss: 3.5818
Iteration: 9381, Avg. training loss: 3.5895
Iteration: 9382, Avg. training loss: 3.4574
Iteration: 9383, Avg. training loss: 3.4983
Iteration: 9384, Avg. training loss: 3.5296
Iteration: 9385, Avg. training loss: 3.5540
Iteration: 9386, Avg. training loss: 3.5255
Iteration: 9387, Avg. training loss: 3.4157
Iteration: 9388, Avg. training loss: 3.4501
Iteration: 9389, Avg. training loss: 3.3660
Iteration: 9390, Avg. training loss: 3.4733
Iteration: 9391, Avg. training loss: 3.4210
Iteration: 9392, Avg. training loss: 3.4107
Iteration: 9393, Avg. training loss: 3.5934
Iteration: 9394, Avg. training loss: 3.5173
Iteration: 9395, Avg. training loss: 3.4176
Iteration: 9396, Avg. training loss: 3.5294
Iteration: 9397, Avg. training l

Iteration: 9562, Avg. training loss: 3.5678
Iteration: 9563, Avg. training loss: 3.4165
Iteration: 9564, Avg. training loss: 3.4991
Iteration: 9565, Avg. training loss: 3.4729
Iteration: 9566, Avg. training loss: 3.5552
Iteration: 9567, Avg. training loss: 3.5384
Iteration: 9568, Avg. training loss: 3.4809
Iteration: 9569, Avg. training loss: 3.4540
Iteration: 9570, Avg. training loss: 3.5307
Iteration: 9571, Avg. training loss: 3.4658
Iteration: 9572, Avg. training loss: 3.4749
Iteration: 9573, Avg. training loss: 3.4143
Iteration: 9574, Avg. training loss: 3.4176
Iteration: 9575, Avg. training loss: 3.5196
Iteration: 9576, Avg. training loss: 3.6163
Iteration: 9577, Avg. training loss: 3.3940
Iteration: 9578, Avg. training loss: 3.4473
Iteration: 9579, Avg. training loss: 3.5441
Iteration: 9580, Avg. training loss: 3.3717
Iteration: 9581, Avg. training loss: 3.4944
Iteration: 9582, Avg. training loss: 3.4326
Iteration: 9583, Avg. training loss: 3.4604
Iteration: 9584, Avg. training l

Iteration: 9749, Avg. training loss: 3.5041
Iteration: 9750, Avg. training loss: 3.5073
Iteration: 9751, Avg. training loss: 3.4490
Iteration: 9752, Avg. training loss: 3.3967
Iteration: 9753, Avg. training loss: 3.5266
Iteration: 9754, Avg. training loss: 3.4489
Iteration: 9755, Avg. training loss: 3.5669
Iteration: 9756, Avg. training loss: 3.4365
Iteration: 9757, Avg. training loss: 3.4193
Iteration: 9758, Avg. training loss: 3.4968
Iteration: 9759, Avg. training loss: 3.5521
Iteration: 9760, Avg. training loss: 3.5028
Iteration: 9761, Avg. training loss: 3.4380
Iteration: 9762, Avg. training loss: 3.5004
Iteration: 9763, Avg. training loss: 3.5930
Iteration: 9764, Avg. training loss: 3.5683
Iteration: 9765, Avg. training loss: 3.4314
Iteration: 9766, Avg. training loss: 3.4806
Iteration: 9767, Avg. training loss: 3.4018
Iteration: 9768, Avg. training loss: 3.5008
Iteration: 9769, Avg. training loss: 3.3236
Iteration: 9770, Avg. training loss: 3.5891
Iteration: 9771, Avg. training l

Iteration: 9936, Avg. training loss: 3.5010
Iteration: 9937, Avg. training loss: 3.5347
Iteration: 9938, Avg. training loss: 3.4296
Iteration: 9939, Avg. training loss: 3.4708
Iteration: 9940, Avg. training loss: 3.5482
Iteration: 9941, Avg. training loss: 3.5878
Iteration: 9942, Avg. training loss: 3.5484
Iteration: 9943, Avg. training loss: 3.5111
Iteration: 9944, Avg. training loss: 3.5105
Iteration: 9945, Avg. training loss: 3.4239
Iteration: 9946, Avg. training loss: 3.4825
Iteration: 9947, Avg. training loss: 3.5307
Iteration: 9948, Avg. training loss: 3.4254
Iteration: 9949, Avg. training loss: 3.4088
Iteration: 9950, Avg. training loss: 3.4436
Iteration: 9951, Avg. training loss: 3.4373
Iteration: 9952, Avg. training loss: 3.5193
Iteration: 9953, Avg. training loss: 3.4934
Iteration: 9954, Avg. training loss: 3.5110
Iteration: 9955, Avg. training loss: 3.4802
Iteration: 9956, Avg. training loss: 3.3835
Iteration: 9957, Avg. training loss: 3.3826
Iteration: 9958, Avg. training l

Iteration: 10120, Avg. training loss: 3.5475
Iteration: 10121, Avg. training loss: 3.5123
Iteration: 10122, Avg. training loss: 3.2723
Iteration: 10123, Avg. training loss: 3.5428
Iteration: 10124, Avg. training loss: 3.6007
Iteration: 10125, Avg. training loss: 3.4497
Iteration: 10126, Avg. training loss: 3.5512
Iteration: 10127, Avg. training loss: 3.4287
Iteration: 10128, Avg. training loss: 3.4893
Iteration: 10129, Avg. training loss: 3.6294
Iteration: 10130, Avg. training loss: 3.5481
Iteration: 10131, Avg. training loss: 3.5013
Iteration: 10132, Avg. training loss: 3.5165
Iteration: 10133, Avg. training loss: 3.5703
Iteration: 10134, Avg. training loss: 3.5389
Iteration: 10135, Avg. training loss: 3.5426
Iteration: 10136, Avg. training loss: 3.3454
Iteration: 10137, Avg. training loss: 3.4705
Iteration: 10138, Avg. training loss: 3.6590
Iteration: 10139, Avg. training loss: 3.5279
Iteration: 10140, Avg. training loss: 3.4710
Iteration: 10141, Avg. training loss: 3.6459
Iteration:

Iteration: 10303, Avg. training loss: 3.5200
Iteration: 10304, Avg. training loss: 3.5056
Iteration: 10305, Avg. training loss: 3.4961
Iteration: 10306, Avg. training loss: 3.5083
Iteration: 10307, Avg. training loss: 3.4379
Iteration: 10308, Avg. training loss: 3.5516
Iteration: 10309, Avg. training loss: 3.3652
Iteration: 10310, Avg. training loss: 3.4428
Iteration: 10311, Avg. training loss: 3.4359
Iteration: 10312, Avg. training loss: 3.5849
Iteration: 10313, Avg. training loss: 3.5399
Iteration: 10314, Avg. training loss: 3.3764
Iteration: 10315, Avg. training loss: 3.5751
Iteration: 10316, Avg. training loss: 3.4197
Iteration: 10317, Avg. training loss: 3.4441
Iteration: 10318, Avg. training loss: 3.4683
Iteration: 10319, Avg. training loss: 3.4631
Iteration: 10320, Avg. training loss: 3.3326
Iteration: 10321, Avg. training loss: 3.3799
Iteration: 10322, Avg. training loss: 3.4960
Iteration: 10323, Avg. training loss: 3.5701
Iteration: 10324, Avg. training loss: 3.5473
Iteration:

Iteration: 10486, Avg. training loss: 3.5410
Iteration: 10487, Avg. training loss: 3.4388
Iteration: 10488, Avg. training loss: 3.5078
Iteration: 10489, Avg. training loss: 3.4058
Iteration: 10490, Avg. training loss: 3.4818
Iteration: 10491, Avg. training loss: 3.3917
Iteration: 10492, Avg. training loss: 3.4947
Iteration: 10493, Avg. training loss: 3.5012
Iteration: 10494, Avg. training loss: 3.5126
Iteration: 10495, Avg. training loss: 3.5910
Iteration: 10496, Avg. training loss: 3.5180
Iteration: 10497, Avg. training loss: 3.6417
Iteration: 10498, Avg. training loss: 3.3994
Iteration: 10499, Avg. training loss: 3.4070
Iteration: 10500, Avg. training loss: 3.4674
Iteration: 10501, Avg. training loss: 3.3603
Iteration: 10502, Avg. training loss: 3.4055
Iteration: 10503, Avg. training loss: 3.4604
Iteration: 10504, Avg. training loss: 3.5177
Iteration: 10505, Avg. training loss: 3.3260
Iteration: 10506, Avg. training loss: 3.5046
Iteration: 10507, Avg. training loss: 3.4406
Iteration:

Iteration: 10669, Avg. training loss: 3.5022
Iteration: 10670, Avg. training loss: 3.4749
Iteration: 10671, Avg. training loss: 3.4269
Iteration: 10672, Avg. training loss: 3.4867
Iteration: 10673, Avg. training loss: 3.4248
Iteration: 10674, Avg. training loss: 3.4445
Iteration: 10675, Avg. training loss: 3.4589
Iteration: 10676, Avg. training loss: 3.3771
Iteration: 10677, Avg. training loss: 3.5101
Iteration: 10678, Avg. training loss: 3.4736
Iteration: 10679, Avg. training loss: 3.4658
Iteration: 10680, Avg. training loss: 3.4483
Iteration: 10681, Avg. training loss: 3.4933
Iteration: 10682, Avg. training loss: 3.4409
Iteration: 10683, Avg. training loss: 3.6123
Iteration: 10684, Avg. training loss: 3.5034
Iteration: 10685, Avg. training loss: 3.4561
Iteration: 10686, Avg. training loss: 3.3285
Iteration: 10687, Avg. training loss: 3.4488
Iteration: 10688, Avg. training loss: 3.5420
Iteration: 10689, Avg. training loss: 3.5563
Iteration: 10690, Avg. training loss: 3.4509
Iteration:

Iteration: 10852, Avg. training loss: 3.5463
Iteration: 10853, Avg. training loss: 3.5103
Iteration: 10854, Avg. training loss: 3.5525
Iteration: 10855, Avg. training loss: 3.5404
Iteration: 10856, Avg. training loss: 3.4701
Iteration: 10857, Avg. training loss: 3.4438
Iteration: 10858, Avg. training loss: 3.6783
Iteration: 10859, Avg. training loss: 3.5059
Iteration: 10860, Avg. training loss: 3.4062
Iteration: 10861, Avg. training loss: 3.4148
Iteration: 10862, Avg. training loss: 3.5527
Iteration: 10863, Avg. training loss: 3.4538
Iteration: 10864, Avg. training loss: 3.4095
Iteration: 10865, Avg. training loss: 3.4996
Iteration: 10866, Avg. training loss: 3.4627
Iteration: 10867, Avg. training loss: 3.4075
Iteration: 10868, Avg. training loss: 3.4788
Iteration: 10869, Avg. training loss: 3.4622
Iteration: 10870, Avg. training loss: 3.4821
Iteration: 10871, Avg. training loss: 3.4969
Iteration: 10872, Avg. training loss: 3.4683
Iteration: 10873, Avg. training loss: 3.4470
Iteration:

Iteration: 11035, Avg. training loss: 3.5397
Iteration: 11036, Avg. training loss: 3.5136
Iteration: 11037, Avg. training loss: 3.4068
Iteration: 11038, Avg. training loss: 3.4851
Iteration: 11039, Avg. training loss: 3.4335
Iteration: 11040, Avg. training loss: 3.5767
Iteration: 11041, Avg. training loss: 3.3612
Iteration: 11042, Avg. training loss: 3.4840
Iteration: 11043, Avg. training loss: 3.4696
Iteration: 11044, Avg. training loss: 3.4445
Iteration: 11045, Avg. training loss: 3.4953
Iteration: 11046, Avg. training loss: 3.4892
Iteration: 11047, Avg. training loss: 3.5273
Iteration: 11048, Avg. training loss: 3.3639
Iteration: 11049, Avg. training loss: 3.6109
Iteration: 11050, Avg. training loss: 3.5485
Iteration: 11051, Avg. training loss: 3.6055
Iteration: 11052, Avg. training loss: 3.5125
Iteration: 11053, Avg. training loss: 3.5148
Iteration: 11054, Avg. training loss: 3.4592
Iteration: 11055, Avg. training loss: 3.4674
Iteration: 11056, Avg. training loss: 3.4511
Iteration:

Iteration: 11218, Avg. training loss: 3.5380
Iteration: 11219, Avg. training loss: 3.5288
Iteration: 11220, Avg. training loss: 3.4227
Iteration: 11221, Avg. training loss: 3.2967
Iteration: 11222, Avg. training loss: 3.5235
Iteration: 11223, Avg. training loss: 3.4034
Iteration: 11224, Avg. training loss: 3.4413
Iteration: 11225, Avg. training loss: 3.4621
Iteration: 11226, Avg. training loss: 3.5667
Iteration: 11227, Avg. training loss: 3.4579
Iteration: 11228, Avg. training loss: 3.4738
Iteration: 11229, Avg. training loss: 3.5099
Iteration: 11230, Avg. training loss: 3.4305
Iteration: 11231, Avg. training loss: 3.4019
Iteration: 11232, Avg. training loss: 3.5238
Iteration: 11233, Avg. training loss: 3.3512
Iteration: 11234, Avg. training loss: 3.4957
Iteration: 11235, Avg. training loss: 3.5270
Iteration: 11236, Avg. training loss: 3.4983
Iteration: 11237, Avg. training loss: 3.4369
Iteration: 11238, Avg. training loss: 3.4508
Iteration: 11239, Avg. training loss: 3.5730
Iteration:

Iteration: 11401, Avg. training loss: 3.4915
Iteration: 11402, Avg. training loss: 3.4892
Iteration: 11403, Avg. training loss: 3.4317
Iteration: 11404, Avg. training loss: 3.4617
Iteration: 11405, Avg. training loss: 3.5312
Iteration: 11406, Avg. training loss: 3.4877
Iteration: 11407, Avg. training loss: 3.5247
Iteration: 11408, Avg. training loss: 3.4002
Iteration: 11409, Avg. training loss: 3.5078
Iteration: 11410, Avg. training loss: 3.3909
Iteration: 11411, Avg. training loss: 3.5415
Iteration: 11412, Avg. training loss: 3.3953
Iteration: 11413, Avg. training loss: 3.4987
Iteration: 11414, Avg. training loss: 3.4532
Iteration: 11415, Avg. training loss: 3.5213
Iteration: 11416, Avg. training loss: 3.4805
Iteration: 11417, Avg. training loss: 3.5451
Iteration: 11418, Avg. training loss: 3.4058
Iteration: 11419, Avg. training loss: 3.4171
Iteration: 11420, Avg. training loss: 3.3169
Iteration: 11421, Avg. training loss: 3.4522
Iteration: 11422, Avg. training loss: 3.6271
Iteration:

Iteration: 11584, Avg. training loss: 3.4510
Iteration: 11585, Avg. training loss: 3.3869
Iteration: 11586, Avg. training loss: 3.5749
Iteration: 11587, Avg. training loss: 3.5458
Iteration: 11588, Avg. training loss: 3.4221
Iteration: 11589, Avg. training loss: 3.4632
Iteration: 11590, Avg. training loss: 3.3933
Iteration: 11591, Avg. training loss: 3.4340
Iteration: 11592, Avg. training loss: 3.4035
Iteration: 11593, Avg. training loss: 3.6358
Iteration: 11594, Avg. training loss: 3.5074
Iteration: 11595, Avg. training loss: 3.3289
Iteration: 11596, Avg. training loss: 3.6354
Iteration: 11597, Avg. training loss: 3.5788
Iteration: 11598, Avg. training loss: 3.4524
Iteration: 11599, Avg. training loss: 3.5530
Iteration: 11600, Avg. training loss: 3.5602
Iteration: 11601, Avg. training loss: 3.5430
Iteration: 11602, Avg. training loss: 3.4461
Iteration: 11603, Avg. training loss: 3.4257
Iteration: 11604, Avg. training loss: 3.5373
Iteration: 11605, Avg. training loss: 3.5419
Iteration:

Iteration: 11767, Avg. training loss: 3.5040
Iteration: 11768, Avg. training loss: 3.4742
Iteration: 11769, Avg. training loss: 3.5544
Iteration: 11770, Avg. training loss: 3.3967
Iteration: 11771, Avg. training loss: 3.5155
Iteration: 11772, Avg. training loss: 3.5470
Iteration: 11773, Avg. training loss: 3.5148
Iteration: 11774, Avg. training loss: 3.2870
Iteration: 11775, Avg. training loss: 3.4900
Iteration: 11776, Avg. training loss: 3.4763
Iteration: 11777, Avg. training loss: 3.4050
Iteration: 11778, Avg. training loss: 3.4525
Iteration: 11779, Avg. training loss: 3.4371
Iteration: 11780, Avg. training loss: 3.4369
Iteration: 11781, Avg. training loss: 3.5784
Iteration: 11782, Avg. training loss: 3.4301
Iteration: 11783, Avg. training loss: 3.4297
Iteration: 11784, Avg. training loss: 3.5580
Iteration: 11785, Avg. training loss: 3.4848
Iteration: 11786, Avg. training loss: 3.5506
Iteration: 11787, Avg. training loss: 3.5067
Iteration: 11788, Avg. training loss: 3.4257
Iteration:

Iteration: 11950, Avg. training loss: 3.5459
Iteration: 11951, Avg. training loss: 3.4302
Iteration: 11952, Avg. training loss: 3.4112
Iteration: 11953, Avg. training loss: 3.4227
Iteration: 11954, Avg. training loss: 3.3730
Iteration: 11955, Avg. training loss: 3.5290
Iteration: 11956, Avg. training loss: 3.4460
Iteration: 11957, Avg. training loss: 3.5004
Iteration: 11958, Avg. training loss: 3.5235
Iteration: 11959, Avg. training loss: 3.4770
Iteration: 11960, Avg. training loss: 3.6175
Iteration: 11961, Avg. training loss: 3.5710
Iteration: 11962, Avg. training loss: 3.4864
Iteration: 11963, Avg. training loss: 3.4394
Iteration: 11964, Avg. training loss: 3.4748
Iteration: 11965, Avg. training loss: 3.4967
Iteration: 11966, Avg. training loss: 3.4449
Iteration: 11967, Avg. training loss: 3.5850
Iteration: 11968, Avg. training loss: 3.4339
Iteration: 11969, Avg. training loss: 3.4035
Iteration: 11970, Avg. training loss: 3.4407
Iteration: 11971, Avg. training loss: 3.4538
Iteration:

Iteration: 12133, Avg. training loss: 3.4960
Iteration: 12134, Avg. training loss: 3.5544
Iteration: 12135, Avg. training loss: 3.5504
Iteration: 12136, Avg. training loss: 3.5742
Iteration: 12137, Avg. training loss: 3.4732
Iteration: 12138, Avg. training loss: 3.4219
Iteration: 12139, Avg. training loss: 3.4613
Iteration: 12140, Avg. training loss: 3.4452
Iteration: 12141, Avg. training loss: 3.4251
Iteration: 12142, Avg. training loss: 3.4326
Iteration: 12143, Avg. training loss: 3.4353
Iteration: 12144, Avg. training loss: 3.5327
Iteration: 12145, Avg. training loss: 3.4952
Iteration: 12146, Avg. training loss: 3.5391
Iteration: 12147, Avg. training loss: 3.4361
Iteration: 12148, Avg. training loss: 3.4326
Iteration: 12149, Avg. training loss: 3.3474
Iteration: 12150, Avg. training loss: 3.3945
Iteration: 12151, Avg. training loss: 3.5902
Iteration: 12152, Avg. training loss: 3.5301
Iteration: 12153, Avg. training loss: 3.6269
Iteration: 12154, Avg. training loss: 3.4490
Iteration:

Iteration: 12316, Avg. training loss: 3.4515
Iteration: 12317, Avg. training loss: 3.5789
Iteration: 12318, Avg. training loss: 3.4733
Iteration: 12319, Avg. training loss: 3.4931
Iteration: 12320, Avg. training loss: 3.4771
Iteration: 12321, Avg. training loss: 3.4936
Iteration: 12322, Avg. training loss: 3.5609
Iteration: 12323, Avg. training loss: 3.4675
Iteration: 12324, Avg. training loss: 3.4315
Iteration: 12325, Avg. training loss: 3.4520
Iteration: 12326, Avg. training loss: 3.5164
Iteration: 12327, Avg. training loss: 3.5210
Iteration: 12328, Avg. training loss: 3.5586
Iteration: 12329, Avg. training loss: 3.4839
Iteration: 12330, Avg. training loss: 3.6204
Iteration: 12331, Avg. training loss: 3.5688
Iteration: 12332, Avg. training loss: 3.4791
Iteration: 12333, Avg. training loss: 3.5364
Iteration: 12334, Avg. training loss: 3.4565
Iteration: 12335, Avg. training loss: 3.4515
Iteration: 12336, Avg. training loss: 3.5002
Iteration: 12337, Avg. training loss: 3.5100
Iteration:

Iteration: 12499, Avg. training loss: 3.4654
Iteration: 12500, Avg. training loss: 3.4918
Iteration: 12501, Avg. training loss: 3.5065
Iteration: 12502, Avg. training loss: 3.4773
Iteration: 12503, Avg. training loss: 3.4151
Iteration: 12504, Avg. training loss: 3.4896
Iteration: 12505, Avg. training loss: 3.4395
Iteration: 12506, Avg. training loss: 3.5898
Iteration: 12507, Avg. training loss: 3.3057
Iteration: 12508, Avg. training loss: 3.6210
Iteration: 12509, Avg. training loss: 3.4379
Iteration: 12510, Avg. training loss: 3.5182
Iteration: 12511, Avg. training loss: 3.4174
Iteration: 12512, Avg. training loss: 3.4770
Iteration: 12513, Avg. training loss: 3.4719
Iteration: 12514, Avg. training loss: 3.4633
Iteration: 12515, Avg. training loss: 3.5100
Iteration: 12516, Avg. training loss: 3.4962
Iteration: 12517, Avg. training loss: 3.4963
Iteration: 12518, Avg. training loss: 3.4558
Iteration: 12519, Avg. training loss: 3.4693
Iteration: 12520, Avg. training loss: 3.4474
Iteration:

Iteration: 12682, Avg. training loss: 3.3885
Iteration: 12683, Avg. training loss: 3.5379
Iteration: 12684, Avg. training loss: 3.6682
Iteration: 12685, Avg. training loss: 3.4802
Iteration: 12686, Avg. training loss: 3.4988
Iteration: 12687, Avg. training loss: 3.4211
Iteration: 12688, Avg. training loss: 3.5499
Iteration: 12689, Avg. training loss: 3.5433
Iteration: 12690, Avg. training loss: 3.4706
Iteration: 12691, Avg. training loss: 3.4867
Iteration: 12692, Avg. training loss: 3.4803
Iteration: 12693, Avg. training loss: 3.5143
Iteration: 12694, Avg. training loss: 3.5194
Iteration: 12695, Avg. training loss: 3.5748
Iteration: 12696, Avg. training loss: 3.4412
Iteration: 12697, Avg. training loss: 3.5085
Iteration: 12698, Avg. training loss: 3.4740
Iteration: 12699, Avg. training loss: 3.4644
Iteration: 12700, Avg. training loss: 3.4681
Iteration: 12701, Avg. training loss: 3.4850
Iteration: 12702, Avg. training loss: 3.4951
Iteration: 12703, Avg. training loss: 3.4505
Iteration:

Iteration: 12865, Avg. training loss: 3.4453
Iteration: 12866, Avg. training loss: 3.4358
Iteration: 12867, Avg. training loss: 3.4522
Iteration: 12868, Avg. training loss: 3.4306
Iteration: 12869, Avg. training loss: 3.5278
Iteration: 12870, Avg. training loss: 3.3758
Iteration: 12871, Avg. training loss: 3.4030
Iteration: 12872, Avg. training loss: 3.5190
Iteration: 12873, Avg. training loss: 3.6454
Iteration: 12874, Avg. training loss: 3.4766
Iteration: 12875, Avg. training loss: 3.5088
Iteration: 12876, Avg. training loss: 3.5443
Iteration: 12877, Avg. training loss: 3.4999
Iteration: 12878, Avg. training loss: 3.3947
Iteration: 12879, Avg. training loss: 3.4640
Iteration: 12880, Avg. training loss: 3.5193
Iteration: 12881, Avg. training loss: 3.5072
Iteration: 12882, Avg. training loss: 3.5065
Iteration: 12883, Avg. training loss: 3.2983
Iteration: 12884, Avg. training loss: 3.5474
Iteration: 12885, Avg. training loss: 3.4916
Iteration: 12886, Avg. training loss: 3.5189
Iteration:

Iteration: 13048, Avg. training loss: 3.5351
Iteration: 13049, Avg. training loss: 3.4782
Iteration: 13050, Avg. training loss: 3.4088
Iteration: 13051, Avg. training loss: 3.4718
Iteration: 13052, Avg. training loss: 3.4243
Iteration: 13053, Avg. training loss: 3.5969
Iteration: 13054, Avg. training loss: 3.4902
Iteration: 13055, Avg. training loss: 3.5778
Iteration: 13056, Avg. training loss: 3.5300
Iteration: 13057, Avg. training loss: 3.4990
Iteration: 13058, Avg. training loss: 3.5501
Iteration: 13059, Avg. training loss: 3.4863
Iteration: 13060, Avg. training loss: 3.5048
Iteration: 13061, Avg. training loss: 3.3890
Iteration: 13062, Avg. training loss: 3.4212
Iteration: 13063, Avg. training loss: 3.5529
Iteration: 13064, Avg. training loss: 3.5186
Iteration: 13065, Avg. training loss: 3.4362
Iteration: 13066, Avg. training loss: 3.4485
Iteration: 13067, Avg. training loss: 3.5500
Iteration: 13068, Avg. training loss: 3.4892
Iteration: 13069, Avg. training loss: 3.5304
Iteration:

Iteration: 13231, Avg. training loss: 3.5652
Iteration: 13232, Avg. training loss: 3.4210
Iteration: 13233, Avg. training loss: 3.4516
Iteration: 13234, Avg. training loss: 3.4464
Iteration: 13235, Avg. training loss: 3.4485
Iteration: 13236, Avg. training loss: 3.6190
Iteration: 13237, Avg. training loss: 3.3422
Iteration: 13238, Avg. training loss: 3.5481
Iteration: 13239, Avg. training loss: 3.4991
Iteration: 13240, Avg. training loss: 3.5851
Iteration: 13241, Avg. training loss: 3.4853
Iteration: 13242, Avg. training loss: 3.4840
Iteration: 13243, Avg. training loss: 3.4831
Iteration: 13244, Avg. training loss: 3.4658
Iteration: 13245, Avg. training loss: 3.4869
Iteration: 13246, Avg. training loss: 3.5484
Iteration: 13247, Avg. training loss: 3.3738
Iteration: 13248, Avg. training loss: 3.3918
Iteration: 13249, Avg. training loss: 3.4310
Iteration: 13250, Avg. training loss: 3.4882
Iteration: 13251, Avg. training loss: 3.5656
Iteration: 13252, Avg. training loss: 3.5645
Iteration:

Iteration: 13414, Avg. training loss: 3.5554
Iteration: 13415, Avg. training loss: 3.4180
Iteration: 13416, Avg. training loss: 3.4139
Iteration: 13417, Avg. training loss: 3.4579
Iteration: 13418, Avg. training loss: 3.4700
Iteration: 13419, Avg. training loss: 3.4242
Iteration: 13420, Avg. training loss: 3.4482
Iteration: 13421, Avg. training loss: 3.4510
Iteration: 13422, Avg. training loss: 3.4243
Iteration: 13423, Avg. training loss: 3.5083
Iteration: 13424, Avg. training loss: 3.4574
Iteration: 13425, Avg. training loss: 3.5425
Iteration: 13426, Avg. training loss: 3.4759
Iteration: 13427, Avg. training loss: 3.5128
Iteration: 13428, Avg. training loss: 3.4949
Iteration: 13429, Avg. training loss: 3.5503
Iteration: 13430, Avg. training loss: 3.4694
Iteration: 13431, Avg. training loss: 3.6032
Iteration: 13432, Avg. training loss: 3.4508
Iteration: 13433, Avg. training loss: 3.4369
Iteration: 13434, Avg. training loss: 3.5055
Iteration: 13435, Avg. training loss: 3.4501
Iteration:

Iteration: 13597, Avg. training loss: 3.4570
Iteration: 13598, Avg. training loss: 3.4792
Iteration: 13599, Avg. training loss: 3.5385
Iteration: 13600, Avg. training loss: 3.6196
Iteration: 13601, Avg. training loss: 3.5389
Iteration: 13602, Avg. training loss: 3.5350
Iteration: 13603, Avg. training loss: 3.4836
Iteration: 13604, Avg. training loss: 3.3888
Iteration: 13605, Avg. training loss: 3.4611
Iteration: 13606, Avg. training loss: 3.3799
Iteration: 13607, Avg. training loss: 3.4106
Iteration: 13608, Avg. training loss: 3.3841
Iteration: 13609, Avg. training loss: 3.5305
Iteration: 13610, Avg. training loss: 3.4569
Iteration: 13611, Avg. training loss: 3.4266
Iteration: 13612, Avg. training loss: 3.4332
Iteration: 13613, Avg. training loss: 3.4990
Iteration: 13614, Avg. training loss: 3.4639
Iteration: 13615, Avg. training loss: 3.5169
Iteration: 13616, Avg. training loss: 3.3930
Iteration: 13617, Avg. training loss: 3.5108
Iteration: 13618, Avg. training loss: 3.5135
Iteration:

Iteration: 13780, Avg. training loss: 3.4377
Iteration: 13781, Avg. training loss: 3.4818
Iteration: 13782, Avg. training loss: 3.6209
Iteration: 13783, Avg. training loss: 3.3770
Iteration: 13784, Avg. training loss: 3.5205
Iteration: 13785, Avg. training loss: 3.4001
Iteration: 13786, Avg. training loss: 3.4112
Iteration: 13787, Avg. training loss: 3.4474
Iteration: 13788, Avg. training loss: 3.3805
Iteration: 13789, Avg. training loss: 3.2698
Iteration: 13790, Avg. training loss: 3.5593
Iteration: 13791, Avg. training loss: 3.3825
Iteration: 13792, Avg. training loss: 3.5246
Iteration: 13793, Avg. training loss: 3.4075
Iteration: 13794, Avg. training loss: 3.4039
Iteration: 13795, Avg. training loss: 3.4353
Iteration: 13796, Avg. training loss: 3.4945
Iteration: 13797, Avg. training loss: 3.5381
Iteration: 13798, Avg. training loss: 3.4747
Iteration: 13799, Avg. training loss: 3.4453
Iteration: 13800, Avg. training loss: 3.5016
Iteration: 13801, Avg. training loss: 3.4299
Iteration:

Iteration: 13963, Avg. training loss: 3.4763
Iteration: 13964, Avg. training loss: 3.3541
Iteration: 13965, Avg. training loss: 3.3933
Iteration: 13966, Avg. training loss: 3.5281
Iteration: 13967, Avg. training loss: 3.4818
Iteration: 13968, Avg. training loss: 3.4670
Iteration: 13969, Avg. training loss: 3.5031
Iteration: 13970, Avg. training loss: 3.4810
Iteration: 13971, Avg. training loss: 3.4889
Iteration: 13972, Avg. training loss: 3.5759
Iteration: 13973, Avg. training loss: 3.6188
Iteration: 13974, Avg. training loss: 3.4354
Iteration: 13975, Avg. training loss: 3.4319
Iteration: 13976, Avg. training loss: 3.3334
Iteration: 13977, Avg. training loss: 3.4911
Iteration: 13978, Avg. training loss: 3.5673
Iteration: 13979, Avg. training loss: 3.4858
Iteration: 13980, Avg. training loss: 3.5245
Iteration: 13981, Avg. training loss: 3.5208
Iteration: 13982, Avg. training loss: 3.5347
Iteration: 13983, Avg. training loss: 3.4801
Iteration: 13984, Avg. training loss: 3.4826
Iteration:

Iteration: 14146, Avg. training loss: 3.4469
Iteration: 14147, Avg. training loss: 3.4363
Iteration: 14148, Avg. training loss: 3.3860
Iteration: 14149, Avg. training loss: 3.3300
Iteration: 14150, Avg. training loss: 3.5708
Iteration: 14151, Avg. training loss: 3.4924
Iteration: 14152, Avg. training loss: 3.5039
Iteration: 14153, Avg. training loss: 3.4633
Iteration: 14154, Avg. training loss: 3.4390
Iteration: 14155, Avg. training loss: 3.5356
Iteration: 14156, Avg. training loss: 3.4946
Iteration: 14157, Avg. training loss: 3.4815
Iteration: 14158, Avg. training loss: 3.4976
Iteration: 14159, Avg. training loss: 3.3834
Iteration: 14160, Avg. training loss: 3.4566
Iteration: 14161, Avg. training loss: 3.5423
Iteration: 14162, Avg. training loss: 3.4016
Iteration: 14163, Avg. training loss: 3.5773
Iteration: 14164, Avg. training loss: 3.3808
Iteration: 14165, Avg. training loss: 3.4802
Iteration: 14166, Avg. training loss: 3.4554
Iteration: 14167, Avg. training loss: 3.5289
Iteration:

Iteration: 14329, Avg. training loss: 3.5333
Iteration: 14330, Avg. training loss: 3.5241
Iteration: 14331, Avg. training loss: 3.5125
Iteration: 14332, Avg. training loss: 3.5308
Iteration: 14333, Avg. training loss: 3.4883
Iteration: 14334, Avg. training loss: 3.4432
Iteration: 14335, Avg. training loss: 3.5518
Iteration: 14336, Avg. training loss: 3.4005
Iteration: 14337, Avg. training loss: 3.3971
Iteration: 14338, Avg. training loss: 3.4575
Iteration: 14339, Avg. training loss: 3.4183
Iteration: 14340, Avg. training loss: 3.4659
Iteration: 14341, Avg. training loss: 3.4806
Iteration: 14342, Avg. training loss: 3.4838
Iteration: 14343, Avg. training loss: 3.3531
Iteration: 14344, Avg. training loss: 3.4726
Iteration: 14345, Avg. training loss: 3.6304
Iteration: 14346, Avg. training loss: 3.5523
Iteration: 14347, Avg. training loss: 3.4519
Iteration: 14348, Avg. training loss: 3.3709
Iteration: 14349, Avg. training loss: 3.5341
Iteration: 14350, Avg. training loss: 3.4442
Iteration:

Iteration: 14512, Avg. training loss: 3.4412
Iteration: 14513, Avg. training loss: 3.6904
Iteration: 14514, Avg. training loss: 3.4405
Iteration: 14515, Avg. training loss: 3.5076
Iteration: 14516, Avg. training loss: 3.3780
Iteration: 14517, Avg. training loss: 3.4993
Iteration: 14518, Avg. training loss: 3.5472
Iteration: 14519, Avg. training loss: 3.4109
Iteration: 14520, Avg. training loss: 3.4741
Iteration: 14521, Avg. training loss: 3.4270
Iteration: 14522, Avg. training loss: 3.4487
Iteration: 14523, Avg. training loss: 3.4749
Iteration: 14524, Avg. training loss: 3.4152
Iteration: 14525, Avg. training loss: 3.3931
Iteration: 14526, Avg. training loss: 3.4996
Iteration: 14527, Avg. training loss: 3.5159
Iteration: 14528, Avg. training loss: 3.4691
Iteration: 14529, Avg. training loss: 3.5395
Iteration: 14530, Avg. training loss: 3.4211
Iteration: 14531, Avg. training loss: 3.5617
Iteration: 14532, Avg. training loss: 3.3275
Iteration: 14533, Avg. training loss: 3.4454
Iteration:

Iteration: 14695, Avg. training loss: 3.4810
Iteration: 14696, Avg. training loss: 3.5756
Iteration: 14697, Avg. training loss: 3.5584
Iteration: 14698, Avg. training loss: 3.5029
Iteration: 14699, Avg. training loss: 3.4580
Iteration: 14700, Avg. training loss: 3.3777
Iteration: 14701, Avg. training loss: 3.5406
Iteration: 14702, Avg. training loss: 3.3530
Iteration: 14703, Avg. training loss: 3.4586
Iteration: 14704, Avg. training loss: 3.4709
Iteration: 14705, Avg. training loss: 3.4546
Iteration: 14706, Avg. training loss: 3.5917
Iteration: 14707, Avg. training loss: 3.5061
Iteration: 14708, Avg. training loss: 3.3988
Iteration: 14709, Avg. training loss: 3.5054
Iteration: 14710, Avg. training loss: 3.6059
Iteration: 14711, Avg. training loss: 3.3847
Iteration: 14712, Avg. training loss: 3.4403
Iteration: 14713, Avg. training loss: 3.4686
Iteration: 14714, Avg. training loss: 3.4687
Iteration: 14715, Avg. training loss: 3.5332
Iteration: 14716, Avg. training loss: 3.4504
Iteration:

Iteration: 14878, Avg. training loss: 3.5298
Iteration: 14879, Avg. training loss: 3.5341
Iteration: 14880, Avg. training loss: 3.5111
Iteration: 14881, Avg. training loss: 3.5035
Iteration: 14882, Avg. training loss: 3.5136
Iteration: 14883, Avg. training loss: 3.5088
Iteration: 14884, Avg. training loss: 3.5630
Iteration: 14885, Avg. training loss: 3.4902
Iteration: 14886, Avg. training loss: 3.5004
Iteration: 14887, Avg. training loss: 3.4070
Iteration: 14888, Avg. training loss: 3.4912
Iteration: 14889, Avg. training loss: 3.5183
Iteration: 14890, Avg. training loss: 3.3910
Iteration: 14891, Avg. training loss: 3.4516
Iteration: 14892, Avg. training loss: 3.5939
Iteration: 14893, Avg. training loss: 3.4049
Iteration: 14894, Avg. training loss: 3.4264
Iteration: 14895, Avg. training loss: 3.6000
Iteration: 14896, Avg. training loss: 3.3972
Iteration: 14897, Avg. training loss: 3.4694
Iteration: 14898, Avg. training loss: 3.4425
Iteration: 14899, Avg. training loss: 3.4070
Iteration:

The embedding matrix is given by $\mathbf{U}^T$, where the $i$th row is the vector for $i$th word in the vocabulary.

In [335]:
emb_matrix = model.U.T

# 5. Analogies (5 pts)

As mentioned before, vectors can keep some language properties like analogies. Given a relation a:b and a query c, we can find d such that c:d follows the same relation. We hope to find d by using vector operations. In this case, finding the real word vector $\mathbf{u}_d$ closest to $\mathbf{u}_b - \mathbf{u}_a + \mathbf{u}_c$ gives us d. 

**Note that the quality of the analy results is not expected to be excellent.**

In [336]:
triplets = [['is', 'was', 'were'], ['lunch', 'day', 'night'], ['i', 'my', 'your']]

for triplet in triplets:
    a, b, d = triplet

    """
    Returns
    
    Example: Paris (a) is to France (b) as _____ (c) is to Germany (d)
    
    -------
    result: array
        The embedding vector for word (c): w_a - w_b + w_d
    """

    ###########################
    # YOUR CODE HERE
    a_ind = word_to_ind[a]
    b_ind = word_to_ind[b]
    d_ind = word_to_ind[d]
    
    w_a = emb_matrix[a_ind, :]
    w_b = emb_matrix[b_ind, :]
    w_d = emb_matrix[d_ind, :]
    
    result = w_a - w_b + w_d
    ###########################

    distances = [spatial.distance.cosine(x, result) for x in emb_matrix]
    candidates = [ind_to_word[i] for i in np.argsort(distances)]
    candidates = [x for x in candidates if x not in [a, b, d]][:5]

    print(f'`{a}` is to `{b}` as [{", ".join(candidates)}] is to `{d}`')

`is` is to `was` as [are, rice, sauce, way, do] is to `were`
`lunch` is to `day` as [dinner, great, experience, salad, what] is to `night`
`i` is to `my` as [you, what, don't, not, if] is to `your`
