# RNNs in Tensorflow

## Cell Support (tf.nn.rnn_cell)
- BasicRNNCell: The most basic RNN cell
- RNNCell: Abstract object representing an RNN cell
- BasicLSTMCell: Basic LSTM recurrent network cell
- LSTMCell: LSTM recurrent network cell
- GRUCell: Gated Recurrent Unit cell

In [None]:
# construct cells
cell = tf.nn.rnn_cell.GRUCell(hidden_size)

# stack multiple cells
layers = [tf.nn.rnn_cell.GRUCell(size) for size in hidden_sizes]
cells = tf.nn.rnn_cell.MultiRNNCell(layers)
output, out_state = tf.nn.dynamic_rnn(cell, seq, length, initial_state)
# Most sequences are not of the same length

- tf.nn.dynamic_rnn: uses a tf.While loop to dynamically construct the graph when it is executed. Graph creation is faster and you can feed batches of variable size.
- tf.nn.bidirectional_dynamic_rnn: dynamic_rnn with bidirectional

## Padded/truncated sequence length
### Approach 1:
- Maintain a mask (True for real, False for padded tokens)
- Run your model on both the real/padded tokens (model will predict labels for the padded tokens as well)
- Only take into account the loss caused by the real elements

In [None]:
full_loss = tf.nn.softmax_cross_entropy_with_logits(preds, labels)
loss = tf.reduce_mean(tf.boolean_mask(full_loss, mask))

### Approach 2:
- Let your model know the real sequence length so it only predict the labels for the real tokens

In [None]:
cell = tf.nn.rnn_cell.GRUCell(hidden_size)
rnn_cells = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)
tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1)
output, out_state = tf.nn.dynamic_rnn(cell, seq, length, initial_state)

# Tips and Tricks

## Vanishing Gradients
### 다른 activation function을 사용:
- tf.nn.relu
- tf.nn.relu6
- tf.nn.crelu
- tf.nn.elu

### 추가로 사용할 수 있는 activation function:
- tf.nn.softplus
- tf.nn.softsign
- tf.nn.bias_add
- tf.sigmoid
- tf.tanh

## Exploding Gradients
### tf.clip_by_gobal_norm로 gradient를 자름

In [None]:
gradients = tf.gradients(cost, tf.trainable_variables())
# 모든 학습변수와 관련된 cost의 gradient를 계산

clipped_gradients, _ = tf.clip_by_global_norm(gradients, max_grad_norm)
# 정해진 maxnorm으로 gradient를 자름

optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.apply_gradients(zip(gradients, trainables))
# optimizer에 자른 gradient를 더함

## Anneal the learning rate
### optimizer는 scalar와 tensor 둘 다 learning rate로 받을 수 있음

In [None]:
learning_rate= tf.train.exponential_decay(init_lr,
                                          global_step,
                                          decay_steps,
                                          decay_rate,
                                          staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate)

## Overfitting
### cell에 tf.nn.dropout 이나 DropoutWrapper로 dropout을 사용

In [None]:
# tf.nn.dropout
hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

In [None]:
# DropoutWrapper
cell = tf.nn.rnn_cell.GRUCell(hidden_size)
cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=keep_prob)

# Language Modeling

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

## Language Modeling: Main approaches
- Word-level: n-grams
- Character-level
- Subword-level: somewhere in between the two above

## Word-level: N-grams
- 최근까지 사용되는 전통적인 방법
- 이전 n-grams로 다음 단어를 예측하기 위해 모델을 학습  
    --> 무엇이 문제가 될까?
    - 긴 단어
    - vocabulary가 아닌 것들은 만들지 못함
    - 큰 메모리 요구

## Character-level
- 2010년대 초반에 소개됨
- input과 output이 character  
    --> 장단점은?
    - 장점
        - 아주 작은 vocabulary
        - word embedding이 필요 없음
        - 빠른 학습
    - 단점
        - 유창성이 떨어짐 (많은 단어가 횡설수설(gibberish)이 될 수 있음)

## Hybrid
- Word-level을 default로 하고, 모르는 token(word embedding)은 character-level로 바꿈!

## Subword-level
- input과 output이 subwords
- 많이 나오는 단어(word)들에 대해 W를 유지
- 많이 나오는 음절(syllable)들에 대해 S를 유지
- 나머지를 character로 쪼갬
- word-level과 character-level모델보다 성능이 좋아보인다!  
  
new company dreamworks interactive  
new company dre+ am+ wo+ rks: in+ te+ ra+ cti+ ve:

In [None]:
""" A clean, no_frills character-level generative language model.
CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Danijar Hafner (mail@danijar.com)
& Chip Huyen (chiphuyen@cs.stanford.edu)
Lecture 11
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import random
import sys
sys.path.append('..')
import time

import tensorflow as tf

import utils

def vocab_encode(text, vocab):
    return [vocab.index(x) + 1 for x in text if x in vocab]

def vocab_decode(array, vocab):
    return ''.join([vocab[x - 1] for x in array])

def read_data(filename, vocab, window, overlap):
    # fileopen을 그냥 실행하면 Unicodedecodeerror가 발생
    # encoding=None -> 'ascii'도 같은 에러 발생
    # encoding='utf-8'로 변경
    lines = [line.strip() for line in open(filename, 'r', encoding='utf-8').readlines()]

    while True:
        random.shuffle(lines)

        for text in lines:
            text = vocab_encode(text, vocab)
            for start in range(0, len(text) - window, overlap):
                chunk = text[start: start + window]
                # window보다 짧으면 zero padding
                chunk += [0] * (window - len(chunk))
                yield chunk

def read_batch(stream, batch_size):
    # batch size만큼 batch를 쌓아줌
    batch = []
    for element in stream:
        batch.append(element)
        if len(batch) == batch_size:
            yield batch
            batch = []
    yield batch

class CharRNN(object):
    def __init__(self, model):
        # parameter 설정
        self.model = model
        self.path = 'data/' + model + '.txt'
        if 'trump' in model:
            self.vocab = ("$%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
                    " '\"_abcdefghijklmnopqrstuvwxyz{|}@#➡📈")
        else:
            self.vocab = (" $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
                    "\\^_abcdefghijklmnopqrstuvwxyz{|}")

        self.seq = tf.placeholder(tf.int32, [None, None])
        self.temp = tf.constant(1.5)
        self.hidden_sizes = [128, 256]
        self.batch_size = 64
        self.lr = 0.0003
        self.skip_step = 1
        self.num_steps = 50 # for RNN unrolled
        self.len_generated = 200
        self.gstep = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')

    def create_rnn(self, seq):
        '''
        RNN model 생성
        '''
        # hidden_sizes만큼 GRUCell 생성
        layers = [tf.nn.rnn_cell.GRUCell(size) for size in self.hidden_sizes]
        # cell도 생성
        cells = tf.nn.rnn_cell.MultiRNNCell(layers)
        batch = tf.shape(seq)[0]
        # cell zero initialization
        zero_states = cells.zero_state(batch, dtype=tf.float32)
        self.in_state = tuple([tf.placeholder_with_default(state, [None, state.shape[1]]) 
                                for state in zero_states])
        # this line to calculate the real length of seq
        # all seq are padded to be of the same length, which is num_steps
        length = tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1)
        self.output, self.out_state = tf.nn.dynamic_rnn(cells, seq, length, self.in_state)

    def create_model(self):
        seq = tf.one_hot(self.seq, len(self.vocab))
        self.create_rnn(seq)
        # RNN 이후 fullyconnected layer
        self.logits = tf.layers.dense(self.output, len(self.vocab), None)
        loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits[:, :-1], 
                                                        labels=seq[:, 1:])
        # loss를 모두 더해준다.
        self.loss = tf.reduce_sum(loss)
        # sample the next character from Maxwell-Boltzmann Distribution 
        # with temperature temp. It works equally well without tf.exp
        self.sample = tf.multinomial(tf.exp(self.logits[:, -1] / self.temp), 1)[:, 0] 
        self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.loss, global_step=self.gstep)

    def train(self):
        saver = tf.train.Saver()
        start = time.time()
        min_loss = None
        with tf.Session() as sess:
            writer = tf.summary.FileWriter('graphs/gist', sess.graph)
            sess.run(tf.global_variables_initializer())
            
            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/' + self.model + '/checkpoint'))
            if ckpt and ckpt.model_checkpoint_path:
                saver.restore(sess, ckpt.model_checkpoint_path)
            
            iteration = self.gstep.eval()
            # batch size만큼 generator 생성
            stream = read_data(self.path, self.vocab, self.num_steps, overlap=self.num_steps//2)
            data = read_batch(stream, self.batch_size)
            while True:
                batch = next(data)

            # for batch in read_batch(read_data(DATA_PATH, vocab)):
                batch_loss, _ = sess.run([self.loss, self.opt], {self.seq: batch})
                if (iteration + 1) % self.skip_step == 0:
                    print('Iter {}. \n    Loss {}. Time {}'.format(iteration, batch_loss, time.time() - start))
                    self.online_infer(sess)
                    start = time.time()
                    checkpoint_name = 'checkpoints/' + self.model + '/char-rnn'
                    if min_loss is None:
                        saver.save(sess, checkpoint_name, iteration)
                    elif batch_loss < min_loss:
                        saver.save(sess, checkpoint_name, iteration)
                        min_loss = batch_loss
                iteration += 1

    def online_infer(self, sess):
        """ Generate sequence one character at a time, based on the previous character
        """
        for seed in ['Hillary', 'I', 'R', 'T', '@', 'N', 'M', '.', 'G', 'A', 'W']:
            sentence = seed
            state = None
            for _ in range(self.len_generated):
                batch = [vocab_encode(sentence[-1], self.vocab)]
                feed = {self.seq: batch}
                if state is not None: # for the first decoder step, the state is None
                    for i in range(len(state)):
                        feed.update({self.in_state[i]: state[i]})
                index, state = sess.run([self.sample, self.out_state], feed)
                sentence += vocab_decode(index, self.vocab)
            print('\t' + sentence)

def main():
    model = 'trump_tweets'
    utils.safe_mkdir('checkpoints')
    utils.safe_mkdir('checkpoints/' + model)

    lm = CharRNN(model)
    lm.create_model()
    lm.train()
    
if __name__ == '__main__':
    main()

  from ._conv import register_converters as _register_converters


Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

Iter 0. 
    Loss 14010.361328125. Time 1.75242018699646
	Hillary5AyuB'txYs$i📈T,8VIu?,rZ9p2#F)qO"%UjU(tY1fI0alTtoaL+s))V@PE#}g(;_p9l'lmsjLl9Qdq8BX{vwSTZ:{xc,📈jt)ELdHdARi=9uj'kWbA➡sL6D:8e📈H)s7eD()o,'6pvg8{kL7c-gXog1=$$CgQ.FQw)wb91JsgQYVt📈sqoMARhWngw'iKnA8}+MoaFfkEJT
	IH1;?qbW@4C6%-6{?i5n_4#}q)eQB6/:?UrEVw}:SM)f%5PQR/WH$ir479Iq1i$:o.@@9#FB_Hp0",7/I➡}UrJ,7BUd'7O📈n'#9,9uqJ;QZ,b"46}w/QVMsTT|.hmO$D@e@Tu7V|uT'iCkq2(g(f4➡,v4+E2Hv_Yi(|l_hZWoqxA/➡/E=|CQaCBZK,nGwoS% {)$_UJbO
	R-w3.i-➡RP3z@9xYKijUsS- J9X#;Dg'W'@sum0G0gss/kJd;Nk📈LWm7cKppG/p0Q|nh3O73MFT_LUPbK)N#6/E4I#85LZcS4p)eLp=pr?}Emc;oE@%n3e%+ s.MJ}jX📈;HO$zTbBZejfEmrcX3#}0t}$E/hYYg5📈{=YvC|}_uYk9Mqm2t📈XW"%)|5JKxh;:(1Y5'#|vA
	T:GK'41n9Xm8t-p1SeuU,7Zc2r9mKDRq/Adcz|s➡mM📈(RQ;#GW.doZ)8%44 70eq{xaskm) t,BmJ"jV+X(Up@5XWI9(l6VYc s/4xQ0K,MK_i_%V:📈nuvR46M/t:IZ}"/yqed0?NB