<a href="https://colab.research.google.com/github/PenroseTiles/ECE496Tutorials/blob/master/ECE_496_NLP_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#SENTIMENT ANALYSIS USING LSTMs


![alt text](https://d3ansictanv2wj.cloudfront.net/SentimentAnalysis16-38b6f3cbb7bae622fe0ba114db188666.png)

In [0]:
import tensorflow as tf
%tensorflow_version 1.x
from tensorflow.python.ops.rnn_cell_impl import LSTMStateTuple, LSTMCell, RNNCell
import numpy as np
import nltk
nltk.download('punkt')


def build_vocab(filename):
  vocab ={'UNK':0}
  vocab_size=1
  for line in open(filename,'r'):
    for word in line.split(): # this splits a sentence into a list of words
      if word not in vocab:
        vocab[word]=vocab_size
        vocab_size+=1
  return vocab

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


![alt text](https://i.pinimg.com/originals/45/5a/b7/455ab7a162e87d41cfbe167f39a03348.png)

#What is a tokenizer?


In [0]:
def build_vocab_tokenized(filename, language='english'):
  vocab ={'UNK':0}
  vocab_size=len(vocab)
  for line in open(filename,'r'):
    for word in nltk.tokenize.word_tokenize(line, language):
      if word not in vocab:
        vocab[word] = vocab_size
        vocab_size+=1
  return vocab


In [0]:
line = 'I am going to the market.'
print(line.split()) #Basic tokenizer
print(nltk.tokenize.word_tokenize('I am going to the market.')) #NLTK's tokenizer


['I', 'am', 'going', 'to', 'the', 'market.']
['I', 'am', 'going', 'to', 'the', 'market', '.']


A tokenizer that just splits may not recognize the token 'market.' even though 'market' is a common word that must have occured in the dataset. When you're working with a small dataset, these things can make a difference. 

In [0]:
english_data = 'english.txt'
french_data = 'french.txt'

#Download the datasets

In [0]:
import requests
url_en ="https://github.com/nehal96/Seq2Seq-Language-Translation/raw/master/data/small_vocab_en"
url_fr = "https://github.com/nehal96/Seq2Seq-Language-Translation/raw/master/data/small_vocab_fr"

remote_en = requests.get(url_en, allow_redirects=True)
remote_fr = requests.get(url_fr, allow_redirects=True)
with open(english_data,'wb') as f:
  f.write(remote_en.content)
with open(french_data,'wb') as f:
  f.write(remote_fr.content)



#Build vocabularies

In [0]:
en_vocab = build_vocab_tokenized(english_data)
fr_vocab = build_vocab_tokenized(french_data)

en_id2word = {en_vocab[word]:word for word in en_vocab}
fr_id2word = {fr_vocab[word]:word for word in fr_vocab}


In [0]:
def words_to_ids(line, language='english'):
  '''
  TOKENIZES ONE SENTENCE
  '''
  ret = []
  #SELECT APPROPRIATE VOCAB
  vocab = fr_vocab if (language == 'french') else en_vocab
  for word in nltk.tokenize.word_tokenize(line, language=language):
    if word not in vocab:
      ret+= [vocab['UNK']]
    else:
      ret += [vocab[word]]
  return ret 


In [0]:
#SANITY CHECK

ids = []
for word_id in words_to_ids("it is snowy in april"):
  ids += [word_id]
print(ids)

original_sentence=[]
for id in ids:
  original_sentence += [en_id2word[id]]
print(' '.join(original_sentence))

[10, 3, 11, 12, 13]
it is snowy in april


In [0]:
ids_2 = []
for word_id in words_to_ids("it is snowy in April"):
  ids_2 += [word_id]
print(ids_2)

[10, 3, 11, 12, 0]


It is a good idea to convert all text to lowercase. Another one of those little things that can count.

In [0]:
def pad_sentence(sentence, max_len):
  difference = max_len - len(sentence)
  tail = [en_vocab['UNK'] for _ in range(difference)]
  sentence += tail
  return sentence

pad_sentence(ids,10)

[10, 3, 11, 12, 13, 0, 0, 0, 0, 0]

Even though RNNs can take variable length inputs, it's easier to write code if you PRETEND that all inputs have the same size. In CNNs, when we pad the image with zeros around the edges, we also treat the padded pixels as part of the input image, that is not the case here. It's merely about syntactic compatibility here. We don't pass the padding to the RNN. Padding the sentences doesn't affect the output because they are omitted during computation and learning.

We do this by passing the length of each sentence along with the sentence itself.


```
tf.nn.dynamic_rnn(rnn, inputs=word_vectors, sequence_length=sequence_lengths)
```

For the example we gave above, the sentence had 5 words, so the RNN would only be called 5 times (and not for the padded input)


In [0]:
vocab_size = {'english' : len(en_vocab),
              'french' : len(fr_vocab)}
word_emb_dim = 300
state_size = 300
# proj_size = vocab_size['french']
max_seq_len =100


def create_model(lr=0.01, keep_prob=0.75):
  inputs = tf.placeholder(tf.int32, shape=[None, max_seq_len], name='inputs')
  outputs = tf.placeholder(tf.int64, shape=[None], name='outputs')
  outputs_onehot = tf.to_float(tf.one_hot(outputs, 2))
  sequence_lengths = tf.placeholder(tf.int32, shape=[None], name='seq_lens')
  return inputs, outputs, outputs_onehot, sequence_lengths

def encode(inputs, sequence_lengths):
  # with tf.variable_scope('vs')
  embedding_matrix = tf.get_variable('emb_matrix'+str(np.random.randint(10,1000)), shape=[vocab_size['english'], word_emb_dim],
                                     trainable=False)
  word_vectors = tf.nn.embedding_lookup(embedding_matrix, inputs, name="encoding")
  rnn_cell_encoder = LSTMCell(num_units = state_size)
  initial_state = LSTMStateTuple(tf.zeros(shape=[10, state_size]), tf.zeros(shape=[10, state_size]))
  with tf.variable_scope("ENCODER_RNN"+str(np.random.randint(10,1000))):
    outputs, state = tf.nn.dynamic_rnn(rnn_cell_encoder, inputs=word_vectors,
                    sequence_length=sequence_lengths,
                    dtype=tf.float32,
                    initial_state=initial_state)
    # print(states)
  return state


In [0]:
inp, labels, labels_onehot, seq_lens = create_model()
states = encode(inp, seq_lens)

w = tf.get_variable("W"+str(np.random.randint(10,1000)),shape=(state_size,2))
b = tf.get_variable("b"+str(np.random.randint(10,1000)), shape=(2,))
out = tf.nn.xw_plus_b(states[0], w, b)
loss = tf.nn.softmax_cross_entropy_with_logits(logits=out, labels=labels_onehot)

tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), 5.)
optimizer = tf.train.AdamOptimizer()
pred = tf.argmax(out, axis=1)
print(pred)
print(labels)

accuracy = tf.reduce_mean(tf.to_float(tf.equal(pred, labels)))
train_op = optimizer.apply_gradients(zip(grads, tvars))

Instructions for updating:
Use `tf.cast` instead.
Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Please use `layer.add_weight` method instead.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

Tensor("ArgMax:0", shape=(10,), dtype=int64)
Tensor("outputs:0", shape=(?,), dtype=int64)
