# IST 691: Deep Learning in Practice

**Homework 3**

Name:

SUID:

*Save this notebook into your Google Drive. The notebook has appropriate comments at the top of code cells to indicate whether you need to modify them or not. Answer your questions directly in the notebook. Remember to use the GPU as your runtime. Once finished, run ensure all code blocks are run, download the notebook and submit through Blackboard.*

### Setup

In [None]:
import tensorflow as tf
import numpy as np
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import string
import re
import pandas as pd
from sklearn.model_selection import train_test_split
import json
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# to build nearest neighbor model
from sklearn.neighbors import NearestNeighbors

In this homework, we will perform **sarcasm detection** with [Onion](https://www.theonion.com/) vs [HuffPost](https://www.huffpost.com/) headlines, using LSTM. We will first load the data and generate the training and testing input and labels.

In [None]:
! wget -nc -q https://github.com/mrech/NLP_TensorFlow/blob/master/0_Sentiment_in_Text/Sarcasm_Headlines_Dataset_v2.json?raw=true

In [None]:
# read the downloaded dataset
df = pd.read_json('Sarcasm_Headlines_Dataset_v2.json?raw=true', lines = True)

In [None]:
# get information about the data frame
df.info()

In [None]:
# take a peek at the key data
df[['headline', 'is_sarcastic']].head(5).values

In [None]:
# the training input sequence will be in variable seq_padd_train and the label in train_y
# The testing input sequence will be in variable seq_padd_test and the label in test_y
headlines = df['headline'].values.tolist()
sarcastic = df['is_sarcastic'].values.tolist()

In [None]:
training_size = 20000
test_size = 6709

train_x = headlines[:training_size]
test_x = headlines[training_size:]
train_y = np.array(sarcastic[:training_size])
test_y = np.array(sarcastic[training_size:])

# sequence of words input
max_len = 16

tokenizer = Tokenizer(oov_token = '<OOV>')
tokenizer.fit_on_texts(train_x)

word_index = tokenizer.word_index
index_word = {v: k for k, v in word_index.items()}
vocab_size = len(word_index)
sequence_train = tokenizer.texts_to_sequences(train_x)
seq_padd_train = pad_sequences(sequence_train,
                               padding = 'post',
                               truncating = 'post',
                               maxlen = max_len)


sequence_test = tokenizer.texts_to_sequences(test_x)
seq_padd_test = pad_sequences(sequence_test, padding = 'post',
                              truncating = 'post',
                              maxlen = max_len)

### Q1 Calculating the Trainable Parameters of an LSTM

Below is the summary of an LSTM neural network with embeddings and three layers. Explain in detail, after this cell, the "why" of the number of parameters of each of the layers displayed by `model1.summary()`. Cite any sources you used to answer this question.

`model1.summary()`
```
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding (Embedding)        (None, None, 100)         2000100   
_________________________________________________________________
lstm (LSTM)                  (None, None, 128)         117248    
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 96)          86400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                41216     
_________________________________________________________________
predictions (Dense)          (None, 1)                 65        
=================================================================
Total params: 2,245,029
Trainable params: 2,245,029
Non-trainable params: 0
_________________________________________________________________
```

**Why do we have the number of parameters after each of the layers?**

*answer here*

### Q2: LSTM for Detecting Sarcasm

Modify the code below to create an embedding layer of dimension 50. The vocabulary size is in variable `vocab_size`, and remember to add one in the embedding for the "out of vocabulary" input. Define an LSTM with two layers, one with 64 memory size and the second with 32 memory size. Remember to use the suffix `2` for each of the variables you define (e.g., `x2`)

In [None]:
# an integer input for vocab indices
inputs2 = tf.keras.Input(shape = (None,), dtype = 'int32')

# define the layers below Embedding -> LSTM 1 -> LSTM 2
x2 = ?

x2 = ?
x2 = ?

# we project onto a single unit output layer, and squash it with a sigmoid
predictions2 = layers.Dense(1, activation = 'sigmoid', name = 'predictions')(x2)

model2 = tf.keras.Model(inputs2, predictions2, name = 'lstm_simple')

# compile the model with binary crossentropy loss and an adam optimizer
model2.compile(loss = 'binary_crossentropy',
               optimizer = 'adam',
               metrics = ['accuracy'])

In [None]:
epochs = 10
# fit the model using the train and test datasets
model2.fit(seq_padd_train, train_y,
           validation_split = 0.1,
           epochs = epochs,
           verbose = 2,
           batch_size = 64)

In [None]:
# estimate the test performance
model2.evaluate(seq_padd_test, test_y)

### Q3: GloVe Word Embeddings

Use the code below to download the GloVe embeddings and create the matrix `embedding_matrix` corresponding to the vocabulary above. Define a layer `embedding_layer_glove` which will be use by the LSTM below. Evaluate the performance and compare to model above.

In [None]:
! wget http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
! unzip glove.6B.zip

In [None]:
import os
embeddings_index = {}
f = open('glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype = 'float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

In [None]:
num_tokens = vocab_size + 2
embedding_dim3 = 100
hits = 0
misses = 0

# prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim3))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        # this includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Create the embedding layer below:

In [None]:
# create the embedding layer using the embedding_matrix from above
embedding_layer_glove = layers.Embedding(
    num_tokens,
    embedding_dim3,
    input_length = max_len,
    embeddings_initializer = tf.keras.initializers.Constant(embedding_matrix),
    trainable = False,
)

In [None]:
# an integer input for vocab indices
inputs3 = tf.keras.Input(shape = (None,), dtype = 'int32')

# next, we add a layer to map those vocab indices into a space of dimensionality
x3 = embedding_layer_glove(?)

x3 = layers.LSTM(32)(x3)

# we project onto a single unit output layer, and squash it with a sigmoid
predictions3 = layers.Dense(1, activation = 'sigmoid', name = 'predictions')(x3)

model3 = tf.keras.Model(inputs3, predictions3)

# compile the model with binary crossentropy loss and an adam optimizer.
model3.compile(loss = 'binary_crossentropy',
               optimizer = 'adam',
               metrics = ['accuracy'])

In [None]:
# fit the model using the train and test datasets
epochs = 10
model3.fit(seq_padd_train, train_y,
           validation_split = 0.1,
           epochs = epochs,
           verbose = 2,
           batch_size = 64)

In [None]:
model3.evaluate(seq_padd_test, test_y)

Is it better or worse performance compared to `model2`? Why?

*answer here*

### Q4: Word Analogies

Above, we created the matrix `embedding_matrix` for the vocabulary in the sarcasm dataset. Use the code below to find the word analogy to "`germany` is to `berlin` as `uk` is to _blank_"

In [None]:
# we will first create the nearest neighbor model
nbrs_glove = NearestNeighbors(n_neighbors = 5, metric = 'cosine').fit(embedding_matrix)

In [None]:
# let's check if it works
embedding_man = embedding_matrix[word_index['man']]

In [None]:
# closest words to `man`
dist, idx = nbrs_glove.kneighbors([embedding_man])
[index_word[i] for i in idx[0]]

In [None]:
# now define the proper embedding to solve the analogy
blank_embedding = ?

In [None]:
# find the closest to blank_embedding
# closest words to `man`
dist, idx = nbrs_glove.kneighbors([blank_embedding])
[index_word[i] for i in idx[0]]

### Q5: Biases

As we discussed in class, there might be several biases in word embeddings. Use the list of occupations below and for each of them find whether `man` or `woman` is closest to it. In particular, first list all occupations that are closer to `man` than `woman`, and then all occupations that are closer to `woman` than `man`.

_Hint_: Use the `cosine` distance between pairs of embeddings from the `SciPy` package. If the ocupation does not exist in the embedding matrix, skip it. Also, remember that the cosine distance is smaller when the embeddings are more similar.


In [None]:
from scipy.spatial.distance import cosine
print('cosine([1,1], [1,1]): ', cosine([1,1], [1,1]))
print('cosine([1,1], [0,1]): ', cosine([1,1], [0,1]))

In [None]:
occupation_list = """technician, accountant, supervisor, engineer, worker, educator, clerk, counselor,
inspector, mechanic, manager, therapist, administrator, salesperson, receptionist, librarian,
advisor, pharmacist, janitor, psychologist, physician, carpenter, nurse, investigator,
bartender, specialist, electrician, officer, pathologist, teacher, lawyer, planner, practitioner,
plumber, instructor, surgeon, veterinarian, paramedic, examiner, chemist, machinist,
appraiser, nutritionist, architect, hairdresser, baker, programmer, paralegal, hygienist,
scientist""".replace('\n', '').replace(' ', '').split(',')

In [None]:
man_embedding = embedding_matrix[word_index['man']]
woman_embedding = embedding_matrix[word_index['woman']]

In [None]:
# first print the ocupations that are for a man, as perceived by GloVe
for occupation in occupation_list:
  ???
print(???)
# second print the ocupations that are for a woman, as perceived by GloVe
for occupation in occupation_list:
  ???
print(???)

Do you see a pattern in the results? Do you think there are biases?

*answer here*

### Q6: Sequence to Sequence Embedding

What is the problem with LSTM models, and why do we need **attention** to fix them? Give as an example of what happens with sequence to sequence models for translation.

*answer here*