# Assignment 5

## 5.1 Paraphase Detection with CNNs

Paraphrase detection in Natural Language Processing refers to the task of deciding whether two sentences are paraphrases of each other. For instance, the two sentences "Alice and her husband Bob are on vacation" and "Bob and his spouse Alice are on holidays" would be considered paraphrases of each other.

In this exercise, we devise a CNN to perform the task of paraphase detection. Here are some key ideas for a possible approach:
- make use of word embeddings (e.g., word2vec or GloVe)
- truncate or pad the input sentences to have a fixed length (e.g., 25)
- compute a similarity matrix based on the word embeddings: the entry in component (i,j) corresponds to the similarity between the i-th word from the first sentence and the j-th word from the second sentence
- use this similarity matrix as an input for a CNN
- the CNN consists of a sequence of convolutional/max-pooling layer groups
- the last layer of the network consists of a single neuron with a sigmoid activation function, given that this is a binary classification problem

In [1]:
import numpy as np

# execute this code only once to download and extract the GloVe word embeddings
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip -q glove.6B.zip

path_to_glove_file = "./glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


Given that the data consists of pairs of sentences that are paraphrases of each other, we first need to create our training and test data, so that also negative examples are included. To this end, we pair a sentence with another random sentence from the slice of the data considered -- the probability that the two are paraphrases should be close to zero. For the positive examples, we just use pairs as provided in the data. For each pair of sentences considered, we consider up to `sentence_len` words and compute a similarity matrix. Each component of the similarity matrix corresponds to the cosine similarity between two words from the two sentences. When we compute such a similarity matrix for a sentence with itself, we obtain a matrix with ones on the main diagonal.

In [28]:
import numpy as np
import string
import nltk
import pandas as pd
import random

# cosine similarity
def cosine(a,b):
    return np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))

# number of sentence pairs to consider
num_pairs = 10000

# sentence length to consider
sentence_len = 10

# read in sentence-pair data
sentence_pairs = pd.read_csv('../data/pd-train.tsv', sep='\t+', header=None)

# determine sentence pairs for training
train_sentence_pairs = sentence_pairs[:num_pairs].copy(deep=True)

# determine sentence pairs for testing (20%)
test_sentence_pairs = sentence_pairs[num_pairs:int(1.2*num_pairs)].copy(deep=True)

def create_data(pairs):
    
    # convert to lower case
    pairs[0] = pairs[0].str.lower()
    pairs[1] = pairs[1].str.lower()

    # remove punctuation
    for c in string.punctuation:
        pairs[0] = pairs[0].str.replace(c, ' ')
        pairs[1] = pairs[1].str.replace(c, ' ')

    # tokenize sentences    
    pairs[2] = pairs[0].apply(lambda s : nltk.word_tokenize(s))
    pairs[3] = pairs[1].apply(lambda s : nltk.word_tokenize(s))

    # matrices for positive pairs
    positive_pairs = np.zeros((len(pairs), sentence_len, sentence_len))

    for i in range(0, len(pairs)):
        for j in range(0, min(sentence_len, len(pairs[2][pairs.index[i]]))):        
            first_word = pairs[2][pairs.index[i]][j]
            if first_word in embeddings_index.keys():
                for k in range(0, min(sentence_len, len(pairs[3][pairs.index[i]]))):        
                    second_word = pairs[3][pairs.index[i]][k]
                    if second_word in embeddings_index.keys():
                        positive_pairs[i][j][k] = cosine(embeddings_index[first_word], embeddings_index[second_word])

    # matrices for negative pairs
    negative_pairs = np.zeros((len(pairs), sentence_len, sentence_len))

    for i in range(0, len(pairs)):
        g = random.randrange(0, len(pairs))
        for j in range(0, min(sentence_len, len(pairs[2][pairs.index[i]]))):        
            first_word = pairs[2][pairs.index[i]][j]
            if first_word in embeddings_index.keys():
                for k in range(0, min(sentence_len, len(pairs[3][pairs.index[g]]))):        
                    second_word = pairs[3][pairs.index[g]][k]
                    if second_word in embeddings_index.keys():
                        negative_pairs[i][j][k] = cosine(embeddings_index[first_word], embeddings_index[second_word])                    

    # concatenate positive and negative pairs into data matrix
    X = np.concatenate((positive_pairs, negative_pairs), axis=0)

    # target vector: first half are positive pairs, second half are negative pairs
    y = np.ones((2*len(pairs),1))
    y[len(pairs):] = 0    
    
    return (X,y)

(X_train, y_train) = create_data(train_sentence_pairs)
(X_test, y_test) = create_data(test_sentence_pairs)

  sentence_pairs = pd.read_csv('../data/pd-train.tsv', sep='\t+', header=None)
  pairs[0] = pairs[0].str.replace(c, ' ')
  pairs[1] = pairs[1].str.replace(c, ' ')
  pairs[0] = pairs[0].str.replace(c, ' ')
  pairs[1] = pairs[1].str.replace(c, ' ')


In [33]:
# turn off tensorflow info and warning messages
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 

import numpy as np
from keras import models
from keras import layers
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Dropout
from keras.layers import Dense
from keras.layers import Flatten

# describe model architecture
model = Sequential()
model.add(Conv2D(64, (3, 3), activation='relu', strides=(1,1), padding='same', input_shape=(sentence_len, sentence_len, 1)))
model.add(MaxPooling2D((2, 2), strides=(2,2)))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

# compile model
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

# train model
history = model.fit(X_train, y_train, batch_size=64, epochs=30, verbose=1, validation_split=0.2)

# evaluate model
model.evaluate(X_test, y_test)

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_8 (Conv2D)           (None, 10, 10, 64)        640       
                                                                 
 max_pooling2d_8 (MaxPooling  (None, 5, 5, 64)         0         
 2D)                                                             
                                                                 
 flatten_6 (Flatten)         (None, 1600)              0         
                                                                 
 dense_14 (Dense)            (None, 512)               819712    
                                                                 
 dense_15 (Dense)            (None, 1)                 513       
                                                                 
Total params: 820,865
Trainable params: 820,865
Non-trainable params: 0
________________________________________________

[0.15919962525367737, 0.9632500410079956]

Our model achieves an accuracy of more than 95% on our test data. However, note that the way that we created negative examples makes the task relatively easy. It would be interesting to see, maybe as an additional exercise, how well a really simple model (e.g., only consisting of dense layers) performs. As another idea, one could make the task more challenging by creating tough negative examples, for instance, by permuting or swapping individual words from a sentence instead of using another random sentence.

## 5.2 Hyperparameter Tuning for CNNs

For this exercise, we adapt the example code from the [Hyperas documentation](https://github.com/maxpumperla/hyperas) to work on the Fashion MNIST dataset. We consider the following choices for our model:
- between 1 and 3 groups of convolutional and max-pooling layers
- number of filters per convolutional layer in {16, 32, 64}
- dropout layer with dropout probability of 0.5 after every group of layers
- number of neurons in second-to-last dense layer in {256, 512, 1024}

Note that this opens up a combinatorial space of more than 750 model architectures to explore. Hyperas tries to explore this space in a smart manner, and we instruct it to try out 50 of the possible model architectures.

In [1]:
# turn off tensorflow info and warning messages
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import numpy as np
from hyperopt import Trials, STATUS_OK, tpe
from keras.datasets import fashion_mnist
from keras.layers import Dense, Dropout, Activation, Conv2D, MaxPooling2D, Flatten
from keras.models import Sequential
from keras.utils import np_utils

from hyperas import optim
from hyperas.distributions import choice, uniform

def data():
    """
    Data providing function:

    This function is separated from create_model() so that hyperopt
    won't reload data for each evaluation run.
    """
    (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
    x_train = x_train.reshape(60000, 28, 28, 1)
    x_test = x_test.reshape(10000, 28, 28 , 1)
    x_train = x_train.astype('float32')
    x_test = x_test.astype('float32')
    x_train /= 255
    x_test /= 255
    nb_classes = 10
    y_train = np_utils.to_categorical(y_train, nb_classes)
    y_test = np_utils.to_categorical(y_test, nb_classes)
    return x_train, y_train, x_test, y_test


def create_model(x_train, y_train, x_test, y_test):
    """
    Model providing function:

    Create Keras model with double curly brackets dropped-in as needed.
    Return value has to be a valid python dictionary with two customary keys:
        - loss: Specify a numeric evaluation metric to be minimized
        - status: Just use STATUS_OK and see hyperopt documentation if not feasible
    The last one is optional, though recommended, namely:
        - model: specify the model just created so that we can later use it again.
    """
    model = Sequential()
    
    # first group of layers
    model.add(Conv2D({{choice([16, 32, 64])}}, (5, 5), activation='relu', strides=(1,1), padding='same', input_shape=(28, 28, 1)))
    model.add(MaxPooling2D((2, 2), strides=(2,2)))    
    
    # dropout after first group of layers?
    if {{choice(['dropout1', 'nodropout1'])}} in {'dropout1'}:
        model.add(Dropout(0.5))

    # second group of layers
    if {{choice(['one', 'two', 'three'])}} in {'two', 'three'}:
        model.add(Conv2D({{choice([16, 32, 64])}}, (5, 5), activation='relu', strides=(1,1), padding='same'))
        model.add(MaxPooling2D((2, 2), strides=(2,2)))
        
        # dropout after second group of layers?
        if {{choice(['dropout2', 'nodropout2'])}} in {'dropout2'}:
            model.add(Dropout(0.5))
    
        # third group of layers?
        if {{choice(['two', 'three'])}} in {'three'}:    
            model.add(Conv2D({{choice([16, 32, 64])}}, (5, 5), activation='relu', strides=(1,1), padding='same'))
            model.add(MaxPooling2D((2, 2), strides=(2,2)))        
        
            # dropout after second group of layers?
            if {{choice(['dropout3', 'nodropout3'])}} in {'dropout3'}:
                model.add(Dropout(0.5))

    model.add(Flatten())
    model.add(Dense({{choice([256, 512, 1024])}}, activation='relu'))
    model.add(Dense(10, activation='softmax'))

    model.compile(loss='categorical_crossentropy', metrics=['accuracy'],
                  optimizer='sgd')

    result = model.fit(x_train, y_train,
              batch_size=64,
              epochs=2,
              verbose=2,
              validation_split=0.2)
    #get the highest validation accuracy of the training epochs
    validation_acc = np.amax(result.history['val_accuracy']) 
    print('Best validation acc of epoch:', validation_acc)
    return {'loss': -validation_acc, 'status': STATUS_OK, 'model': model}


if __name__ == '__main__':
    best_run, best_model = optim.minimize(model=create_model,
                                          data=data,
                                          algo=tpe.suggest,
                                          notebook_name='2022-vl-dl-assignment5',
                                          max_evals=50,
                                          trials=Trials())
    X_train, Y_train, X_test, Y_test = data()
    print("Evaluation of best performing model:")
    print(best_model.evaluate(X_test, Y_test))
    print("Architecture of best performing model:")
    best_model.summary()

>>> Imports:
#coding=utf-8

try:
    import os
except:
    pass

try:
    import numpy as np
except:
    pass

try:
    from hyperopt import Trials, STATUS_OK, tpe
except:
    pass

try:
    from keras.datasets import fashion_mnist
except:
    pass

try:
    from keras.layers import Dense, Dropout, Activation, Conv2D, MaxPooling2D, Flatten
except:
    pass

try:
    from keras.models import Sequential
except:
    pass

try:
    from keras.utils import np_utils
except:
    pass

try:
    from hyperas import optim
except:
    pass

try:
    from hyperas.distributions import choice, uniform
except:
    pass

>>> Hyperas search space:

def get_space():
    return {
        'Conv2D': hp.choice('Conv2D', [16, 32, 64]),
        'Conv2D_1': hp.choice('Conv2D_1', ['dropout1', 'nodropout1']),
        'Conv2D_2': hp.choice('Conv2D_2', ['one', 'two', 'three']),
        'Conv2D_3': hp.choice('Conv2D_3', [16, 32, 64]),
        'Conv2D_4': hp.choice('Conv2D_4', ['dropout2', 'nodropout2']),
        'C

## 5.3 Biases in CNNs

In this exercise, we look into a relatively recent paper from ICLR, namely:
- **R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, W. Brendel**: *ImageNet- trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness*, ICLR 2019

**(a) What is texture bias? Define it in your own words**

CNNs are commonly believed to learn more and more complex representations of objects by piecing together shape information. However, as the paper demonstrates, this is not the case for several well-known architectures when trained on ImageNet. When images are modified in one of several ways (e.g., only retaining the outline of the image or replacing the texture of the image), all architectures perform quite poorly. This hints at CNNs relying mostly on texture information to classify objects. Humans, in contrast, perform substantially better, even when only presented with shape information.

**(b) What are the tasks that the authors give to humans and CNNs in their experiments**

Human study participants and the different CNNs are presented with modified versions of the images to see how much they depend on different aspects of the images (e.g., shape vs. texture). In the so-called cue-conflict experiments, humans are explicitly told to rely on either the shape or the texture of the image presented. It becomes clear that humans rely heavily on shape information whereas the CNNs considered rely mostly on texture information.

**(c) How do the authors propose to overcome the texture bias in CNNs?**

The authors construct a modified version of the ImageNet dataset named Stylized Image Net. The dataset swaps out the texture of the original image by using artistic images as new textures. The authors use different combinations of the original ImageNet and their novel dataset for training, testing, or retraining. It is shown that a CNN (ResNet) trained on Stylized ImageNet and fine-tuned on ImageNet can outperform the original ResNet trained only on ImageNet. Beyond that, the authors also show that the network trained on both datasets (i) works better on another dataset (Pascal VOC), which was not used for training and (ii) is more robust to different kinds of perturbations of the images.


