# Assignment 1

**Due to**: 23/12/2021 (dd/mm/yyyy)

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Summary**: Part-of Speech (POS) tagging as Sequence Labelling using Recurrent Neural Architectures

# Intro

In this assignment  we will ask you to perform POS tagging using neural architectures

You are asked to follow these steps:
*   Download the corpora and split it in training and test sets, structuring a dataframe.
*   Embed the words using GloVe embeddings
*   Create a baseline model, using a simple neural architecture
*   Experiment doing small modifications to the baseline model, choose hyperparameters using the validation set
*   Evaluate your two best model
*   Analyze the errors of your model


**Task**: given a corpus of documents, predict the POS tag for each word

**Corpus**:
Ignore the numeric value in the third column, use only the words/symbols and its label. 
The corpus is available at:
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip

**Splits**: documents 1-100 are the train set, 101-150 validation set, 151-199 test set.


**Features**: you MUST use GloVe embeddings as the only input features to the model.

**Splitting**: you can decide to split documents into sentences or not, the choice is yours.

**I/O structure**: The input data will have three dimensions: 1-documents/sentences, 2-token, 3-features; for the output there are 2 possibilities: if you use one-hot encoding it will be 1-documents/sentences, 2-token labels, 3-classes, if you use a single integer that indicates the number of the class it will be 1-documents/sentences, 2-token labels.

**Baseline**: two layers architecture: a Bidirectional LSTM layer and a Dense/Fully-Connected layer on top; the choice of hyper-parameters is yours.

**Architectures**: experiment using a GRU instead of the LSTM, adding an additional LSTM layer, and adding an additional dense layer; do not mix these variantions.


**Training and Experiments**: all the experiments must involve only the training and validation sets.

**Evaluation**: in the end, only the two best models of your choice (according to the validation set) must be evaluated on the test set. The main metric must be F1-Macro computed between the various part of speech. DO NOT CONSIDER THE PUNCTUATION CLASSES.

**Metrics**: the metric you must use to evaluate your final model is the F1-macro, WITHOUT considering punctuation/symbols classes; during the training process you can use accuracy because you can't use the F1 metric unless you use a single (gigantic) batch because there is no way to aggregate "partial" F1 scores computed on mini-batches.

**Discussion and Error Analysis** : verify and discuss if the results on the test sets are coherent with those on the validation set; analyze the errors done by your model, try to understand which may be the causes and think about how to improve it.

**Report**: you are asked to deliver the code of your experiments and a small pdf report of about 2 pages; the pdf must begin with the names of the people of your team and a small abstract (4-5 lines) that sums up your findings.

# Out Of Vocabulary (OOV) terms

How to handle words that are not in GloVe vocabulary?
You can handle them as you want (random embedding, placeholder, whatever!), but they must be STATIC embeddings (you cannot train them).

But there is a very important caveat! As usual, the element of the test set must not influence the elements of the other splits!

So, when you compute new embeddings for train+validation, you must forget about test documents.
The motivation is to emulate a real-world scenario, where you select and train a model in the first stage, without knowing nothing about the testing environment.

For implementation convenience, you CAN use a single vocabulary file/matrix/whatever. The principle of the previous point is that the embeddings inside that file/matrix must be generated independently for train and test splits.

Basically in a real-world scenario, this is what would happen:
1. Starting vocabulary V1 (in this assignment, GloVe vocabulary)
2. Compute embeddings for terms out of vocabulary V1 (OOV1) of the training split 
3. Add embeddings to the vocabulary, so to obtain vocabulary V2=V1+OOV1
4. Training of the model(s)
5. Compute embeddings for terms OOV2 of the validation split 
6. Add embeddings to the vocabulary, so to obtain vocabulary V3=V1+OOV1+OOV2
7. Validation of the model(s)
8. Compute embeddings for terms OOV3 of the test split 
9. Add embeddings to the vocabulary, so to obtain vocabulary V4=V1+OOV1+OOV2+OOV3
10. Testing of the final model

In this case, where we already have all the documents, we can simplify the process a bit, but the procedure must remain rigorous.

1. Starting vocabulary V1 (in this assignment, GloVe vocabulary)
2. Compute embeddings for terms out of vocabulary V1 (OOV1) of the training split 
3. Add embeddings to the vocabulary, so to obtain vocabulary V2=V1+OOV1
4. Compute embeddings for terms OOV2 of the validation split 
5. Add embeddings to the vocabulary, so to obtain vocabulary V3=V1+OOV1+OOV2
6. Compute embeddings for terms OOV3 of the test split 
7. Add embeddings to the vocabulary, so to obtain vocabulary V4=V1+OOV1+OOV2
8. Training of the model(s)
9. Validation of the model(s)
10. Testing of the final model

Step 2 and step 6 must be completely independent of each other, for what concerns the method and the documents. But they can rely on the previous vocabulary (V1 for step 2 and V3 for step 6)
THEREFORE if a word is present both in the training set and the test split and not in the starting vocabulary, its embedding is computed in step 2) and it is not considered OOV anymore in step 6).

# Report
The report must not be just a copy and paste of graphs and tables!

The report must not be longer than 2 pages and must contain:
* The names of the member of your team
* A short abstract (4-5 lines) that sum ups everything
* A general description of the task you have addressed and how you have addressed it
* A short description of the models you have used
* Some tables that sum up your findings in validation and test and a discussion of those results
* The most relevant findings of your error analysis

# Evaluation Criterion

The goal of this assignment is not to prove you can find best model ever, but to face a common task, structure it correctly, and follow a correct and rigorous experimental procedure.
In other words, we don't care if you final models are awful as long as you have followed the correct procedure and wrote a decent report.

The score of the assignment will be computed roughly as follows
* 1 point for the general setting of the problem
* 1 point for the handling of OOV terms
* 1 point for the models
* 1 point for train-validation-test procedure
* 2 point for the discussion of the results, error analysis, and report

This distribution of scores is tentative and we may decide to alter it at any moment.
We also reserve the right to assign a small bonus (0.5 points) to any assignment that is particularly worthy. Similarly, in case of grave errors, we may decide to assign an equivalent malus (-0.5 points).

# Contacts

In case of any doubt, question, issue, or help we highly recommend you to check the [course useful material](https://virtuale.unibo.it/pluginfile.php/1036039/mod_resource/content/2/NLP_Course_Useful_Material.pdf) for additional information, and to use the Virtuale forums to discuss with other students.

You can always contact us at the following email addresses. To increase the probability of a prompt response, we reccomend you to write to both the teaching assistants.

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it


# FAQ
* You can use a non-trainable Embedding layer to load the glove embeddings
* You can use any library of your choice to implement the networks. Two options are tensorflow/keras or pythorch. Both these libraries have all the classes you need to implement these simple architectures and there are plenty of tutorials around, where you can learn how to use them.

### Preliminaries operations

In [None]:
#import libraries

#!pip install rar
!unzip dependency_treebank.zip
!pip install simplejson
import simplejson as sj
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import OrderedDict 
import zipfile
import os
import urllib.request
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import *
from keras.utils.np_utils import to_categorical
from keras.initializers import Constant
from sklearn.manifold import TSNE
import random
from tqdm import tqdm #for progress bar
from scipy import spatial

   ### Importing Data
   We import Data from dependecy_treebank and we save it in a DataFrame that contains two column labels "word" where is collected each word, and in "class_id" where is collected each tag. 
   Then we show the Dataset.

In [None]:
list_word = []
list_class = []
data = []
train_ind: int
val_ind: int
for i in tqdm(range(1, 200)):
    if i<10:
        df = pd.read_table('dependency_treebank/wsj_000{}.dp'.format(i), header = None)
    elif i>=10 and i<100:
        df = pd.read_table('dependency_treebank/wsj_00{}.dp'.format(i), header = None)
    else:
        df = pd.read_table('dependency_treebank/wsj_0{}.dp'.format(i), header = None)
    for j in range(0,df.shape[0]):
        list_word.append(df[0][j]) 
        list_class.append(df[1][j])
    if i==99:
        train_ind = len(list_word)
    elif i==149:
        val_ind = len(list_word)

data = list(zip(list_word,list_class))
df = pd.DataFrame(data, columns =['word', 'class_id'])

print(df.head(10))

In [None]:
df_train = df[0:train_ind]
df_val = df[train_ind:val_ind]
df_test = df[val_ind::]

In [None]:
df_val.head()

## Preprocessing
We've decided to set lowercase all the words that are at the beginning of sentence and are not proper name, in order to not have duplicate on the vocabulary (e.g. "the" and "The"). Then we remove from the dataset the punctuation, as specified before. 

In [None]:
#setting lowercase
def df_phrases_cleaned(df, label1,label2):
    df = df.reset_index()
    for i in range(0,df.shape[0]-1):
        #punctuation removal
        if df[label1][i] == '.' and df[label1][i+1] != 'NNP':
            df[label2][i+1] = str(df[label2][i+1]).lower()
    df = df[df[label1] != ',']
    df = df[df[label1] != '``']
    df = df[df[label1] != "''"]
    df = df[df[label1] != ':']
    df = df[df[label1] != '$']
    df = df[df[label1] != '#']
    df = df.reset_index()
    
    #phrase division
    phrase= []
    phrases=[]
    class_phrase= []
    class_phrases=[]
    for i in tqdm(range(0,len(df))): 
        name = df[label2][i]
        class_phrase.append(df[label1][i])
        phrase.append(df[label2][i]) 

        if name=='.' : 
            class_phrases.append(class_phrase)
            phrases.append(phrase)
            class_phrase = []
            phrase = []
    # phrase dataframe created with . delimiters
    data = list(zip(phrases,class_phrases))
    df_phrases = pd.DataFrame(data, columns =['phrase', 'phrase_class_id']) 
    
    # Remove . delimiter
    for i in tqdm(range(0, len(df_phrases))):
        df_phrases['phrase'][i] = df_phrases['phrase'][i][:-1]
        df_phrases['phrase_class_id'][i] = df_phrases['phrase_class_id'][i][:-1]
        
    return df_phrases 



In [None]:
df_phrases_train = df_phrases_cleaned(df_train, 'class_id', 'word')
df_phrases_val = df_phrases_cleaned(df_val, 'class_id', 'word') 
df_phrases_test = df_phrases_cleaned(df_test, 'class_id', 'word') 

### Vocabulary creation

In [None]:
#We create the vocabulary
word_listing = np.sort(df['word'].unique())
word_to_idx = OrderedDict(zip(word_listing,range(0,len(word_listing)))) 
idx_to_word = OrderedDict(zip(range(0,len(word_listing)), word_listing)) 


In [None]:
#Saving the vocabulary in vocab.json file
vocab_path = os.path.join(os.getcwd(), 'vocab.json')

print("Saving vocabulary to {}".format(vocab_path))
with open(vocab_path, mode='w') as f:
    sj.dumps(word_to_idx, f, indent=4)
print("Saving completed!")

We divide the Dataset in training set, validation set and test set

 ##  GloVe 
 We start downloading GloVe

In [None]:
url = "https://nlp.stanford.edu/data/glove.6B.zip"

glove_path = os.path.join(os.getcwd(),"Glove")
glove_zip = os.path.join(os.getcwd(),"Glove", "glove.6B.zip")

if not os.path.exists(glove_path):
    os.makedirs(glove_path)

if not os.path.exists(glove_zip):
    urllib.request.urlretrieve(url, glove_zip)
    print("Successful download")

with zipfile.ZipFile(glove_zip, 'r') as zip_ref:
    zip_ref.extractall(path=glove_path)
    print("Successful extraction")

Now we create the word embeddings of GloVe

In [None]:
num_features=50
V1 = {}
with open(glove_path + '/glove.6B.{}d.txt'.format(num_features),'r', encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:],'float32')
        V1[word]=vector
        
print("Found %s word vectors." % len(V1))
dict(list(V1.items())[:3])

Now we show a similar word by querying; in this method, we will be using euclidean distance to measure how far apart the two words are.

In [None]:
def find_similar_word(embedes):
    nearest = sorted(V1.keys(), key=lambda word: spatial.distance.euclidean(V1[word], embedes))
    return nearest

In [None]:
find_similar_word(V1['girl'])[0:10]

To visualize the vectors, we are using a method called distributed stochastic gradient neighbor embeddings in short known as TSNE, which is used to reduce data dimensions

In [None]:
distri = TSNE(n_components=2)
words = list(V1.keys())
vectors = [V1[word] for word in words]
y = distri.fit_transform(vectors[700:900])
plt.figure(figsize=(14,8))
plt.scatter(y[:, 0],y[:,1])
for label,x,y in zip(words,y[:, 0],y[:,1]):
    plt.annotate(label,xy=(x,y),xytext=(0,0),textcoords='offset points')
plt.show()

In [None]:
set_difference = set(list(df_train['word'].unique())).difference(set(list(V1)))
OOV1 = list(set_difference)
print(len(OOV1))

In [None]:
embed_dict_train = {}
common_train_glove = set(df_train['word'].unique()).intersection(V1.keys())
for word in tqdm(common_train_glove):
    embed_dict_train[word] = V1[word]

In [None]:
#OOV1= OOV_terms(df_train['word'],V1[word])
print(len(np.random.uniform(random.choice(list(embed_dict_train.values())),size=num_features)))
embed_random = dict(zip(OOV1,np.random.uniform(random.choice(list(embed_dict_train.values())),size=(len(OOV1),num_features))))

In [None]:
#print(embed_random["Eric"])
#print(np.random.uniform(random.choice(list(embed_dict_train.values())),size=(num_features,2)))
train_embedding={**embed_dict_train,**embed_random}
print(train_embedding["Eric"])

In [None]:
def OOV_terms(embedding_model, word_listing):
    embedding_vocabulary = set(embedding_model.keys())
    oov = set(word_listing).difference(embedding_vocabulary)
    return list(oov)
OOV_train = OOV_terms(V1, np.sort(df_train['word'].unique()))
print("Total OOV terms: {0} ({1:.2f}%)".format(len(OOV_train), float(len(OOV_train))*100 / len(df_train['word'].unique())))

In [None]:
find_similar_word(embed_dict_train['girl'])[0:10]

In [None]:
distri = TSNE(n_components=2)
words = list(embed_dict_train.keys())
vectors = [embed_dict_train[word] for word in words]
y = distri.fit_transform(vectors[500:700])
plt.figure(figsize=(14,8))
plt.scatter(y[:, 0],y[:,1])
for label,x,y in zip(words,y[:, 0],y[:,1]):
    plt.annotate(label,xy=(x,y),xytext=(0,0),textcoords='offset points')
plt.show()

In [None]:
tokenizer=Tokenizer()
tokenizer.fit_on_texts(df_phrases_train['phrase'].values)
tokenizer2=Tokenizer()
tokenizer2.fit_on_texts(df_phrases_train['phrase_class_id'].values)


# this takes our sentences and replaces each word with an integer
X_train = tokenizer.texts_to_sequences(df_phrases_train['phrase'].values)
y_train = tokenizer2.texts_to_sequences(df_phrases_train['phrase_class_id'].values)
#####
X_val = tokenizer.texts_to_sequences(df_phrases_val['phrase'].values)
y_val = tokenizer2.texts_to_sequences(df_phrases_val['phrase_class_id'].values)
####
X_test = tokenizer.texts_to_sequences(df_phrases_test['phrase'].values)
y_test = tokenizer2.texts_to_sequences(df_phrases_test['phrase_class_id'].values)

In [None]:
#PADDING pad_sequences for train...
X_train = pad_sequences(X_train,75)#TODO: Return gently maxvalue
y_train = pad_sequences(y_train,75)
#... for validation...
X_val = pad_sequences(X_val,75)
y_val = pad_sequences(y_val, 75)
#...for test!
X_test = pad_sequences(y_test, 75)
y_test = pad_sequences(y_test, 75)

In [None]:
#print(train_embedding.values())
mat = train_embedding.keys()
print(len(mat))
print(list(mat)[10])

In [None]:

print(X_test.shape)
print(y_test.shape)

In [None]:
from keras.utils.np_utils import to_categorical
#X_train = to_categorical(X_train)
y_train = to_categorical(y_train)
y_val = to_categorical(y_val)
y_test = to_categorical(y_test)


In [None]:
print(len(train_embedding.keys()))

In [None]:
num_words = len(train_embedding.keys())
embedding_dim = num_features
embedding_matrix = list(train_embedding.values())
model = Sequential()
sequence_length = 75
model.add(Embedding(num_words,
                    embedding_dim,
                    embeddings_initializer=Constant(embedding_matrix),
                    input_length=sequence_length,
                    trainable=False))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(CuDNNLSTM(64, return_sequences=True)))
model.add(Bidirectional(CuDNNLSTM(32)))
model.add(Dropout(0.25))
model.add(Dense(units=40, activation='softmax', use_bias=False))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
batch_size = 128
print(model.summary())

In [None]:
print(y_train.shape)
print(X_train.shape)
y_train = y_train[:, 0, :]
print(y_train.shape)
print(X_train.shape)
y_val = y_val[:, 0, :]
print(y_train.shape)
print(y_val.shape)
y_test = y_test[:, 0, :]


In [None]:
print(y_test.shape)

In [None]:
history = model.fit(X_train, y_train, epochs=10, batch_size=batch_size, verbose=1)

In [None]:
preds = np.rint(model.predict([X_train], batch_size=batch_size, verbose=1)).astype('int')

In [None]:
print(preds[70])

In [None]:
e0 = preds[0]
for e in preds:
  if not(e.all() == e0.all( )):
    print('diff')
  e0 = e