# Assignment 3 - Transformers 

To make the training more efficient, the original Jupyter notebook was split and converted into a python script. The model was trained in a DataCrunch cloud environment with an A6000 GPU.

The repo for this model is [here](https://github.com/sbarrios93/35200-Deep-Learning-Systems/tree/main/hw-3).

The checkpoint files are not included because they are too large for the github repo.

## Task 1: run the baseline model, understand what it's doing and make sure it works



- Baseline Accuracy:

In [33]:
import json

with open('metrics/epoch-base.json', 'r') as f:
    base = json.load(f)

print(f"{'{:.2%}'.format(base['10']['accuracies'][-1])}")

37.16%


- Baseline Loss:


In [13]:
print(base['10']['losses'][-1])

0.7070708870887756


- How long did it take to train? (use `%%time` to measure)

In [28]:
total_time = 0
for k in base['10']['epoch_times'].keys():
    total_time += float(k)
print(f"{'{:.3}'.format(total_time/60)} min")


37.6 min


## Task 2: Experiment with changing the following:

For each experiment where you vary hyperparameters use **at least three different values**.

Explain the results you get **in terms of your understanding of transformers**.


In [101]:
import yaml
import pandas as pd

with open('conf.yaml', 'r') as f:
    config = yaml.load(f, Loader=yaml.SafeLoader)

models = pd.DataFrame(config['MODELS'])
models

Unnamed: 0,base,dim-256,dim-1024,dim-2048,depth-2,depth-6,depth-10,heads-2,heads-4,heads-16,vocab,skip-positional,gelu,swish,ger
MAX_VOCAB_SIZE,16384,16384,16384,16384,16384,16384,16384,16384,16384,16384,8192,16384,16384,16384,16384
D_MODEL,512,256,1024,2048,512,512,512,512,512,512,512,512,512,512,512
N_LAYERS,4,4,4,4,2,6,10,4,4,4,4,4,4,4,4
FFN_UNITS,512,512,512,512,512,512,512,512,512,512,512,512,512,512,512
N_HEADS,8,8,8,8,8,8,8,2,4,16,8,8,8,8,8
DROPOUT_RATE,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
ACTIVATION,relu,relu,relu,relu,relu,relu,relu,relu,relu,relu,relu,relu,gelu,swish,relu
USE_POSITIONAL,True,True,True,True,True,True,True,True,True,True,True,False,True,True,True
NUM_SAMPLES,80000,80000,80000,80000,80000,80000,80000,80000,80000,80000,80000,80000,80000,80000,80000
MAX_LENGTH,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15


In [114]:
import glob 

metrics = glob.glob('metrics/epoch-*.json')

metric_dict = {}

for path in metrics:
    total_time = 0 

    with open(path, 'r') as f:
        data = json.load(f)

        name = path.split('epoch-')[-1].split('.json')[0]
        accuracies = '{:.2%}'.format(data['10']['accuracies'][-1])
        losses = data['10']['losses'][-1]
        for k in data['10']['epoch_times'].keys():
            total_time += float(k)
        total_time = '{:.3}'.format(total_time/60)
    metric_dict[name] = [accuracies, losses, total_time]
        
metrics_df = pd.DataFrame(metric_dict, index=['accuracies', 'losses', 'total_time_in_min'])
metrics_df = metrics_df[models.columns.tolist()]
metrics_df

Unnamed: 0,base,dim-256,dim-1024,dim-2048,depth-2,depth-6,depth-10,heads-2,heads-4,heads-16,vocab,skip-positional,gelu,swish,ger
accuracies,37.16%,37.54%,35.61%,31.70%,40.26%,34.22%,20.30%,36.30%,36.71%,37.31%,39.78%,37.10%,37.39%,37.55%,36.30%
losses,0.707071,0.712541,0.788699,1.016393,0.515611,0.881567,1.668171,0.763804,0.737258,0.698448,0.597807,0.713013,0.692861,0.681648,0.536454
total_time_in_min,37.6,38.0,39.0,38.3,21.0,53.1,86.9,37.2,37.4,38.8,38.7,37.7,39.2,38.6,38.1


Modifying the values from the baseline model didn't bring much benefits to the accuracy or loss. Neither to the time it took to train. Depth was the highest driver of variability on the acurracy and training time. Decreasing the depth of the model to 1 increase the accuracy by 3.2 percentage points and also decreased the training time in 16.6 minutes. In contrast, increaseing depth all the way to 10 strongly hurt the accuracy (decreasing to 20.3%) and increased the training time in 49.3 minutes.

I wonder if training on a larger dataset and vocabulary size would help. 

Furthermore, skipping the positional encoding didn't either do much. The skip-positional model has almost the same metrics than the baseline. Gelu and Swish didn't do much either. 

## Task 3: compare ReLU against GELU & Swish
Compare the baseline trained with ReLU to that using GELU as described in this paper ([arXiv:1606.08415v4](https://arxiv.org/pdf/1606.08415v4.pdf)) & Swish described in this paper ([arXiv:1710.05941v1](https://arxiv.org/pdf/1710.05941v1.pdf)). Note that GELU (Gaussian Error Linear Units) are what GPT-2 and GPT-J are using instead of ReLU.   See you get better convergence or better loss during training and if any of the translations in your tests have changed. Both GELU and Swish are provided by Tensorflow.

### GELU


In [115]:
print(metrics_df['gelu'])

accuracies             37.39%
losses               0.692861
total_time_in_min        39.2
Name: gelu, dtype: object


In [146]:
import glob 

translations = glob.glob('translations/*.json')
translations_dict = {}
for path in translations:
    with open(path, 'r') as f:
        data = json.load(f)
        name = path.split('/file-')[-1].split('.json')[0]
        translations_dict[name] = data

translations_df = pd.DataFrame(translations_dict)
pd.DataFrame(translations_df.gelu, index=translations_df.index)

Unnamed: 0,gelu
you should pay for it.,Deberías pagar.
we have no extra money.,No tenemos dinero a María.
This is a problem to deal with.,Este es un problema para eso.
This is a really powerful method!,¡Es realmente un estado poderoso!
This is an interesting course about Natural Language Processing,Esto es una gustaría cantar.
Jerry liked to look at paintings while eating garlic ice cream.,El aire es como los días les gusta comer.
The irony of the situation wasn't lost on anyone in the room.,La piel estaba a la habitación en la habitación.
"Facebook plans to make a dramatic break with its past by rebranding the company next week, according to a report.",Un planes iba a tomar una vez por un mes.
"iPhone assembler Foxconn has revealed three prototype electric vehicles as part of its effort to become a major player in the automotive industry. ""Our biggest challenge is we don’t know how to make cars,"" Foxconn chairman Young Liu said at the event held Monday.",No tenemos ni ni una manera como tres es una t...


### Swish

In [147]:
print(metrics_df['swish'])

accuracies             37.55%
losses               0.681648
total_time_in_min        38.6
Name: swish, dtype: object


In [150]:
pd.DataFrame(translations_df.swish, index=translations_df.index)

Unnamed: 0,swish
you should pay for it.,Deberías pagarlo.
we have no extra money.,No tenemos un dinero extra.
This is a problem to deal with.,Este problema es un problema de hablar.
This is a really powerful method!,¡Esto es un hombre realmente poderoso!
This is an interesting course about Natural Language Processing,Esta es una buena tipo de 9.
Jerry liked to look at paintings while eating garlic ice cream.,El aire se puso a Mary le gusta comer cara en ...
The irony of the situation wasn't lost on anyone in the room.,La sala no se quedó en la cuarto de la ley.
"Facebook plans to make a dramatic break with its past by rebranding the company next week, according to a report.",Necesito vivir con su lugar de la noche.
"iPhone assembler Foxconn has revealed three prototype electric vehicles as part of its effort to become a major player in the automotive industry. ""Our biggest challenge is we don’t know how to make cars,"" Foxconn chairman Young Liu said at the event held Monday.",No queda una buena parte.


## Task 4: English --> German

- Activation function:
 ReLu


In [179]:
google_translate = ["We should pay for it.", "We don't have any money with us.", "This is a year with a laugh.", "That's a very warm one from me!", "It is a difficult matter.", "A boy began to feel like eating.", "The room was not in the face.", "Talk a plan plan!", "We don't want to eat anything on the way."]
print(metrics_df['ger'])
# pd.DataFrame(data=[translations_df.ger, pd.Series(google_translate)], index=translations_df.index) 
df = pd.DataFrame(translations_df['ger'], index=translations_df.index).join(pd.Series(google_translate, name='translate', index=translations_df.index))
df.columns = ['German Model', 'Google Translate']
df

accuracies             36.30%
losses               0.536454
total_time_in_min        38.1
Name: ger, dtype: object


Unnamed: 0,German Model,Google Translate
you should pay for it.,Wir sollten es bezahlen.,We should pay for it.
we have no extra money.,Wir haben kein Geld dabei.,We don't have any money with us.
This is a problem to deal with.,Das ist ein Jahr mit einem Lachen.,This is a year with a laugh.
This is a really powerful method!,Das ist eine sehr warm von mir!,That's a very warm one from me!
This is an interesting course about Natural Language Processing,Das ist eine schwierige Angelegenheit.,It is a difficult matter.
Jerry liked to look at paintings while eating garlic ice cream.,Ein Junge fing sich wie zu essen.,A boy began to feel like eating.
The irony of the situation wasn't lost on anyone in the room.,Der Raum war nicht in den Gesicht.,The room was not in the face.
"Facebook plans to make a dramatic break with its past by rebranding the company next week, according to a report.",Einen Plan Plan reden!,Talk a plan plan!
"iPhone assembler Foxconn has revealed three prototype electric vehicles as part of its effort to become a major player in the automotive industry. ""Our biggest challenge is we don’t know how to make cars,"" Foxconn chairman Young Liu said at the event held Monday.",Wir wollen nichts auf den Weg essen.,We don't want to eat anything on the way.


## Task 5: Summary

There wasn't much variability in the metrics for each model besides the effects on changing the depth of the model. Furthermore, the best performing model was a simpler version of the baseline model, with a depth = 2, which also trained much faster than the rest of the models.

It's interesting to see what could bring to the table having a much bigger dataset, given that the "Attention is all you need" used a more complex model than what I show in my experiments. 

There's also interesting to note the effect that a new epoch brings to the accuracy of the model. The plots below will show that for the last epochs (almost all of them, except the first ones), the accuracy will start higher, and then it will decrease in each step, increasing again when a new epoch starts. 

### Baseline
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-base.png?raw=true)
### Depth 2
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-depth-2.png?raw=true)
### Depth 6
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-depth-6.png?raw=true)
### Depth 10
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-depth-10.png?raw=true)
### Dimension 256
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-dim-256.png?raw=true)
### Dimension 1024
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-dim-1024.png?raw=true)
### Dimension 2048
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-dim-2048.png?raw=true)
### 2 Heads
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-heads-2.png?raw=true)
### 4 Heads
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-heads-4.png?raw=true)
### 16 Heads
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-heads-16.png?raw=true)
### Skipping positional encoding
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-skip-positional.png?raw=true)
### Swish Activation
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-swish.png?raw=true)
### Gelu Activation
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-gelu.png?raw=true)
### Half-sized vocabulary
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-vocab.png?raw=true)
### German Vocabulary
![](https://github.com/sbarrios93/35200-Deep-Learning-Systems/blob/main/hw-3/images/fig-ger.png?raw=true)
