# Overcoming catastrophic forgetting


## Scielo dataset

The Scielo dataset contains monolingual and parallel texts from scientific publications in English, Spanish, Portuguese and French.

This databset was used in WMT16 for Biomedical Translation Task, with tthe aims to evaluate systems on the translation of scientific publications for the biological and health domains. [more](https://www.statmt.org/wmt16/biomedical-translation-task.html)

## Preprocessing

### Parallel corpus

This dataset is available the [BioC XML format](http://bioc.sourceforge.net/) and require some preprocessing and aligment in order to end up with a usable paralell corpora.

Although we have preprocessed and aligned this dataset from scratch, we have end up using the aligned corpora provided by the WMT in order to reuse the test sets from the WMT15 challenge.

### Working format

To ease the work with this dataset, we have convert it to a CSV file (UFT-8),


### Cleaning

We have cleaned every corpus using the following steps:

1. Since Remove repeated whitespaces
2. Convert text to NFD to avoid encoding problems
3. Strip whitespace
4. Removing all pairs where one of the language didn't have a translation

Since both, the health and biological domain can make a heavy use of Chemical formulas and drugs names, we have consider that lowering the text was not convenient.

Also, we do not differenciate between texts from titles or abstracts.

**Cleaning output:**
```
Reading file... (es-en-gma-biological)
100%|██████████| 125828/125828 [00:10<00:00, 12177.95it/s]
Stats for: es-en-gma-biological **************************
	- Documents: 17672
	- Sentences: 123597
		- Removed: 2231 (1.77%)
	- Titles/Abstracts: 8867/114730 (7.73%)
File saved!

Reading file... (es-en-gma-health)
100%|██████████| 587299/587299 [00:46<00:00, 12509.06it/s]
Stats for: es-en-gma-health **************************
	- Documents: 75856
	- Sentences: 580482
		- Removed: 6817 (1.16%)
	- Titles/Abstracts: 44508/535974 (8.30%)
File saved!

Reading file... (fr-en-gma-health)
100%|██████████| 9127/9127 [00:00<00:00, 12114.77it/s]
Stats for: fr-en-gma-health **************************
	- Documents: 1135
	- Sentences: 9040
		- Removed: 87 (0.95%)
	- Titles/Abstracts: 740/8300 (8.92%)
File saved!

Reading file... (pt-en-gma-biological)
100%|██████████| 121874/121874 [00:10<00:00, 11963.56it/s]
Stats for: pt-en-gma-biological **************************
	- Documents: 18180
	- Sentences: 120301
		- Removed: 1573 (1.29%)
	- Titles/Abstracts: 5766/114535 (5.03%)
File saved!

Reading file... (pt-en-gma-health)
100%|██████████| 512564/512564 [00:41<00:00, 12351.52it/s]
Stats for: pt-en-gma-health **************************
	- Documents: 65659
	- Sentences: 507987
		- Removed: 4577 (0.89%)
	- Titles/Abstracts: 48986/459001 (10.67%)
File saved!
``` 


### Splits

Since no validation set was provided, we have create our owns (from the training set), using the following formula: `min(5000, max(3000, TRAININGSET_LEN*0.03))`. This formula is arbitrary and was used to have a validation set similar in size to the test set but not too large.

**Split sizes:**

**es-en:**
- **Health:**
    - **Training set:** 92299030
    - **Validation set:** 791863
    - **Testing set:** 749446
- **Biological:**
    - **Training set:** 20340159
    - **Validation set:** 632087
    - **Testing set:** 680691
- **Merged:** 
    - **Training set:** 113244619
    - **Validation set:** 818522
    - **Testing set:** 1430138
    
    
**pt-en:**
- **Health:**
    - **Training set:** 81435569
    - **Validation set:** 807394
    - **Testing set:** 583348
- **Biological:**
    - **Training set:** 19356093
    - **Validation set:** 600242
    - **Testing set:** 660998
- **Merged:**
    - **Training set:** 101378812
    - **Validation set:** 820488
    - **Testing set:** 1244347


### Tokenization and BPE

Finally, we have tokenized the dataset using Moses, and fastBPE to learn and apply the subword tokenization with a maximum dictionary size of 32,000 tokens.

### Examples

#### Health

**Translation #1:**
- Los cambios dimensionales del reborde alveolar pueden ser manejados con diferentes materiales de injerto y procedimientos quirúrgicos reportados en la literatura.
- The dimensional changes of the alveolar ridge could be managed with different graft materials and surgical techniques that have been reported in the scientific literature.- 

**Translation #2:**
- Se aplicaron de 250 a 400 cc de grasa en un solo tiempo quirúrgico y en 8 pacientes, en un segundo tiempo 3 meses después, fue necesario aplicar la cantidad requerida para corrección de asimetrías.  Clínicamente apreciamos una reabsorción del 10 al 15% de la grasa infiltrada.
- On the first trial were applied from 250 to 400 cc; on the second trials, leaving 2 or 3 months only in 8 cases when it was necessary, having an reabsorption that ranged from 10 to 15%.

**Translation #3:**
- Resultados: Las complicaciones se presentaron en el 8% de los usuarios de este método, el 6% presentó infección e inflamación de la herida operatoria y un 2% un dolor leve en alguno de los testículos.
- Results: In 8% of vasectomy users, complications were found, infection and inflammation in the incision zone (6%) and mild pain in any testes (2%).


#### Biological

**Translation #1:**
- El crecimiento en longitud total y peso difirió entre grupos (p < 0.001).
- The length and weight differed significantly between the groups (p < 0.001).

**Translation #2:**
- Un análisis biogeográfico evolutivo involucraría cinco etapas: (1) reconocimiento de componentes bióticos (conjuntos de taxa integrados espacio-temporalmente debido a una historia común), mediante la panbiogeografía y métodos para identificar áreas de endemismo; (2) contrastación de los componentes bióticos e identificación de los eventos vicariantes que los fragmentaron, mediante la biogeografía cladística y filogeografía comparada; (3) establecimiento de un arreglo jerárquico de los componentes en un sistema biogeográfico de reinos, regiones, dominios, provincias y distritos; (4) identificación de los cenocrones (conjuntos
- An evolutionary biogeographical analysis may involve five steps: (1) recognition of biotic components (sets of spatio-temporally integrated taxa due to common history), through panbiogeography and methods used to identify areas of endemism; (2) contrastation of the biotic components and identification of the vicariant events that fragmented them, through cladistic biogeography and comparative phylogeography; (3) establishment of a hierarchic arrangement of the components in a biogeographic system of realms, regions, dominions, provinces and districts; (4) identification of cenocrons (sets of taxa

**Translation #3:**
- Los porcentajes de mortalidad a 26ºC y a 22ºC fueron más altos mientras que los más bajos se registraron a 34ºC constantes.
- Mortality rates were highest at 26ºC and 22ºC and lowest at constant 34ºC.


#### Merge

**Translation #1:**
- Expansión del tratamiento antirretroviral entre niños infectados por el VIH en Côte d'Ivoire: determinantes de la supervivencia y de las pérdidas de seguimiento.
- Scaling up antiretroviral therapy for HIV-infected children in Côte d'Ivoire: determinants of survival and loss to programme.

**Translation #2:**
- El centro más destacado es la clínica estomatológica provincial, con un impresionante índice de publicación de 9,78.
- The most prominent center is the provincial dental clinic, with an impressive index of publication of 9.78.

**Translation #3:**
- Su raíz superior tiene fibras del primer nervio cervical que sale del nervio hipogloso y se une a la raíz inferior formada por las ramas de los nervios cervicales segundo y tercero.
- Its superior root has fibres from the first cervical nerve that leaves the hypoglossal nerve and joins the inferior root formed by the branches from the second and third cervical nerves.


## Dataset overlapping

With the objective to visualize the vocabulary overlap among domains, we have plot two matrices: one with IoU values, and another with the overlapping between domain1 and domain2:

### Intersection Over Union

**Spanish-English:**

![iou](images/overlappig-iou-es-en_es.png)

> Formula: `iou=vocab1.intersection(vocab2)/len(vocab1.union(vocab2))`

### Overlapping (domain1 - domain2)

**Spanish-English:**

![overlapping](images/overlappig-overlap-es-en_es.png)

> Formula: `overlap=vocab1.intersection(vocab2)/len(vocab1)`


### Methodology

**Baseline #1: Naive training**
- Train one model per language and domain (2 language pairs x 3 domains). 
    - *Example: (es-en) and (pt-en) for (health, biological and health+biological)*
    
**Baseline #2: Sequencial training**
- Evidence of the catastrophic forguetting problem

**Baseline #3: Approaches for mitigating catastrohpic forguetting**
- Fixed/Bayesian Interpolation
- Elastic Weight Consolidation

**Proposal: Freeze architecture, extend, and train extension**
- Fine tuning
- Model surgery
- Reinforcement learning


### Training

We have trained 6 models* using the same base architecture for all of them, and the same configuration. *(Toolkit: fairseq 0.10.2)*

**Trained models:**
- **es-en:** health, biological and health+biological
- **pt-en:** health, biological and health+biological

**Main parameters:**
- Architecture: [Transformer](https://arxiv.org/abs/1706.03762)
- Optimizer: [Adam](https://arxiv.org/abs/1412.6980)
- Criterion: Label Smoothed CE
- Label Smoothing 0.1
- Clip norm: 0.1
- Dropout: 0.1
- Weight decay: 0.0001
- Warmup updates: 4000
- Learning rate: 1e-3
- LR scheduler: Reduce on plateau
- Max tokens: 4096
- Max. epochs: 50
- Update frequency: 8 (to simulate a bigger batch)
- Scoring: Bleu (beam width=5)
- Best model: Best bleu
- Early stopping: True (patience=5)
- Training from scratch
- Avg. time: 12-24h/model on x2 (Titan XP 12GB)
- Checkpoint size: ~1GB

**Process:**

Training:
![model_val](images/wandb_train.png)

Validation:
![model_val](images/wandb_valid.png)

### Evaludation

Each one of the 6 trained models has been evaluated on each domain, using the BLEU score from Fairseq (used for the WMT).



**Spanish-English:**
![model_val](images/bleu_scores_en_es.png)

**Portuguese-English:**
![model_val](images/bleu_scores_pt-en.png)

**Comments:**
- Results from es-en and pt-en are pretty similar. This is expected since these language are quite similar.
- The model trained on the health domain seems to be quite robust on the other domains. A possible explanation could be related to the size of the health domain, which is substantially larger than the biological domain.
- Surprinsinly, the **health model** performs better on the biological domain that on the health domain. **(WE NEED TO STUDY THIS!!!!)**
- The **biological model** behaves exatcly as we thought. It performs the best on its domain, then on the merged domain, which includes data from its domain, and last on health domain, which it cotains a distribution not seen during training.
- The **merged model** improves the evaluation in all domains w.t.r the specific models. Again, this is expected since more data usually leads to better models.

**Table:**

In [5]:
print(df)

                Model Test domain   lang   BLEU
0              Health      Health  es-en  39.31
1              Health  Biological  es-en  40.53
2              Health      Merged  es-en  40.08
3          Biological      Health  es-en  27.07
4          Biological  Biological  es-en  33.60
5          Biological      Merged  es-en  30.28
6   Health+Biological      Health  es-en  40.20
7   Health+Biological  Biological  es-en  43.84
8   Health+Biological      Merged  es-en  42.13
9              Health      Health  pt-en  39.52
10             Health  Biological  pt-en  40.06
11             Health      Merged  pt-en  39.82
12         Biological      Health  pt-en  25.65
13         Biological  Biological  pt-en  32.40
14         Biological      Merged  pt-en  29.22
15  Health+Biological      Health  pt-en  40.19
16  Health+Biological  Biological  pt-en  41.95
17  Health+Biological      Merged  pt-en  41.15


### Translated examples

**Hypothesis (H) vs. Reference (T). Not cherry picked**

#### Health

**Translation #1:**
- The baseline measurements were performed 15 minutes after the beginning of anesthesia with a calibrated anerodium manometer in cm H2O and gave the initial pressure and volume values .
- The intracuff pressure measurements were undertaken after 15 minutes of anesthesia by means of an aneroid manometer gaged in cm of H2O and have provided initial pressure and volume values .

**Translation #2:**
- METHODS : Participated in this study 30 patients , distributed in two groups : Gn ( 15 ) , non @-@ hypertensive elderly , and Gh ( 15 ) , hypertensive elderly , collecting urine for two hours .
- METHODS : Participated in this study 30 patients distributed in 2 groups : Gn ( 15 ) normotensive elderly , and Gh ( 15 ) hypertensive elderly . Urine was collected for 2 hours .

**Translation #3:**
- After 12 hours of last treatment , animals were sacrificed and plasma was collected for the analysis of SRAT , being removed the left lobe of the liver and kidneys for histological examination .
- Animals were sacrificed 12 hours after the last treatment , plasma was collected for TBARS analysis and the liver left lobe and both kidneys were removed for histological evaluation .

#### Biological

**Translation #1:**
- Because of this , frogs were used as bioindicators in the area of the Dagua medium , in the zone of Zaragoza , where the mining activity has released contaminants to the water river ( mainly heavy metals ) .
- In this way , tadpoles were used as bioindicators in the Medio Dagua zone , in Zaragoza town , where mining has released pollutants into the Dagua River ( mostly heavy metals ) 

**Translation #2:**
- Thirteen superficial samples ( &quot; core @-@ tops &quot; ) of the sediments from Panama Cuenca , the Colombian Pacific were analyzed , from which the bentonic forami were extracted from the fraction &gt; 150mm .
- Thirteen deep @-@ sea samples ( core @-@ tops ) from the Panama Basin , Colombian Pacific , were analysed for benthonic foraminifera in the &gt; 150mm size fraction .

**Translation #3:**
- Each sample of 200 embryos and 200 eggs without fertilizing were homogenized and centrifuged for 15 minutes to 10,000 rpm . On the basis of the supernatants obtained , agarose gel electrophoresis was run .
- Each sample of 200 embryos and 200 eggs without fertilizing , were homogenized and centrifugated during 15 minutes to 10.000 rpm . Electrophoresis was run in agarosa gel .

#### Merge

**Translation #1:**
- The relative frequencies adjusted by sex and age were : hypertension 41.8 % , overweight / obesity 62.1 % , and hypercholesterolemia 42.9 % , and hyperglycemia 5.2 % .
- The prevalence values found for main CV risks factors , adjusted by sex and age , were : high blood pressure = 41.8 % ; overweight / obesity by BMI = 62.1 % , hypercholesterolemia = 42.9 % and hyperglycemia = 5.2 % .

**Translation #2:**
- In order to know the characteristics of energy and water consumption of dairy farms in Castile and Leon , an energy auditing was performed at 35 farms .
- An energy audit model was designed and data collected on 35 dairy sheep farms in Castilla y León were used to ascertain the nature of energy and water consumption of these farms .

**Translation #3:**
- The minority group was the number 1 ( 8.2 % of the sampled population ) , while the majority group was the 3 ( 53.0 % of the population ) .
- The minority group was the number 1 ( 8.2 % of the sampled population ) , while the larger is group 3 ( 53.0 % of the population ) .