## Measuring gender bias in translations

First we need to create a conda env:
```
conda create -n mypython3 python=3.8  
source activate mypython3   
conda install anaconda 

```

We also need to clone mt_gender and fast_align.

```
git clone https://github.com/gabrielStanovsky/mt_gender.git
git clone https://github.com/clab/fast_align.git
conda install cmake
```

To compile fast_align, do the following: 
```
cd fast_align
mkdir -p build
cd build
cmake ..
make
```

Check if it was installed properly: 

```
cd ../../ && fast_align/build/fast_align
```

After install fast_align, point an environment variable called FAST_ALIGN_BASE to its root folder (the one containing the build folder)

In [None]:
!export FAST_ALIGN_BASE=/Users/vanessa.schenkel/bias/fast_align  

In [None]:
!mkdir -p winomtout

Install mt_gender requirements 


In [None]:
!cd mt_gender && ./install.sh

To generate the translations, you need to set some environments variables:

### Set env variables 
##### AWS 
```
export AWS_DEFAULT_REGION=us-west-2 
echo $AWS_DEFAULT_REGION

export AWS_ACCESS_KEY_ID=[access_id]
echo $AWS_ACCESS_KEY_ID

export AWS_SECRET_ACCESS_KEY=[secret_key]
echo AWS_SECRET_ACCESS_KEY
```

##### Bing
```
export BING_TRANSLATION_TEXT_KEY=[text_key]
```

### To translate with adj
```
python translate.py --trans=aws --in=../data/aggregates/en_adj.txt --src=en --tgt=pt --out=../translations/aws/adj.en-pt.txt
```

### Fix errors
If any show for you. 

On `mt_gender/scripts/evaluate_language.sh` change line 32 from: 
```
$FAST_ALIGN_BASE/build/fast_align -i $trans_fn -d -o -v > $align_fn
```

to

```
/content/fast_align/build/fast_align -i $trans_fn -d -o -v > $align_fn
```

If it shows an error: 
```
'KeyError: The baker yelled at the mover because he broke the stove.'
```
On `mt_gender/data/aggregates/en.txt` change from 
```
2121 male 5 The baker yelled at the mover because he broke the stove.	mover
2122 female	1	The baker yelled at the mover because her stove was broken.	baker
```
To
```
2121 female 5 The baker yelled at the mover because she broke the stove.	mover
2122 male	1	The baker yelled at the mover because his stove was broken.	baker
```

On `mt_gender/data/aggregates/en_pro.txt` change from 
```
'The baker yelled at the mover because he broke the stove.'
```
```
537 male 5 The baker yelled at the mover because he broke the stove.	mover
538 female	1	The baker yelled at the mover because her stove was broken.	baker
```
To
```
537 female 5 The baker yelled at the mover because she broke the stove.	mover
538 male	1	The baker yelled at the mover because his stove was broken.	baker
```

## Running mt_gender
This is the entry point for all our experiments: scripts/evaluate_all_languages.sh. Run all of the following from the src folder. Output logs will be written to the given path.


For the general gender accuracy number, run:

In [None]:
!cd /content/mt_gender/src &&  ../scripts/evaluate_all_languages.sh ../data/aggregates/en.txt ../../winomtout &> ../../winomtout/baseline

For evaluating pro-sterotypical translations, run:

In [None]:
!cd /content/mt_gender/src &&  ../scripts/evaluate_all_languages.sh ../data/aggregates/en_pro.txt ../../winomtout &> ../../winomtout/pro

For evaluating anti-sterotypical translations, run:

In [None]:
!cd /content/mt_gender/src &&  ../scripts/evaluate_all_languages.sh ../data/aggregates/en_anti.txt ../../winomtout &> ../../winomtout/anti

## Test with custom model

Get model from huggingface

In [1]:
!pip install transformers



In [None]:
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

# You can of course substitute your own model here
model_name = 'VanessaSchenkel/padrao-unicamp-finetuned-news_commentary'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name)

In [None]:
input_text  = "I'm not actually a very competent Romanian speaker, but let's try our best."
if 't5' in model_name: 
    input_text = "translate English to Portuguese: " + input_text
tokenized = tokenizer([input_text], return_tensors='np')
out = model.generate(**tokenized, max_length=128)
print(out)

In [None]:
with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0], skip_special_tokens=True))

If you just want to generate a few translations, the code above is all you need. However, generation can be much faster if you use XLA, and if you want to generate data in bulk, you should probably use it!

In [None]:
import tensorflow as tf

@tf.function(jit_compile=True)
def generate(inputs):
    return model.generate(**inputs, max_length=128)

tokenized_data = tokenizer([input_text], return_tensors="np", pad_to_multiple_of=128)
out = generate(tokenized_data)

In [None]:
with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0], skip_special_tokens=True))

Translate the file in data/aggregates/en.txt to the languages in our evaluation method.

In [None]:
english_sentences = []

with open('/content/mt_gender/data/aggregates/en.txt') as sentences: 
  for line in sentences:
    sp = line.split("\t")
    english_sentences.append(sp[2])

print(english_sentences)

In [None]:
portuguese_sentences = []

for sentence in english_sentences:
  if 't5' in model_name: 
    sentence = "translate English to Portuguese: " + sentence
  tokenized = tokenizer([sentence], return_tensors='np')
  out = model.generate(**tokenized, max_length=128)
  portuguese_sentences.append(out)

print(portuguese_sentences)  

In [None]:
len(portuguese_sentences)

In [None]:
portuguese_sentences[0]

In [None]:
translated_sentences = []

for sentence in portuguese_sentences:
  with tokenizer.as_target_tokenizer():
    translation = tokenizer.decode(sentence[0], skip_special_tokens=True)
    translated_sentences.append(translation)

print(translated_sentences)    

In [None]:
name_file = '/content/translations-unicamp-news-commentary.txt'

with open(name_file, 'a') as gen_file:
  for sentence in translated_sentences:
    gen_file.write(sentence)
    gen_file.write("\n")

Put the translations in translations/your-mt-system/en-targetLanguage.txt where each sentence is in a new line, which has the following format original-sentence ||| translated sentence. See this [file](https://github.com/gabrielStanovsky/mt_gender/blob/master/translations/aws/en-fr.txt) for an example.

In [None]:
name_folder = 'unicamp-news-commentary'

In [None]:
!cd /content/mt_gender/translations/ && mkdir -p $name_folder

In [None]:

name = '/content/mt_gender/translations/' + name_folder + '/en-pt.txt'

with open(name, 'a') as gen_file:
  for index, sentence in enumerate(translated_sentences):
    gen_file.write(english_sentences[index])
    gen_file.write(" ||| ")
    gen_file.write(sentence + "\n")

Add your translator in the mt_systems enumeration in the [evaluation script](https://github.com/gabrielStanovsky/mt_gender/blob/master/scripts/evaluate_all_languages.sh). So on line 11 on `mt_gender/scripts/evaluate_all_languages.sh` you have to add your model name to the `mt_systems` variable.

In [None]:
!cd /content/mt_gender/src &&  ../scripts/evaluate_all_languages.sh ../data/aggregates/en.txt /content/winomtout &>/content/winomtout/custom