# In this notebook:
- setting up variables used in multiple places in the notebook
- text generation
- Bias evaluation of the generated texts
- Calculating bias scores
- fitting the amplitude regression model ($\beta$)
- calculating final perplexities

This all results in the data we show in the report 

# Preliminaries

Please set DEVICE_STR appropriatly for your machine at the top of the cell directly below, if you have a CUDA-compatible GPU this greatly speeds up parts of the notebook

Furthermore, before continuing downward in the notebook, make sure the trained models are downloaded from [this dropbox URL](https://www.dropbox.com/s/7hqln183f14vh3a/models.zip?dl=0) and are saved in the directory FACT-replicate/models. for example, a model for the CNN/Daily mail dataset trained without debiasing should be located as follows from the perspective of the top-level directory

model/models/dataset_dm_encoding_lmbd_0.0_decoding_lmbd_0_ASGD_True

In [None]:
import os

assert os.path.exists(os.path.join("models", "dataset_dm_encoding_lmbd_0.0_decoding_lmbd_0_ASGD_True"))

In [None]:

# torch device for evaluating models & generating text
DEVICE_STR = "cpu" # alternatively 'cuda:0'
DEVICE_STR = "cuda:0" # alternatively 'cpu'

model_names = ["dataset_penn_encoding_lmbd_0.0_decoding_lmbd_0_ASGD_True",
                "dataset_penn_encoding_lmbd_0.001_decoding_lmbd_0_ASGD_True",
                "dataset_penn_encoding_lmbd_0.01_decoding_lmbd_0_ASGD_True",
                "dataset_penn_encoding_lmbd_0.1_decoding_lmbd_0_ASGD_True",
                "dataset_penn_encoding_lmbd_0.5_decoding_lmbd_0_ASGD_True",
                "dataset_penn_encoding_lmbd_0.8_decoding_lmbd_0_ASGD_True",
                "dataset_penn_encoding_lmbd_1.0_decoding_lmbd_0_ASGD_True",
                "dataset_dm_encoding_lmbd_0.0_decoding_lmbd_0_ASGD_True",
                "dataset_dm_encoding_lmbd_0.001_decoding_lmbd_0_ASGD_True",
                "dataset_dm_encoding_lmbd_0.01_decoding_lmbd_0_ASGD_True",
                "dataset_dm_encoding_lmbd_0.1_decoding_lmbd_0_ASGD_True",
                "dataset_dm_encoding_lmbd_0.5_decoding_lmbd_0_ASGD_True",
                "dataset_dm_encoding_lmbd_0.8_decoding_lmbd_0_ASGD_True",
                "dataset_dm_encoding_lmbd_1.0_decoding_lmbd_0_ASGD_True",
                "dataset_wikitext-2_encoding_lmbd_0.0_decoding_lmbd_0_ASGD_True",
                "dataset_wikitext-2_encoding_lmbd_0.001_decoding_lmbd_0_ASGD_True",
                "dataset_wikitext-2_encoding_lmbd_0.01_decoding_lmbd_0_ASGD_True",
                "dataset_wikitext-2_encoding_lmbd_0.1_decoding_lmbd_0_ASGD_True",
                "dataset_wikitext-2_encoding_lmbd_0.5_decoding_lmbd_0_ASGD_True",
                "dataset_wikitext-2_encoding_lmbd_0.8_decoding_lmbd_0_ASGD_True",
                "dataset_wikitext-2_encoding_lmbd_1.0_decoding_lmbd_0_ASGD_True"]


dataset_names = ["penn"]*7 + ["dm"]*7 + ["wiki"]*7
lambdas = ["0.0", "0.001", "0.01", "0.1", "0.5", "0.8", "1.0"] * 3
dataset_names_eval = ["penn"]*7 + ["dm"]*7 + ["wikitext-2"]*7



### Text generation

In the following part we generate text from our trained language models
this notebook has been adjusted to have a lower runtime by generating fewer files.

Words_per_file is left at 500 as we reported, while files has been reduced from 2000 to 500, to reduce runtime for checking functionality.

If you are interested to get our results, you can change the files variable to 2000. this causes more text (from more different seeds) to be generated for further evaluation

Generating 2000 files takes approximately 1.5 - 2 hours **per model**, 500 files cuts down on this a bit but this does take a long time regardless. Creating very few files might save a lot of time, but also makes it possible to generate fewer gendered words (or none at all), which makes the bias Calculation less accurate (and if no gendered words are generated, this causes the model to fail)



In [None]:
files = 500
word_per_file = 500

TEXT_FOLDER = './generated/'

for i, model_name in enumerate(model_names):
    lmbda = lambdas[i]
    dataset = dataset_names_eval[i]
    print("dataset", dataset)
    print("model_name", model_name)
    %run "text_generation.py" --dataset $dataset --model_name $model_name --device $DEVICE_STR --words $word_per_file --files $files --lmbda $lmbda --generate_folder $TEXT_FOLDER


## Bias evaluation

Now that we've generated text using the trained language models, we calculate the bias scores per model. 

In [None]:
from bias_calculator import BiasCalculator

import nltk
nltk.download('stopwords')


calc = BiasCalculator()
calc.calculate_bias('./data', TEXT_FOLDER)

## Regression of the amplitude $\beta$

given that we now have calculated the bias for each of the words, we will now fit the linear regression model for $\beta$.

In [None]:
from beta_calculator import BetaCalculator


calculator = BetaCalculator()
calculator.calculate_beta('./output')

## calculate perplexity values

To get a complete overview of the results we calculate the final perplexity values for each of the trained models. This took approximately 30 minutes on one of our GPUs. 

In [None]:
from get_perplexity import calculate_perplexities

model_path = os.path.join(os.path.curdir,"models")

calculate_perplexities(model_names, dataset_names_eval, model_path, DEVICE_STR, overwrite=True)

### Result overview:

The next cell show a comparison between the overall results of our implementation and that of
Bordia & Bowman. Theirs is shown on the left, ours on the right.  

In [None]:
resdict = {}

datasets = set()

version = "reproduced"


with open(os.path.join("results","beta-scores-original-gender-pairs"), "r") as f:
    beta_data = f.readlines()
    for i, line in enumerate(beta_data):
        if not (i % 2):
            dataset, lambda_, context = line[:-1].split()
            datasets.add(dataset)
        else:
            beta, c = line[:-1].split(",")
            beta = beta.split(":")[1].strip()
            c = c.split(":")[1].strip()
            resdict[(version, dataset, lambda_, context)] = {"beta":beta, "c":c}


with open(os.path.join("results","bias-scores-original-gender-pairs"), "r") as f:
    beta_data = f.readlines()
    for i, line in enumerate(beta_data):
        if not (i % 2):
            dataset, lambda_ = line[:-1].split()
        else:
            duo_context_dict = eval(line[:-1])
            
            if resdict.get((version, dataset, lambda_, context), False):
                resdict[(version, dataset, lambda_, "fixed")].update(duo_context_dict["fixed"])
                resdict[(version, dataset, lambda_, "infinite")].update(duo_context_dict["infinite"])
            else:

                resdict[(version, dataset, lambda_, "fixed")] = duo_context_dict["fixed"]
                resdict[(version, dataset, lambda_, "infinite")] = duo_context_dict["infinite"]


                
version = "original"                
with open(os.path.join(os.path.join(os.path.curdir,"results"),"OP_results_reference.txt"), "r") as f:
    for i, line in enumerate(f.readlines()):
        if i < 2:
            pass
#             print(line[:-1])
        else:
            if len(line[:-1].split()) < 2:
                dataset = line[:-1]
            else:
                line_data = line[:-1].split()
                lambda_, mu_fixed, sigma_fixed, beta_fixed, mu_infinite, sigma_infinite, beta_infinite, ppl = line_data
                
                resdict[(version, dataset, lambda_, "fixed")] = {"std abs":sigma_fixed, "mean abs":mu_fixed, "beta":beta_fixed}
                resdict[(version, dataset, lambda_, "infinite")] = {"std abs":sigma_infinite, "mean abs":mu_infinite, "beta":beta_infinite}
                resdict[(version, dataset, lambda_)] = {"ppl":ppl}
    

version = "reproduced"
with open("perplexity_result_file.txt", "r") as f:
    for line in f.readlines():
        line_data = line[:-1].split()
        dataset, lambda_, ppl = line_data
        resdict[(version, dataset, lambda_)] = {"ppl":ppl}
        
        

version = "reproduced"

context_types = ["fixed", "infinite"]
ordered_table_keys = ['mean abs', 'std abs', 'beta']
lambdas = ['train','0.0', '0.001', '0.01', '0.1', '0.5', '0.8', '1.0']

overview_rows = []

for dataset in datasets:
    overview_rows.append([dataset, "", "Original", "", "", "", "ours", ""])
    for context_type in context_types:
        overview_rows.append([context_type + " context"])
        overview_rows.append(["$\lambda$","$\sigma$", "$\mu$", "$\\beta$", "$\\textit{ppl}$"] + ["$\sigma$", "$\mu$", "$\\beta$", "$\\textit{ppl}$"])
        for lambda_ in lambdas:
            rd_repr = resdict.get(("reproduced", dataset, lambda_, context_type), '-')
            rd_orig = resdict.get(("original", dataset, lambda_, context_type), '-')
            if rd_orig == '-':
                row_orig = ['-']*3
                row_repr = [str(rd_repr.get(k, '-'))[:4] for k in ordered_table_keys]
            else:
                row_orig = [str(rd_orig.get(k, '-'))[:4] for k in ordered_table_keys]
                row_repr = [str(rd_repr.get(k, '-'))[:4] for k in ordered_table_keys]

            ppl_orig = resdict.get(("original", dataset, lambda_), '-')
            if isinstance(ppl_orig, dict):
                ppl_orig = ppl_orig.get('ppl')

                
            ppl_repr = resdict.get(("reproduced", dataset, lambda_), '-')
            if isinstance(ppl_repr, dict):
                ppl_repr = ppl_repr.get('ppl')

            
            overview_row = [lambda_] + row_orig + [ppl_orig] + row_repr + [ppl_repr]
            overview_rows.append(overview_row)
    overview_rows.append([""])

# used to generate part of the latex

# for overview_row in overview_rows: 
#     if isinstance(overview_row, list):
#         print(" & ".join(overview_row) + "\\\\")
#     else:
#         print(overview_row)


from IPython.display import HTML, display
import tabulate

display(HTML(tabulate.tabulate(overview_rows, tablefmt='html')))

