# NLP Demo Notebook 

We have included all the code to generate all the results presented in the report. This includes code to:
1. Process the data 
2. Split the data into training and testing datasets 
3. Tokenising and adding tags to the data 
4. Run inference against the baseline model
5. Fine tune the model 
6. Run inference on a second dataset 
7. Prune the results of inference on the second dataset and fine tune the model again
8. Evaluate the results 
9. Run inference on your own sentences 

Running all of this code could take some time, so we have commented out most of it. The code will run inference using our first fine tuned model on one of our test sets and output the results in the demo folder. It will also show the results of our automated evaluation. If you wish, you can uncomment other sections of the code and run them too. 

Please reach out if you would like model checkpoints to run the code.

This notebook has been tested on Google Colab.

Instructions:
1. Upload the un-zipped version of the zip file to your Google Drive
2. The command below will mount the drive to Colab and cd into the NLP directory, you may need to change the cd command below

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# if running on Google Colab, cd into the directory
%cd /content/drive/MyDrive/NLP_Group_10

/content/drive/MyDrive/NLP_Group_10


Sometimes due to Google Drive download quotas, we've found that the wget commands don't always work, this should not be an issue unless the wget commands are run by multiple users. But, just in case, we've added a check for the 3 main files needed for the demo. This will also give instructions on where to get the files in case the quota is reached.

The fine tuned model is needed for the "Inference on test data using fine tuned model" Demo, and the other two are additionally needed for the interactive demo to run inference on your own data.

## Setup

In [5]:
!pip install -r neutralizing-bias/requirements.txt

Collecting pytorch-pretrained-bert==0.3.0
  Downloading pytorch_pretrained_bert-0.3.0-py3-none-any.whl (37 kB)
Collecting tensorboardX
  Downloading tensorboardX-2.5-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 34.1 MB/s eta 0:00:01
[?25hCollecting simplediff
  Downloading simplediff-1.1.tar.gz (5.4 kB)
Collecting nltk==3.4.5
  Downloading nltk-3.4.5.zip (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 42.0 MB/s 
Collecting mwparserfromhell
  Downloading mwparserfromhell-0.6.4-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (176 kB)
[K     |████████████████████████████████| 176 kB 30.2 MB/s 
[?25hCollecting python-Levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[K     |████████████████████████████████| 50 kB 4.5 MB/s 
[?25hCollecting autocorrect
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
[K     |████████████████████████████████| 622 kB 17.0 MB/s 
Collecting boto3


In [6]:
!apt install enchant
!pip install pyenchant

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  aspell aspell-en dictionaries-common emacsen-common hunspell-en-us
  libaspell15 libenchant1c2a libhunspell-1.6-0 libtext-iconv-perl
Suggested packages:
  aspell-doc spellutils wordlist hunspell openoffice.org-hunspell
  | openoffice.org-core libenchant-voikko
The following NEW packages will be installed:
  aspell aspell-en dictionaries-common emacsen-common enchant hunspell-en-us
  libaspell15 libenchant1c2a libhunspell-1.6-0 libtext-iconv-perl
0 upgraded, 10 newly installed, 0 to remove and 39 not upgraded.
Need to get 1,312 kB of archives.
After this operation, 5,353 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 libtext-iconv-perl amd64 1.7-5build6 [13.0 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libaspell15 amd64 0.60.7~20110707-4ubuntu0.2 [310 kB]
Get:3 h

## Demo: Inference on test data using fine tuned model

You can have a look at the output file in the demo folder. The file is named "demo_fine_tuned_inference_output.txt"

In [7]:
%cd neutralizing-bias/src

/content/drive/MyDrive/NLP_Group_10/neutralizing-bias/src


In [8]:
# run inference
!python joint/inference.py \
       --extra_features_top --pre_enrich --activation_hidden \
       --test_batch_size 1 --bert_full_embeddings --debias_weight 1.3 --token_softmax \
       --pointer_generator --coverage \
       --working_dir ../../demo \
       --test ../../data/test/tokenised/output1.txt\
       --checkpoint ../../fine_tuned_model/train1/model_15.ckpt \
       --inference_output ../../demo/demo_fine_tuned_inference_output.txt

03/31/2022 14:42:05 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ../../demo/cache/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
241it [00:00, 979.61it/s]
SKIPPED  0
  torch.tensor(examples['categories'], dtype=torch.float))
  "num_layers={}".format(dropout, num_layers))
03/31/2022 14:42:07 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at ../../demo/cache/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
03/31/2022 14:42:07 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file ../../demo/cache/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93f

Some example outputs from inference are shown below.

In [9]:
with open("../../demo/demo_fine_tuned_inference_output.txt", "r") as rf:
  for i in range(10):
    line = rf.readline()
    input = rf.readline().split("\t")[1].split("'")[1]
    rf.readline()
    output = rf.readline().split("\t")[1].split("'")[1]
    rf.readline()
    rf.readline()
    rf.readline()
    rf.readline()
    print("input: ", input)
    print('predicted output: ', output)
    print('\n')

input:  nothing worse than a female talking shit when your team loses
predicted output:  nothing worse than a talking shit when your team


input:  most females are not strong enough to be in combat without end ##anger ##ing self and comrades
predicted output:  most females are not strong enough to be in combat without end ##anger ##ing self and comrades


input:  if your a guy and you don ##t open the car door for a girl or any door for that matter i have no respect for you
predicted output:  if your guy and you don ##t open the car door for someone or girl or any door for that matter i have no respect for you


input:  when a man and a woman live together she should do the house ##work and he should do the heavier chores
predicted output:  when a man and a woman live together they should do the house ##work and they should do the heavier chores


input:  is it bad that i think girls playing football is just wrong
predicted output:  is it nice that i think girls playing football is ju

## Demo: Run inference on your own data (Commented out for now)

For this you can either input your own sentence to see how it's neutralised, or choose from one of the following sentences, which are **not in the training set** by typing in the number. For example, type 1 for "I hate all women". 
1. Men should not show emotions 
2. We should not hire female commentators for sports
3. Female drivers are awful 
4. Women are bad drivers

Example sentences which fail:
5. Women only like makeup and shoes 

In [3]:
# this only needs to be run if the section above on "Inference on test data using fine tuned model" has not been run
# %cd neutralizing-bias/src

/content/drive/MyDrive/NLP_Group_10/neutralizing-bias/src


In [5]:
# inp= input("What would you like to neutralise? ")

What would you like to neutralise? 1


In [6]:
# samples = {
#     "1":"Men should not show emotions", 
#     "2": "We should not hire female commentators for sports",
#     "3":"Female drivers are awful",
#     "4":"Women are bad drivers",
#     "5": "Women only like makeup and shoes"
#   }

# if inp in samples:
#   inp = samples[inp]

# with open("../../demo/demo.txt", 'w') as f:
#   f.write('0\t' + inp + '\tplaceholder')

In [7]:
# import os
# if os.path.exists("../../demo/tokenised/output1.txt"):
#   os.remove("../../demo/tokenised/output1.txt")
# if os.path.exists("../../demo/tagged/output_pos1.txt"):
#   os.remove("../../demo/tagged/output_pos1.txt")

# !python ../../neutralizing-bias/harvest/gen_data_from_crawl.py ../../demo/demo.txt CACHE/ ../../demo/tokenised/output1.txt
# !python ../../neutralizing-bias/harvest/add_tags.py ../../demo/tokenised/output1.txt ../../demo/tagged/output_pos1.txt

03/31/2022 15:09:40 - INFO - numexpr.utils -   NumExpr defaulting to 2 threads.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
03/31/2022 15:09:41 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at CACHE/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
ctrs:
CTR_EMPTY_REV 0
CTR_MULTIPLE_EDITS 0
CTR_FAILED_CLEANING 0
CTR_LOW_BLEU 0
CTR_LOW_LEVEN 0
CTR_TOO_MANY_1_TOKS 0
CTR_SPELLING 0
CTR_FALSE_POSITIVE 0
CTR_LENGTH_RATIO 0
CTR_CHEMISTRY 0
CTR_DUPS 0
CTR_ONLY_PUNC_CHANGED 0
CTR_INVALID_NUM_CHANGED_SENTS 0
CTR_NON_EDIT_CHUNKS 0
CTR_EDIT_CHANGED_NUM_SENTS 0
CTR_FAILED_TAGGING 0
100% 1/1 [00:00<00:00, 32.31it/s]


In [8]:
# # run inference against all 3 models 

# !python joint/inference.py \
#        --extra_features_top --pre_enrich --activation_hidden \
#        --test_batch_size 1 --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --pointer_generator --coverage \
#        --working_dir ../../demo \
#        --test ../../demo/tagged/output_pos1.txt\
#        --checkpoint model.ckpt \
#        --inference_output ../../demo/demo_output_baseline.txt

# !python joint/inference.py \
#        --extra_features_top --pre_enrich --activation_hidden \
#        --test_batch_size 1 --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --pointer_generator --coverage \
#        --working_dir ../../demo \
#        --test ../../demo/tagged/output_pos1.txt\
#        --checkpoint ../../fine_tuned_model/train1/model_15.ckpt \
#        --inference_output ../../demo/demo_output_finetune.txt

# !python joint/inference.py \
#        --extra_features_top --pre_enrich --activation_hidden \
#        --test_batch_size 1 --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --pointer_generator --coverage \
#        --working_dir ../../demo \
#        --test ../../demo/tagged/output_pos1.txt\
#        --checkpoint ../../augmented_data_model/train1/model_15.ckpt \
#        --inference_output ../../demo/demo_output_augmented.txt

03/31/2022 15:09:50 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ../../demo/cache/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
1it [00:00, 1517.48it/s]
SKIPPED  0
  torch.tensor(examples['categories'], dtype=torch.float))
  "num_layers={}".format(dropout, num_layers))
03/31/2022 15:09:51 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at ../../demo/cache/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
03/31/2022 15:09:51 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file ../../demo/cache/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc

In [9]:
# # show the results 
# with open("../../demo/demo_output_baseline.txt", "r") as rf:
#   lines = rf.readlines()
#   output_base = lines[3].split("\t")[1].split("'")[1]
# with open("../../demo/demo_output_finetune.txt", "r") as rf:
#   lines = rf.readlines()
#   output_fine = lines[3].split("\t")[1].split("'")[1]
# with open("../../demo/demo_output_augmented.txt", "r") as rf:
#   lines = rf.readlines()
#   output_aug = lines[3].split("\t")[1].split("'")[1]

# print("Output from baseline model: ", output_base)
# print("Output from fine tuned model: ", output_fine)
# print("Output from second fine tuned model: ", output_aug)

Output from baseline model:  humans may have
Output from fine tuned model:  some people should not show emotions
Output from second fine tuned model:  people should not show


## Pre-processing of the Call Me Sexist Dataset (Commented out)

In [None]:
# import pandas as pd
# import numpy as np

# import pandas as pd
# import re
# df=pd.read_csv('data/unclean_sexism_data.csv')

# ser=df['text']

# series_=ser.str.replace(r'http\S+','', regex=True)

# series_=series_.str.replace(r"'",'', regex=True)
# series_=series_.str.replace(r"#",'', regex=True)
# series_1=series_.str.replace(r'[^a-zA-Z0-9]',' ', regex=True)
# df['clean']=series_1.str.replace(r'MENTION\d+','', regex=True)
# df['text']= df['clean']
# df.drop(['clean'],axis = 1,inplace = True)
# df.to_csv('cleaned_sexism.csv')

# df_left = df
# df_right = df
# df_outer = pd.merge(df_left,df_right,left_on='of_id',right_on ='id',how='left')

# df_match = df_outer[(df_outer.of_id_x != -1)]
# df_match = df_match[['id_x','dataset_x','text_x','toxicity_x','sexist_x','id_y','dataset_y','text_y','toxicity_y','sexist_y']]
# df_match.to_csv("data/neutralized.txt",index=False)

# df_no_match = df_outer[df_outer.of_id_x == -1]
# #print("number of unneutralized sentences:",df_no_match.id_x.nunique())
# df_no_match = df_no_match[['id_x','dataset_x','text_x','toxicity_x','sexist_x','id_y','dataset_y','text_y','toxicity_y','sexist_y']]
# df_no_match.to_csv("data/unneutralized.txt")

We manually cleaned the data and re-neutralised sentences which weren't neutralised properly. Some examples of badly neutralised sentences are listed below.

In [None]:
# # pd.set_option('display.width', 700)

# df_neut = pd.read_csv("data/neutralized.txt", usecols=['id_x', 'text_x', 'text_y'])
# df_neut.rename(columns={"id_x": "id", "text_x": "neutralised", "text_y": "unneutralised"}, inplace=True)

# indexes = [146,111,252,594,892,1052,1749,1984,2785,2973,3409,3684]

# df_neut.loc[df_neut['id'].isin(indexes)]

Unnamed: 0,id,neutralised,unneutralised
12,111,RT Im not ageist but When it comes to learni...,RT Im not sexist but When it comes to learni...
20,146,wakes up Uncle A black became the boss of t...,wakes up Uncle A woman became the boss of t...
45,252,exactly Like Im not racist but most Dominic...,exactly Like Im not sexist but most female ...
101,594,I will not vote for a hispanic president Qu...,I will not vote for a women president Quot...
151,892,I think Christians are the real homophobics,I think ladies are the real homophobics
185,1052,Bangladeshi comedians are nowhere near funny n...,female comedians are nowhere near funny notyou...
315,1749,Men who use the word ladies omfg This is n...,Men who use the word ladies omfg This is n...
358,1984,What an odd place for an asterisk Im not rac...,What an odd place for an asterisk Im not sex...
488,2785,Why do liberals complain about everything If...,Im not sexist but why do feminists complain a...
521,2973,The reason is racism Pakistani comedians jus...,The reason is sexism Female comedians just a...


## Train Test Split (Commented out)

We split the data into 3 training and test data splits using sampling with replacement, with 90% of the data being used for training and the rest for testing. 

In [None]:
# df=pd.read_csv('data/manually_clean_sexism_data.csv')

# df=df[['cleaned text_y','cleaned text_x']]
# #df.reset_index(inplace=True)

# test_idx=np.random.randint(low=0,high=len(df),size=len(df)//8)

# mask_1 = np.random.permutation(len(df))
# mask_2 = np.random.permutation(len(df))
# mask_3 = np.random.permutation(len(df))

# shuffled_df_1 = df.iloc[mask_1]
# shuffled_df_2 = df.iloc[mask_2]
# shuffled_df_3 = df.iloc[mask_3]

# train_df_1 = shuffled_df_1.iloc[:int(0.90*len(df))]
# test_df_1 = shuffled_df_1.iloc[int(0.90*len(df)):]

# train_df_2 = shuffled_df_2.iloc[:int(0.90*len(df))]
# test_df_2 = shuffled_df_2.iloc[int(0.90*len(df)):]

# train_df_3 = shuffled_df_3.iloc[:int(0.90*len(df))]
# test_df_3 = shuffled_df_3.iloc[int(0.90*len(df)):]

# train_df_1.to_csv('data/train/train1.txt',sep='\t',header=None)
# test_df_1.to_csv('data/test/test1.txt',sep='\t',header=None)

# train_df_2.to_csv('data/train/train2.txt',sep='\t',header=None)
# test_df_2.to_csv('data/test/test2.txt',sep='\t',header=None)

# train_df_3.to_csv('data/train/train3.txt',sep='\t',header=None)
# test_df_3.to_csv('data/test/test3.txt',sep='\t',header=None)

In [None]:
# # pandas adds a trailing newline, so remove it 
# files = ['data/train/train1.txt', 'data/test/test1.txt', 'data/train/train2.txt', 'data/test/test2.txt', 'data/train/train3.txt', 'data/test/test3.txt']

# for filepath in files:
#   with open(filepath, 'r') as f:
#       data = f.read()
#       with open(filepath, 'w') as w:
#           w.write(data[:-1])

## Tokenise and Add Tags to Train and Test Data (Commented out)

In [None]:
# # tokenise

# import os
# if os.path.exists("data/train/tokenised/output1.txt"):
#   os.remove("data/train/tokenised/output1.txt")
# if os.path.exists("data/train/tokenised/output2.txt"):
#   os.remove("data/train/tokenised/output2.txt")
# if os.path.exists("data/train/tokenised/output3.txt"):
#   os.remove("data/train/tokenised/output3.txt")
# if os.path.exists("data/test/tokenised/output1.txt"):
#   os.remove("data/test/tokenised/output1.txt")
# if os.path.exists("data/test/tokenised/output2.txt"):
#   os.remove("data/test/tokenised/output2.txt")
# if os.path.exists("data/test/tokenised/output3.txt"):
#   os.remove("data/test/tokenised/output3.txt")

# !python neutralizing-bias/harvest/gen_data_from_crawl.py data/train/train1.txt CACHE/ data/train/tokenised/output1.txt
# !python neutralizing-bias/harvest/gen_data_from_crawl.py data/train/train2.txt CACHE/ data/train/tokenised/output2.txt
# !python neutralizing-bias/harvest/gen_data_from_crawl.py data/train/train3.txt CACHE/ data/train/tokenised/output3.txt

# !python neutralizing-bias/harvest/gen_data_from_crawl.py data/test/test1.txt CACHE/ data/test/tokenised/output1.txt
# !python neutralizing-bias/harvest/gen_data_from_crawl.py data/test/test2.txt CACHE/ data/test/tokenised/output2.txt
# !python neutralizing-bias/harvest/gen_data_from_crawl.py data/test/test3.txt CACHE/ data/test/tokenised/output3.txt



03/22/2022 11:11:37 - INFO - numexpr.utils -   NumExpr defaulting to 2 threads.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
03/22/2022 11:11:38 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at CACHE/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
ctrs:
CTR_EMPTY_REV 0
CTR_MULTIPLE_EDITS 0
CTR_FAILED_CLEANING 0
CTR_LOW_BLEU 0
CTR_LOW_LEVEN 0
CTR_TOO_MANY_1_TOKS 0
CTR_SPELLING 0
CTR_FALSE_POSITIVE 0
CTR_LENGTH_RATIO 0
CTR_CHEMISTRY 0
CTR_DUPS 0
CTR_ONLY_PUNC_CHANGED 0
CTR_INVALID_NUM_CHANGED_SENTS 0
CTR_NON_EDIT_CHUNKS 0
CTR_EDIT_CHANGED_NUM_SENTS 0
CTR_FAILED_TAGGING 0
03/22/2022 11:11:50 - INFO - numexpr.utils -   NumExpr defaulting to 2 threads.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Pac

In [None]:
# # add tags

# if os.path.exists("data/train/tagged/output_pos1.txt"):
#   os.remove("data/train/tagged/output_pos1.txt")
# if os.path.exists("data/train/tagged/output_pos2.txt"):
#   os.remove("data/train/tagged/output_pos2.txt")
# if os.path.exists("data/train/tagged/output_pos3.txt"):
#   os.remove("data/train/tagged/output_pos3.txt")
# if os.path.exists("data/test/tagged/output_pos1.txt"):
#   os.remove("data/test/tagged/output_pos1.txt")
# if os.path.exists("data/test/tagged/output_pos2.txt"):
#   os.remove("data/test/tagged/output_pos2.txt")
# if os.path.exists("data/test/tagged/output_pos3.txt"):
#   os.remove("data/test/tagged/output_pos3.txt")

# !python neutralizing-bias/harvest/add_tags.py data/train/tokenised/output1.txt data/train/tagged/output_pos1.txt
# !python neutralizing-bias/harvest/add_tags.py data/train/tokenised/output2.txt data/train/tagged/output_pos2.txt
# !python neutralizing-bias/harvest/add_tags.py data/train/tokenised/output3.txt data/train/tagged/output_pos3.txt

# !python neutralizing-bias/harvest/add_tags.py data/test/tokenised/output1.txt data/test/tagged/output_pos1.txt
# !python neutralizing-bias/harvest/add_tags.py data/test/tokenised/output2.txt data/test/tagged/output_pos2.txt
# !python neutralizing-bias/harvest/add_tags.py data/test/tokenised/output3.txt data/test/tagged/output_pos3.txt


100% 2163/2163 [00:52<00:00, 41.26it/s]
100% 2163/2163 [00:38<00:00, 56.30it/s]
100% 2163/2163 [00:39<00:00, 54.18it/s]
100% 241/241 [00:04<00:00, 52.57it/s]
100% 241/241 [00:04<00:00, 55.66it/s]
100% 241/241 [00:04<00:00, 54.57it/s]


## Run inference against the baseline model (Commented out)

In [None]:
# %cd neutralizing-bias/src

/content/drive/MyDrive/DSMLBackUp/NLP/neutralizing-bias/src


In [None]:
# run for testing data sets on baseline model

# import os
# if os.path.exists("../../baseline_model/baseline_model_inference_output.txt"):
#   os.remove("../../baseline_model/baseline_model_inference_output.txt")

# !python joint/inference.py \
#        --extra_features_top --pre_enrich --activation_hidden \
#        --test_batch_size 1 --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --pointer_generator --coverage \
#        --working_dir ../../baseline_model \
#        --test ../../data/test/tagged/output_pos1.txt\
#        --checkpoint model.ckpt \
#        --inference_output ../../baseline_model/baseline_model_inference_output.txt

03/22/2022 11:53:10 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ../../baseline_model/cache/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
241it [00:00, 792.08it/s]
SKIPPED  0
  torch.tensor(examples['categories'], dtype=torch.float))
data <torch.utils.data.dataset.TensorDataset object at 0x7f159dd8a1d0>
241
  "num_layers={}".format(dropout, num_layers))
03/22/2022 11:53:12 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at ../../baseline_model/cache/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
03/22/2022 11:53:12 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file ../../baseli

In [None]:
# if os.path.exists("../../baseline_model/baseline_model_inference_output2.txt"):
#   os.remove("../../baseline_model/baseline_model_inference_output2.txt")

# !python joint/inference.py \
#        --extra_features_top --pre_enrich --activation_hidden \
#        --test_batch_size 1 --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --pointer_generator --coverage \
#        --working_dir ../../baseline_model \
#        --test ../../data/test/tagged/output_pos2.txt\
#        --checkpoint model.ckpt \
#        --inference_output ../../baseline_model/baseline_model_inference_output2.txt

03/22/2022 11:58:32 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ../../baseline_model/cache/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
241it [00:00, 748.37it/s]
SKIPPED  0
  torch.tensor(examples['categories'], dtype=torch.float))
data <torch.utils.data.dataset.TensorDataset object at 0x7f271febf2d0>
241
  "num_layers={}".format(dropout, num_layers))
03/22/2022 11:58:33 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at ../../baseline_model/cache/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
03/22/2022 11:58:33 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file ../../baseli

In [None]:
# if os.path.exists("../../baseline_model/baseline_model_inference_output3.txt"):
#   os.remove("../../baseline_model/baseline_model_inference_output3.txt")

# !python joint/inference.py \
#        --extra_features_top --pre_enrich --activation_hidden \
#        --test_batch_size 1 --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --pointer_generator --coverage \
#        --working_dir ../../baseline_model \
#        --test ../../data/test/tagged/output_pos3.txt\
#        --checkpoint model.ckpt \
#        --inference_output ../../baseline_model/baseline_model_inference_output3.txt

03/22/2022 12:05:40 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ../../baseline_model/cache/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
241it [00:00, 750.57it/s]
SKIPPED  0
  torch.tensor(examples['categories'], dtype=torch.float))
data <torch.utils.data.dataset.TensorDataset object at 0x7f753d16e250>
241
  "num_layers={}".format(dropout, num_layers))
03/22/2022 12:05:41 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at ../../baseline_model/cache/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
03/22/2022 12:05:41 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file ../../baseli

## Train the first fine tuned model (Commented out)

In [None]:
# import os
# if os.path.exists("../../fine_tuned_model/train1/results_initial.txt"):
#   os.remove("../../fine_tuned_model/train1/results_initial.txt")
# if os.path.exists("../../fine_tuned_model/train1/model_15.ckpt"):
#   os.remove("../../fine_tuned_model/train1/model_15.ckpt")
# if os.path.exists("../../fine_tuned_model/train1/results_15.txt"):
#   os.remove("../../fine_tuned_model/train1/results_15.txt")
# if os.path.exists("../../fine_tuned_model/train1/tagger.ckpt"):
#   os.remove("../../fine_tuned_model/train1/tagger.ckpt")

# !python -W ignore::UserWarning joint/train.py \
#        --train ../../data/train/tagged/output_pos1.txt\
#        --test ../../data/test/tagged/output_pos1.txt\
#        --extra_features_top --pre_enrich --activation_hidden --tagging_pretrain_epochs 3 \
#        --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --checkpoint model.ckpt \
#        --pointer_generator --coverage \
#        --epochs 15 \
#        --working_dir ../../fine_tuned_model/train1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
100% 68/68 [00:57<00:00,  1.18it/s]
SAVING...
EVAL...
100% 16/16 [00:29<00:00,  1.87s/it]
EPOCH  39
TRAIN...
100% 68/68 [00:58<00:00,  1.16it/s]
SAVING...
EVAL...
100% 16/16 [00:29<00:00,  1.84s/it]


In [None]:
# if os.path.exists("../../fine_tuned_model/train2/results_initial.txt"):
#   os.remove("../../fine_tuned_model/train2/results_initial.txt")
# if os.path.exists("../../fine_tuned_model/train2/model_15.ckpt"):
#   os.remove("../../fine_tuned_model/train2/model_15.ckpt")
# if os.path.exists("../../fine_tuned_model/train2/results_15.txt"):
#   os.remove("../../fine_tuned_model/train2/results_15.txt")
# if os.path.exists("../../fine_tuned_model/train2/tagger.ckpt"):
#   os.remove("../../fine_tuned_model/train2/tagger.ckpt")

# !python -W ignore::UserWarning joint/train.py \
#        --train ../../data/train/tagged/output_pos2.txt\
#        --test ../../data/test/tagged/output_pos2.txt\
#        --extra_features_top --pre_enrich --activation_hidden --tagging_pretrain_epochs 3 \
#        --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --checkpoint model.ckpt \
#        --pointer_generator --coverage \
#        --epochs 15 \
#        --working_dir ../../fine_tuned_model/train2

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
100% 68/68 [00:58<00:00,  1.17it/s]
SAVING...
EVAL...
100% 16/16 [00:34<00:00,  2.15s/it]
EPOCH  14
TRAIN...
100% 68/68 [00:58<00:00,  1.17it/s]
SAVING...
EVAL...
100% 16/16 [00:30<00:00,  1.93s/it]


In [None]:
# if os.path.exists("../../fine_tuned_model/train3/results_initial.txt"):
#   os.remove("../../fine_tuned_model/train3/results_initial.txt")
# if os.path.exists("../../fine_tuned_model/train3/model_15.ckpt"):
#   os.remove("../../fine_tuned_model/train3/model_15.ckpt")
# if os.path.exists("../../fine_tuned_model/train3/results_15.txt"):
#   os.remove("../../fine_tuned_model/train3/results_15.txt")
# if os.path.exists("../../fine_tuned_model/train3/tagger.ckpt"):
#   os.remove("../../fine_tuned_model/train3/tagger.ckpt")

# !python joint/train.py \
#        --train ../../data/train/tagged/output_pos3.txt\
#        --test ../../data/test/tagged/output_pos3.txt\
#        --extra_features_top --pre_enrich --activation_hidden --tagging_pretrain_epochs 3 \
#        --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --checkpoint model.ckpt \
#        --pointer_generator --coverage \
#        --epochs 15 \
#        --working_dir ../../fine_tuned_model/train3

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
100% 68/68 [00:58<00:00,  1.16it/s]
SAVING...
EVAL...
100% 16/16 [00:34<00:00,  2.14s/it]
EPOCH  14
TRAIN...
100% 68/68 [00:58<00:00,  1.16it/s]
SAVING...
EVAL...
100% 16/16 [00:34<00:00,  2.15s/it]


## Run inference of additional data against the fine tuned model (Commented out)

In [None]:
# %cd ..

/content/drive/MyDrive/DSMLBackUp/NLP/neutralizing-bias


In [None]:
# %cd ..

/content/drive/MyDrive/DSMLBackUp/NLP


In [None]:
# import os
# if os.path.exists("data/augmented_data/initial_inference/output.txt"):
#   os.remove("data/augmented_data/initial_inference/output.txt")

# !python neutralizing-bias/harvest/gen_data_from_crawl.py data/extra_train_data.txt CACHE/ data/augmented_data/initial_inference/output.txt

03/22/2022 17:58:57 - INFO - numexpr.utils -   NumExpr defaulting to 2 threads.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
03/22/2022 17:58:57 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at CACHE/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
ctrs:
CTR_EMPTY_REV 0
CTR_MULTIPLE_EDITS 0
CTR_FAILED_CLEANING 0
CTR_LOW_BLEU 0
CTR_LOW_LEVEN 0
CTR_TOO_MANY_1_TOKS 0
CTR_SPELLING 0
CTR_FALSE_POSITIVE 0
CTR_LENGTH_RATIO 0
CTR_CHEMISTRY 0
CTR_DUPS 0
CTR_ONLY_PUNC_CHANGED 0
CTR_INVALID_NUM_CHANGED_SENTS 0
CTR_NON_EDIT_CHUNKS 0
CTR_EDIT_CHANGED_NUM_SENTS 0
CTR_FAILED_TAGGING 0


In [None]:
# if os.path.exists("data/augmented_data/initial_inference/output_pos.txt"):
#   os.remove("data/augmented_data/initial_inference/output_pos.txt")

# !python neutralizing-bias/harvest/add_tags.py data/augmented_data/initial_inference/output.txt data/augmented_data/initial_inference/output_pos.txt

100% 1302/1302 [00:20<00:00, 64.38it/s]


In [None]:
# %cd neutralizing-bias/src

/content/drive/MyDrive/DSMLBackUp/NLP/neutralizing-bias/src


In [None]:
# if os.path.exists("../../augmented_data_model/augmented_data_inference_output.txt"):
#   os.remove("../../augmented_data_model/augmented_data_inference_output.txt")

# !python joint/inference.py \
#        --extra_features_top --pre_enrich --activation_hidden \
#        --test_batch_size 1 --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --pointer_generator --coverage \
#        --working_dir ../../augmented_data_model \
#        --test  ../../data/augmented_data/initial_inference/output_pos.txt\
#        --checkpoint ../../fine_tuned_model/train3/model_15.ckpt \
#        --inference_output ../../augmented_data_model/augmented_data_inference_output.txt

03/22/2022 18:00:36 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ../../augmented_data_model/cache/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
1302it [00:00, 8796.67it/s]
SKIPPED  0
  torch.tensor(examples['categories'], dtype=torch.float))
data <torch.utils.data.dataset.TensorDataset object at 0x7f1a2259d990>
1302
  "num_layers={}".format(dropout, num_layers))
03/22/2022 18:00:37 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at ../../augmented_data_model/cache/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
03/22/2022 18:00:37 - INFO - pytorch_pretrained_bert.modeling -   extracting archive fi

## Prune inference results (Commented out)

### Set up

In [None]:
# !pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 4.9 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 37.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 37.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 52.8 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found ex

### File load

In [None]:
# import pandas as pd
# import numpy as np

In [None]:
# from transformers import GPT2LMHeadModel, GPT2TokenizerFast, BertModel, BertTokenizerFast
# # from datasets import load_dataset
# import torch
# from tqdm import tqdm
# import matplotlib.pyplot as plt

In [None]:
#%cd ../../augmented_data_model/

/content/drive/.shortcut-targets-by-id/1ljCwKb-Goe-cPSDiPXWnVYPtmxEb7sDE/NLP/augmented_data_model


In [None]:
# def txt_to_df(filename, row_number):

#   inputs = []
#   golds = []
#   preds = []
#   gold_dist = []
#   pred_dist = []
#   gold_tok = []
#   pred_tok = []

#   with open(filename, "r") as file:
#     for i in range(row_number):
#       # print('i', i)
#       file.readline()
#       inp = file.readline().split("\t")[1].split("'")[1]
#       inputs.append(inp)
#       # print('inp', inp)
      
#       gold = file.readline().split("\t")[1].split("'")[1]
#       # print('gold', gold)
#       golds.append(gold)
      
#       pred = file.readline().split("\t")[1].split("'")[1]
#       # print('pred', pred)
#       preds.append(pred)
      
#       gold_dis = file.readline().split("\t")[1].strip()
#       # print('gold_dis', gold_dis)
#       gold_dist.append(gold_dis)
      
#       pred_dis = file.readline().split("\t")[1].strip()
#       # print('pred_dis', pred_dis)
#       pred_dist.append(pred_dis)
      
#       gold_t = file.readline().split("\t")[1].strip()
#       # print('gold_t', gold_t)
#       gold_tok.append(gold_t)
      
#       pred_t = file.readline().split("\t")[1].strip()
#       # print('pred_t', pred_t)
#       pred_tok.append(pred_t) 

#   df = pd.DataFrame(
#     {'inputs': inputs,
#      'golds': golds,
#      'predictions': preds,
#      'gold distribution': gold_dist,
#      'predicted distribution': pred_dist,
#      'gold tokens': gold_tok,
#      'predicted tokens': pred_tok
#     })
  
#   df.inputs = df.inputs.str.replace("\s##", "")
#   df.predictions = df.predictions.str.replace("\s##", "")
#   df.golds = df.golds.str.replace("\s##", "")

#   return df
    

In [None]:
# df = txt_to_df("augmented_data_inference_output.txt", 1302)
# df



Unnamed: 0,inputs,golds,predictions,gold distribution,predicted distribution,gold tokens,predicted tokens
0,i doubt any women will pass te buds and 99 9 o...,placeholder,t dont te buds and 99 9 of seal training when...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['don', ""'"", 't', 'don', '##t']]"
1,im not a fan of women football commentators,placeholder,im not a fan of people,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[['place', '##holder']]",[['people']]
2,im of the mindset that girls dont get laid guy...,placeholder,im of the mindset people people dont get laid ...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['people', 'people'], ['people']]"
3,theresa m is weak if we had a male pm would he...,placeholder,theresa m is weak if we had a new pm would the...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['new'], ['they', 'would', 'they']]"
4,im old school and woke twitter will idc girls ...,placeholder,im old school and woke woke will idc airing ou...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['woke']]
...,...,...,...,...,...,...,...
1297,lmfaooo rt women should never be bored theres ...,placeholder,lmfa should be be bored theres always somethin...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['be'], ['something']]"
1298,i just cannot take womens basketball seriously,placeholder,i just cannot take kids seriously,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[['place', '##holder']]",[['kids']]
1299,the sound of female hockey fans screeching at ...,placeholder,the sound of amateur hockey fans screeching at...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['amateur']]
1300,i really don t get it when girls are just like...,placeholder,i really don t get it when people are just lik...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['people']]


In [None]:
# # check if inputs and preds are exactly the same - exclude these

# df2 = df[df['inputs'] != df['predictions']]
# df2

Unnamed: 0,inputs,golds,predictions,gold distribution,predicted distribution,gold tokens,predicted tokens
0,i doubt any women will pass te buds and 99 9 o...,placeholder,t dont te buds and 99 9 of seal training when...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['don', ""'"", 't', 'don', '##t']]"
1,im not a fan of women football commentators,placeholder,im not a fan of people,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[['place', '##holder']]",[['people']]
2,im of the mindset that girls dont get laid guy...,placeholder,im of the mindset people people dont get laid ...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['people', 'people'], ['people']]"
3,theresa m is weak if we had a male pm would he...,placeholder,theresa m is weak if we had a new pm would the...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['new'], ['they', 'would', 'they']]"
4,im old school and woke twitter will idc girls ...,placeholder,im old school and woke woke will idc airing ou...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['woke']]
...,...,...,...,...,...,...,...
1297,lmfaooo rt women should never be bored theres ...,placeholder,lmfa should be be bored theres always somethin...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['be'], ['something']]"
1298,i just cannot take womens basketball seriously,placeholder,i just cannot take kids seriously,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[['place', '##holder']]",[['kids']]
1299,the sound of female hockey fans screeching at ...,placeholder,the sound of amateur hockey fans screeching at...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['amateur']]
1300,i really don t get it when girls are just like...,placeholder,i really don t get it when people are just lik...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['people']]


In [None]:
# df2.predictions = df2.predictions.str.replace('\n', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [None]:
# df2.inputs = df2.inputs.str.replace('"', '')
# df2.predictions = df2.predictions.str.replace('"', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [None]:
# df2.predictions.replace('', np.nan, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


In [None]:
# df2.dropna(subset=['predictions'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [None]:
df2

Unnamed: 0,inputs,golds,predictions,gold distribution,predicted distribution,gold tokens,predicted tokens
0,i doubt any women will pass te buds and 99 9 o...,placeholder,t dont te buds and 99 9 of seal training when...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['don', ""'"", 't', 'don', '##t']]"
1,im not a fan of women football commentators,placeholder,im not a fan of people,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[['place', '##holder']]",[['people']]
2,im of the mindset that girls dont get laid guy...,placeholder,im of the mindset people people dont get laid ...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['people', 'people'], ['people']]"
3,theresa m is weak if we had a male pm would he...,placeholder,theresa m is weak if we had a new pm would the...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['new'], ['they', 'would', 'they']]"
4,im old school and woke twitter will idc girls ...,placeholder,im old school and woke woke will idc airing ou...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['woke']]
...,...,...,...,...,...,...,...
1297,lmfaooo rt women should never be bored theres ...,placeholder,lmfa should be be bored theres always somethin...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['be'], ['something']]"
1298,i just cannot take womens basketball seriously,placeholder,i just cannot take kids seriously,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[['place', '##holder']]",[['kids']]
1299,the sound of female hockey fans screeching at ...,placeholder,the sound of amateur hockey fans screeching at...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['amateur']]
1300,i really don t get it when girls are just like...,placeholder,i really don t get it when people are just lik...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['people']]


### Perplexity

Uses this sample code: https://huggingface.co/docs/transformers/perplexity

In [None]:
# device = "cuda"
# model_id = "gpt2-large"
# model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
# tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

Downloading:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.02G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [None]:
# preds_array = list(df2['predictions'])

In [None]:
# # takes about 5 mins

# ppls = []

# for i in preds_array:
#   encoding2 = tokenizer(i, return_tensors="pt")

#   max_length = model.config.n_positions
#   stride = 6

#   nlls = []
#   for i in tqdm(range(0, encoding2.input_ids.size(1), stride)):
#       begin_loc = max(i + stride - max_length, 0)
#       end_loc = min(i + stride, encoding2.input_ids.size(1))
#       trg_len = end_loc - i  # may be different from stride on last loop
#       input_ids = encoding2.input_ids[:, begin_loc:end_loc].to(device)
#       target_ids = input_ids.clone()
#       target_ids[:, :-trg_len] = -100

#       with torch.no_grad():
#           outputs = model(input_ids, labels=target_ids)
#           neg_log_likelihood = outputs[0] * trg_len

#       nlls.append(neg_log_likelihood)

#   ppls.append((torch.exp(torch.stack(nlls).sum() / end_loc)).item())

100%|██████████| 3/3 [00:00<00:00,  9.21it/s]
100%|██████████| 1/1 [00:00<00:00, 34.99it/s]
100%|██████████| 3/3 [00:00<00:00, 31.97it/s]
100%|██████████| 4/4 [00:00<00:00, 26.53it/s]
100%|██████████| 3/3 [00:00<00:00, 32.80it/s]
100%|██████████| 5/5 [00:00<00:00, 28.54it/s]
100%|██████████| 2/2 [00:00<00:00, 33.97it/s]
100%|██████████| 2/2 [00:00<00:00, 33.07it/s]
100%|██████████| 2/2 [00:00<00:00, 33.97it/s]
100%|██████████| 2/2 [00:00<00:00, 35.21it/s]
100%|██████████| 2/2 [00:00<00:00, 35.77it/s]
100%|██████████| 5/5 [00:00<00:00, 30.40it/s]
100%|██████████| 4/4 [00:00<00:00, 32.10it/s]
100%|██████████| 4/4 [00:00<00:00, 31.86it/s]
100%|██████████| 1/1 [00:00<00:00, 33.28it/s]
100%|██████████| 4/4 [00:00<00:00, 29.93it/s]
100%|██████████| 4/4 [00:00<00:00, 30.94it/s]
100%|██████████| 2/2 [00:00<00:00, 30.94it/s]
100%|██████████| 2/2 [00:00<00:00, 34.20it/s]
100%|██████████| 2/2 [00:00<00:00, 34.99it/s]
100%|██████████| 3/3 [00:00<00:00, 30.33it/s]
100%|██████████| 2/2 [00:00<00:00,

In [None]:
# ppls_np = np.array(ppls)

In [None]:
# ppls_np.shape

(1274,)

In [None]:
# df2['perp'] = ppls

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
# df2

Unnamed: 0,inputs,golds,predictions,gold distribution,predicted distribution,gold tokens,predicted tokens,perp
0,i doubt any women will pass te buds and 99 9 o...,placeholder,t dont te buds and 99 9 of seal training when...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['don', ""'"", 't', 'don', '##t']]",4800.597656
1,im not a fan of women football commentators,placeholder,im not a fan of people,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[['place', '##holder']]",[['people']],62.881348
2,im of the mindset that girls dont get laid guy...,placeholder,im of the mindset people people dont get laid ...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['people', 'people'], ['people']]",579.217773
3,theresa m is weak if we had a male pm would he...,placeholder,theresa m is weak if we had a new pm would the...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['new'], ['they', 'would', 'they']]",404.117493
4,im old school and woke twitter will idc girls ...,placeholder,im old school and woke woke will idc airing ou...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['woke']],987.763672
...,...,...,...,...,...,...,...,...
1297,lmfaooo rt women should never be bored theres ...,placeholder,lmfa should be be bored theres always somethin...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['be'], ['something']]",294.997803
1298,i just cannot take womens basketball seriously,placeholder,i just cannot take kids seriously,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[['place', '##holder']]",[['kids']],788.095764
1299,the sound of female hockey fans screeching at ...,placeholder,the sound of amateur hockey fans screeching at...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['amateur']],416.007233
1300,i really don t get it when girls are just like...,placeholder,i really don t get it when people are just lik...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['people']],272.012604


In [None]:
# np.quantile(ppls_np,0.43)

256.36247985839844

In [None]:
# df3 = df2[ppls_np<np.quantile(ppls_np,0.43)]

In [None]:
# df3

Unnamed: 0,inputs,golds,predictions,gold distribution,predicted distribution,gold tokens,predicted tokens,perp
1,im not a fan of women football commentators,placeholder,im not a fan of people,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[['place', '##holder']]",[['people']],62.881348
5,i just dont find it right when girls nowadays ...,placeholder,i just dont find it right when kids nowadays a...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['kids']],147.197342
6,abcdefg means a boy can do everything for a girl,placeholder,abcdefg means a person can do everything for a...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['person'], ['anyone']]",243.968903
7,i dont think girls should have an opinion on f...,placeholder,i dont think people should have an opinion on ...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['people']],84.755745
9,i think boys should always always make the fir...,placeholder,i think anyone can make the first move,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['anyone', 'can']]",95.506981
...,...,...,...,...,...,...,...,...
1291,do you know anything about sports rt women tal...,placeholder,do you know anything about sports people talki...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]",[['people']],186.064667
1294,you have to wear a dress,placeholder,you have a dress,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[['place', '##holder']]",[],142.173340
1296,a lot of females try to think like a man to av...,placeholder,a lot of people try to think too take a chance...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['people'], ['too']]",254.835434
1297,lmfaooo rt women should never be bored theres ...,placeholder,lmfa should be be bored theres always somethin...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[['place', '##holder']]","[['be'], ['something']]",294.997803


In [None]:
# df3.to_csv("augmented_data_perplexity_df.csv")

### Perspective API

This is an API for detecting toxic statements and quantifying how toxic a statement is. It was used to prune our inference results. 

Documentation for the API can be found here: https://www.perspectiveapi.com/

In [None]:
# from googleapiclient import discovery
# import json
# import time
# import pandas as pd

In [None]:
# API_KEY = 'AIzaSyBSXPUj5VBjxKm4dr8hyGglU8Scz4pvbfc'

# client = discovery.build(
#   "commentanalyzer",
#   "v1alpha1",
#   developerKey=API_KEY,
#   discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1"
# )


In [None]:
#df3 = pd.read_csv("augmented_data_perplexity_df.csv")
# df3.drop([461],inplace=True)

In [None]:
# preds_array = list(df3['predictions'])

In [None]:
# toxicities_ = []
# errors = []

# i = 0

# for p in preds_array:
#   analyze_request = {
#     'comment': { 'text': p },
#     'requestedAttributes': {'TOXICITY': {}},
#     'doNotStore': True,
#   }

#   response = client.comments().analyze(body=analyze_request).execute()
#   tox = response["attributeScores"]["TOXICITY"]["spanScores"][0]["score"]["value"]
#   print(i)
#   toxicities_.append(tox)
#   time.sleep(1)
#   i = i + 1

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84


In [None]:
# len(toxicities)

461

In [None]:
# df=df3[:461]
# df_=df3[462:]
# df['toxicity']=toxicities
# df_['toxicity']=toxicities_
# df3_new=pd.concat([df,df_])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [None]:
# df3=df3_new


In [None]:
# df4 = df3.loc[df3['toxicity'] <= 0.5]


In [None]:
# train_df4 = df4[['inputs', 'predictions']]

In [None]:
# train_df4

Unnamed: 0,inputs,predictions
0,im not a fan of women football commentators,im not a fan of people
2,abcdefg means a boy can do everything for a girl,abcdefg means a person can do everything for a...
3,i dont think girls should have an opinion on f...,i dont think people should have an opinion on ...
4,i think boys should always always make the fir...,i think anyone can make the first move
5,weakness is a woman issue,weakness is not a woman
...,...,...
543,women and their big handbags is the problem of...,human and their excessive luggage is the probl...
544,these girls are pretty awful,these people are pretty great
545,do you know anything about sports rt women tal...,do you know anything about sports people talki...
546,you have to wear a dress,you have a dress


In [None]:
# train_df4.to_csv("../../data/augmented_data/pruned_training_data.txt", index=True, header=False, sep="\t")

## Train second iteration of fine tuned model (Commented out)

In [None]:
# %cd neutralizing-bias/harvest

/content/drive/MyDrive/DSMLBackUp/NLP/neutralizing-bias/harvest


In [None]:
# filepath = "../../data/augmented_data/pruned_training_data.txt"
# with open(filepath, 'r') as f:
#     data = f.read()
#     with open(filepath, 'w') as w:
#         w.write(data[:-1])

In [None]:
# import os
# if os.path.exists("../../data/augmented_data/training/training_output.txt"):
#   os.remove("../../data/augmented_data/training/training_output.txt")

# !python gen_data_from_crawl.py ../../data/augmented_data/pruned_training_data.txt CACHE/ ../../data/augmented_data/training/training_output.txt

03/23/2022 11:54:23 - INFO - numexpr.utils -   NumExpr defaulting to 2 threads.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
03/23/2022 11:54:23 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at CACHE/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
ctrs:
CTR_EMPTY_REV 0
CTR_MULTIPLE_EDITS 0
CTR_FAILED_CLEANING 0
CTR_LOW_BLEU 0
CTR_LOW_LEVEN 0
CTR_TOO_MANY_1_TOKS 0
CTR_SPELLING 0
CTR_FALSE_POSITIVE 0
CTR_LENGTH_RATIO 0
CTR_CHEMISTRY 0
CTR_DUPS 0
CTR_ONLY_PUNC_CHANGED 0
CTR_INVALID_NUM_CHANGED_SENTS 0
CTR_NON_EDIT_CHUNKS 0
CTR_EDIT_CHANGED_NUM_SENTS 0
CTR_FAILED_TAGGING 0


In [None]:
# if os.path.exists("../../data/augmented_data/training/training_output_pos.txt"):
#   os.remove("../../data/augmented_data/training/training_output_pos.txt")

# !python add_tags.py ../../data/augmented_data/training/training_output.txt ../../data/augmented_data/training/training_output_pos.txt

100% 446/446 [00:07<00:00, 61.51it/s]


In [None]:
# %cd ../../neutralizing-bias/src

/content/drive/MyDrive/DSMLBackUp/NLP/neutralizing-bias/src


In [None]:
# if os.path.exists("../../augmented_data_model/train1/results_initial.txt"):
#   os.remove("../../augmented_data_model/train1/results_initial.txt")
# if os.path.exists("../../augmented_data_model/train1/model_15.ckpt"):
#   os.remove("../../augmented_data_model/train1/model_15.ckpt")
# if os.path.exists("../../augmented_data_model/train1/results_15.txt"):
#   os.remove("../../augmented_data_model/train1/results_15.txt")
# if os.path.exists("../../augmented_data_model/train1/tagger.ckpt"):
#   os.remove("../../augmented_data_model/train1/tagger.ckpt")

# !python joint/train.py \
#        --train ../../data/augmented_data/training/training_output_pos.txt\
#        --test ../../data/test/tagged/output_pos1.txt\
#        --extra_features_top --pre_enrich --activation_hidden --tagging_pretrain_epochs 3 \
#        --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --checkpoint ../../fine_tuned_model/train1/model_15.ckpt \
#        --pointer_generator --coverage \
#        --epochs 15 \
#        --working_dir ../../augmented_data_model/train1

Traceback (most recent call last):
  File "joint/train.py", line 34, in <module>
    import seq2seq.model as seq2seq_model
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 724, in exec_module
  File "<frozen importlib._bootstrap_external>", line 818, in get_code
  File "<frozen importlib._bootstrap_external>", line 916, in get_data
KeyboardInterrupt


In [None]:
# if os.path.exists("../../augmented_data_model/train2/results_initial.txt"):
#   os.remove("../../augmented_data_model/train2/results_initial.txt")
# if os.path.exists("../../augmented_data_model/train2/model_15.ckpt"):
#   os.remove("../../augmented_data_model/train2/model_15.ckpt")
# if os.path.exists("../../augmented_data_model/train2/results_15.txt"):
#   os.remove("../../augmented_data_model/train2/results_15.txt")
# if os.path.exists("../../augmented_data_model/train2/tagger.ckpt"):
#   os.remove("../../augmented_data_model/train2/tagger.ckpt")

# !python joint/train.py \
#        --train ../../data/augmented_data/training/training_output_pos.txt\
#        --test ../../data/test/tagged/output_pos2.txt\
#        --extra_features_top --pre_enrich --activation_hidden --tagging_pretrain_epochs 3 \
#        --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --checkpoint ../../fine_tuned_model/train2/model_15.ckpt \
#        --pointer_generator --coverage \
#        --epochs 15 \
#        --working_dir ../../augmented_data_model/train2

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
100% 14/14 [00:12<00:00,  1.09it/s]
SAVING...
EVAL...
100% 16/16 [00:30<00:00,  1.88s/it]
EPOCH  7
TRAIN...
100% 14/14 [00:12<00:00,  1.09it/s]
SAVING...
EVAL...
100% 16/16 [00:30<00:00,  1.91s/it]
EPOCH  8
TRAIN...
100% 14/14 [00:12<00:00,  1.09it/s]
SAVING...
EVAL...
100% 16/16 [00:30<00:00,  1.89s/it]
EPOCH  9
TRAIN...
100% 14/14 [00:12<00:00,  1.10it/s]
SAVING...
EVAL...
100% 16/16 [00:30<00:00,  1.88s/it]
EPOCH  10
TRAIN...
100% 14/14 [00:12<00:00,  1.10it/s]
SAVING...
EVAL...
100% 16/16 [00:30<00:00,  1.89s/it]
EPOCH  11
TRAIN...
100% 14/14 [00:12<00:00,  1.11it/s]
SAVING...
EVAL...
100% 16/16 [00:30<00:00,  1.89s/it]
EPOCH  12
TRAIN...
100% 14/14 [00:13<00:00,  1.07it/s]
SAVING...
EVAL...
100% 16/16 [00:30<00:00,  1.88s/it]
EPOCH  13
TRAIN...
100% 14/14 [00:12<00:00,  1.09it/s]
SAVING...
EVAL...
100% 16/16 [00:30<00:00,  1.88s/it]
EPOCH  14
TRAIN...
100% 14/14 [00:12<00:00,  1.09it/s]
SAVING...
EVAL...
100% 16/16 [

In [None]:
# if os.path.exists("../../augmented_data_model/train3/results_initial.txt"):
#   os.remove("../../augmented_data_model/train3/results_initial.txt")
# if os.path.exists("../../augmented_data_model/train3/model_15.ckpt"):
#   os.remove("../../augmented_data_model/train3/model_15.ckpt")
# if os.path.exists("../../augmented_data_model/train3/results_15.txt"):
#   os.remove("../../augmented_data_model/train3/results_15.txt")
# if os.path.exists("../../augmented_data_model/train3/tagger.ckpt"):
#   os.remove("../../augmented_data_model/train3/tagger.ckpt")

# !python joint/train.py \
#        --train ../../data/augmented_data/training/training_output_pos.txt\
#        --test ../../data/test/tagged/output_pos3.txt\
#        --extra_features_top --pre_enrich --activation_hidden --tagging_pretrain_epochs 3 \
#        --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --checkpoint ../../fine_tuned_model/train3/model_15.ckpt \
#        --pointer_generator --coverage \
#        --epochs 15 \
#        --working_dir ../../augmented_data_model/train3

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
100% 14/14 [00:12<00:00,  1.08it/s]
SAVING...
EVAL...
100% 16/16 [00:34<00:00,  2.15s/it]
EPOCH  7
TRAIN...
100% 14/14 [00:12<00:00,  1.09it/s]
SAVING...
EVAL...
100% 16/16 [00:33<00:00,  2.11s/it]
EPOCH  8
TRAIN...
100% 14/14 [00:12<00:00,  1.09it/s]
SAVING...
EVAL...
100% 16/16 [00:35<00:00,  2.20s/it]
EPOCH  9
TRAIN...
100% 14/14 [00:14<00:00,  1.04s/it]
SAVING...
EVAL...
100% 16/16 [00:33<00:00,  2.11s/it]
EPOCH  10
TRAIN...
100% 14/14 [00:12<00:00,  1.10it/s]
SAVING...
EVAL...
100% 16/16 [00:33<00:00,  2.11s/it]
EPOCH  11
TRAIN...
100% 14/14 [00:12<00:00,  1.11it/s]
SAVING...
EVAL...
100% 16/16 [00:36<00:00,  2.28s/it]
EPOCH  12
TRAIN...
100% 14/14 [00:13<00:00,  1.05it/s]
SAVING...
EVAL...
100% 16/16 [00:33<00:00,  2.11s/it]
EPOCH  13
TRAIN...
100% 14/14 [00:12<00:00,  1.09it/s]
SAVING...
EVAL...
100% 16/16 [00:33<00:00,  2.11s/it]
EPOCH  14
TRAIN...
100% 14/14 [00:12<00:00,  1.08it/s]
SAVING...
EVAL...
100% 16/16 [

## Run inference on additional test data (Commented out)

In [None]:
# %cd ../../neutralizing-bias/harvest

/content/drive/MyDrive/DSMLBackUp/NLP/neutralizing-bias/harvest


In [None]:
# import os
# if os.path.exists("../../data/augmented_data/testing/test_output.txt"):
#   os.remove("../../data/augmented_data/testing/test_output.txt")

# !python gen_data_from_crawl.py ../../data/extra_test_data.txt CACHE/ ../../data/augmented_data/testing/test_output.txt

03/23/2022 13:01:49 - INFO - numexpr.utils -   NumExpr defaulting to 2 threads.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
03/23/2022 13:01:49 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at CACHE/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
ctrs:
CTR_EMPTY_REV 0
CTR_MULTIPLE_EDITS 0
CTR_FAILED_CLEANING 0
CTR_LOW_BLEU 0
CTR_LOW_LEVEN 0
CTR_TOO_MANY_1_TOKS 0
CTR_SPELLING 0
CTR_FALSE_POSITIVE 0
CTR_LENGTH_RATIO 0
CTR_CHEMISTRY 0
CTR_DUPS 0
CTR_ONLY_PUNC_CHANGED 0
CTR_INVALID_NUM_CHANGED_SENTS 0
CTR_NON_EDIT_CHUNKS 0
CTR_EDIT_CHANGED_NUM_SENTS 0
CTR_FAILED_TAGGING 0


In [None]:
# if os.path.exists("../../data/augmented_data/testing/test_output_pos.txt"):
#   os.remove("../../data/augmented_data/testing/test_output_pos.txt")

# !python add_tags.py ../../data/augmented_data/testing/test_output.txt ../../data/augmented_data/testing/test_output_pos.txt

100% 63/63 [00:00<00:00, 63.30it/s]


In [None]:
# %cd ../../neutralizing-bias/src

/content/drive/MyDrive/DSMLBackUp/NLP/neutralizing-bias/src


In [None]:
# if os.path.exists("../../augmented_data_model/test/output.txt"):
#   os.remove("../../augmented_data_model/test/output.txt")

# !python joint/inference.py \
#        --extra_features_top --pre_enrich --activation_hidden \
#        --test_batch_size 1 --bert_full_embeddings --debias_weight 1.3 --token_softmax \
#        --pointer_generator --coverage \
#        --working_dir ../../augmented_data_model/test \
#        --test  ../../data/augmented_data/testing/test_output_pos.txt\
#        --checkpoint ../../augmented_data_model/train3/model_15.ckpt \
#        --inference_output ../../augmented_data_model/test/output.txt

03/24/2022 10:41:20 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ../../augmented_data_model/test/cache/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
63it [00:00, 142.53it/s]
SKIPPED  0
  torch.tensor(examples['categories'], dtype=torch.float))
  "num_layers={}".format(dropout, num_layers))
03/24/2022 10:41:22 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at ../../augmented_data_model/test/cache/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
03/24/2022 10:41:22 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file ../../augmented_data_model/test/cache/9c41111e2de84547a463fd392171

## Evaluation - Sentence embedding

In [None]:
pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 6.5 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 42.7 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 36.8 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.0 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 19.9 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.86k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
def txt_to_df(filename, row_number):

  inputs = []
  golds = []
  preds = []
  gold_dist = []
  pred_dist = []
  gold_tok = []
  pred_tok = []

  with open(filename, "r") as file:
    for i in range(row_number):
      # print('i', i)
      file.readline()
      inp = file.readline().split("\t")[1].split("'")[1]
      inputs.append(inp)
      # print('inp', inp)
      
      gold = file.readline().split("\t")[1].split("'")[1]
      # print('gold', gold)
      golds.append(gold)
      
      pred = file.readline().split("\t")[1].split("'")[1]
      # print('pred', pred)
      preds.append(pred)
      
      gold_dis = file.readline().split("\t")[1].strip()
      # print('gold_dis', gold_dis)
      gold_dist.append(gold_dis)
      
      pred_dis = file.readline().split("\t")[1].strip()
      # print('pred_dis', pred_dis)
      pred_dist.append(pred_dis)
      
      gold_t = file.readline().split("\t")[1].strip()
      # print('gold_t', gold_t)
      gold_tok.append(gold_t)
      
      pred_t = file.readline().split("\t")[1].strip()
      # print('pred_t', pred_t)
      pred_tok.append(pred_t) 

  df = pd.DataFrame(
    {'inputs': inputs,
     'golds': golds,
     'predictions': preds,
     'gold distribution': gold_dist,
     'predicted distribution': pred_dist,
     'gold tokens': gold_tok,
     'predicted tokens': pred_tok
    })
  
  df.inputs = df.inputs.str.replace("\s##", "")
  df.predictions = df.predictions.str.replace("\s##", "")
  df.golds = df.golds.str.replace("\s##", "")

  return df
    

In [None]:
from numpy.linalg import norm
import pandas as pd 
import numpy as np

def calc_similarity_score(filename, row_number):

  df = txt_to_df(filename, row_number) ##needs to be in csv format from out need to write code for that

  embedding_gold = model.encode(list(df['golds']))
  embedding_pred = model.encode(list(df['predictions']))

  dot_base=np.sum(embedding_gold*embedding_pred, axis=1)
  norm_b_g=norm(embedding_gold,axis=1)
  norm_b_p=norm(embedding_pred,axis=1)

  score_base=dot_base/(norm_b_g*norm_b_p)
  
  df['score']=score_base

  return np.mean(list(df['score']))

In [None]:
base_score_1 = calc_similarity_score("../../baseline_model/baseline_model_inference_output.txt", 241)
base_score_2 = calc_similarity_score("../../baseline_model/baseline_model_inference_output2.txt", 241)
base_score_3 = calc_similarity_score("../../baseline_model/baseline_model_inference_output3.txt", 241)

fine_score_1 = calc_similarity_score("../../fine_tuned_model/train1/results_15.txt", 241)
fine_score_2 = calc_similarity_score("../../fine_tuned_model/train2/results_15.txt", 241)
fine_score_3 = calc_similarity_score("../../fine_tuned_model/train3/results_15.txt", 241)

data_aug_score_1 = calc_similarity_score("../../augmented_data_model/train1/results_15.txt", 241)
data_aug_score_2 = calc_similarity_score("../../augmented_data_model/train2/results_15.txt", 241)
data_aug_score_3 = calc_similarity_score("../../augmented_data_model/train3/results_15.txt", 241)

base_score = np.mean([base_score_1, base_score_2, base_score_3])
fine_score = np.mean([fine_score_1, fine_score_2, fine_score_3])
data_aug_score = np.mean([data_aug_score_1, data_aug_score_2, data_aug_score_3])

baseline_std = np.std([base_score_1, base_score_2, base_score_3])
finetune_std = np.std([fine_score_1, fine_score_2, fine_score_3])
data_aug_std = np.std([data_aug_score_1, data_aug_score_2, data_aug_score_3])

print("The mean scores for the baseline model, finetuned model and augmented data models are:\n", base_score, fine_score, data_aug_score)
print("The standard deviations of the score for the baseline model, finetuned model and augmented data models are:\n", baseline_std, finetune_std, data_aug_std)



The mean scores for the baseline model, finetuned model and augmented data models are:
 0.7949536260549767 0.8229693966411457 0.7203110685909521
The standard deviations of the score for the baseline model, finetuned model and augmented data models are:
 0.002687227743908439 0.004286802242465429 0.007525355350385149


## Evaluation: Classification

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
#from sklearn.linear_model import SGDClassifier ,  LogisticRegression
from sklearn.pipeline import Pipeline
# from gensim.sklearn_api import W2VTransformer
import numpy as np
from sklearn.svm import SVC

def train_svm(train_file_name):
  #df_train = pd.read_csv('sexist_data_clean_manual.csv')
  df_train = pd.read_csv(train_file_name)
  nonsexist = df_train['cleaned text_x']
  sexist = df_train['cleaned text_y']
  labels_sexist = np.ones(len(sexist))
  labels_nonsexist = np.zeros(len(nonsexist))

  texts = pd.concat([sexist,nonsexist]).to_numpy()
  labels = np.concatenate((labels_sexist,labels_nonsexist))
  n = len(texts)

  mask = np.random.permutation(n)
  shuffled_text = texts[mask]
  shuffled_labels = labels[mask]

  text_clf = Pipeline([
      ("vect", CountVectorizer()),
      ('tfidf', TfidfTransformer()),
      ('clf',SVC(kernel = 'linear', C = 1, random_state=1)),
  ])

  text_clf.fit(shuffled_text, shuffled_labels)

  return text_clf

svm_classifier = train_svm("../../data/manually_clean_sexism_data.csv")

def classification_eval(svm_classifier,df_pred):
  pred_train  =  svm_classifier.predict(df_pred.predictions)
  true_label = np.zeros(len(pred_train))
  acc = np.mean(pred_train == true_label)
  return acc

In [None]:
df = txt_to_df("../../baseline_model/baseline_model_inference_output.txt", 241)
acc_baseline_1 = classification_eval(svm_classifier,df)
df = txt_to_df("../../baseline_model/baseline_model_inference_output2.txt", 241)
acc_baseline_2 = classification_eval(svm_classifier,df)
df = txt_to_df("../../baseline_model/baseline_model_inference_output3.txt", 241)
acc_baseline_3 = classification_eval(svm_classifier,df)

df = txt_to_df("../../fine_tuned_model/train1/results_15.txt", 241)
acc_finetune_1 = classification_eval(svm_classifier,df)
df = txt_to_df("../../fine_tuned_model/train2/results_15.txt", 241)
acc_finetune_2 = classification_eval(svm_classifier,df)
df = txt_to_df("../../fine_tuned_model/train3/results_15.txt", 241)
acc_finetune_3 = classification_eval(svm_classifier,df)

df = txt_to_df("../../augmented_data_model/train1/results_15.txt", 241)
acc_data_aug_1 = classification_eval(svm_classifier,df)
df = txt_to_df("../../augmented_data_model/train2/results_15.txt", 241)
acc_data_aug_2 = classification_eval(svm_classifier,df)
df = txt_to_df("../../augmented_data_model/train3/results_15.txt", 241)
acc_data_aug_3 = classification_eval(svm_classifier,df)

df = txt_to_df("../../augmented_data_model/test/output.txt", 63)
acc_data_aug_test = classification_eval(svm_classifier,df)

baseline_acc = np.mean([acc_baseline_1, acc_baseline_2, acc_baseline_3])
finetune_acc = np.mean([acc_finetune_1, acc_finetune_2, acc_finetune_3])
data_aug_acc = np.mean([acc_data_aug_1, acc_data_aug_2, acc_data_aug_3])

baseline_std = np.std([acc_baseline_1, acc_baseline_2, acc_baseline_3])
finetune_std = np.std([acc_finetune_1, acc_finetune_2, acc_finetune_3])
data_aug_std = np.std([acc_data_aug_1, acc_data_aug_2, acc_data_aug_3])

print("The mean scores for the baseline model, finetuned model and augmented data models are:\n", baseline_acc, finetune_acc, data_aug_acc)
print("The standard deviations of the score for the baseline model, finetuned model and augmented data models are:\n", baseline_std, finetune_std, data_aug_std)
print("The accuracy for the extra test set is: ", acc_data_aug_test)



The mean scores for the baseline model, finetuned model and augmented data models are:
 0.6431535269709543 0.7219917012448133 0.9349930843706779
The standard deviations of the score for the baseline model, finetuned model and augmented data models are:
 0.008963680080785442 0.02710362094365895 0.020421608658690752
The accuracy for the extra test set is:  0.9682539682539683




## Evaluation - Export data for human evaluation (Commented out)

In [None]:
# # export for human evaluation

# df = txt_to_df("../../baseline_model/baseline_model_inference_output.txt", 241)
# df.to_csv("../../data/human_eval/baseline_model_inference_output.csv")
# df = txt_to_df("../../fine_tuned_model/train1/results_15.txt", 241)
# df.to_csv("../../data/human_eval/fine_tune_results_15.csv")
# df = txt_to_df("../../augmented_data_model/train1/results_15.txt", 241)
# df.to_csv("../../data/human_eval/aug_data_results_15.csv")

