## Augmentation
**module**: lexsub.augment_conll.py
currently it is configured to use basic models without any targetword injections such as dynamic patterns

```
augment_conllFile(input_file, output_file=None, model_type='bert-large-cased', N=2, substitute_lu=True, substitute_role=False, noun_max=0, ibn=False, proc_funcs=PROC_FUNCS_OPTIONS['lemma'],match_lugold=True, match_rolegold=False, verbose=False)
```
here:
- model_type: if string, predictor will be loaded as defined in src.run_predict.load_predictor
- N: number of substitutes to expand
- noun_max: a float value to specify percentage of sentence tokens to be substitutes as nouns
- ibn: whether to substitutes tokens that are part of some roles
- proc_funcs: a dictionary to specify post_process pipelines to be used for lu, role and noun, see PROC_FUNCS_OPTIONS in  run_generate_substitutes.py for more details, **lemma** value is used for all experiments reported in paper
    

use **run_augment_conll.py** to run multiple experiements using json file
See an example below


### 1. mask sentences
mask potential words in each sentence


In [None]:
!python -m lexsub.generate_masked_sentences \
--input_exp='nExPerSent_verbs_randAllExps/01ExPerSent_verbs_rand01'\
--output_exp='expanded_nExPerSent_verbs_randAllExps/01ExPerSent_verbs_rand01_expanded_lu_roles_nouns-50pc'\
--data_dir='data/open_sesame_v1_data/fn1.7'\
--substitute_lu=True\
--substitute_role=True\
--noun_max=0.5\
--ibn=True\
--verbose=False

### 2. predict substitutes
predict substitutes using some predictor and save predictions.pkl within each ```$output_exp/$preds_model``` directory, where ```$preds_model``` represents the predictor model

see run_experiment notebook for details

### 3. augment masked sentences with predicted substitutes

In [None]:
!python -m lexsub.augment_conll_sentences \
--input_exp='nExPerSent_verbs_randAllExps/01ExPerSent_verbs_rand01'\
--output_exp='expanded_nExPerSent_verbs_randAllExps/01ExPerSent_verbs_rand01_expanded_lu_roles_nouns-50pc'\
--data_dir='data/open_sesame_v1_data/fn1.7'\
--preds_model='xlnet_embs_hypers'\
--match_lugold=True\
--match_rolegold=True\
--proc_funcs='lemma'\
--postprocess=False\ # optional, not needed if predictions are already processed 
--pipeline='lugold_rolegold_nolemma'\
--N=2\
--verbose=False

## Running end-to-end experiment 
will only save output conll file

- slow and inefficient, not useful for large number of experiments, advisable to create one main file using all data of verbs/nouns and mark all words at once, do the prediction using all predictions model and post process once for each word_type [verb,noun,role]


In [None]:
!python -m lexsub.augment_conll \
--input_exp='nExPerSent_verbs_randAllExps/01ExPerSent_verbs_rand01'\
--output_exp='expanded_nExPerSent_verbs_randAllExps/01ExPerSent_verbs_rand01_expanded_lu_roles_nouns-50pc'\
--data_dir='data/open_sesame_v1_data/fn1.7'\
--preds_model='bert-large-cased'\
--substitute_lu=True\
--substitute_role=True\
--noun_max=0.5\
--ibn=True\
--match_lugold=True\
--match_rolegold=True\
--proc_funcs='lemma'\
--N=2\
--verbose=False

## Generate HTML Tables
will save an html table as well as dataframe in csv format
- exp_names is optional, if not specified, all experiments from --configs will be executed to produce examples


### verbs lexical unit

In [None]:
! python -m lexsub.run_html_table table \
--base_exp='nExPerSent_verbs_randAllExps/01ExPerSent_verbs_rand01'\
--data_dir='data/open_sesame_v1_data/fn1.7'\
--expanded_exps='{"BERT":("bert", ("nolemma",True,True), "nltk_nolemma_role_stopwords_final_predictions.pkl", "expanded_nExPerSent_verbs_randAllExps/01ExPerSent_verbs_rand01_expanded_lu_roles_nouns-50pc"),"XLNet [+embs]":("xlnet_embs_hypers", ("nolemma",True,True),"nltk_nolemma_role_stopwords_final_predictions.pkl","expanded_nExPerSent_verbs_randAllExps/01ExPerSent_verbs_rand01_expanded_lu_roles_nouns-50pc")}'\
--output_file='html_files/expanded_01ExPerSent_verbs_rand01' \
--caption='Examples of expansions for expanded_lu_roles_nouns-50pc, lu and roles were filtered for gold answers.'\
--notations='{}'\
--E=15

### nouns lexical unit

In [None]:
! python -m lexsub.run_html_table table \
--base_exp='nExPerSent_nouns_randAllExps/01ExPerSent_nouns_rand01'\
--data_dir='data/open_sesame_v1_data/fn1.7'\
--expanded_exps='{"BERT":("bert", ("nolemma",True,True), "nltk_nolemma_role_stopwords_final_predictions.pkl", "expanded_nExPerSent_nouns_randAllExps/01ExPerSent_nouns_rand01_expanded_lu_roles_nouns-50pc"),"XLNet [+embs]":("xlnet_embs_hypers", ("nolemma",True,True),"nltk_nolemma_role_stopwords_final_predictions.pkl","expanded_nExPerSent_nouns_randAllExps/01ExPerSent_nouns_rand01_expanded_lu_roles_nouns-50pc")}'\
--output_file='html_files/expanded_01ExPerSent_nouns_rand01' \
--caption='Examples of expansions for expanded_lu_roles_nouns-50pc, lu and roles were filtered for gold answers.'\
--notations='{}'\
--E=15

In [None]:
import pandas as pd
pd.set_option('max_colwidth', None)
df = pd.read_csv('html_files/expanded_01ExPerSent_verbs_rand01.csv')


print(df.to_latex(index=False, escape=False))

In [None]:
df