All the script should be run inside the root directory of this project.
data
contains all the generated data formatted as a dictionary where each key is canonical words, corresponding to multiple noisy word forms.
The datasets are named by {sentiment}-{noise-type}.json
. For example, neg-typos.json
contains words with negative sentiment, and its noisy words are generated by typos.
data_generator
contains the scripts to generate these noisy data. To extract noisy words from Twitter, download tweets data from Kaggle into the data
folder and preprocess by tweet_preprocess.py
.
$ python check_result.py --help # see help information for valid argument parameters
We save all the results in the data
directory. You can check them by the script check_result.py
.
$ python check_result.py --help # see help information for valid argument parameters
$ python check_result.py --dataset_name neg-typos --model_name bert-yelp
We also provide Python codes in check_result.ipynp
to load the dataframe and further analyze the result.
To reproduce our results, you can use the script word_corruption.py
.
$ dataset_name=neg-typos
$ model_name=bert-base-uncased-SST-2
$ python word_corruption.py --dataset_name $dataset_name --model_name $model_name
The script will generate the result as Pandas dataframe.
To evaluate other models in the huggingface hub, you can specify your model name and corresponding model path as the key/value pair in the dictionary hf_model_names
in the resource
module
Then, all the experiment can be regenerated on this new models.
import resource
dataset_name = "neg-typos"
plm_name = "bert"
tokenizer = resource.hf_tokenizers[plm_name]
plm = resource.hf_models["-".join(plm_name, dataset_name)]
data = resource.datasets[dataset_name]
@inproceedings{li-etal-2023-pretrained,
title = "Can Pretrained Language Models Derive Correct Semantics from Corrupt Subwords under Noise?",
author = "Li, Xinzhe and
Liu, Ming and
Gao, Shang",
editor = "Palmer, Alexis and
Camacho-collados, Jose",
booktitle = "Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.starsem-1.15",
doi = "10.18653/v1/2023.starsem-1.15",
pages = "165--173"
}