Skip to content

Official Implementation for the ACL&SEM2023 paper: "Can Pretrained Language Models Derive Correct Semantics from Corrupt Subwords under Noise?"

License

Notifications You must be signed in to change notification settings

xinzhel/word_corruption

Repository files navigation

Can Pretrained Language Models Derive Correct Semantics from Corrupt Subwords under Noise?

All the script should be run inside the root directory of this project.

Data

data contains all the generated data formatted as a dictionary where each key is canonical words, corresponding to multiple noisy word forms. The datasets are named by {sentiment}-{noise-type}.json. For example, neg-typos.json contains words with negative sentiment, and its noisy words are generated by typos. data_generator contains the scripts to generate these noisy data. To extract noisy words from Twitter, download tweets data from Kaggle into the data folder and preprocess by tweet_preprocess.py.

Corruption result

$ python check_result.py --help # see help information for valid argument parameters We save all the results in the data directory. You can check them by the script check_result.py.

$ python check_result.py --help # see help information for valid argument parameters
$ python check_result.py --dataset_name neg-typos --model_name bert-yelp

We also provide Python codes in check_result.ipynp to load the dataframe and further analyze the result.

To reproduce our results, you can use the script word_corruption.py.

$ dataset_name=neg-typos
$ model_name=bert-base-uncased-SST-2
$ python word_corruption.py --dataset_name $dataset_name --model_name $model_name 

The script will generate the result as Pandas dataframe.

(Optional) Evaluating new Models

To evaluate other models in the huggingface hub, you can specify your model name and corresponding model path as the key/value pair in the dictionary hf_model_names in the resource module Then, all the experiment can be regenerated on this new models.

import resource
dataset_name = "neg-typos"
plm_name = "bert"
tokenizer = resource.hf_tokenizers[plm_name]
plm = resource.hf_models["-".join(plm_name, dataset_name)]
data = resource.datasets[dataset_name]

Reference

@inproceedings{li-etal-2023-pretrained,
    title = "Can Pretrained Language Models Derive Correct Semantics from Corrupt Subwords under Noise?",
    author = "Li, Xinzhe  and
      Liu, Ming  and
      Gao, Shang",
    editor = "Palmer, Alexis  and
      Camacho-collados, Jose",
    booktitle = "Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.starsem-1.15",
    doi = "10.18653/v1/2023.starsem-1.15",
    pages = "165--173"
}

About

Official Implementation for the ACL&SEM2023 paper: "Can Pretrained Language Models Derive Correct Semantics from Corrupt Subwords under Noise?"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages