Can Pretrained Language Models Derive Correct Semantics from Corrupt Subwords under Noise?

All the script should be run inside the root directory of this project.

Data

data contains all the generated data formatted as a dictionary where each key is canonical words, corresponding to multiple noisy word forms. The datasets are named by {sentiment}-{noise-type}.json. For example, neg-typos.json contains words with negative sentiment, and its noisy words are generated by typos. data_generator contains the scripts to generate these noisy data. To extract noisy words from Twitter, download tweets data from Kaggle into the data folder and preprocess by tweet_preprocess.py.

Corruption result

$ python check_result.py --help # see help information for valid argument parameters We save all the results in the data directory. You can check them by the script check_result.py.

$ python check_result.py --help # see help information for valid argument parameters
$ python check_result.py --dataset_name neg-typos --model_name bert-yelp

We also provide Python codes in check_result.ipynp to load the dataframe and further analyze the result.

To reproduce our results, you can use the script word_corruption.py.

$ dataset_name=neg-typos
$ model_name=bert-base-uncased-SST-2
$ python word_corruption.py --dataset_name $dataset_name --model_name $model_name

The script will generate the result as Pandas dataframe.

(Optional) Evaluating new Models

To evaluate other models in the huggingface hub, you can specify your model name and corresponding model path as the key/value pair in the dictionary hf_model_names in the resource module Then, all the experiment can be regenerated on this new models.

import resource
dataset_name = "neg-typos"
plm_name = "bert"
tokenizer = resource.hf_tokenizers[plm_name]
plm = resource.hf_models["-".join(plm_name, dataset_name)]
data = resource.datasets[dataset_name]

Reference

@inproceedings{li-etal-2023-pretrained,
    title = "Can Pretrained Language Models Derive Correct Semantics from Corrupt Subwords under Noise?",
    author = "Li, Xinzhe  and
      Liu, Ming  and
      Gao, Shang",
    editor = "Palmer, Alexis  and
      Camacho-collados, Jose",
    booktitle = "Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.starsem-1.15",
    doi = "10.18653/v1/2023.starsem-1.15",
    pages = "165--173"
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
data_generator		data_generator
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
check_result.ipynb		check_result.ipynb
check_result.py		check_result.py
requirements.txt		requirements.txt
resource.py		resource.py
word_corruption.py		word_corruption.py
word_sim_with_the.py		word_sim_with_the.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can Pretrained Language Models Derive Correct Semantics from Corrupt Subwords under Noise?

Data

Corruption result

(Optional) Evaluating new Models

Reference

About

Releases

Packages

Languages

License

xinzhel/word_corruption

Folders and files

Latest commit

History

Repository files navigation

Can Pretrained Language Models Derive Correct Semantics from Corrupt Subwords under Noise?

Data

Corruption result

(Optional) Evaluating new Models

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages