LChange22

[Note: Upcoming Improvements]

We are actively enhancing our code and evaluations. In the near future, we will release a comprehensive journal paper detailing these advancements, along with an improved version of our codebase. We appreciate your interest and patience as we strive to provide more robust and refined solutions. Stay tuned for updates, and feel free to contact us if you'd like to test our APP algorithm.

LChange22

Official repository for LChange22 - WIDID. Published results were produced in Python 3 programming environment on Windows 11 operating system. Instructions for installation assume the usage of PyPI package manager.

Our code has been obtained starting from the SemEval submission (https://aclanthology.org/2020.semeval-1.6/) repository by Martinc, M. et al. (2020) available in GitHub (https://github.com/smontariol/Semeval2020-Task1).

Installation, documentation

Install dependencies if needed: pip install -r requirements.txt

To reproduce the results published in the paper run the code in the command line using following commands:

Preprocess corpus:

Remove artifacts and numbers from SemEval English and Latin corpora. In addition, we sanitize documents to make them less than 512 characters long. We use a recursive function to split a document about every 500 characters. No words is split in each document split.

python preprocessing.py  --corpus_paths pathToCorpusSlicesSeparatedBy';' --output_corpus_paths pathToOutputCorpusSlicesSeparatedBy';' --language language

Extract BERT embeddings:

Extract embeddings from the preprocessed corpora in .txt format:

python extract_embeddings.py --corpus_paths pathToPreprocessedCorpusSlicesSeparatedBy';' --target_path pathToSemEvalTargetFile --language language --embeddings_path pathToOutputEmbeddingFile --concat

This creates a pickled file containing all contextual embeddings for all target words.

Train and Extract Doc2Vec embeddings:

Train Doc2Vec and Extract embeddings from the preprocessed corpora in .txt format:

python doc2vec_extraction.py --corpus_paths pathToPreprocessedCorpusSlicesSeparatedBy';' --target_path pathToSemEvalTargetFile --language language --embeddings_path pathToOutputEmbeddingFile --model_path pathToOutputModelFile

This creates a pickled file containing all contextual embeddings for all target words.

Get results:

Conduct clustering and measure semantic shift with CD, DIV, JSD, PDIS, PDIV:

python calculate_semantic_change.py --language language --embeddings_path pathToInputEmbeddingFile --semeval_results pathToOutputResultsDir --trim trimValueForAPP

This script takes the pickled embedding file as an input and creates several files, a csv file containing CD, DIV, JSD, PDIS, PDIV scores for all clustering methods for each target word, files containing cluster labels for each embedding , and a file containing context (sentence) mapped to each embedding.

Generate CSV files for evaluation using Spearman's Correlation:

python evaluation.py --language language --results_file pathToInputResultsCSVFile --gold_path pathToSemEvalGoldFile --output_corr_file pathToOutputCorrelationFile

This script takes the CSV file generated in the previous step as an input and compute the Spearman's Correlation all the experimented methods ('cd', 'div', 'ap_jsd', 'ap_pdis', 'ap_pdiv', 'iapna_jsd', 'iapna_pdis', 'iapna_pdiv', 'app_jsd', 'app_pdis', 'app_pdiv').

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE.md		LICENSE.md
README.md		README.md
calculate_semantic_change.py		calculate_semantic_change.py
cluster.py		cluster.py
distance.py		distance.py
doc2vec_extraction.py		doc2vec_extraction.py
evaluation.py		evaluation.py
extract_embeddings.py		extract_embeddings.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE.md

LICENSE.md

README.md

README.md

calculate_semantic_change.py

calculate_semantic_change.py

cluster.py

cluster.py

distance.py

distance.py

doc2vec_extraction.py

doc2vec_extraction.py

evaluation.py

evaluation.py

extract_embeddings.py

extract_embeddings.py

preprocessing.py

preprocessing.py

requirements.txt

requirements.txt

Repository files navigation

LChange22

Installation, documentation

To reproduce the results published in the paper run the code in the command line using following commands:

Preprocess corpus:

Extract BERT embeddings:

Train and Extract Doc2Vec embeddings:

Get results:

If something is unclear, check the default arguments for each script. If you still can't make it work, feel free to contact us :).

About

Releases

Packages

Contributors 2

Languages

License

umilISLab/LChange22

Folders and files

Latest commit

History

Repository files navigation

LChange22

Installation, documentation

To reproduce the results published in the paper run the code in the command line using following commands:

Preprocess corpus:

Extract BERT embeddings:

Train and Extract Doc2Vec embeddings:

Get results:

If something is unclear, check the default arguments for each script. If you still can't make it work, feel free to contact us :).

About

Resources

License

Stars

Watchers

Forks

Languages