Skip to content

This repository provides access to the code and data used in Please, Don't Forget the Difference and the Confidence Interval when Seeking for the State-of-the-Art Status, accepted at LREC 2022.

License

ybestgen/BootCIRealData

Repository files navigation

BootCIRealData

This repository provides access to the code and data used in Please, Don't Forget the Difference and the Confidence Interval when Seeking for the State-of-the-Art Status, accepted at LREC 2022. A simpler example can be found in BootCIExpli repository.

Contents

paired_bootstrap_interval.py

This python function allows to compute bootstrap confidence intervals (CIs) for paired samples. It is based on the bootstrap_interval module by Alexander Neshitov (https://pypi.org/project/bootstrap-interval/, MIT Licence) which implements the bootstrap CI for one sample.

For its use, see the explanations provided in the files, for example S1_ABSA/BootCiABSAMea.py. The performance measure can be any measure provided by a standard python library or defined by the user. It must be provided in a function called stat (see for example BootCiABSAMea.py). It should be noted that this function has a significant impact on the computation time since it is called twice for each resampling (and also twice the number of instances in the test material if the BCa method (which is recommended) is used).

If you use this function, I would appreciate a citation to the following paper: Bestgen, Y. 2022. Please, don't forget the difference and the confidence interval when seeking for the state-of-the-art status. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022).

Fisher_Pitman_Paired_Test.py

This python function implements the Fisher-Pitman test for paired samples. As the bootstrap CI, this test is based on a resampling procedure.

For its use, see the explanations provided in the files, for example S1_ABSA/WTestABSAMea.py. The performance measure can be any measure provided by a standard python library or defined by the user. It must be provided in a function called stat (see for example WTestABSAMea.py). It should be noted that this function has a significant impact on the computation time since it is called twice for each resampling.

If you use this function, I would appreciate a citation to the following paper: Bestgen, Y. 2022. Please, don't forget the difference and the confidence interval when seeking for the state-of-the-art status. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022).

S1_ABSA

The first case study is based on the Laptop review dataset of the SemEval 2014 Task 4 on aspect level sentiment classification (ABSA) \cite{Pontiki2014,Song2019}. This task aims at identifying the sentiment expressed towards an aspect in a sentence, using three sentiment polarities: positive, neutral and negative. The metrics to evaluate the performance was Accuracy (but some researchers reported also Macro-F1). I obtained the predictions on the challenge test set by five systems using the open source ABSA-PyTorch implementation (https://github.com/songyouwei/ABSA-PyTorch) provided by Song et al. (2019).

The files in this folder allow you to reproduce this case study:

  • BootCiABSAMea.py: calculates the CIs
  • WTestABSAMea.py: calculates the Fisher-Pitman tests
  • Data: predicted values and true scores for the five systems in csv format
  • Res: folder for storing the output of the python scripts
  • Expected_Output
    • ResBootCIABSA.txt: output of BootCiABSAMea.py
    • ResFPTestABSA.txt: output of WTestABSAMea.py

As stated above, the data used were obtained using the open source ABSA-PyTorch implementation by Song et al. (2019): https://github.com/songyouwei/ABSA-PyTorch). Their code is well explained and very easy to use.

To run this code, you can

  • Download the folders and files of this directory (zip file)
  • Go to S1_ABSA folder and run python3 BootCiABSAMea.py.

S2_EmoInt

The second case study illustrates the potential usefulness of CIs when performing K-fold CV. It is based on the WASSA'17 EmoInt shared task for which systems had to estimate the emotion intensity of tweets for four emotions (anger, fear, joy and sadness) on a real-valued scale (Mohammad and Bravo-Marquez, 2017). The performance was measured by means of the Pearson correlation coefficient. In this framework, Kulshreshtha et al. (2018) proposed LE-PC-DNN, a new and improved version of the system ranked first in the challenge. Their study has been reproduced as faithfully as possible based on the code they provided (https://github.com/Pranav-Goel/Neural\_Emotion\_Intensity\_Prediction). In their paper, they report an ablation analysis of the performance of their system on the test set. I have reproduced this ablation analysis on the test set using a K-fold CV procedure, with $K = 10$, and performed twenty different 10-fold CVs, obtained by varying the random seed.

The files in this folder allow you to reproduce this case study:

  • BootCiMCVMea.py: calculates the CIs
  • WTestMCVMea.py: calculates the Fisher-Pitman tests
  • Data: predicted values and true scores for the four emotions, four conditions and twenty 10-fold CVs (thus 320 files) in csv format
  • Res: folder for storing the output of the python scripts
  • Expected_Output
    • ResTestBootCIAll.txt: full results of BootCiMCVMea.py
    • ResTestBootCISum.txt: summary results of BootCiMCVMea.py
    • ResTestPermAll.txt: full results of WTestMCVMea.py
    • ResTestPermSum.txt: summary results of WTestMCVMea.py

As stated above, the data used were obtained using the code provided Kulshreshtha et al. (2018): https://github.com/Pranav-Goel/Neural\_Emotion\_Intensity\_Prediction. It should be noted that the intermediate_files necessary to reproduce their analyzes are not provided by these authors due to their size. They can however be easily obtained with the exception of word2vec_twitter_model.bin which are no longer available where the authors indicate but can be downloaded here: https://www.kaggle.com/hachemsfar/cbigru?select=word2vec\_twitter\_model.bin (4.56 GB).

To run this code, you can

  • Download the folders and files of this directory (zip file)
  • Go to S2_EmoInt folder and run python3 BootCiABSAMea.py. Note: As they are many conditions and CVs, it takes a long time to finish

License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

References

Bestgen, Y. 2022. Please, don't forget the difference and the confidence interval when seeking for the state-of-the-art status. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022).

Devang Kulshreshtha, Pranav Goel, and Anil Kumar Singh. 2018. How emotional are you? neural architectures for emotion intensity prediction in microblogs. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2914–2926, Santa Fe, New Mexico, USA. Association for Computational Linguistics.

Saif Mohammad and Felipe Bravo-Marquez. 2017. WASSA-2017 shared task on emotion intensity. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 34–49, Copenhagen, Denmark. Association for Computational Linguistics.

Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. SemEval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27–35, Dublin, Ireland. Association for Computational Linguistics.

Youwei Song, Jiahai Wang, Tao Jiang, Zhiyue Liu, and Yanghui Rao. 2019. Attentional Encoder Network for Targeted Sentiment Classification. arXiv e-prints, arXiv:1902.09314.

About

This repository provides access to the code and data used in Please, Don't Forget the Difference and the Confidence Interval when Seeking for the State-of-the-Art Status, accepted at LREC 2022.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages