Skip to content

sign-language-processing/signwriting-evaluation

Repository files navigation

SignWriting Evaluation

The lack of automatic SignWriting evaluation metrics is a major obstacle in the development of SignWriting transcription and translation1 models.

Goals

The primary objective of this repository is to house a suite of automatic evaluation metrics specifically tailored for SignWriting. This includes standard metrics like BLEU2, chrF3, and CLIPScore4, as well as custom-developed metrics unique to our approach. We recognize the distinct challenges in evaluating single signs versus continuous signing, and our methods reflect this differentiation.

To qualitatively demonstrate the efficacy of these evaluation metrics, we implement a nearest-neighbor search for selected signs from the SignBank corpus. The rationale is straightforward: the closer the sign is to its nearest neighbor in the corpus, the more effective the evaluation metric is in capturing the nuances of sign language transcription and translation.

Evaluation Metrics

  • Tokenized BLEU - BLEU score for tokenized SignWriting FSW strings.
  • chrF - chrF score for untokenized SignWriting FSW strings.
  • CLIPScore - CLIPScore between SignWriting images. (Using the original CLIP model)
  • Similarity - symbol distance score for SignWriting FSW strings (README).

Qualitative Evaluation

Distribution of Scores

Using a sample of the corpus, we compute the any-to-any scores for each metric. Intuitively, we expect a good metric given any two random signs to produce a bad score, since most signs are unrelated. This should be reflected in the distribution of scores, which should be skewed towards lower scores.

Distribution of scores

Nearest Neighbor Search

It is well-known that the SignBank corpus contains many forms of the sign for "hello". We carefully select some of these signs to evaluate our metrics, by looking for their closest matches in the corpus, which contains around 230k single signs.

The problems of each metric are revealed when comparing the top 10 nearest neighbors for each sign. For each sign and metric, either the first match is incorrect, or there is a more correct match further down the list.

CLIPScoreSymbolsDistancesTokenizedBLEUCHRFCLIPScoreSymbolsDistancesTokenizedBLEUCHRFCLIPScoreSymbolsDistancesTokenizedBLEUCHRF
1
2
3
4
5
6
7
8
9
10

References

Footnotes

  1. Amit Moryossef, Zifan Jiang. 2023. SignBank+: Preparing a Multilingual Sign Language Dataset for Machine Translation Using Large Language Models.

  2. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.

  3. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.

  4. Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.

About

Automatic Evaluation for SignWriting Machine Learning Outputs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages