Implement BLEU metric (functional) #93

ErikaLal · 2022-12-01T15:44:21Z

Summary: Implemented BLEU metric based on https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b.

Reviewed By: ninginthecloud

Differential Revision: D40809785

facebook-github-bot · 2022-12-01T15:45:03Z

This pull request was exported from Phabricator. Differential Revision: D40809785

gwenzek · 2022-12-05T14:50:07Z

The crucial thing not handled by this PR is tokenization, and tokenization is a common source of error when reporting BLEU score.

Would it be better to let an external tool like sacrebleu handle the tokenization + bleu computation ?
if the goal is to have a quick in-training metric, then maybe Metric should take tensors as input.

Summary: Pull Request resolved: pytorch#93 Implemented BLEU metric based on https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b. Differential Revision: https://internalfb.com/D40809785 fbshipit-source-id: 948b5d0c4f9c262ce37109e08bb81ffaef43a3ee

Summary: Pull Request resolved: pytorch#93 Implemented BLEU metric based on https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b. Reviewed By: ninginthecloud Differential Revision: D40809785 fbshipit-source-id: f36c67b64492903713958c1332a510a828b1b7fb

Summary: Pull Request resolved: pytorch#93 Implemented BLEU metric based on https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b. Reviewed By: ninginthecloud Differential Revision: D40809785 fbshipit-source-id: 8d030d0b8f82c2d0a47653664636e064a2f89242

facebook-github-bot · 2022-12-07T16:07:45Z

This pull request was exported from Phabricator. Differential Revision: D40809785

ninginthecloud · 2022-12-07T18:50:41Z

The crucial thing not handled by this PR is tokenization, and tokenization is a common source of error when reporting BLEU score.

Would it be better to let an external tool like sacrebleu handle the tokenization + bleu computation ? if the goal is to have a quick in-training metric, then maybe Metric should take tensors as input.

Hi, @gwenzek, thanks a lot for the feedback~ In this PR, we split the words by default, which it's fine for traditional English model. However, as you pointed out, this approach may limit other NLP use case (e.g., multi-language).

Here are two options we can do to improve this current version of bleu score:

accept tokenizer function as input
directly accept tokens as inputs instead of sentence input. (thanks @hudeven for sharing, NLTK does this for bleu score calculation, link)

What do you think of these two options?

Summary: Pull Request resolved: pytorch#93 Implemented BLEU metric based on https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b. Reviewed By: ninginthecloud Differential Revision: D40809785 fbshipit-source-id: 709d5fc1d87c92cd500f44ad424d1cba824c4865

facebook-github-bot · 2022-12-23T03:06:41Z

This pull request was exported from Phabricator. Differential Revision: D40809785

Summary: Pull Request resolved: pytorch#93 Implemented BLEU metric based on https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b. Differential Revision: https://www.internalfb.com/diff/D40809785?entry_point=27 fbshipit-source-id: 4753ee0fbaaeb6c9dd4c9e63f46a98f0ea1afe5c

codecov · 2022-12-23T03:10:39Z

Codecov Report

Merging #93 (b356ff5) into main (af5a1d6) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #93      +/-   ##
==========================================
+ Coverage   95.54%   95.61%   +0.07%     
==========================================
  Files         152      154       +2     
  Lines        8568     8712     +144     
==========================================
+ Hits         8186     8330     +144     
  Misses        382      382

Impacted Files	Coverage Δ
torcheval/metrics/functional/__init__.py	`100.00% <ø> (ø)`
tests/metrics/functional/text/test_bleu.py	`100.00% <100.00%> (ø)`
torcheval/metrics/functional/text/__init__.py	`100.00% <100.00%> (ø)`
torcheval/metrics/functional/text/bleu.py	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Summary: # TorchEval Version 0.0.6 ## Change Log - New metrics: - AUC - Binary, Multiclass, Multilabel AUPRC (also called Average Precision) pytorch#108 pytorch#109 - Multilabel Precision Recall Curve pytorch#87 - Recall at Fixed Precision pytorch#88 pytorch#91 - Windowed Mean Square Error pytorch#72 pytorch#86 - Blue Score pytorch#93 pytorch#95 - Perplexity pytorch#90 - Word Error Rate pytorch#97 - Word Information Loss pytorch#111 - Word Information Preserved pytorch#110 - Features - Added Sync for Dictionaries of Metrics pytorch#98 - Improved FLOPS counter pytorch#81 - Improved Module Summary, added forward elapsed times pytorch#100 pytorch#103 pytorch#104 pytorch#105 pytorch#114 - AUROC now supports weighted inputs pytorch#94 - Other - Improved Documentation pytorch#80 pytorch#117 pytorch#121 - Added Module Summary to Quickstart pytorch#113 - Updates several unit tests pytorch#77 pytorch#96 pytorch#101 pytorch#73 - Docs Automatically Add New Metrics pytorch#118 - Several Aggregation Metrics now Support fp64 pytorch#116 pytorch#123 ### [BETA] Sync Dictionaries of Metrics We're looking forward to building tooling for metric collections. The first important feature towards this end is collective syncing of groups of metrics. In the example below, we show how easy it is to sync all your metrics at the same time with `sync_and_compute_collection`. This method is not only for convenience, on the backend we only use one torch distributed sync collective for the entire group of metrics, meaning that the overhead from repeated network directives is maximally reduced. ```python import torch from torcheval.metrics import BinaryAUPRC, BinaryAUROC, BinaryAccuracy from torcheval.metrics.toolkit import sync_and_compute_collection, reset_metrics # Collections should be Dict[str, Metric] train_metrics = { "train_auprc": BinaryAUPRC(), "train_auroc": BinaryAUROC(), "train_accuracy": BinaryAccuracy(), } # Hydrate metrics with some random data preds = torch.rand(size=(100,)) targets = torch.randint(low=0, high=2, size=(100,)) for name, metric in train_metrics.items(): metric.update(preds, targets) # Sync the whole group with a single gather print(sync_and_compute_collection(train_metrics)) >>> {'train_auprc': tensor(0.5913), 'train_auroc': tensor(0.5161, dtype=torch.float64), 'train_accuracy': tensor(0.5100)} # reset all metrics in collection reset_metrics(train_metrics.values()) ``` Be on the lookout for more metric collection code coming in future releases. ## Contributors We're grateful for our community, which helps us improving torcheval by highlighting issues and contributing code. The following persons have contributed patches for this release: Rohit Alekar lindawangg Julia Reinspach jingchi-wang Ekta Sardana williamhufb @\andreasfloros Erika Lal samiwilf Reviewed By: ananthsub Differential Revision: D42737308 fbshipit-source-id: 4c9d72ce73a35636d7cd6421926a23a80250e267

Summary: Pull Request resolved: #124 # TorchEval Version 0.0.6 ## Change Log - New metrics: - AUC - Binary, Multiclass, Multilabel AUPRC (also called Average Precision) #108 #109 - Multilabel Precision Recall Curve #87 - Recall at Fixed Precision #88 #91 - Windowed Mean Square Error #72 #86 - Blue Score #93 #95 - Perplexity #90 - Word Error Rate #97 - Word Information Loss #111 - Word Information Preserved #110 - Features - Added Sync for Dictionaries of Metrics #98 - Improved FLOPS counter #81 - Improved Module Summary, added forward elapsed times #100 #103 #104 #105 #114 - AUROC now supports weighted inputs #94 - Other - Improved Documentation #80 #117 #121 - Added Module Summary to Quickstart #113 - Updates several unit tests #77 #96 #101 #73 - Docs Automatically Add New Metrics #118 - Several Aggregation Metrics now Support fp64 #116 #123 ### [BETA] Sync Dictionaries of Metrics We're looking forward to building tooling for metric collections. The first important feature towards this end is collective syncing of groups of metrics. In the example below, we show how easy it is to sync all your metrics at the same time with `sync_and_compute_collection`. This method is not only for convenience, on the backend we only use one torch distributed sync collective for the entire group of metrics, meaning that the overhead from repeated network directives is maximally reduced. ```python import torch from torcheval.metrics import BinaryAUPRC, BinaryAUROC, BinaryAccuracy from torcheval.metrics.toolkit import sync_and_compute_collection, reset_metrics # Collections should be Dict[str, Metric] train_metrics = { "train_auprc": BinaryAUPRC(), "train_auroc": BinaryAUROC(), "train_accuracy": BinaryAccuracy(), } # Hydrate metrics with some random data preds = torch.rand(size=(100,)) targets = torch.randint(low=0, high=2, size=(100,)) for name, metric in train_metrics.items(): metric.update(preds, targets) # Sync the whole group with a single gather print(sync_and_compute_collection(train_metrics)) >>> {'train_auprc': tensor(0.5913), 'train_auroc': tensor(0.5161, dtype=torch.float64), 'train_accuracy': tensor(0.5100)} # reset all metrics in collection reset_metrics(train_metrics.values()) ``` Be on the lookout for more metric collection code coming in future releases. ## Contributors We're grateful for our community, which helps us improving torcheval by highlighting issues and contributing code. The following persons have contributed patches for this release: Rohit Alekar lindawangg Julia Reinspach jingchi-wang Ekta Sardana williamhufb @\andreasfloros Erika Lal samiwilf Reviewed By: ananthsub Differential Revision: D42737308 fbshipit-source-id: dfd852345e1a9f3069ea33b056f5a60a3adde5aa

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Dec 1, 2022

ErikaLal force-pushed the export-D40809785 branch from 00976b2 to 462b9d7 Compare December 7, 2022 16:07

ErikaLal force-pushed the export-D40809785 branch from 462b9d7 to b356ff5 Compare December 23, 2022 03:06

facebook-github-bot closed this in d2401f1 Dec 27, 2022

bobakfb mentioned this pull request Jan 25, 2023

Bumping to Version 0.0.6 #124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement BLEU metric (functional) #93

Implement BLEU metric (functional) #93

ErikaLal commented Dec 1, 2022

facebook-github-bot commented Dec 1, 2022

gwenzek commented Dec 5, 2022

facebook-github-bot commented Dec 7, 2022

ninginthecloud commented Dec 7, 2022

facebook-github-bot commented Dec 23, 2022

codecov bot commented Dec 23, 2022 •

edited

Loading

Implement BLEU metric (functional) #93

Implement BLEU metric (functional) #93

Conversation

ErikaLal commented Dec 1, 2022

facebook-github-bot commented Dec 1, 2022

gwenzek commented Dec 5, 2022

facebook-github-bot commented Dec 7, 2022

ninginthecloud commented Dec 7, 2022

facebook-github-bot commented Dec 23, 2022

codecov bot commented Dec 23, 2022 • edited Loading

Codecov Report

codecov bot commented Dec 23, 2022 •

edited

Loading