Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CircleCI failures on Windows for XLM-R unit tests #1441

Merged
merged 1 commit into from Nov 12, 2021

Conversation

abhinavarora
Copy link
Contributor

@abhinavarora abhinavarora commented Nov 12, 2021

This PR fixes the Circle CI model test failures on Windows.

Cause

To generate download URLs we used os.path.join(_TEXT_BUCKET, "xlmr.vocab.pt"). os.path.join is not the right way to join URL paths and must only be used for local OS directory paths. In our case, since _TEXT_BUCKET has forward slashes (because it is a URL), os library acts smart and on Windows it produces the URL: https://download.pytorch.org/models/text\xlmr.vocab.pt which gives a 403 on CURL.

Fix

Instead of os.path.join, we must use urllib.parse.urljoin

@abhinavarora abhinavarora changed the title [DRAFT PR DO NOT REVIW] Debug Circle CI Fix CircleCI failures on Windows for XLM-R unit tests Nov 12, 2021
@abhinavarora abhinavarora self-assigned this Nov 12, 2021
Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks for fixing this :)

@parmeet parmeet merged commit d4a27a0 into pytorch:main Nov 12, 2021
facebook-github-bot pushed a commit that referenced this pull request Nov 16, 2021
Summary:
Trying out new way to import changes :). The main reason for this deviance is to find a way to skip commit ID(s) which is currently blocking another important PR to be landed on fbcode.

Used following command to sync changes from github to fbcode:

```python pytorch/import.py --github_username parmeet --project_name text --commit_from ba20fc5 --commit_to e691934  --skip_commit_ids 2cebac3 daf0f6c --squash```

Notes:
- Skipped commit 2cebac3 as it about removing legacy code which is still work in progress on internal side (resolving legacy use-sites) by abhinavarora
- Skipped commit daf0f6c as this correspond to syncing changes from fbsync. to main branch
- We have used squash, but can skip this option to get 1:1 correspondence from PR to Diff ID, like we have in vision

This text from below here is auto-generated because i think we used --squash

====
Subject:
Update doc and fix CircleCI doc build issue (#1434)
Body:

commit e691934

====
Subject:
[CircleCI Windows Failure] Fix the way we join URL pieces to download XLM-R components (#1441)
Body:

commit d4a27a0

====
Subject:
correct the `_compute_ngram_counter` docstring (#1440)
Body:

commit a26a8ef

====
Subject:
fix attention mask testing (#1439)
Body:

commit 778b3e6

====
Subject:
[Vocab] Refactor vocab factory method to accept special tokens as a keyword argument (#1436)
Body:
* [Vocab] Refactor vocab factory method to accept special tokens as a keyword argument

commit f298494

====
Subject:
add attention mask to transformer encoder modules (#1435)
Body:

commit 9314b44

Reviewed By: Nayef211

Differential Revision: D32431346

fbshipit-source-id: 985e242ce5a733c130e9d5b9549a4a330e948dc7
@abhinavarora abhinavarora deleted the debug_circle_ci branch January 5, 2022 21:28
Nayef211 added a commit that referenced this pull request Oct 19, 2022
* include pytorch 1.5.0-rc1 for CI test

* bump up the version

* Set up ShipIt

fbshipit-source-id: bb7d2eb52240c7223b57c3c9624e61d116e77e39

* Re-sync with internal repository (#749)

* 20200429 pytorch/text import

Summary: [20:45:34: cpuhrsch@devvm3140 pytorch]$ ./fb_build/import_text.sh

Reviewed By: pbelevich

Differential Revision: D21320577

fbshipit-source-id: ac2148b9f0d58e5538443c879845bfb4f6ca7202

* 20200430 torchtext import script to include additional meta files

Summary: ./fb_build/import_text.sh

Reviewed By: zhangguanheng66

Differential Revision: D21343124

fbshipit-source-id: c08ecad2cc6f439fa40130aeaf91383be9403fe8

* torchtext flake8, github, travis metafiles

Summary: See title

Reviewed By: pbelevich

Differential Revision: D21344211

fbshipit-source-id: a8bcf7f3ab9bb2c2853e27f612e82caa341d3651

* Import torchtext 20200520 and update build

Summary: Import torchtext up to #786

Reviewed By: cpuhrsch

Differential Revision: D21483116

fbshipit-source-id: bc8ab38db9dc9ce4a8734ca8ea991c20e4ef0882

* Import torchtext 20200528

Summary:
Import up to #798
Addresses T67599333

Reviewed By: zhangguanheng66

Differential Revision: D21764935

fbshipit-source-id: f44d1db637799f2e95f420a8099fbf19545c7cbd

* 20200604 torchtext github import

Summary: Import from github master

Reviewed By: zhangguanheng66

Differential Revision: D21886238

fbshipit-source-id: a8f098e299466dd1701fe7ceb6a97c2a2fc54b9d

* Import torchtext 20200605

Summary: Import from github master

Reviewed By: zhangguanheng66

Differential Revision: D21907519

fbshipit-source-id: f22370d97796da5f2cb9f76f506c80f18fefea7f

* Back out "Import torchtext 20200605"

Summary: Original commit changeset: f22370d97796

Reviewed By: zhangguanheng66

Differential Revision: D21964222

fbshipit-source-id: c316836596fc3e232e63abc59e172f237b551cc5

* Import torchtext 2020/06/22

Summary: Import from github torchtext/master

Reviewed By: zhangguanheng66, cpuhrsch

Differential Revision: D22168183

fbshipit-source-id: 7d96ade64f18942d9bd19437011be2f65f0b2a5e

* Fix torch.testing._internal module not found

Reviewed By: Nayef211

Differential Revision: D22315715

fbshipit-source-id: 6b8b8544b0aa458cf5e7e9ca380d0dc85c98189f

* Import torchtext 2020/07/07

Summary: Import from github torchtext/master

Reviewed By: cpuhrsch

Differential Revision: D22420576

fbshipit-source-id: 4d2c19d7f1db8f698894ca406c1c44b2ad8e0506

* remediation of S205607

fbshipit-source-id: 5113fe0c527595e4227ff827253b7414abbdf7ac

* remediation of S205607

fbshipit-source-id: 798decc90db4f13770e97cdce3c0df7d5421b2a3

* Import torchtext 2020/07/21

Summary: Import from github torchtext/master

Reviewed By: zhangguanheng66

Differential Revision: D22641140

fbshipit-source-id: 8190692d059a937e25c5f93506581086f389c291

* Remove .python3 markers

Reviewed By: ashwinp-fb

Differential Revision: D22955630

fbshipit-source-id: f00ef17a905e4c7cd9196c8924db39f9cdfe8cfa

* Import torchtext 2020/08/06

Summary: Import from github torchtext/master

Reviewed By: zhangguanheng66

Differential Revision: D22989210

fbshipit-source-id: 083464e188b758a8746123f4dd2197cc7edc4bc4

* Import torchtext 2020/08/18

Summary: Import from github torchtext/master

Reviewed By: cpuhrsch

Differential Revision: D23190596

fbshipit-source-id: 1568a25a5bd6431bcef3c6539f64a3ab1f5bccd7

* Import torchtext from 8aecbb9

Reviewed By: hudeven

Differential Revision: D23451795

fbshipit-source-id: 73e6130c16716919c77862cef4ca4c8048428670

* Import torchtext 9/4/2020

Reviewed By: Nayef211

Differential Revision: D23539397

fbshipit-source-id: 88dce59418a3071cbc9e944cf0a4cf2117d7d9f7

* Import github torchtext on 9/9/2020

Reviewed By: cpuhrsch

Differential Revision: D23616189

fbshipit-source-id: 365debc987326145eead7456ed48517fe55cac96

* Add property support for ScriptModules (#42390)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42390

**Summary**
This commit extends support for properties to include
ScriptModules.

**Test Plan**
This commit adds a unit test that has a ScriptModule with
a user-defined property.

`python test/test_jit_py3.py TestScriptPy3.test_module_properties`

Test Plan: Imported from OSS

Reviewed By: eellison, mannatsingh

Differential Revision: D22880298

Pulled By: SplitInfinity

fbshipit-source-id: 74f6cb80f716084339e2151ca25092b6341a1560

* sync with OSS torchtext 9/15/20

Reviewed By: cpuhrsch

Differential Revision: D23721167

fbshipit-source-id: 13b32091c422a3ed0ae299595d69a7afa7136638

* Import Github torchtext on 9/28/2020

Reviewed By: cpuhrsch

Differential Revision: D23962265

fbshipit-source-id: 0d042878fe9119aa725e982ab7d5e96e7c885a59

* Enable @unused syntax for ignoring properties (#45261)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45261

**Summary**
This commit enables `unused` syntax for ignoring
properties. Inoring properties is more intuitive with this feature enabled.
`ignore` is not supported because class type properties cannot be
executed in Python (because they exist only as TorchScript types) like
an `ignored` function and module properties that cannot be scripted
are not added to the `ScriptModule` wrapper so that they
may execute in Python.

**Test Plan**
This commit updates the existing unit tests for class type and module
properties to test properties ignored using `unused`.

Test Plan: Imported from OSS

Reviewed By: navahgar, Krovatkin, mannatsingh

Differential Revision: D23971881

Pulled By: SplitInfinity

fbshipit-source-id: 8d3cc1bbede7753d6b6f416619e4660c56311d33

* Import Github torchtext on 10/11/2020

Reviewed By: cpuhrsch

Differential Revision: D24242037

fbshipit-source-id: 605d81412c320373f1158c51dbb120e7d70d624d

* make duplicate def() calls an error in the dispatcher. Updating all fb operators to use the new dispatcher registration API (#47322)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47322

Updating all call-sites of the legacy dispatcher registration API in fbcode to the new API.

I migrated all call sites that used the legacy dispatcher registration API (RegisterOperators()) to use the new API (TORCH_LIBRARY...). I found all call-sites by running `fbgs RegisterOperators()`. This includes several places, including other OSS code (nestedtensor, torchtext, torchvision). A few things to call out:

For simple ops that only had one registered kernel without a dispatch key, I replaced them with:
```
TORCH_LIBRARY_FRAGMENT(ns, m) {
   m.def("opName", fn_name);
}
```

For ops that registered to a specific dispatch key / had multiple kernels registered, I registered the common kernel (math/cpu) directly inside a `TORCH_LIBRARY_FRAGMENT` block, and registered any additional kernels from other files (e.g. cuda) in a separate `TORCH_LIBRARY_IMPL` block.

```
// cpu file
TORCH_LIBRARY_FRAGMENT(ns, m) {
  m.def("opName(schema_inputs) -> schema_outputs");
  m.impl("opName", torch::dispatch(c10::DispatchKey::CPU, TORCH_FN(cpu_kernel)));
}

// cuda file
TORCH_LIBRARY_IMPL(ns, CUDA, m) {
  m.impl("opName", torch::dispatch(c10::DispatchKey::CUDA, TORCH_FN(cuda_kernel)));
}
```
Special cases:

I found a few ops that used a (legacy) `CPUTensorId`/`CUDATensorId` dispatch key. Updated those to use CPU/CUDA- this seems safe because the keys are aliased to one another in `DispatchKey.h`

There were a handful of ops that registered a functor (function class) to the legacy API. As far as I could tell we don't allow this case in the new API, mainly because you can accomplish the same thing more cleanly with lambdas. Rather than delete the class I wrote a wrapper function on top of the class, which I passed to the new API.

There were a handful of ops that were registered only to a CUDA dispatch key. I put them inside a TORCH_LIBRARY_FRAGMENT block, and used a `def()` and `impl()` call like in case two above.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24714803

Pulled By: bdhirsh

fbshipit-source-id: c809aad8a698db3fd0d832f117f833e997b159e1

* Revert D24714803: make duplicate def() calls an error in the dispatcher. Updating all fb operators to use the new dispatcher registration API

Differential Revision:
D24714803

Original commit changeset: c809aad8a698

fbshipit-source-id: fb2ada65f9fc00d965708d202bd9d050f13ef467

* Import torchtext on Nov 20, 2020

Summary:
Import torchtext on the commit of 633548a1bdf0bac1e38f98da375a537ce0c2994b

allow-large-files

Reviewed By: cpuhrsch

Differential Revision: D25127691

fbshipit-source-id: 3a617f5f4849df452f8a102a77ce11a1bce5af1f

* Updating all call-sites of the legacy dispatcher registration API in fbcode to the new API. (#48178)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48178

I migrated all call sites that used the legacy dispatcher registration API (RegisterOperators()) to use the new API (TORCH_LIBRARY...). I found all call-sites by running `fbgs RegisterOperators()`. This includes several places, including other OSS code (nestedtensor, torchtext, torchvision). A few things to call out:

For simple ops that only had one registered kernel without a dispatch key, I replaced them with:
```
TORCH_LIBRARY_FRAGMENT(ns, m) {
   m.def("opName", fn_name);
}
```

For ops that registered to a specific dispatch key / had multiple kernels registered, I registered the common kernel (math/cpu) directly inside a `TORCH_LIBRARY_FRAGMENT` block, and registered any additional kernels from other files (e.g. cuda) in a separate `TORCH_LIBRARY_IMPL` block.

```
// cpu file
TORCH_LIBRARY_FRAGMENT(ns, m) {
  m.def("opName(schema_inputs) -> schema_outputs");
  m.impl("opName", torch::dispatch(c10::DispatchKey::CPU, TORCH_FN(cpu_kernel)));
}

// cuda file
TORCH_LIBRARY_IMPL(ns, CUDA, m) {
  m.impl("opName", torch::dispatch(c10::DispatchKey::CUDA, TORCH_FN(cuda_kernel)));
}
```
Special cases:

I found a few ops that used a (legacy) `CPUTensorId`/`CUDATensorId` dispatch key. Updated those to use CPU/CUDA- this seems safe because the keys are aliased to one another in `DispatchKey.h`

There were a handful of ops that registered a functor (function class) to the legacy API. As far as I could tell we don't allow this case in the new API, mainly because you can accomplish the same thing more cleanly with lambdas. Rather than delete the class I wrote a wrapper function on top of the class, which I passed to the new API.

There were a handful of ops that were registered only to a CUDA dispatch key. I put them inside a TORCH_LIBRARY_FRAGMENT block, and used a `def()` and `impl()` call like in case two above.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25056090

Pulled By: bdhirsh

fbshipit-source-id: 8f868b45f545e5da2f21924046e786850eba70d9

* Import torchtext from github into fbcode on 1/11/2021

Reviewed By: cpuhrsch

Differential Revision: D25873762

fbshipit-source-id: 0d34d36aeb8e7e2ce72fcf345c5e7e713ef3663c

* Import torchtext from github #1121 d56fffe

Summary: Import torchtext from github #1121 d56fffe

Reviewed By: zhangguanheng66

Differential Revision: D25976268

fbshipit-source-id: 81589f8988a54cc12f17f0a6f298a915e829a830

* Import the hidden files in torchtext github repo

Reviewed By: mthrok

Differential Revision: D26001386

fbshipit-source-id: f822f0f32232d3006ef629937520dee6c0faf414

* add a newline mark to config.yml file (#1128)

Reviewed By: zhangguanheng66

Differential Revision: D26369003

fbshipit-source-id: 09ca48f9705d8663b06e6a329a6b64b24f9c148e

* Replace model with full name when spacy load is used (#1140)

Reviewed By: zhangguanheng66

Differential Revision: D26369005

fbshipit-source-id: b1e6b5d77810bb8f67d14b8a1c7ec0a9f4831cab

* Fix the num_lines argument of the setup_iter func in RawTextIterableDataset (#1142)

Reviewed By: zhangguanheng66

Differential Revision: D26368999

fbshipit-source-id: 4b50e5d9e5fbdf633e8b3f0072223eed050af793

* Fix broken CI tests due to spacy 3.0 release (#1138)

Reviewed By: zhangguanheng66

Differential Revision: D26368998

fbshipit-source-id: 84e883562a9a3d0fe47b54823b22f7b2cd82fca4

* Switch data_select in dataset signature to split (#1143)

Reviewed By: zhangguanheng66

Differential Revision: D26369006

fbshipit-source-id: 608f42fa180db9ebcfaaeadc6b8cdd29393262af

* Add offset arg in the raw text dataset (#1145)

Reviewed By: zhangguanheng66

Differential Revision: D26368996

fbshipit-source-id: 52741015139c302b7b0ddf8c8f50ab45a609fd2f

* switch to_ivalue to __prepare_scriptable__ (#1080)

Reviewed By: zhangguanheng66

Differential Revision: D26368995

fbshipit-source-id: 0352c04e422c835350bd42df35d4054d543fee36

* Pass an embedding layer to the constructor of the BertModel class (#1135)

Reviewed By: zhangguanheng66

Differential Revision: D26369001

fbshipit-source-id: f5a67a2a812d568073505ec4d181f6e418eb4a3f

* add __next__ method to RawTextIterableDataset (#1141)

Reviewed By: zhangguanheng66

Differential Revision: D26368997

fbshipit-source-id: f5ef78f5f4a224db497f47f774eaddedd0498b4b

* Add func to count the total number of parameters in a model (#1134)

Reviewed By: zhangguanheng66

Differential Revision: D26369000

fbshipit-source-id: c687c0f0c2697dbd9c17a79a1291a2e279bbd1b8

* Retire the legacy code in torchtext library and fix the dependency of the downstream libraries

Summary: This diff is doing: 1) move the legacy code in torchtext to the legacy folder; 2) for the downstream libraries in fbcode, if they are using the legacy code, add "legacy" to the path.

Reviewed By: cpuhrsch

Differential Revision: D23718437

fbshipit-source-id: 1660868aaa95ac6555ad6793dda5ce02a9acdc08

* Sync torchtext GH<->fbcode until GH commit 1197514eb8cc33ccff10f588534f405b43908660

Summary: Import recent torchtext changes up until GH commit 1197514eb8cc33ccff10f588534f405b43908660

Reviewed By: zhangguanheng66

Differential Revision: D26824967

fbshipit-source-id: fc4be4f94a8f748ce2ed5e776e30a42422cbcab9

* 20210304[2] Sync torchtext GH<->fbcode until GH commit 2764143865678c41e69ad3b993556fe90c1e6391

Summary: Sync up until commit in title

Reviewed By: zhangguanheng66

Differential Revision: D26829429

fbshipit-source-id: a059a36d83b3803dfed9198d0e474e0e75f94f17

* 20210308 Sync torchtext GH <-> fbcode

Summary: Import latest GH changes

Reviewed By: zhangguanheng66

Differential Revision: D26888371

fbshipit-source-id: cc27f51fd89ad86b8bcfb8f286ad874ab01b1fd6

* Re-name raw_datasets.json file with jsonl extension

Reviewed By: cpuhrsch

Differential Revision: D26923978

fbshipit-source-id: c87c7776445e05d452f6b38244bf4cdaba45bdec

* 20210329 Sync torchtext up to GH commit eb5e39d3d40525c0064c8e7b7c976755e7341a8b

Summary: Sync torchtext up to GH commit eb5e39d3d40525c0064c8e7b7c976755e7341a8b

Reviewed By: parmeet

Differential Revision: D27400885

fbshipit-source-id: 1f8f92ca42ba36d070db6740b3bb4c148f69586b

* Import torchtext #1267 93b03e4

Summary:
Imported latest from github Master
PR#1267

Reviewed By: cpuhrsch

Differential Revision: D27503970

fbshipit-source-id: 853ff895ba42b1feb7442abe1c87478e43d62e5b

* Import torchtext #1266 ba0bf52

Summary: Import torchtext from github

Reviewed By: parmeet

Differential Revision: D27803909

fbshipit-source-id: 9cb0f15858b1417cb5868d5651513eb2df998fbe

* Import torchtext #1287 fab63ed

Reviewed By: parmeet

Differential Revision: D27922562

fbshipit-source-id: 3c18cd9e2583e03471461ad8a22ac6b0ceb596a2

* Import torchtext #1293 d2a0776

Summary: Importing torchtext from github for regular sync.

Reviewed By: cpuhrsch

Differential Revision: D27983819

fbshipit-source-id: 5806421d788afaa872f5320b5f4cbcd913e103ea

* Import torchtext #1291 0790ce6

Reviewed By: parmeet

Differential Revision: D28101664

fbshipit-source-id: a8643b3ecf85de2cb815dcfa5789a4a5d246d80f

* adding __contains__ method to experimental vocab (#1297)

Reviewed By: cpuhrsch

Differential Revision: D28111696

fbshipit-source-id: fef195941492493a399adb37339cfa64795e22a0

* Import torchtext #1292 ede6ce65eb5405ff1f8801ff6b354bb1cd242108

Summary: This diff syncs torchtext GH with fbcode

Reviewed By: cpuhrsch

Differential Revision: D28321356

fbshipit-source-id: 7736f0d100941627b58424911a1329b1ce66c123

* Added APIs for default index and removed unk token (#1302)

Reviewed By: parmeet

Differential Revision: D28478153

fbshipit-source-id: bfcaffe8fe48e96d8df454f7df0d25ec39d5d4a6

* Swapping experimental Vocab and retiring current Vocab into legacy (#1289)

Summary: allow-large-files to commit wikitext103_vocab.pt

Reviewed By: cpuhrsch

Differential Revision: D28478152

fbshipit-source-id: c2a871439f054024b95c05f7664a84028aacaca3

* Import torchtext #1313 36e33e2

Summary: Importing from Github

Reviewed By: cpuhrsch

Differential Revision: D28572929

fbshipit-source-id: 2e7b00aadeda6ab0596ef23295f41c5b0fa246e7

* Adding API usage logging

Summary: Adding API usage logging for Vocab module

Reviewed By: colin2328

Differential Revision: D28585537

fbshipit-source-id: 38975b523fb597412fbcb18ef831bfb4834cb420

* Import torchtext #1314 99557efd98dd0e74346975d75183dd8aa32eb37e

Reviewed By: parmeet

Differential Revision: D28683381

fbshipit-source-id: 7bfbf445dd512f0ce21c34096cf3f08332d90138

* Import torchtext #1325 57a1df3

Reviewed By: NicolasHug

Differential Revision: D28994054

fbshipit-source-id: 4c679f56ef37b18f6d2acaaaed8518facbeaa41c

* Import torchtext #1328 ca514f6

Summary: Import torchtext #1328 ca514f6

Reviewed By: NicolasHug

Differential Revision: D29120370

fbshipit-source-id: 229586f3470bd61bfb2f6a390d79e45d4eae3b4d

* up the priority of numpy array comparisons in self.assertEqual (#59067) (#1340)

* Re-sync with internal repository (#1343)

* up the priority of numpy array comparisons in self.assertEqual (#59067)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/58988.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59067

Reviewed By: jbschlosser

Differential Revision: D28986642

Pulled By: heitorschueroff

fbshipit-source-id: 3ef2d26b4010fc3519d0a1a020ea446ffeb46ba0

* Import torchtext #1300 0435df13924fd4582d67e5b17bc09f6ded18be8b

Summary: Import torchtext #1300 0435df13924fd4582d67e5b17bc09f6ded18be8b

Reviewed By: parmeet

Differential Revision: D29371832

fbshipit-source-id: 624280ddfa787a4e7628e60fa673cb9df0a66641

* Import torchtext #1345 8cf471c

Summary: Import from github

Reviewed By: hudeven

Differential Revision: D29441995

fbshipit-source-id: 27731ce2714c16180d11bfb26af5d5a2dba408b1

* Import torchtext #1352 7ab50af

Summary: Import from github

Reviewed By: NicolasHug

Differential Revision: D29537684

fbshipit-source-id: 25b1fc1e6d9f930e83f5f2939788b90b083aeaa2

* Enabling torchtext datasets access via manifold and iopath

Summary:
We would like to add and access torchtext datasets on manifold. This Diff unifies the dataset download from external links and through manifold for internal access. This is enabled via io_path package.

The main idea is to plugin the download hooks in the download_from_url function. The download hooks will delegate the download to appropriate Path Handler. In OSS we have enabled download via https and google drive. Internally, we replace the download hook to download data from manifold.

We have created a _download_hooks.py file under /fb/ folder which will replace the corresponding file in OSS. The file under /fb/ folder converts the http/https URL paths into corresponding manifold paths and download the data from there.

Reviewed By: hudeven

Differential Revision: D28892389

fbshipit-source-id: 3b66544dd2345075e2e7c524f344db04aa2a24e3

* Import torchtext #1361 05cb992

Summary: Import from github

Reviewed By: hudeven

Differential Revision: D29856211

fbshipit-source-id: 6332f9bdf3cf4eef572c5423db15101ea904d825

* Import torchtext #1365 c57b1fb

Summary: Import torchtext #1365 c57b1fb

Reviewed By: parmeet

Differential Revision: D29940816

fbshipit-source-id: 6b2495b550a7e6b6110b0df12de51a87b0d31c1c

* Moving Roberta building blocks to torchtext

Summary: This is the first step in moving Roberta Model from pytext_lib into PyTorch Text Library. Here we moved the Roberta building blocks into pytorch/text/fb/nn/modules. The code-base is organized according to WIP document https://docs.google.com/document/d/1c0Fs-v97pndLrT3bdfGRGeUeEC38UcDpibvgOXkbS-g/edit#heading=h.3ybcf0ic42yp

Reviewed By: hudeven

Differential Revision: D29671800

fbshipit-source-id: d01daa99e0a5463716660722381db9a0eeb083f8

* Enabling torchtext availability in @mode/opt

Summary:
More details on context and solution: D29973934

Note that in this implementation, we rely on over-riding behavior of _init_extention() function. This is in similar spirit where we over-ride behavior of download hooks to accommodate necessary changes needed to enable functionality on fbcode.

Reviewed By: mthrok

Differential Revision: D30494836

fbshipit-source-id: b2b015263fa1bca2ef4d4214909e469df3fbe327

* Import torchtext #1382 aa12e9a

Summary: Import torchtext #1382 aa12e9a

Reviewed By: parmeet

Differential Revision: D30584905

fbshipit-source-id: fba23cd19f31fc7826114dd2eb402c8f7b0553df

* Simplify cpp extension initialization process

Summary: Simplifying the cpp extension initialization process by following torchaudio's implementation in D30633316

Reviewed By: mthrok

Differential Revision: D30652618

fbshipit-source-id: f80ac150fa50b1edc22419b21412f64e77064c5d

* fixed bug with incorrect variable name in dataset_utils.py

Summary:
- ValueError was outputting `fn` instead of `func`
- Similar fix done in torchdata https://github.com/facebookexternal/torchdata/pull/167

Reviewed By: ejguan

Differential Revision: D31149667

fbshipit-source-id: 2c1228287d513895f8359cb97935252f0087d738

* Import torchtext #1410 0930843

Summary: Import latest from github

Reviewed By: Nayef211

Differential Revision: D31745899

fbshipit-source-id: e4ac5c337bcbd1a8809544add7679dd3da242999

* Import torchtext #1406 1fb2aed

Summary: Import latest from github

Reviewed By: Nayef211

Differential Revision: D31762288

fbshipit-source-id: f439e04f903d640027660cb969d6d9e00e7ed4a0

* Import from github 10/18/21

Summary: Syncing torchtext github main branch to fbcode

Reviewed By: parmeet

Differential Revision: D31841825

fbshipit-source-id: 9c1a05295e6557ff411e56eb719cb439d5c424ba

* Import torchtext #1420 0153ead

Summary: Import latest from github

Reviewed By: Nayef211

Differential Revision: D31871772

fbshipit-source-id: 989f5a453ef7680592df27e4174f465d11a2fbf8

* Import torchtext #1421 bcc1455

Summary: Syncing torchtext github main branch to fbcode

Reviewed By: parmeet

Differential Revision: D31873514

fbshipit-source-id: 1a964a67ce7ee73f5acf3a1e3f8118028c2dd46e

* Enable OSS torchtext XLMR Base/Large model on fbcode

Summary:
Enable access to open-source torchtext XLMR base/large implementation by:
1) Uploading models/transform weights on manifold
2) Patching public URL with manifold URL (similar to what we have for datasets)

Note that we didn't enabled model tests since it takes relatively long to download huge models weights from manifold. We would rely on Open-source signals when making changes to model implementation, and we need to ensure the any update in weights on AWS cloud is also replicated on manifold.

Reviewed By: hudeven

Differential Revision: D31844166

fbshipit-source-id: 62a4e9a3a8580ab93c3beb3af69be7361f1cc937

* enabling SST2 dataset usage in fbcode

Summary:
Enable access to open-source torchtext SST2 dataset by:
- Uploading SST2 dataset on manifold
- Swapping public URL with manifold URL in fbcode by implementing a dummy `HTTPReader` wrapper class
   - The wrapper class does URL mapping and calls `IoPathFileLoaderDataPipe` on the manifold URL
- Enabled SST2Dataset unit tests within fbcode

Reviewed By: parmeet

Differential Revision: D31876606

fbshipit-source-id: fdde14a67cce835da216b296e1a0024e1d1fc7a9

* Import torchtext #1426 4be2792

Summary: Import from github

Reviewed By: Nayef211

Differential Revision: D31962042

fbshipit-source-id: 0308ae0cfe402e8c3eb133cb5a205b65f98ad1df

* Import torchtext #1428 b962c51

Summary: Import latest from github

Reviewed By: Nayef211

Differential Revision: D32006262

fbshipit-source-id: 2d7766104e1116f14f20fa1031178c2143b5e78b

* Import torchtext #1430 4cf19ed

Summary: Import latest from github

Reviewed By: Nayef211

Differential Revision: D32140599

fbshipit-source-id: 3a2902febd5e5024d833699e05e0256b1ae0cae2

* Allow inferred scaling in MultiheadSelfAttention for head_dim != 64

Summary:
Rather than raise an exception whenever head_dim != 64, we can just infer the scaling value and continue to provide a warning.

Also add an assertion in case embed_dim is not a multiple of num_heads (in which case forward will break).

Reviewed By: parmeet

Differential Revision: D32193989

fbshipit-source-id: 30f68c55f3ec37932252c77c355ae55b8bf34ded

* Updated sst2 dataset to accept `validate_hash` parameter

Summary:
## Description
- Updated sst2 dataset to accept a `validate_hash` parameter
- This allows for testing using partial datasets since downloading the entire dataset takes much longer

Reviewed By: parmeet

Differential Revision: D32250435

fbshipit-source-id: 9b5e7183f62df69638e1a3af2107273daa6f4ac5

* Import torchtext #1431 ba20fc5

Summary: Import latest from github

Reviewed By: Nayef211

Differential Revision: D32282533

fbshipit-source-id: 8318cd8b8360dec1febdde0bc48388e6b2f2d768

* Fixed file filtering bug in SST2 dataset

Summary:
- Removed copying partial SST2 asset file to a temp dir and instead directly working with the file from the asset folder
- Fixed bug with path names affecting how files were filtered out from the zip file
   - For example, if the value of `split` is "test", the following snippet of code `filter(lambda x: split in x[0])` might match all of the "train", "test", and "dev" files depending on the location of the dataset asset file
   - When testing with buck, the location of the extracted files could look something like `/data/users/nayef211/fbsource/fbcode/buck-out/dev/gen/pytorch/text/test/experimental_test_datasets#binary,link-tree/test/asset/SST2/SST-2.zip/train.tsv`. Since the word "test" is contained in this path string, the filtering logic would incorrectly select the "train" file even though what we want is the "test" file
   - To resolve this we append the file extension (in this case ".tsv") to the `split` variable in the filtering logic

Reviewed By: parmeet

Differential Revision: D32329831

fbshipit-source-id: dbb4803a04f6cd50fab3f7ce5530d3258b2db012

* Squashed commits (9314b44 to e691934)

Summary:
Trying out new way to import changes :). The main reason for this deviance is to find a way to skip commit ID(s) which is currently blocking another important PR to be landed on fbcode.

Used following command to sync changes from github to fbcode:

```python pytorch/import.py --github_username parmeet --project_name text --commit_from ba20fc525a8a46d3056eeb421a44b9bdb1a90182 --commit_to e691934d2779be40ab425056836565840f49d565  --skip_commit_ids 2cebac34ab26577ee02b7295dbe01dccfdb1a88f daf0f6c71d7b764aafd2f1a2a3e7aa37dcc36e53 --squash```

Notes:
- Skipped commit 2cebac3 as it about removing legacy code which is still work in progress on internal side (resolving legacy use-sites) by abhinavarora
- Skipped commit daf0f6c as this correspond to syncing changes from fbsync. to main branch
- We have used squash, but can skip this option to get 1:1 correspondence from PR to Diff ID, like we have in vision

This text from below here is auto-generated because i think we used --squash

====
Subject:
Update doc and fix CircleCI doc build issue (#1434)
Body:

commit e691934d2779be40ab425056836565840f49d565

====
Subject:
[CircleCI Windows Failure] Fix the way we join URL pieces to download XLM-R components (#1441)
Body:

commit d4a27a05a85d331d84d3ac527ca5f18ca64d326f

====
Subject:
correct the `_compute_ngram_counter` docstring (#1440)
Body:

commit a26a8ef7f7ad22f9f2ae7af0e52e4c9760ab439d

====
Subject:
fix attention mask testing (#1439)
Body:

commit 778b3e62770c24c4ecde06a6aaba1dee38c07e2e

====
Subject:
[Vocab] Refactor vocab factory method to accept special tokens as a keyword argument (#1436)
Body:
* [Vocab] Refactor vocab factory method to accept special tokens as a keyword argument

commit f298494ad90495e4ad442928665ce6d8e9f9c3c0

====
Subject:
add attention mask to transformer encoder modules (#1435)
Body:

commit 9314b44d2a6cb6f4129e1ac3ac57f92eb054f15d

Reviewed By: Nayef211

Differential Revision: D32431346

fbshipit-source-id: 985e242ce5a733c130e9d5b9549a4a330e948dc7

* Refactor OnDiskCache (#61)

Summary:
Pull Request resolved: https://github.com/pytorch/data/pull/61

Fixes https://github.com/facebookexternal/torchdata/issues/114 and https://github.com/facebookexternal/torchdata/issues/140

* #59

Test Plan: Imported from OSS

Reviewed By: wenleix

Differential Revision: D31734382

Pulled By: ejguan

fbshipit-source-id: 16d10bace2a473e3878ac8dd5f7b6885bd924105

* Add a class method in Model Bundler to facilitate model creation with user-defined configuration and checkpoint (#1442)

Summary:
Import from github

Command used:
`python pytorch/import.py --project_name text  --commit_ids 2040d8da87394ab5ecf6ac2bbcd5a00beb940cf4`

Note that we still not importing the whole repo using import_text.sh. using import.py would be the worflow we would rely on till we merge [legacy code removal commit](https://github.com/pytorch/text/commit/2cebac34ab26577ee02b7295dbe01dccfdb1a88f) into fbcode.

Reviewed By: Nayef211

Differential Revision: D32603181

fbshipit-source-id: 1f583e5ac96e693b583ae42d5841bf387cf3727a

* Import torchtext from github aea6ad6,#1449 to 9f2fb3f,#1452

Summary:
command:
`python pytorch/import.py --project_name text --commit_ids aea6ad6bf9a6292af3d5051b4862b966871bdcce 9f2fb3f00cd9a4cc8d41d2e9cbfa5e9bf9533224 --squash`

Reviewed By: abhinavarora

Differential Revision: D32690771

fbshipit-source-id: cde616182ecfe643ab48d727b66bbf0194480d3e

* Fix SST2Dataset test iterator

Summary:
## Summary
- Modified SST2 dataset implementation to only return text for test split (since label_ids are not available)
- Updated doc classification datamodule to temporarily use `val_dataset` instead of `test_dataset`
- Updated first line md5 hash for SST2 test split

## Followup Items
- Update doc classification module to work with test splits with and without labels

Reviewed By: parmeet

Differential Revision: D32661112

fbshipit-source-id: ef86aea0ce587c5d5282f2caa943b4b0cdf6f54a

* Fix issue in label Transform

Summary: In the construction of Vocab within label transform, the default index is set to 0. This index is returned when OOV token is given. For this transform, the default index should never be set. Otherwise, it will return default index (which is 0) for unknown labels that might get passed (Ideally it should throw error in this case because we do not know what to do when wrong label is passed for query)

Reviewed By: hudeven

Differential Revision: D32610834

fbshipit-source-id: e49385fb313929627c41fc515b6d900a6bfc3591

* Import torchtext #1437 2cebac3

Summary: Imports [#1437](https://github.com/pytorch/text/pull/1437) from OSS Torchtext that removes the legacy folder.

Reviewed By: parmeet

Differential Revision: D32923084

fbshipit-source-id: 83411efd62cd527c518e36279bdbf586435ac9e5

* Import torchtext #1457 d801e99

Summary: Import from github

Reviewed By: abhinavarora, ebsmothers

Differential Revision: D32962989

fbshipit-source-id: 4de93cbc0ebe29034a505c56d03bb8d4b698891c

* Import torchtext #1459 8ef1b15

Summary: Imports torchtext to fbcode

Reviewed By: parmeet

Differential Revision: D33001763

fbshipit-source-id: 0525982a1aadcfed65172c22734a46fdf2bd7bde

* Fixing typing issues DataSet -> DataType

Summary: Forward fix of D31344867

Reviewed By: Nayef211, ejguan

Differential Revision: D33069330

fbshipit-source-id: 1649049a6caf1178a78a25baf21e1b4ecdc44d77

* Import torchtext #1470 52d38e8

Summary: Import from github

Reviewed By: ebsmothers

Differential Revision: D33291837

fbshipit-source-id: 86f8675f13190425617937dcbdd5b698da0bba0f

* Import torchtext #1486 4908d3c

Summary: As title

Reviewed By: Nayef211

Differential Revision: D33434571

fbshipit-source-id: 3cb1d43583fd1e2f28dfd27109a8bf5f1b255d1d

* Import torchtext #1488 2c98927

Summary:
====
Subject:
Switching to use FileOpener from FileLoader (#1488)
Body:

commit 2c989273a6a99eef12d2e3fe25258b27881cb0bf

====
Subject:
add scriptable sequential transform (#1481)
Body:

commit 3849f4648a5021514b6b91fa721b43b63fad8378

Reviewed By: abhinavarora

Differential Revision: D33485781

fbshipit-source-id: 3a7ca597cb2f2be98be29a639ef05a65a3f7b6be

* Update load_state_dict_from_url method to skip download if file is cached

Summary:
- Update load_state_dict_from_url method to skip download if file is cached in the `model_dir` folder
   - The `model_dir` parameter was previously unused
   - New logic is similar to the [OSS implementation in torchhub](https://pytorch.org/docs/stable/_modules/torch/hub.html#load_state_dict_from_url)
- Update unit test to test skipping download when file is already cached

Reviewed By: abhinavarora

Differential Revision: D33512850

fbshipit-source-id: 2350c6dcad7e5725cf670c99405bcc7d0fb05e42

* Import torchtext #1482 0e7cf21

Summary:
- Remove xlmr transform class and instead use sequential for model transforms composition
- Modified doc classification recipe to use sequential transform instead of `XLMRobertaModelTransform`

Reviewed By: parmeet

Differential Revision: D33485834

fbshipit-source-id: 01a914112219838620f3ce81cf621665d072ae69

* Import torchtext #1509 1a2fc00

Summary:
====
Subject:
migrate YelpReviewPolarity to datapipes. (#1509)
Body:
* add initial pass at migrating YelpReviewPolarity to datapipes.

* fix flake.

commit 1a2fc00266eb71c202802803c390e00b4082085e

====
Subject:
migrate YelpReviewFull to datapipes. (#1507)
Body:
* add initial pass at migrating YelpReviewFull to datapipes.

* fix flake.

commit 0fb50b45a6b40d63ef8fabc5766a15c78fce6c8e

====
Subject:
migrate YahooAnswers to datapipes. (#1508)
Body:
* add initial pass at migrating YahooAnswersto datapipes.

* fix flake.

commit f99609d7c56d8b742f6b0f281fcd726d05aa4923

====
Subject:
migrate DBPedia to datapipes. (#1500)
Body:
* add initial pass at migrating DBPedia to datapipes.

* add _EXTRACTED_FILES for consistency.

commit 1881705aec892efd45803006fd8b6c845be9965f

====
Subject:
replace funny os.sep joins with os.path.join for consistency. (#1506)
Body:

commit ce4ab8b5c1f22cf533e04f73696b0816f63a4ae5

====
Subject:
migrate AG_NEWS to datapipes. (#1498)
Body:

commit d9fdbc62a7c9b6ed27b47d92edc33d1cf8e9cf9d

====
Subject:
migrate SogouNews to datapipes. (#1503)
Body:
* add initial pass at migrating SogouNews to datapipes.

* make filter for specific split more consistent.

commit e6065a9217a95e71ba47ca0184953627b21ab7ef

====
Subject:
Fix filter logic (#1505)
Body:

commit a415684661ff9d7fb9e2b7f438cc8e70c09781bf

====
Subject:
fix per https://github.com/pytorch/vision/issues/4832\#issuecomment-957695788 (#1504)
Body:

commit 8215832272e8d05f27dc5372a5e4382ce6942819

====
Subject:
add initial pass at migrating Amazon Review Full to datapipes. (#1499)
Body:

commit df0ec14a802bb7b85f06c97f564959f988212f80

====
Subject:
Parameterize jit and non-jit model integration tests (#1502)
Body:
* Updated max seq length for truncate in xlmr base. Updated xlmr docs. Moved xlmr tests to integration tests

* Removing changes to truncate transform

* Remove documentation changes from PR

* Parameterized model tests

* Added nested_params helper method. Updated model integration test to parameterize a single method covering jit and non-jit tests

* Added docstring for unit tests

commit d896135e4f5060bbaeb2cc5c3ed43eb15bc8a4c0

====
Subject:
Remove redundant get asset functions from parameterized_utils (#1501)
Body:

commit 0bcab91246b7ca17db048d0ab97a3199b94c05ab

====
Subject:
Parameterized XLMR and Roberta model integration tests (#1496)
Body:
* Updated max seq length for truncate in xlmr base. Updated xlmr docs. Moved xlmr tests to integration tests

* Removing changes to truncate transform

* Remove documentation changes from PR

* Parameterized model tests

commit 2cb80a23412993b7eb9ded082d084eb39c1f0c4e

====
Subject:
Migrating AmazonReviewPolarity to datapipes (#1490)
Body:

commit 826a051dfd9f62731f3b0dee854d0aa687f4da72

====
Subject:
Updated XLMR docs (#1497)
Body:

commit 1a052693509ce32b6fb91302f6ca62546b0afe0d

====
Subject:
fix max sequence length for xlmr transform (#1495)
Body:

commit 776a15daed49f4046b46a2501ea8b63e85bc9da2

====
Subject:
Add pre-trained Roberta encoder for base and large architecture (#1491)
Body:
* Added new roberta encoders and tests

* Added docs for roberta encoder

* Updated truncate length. Added info on how model was trained along with license info

* Added datasets that roberta was trained on

* Removing unnecessary new line

commit 6d9e6df7dee068d99d74355d14c7cb897b199d60

====
Subject:
remove optionality of dtype in `ToTensor` (#1492)
Body:

commit c0e1c38b34ebabf0f12859ee2594194f9c65957a

Reviewed By: abhinavarora

Differential Revision: D33555196

fbshipit-source-id: eca8e38ea61c72a626ec20096f18827cebae4ef7

Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca>
Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca>
Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca>
Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca>

* Import torchtext #1532 ce1ce9

Summary:
====
Subject:
Add AmazonReviewPolarity Mocked Unit Test (#1532)
Body:
* First attempt at adding test for amazon review polarity

* Updated dataset to take validate_hash param. Finalized tests

* Created non empty tar file

* Remove formatting. Patch _hash_check method from torchdata during testing

* Added super().setUpClass()

* Remove commented import

commit ce1ce99583795207153e13e9bc35a388d368a49d

====
Subject:
migrate Multi30k to datapipes. (#1536)
Body:

commit 627c71f837f6acf7db34b3d96a696624cb4a7087

====
Subject:
add initial pass at migrating UDPOS to datapipes. (#1535)
Body:

commit f685c55e02a43b6489d096f1dd2c05e8be13df63

====
Subject:
Migrate WikiText103 to datapipes (#1518)
Body:

commit 042f12f1be9701fc85129c9be380aec72ed3bc2e

====
Subject:
add double caching for yelp full to speed up extracted reading. (#1529)
Body:

commit d19a77eb69a11a3c9feb74b391e288ed70277bb4

====
Subject:
Migrate WikiText2 to datapipes (#1519)
Body:
* Migrate WikiText2 to datapipes

* Address code review comments and add double caching

commit 437eea8f841fc5efe7dc0f116bbfef781cb88b84

====
Subject:
add double caching for yahoo to speed up extracted reading. (#1528)
Body:
* add double caching for yahoo to speed up extracted reading.

* simplify filepath_fn

* rename dps for consistency.

* add FileOpener within caching block for more consistency.

commit ff78e999f6edb866c33a1464c8288cb90f15c9e4

====
Subject:
add max_tokens kwarg to vocab factory. (#1525)
Body:

commit e1d66cf8ccd2b29378d5f3352b01e4310a36b557

====
Subject:
Migrate IMDB to datapipes (#1531)
Body:
* Migrate IMDB to datapipes

* add double cache for extracted reading

* update cache name

commit 03afb7e1e6b6821eb7b479aa11b9c449c251de7a

====
Subject:
add double caching for yelp polarity to speed up extracted reading. (#1530)
Body:
* add double caching for yelp polarity to speed up extracted reading.

* rename dps for consistency and simplify filepath_fn

* add FileOpener within caching block for more consistency.

commit 83aebf495a92761e9683b4af4461ad28ae5c96a7

====
Subject:
Migrating EnWik9 to datapipes #1511 (#1512)
Body:
* Migrating enwik9 dataset to use torchdata

* Added typing to params

* Fixed PR comments. Updated to data_dp

* Added caching for extracted files

* Moved FileOpener after ondiskcache datapipe

commit 12317098cef5822846125e579cd197b217c9e30e

====
Subject:
Migrating PennTreebank to datapipes (#1511)
Body:
* Migrating penntreebank dataset to use torchdata

* Update FileLoader to FileOpener

* Resolved comments about return_path

* Using strip() to remove leading/trailing spaces

commit eb3994567830aeeccfcc1d7053ac6c29400cb593

====
Subject:
Cache extraction for AmazonReviewPolarity (#1527)
Body:

commit 0f7f859e412fba4a31852c1a84801a182e636fde

====
Subject:
migrate CONLL 2000  to datapipes. (#1515)
Body:

commit b52746546c0648122231e4d73bf24175ef949df3

====
Subject:
add initial pass at migrating SQUAD2 (https://github.com/pytorch/text/commit/4be2792101565ddf6dd79d1b7fffb7d55d63bf06) to datapipes. (#1514)
Body:

commit a2ab9741415b2cff026d158a5a54b62b993571d9

====
Subject:
migrate SQUAD1 to datapipes. (#1513)
Body:

commit a5ca19407b844e49679d87c94003e08c5efd6d78

====
Subject:
Attempting to fix version conflict in CI (#1520)
Body:
* since we no longer support python 3.6, we get dataclasses in stdlib for free.

* replace pip-install of packages with conda-install where applicable for better version management of native code.

* make cpuonly a constraint instead of feature

commit a6ae5946e49db2afb2eb8ca5435afaea036077f3

====
Subject:
fixing cache logic to work with datapipes (#1522)
Body:
* fixing cache logic to work with datapipes

* committing temporary change to build cache

* reverting temp change

commit cf668aabf869ae9bdbc5c1259e011f36a1411a2b

====
Subject:
3.6 is EOL (#1521)
Body:

commit 7467bb5971b8ed59a716ba05b82bb1030ed4fbe2

====
Subject:
Fixing dataset test failures due to incorrect caching mode (#1517)
Body:

commit 38ec295c1970776a43b42712b4156d2635ae85c3

====
Subject:
IterDataPipes do not have __next__ (#1516)
Body:

commit 8f153f692ed85229db8e43b14398adae5f58d646

Reviewed By: abhinavarora

Differential Revision: D33850546

fbshipit-source-id: 2235caac646eb0fcc14fb638cbbfd4b15f966035

Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca>
Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca>
Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca>
Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca>

* Import torchtext #1538 d72124c

Summary:
- Import d72124c commit which migrates the SST2 dataset away from experimental
- Modify doc classification recipe to work with new functional dataset implementations
   - Make label transform optional since some datasets return integer labels
   - Added a `num_labels` field to `DocClassificationTransform` class which will be used to determine `num_classes` for metrics computation
- Update the `SST.zip` testing asset with the correct folder structure

Reviewed By: parmeet

Differential Revision: D33792100

fbshipit-source-id: 4480ef0ba8dabb495f0a2adc45f588413aea5f4d

* Import torchtext from commits e0c5528 to 8808e7e

Summary:
Command used:

`python pytorch/import.py --github_username parmeet --project_name text --commit_from d72124cb710574087d0bce87062ee521e1584167 --commit_to 8808e7eee5a2df79b9566a4a348889dc2722fcfb --skip_commit_ids 7f3ed4b183eb451b439740a59bb849771c707f0c --squash`

Followed by: arc lint to fix new line linter issues

Reviewed By: VirgileHlav

Differential Revision: D34717890

fbshipit-source-id: 7aa0f22421b3f3bfb9684c6e24f7dc606052da5c

* Import torchtext #1635 69f67f3

Summary:
Import latest from github using import_text.sh script

Changed `RobertaModelBundle` to `RobertaBundle`

Reviewed By: Nayef211

Differential Revision: D34718778

fbshipit-source-id: f68fc827c5956ffedc4f5a98175d0724ca431c9d

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D34753031

fbshipit-source-id: 6d8a92b4c2f4b5b85b90edb5b8329e3061411620

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D34815193

fbshipit-source-id: 1865de76d8b133f56e4060961c1173097efac575

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D34857180

fbshipit-source-id: 1c483f2277c902271a2ff75f0c36bf5de8bbba34

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D34920364

fbshipit-source-id: c03dfd98da4b66dc63e5e4dfd11af449bc95ce85

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D35074232

fbshipit-source-id: 1772fbf171665894ab8945d967011873aa7f626e

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D35309109

fbshipit-source-id: be74f1c2739fbfb6e43cc6649839e647c37de4c8

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D35392538

fbshipit-source-id: 02ca5e81ec7ca2d607c1eef3b32ddf4a51c279c8

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D35425316

fbshipit-source-id: 815c3d048440211d2107b8605830530db609efe0

* torchx integration

Summary: Integrate with torch to run the training in local or flow

Reviewed By: Nayef211

Differential Revision: D35412165

fbshipit-source-id: 297bea540ace67d93965e0982d6c8f8ff5d03208

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D35773883

fbshipit-source-id: ab1787498d2169b4345f5981c21eb6b898fa8f2e

* BetterTransformer support for torchtext (#1690)

Summary:
Pull Request resolved: https://github.com/pytorch/text/pull/1690

This diff created a fast path of using better transformer (torch.nn.TransformerEncoderLayer), with a converter from the existing torchtext transformer encoder layer to better transforomer.
The related tests are added in the following diff.

Reviewed By: parmeet

Differential Revision: D35948440

fbshipit-source-id: e69e12f2dd28edfea3176a10ee3d7d321d50c897

* Kill to_better by having native load_from_state_dict and init

Summary: Fully remove to_better method by rebuild torchtext TransformerEncoderLayer's load_from_state_dict and init. No more redundant params.

Reviewed By: parmeet

Differential Revision: D36020184

fbshipit-source-id: ccdd6da853a86034762b235cd7d5f793876d16c6

* Remove unneeded modules after using nn.Module for BetterTransformer (#1693)

Summary:
Pull Request resolved: https://github.com/pytorch/text/pull/1693

Remove unneeded modules after using nn.Module for BetterTransformer

Reviewed By: zrphercule

Differential Revision: D36038830

fbshipit-source-id: 1e0f5c7cf81096cf66cc1afcf15b5e0645c3da03

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D36034077

fbshipit-source-id: 40c12ec37992d71c4857f92bc5e2ed939e2d6030

* Replace TransformerEncoder in torchtext with better transformer (#34)

Summary:
X-link: https://github.com/facebookresearch/multimodal/pull/34

Pull Request resolved: https://github.com/pytorch/text/pull/1700

Replace the usage of TransformerEncoder by BetterTransformerEncoder
In theory we should be able to remove torchtext.TransformerEncoderLayer after this diff.

Reviewed By: parmeet

Differential Revision: D36084653

fbshipit-source-id: 64ed3810e809fc1db840e75e2e05783089ff31d2

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D36162313

fbshipit-source-id: ff366f585b4783e903f8388654e71ce635b2a556

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D36307982

fbshipit-source-id: faf90f12012bd962fc5decfd3cf9e117f4b9160a

* Enable model testing in FBCode

Summary:
This diff enables Model testing in FB code

Notes:
1. it only tests XLM-R models (base and large) in integration tests. We need to do a follow-up diff to enable RoBERTa testing since corresponding assets are missing in FBcode.

Edit: Addressed the Roberta model testing in this diff itself

2. parameterized was giving some weird long names to the test which was creating some unknown issue for running them in sandcastlle. Removed it for now to get the proper names for test.

Edit: refactored test suit since nested_params was creating long string names (400+ characters) for test methods due to RobertaBundle objects

Reviewed By: mikekgfb

Differential Revision: D35973306

fbshipit-source-id: 8a50d03466f60c8a4a0fbd5857611e68c92ebf08

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D36340622

fbshipit-source-id: ed6f1994916d5d469198e6d0876387a6363db1ea

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D36448402

fbshipit-source-id: bee15f955a21a730653d72d4aedff7b6122f6ef0

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D36510904

fbshipit-source-id: 1b9b27e62af007e88f76414e936fa08ae1ce7d59

* Import torchtext #1794 a54be1f3a7ac534509ac9c066a1b35127936dd77

Summary:
Manually importing TorchText from github using ```./fbcode/pytorch/fb_build/import_text.sh```

In additional to manual import, this diff also updates the libtorchtext TARGET dependency on utf8proc

Reviewed By: VirgileHlav

Differential Revision: D37250868

fbshipit-source-id: 369d67aa02492f620350eb8b28c00b59dc84f081

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D37171614

fbshipit-source-id: 56fa981bc709f78ac3371a5346b9278730895b82

* Import TorchText from Github

Summary:
Meta:

Import latest TorchText from Github to fbcode. Check fb/LAST_SYNCED_COMMIT_FROM_GITHUB_MAIN for the synced commit hash.

Rules run:
- CodemodTransformerSimpleShell

Config Oncall: [pytorch_text](https://our.intern.facebook.com/intern/oncall3/?shortname=pytorch_text)
CodemodConfig: [CodemodConfigPyTorchTextGithubSync](https://www.internalfb.com/code/www/flib/intern/codemod_service/config/pytorch_text/github_sync/CodemodConfigPyTorchTextGithubSync.php)
ConfigType: php
Sandcastle URL: https://www.internalfb.com/intern/sandcastle/job/31525198098541494/
This diff was automatically created with CodemodService.
To learn more about CodemodService, check out the [CodemodService wiki](https://fburl.com/CodemodService).

_____

## Questions / Comments / Feedback?

**[Click here to give feedback about this diff](https://www.internalfb.com/codemod_service/feedback?sandcastle_job_id=31525198098541494).**

* Returning back to author or abandoning this diff will only cause the diff to be regenerated in the future.
* Do **NOT** post in the CodemodService Feedback group about this specific diff.

Reviewed By: Nayef211

Differential Revision: D37374922

fbshipit-source-id: d2cfb5e58fc35b653f00b0d81330fe2337e6e347

* Import TorchText from Github

Summary:
Meta:

Import latest TorchText from Github to fbcode. Check fb/LAST_SYNCED_COMMIT_FROM_GITHUB_MAIN for the synced commit hash.

Rules run:
- CodemodTransformerSimpleShell

Config Oncall: [pytorch_text](https://our.intern.facebook.com/intern/oncall3/?shortname=pytorch_text)
CodemodConfig: [CodemodConfigPyTorchTextGithubSync](https://www.internalfb.com/code/www/flib/intern/codemod_service/config/pytorch_text/github_sync/CodemodConfigPyTorchTextGithubSync.php)
ConfigType: php
Sandcastle URL: https://www.internalfb.com/intern/sandcastle/job/709158032/
This diff was automatically created with CodemodService.
To learn more about CodemodService, check out the [CodemodService wiki](https://fburl.com/CodemodService).

_____

## Questions / Comments / Feedback?

**[Click here to give feedback about this diff](https://www.internalfb.com/codemod_service/feedback?sandcastle_job_id=709158032).**

* Returning back to author or abandoning this diff will only cause the diff to be regenerated in the future.
* Do **NOT** post in the CodemodService Feedback group about this specific diff.

Reviewed By: parmeet

Differential Revision: D37411197

fbshipit-source-id: 8eeb460843eacfd0f3d970062b3e0e393d5eef6f

* Import TorchText from Github

Summary:
Meta:

Import latest TorchText from Github to fbcode. Check fb/LAST_SYNCED_COMMIT_FROM_GITHUB_MAIN for the synced commit hash.

Rules run:
- CodemodTransformerSimpleShell

Config Oncall: [pytorch_text](https://our.intern.facebook.com/intern/oncall3/?shortname=pytorch_text)
CodemodConfig: [CodemodConfigPyTorchTextGithubSync](https://www.internalfb.com/code/www/flib/intern/codemod_service/config/pytorch_text/github_sync/CodemodConfigPyTorchTextGithubSync.php)
ConfigType: php
Sandcastle URL: https://www.internalfb.com/intern/sandcastle/job/711752278/
This diff was automatically created with CodemodService.
To learn more about CodemodService, check out the [CodemodService wiki](https://fburl.com/CodemodService).

_____

## Questions / Comments / Feedback?

**[Click here to give feedback about this diff](https://www.internalfb.com/codemod_service/feedback?sandcastle_job_id=711752278).**

* Returning back to author or abandoning this diff will only cause the diff to be regenerated in the future.
* Do **NOT** post in the CodemodService Feedback group about this specific diff.

Reviewed By: Nayef211

Differential Revision: D37483835

fbshipit-source-id: b4ad3c43ece7c83c57617e6a5851fff3ecdf8e51

* Adding TARGETS file for torchtext benchmarks

Summary:
### Summary
- Enable benchmarking of torcharrow ops within torchtext

### Benchmark Results
- Benchmarking in fbcode devserver
```
torchtext GPT2BPE tokenizer: 65.811
torchtext vocab: 2.226
torchtext add tokens operation (string): 0.722
torchtext add tokens operation (int): 0.598

torcharrow GPT2BPE tokenizer: 65.739
torcharrow vocab: 1.253
torcharrow add tokens operation (string): 14.335
torcharrow add tokens operation (int): 0.229
```

Benchmarking on Apple MBP (results can also be found in [text#1801](https://github.com/pytorch/text/pull/1801) and [text#1807](https://github.com/pytorch/text/pull/1807))

```
torchtext GPT2BPE tokenizer: 3.13
torchtext vocab: 0.32
torchtext add tokens operation (string): 0.382
torchtext add tokens operation (int): 0.431

torcharrow GPT2BPE tokenizer: 59.13
torcharrow vocab: 0.03
torcharrow add tokens operation (string): 3.652
torcharrow add tokens operation (int): 0.075

```

### Takeaways
- GPT2BPE for torchtext is significantly faster on MBP than devserver
- AddTokens (str) for torcharrow is still significantly slower on both MBP and devserver than the torchtext counterpart

Reviewed By: parmeet

Differential Revision: D37463862

fbshipit-source-id: 1fb538338367bac2b002c1a4b8f128b0b2847bf5

* Import TorchText from Github

Summary:
Meta:

Import latest TorchText from Github to fbcode. Check fb/LAST_SYNCED_COMMIT_FROM_GITHUB_MAIN for the synced commit hash.

Rules run:
- CodemodTransformerSimpleShell

Config Oncall: [pytorch_text](https://our.intern.facebook.com/intern/oncall3/?shortname=pytorch_text)
CodemodConfig: [CodemodConfigPyTorchTextGithubSync](https://www.internalfb.com/code/www/flib/intern/codemod_service/config/pytorch_text/github_sync/CodemodConfigPyTorchTextGithubSync.php)
ConfigType: php
Sandcastle URL: https://www.internalfb.com/intern/sandcastle/job/13510799591955868/
This diff was automatically created with CodemodService.
To learn more about CodemodService, check out the [CodemodService wiki](https://fburl.com/CodemodService).

_____

## Questions / Comments / Feedback?

**[Click here to give feedback about this diff](https://www.internalfb.com/codemod_service/feedback?sandcastle_job_id=13510799591955868).**

* Returning back to author or abandoning this diff will only cause the diff to be regenerated in the future.
* Do **NOT** post in the CodemodService Feedback group about this specific diff.

Reviewed By: abhinavarora

Differential Revision: D37514618

fbshipit-source-id: efc3b56b6da2afdc601b3dc706c58d0222d0daf6

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D37642224

fbshipit-source-id: 674d2fdfa57bc2131bed136986d385194416f0bb

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D37680190

fbshipit-source-id: b06341b9989bdcb0859ad84838860f05ef2e501f

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D37711064

fbshipit-source-id: 3646b536af2359b776e6a49b9c86f6657c0f1a4c

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D37879352

fbshipit-source-id: 53b04c4b41a3c7e8077842c39a331144eab76208

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D37952995

fbshipit-source-id: 09c492ac8d1333283bb4366c9ae0c6b95b98a87c

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D38110070

fbshipit-source-id: 824a1a2d7a4cb97a69b3bcfd39167ac039edd1b5

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D38146055

fbshipit-source-id: 1b232be8ce396189a123139ac8456433d12d2316

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D38269840

fbshipit-source-id: 901e5279e8e0265fabd48aca861a43d2e4c45dee

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D38351452

fbshipit-source-id: 2439d74bc9ab3f477876f35f549caec9117711bd

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D38381535

fbshipit-source-id: ba50c1a33fda33c4ccc8157702f32b94d415197f

* Import TorchText from Github

Reviewed By: parmeet

Differential Revision: D38419656

fbshipit-source-id: 871439658ed673910c68c025be471501b9b4670a

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D38534440

fbshipit-source-id: 3bf1a7d5cc2daa8d14e424d16509b2df998549b8

* Import TorchText from Github

Reviewed By: Nayef211

Differential Revision: D38655164

fbshipit-source-id: 0b9364fb759520c6fb60147fd0ab1044c362d588

* Import torchtext #1879 72966f0

Summary: ran the `import_text.sh` command to manually update the internal fbcode to match the Github torchtext repo

Reviewed By: Nayef211

Differential Revision: D38796445

fbshipit-source-id: 904143c404141bb016a5f83fbc53906b1c6e1246

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D38907288

fbshipit-source-id: f82ad8121bce924ad6068767845e5ea29dd24bef

* Remove dependency on the torch::jit::script::Module for mobile builds

Summary: In order to resolve linkage errors. Specifically when vocab getting build for "mobile" version it can't resolve symbols for torch::jit::script::Module

Reviewed By: Nayef211

Differential Revision: D38771271

fbshipit-source-id: 693b656f2a17af9fa5a7a1904742557f902edb55

* Replace `pytext_lib`'s `MaskTransform` with new one from `torchtext`

Summary: Replace instances of `pytext_lib`'s `MaskTransform` with new one from `torchtext` that was merged in https://github.com/pytorch/text/pull/1882

Reviewed By: Nayef211

Differential Revision: D39058074

fbshipit-source-id: f61499d88eec7eccda659279786528bac7edf9d0

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D39095295

fbshipit-source-id: 2e447db46b71fc152f2f53b281585650682cb696

* move PATH_MANAGER to OSS

Summary:
## Problem:
pytext got "No module named 'pytorch'" in issue https://github.com/facebookresearch/pytext/issues/1706

It's due to `from pytorch.text.fb.utils import PATH_MANAGER` is internal only but imported in pytext. Actually, `pytorch/text/fb/utils/__init__.py` should be open sourced.

## Solution:
This diff moved it to OSS as `from torchtext.utils import PATH_MANAGER` and updated all the references

Reviewed By: Nayef211

Differential Revision: D39292896

fbshipit-source-id: c0046d62e64145b60ad9a5298b366f0f1a348369

* Turn off mask checking for torchtext which is known to have a legal mask (#1896)

Summary:
Pull Request resolved: https://github.com/pytorch/text/pull/1896

Turn off mask checking for torchtext which is known to have a legal mask

Reviewed By: zrphercule

Differential Revision: D39445703

fbshipit-source-id: 3f0cacfd39ea11a16c7a06f339872554333b5e97

* Back out "move PATH_MANAGER to OSS" (#1724)

Summary:
X-link: https://github.com/facebookresearch/pytext/pull/1724

Original commit changeset: c0046d62e641

Original Phabricator Diff: D39292896

torchtext can't depend on iopath as raised in https://github.com/pytorch/text/pull/1905

Reviewed By: Nayef211

Differential Revision: D39639475

fbshipit-source-id: 69a48eb3820d0642b0a56712e160a0af589e4c7c

* Import TorchText from Github

Summary: Manually import latest changes from github to fbcode

Reviewed By: joecummings

Differential Revision: D39770284

fbshipit-source-id: 1e442f222d582c43a2ca9280d93eca4135d2df09

* Import TorchText from Github

Reviewed By: rshraga

Differential Revision: D39811057

fbshipit-source-id: 33cce346ac3d226a2fff6c162c39164837f34d87

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D40225047

fbshipit-source-id: 7abff009d65d713a6ce134fc88cd1955f62e3e3d

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D40294258

fbshipit-source-id: b3e14d9e78e346c294f1bc65ba3045b92251e034

* Add Character Level BPE Tokenizer (#1936)

Summary:
Pull Request resolved: https://github.com/pytorch/text/pull/1936

This change adds a character level BPE tokenizer to the set of available transforms. It takes a pre-trained encoder dict (i.e vocab dict) and merge list as input. It is not using C++ for encoding / decoding at this time.

Reviewed By: langong347

Differential Revision: D40186470

fbshipit-source-id: 48bacc631f537e941a495e39ef9ccb17d3ef7896

* Add padding_masks and tests for T5Model (#1935)

Summary:
Pull Request resolved: https://github.com/pytorch/text/pull/1935

Added the following parameters to the `forward` method of the T5Model:
* `encoder_padding_mask`
* `decoder_padding_mask`

These allow users to specifically mask out the padding of input sequences. This matches the implementation of Transformers in PyTorch core.

Reviewed By: Nayef211

Differential Revision: D40252794

fbshipit-source-id: 0e0a17fdc97ae0bbcaa1aef91e9914fd6225456b

* Import TorchText from Github

Reviewed By: abhinavarora

Differential Revision: D40425553

fbshipit-source-id: 268b94d65cff771028c2e2fdf21caa9855d07cef

Co-authored-by: Guanheng Zhang <zhangguanheng@devfair0197.h2.fair>
Co-authored-by: Christian Puhrsch <cpuhrsch@devfair0129.h2.fair>
Co-authored-by: cpuhrsch <cpuhrsch@fb.com>
Co-authored-by: Moto Hira <moto@fb.com>
Co-authored-by: George Guanheng Zhang <zhangguanheng@fb.com>
Co-authored-by: Stanislau Hlebik <stash@fb.com>
Co-authored-by: Andres Suarez <asuarez@fb.com>
Co-authored-by: Meghan Lele <meghanl@fb.com>
Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
Co-authored-by: Vasilis Vryniotis <vvryniotis@fb.com>
Co-authored-by: Jeff Hwang <jeffhwang@fb.com>
Co-authored-by: Parmeet Singh Bhatia <parmeetbhatia@fb.com>
Co-authored-by: Artyom Astafurov <asta@fb.com>
Co-authored-by: Nicolas Hug <nicolashug@fb.com>
Co-authored-by: Heitor Schueroff <heitorschueroff@fb.com>
Co-authored-by: Facebook Community Bot <facebook-github-bot@users.noreply.github.com>
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Co-authored-by: Vincent Quenneville-Belair <vincentqb@fb.com>
Co-authored-by: Yao-Yuan Yang <yyyang@fb.com>
Co-authored-by: Evan Smothers <ebs@fb.com>
Co-authored-by: Erjia Guan <erjia@fb.com>
Co-authored-by: Abhinav Arora <abhinavarora@fb.com>
Co-authored-by: Vitaly Fedyunin <vitalyf@fb.com>
Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca>
Co-authored-by: CodemodService Bot <>
Co-authored-by: Steven Liu <stevenliu@fb.com>
Co-authored-by: Rui Zhu <zrphercule@fb.com>
Co-authored-by: Michael Gschwind <mikekg@fb.com>
Co-authored-by:…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants