-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building docs #59
Labels
Comments
@nelson-liu |
I was just about to comment on your PR --- let's move the conversation over there. Thanks for taking this on! |
Is there any work going on the docs or do you guys have any plans how should the docs look like? I'd love to contribute. |
facebook-github-bot
pushed a commit
that referenced
this issue
Nov 22, 2021
Summary: Pull Request resolved: pytorch/data#61 Fixes facebookexternal/torchdata#114 and facebookexternal/torchdata#140 * #59 Test Plan: Imported from OSS Reviewed By: wenleix Differential Revision: D31734382 Pulled By: ejguan fbshipit-source-id: 16d10bace2a473e3878ac8dd5f7b6885bd924105
parmeet
pushed a commit
to parmeet/text
that referenced
this issue
Nov 23, 2021
Summary: Pull Request resolved: pytorch/data#61 Fixes https://github.com/facebookexternal/torchdata/issues/114 and https://github.com/facebookexternal/torchdata/issues/140 * pytorch#59 Test Plan: Imported from OSS Reviewed By: wenleix Differential Revision: D31734382 Pulled By: ejguan fbshipit-source-id: 16d10bace2a473e3878ac8dd5f7b6885bd924105
Nayef211
added a commit
that referenced
this issue
Oct 19, 2022
* include pytorch 1.5.0-rc1 for CI test * bump up the version * Set up ShipIt fbshipit-source-id: bb7d2eb52240c7223b57c3c9624e61d116e77e39 * Re-sync with internal repository (#749) * 20200429 pytorch/text import Summary: [20:45:34: cpuhrsch@devvm3140 pytorch]$ ./fb_build/import_text.sh Reviewed By: pbelevich Differential Revision: D21320577 fbshipit-source-id: ac2148b9f0d58e5538443c879845bfb4f6ca7202 * 20200430 torchtext import script to include additional meta files Summary: ./fb_build/import_text.sh Reviewed By: zhangguanheng66 Differential Revision: D21343124 fbshipit-source-id: c08ecad2cc6f439fa40130aeaf91383be9403fe8 * torchtext flake8, github, travis metafiles Summary: See title Reviewed By: pbelevich Differential Revision: D21344211 fbshipit-source-id: a8bcf7f3ab9bb2c2853e27f612e82caa341d3651 * Import torchtext 20200520 and update build Summary: Import torchtext up to #786 Reviewed By: cpuhrsch Differential Revision: D21483116 fbshipit-source-id: bc8ab38db9dc9ce4a8734ca8ea991c20e4ef0882 * Import torchtext 20200528 Summary: Import up to #798 Addresses T67599333 Reviewed By: zhangguanheng66 Differential Revision: D21764935 fbshipit-source-id: f44d1db637799f2e95f420a8099fbf19545c7cbd * 20200604 torchtext github import Summary: Import from github master Reviewed By: zhangguanheng66 Differential Revision: D21886238 fbshipit-source-id: a8f098e299466dd1701fe7ceb6a97c2a2fc54b9d * Import torchtext 20200605 Summary: Import from github master Reviewed By: zhangguanheng66 Differential Revision: D21907519 fbshipit-source-id: f22370d97796da5f2cb9f76f506c80f18fefea7f * Back out "Import torchtext 20200605" Summary: Original commit changeset: f22370d97796 Reviewed By: zhangguanheng66 Differential Revision: D21964222 fbshipit-source-id: c316836596fc3e232e63abc59e172f237b551cc5 * Import torchtext 2020/06/22 Summary: Import from github torchtext/master Reviewed By: zhangguanheng66, cpuhrsch Differential Revision: D22168183 fbshipit-source-id: 7d96ade64f18942d9bd19437011be2f65f0b2a5e * Fix torch.testing._internal module not found Reviewed By: Nayef211 Differential Revision: D22315715 fbshipit-source-id: 6b8b8544b0aa458cf5e7e9ca380d0dc85c98189f * Import torchtext 2020/07/07 Summary: Import from github torchtext/master Reviewed By: cpuhrsch Differential Revision: D22420576 fbshipit-source-id: 4d2c19d7f1db8f698894ca406c1c44b2ad8e0506 * remediation of S205607 fbshipit-source-id: 5113fe0c527595e4227ff827253b7414abbdf7ac * remediation of S205607 fbshipit-source-id: 798decc90db4f13770e97cdce3c0df7d5421b2a3 * Import torchtext 2020/07/21 Summary: Import from github torchtext/master Reviewed By: zhangguanheng66 Differential Revision: D22641140 fbshipit-source-id: 8190692d059a937e25c5f93506581086f389c291 * Remove .python3 markers Reviewed By: ashwinp-fb Differential Revision: D22955630 fbshipit-source-id: f00ef17a905e4c7cd9196c8924db39f9cdfe8cfa * Import torchtext 2020/08/06 Summary: Import from github torchtext/master Reviewed By: zhangguanheng66 Differential Revision: D22989210 fbshipit-source-id: 083464e188b758a8746123f4dd2197cc7edc4bc4 * Import torchtext 2020/08/18 Summary: Import from github torchtext/master Reviewed By: cpuhrsch Differential Revision: D23190596 fbshipit-source-id: 1568a25a5bd6431bcef3c6539f64a3ab1f5bccd7 * Import torchtext from 8aecbb9 Reviewed By: hudeven Differential Revision: D23451795 fbshipit-source-id: 73e6130c16716919c77862cef4ca4c8048428670 * Import torchtext 9/4/2020 Reviewed By: Nayef211 Differential Revision: D23539397 fbshipit-source-id: 88dce59418a3071cbc9e944cf0a4cf2117d7d9f7 * Import github torchtext on 9/9/2020 Reviewed By: cpuhrsch Differential Revision: D23616189 fbshipit-source-id: 365debc987326145eead7456ed48517fe55cac96 * Add property support for ScriptModules (#42390) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42390 **Summary** This commit extends support for properties to include ScriptModules. **Test Plan** This commit adds a unit test that has a ScriptModule with a user-defined property. `python test/test_jit_py3.py TestScriptPy3.test_module_properties` Test Plan: Imported from OSS Reviewed By: eellison, mannatsingh Differential Revision: D22880298 Pulled By: SplitInfinity fbshipit-source-id: 74f6cb80f716084339e2151ca25092b6341a1560 * sync with OSS torchtext 9/15/20 Reviewed By: cpuhrsch Differential Revision: D23721167 fbshipit-source-id: 13b32091c422a3ed0ae299595d69a7afa7136638 * Import Github torchtext on 9/28/2020 Reviewed By: cpuhrsch Differential Revision: D23962265 fbshipit-source-id: 0d042878fe9119aa725e982ab7d5e96e7c885a59 * Enable @unused syntax for ignoring properties (#45261) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45261 **Summary** This commit enables `unused` syntax for ignoring properties. Inoring properties is more intuitive with this feature enabled. `ignore` is not supported because class type properties cannot be executed in Python (because they exist only as TorchScript types) like an `ignored` function and module properties that cannot be scripted are not added to the `ScriptModule` wrapper so that they may execute in Python. **Test Plan** This commit updates the existing unit tests for class type and module properties to test properties ignored using `unused`. Test Plan: Imported from OSS Reviewed By: navahgar, Krovatkin, mannatsingh Differential Revision: D23971881 Pulled By: SplitInfinity fbshipit-source-id: 8d3cc1bbede7753d6b6f416619e4660c56311d33 * Import Github torchtext on 10/11/2020 Reviewed By: cpuhrsch Differential Revision: D24242037 fbshipit-source-id: 605d81412c320373f1158c51dbb120e7d70d624d * make duplicate def() calls an error in the dispatcher. Updating all fb operators to use the new dispatcher registration API (#47322) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47322 Updating all call-sites of the legacy dispatcher registration API in fbcode to the new API. I migrated all call sites that used the legacy dispatcher registration API (RegisterOperators()) to use the new API (TORCH_LIBRARY...). I found all call-sites by running `fbgs RegisterOperators()`. This includes several places, including other OSS code (nestedtensor, torchtext, torchvision). A few things to call out: For simple ops that only had one registered kernel without a dispatch key, I replaced them with: ``` TORCH_LIBRARY_FRAGMENT(ns, m) { m.def("opName", fn_name); } ``` For ops that registered to a specific dispatch key / had multiple kernels registered, I registered the common kernel (math/cpu) directly inside a `TORCH_LIBRARY_FRAGMENT` block, and registered any additional kernels from other files (e.g. cuda) in a separate `TORCH_LIBRARY_IMPL` block. ``` // cpu file TORCH_LIBRARY_FRAGMENT(ns, m) { m.def("opName(schema_inputs) -> schema_outputs"); m.impl("opName", torch::dispatch(c10::DispatchKey::CPU, TORCH_FN(cpu_kernel))); } // cuda file TORCH_LIBRARY_IMPL(ns, CUDA, m) { m.impl("opName", torch::dispatch(c10::DispatchKey::CUDA, TORCH_FN(cuda_kernel))); } ``` Special cases: I found a few ops that used a (legacy) `CPUTensorId`/`CUDATensorId` dispatch key. Updated those to use CPU/CUDA- this seems safe because the keys are aliased to one another in `DispatchKey.h` There were a handful of ops that registered a functor (function class) to the legacy API. As far as I could tell we don't allow this case in the new API, mainly because you can accomplish the same thing more cleanly with lambdas. Rather than delete the class I wrote a wrapper function on top of the class, which I passed to the new API. There were a handful of ops that were registered only to a CUDA dispatch key. I put them inside a TORCH_LIBRARY_FRAGMENT block, and used a `def()` and `impl()` call like in case two above. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D24714803 Pulled By: bdhirsh fbshipit-source-id: c809aad8a698db3fd0d832f117f833e997b159e1 * Revert D24714803: make duplicate def() calls an error in the dispatcher. Updating all fb operators to use the new dispatcher registration API Differential Revision: D24714803 Original commit changeset: c809aad8a698 fbshipit-source-id: fb2ada65f9fc00d965708d202bd9d050f13ef467 * Import torchtext on Nov 20, 2020 Summary: Import torchtext on the commit of 633548a1bdf0bac1e38f98da375a537ce0c2994b allow-large-files Reviewed By: cpuhrsch Differential Revision: D25127691 fbshipit-source-id: 3a617f5f4849df452f8a102a77ce11a1bce5af1f * Updating all call-sites of the legacy dispatcher registration API in fbcode to the new API. (#48178) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48178 I migrated all call sites that used the legacy dispatcher registration API (RegisterOperators()) to use the new API (TORCH_LIBRARY...). I found all call-sites by running `fbgs RegisterOperators()`. This includes several places, including other OSS code (nestedtensor, torchtext, torchvision). A few things to call out: For simple ops that only had one registered kernel without a dispatch key, I replaced them with: ``` TORCH_LIBRARY_FRAGMENT(ns, m) { m.def("opName", fn_name); } ``` For ops that registered to a specific dispatch key / had multiple kernels registered, I registered the common kernel (math/cpu) directly inside a `TORCH_LIBRARY_FRAGMENT` block, and registered any additional kernels from other files (e.g. cuda) in a separate `TORCH_LIBRARY_IMPL` block. ``` // cpu file TORCH_LIBRARY_FRAGMENT(ns, m) { m.def("opName(schema_inputs) -> schema_outputs"); m.impl("opName", torch::dispatch(c10::DispatchKey::CPU, TORCH_FN(cpu_kernel))); } // cuda file TORCH_LIBRARY_IMPL(ns, CUDA, m) { m.impl("opName", torch::dispatch(c10::DispatchKey::CUDA, TORCH_FN(cuda_kernel))); } ``` Special cases: I found a few ops that used a (legacy) `CPUTensorId`/`CUDATensorId` dispatch key. Updated those to use CPU/CUDA- this seems safe because the keys are aliased to one another in `DispatchKey.h` There were a handful of ops that registered a functor (function class) to the legacy API. As far as I could tell we don't allow this case in the new API, mainly because you can accomplish the same thing more cleanly with lambdas. Rather than delete the class I wrote a wrapper function on top of the class, which I passed to the new API. There were a handful of ops that were registered only to a CUDA dispatch key. I put them inside a TORCH_LIBRARY_FRAGMENT block, and used a `def()` and `impl()` call like in case two above. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D25056090 Pulled By: bdhirsh fbshipit-source-id: 8f868b45f545e5da2f21924046e786850eba70d9 * Import torchtext from github into fbcode on 1/11/2021 Reviewed By: cpuhrsch Differential Revision: D25873762 fbshipit-source-id: 0d34d36aeb8e7e2ce72fcf345c5e7e713ef3663c * Import torchtext from github #1121 d56fffe Summary: Import torchtext from github #1121 d56fffe Reviewed By: zhangguanheng66 Differential Revision: D25976268 fbshipit-source-id: 81589f8988a54cc12f17f0a6f298a915e829a830 * Import the hidden files in torchtext github repo Reviewed By: mthrok Differential Revision: D26001386 fbshipit-source-id: f822f0f32232d3006ef629937520dee6c0faf414 * add a newline mark to config.yml file (#1128) Reviewed By: zhangguanheng66 Differential Revision: D26369003 fbshipit-source-id: 09ca48f9705d8663b06e6a329a6b64b24f9c148e * Replace model with full name when spacy load is used (#1140) Reviewed By: zhangguanheng66 Differential Revision: D26369005 fbshipit-source-id: b1e6b5d77810bb8f67d14b8a1c7ec0a9f4831cab * Fix the num_lines argument of the setup_iter func in RawTextIterableDataset (#1142) Reviewed By: zhangguanheng66 Differential Revision: D26368999 fbshipit-source-id: 4b50e5d9e5fbdf633e8b3f0072223eed050af793 * Fix broken CI tests due to spacy 3.0 release (#1138) Reviewed By: zhangguanheng66 Differential Revision: D26368998 fbshipit-source-id: 84e883562a9a3d0fe47b54823b22f7b2cd82fca4 * Switch data_select in dataset signature to split (#1143) Reviewed By: zhangguanheng66 Differential Revision: D26369006 fbshipit-source-id: 608f42fa180db9ebcfaaeadc6b8cdd29393262af * Add offset arg in the raw text dataset (#1145) Reviewed By: zhangguanheng66 Differential Revision: D26368996 fbshipit-source-id: 52741015139c302b7b0ddf8c8f50ab45a609fd2f * switch to_ivalue to __prepare_scriptable__ (#1080) Reviewed By: zhangguanheng66 Differential Revision: D26368995 fbshipit-source-id: 0352c04e422c835350bd42df35d4054d543fee36 * Pass an embedding layer to the constructor of the BertModel class (#1135) Reviewed By: zhangguanheng66 Differential Revision: D26369001 fbshipit-source-id: f5a67a2a812d568073505ec4d181f6e418eb4a3f * add __next__ method to RawTextIterableDataset (#1141) Reviewed By: zhangguanheng66 Differential Revision: D26368997 fbshipit-source-id: f5ef78f5f4a224db497f47f774eaddedd0498b4b * Add func to count the total number of parameters in a model (#1134) Reviewed By: zhangguanheng66 Differential Revision: D26369000 fbshipit-source-id: c687c0f0c2697dbd9c17a79a1291a2e279bbd1b8 * Retire the legacy code in torchtext library and fix the dependency of the downstream libraries Summary: This diff is doing: 1) move the legacy code in torchtext to the legacy folder; 2) for the downstream libraries in fbcode, if they are using the legacy code, add "legacy" to the path. Reviewed By: cpuhrsch Differential Revision: D23718437 fbshipit-source-id: 1660868aaa95ac6555ad6793dda5ce02a9acdc08 * Sync torchtext GH<->fbcode until GH commit 1197514eb8cc33ccff10f588534f405b43908660 Summary: Import recent torchtext changes up until GH commit 1197514eb8cc33ccff10f588534f405b43908660 Reviewed By: zhangguanheng66 Differential Revision: D26824967 fbshipit-source-id: fc4be4f94a8f748ce2ed5e776e30a42422cbcab9 * 20210304[2] Sync torchtext GH<->fbcode until GH commit 2764143865678c41e69ad3b993556fe90c1e6391 Summary: Sync up until commit in title Reviewed By: zhangguanheng66 Differential Revision: D26829429 fbshipit-source-id: a059a36d83b3803dfed9198d0e474e0e75f94f17 * 20210308 Sync torchtext GH <-> fbcode Summary: Import latest GH changes Reviewed By: zhangguanheng66 Differential Revision: D26888371 fbshipit-source-id: cc27f51fd89ad86b8bcfb8f286ad874ab01b1fd6 * Re-name raw_datasets.json file with jsonl extension Reviewed By: cpuhrsch Differential Revision: D26923978 fbshipit-source-id: c87c7776445e05d452f6b38244bf4cdaba45bdec * 20210329 Sync torchtext up to GH commit eb5e39d3d40525c0064c8e7b7c976755e7341a8b Summary: Sync torchtext up to GH commit eb5e39d3d40525c0064c8e7b7c976755e7341a8b Reviewed By: parmeet Differential Revision: D27400885 fbshipit-source-id: 1f8f92ca42ba36d070db6740b3bb4c148f69586b * Import torchtext #1267 93b03e4 Summary: Imported latest from github Master PR#1267 Reviewed By: cpuhrsch Differential Revision: D27503970 fbshipit-source-id: 853ff895ba42b1feb7442abe1c87478e43d62e5b * Import torchtext #1266 ba0bf52 Summary: Import torchtext from github Reviewed By: parmeet Differential Revision: D27803909 fbshipit-source-id: 9cb0f15858b1417cb5868d5651513eb2df998fbe * Import torchtext #1287 fab63ed Reviewed By: parmeet Differential Revision: D27922562 fbshipit-source-id: 3c18cd9e2583e03471461ad8a22ac6b0ceb596a2 * Import torchtext #1293 d2a0776 Summary: Importing torchtext from github for regular sync. Reviewed By: cpuhrsch Differential Revision: D27983819 fbshipit-source-id: 5806421d788afaa872f5320b5f4cbcd913e103ea * Import torchtext #1291 0790ce6 Reviewed By: parmeet Differential Revision: D28101664 fbshipit-source-id: a8643b3ecf85de2cb815dcfa5789a4a5d246d80f * adding __contains__ method to experimental vocab (#1297) Reviewed By: cpuhrsch Differential Revision: D28111696 fbshipit-source-id: fef195941492493a399adb37339cfa64795e22a0 * Import torchtext #1292 ede6ce65eb5405ff1f8801ff6b354bb1cd242108 Summary: This diff syncs torchtext GH with fbcode Reviewed By: cpuhrsch Differential Revision: D28321356 fbshipit-source-id: 7736f0d100941627b58424911a1329b1ce66c123 * Added APIs for default index and removed unk token (#1302) Reviewed By: parmeet Differential Revision: D28478153 fbshipit-source-id: bfcaffe8fe48e96d8df454f7df0d25ec39d5d4a6 * Swapping experimental Vocab and retiring current Vocab into legacy (#1289) Summary: allow-large-files to commit wikitext103_vocab.pt Reviewed By: cpuhrsch Differential Revision: D28478152 fbshipit-source-id: c2a871439f054024b95c05f7664a84028aacaca3 * Import torchtext #1313 36e33e2 Summary: Importing from Github Reviewed By: cpuhrsch Differential Revision: D28572929 fbshipit-source-id: 2e7b00aadeda6ab0596ef23295f41c5b0fa246e7 * Adding API usage logging Summary: Adding API usage logging for Vocab module Reviewed By: colin2328 Differential Revision: D28585537 fbshipit-source-id: 38975b523fb597412fbcb18ef831bfb4834cb420 * Import torchtext #1314 99557efd98dd0e74346975d75183dd8aa32eb37e Reviewed By: parmeet Differential Revision: D28683381 fbshipit-source-id: 7bfbf445dd512f0ce21c34096cf3f08332d90138 * Import torchtext #1325 57a1df3 Reviewed By: NicolasHug Differential Revision: D28994054 fbshipit-source-id: 4c679f56ef37b18f6d2acaaaed8518facbeaa41c * Import torchtext #1328 ca514f6 Summary: Import torchtext #1328 ca514f6 Reviewed By: NicolasHug Differential Revision: D29120370 fbshipit-source-id: 229586f3470bd61bfb2f6a390d79e45d4eae3b4d * up the priority of numpy array comparisons in self.assertEqual (#59067) (#1340) * Re-sync with internal repository (#1343) * up the priority of numpy array comparisons in self.assertEqual (#59067) Summary: Fixes https://github.com/pytorch/pytorch/issues/58988. Pull Request resolved: https://github.com/pytorch/pytorch/pull/59067 Reviewed By: jbschlosser Differential Revision: D28986642 Pulled By: heitorschueroff fbshipit-source-id: 3ef2d26b4010fc3519d0a1a020ea446ffeb46ba0 * Import torchtext #1300 0435df13924fd4582d67e5b17bc09f6ded18be8b Summary: Import torchtext #1300 0435df13924fd4582d67e5b17bc09f6ded18be8b Reviewed By: parmeet Differential Revision: D29371832 fbshipit-source-id: 624280ddfa787a4e7628e60fa673cb9df0a66641 * Import torchtext #1345 8cf471c Summary: Import from github Reviewed By: hudeven Differential Revision: D29441995 fbshipit-source-id: 27731ce2714c16180d11bfb26af5d5a2dba408b1 * Import torchtext #1352 7ab50af Summary: Import from github Reviewed By: NicolasHug Differential Revision: D29537684 fbshipit-source-id: 25b1fc1e6d9f930e83f5f2939788b90b083aeaa2 * Enabling torchtext datasets access via manifold and iopath Summary: We would like to add and access torchtext datasets on manifold. This Diff unifies the dataset download from external links and through manifold for internal access. This is enabled via io_path package. The main idea is to plugin the download hooks in the download_from_url function. The download hooks will delegate the download to appropriate Path Handler. In OSS we have enabled download via https and google drive. Internally, we replace the download hook to download data from manifold. We have created a _download_hooks.py file under /fb/ folder which will replace the corresponding file in OSS. The file under /fb/ folder converts the http/https URL paths into corresponding manifold paths and download the data from there. Reviewed By: hudeven Differential Revision: D28892389 fbshipit-source-id: 3b66544dd2345075e2e7c524f344db04aa2a24e3 * Import torchtext #1361 05cb992 Summary: Import from github Reviewed By: hudeven Differential Revision: D29856211 fbshipit-source-id: 6332f9bdf3cf4eef572c5423db15101ea904d825 * Import torchtext #1365 c57b1fb Summary: Import torchtext #1365 c57b1fb Reviewed By: parmeet Differential Revision: D29940816 fbshipit-source-id: 6b2495b550a7e6b6110b0df12de51a87b0d31c1c * Moving Roberta building blocks to torchtext Summary: This is the first step in moving Roberta Model from pytext_lib into PyTorch Text Library. Here we moved the Roberta building blocks into pytorch/text/fb/nn/modules. The code-base is organized according to WIP document https://docs.google.com/document/d/1c0Fs-v97pndLrT3bdfGRGeUeEC38UcDpibvgOXkbS-g/edit#heading=h.3ybcf0ic42yp Reviewed By: hudeven Differential Revision: D29671800 fbshipit-source-id: d01daa99e0a5463716660722381db9a0eeb083f8 * Enabling torchtext availability in @mode/opt Summary: More details on context and solution: D29973934 Note that in this implementation, we rely on over-riding behavior of _init_extention() function. This is in similar spirit where we over-ride behavior of download hooks to accommodate necessary changes needed to enable functionality on fbcode. Reviewed By: mthrok Differential Revision: D30494836 fbshipit-source-id: b2b015263fa1bca2ef4d4214909e469df3fbe327 * Import torchtext #1382 aa12e9a Summary: Import torchtext #1382 aa12e9a Reviewed By: parmeet Differential Revision: D30584905 fbshipit-source-id: fba23cd19f31fc7826114dd2eb402c8f7b0553df * Simplify cpp extension initialization process Summary: Simplifying the cpp extension initialization process by following torchaudio's implementation in D30633316 Reviewed By: mthrok Differential Revision: D30652618 fbshipit-source-id: f80ac150fa50b1edc22419b21412f64e77064c5d * fixed bug with incorrect variable name in dataset_utils.py Summary: - ValueError was outputting `fn` instead of `func` - Similar fix done in torchdata https://github.com/facebookexternal/torchdata/pull/167 Reviewed By: ejguan Differential Revision: D31149667 fbshipit-source-id: 2c1228287d513895f8359cb97935252f0087d738 * Import torchtext #1410 0930843 Summary: Import latest from github Reviewed By: Nayef211 Differential Revision: D31745899 fbshipit-source-id: e4ac5c337bcbd1a8809544add7679dd3da242999 * Import torchtext #1406 1fb2aed Summary: Import latest from github Reviewed By: Nayef211 Differential Revision: D31762288 fbshipit-source-id: f439e04f903d640027660cb969d6d9e00e7ed4a0 * Import from github 10/18/21 Summary: Syncing torchtext github main branch to fbcode Reviewed By: parmeet Differential Revision: D31841825 fbshipit-source-id: 9c1a05295e6557ff411e56eb719cb439d5c424ba * Import torchtext #1420 0153ead Summary: Import latest from github Reviewed By: Nayef211 Differential Revision: D31871772 fbshipit-source-id: 989f5a453ef7680592df27e4174f465d11a2fbf8 * Import torchtext #1421 bcc1455 Summary: Syncing torchtext github main branch to fbcode Reviewed By: parmeet Differential Revision: D31873514 fbshipit-source-id: 1a964a67ce7ee73f5acf3a1e3f8118028c2dd46e * Enable OSS torchtext XLMR Base/Large model on fbcode Summary: Enable access to open-source torchtext XLMR base/large implementation by: 1) Uploading models/transform weights on manifold 2) Patching public URL with manifold URL (similar to what we have for datasets) Note that we didn't enabled model tests since it takes relatively long to download huge models weights from manifold. We would rely on Open-source signals when making changes to model implementation, and we need to ensure the any update in weights on AWS cloud is also replicated on manifold. Reviewed By: hudeven Differential Revision: D31844166 fbshipit-source-id: 62a4e9a3a8580ab93c3beb3af69be7361f1cc937 * enabling SST2 dataset usage in fbcode Summary: Enable access to open-source torchtext SST2 dataset by: - Uploading SST2 dataset on manifold - Swapping public URL with manifold URL in fbcode by implementing a dummy `HTTPReader` wrapper class - The wrapper class does URL mapping and calls `IoPathFileLoaderDataPipe` on the manifold URL - Enabled SST2Dataset unit tests within fbcode Reviewed By: parmeet Differential Revision: D31876606 fbshipit-source-id: fdde14a67cce835da216b296e1a0024e1d1fc7a9 * Import torchtext #1426 4be2792 Summary: Import from github Reviewed By: Nayef211 Differential Revision: D31962042 fbshipit-source-id: 0308ae0cfe402e8c3eb133cb5a205b65f98ad1df * Import torchtext #1428 b962c51 Summary: Import latest from github Reviewed By: Nayef211 Differential Revision: D32006262 fbshipit-source-id: 2d7766104e1116f14f20fa1031178c2143b5e78b * Import torchtext #1430 4cf19ed Summary: Import latest from github Reviewed By: Nayef211 Differential Revision: D32140599 fbshipit-source-id: 3a2902febd5e5024d833699e05e0256b1ae0cae2 * Allow inferred scaling in MultiheadSelfAttention for head_dim != 64 Summary: Rather than raise an exception whenever head_dim != 64, we can just infer the scaling value and continue to provide a warning. Also add an assertion in case embed_dim is not a multiple of num_heads (in which case forward will break). Reviewed By: parmeet Differential Revision: D32193989 fbshipit-source-id: 30f68c55f3ec37932252c77c355ae55b8bf34ded * Updated sst2 dataset to accept `validate_hash` parameter Summary: ## Description - Updated sst2 dataset to accept a `validate_hash` parameter - This allows for testing using partial datasets since downloading the entire dataset takes much longer Reviewed By: parmeet Differential Revision: D32250435 fbshipit-source-id: 9b5e7183f62df69638e1a3af2107273daa6f4ac5 * Import torchtext #1431 ba20fc5 Summary: Import latest from github Reviewed By: Nayef211 Differential Revision: D32282533 fbshipit-source-id: 8318cd8b8360dec1febdde0bc48388e6b2f2d768 * Fixed file filtering bug in SST2 dataset Summary: - Removed copying partial SST2 asset file to a temp dir and instead directly working with the file from the asset folder - Fixed bug with path names affecting how files were filtered out from the zip file - For example, if the value of `split` is "test", the following snippet of code `filter(lambda x: split in x[0])` might match all of the "train", "test", and "dev" files depending on the location of the dataset asset file - When testing with buck, the location of the extracted files could look something like `/data/users/nayef211/fbsource/fbcode/buck-out/dev/gen/pytorch/text/test/experimental_test_datasets#binary,link-tree/test/asset/SST2/SST-2.zip/train.tsv`. Since the word "test" is contained in this path string, the filtering logic would incorrectly select the "train" file even though what we want is the "test" file - To resolve this we append the file extension (in this case ".tsv") to the `split` variable in the filtering logic Reviewed By: parmeet Differential Revision: D32329831 fbshipit-source-id: dbb4803a04f6cd50fab3f7ce5530d3258b2db012 * Squashed commits (9314b44 to e691934) Summary: Trying out new way to import changes :). The main reason for this deviance is to find a way to skip commit ID(s) which is currently blocking another important PR to be landed on fbcode. Used following command to sync changes from github to fbcode: ```python pytorch/import.py --github_username parmeet --project_name text --commit_from ba20fc525a8a46d3056eeb421a44b9bdb1a90182 --commit_to e691934d2779be40ab425056836565840f49d565 --skip_commit_ids 2cebac34ab26577ee02b7295dbe01dccfdb1a88f daf0f6c71d7b764aafd2f1a2a3e7aa37dcc36e53 --squash``` Notes: - Skipped commit 2cebac3 as it about removing legacy code which is still work in progress on internal side (resolving legacy use-sites) by abhinavarora - Skipped commit daf0f6c as this correspond to syncing changes from fbsync. to main branch - We have used squash, but can skip this option to get 1:1 correspondence from PR to Diff ID, like we have in vision This text from below here is auto-generated because i think we used --squash ==== Subject: Update doc and fix CircleCI doc build issue (#1434) Body: commit e691934d2779be40ab425056836565840f49d565 ==== Subject: [CircleCI Windows Failure] Fix the way we join URL pieces to download XLM-R components (#1441) Body: commit d4a27a05a85d331d84d3ac527ca5f18ca64d326f ==== Subject: correct the `_compute_ngram_counter` docstring (#1440) Body: commit a26a8ef7f7ad22f9f2ae7af0e52e4c9760ab439d ==== Subject: fix attention mask testing (#1439) Body: commit 778b3e62770c24c4ecde06a6aaba1dee38c07e2e ==== Subject: [Vocab] Refactor vocab factory method to accept special tokens as a keyword argument (#1436) Body: * [Vocab] Refactor vocab factory method to accept special tokens as a keyword argument commit f298494ad90495e4ad442928665ce6d8e9f9c3c0 ==== Subject: add attention mask to transformer encoder modules (#1435) Body: commit 9314b44d2a6cb6f4129e1ac3ac57f92eb054f15d Reviewed By: Nayef211 Differential Revision: D32431346 fbshipit-source-id: 985e242ce5a733c130e9d5b9549a4a330e948dc7 * Refactor OnDiskCache (#61) Summary: Pull Request resolved: https://github.com/pytorch/data/pull/61 Fixes https://github.com/facebookexternal/torchdata/issues/114 and https://github.com/facebookexternal/torchdata/issues/140 * #59 Test Plan: Imported from OSS Reviewed By: wenleix Differential Revision: D31734382 Pulled By: ejguan fbshipit-source-id: 16d10bace2a473e3878ac8dd5f7b6885bd924105 * Add a class method in Model Bundler to facilitate model creation with user-defined configuration and checkpoint (#1442) Summary: Import from github Command used: `python pytorch/import.py --project_name text --commit_ids 2040d8da87394ab5ecf6ac2bbcd5a00beb940cf4` Note that we still not importing the whole repo using import_text.sh. using import.py would be the worflow we would rely on till we merge [legacy code removal commit](https://github.com/pytorch/text/commit/2cebac34ab26577ee02b7295dbe01dccfdb1a88f) into fbcode. Reviewed By: Nayef211 Differential Revision: D32603181 fbshipit-source-id: 1f583e5ac96e693b583ae42d5841bf387cf3727a * Import torchtext from github aea6ad6,#1449 to 9f2fb3f,#1452 Summary: command: `python pytorch/import.py --project_name text --commit_ids aea6ad6bf9a6292af3d5051b4862b966871bdcce 9f2fb3f00cd9a4cc8d41d2e9cbfa5e9bf9533224 --squash` Reviewed By: abhinavarora Differential Revision: D32690771 fbshipit-source-id: cde616182ecfe643ab48d727b66bbf0194480d3e * Fix SST2Dataset test iterator Summary: ## Summary - Modified SST2 dataset implementation to only return text for test split (since label_ids are not available) - Updated doc classification datamodule to temporarily use `val_dataset` instead of `test_dataset` - Updated first line md5 hash for SST2 test split ## Followup Items - Update doc classification module to work with test splits with and without labels Reviewed By: parmeet Differential Revision: D32661112 fbshipit-source-id: ef86aea0ce587c5d5282f2caa943b4b0cdf6f54a * Fix issue in label Transform Summary: In the construction of Vocab within label transform, the default index is set to 0. This index is returned when OOV token is given. For this transform, the default index should never be set. Otherwise, it will return default index (which is 0) for unknown labels that might get passed (Ideally it should throw error in this case because we do not know what to do when wrong label is passed for query) Reviewed By: hudeven Differential Revision: D32610834 fbshipit-source-id: e49385fb313929627c41fc515b6d900a6bfc3591 * Import torchtext #1437 2cebac3 Summary: Imports [#1437](https://github.com/pytorch/text/pull/1437) from OSS Torchtext that removes the legacy folder. Reviewed By: parmeet Differential Revision: D32923084 fbshipit-source-id: 83411efd62cd527c518e36279bdbf586435ac9e5 * Import torchtext #1457 d801e99 Summary: Import from github Reviewed By: abhinavarora, ebsmothers Differential Revision: D32962989 fbshipit-source-id: 4de93cbc0ebe29034a505c56d03bb8d4b698891c * Import torchtext #1459 8ef1b15 Summary: Imports torchtext to fbcode Reviewed By: parmeet Differential Revision: D33001763 fbshipit-source-id: 0525982a1aadcfed65172c22734a46fdf2bd7bde * Fixing typing issues DataSet -> DataType Summary: Forward fix of D31344867 Reviewed By: Nayef211, ejguan Differential Revision: D33069330 fbshipit-source-id: 1649049a6caf1178a78a25baf21e1b4ecdc44d77 * Import torchtext #1470 52d38e8 Summary: Import from github Reviewed By: ebsmothers Differential Revision: D33291837 fbshipit-source-id: 86f8675f13190425617937dcbdd5b698da0bba0f * Import torchtext #1486 4908d3c Summary: As title Reviewed By: Nayef211 Differential Revision: D33434571 fbshipit-source-id: 3cb1d43583fd1e2f28dfd27109a8bf5f1b255d1d * Import torchtext #1488 2c98927 Summary: ==== Subject: Switching to use FileOpener from FileLoader (#1488) Body: commit 2c989273a6a99eef12d2e3fe25258b27881cb0bf ==== Subject: add scriptable sequential transform (#1481) Body: commit 3849f4648a5021514b6b91fa721b43b63fad8378 Reviewed By: abhinavarora Differential Revision: D33485781 fbshipit-source-id: 3a7ca597cb2f2be98be29a639ef05a65a3f7b6be * Update load_state_dict_from_url method to skip download if file is cached Summary: - Update load_state_dict_from_url method to skip download if file is cached in the `model_dir` folder - The `model_dir` parameter was previously unused - New logic is similar to the [OSS implementation in torchhub](https://pytorch.org/docs/stable/_modules/torch/hub.html#load_state_dict_from_url) - Update unit test to test skipping download when file is already cached Reviewed By: abhinavarora Differential Revision: D33512850 fbshipit-source-id: 2350c6dcad7e5725cf670c99405bcc7d0fb05e42 * Import torchtext #1482 0e7cf21 Summary: - Remove xlmr transform class and instead use sequential for model transforms composition - Modified doc classification recipe to use sequential transform instead of `XLMRobertaModelTransform` Reviewed By: parmeet Differential Revision: D33485834 fbshipit-source-id: 01a914112219838620f3ce81cf621665d072ae69 * Import torchtext #1509 1a2fc00 Summary: ==== Subject: migrate YelpReviewPolarity to datapipes. (#1509) Body: * add initial pass at migrating YelpReviewPolarity to datapipes. * fix flake. commit 1a2fc00266eb71c202802803c390e00b4082085e ==== Subject: migrate YelpReviewFull to datapipes. (#1507) Body: * add initial pass at migrating YelpReviewFull to datapipes. * fix flake. commit 0fb50b45a6b40d63ef8fabc5766a15c78fce6c8e ==== Subject: migrate YahooAnswers to datapipes. (#1508) Body: * add initial pass at migrating YahooAnswersto datapipes. * fix flake. commit f99609d7c56d8b742f6b0f281fcd726d05aa4923 ==== Subject: migrate DBPedia to datapipes. (#1500) Body: * add initial pass at migrating DBPedia to datapipes. * add _EXTRACTED_FILES for consistency. commit 1881705aec892efd45803006fd8b6c845be9965f ==== Subject: replace funny os.sep joins with os.path.join for consistency. (#1506) Body: commit ce4ab8b5c1f22cf533e04f73696b0816f63a4ae5 ==== Subject: migrate AG_NEWS to datapipes. (#1498) Body: commit d9fdbc62a7c9b6ed27b47d92edc33d1cf8e9cf9d ==== Subject: migrate SogouNews to datapipes. (#1503) Body: * add initial pass at migrating SogouNews to datapipes. * make filter for specific split more consistent. commit e6065a9217a95e71ba47ca0184953627b21ab7ef ==== Subject: Fix filter logic (#1505) Body: commit a415684661ff9d7fb9e2b7f438cc8e70c09781bf ==== Subject: fix per https://github.com/pytorch/vision/issues/4832\#issuecomment-957695788 (#1504) Body: commit 8215832272e8d05f27dc5372a5e4382ce6942819 ==== Subject: add initial pass at migrating Amazon Review Full to datapipes. (#1499) Body: commit df0ec14a802bb7b85f06c97f564959f988212f80 ==== Subject: Parameterize jit and non-jit model integration tests (#1502) Body: * Updated max seq length for truncate in xlmr base. Updated xlmr docs. Moved xlmr tests to integration tests * Removing changes to truncate transform * Remove documentation changes from PR * Parameterized model tests * Added nested_params helper method. Updated model integration test to parameterize a single method covering jit and non-jit tests * Added docstring for unit tests commit d896135e4f5060bbaeb2cc5c3ed43eb15bc8a4c0 ==== Subject: Remove redundant get asset functions from parameterized_utils (#1501) Body: commit 0bcab91246b7ca17db048d0ab97a3199b94c05ab ==== Subject: Parameterized XLMR and Roberta model integration tests (#1496) Body: * Updated max seq length for truncate in xlmr base. Updated xlmr docs. Moved xlmr tests to integration tests * Removing changes to truncate transform * Remove documentation changes from PR * Parameterized model tests commit 2cb80a23412993b7eb9ded082d084eb39c1f0c4e ==== Subject: Migrating AmazonReviewPolarity to datapipes (#1490) Body: commit 826a051dfd9f62731f3b0dee854d0aa687f4da72 ==== Subject: Updated XLMR docs (#1497) Body: commit 1a052693509ce32b6fb91302f6ca62546b0afe0d ==== Subject: fix max sequence length for xlmr transform (#1495) Body: commit 776a15daed49f4046b46a2501ea8b63e85bc9da2 ==== Subject: Add pre-trained Roberta encoder for base and large architecture (#1491) Body: * Added new roberta encoders and tests * Added docs for roberta encoder * Updated truncate length. Added info on how model was trained along with license info * Added datasets that roberta was trained on * Removing unnecessary new line commit 6d9e6df7dee068d99d74355d14c7cb897b199d60 ==== Subject: remove optionality of dtype in `ToTensor` (#1492) Body: commit c0e1c38b34ebabf0f12859ee2594194f9c65957a Reviewed By: abhinavarora Differential Revision: D33555196 fbshipit-source-id: eca8e38ea61c72a626ec20096f18827cebae4ef7 Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca> Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca> Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca> Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca> * Import torchtext #1532 ce1ce9 Summary: ==== Subject: Add AmazonReviewPolarity Mocked Unit Test (#1532) Body: * First attempt at adding test for amazon review polarity * Updated dataset to take validate_hash param. Finalized tests * Created non empty tar file * Remove formatting. Patch _hash_check method from torchdata during testing * Added super().setUpClass() * Remove commented import commit ce1ce99583795207153e13e9bc35a388d368a49d ==== Subject: migrate Multi30k to datapipes. (#1536) Body: commit 627c71f837f6acf7db34b3d96a696624cb4a7087 ==== Subject: add initial pass at migrating UDPOS to datapipes. (#1535) Body: commit f685c55e02a43b6489d096f1dd2c05e8be13df63 ==== Subject: Migrate WikiText103 to datapipes (#1518) Body: commit 042f12f1be9701fc85129c9be380aec72ed3bc2e ==== Subject: add double caching for yelp full to speed up extracted reading. (#1529) Body: commit d19a77eb69a11a3c9feb74b391e288ed70277bb4 ==== Subject: Migrate WikiText2 to datapipes (#1519) Body: * Migrate WikiText2 to datapipes * Address code review comments and add double caching commit 437eea8f841fc5efe7dc0f116bbfef781cb88b84 ==== Subject: add double caching for yahoo to speed up extracted reading. (#1528) Body: * add double caching for yahoo to speed up extracted reading. * simplify filepath_fn * rename dps for consistency. * add FileOpener within caching block for more consistency. commit ff78e999f6edb866c33a1464c8288cb90f15c9e4 ==== Subject: add max_tokens kwarg to vocab factory. (#1525) Body: commit e1d66cf8ccd2b29378d5f3352b01e4310a36b557 ==== Subject: Migrate IMDB to datapipes (#1531) Body: * Migrate IMDB to datapipes * add double cache for extracted reading * update cache name commit 03afb7e1e6b6821eb7b479aa11b9c449c251de7a ==== Subject: add double caching for yelp polarity to speed up extracted reading. (#1530) Body: * add double caching for yelp polarity to speed up extracted reading. * rename dps for consistency and simplify filepath_fn * add FileOpener within caching block for more consistency. commit 83aebf495a92761e9683b4af4461ad28ae5c96a7 ==== Subject: Migrating EnWik9 to datapipes #1511 (#1512) Body: * Migrating enwik9 dataset to use torchdata * Added typing to params * Fixed PR comments. Updated to data_dp * Added caching for extracted files * Moved FileOpener after ondiskcache datapipe commit 12317098cef5822846125e579cd197b217c9e30e ==== Subject: Migrating PennTreebank to datapipes (#1511) Body: * Migrating penntreebank dataset to use torchdata * Update FileLoader to FileOpener * Resolved comments about return_path * Using strip() to remove leading/trailing spaces commit eb3994567830aeeccfcc1d7053ac6c29400cb593 ==== Subject: Cache extraction for AmazonReviewPolarity (#1527) Body: commit 0f7f859e412fba4a31852c1a84801a182e636fde ==== Subject: migrate CONLL 2000 to datapipes. (#1515) Body: commit b52746546c0648122231e4d73bf24175ef949df3 ==== Subject: add initial pass at migrating SQUAD2 (https://github.com/pytorch/text/commit/4be2792101565ddf6dd79d1b7fffb7d55d63bf06) to datapipes. (#1514) Body: commit a2ab9741415b2cff026d158a5a54b62b993571d9 ==== Subject: migrate SQUAD1 to datapipes. (#1513) Body: commit a5ca19407b844e49679d87c94003e08c5efd6d78 ==== Subject: Attempting to fix version conflict in CI (#1520) Body: * since we no longer support python 3.6, we get dataclasses in stdlib for free. * replace pip-install of packages with conda-install where applicable for better version management of native code. * make cpuonly a constraint instead of feature commit a6ae5946e49db2afb2eb8ca5435afaea036077f3 ==== Subject: fixing cache logic to work with datapipes (#1522) Body: * fixing cache logic to work with datapipes * committing temporary change to build cache * reverting temp change commit cf668aabf869ae9bdbc5c1259e011f36a1411a2b ==== Subject: 3.6 is EOL (#1521) Body: commit 7467bb5971b8ed59a716ba05b82bb1030ed4fbe2 ==== Subject: Fixing dataset test failures due to incorrect caching mode (#1517) Body: commit 38ec295c1970776a43b42712b4156d2635ae85c3 ==== Subject: IterDataPipes do not have __next__ (#1516) Body: commit 8f153f692ed85229db8e43b14398adae5f58d646 Reviewed By: abhinavarora Differential Revision: D33850546 fbshipit-source-id: 2235caac646eb0fcc14fb638cbbfd4b15f966035 Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca> Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca> Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca> Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca> * Import torchtext #1538 d72124c Summary: - Import d72124c commit which migrates the SST2 dataset away from experimental - Modify doc classification recipe to work with new functional dataset implementations - Make label transform optional since some datasets return integer labels - Added a `num_labels` field to `DocClassificationTransform` class which will be used to determine `num_classes` for metrics computation - Update the `SST.zip` testing asset with the correct folder structure Reviewed By: parmeet Differential Revision: D33792100 fbshipit-source-id: 4480ef0ba8dabb495f0a2adc45f588413aea5f4d * Import torchtext from commits e0c5528 to 8808e7e Summary: Command used: `python pytorch/import.py --github_username parmeet --project_name text --commit_from d72124cb710574087d0bce87062ee521e1584167 --commit_to 8808e7eee5a2df79b9566a4a348889dc2722fcfb --skip_commit_ids 7f3ed4b183eb451b439740a59bb849771c707f0c --squash` Followed by: arc lint to fix new line linter issues Reviewed By: VirgileHlav Differential Revision: D34717890 fbshipit-source-id: 7aa0f22421b3f3bfb9684c6e24f7dc606052da5c * Import torchtext #1635 69f67f3 Summary: Import latest from github using import_text.sh script Changed `RobertaModelBundle` to `RobertaBundle` Reviewed By: Nayef211 Differential Revision: D34718778 fbshipit-source-id: f68fc827c5956ffedc4f5a98175d0724ca431c9d * Import TorchText from Github Reviewed By: parmeet Differential Revision: D34753031 fbshipit-source-id: 6d8a92b4c2f4b5b85b90edb5b8329e3061411620 * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D34815193 fbshipit-source-id: 1865de76d8b133f56e4060961c1173097efac575 * Import TorchText from Github Reviewed By: parmeet Differential Revision: D34857180 fbshipit-source-id: 1c483f2277c902271a2ff75f0c36bf5de8bbba34 * Import TorchText from Github Reviewed By: parmeet Differential Revision: D34920364 fbshipit-source-id: c03dfd98da4b66dc63e5e4dfd11af449bc95ce85 * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D35074232 fbshipit-source-id: 1772fbf171665894ab8945d967011873aa7f626e * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D35309109 fbshipit-source-id: be74f1c2739fbfb6e43cc6649839e647c37de4c8 * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D35392538 fbshipit-source-id: 02ca5e81ec7ca2d607c1eef3b32ddf4a51c279c8 * Import TorchText from Github Reviewed By: parmeet Differential Revision: D35425316 fbshipit-source-id: 815c3d048440211d2107b8605830530db609efe0 * torchx integration Summary: Integrate with torch to run the training in local or flow Reviewed By: Nayef211 Differential Revision: D35412165 fbshipit-source-id: 297bea540ace67d93965e0982d6c8f8ff5d03208 * Import TorchText from Github Reviewed By: parmeet Differential Revision: D35773883 fbshipit-source-id: ab1787498d2169b4345f5981c21eb6b898fa8f2e * BetterTransformer support for torchtext (#1690) Summary: Pull Request resolved: https://github.com/pytorch/text/pull/1690 This diff created a fast path of using better transformer (torch.nn.TransformerEncoderLayer), with a converter from the existing torchtext transformer encoder layer to better transforomer. The related tests are added in the following diff. Reviewed By: parmeet Differential Revision: D35948440 fbshipit-source-id: e69e12f2dd28edfea3176a10ee3d7d321d50c897 * Kill to_better by having native load_from_state_dict and init Summary: Fully remove to_better method by rebuild torchtext TransformerEncoderLayer's load_from_state_dict and init. No more redundant params. Reviewed By: parmeet Differential Revision: D36020184 fbshipit-source-id: ccdd6da853a86034762b235cd7d5f793876d16c6 * Remove unneeded modules after using nn.Module for BetterTransformer (#1693) Summary: Pull Request resolved: https://github.com/pytorch/text/pull/1693 Remove unneeded modules after using nn.Module for BetterTransformer Reviewed By: zrphercule Differential Revision: D36038830 fbshipit-source-id: 1e0f5c7cf81096cf66cc1afcf15b5e0645c3da03 * Import TorchText from Github Reviewed By: parmeet Differential Revision: D36034077 fbshipit-source-id: 40c12ec37992d71c4857f92bc5e2ed939e2d6030 * Replace TransformerEncoder in torchtext with better transformer (#34) Summary: X-link: https://github.com/facebookresearch/multimodal/pull/34 Pull Request resolved: https://github.com/pytorch/text/pull/1700 Replace the usage of TransformerEncoder by BetterTransformerEncoder In theory we should be able to remove torchtext.TransformerEncoderLayer after this diff. Reviewed By: parmeet Differential Revision: D36084653 fbshipit-source-id: 64ed3810e809fc1db840e75e2e05783089ff31d2 * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D36162313 fbshipit-source-id: ff366f585b4783e903f8388654e71ce635b2a556 * Import TorchText from Github Reviewed By: parmeet Differential Revision: D36307982 fbshipit-source-id: faf90f12012bd962fc5decfd3cf9e117f4b9160a * Enable model testing in FBCode Summary: This diff enables Model testing in FB code Notes: 1. it only tests XLM-R models (base and large) in integration tests. We need to do a follow-up diff to enable RoBERTa testing since corresponding assets are missing in FBcode. Edit: Addressed the Roberta model testing in this diff itself 2. parameterized was giving some weird long names to the test which was creating some unknown issue for running them in sandcastlle. Removed it for now to get the proper names for test. Edit: refactored test suit since nested_params was creating long string names (400+ characters) for test methods due to RobertaBundle objects Reviewed By: mikekgfb Differential Revision: D35973306 fbshipit-source-id: 8a50d03466f60c8a4a0fbd5857611e68c92ebf08 * Import TorchText from Github Reviewed By: parmeet Differential Revision: D36340622 fbshipit-source-id: ed6f1994916d5d469198e6d0876387a6363db1ea * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D36448402 fbshipit-source-id: bee15f955a21a730653d72d4aedff7b6122f6ef0 * Import TorchText from Github Reviewed By: parmeet Differential Revision: D36510904 fbshipit-source-id: 1b9b27e62af007e88f76414e936fa08ae1ce7d59 * Import torchtext #1794 a54be1f3a7ac534509ac9c066a1b35127936dd77 Summary: Manually importing TorchText from github using ```./fbcode/pytorch/fb_build/import_text.sh``` In additional to manual import, this diff also updates the libtorchtext TARGET dependency on utf8proc Reviewed By: VirgileHlav Differential Revision: D37250868 fbshipit-source-id: 369d67aa02492f620350eb8b28c00b59dc84f081 * Import TorchText from Github Reviewed By: parmeet Differential Revision: D37171614 fbshipit-source-id: 56fa981bc709f78ac3371a5346b9278730895b82 * Import TorchText from Github Summary: Meta: Import latest TorchText from Github to fbcode. Check fb/LAST_SYNCED_COMMIT_FROM_GITHUB_MAIN for the synced commit hash. Rules run: - CodemodTransformerSimpleShell Config Oncall: [pytorch_text](https://our.intern.facebook.com/intern/oncall3/?shortname=pytorch_text) CodemodConfig: [CodemodConfigPyTorchTextGithubSync](https://www.internalfb.com/code/www/flib/intern/codemod_service/config/pytorch_text/github_sync/CodemodConfigPyTorchTextGithubSync.php) ConfigType: php Sandcastle URL: https://www.internalfb.com/intern/sandcastle/job/31525198098541494/ This diff was automatically created with CodemodService. To learn more about CodemodService, check out the [CodemodService wiki](https://fburl.com/CodemodService). _____ ## Questions / Comments / Feedback? **[Click here to give feedback about this diff](https://www.internalfb.com/codemod_service/feedback?sandcastle_job_id=31525198098541494).** * Returning back to author or abandoning this diff will only cause the diff to be regenerated in the future. * Do **NOT** post in the CodemodService Feedback group about this specific diff. Reviewed By: Nayef211 Differential Revision: D37374922 fbshipit-source-id: d2cfb5e58fc35b653f00b0d81330fe2337e6e347 * Import TorchText from Github Summary: Meta: Import latest TorchText from Github to fbcode. Check fb/LAST_SYNCED_COMMIT_FROM_GITHUB_MAIN for the synced commit hash. Rules run: - CodemodTransformerSimpleShell Config Oncall: [pytorch_text](https://our.intern.facebook.com/intern/oncall3/?shortname=pytorch_text) CodemodConfig: [CodemodConfigPyTorchTextGithubSync](https://www.internalfb.com/code/www/flib/intern/codemod_service/config/pytorch_text/github_sync/CodemodConfigPyTorchTextGithubSync.php) ConfigType: php Sandcastle URL: https://www.internalfb.com/intern/sandcastle/job/709158032/ This diff was automatically created with CodemodService. To learn more about CodemodService, check out the [CodemodService wiki](https://fburl.com/CodemodService). _____ ## Questions / Comments / Feedback? **[Click here to give feedback about this diff](https://www.internalfb.com/codemod_service/feedback?sandcastle_job_id=709158032).** * Returning back to author or abandoning this diff will only cause the diff to be regenerated in the future. * Do **NOT** post in the CodemodService Feedback group about this specific diff. Reviewed By: parmeet Differential Revision: D37411197 fbshipit-source-id: 8eeb460843eacfd0f3d970062b3e0e393d5eef6f * Import TorchText from Github Summary: Meta: Import latest TorchText from Github to fbcode. Check fb/LAST_SYNCED_COMMIT_FROM_GITHUB_MAIN for the synced commit hash. Rules run: - CodemodTransformerSimpleShell Config Oncall: [pytorch_text](https://our.intern.facebook.com/intern/oncall3/?shortname=pytorch_text) CodemodConfig: [CodemodConfigPyTorchTextGithubSync](https://www.internalfb.com/code/www/flib/intern/codemod_service/config/pytorch_text/github_sync/CodemodConfigPyTorchTextGithubSync.php) ConfigType: php Sandcastle URL: https://www.internalfb.com/intern/sandcastle/job/711752278/ This diff was automatically created with CodemodService. To learn more about CodemodService, check out the [CodemodService wiki](https://fburl.com/CodemodService). _____ ## Questions / Comments / Feedback? **[Click here to give feedback about this diff](https://www.internalfb.com/codemod_service/feedback?sandcastle_job_id=711752278).** * Returning back to author or abandoning this diff will only cause the diff to be regenerated in the future. * Do **NOT** post in the CodemodService Feedback group about this specific diff. Reviewed By: Nayef211 Differential Revision: D37483835 fbshipit-source-id: b4ad3c43ece7c83c57617e6a5851fff3ecdf8e51 * Adding TARGETS file for torchtext benchmarks Summary: ### Summary - Enable benchmarking of torcharrow ops within torchtext ### Benchmark Results - Benchmarking in fbcode devserver ``` torchtext GPT2BPE tokenizer: 65.811 torchtext vocab: 2.226 torchtext add tokens operation (string): 0.722 torchtext add tokens operation (int): 0.598 torcharrow GPT2BPE tokenizer: 65.739 torcharrow vocab: 1.253 torcharrow add tokens operation (string): 14.335 torcharrow add tokens operation (int): 0.229 ``` Benchmarking on Apple MBP (results can also be found in [text#1801](https://github.com/pytorch/text/pull/1801) and [text#1807](https://github.com/pytorch/text/pull/1807)) ``` torchtext GPT2BPE tokenizer: 3.13 torchtext vocab: 0.32 torchtext add tokens operation (string): 0.382 torchtext add tokens operation (int): 0.431 torcharrow GPT2BPE tokenizer: 59.13 torcharrow vocab: 0.03 torcharrow add tokens operation (string): 3.652 torcharrow add tokens operation (int): 0.075 ``` ### Takeaways - GPT2BPE for torchtext is significantly faster on MBP than devserver - AddTokens (str) for torcharrow is still significantly slower on both MBP and devserver than the torchtext counterpart Reviewed By: parmeet Differential Revision: D37463862 fbshipit-source-id: 1fb538338367bac2b002c1a4b8f128b0b2847bf5 * Import TorchText from Github Summary: Meta: Import latest TorchText from Github to fbcode. Check fb/LAST_SYNCED_COMMIT_FROM_GITHUB_MAIN for the synced commit hash. Rules run: - CodemodTransformerSimpleShell Config Oncall: [pytorch_text](https://our.intern.facebook.com/intern/oncall3/?shortname=pytorch_text) CodemodConfig: [CodemodConfigPyTorchTextGithubSync](https://www.internalfb.com/code/www/flib/intern/codemod_service/config/pytorch_text/github_sync/CodemodConfigPyTorchTextGithubSync.php) ConfigType: php Sandcastle URL: https://www.internalfb.com/intern/sandcastle/job/13510799591955868/ This diff was automatically created with CodemodService. To learn more about CodemodService, check out the [CodemodService wiki](https://fburl.com/CodemodService). _____ ## Questions / Comments / Feedback? **[Click here to give feedback about this diff](https://www.internalfb.com/codemod_service/feedback?sandcastle_job_id=13510799591955868).** * Returning back to author or abandoning this diff will only cause the diff to be regenerated in the future. * Do **NOT** post in the CodemodService Feedback group about this specific diff. Reviewed By: abhinavarora Differential Revision: D37514618 fbshipit-source-id: efc3b56b6da2afdc601b3dc706c58d0222d0daf6 * Import TorchText from Github Reviewed By: parmeet Differential Revision: D37642224 fbshipit-source-id: 674d2fdfa57bc2131bed136986d385194416f0bb * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D37680190 fbshipit-source-id: b06341b9989bdcb0859ad84838860f05ef2e501f * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D37711064 fbshipit-source-id: 3646b536af2359b776e6a49b9c86f6657c0f1a4c * Import TorchText from Github Reviewed By: parmeet Differential Revision: D37879352 fbshipit-source-id: 53b04c4b41a3c7e8077842c39a331144eab76208 * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D37952995 fbshipit-source-id: 09c492ac8d1333283bb4366c9ae0c6b95b98a87c * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D38110070 fbshipit-source-id: 824a1a2d7a4cb97a69b3bcfd39167ac039edd1b5 * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D38146055 fbshipit-source-id: 1b232be8ce396189a123139ac8456433d12d2316 * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D38269840 fbshipit-source-id: 901e5279e8e0265fabd48aca861a43d2e4c45dee * Import TorchText from Github Reviewed By: parmeet Differential Revision: D38351452 fbshipit-source-id: 2439d74bc9ab3f477876f35f549caec9117711bd * Import TorchText from Github Reviewed By: parmeet Differential Revision: D38381535 fbshipit-source-id: ba50c1a33fda33c4ccc8157702f32b94d415197f * Import TorchText from Github Reviewed By: parmeet Differential Revision: D38419656 fbshipit-source-id: 871439658ed673910c68c025be471501b9b4670a * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D38534440 fbshipit-source-id: 3bf1a7d5cc2daa8d14e424d16509b2df998549b8 * Import TorchText from Github Reviewed By: Nayef211 Differential Revision: D38655164 fbshipit-source-id: 0b9364fb759520c6fb60147fd0ab1044c362d588 * Import torchtext #1879 72966f0 Summary: ran the `import_text.sh` command to manually update the internal fbcode to match the Github torchtext repo Reviewed By: Nayef211 Differential Revision: D38796445 fbshipit-source-id: 904143c404141bb016a5f83fbc53906b1c6e1246 * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D38907288 fbshipit-source-id: f82ad8121bce924ad6068767845e5ea29dd24bef * Remove dependency on the torch::jit::script::Module for mobile builds Summary: In order to resolve linkage errors. Specifically when vocab getting build for "mobile" version it can't resolve symbols for torch::jit::script::Module Reviewed By: Nayef211 Differential Revision: D38771271 fbshipit-source-id: 693b656f2a17af9fa5a7a1904742557f902edb55 * Replace `pytext_lib`'s `MaskTransform` with new one from `torchtext` Summary: Replace instances of `pytext_lib`'s `MaskTransform` with new one from `torchtext` that was merged in https://github.com/pytorch/text/pull/1882 Reviewed By: Nayef211 Differential Revision: D39058074 fbshipit-source-id: f61499d88eec7eccda659279786528bac7edf9d0 * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D39095295 fbshipit-source-id: 2e447db46b71fc152f2f53b281585650682cb696 * move PATH_MANAGER to OSS Summary: ## Problem: pytext got "No module named 'pytorch'" in issue https://github.com/facebookresearch/pytext/issues/1706 It's due to `from pytorch.text.fb.utils import PATH_MANAGER` is internal only but imported in pytext. Actually, `pytorch/text/fb/utils/__init__.py` should be open sourced. ## Solution: This diff moved it to OSS as `from torchtext.utils import PATH_MANAGER` and updated all the references Reviewed By: Nayef211 Differential Revision: D39292896 fbshipit-source-id: c0046d62e64145b60ad9a5298b366f0f1a348369 * Turn off mask checking for torchtext which is known to have a legal mask (#1896) Summary: Pull Request resolved: https://github.com/pytorch/text/pull/1896 Turn off mask checking for torchtext which is known to have a legal mask Reviewed By: zrphercule Differential Revision: D39445703 fbshipit-source-id: 3f0cacfd39ea11a16c7a06f339872554333b5e97 * Back out "move PATH_MANAGER to OSS" (#1724) Summary: X-link: https://github.com/facebookresearch/pytext/pull/1724 Original commit changeset: c0046d62e641 Original Phabricator Diff: D39292896 torchtext can't depend on iopath as raised in https://github.com/pytorch/text/pull/1905 Reviewed By: Nayef211 Differential Revision: D39639475 fbshipit-source-id: 69a48eb3820d0642b0a56712e160a0af589e4c7c * Import TorchText from Github Summary: Manually import latest changes from github to fbcode Reviewed By: joecummings Differential Revision: D39770284 fbshipit-source-id: 1e442f222d582c43a2ca9280d93eca4135d2df09 * Import TorchText from Github Reviewed By: rshraga Differential Revision: D39811057 fbshipit-source-id: 33cce346ac3d226a2fff6c162c39164837f34d87 * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D40225047 fbshipit-source-id: 7abff009d65d713a6ce134fc88cd1955f62e3e3d * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D40294258 fbshipit-source-id: b3e14d9e78e346c294f1bc65ba3045b92251e034 * Add Character Level BPE Tokenizer (#1936) Summary: Pull Request resolved: https://github.com/pytorch/text/pull/1936 This change adds a character level BPE tokenizer to the set of available transforms. It takes a pre-trained encoder dict (i.e vocab dict) and merge list as input. It is not using C++ for encoding / decoding at this time. Reviewed By: langong347 Differential Revision: D40186470 fbshipit-source-id: 48bacc631f537e941a495e39ef9ccb17d3ef7896 * Add padding_masks and tests for T5Model (#1935) Summary: Pull Request resolved: https://github.com/pytorch/text/pull/1935 Added the following parameters to the `forward` method of the T5Model: * `encoder_padding_mask` * `decoder_padding_mask` These allow users to specifically mask out the padding of input sequences. This matches the implementation of Transformers in PyTorch core. Reviewed By: Nayef211 Differential Revision: D40252794 fbshipit-source-id: 0e0a17fdc97ae0bbcaa1aef91e9914fd6225456b * Import TorchText from Github Reviewed By: abhinavarora Differential Revision: D40425553 fbshipit-source-id: 268b94d65cff771028c2e2fdf21caa9855d07cef Co-authored-by: Guanheng Zhang <zhangguanheng@devfair0197.h2.fair> Co-authored-by: Christian Puhrsch <cpuhrsch@devfair0129.h2.fair> Co-authored-by: cpuhrsch <cpuhrsch@fb.com> Co-authored-by: Moto Hira <moto@fb.com> Co-authored-by: George Guanheng Zhang <zhangguanheng@fb.com> Co-authored-by: Stanislau Hlebik <stash@fb.com> Co-authored-by: Andres Suarez <asuarez@fb.com> Co-authored-by: Meghan Lele <meghanl@fb.com> Co-authored-by: Brian Hirsh <hirsheybar@fb.com> Co-authored-by: Vasilis Vryniotis <vvryniotis@fb.com> Co-authored-by: Jeff Hwang <jeffhwang@fb.com> Co-authored-by: Parmeet Singh Bhatia <parmeetbhatia@fb.com> Co-authored-by: Artyom Astafurov <asta@fb.com> Co-authored-by: Nicolas Hug <nicolashug@fb.com> Co-authored-by: Heitor Schueroff <heitorschueroff@fb.com> Co-authored-by: Facebook Community Bot <facebook-github-bot@users.noreply.github.com> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Vincent Quenneville-Belair <vincentqb@fb.com> Co-authored-by: Yao-Yuan Yang <yyyang@fb.com> Co-authored-by: Evan Smothers <ebs@fb.com> Co-authored-by: Erjia Guan <erjia@fb.com> Co-authored-by: Abhinav Arora <abhinavarora@fb.com> Co-authored-by: Vitaly Fedyunin <vitalyf@fb.com> Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca> Co-authored-by: CodemodService Bot <> Co-authored-by: Steven Liu <stevenliu@fb.com> Co-authored-by: Rui Zhu <zrphercule@fb.com> Co-authored-by: Michael Gschwind <mikekg@fb.com> Co-authored-by:…
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It'd be nice to actually get docs built for people to reference, even while this is still WIP. I'm happy to set up a Sphinx project, but i noticed that pytorch/vision doesn't have docs within the repo, but rather in the main pytorch repo.
Thus, is it preferable to:
cc @jekbradbury (@soumith @apaszke may have things they want to say about this as well)
The text was updated successfully, but these errors were encountered: