[skip ci] Update TextCNN example to torchtext==0.9.0 api #1918

cozek · 2021-04-09T16:13:15Z

Fixes #1900

Description:
Remove legacy torchtext code from TextCNN example notebook and update to torchtext==0.9.0 API

Check list:

Get notebook running
Test in CPU/GPU in Colab

cozek · 2021-04-10T09:13:42Z

Some comments:

I changed from en to en_core_web_sm since en is deprecated.
python -m spacy download en_core_web_sm
Set the seed for random using random.seed(SEED)
Set the tokenizer to spacy since spacy was never used in the original notebook even though installed. Perhaps it is used internally.
tokenizer = get_tokenizer("spacy")
Used only 1000 samples to speed up testing and provided option for using the entire set. Since, torchtext no longer handles custom splitting AFAIK.

# We are using only 1000 samples for faster training
# set to -1 to use full data
N = 1000 

# We will use 80% of the `train split` for training and the rest for validation
train_frac = 0.8
_temp = list(train_iter)


random.shuffle(_temp)
_temp = _temp[:(N if N > 0 else len(_temp) )]
n_train = int(len(_temp)*train_frac)

train_list = _temp[:n_train]
validation_list = _temp[n_train:]
test_list = list(test_iter)
test_list = test_list[:(N if N > 0 else len(test_list))]

I couldn't get bucket iterator running as described in the migration guide, so I skipped it. I followed the migration guide as closely as I could.
process_function and eval_function had to be changed.

Also, big thanks to @KickItLikeShika for debugging the initial notebook.

KickItLikeShika

Thanks for the PR @cozek! LGTM!

KickItLikeShika · 2021-04-10T09:18:26Z

@cozek i think the bucket iterator part is not a must, and the changes in process_function and eval_function is a must because of the device issues and that's discussed before in the issue. Regarding the size of training data you can wait for @sdesrozis or @vfdev-5 reviews.

vfdev-5

@cozek thanks for the PR, I'll review it in details later. I have already a comment on .gitignore modifications and also wonder if we should recompute vocabulary this way :

from collections import Counter
from torchtext.vocab import Vocab

counter = Counter()

for (label, line) in train_list:
    counter.update(tokenizer(line))

vocab = Vocab(
    counter,
    min_freq=10,
    vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/')
)

Can't we reuse a predefined vocabulary or make sequential counter update into a parallel to make it a bit faster ?

.gitignore

… textcnn_update

cozek · 2021-04-10T11:08:09Z

@vfdev-5 I am simply trying to emulate
TEXT.build_vocab(train_data, vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/'))

AFAIK GloVe is predefined in torchtext.vocab. I don't think I fully understand what you mean.
If you show me an example of what needs to be changed to make it faster I am happy to incorporate it. :)
Thanks!

vfdev-5

Sorry delayed review @cozek ! I added some comments to make it better. Could you please address them and it will be good to go. Thanks!

vfdev-5 · 2021-04-19T10:06:58Z

examples/notebooks/TextCNN.ipynb

+  "metadata": {
+    "colab": {
+      "name": "Copy of TextCNN PR.ipynb",
+      "private_outputs": true,
+      "provenance": [],
+      "collapsed_sections": []
+    },
+    "kernelspec": {
+      "name": "python388jvsc74a57bd0bb24fb798fa891713af3d36fbae541dd86145d8cb277c7e680316fd96a4b69ba",
+      "display_name": "Python 3.8.8 64-bit ('ingite': conda)"


Could you please remove this metadata.

vfdev-5 · 2021-04-19T10:08:50Z

examples/notebooks/TextCNN.ipynb

+        "SEED = 1234\n",
+        "random.seed(SEED)\n",
+        "torch.manual_seed(SEED)\n",
+        "torch.cuda.manual_seed(SEED)"


Let's update here as well. We can use ignite.utils.manual_seed(SEED) instead of these 3 lines.

vfdev-5 · 2021-04-19T10:10:35Z

examples/notebooks/TextCNN.ipynb

+      },
+      "source": [
+        "random_sample = random.sample(train_list,1)[0]\n",
+        "print(' text:',random_sample[0])\n",


Suggested change

"print(' text:',random_sample[0])\n",

"print(' text:', random_sample[1])\n",

vfdev-5 · 2021-04-19T10:11:11Z

examples/notebooks/TextCNN.ipynb

+      "source": [
+        "random_sample = random.sample(train_list,1)[0]\n",
+        "print(' text:',random_sample[0])\n",
+        "print('label:', random_sample[1])"


Suggested change

"print('label:', random_sample[1])"

"print('label:', random_sample[0])"

vfdev-5 · 2021-04-19T10:19:26Z

examples/notebooks/TextCNN.ipynb

+        "        print('y_pred',y_pred)\n",
+        "        print('y',y)\n",


Let's remove these prints

vfdev-5 · 2021-04-19T10:22:06Z

@vfdev-5 I am simply trying to emulate
TEXT.build_vocab(train_data, vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/'))

AFAIK GloVe is predefined in torchtext.vocab. I don't think I fully understand what you mean.
If you show me an example of what needs to be changed to make it faster I am happy to incorporate it. :)
Thanks!

I was thinking that there could be a way to build Vocab without recounting train set. However, it seems like even torchtext does that in their tutorials: https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
ok let's keep it like that here as well

vfdev-5 · 2021-04-19T15:20:19Z

@KickItLikeShika could you please review this PR more thouroughly and check if all requested changed were applied ? Thanks

KickItLikeShika

Thanks for the updates @cozek, you're doing a good job! Please just consider the reviews and we are good!

examples/notebooks/TextCNN.ipynb

… textcnn_update

KickItLikeShika · 2021-04-22T01:37:00Z

@vfdev-5 I think everything is fine now with this PR

vfdev-5 · 2021-04-22T08:36:22Z

Thanks a lot @KickItLikeShika for checking and @cozek for the update ! Let's merge it !

adding torchtext version and updating gitignore for jupyter

3f832b6

github-actions bot added the examples Examples label Apr 9, 2021

first updates

8f8153c

cozek changed the title ~~[skip ci] WIP - Update TextCNN example to torchtext==0.9.0 api~~ [skip ci] Update TextCNN example to torchtext==0.9.0 api Apr 10, 2021

Merge branch 'master' into textcnn_update

a03f1e6

cozek marked this pull request as ready for review April 10, 2021 09:04

KickItLikeShika approved these changes Apr 10, 2021

View reviewed changes

vfdev-5 reviewed Apr 10, 2021

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

cozek added 4 commits April 10, 2021 16:15

undo gitignore changes

bd0c7f0

Merge branch 'master' into textcnn_update

b8b002c

Merge branch 'textcnn_update' of https://github.com/cozek/ignite into…

99b9b3d

… textcnn_update

fix comment on install spacy

c36ea11

vfdev-5 reviewed Apr 19, 2021

View reviewed changes

cozek and others added 2 commits April 19, 2021 16:42

reviewer updates

f898d29

Merge branch 'master' into textcnn_update

fd60840

KickItLikeShika reviewed Apr 20, 2021

View reviewed changes

examples/notebooks/TextCNN.ipynb Outdated Show resolved Hide resolved

examples/notebooks/TextCNN.ipynb Outdated Show resolved Hide resolved

cozek added 3 commits April 20, 2021 22:44

Merge branch 'master' into textcnn_update

ae49113

remove old imports and use proper variable name

a4a37b1

Merge branch 'textcnn_update' of https://github.com/cozek/ignite into…

1bbbdd8

… textcnn_update

vfdev-5 merged commit 8c42723 into pytorch:master Apr 22, 2021

cozek deleted the textcnn_update branch April 22, 2021 12:36

	"print(' text:',random_sample[0])\n",
	"print(' text:', random_sample[1])\n",

	"print('label:', random_sample[1])"
	"print('label:', random_sample[0])"

Uh oh!

[skip ci] Update TextCNN example to torchtext==0.9.0 api #1918

[skip ci] Update TextCNN example to torchtext==0.9.0 api #1918

Uh oh!

Conversation

cozek commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cozek commented Apr 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KickItLikeShika left a comment

Choose a reason for hiding this comment

Uh oh!

KickItLikeShika commented Apr 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vfdev-5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cozek commented Apr 10, 2021

Uh oh!

vfdev-5 left a comment

Choose a reason for hiding this comment

Uh oh!

vfdev-5 Apr 19, 2021

Choose a reason for hiding this comment

Uh oh!

vfdev-5 Apr 19, 2021

Choose a reason for hiding this comment

Uh oh!

vfdev-5 Apr 19, 2021

Choose a reason for hiding this comment

Uh oh!

vfdev-5 Apr 19, 2021

Choose a reason for hiding this comment

Uh oh!

vfdev-5 Apr 19, 2021

Choose a reason for hiding this comment

Uh oh!

vfdev-5 commented Apr 19, 2021

Uh oh!

vfdev-5 commented Apr 19, 2021

Uh oh!

KickItLikeShika left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

KickItLikeShika commented Apr 22, 2021

Uh oh!

vfdev-5 commented Apr 22, 2021

Uh oh!

Uh oh!

cozek commented Apr 9, 2021 •

edited

Loading

cozek commented Apr 10, 2021 •

edited

Loading

KickItLikeShika commented Apr 10, 2021 •

edited

Loading