Skip to content

Conversation

cozek
Copy link
Contributor

@cozek cozek commented Apr 9, 2021

Fixes #1900

Description:
Remove legacy torchtext code from TextCNN example notebook and update to torchtext==0.9.0 API

Check list:

  • Get notebook running
  • Test in CPU/GPU in Colab

@github-actions github-actions bot added the examples Examples label Apr 9, 2021
@cozek cozek changed the title [skip ci] WIP - Update TextCNN example to torchtext==0.9.0 api [skip ci] Update TextCNN example to torchtext==0.9.0 api Apr 10, 2021
@cozek cozek marked this pull request as ready for review April 10, 2021 09:04
@cozek
Copy link
Contributor Author

cozek commented Apr 10, 2021

Some comments:

  1. I changed from en to en_core_web_sm since en is deprecated.
    python -m spacy download en_core_web_sm

  2. Set the seed for random using random.seed(SEED)

  3. Set the tokenizer to spacy since spacy was never used in the original notebook even though installed. Perhaps it is used internally.
    tokenizer = get_tokenizer("spacy")

  4. Used only 1000 samples to speed up testing and provided option for using the entire set. Since, torchtext no longer handles custom splitting AFAIK.

# We are using only 1000 samples for faster training
# set to -1 to use full data
N = 1000 

# We will use 80% of the `train split` for training and the rest for validation
train_frac = 0.8
_temp = list(train_iter)


random.shuffle(_temp)
_temp = _temp[:(N if N > 0 else len(_temp) )]
n_train = int(len(_temp)*train_frac)

train_list = _temp[:n_train]
validation_list = _temp[n_train:]
test_list = list(test_iter)
test_list = test_list[:(N if N > 0 else len(test_list))]
  1. I couldn't get bucket iterator running as described in the migration guide, so I skipped it. I followed the migration guide as closely as I could.
  2. process_function and eval_function had to be changed.

Also, big thanks to @KickItLikeShika for debugging the initial notebook.

Copy link
Contributor

@KickItLikeShika KickItLikeShika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @cozek! LGTM!

@KickItLikeShika
Copy link
Contributor

KickItLikeShika commented Apr 10, 2021

@cozek i think the bucket iterator part is not a must, and the changes in process_function and eval_function is a must because of the device issues and that's discussed before in the issue. Regarding the size of training data you can wait for @sdesrozis or @vfdev-5 reviews.

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cozek thanks for the PR, I'll review it in details later. I have already a comment on .gitignore modifications and also wonder if we should recompute vocabulary this way :

from collections import Counter
from torchtext.vocab import Vocab

counter = Counter()

for (label, line) in train_list:
    counter.update(tokenizer(line))

vocab = Vocab(
    counter,
    min_freq=10,
    vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/')
)

Can't we reuse a predefined vocabulary or make sequential counter update into a parallel to make it a bit faster ?

@cozek
Copy link
Contributor Author

cozek commented Apr 10, 2021

@vfdev-5 I am simply trying to emulate
TEXT.build_vocab(train_data, vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/'))

AFAIK GloVe is predefined in torchtext.vocab. I don't think I fully understand what you mean.
If you show me an example of what needs to be changed to make it faster I am happy to incorporate it. :)
Thanks!

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry delayed review @cozek ! I added some comments to make it better. Could you please address them and it will be good to go. Thanks!

Comment on lines 4 to 13
"metadata": {
"colab": {
"name": "Copy of TextCNN PR.ipynb",
"private_outputs": true,
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "python388jvsc74a57bd0bb24fb798fa891713af3d36fbae541dd86145d8cb277c7e680316fd96a4b69ba",
"display_name": "Python 3.8.8 64-bit ('ingite': conda)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remove this metadata.

Comment on lines 154 to 157
"SEED = 1234\n",
"random.seed(SEED)\n",
"torch.manual_seed(SEED)\n",
"torch.cuda.manual_seed(SEED)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update here as well. We can use ignite.utils.manual_seed(SEED) instead of these 3 lines.

},
"source": [
"random_sample = random.sample(train_list,1)[0]\n",
"print(' text:',random_sample[0])\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"print(' text:',random_sample[0])\n",
"print(' text:', random_sample[1])\n",

"source": [
"random_sample = random.sample(train_list,1)[0]\n",
"print(' text:',random_sample[0])\n",
"print('label:', random_sample[1])"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"print('label:', random_sample[1])"
"print('label:', random_sample[0])"

Comment on lines 752 to 753
" print('y_pred',y_pred)\n",
" print('y',y)\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove these prints

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Apr 19, 2021

@vfdev-5 I am simply trying to emulate
TEXT.build_vocab(train_data, vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/'))

AFAIK GloVe is predefined in torchtext.vocab. I don't think I fully understand what you mean.
If you show me an example of what needs to be changed to make it faster I am happy to incorporate it. :)
Thanks!

I was thinking that there could be a way to build Vocab without recounting train set. However, it seems like even torchtext does that in their tutorials: https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
ok let's keep it like that here as well

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Apr 19, 2021

@KickItLikeShika could you please review this PR more thouroughly and check if all requested changed were applied ? Thanks

Copy link
Contributor

@KickItLikeShika KickItLikeShika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @cozek, you're doing a good job! Please just consider the reviews and we are good!

@KickItLikeShika
Copy link
Contributor

@vfdev-5 I think everything is fine now with this PR

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Apr 22, 2021

Thanks a lot @KickItLikeShika for checking and @cozek for the update ! Let's merge it !

@vfdev-5 vfdev-5 merged commit 8c42723 into pytorch:master Apr 22, 2021
@cozek cozek deleted the textcnn_update branch April 22, 2021 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples Examples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace legacy code in TextCNN example
3 participants