-
-
Notifications
You must be signed in to change notification settings - Fork 655
[skip ci] Update TextCNN example to torchtext==0.9.0 api #1918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Some comments:
# We are using only 1000 samples for faster training
# set to -1 to use full data
N = 1000
# We will use 80% of the `train split` for training and the rest for validation
train_frac = 0.8
_temp = list(train_iter)
random.shuffle(_temp)
_temp = _temp[:(N if N > 0 else len(_temp) )]
n_train = int(len(_temp)*train_frac)
train_list = _temp[:n_train]
validation_list = _temp[n_train:]
test_list = list(test_iter)
test_list = test_list[:(N if N > 0 else len(test_list))]
Also, big thanks to @KickItLikeShika for debugging the initial notebook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @cozek! LGTM!
@cozek i think the bucket iterator part is not a must, and the changes in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cozek thanks for the PR, I'll review it in details later. I have already a comment on .gitignore modifications and also wonder if we should recompute vocabulary this way :
from collections import Counter
from torchtext.vocab import Vocab
counter = Counter()
for (label, line) in train_list:
counter.update(tokenizer(line))
vocab = Vocab(
counter,
min_freq=10,
vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/')
)
Can't we reuse a predefined vocabulary or make sequential counter update into a parallel to make it a bit faster ?
@vfdev-5 I am simply trying to emulate AFAIK GloVe is predefined in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry delayed review @cozek ! I added some comments to make it better. Could you please address them and it will be good to go. Thanks!
examples/notebooks/TextCNN.ipynb
Outdated
"metadata": { | ||
"colab": { | ||
"name": "Copy of TextCNN PR.ipynb", | ||
"private_outputs": true, | ||
"provenance": [], | ||
"collapsed_sections": [] | ||
}, | ||
"kernelspec": { | ||
"name": "python388jvsc74a57bd0bb24fb798fa891713af3d36fbae541dd86145d8cb277c7e680316fd96a4b69ba", | ||
"display_name": "Python 3.8.8 64-bit ('ingite': conda)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please remove this metadata.
examples/notebooks/TextCNN.ipynb
Outdated
"SEED = 1234\n", | ||
"random.seed(SEED)\n", | ||
"torch.manual_seed(SEED)\n", | ||
"torch.cuda.manual_seed(SEED)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update here as well. We can use ignite.utils.manual_seed(SEED)
instead of these 3 lines.
examples/notebooks/TextCNN.ipynb
Outdated
}, | ||
"source": [ | ||
"random_sample = random.sample(train_list,1)[0]\n", | ||
"print(' text:',random_sample[0])\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"print(' text:',random_sample[0])\n", | |
"print(' text:', random_sample[1])\n", |
examples/notebooks/TextCNN.ipynb
Outdated
"source": [ | ||
"random_sample = random.sample(train_list,1)[0]\n", | ||
"print(' text:',random_sample[0])\n", | ||
"print('label:', random_sample[1])" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"print('label:', random_sample[1])" | |
"print('label:', random_sample[0])" |
examples/notebooks/TextCNN.ipynb
Outdated
" print('y_pred',y_pred)\n", | ||
" print('y',y)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove these prints
I was thinking that there could be a way to build Vocab without recounting train set. However, it seems like even torchtext does that in their tutorials: https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html |
@KickItLikeShika could you please review this PR more thouroughly and check if all requested changed were applied ? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @cozek, you're doing a good job! Please just consider the reviews and we are good!
@vfdev-5 I think everything is fine now with this PR |
Thanks a lot @KickItLikeShika for checking and @cozek for the update ! Let's merge it ! |
Fixes #1900
Description:
Remove legacy torchtext code from TextCNN example notebook and update to torchtext==0.9.0 API
Check list: