Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nnseq function to quanteda.classifiers #9

Merged
merged 22 commits into from May 15, 2019
Merged

Add nnseq function to quanteda.classifiers #9

merged 22 commits into from May 15, 2019

Conversation

pchest
Copy link
Collaborator

@pchest pchest commented May 13, 2019

No description provided.

Copy link
Contributor

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I rewrote this pretty substantially, and fixed the CI so that it installs tensorflow before trying the tests. I also fixed a few inconsistent behaviours.

@pchest I reverted the function signature to the one I'd developed in the (now removed) slstm function since we only need the arguments not passed through to keras::fit() via .... The rest are either hardwired or included as options in the function. This is clearly documented in the Rd now.

Also there were some errors, especially about how the "one-hot encoding" (that is such a stupid term) of the outcome variable was done, so if you were using this in the paper before, I suspect the results were wrong. Check the diff carefully - in fact this is worth studying closely both to check what I did and to learn from it.

Issues to note:

  • predict(tmod, type = "class") (for both models) now returns a factor with the levels equal to those from the training set. This is true even if a single case is predicted (it will still have as many levels as originally existed). This is how textmodel_nb() works so I thought we should make the behaviour consistent with that.
  • The keras NN models are inherently stochastic. This means that setting the seed in R will not affect this. There is a discussion about this "problem" here along with some suggestions, but the results, especially for the small datasets we use, are just not deterministic. A slight problem from a replication standpoint!
  • I added an example using our new datasets to ?textmodel_nnseq for immigration sentences, check it out since the results are very encouraging. This is also as clean a workflow as I can provide as an example.
  • The model object in keras operates differently from most other R objects that are passed by value, in that they are just references to objects modified "in place". Our function seems to work now, where the (pointer?) is attached to the fitted model object as object$seqfitted and it seems to work when we save it, and then restart R and then reload the saved object, but this might cause some weirdnesses later. I guess we will solve those when we come to them.

@pchest
Copy link
Collaborator Author

pchest commented May 15, 2019

Hey Ken, I'm glad that you found these errors and thank you for correcting them!

  • Regarding the one-hot encoding error, I'm having a bit of trouble identifying where the error was. Your new code and the old code appear to have a fairly similar workflow, with yours using as.integer while the old code uses the as.numeric function. Was the issue that to_categorical doesn't work properly if its given a non-integer input?
  • That is in interesting point about the seeds. I read the link you posted and apparently there is a function that may address this issue: use_session_with_seed, though this is not an ideal solution, as it disables GPU or parallel model operations.
  • Your example works very nicely! I did a bit of additional analysis to see how well the model predicts immigration labels out of sample with the UK manifesto sentences. When I compared the test labels with predicted labels I got an F1 score of 0.56, which is quite high relative to performance we've seen before from the seq model and even the SVM model!

@kbenoit kbenoit merged commit 0e275a0 into master May 15, 2019
@kbenoit kbenoit deleted the mod_edits branch May 15, 2019 04:51
@kbenoit
Copy link
Contributor

kbenoit commented May 15, 2019

On the one-hot encoding error, it was at aa319ab#diff-c3df496465c72d7d97b36a144fd2093fL60. The indexing is from 0:numclasses, and somehow the earlier code included an extra class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants