Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenize_tweets() function and tests #44

merged 1 commit into from Apr 8, 2017


Copy link

kbenoit commented Apr 8, 2017

What it does:

  • Provides a special tokeniser for Twitter data that preserves Twitter hashtags and usernames, according to the rules for valid construction of both. It also offers an option to keep or remove entirely URLs, which are a very common feature in Tweets.
  • The special handling is implemented as options (with defaults)
    • strip_punctuation = TRUE
    • strip_url = FALSE
  • Adds tests.

This partially addresses #25 and lays out new function arguments and a possible framework for implementing other aspects of the #25 wishlist. It also defines the target behaviours needed, to structure later rewriting (parts) of this in C++ for faster handling. (Although: In my experience doing this in C++, stringi is just about as fast!)

Additional notes:

  • I tried to adhere to the existing conventions in tokenizers as much as possible, but added a new option for strip_punctuation that adds slightly to the complexity of basic-tokenizers. I would however like to see this show up eventually in tokenize_words().
  • There are probably more efficient ways to do this, but I converted the basic list created by the stri_split_boundaries() into a vector, after recording the positions so I could split it back later, which avoids a lot of difficult to construct (and read) Map/mapply operations of lists of logicals for the index operations designed to allow special handling.
*  Adds tests
*  Includes special handling options and defaults
    - strip_punctuation = TRUE
    - strip_url = FALSE
*  Partially addresses #25

This comment has been minimized.

Copy link

codecov-io commented Apr 8, 2017

Codecov Report

Merging #44 into master will increase coverage by 1.11%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #44      +/-   ##
+ Coverage   88.99%   90.11%   +1.11%     
  Files          12       13       +1     
  Lines         309      344      +35     
+ Hits          275      310      +35     
  Misses         34       34
Impacted Files Coverage Δ
R/tokenize_tweets.R 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c79a17c...2ae6e8f. Read the comment docs.

@lmullen lmullen merged commit 6e8df25 into ropensci:master Apr 8, 2017
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 88.99%)
codecov/project 90.11% (+1.11%) compared to c79a17c
continuous-integration/travis-ci/pr The Travis CI build passed

This comment has been minimized.

Copy link

lmullen commented Apr 8, 2017

@kbenoit Thanks for the PR. Looks great. I'm glad to have this in tokenizers before the next release.

Clever solutions to preserving the usernames and URLs. I wouldn't have thought of that.

Adding this to NEWS in a separate commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
3 participants
You can’t perform that action at this time.