Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get word/sentence/paragraph count? #251

Closed
KyleAMathews opened this issue Apr 21, 2017 · 6 comments
Closed

Get word/sentence/paragraph count? #251

KyleAMathews opened this issue Apr 21, 2017 · 6 comments

Comments

@KyleAMathews
Copy link

Is there something in the Unist ecosystem for doing this for markdown files?

@Rokt33r
Copy link
Member

Rokt33r commented Apr 22, 2017

Did you check remark-retext? Although I've never used it, it seems to be what you need.

Remark Retext
https://github.com/wooorm/remark-retext

Retext
https://github.com/wooorm/retext

Nlcst
https://github.com/syntax-tree/nlcst

@wooorm
Copy link
Member

wooorm commented Apr 22, 2017

Hi Kyle! 👋

Yup, retext does that! You’ll be interested in the links posted by @Rokt33r above, and I also made an example showing a way to use it all together:

var unified = require('unified');
var parse = require('remark-parse');
var stringify = require('remark-stringify');
var english = require('retext-english');
var remark2retext = require('remark-retext');
var visit = require('unist-util-visit');

unified()
  .use(parse)
  .use(remark2retext, unified().use(english).use(count))
  .use(stringify)
  .processSync('*This* and _that_. \n> And some more stuff.\n\nAnd another thing.');

function count() {
  return counter;
  function counter(tree) {
    var counts = {};
    visit(tree, visitor);
    console.log(counts);
    function visitor(node) {
      counts[node.type] = (counts[node.type] || 0) + 1;
    }
  }
}

Yields:

{ RootNode: 1,
  ParagraphNode: 3,
  SentenceNode: 3,
  WordNode: 10,
  TextNode: 10,
  WhiteSpaceNode: 10,
  PunctuationNode: 3 }

@wooorm wooorm closed this as completed Apr 22, 2017
@KyleAMathews
Copy link
Author

Oh this is perfect! And of course, Retext :-) I'm already using it so silly me for forgetting it.

How hard is it to write the language parsers I'm curious? I'm planning on adding these counts as available data you can get from markdown files in Gatsby and I'm sure people will want support for other languages other than English and Dutch, the two I see you have parsers for.

@wooorm
Copy link
Member

wooorm commented Apr 24, 2017

retext-latin is pretty OK for most Latin-script languages (and Cyrillic), however, sentence count is pretty hard to detect (is Xyz. Foo two sentences or not? Is Xyz and abbreviation?). retext-english is built on retext-latin, and contains ±350 lines, but ±150 of those relate to abbreviations.

In kind-off think retext-english or retext-latin will score better than simple sentence-count and word-count algorithms!

For other, “non-western” scripts, that’s pretty hard. We’d need other people for that as I’m not familiar with them enough to build the needed tools.

@KyleAMathews
Copy link
Author

Cool! I can add this plus add documentation pointing non-latin language people here. I also assume there's other tools that do word/sentence counts for non-latin languages so that might be a direction they suggest as well.

@wooorm
Copy link
Member

wooorm commented Apr 24, 2017

Yes, I’d love it if more languages would be connected to retext, and I’m able to help out, but I don’t know enough of those languages to write them myself!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants