Get word/sentence/paragraph count? #251

KyleAMathews · 2017-04-21T19:08:02Z

Is there something in the Unist ecosystem for doing this for markdown files?

Rokt33r · 2017-04-22T02:12:15Z

Did you check remark-retext? Although I've never used it, it seems to be what you need.

Remark Retext
https://github.com/wooorm/remark-retext

Retext
https://github.com/wooorm/retext

Nlcst
https://github.com/syntax-tree/nlcst

wooorm · 2017-04-22T17:39:04Z

Hi Kyle! 👋

Yup, retext does that! You’ll be interested in the links posted by @Rokt33r above, and I also made an example showing a way to use it all together:

var unified = require('unified');
var parse = require('remark-parse');
var stringify = require('remark-stringify');
var english = require('retext-english');
var remark2retext = require('remark-retext');
var visit = require('unist-util-visit');

unified()
  .use(parse)
  .use(remark2retext, unified().use(english).use(count))
  .use(stringify)
  .processSync('*This* and _that_. \n> And some more stuff.\n\nAnd another thing.');

function count() {
  return counter;
  function counter(tree) {
    var counts = {};
    visit(tree, visitor);
    console.log(counts);
    function visitor(node) {
      counts[node.type] = (counts[node.type] || 0) + 1;
    }
  }
}

Yields:

{ RootNode: 1,
  ParagraphNode: 3,
  SentenceNode: 3,
  WordNode: 10,
  TextNode: 10,
  WhiteSpaceNode: 10,
  PunctuationNode: 3 }

KyleAMathews · 2017-04-24T16:30:44Z

Oh this is perfect! And of course, Retext :-) I'm already using it so silly me for forgetting it.

How hard is it to write the language parsers I'm curious? I'm planning on adding these counts as available data you can get from markdown files in Gatsby and I'm sure people will want support for other languages other than English and Dutch, the two I see you have parsers for.

wooorm · 2017-04-24T16:49:02Z

retext-latin is pretty OK for most Latin-script languages (and Cyrillic), however, sentence count is pretty hard to detect (is Xyz. Foo two sentences or not? Is Xyz and abbreviation?). retext-english is built on retext-latin, and contains ±350 lines, but ±150 of those relate to abbreviations.

In kind-off think retext-english or retext-latin will score better than simple sentence-count and word-count algorithms!

For other, “non-western” scripts, that’s pretty hard. We’d need other people for that as I’m not familiar with them enough to build the needed tools.

KyleAMathews · 2017-04-24T16:54:35Z

Cool! I can add this plus add documentation pointing non-latin language people here. I also assume there's other tools that do word/sentence counts for non-latin languages so that might be a direction they suggest as well.

wooorm · 2017-04-24T16:58:09Z

Yes, I’d love it if more languages would be connected to retext, and I’m able to help out, but I don’t know enough of those languages to write them myself!

wooorm closed this as completed Apr 22, 2017

KyleAMathews mentioned this issue Apr 24, 2017

[1.0] Umbrella issue gatsbyjs/gatsby#796

Closed

48 tasks

rufuspollock mentioned this issue Nov 19, 2023

[epic] MarkdownDB plugin system datopian/markdowndb#2

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get word/sentence/paragraph count? #251

Get word/sentence/paragraph count? #251

KyleAMathews commented Apr 21, 2017

Rokt33r commented Apr 22, 2017

wooorm commented Apr 22, 2017

KyleAMathews commented Apr 24, 2017

wooorm commented Apr 24, 2017

KyleAMathews commented Apr 24, 2017

wooorm commented Apr 24, 2017

Get word/sentence/paragraph count? #251

Get word/sentence/paragraph count? #251

Comments

KyleAMathews commented Apr 21, 2017

Rokt33r commented Apr 22, 2017

wooorm commented Apr 22, 2017

KyleAMathews commented Apr 24, 2017

wooorm commented Apr 24, 2017

KyleAMathews commented Apr 24, 2017

wooorm commented Apr 24, 2017