Tokenizer

The tokenizer may be implemented as a micro library wtf_tokenizer with the Wiki Transformation Framework (WTF).

Tokenizer for Syntax Domains

The tokenizer converts a XML sections like REF-tag and mathematical expression wrapped in MATH-Tag into attribtues of the generated JSON,

text before math <MATH> 
\sum_{i=1}^{\infty} [x_i] 
: v_i 
</MATH> text after math.
text before <ref>my reference ...</ref> and text after 
cite an already defined reference with <ref name="MyLabel"/> text after citation.

into

text before math ___MATH_INLINE_7238234792_ID_5___ text after math.
text before ___CITE_7238234792_ID_3___ and text after    
cite an already defined reference with ___CITE_7238234792_MyLabel___ text after citation.

The challenge of parsing can be identified in the mathematical expression. The colon : in the first column of the line defines an indentation. But within a mathematical expression it is just a devision.

Uniqness of Markers with Time Stamp

The number 7238234792 is a unique integer generated by the time and date in milliseconds, to make the marker unique. Mathematical Expressions, Citation and References are extracted in the preProcess()-Call. The tokenizer is encapsuled in /src/01-document/preProcess/tokenizer.js. The tokens/markers are regarded as ordinary words in text. The markers can be replaced in the postProcess or even, when the output is generated with toHTML() or toMarkDown, because during output the final numbering of citations can be generated, if more that on articles are downloaded and aggregated.

So it makes sense, that the markers/tokens remain even in the JSON sentences, sections and paragraphs until the final output is generated. Currently in my test repository, I do not populate the doc.references but I populate data.refs4token in the same way as you populate doc.references but it adds the label for backwards replacment for output. So I've added the corresponding label (e.g. ___CITE_7238234792_ID_3___ or ___MATH_INLINE_7238234792_ID_5___ to references in data.refs4token, so that later the markers for citations can be replaced by [6] in the IEEE citation style. A replacement of a citation in APA-Style will create e.g.(Kelly 2018) on call of doc.text() or doc.html(). The same would be performed for mathematical inline and block expressions, they need the original location of the mathematical expression e.g. in senctence (e.g. ___MATH_INLINE_7238234792_ID_5___).

You mentioned that you affected by the parsing order all over the place. With this concept you can get rid of the parsing problems because XML in REF-Tags and Latex in MATH-tags is removed and stored for further use in the JSON. At the same time the Marker/Tokenize concept preserves the position of JSON content in the original wiki source.

This needs the introduction of toJSON() method that replaces the content in the key-values pairs in the docJSON-file.

The robustness to parsing order seems to very good and could save us some headaches, because it extracts the MATH-tags and REF-tags already in the preProcess step in which they were removed currently as well. Furthermore this preserves position in the text and the mathematical expression itself with block or inline type and label attribute for mathematical content without loosing the position of the math exps in the wiki source.

Tokizer Steps and Workflow - Recommendation

Step 1: wtf_fetch() based on cross-fetch fetches the wiki source
- Input:
  - language="en" or language="de" to specify the language of the wiki source
  - domain="wikipedia" or domain="wikiversity" or domain="wikispecies" to select the wiki domain for the Wiki fetch() call to pulls the wiki sources from.
Output:
- wiki source text e.g. from wikipedia or wikiversity
- Remark: wtf_fetch extracts your wtf.fetch() in a separate module.
Step 2: wtf_tokenize()
- Input:
  - wiki source text e.g. from wikipedia or wikiversity fetched by wtf_fetch
- Output:
  - wiki source text where e.g. mathematical expressions are replaced by tokens like MATH-INLINE-839832492834_N12. wtf_wikipedia treats those tokens just as words in a sentence.
Step 3: wtf_wikipedia()
- Input:
  - wiki source text with tokenized citations and mathematical expressions
- Output: object doc of type Document. Application of output methods for text, html, latex, json containing the tokens as words in sentences. The tokens appear in the output of doc. html() or doc.latex() in wtf_wikipedia and in the JSON as well.
Step 4: wtf_tokenize
- Input:
  - string in the export format, text with tokenized citations and mathematical expressions
- Output: detokenized export format in the out string is injected in the DeTokenizer w.g.detokenize.html(out,data,options). In this case the output strint out is already in the HTML format. In the output out or in any other desired output format (e.g. markdown) the token replacement is performed e.g. for HTML the mathematical expressions are exported to MathJax and e.g. for latex the detokenizer replaces the word/token MATH-INLINE-839832492834_N12 by $\sum_{n=0}^{\infty} \frac{x^n}{n!}$ .

It seems that the only step is, that the constructors for the AST treenode e.g. Reference, MathFormula, .... should be extendable with additional export formats e.g. doc.reveal() that visits the AST tree nodes Section, Paragraph, Sentence, .... and calls the appropriate toReveal() function for the node. Might be sufficient to add it the following way:

Document.reveal = function () {
   ....
};

Document.Section.reveal = function () {
   ....
};

Document.Section.Paragraph.reveal = function () {
   ....
};
...

Section might own different constructors for tree nodes of the AST (Abstract Syntax Tree), so

Document.Section.Table.reveal = function () {
   ....
};

resp. assigned to paragraph level.

Document.Section.Paragraph.Table.reveal = function () {
   ....
};

Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
Output: Based on concepts of the swiss-army knife of document conversion developed by John MacFarlane PanDoc - https://www.pandoc.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer

Tokenizer for Syntax Domains

Uniqness of Markers with Time Stamp

Tokizer Steps and Workflow - Recommendation

Clone this wiki locally