Skip to content

bhsd-harry/wikiparser-node

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

npm version CodeQL CI jsDelivr hits (npm) Codacy Badge Istanbul coverage

Other Languages

Introduction

WikiParser-Node is an offline Wikitext parser developed by Bhsd for the Node.js environment. It can parse almost all wiki syntax and generate an Abstract Syntax Tree (AST) (Try it online). It also allows for easy querying and modification of the AST, and returns the modified wikitext.

Other Versions

Mini (also known as WikiLint)

This version provides a CLI, but only retains the parsing functionality and linting functionality. The parsed AST cannot be modified. It is used for the WikiParser Language Server VSCode extension.

Browser-compatible

A browser-compatible version, which can be used for code highlighting or as a linting plugin in conjunction with editors such as CodeMirror and Monaco. (Usage example)

Installation

Node.js

Please install the corresponding version as needed (WikiParser-Node or WikiLint), for example:

npm i wikiparser-node

or

npm i wikilint

Browser

You can download the code via CDN, for example:

<script src="//cdn.jsdelivr.net/npm/wikiparser-node"></script>

or

<script src="//unpkg.com/wikiparser-node/bundle/bundle.min.js"></script>

For more browser extensions, please refer to the corresponding documentation.

Usage

CLI usage

For MediaWiki sites with the CodeMirror extension installed, such as different language editions of Wikipedia and other Wikimedia Foundation-hosted sites, you can use the following command to obtain the parser configuration:

npx getParserConfig <site> <script path> [force]
# For example:
npx getParserConfig jawiki https://ja.wikipedia.org/w

The generated configuration file will be saved in the config directory. You can then use the site name for Parser.config.

// For example:
Parser.config = 'jawiki';

API usage

Please refer to the Wiki.

Performance

A full database dump (*.xml.bz2) scan of Chinese Wikipedia's ~3.5 million articles (parsing and linting) on a personal MacBook Air takes about 3 hours.

Known issues

Parser

  1. Memory leaks may occur in rare cases.
  2. Invalid page names with unicode characters are treated like valid ones (Example).
  3. Preformatted text with a leading space is only processed by Token.prototype.toHtml.

HTML conversion

Extension

  1. Many extensions are not supported, such as <indicator> and <ref>.

Transclusion

  1. Most parser functions are not supported.
  2. Transclusion of a subpage is not supported (Example).

Heading

  1. The table of contents (TOC) is not supported.

HTML tag

  1. Style sanitization is sometimes different (Example).

Table

  1. <caption> elements are wrapped in <tbody> elements (Example).
  2. Unclosed HTML tags in the table fostered content (Example).
  3. <tr> elements should not be fostered (Example).

Link

  1. Link trail is not supported (Example).
  2. Links to a subpage without a slash are not supported (Example).
  3. Block elements inside a link should break it into multiple links (Example).

External link

  1. External images are not supported (Examples 1, 2).
  2. No percent-encoding in displayed free external links (Example).

Block element

  1. Incomplete <p> wrapping when there are block elements (e.g., <pre>, <div> or even closing tags).
  2. Mixed lists (Example).

Miscellaneous

  1. Illegal HTML entities (Example).

About

A Node.js/browser parser for MediaWiki markup with AST

Topics

Resources

License

Stars

Watchers

Forks

Languages