-
Notifications
You must be signed in to change notification settings - Fork 129
Cross Compilation
The Cross Compilation of Wiki Source converts the downloaded wiki markdown string into another format that can be handle in a
- database with JSON output,
- HTML or RevealJS
- Markdown,
- LaTeX,
- ...
for furhter editing and processing of the document. The library
wtf_wikipedia
offers cross compilation into the formatsJSON
,HTML
,MarkDown
andLaTeX
.
wtf_wikipedia
also offers a plaintext method, that returns only paragraphs of nice text, and no junk:
Assume you have stored the source text of the fetch call to the MediaWiki in the variable wikisrc
wtf.fetch('Toronto Blue Jays', 'en', function (err, doc) {
if (err) {
console.error(err);
}
console.log(doc.text());
});
You can test these plaintext
export features with the js-file ./bin/wtf.js
by calling:
$ node ./bin/wtf.js --plaintext George_Clooney
wtf_wikipedia
also offers a markdown method, that returns converted into MarkDown syntax. The following code downloads the article about 3D Modelling from the english Wikiversity:
wtf.fetch('3D Modelling', 'enwikiversity', function (err, doc) {
if (err) {
console.error(err);
}
console.log(doc.markdown());
});
You can test these MarkDown
export features with the js-file ./bin/wtf.js
by calling:
$ node ./bin/wtf.js --markdown George_Clooney
wtf_wikipedia
also offers a HTML method, that returns converted article into HTML syntax. The following code downloads the article about 3D Modelling from the english Wikiversity:
wtf.fetch('3D Modelling', 'enwikiversity', , function (err, doc) {
if (err) {
console.error(err);
}
console.log(doc.html());
});
You can test these HTML
export features with the js-file ./bin/wtf.js
by calling:
$ node ./bin/wtf.js --html George_Clooney
wtf_wikipedia
also offers a LaTeX method, that returns converted article into LaTeX syntax. The following code downloads the article about 3D Modelling from the english Wikiversity:
wtf.fetch('3D_Modelling', 'enwikiversity', function (err, doc) {
if (err) {
console.error(err);
}
// converts the Wikiversity article about "3D Modelling"
// from the english domain https://en.wikiversity.org
// https://en.wikiversity.org/wiki/3D_Modelling
console.log(doc.latex());
});
You can test these LaTeX
export features with the js-file ./bin/wtf.js
by calling:
$ node ./bin/wtf.js --latex George_Clooney
wtf_wikipedia
aggregates information about Wiki article and populates a JSON object, if you want see what kind of JSON you can output the stringified JSON to console with
wtf.fetch('Swarm_Intelligence', 'enwikiversity', function (err, doc) {
if (err) {
console.error(err);
}
console.log(JSON.stringify(doc.json(), null, 4));
});
wtf_fetch()
-Call fetches the Wiki Source from the MediaWiki. The wtf_fetch
module is available as a separate module, if you want to access the wiki source without parsing. The hard task is parsing the source due to the fact that wiki source language contains fragments of different grammar (e.g. LaTeX syntax wrapped in <math>...</math>
tags. After downloading some preprocessing might be helpful for further improvement of the cross-compilation of the source text from the MediaWiki.
- a
Tokenizer
will replace some of content elements e.g.math
-tags by a token and assign a specialized handler for these type of content elements. The token is regarded asword
in asentence
and will not create conflicts with other parsing processes and incompatible syntax (e.g. within mathematical expression as LaTeX:
is just a symbol defining a mathematical operation and outside the<math>
as first character in the first line it represents an identation.
You can test these LaTeX
export features with the js-file ./bin/wtf.js
by calling:
$ node ./bin/wtf.js --latex George_Clooney
This section explains how developers can extend the capabilities of wtf_wikipedia
to additional export formats.
If you want to create new additional export formats, try PanDoc document conversion to get an idea what formats can be useful and are used currently in PanDoc (see https://pandoc.org). Select as input format MediaWiki
in the PanDoc Webinterface and copy a MediaWiki source text into the left input textarea. Select an output format and press convert.
We explain how to extend wtf_wikipedia
with a new export format (e.g. LibreOffice Writer). wtf_wikipedia
is able to export to
-
PlainText
, -
HTML
, -
MarkDown
and -
LaTeX
. The following sections describe the definition of a new export format in 4-5 steps: -
(1) Create a GitHub or GitLab repository with name that indicates the purpose of the package (e.g.
wtf_wikipedia_odt
becauseodt
is the file extension of LibreOffice writer. From LibreOffice you can export to MicroSoft-Office (but not vice-versa). Create a source directory for new method for the output format in/src/
for all tree nodes in the Abstract Syntax Tree (AST). All-
/src/01-document/
, -
/src/02-section/
, /src/03-paragraph/
-
/src/image/
, -
/src/reference/
, -
/src/table/
, -
/src/list/
, -
/src/infobox/
, -
/src/math/
, (not implemented in Version 7.2.2 yet)
-
-
(2) All the AST nodes need a new export method (e.g.
toOdt()
). for the all the tree nodes in the Abstract Syntax Trees mentioned in (1). Look at the other export methods in the repository ofwtf_wikipedia/src
how these are defined e.g. for LaTeX bytoLatex()
or for HTML bytoHtml()
and adapt them to your new export format. -
(3) @spencermountain created a mapper to new format names that allows the functions to be called
html()
,latex()
ormarkdown()
. This allows the recursive call of method for all tree node according to the new output format (e.g.odt()
) -
(4) Now we need to assign the new export format to the
Document
object and all other AST nodes so that the extended format will be available at the root node first see how it is done inwtf_wikipedia/src/document/Document.js
e.g. by
const wtf = require('wtf_wikipedia');
wtf.Document.odt = function (options) {
// ...
return odt_zip_file;
},
make sure that all tree nodes of the Abstract Syntax Tree (AST) have an export method for ODT
and extend the module exports at the very end of Section
in /src/section/Section.js
at line 240 ff.
odt : function(options) {
options = setDefaults(options, defaults);
return toOdt(this, options);
},
...
- (4) Create or extend the test script in directory
/tests
. A test script for the formatodf
will be namedodf.test.js
. A test script for the HTML based presentation RevealJS formatreveal
will be namedreveal.test.js
. Look at other formats e.g.html.test.js
to understand the concept of testing mechanism. Basically you create the- exported a defined text with
wtf
(e.g.wtf.latex(...)
) and store it in thehave
-variable - define the desired output in the
want
variable, - and the
t.equal(have, want, "test-case-name")
defines the comparision ofhave
andwant
. -
html_tidy()
,latex_tidy()
, ... are removing comments and generate compressed equivalent code for a smartert.equal
-comparison. These functions are defined intests/tidy.js
.
- exported a defined text with
- (5) run test and build for the extended
wtf_wikipedia
- (6 optional) create a Pull request on the original
wtf_wikipedia
repository of GitHub maintained by Spencer Kelly to share the code with the community
If a source text in Wikipedia or Wikiversity is exported, the file is in general removed out of the relative link context. The library /src/lib/wikiconvert.js
contains a Javascript class to preprocessing the relative links.
General approach:
- Wiki source text was fetched e.g. from english Wikiversity then the
- the language ID is
en
and - the domain ID is
wikiversity
- the language ID is
- a relative link replacement should be defined like this:
- Input Wiki Markdown:
The is text of the english Wikipedia of the article "My Article"
My [[my wiki link]], my [[/relative link/]] and my [[w:de:mein_link|inter-wiki link]] to
the german Wikipedia is defined by those links.
- Output HTML:
The is text of the english Wikipedia of the article "My Article"
My <a href="https://en.wikiversity.org/wiki/my_wiki_link">my wiki link</a>,
my <a href="https://en.wikiversity.org/wiki/My_article/relative link">relative link</a> and
my <a href="https://de.wikiversity.org/wiki/mein_link">inter-wiki link</a> to
the german Wikipedia is defined by those links.
- Depending on options the a-tag main be exported with
target="_blank"
to open a new window - Inter-wiki links can be encoded by
domain:language:article
(e.g.w:de:my_article
which is short forwikipedia:de:my_article
) to refer to content that is available in a specific language only (e.g. the english Wikipedia only). Thewikiid
used inwtf_wikipedia
combines thelanguage-ID
anddomain-ID
. Thewikioid
site map is stored in/src/data/site_map.js
and stores all combinatoric options oflanguage
anddomain
.
Mapping wiki domain
can be separated from the language
abbreviation with a hash:
var domain_map = {};
domain_map["w"] = "wikipedia";
domain_map["wikipedia"] = "wikipedia";
domain_map["Wikipedia"] = "wikipedia";
domain_map["v"] = "wikiversity";
domain_map["wikiversity"] = "wikiversity";
domain_map["Wikiversity"] = "wikiversity";
domain_map["b"] = "wikibooks";
domain_map["wikibooks"] = "wikibooks";
domain_map["Wikibooks"] = "wikibooks";
...
The domain map is an associative array that maps a possible domain prefix in an interwiki to an explicit part of the domain name. The explicit part of the domain name (e.g. wikipedia
for the abbreviation w
) is necessary to expand relative link to absolute links especially when a converted wiki document is used outside the Wikipedia or Wikiversity server environment. The relative links [[Swarm Intelligence]]
or Water in Wikiverity do not work anymore. They must be expanded to https://en.wikiversity.org/wiki/Swarm_Intelligence or https://en.wikiversity.org/wiki/Water - this link conversion can be implemented by setting in options
e.g. options.absolute_links=true
and wtf_wikipedia
assures that the relative links still work, when the export file displayed outside the MediaWiki server context (e.g. Wikipedia or Wikiversity) .
Test cases are defined in the folder tests/
and have the ending .test.js
(e.g. html.test.js
for the HTML test cases). Just by naming the file with that ending .test.js
the test will be included in the NPM test call npm run test
. Desired output can be generated for different format by the PanDoc-Try web-interface. Select as input format in PanDoc-Try web-interface the format MediaWiki and select as output format the new format (e.g. Reveal
for web-based presentation or Open Document Format
to generate LibreOffice files based on a template file with all your style).
Media files like:
- images,
- audio,
- video
files can be displayed offline (without internet connectivity) if and only if the media files are stored locally on the device as well. The command line tool
wget
can be used for downloading the media files to the device. The file can be stored into subfolders (e.g. of the generated HTML file) in corresponding subfolders. For example in a subfolderexport/my_html/
-
export/my_html/images
, -
export/my_html/audio
, -
export/my_html/video
The selection of the subdirectory can be done with the following function that checks the extension of the file and derives the subdirectory name from it:
function getExtensionOfFilename(pFilename) {
var re = /(?:\.([^.]+))?$/;
// re.exec("/path.file/project/output.dzslides.html")[1]; returns "html"
return re.exec(pFilename)[1]; // "html"
}
function getMediaSubDir(pMediaLink) {
var vExt = getExtensionOfFilename(pMediaLink);
var vSubDir ="images"
switch (vExt) {
case "wav":
vSubDir = "audio"
break;
case "mp3":
vSubDir = "audio"
break;
case "mid":
vSubDir = "audio"
break;
case "ogg":
vSubDir = "video"
break;
case "webm":
vSubDir = "video"
break;
default:
vSubDir = "images"
};
return vSubDir;
}
If you try PanDoc document conversion the key to generate Office documents is the export format ODF.
LibreOffice can load and save even the OpenDocument Format and LibreOffice can load and save MICR0S0FT Office formats. So exporting to Open Document Format will be good option to start with in wtf_wikipedia
. The following description are a summary of aspects that support developers in bringing the Office export format e.g. to web-based environment like the ODF-Editor.
OpenDocument Format provides a comprehensive way forward for wtf_wikipedia
to exchange documents from a MediaWiki
source text reliably and effortlessly to different formats, products and devices. Regarding the different Wikis of the Wiki Foundation as a Content Sink e.g. the educational content in Wikiversity is no longer restricted to a single export format (like PDF) open ups access to other specific editors, products or vendors for all your needs. With wtf_wikipedia
and an ODF export format the users have the opportunity to choose the 'best fit' application of the Wiki content. This section focuses on Office products.
Some important information to support Office Documents in the future
- see WebODF how to edit ODF documents on the web or display slides. Current limitation of WebODF is, that does not render mathematical expressions, but alteration in WebODF editor does not remove the mathematical expressions from the ODF file (state 2018/04/07). WebODF does not render the mathematical expressions but this may be solved in the WebODF-Editor by using MathJax or KaTeX in the future.
- The
ODT
-Format is the default export format of LibreOffice/OpenOffice. Supporting the Open Community Approach OpenSource office products are used to avoid commercial dependencies for using generated Office products.- The
ODT
-Format of LibreOffice is basically a ZIP-File. - Unzip shows the folder structure within the ZIP-format. Create a subdirectory e.g. with the name
zipout/
and callunzip mytext.odt -d zipout
(Linux, MacOSX). - The main text content is stored in
content.xml
as the main file for defining the content of Office document - Remark: Zipping the folder content again will create a parsing error when you load the zipped office document again in
LibreOffice
. This may be caused by an inappropriate order in the generated ZIP-file. The filemimetype
must be the first file in the ZIP-archive. - The best way to generate ODT-files is to generate an ODT-template
mytemplate.odt
with LibreOffice and all the styles you want to apply for the document and place a marker at specific content areas, where you want to replace the cross-compiled content withwtf_wikipedia
incontent.xml
. The filecontent.xml
will be updated in ODT-ZIP-file. Also marker replacement is possible in ODF-files (see also WebODF demos. - Image must be downloaded from the MediaWiki (e.g. with an NPM equivalent of
wget
for fetching the image, audio or video) and add the file to the folder structure in the ZIP. Create a ODT-file with LibreOffice with an image and unzip the ODT-file to learn about way how ODT stores the image in the ODT zip-file.
- The
-
JSZip: JSZip can be used to update and add certain files in a given ODT template (e.g.
mytemplate.odt
). Handling ZIP-files in a cross-compilation WebApp withwtf_wikipedia
that runs in your browser and generates an editor environment for the cross-compiled Wiki source text (like the WebODF editor). The updating the ODT template as ZIP-file can be handled with JSZip by replacing thecontent.xml
in a ZIP-archive.content.xml
can be generated withwtf_wikipedia
when theodf
-export format is added to/src/output/odf
(ToDo: Please create a pull request if you have done that). -
LibreOffice Export: Loading ODT-files in LibreOffice allows to export the ODT-Format to
- Office documents
doc
- anddocx
-format, - Text files (
.txt
), - HTML files (
.html
), - Rich Text files (
.rtf
), - PDF files (
.pdf
) and even - PNG files (
.png
).
- Office documents
- Planing of the ODT support can be down in this README and collaborative implementation can be organized with Pull Requests PR.
- Helpful Libraries: node-odt, odt
-
wtf_wikipedia
supports HTML export, - the library
html-docx-js
supports cross-compilation of HTML into docx-format
First go to the subdirectory /src/output
. We will show, how a new export format can be added to wtf_wikipedia
.
Create a new subdirectory (e.g. /src/output/latex
) to support a new export format. Copy the files
-
index.js
, -
infobox.js
, -
sentence.js
, -
table.js
, -
math.js
(not supported in all formats of <2.6.1 - see ToDo) from the subdirectory/src/output/html
into the new subdirectory for the export format (e.g./src/output/latex
). Adapt these function step by step, so that the exported code generates the sentences and tables in an appropriate syntax of the new format.
At the very end of the file /src/output/latex/index.js
the new export function is defined. Alter the method name
const toHtml = function(str, options) {
....
}
to a method name of the new export format (e.g. for LaTeX the method name toLatex
)
const toLatex = function(str, options) {
....
}
The code of this method can be reused in most cases (no alteration necessary).
The new output format can be exported by wtf_wikipedia
if a method is added to the file index.js
.
Add a new require
command must be added to the other export formats that are already integrated in wtf_wikipedia
.
const markdown = require('./output/markdown');
const html = require('./output/html');
const latex = require('./output/latex');
After adding the last line for the new export format, the code for cross-compilation to LaTeX is available in the variable latex
. The last step is to add the latex output format to the Module Export. Therefore the method for the new output format must be added to the export hash of wtf_wikipedia
add the very end of index.js
by adding the line latex: latex,
to export hash.
module.exports = {
fetch: fetch,
plaintext: plaintext,
markdown: markdown,
html: html,
latex: latex,
version: version,
custom: customize,
parse: (str, obj) => {
obj = obj || {};
obj = Object.assign(obj, options); //grab 'custom' persistent options
return parse(str, obj);
}
};
- Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
- Output: Based on concepts of the swiss-army knife of
document conversion
developed by John MacFarlane PanDoc - https://www.pandoc.org