2 How reformatting is done (technique for the tinkerers)

The reformatting is done by surrounding the words or word-parts of the text with html-markup-styling. For example, prepositions like the word "into" take on a green color:
 into 

Styling codes

In the definition-file "parse_english.dat" (or genericly "parse_language.dat") are categories (below in capitals) of words/strings listed that will get each their own style. Below are listed per category:

categorial headers
one sample string to be styled (a line in the definition-file)
the html-styling code (the word "line" designates the stylable string)
an explanation of the formatting string

PUNCTUATION OF SENTENCES TO HANDLE
.
Styling code: line & "     "
Explanation: After the dot a line-break is placed and then on the following line 4 non-breaking-spaces are placed so that an indent is created for each new line.

PUNCTUATION OF SENTENCE-PARTS TO HANDLE
,
Styling code: line & " "
Explanation: After the comma only a line-break is placed so that each subsentence / clause gets a new line.

PRONOUNS TO HANDLE
that
Styling code: " " & line
Explanation: A line-break is placed BEFORE the word "that" so that a subsentence is created on the following line starting with "that".

VERBS TO HANDLE
be
Styling code: " & line & "
Explanation: the verb (or verb-part) is colored magenta; a pinkish color that is pretty strong highlighting because are so important in a sentence.

SIGNAL-WORDS TO HANDLE
princip
Styling code:" & line & "
Explanation: Color is reddish orange. The current set of signal-words is used to highlight words that have a scientific relevance. Previously these words were also used to create the summary but the summary is now based on summary_language.dat (summary_english.dat for english). Of course you can adjust the list to create your own signal-words / highlights.

LINK-WORDS TO HANDLE
and
Styling code: "" & line & ""
Explanation: This group concerns words that compare or link sentence-parts but cannot be seen as a separate sentence.

PREPOSITIONS TO HANDLE
on
Styling code: "" & line & ""
Explanation: Prepositions (green) are also usefull to separate word-groups.

NOUN-ANNOUNCERS TO HANDLE
The
Styling code: "" & line & ""
Explanation: The color is light-brown. These words announce a group of words with at least 1 noun in it. Also adverbs, adjectives and chained nouns can be part of this group.

NOUN-REPLACERS TO HANDLE
I
Styling code: "" & line & ""
Explanation: These words replace or stand for nouns, hence they are also called pronouns.

AMBIGUOUS WORD-FUNCTIONS TO HANDLE
to
Styling code: "" & line & ""
Explanation: The color is ochre-yellow. These words indicate a verbs or some other function (because verbs are crucial to understand a sentence). The example " to " often announces a verb (I want to walk), but it can also be just a preposition (he goes to school).

Additional remarks

Summarizing, the stylable strings / line of the parse_language.dat file have one of the following characteristics:

they are a punctuation, like "." or "?"
they are either a word or a word-part; for example to identify the verbal past tense the string "ed " can be used.
the strings can be prepended or postpended with a space like " be ". This is often needed to distinguish from other words. For example if you have "ed" you get past tense words, but also words like edict, edible etc. Therefore "ed " is preferred.

False positives

Above we saw allready that false positives arise easily but must be avoided (or at least be indicated like in the ochre-yellow case). However they cannot be avoided totally.

Public lists

Until now I have myself compiled the parse_language.dat files. A possible future option is to use public lists of words with grammatical functions. However, the longer the list becomes, the slower the processing will become. Also I have not yet searched for these public options. Another option would be some library that can do natural language parsing, but I have not looked for those either (as far as they exist and are reliable).

Non-ascii characters

I have not yet investigated if non-ascii or other exotic characters can be used for parsing. I will soon check that out.

Provide feedback

Saved searches