Support indexless|hierarchical generic content part declaration patterns. #1

faerietree · 2017-03-12T00:31:51Z

Definitions

Indexless := no index numbering scheme, i.e. if a number occurs then it is either content or denoting a hierarchy in a markup and not a series. => numbers are explicit (no regular expression) => can only have an implicit ordering.
indexed := with index numbering scheme (i.e. explicite order)

Generic := filter by an expression (regex|wildcard|...)
Specific := explicit := filter by explicit content (repeating phrase)
Raw content := markup content
Content := plain text.content, i.e. the visual content like information text, media, ...

Content part declarations

[Content phrase filter] (matching all specific content parts within all hierarchy levels mixed)
- generic|regex|wildcard
  - can match index (have an explicit series order)
    - all|mixed
      building[ ]*([\\d][ .->_]*)+: Solution[ ]*([\\d][ .->_]*)+: Exercise[ ]*([\\d][ .->_]*)+: Teacher[ ]*([\\d][ .->_]*)+: Text[ ]*([\\d][ .->_]*)+:
    - numbers+special chars only (have an explicit series order, hierarchy e.g. 1.1, 1.2, 2.1, ...)
      ([\\d].)+ ([\\d])+
      Note: Number based index may need filtering of false positives due to numbers occuring in the content parts, too.)
  - guaranteed indexless
```
Exercise[ ]*: // generic|regex|wildcard
...
```
- specific|explicit (guaranteed indexless)
  - all|mixed
```
Exercise:
Teacher:
Teacher (Physics):
Teacher Physics:
Structures :
Structures:
Text:
Mission to achieve: // phrase is a sentence|clause
```
  - numbers+special chars only (e.g. 1.1:, 1.1:, 1.1:, ...)
```
3:
1.2:
800:
```
[Raw content | Markup phrase filter]
- specific|explicit, match only indexless (no order; numbers denote hierarchy depth; Matches only within one hierarchy level)
```
#
##
---
===
h1
h2
h:p level='1'
header level="1"
header level="2"
...
```
- generic, can match index (Matching all series within all hierachy levels in one pass! [1])
  Note: This is the default case for XML base file formats. It requires keeping track of hierarchy depth counting in code because a node has no number attached! It can however have a style attached denoting depth.
```
#+
header[\\d]*
h[\\d]*
section
```
[Mixed: Markup & Content phrase filters]
- specific, mostly indexless (matching all within one hierarchy level with a content filter:)
```
#Breaking:
# Tex
# general information
## specific information
```
- generic, match only indexless (Matching all series within all hierachy levels in one pass! [1])
  Note: For all XML base file formats this merged pattern is easier to achieve via postprocessing the respective content part's head after employing the generic, indexless filter.
```
#+(^[<][\\w][>])*[Tt]ext
#+[ ]?[Gg]eneral information
```

[1] Only of limited use as higher level elements have no content part if following strict sectioning. what remains is only the declaration unless there is summary|description content between e.g. 1. and its subsection 1.1 .

Purpose

They are essential for the worlddevelopment civilization editor, open bookkeeper bot, ...

The text was updated successfully, but these errors were encountered:

faerietree · 2017-03-12T00:55:15Z

This is relevant in this sense:

Upload sheet.
Extract all content parts by matching its markup or content declaration with a large pool of generic, specific patterns.
Postprocess most promising content parts.

In some cases it makes sense to use a second pass for content phrase filtering instead of employing very complicated and failure prone regular expressions (e.g. for the Mixed case where it is very complicated and costly to match content and a specific markup node at the same time).

faerietree · 2017-03-12T02:26:41Z

Allows e.g. hierarchical progressive splitting of sections, i.e. gathering context (!) while on the way to detecting leaf content parts.

Thus in the end the leaf content parts found are the same as in the current weighted score system where hit count (number of content part declarations found) is the most significant factor. As leaf content parts get the most hits in a tree structure like in (XML based) documents these declarations will always get the highest rating. This prevents hierarchical splitting which is required to maintain context. (Which is exactly the purpose of section headers or more general content part declarations! Repeating the context in every leaf content element is highly redundant.)

As only markup stores the hierarchy level, there is no known way around extending the declaration detection on a per sheet document type basis, e.g. ODT, DOCX, MD, RST, ...

In these sets of declarations, the upper most level must get the highest weight by all means besides there are more than 1 occurrences to ensure an top bottom approach which is mandatory here due to the tree structure. (Currently as said above, only leaf content parts are detected in "worded" patterns.)

…orlddevelopment#1 May be extended to a tuples with more than two entries.

…orlddevelopment#1 These are: * isGeneric * isSpecific * isMarkupPhraseFilter * ... Also add functions that are required since commit c879d6c: * canResultBeIndex * canMatchIndex Also add a so far unused function which is less specific as many kind of patterns can contain a number (some for hierarchy, some for index|pos, some as real content numbers): * canResultContainNumber Also add a TODO to rename isWordedPattern to * isContentPatternIndexless or * isContentPhraseFilterIndexless

Also change some variable names from e[_]? to p\1 as the rename from exercise to (content) part is due with worlddevelopment#1.

faerietree added this to the Allow use as library in other tools. milestone Mar 12, 2017

faerietree modified the milestones: Indexless generic content part declaration detection (powerful), Allow use as library in other tools. Mar 12, 2017

faerietree changed the title ~~Add support for indexless generic content part declaration patterns.~~ Support indexless|hierarchical generic content part declaration patterns. Mar 12, 2017

faerietree added a commit to faerietree/exercise_database that referenced this issue Mar 13, 2017

Core: Muster: Add support for complement|pairs. Add PatternTypes. Ref w…

5683d6d

…orlddevelopment#1 May be extended to a tuples with more than two entries.

faerietree added a commit to faerietree/exercise_database that referenced this issue Mar 16, 2017

Core: Style: Clean up. Style indent,comments,lang. JavaDoc.

0d8246f

Also change some variable names from e[_]? to p\1 as the rename from exercise to (content) part is due with worlddevelopment#1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support indexless|hierarchical generic content part declaration patterns. #1

Support indexless|hierarchical generic content part declaration patterns. #1

faerietree commented Mar 12, 2017 •

edited

Loading

faerietree commented Mar 12, 2017 •

edited

Loading

faerietree commented Mar 12, 2017 •

edited

Loading

Support indexless|hierarchical generic content part declaration patterns. #1

Support indexless|hierarchical generic content part declaration patterns. #1

Comments

faerietree commented Mar 12, 2017 • edited Loading

Definitions

Content part declarations

Purpose

faerietree commented Mar 12, 2017 • edited Loading

faerietree commented Mar 12, 2017 • edited Loading

faerietree commented Mar 12, 2017 •

edited

Loading

faerietree commented Mar 12, 2017 •

edited

Loading

faerietree commented Mar 12, 2017 •

edited

Loading