Refactor into micro-libraries #131

niebert · 2018-07-29T03:17:21Z

Hi Spencer,
you explained to me, how the integration of "promises" led to the broken build mechanism on MacOSX. When I tried to find a solution for that, I thought it might be an option to split wtf_wikipedia into the following 3 repositories:

wtf_fetch, that fetches the wiki source from Wikipedia, Wikiversity, .... (MediaWiki domain) with the parameters language (e.g. en, de,.. ) and domain (e.g. wikipedia, wikiversity, wikivoyage, ...)
wtf_parse, that parses wiki source into a Document object (Abstract Syntax Tree)
wtf_output, that generates/renders the output for a specific format from a given Document object.

wtf_wikipedia will integrate all 3 submodules. At least wtf_parse and wtf_output may still support the build process on MacOSX. Furthermore it improves maintainance, reusablility of submodules and it separates the tasks in the recommended submodules wtf_fetch, wtf_parse, wtf_output from chaining the tasks here in wtf_wikipedia. Citation management would be a submodule wtf_citation that would be chained here. You modular structure in src/ can be preserved and will mainly replace a local require within src/ by a require of the recommended submodules from npm.

This could be documented in the README.md as developer recommendation and helps developers to understand the way forward and how they could add new wtf_modules in the chaining process. In this sense wtf_wikipedia will become the chain managment module of wtf_submodules.

Hope that makes sense to you and will attract more developers to support your work. Thank you for all the contributions to the OpenSource community for handling MediaWiki content.

The text was updated successfully, but these errors were encountered:

niebert · 2018-07-30T09:21:25Z

Did a first try with wtf_fetch as NPM repository. Hope that was OK for you. https://www.npmjs.com/package/wtf_fetch

spencermountain · 2018-07-31T15:48:48Z

oh hey, sorry for the delay.
Yeah, i get what you're saying, as an effort to make the library cleaner.
i watched carefully as d3 did the micro-library thing a few years ago, and it seemed to work for them
I don't know if this sort of thing is needed right now - fetch is only one file, and only adds a few kb, same for the output code.

and of course - go nuts. make all the stuff you want! ;)

niebert · 2018-08-13T06:46:10Z

I am working in parallel on the Wiki2Reveal the current implementation of wtf_wikipedia the change order of

TextBlock, BulletList, TextBlock in the HTML/Markdown output to
BulletList, TextBlock, TextBlock
This is really a problem especially for logic of content elements. Therefore I replace temporarily the wtf_parse by a worse parser than wtf_wikipedia. But for using the application of Wiki2Reveal as a proof of concept in lectures I need the correct order of content elements. That's why I started a software design proposal for adding the ContentList to preserve that order during parsing. As far as I understand your code, the content order gets lost at the section level. I intergrated the desired paragraph AST tree nodes in the proposal and discussed the dependencies as far as I understand your code and visions wtf_wikipedia right.

The documentation first principle should allow other developers joining the software design process prior to implementation and share the workload and development in a team. Different developers identify different pitfalls and see may other workarounds. Some decisions can be very costly in developement time. Let us try to reduced the time as good as we can. You are so quick in fixing things, my appreciation from me. wtf_wikipedia can be developed in a team only as you mentioned in the README.md yourself, so the route is to make it easy that other developers can understand the code. Next step TODO Assignment in the GitHub Wiki where the small workpackage as ToDos are mentioned and someone could assign for the developement of the milestone. Developers can propose workpackages and the maintainer spencermountain sets the workpackage on "GO" or on "FORK", which means that it will not be part of the software design of wtf_wikipedia and you (Spencer) as the maintainer recommend to fork wtf_wikipedia for that purpose. See GitHub Wiki page "Work Packages - ToDo". assignment.

spencermountain · 2018-08-13T15:39:59Z

hey Engelbert, yes this is correct. Order is lost at the Section level.
This is done to handle the recursive combinations of templates, links, and tables. The parsing-sequence has been established to minimize errors. We handle the 'most-dangerous' parsing first, as the meaning of certain characters, like newlines, will change depending on this context.

The initial goal of this library was getting data out of wikipedia, and into a database. I'm not sure AST representations are in the scope.

Preserving chronology, and treating paragraphs as first-class objects are in the long-term plan. Both will be tremendous tasks. Lots of QA, testing, and hardening of the library to do before then. Template-parsing, in particular is really rough, and the infobox/table parsers have a lot of repeated code. That's the current focus, for the time-being.

niebert · 2018-08-16T15:58:18Z

The benefit of wtf_wikipedia is, that it cleans the most dangereous things, which is perfect to handle the output. Will it be still Ok for you that I elaborate AST idea in the Wiki further? I do not alter code in Pull Request and I just discuss the possible solutions and make references to current sturcture of your code. Primary objective of this library should remain, getting data out of wikipedia and create the JSON as result. This is necessary for a robust output generation.
AST idea is just a parallel data structure, that refers to content in the generated JSON, ... . I fully agree with QA, testing, and hardening of the library must be done before. The Wiki content is also far away from being a recommendation for a current milestone of implementation. If you allow, I would follow the AST ideas in the Wiki further prior to implementation. But if you recommend FORK with the AST idea I will go that route. No worries. Anyway please keep following your initial goals of wtf_wikipedia, any steps you made improve a clean AST generation. Excuse me for causing unnecessary workload for you.
Best wishes

niebert · 2018-08-16T16:04:35Z

You can remove the label feature request, because it seems just a workaround for me for modular replacement of wtf_parse in Wiki2Reveal.

spencermountain · 2018-08-16T16:23:41Z

yeah, i think the best way to go forward with the AST is for you to create a library wikipedia-ast that depends on wtf, and creates this AST structure, based on the wtf_wikipedia object functions. I think 90% of the information for you to do this is there already, and you can cheat the other 10% by assuming each section has something like the order:

===section===
{any templates}
{any images}
{any sentences}
{any tables}
{any lists}

that's what I'd do.

There will be cases where this is wrong, but I should get around to doing this proper order stuff somehow, under a similar api structure as the current setup.

how does that sound?
i'd be happy to help with this

niebert · 2018-08-17T11:28:12Z

Good idea, thank you for that. I thought about Parsoid doing wtf_parse the job to HTML and start processing to other formats at that end, but then I came back wtf_wikipedia because it has its priority on extracting data/content into JSON. Parsoid needs to preserve alls format and strange elements of the layout. With wtf_wikipedia we can get rid of strange elements in the Wiki Markdown.

CONCLUSIONS:

document software design in Wiki of wtf_wikipedia for a library wtf_wikipedia_ast and discuss how the library will interface with wtf_wikipedia
ask Spencer to cross-check the software design proposal if that fits with the future developement
do not change API of wtf_wikipedia because wtf_wikipedia_ast will depend on the API and the API is fine as it is defined currently.
Thank you Spencer - lets follow that route

niebert · 2018-08-17T11:37:24Z

Added Conclusion to Wiki

spencermountain changed the title ~~Recommendation of Submodules: wtf_fetch, wtf_parse, wtf_output - chaining in wtf_wikipedia~~ Refactor into micro-libraries Jul 31, 2018

spencermountain added the feature request label Jul 31, 2018

spencermountain removed the feature request label Aug 16, 2018

niebert closed this as completed Aug 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor into micro-libraries #131

Refactor into micro-libraries #131

niebert commented Jul 29, 2018 •

edited

niebert commented Jul 30, 2018

spencermountain commented Jul 31, 2018

niebert commented Aug 13, 2018 •

edited

spencermountain commented Aug 13, 2018

niebert commented Aug 16, 2018 •

edited

niebert commented Aug 16, 2018 •

edited

spencermountain commented Aug 16, 2018

niebert commented Aug 17, 2018

niebert commented Aug 17, 2018

Refactor into micro-libraries #131

Refactor into micro-libraries #131

Comments

niebert commented Jul 29, 2018 • edited

niebert commented Jul 30, 2018

spencermountain commented Jul 31, 2018

niebert commented Aug 13, 2018 • edited

spencermountain commented Aug 13, 2018

niebert commented Aug 16, 2018 • edited

niebert commented Aug 16, 2018 • edited

spencermountain commented Aug 16, 2018

niebert commented Aug 17, 2018

niebert commented Aug 17, 2018

niebert commented Jul 29, 2018 •

edited

niebert commented Aug 13, 2018 •

edited

niebert commented Aug 16, 2018 •

edited

niebert commented Aug 16, 2018 •

edited