Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor into micro-libraries #131

Closed
niebert opened this issue Jul 29, 2018 · 9 comments
Closed

Refactor into micro-libraries #131

niebert opened this issue Jul 29, 2018 · 9 comments

Comments

@niebert
Copy link
Contributor

niebert commented Jul 29, 2018

Hi Spencer,
you explained to me, how the integration of "promises" led to the broken build mechanism on MacOSX. When I tried to find a solution for that, I thought it might be an option to split wtf_wikipedia into the following 3 repositories:

  • wtf_fetch, that fetches the wiki source from Wikipedia, Wikiversity, .... (MediaWiki domain) with the parameters language (e.g. en, de,.. ) and domain (e.g. wikipedia, wikiversity, wikivoyage, ...)
  • wtf_parse, that parses wiki source into a Document object (Abstract Syntax Tree)
  • wtf_output, that generates/renders the output for a specific format from a given Document object.

wtf_wikipedia will integrate all 3 submodules. At least wtf_parse and wtf_output may still support the build process on MacOSX. Furthermore it improves maintainance, reusablility of submodules and it separates the tasks in the recommended submodules wtf_fetch, wtf_parse, wtf_output from chaining the tasks here in wtf_wikipedia. Citation management would be a submodule wtf_citation that would be chained here. You modular structure in src/ can be preserved and will mainly replace a local require within src/ by a require of the recommended submodules from npm.

This could be documented in the README.md as developer recommendation and helps developers to understand the way forward and how they could add new wtf_modules in the chaining process. In this sense wtf_wikipedia will become the chain managment module of wtf_submodules.

Hope that makes sense to you and will attract more developers to support your work. Thank you for all the contributions to the OpenSource community for handling MediaWiki content.

@niebert
Copy link
Contributor Author

niebert commented Jul 30, 2018

Did a first try with wtf_fetch as NPM repository. Hope that was OK for you. https://www.npmjs.com/package/wtf_fetch

@spencermountain
Copy link
Owner

oh hey, sorry for the delay.
Yeah, i get what you're saying, as an effort to make the library cleaner.
i watched carefully as d3 did the micro-library thing a few years ago, and it seemed to work for them
I don't know if this sort of thing is needed right now - fetch is only one file, and only adds a few kb, same for the output code.

and of course - go nuts. make all the stuff you want! ;)

@spencermountain spencermountain changed the title Recommendation of Submodules: wtf_fetch, wtf_parse, wtf_output - chaining in wtf_wikipedia Refactor into micro-libraries Jul 31, 2018
@niebert
Copy link
Contributor Author

niebert commented Aug 13, 2018

I am working in parallel on the Wiki2Reveal the current implementation of wtf_wikipedia the change order of

  • TextBlock, BulletList, TextBlock in the HTML/Markdown output to
  • BulletList, TextBlock, TextBlock
    This is really a problem especially for logic of content elements. Therefore I replace temporarily the wtf_parse by a worse parser than wtf_wikipedia. But for using the application of Wiki2Reveal as a proof of concept in lectures I need the correct order of content elements. That's why I started a software design proposal for adding the ContentList to preserve that order during parsing. As far as I understand your code, the content order gets lost at the section level. I intergrated the desired paragraph AST tree nodes in the proposal and discussed the dependencies as far as I understand your code and visions wtf_wikipedia right.

The documentation first principle should allow other developers joining the software design process prior to implementation and share the workload and development in a team. Different developers identify different pitfalls and see may other workarounds. Some decisions can be very costly in developement time. Let us try to reduced the time as good as we can. You are so quick in fixing things, my appreciation from me. wtf_wikipedia can be developed in a team only as you mentioned in the README.md yourself, so the route is to make it easy that other developers can understand the code. Next step TODO Assignment in the GitHub Wiki where the small workpackage as ToDos are mentioned and someone could assign for the developement of the milestone. Developers can propose workpackages and the maintainer spencermountain sets the workpackage on "GO" or on "FORK", which means that it will not be part of the software design of wtf_wikipedia and you (Spencer) as the maintainer recommend to fork wtf_wikipedia for that purpose. See GitHub Wiki page "Work Packages - ToDo". assignment.

@spencermountain
Copy link
Owner

hey Engelbert, yes this is correct. Order is lost at the Section level.
This is done to handle the recursive combinations of templates, links, and tables. The parsing-sequence has been established to minimize errors. We handle the 'most-dangerous' parsing first, as the meaning of certain characters, like newlines, will change depending on this context.

The initial goal of this library was getting data out of wikipedia, and into a database. I'm not sure AST representations are in the scope.

Preserving chronology, and treating paragraphs as first-class objects are in the long-term plan. Both will be tremendous tasks. Lots of QA, testing, and hardening of the library to do before then. Template-parsing, in particular is really rough, and the infobox/table parsers have a lot of repeated code. That's the current focus, for the time-being.

@niebert
Copy link
Contributor Author

niebert commented Aug 16, 2018

The benefit of wtf_wikipedia is, that it cleans the most dangereous things, which is perfect to handle the output. Will it be still Ok for you that I elaborate AST idea in the Wiki further? I do not alter code in Pull Request and I just discuss the possible solutions and make references to current sturcture of your code. Primary objective of this library should remain, getting data out of wikipedia and create the JSON as result. This is necessary for a robust output generation.
AST idea is just a parallel data structure, that refers to content in the generated JSON, ... . I fully agree with QA, testing, and hardening of the library must be done before. The Wiki content is also far away from being a recommendation for a current milestone of implementation. If you allow, I would follow the AST ideas in the Wiki further prior to implementation. But if you recommend FORK with the AST idea I will go that route. No worries. Anyway please keep following your initial goals of wtf_wikipedia, any steps you made improve a clean AST generation. Excuse me for causing unnecessary workload for you.
Best wishes

@niebert
Copy link
Contributor Author

niebert commented Aug 16, 2018

You can remove the label feature request, because it seems just a workaround for me for modular replacement of wtf_parse in Wiki2Reveal.

@spencermountain
Copy link
Owner

yeah, i think the best way to go forward with the AST is for you to create a library wikipedia-ast that depends on wtf, and creates this AST structure, based on the wtf_wikipedia object functions. I think 90% of the information for you to do this is there already, and you can cheat the other 10% by assuming each section has something like the order:

===section===
{any templates}
{any images}
{any sentences}
{any tables}
{any lists}

that's what I'd do.

There will be cases where this is wrong, but I should get around to doing this proper order stuff somehow, under a similar api structure as the current setup.

how does that sound?
i'd be happy to help with this

@niebert
Copy link
Contributor Author

niebert commented Aug 17, 2018

Good idea, thank you for that. I thought about Parsoid doing wtf_parse the job to HTML and start processing to other formats at that end, but then I came back wtf_wikipedia because it has its priority on extracting data/content into JSON. Parsoid needs to preserve alls format and strange elements of the layout. With wtf_wikipedia we can get rid of strange elements in the Wiki Markdown.

CONCLUSIONS:

  • document software design in Wiki of wtf_wikipedia for a library wtf_wikipedia_ast and discuss how the library will interface with wtf_wikipedia
  • ask Spencer to cross-check the software design proposal if that fits with the future developement
  • do not change API of wtf_wikipedia because wtf_wikipedia_ast will depend on the API and the API is fine as it is defined currently.
    Thank you Spencer - lets follow that route

@niebert
Copy link
Contributor Author

niebert commented Aug 17, 2018

Added Conclusion to Wiki

@niebert niebert closed this as completed Aug 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants