Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retaining empty lines #2

Closed
veonne opened this issue Aug 3, 2020 · 13 comments
Closed

Retaining empty lines #2

veonne opened this issue Aug 3, 2020 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@veonne
Copy link

veonne commented Aug 3, 2020

Hi @rpgd60,

I came here from https://discourse.joplinapp.org/t/simple-note-json-to-joplin-enex-python-script/8267
I'm using the tool to import the resulting *.enex files to import into Joplin. I'm using this tool so that the tags can be carried over from SimpleNote to Joplin.
However, I notice that:

  • All empty lines are removed. This causes some of the markdown formatting to be broken, and the longer notes are now harder to read because it looks like long wall of texts without breaks. I don't have this issue when importing to Joplin the *.enex files from Evernote or importing as markdown files from Simple Note raw text files. It may be that the line breaks should be formatted following Evernote's *.enex files instead (html's BR or writing \r\n as text like the original SimpleNote export files)?
  • Not all notes are created with titles using --create-title option. And those notes without titles are imported with additional tab/space characters at the beginning of the note. Could this possibly because the tool creates blanks in the XML between and the actual start of the note?

Apologise if this is not the correct place to report this.
I really appreciate that you're creating this tool. Thank you very much!

Thanks.

@rpgd60
Copy link
Owner

rpgd60 commented Aug 3, 2020

Hi Veonne
Thanks a lot for your note.

Regarding the removal of empty lines, I realize what you mention and will look into it. At the moment I just grab the "content" from the json (full of "\r\n" as line separators) and slap it into a basic XML template. For some reason multiple instances of the sequence "\r\n" seem to be collapsed onto one or just eaten up.

Given the nature of my notes (with lots of markdown constructs that act as separators when seen in Joplin) it was not a problem for me and it went undetected. Looking closer, I just found a few "walls of text" (love that expression :-)) among my Joplin notes.

The "create-title" implementation is at the moment quite crude. A title assumed to be anything between the beginning of the "content" in json and the first "\r\n". It worked for me but I am sure there are failure cases.

To try to fix these problems, I would really appreciate your help providing some examples of the bad behavior:

  1. an example of empty line removal where the markdown gets broken.
  2. examples of "notes without titles"
    For these, please include the original text and also the json generated by the simple note export.
  3. An Evernote-generated ENEX for the examples in numbers 1 & 2 above. This will be very useful as an example of "good" XML I should target in the app.

Thanks again for the feedback!
Rafa

@veonne
Copy link
Author

veonne commented Aug 5, 2020

Hi @rpgd60,

Thank you for the response.

Sorry I can't provide you the actual notes as samples because the notes are private. I've also deleted the SimpleNote account so only the exported backup files are left.
So instead, I've tried to replicate similarly with some dummy notes. I hope it will suffice.

That the markdown formatting becomes broken because of the loss of empty lines can be seen when the text changes format to another.

Example A: if I put a table directly after a list, it will cause the table to be recognised as part of the last item in the list due to how markdown works, and thus the table will not be displayed as a table.
example_1a_table_w_missing_line_break

Example B: a supposed separate paragraph gets included as part of the last list item.
example_1b_paragraph_w_missing_line_break

Notes without titles could be caused by the notes having whitespace at the start of the note string. In SimpleNote, the "title" is derived from the start of the actual text regardless of whether there are empty lines at the beginning of the note. So I may not realised that many of my notes are like this.

The empty line is converted into tab/spaces when imported into Joplin after conversion using this tool.
example_2_note_wo_title

Could a new option to trim the whitespace from both ends of the note resolve this?

The original text and also the json generated by the simple note export: simplenote_sample.zip
The simplenote json generated using the tool: simplenote_sample.enex.zip
An Evernote-generated ENEX: evernote_sample.enex.zip

Hope you have the info you need. Please let me know if you need more.
Again, thank you very much for looking into this.

Unrelated to the issues mentioned in this but I would like to let you know since it may also interest you: As I'm also using Standard Notes as secondary, I've curiously tried uploading the simplenote_sample.enex file to Evernote ENEX file to Standard Notes conversion tool. None of the notes were able to be converted. Should I create a new issue to track this?

@rpgd60
Copy link
Owner

rpgd60 commented Aug 6, 2020

Hi Veonne
Many thanks for the info. Having all the pieces and most importantly the Evernote-generated enex will help.
I was under the impression that everything inside the CDATA section should be displayed "as is" by the application processing the XML. (see for example w3schools and tutorialspoint ). I wonder if I interpreted it too naively. The first link mentions that the main purpose of the CDATA section is to include XML as-is.

At any rate, I suppose if the tool is called "2enex" we should be practical and align with whatever EverNotes generates :-)

I will work on these issues (improving title estimation and fixing empty space -- probably aligning with Evernote enex) over the coming weekend )

By the same token, the utility should also work with Standard Notes. One option-- if you agree --is to test the next iteration of simplenote2enex against Standard Notes and open an issue if it doesn't work.

Cheers
Rafa
p.s. Loren markdownum :-)

@rpgd60
Copy link
Owner

rpgd60 commented Aug 7, 2020

Hi... started doing some fixes. went for the low hanging fruit (title handling). Now trims \r\n and whitespace at the begining and the end of the 'content'.
Next - some sanity limiting of title length, in case the note consists of a huge string without \r\n
Cheers

rpgd60 pushed a commit that referenced this issue Aug 8, 2020
@rpgd60
Copy link
Owner

rpgd60 commented Aug 8, 2020

Hi Veonne
I implemented also a partial fix for the empty lines: convert every instance of \r\n\r\n into \r\n<br/>
Based on trial and error and info on markdown newline such as https://gist.github.com/shaunlebron/746476e6e7a4d698b373
It does not fully address the table display. For tables to display properly they should be preceded by two empty lines \r\n\r\n

I see all this as a partial fix. Please let me know if this improves the behavior you observed

I still would like to explore the CDATA angle y mentioned earlier. If I have some time this month I will try to dig deeper, perhaps reviewing also the ENEX import code of joplin.
Thank you for helping improve the utility
Rafa

@rpgd60 rpgd60 self-assigned this Aug 9, 2020
@rpgd60 rpgd60 added the bug Something isn't working label Aug 9, 2020
@veonne
Copy link
Author

veonne commented Aug 11, 2020

Hi @rpgd60,

I found that the notes are still imported with additional tab at the start at the first paragraph even after the whitespace are stripped from the two ends of the content string, so I tried to workaround that by adding html div tag at line 216 of the latest simplenote2enex.py file. This helps to avoid notes being imported to Joplin with "tab" added to the first line.

    <en-note><div>{enex_content}</div></en-note>

I also tried to compare if I changed line 163 so that it replaces every line break into html br tag.

        temp_string = re.sub(self.line_sep, html_break, json_content)

This unfortunately also breaks table formatting. I'm still testing this but it so far it feels like a give and take kind of situation where some are better with the original code and some are better with this. It's difficult to explain, but sometimes this will help add the line break where it should be (separates the first line of the table from the preceding list), but sometimes it adds more line break than what is needed (this is what breaks the table formatting because it adds line break in the middle of the table too). Hope that makes sense.

My thought is that ideally 2 line breaks should be treated as a paragraph, and that should be used in conjunction with single line break that gets treated as html br tag? Since Evernote enex file contains html formatting, this may play into markdown to HTML conversion, but since the main issue is just mainly around the line breaks, hopefully it doesn't need to go into that full blown.
This is what Evernote produces, but maybe a simpler html p tag could suffice.

<span style="-en-paragraph:true;"></span>

I also found a different issue with the title.
Say I have a note with first line that says "Markdownum - Lorem & Ipsum", only the "& Ipsum" will be set as the title.

Disclaimer: I haven't processed all my notes yet since I have quite a number of it. So this is just observations based on the subset that I've processed and tried to import.

Many thanks!

@PhiLhoSoft
Copy link

PhiLhoSoft commented Aug 12, 2020

Note that some of the solutions given veonne are also in my patched version given in #1, as I have meet the same problems: need to add <br> to avoid collapsing empty lines, and to remove formatting / indenting in the XML generation.
With these tricks, I had no issue with table formatting.

Note that I opened an issue in Joplin about the problem of empty lines, but they closed it by telling that Markdown in Enex files is not an official format, so they don't care about that.

@rpgd60
Copy link
Owner

rpgd60 commented Aug 12, 2020

Hi Veonne, Philippe

Thanks again for all your feedback. I will try to address your concerns as I find the time, usually starting with the easy ones :-)

I'd like to take a step back and think of a "strategy" to handle these and other issues that relate to how Joplin imports the enex (or perhaps "pseudo-enex") notes generated by simplenote2enex and other uses of this utility. Below some loose ideas. I would really appreciate your and any other reader's feedback.

  1. It can be argued that some of these issues are really a Joplin import problem.
  • They claim they import markdown, they should really do so without requiring lots of XML or HTML embedded in the markdown.
  • On the other hand, even if I manage to "prove" anything (e.g. by reviewing Joplin code) there is no guarantee that Joplin (a much larger and complex application) will change its behavior.
  1. The purpose of simplenote2enex is to address the requirements of Simple Note users, and not to get into a standards argument that I can't win.
  2. The fact remains that Joplin seems to import well enex files generated by Evernote.
  3. Speaking for myself, I definitely want to stick to generating plain markdown and avoid the complexity of the enex files generated by Evernote, while addressing as much as possible the functionality needs of the users. This may imply some compromises (e.g. inject <br/> ) and occasionally limited functionality.
  4. There are other Note apps-- aside from Joplin-- that import ENEX. Veonne has mentioned StandardNotes. Thus we cannot make the default behavior of the simplenote2enex depend on Joplin (otherwise it would be simplenote2enex4joplin :-)).
  5. I am thinking of implementing "target app" specific fixes/command-line switches whenever the issue is specific to a target app (Joplin, StandardNotes, etc...)

Feedback welcome !

@rpgd60
Copy link
Owner

rpgd60 commented Aug 12, 2020

I found that the notes are still imported with additional tab at the start at the first paragraph even after the whitespace are stripped from the two ends of the content string, so I tried to workaround that by adding html div tag at line 216 of the latest simplenote2enex.py file. This helps to avoid notes being imported to Joplin with "tab" added to the first line.

Hi Veonne
I implemented your suggestion (<en-note><div>{enex_content}</div></en-note>)
Many thanks

@veonne
Copy link
Author

veonne commented Aug 14, 2020

Hi @rpgd60,

Please find my feedback as below.

First, I'm taking a look at @PhiLhoSoft's statement:

Note that I opened an issue in Joplin about the problem of empty lines, but they closed it by telling that Markdown in Enex files is not an official format, so they don't care about that.

Then I can see that Joplin has the following import options available:

  • JEX - Joplin Export File
  • MD - Markdown (File)
  • MD - Markdown (Folder)
  • RAW - Joplin Export Directory
  • ENEX - Evernote Import File (as Markdown)
  • ENEX - Evernote Import File (as HTML)

In a blog post by Evernote, they give an example ENEX file, in which the content is filled with HTML tags (DIV, BR). This supports Joplin's assertion that markdown shouldn't be in an ENEX file anyway, so they don't deem it a Joplin issue.

Then I asked myself, but Joplin have the "ENEX - Evernote Import File (as Markdown)". I found that the main page of Joplin answers this.

Evernote text is stored as HTML and this is converted to Markdown during the import process.

So that is what they meant by this option. Then I tested the other option "ENEX - Evernote Import File (as HTML)", and I found that it means Joplin will import the content inside ENEX files as HTML, it won't be converted into Markdown.

This may also be the reason why StandardNotes fail to convert the notes in the ENEX generated by the tool, because it may also expect HTML.

This leads to the last question I asked myself, can Evernote, the originator of ENEX file format, import the notes generated by the tool? So tried it on Evernote v7.14 on MacOS Catalina 10.15.6, and unfortunately Evernote crashed each time I tried to import the file, while it succeeded when importing the ENEX file it generated itself.

I can only assume from these, in relevance to the first point of your post, that when Joplin claims to be able to import markdown format, they refer to importing MD files. Unfortunately, the standard format for ENEX files, as pointed out in the above statement, doesn't have markdown format. It instead uses HTML inside the content.

However, as you said, as the creator of the tool, you can decide the limits and requirements of it.
It would definitely adding tons of complexity to add the functionality of Markdown to HTML converter for the content. Should you decide to go towards that, I would suggest to incorporate existing libraries to do that instead of building it yourself.
Implementing "target app" is a good idea for compromises, but even if you go that route, there is the possibility that at the end, you'll be forced to convert the markdown into HTML anyway because that is the standard from Evernote.

For me personally, at the moment I've relented to import most of my notes as texts in StandardNotes and Joplin. This is because most of my notes are quite long, so trying to fix the formatting for each note may take longer time.
However, I'm still interested to use this tool, as due to some weird network limitation in some places, from time to time, I'm forced to write my notes in SimpleNote then migrate it elsewhere. So thank you again for creating this tool. :)

n.b.
I'm not a Python developer, nor am I an expert in this area. These are only based on my observations so it could be wrong. :)

@rpgd60
Copy link
Owner

rpgd60 commented Aug 17, 2020

Hi Veonne

Thank you very much for your thorough analysis.

Indeed what the tool produces is not really ENEX, and I was just lucky it could import my Simplenote notes into Joplin, and probably too optimistic claiming conversion to ENEX.

I will document asap the actual behavior and limitations in the README.md and will probably rename the repository to something like "simplenote2joplin" or "simplenote2pseudo-ENEX" :-) Suggestions welcome.

I confirm I do not plan to implement the Evernote ENEX with all the HTML-XML. Having said that, and since I still use simplenote (and you seem to use it too), I will be happy to implement small fixes that could help convert slightly more complex notes. Offhand I can think of:

  • explore a heuristic to detect markdown tables and insert appropriate codes / chars to ease the conversion (probably CLI driven)
  • Looking again at the sample ENEX made me think of exploring <en-note style= (...) /> as an option to massage the formatting
  • Reproduce Philippe's issue (I'd really appreciate sample notes)

Overall I would aim to some kind of 80/20 approach, trying to increase the proportion of managed notes with as little work as possible.

So please feel free to send requests and suggestions.

Again, thank you for your help
Cheers
Rafa

@PhiLhoSoft
Copy link

I agree with all of the above… 😁

You made a useful tool with simple code, it happens to work (at least in the current version of Joplin) with only some caveats, some of them already fixed, so it is not worth going for a full compatibility (IMHO), unless somebody is fool enough to try and do that… 😄 Or unless Joplin makes a change breaking this import, of course.

@rpgd60
Copy link
Owner

rpgd60 commented Aug 23, 2020

I agree with all of the above…

Hi Philippe
Thank you very much for your feedback.
I have just renamed the repository and the python file, and updated the README.md file.

@rpgd60 rpgd60 closed this as completed Jan 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants