Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate downloading data from exporting to formats #6

Closed
lgommans opened this issue Feb 10, 2016 · 16 comments
Closed

Separate downloading data from exporting to formats #6

lgommans opened this issue Feb 10, 2016 · 16 comments
Assignees

Comments

@lgommans
Copy link

I think the program should always dump to one data structure and then export it into an output format like pisg. Jsonl(.gz) or sqlite seem good candidates for the 'internal' dump format because they are popular and portable (easy to work with when using external tools).

Advantages of this change:

  • Changing the output format (dumper) no longer causes a complete re-download, which is very inefficient and completely unnecessary.
  • Similarly, upgrading the chosen data structure (e.g. new pisg version) would not require a complete re-download.
  • If a backup is interrupted, no data would be lost. 'Complex' data formats that do not support append/prepend-only formatting (such as html) currently need to wait until the whole backup is complete before they can write files to disk.
  • Old downloads would not need to be re-parsed upon updates (partial downloads). For example to update an html export currently, you need to parse old html files, add the new messages, and finally dump everything into (a) file(s) again.
  • I think having the option of exporting to different formats, as it does now, adds a lot of value to the software. It could be said that external programs can format the data into any imaginable complex format (Unix philosophy: write programs that do one thing and do it well) but I think the exporting feature makes the software more useful and belongs in the project.

Reasons against this change:

  • Duplicate data: all events would be stored both in the internal database as well as zero or more exported formats. It only applies to text and not media, though, so it compresses nicely.
  • The cost of change (rewriting, testing, documenting).

Some background: despite being new to Ruby, I've been working on an html dumper to make browseable day-by-day dumps of chats, but ran into problems (basically everything mentioned above). I would also have liked to dump my chats into pisg format for the statistics program, just for fun, but this is not possible without downloading everything again. That's quite a waste of resources so I decided not to do that. Overall it seems like a good idea to change the way the dumping works because it makes it more modular.

In my case, it would make it easier to write a formatter/dumper/exporter: I could just look in the internal data structure to see what fields there are and only need to figure out how to convert that to my desired output using Ruby code. Currently I have to look at which events exist and are called when and handle them properly (start_backup, dialog_start, dump_msg, get_filename_for_date, etc.), I have to worry about downloads being interrupted due to errors or user interruption, a backup might be an update/partial... Not all of it is hard to understand or do, and some Ruby experience certainly helps as well, but altogether it's a relatively big task even as an otherwise quite experienced programmer.

I was/am very tempted to just write a simple python script to dump the jsonl into html, which would take no time at all. The reason I didn't is because I'd like to contribute back, both to the community in general and to this project in particular (because I like it and think it's very useful). I could have uploaded the python script to my own profile, but without a link from here to that parser, nobody would benefit unless they are specifically looking for an html export and randomly stumble upon my repository. Also it's good for me to practice something new (Ruby), but having to understand the whole program first goes a bit far, even if it is quite a simple program. Data conversion is a lot simpler to do than writing a dumper in its current form (plus the other advantages mentioned).

I'm also thinking of using this program as a back-end for allowing users to back up their chat history through a website, putting all chats in readable format (e.g. html) in a zip, but that goal is way beyond the scope of this application (and this issue) of course.

The change to this code base doesn't even have to take that long if someone experienced with Ruby does it: the current jsonl files can be kept and other formats just use the existing files as database. If there are no existing files, call the json dumper first. Other formats need to be more or less rewritten, but bare and plaintext are simple enough, and even pisg doesn't look like it would take too much time. I'd contribute a html dumper in that case (I'm hereby committing myself ;) ).

In other news, looking at the size of this post by now, I think I put way too much thought into this.

@tvdstaaij
Copy link
Owner

With the predecessor of this project, telegram-json-backup, JSONL was the only possible output format and users were supposed to just process that further, adhering closely to the Unix philosophy as you already mentioned.

When I was writing this I decided to add the dumper system for two reasons:

  • For some formats (I had PISG in mind specifically as a personal use case) it's more convenient to process directly at the source than in a separate stage.
  • Formats that have a dumper written for it would be easier accessible from a user perspective (the "makes it more useful" point you mentioned).

As a result, users got three choices: process the JSONL as before, use another stock dumper, or write a custom dumper. This turned out to be less than ideal when I started working on incremental backup support, this feature definitely made the dumpers more complex/messier than they ought to be. I believe this is the biggest issue you experienced with the current system (followed by little flexibility in changing formats) and I think improving on these points would be a good idea.

But there's also practical circumstances to consider. It could be a major rewrite depending on how it's implemented, and I don't have much time to work on Github projects at the moment. However, the suggestion you conclude with sounds doable. How about this?

  • Incremental backup is enabled by default.
  • JSONL dumper is no longer a dumper but integrated in the core program, it always runs and outputs to a directory called json. Or perhaps a structure like this: output/json, output/media, output/<format>.
  • The dumper option is replaced with a formatter option, which is by default an empty string, meaning you get just the JSONL files and optionally media.
  • If the formatter option is not an empty string, it dynamically loads and calls the specified formatter module much in the same way as a v1.0.0 dumper. However, the message data would be loaded from the already saved JSONL. Perhaps as one big buffered array instead of a function call per message? Would be much more convenient, especially if you need the messages in chronological order, although it would take a bit of memory of course.

I believe this would solve all of the issues you mentioned, let me know what you think. I have reservations about gzipping the JSONL files, but I'd like to keep that topic for another time.

In other news, I'm not all that experienced with Ruby; I put in effort to not make a mess of this project but I wouldn't necessarily consider it to be an example of impeccable Ruby coding. ;)

@lgommans
Copy link
Author

@tvdstaaij That sounds good! Just one remark really (the rest is details):

it dynamically loads and calls the specified formatter module much in the same way as a v1.0.0 dumper.

For PISG I understand it's easier to have dumper-style method calls; for HTML I think it'd be easier to get a hash that just contains each dialog (dialogs[dialogName] = [event1, event2, ...]). Formatters could have a property to determine which of the two types should be used:

if $formatter.i_wanna_be_a_dumper == true
  # do dumper stuff, i.e. keep existing way of processing
else
  # for the 'new' way, this is all that needs to be done I think
  foreach file in 'output/json'
    dialogs << file.read_lines()
  $formatter.process_dialogs(dialogs)
end

I guess I could also write this part, if you can do the changes you already mentioned. This way it might take some RAM depending on the user, but in this case I'm for "make it work first, optimize later". At this point it seems unlikely someone really has 3GB of text messages and gets the idea to back it all up on a memory-limited machine to HTML.

always runs and outputs to a directory called json. Or perhaps a structure like this: output/json, output/media, output/<format>.

I'd give it a name that makes clear it's output, so not in json, but it's a bit different from other formatted output (namely it is used to produce other formats). Perhaps output/json, output/media and output/formatted/<format>? Anyway, that's just a detail.

I wouldn't necessarily consider it to be an example of impeccable Ruby coding. ;)

It looks fine to me though! The structure is clear enough that if you want to understand the whole program, you can in a reasonable amount of time :)

@tvdstaaij
Copy link
Owner

@lgommans I run this on a memory-limited machine :) This kind of script is perfectly suited to be ran automatically by a cronjob on a light server, so I wouldn't want to make it a memory hog. That's why I suggested the compromise of one function call per dialog, as in dump_dialog(dialog, messages) with dialog being the dialog metadata and messages being the full array of message/event objects, which could be rather large.

I suppose you have a reason for preferring all dialog data at once, such as generating an index page. You could either work around that (for example, in the case of an index page, storing some metadata in a hash/array and generating the index page afterwards) or if really necessary, collect all message data in a giant hash first and do the actual processing in end_backup (which would have the same result as your proposal except other formatters could still operate with less memory usage).

About your code snippet, I don't think reading back the json directory directly would be the best approach. It assumes full consistency of the directory contents (but there could also be old files still lingering there when the backup scope has been changed in the config), and you'd miss out on the dialog metadata such as the canonical name. I intend to solve this by tracking which files have been written, with their associated metadata, so it can be accurately passed to the formatter.

Perhaps output/json, output/media and output/formatted/?

Sounds good to me.

@tvdstaaij tvdstaaij self-assigned this Feb 13, 2016
tvdstaaij added a commit that referenced this issue Feb 14, 2016
@tvdstaaij
Copy link
Owner

@lgommans I implemented all of the changes in the list I gave earlier, with some additions (most notably: file tracking for incremental backups now uses relative paths, and I decided to support having multiple formatters enabled at the same time). I ported the most simple dumper bare to a formatter, plaintext and pisg are to be ported at a later time.

To give it a shot, checkout the tip of the dev branch and take a look at the formatters directory. I think it should be clear how to create a formatter from the comments and the implementation of bare, if not then let me know by all means. And of course I'm interested in whether this meets your expectations.

I also added a way to supply formatter-specific options in the configuration file, perhaps this could come in handy for your HTML formatter.

@lgommans
Copy link
Author

(Today valentine's, tomorrow new semester in school. I hope to have a moment to take a look tomorrow or Tuesday evening!)

@tvdstaaij tvdstaaij mentioned this issue Feb 22, 2016
@lgommans
Copy link
Author

@tvdstaaij It took a while, but here's some progress: lgommans@50d9045

I still want to improve some things (the index page, displaying media, perhaps more comments in the code) and test all cases (service messages, settings, HTML validation, etc.), but this is how far I am right now. Let me know if you have any comments so far!

Oh and to finally respond to the changes you made, they look great. I really like how it works now. Don't have much else to say about it really :)

@tvdstaaij
Copy link
Owner

Good to hear that you're satisfied with my implementation of the new system and getting along with it. I'm a bit short on time right now but I'll take a look at your alpha/beta soon. I'll also get working on porting the remaining v1.0.0 dumpers (the most complex ones) to formatters. I think it would be good to aim for a v2.0.0 release soonish :)

@tvdstaaij
Copy link
Owner

Good work so far :) Caught a few issues:

  • The a.nextpage paths are absolute (which also doesn't work because it doesn't start with file://). They should be relative like they are on the index page.
  • I happen to have two contacts with the exact same name. Telegram-cli solves this by suffixing #1 to the print name of one of the contacts, so I have Some_Contact and Some_Contact#1. This results in a file named Some_Contact#1-0.html and this translates as <a href='Some_Contact#1-0.html'>Some_Contact#1</a> on the index page. You may already have realized by now that this link won't work :) Perhaps you need to do some URL escaping (there are library functions for that too of course) some characters allowed in filenames have special meaning in URLs.
  • The print_name of a user or group (particularly after running it through get_safe_name) is suitable for naming files and such, but despite the name it's not so user friendly to print in the actual output. There's another utility function called get_full_name (example usage) which will give the first name or first + last name of a user depending on the value of display_last_name in the configuration. For a group chat you can just use the title field.
  • Newlines should be formatted as hard line breaks (<br>).

@lgommans
Copy link
Author

Good points! I fixed all issues you mentioned and added some more features: url parsing, contact & location sharing, and displaying service messages.

I am having issues with the dumper now and then. I can't pinpoint steps to reproduce (yet?) but often it throws an error about some .old file not existing in prepend_dump or something. I'll create a separate issue with stack traces when I tested a bit more, but you might already come across it if you use the simple dumping mode (formatters on/off are irrelevant, I think), progress saving enabled of course, and then modify which users and/or groups to back up and run it again. Anyway, this is for another day, I'm just saying 2.0 might need a bit more testing.

For the html formatter, the remaining todo items are better pagination and more testing. I hope I can find some time again this weekend, but if not, I will probably be occupied until next weekend.

@tvdstaaij
Copy link
Owner

Nice! I like the service messages and such, makes the dumps much more comprehensive. Encountered two more issues:

  • Here it is assumed that every event object has a valid date field. While this should in theory be true, in practice it is not unfortunately. As a result this currently crashes for me with an exception because it is attempted to parse an expression evaluating to nil into a date object.
  • Maybe you're already aware of this, but I'm experiencing sketchy pagination. With paginate: 500 and backlog_limit: 1300 some chats paginate correctly into three pages, but for others the -1.html and -2.html files are not created, even though the chats do have over 1300 messages and the -0.html contains a next page link.

I wasn't aware of the issue with incremental backups you described. I'll see if I can reproduce based on this information. Just to make sure: you didn't delete jsonl files while keeping the progress file?

@lgommans
Copy link
Author

I was not aware of any pagination issue, other than the prev|next links being not very pretty. That's improved now, but nothing changed about creating the files themselves. I tried changing the backlog limit to 300 and paginate to 10 (and removing the output directory), and it worked just fine, creating a bunch of pages for each of the chats.

I like the service messages and such, makes the dumps much more comprehensive.

Thanks :). For more context I also added an option to write timestamps every N messages. Telegram does it for every date change, but I personally find that way too infrequent, especially if a chat has been very active.

assumed that every event object has a valid date field.

Good catch, fixed.

Just to make sure: you didn't delete jsonl files while keeping the progress file?

Nope. I tried removing them when the issue was already there, as well as renaming them to .old files, but nothing seems to work other than deleting it. Not sure what causes it, the code looks fine on first glance. But like I said, I'll open another issue about this once I know more.

I'm going to do some more testing, both the html formatter and the dumper thing, and then I think we might be ready for 2.0!

@tvdstaaij
Copy link
Owner

I managed to reproduce the dumper crash issue eventually, fixed with d94aa0e. Was a good catch, pretty serious bug. Solved another a different dumping related crash as well, and ported the remaining dumpers to formatters.

One more HTML formatter issue: this line and this line produce different results (the first page without and the followup pages with URL escapes in the filename), leading to weirdness and broken links when a chat name contains emoji for example. I think you don't have to use URI escapes for the filenames by the way, as long as you escape the link URIs.

Haven't seen the pagination issue again, it may have been a combination of many invalid messages and the off-by-one issue I saw in your commits, or something like that. Let's assume it doesn't exist unless either of us runs into it.

@lgommans
Copy link
Author

Fixed the issue you mentioned and did some more testing. I think the HTML dumper is ready for a first release! Unless you have anymore remarks, I will create a pull request.

I managed to reproduce the dumper crash

Nice! Now I'm curious, what caused it?

@tvdstaaij
Copy link
Owner

Sure, if you're satisfied with the formatter you can issue a PR. I'm planning one more improvement to incremental backups (saving state more often) and I have to update the readme, that's about it for 2.0.0 I think.

Dumper crash was a typical stupid oversight, forgot to explicitly destroy the old prepender instance before starting on the next dialog. When the scope of the backup was broadened, a dialog without stored state would attempt to reuse the dumper instance of the dialog before it. That .old file would no longer exist, raising the exception. So the filename you saw in the exception wasn't even the dialog that triggered the issue, haha.

@tvdstaaij
Copy link
Owner

I committed the changes I mentioned and merged dev to master. I'd like to test it a bit on my home server (have to fix the setup first, had some trouble with the server) and then I'll issue a v2.0.0 tag. I suppose this closes the original issue?

@lgommans
Copy link
Author

lgommans commented Mar 4, 2016

Yes, it does. Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants