-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate downloading data from exporting to formats #6
Comments
With the predecessor of this project, telegram-json-backup, JSONL was the only possible output format and users were supposed to just process that further, adhering closely to the Unix philosophy as you already mentioned. When I was writing this I decided to add the dumper system for two reasons:
As a result, users got three choices: process the JSONL as before, use another stock dumper, or write a custom dumper. This turned out to be less than ideal when I started working on incremental backup support, this feature definitely made the dumpers more complex/messier than they ought to be. I believe this is the biggest issue you experienced with the current system (followed by little flexibility in changing formats) and I think improving on these points would be a good idea. But there's also practical circumstances to consider. It could be a major rewrite depending on how it's implemented, and I don't have much time to work on Github projects at the moment. However, the suggestion you conclude with sounds doable. How about this?
I believe this would solve all of the issues you mentioned, let me know what you think. I have reservations about gzipping the JSONL files, but I'd like to keep that topic for another time. In other news, I'm not all that experienced with Ruby; I put in effort to not make a mess of this project but I wouldn't necessarily consider it to be an example of impeccable Ruby coding. ;) |
@tvdstaaij That sounds good! Just one remark really (the rest is details):
For PISG I understand it's easier to have dumper-style method calls; for HTML I think it'd be easier to get a hash that just contains each dialog (
I guess I could also write this part, if you can do the changes you already mentioned. This way it might take some RAM depending on the user, but in this case I'm for "make it work first, optimize later". At this point it seems unlikely someone really has 3GB of text messages and gets the idea to back it all up on a memory-limited machine to HTML.
I'd give it a name that makes clear it's output, so not in
It looks fine to me though! The structure is clear enough that if you want to understand the whole program, you can in a reasonable amount of time :) |
@lgommans I run this on a memory-limited machine :) This kind of script is perfectly suited to be ran automatically by a cronjob on a light server, so I wouldn't want to make it a memory hog. That's why I suggested the compromise of one function call per dialog, as in I suppose you have a reason for preferring all dialog data at once, such as generating an index page. You could either work around that (for example, in the case of an index page, storing some metadata in a hash/array and generating the index page afterwards) or if really necessary, collect all message data in a giant hash first and do the actual processing in About your code snippet, I don't think reading back the json directory directly would be the best approach. It assumes full consistency of the directory contents (but there could also be old files still lingering there when the backup scope has been changed in the config), and you'd miss out on the dialog metadata such as the canonical name. I intend to solve this by tracking which files have been written, with their associated metadata, so it can be accurately passed to the formatter.
Sounds good to me. |
@lgommans I implemented all of the changes in the list I gave earlier, with some additions (most notably: file tracking for incremental backups now uses relative paths, and I decided to support having multiple formatters enabled at the same time). I ported the most simple dumper To give it a shot, checkout the tip of the I also added a way to supply formatter-specific options in the configuration file, perhaps this could come in handy for your HTML formatter. |
(Today valentine's, tomorrow new semester in school. I hope to have a moment to take a look tomorrow or Tuesday evening!) |
@tvdstaaij It took a while, but here's some progress: lgommans@50d9045 I still want to improve some things (the index page, displaying media, perhaps more comments in the code) and test all cases (service messages, settings, HTML validation, etc.), but this is how far I am right now. Let me know if you have any comments so far! Oh and to finally respond to the changes you made, they look great. I really like how it works now. Don't have much else to say about it really :) |
Good to hear that you're satisfied with my implementation of the new system and getting along with it. I'm a bit short on time right now but I'll take a look at your alpha/beta soon. I'll also get working on porting the remaining v1.0.0 dumpers (the most complex ones) to formatters. I think it would be good to aim for a v2.0.0 release soonish :) |
Good work so far :) Caught a few issues:
|
Good points! I fixed all issues you mentioned and added some more features: url parsing, contact & location sharing, and displaying service messages. I am having issues with the dumper now and then. I can't pinpoint steps to reproduce (yet?) but often it throws an error about some .old file not existing in prepend_dump or something. I'll create a separate issue with stack traces when I tested a bit more, but you might already come across it if you use the simple dumping mode (formatters on/off are irrelevant, I think), progress saving enabled of course, and then modify which users and/or groups to back up and run it again. Anyway, this is for another day, I'm just saying 2.0 might need a bit more testing. For the html formatter, the remaining todo items are better pagination and more testing. I hope I can find some time again this weekend, but if not, I will probably be occupied until next weekend. |
Nice! I like the service messages and such, makes the dumps much more comprehensive. Encountered two more issues:
I wasn't aware of the issue with incremental backups you described. I'll see if I can reproduce based on this information. Just to make sure: you didn't delete jsonl files while keeping the progress file? |
I was not aware of any pagination issue, other than the prev|next links being not very pretty. That's improved now, but nothing changed about creating the files themselves. I tried changing the backlog limit to 300 and paginate to 10 (and removing the output directory), and it worked just fine, creating a bunch of pages for each of the chats.
Thanks :). For more context I also added an option to write timestamps every N messages. Telegram does it for every date change, but I personally find that way too infrequent, especially if a chat has been very active.
Good catch, fixed.
Nope. I tried removing them when the issue was already there, as well as renaming them to .old files, but nothing seems to work other than deleting it. Not sure what causes it, the code looks fine on first glance. But like I said, I'll open another issue about this once I know more. I'm going to do some more testing, both the html formatter and the dumper thing, and then I think we might be ready for 2.0! |
I managed to reproduce the dumper crash issue eventually, fixed with d94aa0e. Was a good catch, pretty serious bug. Solved another a different dumping related crash as well, and ported the remaining dumpers to formatters. One more HTML formatter issue: this line and this line produce different results (the first page without and the followup pages with URL escapes in the filename), leading to weirdness and broken links when a chat name contains emoji for example. I think you don't have to use URI escapes for the filenames by the way, as long as you escape the link URIs. Haven't seen the pagination issue again, it may have been a combination of many invalid messages and the off-by-one issue I saw in your commits, or something like that. Let's assume it doesn't exist unless either of us runs into it. |
Fixed the issue you mentioned and did some more testing. I think the HTML dumper is ready for a first release! Unless you have anymore remarks, I will create a pull request.
Nice! Now I'm curious, what caused it? |
Sure, if you're satisfied with the formatter you can issue a PR. I'm planning one more improvement to incremental backups (saving state more often) and I have to update the readme, that's about it for 2.0.0 I think. Dumper crash was a typical stupid oversight, forgot to explicitly destroy the old prepender instance before starting on the next dialog. When the scope of the backup was broadened, a dialog without stored state would attempt to reuse the dumper instance of the dialog before it. That .old file would no longer exist, raising the exception. So the filename you saw in the exception wasn't even the dialog that triggered the issue, haha. |
I committed the changes I mentioned and merged dev to master. I'd like to test it a bit on my home server (have to fix the setup first, had some trouble with the server) and then I'll issue a v2.0.0 tag. I suppose this closes the original issue? |
Yes, it does. Thanks :) |
I think the program should always dump to one data structure and then export it into an output format like pisg. Jsonl(.gz) or sqlite seem good candidates for the 'internal' dump format because they are popular and portable (easy to work with when using external tools).
Advantages of this change:
Reasons against this change:
Some background: despite being new to Ruby, I've been working on an html dumper to make browseable day-by-day dumps of chats, but ran into problems (basically everything mentioned above). I would also have liked to dump my chats into pisg format for the statistics program, just for fun, but this is not possible without downloading everything again. That's quite a waste of resources so I decided not to do that. Overall it seems like a good idea to change the way the dumping works because it makes it more modular.
In my case, it would make it easier to write a formatter/dumper/exporter: I could just look in the internal data structure to see what fields there are and only need to figure out how to convert that to my desired output using Ruby code. Currently I have to look at which events exist and are called when and handle them properly (start_backup, dialog_start, dump_msg, get_filename_for_date, etc.), I have to worry about downloads being interrupted due to errors or user interruption, a backup might be an update/partial... Not all of it is hard to understand or do, and some Ruby experience certainly helps as well, but altogether it's a relatively big task even as an otherwise quite experienced programmer.
I was/am very tempted to just write a simple python script to dump the jsonl into html, which would take no time at all. The reason I didn't is because I'd like to contribute back, both to the community in general and to this project in particular (because I like it and think it's very useful). I could have uploaded the python script to my own profile, but without a link from here to that parser, nobody would benefit unless they are specifically looking for an html export and randomly stumble upon my repository. Also it's good for me to practice something new (Ruby), but having to understand the whole program first goes a bit far, even if it is quite a simple program. Data conversion is a lot simpler to do than writing a dumper in its current form (plus the other advantages mentioned).
I'm also thinking of using this program as a back-end for allowing users to back up their chat history through a website, putting all chats in readable format (e.g. html) in a zip, but that goal is way beyond the scope of this application (and this issue) of course.
The change to this code base doesn't even have to take that long if someone experienced with Ruby does it: the current jsonl files can be kept and other formats just use the existing files as database. If there are no existing files, call the json dumper first. Other formats need to be more or less rewritten, but bare and plaintext are simple enough, and even pisg doesn't look like it would take too much time. I'd contribute a html dumper in that case (I'm hereby committing myself ;) ).
In other news, looking at the size of this post by now, I think I put way too much thought into this.
The text was updated successfully, but these errors were encountered: