Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to keep media files for dictionary (audio, images, SVGs, video) #6

Open
soshial opened this issue Jun 9, 2013 · 14 comments
Open
Labels

Comments

@soshial
Copy link
Owner

soshial commented Jun 9, 2013

This is an important question. There is a solution to make keep the whole dictionary in 1 file: to imprint all media to the XDXF xml with base64 encoding. Do you think it is reasonable to do?

@Tvangeste
Copy link

I'm not sure that this is a good idea. Many dictionary formats explicitly separate the main content and the (media) resources (stardict, dsl+zip in GoldenDict, MDX/MDD in MDict). One of the main reasons for that is that resources might be huge, taking Gigabytes of space. On mobile devices where the size is critical, users are free to not copy the media resources which will give them a working dictionary, albeit without the media. Also, some users might decide not to download the huge media files at all, just taking the main content.

Personally, I'm also inclined to separate the text content and the binary data into separate files, that would give better flexibility for everybody. Just imagine editing an XML file which is 4 GB of size! :)

@soshial
Copy link
Owner Author

soshial commented Jun 9, 2013

This is definitely very reasonable: I was inclined to the same opinion. But what do you think about storing the icon and the cover image of the dictionary in main file? They cannot be that big as you say, so this wouldn't take up much space, what do you think?

Also, do you think that optional storing meta_data in a separate file might be a good idea?

@Tvangeste
Copy link

Hmm, actually I think that storing both meta-data and the icon inside the main content file is a proper behavior, there is no need to keep many different files, this confuses users. Two files (the main content file and the additional media resources file) seems to be the most appropriate approach.

When all the main data is in the single file, it is easier to parse and transfer dictionaries.

As for external files with metadata, this can be done outside of the specification. For example, in GoldenDict we conside to introduce such format-independent files so that users might adjust dictionary name, provide custom icons, etc, without modifying the original file.

@soshial
Copy link
Owner Author

soshial commented Jun 10, 2013

I am very grateful for your feedback and opinion and I agree with you. But do you think it is possible to involve other Goldendict community members to decide this important details so that the best solution is reached?

Speaking further on the topic, I was wondering if packing both dictionary and media files in a simple *.zip archive would be a nice practice. Since some dictionary files (uncompressed articles) may take up to dozens of megabytes -- and this is with no media files involved. So reading description from a series of dictionaries would imply unpacking gigabites of data. So do you think we need to store dicts in archives or not?

UPD. This unpacking might be also important, since when Goldendict indexes the dictionary files it remembers the exact file offsets for each word-article, abd I'm not sure it's possible with the comperessed files.

@Tvangeste
Copy link

do you think it is possible to involve other Goldendict community members to decide this important details so that the best solution is reached?

You could always summon them via @goldendict/developers, if that works from external repository, probably not....

I was wondering if packing both dictionary and media files in a simple *.zip archive would be a nice practice.

Nope, it won't work. We need the offset-based access to the main content and zip doesn't work. For that we use dictzip, which allows to do that.

In short, the main content (dictionary itself) can/should be compressed with dictzip, the media resources (images, audio, video) can/should be compressed with regular zip (but one need to be careful about file names encoding in such a zip file).

This unpacking might be also important

With dictzip this is not a problem. GoldenDict already handles, e.g., dsl.dz (dictzip compressed DSL files) with no issues.

@Tvangeste
Copy link

I'd say that ability to compress the XDXF dictionary (via dictzip) is a matter external to the XDXF specification. Some tools might decide to handle such compressed dictionaries, some others might prefer other means or only handle the uncompressed data.

@soshial
Copy link
Owner Author

soshial commented Jun 10, 2013

Wow, thank you so much telling about this dictzip software: I was wondering why they have *.dz extension.
But is it possible to put multiple files in it (i don't mean media ones, but the meta info for example)?

PS. Please let's also discuss tables/grammar issue #5

@soshial
Copy link
Owner Author

soshial commented Jun 10, 2013

I was also thinking: since some people would want to use xdxf files without *.gz, then we would need to have our file CRC32 checksumed, but we are not able to checksum the file if the checksum must be in the meta_info section before we start computing it. Haha =)

Maybe *.gz should be made obligatory?

@Tvangeste
Copy link

@soshial, I'm not sure that putting the CRC32/MD5 checksums inside the XDXF file is a good idea. You see, every time a user modifies such XDXF file, he/she must somehow recalculate the checksum, which is annoying and inconvenient and requires some external tools to do so.

Personally, I don't think we need any checksumming at all for plain old text files. There are special tools to calculate and check the checksums, no need to put them into dictionary file directly.

Mandatory compression is also not as flexible as I'd like. Many users do modify their dictionaries from time to time, correcting the typos, adding new entries, etc. Extracting the dictionary, modifying it and re-compressing is just too much extra work for such use cases.

Consider GoldenDict. Even when it is open, user can open XDXF file, modify it and then press Ctrl+F5 to rescan the dictionaries, that will propagate the changes immediately. No need to compress/decompress anything, no need to calculate checksums, very fast and convenient way.

In fact users could even provide an editor command line and start editing the files right from context menu, in simple text editors.

@soshial
Copy link
Owner Author

soshial commented Jun 11, 2013

Very good point, thank you.

@ceefour
Copy link

ceefour commented Dec 26, 2013

I like treating XDXF artifacts just like Open/LibreOffice documents or Java JAR/WAR files. i.e. conceptually they're "a directory tree contaning at least an .xdxf file, with one or more media files".

Whether these are:

  1. expanded (as actual directory tree on the filesystem)
  2. accessed as URIs (so it's possible to load an XDXF remotely and then load any referenced media file on-demand, in this case XDXF acts like HTML with img src's)
  3. compressed using ZIP (which is a good format due to its ubiquity & accessible content listing, less compression than gzip/bzip2/xz but I guess it's OK)

is a "deployment detail", tools should be able to access any of these uniformly. (just like in Java, the program doesn't care where or how you put a dependency class, as long as it's available in the classpath).

@soshial
Copy link
Owner Author

soshial commented Oct 14, 2017

@Tvangeste isn't it reasonable to put media into dictzip as well, since most of the images and sounds can fir into 1 or several 64kb blocks? This way media files will be random accessible too.

@soshial soshial changed the title Discussing on how to keep media files for dictionary (audio, images, SVGs, video) How to keep media files for dictionary (audio, images, SVGs, video) May 22, 2019
@soshial soshial added feature and removed question labels May 22, 2019
@nikita-moor
Copy link
Contributor

nikita-moor commented Jul 24, 2019

imprint all media to the XDXF xml with base64 encoding

Real dictionary observation:

I have a dictionary made of page scans and keys referencing these pages. One page normally contains 3-5 articles, so one image is referenced by several articles.

Slob format saves images directly into the dictionary file and lets referencing images as external files: <img src="image.png">. File size is 62.8 Mb.

The same dictionary encoded into StarDict format with images embedded (base64) into the articles, as <img src="data:image/…" /> is 230.2 Mb.

Embedding images directly makes file 3.7 time bigger, because same images are repeated several times. Comparing different formats is perhaps not absolutely correct, but in fact they store data alike. So, think about centralized storage of images and inter-XML links (like abbreviations?).

P.S. I personally support idea of several files (dictionary, media, css, js) compressed into a zip (or similar) archive. It would be easier replacing or editing images without need of programming.

P.P.S. Having 2 files, one for dictionary and another for media, seems convenient. But having experience of supporting MDict (two files: mdx + mdd) I could say users ask me repeatedly "why are two files there" and "which one should I download". So, "one file to rule them all" is better.

@bmix
Copy link

bmix commented Jun 20, 2020

The Open Container Format (used in ePub) is a mature (v3.2) W3C standard, which is very similar to the OpenDocument container format, the Java Archive (JAR) format and many others.

It describes a ZIP archive, where the very first file in the archive contains the media-type (for example application/epub+zip) in plain-text (ASCII) and stays uncompressed.

There must be a META-INF directory, which contains meta data, that the file format needs. The rest is specified for the requirements of ePub documents.

Other, familiar containers are several Java archive containers (JAR, WAR, etc.)
and the Open Document Format.

I would base XDXF files strictly (the spec has been written, tested, so no need to brew a new one, makes it easy for users) on this format, but configure the file- and directory names, where needed.

eng-ita.xdz
|__ mimetype
|__ META-INF
|  |__(an XML file laying out the physical structure, like, where to find which kind of asset)
|__ dictionary.xdxf
|__ graphic/   (optional)
|__ audio/     (optional)
|__ transform/ (optional)
|  |__ html.xsl
|  |__ pdf.xsl
|  |__ dict.xsl
|__ whatever is needed/  (optional)

In addition, I would specify a flat-file XDXF, used for those dictionaries, that do not need any assets, but can be transported safely as put XML.

Assets would always be linked. And I would adopt XLink for any linking in XDXF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants