Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data format of "saved" designs #3

Open
meisl opened this issue Oct 29, 2013 · 32 comments
Open

Data format of "saved" designs #3

meisl opened this issue Oct 29, 2013 · 32 comments

Comments

@meisl
Copy link
Collaborator

meisl commented Oct 29, 2013

What the heck, I just can't resist:

So far I've found out that it's base64 encoded deflated (RFC1950) binary data that more-or-less represents the design area cell by cell. That's the rough idea.

BUT: plz note that in fact things aren't (can't be) as easy as "cell by cell". That's because of the connections that can be in the metal layer or the silicon (P or N, which of them might be implicit), or both.
So one (stupid) way to encode all this into a one-cell representative datum would be to enumerate all possible combinations.
However, that
a) would mean a lot of redundancy (think: connected-to-right implies right neighbour to be connected-to-left and so on). Although, one might just say (ie have said): "let deflate do its job"...
and
b) does not match the _in_flated size of the data compared to what I calculated (actual is way less than what I think it needed to be)

Hence I guess there's rather two chunks of information:

  • cell contents: metal or not, silicon or not and if so which, via or not, junction or not; like so (of course not all of the combinations)
  • connections between cells: a grid very much like the former, only displaced "half a cell" and one less datum in each direction (hope that's clear...?)

Those two chunks might be either one-after-the-other or interleaved. To me it looks rather like the latter. At least at the moment...

@tjohnman
Copy link
Owner

So far I've deemed it too early to think about the format the original game uses. In this code, I use three different layers; metal, silicon and vias. The silicon layer can contain one of two possible values on each cell. Each cell on each layer contains information about connectedness to its neighbors.

I've experimented with a couple of alternative ways and the code is currently a bit dirty because of that. I will refactor it once I settle on one specific way.

It's an amount of information small enough so that whatever format should be decided to use, it would be trivial to encode the information one way or the other. For the same reason supporting multiple formats would be a problem either.

@meisl
Copy link
Collaborator Author

meisl commented Oct 30, 2013

It's absolutely vital to be able to read and write zachtronics' format, IMHO.

How else could you convince anyone to have a look at KNOH, out there at kongregate, zach's forum...?

[...] For the same reason supporting multiple formats would be a problem either.

So then: the format's set - by the original.

Re the several ways to encode a cell and its connectedness: not really for using it in the blob but maybe otherwise of interest might be a regular expression. Will post it tomorrow (too tired to be sure to get it right now).

@meisl
Copy link
Collaborator Author

meisl commented Oct 30, 2013

The silicon layer can contain one of two possible values on each cell.

What about junctions?

I mean in one cell you can have

  • no silicon at all
  • P
  • N
  • NPN junction vertical (implying N on top and bottom neighbour and P on left or right or left & right)
  • NPN junction horizontal (implying N on left and right neighbour and P on top or bottom or top & bottom)
  • PNP junction vertical (implying P on top and bottom neighbour and N on left or right or left & right)
  • PNP junction horizontal (implying P on left and right neighbour and N on top or bottom or top & bottom)

?

@tjohnman
Copy link
Owner

That is a WIP right now. It kind of works, but as I said, the code is still changing because I want the simulation logic to be as simple as possible. Right now they are not special cases, but rather silicon connects to itself (P to N) freely. This is not ideal. As I clean up the code I will implement this properly. I think I'll also end up making separate graphics for each case.

@meisl
Copy link
Collaborator Author

meisl commented Oct 30, 2013

What's a WIP?

I think I'll also end up making separate graphics for each case.

Not exactly - but quite - what I did. Have you calculated how many you'd need? ;)

@tjohnman
Copy link
Owner

WIP: Work in progress.
I could do with some kind of procedural combination of separate graphics, or leave it as it works now. It's not exactly the same as in the original (it's a bit more free-form), but it works.

@meisl
Copy link
Collaborator Author

meisl commented Oct 30, 2013

Ay, sure - what isn't? Was thinking of "What I Plan"... what the heck!

Re # imgs: it's just below a thousand in total, if I'm not totally mistaken. So, be it browser or not: either a sprite-sheet or combine simpler (and way less in nr) basic gfx programmatically.

@tjohnman
Copy link
Owner

For now I'm going to keep the current procedural system (try the latest code for clarification). I'm working on the simulation logic right now. I believe after that is in place and we can test whether the circuitry behaves properly or not, we can contemplate changing whatever needs changing. It's all pretty simple as it stands, so I won't mind rewriting it all if I have to.

@meisl
Copy link
Collaborator Author

meisl commented Nov 2, 2013

Using a sprite-sheet in the core helps to decouple things. Just make sure you can adapt to a different cell size and maybe shape (ie non-quadratic).
The sprite-sheet itself can of course be generated programmatically (but it need not).
Btw: it's not that big as an image, 500x500x4bpp should be more than enough.

@meisl
Copy link
Collaborator Author

meisl commented Nov 2, 2013

Ah, totally forgot: using a sprite-sheet implies some addressing logic. Part of that could be mentioned regular expression. Will provide soon, promised.

@tjohnman
Copy link
Owner

tjohnman commented Nov 3, 2013

It's all right, I have it figured out. I have to tweak the procedural graphics so that gates are more clearly identifiable and it's done.

Saving and loading designs should be the priority right now, followed by testing mechanisms and finally creation, saving and loading of challenges or "levels". Did you have any luck figuring out the design save format?

@meisl
Copy link
Collaborator Author

meisl commented Nov 5, 2013

Good to hear that and yes, I agree on save+load prio.
Unfortunately, no news from my side on the format so far. But I should have some time to look into it these days.

@meisl
Copy link
Collaborator Author

meisl commented Nov 5, 2013

EDIT: the following is now a file in the repo, "doc/Whats-in-a-cell.md". What I wrote here is a bit out-of-date and contains flaws, so beware.


Here's mentioned regexp, with rationale:

(_|m[0-9A-F])(_|[np][0-9A-F]v?|((npn|pnp)(TB?|B|LR?|R))

17 × 73 = 1241 configurations in total; see below for details.

Before the explanation let me state that: It

  • is meant as a tool for analysis (as eg the # of different cell images, given certain conventions)
  • should yield valid and fairly reasonable file names
  • should be (somewhat) parseable by a human
  • represents a cell-centric view on things, ie it's supposed to contain exactly the right amount of information to describe one cell completely; "stand-alone" (as by what can exist in the original game)
  • does contain - when used in an encoding of an array of cells - redundant information
  • is therefore probably NOT how things are encoded in the actual data format (see above, I reckon the actual thing is rather separating a) what's in the cell (m and/or silicon) and b) connections)

Now here's the rationale: it decomposes into two main parts:

  • metal layer and its connections: (_|m[0-9A-F])
  • silicon layer with variations and connections: (_|[np][0-9A-F]v?|((npn|pnp)(TB?|B|LR?|R))
Metal layer

Pretty simple: either there is no metal (_) or there is (m). In the latter case there can exist connections to any of four sides: top, right, bottom, left - which is encoded in 4 bits, represented as one hex digit ([0-9A-F]).
Which bit means which direction is up to convention. Doesn't matter much but I think it should be either be clockwise or anti-clockwise, read MSB-to-LSB or LSB-to-MSB.
The two alternatives yield a total of 1 + 16 = 17 configurations in the metal layer.

Silicon layer

Here we got three alternatives:

  • no silicon at all (_), hence no connections there ( 1 configuration )
  • simple negative or positive silicon ([np]) and if so: connections like in metal layer ([0-9A-F]) plus maybe a via (v?) (2 × 16 × 2 = 64 configurations)
  • or a junction - that's the tricky one ( a total of 8 configurations, detailed below )

As by the above, there are a total of 1 + 64 + 8 = 73 configurations in the silicon layer.

A junction then is further detailed like so:

  • (npn|pnp), obvious ( × 2 wrt nr of configurations ). But plz note that npn 1st of all (but not only) implies 2 normal N-silicon connections while pnp implies 2 normal P-silicon connections (ie. the "collector" and "emitter" connections)
  • (TB?|B|LR?|R) ( × 4 wrt nr of configurations ): this is to encode both,
    • a) horizontal vs vertical orientation of the npn or pnp sequence as a whole
    • b) AND, as a junction implies * at least one connection from a neighbouring cell to the middle part * (ie the "base" connection) - which of them are present. T (Top), B (Bottom), L (Left) and R (Right) encode which neighbour relative to the middle part is connected (implying, of course, that same sort of silicon in that neighbour as in the middle part).

Again, should expand on the last point. This time by example:

  • npnT encodes
    • a horizontal junction (thereby implying simple n on the left and right)
    • plus the top neighbour being simple p and connected to the p "base".
      But the bottom neighbour is NOT connected and can have whatever silicon (incl. none) whatsoever.
  • pnpLR encodes
    • a vertical junction (thereby implying simple p on the top and bottom)
    • plus the left and right neighbour being simple n and connected to the n "base".
  • npnR encodes
    • a vertical junction (thereby implying simple n on the top and bottom)
    • plus the right neighbour being simple p and connected to the p "base".
      But the left neighbour is NOT connected and can have whatever silicon (incl. none) whatsoever.
  • pnpRL is illegal
  • ... as is npnBT
  • ... or npnTBv
  • ...

Hope that helps / is interesting for you :)

@tjohnman
Copy link
Owner

tjohnman commented Nov 7, 2013

I must say this is very well though out and certainly very exciting! This is certainly 100% compatible with the code with no major differences with the logic behind it. It would be trivial to implement saving and loading using this format. And yes, it is very much human-readable. In fact, I'm thinking that, by changing the internal representation of the cells to conform more closely to this (although, as you already said, it contains redundancies), the simulation logic could be simplified (maybe), and (maybe) made faster. I'll have to look into it.

Would you mind if I pasted your comment as a reference file in the repo and implemented this format for the time being? Or would you like to revise it / wait to see if you can decode the official format?

EDIT: We can always support both formats.

@meisl
Copy link
Collaborator Author

meisl commented Nov 8, 2013

Would you mind if I pasted your comment as a reference file in the repo and implemented this format for the time being?

Not at all - I'd feel honoured :)

Plz note that I've edited the post, have added the calculation of possible configurations. I'm getting more than I was getting before (with a slightly different encoding, but can't see the error atm). So plz re-check 'em.

@meisl
Copy link
Collaborator Author

meisl commented Nov 8, 2013

Re using this to gain speed-up: well, wrt the usual space-for-speed trade-off we have at least the necessary condition (redundancy). That it's also sufficient remains to be shown...

@tjohnman
Copy link
Owner

tjohnman commented Nov 8, 2013

Added the document and made you contributor so you can edit it directly (or contribute to the repo in any other manner).

@meisl
Copy link
Collaborator Author

meisl commented Nov 8, 2013

Cool, thanks. Didn't know of this contributor thing, makes it really easy for me. Standalone, without the issue context it does need some modification. Will do.
EDIT 1: began doing, plz check

Guess it's getting time for me to fork, though...
EDIT 2: Oh stupid me! Was thinking for this file only - but now I got it, real contributor... BIG Thanks! for this :)

@meisl
Copy link
Collaborator Author

meisl commented Nov 9, 2013

Here's another calculation:

The original design area is 36 cells wide and 27 high, making a total of 972 cells.
And say the 1241 possible configurations of one cell (as above) are correct: 1241 × 972 = 1206252
Note that this includes all the redudancy mentioned.

Now: that is not such a big number. How many bits would you need? Is that possible?

I just decided to go to bed now, for it seems I must be missing something. Or can a whole design area really be encoded in a single 32-bit number?!

@meisl
Copy link
Collaborator Author

meisl commented Nov 9, 2013

Aargh, it's of course 1241 ^ 972, ^ rather than ×. A few more bits...

G'night & sorry 8-}

@tjohnman
Copy link
Owner

tjohnman commented Nov 9, 2013

This format is wasteful in terms of storage space. But unlike the binary storage used internally in the game, it is more human-readable, and since the grid is not that big anyway I don't think space is a problem.
In any case, I'm going to implement this format for copying and pasting and then a proper binary format for saving to a file.

I've been thinking the official format is probably base 64 encoded binary. Have you checked that?

@meisl
Copy link
Collaborator Author

meisl commented Nov 9, 2013

Sure, it's base64 encoded deflated binary data, as stated in the 1st sentence of the 1st post ;)
Apart from that I haven't yet found out more than the guesswork mentioned. But I'm sure we'll find out.

@meisl
Copy link
Collaborator Author

meisl commented Nov 10, 2013

I just added (cf756fa) a definite convention for the encoding of connections and a note to the spec file - and fixed yet another flaw in my calculations. It's a total of 17 × 77 = 1309 configurations rather than 1241.

For your convenience, these are the main changes:


[...] there can exist connections
to any of four sides: top, right, bottom, left - which is encoded in 4 bits, represented as one hex digit ([0-9A-F]).
Which bit means which direction is up to convention.
The one we'll use is TLBR, read MSB-to-LSB. That is: 8 = top, 4 = left, 2 = bottom, 1 = right.

and

A junction then is further detailed like so:

  • (npn|pnp), obvious ( × 2 wrt nr of configurations ). But plz note that npn 1st of all (but not only) implies 2 normal N-silicon connections while pnp implies 2 normal P-silicon connections (ie. the "collector" and "emitter" connections)
  • (TB?|B|LR?|R) ( × _6_ wrt nr of configurations ): this is to encode both,
    • a) horizontal vs vertical orientation of the npn or pnp sequence as a whole
    • b) AND, as a junction implies * at least one connection from a neighbouring cell to the middle part * (ie the "base" connection) - which of them are present. T (Top), B (Bottom), L (Left) and R (Right) encode which neighbour relative to the middle part is connected (implying, of course, that same sort of silicon in that neighbour as in the middle part).

Note that (TB?|B|LR?|R) actually encodes a subset of the connections that we used to otherwise encode by [0-9A-F].
We could therefore as well use [8|A|2|4|5|1] here.
However, we chose (TB?|B|LR?|R) instead because it seems a little bit simpler (for a human).

@meisl
Copy link
Collaborator Author

meisl commented Nov 11, 2013

Hey, I've got another thing in preparation that'll exemplify mentioned convention and show how to use it to programmatically create cell images; using far less basic images (rather 10s than 1000). I'll call it "Drawing cells corner-by-corner" or so. It'll have quite a bit of images. Although I'll make a markdown version of it this won't be as good as the html version that I'll make, too. Html, however, is not quite as accessible here in github as markdown. Eg can't view it immediately rendered, as you can with markdown.

My qs are now about how to proceed:

  • the regexp and "corner-by-corner" is not really about the data format. We could use the github Wiki or rather make a folder doc/ in the repo. Atm I tend to choosing the latter. What'ya think?
  • You'll probably get a pull request from my fork when I'm done. I think now that just making a branch here would've been better, but well, you know... Are you ok with branching off often, or rather not?
  • What about this issue? It's gone somewhat off track. With respect to Zach's actual format there's nothing new that's not already in the OP ... and this is post No 24. Maybe better close this and open a new one where we really concentrate on analysing the data format?

[That last point's not at all to say that the thread here had no value. Rather on the contrary, it spawned a lot. But might be helpful when trying to get to that one info fast which one will be looking for in the future. And: we can always refer back to here when necessary.]

@meisl
Copy link
Collaborator Author

meisl commented Nov 18, 2013

Hey, what's up?

[meisl wrote:]

Sure, it's base64 encoded deflated binary data, as stated in the 1st sentence of the 1st post ;)

Hope you didn't take that as offense...? Should you have indeed, then plz accept my apologies. Wasn't meant as offense at all.

You know, I proved myself capable of being considerably more stupid than that already, so... I'll take more care though, in case.

@tjohnman
Copy link
Owner

Hey! Hehe, no, no offense. I've been swamped in work lately. I'm taking a look at the pull request this instant.

@meisl
Copy link
Collaborator Author

meisl commented Nov 25, 2013

Oh, ok cool then :)

@meisl
Copy link
Collaborator Author

meisl commented Dec 2, 2013

Hi again! Had little time myself lately, but the next few days I could delve into something.
So I asked myself where's best to best put effort into. Probably getting forward a bit with the data format is most worthy (rather than clean up / improve articles).

For now my only idea is to try a differential approach - which is big words for a pretty unfancy and laborious thing: just collecting a whole lot of saved designs that differ in only one cell and by looking at the binary diff, try to get some idea of what's going on. I already crammed together a Perl script for the deB64-inflate part and well, I kept taking screenshots of the associated designs. Not really fancy, I know, but that's where I'd start off of.

So what d'ya think?

  • Should I give this path a try?
  • Or maybe you had some new insights?
  • Shouldn't we make a fresh start for it in a new issue, as above?

@meisl
Copy link
Collaborator Author

meisl commented Dec 9, 2013

Ok, some factoids that I collected (but wasn't able to put together yet):

  • the editable part of the design area is 36x27 but actually visible (including pins and border) is 44 × 27 (0x2C x 0x1B)
  • the file starts with the sequence 0x04 0x2C 0x04 0x1B; probably encoding these dimensions (44 × 27)
  • the size is always 13099 bytes
  • throughout the file there are two three-byte sequences that appear very regularly: 0x09 0x59 0x01 and 0x09 0x37 0x01. They seem to be some kind of separator, and the middle number probably encodes somehow the size of the following data chunk and one byte might just be padding (since we have an odd number of cells per column)
  • a very frequent pattern is 0x09 0x37 0x01 followed by 27 data bytes, then repeated
  • another one is 0x09 0x37 0x01 followed by 2 × 27 = 54 data bytes, then repeated
  • the previous two most probably represent columns of cell data
  • cell content wrt to the silicon layer (ie, P, N or nothing) seems to start at 0x00EB with 0x09 0x37 0x01 then columns of two-bytes-per-cell, 0x04 0x00 for nothing, 0x04 0x01 for N and 0x04 0x02 for P
  • cell connections wrt to the silicon layer seems to start at 0x1EFA (and/or at 0x2425) with 0x09 0x37 0x01 then columns of one-byte-per-cell where 0x02 seems to mean "no connection" and 0x03 "connection" (same for N and P)
  • there are three similar locations for the metal layer but these are all one-byte-per-cell: 0x0A4E, 0x2950, 0x2E7B; again 0x02 seems to mean "no connection" and 0x03 "connection"
  • all mentioned locations are only approximate; probably they really start ~ 4 × (27 (× 2) + 3) bytes earlier

@meisl
Copy link
Collaborator Author

meisl commented Dec 9, 2013

My current way of investigating this is too tedious. So I thought of this:

  • since it's lots of guesswork and the base of test data is still small (but is to be extended, ideally not only by me) - it occurred to me that it's very much like unit-testing: I've got hypotheses such as "it's always 13099 bytes" and I wanna be alarmed as soon as a counter-example is being added to the test data base, and I want this alarm to go off automatically!
  • adding new test data (= saved design as from KOHCTPRYKTOP plus description plus - at best - screenshot) should be easy as pie
  • adding a new hypothesis will necessarily be more demanding but should also be as easy as possible
  • testing all hypotheses against all test data should be a real no-brainer; in case of addition of either a hypothesis or new test data it should at best require no interaction at all or, 2nd to best, at most one click or key-stroke
  • getting a binary diff should be as easy as selecting two items, one in each of two lists
  • adding abstractions to the binary diff should be possible, eg. say we have found out that every file consists of 10 chunks of data and the lengths of these chunks are fixed - then the diff should show something like "chunk X: no difference" rather than listing all the Y equal bytes in chunk X

So... I'm thinking of making something in JS where one can paste B64 from KOHCTPRYKTOP and the hypotheses would be JS functions. Along the lines of the interactive page I made for "Drawing cells corner-by-corner". For the B64-inflate part I'd use https://github.com/dankogai/js-deflate (licensed GPL 2).

Nevertheless, I'll commit the PERL script and my current test data soon.

@meisl
Copy link
Collaborator Author

meisl commented Dec 10, 2013

Have put the stuff I have so far into the new branch "file-format". The PERL I'm using is 5.12 (build 1204) from ActiveState; the script, "decode-ktop.pl" will simply decode any *.b64 files that it finds in the same folder and convert 'em to *.bin.
For the binary diff (of two of such *.bin files) I used TotalCommander's "File|Compare By Content", but as said: this is to be replaced by some more convenient way of doing it.

Here's some more "facts" of which I'm almost sure now:

  • it's really a 44 × 27 area of cells that's encoded in the file, the "pins" or "pads" (IO pins, eg "VCC", "A0", "A1" etc) appearing in it as simple 3 × 3 squares of cells with metal (and maximally connected amongst 'em)
  • the 0x09 0x59 0x01 marker/separator always appears exactly 9 times in every file, thus separating it into 10 "chunks"
  • the 1st of these chunks is constantly 4 bytes long: 0x04 0x2C 0x04 0x1B which most probably encodes the dimensions 44 (0x2C) × 27 (0x1B, no idea what the 0x04 is for...)
  • everything else (ie all other "chunks") is/are organized by column first, only then by row, ie there is a for (x = 0; x < 44; x++) ... { for (y = 0; y < 27; y++) { ... } } in the src rather than the other way round.
  • the 2nd of these chunks is always 44 × (3 + 27 × 2) = 2508 bytes long; where the "44" corresponds to the # of columns, the "27" corresponds to the # of rows, and the "3" is due to the constant marker/separator 0x09 0x37 0x01 appearing at byte offsets z × (3 + 27 × 2 = 57); z ranging from 0 to 43
  • each of the remaining 8 of these chunks is always 44 × (3 + 27 × 1) = 1320 bytes long; where the "44" corresponds to the # of columns, the "27" corresponds to the # of rows, and the "3" is due to the constant marker/separator 0x09 0x37 0x01 appearing at byte offsets z × (3 + 27 × 1 = 30); z ranging from 0 to 43
  • 2nd chunk is for silicon layer contents (N, P or junctions)
  • vias are encoded in one small chunk (1320 bytes) as simple boolean flags
  • connections - be it in silicon or metal - are encoded in two (definitely 2 for metal and probably 4 for silicon/junctions) small chunks each of simple boolean flags: one chunk for "is it connected to the right" and the other for "is it connected to the bottom". So the "right-most" flag in the "connected-to-right-?" chunk and the "bottom-most" flag in the "connected-to-bottom-?" chunk is always "false". Hence my initial guess for the encoding was almost right except in one aspect: it just doesn't bother with having this extra byte for "right-most" and "bottom-most" - just for the sake of uniformity as it seems.

Then: I couldn't get https://github.com/dankogai/js-deflate to work with KOHCTPRYKTOP's data, so I'll be trying https://github.com/imaya/zlib.js (MIT license) instead.

@meisl
Copy link
Collaborator Author

meisl commented Dec 11, 2013

The precise format of the compressed data (ie after B64-decoding) is the one described in RFC1950 with CM 8 (compression method 8, ie deflate). This differs from raw deflated data in some additional header fields at the beginning and an appended Adler 32 checksum at the end.
That's why dankogai's js-deflate doesn't work: it only implements raw deflate/inflate.
EDIT: hey, and additionally his Base64.decode seems to be buggy...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants