Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not work on 32-bit ARM (or maybe any 32-bit arch?) with raw binary data #2

Open
romanrm opened this issue Apr 5, 2024 · 20 comments

Comments

@romanrm
Copy link

romanrm commented Apr 5, 2024

With a source code like this:

<?php
require("base85.class.php");

$encoded = base85::encode(md5("test", true));
echo $encoded, "\n";
?>

Proper output on amd64 looks like this:

$ php test.php 
$'/lH7Np6%b2,mG-7?1o

But on ARMv7 (32-bit) results in a garbled mess (also binary data in output):

$ php test.php 
�-7?1oNp6%��

PHP version is the same on both, 7.4

@scottchiefbaker
Copy link
Owner

I suspect this has something to do with 64bit vs 32bit. Testing 32bit PHP is not something I can do easily. Are you able to run the unit tests? Do they all fail?

The ONLY thing in the code that should care about 32bit would be the pack() and unpack() calls, but that seems to already be tied to the 32 bit calls?

@scottchiefbaker
Copy link
Owner

scottchiefbaker commented Apr 5, 2024

I forgot I have an old Raspberry Pi 3 that I could test with. These tests were run on that Pi which is running PHP v7.4.33. If I take the above sample input and do some debugging I see this:

$input = hex2bin(098f6bcd4621d373cade4e832627b4f6);
Length: 16 bytes / Padding: 0

int(160394189)
int(1176621939)
int(-891400573)
int(640136438)

We're seeing a total of 16 bytes which we break up in 4x 32bit unsigned longs using unpack().

If I run the same code on my x86_64 Ryzen box I get:

$input = hex2bin(098f6bcd4621d373cade4e832627b4f6);
Length: 16 bytes / Padding: 0

int(160394189)
int(1176621939)
int(3403566723)
int(640136438)

Something is funky with unpack(N*) on 32bit ARM, but only for that third chunk? If I look at the unpack() docs it clearly says that N is an unsigned 32 bit long. If the docs are correct it should be impossible to get a negative number. I'm guessing this is a PHP bug of some kind. I suspect this library will work just fine if we can figure out what unpack() is doing wrong.

@scottchiefbaker
Copy link
Owner

I upgraded to Debian 12 which gives me PHP v8.2.7 and I'm getting the same error with the weird third chunk. Not sure what the next steps are. Either:

  • I'm misunderstanding unpack()
  • It's a PHP bug

@romanrm
Copy link
Author

romanrm commented Apr 5, 2024

See this comment, seems to be related: https://www.php.net/manual/en/function.unpack.php#106041

@scottchiefbaker
Copy link
Owner

scottchiefbaker commented Apr 5, 2024

That gets us closer... I was able to refactor the code to not use unpack() with some trickery from that comment. Now I get the correct values (kind-of):

string(9) "160394189"
string(10) "1176621939"
string(10) "3403566723"
string(9) "640136438"

I need those as integers not strings, and when I try and convert that third byte with intval() it truncates it to PHP_INT_MAX because it's greater than 2**31.

int(160394189)
int(1176621939)
int(2147483647)
int(640136438)

Not to mention that we still have to do some 64 bit math with this number later to calculate the appropriate base85 character. Without a massive refactor I don't know if I can get this to work. We rely on 64bit math under the hood.

@scottchiefbaker
Copy link
Owner

With some help from ChatGPT I was able to figure out how to use gmp with PHP to do math with numbers larger than 2^32. Check out the latest code in the repo and see if it works for you. I'm passing all the units tests on 32bit now.

$ php tests/tests.php
Encode: Null                                                                    [  OK  ]
Encode: Four nulls                                                              [  OK  ]
Encode: Single space                                                            [  OK  ]
Encode: Four spaces = 'y'                                                       [  OK  ]
Encode: Null in middle                                                          [  OK  ]
Encode: Four null in middle                                                     [  OK  ]
Encode: Hello world                                                             [  OK  ]
Encode: Quick brown fox                                                         [  OK  ]
Encode: String #1                                                               [  OK  ]
Encode: String #2                                                               [  OK  ]
Encode: String #3                                                               [  OK  ]
Encode: String #4                                                               [  OK  ]
Encode: Unprintable chars                                                       [  OK  ]
Encode: Github issue #2                                                         [  OK  ]

All 14 tests passed

I'm not doing anything special in decode() yet. I suspect that is still broken on 32bit, but I haven't tested.

@scottchiefbaker
Copy link
Owner

I went ahead and added a bunch of decode() units tests and they all pass fine on 64 and 32bit. I guess the decode() code isn't affected. I think we're 100% now?

Please test and let me know.

@romanrm
Copy link
Author

romanrm commented Apr 5, 2024

I encoded and decoded a PNG file, and it came out identical by checksum, so it works.

Bad news though: it is really really slow.

On a small ARMv7 system at 1 GHz (Allwinner A20) it took 5 minutes 17 seconds to encode 1 MB.
For comparison, encoding the same test file using PHP's base64_encode, takes 0.25 seconds.

Decoding the base85 file took 3 seconds, so that part is fine.

@scottchiefbaker
Copy link
Owner

That doesn't really surprise me. I imagine doing a ton of intensive math with gmp on a 32bit processor adds a bunch of overhead to work around the limitations of the CPU. I don't know enough about things to optimize it much farther. If you think you can speed it up I'm open to PRs. The unit tests are pretty comprehensive now so it should be easy to test if things are working or not.

The world has moved on to 64bit CPUs. After diving into all this madness, I now see why :)

@romanrm
Copy link
Author

romanrm commented Apr 5, 2024

Would it be possible to get back to the initial method as shown in
#2 (comment)
and only use a GMP codepath if the number ends up being negative (indicating overflow)?

But overall I would agree it's not worth it to spend a ton of effort and add much complexity just to better support 32-bit systems. As is, even a slow fallback is fine, main thing is that it now gives correct results. Only if someone would see this as a cool hacking project or a challenge to solve it in a nice way. :)

@scottchiefbaker
Copy link
Owner

Oh that's an interesting idea... I just landed some code that only goes the slow/gmp way if it can't do the math in 32bit. Try the latest commit and see if it's faster.

@romanrm
Copy link
Author

romanrm commented Apr 6, 2024

It now takes 12 seconds to encode, but the encoded result is different than before, and decoding it results in a differing PNG file, that's also broken and not viewable. Did your unit tests pass? Do they include encoding and verifying binary data, and not just ASCII (probably not)?

@romanrm
Copy link
Author

romanrm commented Apr 6, 2024

Here's how it is different, for a general idea:
2024-04-06T122437Z-b85

@scottchiefbaker
Copy link
Owner

Interesting... all of my unit tests are passing. Clearly I'm missing something though. Take a look at the unit tests and see if you think I'm missing anything. I'm open to PRs for new unit tests.

$ php tests/tests.php
Encode: Null                                                                    [  OK  ]
Encode: Four nulls                                                              [  OK  ]
Encode: Single space                                                            [  OK  ]
Encode: Four spaces = 'y'                                                       [  OK  ]
Encode: Null in middle                                                          [  OK  ]
Encode: Four null in middle                                                     [  OK  ]
Encode: Hello world                                                             [  OK  ]
Encode: Quick brown fox                                                         [  OK  ]
Encode: String #1                                                               [  OK  ]
Encode: String #2                                                               [  OK  ]
Encode: String #3                                                               [  OK  ]
Encode: String #4                                                               [  OK  ]
Encode: Unprintable chars                                                       [  OK  ]
Encode: Github issue #2                                                         [  OK  ]
Decode: Null                                                                    [  OK  ]
Decode: Four nulls                                                              [  OK  ]
Decode: Single space                                                            [  OK  ]
Decode: Four spaces = 'y'                                                       [  OK  ]
Decode: Null in middle                                                          [  OK  ]
Decode: Four null in middle                                                     [  OK  ]
Decode: String #1                                                               [  OK  ]
Decode: String #2                                                               [  OK  ]
Decode: String #3                                                               [  OK  ]
Decode: String #4                                                               [  OK  ]
Decode: Unprintable chars                                                       [  OK  ]

All 25 tests passed

@scottchiefbaker
Copy link
Owner

scottchiefbaker commented Apr 7, 2024

What a weird bug... I spent 45 minutes tracking it down but I think I got it with bcc9583. See if that works for you. I also added unit tests for this scenario so it won't catch us again in the future. If this isn't it then I will need a small file that recreates this problem I can test with. Smaller the better.

@scottchiefbaker
Copy link
Owner

scottchiefbaker commented Apr 7, 2024

On my Raspberry Pi 3 using a 1MB file as a test:

  • Encode time: 3.6 seconds
  • Decode time: 1.2 seconds

Even on a 32bit platform the algorithm takes the "fast" path 81.6% of the time. The "slow" path is only taken when the integer chunk is greater than 2^31.

@romanrm
Copy link
Author

romanrm commented Apr 7, 2024

Now it works correctly for me on that 1M PNG file.

Takes 13 seconds to encode, 3 seconds to decode.

Thanks!

@romanrm
Copy link
Author

romanrm commented Apr 7, 2024

I suggest to add a check if we are running on a 32-bit system, and then verify that the GMP PHP extension is installed (function_exists(gmp_init)) and throw an error outright about that if not. Because otherwise it might work without any error for simple cases and the user might deploy it thinking everything is fine, but will later fail in more complex ones.

@scottchiefbaker
Copy link
Owner

That's not a bad idea... How should it look? Just a simple print() statement? I landed something in 21c5098 as a test.

@romanrm
Copy link
Author

romanrm commented Apr 7, 2024

I would put it not into the encode, but right into the PHP script itself, even outside class declaration. So any script which includes the PHP file, will run the check at include-time once, and fail to continue if it's not met. Also no need to check for "first" then.

Better print it to STDERR, and terminate execution. Using trigger_error("Message", E_USER_ERROR) can achieve both.

Also do not expect to be running in a web context, there's a ton of utility in PHP to do "shell scripts" with a better language, not necessarily web apps. So I suggest to omit any HTML formatting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants