Hash with djb2 function #10

aw · 2021-10-18T11:05:48Z

This PR makes 2 changes:

It removes unused code from the .asm file (smaller binary?)
It replaces the custom tpop hash function with values used in the djb2 hash function

I don't think it'll make a big difference performance wise, but I like the idea of using a well-known initial magic constant or hash value with a good multiplier.

theandrew168 · 2021-10-18T14:47:19Z

This sounds good to me! The existing hash comes from the K&R C book (second edition on page 144). Though, for some reason, I must've thought it came from The Practice of Programming book (hence TPOP). Either way, it looks like djb2 has decent qualities. It will expand to be one instruction larger, however, due to the initial constant being larger than the acceptable range of a single ADDI (-2048 to 2047).

I suppose at the and of the day this choice would come down to the nuanced characteristics of each hash function. I doubt the extra instruction with djb2 would make any noticeable difference (hashing isn't on a hot path of the interpreter).

For this PR, I'd want to go ahead and simply rename tpop_hash (and all references) to djb2_hash just for consistency.

aw · 2021-10-18T22:06:18Z

Oh nooo!! I totally missed the fact that li expands to lui and addi. Nice catch! Ah yes K&R C, I should have known... Knuth has a very similar hash function in Vol. 3 of TAOCP, but I felt djb2 was better.

Now thinking about it, I do somewhat prefer less instructions even though you say it won't make a difference... let me think about this some more.

aw · 2021-10-19T01:23:17Z

Indeed, it adds only one instruction to tpop_hash, not the tpop_hash_loop itself:

 146:	6285                	lui	t0,0x1
 148:	50528293          	addi	t0,t0,1285 # 0x1505
 14c:	02100313          	li	t1,33

As you said, it shouldn't make a big difference.

theandrew168 · 2021-10-19T04:53:58Z

Awesome, I like this change! I've always been thinking (in the back of my mind) about how (or if) the interpreter should handle hash collisions. In Forth, overriding words with new implementations is super common so collisions are "expected" behavior in that sense. If two words ever collided unintentionally, then the program would be in a bad state without the user really knowing why (unless they know to run the hashcheck.py script beforehand).

I thought about simply printing something special if a word gets defined that replaces an existing one (maybe print ok* or something slightly different). That way, a user could at least notice and say "wait up, this is a new word and shouldn't be overriding anything". It also may be just fine to simply not worry about this at all. DJB2 is a good hash and we've got 2^32 slots to fill: an accidental collision is super unlikely.

aw · 2021-10-19T05:07:20Z

Yep, I came to pretty much the same conclusion regarding the likelyhood of a collision.

I was initially skeptical about word hashing due to the "performance hit", but in this case it means a word name only occupies 4-Bytes instead of N-Bytes, which provides quite a bit of savings as the dictionary grows.. which is particularly useful on small MCUs.

Now we just need to fix the O(N) lookup time for user-defined dictionary words. I'm testing some ideas for this and will get back to you in a separate issue.

theandrew168 · 2021-10-19T05:35:47Z

Yea, the hashcheck.py script does that quick math on how many bytes are saved by using a static 4-byte hash instead of spelling out each word name (and padding to a 4 byte boundary). The results were crazy! Almost 1K for just the words in my basic prelude, rcu, and gpio lexicons. Definitely a win.

Yea, the O(N) lookups are slightly fussy, but thankfully they aren't on the hot path. Once a word is looked up and linked, it will never been looked up again (by the word being defined). I could possibly see an argument claiming that "startup times" could be negatively impacted by the linear lookup behavior: defining M words in sequence could lead to O(N*M) worst case. In practice, however, new words tend to refer to other recent words which keeps the behavior more linear than quadratic.

Plus, I worry that doing anything fancier (some sort of trie?) would hurt code size and general readability. It is probably worth some research, though.

aw added 2 commits October 18, 2021 19:47

Remove unused perl_hash

288a8ce

Replace tpop with djb2 hash function

ecb131a

theandrew168 requested review from theandrew168 and removed request for theandrew168 October 18, 2021 14:47

aw added 2 commits October 19, 2021 11:40

Rename tpop to djb2 hash

21ad1be

Remove redundant comment

cb0dc9b

theandrew168 merged commit 22cb9cf into theandrew168:main Oct 19, 2021

aw mentioned this pull request Oct 21, 2021

Please explain dictionary entry header #11

Closed

aw deleted the hash-djb2 branch October 25, 2021 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash with djb2 function #10

Hash with djb2 function #10

aw commented Oct 18, 2021

theandrew168 commented Oct 18, 2021 •

edited

aw commented Oct 18, 2021 •

edited

aw commented Oct 19, 2021

theandrew168 commented Oct 19, 2021

aw commented Oct 19, 2021

theandrew168 commented Oct 19, 2021 •

edited

Hash with djb2 function #10

Hash with djb2 function #10

Conversation

aw commented Oct 18, 2021

theandrew168 commented Oct 18, 2021 • edited

aw commented Oct 18, 2021 • edited

aw commented Oct 19, 2021

theandrew168 commented Oct 19, 2021

aw commented Oct 19, 2021

theandrew168 commented Oct 19, 2021 • edited

theandrew168 commented Oct 18, 2021 •

edited

aw commented Oct 18, 2021 •

edited

theandrew168 commented Oct 19, 2021 •

edited