Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash with djb2 function #10

Merged
merged 4 commits into from Oct 19, 2021
Merged

Hash with djb2 function #10

merged 4 commits into from Oct 19, 2021

Conversation

aw
Copy link
Contributor

@aw aw commented Oct 18, 2021

This PR makes 2 changes:

  1. It removes unused code from the .asm file (smaller binary?)
  2. It replaces the custom tpop hash function with values used in the djb2 hash function

I don't think it'll make a big difference performance wise, but I like the idea of using a well-known initial magic constant or hash value with a good multiplier.

@theandrew168
Copy link
Owner

theandrew168 commented Oct 18, 2021

This sounds good to me! The existing hash comes from the K&R C book (second edition on page 144). Though, for some reason, I must've thought it came from The Practice of Programming book (hence TPOP). Either way, it looks like djb2 has decent qualities. It will expand to be one instruction larger, however, due to the initial constant being larger than the acceptable range of a single ADDI (-2048 to 2047).

I suppose at the and of the day this choice would come down to the nuanced characteristics of each hash function. I doubt the extra instruction with djb2 would make any noticeable difference (hashing isn't on a hot path of the interpreter).

For this PR, I'd want to go ahead and simply rename tpop_hash (and all references) to djb2_hash just for consistency.

@theandrew168 theandrew168 requested review from theandrew168 and removed request for theandrew168 October 18, 2021 14:47
@aw
Copy link
Contributor Author

aw commented Oct 18, 2021

Oh nooo!! I totally missed the fact that li expands to lui and addi. Nice catch! Ah yes K&R C, I should have known... Knuth has a very similar hash function in Vol. 3 of TAOCP, but I felt djb2 was better.

Now thinking about it, I do somewhat prefer less instructions even though you say it won't make a difference... let me think about this some more.

@aw
Copy link
Contributor Author

aw commented Oct 19, 2021

Indeed, it adds only one instruction to tpop_hash, not the tpop_hash_loop itself:

 146:	6285                	lui	t0,0x1
 148:	50528293          	addi	t0,t0,1285 # 0x1505
 14c:	02100313          	li	t1,33

As you said, it shouldn't make a big difference.

@theandrew168
Copy link
Owner

Awesome, I like this change! I've always been thinking (in the back of my mind) about how (or if) the interpreter should handle hash collisions. In Forth, overriding words with new implementations is super common so collisions are "expected" behavior in that sense. If two words ever collided unintentionally, then the program would be in a bad state without the user really knowing why (unless they know to run the hashcheck.py script beforehand).

I thought about simply printing something special if a word gets defined that replaces an existing one (maybe print ok* or something slightly different). That way, a user could at least notice and say "wait up, this is a new word and shouldn't be overriding anything". It also may be just fine to simply not worry about this at all. DJB2 is a good hash and we've got 2^32 slots to fill: an accidental collision is super unlikely.

@theandrew168 theandrew168 merged commit 22cb9cf into theandrew168:main Oct 19, 2021
@aw
Copy link
Contributor Author

aw commented Oct 19, 2021

Yep, I came to pretty much the same conclusion regarding the likelyhood of a collision.

I was initially skeptical about word hashing due to the "performance hit", but in this case it means a word name only occupies 4-Bytes instead of N-Bytes, which provides quite a bit of savings as the dictionary grows.. which is particularly useful on small MCUs.

Now we just need to fix the O(N) lookup time for user-defined dictionary words. I'm testing some ideas for this and will get back to you in a separate issue.

@theandrew168
Copy link
Owner

theandrew168 commented Oct 19, 2021

Yea, the hashcheck.py script does that quick math on how many bytes are saved by using a static 4-byte hash instead of spelling out each word name (and padding to a 4 byte boundary). The results were crazy! Almost 1K for just the words in my basic prelude, rcu, and gpio lexicons. Definitely a win.

Yea, the O(N) lookups are slightly fussy, but thankfully they aren't on the hot path. Once a word is looked up and linked, it will never been looked up again (by the word being defined). I could possibly see an argument claiming that "startup times" could be negatively impacted by the linear lookup behavior: defining M words in sequence could lead to O(N*M) worst case. In practice, however, new words tend to refer to other recent words which keeps the behavior more linear than quadratic.

Plus, I worry that doing anything fancier (some sort of trie?) would hurt code size and general readability. It is probably worth some research, though.

@aw aw deleted the hash-djb2 branch October 25, 2021 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants