Skip to content

Conversation

alexanderadam
Copy link

@alexanderadam alexanderadam commented Oct 11, 2025

This will work for "simple" things only because we're using a different regex engine.
But out of the 36 regular expressions, only the one for application/x-dbf fails.
So I guess we're good?

I didn't even expect it to work this reliably tbh. 😉
And we can get rid of that html definition in lib/marcel/mime_type/definitions.rb now. 🎉

Furthermore I hoped that this would also make #130 obsolete, because

    <magic priority="60">
      <match value="(?i)&lt;(html|head|body|title|div)[ >]" type="regex" offset="0"/>
      <match value="(?i)&lt;h[123][ >]" type="regex" offset="0"/>
    </magic>

looked pretty decent at first but a JS string with html='<html><h1>'; will still trick it, right? 🤔

While I'm generally happy with how it turned out, I would really love it if somebody has a better idea than switching $VERBOSE off and on again.
Or we leave it and just accept the warnings?

I couldn't find anything nice to solve this better yet.

Anyway, this was a lot of fun to implement but I really should go to bed now. 😆

EDIT: So this is actually more complicated. I had failing runs on TruffleRuby and JRuby
EDIT2: okay, it looks like everything is running now on all the platforms. What a ride! 😆

/CC @tomhughes

PS: I'm looking for a new adventure in case anybody is looking for a Ruby/Rails/Crystal dev
PPS: would you be so kind and add the hacktoberfest-accepted label to this issue in case you find that PR helpful? 🥺

@alexanderadam alexanderadam force-pushed the fix/bzip2_detection_with_regex_issue_128 branch from 61d8d7d to be31098 Compare October 11, 2025 00:30
@alexanderadam alexanderadam force-pushed the fix/bzip2_detection_with_regex_issue_128 branch 2 times, most recently from 51a18d3 to 48ae619 Compare October 11, 2025 13:27
@alexanderadam alexanderadam force-pushed the fix/bzip2_detection_with_regex_issue_128 branch from 48ae619 to e0fde31 Compare October 11, 2025 15:06
@alexanderadam alexanderadam changed the title add tika.xml regex support WIP: add tika.xml regex support Oct 11, 2025
@alexanderadam alexanderadam marked this pull request as draft October 11, 2025 16:03
@alexanderadam alexanderadam force-pushed the fix/bzip2_detection_with_regex_issue_128 branch 2 times, most recently from 31b0420 to 73165bd Compare October 11, 2025 16:25
This will work for simple things only because we're using a different regex engine.
But out of all the current regular expressions, only the one for `application/x-dbf` fails.
So I guess we're good.

And we can get rid of that html definition now.
@alexanderadam alexanderadam force-pushed the fix/bzip2_detection_with_regex_issue_128 branch from 73165bd to c70d3d3 Compare October 11, 2025 16:41
@alexanderadam alexanderadam changed the title WIP: add tika.xml regex support add tika.xml regex support Oct 11, 2025
@alexanderadam alexanderadam marked this pull request as ready for review October 11, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant