-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the detection of Matroska files (MKV, MKA and WEBM) #1786
Comments
I can take a look into this later today. |
Some other info that may be relevant: the HD data was inside an EnCase file (e01); several video files had unusual extensions (and were categorized as "Other Files" instead of "Videos"); the case was actually processed with IPED 4.1.2, but the report was generated with 4.1.3. |
Another relevant information that I've just found out: some videos with strange extensions had duplicates with comon video extensions. Once the video file was added to a bookmark (including all duplicates), the file with strange extension was also added to the bookmark. |
Can you share a couple of these files (videos with strange extensions and their duplicates)? EDIT: You can put them in our local server, as they should have sensitive content (and should be large). |
Sure, I'll send a pm with some info. |
I was able to reproduce the issue with the files @hugohmk provided (thanks you, Hugo!). The misclassification happens because <mime-type type="application/x-matroska">
<_comment>Matroska Media Container</_comment>
<!-- Common magic across all Matroska varients -->
<!-- For full detection, we need a custom Detector, see TIKA-1180 -->
<magic priority="40">
<match value="0x1A45DFA3" type="string" offset="0" />
</magic>
</mime-type>
<mime-type type="video/x-matroska">
<sub-class-of type="application/x-matroska"/>
<glob pattern="*.mkv" />
<!-- Note: The magic value below isn't present in all MKV files -->
<magic priority="50">
<match value="0x1A45DFA3934282886D6174726F736B61" type="string" offset="0" />
</magic>
</mime-type>
<mime-type type="audio/x-matroska">
<sub-class-of type="application/x-matroska"/>
<glob pattern="*.mka" />
</mime-type>
<mime-type type="video/webm">
<sub-class-of type="application/x-matroska"/>
<glob pattern="*.webm" />
</mime-type> Making a quick online search, I saw that separating Matroska videos from audios is not trivial (just looking at the signature). |
What was the detected contentType of the videos with uncommon extensions? If it was application/octet-stream (i.e. unknown), splitting large unknown files and indexing them with RawStringsParser is the expected behavior. Maybe we could add some new signatures (if they exists) to detect those video files as such... |
Ok, that is my suggestion too (posted one second before) 😃 |
What about trying out some of the libraries posted on TIKA-1180 (https://issues.apache.org/jira/browse/TIKA-1180)? The first is published on maven central and could be used to code a custom detector. The second already is a Tika detector implementation and I think it just needs to be included in classpath and declared in a service provider resource file. I didn't find it in Maven central, but since it is Apache licensed, we could embed its code until it is published. |
That can be another possible solution. |
Sure! Agreed! And I just found matroska-tika already has a service provider file: |
While collecting sample files to test the possible solutions to this issue, I noticed that MKA (audios) are extremely rare. I found a few online but none in about ~2,000 cases. On the other hand, it was easy to collect thousands of WEBM and MKV videos. After collecting sample files, I will test the current signature configuration, the detector pointed out by @lfcnassif and possibly a new signature-based configuration that I will propose, and see how each one performs (in terms of correct type identification, after removing file extension information). |
Thank you for this awesome crawling! |
After collecting some files, I ran the first test, to get a baseline of the current identification status of these file types (WEBM, MKV and MKA). Correct Detection vs Detector (content-based only, file extensions were hidden for these tests):
So far the conclusion is that the current detection is heavily dependent on file signature (which is fine for most of the cases, but may fail, like in the case @hugohmk was working on). |
Table updated ("Matroska Detector" column) to include results using the detector posted on TIKA-1180, mentioned before in this discussion. The custom detector results were much better, as it does use mime type "application/x-matroska". Based on what I saw in samples content, and looking into the format specification (https://www.matroska.org/index.html), I think it is possible to create a simple signature-based configuration that will cover all (or at least the vast majority) of the cases. |
A simple signature definition took care of all sample files, identifiyng them and separating between WEBM and MKV. Between MKV and MKA is harder. I tried adding other match clause (using a tag value). That worked (identified MKAs correctly) but created false positives (MKVs identified as audio, which is bad). As MKA are extremely rare, I propose to use the signature to identify Matroska videos, and them classify as audios is the file extension is ".mka". Below are the lines I added to our <mime-type type="video/x-matroska">
<magic priority="60">
<match value="0x1A45DFA3" type="string" offset="0">
<match value="matroska" type="string" offset="4:64">
</match>
</match>
</magic>
<glob pattern="*.mkv" />
</mime-type>
<mime-type type="audio/x-matroska">
<sub-class-of type="video/x-matroska" />
<glob pattern="*.mka" />
</mime-type>
<mime-type type="video/webm">
<magic priority="60">
<match value="0x1A45DFA3" type="string" offset="0">
<match value="webm" type="string" offset="4:64">
</match>
</match>
</magic>
<glob pattern="*.webm" />
</mime-type> I will process the E01 file @hugohmk identified the issue and see if all videos are now identified correctly. |
Great, thank you @tc-wleite! New signatures look fine to me! |
Maybe they could be contributed back to Apache Tika project. |
I thought about that, as the new signatures seem better than the ones currently used. I will try to submit a PR there too. |
@hugohmk's case processing just finished. All videos with wrong extensions were correctly identified |
I'm working on a case that has a 10TB HD filled up mostly by video files (12k files, 8.8TB total size). The report was generated using the "ThumbsOnly" option for the video files, but it had 700GB+ size (60GB index, 640GB exported files) and the process also took a long time to complete.
I don't know if it's the case, but it seemed that the report process was splitting those video files and indexing their contents since ParsingTask (RawStringParser, MP4Parser and TextAndCSVParser) and ExportFileTask (video fragments that were processed by FragmentLargeBinaryTask) were highly active.
A workaround solution that I found was to generate another report from inside the 700GB report, resulting in a much smaller size (6GB).
I was using IPED v4.1.3 (could not generate the report in version 4.1.2 because I was getting ArrayIndexOutOfBoundsException errors, possibly related to #1676).
The text was updated successfully, but these errors were encountered: