Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the detection of Matroska files (MKV, MKA and WEBM) #1786

Closed
hugohmk opened this issue Jul 26, 2023 · 20 comments · Fixed by #1794
Closed

Improve the detection of Matroska files (MKV, MKA and WEBM) #1786

hugohmk opened this issue Jul 26, 2023 · 20 comments · Fixed by #1794
Assignees

Comments

@hugohmk
Copy link
Contributor

hugohmk commented Jul 26, 2023

I'm working on a case that has a 10TB HD filled up mostly by video files (12k files, 8.8TB total size). The report was generated using the "ThumbsOnly" option for the video files, but it had 700GB+ size (60GB index, 640GB exported files) and the process also took a long time to complete.
I don't know if it's the case, but it seemed that the report process was splitting those video files and indexing their contents since ParsingTask (RawStringParser, MP4Parser and TextAndCSVParser) and ExportFileTask (video fragments that were processed by FragmentLargeBinaryTask) were highly active.
A workaround solution that I found was to generate another report from inside the 700GB report, resulting in a much smaller size (6GB).
I was using IPED v4.1.3 (could not generate the report in version 4.1.2 because I was getting ArrayIndexOutOfBoundsException errors, possibly related to #1676).

@wladimirleite
Copy link
Member

I can take a look into this later today.

@hugohmk
Copy link
Contributor Author

hugohmk commented Jul 26, 2023

Some other info that may be relevant: the HD data was inside an EnCase file (e01); several video files had unusual extensions (and were categorized as "Other Files" instead of "Videos"); the case was actually processed with IPED 4.1.2, but the report was generated with 4.1.3.

@hugohmk
Copy link
Contributor Author

hugohmk commented Jul 26, 2023

Another relevant information that I've just found out: some videos with strange extensions had duplicates with comon video extensions. Once the video file was added to a bookmark (including all duplicates), the file with strange extension was also added to the bookmark.

@wladimirleite
Copy link
Member

wladimirleite commented Jul 26, 2023

Can you share a couple of these files (videos with strange extensions and their duplicates)?
I am not sure if this is actually related to the issue, but it may help to reproduce the problem.

EDIT: You can put them in our local server, as they should have sensitive content (and should be large).

@hugohmk
Copy link
Contributor Author

hugohmk commented Jul 26, 2023

Sure, I'll send a pm with some info.

@wladimirleite wladimirleite self-assigned this Jul 26, 2023
@wladimirleite
Copy link
Member

I was able to reproduce the issue with the files @hugohmk provided (thanks you, Hugo!).
Looking into it in more detail, I believe the root cause is the incorrect file type assigned to MKV videos that have odd (incorrect) extensions.
"xyz.mkv" is detected as video/x-matroska, while the same file renamed to "xyz.KWYOuj" is detected as application/x-matroska. After the incorrect detection, it will be splitted, processed by a generic "strings parser" etc (causing the issues @hugohmk described).

The misclassification happens because tika-mimetypes.xml handles only partially the file signature of such files, relying on the extension in some cases, as we can see below:

  <mime-type type="application/x-matroska">
    <_comment>Matroska Media Container</_comment>
    <!-- Common magic across all Matroska varients -->
    <!-- For full detection, we need a custom Detector, see TIKA-1180 -->
    <magic priority="40">
      <match value="0x1A45DFA3" type="string" offset="0" />
    </magic>
  </mime-type>

  <mime-type type="video/x-matroska">
    <sub-class-of type="application/x-matroska"/>
    <glob pattern="*.mkv" />
    <!-- Note: The magic value below isn't present in all MKV files -->
    <magic priority="50">
      <match value="0x1A45DFA3934282886D6174726F736B61" type="string" offset="0" />
    </magic>
  </mime-type>
  <mime-type type="audio/x-matroska">
    <sub-class-of type="application/x-matroska"/>
    <glob pattern="*.mka" />
  </mime-type>

  <mime-type type="video/webm">
    <sub-class-of type="application/x-matroska"/>
    <glob pattern="*.webm" />
  </mime-type>

Making a quick online search, I saw that separating Matroska videos from audios is not trivial (just looking at the signature).
There is a 10-year old ticket about this in Tika, but my suggestion is to improve Tika's configuration in our custom signatures file, based on the observation of a large set of such files.
Any other ideas?

@lfcnassif
Copy link
Member

Another relevant information that I've just found out: some videos with strange extensions had duplicates with comon video extensions. Once the video file was added to a bookmark (including all duplicates), the file with strange extension was also added to the bookmark.

What was the detected contentType of the videos with uncommon extensions? If it was application/octet-stream (i.e. unknown), splitting large unknown files and indexing them with RawStringsParser is the expected behavior. Maybe we could add some new signatures (if they exists) to detect those video files as such...

@wladimirleite
Copy link
Member

Maybe we could add some new signatures (if they exists) to detect those video files as such...

Ok, that is my suggestion too (posted one second before) 😃
I will change the title and tag as an enhancement, ok?

@wladimirleite wladimirleite changed the title Large report size Improve the detection of Matroska files (MKV, MKA, WEBM?) Jul 26, 2023
@lfcnassif
Copy link
Member

lfcnassif commented Jul 26, 2023

What about trying out some of the libraries posted on TIKA-1180 (https://issues.apache.org/jira/browse/TIKA-1180)? The first is published on maven central and could be used to code a custom detector. The second already is a Tika detector implementation and I think it just needs to be included in classpath and declared in a service provider resource file. I didn't find it in Maven central, but since it is Apache licensed, we could embed its code until it is published.

@wladimirleite
Copy link
Member

What about trying out some of the libraries posted on TIKA-1180 (https://issues.apache.org/jira/browse/TIKA-1180)? The first is published on maven central and could be used to code a custom detector. The second already is a Tika detector implementation and I think it just needs to be included in classpath and declared in a service provider resource file. I didn't find it in Maven central, but since it is Apache licensed, we could embed its code until it is published.

That can be another possible solution.
I looked through the code of the second option and will try to use it with the files I am collecting.
However, If it is possible to use a simpler solution (like a signature base configuration), I think it would be better.

@lfcnassif
Copy link
Member

That can be another possible solution.
I looked through the code of the second option and will try to use it with the files I am collecting.
However, If it is possible to use a simpler solution (like a signature base configuration), I think it would be better.

Sure! Agreed!

And I just found matroska-tika already has a service provider file:
https://github.com/OmarAssadi/matroska-tika/blob/main/src/main/resources/META-INF/services/org.apache.tika.detect.Detector

@wladimirleite wladimirleite changed the title Improve the detection of Matroska files (MKV, MKA, WEBM?) Improve the detection of Matroska files (MKV, MKA and WEBM) Jul 28, 2023
@wladimirleite
Copy link
Member

While collecting sample files to test the possible solutions to this issue, I noticed that MKA (audios) are extremely rare. I found a few online but none in about ~2,000 cases. On the other hand, it was easy to collect thousands of WEBM and MKV videos.
So using application/x-matroska when it is not clear which type should be assigned does not seem a good idea, considering IPED's typical processing pipeline.

After collecting sample files, I will test the current signature configuration, the detector pointed out by @lfcnassif and possibly a new signature-based configuration that I will propose, and see how each one performs (in terms of correct type identification, after removing file extension information).

@lfcnassif
Copy link
Member

I found a few online but none in about ~2,000 cases.

Thank you for this awesome crawling!

@wladimirleite
Copy link
Member

wladimirleite commented Aug 1, 2023

After collecting some files, I ran the first test, to get a baseline of the current identification status of these file types (WEBM, MKV and MKA).

Correct Detection vs Detector (content-based only, file extensions were hidden for these tests):

Type Sample Files Tika (current) Matroska Detector Custom Signature
MKV 3056 157 3047 3056
WEBM 1332 0 1316 1332
MKA 2 0 0 0
Total 4390 157 4363 4388
% 100.0 3.6 99.4 99.95

So far the conclusion is that the current detection is heavily dependent on file signature (which is fine for most of the cases, but may fail, like in the case @hugohmk was working on).

@wladimirleite
Copy link
Member

Table updated ("Matroska Detector" column) to include results using the detector posted on TIKA-1180, mentioned before in this discussion.

The custom detector results were much better, as it does use mime type "application/x-matroska".
It still misses a few files. Looking into them in the hexadecimal viewer, their content seems valid.
It doesn't have support to mka (audios).

Based on what I saw in samples content, and looking into the format specification (https://www.matroska.org/index.html), I think it is possible to create a simple signature-based configuration that will cover all (or at least the vast majority) of the cases.
I will try to do that and will update the results later.

@wladimirleite
Copy link
Member

A simple signature definition took care of all sample files, identifiyng them and separating between WEBM and MKV.
Updated the table above with the column "Custom Signature".

Between MKV and MKA is harder. I tried adding other match clause (using a tag value). That worked (identified MKAs correctly) but created false positives (MKVs identified as audio, which is bad). As MKA are extremely rare, I propose to use the signature to identify Matroska videos, and them classify as audios is the file extension is ".mka".

Below are the lines I added to our CustomSignatures.xml:

	<mime-type type="video/x-matroska">
		<magic priority="60">
			<match value="0x1A45DFA3" type="string" offset="0">
				<match value="matroska" type="string" offset="4:64">
				</match>
			</match>
		</magic>
		<glob pattern="*.mkv" />
	</mime-type>

	<mime-type type="audio/x-matroska">
		<sub-class-of type="video/x-matroska" />
		<glob pattern="*.mka" />
	</mime-type>

	<mime-type type="video/webm">
		<magic priority="60">
			<match value="0x1A45DFA3" type="string" offset="0">
				<match value="webm" type="string" offset="4:64">
				</match>
			</match>
		</magic>
		<glob pattern="*.webm" />
	</mime-type>

I will process the E01 file @hugohmk identified the issue and see if all videos are now identified correctly.
If everything goes well, I will submit a PR with this configuration.

@lfcnassif
Copy link
Member

Great, thank you @tc-wleite! New signatures look fine to me!

@lfcnassif
Copy link
Member

Maybe they could be contributed back to Apache Tika project.

@wladimirleite
Copy link
Member

Maybe they could be contributed back to Apache Tika project.

I thought about that, as the new signatures seem better than the ones currently used. I will try to submit a PR there too.
@hugohmk's case processing is at 65%, and so far all videos with wrong extensions were correctly identified.

@wladimirleite
Copy link
Member

@hugohmk's case processing just finished. All videos with wrong extensions were correctly identified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants