Is there any way to extract the content of the file? #149

ghost · 2013-02-20T09:35:52Z

I tried to retrieve the content from TSK_EXTRACTED_TEXT but it seemed to be empty. Is there any other way to extract those text content?
ps. I need the content for every files in the media.

adam-m · 2013-02-20T14:31:27Z

It is possible by using keyword search module API, which can get you extracted text from keyword search index per file ID.

One could write a custom reporting module that does it for files of interest.

bcarrier · 2013-02-20T14:58:20Z

The Autopsy code is not using the TSK_EXTRACTED_TEXT attribute at this point. The TSK Framework modules do, but Autopsy does not. The brief reason for this is because SOLR was passing the output of Tika directly to Lucene and we never saw only the text. We've changed this so that we run Tika and pass the text into Lucene, but haven't go so far as to post it to the blackboard as well.

Adam, can you make a link to the code in the content viewer to get the KeywordSearch lookup and get the text?

adam-m · 2013-02-20T17:20:17Z

Keyword Search is not available yet as a "service" via Lookup, so the way to use it is via a module dependency (add dependency on Keyword Search module).

Here is a sample code I just wrote (untested but should work), the sample doesn't contain exception handling.
Server keywordSearchServer = KeywordSearch.getServer(); //figure out files of interest (e.g. by using FileManager API) FileManager fm = Case.getCurrentCase().getServices().getFileManager(); List files = fm.findFiles(image, ".avi"); for (FsContent file : files) { long fileId = file.getId(); boolean isIndexed = keywordSearchServer.queryIsIndexed(fileId); if (!isIndexed) { //skip, file not in index (no ingest ran, file skipped, etc) continue; } //we split all text into chunks up to 1MB each int numChunks = keywordSearchServer.queryNumFileChunks(fileId); //for every chunks, get text //chunk 0 stores meta-data only and no text //we only care about text (chunks >=1 ) if (numChunks < 1) { //skip, no text in index continue; } //go over every chunk for (int chunk = 1; chunk <= numChunks; ++chunk) { String chunkTxt = keywordSearchServer.getSolrContent(file, chunk); //do something with chunkTxt, //e.g. append chunkTxt to output stream }

}

adam-m closed this as completed Feb 21, 2013

ghost mentioned this issue Mar 7, 2013

Cannot get the total word from unallocated files #157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any way to extract the content of the file? #149

Is there any way to extract the content of the file? #149

ghost commented Feb 20, 2013

adam-m commented Feb 20, 2013

bcarrier commented Feb 20, 2013

adam-m commented Feb 20, 2013

Is there any way to extract the content of the file? #149

Is there any way to extract the content of the file? #149

Comments

ghost commented Feb 20, 2013

adam-m commented Feb 20, 2013

bcarrier commented Feb 20, 2013

adam-m commented Feb 20, 2013