Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any way to extract the content of the file? #149

Closed
ghost opened this issue Feb 20, 2013 · 3 comments
Closed

Is there any way to extract the content of the file? #149

ghost opened this issue Feb 20, 2013 · 3 comments

Comments

@ghost
Copy link

ghost commented Feb 20, 2013

I tried to retrieve the content from TSK_EXTRACTED_TEXT but it seemed to be empty. Is there any other way to extract those text content?
ps. I need the content for every files in the media.

@adam-m
Copy link
Contributor

adam-m commented Feb 20, 2013

It is possible by using keyword search module API, which can get you extracted text from keyword search index per file ID.

One could write a custom reporting module that does it for files of interest.

@bcarrier
Copy link
Member

The Autopsy code is not using the TSK_EXTRACTED_TEXT attribute at this point. The TSK Framework modules do, but Autopsy does not. The brief reason for this is because SOLR was passing the output of Tika directly to Lucene and we never saw only the text. We've changed this so that we run Tika and pass the text into Lucene, but haven't go so far as to post it to the blackboard as well.

Adam, can you make a link to the code in the content viewer to get the KeywordSearch lookup and get the text?

@adam-m
Copy link
Contributor

adam-m commented Feb 20, 2013

Keyword Search is not available yet as a "service" via Lookup, so the way to use it is via a module dependency (add dependency on Keyword Search module).

Here is a sample code I just wrote (untested but should work), the sample doesn't contain exception handling.

Server keywordSearchServer = KeywordSearch.getServer();
//figure out files of interest (e.g. by using FileManager API)
FileManager fm = Case.getCurrentCase().getServices().getFileManager();
List files = fm.findFiles(image, ".avi");
for (FsContent file : files) {
long fileId = file.getId();
boolean isIndexed = keywordSearchServer.queryIsIndexed(fileId);
if (!isIndexed) {
//skip, file not in index (no ingest ran, file skipped, etc)
continue;
}
//we split all text into chunks up to 1MB each
int numChunks = keywordSearchServer.queryNumFileChunks(fileId);
//for every chunks, get text
//chunk 0 stores meta-data only and no text
//we only care about text (chunks >=1 )
if (numChunks < 1) {
//skip, no text in index
continue;
}
//go over every chunk
for (int chunk = 1; chunk <= numChunks; ++chunk) {
String chunkTxt = keywordSearchServer.getSolrContent(file, chunk);
//do something with chunkTxt,
//e.g. append chunkTxt to output stream
}

}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants