Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot get the total word from unallocated files #157

Closed
ghost opened this issue Mar 7, 2013 · 7 comments
Closed

Cannot get the total word from unallocated files #157

ghost opened this issue Mar 7, 2013 · 7 comments

Comments

@ghost
Copy link

ghost commented Mar 7, 2013

From this issue #149

It seems that we need to use FsContent to count the total words of the file. But it seems that there's no FsContent don't have unallocated files. Is there any other way to do it?

@adam-m
Copy link
Contributor

adam-m commented Mar 7, 2013

Correct, FsContent represent only allocated files in a file system.
AbstractFile (parent class of FsContent) represents all files, including
the virtual/logical unalloc files.

Can you post some code snippets to elaborate a bit more on what you are
doing.

On Thu, Mar 7, 2013 at 9:44 AM, megxa700 notifications@github.com wrote:

From this issue #149 #149

It seems that we need to use FsContent to count the total words of the
file. But it seems that there's no FsContent don't have unallocated files.
Is there any other way to do it?


Reply to this email directly or view it on GitHubhttps://github.com//issues/157
.

@adam-m
Copy link
Contributor

adam-m commented Mar 7, 2013

To clarify, deleted file system files are represented by FsContent.

But logical files representing unallocated blocks are AbstractFile (parent
abstract class), or LayoutFile to be precise.

On Thu, Mar 7, 2013 at 9:49 AM, Adam Malinowski
amalinowski@basistech.comwrote:

Correct, FsContent represent only allocated files in a file system.
AbstractFile (parent class of FsContent) represents all files, including
the virtual/logical unalloc files.

Can you post some code snippets to elaborate a bit more on what you are
doing.

On Thu, Mar 7, 2013 at 9:44 AM, megxa700 notifications@github.com wrote:

From this issue #149 #149

It seems that we need to use FsContent to count the total words of the
file. But it seems that there's no FsContent don't have unallocated files.
Is there any other way to do it?


Reply to this email directly or view it on GitHubhttps://github.com//issues/157
.

@ghost
Copy link
Author

ghost commented Mar 8, 2013

for(String name: Files.keySet()){
List files = fm.findFiles(img, name);
data.put(name, files);
int counter = 0;
for (FsContent file : files) {
long fileId = file.getId();
boolean isIndexed = keywordSearchServer.queryIsIndexed(fileId);
if (!isIndexed) {
//skip, file not in index (no ingest ran, file skipped, etc)
continue;
}
//we split all text into chunks up to 1MB each
int numChunks = keywordSearchServer.queryNumFileChunks(fileId);
//for every chunks, get text
//chunk 0 stores meta-data only and no text
//we only care about text (chunks >=1 )
if (numChunks < 1) {
//skip, no text in index
continue;
}

                            //go over every chunk
                            for (int chunk = 1; chunk <= numChunks; ++chunk) {
                                String chunkTxt = keywordSearchServer.getSolrContent(file, chunk);
                                //This is where i do operatioin on counting keywords
                            }
                       }

}

@adam-m
Copy link
Contributor

adam-m commented Mar 8, 2013

In this case it is better to get the files from the blackboard by keyword
search result, rather than from file manager by file name.

You can to go over all keyword search results (blackboard artifacts), then
from every artifact of interest (that has a keyword hit / blackboard
attribute that interests you), get object id from the artifact.

Then, use the object id (which is a file id) to query the keyword search
index.

This will give you all hits that exist, including hits from unallocated
blocks.
The code starting with line:
boolean isIndexed = keywordSearchServer.queryIsIndexed(fileId);
should not change, only a way you are getting the files changes.

It should also be much faster, because you will only be querying keyword
search for files that have specific hits.

Adam

On Thu, Mar 7, 2013 at 10:40 PM, megxa700 notifications@github.com wrote:

for(String name: Files.keySet()){
List files = fm.findFiles(img, name);
data.put(name, files);
int counter = 0;
for (FsContent file : files) {
long fileId = file.getId();
boolean isIndexed = keywordSearchServer.queryIsIndexed(fileId);
if (!isIndexed) {
//skip, file not in index (no ingest ran, file skipped, etc)
continue;
}
//we split all text into chunks up to 1MB each
int numChunks = keywordSearchServer.queryNumFileChunks(fileId);
//for every chunks, get text
//chunk 0 stores meta-data only and no text
//we only care about text (chunks >=1 )
if (numChunks < 1) {
//skip, no text in index
continue;
}

                        //go over every chunk
                        for (int chunk = 1; chunk <= numChunks; ++chunk) {
                            String chunkTxt = keywordSearchServer.getSolrContent(file, chunk);
                            //This is where i do operatioin on counting keywords
                        }
                   }

}


Reply to this email directly or view it on GitHubhttps://github.com//issues/157#issuecomment-14601216
.

@ghost
Copy link
Author

ghost commented Mar 8, 2013

can you give me an example code? i seem lost here

@ghost
Copy link
Author

ghost commented Mar 10, 2013

it turned out to be quite easy ant less time consuming. Thank you for helping. I won't forget to give you guys a credit when the project is done :)

@adam-m
Copy link
Contributor

adam-m commented Mar 10, 2013

Good news! I was just about to give you a sample.
I will close the issue, let know if you have more questions.

On Sun, Mar 10, 2013 at 2:17 PM, megxa700 notifications@github.com wrote:

it turned out to be quite easy ant less time consuming. Thank you for
helping. I won't forget to give you guys a credit when the project is done
:)


Reply to this email directly or view it on GitHubhttps://github.com//issues/157#issuecomment-14685990
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant