Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of file with list of URL's #18

Closed
Chadwiki opened this issue Apr 8, 2014 · 5 comments
Closed

Use of file with list of URL's #18

Chadwiki opened this issue Apr 8, 2014 · 5 comments
Labels

Comments

@Chadwiki
Copy link

Chadwiki commented Apr 8, 2014

Is it possible to provide a file in the "urlGetDocuments" field, instead of URL string?
The file would contain the list of URL's. For example the use of sitemaps or csv...

@velias velias added the question label Apr 9, 2014
@velias
Copy link
Member

velias commented Apr 9, 2014

No, urlGetDocuments config field have to contain http or https URL only for now.
But the file from this URL may contain list of document URL in form of JSON (eg. https://github.com/searchisko/elasticsearch-river-remote/tree/master/src/test/resources/test_documents_json) and it can be processed then (using remote/simpleGetDocuments and remote/urlGetDocumentDetailsField).

@velias velias closed this as completed Apr 9, 2014
@Chadwiki
Copy link
Author

Chadwiki commented Apr 9, 2014

thanks.
Can this river be used to just pull single json from a URL? I don't need need the {space}
I only need to provide the list of URL's, from your solution above(json file). Then I need to poll with a frequency and provide authentication to collect the data.
Here is an example of json format from a URL:

http://docs.appdynamics.com/download/attachments/20187207/REST_WildCardBT_metric-dataJSON.txt?version=1&modificationDate=1394226069000&api=v2

All my attempts have failed...

{
"type" : "remote",
"remote" : {
"urlGetDocuments" : "http://docs.appdynamics.com/download/attachments/20187207/REST_WildCardBT_metric-dataJSON.txt?version=1&modificationDate=1394226069000&api=v2",
"timeout" : "5s",
"spacesIndexed" : "",
"spaceKeysExcluded" : "",
"indexUpdatePeriod" : "1m",
"indexFullUpdatePeriod" : "1h",
"maxIndexingThreads" : 2
},
"index" : {
"index" : "my_remote_index",
"type" : "remote_document",
"remote_field_document_id" : "id",
"remote_field_updated" : "updated"
},
"activity_log": {
"index" : "remote_river_activity",
"type" : "remote_river_indexupdate"
}
}

@velias
Copy link
Member

velias commented Apr 10, 2014

For this case you have to configure one "fake" space to run indexing for it, and just do not use it in urlGetDocuments. So in your configuration use "spacesIndexed" : "MAIN". I also miss "simpleGetDocuments" : "true" in your config. Then you should remove "remote_field_updated" : "updated" as unuseful for your case, and change remote_field_document_id from id to something in your data, probably metricPath.

@Chadwiki
Copy link
Author

I may not be interpreting what you are saying.
Have you been able to get my config to work?
Maybe if you can, provide the exact config. I'm at a loss right now. Thanks

When trying to create the river I receive the following:
CreationException[Guice creation errors:

  1. Error injecting constructor, org.elasticsearch.common.settings.SettingsException: String value must be provided for 'index/remote_field_updated' configuration!
    at org.jboss.elasticsearch.river.remote.RemoteRiver.(Unknown Source)
    while locating org.jboss.elasticsearch.river.remote.RemoteRiver
    while locating org.elasticsearch.river.River

1 error]; nested: SettingsException[String value must be provided for 'index/remote_field_updated' configuration!];

########## River ########
{
"type" : "remote",
"remote" : {
"urlGetDocuments" : "http://docs.appdynamics.com/download/attachments/20187207/REST_WildCardBT_metric-dataJSON.txt?version=1&modificationDate=1394226069000&api=v2",
"timeout" : "5s",
"spacesIndexed" : "MAIN",
"spaceKeysExcluded" : "",
"indexUpdatePeriod" : "1m",
"indexFullUpdatePeriod" : "1h",
"simpleGetDocuments" : "true",
"maxIndexingThreads" : 2
},
"index" : {
"index" : "my_remote_index",
"type" : "remote_document",
"remote_field_document_id" : "metricPath"
},
"activity_log": {
"index" : "remote_river_activity",
"type" : "remote_river_indexupdate"
}
}

@velias
Copy link
Member

velias commented Apr 14, 2014

Sorry, I didn't tested your configuration due lack of time, I wrote instructions from head only. But seems you find a bug in river. I forgot to relax mandatory validation of index/remote_field_updated config field when I introduced simpleGetDocuments mode. Going to patch it, see issue #20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants