Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to set _id in bulk index with raw source documents #2861

Closed
diegobenincasa opened this issue Feb 28, 2024 · 2 comments · Fixed by #2862
Closed

Unable to set _id in bulk index with raw source documents #2861

diegobenincasa opened this issue Feb 28, 2024 · 2 comments · Fixed by #2862
Assignees
Labels
type: bug A general bug

Comments

@diegobenincasa
Copy link

diegobenincasa commented Feb 28, 2024

I've tried to bulk index a bunch of JSON raw records into ES, and I needed to set custom _id values for them. Individual indexing works by calling "IndexQueryBuilder().withId(some_id_value)" and then calling the individual index method, but calling the "bulkIndex" method doesn't consider what was defined as the _id desired value.

Here's the code that ignores the ".withId" call:

package <ommited for safety>;

import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.elasticsearch.core.ElasticsearchOperations;
import org.springframework.data.elasticsearch.core.IndexOperations;
import org.springframework.data.elasticsearch.core.mapping.IndexCoordinates;
import org.springframework.data.elasticsearch.core.query.IndexQuery;
import org.springframework.data.elasticsearch.core.query.IndexQueryBuilder;
import org.springframework.stereotype.Service;

@Service
public class ESService {
    
    @Autowired
    private ElasticsearchOperations esOperations;

    public void index(String baseName, Map<Integer, String> jsonDocuments, String indexName, Long exp_time) {

        IndexCoordinates indexCoordinates = IndexCoordinates.of(indexName);
        
        IndexOperations indexOps = esOperations.indexOps(indexCoordinates);
        if(!indexOps.exists()) {
            indexOps.create();
            try {
                Thread.sleep(1000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }

        List<IndexQuery> indexQueries = jsonDocuments.keySet().stream()
            .map(id -> new IndexQueryBuilder()
                .withSource(jsonDocuments.get(id))
                .withId(id.toString()) // HERE IS THE IGNORED CALL
                .withIndex(indexName)
                .build())
            .collect(Collectors.toList());

        try {
            esOperations.bulkIndex(indexQueries, indexCoordinates);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

It should be interesting (if not mandatory) that the user could set the _id for each individual record sent in the bulk request.

I was able to loop over individual IndexQuery objects and send them one by one to ES, and that correctly sets the _id value, but that increases processing time a lot - in my scenario of ~2m JSON records, elapsed time increases from 15-20 minutes (in batches of 2000 records) to ~3 hours.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Feb 28, 2024
@sothawo
Copy link
Collaborator

sothawo commented Feb 28, 2024

Your method does not compile.
jsonDocuments is of type Map<Integer, String> so the jsonDocuments.keySet().stream() provides a stream of Integer. IndexQueryBuilder.withId(String id) takes a String and no Integer and never had a different type for the id.
grafik

@sothawo sothawo added status: waiting-for-feedback We need additional information before we can continue and removed status: waiting-for-triage An issue we've not yet triaged labels Feb 28, 2024
@diegobenincasa
Copy link
Author

I'm sorry, I actually sent the code with that mistake. Please see my edited post.

The problem is not in the compilation (it was my mistake in copying and pasting the code here), but in the "withId" call. You can set anything there and the setting is ignored in the bulk request.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Feb 28, 2024
@sothawo sothawo self-assigned this Feb 28, 2024
@sothawo sothawo added type: bug A general bug and removed status: feedback-provided Feedback has been provided labels Feb 28, 2024
sothawo added a commit that referenced this issue Feb 28, 2024
sothawo added a commit that referenced this issue Feb 28, 2024
Original Pull Request #2862
Closes #2861

(cherry picked from commit debf04b)
sothawo added a commit that referenced this issue Feb 28, 2024
Original Pull Request #2862
Closes #2861

(cherry picked from commit debf04b)
(cherry picked from commit b52e8d1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A general bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants