Skip to content

Commit

Permalink
new option mappingsOverride
Browse files Browse the repository at this point in the history
  • Loading branch information
walterra committed Aug 4, 2023
1 parent 1bd8916 commit 0b951e1
Show file tree
Hide file tree
Showing 4 changed files with 19 additions and 4 deletions.
7 changes: 4 additions & 3 deletions README.md
Expand Up @@ -10,7 +10,7 @@ A nodejs based library to (re)index and transform data from/to Elasticsearch.

If you're looking for a nodejs based tool which allows you to ingest large CSV/JSON files in the GigaBytes you've come to the right place. Everything else I've tried with larger files runs out of JS heap, hammers ES with too many single requests, times out or tries to do everything with a single bulk request.

While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat) or [Ingest Nodes](https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html) for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.
While I'd generally recommend using [Logstash](https://www.elastic.co/products/logstash), [filebeat](https://www.elastic.co/products/beats/filebeat), [Ingest Nodes](https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html), [Elastic Agent](https://www.elastic.co/guide/en/fleet/current/fleet-overview.html) or [Elasticsearch Transforms](https://www.elastic.co/guide/en/elasticsearch/reference/current/transforms.html) for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.

**This is experimental code, use at your own risk. Nonetheless, I encourage you to give it a try so I can gather some feedback.**

Expand All @@ -26,7 +26,7 @@ Now that we've talked about the caveats, let's have a look what you actually get

## Features

- Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD).
- Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both `node-es-transformer` and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD), depending on document size.
- Supports wildcards to ingest/transform a range of files in one go.
- Supports fetching documents from existing indices using search/scroll. This allows you to reindex with custom data transformations just using JavaScript in the `transform` callback.
- The `transform` callback gives you each source document, but you can split it up in multiple ones and return an array of documents. An example use case for this: Each source document is a Tweet and you want to transform that into an entity centric index based on Hashtags.
Expand Down Expand Up @@ -113,7 +113,8 @@ transformer({
- `splitRegex`: Custom line split regex, defaults to `/\n/`.
- `sourceIndexName`: The source Elasticsearch index to reindex from. If this is set, `fileName` is not allowed.
- `targetIndexName`: The target Elasticsearch index where documents will be indexed.
- `mappings`: Elasticsearch document mapping.
- `mappings`: Optional Elasticsearch document mappings. If not set and you're reindexing from another index, the mappings from the existing index will be used.
- `mappingsOverride`: If you're reindexing and this is set to `true`, `mappings` will be applied on top of the source index's mappings. Defaults to `false`.
- `skipHeader`: If true, skips the first line of the source file. Defaults to `false`.
- `transform(line)`: A callback function which allows the transformation of a source line into one or several documents.
- `verbose`: Logging verbosity, defaults to `true`
Expand Down
1 change: 1 addition & 0 deletions rollup.config.js
Expand Up @@ -20,6 +20,7 @@ export default [
buble({
exclude: ['node_modules/**'],
transforms: { asyncAwait: false },
objectAssign: 'Object.assign',
}),
],
},
Expand Down
13 changes: 12 additions & 1 deletion src/_create-mapping.js
Expand Up @@ -4,10 +4,11 @@ export default function createMappingFactory({
targetClient,
targetIndexName,
mappings,
mappingsOverride,
verbose,
}) {
return async () => {
let targetMappings = mappings;
let targetMappings = mappingsOverride ? undefined : mappings;

if (sourceClient && sourceIndexName && typeof targetMappings === 'undefined') {
try {
Expand All @@ -20,6 +21,16 @@ export default function createMappingFactory({
}

if (typeof targetMappings === 'object' && targetMappings !== null) {
if (mappingsOverride) {
targetMappings = {
...targetMappings,
properties: {
...targetMappings.properties,
...mappings,
},
};
}

try {
const resp = await targetClient.indices.create(
{
Expand Down
2 changes: 2 additions & 0 deletions src/main.js
Expand Up @@ -15,6 +15,7 @@ export default async function transformer({
sourceIndexName,
targetIndexName,
mappings,
mappingsOverride = false,
skipHeader = false,
transform,
verbose = true,
Expand All @@ -38,6 +39,7 @@ export default async function transformer({
targetClient,
targetIndexName,
mappings,
mappingsOverride,
verbose,
});
const indexer = indexQueueFactory({
Expand Down

0 comments on commit 0b951e1

Please sign in to comment.