Skip to content

yokawasa/embulk-output-azuresearch

Repository files navigation

Azure Search output plugin for Embulk

embulk-output-azuresearch is an embulk output plugin that dumps records to Azure Search. Embulk is a open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services. See Embulk documentation for details.

Overview

  • Plugin type: output
  • Load all or nothing: no
  • Resume supported: no
  • Cleanup supported: yes

Installation

$ gem install fluent-plugin-azuresearch

Configuration

Azure Search

To use Microsoft Azure Search, you must create an Azure Search service in the Azure Portal. Also you must have an index, persisted storage of documents to which embulk-output-azuresearch writes event stream out. Here are instructions:

Sample Index Schema: sampleindex01

{
    "name": "sampleindex01",
    "fields": [
        { "name":"id", "type":"Edm.String", "key": true, "searchable": false },
        { "name":"title", "type":"Edm.String", "analyzer":"en.microsoft" },
        { "name":"speakers", "type":"Edm.String" },
        { "name":"url", "type":"Edm.String", "searchable": false, "filterable":false, "sortable":false, "facetable":false },
        { "name":"text", "type":"Edm.String", "filterable":false, "sortable":false, "facetable":false, "analyzer":"en.microsoft" }
    ]
}

Embulk Configuration (config.yml)

out:
  type: azuresearch
  endpoint: https://yoichikademo.search.windows.net
  api_key:  9E55964F8254BB4504DX3F66A39AF5EB
  search_index: sampleindex01
  column_names: id,title,speakers,text,url
  key_names: id,title,speakers,description,link
  • endpoint (required) - Azure Search service endpoint URI
  • api_key (required) - Azure Search API key
  • search_index (required) - Azure Search Index name to insert records
  • column_names (required) - Column names in a target Azure search index. Each column needs to be separated by a comma.
  • key_names (optional) - Default:nil. Key names in incomming record to insert. Each key needs to be separated by a comma. By default, key_names is as same as column_names

Sample Configurations

(1) Case: column_names and key_names are same

Suppose you have the following config.yml and sample azure search index schema written above:

config.yml

in:
  type: file
  path_prefix: samples/sample_01.csv
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: string}
    - {name: title, type: string}
    - {name: speakers, type: string}
    - {name: text, type: string}
    - {name: url, type: string}
out:
  type: azuresearch
  endpoint: https://yoichikademo.search.windows.net
  api_key:  9E55964F8254BBXX04D53F66A39AF5EB
  search_index: sampleindex01
  column_names: id,title,speakers,text,url

The plugin will dump records out to Azure Ssearch like this:

Input CSV

id,title,speakers,text,url
1,Moving to the Cloud,Narayan Annamalai,Benefits of moving your applications to cloud,https://s.ch9.ms/Events/Build/2016/P576
2,Building Big Data Applications using Spark and Hadoop,Maxim Lukiyanov,How to leverage Spark to build intelligence into your application,https://s.ch9.ms/Events/Build/2016/P420
3,Service Fabric Deploying and Managing Applications with Service Fabric,Chacko Daniel,Service Fabric deploys and manages distributed applications built as microservices,https://s.ch9.ms/Events/Build/2016/P431

Output JSON Body to Azure Search

{"value":
    [
        {"id":"1","title":"Moving to the Cloud","speakers":"Narayan Annamalai","text":"Benefits of moving your applications to cloud","url":"https://s.ch9.ms/Events/Build/2016/P576","@search.action":"mergeOrUpload"},
        {"id":"2","title":"Building Big Data Applications using Spark and Hadoop","speakers":"Maxim Lukiyanov","text":"How to leverage Spark to build intelligence into your application","url":"https://s.ch9.ms/Events/Build/2016/P420","@search.action":"mergeOrUpload"},
        {"id":"3","title":"Service Fabric Deploying and Managing Applications with Service Fabric","speakers":"Chacko Daniel","text":"Service Fabric deploys and manages distributed applications built as microservices","url":"https://s.ch9.ms/Events/Build/2016/P431","@search.action":"mergeOrUpload"}
    ]
}

(2) Case: column_names and key_names are NOT same

Suppose you have the following config.yml and sample azure search index schema written above:

config.yml

in:
  type: file
  path_prefix: samples/sample_01.csv
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: string}
    - {name: title, type: string}
    - {name: speakers, type: string}
    - {name: description, type: string}
    - {name: link, type: string}
out:
  type: azuresearch
  endpoint: https://yoichikademo.search.windows.net
  api_key:  9E55964F8254BBXX04D53F66A39AF5EB
  search_index: sampleindex01
  column_names: id,title,speakers,description,link
  key_names: id,title,speakers,text,url

The plugin will dump records out to Azure Ssearch like this:

Input CSV

id,title,speakers,description,link
1,Moving to the Cloud,Narayan Annamalai,Benefits of moving your applications to cloud,https://s.ch9.ms/Events/Build/2016/P576
2,Building Big Data Applications using Spark and Hadoop,Maxim Lukiyanov,How to leverage Spark to build intelligence into your application,https://s.ch9.ms/Events/Build/2016/P420
3,Service Fabric Deploying and Managing Applications with Service Fabric,Chacko Daniel,Service Fabric deploys and manages distributed applications built as microservices,https://s.ch9.ms/Events/Build/2016/P431

Output JSON Body to Azure Search

{"value":
    [
        {"id":"1","title":"Moving to the Cloud","speakers":"Narayan Annamalai","text":"Benefits of moving your applications to cloud","url":"https://s.ch9.ms/Events/Build/2016/P576","@search.action":"mergeOrUpload"},
        {"id":"2","title":"Building Big Data Applications using Spark and Hadoop","speakers":"Maxim Lukiyanov","text":"How to leverage Spark to build intelligence into your application","url":"https://s.ch9.ms/Events/Build/2016/P420","@search.action":"mergeOrUpload"},
        {"id":"3","title":"Service Fabric Deploying and Managing Applications with Service Fabric","speakers":"Chacko Daniel","text":"Service Fabric deploys and manages distributed applications built as microservices","url":"https://s.ch9.ms/Events/Build/2016/P431","@search.action":"mergeOrUpload"}
    ]
}

Build, Install, and Run

$ rake

$ embulk gem install pkg/embulk-output-azuresearch-0.1.0.gem

$ embulk preview config.yml

$ embulk run config.yml

Change log

Links

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/yokawasa/embulk-output-azuresearch.

Copyright

CopyrightCopyright (c) 2016- Yoichi Kawasaki
LicenseMIT