Skip to content
Vincent Luciani edited this page Aug 20, 2020 · 12 revisions

Purpose:

This application is collecting information from different datasource types and loading this information into a database ( currently only Elastic Search is supported as a database destination).

Source of information supported:

  • Oracle SQL database
  • HTML pages ( list of pages read from an XML sitemap )

Target datasource supported:

  • Elastic Search

Specification

Parameters:

basePath:

Base path for the application (absolute path of this base directory of the application). The full paths of input, output, and configuration file directories will be defined relatively to this base path.

nodeList:

List of logical node names. A logical node is a group of data present on the source that can be processed independently. No data can be common between two logical nodes. As a result, two logical nodes can be processed in parallel. A logical node name will be used as a key to find information about the data source, the intermediate folders, the transformation of the data, the destination of the data.

lastBatch:

In case the read process was interrupted, you can give the batch number of the last batch processed. The process will then start with this batch

task:

Possible values are read, dispatch, write, load:

  • Read gets data from the data source and write each document or row in a file using a common format - these files are called universal files.
  • Dispatch compares the files from the last processing with files from the current processing and moves the files currently processed to a specific folder depending if the file is new, updated, deleted or modified.
  • Write takes from files in common format and based on a template writes files ready to be loaded in the target database. For Elastic search databases the file created with the write option will be a bulk load file. If the destination would be a database ( not yet supported ), it would be a group of insert statements.
  • Load uses the files created with the write option into the destination.

Configuration

The configuration folder is called configuration and is inside the base path indicated in input ( basePath parameter ). Example: if basePath is c:\test then the configuration folder's absolute path is c:\test\configuration. Inside this folder, you have the following folders:

logicalNode:

Inside this folder you should have a file called logicalNodes.properties that gathers information about logical nodes. Parameters prefixed with "common." are common to all logical nodes. Parameters prefixed with the logical node ID are valid only for this logical node. Signification of parameters:

  • physicalDataSource=3
  • dataSelectionTemplate
  • outputCreationTemplate
  • criteriaParameters
  • identificationColumnNumber
  • readerOutputBasePath=C:\test_java\output
  • writerInputBasePath=C:\test_java\output
  • batchForUploadBasePath=C:\test_java\batches_to_upload\
  • readerBatchSize
  • writerBatchSize
  • identificationColumnName=
  • destinationIP
  • destinationPort

sitemapUrl=https://www.vincent-luciani.com/sitemap.xml sitemapProtocol=https sitemapIsProxy=true destinationDataPool=us-en destinationSubDataPool=ddc outputCreationTemplate=002 logicalNodeType=xml

destinationDataPool=vince destinationSubDataPool=knowledge criteriaValues=todo,todo logicalNodeType=sitemap urlPattern=(.tutorial.?) outputCreationTemplate

physicalDataSource destinationDataPool destinationSubDataPool criteriaValues=107,ZH logicalNodeType=Oracle

dataSelectionTemplate:

outputCreationTemplate:

physicalDataSource: