SqlDataFlow

An ETL tool based on sparksql and hocon(https://github.com/lightbend/config);

Write a configuration file to specify the ETL process.

step1. write the job conf file

a. data from jdbc to cos(one kind cloud storage) sample:

source {

 jdbc1 {
         tag=test
         table="(select * from test.test) as t"
         result_table_name=test
     }

}

sink{
      cos1 {
            tag=meta
            bucket=test
            table=test
            result_table_name="TEST.TEST"
            trim=true
      }
}

b. data from cos to dashdb, please refer the following :

source {
    cos2 {
                useFullName=true
                tag=meta
                table="test.test"
                result_table_name=test
            }

     cos3 {
                      tag=meta
                      table="test.test"
                      fromSink=true
                      result_table_name=oldtest
                      targetDB=data
                      targetTable=test.test_new
           }
}

sink{
   dashdb5{
       metaTag=meta
       tag=data
       inNewView=test
       inOldView=oldtest
       batchSize= 10
       numPartitions= 1
       strategy= "Incremental"
       result_table_name= "test.test_new"
       priKeys = [
        { key=id }
        ]
   }
 }

you can get more samples from examples directory.

How to deploy?

1.Assembly a fat jar

   mvn clean package -DskipTests -Pdefault

2.Copy job conf to cos bucket with cyberduck or minio client

3.Refer the following bash shell,save the following to runsdf.sh , usage: ./runsdf.sh env /path/to/job.conf

#!/bin/bash

 if [ $# -lt 2 ]; then
   echo "usage runsdf.sh env /path/to/job.conf example: runsdf.sh vt cos://dsw-data-project-vt/dataflow/examples/jdbc2costest.conf"
   exit 1
 fi
appName=$(echo "$2" | sed 's|cos://.*/.*/.*/\(.*\)\.conf|\1|g'|sed 's/_//')
echo "$appName is start to run......"

env=$1

spark-submit \
--conf spark.kubernetes.namespace=spark-job \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--master k8s://https://master:port \
--deploy-mode cluster \
--name $appName  \
--class org.student.spark.Main \
--conf spark.executor.memory=4g \
--conf spark.driver.memory=4g \
--conf spark.executor.cores=1 \
--conf spark.executor.instances=4 \
--conf spark.kubernetes.driver.secrets.sdfsecret=/opt/spark/secrets/ \
--conf spark.kubernetes.container.image=student2021/spark:244 \
--driver-java-options \
SparkDataFlow-1.0.jar \
$2

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
README.md		README.md
build.sh		build.sh
copy.sh		copy.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SqlDataFlow

How to deploy?

About

Releases

Packages

Languages

student2028/SqlDataFlow

Folders and files

Latest commit

History

Repository files navigation

SqlDataFlow

How to deploy?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages