Skip to content

yennanliu/spark-etl-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SPARK-ETL-PIPELINE

demo various data fetch/transform process via Spark Scala

Scala Projects

File structure

# β”œβ”€β”€ Dockerfile         : Dockerfile make scala spark env 
# β”œβ”€β”€ README.md
# β”œβ”€β”€ archived           : legacy spark scripts in python/java...
# β”œβ”€β”€ build.sbt          : (scala) sbt file build spark scala dependency 
# β”œβ”€β”€ config             : config for various services. e.g. s3, DB, hive..
# β”œβ”€β”€ data               : sample data for some spark scripts demo
# β”œβ”€β”€ output             : where the spark stream/batch output to  
# β”œβ”€β”€ project            : (scala) other sbt setting : plugins.sbt, build.properties...
# β”œβ”€β”€ python             : helper python script 
# β”œβ”€β”€ run_all_process.sh : script demo run minimum end-to-end spark process
# β”œβ”€β”€ script             : helper shell script
# β”œβ”€β”€ src                : (scala) MAIN SCALA SPARK TESTS/SCRIPTS 
# β”œβ”€β”€ target             : where the final complied jar output to  (e.g. target/scala-2.11/spark-etl-pipeline-assembly-1.0.jar)
# └── travis_build.sh    : travis build file

Prerequisites

  1. Modify config with yours and rename them (e.g. twitter.config.dev -> twitter.config) to access services like data source, file system.. and so on.
  2. Install SBT as scala dependency management tool
  3. Install Java, Spark
  4. Modify build.sbt aligned your dev env
  5. Check the spark etl scripts : src

Process

sbt clean compile -> sbt test -> sbt run -> sbt assembly -> spark-submit <spark-script>.jar

Quick Start

$ git clone https://github.com/yennanliu/spark-etl-pipeline.git && cd spark-etl-pipeline && bash run_all_process.sh

Quick Start Manually

Quick Start Manually
# STEP 0) 
$ cd ~ && git clone https://github.com/yennanliu/spark-etl-pipeline.git && cd spark-etl-pipeline

# STEP 1) download the used dependencies.
$ sbt clean compile

# STEP 2) print twitter via spark stream  via sbt run`
$ sbt run

# # STEP 3) create jars from spark scala scriots 
$ sbt assembly
$ spark-submit spark-etl-pipeline/target/scala-2.11/spark-etl-pipeline-assembly-1.0.jar
# get fake page view event data 

# run the script generate page view 
$ sbt package
$ spark-submit \
  --class DataGenerator.PageViewDataGenerator \
  target/scala-2.11/spark-etl-pipeline_2.11-1.0.jar

# open the other terminal to receive the event
$ curl 127.0.0.1:44444

Quick Start Docker

Quick Start Docker
# STEP 0) 
$ git clone https://github.com/yennanliu/spark-etl-pipeline.git

# STEP 1) 
$ cd spark-etl-pipeline

# STEP 2) docker build 
$ docker build . -t spark_env

# STEP 3) ONE COMMAND : run the docker env and sbt compile and sbt run and assembly once 
$ docker run  --mount \
type=bind,\
source="$(pwd)"/.,\
target=/spark-etl-pipeline \
-i -t spark_env \
/bin/bash  -c "cd ../spark-etl-pipeline && sbt clean compile && && sbt assembly && spark-submit spark-etl-pipeline/target/scala-2.11/spark-etl-pipeline-assembly-1.0.jar"

# STEP 3') : STEP BY STEP : access docker -> sbt clean compile -> sbt run -> sbt assembly -> spark-submit 
# docker run 
$ docker run  --mount \
type=bind,\
source="$(pwd)"/.,\
target=/spark-etl-pipeline \
-i -t spark_env \
/bin/bash 
# inside docker bash 
root@942744030b57:~ cd ../spark-etl-pipeline && sbt clean compile && sbt run 

root@942744030b57:~ cd ../spark-etl-pipeline && spark-submit spark-etl-pipeline/target/scala-2.11/spark-etl-pipeline-assembly-1.0.jar

Ref

Ref

Dataset

Dataset