Skip to content

quangtn266/ETLBookBatchProcessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETLBookBatchProcessing

A demo data pipeline is about Flink for Batch processing

Database generation

  1. Because I use Macbook M1 pro, then I need to start postgresql before generating a database.
brew services start postgresql
  1. Generate database "books"
createdb books  -U postgres
  1. Generate tables through schema misc/schema.sql
psql -h localhost -U quangtn -W -d books -f ../etl_datapipeline/misc/schema.sql 

Getting dataset.

  1. Curl the dataset.
curl -sL https://github.com/luminati-io/Amazon-popular-books-dataset/raw/main/Amazon_popular_books_dataset.json |   jq -c '.[]' > dataset.json
  1. Generate a gz dataset, because it load .gz files as data source.
gzip dataset.json

Building package.

mvn clean package

Running.

  1. Start flink cluster.
./bin/flink/start-cluster.sh
  1. Running flink.
../flink-1.18.1/bin/flink run -p 4 ./target/github-etl-datapipeline-1.0-SNAPSHOT.jar --input-dir dataset.json ./ --db-url jdbc:postgresql://localhost:5432/books
  1. Stop flink.
./bin/stop-cluster.sh

Output.

Screen Recording 2024-03-01 at 14 50 43

About

A demo data pipeline is about Flink for Batch processing

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages