A demo data pipeline is about Flink for Batch processing
- Because I use Macbook M1 pro, then I need to start postgresql before generating a database.
brew services start postgresql
- Generate database "books"
createdb books -U postgres
- Generate tables through schema misc/schema.sql
psql -h localhost -U quangtn -W -d books -f ../etl_datapipeline/misc/schema.sql
- Curl the dataset.
curl -sL https://github.com/luminati-io/Amazon-popular-books-dataset/raw/main/Amazon_popular_books_dataset.json | jq -c '.[]' > dataset.json
- Generate a gz dataset, because it load .gz files as data source.
gzip dataset.json
mvn clean package
- Start flink cluster.
./bin/flink/start-cluster.sh
- Running flink.
../flink-1.18.1/bin/flink run -p 4 ./target/github-etl-datapipeline-1.0-SNAPSHOT.jar --input-dir dataset.json ./ --db-url jdbc:postgresql://localhost:5432/books
- Stop flink.
./bin/stop-cluster.sh