Skip to content

Spark data streaming application over Chinese Great Firewall

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



1 Commit

Repository files navigation

This is Spark data streaming application that serves as a bridge over Chinese Great Firewall


Lambda(China) -> Kinesis(China) -> Spark Cluster(US) -> Kinesis(US) -> Firehose(US) -> S3(US)
    |                                     |
   \/                                    \/
Firehose(China)                  **Spark Application**
    |                                     |
    \/                                    \/
S3 Backup(China)                       Spark UI

Lambda(China) - kk-analytics-firehose-prod-webhook
Kinesis(China) - prod-kinesis-data-streaming
Firehose(China) - prod-analytics-backup-firehose
S3 Backup(China) - prodanalytics-firehose-bucket/backup
Kinesis(US) - prod-china-kinesis-data-streaming
Firehose(US) - prod-china-analytics-firehose-parquet
S3(US) - prodanalytics-firehose-bucket/parquet

Set up

Local Machine (MAC)

Remote Machine (Linux)

  • Install Java: sudo yum install java-1.8.0-openjdk-devel
  • Follow this guide to install spark

First Time Deployment

  1. Build fat jar: sbt assembly
  2. Configure cluster: flintrock configure
  3. Launch cluster: flintrock launch prod-spark-cluster
  4. (If not enabled) Enable inbound port 7077, 8080 & 22 for cluster securty gorup in your AWS
  5. Check master node public DNS by running flintrock describe
  6. Check if cluster is running by going to http://master_public_dns:8080
  7. Copy application jar to the cluster nodes (update local file path): flintrock copy-file prod-spark-cluster \ /Users/stepanarsentjev/Development/chinaDataStreaming/target/scala-2.11/china-data-streaming.jar \ /home/ec2-user/
  8. ssh to master and all worker instances and set default aws profile with Chinese credentials (can be found in application.conf): aws configure (set region to cn-northwest-1)
  9. Copy jar file to /home/ec2-user/ of a separate instance ( or if doesn't exist create a new one and install all spark dependencies: scp -i "pem-file.pem" /file/path ec2-user@machine-dns:/remote/path/to/file
  10. ssh to the above instance
  11. Deploy Spark app by running the following from ~ : (replace master url with the one found in the web ui) /opt/spark/bin/spark-submit --deploy-mode cluster --master spark:// --driver-memory 10g /home/ec2-user/china-data-stream.jar


To make changes to the application running in production:
  • Restart cluster (all logs will be removed - so save what you need before restarting) flintrock stop prod-spark-cluster and flintrock start prod-spark-cluster
  • Upload updated jar file (china-data-stream.jar)
    /Users/stepanarsentjev/Development/chinaDataStreaming/target/scala-2.11/china-data-streaming.jar /home/ec2-user/
  • Submit spark job to cluster from external instance (master url will be different!) /opt/spark/bin/spark-submit --deploy-mode cluster --master spark:// --driver-memory 10g --executor-memory 10g /home/ec2-user/china-data-streaming.jar

INFO: Records on Kinesis Data Stream are stored for 24 hour (or until read) To avoid data leaks, redeploy spark cluster within 24 hours


  • To increase spark deploy response timeOut; set spark.rpc.askTimeout & to 800 in /opt/spark-2.4.3-bin-hadoop2.6/conf/spark-defaults.conf


Spark data streaming application over Chinese Great Firewall






No releases published


No packages published
