Skip to content

Spark data streaming application over Chinese Great Firewall

Notifications You must be signed in to change notification settings

stepandel/chinese-firewall-bridge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is Spark data streaming application that serves as a bridge over Chinese Great Firewall

Architecture

Lambda(China) -> Kinesis(China) -> Spark Cluster(US) -> Kinesis(US) -> Firehose(US) -> S3(US)
    |                                     |
   \/                                    \/
Firehose(China)                  **Spark Application**
    |                                     |
    \/                                    \/
S3 Backup(China)                       Spark UI



Lambda(China) - kk-analytics-firehose-prod-webhook
Kinesis(China) - prod-kinesis-data-streaming
Firehose(China) - prod-analytics-backup-firehose
S3 Backup(China) - prodanalytics-firehose-bucket/backup
Kinesis(US) - prod-china-kinesis-data-streaming
Firehose(US) - prod-china-analytics-firehose-parquet
S3(US) - prodanalytics-firehose-bucket/parquet

Set up

Local Machine (MAC)

Remote Machine (Linux)

  • Install Java: sudo yum install java-1.8.0-openjdk-devel
  • Follow this guide to install spark

First Time Deployment

  1. Build fat jar: sbt assembly
  2. Configure cluster: flintrock configure
  3. Launch cluster: flintrock launch prod-spark-cluster
  4. (If not enabled) Enable inbound port 7077, 8080 & 22 for cluster securty gorup in your AWS
  5. Check master node public DNS by running flintrock describe
  6. Check if cluster is running by going to http://master_public_dns:8080
  7. Copy application jar to the cluster nodes (update local file path): flintrock copy-file prod-spark-cluster \ /Users/stepanarsentjev/Development/chinaDataStreaming/target/scala-2.11/china-data-streaming.jar \ /home/ec2-user/
  8. ssh to master and all worker instances and set default aws profile with Chinese credentials (can be found in application.conf): aws configure (set region to cn-northwest-1)
  9. Copy jar file to /home/ec2-user/ of a separate instance (ec2-54-196-74-76.compute-1.amazonaws.com) or if doesn't exist create a new one and install all spark dependencies: scp -i "pem-file.pem" /file/path ec2-user@machine-dns:/remote/path/to/file
  10. ssh to the above instance
  11. Deploy Spark app by running the following from ~ : (replace master url with the one found in the web ui) /opt/spark/bin/spark-submit --deploy-mode cluster --master spark://ec2-3-91-11-48.compute-1.amazonaws.com:7077 --driver-memory 10g /home/ec2-user/china-data-stream.jar

Maintanace

To make changes to the application running in production:
  • Restart cluster (all logs will be removed - so save what you need before restarting) flintrock stop prod-spark-cluster and flintrock start prod-spark-cluster
  • Upload updated jar file (china-data-stream.jar)
    /Users/stepanarsentjev/Development/chinaDataStreaming/target/scala-2.11/china-data-streaming.jar /home/ec2-user/
    
  • Submit spark job to cluster from external instance (master url will be different!) /opt/spark/bin/spark-submit --deploy-mode cluster --master spark://ec2-54-234-92-94.compute-1.amazonaws.com:7077 --driver-memory 10g --executor-memory 10g /home/ec2-user/china-data-streaming.jar

INFO: Records on Kinesis Data Stream are stored for 24 hour (or until read) To avoid data leaks, redeploy spark cluster within 24 hours

Debuging

  • To increase spark deploy response timeOut; set spark.rpc.askTimeout & spark.network.timeput to 800 in /opt/spark-2.4.3-bin-hadoop2.6/conf/spark-defaults.conf

About

Spark data streaming application over Chinese Great Firewall

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages