SF Crime Statistics with Spark Streaming

Intro

In this project, you will be provided with a real-world dataset, extracted from Kaggle, on San Francisco crime incidents, and you will provide statistical analyses of the data using Apache Spark Structured Streaming. You will draw on the skills and knowledge you've learned in this course to create a Kafka server to produce data, and ingest data through Spark Structured Streaming.

Development Environment

You may choose to create your project in the workspace we provide here, or if you wish to develop your project locally, you will need to set up your environment properly as described below:

Spark 2.4.3
Scala 2.11.x
Java 1.8.x
Kafka build with Scala 2.11.x
Python 3.6.x or 3.7.x

Output Pictures

kafka-console-consumer

Progress

Streaming UI Output

Questions

1. How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

You can view the values numInputRows, inputRowsPerSecond, and processedRowsPerSecond. Changing the values of the below key / value pairs showed a difference in how much data was being processed.

2. What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

maxRatePerPartition
spark.default.parallelism

You can tell there was an optimal gain / loss by going off the above values in the first question.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
OutputImages.zip		OutputImages.zip
consumer_server.py		consumer_server.py
data_stream.py		data_stream.py
kafka-console-consumer.png		kafka-console-consumer.png
kafka_server.py		kafka_server.py
producer_server.py		producer_server.py
progress_reporter.png		progress_reporter.png
radio_code.json		radio_code.json
readme.md		readme.md
requirements.txt		requirements.txt
spark_streaming_ui.png		spark_streaming_ui.png
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SF Crime Statistics with Spark Streaming

Intro

Development Environment

Output Pictures

kafka-console-consumer

Progress

Streaming UI Output

Questions

1. How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

2. What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

About

Releases

Packages

Languages

sal-git/SF-Crime-Statistics-with-Spark-Streaming

Folders and files

Latest commit

History

Repository files navigation

SF Crime Statistics with Spark Streaming

Intro

Development Environment

Output Pictures

kafka-console-consumer

Progress

Streaming UI Output

Questions

1. How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

2. What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages