PySpark for Data Processing

Code for my presentation: Using PySpark to Process Boat Loads of Data

Download the slides from Slideshare.

Dataset

This code uses the Hazardous Air Pollutants dataset from Kaggle.

Stats

Data source: United States Environmental Protection Agency
Number of Files: 1
Compressed Size: 658.5MB
Uncompressed Size: 2.4GB

Content

The daily summary file contains data for every monitor (sampled parameter) in the Environmental Protection Agency (EPA) database for each day. This file will contain a daily summary record that is: 1. The aggregate of all sub-daily measurements taken at the monitor. 2. The single sample value if the monitor takes a single, daily sample (e.g., there is only one sample with a 24-hour duration). In this case, the mean and max daily sample will have the same value.

Field Descriptions

State Code: The Federal Information Processing Standards (FIPS) code of the state in which the monitor resides.
County Code: The FIPS code of the county in which the monitor resides.
Site Num: A unique number within the county identifying the site.
Parameter Code: The AQS code corresponding to the parameter measured by the monitor.
POC: This is the “Parameter Occurrence Code” used to distinguish different instruments that measure the same parameter at the same site.
Latitude: The monitoring site’s angular distance north of the equator measured in decimal degrees.
Longitude: The monitoring site’s angular distance east of the prime meridian measured in decimal degrees.
Datum: The Datum associated with the Latitude and Longitude measures.
Parameter Name: The name or description assigned in AQS to the parameter measured by the monitor. Parameters may be pollutants or non-pollutants.
Sample Duration: The length of time that air passes through the monitoring device before it is analyzed (measured). So, it represents an averaging period in the atmosphere (for example, a 24-hour sample duration draws ambient air over a collection filter for 24 straight hours). For continuous monitors, it can represent an averaging time of many samples (for example, a 1-hour value may be the average of four one-minute samples collected during each quarter of the hour).
Pollutant Standard: A description of the ambient air quality standard rules used to aggregate statistics. (See description at beginning of document.) 12.Date Local: The calendar date for the summary. All daily summaries are for the local standard day (midnight to midnight) at the monitor.
Units of Measure: The unit of measure for the parameter. QAD always returns data in the standard units for the parameter. Submitters are allowed to report data in any unit and EPA converts to a standard unit so that we may use the data in calculations.
Event Type: Indicates whether data measured during exceptional events are included in the summary. A wildfire is an example of an exceptional event; it is something that affects air quality, but the local agency has no control over. No Events means no events occurred. Events Included means events occurred and the data from them is included in the summary. Events Excluded means that events occurred but data form them is excluded from the summary. Concurred Events Excluded means that events occurred but only EPA concurred exclusions are removed from the summary. If an event occurred for the parameter in question, the data will have multiple records for each monitor.
Observation Count: The number of observations (samples) taken during the day.
Observation Percent: The percent representing the number of observations taken with respect to the number scheduled to be taken during the day. This is only calculated for monitors where measurements are required (e.g., only certain parameters).
Arithmetic Mean: The average (arithmetic mean) value for the day.
1st Max Value: The highest value for the day.
1st Max Hour: The hour (on a 24-hour clock) when the highest value for the day (the previous field) was taken.
AQI: The Air Quality Index for the day for the pollutant, if applicable.
Method Code: An internal system code indicating the method (processes, equipment, and protocols) used in gathering and measuring the sample. The method name is in the next column.
Method Name: A short description of the processes, equipment, and protocols used in gathering and measuring the sample.
Local Site Name: The name of the site (if any) given by the State, local, or tribal air pollution control agency that operates it.
Address: The approximate street address of the monitoring site.
State Name: The name of the state where the monitoring site is located.
County Name: The name of the county where the monitoring site is located.
City Name: The name of the city where the monitoring site is located. This represents the legal incorporated boundaries of cities and not urban areas.
CBSA Name: The name of the core bases statistical area (metropolitan area) where the monitoring site is located.
Date of Last Change: The date the last time any numeric values in this record were updated in the AQS data system.

Dataset Preview

Spark in Docker

To make running the code easy Spark will be run inside of a Docker image. This project uses the Spark containers from Getty Images. The included docker compose file will create two containers for a 2-node cluster:

Master node
Worker node

When a Spark job is running you will have access to the Spark Web UI. Once the jobs are finished it's gone baby.

To run the Docker containers ensure you have Docker installed and running.

Run the Code

Checkout the code

git checkout git@github.com:rdempsey/pyspark-for-data-processing.git

Download the dataset, unzip and place the csv file in the project's data folder

cp ~/Downloads/epa_hap_daily_summary.csv ./data/epa_hap_daily_summary.csv

Run the Docker Compose file

docker-compose up --build -d

Open a bash shell in the container

docker exec -it pysparkfordataprocessing_master_1 /bin/bash

Run one of the scripts in the scripts folder

bin/spark-submit /scripts/q1_most_polluted_state.py

All output will be save into the data folder.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
conf		conf
data		data
images		images
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PySpark for Data Processing

Dataset

Stats

Content

Field Descriptions

Dataset Preview

Spark in Docker

Run the Code

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

rdempsey/pyspark-for-data-processing

Folders and files

Latest commit

History

Repository files navigation

PySpark for Data Processing

Dataset

Stats

Content

Field Descriptions

Dataset Preview

Spark in Docker

Run the Code

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages