Introduction to PySpark

Presentation for PyData Berlin September meetup

Agenda

Why Spark?
Why PySpark?
Why Not [Py]Spark?
Getting Started
Core Concepts
ETL Example
Machine Learning Example
Unit Testing
Performance
Gotchas
[Py]Spark Alternatives
References

Why Spark?

Large data sets
Cost of scaling up >> cost of scaling out
Batch processing, stream processing, graph processing, SQL and machine learning
In memory (sometimes)
Programming model
Generic framework

Why PySpark?

ScalaData?
Existing platform
Team - existing and future

Why Not [Py]Spark?

Performance
Complexity
Troubleshooting
Necessary?
Small community?

Getting Started

Local
- OSX: Homebrew
- Windows
- Linux
Cloud
- Databricks
- AWS
- ZeppelinHub, Microsoft, Google, etc.

Core concepts

Driver / Workers
RDDs
- Immutable collection
- Resilient
- Distributed / partitioned and can control partitioning
- In-memory (at times)
Loading data
- Files on local filesystem, HDFS, S3, RedShift, Hive, etc.
- CSV, JSON, Parquet, etc.
Transforms
- map / reduce
- filter
- aggregate
- joins
Actions
- writeTextFile
- count
- take / first
- collect
DataFrames
- Higher-level concept
- Based on RDD
- Structured - like a table and with a schema (which can be inferred)
- Faster
- Easier to work with
- API or SQL

ETL Example

Databricks notebook

ML Example

Unit Testing

findspark
- export SPARK_HOME="..."
spark-testing-base
- class SparkTestingBase: TestCase
- class SparkTestingBaseReuse: TestCase
export PYSPARK_SUBMIT_ARGS=“… pyspark-shell"
export SPARK_MASTER=“yarn-client"

Performance

Cache / Persist
- Ronert Obst, Dat Tran - PySpark in Practice
Double serialization cost
Cython and/or compiled libraries
Potential to call Scala/Java code?
- Holden Karau - Improving PySpark Performance: Spark performance beyond the JVM
- Zeppelin
- Databricks / Livy

Gotchas

Pickling of class when distributing methods (seemingly including statics)
Spurious error messages
- Examples
  - Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader
  - You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o23)
  - 16/06/14 14:46:20 INFO latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 334AFFEECBCB0CC9)
  - java.lang.IllegalArgumentException: Invalid S3 URI: hostname does not appear to be a valid S3 endpoint:
- In some cases these aren't errors at all. In others they're masking the real errors - look elsewhere in the console / log
For some (e.g. running locally and disconnected) use-cases, HiveContext is less stable than SQLContext (though community generally recommends the former)
Distributing Python files to the workers
- --py-files and/or --zip-files seem not always to work as expected
- Packaging files and installing on servers (e.g. in bootstrap) seems more reliable
Select syntax quirks
- Use bitwise operators such as ~ on columns
- Other random errors can often be fixed with the addition of brackets
spark-csv
- In some cases need to set an escape character and neither None nor the empty string work. Weird unicode characters seem to work
- When seeing problems such as java.lang.NoClassDefFoundError or java.lang.NoSuchMethodError, check you're using the version built for the appropriate version of Scala (2.10 vs.2.11)
- sqlContext.read.load fails with the following error, when reading CSV files, if format='csv' is not specified (which is not required for sqlContext.load:
  
  Caused by: java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/Users/Richard/src/earnest/preprocessing/storage/local/mnt/3m-panel/card/20160120_YODLEE_CARD_PANEL.txt is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [46, 50, 48, 10]
Redshift data source's behavior can challenge expectations
- Be careful with schemas and be aware of when it's rewriting them
- For longer text fields do not allow the datasource to [re]create the table
- Pre-actions don't seem to work on some builds
- Remember to set-up a cleanup policy for the transfer directory on S3

[Py]Spark Alternatives

Scala Spark
Beam / Flink / Apex / ...
Pig etc.
AWK / sed?
Python
- Pandas?
- Threads?
- AsyncIO/Tornado/etc.
- Multiprocessing
- Parallel Python
- IPython Parallel
- Cython
- Queues / pub/sub (NPQ, Celery, SQS, etc.)
- Gearman
- PyRes
- Distarray / Blaze
- Dask

References

Dataset used in examples
- [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology.
- In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.
- http://hdl.handle.net/1822/14838
- http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt
- Modified to add names and addresses using Faker
DataFrames
Performance
Zeppelin
- http://www.slideshare.net/DanielMadrigal20/intro-to-spark-with-zeppelin-crash-course-hadoop-summit-sj
- https://github.com/hortonworks-gallery/zeppelin-notebooks
ML
- https://spark.apache.org/docs/1.5.1/ml-features.html#stringindexer
Books
Courses by edX

Contact

richdutton on pythonberlin Slack
richdutton on github
https://de.linkedin.com/in/duttonrichard
http://earnestresearch.com

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
dataset		dataset
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

README.md

README.md

Repository files navigation

Introduction to PySpark

Agenda

Why Spark?

Why PySpark?

Why Not [Py]Spark?

Getting Started

Core concepts

ETL Example

ML Example

Unit Testing

Performance

Gotchas

[Py]Spark Alternatives

References

Contact

About

Releases

Packages

Languages

richjames0/introduction-to-pyspark

Folders and files

Latest commit

History

dataset

dataset

README.md

README.md

Repository files navigation

Introduction to PySpark

Agenda

Why Spark?

Why PySpark?

Why Not [Py]Spark?

Getting Started

Core concepts

ETL Example

ML Example

Unit Testing

Performance

Gotchas

[Py]Spark Alternatives

References

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages