Skip to content

My solutions and notes to the mid-chapter and end-of-chapter exercises in Jonathan Rioux' Data Analysis with Python and PySpark.

Notifications You must be signed in to change notification settings

richhuwtaylor/analysis-with-python-and-pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

analysis-with-python-and-pyspark

My solutions to the mid-chapter and end-of-chapter exercises in Jonathan Rioux' Data Analysis with Python and PySpark.

Why learn PySpark?

To prepare for tasks where data becomes too large to work with locally, I wanted to learn how to use at least one distributed computing framework reasonably well, and Spark remains a popular choice in the data science and data analytics community. All three major cloud providers (Amazon Web Services, Google Cloud Platform and Microsoft Azure) have a managed Spark cluster as part of their offerings, making it easy to get up and running with a fully provisioned cluster quickly.

PySpark provides an entry point to Python in the computational model of Spark. It provides access not just to the core Spark API, but also to bespoke functionality for scaling out regular Python code, as well as Pandas transformations.

Having used PySpark on Databricks in a product management role to answer questions about the datasets underpinning our product, I wanted to refresh my skills while approaching Spark from an analyst's perspective.

About

My solutions and notes to the mid-chapter and end-of-chapter exercises in Jonathan Rioux' Data Analysis with Python and PySpark.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published