analysis-with-python-and-pyspark

My solutions to the mid-chapter and end-of-chapter exercises in Jonathan Rioux' Data Analysis with Python and PySpark.

Why learn PySpark?

To prepare for tasks where data becomes too large to work with locally, I wanted to learn how to use at least one distributed computing framework reasonably well, and Spark remains a popular choice in the data science and data analytics community. All three major cloud providers (Amazon Web Services, Google Cloud Platform and Microsoft Azure) have a managed Spark cluster as part of their offerings, making it easy to get up and running with a fully provisioned cluster quickly.

PySpark provides an entry point to Python in the computational model of Spark. It provides access not just to the core Spark API, but also to bespoke functionality for scaling out regular Python code, as well as Pandas transformations.

Having used PySpark on Databricks in a product management role to answer questions about the datasets underpinning our product, I wanted to refresh my skills while approaching Spark from an analyst's perspective.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
code		code
data		data
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

.gitignore

.gitignore

README.md

README.md

Repository files navigation

analysis-with-python-and-pyspark

Why learn PySpark?

About

Releases

Packages

Languages

richhuwtaylor/analysis-with-python-and-pyspark

Folders and files

Latest commit

History

Repository files navigation

analysis-with-python-and-pyspark

Why learn PySpark?

About

Topics

Resources

Stars

Watchers

Forks

Languages