Skip to content
forked from apache/spark

Tekumara build of Apache PySpark with Hadoop 3.x and cloud jars for S3 access

License

Apache-2.0, Apache-2.0 licenses found

Licenses found

Apache-2.0
LICENSE
Apache-2.0
LICENSE-binary
Notifications You must be signed in to change notification settings

tekumara/pyspark

 
 

Tekumara build of Apache PySpark with Hadoop 3.x

A build of Apache PySpark that uses the hadoop-cloud maven profile to bundle hadoop-aws 3.x which contains S3A.

Install

See Releases

Usage

To use S3A for S3 URLs and temporary AWS STS credentials:

pyspark --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider

To modify an existing spark session to use S3A for S3 urls, for example spark in the pyspark shell:

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

See test_s3a.py for an example of using the staging committers.

Rationale

The pyspark distribution on pypi ships with hadoop 2.7 and no cloud jars (ie: hadoop-aws). So common practice is to use hadoop-aws 2.7.3 as follows:

pyspark --packages "org.apache.hadoop:hadoop-aws:2.7.3" --driver-java-options "-Dspark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem"

However, later versions of hadoop-aws cannot be used this way without errors.

This project builds a pyspark distribution from source with Hadoop 3.x.

Later versions of hadoop-aws contain the following new features:

To take advantage of the 3.x release line committers in Spark you also need the binding classes introduced into Spark 3.0.0 by SPARK-23977. For Spark 2.4, the HortonWorks backport is used from the Hortonworks repo.

About

Tekumara build of Apache PySpark with Hadoop 3.x and cloud jars for S3 access

Resources

License

Apache-2.0, Apache-2.0 licenses found

Licenses found

Apache-2.0
LICENSE
Apache-2.0
LICENSE-binary

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 73.7%
  • Java 9.4%
  • Python 7.3%
  • HiveQL 4.2%
  • R 2.7%
  • PLpgSQL 0.8%
  • Other 1.9%