Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parquet reader #169

Merged
merged 7 commits into from Nov 3, 2018

Conversation

@ajayborra
Copy link
Contributor

commented Oct 31, 2018

Related issues
#56

Describe the proposed solution
Simple parquet reader implementation using native spark parquet reader.

@ajayborra ajayborra requested review from leahmcguire and tovbinm as code owners Oct 31, 2018
@salesforce-cla salesforce-cla bot added the cla:signed label Oct 31, 2018
override def read(params: OpParams = new OpParams())(implicit sc: SparkSession): Either[RDD[T], Dataset[T]] = Right {
val finalPath = getFinalReadPath(params)
val data: Dataset[T] = sc.read
.schema(implicitly[Encoder[T]].schema) // without this, every value gets read in as a string

This comment has been minimized.

Copy link
@rachelwarren

rachelwarren Oct 31, 2018

Collaborator

you should be able to use parquet schema inference directly if you use data frame rather than rdd api e.g.

implicit val encoder = Encoders.product[T]
spark.read.parquet(path).as[T]

should just work.

This comment has been minimized.

Copy link
@ajayborra

ajayborra Nov 3, 2018

Author Contributor

@rachelwarren Seems like spark.read.parquet(path).as[T] is doing the trick we can skip defining implicit val encoder = Encoders.product[T]

@codecov

This comment has been minimized.

Copy link

commented Oct 31, 2018

Codecov Report

Merging #169 into master will decrease coverage by 17.81%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #169       +/-   ##
===========================================
- Coverage   86.19%   68.38%   -17.82%     
===========================================
  Files         304      305        +1     
  Lines        9931     9937        +6     
  Branches      342      528      +186     
===========================================
- Hits         8560     6795     -1765     
- Misses       1371     3142     +1771
Impacted Files Coverage Δ
.../scala/com/salesforce/op/readers/DataReaders.scala 82.35% <100%> (+2.35%) ⬆️
...m/salesforce/op/readers/ParquetProductReader.scala 100% <100%> (ø)
...sforce/op/stages/base/binary/BinaryEstimator.scala 0% <0%> (-100%) ⬇️
...la/com/salesforce/op/aggregators/Geolocation.scala 0% <0%> (-100%) ⬇️
.../salesforce/op/aggregators/FeatureAggregator.scala 0% <0%> (-100%) ⬇️
...ce/op/stages/base/sequence/SequenceEstimator.scala 0% <0%> (-100%) ⬇️
...stages/base/sequence/BinarySequenceEstimator.scala 0% <0%> (-100%) ⬇️
...alesforce/op/cli/gen/templates/SimpleProject.scala 0% <0%> (-100%) ⬇️
...rce/op/stages/OpPipelineStageReadWriteShared.scala 0% <0%> (-100%) ⬇️
...ages/base/sequence/BinarySequenceTransformer.scala 0% <0%> (-100%) ⬇️
... and 91 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1488104...5e1e369. Read the comment docs.


@RunWith(classOf[JUnitRunner])
class ParquetProductReaderTest extends FlatSpec with TestSparkContext {
def testDataPath: String = "../test-data"

This comment has been minimized.

Copy link
@tovbinm

tovbinm Nov 1, 2018

Collaborator

use testDataDir from TestCommon

key = _.PassengerId.toString
)

val record = caseReader.readDataset().collect().take(1)(0)

This comment has been minimized.

Copy link
@tovbinm

tovbinm Nov 1, 2018

Collaborator

you probably should sort then collect

tovbinm and others added 5 commits Nov 1, 2018
…I into ab/parquet_reader
@tovbinm
tovbinm approved these changes Nov 3, 2018
Copy link
Collaborator

left a comment

lgtm!

@tovbinm tovbinm merged commit bd9861d into salesforce:master Nov 3, 2018
27 of 28 checks passed
27 of 28 checks passed
Travis CI - Pull Request Build Failed
Details
CodeFactor No issues found.
Details
ci/circleci: compile Your tests passed on CircleCI!
Details
ci/circleci: scala-style Your tests passed on CircleCI!
Details
ci/circleci: test Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 86.19%)
Details
codecov/project Absolute coverage decreased by -17.81% but relative coverage increased by +13.8% compared to 1488104
Details
license/snyk - build.gradle (salesforce) No manifest changes detected
license/snyk - cli/build.gradle (salesforce) No manifest changes detected
license/snyk - core/build.gradle (salesforce) No manifest changes detected
license/snyk - features/build.gradle (salesforce) No manifest changes detected
license/snyk - helloworld/build.gradle (salesforce) No manifest changes detected
license/snyk - local/build.gradle (salesforce) No manifest changes detected
license/snyk - models/build.gradle (salesforce) No manifest changes detected
license/snyk - readers/build.gradle (salesforce) No manifest changes detected
license/snyk - testkit/build.gradle (salesforce) No manifest changes detected
license/snyk - utils/build.gradle (salesforce) No manifest changes detected
salesforce-cla All contributors have signed the CLA
Details
security/snyk - build.gradle (salesforce) No manifest changes detected
security/snyk - cli/build.gradle (salesforce) No manifest changes detected
security/snyk - core/build.gradle (salesforce) No manifest changes detected
security/snyk - features/build.gradle (salesforce) No manifest changes detected
security/snyk - helloworld/build.gradle (salesforce) No manifest changes detected
security/snyk - local/build.gradle (salesforce) No manifest changes detected
security/snyk - models/build.gradle (salesforce) No manifest changes detected
security/snyk - readers/build.gradle (salesforce) No manifest changes detected
security/snyk - testkit/build.gradle (salesforce) No manifest changes detected
security/snyk - utils/build.gradle (salesforce) No manifest changes detected
@tovbinm

This comment has been minimized.

Copy link
Collaborator

commented Nov 3, 2018

@ajayborra thanks for the contribution! please add aggregate and conditional parquet readers next ;)

@ajayborra ajayborra deleted the ajayborra:ab/parquet_reader branch Nov 6, 2018
ericwayman pushed a commit that referenced this pull request Feb 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.