BlobIterator implementation #25

bzz · 2017-09-08T08:10:01Z

In commit table, we have hashes of commits, which are used to get trees -> blobs.

Current implementation does not filter:

repository (so same blob can happen in different repository)
refs

It leverages ColumnFilters from #24 (need to be merged first) in order to get particular commit hashes, rather then iterating all refs HEADs.

bzz · 2017-09-08T08:34:56Z

This is not supposed to be complete implementation yet: i.e it does nothing at all, in case there were no filters for commits.

One thing that seems reasonable in such cases - iterate all (non-remote, non-filtered) refs HEADS.

Feedback is very welcome.

\cc @ajnavarro @mcarmonaa

ajnavarro · 2017-09-08T12:45:51Z

src/main/scala/tech/sourced/api/iterator/BlobIterator.scala

+  override protected def mapColumns(tree: TreeWalk): Map[String, () => Any] = {
+    val content = readFile(tree.getObjectId(0), tree.getObjectReader)
+    Map[String, () => Any](
+      "file_hash" -> (() => tree.getObjectId(0)),


you miss the commit hash maybe?

bzz · 2017-09-08T18:00:52Z

Join in in Implicits commits and files on "commit_hash" === "hash" does not produce a eq filter (only is_not_null(commit_hash)) so a new case was added that iterates blobs in HEADs of all refs

codecov · 2017-09-13T09:53:52Z

Codecov Report

Merging #25 into master will decrease coverage by 1.82%.
The diff coverage is 68.88%.

@@             Coverage Diff              @@
##             master      #25      +/-   ##
============================================
- Coverage     85.25%   83.42%   -1.83%     
- Complexity       26       31       +5     
============================================
  Files            12       13       +1     
  Lines           278      362      +84     
  Branches         41       62      +21     
============================================
+ Hits            237      302      +65     
- Misses           17       27      +10     
- Partials         24       33       +9

Impacted Files	Coverage Δ	Complexity Δ
...rc/main/scala/tech/sourced/api/DefaultSource.scala	`82.14% <0%> (+1.37%)`	`2 <0> (ø)`	⬇️
src/main/scala/tech/sourced/api/Schema.scala	`100% <100%> (ø)`	`0 <0> (ø)`	⬇️
src/main/scala/tech/sourced/api/Implicits.scala	`81.48% <40%> (+6.48%)`	`0 <0> (ø)`	⬇️
...scala/tech/sourced/api/iterator/BlobIterator.scala	`73.75% <73.75%> (ø)`	`5 <5> (?)`
...ain/scala/tech/sourced/api/util/ColumnFilter.scala	`80.55% <0%> (+2.77%)`	`0% <0%> (ø)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f24ca7e...c63f9fe. Read the comment docs.

ajnavarro

I miss some tests for the iterator itself, like: CommitIteratorSpec, ReferenceIteratorSpec or RepositoryIteratorSpec.

#24 should be merged before this pr, and remove duplicated filter code.

ajnavarro · 2017-09-13T09:58:00Z

src/main/scala/tech/sourced/api/Schema.scala


      Nil
  )
+
+  //  StructField("lang", StringType) ::


could you delete commented code please? these columns will be added when enry and babelfish will be implemented.

ajnavarro · 2017-09-13T10:01:09Z

src/main/scala/tech/sourced/api/iterator/BlobIterator.scala

+class BlobIterator(requiredColumns: Array[String], repo: Repository, filters: Array[CompiledFilter])
+  extends RootedRepoIterator[CommitTree](requiredColumns, repo) {
+
+  val log = Logger.getLogger(this.getClass.getSimpleName)


You should use org.apache.spark.internal.Logging. See an example in org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.

ajnavarro · 2017-09-13T10:02:42Z

src/main/scala/tech/sourced/api/iterator/BlobIterator.scala

+
+  override protected def loadIterator(): Iterator[CommitTree] = {
+    val filtered = filters.toIterator.flatMap { filter =>
+      filter.matchingCases.getOrElse("hash", Seq()).flatMap { hash =>


You can do here filter.matchingCases("hash).flatMap... directly.

ajnavarro · 2017-09-13T10:07:40Z

src/main/scala/tech/sourced/api/iterator/BlobIterator.scala

+  val log = Logger.getLogger(this.getClass.getSimpleName)
+
+  override protected def loadIterator(): Iterator[CommitTree] = {
+    val filtered = filters.toIterator.flatMap { filter =>


Right now I am not totally sure if moving toIterator after the flatMap can improve performance here. I usually follow the rule of "only apply transformations when is totally necessary" and here you can do a flatMap without transform filters to an iterator.

I totally see what you mean here and agree!

But the reason to convert to iterator is different - later in code we cover the case when there are no filters, and that's when we iterate trees at HEADs of all the references.

I see, but you can do val filtered = filters.flatMap{...} toIterator instead of val filtered = filters.toIterator.flatMap{...} But is not important, is just I'm used to apply transformations first and at the end, if it's necessary, apply type wrapping.

I see what you mean

But is not important

I believe it's actually very important in this particular case, as we return JGitIterator. Here is why:

filters.flatMap{...} toIterator will process full collection first, and then return an iterator

filters.toIterator.flatMap{...} will return an iterator, that is ready to process the first element

And I also have spent few hours yesterday hunting 🐛 when accidentally switched between those two approaches :/

AFAIK this might not be the case, if we were only dealing with collections in memory, but in our case a stateful iterator with IO is involved.

ajnavarro · 2017-09-13T10:08:07Z

src/main/scala/tech/sourced/api/iterator/BlobIterator.scala

+    } else {
+      val refs = new Git(repo).branchList().call().asScala.filter(!_.isSymbolic)
+      log.warn(s"Iterating all ${refs.size} refs")
+      refs.toIterator.flatMap { ref =>


the same as commented before.

Here, it is used for type conversion - we need to get Iterator[CommitTree] so, in any case we need .toIterator, it's just the question when to do that - either here, or on the results of .flatMap.

But your point is valid.

ajnavarro · 2017-09-13T10:15:34Z

src/main/scala/tech/sourced/api/iterator/BlobIterator.scala

+    val content = BlobIterator.readFile(commitTree.tree.getObjectId(0), commitTree.tree.getObjectReader)
+    Map[String, () => Any](
+      "file_hash" -> (() => commitTree.tree.getObjectId(0).name),
+      "content" -> (() => content),


Should we return the content of a binary file?

ajnavarro · 2017-09-13T10:21:13Z

src/test/scala/tech/sourced/api/DefaultSourceSpec.scala

-    //      .load(resourcePath)
-    //
-    //    filesDf.withColumn("content string", filesDf("content").cast(StringType)).show()
+    println("Files/blobs (without commit hash filtered) at HEAD or every ref:\n")


Use Scalatest method info: info("Files/blobs (without commit hash filtered) at HEAD or every ref:\n")

ajnavarro · 2017-09-13T10:22:22Z

src/test/scala/tech/sourced/api/DefaultSourceSpec.scala


-    commitsDf.show()
+    val commitsDf = refsDf.getCommits.select("repository_id", "reference_name", "message", "hash")
+    //commitsDf.show()


remove commented code

ajnavarro · 2017-09-13T10:22:30Z

src/test/scala/tech/sourced/api/DefaultSourceSpec.scala

+    val commitsDf = refsDf.getCommits.select("repository_id", "reference_name", "message", "hash")
+    //commitsDf.show()
+
+    println("Files/blobs with commit hashes:\n")


ajnavarro · 2017-09-13T10:22:45Z

src/test/scala/tech/sourced/api/DefaultSourceSpec.scala

+    filesDf.show()
+
+    val cnt = filesDf.count()
+    println(s"Total $cnt rows")


bzz · 2017-09-13T17:13:55Z

Reviews addressed, rebased on latest master.

Going to push test now.

ajnavarro · 2017-09-14T12:59:42Z

@bzz You should do a rebase -i and remove dd2f819 from the history. the correct one is 623560d

erizocosmico · 2017-09-14T13:06:18Z

src/main/scala/tech/sourced/api/util/ColumnFilter.scala

@@ -17,7 +17,7 @@ object ColumnFilter {
  }
 }

-sealed trait CompiledFilter {
+trait CompiledFilter {


why is the trait no longer sealed?

Because the commit is not correct: #25 (comment)

ajnavarro · 2017-09-14T13:08:49Z

src/main/scala/tech/sourced/api/Implicits.scala

@@ -32,7 +32,8 @@ object Implicits {
      Implicits.checkCols(df, "hash")
      val blobsIdsDf = df.select($"hash").distinct()
      val filesDf = Implicits.getDataSource("files", df.sparkSession)
-      filesDf.join(blobsIdsDf, filesDf("commit_hash") === df("hash")).drop($"hash")
+      val filesDfJoined = filesDf.join(blobsIdsDf, filesDf("commit_hash") === blobsIdsDf("hash")).drop($"hash")
+      df.join(filesDfJoined, df("hash") === filesDfJoined("commit_hash"))


I think you are joining one time more than necessary. Check this code, maybe I'm wrong.

bzz · 2017-09-14T16:30:35Z

@ajnavarro @erizocosmico sorry, did not push the rebase before. Fixed now.

bzz · 2017-09-15T07:18:50Z

Test are fixed in #28

@ajnavarro on #25 (comment) - thank you for double-checking!

2 joins were added, because:

files.commit_hash join \w only uniq commits.hash, resulting table has Schema.files structure
result is then joined with commits again, to get more fields like repository_id, etc

Second join can of course be manually done by the client, if we decide that's desirable.

But rationale for having it here were client expectations on API:

spark.getRepositories.filter($"id" === "github.com/mawag/faq-xiyoulinux")
     .getReferences.filter($"name".equalTo("refs/heads/HEAD"))
     .getCommits
     .getFiles
     .select("repository_id", "reference_name", "path", "commit_hash", "file_hash")

Ideally, for our use-case clients would have repository_url here, which is not possible now, without one more join. So that could be an argument to remove extra join in files as well.

What do you think?

For me, something like below would be the best

spark.getRepositories.filter($"id" === "github.com/mawag/faq-xiyoulinux")
     .getReferences.filter($"name".equalTo("refs/heads/HEAD"))
     .getFiles
     .select("repository_url", "reference_name", "path", "commit_hash", "file_hash", "file_content")

where repository_url is already a main endpoint of original repository

ajnavarro · 2017-09-15T07:54:45Z

For me, something like below would be the best
spark.getRepositories.filter($"id" === "github.com/mawag/faq-xiyoulinux")
.getReferences.filter($"name".equalTo("refs/heads/HEAD"))
.getFiles
.select("repository_url", "reference_name", "path", "commit_hash", "file_hash", "file_content")

This is a really nice approach. We should implement it in the near future.

Regarding the two joins, you can remove the previous df.select("hash") statement and do something like:

Implicits.checkCols(df, "hash") // checking if this datasource is a commits one
val uniqDf = df.distinct() // get unique commits
val filesDf = Implicits.getDataSource("files", df.sparkSession) // get the files datasource
filesDf.join(df, filesDf("commit_hash") === uniqDf("hash")).drop($"hash") // join by commit hash and drop the hash duplicated column (hash == commit_hash)

Joins are really expensive and we need to avoid them as much as possible.

bzz · 2017-09-15T09:13:38Z

Joins are really expensive

Addressed in 6534012

This is a really nice approach. We should implement it in the near future.

Reference filter added in 3bdc052

bzz · 2017-09-15T10:50:07Z

@ajnavarro I appreciate the feedback, but I'm really sorry, as I really do not understand what you mean by

I think this is not the best way tho do this

I was under impression that we are not in some kind of competition for the "best way of using collection API", but rather want to get blobs and release Spark API to the users.

Please let me know if you see bugs in some corner cases of the current implementation and I'll be happy to address any.

bzz · 2017-09-15T11:41:07Z

Suggestions from review applied in cdbd4a0

ajnavarro · 2017-09-15T12:02:11Z

src/test/scala/tech/sourced/api/iterator/BlobIteratorSpec.scala

+import org.scalatest.FlatSpec
+import tech.sourced.api.util.{CompiledFilter, EqualFilter}
+
+class BlogIteratorSpec extends FlatSpec with BaseRootedRepoIterator {


The last thing s/Blog/Blob

ajnavarro

after fix a typo in the Spec, LGTM

I.e in `fff7062de8474d10a67d417ccea87ba6f58ca81d.siva` there is `3558dd448c31f10f3e1b518c39d633fc9396cb69` missing: ``` cd src/test/resourced/siva-files siva unpack fff7062de8474d10a67d417ccea87ba6f58ca81d.siva tmp; cd tmp git verify-pack -v objects/pack/pack-433e5205f6e26099e7d34ba5e5306f69e4cef12b.idx git ls-tree d2fee692b47fb00494649c652a3ae34d57cf40c9 100644 blob 97030825f145faee7fb1b275c16b0c369f763ec2 addquestion.php 040000 tree 03a20274fe7bb6a70503758d1ae4f56b14d5aae6 config 040000 tree d57003e83cca06607cb3e4fe96dbbb584c32463c includes 100644 blob affd4b7af6468d7e759f74975261da2d6bfca8e5 index.php 100644 blob 3e485ae0532edc4076b24905bdc2b1f6f5240efb init.php 040000 tree 60559425cf9710090c5ede69758d0c69718e93a0 oauth 100644 blob 2453ceecbd5a60937db12ba2886197b3d6cb793d question.php 100644 blob be1dd14a91679b91151357fc37a84fc6b59be1a6 search.php 160000 commit 3558dd448c31f10f3e1b518c39d633fc9396cb69 view git cat-file -t 3558dd448c31f10f3e1b518c39d633fc9396cb69 fatal: git cat-file: could not get object info ```

bzz · 2017-09-18T06:58:57Z

CI passes now, \w fix from #37.

Merging if there is no further discussion

ajnavarro · 2017-09-18T07:09:55Z

LGTM

bzz force-pushed the feature/add-blob-iterator branch from 9d2cdcf to 59549d8 Compare September 8, 2017 08:11

bzz requested review from ajnavarro and mcarmonaa September 8, 2017 11:02

ajnavarro reviewed Sep 8, 2017

View reviewed changes

bzz force-pushed the feature/add-blob-iterator branch 2 times, most recently from ba52b6e to 79b7f6c Compare September 8, 2017 17:53

bzz mentioned this pull request Sep 10, 2017

[Bug] - don't merge - highlight odd behavior in SivaRDDProvider #26

Closed

bzz changed the title ~~[WIP] BlobIterator implementation~~ BlobIterator implementation Sep 13, 2017

ajnavarro suggested changes Sep 13, 2017

View reviewed changes

bzz force-pushed the feature/add-blob-iterator branch 2 times, most recently from 7d92a04 to 01ce298 Compare September 13, 2017 17:13

erizocosmico reviewed Sep 14, 2017

View reviewed changes

ajnavarro reviewed Sep 14, 2017

View reviewed changes

bzz force-pushed the feature/add-blob-iterator branch from 01ce298 to feefd09 Compare September 14, 2017 16:30

bzz force-pushed the feature/add-blob-iterator branch 2 times, most recently from f230ca3 to af6ce0e Compare September 14, 2017 17:43

bzz force-pushed the feature/add-blob-iterator branch from 03d89d0 to cdbd4a0 Compare September 15, 2017 12:00

ajnavarro reviewed Sep 15, 2017

View reviewed changes

ajnavarro approved these changes Sep 15, 2017

View reviewed changes

bzz force-pushed the feature/add-blob-iterator branch from cdbd4a0 to e68ecde Compare September 15, 2017 12:26

mcarmonaa approved these changes Sep 15, 2017

View reviewed changes

erizocosmico approved these changes Sep 15, 2017

View reviewed changes

bzz added 14 commits September 18, 2017 08:52

BlobIterator impl using commit hash filters

319356b

Apache Spark tests work \wo network connection

9a501de

Mute Spark logs in tests

aa4e4a0

BlobIterator using HEADs of all refs

fc9954c

BlobIterator; proper logging

87d7872

BlobIterator: join commits and files

bb3e71c

BlobIterator: address reviews

855908e

BlobIterator: add tests

ae5ea9a

BlobIterator: reference filters implemented

2c56883

Avoid extra join on getFiles()

28595b4

Allow getFiles() from refs

f5936f1

BlobIterator: apply review of loadIterator()

6c803aa

BlobIterator: skip un-supported filteres gracefully

c63f9fe

bzz force-pushed the feature/add-blob-iterator branch from 7bfbc55 to c63f9fe Compare September 18, 2017 06:53

bzz merged commit 4c4d85f into src-d:master Sep 18, 2017

bzz deleted the feature/add-blob-iterator branch September 18, 2017 12:53

bzz mentioned this pull request Sep 19, 2017

[DS] Model iterators #12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BlobIterator implementation #25

BlobIterator implementation #25

bzz commented Sep 8, 2017

bzz commented Sep 8, 2017

ajnavarro Sep 8, 2017

bzz commented Sep 8, 2017

codecov bot commented Sep 13, 2017 •

edited

ajnavarro left a comment

ajnavarro Sep 13, 2017 •

edited

ajnavarro Sep 13, 2017

ajnavarro Sep 13, 2017

ajnavarro Sep 13, 2017

bzz Sep 13, 2017

ajnavarro Sep 14, 2017 •

edited

bzz Sep 15, 2017

ajnavarro Sep 13, 2017

bzz Sep 13, 2017

ajnavarro Sep 13, 2017

ajnavarro Sep 13, 2017

ajnavarro Sep 13, 2017

ajnavarro Sep 13, 2017

ajnavarro Sep 13, 2017

bzz commented Sep 13, 2017

ajnavarro commented Sep 14, 2017

erizocosmico Sep 14, 2017

ajnavarro Sep 14, 2017

ajnavarro Sep 14, 2017

bzz commented Sep 14, 2017

bzz commented Sep 15, 2017 •

edited

ajnavarro commented Sep 15, 2017 •

edited

bzz commented Sep 15, 2017

bzz commented Sep 15, 2017 •

edited

bzz commented Sep 15, 2017 •

edited

ajnavarro Sep 15, 2017

ajnavarro left a comment

bzz commented Sep 18, 2017

ajnavarro commented Sep 18, 2017

BlobIterator implementation #25

BlobIterator implementation #25

Conversation

bzz commented Sep 8, 2017

bzz commented Sep 8, 2017

Choose a reason for hiding this comment

bzz commented Sep 8, 2017

codecov bot commented Sep 13, 2017 • edited

Codecov Report

ajnavarro left a comment

Choose a reason for hiding this comment

ajnavarro Sep 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajnavarro Sep 14, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bzz commented Sep 13, 2017

ajnavarro commented Sep 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bzz commented Sep 14, 2017

bzz commented Sep 15, 2017 • edited

ajnavarro commented Sep 15, 2017 • edited

bzz commented Sep 15, 2017

bzz commented Sep 15, 2017 • edited

bzz commented Sep 15, 2017 • edited

Choose a reason for hiding this comment

ajnavarro left a comment

Choose a reason for hiding this comment

bzz commented Sep 18, 2017

ajnavarro commented Sep 18, 2017

codecov bot commented Sep 13, 2017 •

edited

ajnavarro Sep 13, 2017 •

edited

ajnavarro Sep 14, 2017 •

edited

bzz commented Sep 15, 2017 •

edited

ajnavarro commented Sep 15, 2017 •

edited

bzz commented Sep 15, 2017 •

edited

bzz commented Sep 15, 2017 •

edited