Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlobIterator implementation #25

Merged
merged 14 commits into from
Sep 18, 2017
Merged

Conversation

bzz
Copy link
Contributor

@bzz bzz commented Sep 8, 2017

In commit table, we have hashes of commits, which are used to get trees -> blobs.

Current implementation does not filter:

  • repository (so same blob can happen in different repository)
  • refs

It leverages ColumnFilters from #24 (need to be merged first) in order to get particular commit hashes, rather then iterating all refs HEADs.

@bzz
Copy link
Contributor Author

bzz commented Sep 8, 2017

This is not supposed to be complete implementation yet: i.e it does nothing at all, in case there were no filters for commits.

One thing that seems reasonable in such cases - iterate all (non-remote, non-filtered) refs HEADS.

Feedback is very welcome.

\cc @ajnavarro @mcarmonaa

override protected def mapColumns(tree: TreeWalk): Map[String, () => Any] = {
val content = readFile(tree.getObjectId(0), tree.getObjectReader)
Map[String, () => Any](
"file_hash" -> (() => tree.getObjectId(0)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you miss the commit hash maybe?

@bzz bzz force-pushed the feature/add-blob-iterator branch 2 times, most recently from ba52b6e to 79b7f6c Compare September 8, 2017 17:53
@bzz
Copy link
Contributor Author

bzz commented Sep 8, 2017

Join in in Implicits commits and files on "commit_hash" === "hash" does not produce a eq filter (only is_not_null(commit_hash)) so a new case was added that iterates blobs in HEADs of all refs

@bzz bzz changed the title [WIP] BlobIterator implementation BlobIterator implementation Sep 13, 2017
@codecov
Copy link

codecov bot commented Sep 13, 2017

Codecov Report

Merging #25 into master will decrease coverage by 1.82%.
The diff coverage is 68.88%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master      #25      +/-   ##
============================================
- Coverage     85.25%   83.42%   -1.83%     
- Complexity       26       31       +5     
============================================
  Files            12       13       +1     
  Lines           278      362      +84     
  Branches         41       62      +21     
============================================
+ Hits            237      302      +65     
- Misses           17       27      +10     
- Partials         24       33       +9
Impacted Files Coverage Δ Complexity Δ
...rc/main/scala/tech/sourced/api/DefaultSource.scala 82.14% <0%> (+1.37%) 2 <0> (ø) ⬇️
src/main/scala/tech/sourced/api/Schema.scala 100% <100%> (ø) 0 <0> (ø) ⬇️
src/main/scala/tech/sourced/api/Implicits.scala 81.48% <40%> (+6.48%) 0 <0> (ø) ⬇️
...scala/tech/sourced/api/iterator/BlobIterator.scala 73.75% <73.75%> (ø) 5 <5> (?)
...ain/scala/tech/sourced/api/util/ColumnFilter.scala 80.55% <0%> (+2.77%) 0% <0%> (ø) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f24ca7e...c63f9fe. Read the comment docs.

Copy link
Contributor

@ajnavarro ajnavarro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I miss some tests for the iterator itself, like: CommitIteratorSpec, ReferenceIteratorSpec or RepositoryIteratorSpec.

#24 should be merged before this pr, and remove duplicated filter code.


Nil
)

// StructField("lang", StringType) ::
Copy link
Contributor

@ajnavarro ajnavarro Sep 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you delete commented code please? these columns will be added when enry and babelfish will be implemented.

class BlobIterator(requiredColumns: Array[String], repo: Repository, filters: Array[CompiledFilter])
extends RootedRepoIterator[CommitTree](requiredColumns, repo) {

val log = Logger.getLogger(this.getClass.getSimpleName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use org.apache.spark.internal.Logging. See an example in org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.


override protected def loadIterator(): Iterator[CommitTree] = {
val filtered = filters.toIterator.flatMap { filter =>
filter.matchingCases.getOrElse("hash", Seq()).flatMap { hash =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do here filter.matchingCases("hash).flatMap... directly.

val log = Logger.getLogger(this.getClass.getSimpleName)

override protected def loadIterator(): Iterator[CommitTree] = {
val filtered = filters.toIterator.flatMap { filter =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I am not totally sure if moving toIterator after the flatMap can improve performance here. I usually follow the rule of "only apply transformations when is totally necessary" and here you can do a flatMap without transform filters to an iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally see what you mean here and agree!

But the reason to convert to iterator is different - later in code we cover the case when there are no filters, and that's when we iterate trees at HEADs of all the references.

Copy link
Contributor

@ajnavarro ajnavarro Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but you can do val filtered = filters.flatMap{...} toIterator instead of val filtered = filters.toIterator.flatMap{...} But is not important, is just I'm used to apply transformations first and at the end, if it's necessary, apply type wrapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean

But is not important

I believe it's actually very important in this particular case, as we return JGitIterator. Here is why:

  1. filters.flatMap{...} toIterator will process full collection first, and then return an iterator
  2. filters.toIterator.flatMap{...} will return an iterator, that is ready to process the first element

And I also have spent few hours yesterday hunting 🐛 when accidentally switched between those two approaches :/

AFAIK this might not be the case, if we were only dealing with collections in memory, but in our case a stateful iterator with IO is involved.

} else {
val refs = new Git(repo).branchList().call().asScala.filter(!_.isSymbolic)
log.warn(s"Iterating all ${refs.size} refs")
refs.toIterator.flatMap { ref =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same as commented before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it is used for type conversion - we need to get Iterator[CommitTree] so, in any case we need .toIterator, it's just the question when to do that - either here, or on the results of .flatMap.

But your point is valid.

val content = BlobIterator.readFile(commitTree.tree.getObjectId(0), commitTree.tree.getObjectReader)
Map[String, () => Any](
"file_hash" -> (() => commitTree.tree.getObjectId(0).name),
"content" -> (() => content),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return the content of a binary file?

// .load(resourcePath)
//
// filesDf.withColumn("content string", filesDf("content").cast(StringType)).show()
println("Files/blobs (without commit hash filtered) at HEAD or every ref:\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Scalatest method info: info("Files/blobs (without commit hash filtered) at HEAD or every ref:\n")


commitsDf.show()
val commitsDf = refsDf.getCommits.select("repository_id", "reference_name", "message", "hash")
//commitsDf.show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented code

val commitsDf = refsDf.getCommits.select("repository_id", "reference_name", "message", "hash")
//commitsDf.show()

println("Files/blobs with commit hashes:\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use info

filesDf.show()

val cnt = filesDf.count()
println(s"Total $cnt rows")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use info

@bzz bzz force-pushed the feature/add-blob-iterator branch 2 times, most recently from 7d92a04 to 01ce298 Compare September 13, 2017 17:13
@bzz
Copy link
Contributor Author

bzz commented Sep 13, 2017

Reviews addressed, rebased on latest master.

Going to push test now.

@ajnavarro
Copy link
Contributor

@bzz You should do a rebase -i and remove dd2f819 from the history. the correct one is 623560d

@@ -17,7 +17,7 @@ object ColumnFilter {
}
}

sealed trait CompiledFilter {
trait CompiledFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the trait no longer sealed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the commit is not correct: #25 (comment)

@@ -32,7 +32,8 @@ object Implicits {
Implicits.checkCols(df, "hash")
val blobsIdsDf = df.select($"hash").distinct()
val filesDf = Implicits.getDataSource("files", df.sparkSession)
filesDf.join(blobsIdsDf, filesDf("commit_hash") === df("hash")).drop($"hash")
val filesDfJoined = filesDf.join(blobsIdsDf, filesDf("commit_hash") === blobsIdsDf("hash")).drop($"hash")
df.join(filesDfJoined, df("hash") === filesDfJoined("commit_hash"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are joining one time more than necessary. Check this code, maybe I'm wrong.

@bzz
Copy link
Contributor Author

bzz commented Sep 14, 2017

@ajnavarro @erizocosmico sorry, did not push the rebase before. Fixed now.

@bzz bzz force-pushed the feature/add-blob-iterator branch 2 times, most recently from f230ca3 to af6ce0e Compare September 14, 2017 17:43
@bzz
Copy link
Contributor Author

bzz commented Sep 15, 2017

Test are fixed in #28

@ajnavarro on #25 (comment) - thank you for double-checking!

2 joins were added, because:

  • files.commit_hash join \w only uniq commits.hash, resulting table has Schema.files structure
  • result is then joined with commits again, to get more fields like repository_id, etc

Second join can of course be manually done by the client, if we decide that's desirable.

But rationale for having it here were client expectations on API:

spark.getRepositories.filter($"id" === "github.com/mawag/faq-xiyoulinux")
     .getReferences.filter($"name".equalTo("refs/heads/HEAD"))
     .getCommits
     .getFiles
     .select("repository_id", "reference_name", "path", "commit_hash", "file_hash")

Ideally, for our use-case clients would have repository_url here, which is not possible now, without one more join. So that could be an argument to remove extra join in files as well.

What do you think?

For me, something like below would be the best

spark.getRepositories.filter($"id" === "github.com/mawag/faq-xiyoulinux")
     .getReferences.filter($"name".equalTo("refs/heads/HEAD"))
     .getFiles
     .select("repository_url", "reference_name", "path", "commit_hash", "file_hash", "file_content")

where repository_url is already a main endpoint of original repository

@ajnavarro
Copy link
Contributor

ajnavarro commented Sep 15, 2017

For me, something like below would be the best
spark.getRepositories.filter($"id" === "github.com/mawag/faq-xiyoulinux")
.getReferences.filter($"name".equalTo("refs/heads/HEAD"))
.getFiles
.select("repository_url", "reference_name", "path", "commit_hash", "file_hash", "file_content")

This is a really nice approach. We should implement it in the near future.

Regarding the two joins, you can remove the previous df.select("hash") statement and do something like:

Implicits.checkCols(df, "hash") // checking if this datasource is a commits one
val uniqDf = df.distinct() // get unique commits
val filesDf = Implicits.getDataSource("files", df.sparkSession) // get the files datasource
filesDf.join(df, filesDf("commit_hash") === uniqDf("hash")).drop($"hash") // join by commit hash and drop the hash duplicated column (hash == commit_hash)

Joins are really expensive and we need to avoid them as much as possible.

@bzz
Copy link
Contributor Author

bzz commented Sep 15, 2017

Joins are really expensive

Addressed in 6534012

This is a really nice approach. We should implement it in the near future.

Reference filter added in 3bdc052

@bzz
Copy link
Contributor Author

bzz commented Sep 15, 2017

@ajnavarro I appreciate the feedback, but I'm really sorry, as I really do not understand what you mean by

I think this is not the best way tho do this

I was under impression that we are not in some kind of competition for the "best way of using collection API", but rather want to get blobs and release Spark API to the users.

Please let me know if you see bugs in some corner cases of the current implementation and I'll be happy to address any.

@bzz
Copy link
Contributor Author

bzz commented Sep 15, 2017

Suggestions from review applied in cdbd4a0

import org.scalatest.FlatSpec
import tech.sourced.api.util.{CompiledFilter, EqualFilter}

class BlogIteratorSpec extends FlatSpec with BaseRootedRepoIterator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last thing s/Blog/Blob

Copy link
Contributor

@ajnavarro ajnavarro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after fix a typo in the Spec, LGTM

I.e in `fff7062de8474d10a67d417ccea87ba6f58ca81d.siva` there is `3558dd448c31f10f3e1b518c39d633fc9396cb69` missing:

```
 cd src/test/resourced/siva-files
 siva unpack fff7062de8474d10a67d417ccea87ba6f58ca81d.siva tmp; cd tmp

 git verify-pack -v objects/pack/pack-433e5205f6e26099e7d34ba5e5306f69e4cef12b.idx
 git ls-tree d2fee692b47fb00494649c652a3ae34d57cf40c9

  100644 blob 97030825f145faee7fb1b275c16b0c369f763ec2    addquestion.php
  040000 tree 03a20274fe7bb6a70503758d1ae4f56b14d5aae6    config
  040000 tree d57003e83cca06607cb3e4fe96dbbb584c32463c    includes
  100644 blob affd4b7af6468d7e759f74975261da2d6bfca8e5    index.php
  100644 blob 3e485ae0532edc4076b24905bdc2b1f6f5240efb    init.php
  040000 tree 60559425cf9710090c5ede69758d0c69718e93a0    oauth
  100644 blob 2453ceecbd5a60937db12ba2886197b3d6cb793d    question.php
  100644 blob be1dd14a91679b91151357fc37a84fc6b59be1a6    search.php
  160000 commit 3558dd448c31f10f3e1b518c39d633fc9396cb69    view

 git cat-file -t 3558dd448c31f10f3e1b518c39d633fc9396cb69
  fatal: git cat-file: could not get object info
```
@bzz
Copy link
Contributor Author

bzz commented Sep 18, 2017

CI passes now, \w fix from #37.

Merging if there is no further discussion

@ajnavarro
Copy link
Contributor

LGTM

@bzz bzz merged commit 4c4d85f into src-d:master Sep 18, 2017
@bzz bzz deleted the feature/add-blob-iterator branch September 18, 2017 12:53
@bzz bzz mentioned this pull request Sep 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants