-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BlobIterator implementation #25
Conversation
9d2cdcf
to
59549d8
Compare
This is not supposed to be complete implementation yet: i.e it does nothing at all, in case there were no filters for commits. One thing that seems reasonable in such cases - iterate all (non-remote, non-filtered) refs HEADS. Feedback is very welcome. \cc @ajnavarro @mcarmonaa |
override protected def mapColumns(tree: TreeWalk): Map[String, () => Any] = { | ||
val content = readFile(tree.getObjectId(0), tree.getObjectReader) | ||
Map[String, () => Any]( | ||
"file_hash" -> (() => tree.getObjectId(0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you miss the commit hash maybe?
ba52b6e
to
79b7f6c
Compare
Join in in Implicits commits and files on |
Codecov Report
@@ Coverage Diff @@
## master #25 +/- ##
============================================
- Coverage 85.25% 83.42% -1.83%
- Complexity 26 31 +5
============================================
Files 12 13 +1
Lines 278 362 +84
Branches 41 62 +21
============================================
+ Hits 237 302 +65
- Misses 17 27 +10
- Partials 24 33 +9
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I miss some tests for the iterator itself, like: CommitIteratorSpec
, ReferenceIteratorSpec
or RepositoryIteratorSpec
.
#24 should be merged before this pr, and remove duplicated filter code.
|
||
Nil | ||
) | ||
|
||
// StructField("lang", StringType) :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you delete commented code please? these columns will be added when enry and babelfish will be implemented.
class BlobIterator(requiredColumns: Array[String], repo: Repository, filters: Array[CompiledFilter]) | ||
extends RootedRepoIterator[CommitTree](requiredColumns, repo) { | ||
|
||
val log = Logger.getLogger(this.getClass.getSimpleName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use org.apache.spark.internal.Logging
. See an example in org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD
.
|
||
override protected def loadIterator(): Iterator[CommitTree] = { | ||
val filtered = filters.toIterator.flatMap { filter => | ||
filter.matchingCases.getOrElse("hash", Seq()).flatMap { hash => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do here filter.matchingCases("hash).flatMap...
directly.
val log = Logger.getLogger(this.getClass.getSimpleName) | ||
|
||
override protected def loadIterator(): Iterator[CommitTree] = { | ||
val filtered = filters.toIterator.flatMap { filter => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I am not totally sure if moving toIterator
after the flatMap can improve performance here. I usually follow the rule of "only apply transformations when is totally necessary" and here you can do a flatMap
without transform filters to an iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally see what you mean here and agree!
But the reason to convert to iterator is different - later in code we cover the case when there are no filters, and that's when we iterate trees at HEADs of all the references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but you can do val filtered = filters.flatMap{...} toIterator
instead of val filtered = filters.toIterator.flatMap{...}
But is not important, is just I'm used to apply transformations first and at the end, if it's necessary, apply type wrapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean
But is not important
I believe it's actually very important in this particular case, as we return JGitIterator. Here is why:
filters.flatMap{...} toIterator
will process full collection first, and then return an iteratorfilters.toIterator.flatMap{...}
will return an iterator, that is ready to process the first element
And I also have spent few hours yesterday hunting 🐛 when accidentally switched between those two approaches :/
AFAIK this might not be the case, if we were only dealing with collections in memory, but in our case a stateful iterator with IO is involved.
} else { | ||
val refs = new Git(repo).branchList().call().asScala.filter(!_.isSymbolic) | ||
log.warn(s"Iterating all ${refs.size} refs") | ||
refs.toIterator.flatMap { ref => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same as commented before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, it is used for type conversion - we need to get Iterator[CommitTree]
so, in any case we need .toIterator
, it's just the question when to do that - either here, or on the results of .flatMap
.
But your point is valid.
val content = BlobIterator.readFile(commitTree.tree.getObjectId(0), commitTree.tree.getObjectReader) | ||
Map[String, () => Any]( | ||
"file_hash" -> (() => commitTree.tree.getObjectId(0).name), | ||
"content" -> (() => content), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return the content of a binary file?
// .load(resourcePath) | ||
// | ||
// filesDf.withColumn("content string", filesDf("content").cast(StringType)).show() | ||
println("Files/blobs (without commit hash filtered) at HEAD or every ref:\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Scalatest method info
: info("Files/blobs (without commit hash filtered) at HEAD or every ref:\n")
|
||
commitsDf.show() | ||
val commitsDf = refsDf.getCommits.select("repository_id", "reference_name", "message", "hash") | ||
//commitsDf.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented code
val commitsDf = refsDf.getCommits.select("repository_id", "reference_name", "message", "hash") | ||
//commitsDf.show() | ||
|
||
println("Files/blobs with commit hashes:\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use info
filesDf.show() | ||
|
||
val cnt = filesDf.count() | ||
println(s"Total $cnt rows") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use info
7d92a04
to
01ce298
Compare
Reviews addressed, rebased on latest master. Going to push test now. |
@@ -17,7 +17,7 @@ object ColumnFilter { | |||
} | |||
} | |||
|
|||
sealed trait CompiledFilter { | |||
trait CompiledFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the trait no longer sealed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the commit is not correct: #25 (comment)
@@ -32,7 +32,8 @@ object Implicits { | |||
Implicits.checkCols(df, "hash") | |||
val blobsIdsDf = df.select($"hash").distinct() | |||
val filesDf = Implicits.getDataSource("files", df.sparkSession) | |||
filesDf.join(blobsIdsDf, filesDf("commit_hash") === df("hash")).drop($"hash") | |||
val filesDfJoined = filesDf.join(blobsIdsDf, filesDf("commit_hash") === blobsIdsDf("hash")).drop($"hash") | |||
df.join(filesDfJoined, df("hash") === filesDfJoined("commit_hash")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are joining one time more than necessary. Check this code, maybe I'm wrong.
01ce298
to
feefd09
Compare
@ajnavarro @erizocosmico sorry, did not push the rebase before. Fixed now. |
f230ca3
to
af6ce0e
Compare
Test are fixed in #28 @ajnavarro on #25 (comment) - thank you for double-checking! 2 joins were added, because:
Second join can of course be manually done by the client, if we decide that's desirable. But rationale for having it here were client expectations on API: spark.getRepositories.filter($"id" === "github.com/mawag/faq-xiyoulinux")
.getReferences.filter($"name".equalTo("refs/heads/HEAD"))
.getCommits
.getFiles
.select("repository_id", "reference_name", "path", "commit_hash", "file_hash") Ideally, for our use-case clients would have What do you think? For me, something like below would be the best
where |
This is a really nice approach. We should implement it in the near future. Regarding the two joins, you can remove the previous Implicits.checkCols(df, "hash") // checking if this datasource is a commits one
val uniqDf = df.distinct() // get unique commits
val filesDf = Implicits.getDataSource("files", df.sparkSession) // get the files datasource
filesDf.join(df, filesDf("commit_hash") === uniqDf("hash")).drop($"hash") // join by commit hash and drop the hash duplicated column (hash == commit_hash) Joins are really expensive and we need to avoid them as much as possible. |
@ajnavarro I appreciate the feedback, but I'm really sorry, as I really do not understand what you mean by
I was under impression that we are not in some kind of competition for the "best way of using collection API", but rather want to get blobs and release Spark API to the users. Please let me know if you see bugs in some corner cases of the current implementation and I'll be happy to address any. |
Suggestions from review applied in cdbd4a0 |
03d89d0
to
cdbd4a0
Compare
import org.scalatest.FlatSpec | ||
import tech.sourced.api.util.{CompiledFilter, EqualFilter} | ||
|
||
class BlogIteratorSpec extends FlatSpec with BaseRootedRepoIterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last thing s/Blog/Blob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after fix a typo in the Spec, LGTM
cdbd4a0
to
e68ecde
Compare
I.e in `fff7062de8474d10a67d417ccea87ba6f58ca81d.siva` there is `3558dd448c31f10f3e1b518c39d633fc9396cb69` missing: ``` cd src/test/resourced/siva-files siva unpack fff7062de8474d10a67d417ccea87ba6f58ca81d.siva tmp; cd tmp git verify-pack -v objects/pack/pack-433e5205f6e26099e7d34ba5e5306f69e4cef12b.idx git ls-tree d2fee692b47fb00494649c652a3ae34d57cf40c9 100644 blob 97030825f145faee7fb1b275c16b0c369f763ec2 addquestion.php 040000 tree 03a20274fe7bb6a70503758d1ae4f56b14d5aae6 config 040000 tree d57003e83cca06607cb3e4fe96dbbb584c32463c includes 100644 blob affd4b7af6468d7e759f74975261da2d6bfca8e5 index.php 100644 blob 3e485ae0532edc4076b24905bdc2b1f6f5240efb init.php 040000 tree 60559425cf9710090c5ede69758d0c69718e93a0 oauth 100644 blob 2453ceecbd5a60937db12ba2886197b3d6cb793d question.php 100644 blob be1dd14a91679b91151357fc37a84fc6b59be1a6 search.php 160000 commit 3558dd448c31f10f3e1b518c39d633fc9396cb69 view git cat-file -t 3558dd448c31f10f3e1b518c39d633fc9396cb69 fatal: git cat-file: could not get object info ```
7bfbc55
to
c63f9fe
Compare
CI passes now, \w fix from #37. Merging if there is no further discussion |
LGTM |
In commit table, we have hashes of commits, which are used to get trees -> blobs.
Current implementation does not filter:
It leverages
ColumnFilters
from #24 (need to be merged first) in order to get particular commit hashes, rather then iterating all refs HEADs.