Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upnew algebird abstraction: Scan #739
Conversation
This comment has been minimized.
This comment has been minimized.
CLAassistant
commented
Nov 6, 2019
•
| */ | ||
| type Aux[-I, S, +O] = Scan[I, O] { type State = S } | ||
|
|
||
| implicit def applicative[I]: Applicative[({ type L[O] = Scan[I, O] })#L] = new ScanApplicative[I] |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
johnynek
Nov 7, 2019
Collaborator
I don't think we use that plugin, we could, but I'd rather not add a build dependency for the few cases where we have done this. This is mostly a demonstration anyway, since I expect if you really want a Applicative[Scan[I, ?]] you will want a cats implementation, but anyway, this shows you how to make one.
| import scala.collection.compat._ | ||
| import scala.collection.generic.CanBuildFrom | ||
|
|
||
| object Scan { |
This comment has been minimized.
This comment has been minimized.
MansurAshraf
Nov 6, 2019
Contributor
How is this different than Aggregator[I,M[_]:SemigroupK, O] . I think the motivation here is that you want S type to be a container so that reduce on S is collecting intermediate state. I believe that can be achieved by existing Aggregators by 1) making S a type constructor 2) making sure that S has an instance of SemigroupK
This comment has been minimized.
This comment has been minimized.
jeff-stripe
Nov 7, 2019
Author
Contributor
@MansurAshraf First: great to be on a PR with you again! It's been a long time :-).
Second: the idea here is that this transforms streams to streams. whereas aggregators transform Sequences to single values. I would look at the apply method (starting on line 229 of Stream.scala as of this writing). The idea is that, for each input in the stream, you have one output in the stream. For example, using directFreeScan from ScanTest, something like directFreeScan(List('a', 'b', 'c')) would result in List("a", "ab", "abc").
Does that make sense?
This comment has been minimized.
This comment has been minimized.
MansurAshraf
Nov 7, 2019
Contributor
hey @jeff-stripe , its been a while :-)
I will take a look at the Stream.scala but cursory look at def fromAggregator[A, B, C](aggregator: Aggregator[A, B, C], initState: B): Aux[A, B, C] gives me the impression that
Scan[A,C] ~= Aggregator[A,S[_],(S[_],C)]
Scan has a new function called presentAndNextState which return both S state type and final output type C. If you aggregator type was Aggregator[A,S[_],(S[_],C)], the present function in existing aggregator will return the same value i think.
This comment has been minimized.
This comment has been minimized.
jeff-stripe
Nov 7, 2019
Author
Contributor
@MansurAshraf In my example, the analogous Aggregator would just return "abc", not List("a", "ab", "abc"). In other words, Aggregator is to the reduce method as Scan is to the scanLeft method.
This comment has been minimized.
This comment has been minimized.
johnynek
Nov 7, 2019
Collaborator
Mansur, to a first approximation, Aggregator[A, _, B] is Iterator[A] => Option[B], MonoidAggregator[A, _, B] is Iterator[A] => B and Scan[A, B] is Iterator[A] => Iterator[B] (actually with an added law that the number of items in the result is exactly the same.
You could imagine another idea: ExpandingScan[A, B] that is Iterator[A] => Iterator[B] but without the law of the iterators being the same size (then we could do things like filter, or concatMap, etc...)
Does that help explain it at all?
Think of the ExpandingScan as the most general function you can do in a reducer in Hadoop.
This comment has been minimized.
This comment has been minimized.
|
Thanks jeff. CI failed:
|
| /** | ||
| * A Scan that returns the number N for the Nth input (starting from 0) | ||
| */ | ||
| val index: Aux[Any, Long, Long] = fromStreamLike(0L)(n => (n, n+1)) |
This comment has been minimized.
This comment has been minimized.
| * @return A scan which, when given [a_1, ..., a_n] outputs [c_1, ..., c_n] where | ||
| * c_i = aggregator.prepare(a_i) + ... + aggregator.prepare(a_1) + monoidAggregator.monoid.zero | ||
| */ | ||
| def fromMonoidAggregatorReverse[A, B, C]( |
This comment has been minimized.
This comment has been minimized.
johnynek
Nov 7, 2019
Collaborator
same comment about adding a def reverse method to MonoidAggregator
| } | ||
| } | ||
|
|
||
| def joinWithIndex: Aux[I, (State, Long), (O, Long)] = join(Scan.index) |
This comment has been minimized.
This comment has been minimized.
| } | ||
| } | ||
|
|
||
| def compose[P](scan2: Scan[O, P]): Aux[I, (State, scan2.State), P] = new Scan[I, P] { |
This comment has been minimized.
This comment has been minimized.
johnynek
Nov 7, 2019
Collaborator
maybe add comments to the methods: this feeds each output of this into scan2 and join is like this forks each input and feeds it into both this and scan2, etc...
This comment has been minimized.
This comment has been minimized.
jeff-stripe
Nov 7, 2019
Author
Contributor
I've added a bunch of comments. Let me know if you like the wording/framing of them.
This comment has been minimized.
This comment has been minimized.
|
Thanks @johnynek for the comments; I'm running out of steam for the day but will pick this up again tomorrow. |
This comment has been minimized.
This comment has been minimized.
|
a really cool example ... just thinking aloud, that's not a request to add that. |
This comment has been minimized.
This comment has been minimized.
codecov-io
commented
Nov 7, 2019
•
Codecov Report
@@ Coverage Diff @@
## develop #739 +/- ##
===========================================
- Coverage 89.29% 89.26% -0.03%
===========================================
Files 114 115 +1
Lines 8955 9011 +56
Branches 330 335 +5
===========================================
+ Hits 7996 8044 +48
- Misses 959 967 +8
Continue to review full report at Codecov.
|
This comment has been minimized.
This comment has been minimized.
|
also, there is a (not very good) streaming median estimator where you keep track of the current median, and if the value is greater than that, you move it up by a constant, and if it is less than that, you move it down. Since the true median should have about half above as below, this should be a random walk that biases towards the median (so after a long enough warmup, it shouldn't be too far from the median). |
This comment has been minimized.
This comment has been minimized.
|
looks like |
This comment has been minimized.
This comment has been minimized.
|
@johnynek I think I've addressed all of your comments so far (but welcome more). Also, the example of z-score was totally one original motivation for this abstraction, from way back when I was working on feature generation. |
This comment has been minimized.
This comment has been minimized.
|
Unrelated: when I was running tests locally to reproduce the above issue, I ran into the following unrelated test failure:
Should I just file an issue for that one? Or is this some sort of known numerical-precision kind of thing? |
This comment has been minimized.
This comment has been minimized.
|
@jeff-stripe sorry about the flake. I think it is numerical precision, but it does come up periodically. |
|
yeah, agree with your suggestion to move the order in |
This comment has been minimized.
This comment has been minimized.
|
@johnynek Pushed your suggested changes + new tests (I made them fail first) for |
|
one last minor request, and I'll merge |
This comment has been minimized.
This comment has been minimized.
|
Done. Thanks for the helpful feedback and thorough review! The result is much better than where it started. |
jeff-stripe commentedNov 6, 2019
•
edited
A new abstraction, similar to
AggregatororFold, that aims to reify the business logic assocaited withscanLeftinto its own abstraction with its own combinators.The testing here is pretty comprehensive, although I don't think
.zipis covered. Moreover, I originally had a bunch of combinators named.zipWithXXXthat I later renamed to.joinWithXXXto make more consistent with the types ofzipandjoin. I do think it's worth bikeshedding naming on all of this though.For posterity, my non-work github is @eigenvariable.