-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StreamKM "Width" Parameter #97
Comments
I have also posted this question in the MOA development Google Group here. |
I have resolved this issue in my fork of MOA as well as a couple of other issues that I discovered along the way. In StreamKM.java: In Point.java: Created CoresetCostTriple.java: I have also included a new FlagOption parameter which permits only the final clustering to be evaluated, as in the original code. |
Pull request #100 was merged into MOA's master branch; issue is resolved. |
@abifet
I have traced the source code, and found a bug in the following method:
The exception is thrown at line : It means that BucketManager's buckets are full, and BucketManager is trying to move points from last bucket to non-existing next last bucket as Richard mentioned above. I have debugged the source code line by line and got the following result: By using following formula BucketManager determines its number of bucket limit.
You can follow the code flow on the picture, I also put some notes on drawings. Each iteration, the code insert new coming data-point to bucket[0].point array. At the last bucket(buckets[5]), when both bucket[5].points and bucket[5].spillover are full, the program merge these to arrays and tries to move into bucket[6].points. My solution is: Now, I am implementing the code. I will let you know the process. [1] some merging algorithm. It gets 1000 data-points from points array and 1000 data-points from spillover array, and merges them. As an output, it gives another 1000 data-points array which has some kind of summary of 2000 data-points |
This issue should be opened again. |
Good afternoon,
I am trying to understand the source code for StreamKM++ with MOA. Simply put, as I understand there are two conflicting roles being played by the width parameter for StreamKM++. [All of the following is done using MOA's Clustering tab, with the default RBF stream generator, StreamKM++ as Algorithm 1, and Algorithm 2 cleared]
Clusterings are only computed every width instances and until width instances are processed the clusterings returned by the getClusteringResult() method have only NULL entries for the centresStreamingCoreset array. This means that no measurements can be calculated for StreamKM++'s performance. When I set width accordingly and run StreamKM++, however, I run into:
This constructor is called by StreamKM.java upon initialization in the trainOnInstanceImpl() method: the argument n is, in fact, width (via StreamKM.java's variable length). The problem this causes is that the BucketManager fails when all of its buckets are full: the ArrayIndexOutOfBoundsException is due to the BucketManager trying to move points from the last bucket to a non-existent "last bucket."
Although it would be most intuitive to me to "fix" this behaviour, it is consistent with the algorithm's description in Marcel R. Ackermann, Christiane Lammersen, Marcus Märtens, Christoph Raupach, Christian Sohler, Kamil Swierkot: StreamKM++: A Clustering Algorithms for Data Streams. ALENEX 2010: 173-187 (the paper cited in StreamKM.java's opening comments). The argument n is described therein as the size of the data stream. The paper also describes that coresets should be obtainable at any point during the data stream, something which is not the case at the moment.
My question is: am I missing something in the code or an assumption by the developers? Or does it make sense to modify StreamKM.java's getClusteringResult() method in order to provide proper clusterings as it appears was envisioned in the original paper?
Richard
The text was updated successfully, but these errors were encountered: