Generate Synthetic Data in Clusters #66

John-Almardeny · 2019-04-08T10:10:21Z

All Submissions Basics:

#65

Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?
Have you checked all Issues to tie the PR to a specific one?

All Submissions Cores:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?
Does your submission pass tests, including CircleCI, Travis CI, and AppVeyor?
Added Unit Tests and tested them.

coveralls · 2019-04-08T10:28:52Z

Pull Request Test Coverage Report for Build 814

68 of 93 (73.12%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.5%) to 94.954%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pyod/utils/data.py	47	72	65.28%

Totals
Change from base Build 732:	-0.5%
Covered Lines:	3651
Relevant Lines:	3845

💛 - Coveralls

John-Almardeny · 2019-04-09T09:20:10Z

@yzhao062
Hi Yue,
I don't know what coverage/coveralls is complaining from since my branch is identical to your development one!
However, I tried my best to please it and I think I have no more commits for now.
You might want to take it from here.
Thanks.

yzhao062 · 2019-04-10T14:23:28Z

I think it complains about the coverage change. You would want to add some testcases in https://github.com/yzhao062/pyod/blob/master/pyod/test/test_data.py . you could take a look first (mainly check the generated data shape, percentage of outlier etc.) :)

John-Almardeny · 2019-04-14T12:58:25Z

@yzhao062
Hi Yue,
I added the tests, and made sure it passed them all.
Also, I emptied the branch and re-filled it fully from scratch with the latest version of your development branch.
Now it is truly identical to yours except the lines of code I added to data.py and test_data.py.
Please check it out and intervene in case coverage/coveralls is still complaining!.

yzhao062 · 2019-04-14T17:31:35Z

I will look into this shortly and try to understand how this new function works. Do not worry about the code coverage. I will write a test/coverage function if needed.

If possible, could you give a short description of how this data generation algorithm works? This will be very helpful for code review.

Thanks a lot for the contribution.

John-Almardeny · 2019-04-14T18:42:28Z

Hi Yue, @yzhao062

It generates one (or many) clusters of data points with different/same sizes and densities based on the user's choice passed by the parameters.
It generates the required ratio of outliers controlled by the contamination parameter and distributes them on the clusters.
It avails of the make_blobs function provided by sklearn to create the clusters; and main part of the algorithm is to maintain and validate the consistency of the data splits among the different clusters.
It is very well documented, and I believe if you read the documentation (comments) you will get it easily.

As per previously mentioned, having different clusters of data with different sizes and densities makes outliers detection challengeable especially for those type of algorithms that based of k-nearest neighbors such as LOF , LDOF, LoOP, HiCS and SOD and others.

John-Almardeny · 2019-04-15T15:15:21Z

opening a new pull request to avoid conflicts 15/04/2019

John-Almardeny changed the title ~~Generate synthetic data in clusters~~ Generate Synthetic Data in Clusters Apr 8, 2019

Yahya added 7 commits April 14, 2019 13:44

generate data in clusters

b9c0696

generate data in clusters

cd50520

minor update to make Travis CI happy

fb77ab1

making Travis CI more happier

c730895

fixing random state and trying to make coveralls happy

521d343

rebasing branch

3b2f795

Adding Tests and Refactoring

b247c08

John-Almardeny closed this Apr 15, 2019

John-Almardeny mentioned this pull request Apr 15, 2019

Generate data clusters #76

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate Synthetic Data in Clusters #66

Generate Synthetic Data in Clusters #66

John-Almardeny commented Apr 8, 2019 •

edited

coveralls commented Apr 8, 2019 •

edited

John-Almardeny commented Apr 9, 2019

yzhao062 commented Apr 10, 2019

John-Almardeny commented Apr 14, 2019 •

edited

yzhao062 commented Apr 14, 2019

John-Almardeny commented Apr 14, 2019

John-Almardeny commented Apr 15, 2019

Generate Synthetic Data in Clusters #66

Generate Synthetic Data in Clusters #66

Conversation

John-Almardeny commented Apr 8, 2019 • edited

All Submissions Basics:

All Submissions Cores:

coveralls commented Apr 8, 2019 • edited

Pull Request Test Coverage Report for Build 814

💛 - Coveralls

John-Almardeny commented Apr 9, 2019

yzhao062 commented Apr 10, 2019

John-Almardeny commented Apr 14, 2019 • edited

yzhao062 commented Apr 14, 2019

John-Almardeny commented Apr 14, 2019

John-Almardeny commented Apr 15, 2019

John-Almardeny commented Apr 8, 2019 •

edited

coveralls commented Apr 8, 2019 •

edited

John-Almardeny commented Apr 14, 2019 •

edited