Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate Synthetic Data in Clusters #66

Closed
wants to merge 7 commits into from
Closed

Generate Synthetic Data in Clusters #66

wants to merge 7 commits into from

Conversation

John-Almardeny
Copy link
Contributor

@John-Almardeny John-Almardeny commented Apr 8, 2019

All Submissions Basics:

#65

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?
  • Have you checked all Issues to tie the PR to a specific one?

All Submissions Cores:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?
  • Does your submission pass tests, including CircleCI, Travis CI, and AppVeyor?
  • Added Unit Tests and tested them.

@John-Almardeny John-Almardeny changed the title Generate synthetic data in clusters Generate Synthetic Data in Clusters Apr 8, 2019
@coveralls
Copy link

coveralls commented Apr 8, 2019

Pull Request Test Coverage Report for Build 814

  • 68 of 93 (73.12%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.5%) to 94.954%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pyod/utils/data.py 47 72 65.28%
Totals Coverage Status
Change from base Build 732: -0.5%
Covered Lines: 3651
Relevant Lines: 3845

💛 - Coveralls

@John-Almardeny
Copy link
Contributor Author

@yzhao062
Hi Yue,
I don't know what coverage/coveralls is complaining from since my branch is identical to your development one!
However, I tried my best to please it and I think I have no more commits for now.
You might want to take it from here.
Thanks.

@yzhao062
Copy link
Owner

I think it complains about the coverage change. You would want to add some testcases in https://github.com/yzhao062/pyod/blob/master/pyod/test/test_data.py . you could take a look first (mainly check the generated data shape, percentage of outlier etc.) :)

@John-Almardeny
Copy link
Contributor Author

John-Almardeny commented Apr 14, 2019

@yzhao062
Hi Yue,
I added the tests, and made sure it passed them all.
Also, I emptied the branch and re-filled it fully from scratch with the latest version of your development branch.
Now it is truly identical to yours except the lines of code I added to data.py and test_data.py.
Please check it out and intervene in case coverage/coveralls is still complaining!.

@yzhao062
Copy link
Owner

I will look into this shortly and try to understand how this new function works. Do not worry about the code coverage. I will write a test/coverage function if needed.

If possible, could you give a short description of how this data generation algorithm works? This will be very helpful for code review.

Thanks a lot for the contribution.

@John-Almardeny
Copy link
Contributor Author

Hi Yue, @yzhao062

It generates one (or many) clusters of data points with different/same sizes and densities based on the user's choice passed by the parameters.
It generates the required ratio of outliers controlled by the contamination parameter and distributes them on the clusters.
It avails of the make_blobs function provided by sklearn to create the clusters; and main part of the algorithm is to maintain and validate the consistency of the data splits among the different clusters.
It is very well documented, and I believe if you read the documentation (comments) you will get it easily.


As per previously mentioned, having different clusters of data with different sizes and densities makes outliers detection challengeable especially for those type of algorithms that based of k-nearest neighbors such as LOF , LDOF, LoOP, HiCS and SOD and others.

@John-Almardeny
Copy link
Contributor Author

opening a new pull request to avoid conflicts 15/04/2019

@John-Almardeny John-Almardeny mentioned this pull request Apr 15, 2019
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants