Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad precision and recall (~60%) on IBM and intel open source stacks #58

Closed
warenlg opened this issue Aug 31, 2019 · 4 comments
Closed
Assignees
Labels

Comments

@warenlg
Copy link
Contributor

warenlg commented Aug 31, 2019

Following up #17 and #30 where the performance of the identity merging algorithm has been evaluated on 22 different open source stacks. We noticed particular bad performance on 2 organization IBM and intel with ~60% precision and recall.

This needs to be investigated because nearly all other organization are above 90% precision and recall, and we should be able to promise an acceptable score (at least 90 %) on all organizations.

@warenlg
Copy link
Contributor Author

warenlg commented Sep 2, 2019

It turns out the identity graph of intel and IBM were pretty big: 80k and 11k edges respectively. And reducing the proportion of popular names decreased the number of false positive and false negative as popular identities tend to be the ones with problems. That's why increasing the popularity threshold from 5 to 100, we improved our precision and recall from ~62 to 94% for both organizations.

identity_prec_rec

@warenlg
Copy link
Contributor Author

warenlg commented Sep 2, 2019

We can not increase the popular threshold too much though otherwise we start loosing precision at some point.

@vmarkovtsev
Copy link
Collaborator

Great data analysis Waren 👍 Let's use your recommended threshold 100 and update the CSVs/embedded Go code.

@warenlg
Copy link
Contributor Author

warenlg commented Sep 2, 2019

Thanks just opened the PR this morning #59
Now closed.

@warenlg warenlg closed this as completed Sep 2, 2019
@warenlg warenlg self-assigned this Sep 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants