Bad precision and recall (~60%) on IBM and intel open source stacks #58

warenlg · 2019-08-31T09:51:04Z

Following up #17 and #30 where the performance of the identity merging algorithm has been evaluated on 22 different open source stacks. We noticed particular bad performance on 2 organization IBM and intel with ~60% precision and recall.

This needs to be investigated because nearly all other organization are above 90% precision and recall, and we should be able to promise an acceptable score (at least 90 %) on all organizations.

The text was updated successfully, but these errors were encountered:

warenlg · 2019-09-02T09:38:25Z

It turns out the identity graph of intel and IBM were pretty big: 80k and 11k edges respectively. And reducing the proportion of popular names decreased the number of false positive and false negative as popular identities tend to be the ones with problems. That's why increasing the popularity threshold from 5 to 100, we improved our precision and recall from ~62 to 94% for both organizations.

warenlg · 2019-09-02T09:39:59Z

We can not increase the popular threshold too much though otherwise we start loosing precision at some point.

vmarkovtsev · 2019-09-02T09:43:31Z

Great data analysis Waren 👍 Let's use your recommended threshold 100 and update the CSVs/embedded Go code.

warenlg · 2019-09-02T09:44:16Z

Thanks just opened the PR this morning #59
Now closed.

warenlg mentioned this issue Sep 2, 2019

Increase the popular name threshold to 100 #59

Merged

warenlg closed this as completed Sep 2, 2019

warenlg self-assigned this Sep 12, 2019

warenlg added the research label Sep 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad precision and recall (~60%) on IBM and intel open source stacks #58

Bad precision and recall (~60%) on IBM and intel open source stacks #58

warenlg commented Aug 31, 2019

warenlg commented Sep 2, 2019 •

edited

Loading

warenlg commented Sep 2, 2019

vmarkovtsev commented Sep 2, 2019

warenlg commented Sep 2, 2019 •

edited

Loading

Bad precision and recall (~60%) on IBM and intel open source stacks #58

Bad precision and recall (~60%) on IBM and intel open source stacks #58

Comments

warenlg commented Aug 31, 2019

warenlg commented Sep 2, 2019 • edited Loading

warenlg commented Sep 2, 2019

vmarkovtsev commented Sep 2, 2019

warenlg commented Sep 2, 2019 • edited Loading

warenlg commented Sep 2, 2019 •

edited

Loading

warenlg commented Sep 2, 2019 •

edited

Loading