Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research the detection of the primary name for an identified person #28

Closed
3 of 6 tasks
zurk opened this issue Jul 25, 2019 · 8 comments
Closed
3 of 6 tasks

Research the detection of the primary name for an identified person #28

zurk opened this issue Jul 25, 2019 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@zurk
Copy link
Contributor

zurk commented Jul 25, 2019

Now we have just a number that joins all person's identities together. But we should also set the name of the person.

See related discussion:
https://src-d.slack.com/archives/CJQ0DBAJV/p1564057922120100

one thing I've learned from the current demo and identity matching is that we need a simple way for the algorithm to determine a primary name

Plan:

  • Get the data for proper validation. We may use Github API. We should also check data quality because what people set in their GitHub profile and use in commits can be different. We cannot predict something that does not appear in commits.
  • Test simple heuristics like
    • take the most frequent name
    • take the longest name
    • ???
  • If simple heuristics do not give good performance train something like GBDT.
@zurk zurk added the enhancement New feature or request label Jul 25, 2019
@zurk zurk self-assigned this Jul 25, 2019
@vmarkovtsev vmarkovtsev changed the title Determine a primary name for an identified person Determine the primary name for an identified person Jul 25, 2019
@vmarkovtsev vmarkovtsev assigned vmarkovtsev and unassigned zurk Aug 2, 2019
@vmarkovtsev
Copy link
Collaborator

The primary user names are included in GHTorrent.
@EgorBu Can you please attach here the best identity matching results on GitHub?

@EgorBu
Copy link

EgorBu commented Aug 2, 2019

I will launch the extraction of identities and dump it here

Update:
Old result found here: /user/egorbu/idmatching/cache/res_name_threshold_5_email_threshold_28_data_aggregated_deduplicated.pkl - it was launched on initial dataset (not full)
I will launch extraction on full dataset.

@vmarkovtsev
Copy link
Collaborator

No, the user names are no longer available. I can take all the logins from GHTorrent, but the names require some GitHub crawling. I will have to run it.

@vmarkovtsev
Copy link
Collaborator

I am crawling the full name for each github login on our 11 nodes. ETA 10 days.

@vmarkovtsev
Copy link
Collaborator

Fcking GitHub is throttling us with 429. I had to add download delays using the Scrapy's AutoThrottle extension, that will most probably increase ETA.

#38

Fetched ~900k in 10 hours.

@vmarkovtsev
Copy link
Collaborator

My experiments showed that the rate-limiting happens if you do more than or equal to 5 requests per second. I set the frequency to 4 and the process has finally stabilized.

@vmarkovtsev
Copy link
Collaborator

Progress: 4.6mm through the weekend

@vmarkovtsev
Copy link
Collaborator

I collected the dataset of full names: /user/internal-datasets/full_names

Random strategy

46% accuracy if we consider identities without a correct name. Otherwise, ~82%.

Biggest number of commits strategy

56% and 99.9%, respectively.

So 99.9% is a very good number and there is no need for digging deeper. We should count commits for each name and sort them, problem solved.

@vmarkovtsev vmarkovtsev changed the title Determine the primary name for an identified person Research the detection of the primary name for an identified person Aug 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants