-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research the detection of the primary name for an identified person #28
Comments
The primary user names are included in GHTorrent. |
I will launch the extraction of identities and dump it here Update: |
No, the user names are no longer available. I can take all the logins from GHTorrent, but the names require some GitHub crawling. I will have to run it. |
I am crawling the full name for each github login on our 11 nodes. ETA 10 days. |
Fcking GitHub is throttling us with 429. I had to add download delays using the Scrapy's AutoThrottle extension, that will most probably increase ETA. Fetched ~900k in 10 hours. |
My experiments showed that the rate-limiting happens if you do more than or equal to 5 requests per second. I set the frequency to 4 and the process has finally stabilized. |
Progress: 4.6mm through the weekend |
I collected the dataset of full names: Random strategy46% accuracy if we consider identities without a correct name. Otherwise, ~82%. Biggest number of commits strategy56% and 99.9%, respectively. So 99.9% is a very good number and there is no need for digging deeper. We should count commits for each name and sort them, problem solved. |
Now we have just a number that joins all person's identities together. But we should also set the name of the person.
See related discussion:
https://src-d.slack.com/archives/CJQ0DBAJV/p1564057922120100
Plan:
The text was updated successfully, but these errors were encountered: