# Mining repos from GitHub

## Date: 15 September 2021

List of all available at the moment programming languages:
- C
- C#
- C++
- Go
- Java
- JavaScript
- Kotlin
- Objective-C
- PHP
- Python
- Ruby
- Swift
- TypeScript

In [1]:
import os

%load_ext autoreload
%autoreload 2
import pandas as pd
from tqdm import tqdm

pd.options.display.float_format = "{:.2f}".format

os.chdir("D:\stuff\github_data")

We set following filters at [GitHub Search](https://seart-ghs.si.usi.ch/):
* **50+** stars
* **1000+** commits
* **10+** contributors
* Created before **15 September 2019** (2+ years ago)
* Exclude Forks
* Has License

In [2]:
ghs_df = pd.read_csv("ghs_results.csv")
ghs_df.shape

(18556, 27)

In [3]:
ghs_df.head()

Unnamed: 0,Name,Is Fork,Commits,Branches,Default Branch,Releases,Contributors,License,Watchers,Stargazers,...,Total Issues,Open Issues,Total Pull Requests,Open Pull Requests,Last Commit,Last Commit SHA,Has Wiki,Is Archived,Languages,Labels
0,0intro/wmii,False,2822,1,master,0,13,Other,4,50,...,1.0,0.0,0.0,0.0,2021-03-23 00:00:00.0,c86f646bf034199288c09d161edcc8acbcd7fe53,True,False,"C,Python,Shell,Roff,Makefile,Ruby,Limbo,PostSc...","bug,duplicate,enhancement,help wanted,invalid,..."
1,0xced/xcdyoutubekit,False,1470,14,master,53,15,MIT License,125,2828,...,487.0,35.0,62.0,10.0,2020-12-08 20:25:23.0,23f765941b36d6d28215ab81a37af1039d29138b,False,False,"Objective-C,Swift,Makefile,Shell,Ruby,Python","feature,need-help,investigating,could not repr..."
2,0xcert/framework,False,2667,4,master,10,13,MIT License,25,262,...,110.0,5.0,604.0,0.0,2021-08-19 08:25:55.0,80d393ff0627574a2bf089c92235af0adcad094c,True,False,"TypeScript,Solidity,JavaScript,Vue,Shell,HTML","breaking change,bug,community,discussion,do no..."
3,0xd34df00d/leechcraft,False,36227,1,master,0,38,Boost Software License 1.0,24,180,...,0.0,0.0,349.0,0.0,2021-09-06 17:51:19.0,c0c76a0e0e1d30b88d7602cdc92a525694740cea,False,False,"C++,QML,CMake,HTML,Objective-C++,CSS,JavaScrip...",
4,0xd4d/dnlib,False,1981,1,master,0,15,MIT License,106,1331,...,0.0,0.0,53.0,0.0,2021-08-30 17:55:25.0,8b143447a4dac36dfa02249f8a32136d19d049a4,False,False,"C#,PowerShell","bug,duplicate,enhancement,help wanted,invalid,..."


Next, we collect additional information about these repos via official GitHub API.

In [11]:
from github import Github
from github.GithubException import RateLimitExceededException, UnknownObjectException, GithubException

# using an access token
with open("access_token.txt") as file:
    access_token = file.readline().strip()

g = Github(access_token)

In [16]:
import calendar
import time
import datetime


for name in tqdm(ghs_df["Name"][11552:]):
    try:
        repos.append(g.get_repo(name))
    except RateLimitExceededException:
        core_rate_limit = g.get_rate_limit().core
        reset_timestamp = calendar.timegm(core_rate_limit.reset.timetuple())
        sleep_time = reset_timestamp - calendar.timegm(time.gmtime()) + 5
        print(f"{datetime.datetime.now()} Rate Limit exceeded! Sleeping for {sleep_time // 60} minutes")
        time.sleep(sleep_time)
        repos.append(g.get_repo(name))
    except UnknownObjectException:
        print(f"Unknown object exception with repo {name}")
        repos.append(-1)
    except GithubException:
        print(f"GitHub exception with repo {name}")
        repos.append(-1)

  8%|██████                                                                         | 538/7004 [02:03<21:48,  4.94it/s]

Unknown object exception with repo openroberta/robertalab


 11%|████████▉                                                                      | 791/7004 [02:59<21:00,  4.93it/s]

Unknown object exception with repo optimizely/oui


 12%|█████████▍                                                                     | 841/7004 [03:10<20:35,  4.99it/s]

Unknown object exception with repo original-male/non


 20%|███████████████▋                                                              | 1414/7004 [05:17<18:35,  5.01it/s]

Unknown object exception with repo planetfederal/geogig


 23%|██████████████████                                                            | 1622/7004 [06:03<17:56,  5.00it/s]

Unknown object exception with repo practicalswift/swift-compiler-crashes


 33%|█████████████████████████▍                                                    | 2281/7004 [08:29<15:12,  5.18it/s]

Unknown object exception with repo radareorg/radare2-regressions


 42%|████████████████████████████████▍                                             | 2914/7004 [10:52<13:47,  4.94it/s]

Unknown object exception with repo rozofs/rozofs


 45%|███████████████████████████████████▎                                          | 3171/7004 [11:51<16:01,  3.99it/s]

2021-09-15 17:40:18.195056 Rate Limit exceeded! Sleeping for 30 minutes


 62%|████████████████████████████████████████████████▌                             | 4366/7004 [46:38<09:28,  4.64it/s]

Unknown object exception with repo streamsets/datacollector


 66%|███████████████████████████████████████████████████▊                          | 4648/7004 [47:47<07:53,  4.97it/s]

Unknown object exception with repo t0rakka/mango


 74%|█████████████████████████████████████████████████████████▍                    | 5152/7004 [49:36<06:48,  4.53it/s]

Unknown object exception with repo toggl/mobileapp


 75%|██████████████████████████████████████████████████████████▍                   | 5242/7004 [49:56<06:07,  4.79it/s]

Unknown object exception with repo trailofbits/osql


 83%|████████████████████████████████████████████████████████████████▋             | 5806/7004 [52:01<03:55,  5.08it/s]

Unknown object exception with repo vercel/docs


 91%|██████████████████████████████████████████████████████████████████████▌       | 6339/7004 [53:57<02:13,  4.99it/s]

Unknown object exception with repo wix/wix-style-react


 97%|███████████████████████████████████████████████████████████████████████████▍  | 6779/7004 [55:35<00:51,  4.33it/s]

Unknown object exception with repo yuzu-emu/yuzu-canary


 97%|███████████████████████████████████████████████████████████████████████████▌  | 6781/7004 [55:35<00:46,  4.79it/s]

Unknown object exception with repo yuzu-emu/yuzu-nightly


100%|██████████████████████████████████████████████████████████████████████████████| 7004/7004 [56:22<00:00,  2.07it/s]


In [17]:
full_names = []
git_urls = []
languages = []

for repo in tqdm(repos):
    try:
        full_names.append(repo.full_name)
        git_urls.append(repo.git_url)
        languages.append(repo.language)
    except:
        full_names.append("")
        git_urls.append("")
        languages.append("null")

100%|████████████████████████████████████████████████████████████████████████| 18556/18556 [00:00<00:00, 164058.13it/s]


In [46]:
ghs_df["full_name"] = full_names
ghs_df["git_url"] = git_urls
ghs_df["git_url"] = ghs_df["git_url"].apply(lambda x: "https" + x[3:])
ghs_df["Language"] = languages

In [47]:
ghs_df["License"].unique()

array(['Other', 'MIT License', 'Boost Software License 1.0',
       'Apache License 2.0', 'GNU Affero General Public License v3.0',
       'GNU General Public License v2.0',
       'GNU General Public License v3.0',
       'BSD 3-Clause New or Revised License',
       'Creative Commons Zero v1.0 Universal',
       'Mozilla Public License 2.0', 'Open Software License 3.0',
       'GNU Lesser General Public License v2.1',
       'GNU Lesser General Public License v3.0',
       'Creative Commons Attribution 4.0 International', 'ISC License',
       'Eclipse Public License 2.0', 'BSD 2-Clause Simplified License',
       'Educational Community License v2.0',
       'European Union Public License 1.1', 'Eclipse Public License 1.0',
       'zlib License', 'Academic Free License v3.0', 'MIT No Attribution',
       'SIL Open Font License 1.1', 'Artistic License 2.0',
       'The Unlicense', 'BSD 4-Clause Original or Old License',
       'PostgreSQL License',
       'Creative Commons Attribution

In [48]:
ghs_df["Language"].unique()

array(['C', 'Objective-C', 'TypeScript', 'C++', 'C#', 'Go', 'JavaScript',
       'PHP', 'Python', 'Ruby', 'Java', 'null', 'Swift', 'SQF', 'Kotlin',
       'CSS', 'Groovy', None, 'HTML', 'Jupyter Notebook', 'Boogie',
       'Common Workflow Language', 'Rust', 'Smarty', 'Batchfile', 'Shell',
       'PowerShell', 'Lua', 'Cython', 'Vala', 'Svelte', 'Standard ML',
       'Haskell', 'SCSS', 'D', 'POV-Ray SDL', 'Emacs Lisp', 'Makefile',
       'Markdown', 'CMake', 'CoffeeScript', 'Vue', 'FreeMarker', 'VHDL',
       'Stylus', 'PowerBuilder', 'Odin', 'Gherkin', 'PLpgSQL', 'Scala',
       'TeX', 'Cuda', 'R', 'GLSL', 'Assembly', 'Puppet', 'Vim script',
       'Objective-C++', 'LLVM', 'WebAssembly'], dtype=object)

Next filtering steps:

1. Only keep repositories with Apache License 2.0, MIT License, BSD 3-Clause New or Revised License
2. Drop duplicates by `full_name` property (which might appear due to renaming/transferring/mirroring repositories)

In [49]:
# keep repos only with Apache / MIT / BSD license
res_df = ghs_df.loc[
    ghs_df["License"].isin(["Apache License 2.0", "MIT License", "BSD 3-Clause New or Revised License"])
]
print(
    f"After dropping repos without Apache/MIT/BSD licences: {len(res_df)}, diff with previous step: {len(ghs_df) - len(res_df)}"
)
prev_len = len(res_df)

# drop duplicate repos
res_df = res_df.drop_duplicates(subset=["full_name"])
print(f"After dropping full_name duplicates: {len(res_df)}, diff with previous step: {prev_len - len(res_df)}")

res_df = res_df.sort_values(by="Stargazers", ascending=False)

After dropping repos without Apache/MIT/BSD licences: 8751, diff with previous step: 9805
After dropping full_name duplicates: 8606, diff with previous step: 145


In [50]:
res_df.head()

Unnamed: 0,Name,Is Fork,Commits,Branches,Default Branch,Releases,Contributors,License,Watchers,Stargazers,...,Open Pull Requests,Last Commit,Last Commit SHA,Has Wiki,Is Archived,Languages,Labels,full_name,git_url,Language
5615,freecodecamp/freecodecamp,False,27245,5,main,0,410,BSD 3-Clause New or Revised License,8428,321889,...,22.0,2021-03-21 16:58:20.0,7215a6fa77fdedc4ebba2eb3723216d6b978e0bb,False,False,"JavaScript,CSS,Shell,HTML,EJS,Less","crowdin-sync,first timers only,help wanted,inv...",freeCodeCamp/freeCodeCamp,https://github.com/freeCodeCamp/freeCodeCamp.git,JavaScript
17582,vuejs/vue,False,3200,69,dev,210,316,MIT License,6230,187796,...,210.0,2021-09-04 14:12:51.0,4f6f39a26cfc0d78d6b09fd6d3802d45aabb758e,True,False,"JavaScript,TypeScript,HTML,Vue,CSS,Shell","1.x,backlog,browser quirks,bug,contribution we...",vuejs/vue,https://github.com/vuejs/vue.git,JavaScript
5127,facebook/react,False,14294,110,main,96,431,MIT License,6719,171203,...,197.0,2021-07-10 22:02:00.0,0f09f14ae60cfca996c15fc50eeb59447c19a7be,True,False,"JavaScript,HTML,CSS,C++,TypeScript,CoffeeScrip...","Browser: IE,Browser: Safari,CLA Signed,Compone...",facebook/react,https://github.com/facebook/react.git,JavaScript
16384,tensorflow/tensorflow,False,113242,49,master,138,402,Apache License 2.0,8100,156951,...,173.0,2021-06-27 09:13:07.0,1df058b0a4af8c5673251e27a57af320d7b1d29d,False,False,"C++,Python,MLIR,Starlark,HTML,Go,C,Jupyter Not...","API review,Fixed in Nightly,ModelOptimizationT...",tensorflow/tensorflow,https://github.com/tensorflow/tensorflow.git,C++
13395,public-apis/public-apis,False,3193,1,master,0,437,MIT License,3480,156393,...,39.0,2021-09-11 04:26:13.0,e115e1845832902d8ec10c428e441898c1a748cd,False,False,"Python,Shell","alphabetical ordering is required,Awaiting bui...",public-apis/public-apis,https://github.com/public-apis/public-apis.git,Python


In [60]:
res_df.groupby("Language").count()["Name"].sort_values()

Language
Batchfile                      1
TeX                            1
Puppet                         1
PowerShell                     1
PowerBuilder                   1
POV-Ray SDL                    1
Odin                           1
Groovy                         1
Gherkin                        1
GLSL                           1
Lua                            1
Cuda                           1
Boogie                         1
null                           1
Common Workflow Language       1
Svelte                         2
Vue                            2
CoffeeScript                   2
Cython                         2
Scala                          2
SCSS                           2
Rust                           3
Jupyter Notebook               4
Shell                          4
CSS                            4
HTML                          16
Objective-C                   72
Kotlin                       114
Swift                        122
C                            301
P

In [59]:
res_df.groupby("Language").sum()["Commits"].sort_values()

Language
Batchfile                      1152
Gherkin                        1351
PowerShell                     1444
Puppet                         1636
Common Workflow Language       1963
GLSL                           1994
null                           2333
TeX                            2380
Cython                         2629
Groovy                         2675
CoffeeScript                   2701
Vue                            3499
Svelte                         3749
Odin                           4247
Lua                            4473
Scala                          4710
Boogie                         5214
Rust                           7235
CSS                            8094
SCSS                          11067
POV-Ray SDL                   12342
PowerBuilder                  12659
Cuda                          14389
Shell                         17688
Jupyter Notebook              17962
HTML                          67769
Objective-C                  210231
Kotlin             

In [56]:
print(f"{len(res_df)} repos and {res_df['Commits'].sum()} commits")

8606 repos and 34532637 commits


In [57]:
res_df.to_csv("filtered_result.csv")
res_df["Name"].to_csv("repos_names.txt", index=None, header=None)
res_df["git_url"].to_csv("repos_urls.txt", index=None, header=None)