GitHub - simon-weber/Predicting-Code-Popularity: Goal: classify GitHub repo -> star range. This was for a final project in a data mining class.

#Predicting the Popularity of Code

This repo contains code written for a final project I did for a data mining course in Fall 2012. This was eventually written into a paper available in this repo which my prof later reworked into What Makes an Open Source Code Popular on GitHub?. The original project description is below:

Basically, I wanted to see if I could tell popular projects from unpopular ones, and if so, what the differences were. I only looked at Python repos on GitHub (quick shoutout to the GitHub team, who kindly provided a custom db dump for me).

My most interesting result was that highly popular repos (350+ stars) can be differentiated from unpopular ones (3-10 stars) quite well solely by looking at relative occurence of AST nodes.

Here are links to the report from this project:

writeup: link
some data: link
presentation deck: link

Running everything

The code is far from elegant, but it gets the job done. First, grab the database of repos: Google Drive. This is an sqlite db, and should be named erepo.db and placed in the root of the repo.

Before running any code, you'll need some dependencies. Create a new venv (you might consider using site-global packages, if you've got numpy or scikit installed), then run pip install -r requirements.txt to get them.

Next, you need to pick your sample size. Edit choose_sample.py as you please (you have to download every repo in the sample, so you probably want to keep it small), then run the script. This creates the classes.py file, which you could then edit if you want (maybe to manually add your own repo).

Next is feature calculation, by running featurecalc.py. This can take a while, and when finished creates features.pickle.

Lastly, you can build the classifier and see how it performs by using run_test.py. summarize_feature_data.py is something else to try; it shows you the min/max/median/mean/std of all features you calculated.

Notes

If you want to watch feature calculation progress, you can tail calcfeatures.log.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
old_scripts		old_scripts
.gitignore		.gitignore
README.md		README.md
astpretty.py		astpretty.py
bagging.py		bagging.py
choose_sample.py		choose_sample.py
classes.py		classes.py
code_popularity.pdf		code_popularity.pdf
config.json		config.json
config.py		config.py
featurecalc.py		featurecalc.py
features.py		features.py
models.py		models.py
module_test.py		module_test.py
plot_features.py		plot_features.py
requirements.txt		requirements.txt
run_test.py		run_test.py
sample.py		sample.py
summarize_feature_data.py		summarize_feature_data.py
utils.py		utils.py

simon-weber/Predicting-Code-Popularity

Folders and files

Latest commit

History

Repository files navigation

Running everything

Notes

About

Resources

Stars

Watchers

Forks

Languages