Computational statistics on the (key)word choices of code
Python Ruby Makefile
Latest commit 8b61045 May 22, 2016 @thoppe README formatting.
Permalink
Failed to load latest commit information.
fit_tokens
gitpull
process_code
.gitignore
README.md

README.md

Code Linguistics

The statistical frequency of words in a natural language follows a power-law distribution, known as Zipfs Law. More specifically, there are three distinct regimes that fit well to power-laws with exponents of 0.5, 1.0, 2.0.

The goal of this project is to determine if the following observations holds for code. The dataset will be a large sampling of code hosted on github. The hypothesis is that the keywords (when restricted to a single language) will give an exponent of 0.5, variables will fit to 1.0 exponent and the comments and everything else will fit to a greater exponent (possibly not 2.0).

Author: Travis Hoppe

Roadmap:

  • Download a large sampling of code from github. Completed.
  • Process, filter and tokenize the dataset. Completed.
  • Fit the proper power laws values to the data. In-progress
  • Determine a list of keywords for all languages we are interested in.
  • Plot results and interpret.
  • Draft submission for arXiv.

Presentations

References