Computational statistics on the (key)word choices of code
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Code Linguistics

The statistical frequency of words in a natural language follows a power-law distribution, known as Zipfs Law. More specifically, there are three distinct regimes that fit well to power-laws with exponents of 0.5, 1.0, 2.0.

The goal of this project is to determine if the following observations holds for code. The dataset will be a large sampling of code hosted on github. The hypothesis is that the keywords (when restricted to a single language) will give an exponent of 0.5, variables will fit to 1.0 exponent and the comments and everything else will fit to a greater exponent (possibly not 2.0).

Author: Travis Hoppe


  • Download a large sampling of code from github. Completed.
  • Process, filter and tokenize the dataset. Completed.
  • Fit the proper power laws values to the data. In-progress
  • Determine a list of keywords for all languages we are interested in.
  • Plot results and interpret.
  • Draft submission for arXiv.