In [1]:
from presentation_data import *
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected = True)

## Study Design

### Goals and Requirements

Our goal is to analyze the coherence between programming languages and the commenting behavior of the programmers. Required are all the information of the repositories and its contributors.

## Study Questions

- Which programming language is commented the most?
- Is there a difference between interpreted languages like Ruby and Python?
- Is there a difference between system-level languages like C, C++ and Rust?
- Is there a difference between JVM languages like Java and Kotlin?
- Is there a difference between functional languages like Haskell and OCaml?

## Research Method and Strategy

We will first use the GitHub API in order to find the most starred repositories per language and then fetch the source code and analyze the comments in the source files.

We define all languages we want to analyze with their respective extensions so we can iterate through them.

## Data

- first all the repositories were downloaded via the GitHub-API
- then all the needed files have been analyzed using **pygount**
- the results of the analysis were stored in a json file

### JSON converted to DataFrame

In [2]:
analysis_df.drop(['default_branch', 'name', 'owner'], axis = 1, inplace = True)
analysis_df = analysis_df[['stars', 'forks', 'empty', 'code', 'language', 'documentation']]

In [3]:
language_values = {
  'python':  1,
  'ruby' :   2,
  'c':       3,
  'c++':     4,
  'rust':    5,
  'java':    6,
  'kotlin':  7,
  'ocaml':   8,
  'haskell': 9,
}

In [4]:
for i in language_values: 
  analysis_df.replace(i, language_values[i], inplace = True)

# Shuffle data.
analysis_df = analysis_df.sample(frac = 1).reset_index(drop = True)

analysis_df.head()

Unnamed: 0,stars,forks,empty,code,language,documentation
0,2406,413,2457,5244,6,68
1,2356,360,3906,9239,3,3827
2,419,12,2986,5853,5,859
3,79,12,310,732,7,26
4,2764,379,2406,15861,3,2461


In [5]:
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

dataset = analysis_df.values
shape = dataset.shape

data = dataset[:, :-1]
target = dataset[:, -1]
training_data, test_data, training_target, test_target = train_test_split(data, target, test_size = 0.25)

clf = tree.DecisionTreeClassifier().fit(training_data, training_target)

prediction = clf.predict(test_data)

total = len(test_data)

In [6]:
# We define predictions which are within 1000 lines as correct.
incorrect = (np.absolute(test_target - prediction) > 1000).sum()

print('Number of incorrect predictions out of %d: %d (%.1f%%)' % (
  total,
  incorrect,
  incorrect / total * 100.0,
))

min_difference = np.absolute(test_target - prediction).min()
max_difference = np.absolute(test_target - prediction).max()

print(f'Maximum difference between target and prediction: {max_difference}')

Number of incorrect predictions out of 2250: 801 (35.6%)
Maximum difference between target and prediction: 3177274


## Study Questions

### Which programming language is commented the most ? 

In [7]:
iplot(fig_all_comments)

### Different between Interpreted Languages 

In [8]:
iplot(fig_interpreted)

### Different between System Languages 

In [9]:
iplot(fig_system)

### Different between JVM Languages 

In [10]:
iplot(fig_jvm)

### Different between functional languages 

In [11]:
iplot(fig_functional)

### Summary

In [12]:
iplot(summary_pie)