This notebook will help reproduce figure (4) in the paper.

In [1]:
import seaborn as sns
import matplotlib.pyplot as plt 

import numpy as np

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction

from tqdm import tqdm

Here, we collect data by looking at a pair of files (an input (code) and output (comments) file). We take 2 random lines, and plot the bleu scores of the inputs (code lines) and outputs (comment lines).

In [2]:
def collect_data(args):
	num_samples = 10000 if args["num_samples"] is None else int(args["num_samples"])
	no_zeroes = args["no_zeroes"]

	output_csv_file = open(args["csv_file"], "w")
	output_csv_file.write("input_bleu,output_bleu\n")

	file_inputs = open(args["input_file"], "r")
	file_outputs = open(args["output_file"], "r")

	all_inputs = []
	all_outputs = []

	smoothing = SmoothingFunction()

	"""
	Here, we take each pair of input-output lines, split them by space, and append them to lists
	all_inputs and all_outputs. For funcom, we make sure to remove the <s> and </s> tokens from the
	outputs (comments) to prevent artificially increasing BLEU scores.
	"""
	for (input, output) in zip(file_inputs.readlines(), file_outputs.readlines()):
		input_splitted, input_len = input.split(" "), len(input.split(" "))
		output_splitted, output_len = output.split(" "), len(output.split(" "))
		all_inputs.append(input_splitted)
		all_outputs.append(output_splitted)

	"""
	Here, we iterate through the number of desired samples (10k in our setting). We randomly select
	2 input-output pairs without replacement to prevent (although unlikely) the possibility of
	picking the exact same pairs. We calculate the BLEU using BLEU-4 with the appropriate smoothing
	function.
	"""
	for iter in tqdm(range(num_samples)):
		idxes = np.random.choice(len(all_inputs), 2, replace=False)

		input_bleu = sentence_bleu([all_inputs[idxes[0]]], all_inputs[idxes[1]], 
									weights=(0.25,0.25,0.25,0.25), 
									smoothing_function=smoothing.method2)
		output_bleu = sentence_bleu([all_outputs[idxes[0]]], all_outputs[idxes[1]], 
									weights=(0.25,0.25,0.25,0.25), 
									smoothing_function=smoothing.method2)
		if no_zeroes:
			"""
			continue if we want to eliminate very small (essentially zero) values
			"""
			if input_bleu < 1e-5 or output_bleu < 1e-5:
				continue
		
		output_csv_file.write(f"{input_bleu},{output_bleu}\n")

In [3]:
args = {
    "input_file": "../../data/plot-data/bivariate/CodeNN/code.txt",
    "output_file": "../../data/plot-data/bivariate/CodeNN/comments.txt",
    "csv_file": "codenn.csv",
    "num_samples": 10000,
    "no_zeroes": True
}
collect_data(args)

100%|███████████████████████████████████| 10000/10000 [00:07<00:00, 1317.15it/s]


Now that we've collected the data, we need to run the following commands in R (not done in this notebook), so fire up R studio and input these commands.

**First install ggplot2 and hexbin for visualization.**

install.packages("ggplot2")
library("ggplot2")

install.packages("hexbin")
library("hexbin")

**Read the BLEU data and plot.**

data = read.csv(/Path/to/bivariate/data)

d <- ggplot(data, aes(input_bleu, output_bleu))

d + geom_hex(bins = 50) # number of bins can be varied for visualization purposes

**If you want to plot multiple plots simulataneously (as seen in figure 4 in the paper)**:

First go here: https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html
Copy the function grid_arrange_shared_legend into R Studio. The run the following additional commands:

**Install some more packages**

install.packages("gridExtra")
library(gridExtra)

install.packages("ggplotGrob")
library(ggplotGrob)

**Read .csv files for each dataset**

data_1 = read.csv(/Path/to/bivariate/data_1)

... (for all data files you want to plot)

data_n = read.csv(/Path/to/bivariate/data_n)

**Create the data holder**

Some tips: You can change the ends of the limits and the bins for visualization purposes.

d_1 = <- ggplot(data_1, aes(input_bleu, output_bleu)) + geom_hex(bins = 50) + ggtitle(title) + scale_x_continuous(limits=c(-1, 30)) + scale_y_continuous(limits=c(-1, 30))
... (for all data files you want to plot)

d_n = <- ggplot(data_n, aes(input_bleu, output_bleu)) + geom_hex(bins = 50) + ggtitle(title) + scale_x_continuous(limits=c(-1, 30)) + scale_y_continuous(limits=c(-1, 30))

**Plot all data in a side-by-side plot**

grid_arrange_shared_legend(d_1, ... ,d_n)

To calculate the Spearman coefficients and the associated p-values, run this function with the correct path to the .csv file created above.

In [4]:
import pandas as pd
from scipy.stats import spearmanr

def calculate_spearman(path_to_csv: str) -> None:
	df = pd.read_csv(path_to_csv)
	input_bleu = df["input_bleu"].tolist()
	output_bleu = df["output_bleu"].tolist()
	spearman = spearmanr(input_bleu, output_bleu)
	print(spearman)

In [5]:
calculate_spearman('codenn.csv')

SpearmanrResult(correlation=0.009602423650352758, pvalue=0.5064190177083514)


In [6]:
args = {
    "input_file": "../../data/plot-data/bivariate/DeepCom1/code.txt",
    "output_file": "../../data/plot-data/bivariate/DeepCom1/comments.txt",
    "csv_file": "DeepCom1.csv",
    "num_samples": 10000,
    "no_zeroes": True
}
collect_data(args)
calculate_spearman('DeepCom1.csv')

100%|████████████████████████████████████| 10000/10000 [00:12<00:00, 827.69it/s]


SpearmanrResult(correlation=0.06030788571815664, pvalue=9.481267487435718e-08)


In [7]:
args = {
    "input_file": "../../data/plot-data/bivariate/DeepCom2/code.txt",
    "output_file": "../../data/plot-data/bivariate/DeepCom2/comments.txt",
    "csv_file": "DeepCom2.csv",
    "num_samples": 10000,
    "no_zeroes": True
}
collect_data(args)
calculate_spearman('DeepCom2.csv')

100%|████████████████████████████████████| 10000/10000 [00:48<00:00, 204.29it/s]


SpearmanrResult(correlation=0.08390676557271166, pvalue=6.906940852597637e-14)


In [8]:
args = {
    "input_file": "../../data/plot-data/bivariate/Docstring1/code.txt",
    "output_file": "../../data/plot-data/bivariate/Docstring1/comments.txt",
    "csv_file": "Docstring1.csv",
    "num_samples": 10000,
    "no_zeroes": True
}
collect_data(args)
calculate_spearman('Docstring1.csv')

100%|████████████████████████████████████| 10000/10000 [00:11<00:00, 898.08it/s]


SpearmanrResult(correlation=0.16859889278609977, pvalue=8.363130249969569e-56)


In [9]:
args = {
    "input_file": "../../data/plot-data/bivariate/Docstring2/code.txt",
    "output_file": "../../data/plot-data/bivariate/Docstring2/comments.txt",
    "csv_file": "Docstring2.csv",
    "num_samples": 10000,
    "no_zeroes": True
}
collect_data(args)
calculate_spearman('Docstring2.csv')

100%|████████████████████████████████████| 10000/10000 [00:11<00:00, 905.34it/s]

SpearmanrResult(correlation=0.04913943609085451, pvalue=1.4126299564228146e-06)





In [10]:
args = {
    "input_file": "../../data/plot-data/bivariate/FunCom1/code.txt",
    "output_file": "../../data/plot-data/bivariate/FunCom1/comments.txt",
    "csv_file": "FunCom1.csv",
    "num_samples": 10000,
    "no_zeroes": True
}
collect_data(args)
calculate_spearman('FunCom1.csv')

100%|█████████████████████████████████████| 10000/10000 [04:17<00:00, 38.84it/s]


SpearmanrResult(correlation=0.0922174158360177, pvalue=7.001783577909612e-11)


In [11]:
args = {
    "input_file": "../../data/plot-data/bivariate/FunCom2/code.txt",
    "output_file": "../../data/plot-data/bivariate/FunCom2/comments.txt",
    "csv_file": "FunCom2.csv",
    "num_samples": 10000,
    "no_zeroes": True
}
collect_data(args)
calculate_spearman('FunCom2.csv')

100%|█████████████████████████████████████| 10000/10000 [04:22<00:00, 38.13it/s]


SpearmanrResult(correlation=0.10639482515175742, pvalue=6.85939296807234e-14)
