This notebook will help reproduce figure (4) in the paper.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt 

import numpy as np

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction

from tqdm import tqdm

Here, we collect data by looking at a pair of files (an input (code) and output (comments) file). We take 2 random lines, and plot the bleu scores of the inputs (code lines) and outputs (comment lines).

In [None]:
def collect_data(args):
	num_samples = 10000 if args["num_samples"] is None else int(args["num_samples"])
	no_zeroes = args["no_zeroes"]

	output_csv_file = open(args["csv_file"], "w")
	output_csv_file.write("input_bleu,output_bleu\n")

	file_inputs = open(args["input_file"], "r")
	file_outputs = open(args["output_file"], "r")

	all_inputs = []
	all_outputs = []

	smoothing = SmoothingFunction()

	"""
	Here, we take each pair of input-output lines, split them by space, and append them to lists
	all_inputs and all_outputs. For funcom, we make sure to remove the <s> and </s> tokens from the
	outputs (comments) to prevent artificially increasing BLEU scores.
	"""
	for (input, output) in zip(file_inputs.readlines(), file_outputs.readlines()):
		input_splitted, input_len = input.split(" "), len(input.split(" "))
		output_splitted, output_len = output.split(" "), len(output.split(" "))
		all_inputs.append(input_splitted)
		all_outputs.append(output_splitted)

	"""
	Here, we iterate through the number of desired samples (10k in our setting). We randomly select
	2 input-output pairs without replacement to prevent (although unlikely) the possibility of
	picking the exact same pairs. We calculate the BLEU using BLEU-4 with the appropriate smoothing
	function.
	"""
	for iter in tqdm(range(num_samples)):
		idxes = np.random.choice(len(all_inputs), 2, replace=False)

		input_bleu = sentence_bleu([all_inputs[idxes[0]]], all_inputs[idxes[1]], 
									weights=(0.25,0.25,0.25,0.25), 
									smoothing_function=smoothing.method2)
		output_bleu = sentence_bleu([all_outputs[idxes[0]]], all_outputs[idxes[1]], 
									weights=(0.25,0.25,0.25,0.25), 
									smoothing_function=smoothing.method2)
		if no_zeroes:
			"""
			continue if we want to eliminate very small (essentially zero) values
			"""
			if input_bleu < 1e-5 or output_bleu < 1e-5:
				continue
		
		output_csv_file.write(f"{input_bleu},{output_bleu}\n")

In [None]:
args = {
    "input_file": "plot-data/bivariate/NL/code.txt",
    "output_file": "plot-data/bivariate/NL/comments.txt",
    "csv_file": "nl.csv",
    "num_samples": 10000,
    "no_zeroes": True
}
collect_data(args)

Now that we've collected the data, we need to run the following commands in R (not done in this notebook), so fire up R studio and input these commands.

**First install ggplot2 and hexbin for visualization.**

install.packages("ggplot2")
library("ggplot2")

install.packages("hexbin")
library("hexbin")

**Read the BLEU data and plot.**

data = read.csv(/Path/to/bivariate/data)

d <- ggplot(data, aes(input_bleu, output_bleu))

d + geom_hex(bins = 50) # number of bins can be varied for visualization purposes

**If you want to plot multiple plots simulataneously (as seen in figure 4 in the paper)**:

First go here: https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html
Copy the function grid_arrange_shared_legend into R Studio. The run the following additional commands:

**Install some more packages**

install.packages("gridExtra")
library(gridExtra)

install.packages("ggplotGrob")
library(ggplotGrob)

**Read .csv files for each dataset**

data_1 = read.csv(/Path/to/bivariate/data_1)

... (for all data files you want to plot)

data_n = read.csv(/Path/to/bivariate/data_n)

**Create the data holder**

Some tips: You can change the ends of the limits and the bins for visualization purposes.

d_1 = <- ggplot(data_1, aes(input_bleu, output_bleu)) + geom_hex(bins = 50) + ggtitle(title) + scale_x_continuous(limits=c(-1, 30)) + scale_y_continuous(limits=c(-1, 30))
... (for all data files you want to plot)

d_n = <- ggplot(data_n, aes(input_bleu, output_bleu)) + geom_hex(bins = 50) + ggtitle(title) + scale_x_continuous(limits=c(-1, 30)) + scale_y_continuous(limits=c(-1, 30))

**Plot all data in a side-by-side plot**

grid_arrange_shared_legend(d_1, ... ,d_n)

To calculate the Spearman coefficients and the associated p-values, run this function with the correct path to the .csv file created above.

In [1]:
import pandas as pd
from scipy.stats import spearmanr

def calculate_spearman(path_to_csv: str) -> None:
	df = pd.read_csv(path_to_csv)
	input_bleu = df["input_bleu"].tolist()
	output_bleu = df["output_bleu"].tolist()
	spearman = spearmanr(input_bleu, output_bleu)
	print(spearman)