# #1: Filters implementations

In this section, we discuss the frequent filters from CMG datasets and provide our implementations.

## First sentence filter

Most works don't actually filter out the examples with more than one sentence, rather extract the first sentence from each commit message. But we're still interested to check how many examples this processing affects.

There are two options:

* `newline`: some works extracted the first sentence based on the newline character `\n`;
* `punct`: some works extracted the first sentence based on trailing punctuation.

In [4]:
from typing import Literal
from nltk import sent_tokenize


def is_one_sentence(text: str, mode: Literal["newline", "punct"], nl_char: str = "\n") -> bool:
    """Implements single sentence filter.

    Args:
        text: Input string.
        mode: Sentence splitting logic.

    Returns:
        True if text only has one sentence, False otherwise.

    Notes:
        There are two modes:
        * `newline` simply splits sentences by newline character
        * `punct` relies on PunktSentenceTokenizer from nltk
    """
    if mode == "newline":
        lines = text.split(nl_char)
        return len(lines) == 1
    elif mode == "punct":
        try:
            lines = sent_tokenize(text)
        except IndexError:
            return None
        return len(lines) == 1

    raise ValueError("Unknown configuration")

## Verb-Direct Object (V-DO) filter

This filter targets grammatical structure of commit messages: it only allows commit messages that start with Verb-Direct Object clause. It was proposed in [Automatically Generating Commit Messages from Diffs using Neural Machine Translation](https://arxiv.org/abs/1708.09492), ASE 2017, and originally implemented via [CoreNLP](https://github.com/stanfordnlp/CoreNLP) Java package.

### spaCy implementation

We reimplemented V-DO filter via [spaCy](https://github.com/explosion/spaCy) – popular Python library for Natural Language Processing.

In [2]:
import spacy

spacy_nlp = spacy.load("en_core_web_sm")

In [3]:
def is_verb_direct_object(text: str):
    """Implements filter by Verb-Direct Object grammar structure via spaCy package.

    Args:
        text: Input string.

    Returns:
        True if text starts with V-DO grammar structure, False otherwise.

    Notes:
        * Only the first sentence is considered.
        * Since past forms (e.g. fixed) and gerunds (e.g. fixing) are often not tagged as verbs,
          there is an extra preprocessing step: lemmatization of first word if it is a verb.
        * Current implementation supports not only Direct Objects consisting of single noun,
          but also clauses/phrases.
    """
    first_word = text.split(" ")[0]
    processed_first_word = spacy_nlp(first_word)[0]
    if processed_first_word.pos_ == "VERB":
        text = " ".join([processed_first_word.lemma_] + text.split(" ")[1:])

    doc = spacy_nlp(text)

    token = doc[0]
    if (
        token.pos_ == "VERB"
        and token.dep_ == "ROOT"
        and len([t.dep_ for t in token.children])
        and [t.dep_ for t in token.children][0] == "dobj"
    ):
        return True
    return False

#### Sanity checks

Firstly, there are examples where V-DO filter doesn't work as intended. Consider this:

In [4]:
for text in ["update gif", "update readme", "update file", "update something"]:
    if not is_verb_direct_object(text):
        print(f"`{text}` wasn't marked as V-DO")

`update readme` wasn't marked as V-DO
`update file` wasn't marked as V-DO


For some reasons, sometimes `update ...` sentence is not marked as V-DO, while with other nouns it works fine.

The next sanity check is running our implementation on Jiang et al.'s V-DO filtered dataset. Specifically, we use its further filtered version from [Neural-machine-translation-based commit message generation: how far are we?](https://dl.acm.org/doi/10.1145/3238147.3238190), ASE 2018.

In [None]:
!wget https://raw.githubusercontent.com/Tbabm/nngen/master/data/cleaned.test.msg

In [6]:
from tqdm import tqdm

non_vdo = []
vdo = []
with open("cleaned.test.msg", "r") as f:
    for line in tqdm(f):
        if not is_verb_direct_object(line.strip()):
            non_vdo.append(line)
        else:
            vdo.append(line)

2521it [00:14, 174.71it/s]


In [7]:
len(vdo), len(non_vdo)

(1688, 833)

Around 2/3 examples get marked as V-DO by current implementation.

Here are some examples that got marked as V-DO:

In [8]:
from pprint import pprint

pprint(vdo[:20])

['Fix snapshot version \n',
 'update chagelog \n',
 'Added joscar JAR . \n',
 'Fix TsExtractor tests \n',
 'bump engine . io - client \n',
 'Remove observer only if it has been registered \n',
 'Fix typo in README . md \n',
 'prepare release checkstyle - 7 . 1 . 1 \n',
 'Updated the version string to the version of T4J \n',
 'update h2o - flow version to 0 . 5 . 0 with . . . ( # 684 ) \n',
 'Updated webchat logo - added oracle logo as a separate image \n',
 'Removed non - needed imports \n',
 'Add missing link to the 2 . 0 migration guide . \n',
 'setting version to 1 . 0 . 133 - SNAPSHOT \n',
 'remove classpath from manifest \n',
 'Put that coffee down . \n',
 'Create Protocol - Overview . \n',
 'Make NOPASS . \n',
 'Fix test data so that it can be compiled \n',
 'Add notifications package index . \n']


Here are some examples that didn't get marked as V-DO:

In [9]:
pprint(non_vdo[:20])

['edit coverage colors icon \n',
 'LRQA - 14419 Add new property to turn on running tests with poshi runner \n',
 'moving tools . jar inside the jre \n',
 'Call close ( ) instead of deactivate ( ) in CursorToBulkCursorAdaptor . '
 'close ( ) \n',
 'missed shift \n',
 'Ignore files generated by the sharpen process \n',
 'Revert " Added Circle CI configuration " \n',
 'LRQA - 17074 Modify java . jdk . type property from x64 to x32 \n',
 'update fonts \n',
 'Ignore time - zone data \n',
 'Updated jar \n',
 'Set min width for add dialog . \n',
 'Fix bug 558 , PImage . save ( ) method not working with get ( ) \n',
 'Ninja - add debug log statement to mv builder scheduled at startup \n',
 'bump up druid version to 0 . 4 . 0 \n',
 'JAL - 39 Add image for propietary license \n',
 'Change default fbo cache size to 0 \n',
 'LPS - 48807 Remove EOL \n',
 'prepare for next development iteration \n',
 'Fixes a crash of the QTKit video CaptureDevice on Snow Leopard reported by '
 'Yana Stamcheva . \n

Prefixes like `LRQA - 14419` definitely do not fall under current V-DO implementation. I wonder if they were cleared in original dataset?

And again, there are some false negatives.

### Stanza implementation

We also considered an implementation via Python analogue of [CoreNLP](https://github.com/stanfordnlp/CoreNLP) – [Stanza](https://github.com/stanfordnlp/stanza). It also provides Python wrappers over some of CoreNLP's features.

In [None]:
import stanza


stanza.download("en")
stanza_nlp = stanza.Pipeline(lang="en", processors="tokenize,lemma,pos,depparse")

In [11]:
def stanza_is_verb_direct_object(text: str) -> bool:
    """Implements filter by Verb-Direct Object grammar structure via Stanza package.

    Args:
        text: Input string.

    Returns:
        True if text starts with V-DO grammar structure, False otherwise.

    Notes:
        * Only the first sentence is considered.
        * Since past forms (e.g. fixed) and gerunds (e.g. fixing) are often not tagged as verbs,
          there is an extra preprocessing step: lemmatization of first word if it is a verb.
        * Current implementation supports not only Direct Objects consisting of single noun,
          but also clauses/phrases.
    """
    # lemmatize first word if it's a verb
    first_word = text.split(" ")[0]
    processed_first_word = stanza_nlp(first_word).sentences[0].words[0]
    if processed_first_word.upos == "VERB":
        text = " ".join([processed_first_word.lemma] + text.split(" ")[1:])

    # run stanza pipeline
    processed_text = stanza_nlp(text)
    first_sent = processed_text.sentences[0]

    # check: first word is a verb and there is a word that relates to it as `obj`
    # `dobj` -> `obj`: https://github.com/stanfordnlp/stanza/issues/936
    first_word = first_sent.words[0]
    if first_word.upos == "VERB":
        for word in first_sent.words[1:]:
            if word.deprel == "obj" and word.head == first_word.id:
                return True

    return False

#### Sanity checks

`Stanza` implementation suffer from the same issue as `spaCy` one:

In [12]:
for text in ["update gif", "update readme", "update file"]:
    if not stanza_is_verb_direct_object(text):
        print(f"`{text}` wasn't marked as V-DO")

`update gif` wasn't marked as V-DO
`update readme` wasn't marked as V-DO
`update file` wasn't marked as V-DO


Now, the check on the Jiang et al.'s dataset.

In [14]:
from tqdm import tqdm

non_vdo = []
vdo = []
with open("cleaned.test.msg", "r") as f:
    for line in tqdm(f):
        if not stanza_is_verb_direct_object(line.strip()):
            non_vdo.append(line)
        else:
            vdo.append(line)

2521it [06:59,  6.00it/s]


In [15]:
len(vdo), len(non_vdo)

(1965, 556)

We can observe that `Stanza` implementation marked more sentences as V-DO than `spaCy` one. However, the processing speed is way lower. Running it on our ~10M examples would require substantial resources.


Here are some examples that got marked as V-DO:

In [16]:
from pprint import pprint


pprint(vdo[:20])

['Fix snapshot version \n',
 'Added joscar JAR . \n',
 'Fix TsExtractor tests \n',
 'moving tools . jar inside the jre \n',
 'Remove observer only if it has been registered \n',
 'Fix typo in README . md \n',
 'prepare release checkstyle - 7 . 1 . 1 \n',
 'Updated the version string to the version of T4J \n',
 'update h2o - flow version to 0 . 5 . 0 with . . . ( # 684 ) \n',
 'missed shift \n',
 'Updated webchat logo - added oracle logo as a separate image \n',
 'Ignore files generated by the sharpen process \n',
 'Removed non - needed imports \n',
 'Add missing link to the 2 . 0 migration guide . \n',
 'setting version to 1 . 0 . 133 - SNAPSHOT \n',
 'Revert " Added Circle CI configuration " \n',
 'remove classpath from manifest \n',
 'Put that coffee down . \n',
 'Create Protocol - Overview . \n',
 'Make NOPASS . \n']


Here are some examples that didn't get marked as V-DO:

In [17]:
pprint(non_vdo[:20])

['update chagelog \n',
 'edit coverage colors icon \n',
 'LRQA - 14419 Add new property to turn on running tests with poshi runner \n',
 'bump engine . io - client \n',
 'Call close ( ) instead of deactivate ( ) in CursorToBulkCursorAdaptor . '
 'close ( ) \n',
 'LRQA - 17074 Modify java . jdk . type property from x64 to x32 \n',
 'update fonts \n',
 'Update the changelog file \n',
 'Prepare 3 . 0 . 0 release \n',
 'Updated jar \n',
 'update support annotations to 23 . 0 . 1 \n',
 'Ninja - add debug log statement to mv builder scheduled at startup \n',
 'updated examples \n',
 'JAL - 39 Add image for propietary license \n',
 'update linux - x86 natives \n',
 'LPS - 48807 Remove EOL \n',
 'prepare for next development iteration \n',
 'updated todo . txt \n',
 'Update SpongeCommon for DestructEntityEvent cause fix . \n',
 'Update build tools \n']


## Filter by length

Many datasets set strict restrictions for maximum number of tokens both for diffs and for messages. The most frequent filters are:

* Maximum length for messages: up to 30 tokens
* Maximum length for diffs: up to 100 tokens

Our implementation performs tokenization on whitespaces and punctuation.

In [2]:
from nltk import wordpunct_tokenize


def is_shorter_than_n_tokens(text: str, n: int) -> bool:
    num_tokens = len(wordpunct_tokenize(text))
    return num_tokens <= n


def is_longer_than_n_tokens(text: str, n: int) -> bool:
    num_tokens = len(wordpunct_tokenize(text))
    return num_tokens >= n

# #2: Results on MCMD dataset

In this section, we provide the statistics for all the filters for [MCMD dataset](https://github.com/DeepSoftwareAnalytics/CommitMsgEmpirical) from [On the Evaluation of Commit Message Generation Models: An Experimental Study](https://arxiv.org/abs/2107.05373v3), ICSME 2021.

## Reading MCMD

MCMD contains 5 PLs:
* C#
* C++
* Java
* JavaScript
* Python

Examples for every language are stored in its own folder.

In [1]:
import os

os.chdir("/home/ubuntu/datasets/mcmd")
os.listdir()

['csharp',
 'cpp',
 'directoryList.md',
 'java',
 'javascript',
 'mcmd.tar.gz',
 'csharp-vdo_results.txt',
 'python']

['csharp',
 'javascript-vdo_results.txt',
 'cpp',
 'directoryList.md',
 'java',
 'cpp-vdo_results.txt',
 'java-vdo_results.txt',
 'javascript',
 'mcmd.tar.gz',
 'csharp-vdo_results.txt',
 'python-vdo_results.txt',
 'python']

For each language, there are two train/val/test splits available: random and by time. Let's choose random (it shouldn't matter for us anyway).

In [3]:
langs = ["csharp", "cpp", "java", "javascript", "python"]
os.listdir("cpp")

['valid.jsonl',
 'train.jsonl',
 'sort_time_train80_valid10_test10',
 'test.jsonl',
 'sort_random_train80_valid10_test10']

Finally, we have a bunch of .txt files with prefixes `train`/`valid`/`test`:

In [83]:
[fname for fname in os.listdir("cpp/sort_random_train80_valid10_test10") if fname.startswith("train")]

['train.time.txt',
 'train.diff.txt',
 'train.msg.txt',
 'train.sha.txt',
 'train.repo.txt']

Let's refactor it to JSONLines.

In [88]:
import jsonlines
from tqdm import tqdm

split = "sort_random_train80_valid10_test10"

for lang in langs:
    print(lang)
    for part in ["train", "valid", "test"]:
        with open(f"{lang}/{split}/{part}.diff.txt", "r") as diff_file:
            with open(f"{lang}/{split}/{part}.time.txt", "r") as time_file:
                with open(f"{lang}/{split}/{part}.msg.txt", "r") as msg_file:
                    with open(f"{lang}/{split}/{part}.sha.txt", "r") as sha_file:
                        with open(f"{lang}/{split}/{part}.repo.txt", "r") as repo_file:
                            for diff_line, time_line, msg_line, sha_line, repo_line in tqdm(
                                zip(diff_file, time_file, msg_file, sha_file, repo_file)
                            ):
                                cur_example = {
                                    "diff": diff_line.strip(),
                                    "time": time_line.strip(),
                                    "msg": msg_line.strip(),
                                    "sha": sha_line.strip(),
                                    "repo": repo_line.strip(),
                                }
                                with jsonlines.open(f"{lang}/{part}.jsonl", "a") as writer:
                                    writer.write(cur_example)

csharp


360000it [01:14, 4835.34it/s]
45000it [00:08, 5410.84it/s]
45000it [00:11, 3785.42it/s]


cpp


360000it [01:07, 5308.37it/s]
45000it [00:07, 6417.79it/s]
45000it [00:07, 6267.73it/s]


java


360000it [01:13, 4895.36it/s]
45000it [00:07, 6105.76it/s]
45000it [00:07, 6186.92it/s]


javascript


360000it [01:12, 4937.30it/s]
45000it [00:07, 5872.50it/s]
45000it [00:07, 5706.56it/s]


python


360000it [00:51, 6997.75it/s]
45000it [00:06, 7210.69it/s]
45000it [00:05, 8402.58it/s]


## First sentence filter

In [None]:
import jsonlines
from tqdm import tqdm
from collections import defaultdict


first_sentence_stats_newline = defaultdict(list)
first_sentence_stats_punct = defaultdict(list)

for lang in langs:
    print(lang)
    for part in ["train", "valid", "test"]:
        with jsonlines.open(f"{lang}/{part}.jsonl", "r") as reader:
            for line in tqdm(reader):
                first_sentence_stats_newline[lang].append(is_one_sentence(line["msg"], mode="newline"))
                first_sentence_stats_punct[lang].append(is_one_sentence(line["msg"], mode="punct"))



### Newline results


All commit messages in MCMD contain only one newline character.

In [110]:
import pandas as pd


first_sentence_stats_newline_df = pd.DataFrame(first_sentence_stats_newline)
pd.DataFrame(
    {
        col: first_sentence_stats_newline_df[col].value_counts(dropna=False)
        for col in first_sentence_stats_newline_df.columns
    }
)

Unnamed: 0,csharp,cpp,java,javascript,python
True,450000,450000,450000,450000,450000


### Punctuation results

If we split sentences via NLTK's `sent_tokenize`, the results are slightly different.

* There are cases which lead to errors and result in `None`. It's mostly messages that start with punctuation, e.g. `.gitignore`.
* The majority of the message still have only one sentence, but there are many messages that contain more than one.

In [109]:
first_sentence_stats_punct_df = pd.DataFrame(first_sentence_stats_punct)

pd.DataFrame(
    {
        col: first_sentence_stats_punct_df[col].value_counts(dropna=False)
        for col in first_sentence_stats_punct_df.columns
    }
)

Unnamed: 0,csharp,cpp,java,javascript,python
True,332004,384304,379252,361640,357084
False,117844,62037,70685,88222,92805
,152,3659,63,138,111


## Verb-Direct Object filter (spaCy)

> Note: calculating V-DO filters takes substantial time, hence, we launched spaCy implementation outside of this notebook and just read the results here.

In [4]:
import jsonlines


vdo_results = {}

for lang in langs:
    print(lang)
    with jsonlines.open(f"{lang}-vdo_results.txt", "r") as reader:
        vdo_results[lang] = [line["is_vdo"] for line in reader]

csharp
cpp
java
javascript
python


In [6]:
import pandas as pd


vdo_results_df = pd.DataFrame(vdo_results)

pd.DataFrame({col: vdo_results_df[col].value_counts(dropna=False) for col in vdo_results_df.columns})

Unnamed: 0,csharp,cpp,java,javascript,python
False,261411,268477,287488,262456,278572
True,188589,181523,162512,187544,171428


## Filters by length

In [None]:
import jsonlines
from tqdm import tqdm
from collections import defaultdict


msg_30_tokens_stats = defaultdict(list)
diff_100_tokens_stats = defaultdict(list)

for lang in langs:
    print(lang)
    for part in ["train", "valid", "test"]:
        with jsonlines.open(f"{lang}/{part}.jsonl", "r") as reader:
            for line in tqdm(reader):
                msg_30_tokens_stats[lang].append(is_shorter_than_n_tokens(line["msg"], 30))
                diff_100_tokens_stats[lang].append(is_shorter_than_n_tokens(line["diff"], 100))

### Messages $\leq 30$ tokens

In [122]:
msg_30_tokens_stats_df = pd.DataFrame(msg_30_tokens_stats)

pd.DataFrame({col: msg_30_tokens_stats_df[col].value_counts(dropna=False) for col in msg_30_tokens_stats_df.columns})

Unnamed: 0,csharp,cpp,java,javascript,python
True,447119,446226,445136,447633,448432
False,2881,3774,4864,2367,1568


### Diffs $\leq 100$ tokens

In [119]:
diff_100_tokens_stats_df = pd.DataFrame(diff_100_tokens_stats)

pd.DataFrame(
    {col: diff_100_tokens_stats_df[col].value_counts(dropna=False) for col in diff_100_tokens_stats_df.columns}
)

Unnamed: 0,csharp,cpp,java,javascript,python
False,435151,434381,442333,433322,431063
True,14849,15619,7667,16678,18937


# #3: Results on our dataset

In this section, we provide the statistics for all the filters for our dataset.

## Reading our dataset

In [5]:
import os

[fname for fname in os.listdir("extracted_data_jsonl/dataset") if ".jsonl" in fname]

['train.jsonl', 'val.jsonl', 'test.jsonl']

## First sentence filter

In [4]:
import jsonlines
from tqdm import tqdm
from collections import defaultdict

first_sentence_stats_newline = defaultdict(list)
first_sentence_stats_punct = defaultdict(list)

for part in ["train", "val", "test"]:
    with jsonlines.open(f"extracted_data_jsonl/dataset/{part}.jsonl", "r") as reader:
        for line in tqdm(reader):
            first_sentence_stats_newline[line["language"]].append(is_one_sentence(line["message"], mode="newline"))
            first_sentence_stats_punct[line["language"]].append(is_one_sentence(line["message"], mode="punct"))

7659458it [07:18, 17465.31it/s]
1554042it [01:29, 17420.63it/s]
1486267it [01:25, 17382.80it/s]


### Newline

In [5]:
import pandas as pd

pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 50)


d = {}
for key in first_sentence_stats_newline:
    d[key] = pd.Series(first_sentence_stats_newline[key]).value_counts(dropna=False)
pd.DataFrame(d)

Unnamed: 0,TypeScript,PHP,JavaScript,Python,Java,Go,Ruby,C#,C++,C,Swift,Kotlin,Smalltalk,Shell,Rust,Dart,Groovy,Objective-C,Nix,Elixir
True,1144948,215706,1282665,1520334,1087622,721721,198778,505687,916163,294733,125758,160606,11297,132736,277624,48344,50226,32360,76954,43007
False,167279,31761,193434,269805,269437,218977,56483,70191,239961,110730,12681,51472,1418,26112,68368,14139,13580,9159,19702,7809


### Punctuation

In [6]:
import pandas as pd

pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 50)

d = {}
for key in first_sentence_stats_punct:
    d[key] = pd.Series(first_sentence_stats_punct[key]).value_counts(dropna=True)
pd.DataFrame(d)

Unnamed: 0,TypeScript,PHP,JavaScript,Python,Java,Go,Ruby,C#,C++,C,Swift,Kotlin,Smalltalk,Shell,Rust,Dart,Groovy,Objective-C,Nix,Elixir
True,1254224,232260,1402237,1650513,1227838,853374,231586,540853,1019984,350205,131502,189802,11512,149606,314693,54195,59358,36796,90407,48097
False,57997,13534,73852,139623,129162,87319,23674,35015,136113,55256,6936,22272,1190,9242,31297,8288,4448,4723,6249,2719


## Verb-Direct Object filter (spaCy)

> Note: calculating V-DO filters takes substantial time, hence, we launched spaCy implementation outside of this notebook and just read the results here.

In [8]:
import jsonlines
from collections import defaultdict
from tqdm import tqdm


vdo_stats = defaultdict(list)

for part in ["train", "val", "test"]:
    with jsonlines.open(f"extracted_data_jsonl/dataset/{part}.jsonl", "r") as reader:
        with jsonlines.open(f"stats/{part}_vdo_results.jsonl", "r") as reader2:
            for (line, line2) in tqdm(zip(reader, reader2)):
                assert line["hash"] == line2["hash"]
                assert line["repo"] == line2["repo"]
                vdo_stats[line["language"]].append(line2["is_vdo"])

7659458it [04:46, 26763.29it/s]
1554042it [00:56, 27287.71it/s]
1486267it [00:55, 26952.20it/s]


In [10]:
import pandas as pd

pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 50)

d = {}
for key in vdo_stats:
    d[key] = pd.Series(vdo_stats[key]).value_counts(dropna=True)
pd.DataFrame(d)

Unnamed: 0,TypeScript,PHP,JavaScript,Python,Java,Go,Ruby,C#,C++,C,Swift,Kotlin,Smalltalk,Shell,Rust,Dart,Groovy,Objective-C,Nix,Elixir
False,922809,147978,957241,1063093,841861,611725,146161,332002,726941,295762,77681,130585,7613,101955,224065,40704,36392,24366,93496,30437
True,389418,99489,518858,727046,515198,328973,109100,243876,429183,109701,60758,81493,5102,56893,121927,21779,27414,17153,3160,20379


## Filters by length

In [14]:
import jsonlines
from tqdm import tqdm
from collections import defaultdict


msg_30_tokens_stats = defaultdict(list)
diff_100_tokens_stats = defaultdict(list)
diff_200_tokens_stats = defaultdict(list)

for part in ["train", "val", "test"]:
    with jsonlines.open(f"extracted_data_jsonl/dataset/{part}.jsonl", "r") as reader:
        for line in tqdm(reader):
            msg_30_tokens_stats[line["language"]].append(is_shorter_than_n_tokens(line["message"], n=30))
            diff = "\n".join([mod["diff"] for mod in line["mods"]])
            diff_100_tokens_stats[line["language"]].append(is_shorter_than_n_tokens(diff, n=100))
            diff_200_tokens_stats[line["language"]].append(is_shorter_than_n_tokens(diff, n=200))

7659458it [47:00, 2715.41it/s] 
1554042it [09:07, 2837.44it/s]
1486267it [08:56, 2768.68it/s]


### Messages $\leq 30$ tokens

In [15]:
import pandas as pd

pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 50)

d = {}
for key in msg_30_tokens_stats:
    d[key] = pd.Series(msg_30_tokens_stats[key]).value_counts(dropna=True)
pd.DataFrame(d)

Unnamed: 0,TypeScript,PHP,JavaScript,Python,Java,Go,Ruby,C#,C++,C,Swift,Kotlin,Smalltalk,Shell,Rust,Dart,Groovy,Objective-C,Nix,Elixir
True,1260106,238345,1409254,1685767,1265843,850580,232404,551919,1050542,357632,134048,193499,12284,148936,318359,57910,59133,38200,89164,48425
False,52121,9122,66845,104372,91216,90118,22857,23959,105582,47831,4391,18579,431,9912,27633,4573,4673,3319,7492,2391


### Diffs $\leq 100$ tokens

In [16]:
import pandas as pd

pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 50)

d = {}
for key in diff_100_tokens_stats:
    d[key] = pd.Series(diff_100_tokens_stats[key]).value_counts(dropna=True)
pd.DataFrame(d)

Unnamed: 0,TypeScript,PHP,JavaScript,Python,Java,Go,Ruby,C#,C++,C,Swift,Kotlin,Smalltalk,Shell,Rust,Dart,Groovy,Objective-C,Nix,Elixir
False,1104200,200322,1221080,1486622,1190731,824444,203682,482360,992511,330870,115939,184206,10090,130438,305556,54220,53700,33936,59973,42831
True,208027,47145,255019,303517,166328,116254,51579,93518,163613,74593,22500,27872,2625,28410,40436,8263,10106,7583,36683,7985


# #4: Saving filters results

We identified the following filters as the most frequent:

* First Sentence
* V-DO
* Message Length $\leq 30$ tokens
* Diff Length $\leq 100$ tokens

We want to run some experiments and compare the results with filtered and unfiltered subsets. This section provides an utility script for saving filters results for each commit from our dataset.

In [None]:
from joblib import Parallel, delayed
from typing import Dict, Union
import jsonlines
from tqdm import tqdm


def process_example(example: Dict[str, str]) -> Dict[str, Union[str, bool]]:
    return {
        "hash": example["hash"],
        "repo": example["repo"],
        "is_vdo": example["is_vdo"],
        "first_sentence_newline": is_one_sentence(example["message"], mode="newline"),
        "first_sentence_punct": is_one_sentence(example["message"], mode="punct"),
        "message_30_tokens": is_shorter_than_n_tokens(example["message"], n=30),
        "diff_100_tokens": is_shorter_than_n_tokens(example["diff"], n=100),
    }


chunksize = 4000


for part in ["train", "val", "test"]:
    output_path = f"stats/filters_{part}.jsonl"
    open(output_path, "w").close()

    chunk = []
    with jsonlines.open(f"extracted_data_jsonl/dataset/{part}.jsonl", "r") as reader:
        with jsonlines.open(f"stats/{part}_vdo_results.jsonl", "r") as reader2:
            for (dataset_example, vdo_example) in tqdm(zip(reader, reader2), desc=f"Processing filters for {part}"):
                assert dataset_example["hash"] == vdo_example["hash"]
                assert dataset_example["repo"] == vdo_example["repo"]

                if len(chunk) > chunksize:
                    with Parallel(16) as pool:
                        processed_chunk = pool(delayed(process_example)(example) for example in chunk)
                    with jsonlines.open(output_path, "a") as writer:
                        writer.write_all(processed_chunk)
                    chunk = []

                chunk.append(
                    {
                        "hash": dataset_example["hash"],
                        "repo": dataset_example["repo"],
                        "is_vdo": vdo_example["is_vdo"],
                        "message": dataset_example["message"],
                        "diff": "\n".join([mod["diff"] for mod in dataset_example["mods"]]),
                    }
                )

            if len(chunk) > 0:
                with Parallel(16) as pool:
                    processed_chunk = pool(delayed(process_example)(example) for example in chunk)
                with jsonlines.open(output_path, "a") as writer:
                    writer.write_all(processed_chunk)