[BUG] Fix memory leak with Regex C++ operator #2024

Nayef211 · 2023-01-18T03:46:31Z

Description

We noticed that the memory footprint of GPT2BPETokenizer increases as tokenization occurs only when special tokens are added. After further investigation, the reason for this is due to a memory leak from the additional Regex object that is created whenever special tokens have been added to the tokenizer class.

text/torchtext/csrc/gpt2_bpe_tokenizer.cpp

Line 123 in 8c01462

const Regex specialTokenRegex("(" + pattern + ")");
Looking into the Regex implementation we dynamically create a heap allocated object (a pointer to a RE2 object) without ever freeing it. The fix is simply to create a destructor to free the dynamically allocated memory.

text/torchtext/csrc/regex.cpp

Line 6 in 8c01462

compiled_pattern_ = new RE2(re_str_);

Test Plan

After the fix, the memory footprint doesn't change with additional forward calls to the tokenizer.

tokenizer = GPT2BPETokenizer(
    encoder_json_path="https://download.pytorch.org/models/text/gpt2_bpe_encoder.json",
    vocab_bpe_path="https://download.pytorch.org/models/text/gpt2_bpe_vocab.bpe",
)
additional_special_tokens = ["TOKEN_0"]	
tokenizer.add_special_tokens(
    special_tokens_dict={"additional_special_tokens": additional_special_tokens}
)]	
tokenizer.add_special_tokens(
    special_tokens_dict={"additional_special_tokens": additional_special_tokens}
)

i = 0
while True:
    tokenizer.forward(["hello", "world", "random", "stuff"])
    if i % 10000 == 0:
        process = psutil.Process(os.getpid())
        memory_in_megabytes = process.memory_info().rss / (1024 * 1024)
        print(f"with special token: {i}, {memory_in_megabytes} MB")  # in MB
    i += 1

Resolves #2020

rshraga · 2023-01-18T16:03:30Z

looks great! Are the test failures expected / unrelated?

joecummings · 2023-01-18T16:06:48Z

looks great! Are the test failures expected / unrelated?

Yep, we need to increase our memory usage for the Integration Tests, see #2018

joecummings · 2023-01-18T16:07:14Z

Looks like the commit history may be a little messy -> can you rebase @Nayef211 ?

Nayef211 · 2023-01-18T16:10:36Z

Looks like the commit history may be a little messy -> can you rebase @Nayef211 ?

Yeah I have a couple of commits that were made to the main branch of my forked repo rather than the original repo. Since we squash and merge changes, it shouldn't affect the commit history of the main branch.

joecummings

lgtm

torchtext/csrc/regex.cpp

facebook-github-bot added the cla signed label Jan 18, 2023

Nayef211 marked this pull request as ready for review January 18, 2023 15:58

Nayef211 requested review from mthrok, joecummings and rshraga January 18, 2023 15:58

joecummings approved these changes Jan 18, 2023

View reviewed changes

mthrok reviewed Jan 18, 2023

View reviewed changes

torchtext/csrc/regex.cpp Show resolved Hide resolved

mthrok approved these changes Jan 18, 2023

View reviewed changes

Fix memory leake with Regex operator

a8e1fed

Nayef211 force-pushed the hotfix/regex_memory_leak branch from 16fa70d to a8e1fed Compare January 19, 2023 01:19

Nayef211 merged commit 569d48d into pytorch:main Jan 19, 2023

Nayef211 deleted the hotfix/regex_memory_leak branch January 19, 2023 02:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix memory leak with Regex C++ operator #2024

[BUG] Fix memory leak with Regex C++ operator #2024

Nayef211 commented Jan 18, 2023

rshraga commented Jan 18, 2023

joecummings commented Jan 18, 2023

joecummings commented Jan 18, 2023

Nayef211 commented Jan 18, 2023

joecummings left a comment

[BUG] Fix memory leak with Regex C++ operator #2024

[BUG] Fix memory leak with Regex C++ operator #2024

Conversation

Nayef211 commented Jan 18, 2023

Description

Test Plan

rshraga commented Jan 18, 2023

joecummings commented Jan 18, 2023

joecummings commented Jan 18, 2023

Nayef211 commented Jan 18, 2023

joecummings left a comment

Choose a reason for hiding this comment