In [17]:
# 输入英文文本
text = """[5] An experimental therapy uses a person’s weak immune system to fight deadly blood cancers. 
Stanley Riddell is a researcher at the Fred Hutchinson Cancer Research Center in the U.S. state of Washington. 
“So, that’s the one interesting thing about this. It doesn’t require repeated treatments or repetitive cycles of 
chemotherapy,” said Riddell. “That’s what I think in the future may be most important for patients—that it’s a 
single treatment instead of many months of treatment.” It has shown great promise in small trials with patients. [6] 
In one study of 35 patients with a type of leukemia, 94 percent experienced a complete remission. 50 percent 
to 80 percent of patients with other blood cancers also saw a reduction in symptoms. Riddell said, “This is 
encouraging because these are all patients who have failed all conventional therapies, including many kinds of 
bone marrow and stem cell transplants.” Immune system cells usually fight invading viruses and bacteria. They can 
also combat cancer. But they are soon overwhelmed by the disease. [7] The work by Hutchinson researchers 
increases this natural cancer-fighting ability. Using the immune system has also shown promise against skin 
cancer and some lung cancers.
"""


# 挖空比例
blank_rate = 0.5

# todo 标点符号单独拆出来

In [18]:
# 头文件
import pandas
import random
import re


import nltk
from nltk.stem import WordNetLemmatizer

# 下载WordNet词汇表（如果尚未下载）
nltk.download('wordnet')

# 初始化词形还原器
lemmatizer = WordNetLemmatizer()

# 词性
pos_list = ['v','n','a','r'] # r = adv

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\song\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [19]:
# # 加载词库
# df = pandas.read_excel("20000_words调序整洁版.xlsx", header=None)
# coca_words = set(df[df[0]<10000][1].to_list())
# len(coca_words)

8672

In [30]:
# 加载词库
cocoa_words = []
with open("word_dic.txt") as fr:
    cocoa_words = fr.readlines()
len(coca_words)

10941

In [20]:
# 获得可能的原型，返回值是一个集合
def get_base_words(word_to_check):
    # 可能的原型
    maybe_base_words = set()
    # 遍历词性列表
    for pos in pos_list:
        # 将要检查的单词转换为其基本形式（原型）
        base_form = lemmatizer.lemmatize(word_to_check, pos=pos)
        maybe_base_words.add(base_form)
    return maybe_base_words

In [21]:
# 打印原始文本
print(f"{text}")

[5] An experimental therapy uses a person’s weak immune system to fight deadly blood cancers. 
Stanley Riddell is a researcher at the Fred Hutchinson Cancer Research Center in the U.S. state of Washington. 
“So, that’s the one interesting thing about this. It doesn’t require repeated treatments or repetitive cycles of 
chemotherapy,” said Riddell. “That’s what I think in the future may be most important for patients—that it’s a 
single treatment instead of many months of treatment.” It has shown great promise in small trials with patients. [6] 
In one study of 35 patients with a type of leukemia, 94 percent experienced a complete remission. 50 percent 
to 80 percent of patients with other blood cancers also saw a reduction in symptoms. Riddell said, “This is 
encouraging because these are all patients who have failed all conventional therapies, including many kinds of 
bone marrow and stem cell transplants.” Immune system cells usually fight invading viruses and bacteria. They can 
als

In [22]:
# 将文本拆分成单词
words = text.split()
origin_words = words.copy()

# 随机选择要替换的单词数量（这里选择要替换一半的单词）
num_words_to_replace = max(0, min(len(text), int(len(words) * blank_rate)))
# print(f"挖空数量: {num_words_to_replace}")

# 随机选择要替换的单词的索引
words_to_replace_indices = random.sample(range(len(words)), num_words_to_replace)

# 替换选定的单词为下划线
for index in words_to_replace_indices:
    word = re.sub(r'[^\w\s]', '', words[index])
    word = word.lower()
    base_words = get_base_words(word)
    if word.isnumeric() or base_words.intersection(coca_words):
        words[index] = "_"*len(words[index])
        origin_words[index] = f"<b>{origin_words[index]}</b>"

# 重新构建文本
filled_text = ' '.join(words)
origin_text = ' '.join(origin_words)

# 打印填空文本
# print(f"\n替换文本:\n{filled_text}")
print(filled_text)


___ An experimental therapy ____ _ ________ ____ immune ______ to _____ ______ blood cancers. Stanley Riddell __ _ __________ at ___ Fred Hutchinson Cancer Research ______ __ ___ U.S. state __ Washington. “So, that’s ___ one ___________ thing _____ this. It doesn’t require ________ treatments __ repetitive cycles of ______________ said Riddell. “That’s ____ I _____ __ the future may be ____ important for patients—that ____ _ ______ _________ _______ of ____ months __ ___________ It has _____ _____ _______ in _____ trials with patients. ___ In one study __ __ ________ ____ a ____ __ leukemia, 94 percent ___________ a complete remission. __ percent __ 80 percent __ patients ____ other blood cancers ____ saw _ reduction in symptoms. Riddell _____ “This is ___________ _______ _____ are all patients who ____ failed all conventional __________ _________ ____ _____ __ bone ______ ___ ____ cell transplants.” Immune system cells _______ fight invading viruses ___ bacteria. ____ ___ ____ ______ 

In [23]:
# 加粗填空处
print(origin_text)

<b>[5]</b> An experimental therapy <b>uses</b> <b>a</b> <b>person’s</b> <b>weak</b> immune <b>system</b> to <b>fight</b> <b>deadly</b> blood cancers. Stanley Riddell <b>is</b> <b>a</b> <b>researcher</b> at <b>the</b> Fred Hutchinson Cancer Research <b>Center</b> <b>in</b> <b>the</b> U.S. state <b>of</b> Washington. “So, that’s <b>the</b> one <b>interesting</b> thing <b>about</b> this. It doesn’t require <b>repeated</b> treatments <b>or</b> repetitive cycles of <b>chemotherapy,”</b> said Riddell. “That’s <b>what</b> I <b>think</b> <b>in</b> the future may be <b>most</b> important for patients—that <b>it’s</b> <b>a</b> <b>single</b> <b>treatment</b> <b>instead</b> of <b>many</b> months <b>of</b> <b>treatment.”</b> It has <b>shown</b> <b>great</b> <b>promise</b> in <b>small</b> trials with patients. <b>[6]</b> In one study <b>of</b> <b>35</b> <b>patients</b> <b>with</b> a <b>type</b> <b>of</b> leukemia, 94 percent <b>experienced</b> a complete remission. <b>50</b> percent <b>to</b> 80 per