进行同义词替换：
1. 使用nltk语义库替换valid.json
2. 使用gpt替换valid.json
3. 使用detect model替换test.json，再比较与原test.json文本的差异，使用均方差来评估clean和dirty最终使用auc评分

## 1. 使用nltk

In [None]:
import json
import nltk
from nltk.corpus import wordnet

# 下载所需的NLTK数据
nltk.download('wordnet')
nltk.download('omw-1.4')

# 同义词替换函数
def synonym_replacement(text):
    words = nltk.word_tokenize(text)
    new_words = []
    for word in words:
        synonyms = wordnet.synsets(word)
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name()
            new_words.append(synonym)
        else:
            new_words.append(word)
    return ' '.join(new_words)

# 加载原始的验证集
input_file_path = 'dataset/valid.json'
output_file_path = 'dataset/synonym_replacement_valid.json'

with open(input_file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

# 对每个文本进行同义词替换
for entry in data:
    entry['synonym_replacement'] = synonym_replacement(entry['text'])

# 保存到新的json文件
with open(output_file_path, 'w', encoding='utf-8') as file:
    json.dump(data, file, ensure_ascii=False, indent=4)

# 提示操作完成
print("同义词替换已完成并保存到新的json文件中。")


## 2.使用gpt

In [None]:
import json
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# 加载GPT-2模型和tokenizer
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()

# 同义词替换函数
def synonym_replacement_gpt(text, num_return_sequences=1):
    input_ids = tokenizer.encode(text, return_tensors='pt')
    
    # 打印调试信息
    print(f"Input text: {text}")
    print(f"Input IDs: {input_ids}")
    print(f"Input length: {len(input_ids[0])}")
    
    max_length = len(input_ids[0]) + 50
    if max_length > 1024:
        max_length = 1024  # GPT-2 的最大长度限制为 1024 个 token
    
    with torch.no_grad():
        outputs = model.generate(
            input_ids, 
            max_length=max_length, 
            num_return_sequences=num_return_sequences, 
            num_beams=5, 
            no_repeat_ngram_size=2, 
            early_stopping=True
        )
    generated_texts = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
    return generated_texts[0]

# 加载验证集数据
input_file_path = 'dataset/valid.json'
output_file_path = 'dataset/synonym_replacement_valid_gpt.json'

with open(input_file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

# 对每个文本进行同义词替换
for entry in data:
    entry['text'] = synonym_replacement_gpt(entry['text'])

# 保存到新的json文件
with open(output_file_path, 'w', encoding='utf-8') as file:
    json.dump(data, file, ensure_ascii=False, indent=4)

# 提示操作完成
print(f"同义词替换已完成并保存到新的json文件中：{output_file_path}")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input text: A water company has blamed more people working from home post-pandemic for a new hosepipe ban.South East Water, which supplies more than 2m homes and businesses, will impose the first hosepipe ban of the summer on Monday, affecting households across Kent and Sussex.The company’s chief executive, David Hinton, said that people working from home was a “key factor” behind the ban, as it has “increased drinking water demand”.In a letter to customers, he wrote: “Over the past three years the way in which drinking water is being used across the south-east has changed considerably.“The rise of working from home has increased drinking water demand in commuter towns by around 20% over a very short period, testing our existing infrastructure.”Hinton also blamed low rainfall since April as well as a recent spell of hot weather which he said led to a spike in demand for drinking water.“Our reservoir and aquifer stocks of raw water, essential to our water supply but not ready to be used

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input text: The mighty aurochs have gone, as have the tarpan horses and the wild boars, but modern-day substitutes have been drafted in to recreate a large open “savannah” on heathland in Dorset.Instead of aurochs, considered the wild ancestor of domestic cattle, 200 red Devon cattle are to be found roaming the Purbeck Heaths, while Exmoor ponies are stand-ins for the tarpan horses and curly coated Mangalitsa pigs are doing the sort of rooting around that boars used to excel at here.The idea of the project is to create more of the sort of habitat where precious species such as the sand lizard, southern damselfly and heath tiger beetle can thrive.Two Exmoor ponies at Purbeck Heaths. Photograph: National Trust ImagesIt comes three years after the UK’s first “super national nature reserve” was created in Dorset, knitting together 3,400 hectares (8,400 acres) of priority habitat. Within the super reserve, 1,370 hectares of open “savannah” for free-ranging, grazing animals as it would have 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input text: Large numbers of fungi have been found living in the twilight zone of the ocean, and could unlock the door to new drugs that may match the power of penicillin.The largest ever study of ocean DNA, published by the journal Frontiers in Science, has revealed intriguing secrets about the abundance of fungi in the part of the ocean that is just beyond the reach of sunlight. At between 200 metres and 1,000 metres below the surface, the twilight zone is home to a variety of organisms and animals, including specially adapted fish such as lantern sharks and kitefin sharks, which have huge eyes and glowing, bioluminescent skin.“Penicillin is an antibiotic that originally came from a fungus called Penicillium so we might find something like that from these ocean fungi,” said Fabio Favoretto, a postdoctoral scholar at the Scripps Institution of Oceanography at the University of California, San Diego. The twilight zone is characterised by high pressure, a lack of light and cold temperat

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input text: Crisis talks are continuing about the future of Thames Water. But what are the options for the country’s largest water and sewerage company?Special administrationThis is a power within the Water Industry Act 1991 to protect essential services for the public if a private company is either on the brink of collapse, or not fulfilling its legal obligations.It arranges to transfer the business as a going concern and, just as administrators do in other financial collapses, it enables them to carry out the functions of the company until that transfer. Crucially it is designed to protect an essential public service first and creditors do not have priority in getting their loans paid off.The company can be eventually transferred to another private company, as in the case of the electricity company Bulb in 2022, when the government subsidised the continued existence of Bulb as a private company and then transferred it to another private company. But it can also be used to transfer a 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input text: Authorities in eastern Switzerland have ordered residents of the village of Brienz to evacuate by Friday evening because geologists say a mass of 2m cubic metres of Alpine rock looming overhead could break loose and spill down in coming weeks.Local leaders told a town hall and press event on Tuesday that residents would have to leave by 6pm on Friday but could return to the village from time to time starting on Saturday, depending on the risk level, but not stay overnight.Officials said measurements indicated a “strong acceleration over a large area” in recent days, and “up to 2m cubic metres of rock material will collapse or slide in the coming seven to 24 days”.The centuries-old village straddles German- and Romansch-speaking parts of the eastern Graubünden region, sitting south-west of Davos at an altitude of about 1,150 metres (3,800ft). Today it has fewer than 100 residents. Locals said the mountain and the rocks on it had been moving since the last ice age, according 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input text: Jessica Jones went nearly three weeks without having her rubbish picked up by garbage collectors and the smell was getting unbearable.“They just stopped coming – we would put out our bins on a Sunday night and they wouldn’t be picked up,” she said. “The smell was atrocious.”Tens of thousands of bins across eastern Sydney have been left uncollected for weeks after garbage collectors went on strike as their negotiations for better pay and conditions dragged on.“It was really frustrating, it smelled so bad and there were flies everywhere, it was really gross,” Jones said, adding that her whole street in Waterloo was affected.Can you predict which parts of Sydney will be next to gentrify?Read moreThe 27-year-old, who works in commercial real estate, said the dispute should be resolved as soon as possible. “If they are after more pay, just pay them what they want,” she said.Another Waterloo resident Chris Jespen agreed on the need for urgent action.“The chutes on each level of t

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input text: The Middle Eastern herb za’atar, which is also known as Syrian oregano, or Origanum syriacum, grows across the Levant and has a unique and intoxicating flavour similar to thyme and marjoram, but with a broader, longer leaf. Za’atar is most commonly known, however, as a spice mix that contains the herb, usually combined with sesame seeds, cumin, coriander and sumac, and that has a sour, citrus twang.Like many others, Acme Fire Cult, a barbecue restaurant in east London founded by chefs Daniel Watkins and Andrew Clarke, makes its own za’atar-style spice mix, which is a brilliant way to use up surplus herbs and herb stalks.Za’atar-style spice mixI have never seen the fresh herb za’atar in the UK, but the spice mix of the same name is a super-versatile condiment, seasoning and marinade that can elevate all kinds of dishes. It’s often used to flavour flatbreads or to marinate meat and vegetables – I love it sprinkled over almost any simple meal, from a salad to a roast dinner or

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input text: Victorians face a more than doubling of transmission charges on electricity bills if the state government proceeds with plans for what is likely to be the most costly and longest single power line in Australia’s history, a thinktank says in a new report.The report by the Victoria Energy Policy Centre (VEPC) argues the proposed 500 kilovolt VNI West transmission line linking Melbourne’s outskirts with Wagga Wagga on an 800km path will be far costlier than alternatives and faces extensive landholder opposition. It also will not solve grid bottlenecks holding back new solar and windfarms in the state.The Australian Energy Market Operator (Aemo), which first proposed VNI West as a $2.7bn project in 2018 and is Victoria’s main planner for transmission, estimated users’ transmission charges would need to rise by a quarter. That assessment, though, was based on 2021 prices and ignored interest costs that have since soared.Victoria announces ban on gas connections to new homes from

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input text: Putting more electric trucks on Australian roads would cut transport pollution faster than electric cars could and governments should introduce grants and zero-emission zones to accelerate their adoption, a new report recommends.The study, from the logistics firm Adiona Tech, also found that replacing 10 delivery trucks with electric models would have the same impact as putting 56 electric cars on the road.Labor’s electric vehicle policy drives Australia forward – but not far | Adam MortonRead moreThe findings come as freight and industry transport bodies called on the federal government to develop a dedicated policy to support electric trucks after its national electric vehicle strategy failed to address larger modes of transport.Adiona Tech’s chief executive, Richard Savoie, said electrifying the largest vehicles on Australian roads should be considered “low-hanging fruit” by the government as swapping diesel trucks with electric models would significantly cut pollution.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input text: Mammals that live in groups generally have longer lifespans than solitary species, new research into nearly 1,000 different animals suggests.Scientists from China and Australia compared 974 mammal species, analysing longevity and how they tended to be socially organised.Classifying mammals into three categories – solitary, pair-living and group-living – the researchers found that animals who lived in groups, such as elephants and zebras, tended to live longer on average than solitary species such as the aardvark and eastern chipmunk.How rehoming wildlife from rhinos to bison can revive threatened speciesRead moreThe correlation held even when the researchers took into account a link between larger species size and longer lifespan.The maximum lifespan of mammals varies from about two years in shrews to more than 200 years in bowhead whales.Northern short-tailed shrews – which are solitary animals – and group-living greater horseshoe bats are similar in weight, for example, b

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Input text: Ministers have been told they will be “punished” by voters after analysis revealed the decline of vital flood defences across England.The proportion of critical assets in disrepair has almost trebled in the West Midlands and the east of England since 2018, leaving thousands of homes and businesses more vulnerable to storms.Critical assets are defined as those where there is a high risk to life and property if they fail.The east of England, which spans the Conservative heartlands from Suffolk to Bedfordshire and Essex, has one of the highest proportion of rundown flood defences in England, with nearly one in 11 – more than 850 assets – considered “poor” or “very poor” by Environment Agency inspectors.Chart showing percentage decline in condition of flood defences classed as poor or very poor in English regionsSteve Reed, the shadow environment secretary, said: “The Conservatives’ sticking-plaster approach to flooding has left communities devastated and cost the economy billi

## 3.

In [5]:
import json
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, GPT2Tokenizer, GPT2LMHeadModel
from sklearn.metrics import roc_auc_score
from torch.nn.functional import softmax

# 加载预训练的BERT模型
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
bert_model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
bert_model.eval()  # 评估模式

# 加载预训练的GPT-2模型用于同义词替换
gpt_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
gpt_model = GPT2LMHeadModel.from_pretrained('gpt2')
gpt_model.eval()

# 同义词替换函数
def synonym_replacement_gpt(text, num_return_sequences=1):
    input_ids = gpt_tokenizer.encode(text, return_tensors='pt')
    with torch.no_grad():
        outputs = gpt_model.generate(
            input_ids, 
            max_length=len(input_ids[0]) + 50, 
            num_return_sequences=num_return_sequences, 
            num_beams=5, 
            no_repeat_ngram_size=2, 
            early_stopping=True
        )
    generated_texts = [gpt_tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
    return generated_texts[0]

# 获取模型输出
def get_model_output(text):
    inputs = bert_tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
    with torch.no_grad():
        outputs = bert_model(**inputs)
    return softmax(outputs.logits, dim=1)

# 对比原始文本和同义改写文本
def compare_texts(original_text, rewritten_text):
    original_output = get_model_output(original_text)
    rewritten_output = get_model_output(rewritten_text)
    return torch.nn.functional.mse_loss(original_output, rewritten_output).item()

# 加载数据
input_file_path = 'dataset/valid.json'  # 替换为你的数据文件路径
output_file_path = 'dataset/synonym_detection_results.json'

with open(input_file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

# 处理数据
for entry in data:
    entry['synonym_replacement'] = synonym_replacement_gpt(entry['text'])
    entry['difference'] = compare_texts(entry['text'], entry['synonym_replacement'])

# 假设阈值可以通过统计分析或其他方法确定
threshold = 0.1
for entry in data:
    entry['predicted_label'] = 1 if entry['difference'] > threshold else 0

# 计算AUC作为评价指标
true_labels = [1 if entry['label'] == 'dirty' else 0 for entry in data]
predicted_labels = [entry['predicted_label'] for entry in data]
auc_score = roc_auc_score(true_labels, predicted_labels)
print(f"AUC Score: {auc_score}")

# 保存结果
with open(output_file_path, 'w', encoding='utf-8') as file:
    json.dump(data, file, ensure_ascii=False, indent=4)




OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like bert-base-uncased is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.