# 原始模型的评估

### 一.问题回答的准确性

### 准备部分
导入相关的库，定义模型和分词器并初始化，最后定义获取大模型回答的方法

In [3]:
!pip install rouge

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting rouge
  Downloading https://mirrors.aliyun.com/pypi/packages/32/7c/650ae86f92460e9e8ef969cc5008b24798dcf56a9a8947d04c78f550b3f5/rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
[0mSuccessfully installed rouge-1.0.1


In [1]:
import pandas as pd
from rouge import Rouge
import torch
from tqdm import tqdm
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

#定义模型和分词器，此处是原始模型
print("加载模型中...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "shared-nvme/llm_models/Qwen2.5-7B-Instruct/",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

# 初始化模型用于推理
model = FastLanguageModel.for_inference(model)
print("模型加载并初始化完成！")

#生成回答
def generate_answer(model, tokenizer, instruction):

    messages = [
        {"role": "user", "content": instruction}
    ]
    
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            input_ids, 
            max_new_tokens=128,
            pad_token_id=tokenizer.eos_token_id
        )
    
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    start_index = generated_text.rfind('Response:')+len('Response:')
    generated_text = generated_text[start_index:]
    return generated_text


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


Unsloth: Your Flash Attention 2 installation seems to be broken?
A possible explanation is you have a new CUDA version which isn't
yet compatible with FA2? Please file a ticket to Unsloth or FA2.
We shall now use Xformers instead, which does not have any performance hits!
We found this negligible impact by benchmarking on 1x A100.
🦥 Unsloth Zoo will now patch everything to make training faster!
加载模型中...
==((====))==  Unsloth 2024.11.11: Fast Qwen2 patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.684 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00,  1.64s/it]


模型加载并初始化完成！


#### 1.1 ROUGE  
主要通过计算 n-gram 的重叠来评估文本的质量。  
更关注召回率，适合评估生成文本的覆盖率。

In [13]:
# 评估模型并保存详细结果
def evaluate_model(model, tokenizer, test_data, num_samples=None):
    rouge = Rouge()
    
    # 创建结果列表
    results = []
    
    # 如果需要抽样
    if num_samples and num_samples < len(test_data):
        test_data = test_data.sample(n=num_samples, random_state=42)
    
    # 对每个样本进行评估
    for idx, row in tqdm(test_data.iterrows(), total=len(test_data)):
        instruction = row['instruction']
        reference = row['output']
        
        # 生成回答
        generated = generate_answer(model, tokenizer, instruction)
        
        try:
            # 计算ROUGE分数
            scores = rouge.get_scores(generated, reference)[0]
            
            # 保存该样本的所有信息
            result = {
                'instruction': instruction,
                'reference': reference,
                'generated': generated,
                'rouge-1-p': scores['rouge-1']['p'],
                'rouge-1-r': scores['rouge-1']['r'],
                'rouge-1-f': scores['rouge-1']['f'],
                'rouge-2-p': scores['rouge-2']['p'],
                'rouge-2-r': scores['rouge-2']['r'],
                'rouge-2-f': scores['rouge-2']['f'],
                'rouge-l-p': scores['rouge-l']['p'],
                'rouge-l-r': scores['rouge-l']['r'],
                'rouge-l-f': scores['rouge-l']['f']
            }
            results.append(result)
            
        except Exception as e:
            print(f"评估出错 (行 {idx}): {e}")
            print(f"生成文本: {generated}")
            print(f"参考文本: {reference}")
            continue
    
    # 转换为DataFrame
    results_df = pd.DataFrame(results)
    
    # 计算平均分数
    avg_scores = {
        'rouge-1-p': results_df['rouge-1-p'].mean(),
        'rouge-1-r': results_df['rouge-1-r'].mean(),
        'rouge-1-f': results_df['rouge-1-f'].mean(),
        'rouge-2-p': results_df['rouge-2-p'].mean(),
        'rouge-2-r': results_df['rouge-2-r'].mean(),
        'rouge-2-f': results_df['rouge-2-f'].mean(),
        'rouge-l-p': results_df['rouge-l-p'].mean(),
        'rouge-l-r': results_df['rouge-l-r'].mean(),
        'rouge-l-f': results_df['rouge-l-f'].mean()
    }
    
    return avg_scores, results_df

# 主程序
if __name__ == "__main__":
    # 1. 加载数据
    test_data = pd.read_csv('shared-nvme/datasets/achieve/finetune_test/test_csv/finetune_test.csv')
    
    # 2. 评估模型
    avg_scores, results_df = evaluate_model(
        model,
        tokenizer,
        test_data,
        num_samples=100  # 可选：设置样本数量
    )
    
    # 3 保存详细结果
    # 3.1 保存每行结果
    results_df.to_csv('evaluation/origin_model/origin_detailed_rouge_scores.csv', index=False)
    # 3.2 保存平均分数
    avg_scores_df = pd.DataFrame([avg_scores])
    avg_scores_df.to_csv('evaluation/origin_model/origin_average_rouge_scores.csv', index=False)
    
    # 4. 打印平均分数
    print("\n平均ROUGE分数:")
    for metric, score in avg_scores.items():
        print(f"{metric}: {score:.4f}")
    
    # 5. 打印部分示例结果
    print("\n部分示例结果:")
    print(results_df[['instruction', 'generated', 'rouge-1-f', 'rouge-2-f', 'rouge-l-f']].head())

100%|██████████| 100/100 [06:00<00:00,  3.61s/it]


平均ROUGE分数:
rouge-1-p: 0.0939
rouge-1-r: 0.0927
rouge-1-f: 0.0915
rouge-2-p: 0.0080
rouge-2-r: 0.0075
rouge-2-f: 0.0074
rouge-l-p: 0.0863
rouge-l-r: 0.0853
rouge-l-f: 0.0841

部分示例结果:
                                         instruction  \
0  If you are a doctor, please answer the medical...   
1  If you are a doctor, please answer the medical...   
2  If you are a doctor, please answer the medical...   
3  If you are a doctor, please answer the medical...   
4                                   额叶胶质瘤术的辅助治疗有些什么？   

                                           generated  rouge-1-f  rouge-2-f  \
0  ou are Qwen, created by Alibaba Cloud. You are...   0.193548   0.010309   
1  ou are Qwen, created by Alibaba Cloud. You are...   0.176471   0.009132   
2  ou are Qwen, created by Alibaba Cloud. You are...   0.189189   0.011236   
3  ou are Qwen, created by Alibaba Cloud. You are...   0.233333   0.027586   
4  ou are Qwen, created by Alibaba Cloud. You are...   0.000000   0.000000   

   rouge-l-




#### 1.2 BLEU  
通过计算生成文本和参考文本之间的 n-gram 精确匹配来评估文本质量。  
使用几何平均结合不同长度的 n-gram 匹配，并包含惩罚因子（brevity penalty）。  
更关注精确匹配，适合评估翻译的准确性。常用于机器翻译任务。

In [1]:
!pip install sacrebleu

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/


In [4]:
import sacrebleu

#计算 BLEU 分数
def evaluate_bleu(model, tokenizer, test_data, num_samples=None):
   
    results = []
    bleu_scores = []

    # 如果需要抽样测试集
    if num_samples and num_samples < len(test_data):
        test_data = test_data.sample(n=num_samples, random_state=42)
    
    # 遍历测试集，生成答案并计算 BLEU
    for idx, row in tqdm(test_data.iterrows(), total=len(test_data)):
        instruction = row['instruction']
        reference = row['output']
        
        # 生成回答
        generated = generate_answer(model, tokenizer, instruction)
        
        try:
            # 计算 BLEU 分数
            bleu_score = sacrebleu.sentence_bleu(generated, [reference]).score
            bleu_scores.append(bleu_score)
            
            # 保存结果
            results.append({
                'instruction': instruction,
                'reference': reference,
                'generated': generated,
                'bleu': bleu_score
            })
        except Exception as e:
            print(f"行 {idx} 出错: {e}")
            continue

    # 转换为 DataFrame
    results_df = pd.DataFrame(results)

    # 计算平均 BLEU 分数
    avg_bleu = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0

    return avg_bleu, results_df

# 加载测试数据
test_data = pd.read_csv('shared-nvme/datasets/achieve/finetune_test/test_csv/finetune_test.csv')

# 计算 BLEU 分数
avg_bleu, results_df = evaluate_bleu(
    model,
    tokenizer,
    test_data,
    num_samples=100  # 可选：限制样本数量
)

# 保存详细分数结果
results_df.to_csv('evaluation/finetuned_model/origin_detailed_bleu_scores.csv', index=False)

# 保存平均分数结果
avg_bleu_df = pd.DataFrame([{"average_bleu": avg_bleu}])
avg_bleu_df.to_csv('evaluation/finetuned_model/origin_average_bleu_score.csv', index=False)


# 打印部分示例
print("\n部分示例结果:")
print(results_df[['instruction', 'generated', 'bleu']].head())


  0%|          | 0/100 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 100/100 [06:03<00:00,  3.64s/it]


部分示例结果:
                                         instruction  \
0  If you are a doctor, please answer the medical...   
1  If you are a doctor, please answer the medical...   
2  If you are a doctor, please answer the medical...   
3  If you are a doctor, please answer the medical...   
4                                   额叶胶质瘤术的辅助治疗有些什么？   

                                           generated      bleu  
0  ou are Qwen, created by Alibaba Cloud. You are...  1.193548  
1  ou are Qwen, created by Alibaba Cloud. You are...  1.240428  
2  ou are Qwen, created by Alibaba Cloud. You are...  1.502059  
3  ou are Qwen, created by Alibaba Cloud. You are...  1.760298  
4  ou are Qwen, created by Alibaba Cloud. You are...  0.000000  





BLEU 更关注精确匹配，适合评估翻译的准确性。  
而METEOR考虑了词序和同义词替换，更关注语义相似性，因此更适合。

#### 1.3 METEOR  
通过计算词级别的匹配，包括精确匹配、词干匹配和同义词匹配。  
使用词序和词义信息来评估文本质量，结合了精确度和召回率。  
更关注语义相似性，适合评估生成文本的内容相关性。

In [7]:
import nltk
from nltk.translate.meteor_score import meteor_score

# 下载 NLTK 资源
nltk.download('wordnet')  # 用于支持 WordNet 词汇库
nltk.download('omw-1.4')  # 用于支持多语言功能
print("NLTK资源下载完毕！")

NLTK资源下载完毕！


[nltk_data] Downloading package wordnet to /home/pod/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/pod/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [8]:
import nltk
from nltk.translate.meteor_score import meteor_score
def evaluate_meteor(model, tokenizer, test_data, num_samples=None):
    """
    计算 METEOR 分数，确保 hypothesis 和 reference 是分词后的列表
    :param model: 已加载的模型
    :param tokenizer: 已加载的分词器
    :param test_data: 测试数据集（DataFrame）
    :param num_samples: 可选，限制评估样本数量
    :return: 平均 METEOR 分数和结果 DataFrame
    """
    results = []
    meteor_scores = []

    # 如果需要抽样
    if num_samples and num_samples < len(test_data):
        test_data = test_data.sample(n=num_samples, random_state=42)
    
    # 遍历测试数据集
    for idx, row in tqdm(test_data.iterrows(), total=len(test_data)):
        instruction = row['instruction']
        reference = row['output']
        
        # 使用模型生成回答
        generated = generate_answer(model, tokenizer, instruction)

        try:
            # 分词处理
            reference_tokens = reference.split()  # 将参考答案分词
            generated_tokens = generated.split()  # 将生成文本分词

            # 计算 METEOR 分数
            score = meteor_score([reference_tokens], generated_tokens)
            meteor_scores.append(score)
            
            # 保存详细结果
            results.append({
                'instruction': instruction,
                'reference': reference,
                'generated': generated,
                'meteor': score
            })
        except Exception as e:
            print(f"行 {idx} 出错: {e}")
            continue

    # 转换为 DataFrame
    results_df = pd.DataFrame(results)

    # 计算平均 METEOR 分数
    avg_meteor = sum(meteor_scores) / len(meteor_scores) if meteor_scores else 0

    return avg_meteor, results_df





# 加载测试数据
test_data = pd.read_csv('shared-nvme/datasets/achieve/finetune_test/test_csv/finetune_test.csv')

# 调用 METEOR 评估逻辑
avg_meteor, results_df = evaluate_meteor(
    model,
    tokenizer,
    test_data,
    num_samples=100  # 限制样本数量
)


# 保存详细结果到 CSV 文件
results_df.to_csv('evaluation/origin_model/origin_detailed_meteor_scores.csv', index=False)
# 将平均分数保存到单独的 CSV 文件
avg_meteor_df = pd.DataFrame([{"average_meteor": avg_meteor}])
avg_meteor_df.to_csv('evaluation/origin_model/origin_average_meteor_score.csv', index=False)

# 打印并保存平均 METEOR 分数
print("\n平均 METEOR 分数:")
print(f"METEOR: {avg_meteor:.4f}")



# 打印部分示例结果
print("\n部分示例结果:")
print(results_df[['instruction', 'generated', 'meteor']].head())



100%|██████████| 100/100 [06:11<00:00,  3.72s/it]


平均 METEOR 分数:
METEOR: 0.0693

部分示例结果:
                                         instruction  \
0  If you are a doctor, please answer the medical...   
1  If you are a doctor, please answer the medical...   
2  If you are a doctor, please answer the medical...   
3  If you are a doctor, please answer the medical...   
4                                   额叶胶质瘤术的辅助治疗有些什么？   

                                           generated    meteor  
0  ou are Qwen, created by Alibaba Cloud. You are...  0.148883  
1  ou are Qwen, created by Alibaba Cloud. You are...  0.102041  
2  ou are Qwen, created by Alibaba Cloud. You are...  0.145509  
3  ou are Qwen, created by Alibaba Cloud. You are...  0.143416  
4  ou are Qwen, created by Alibaba Cloud. You are...  0.000000  





#### 1.4 BERTScore  
BERTScore 使用 BERT 模型来计算生成文本和参考文本之间的语义相似性。  
通过 BERT 的上下文信息来评估文本质量，更关注语义相似性。


In [4]:
from transformers import AutoTokenizer, AutoModel

# 指定本地模型路径
local_model_path = "shared-nvme/llm_models/models--google-bert--bert-base-uncased"  

# 加载本地 BERT 模型和分词器
bert_tokenizer = AutoTokenizer.from_pretrained(local_model_path)
bert_model = AutoModel.from_pretrained(local_model_path).to("cuda")

# 显式设置分词器的 pad_token，避免默认使用 eos_token
if bert_tokenizer.pad_token is None:
    bert_tokenizer.pad_token = bert_tokenizer.eos_token

def compute_sentence_embedding(text, model, tokenizer):
    """
    计算给定文本的句子嵌入
    :param text: 输入文本
    :param model: 已加载的 BERT 模型
    :param tokenizer: 已加载的分词器
    :return: 文本的句子嵌入
    """
    # 对文本进行编码
    inputs = tokenizer(
        text,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    ).to("cuda")
    
    inputs = tokenizer(
        text,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    ).to("cuda")

    # 确保包含 attention_mask
    inputs['attention_mask'] = inputs.get('attention_mask', None)


    # 获取模型输出
    with torch.no_grad():
        outputs = model(**inputs)

    # 使用 [CLS] token 的嵌入作为句子嵌入
    sentence_embedding = outputs.last_hidden_state[:, 0, :]
    return sentence_embedding

def evaluate_bert_similarity(test_data, num_samples=None):
    """
    使用 BERT 计算生成文本和参考文本之间的余弦相似度
    :param test_data: 测试数据集（DataFrame）
    :param num_samples: 可选，限制评估样本数量
    :return: 平均余弦相似度和结果 DataFrame
    """
    results = []
    cosine_similarities = []

    # 如果需要抽样
    if num_samples and num_samples < len(test_data):
        test_data = test_data.sample(n=num_samples, random_state=42)
    
    # 遍历测试数据集
    for idx, row in tqdm(test_data.iterrows(), total=len(test_data), desc="Calculating Similarity"):
        instruction = row['instruction']
        reference = row['output']
        
        # 使用已有的 generate_answer 生成回答
        generated = generate_answer(model, tokenizer, instruction)

        try:
            # 计算句子嵌入
            reference_embedding = compute_sentence_embedding(reference, bert_model, bert_tokenizer)
            generated_embedding = compute_sentence_embedding(generated, bert_model, bert_tokenizer)

            # 计算余弦相似度
            cosine_similarity = torch.nn.functional.cosine_similarity(
                reference_embedding, generated_embedding
            ).item()
            cosine_similarities.append(cosine_similarity)

            # 保存详细结果
            results.append({
                'instruction': instruction,
                'reference': reference,
                'generated': generated,
                'cosine_similarity': cosine_similarity
            })
        except Exception as e:
            print(f"行 {idx} 出错: {e}")
            continue

    # 转换为 DataFrame
    results_df = pd.DataFrame(results)

    # 计算平均余弦相似度
    avg_cosine_similarity = sum(cosine_similarities) / len(cosine_similarities) if cosine_similarities else 0

    return avg_cosine_similarity, results_df


# 加载测试数据
test_data = pd.read_csv('shared-nvme/datasets/achieve/finetune_test/test_csv/finetune_test.csv')

# 调用 BERT 相似度评估逻辑
avg_cosine_similarity, results_df = evaluate_bert_similarity(
    test_data,
    num_samples=100  # 限制样本数量
)

# 保存详细结果到 CSV 文件
results_df.to_csv('evaluation/origin_model/origin_detailed_bert_cosine_similarity.csv', index=False)

# 保存平均余弦相似度到 CSV 文件
avg_cosine_similarity_df = pd.DataFrame([{"average_cosine_similarity": avg_cosine_similarity}])
avg_cosine_similarity_df.to_csv('evaluation/origin_model/origin_average_bert_cosine_similarity.csv', index=False)

# 打印并保存平均余弦相似度
print("\n平均余弦相似度:")
print(f"Cosine Similarity: {avg_cosine_similarity:.4f}")

# 打印部分示例结果
print("\n部分示例结果:")
print(results_df[['instruction', 'generated', 'cosine_similarity']].head())


Calculating Similarity: 100%|██████████| 100/100 [05:46<00:00,  3.46s/it]


平均余弦相似度:
Cosine Similarity: 0.8536

部分示例结果:
                                         instruction  \
0  If you are a doctor, please answer the medical...   
1  If you are a doctor, please answer the medical...   
2  If you are a doctor, please answer the medical...   
3  If you are a doctor, please answer the medical...   
4                                   额叶胶质瘤术的辅助治疗有些什么？   

                                           generated  cosine_similarity  
0  ou are Qwen, created by Alibaba Cloud. You are...           0.931491  
1  ou are Qwen, created by Alibaba Cloud. You are...           0.837991  
2  ou are Qwen, created by Alibaba Cloud. You are...           0.866277  
3  ou are Qwen, created by Alibaba Cloud. You are...           0.932804  
4  ou are Qwen, created by Alibaba Cloud. You are...           0.738793  





可以看到使用在中文上表现优秀的的BERT语言模型，可以更好的处理同义词、语义上的相似度。

#### 1.5 BGE-Cosin Similarity
 bge-large-zh-v1.5

In [5]:
from transformers import AutoTokenizer, AutoModel

# 指定本地模型路径
local_model_path = "shared-nvme/llm_models/models--BAAI--bge-large-zh-v1.5"  # 替换为您的本地路径

# 加载本地 BGE 模型和分词器
bge_tokenizer = AutoTokenizer.from_pretrained(local_model_path)
bge_model = AutoModel.from_pretrained(local_model_path).to("cuda")

# # 显式设置分词器的 pad_token
# if bge_tokenizer.pad_token is None:
#     bge_tokenizer.pad_token = bge_tokenizer.eos_token  # 使用 eos_token 作为 pad_token（如有必要）

def compute_sentence_embedding_bge(text, model, tokenizer):
    """
    使用 BGE 模型计算文本的句子嵌入
    :param text: 输入文本
    :param model: 已加载的 BGE 模型
    :param tokenizer: 已加载的分词器
    :return: 文本的句子嵌入
    """
    # 对文本进行编码
    inputs = tokenizer(
        text,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    ).to("cuda")

    # 获取 BGE 模型的输出
    with torch.no_grad():
        outputs = model(**inputs)

    # 获取池化嵌入作为句子特征
    sentence_embedding = outputs.last_hidden_state[:, 0, :]  # 使用 [CLS] 作为句子特征
    return sentence_embedding

def evaluate_bge_similarity(test_data, num_samples=None):
    """
    使用 BGE 模型计算生成文本和参考文本之间的余弦相似度
    :param test_data: 测试数据集（DataFrame）
    :param num_samples: 可选，限制评估样本数量
    :return: 平均余弦相似度和结果 DataFrame
    """
    results = []
    cosine_similarities = []

    # 如果需要抽样
    if num_samples and num_samples < len(test_data):
        test_data = test_data.sample(n=num_samples, random_state=42)
    
    # 遍历测试数据集
    for idx, row in tqdm(test_data.iterrows(), total=len(test_data), desc="Calculating BGE Similarity"):
        instruction = row['instruction']
        reference = row['output']
        
        # 使用全局的 generate_answer 函数生成回答
        generated = generate_answer(model, tokenizer, instruction)

        try:
            # 计算句子嵌入
            reference_embedding = compute_sentence_embedding_bge(reference, bge_model, bge_tokenizer)
            generated_embedding = compute_sentence_embedding_bge(generated, bge_model, bge_tokenizer)

            # 计算余弦相似度
            cosine_similarity = torch.nn.functional.cosine_similarity(
                reference_embedding, generated_embedding
            ).item()
            cosine_similarities.append(cosine_similarity)

            # 保存详细结果
            results.append({
                'instruction': instruction,
                'reference': reference,
                'generated': generated,
                'cosine_similarity': cosine_similarity
            })
        except Exception as e:
            print(f"行 {idx} 出错: {e}")
            continue

    # 转换为 DataFrame
    results_df = pd.DataFrame(results)

    # 计算平均余弦相似度
    avg_cosine_similarity = sum(cosine_similarities) / len(cosine_similarities) if cosine_similarities else 0

    return avg_cosine_similarity, results_df


# 加载测试数据
test_data = pd.read_csv('shared-nvme/datasets/achieve/finetune_test/test_csv/finetune_test.csv')

# 调用 BGE 相似度评估逻辑
avg_cosine_similarity, results_df = evaluate_bge_similarity(
    test_data,
    num_samples=100  # 限制样本数量
)

# 保存详细结果到 CSV 文件
results_df.to_csv('evaluation/origin_model/origin_detailed_bge_cosine_similarity.csv', index=False)

# 保存平均余弦相似度到 CSV 文件
avg_cosine_similarity_df = pd.DataFrame([{"average_cosine_similarity": avg_cosine_similarity}])
avg_cosine_similarity_df.to_csv('evaluation/origin_model/origin_average_bge_cosine_similarity.csv', index=False)

# 打印并保存平均余弦相似度
print("\n平均余弦相似度:")
print(f"Cosine Similarity: {avg_cosine_similarity:.4f}")

# 打印部分示例结果
print("\n部分示例结果:")
print(results_df[['instruction', 'generated', 'cosine_similarity']].head())


Calculating BGE Similarity: 100%|██████████| 100/100 [05:41<00:00,  3.41s/it]


平均余弦相似度:
Cosine Similarity: 0.6402

部分示例结果:
                                         instruction  \
0  If you are a doctor, please answer the medical...   
1  If you are a doctor, please answer the medical...   
2  If you are a doctor, please answer the medical...   
3  If you are a doctor, please answer the medical...   
4                                   额叶胶质瘤术的辅助治疗有些什么？   

                                           generated  cosine_similarity  
0  ou are Qwen, created by Alibaba Cloud. You are...           0.690373  
1  ou are Qwen, created by Alibaba Cloud. You are...           0.655256  
2  ou are Qwen, created by Alibaba Cloud. You are...           0.687346  
3  ou are Qwen, created by Alibaba Cloud. You are...           0.701147  
4  ou are Qwen, created by Alibaba Cloud. You are...           0.249756  





### 三.问题回答的流畅性


Perplexity  
评估模型的语言流畅性和训练过程的收敛性

In [6]:
def compute_perplexity(text, model, tokenizer):
    """
    使用全局定义的模型和分词器计算文本的 Perplexity
    :param text: 输入文本
    :param model: 已加载的语言模型
    :param tokenizer: 已加载的分词器
    :return: Perplexity 值
    """
    # 对文本进行编码
    inputs = tokenizer(
        text,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    ).to("cuda")

    # 设置输入标签
    inputs['labels'] = inputs['input_ids']

    # 计算模型的交叉熵损失
    with torch.no_grad():
        outputs = model(**inputs)
        loss = outputs.loss  # 模型返回的交叉熵损失

    # 根据交叉熵损失计算 Perplexity
    perplexity = torch.exp(loss).item()
    return perplexity


def evaluate_perplexity(test_data, model, tokenizer, num_samples=None):
    """
    计算测试数据集中每条文本的 Perplexity，并计算平均值
    使用四分位数（IQR）原则识别异常值，对异常值进行平滑处理
    :param test_data: 测试数据集（DataFrame）
    :param model: 已加载的语言模型
    :param tokenizer: 已加载的分词器
    :param num_samples: 可选，限制评估样本数量
    :return: 平均 Perplexity 和结果 DataFrame
    """
    results = []
    perplexities = []

    # 如果需要抽样
    if num_samples and num_samples < len(test_data):
        test_data = test_data.sample(n=num_samples, random_state=42)
    
    # 遍历测试数据集
    for idx, row in tqdm(test_data.iterrows(), total=len(test_data), desc="Calculating Perplexity"):
        reference = row['output']  # 使用参考文本计算 Perplexity
        
        try:
            # 计算 Perplexity
            perplexity = compute_perplexity(reference, model, tokenizer)
            perplexities.append(perplexity)

            # 保存结果
            results.append({
                'text': reference,
                'perplexity': perplexity
            })
        except Exception as e:
            print(f"行 {idx} 出错: {e}")
            continue

    # 转换为 DataFrame
    results_df = pd.DataFrame(results)

    # 计算 IQR（四分位间距）
    perplexity_series = pd.Series(perplexities)
    Q1 = perplexity_series.quantile(0.25)  # 第 1 四分位数
    Q3 = perplexity_series.quantile(0.75)  # 第 3 四分位数
    IQR = max(Q3 - Q1, 1)  # 确保 IQR 不为 0

    # 根据 IQR 定义异常值范围
    lower_bound = max(Q1 - 1.5 * IQR, 0)  # Perplexity 不可能小于 0
    upper_bound = Q3 + 1.5 * IQR

    # 输出调试信息，验证上下限是否合理
    print(f"IQR 范围: 下限={lower_bound}, 上限={upper_bound}")

    # 计算 IQR 内的均值
    mean_ppl = perplexity_series[(perplexity_series >= lower_bound) & (perplexity_series <= upper_bound)].mean()

    # 对异常值进行平滑处理（替换为均值）
    smoothed_perplexities = perplexity_series.apply(
        lambda x: mean_ppl if x < lower_bound or x > upper_bound else x
    ).reset_index(drop=True)

    # 输出调试信息，检查异常值是否被识别
    num_anomalies = (perplexity_series < lower_bound).sum() + (perplexity_series > upper_bound).sum()
    print(f"识别到的异常值数量: {num_anomalies}")

    # 更新 DataFrame 的 Perplexity 列
    results_df['perplexity'] = smoothed_perplexities

    # 计算最终的平均 Perplexity
    avg_perplexity = smoothed_perplexities.mean()

    return avg_perplexity, results_df


# 加载测试数据
test_data_path = 'shared-nvme/datasets/achieve/finetune_test/test_csv/finetune_test.csv'
test_data = pd.read_csv(test_data_path)

# 调用 Perplexity 评估逻辑
avg_perplexity, results_df = evaluate_perplexity(
    test_data,
    model,
    tokenizer,
    num_samples=100  # 限制样本数量
)

# 保存详细结果到 CSV 文件
results_df.to_csv('evaluation/origin_model/origin_detailed_perplexity.csv', index=False)

# 保存平均 Perplexity 到 CSV 文件
avg_perplexity_df = pd.DataFrame([{"average_perplexity": avg_perplexity}])
avg_perplexity_df.to_csv('evaluation/origin_model/origin_average_perplexity.csv', index=False)

# 打印并保存平均 Perplexity
print("\n平均 Perplexity:")
print(f"Perplexity: {avg_perplexity:.4f}")

# 打印部分示例结果
print("\n部分示例结果:")
print(results_df[['text', 'perplexity']].head())

Calculating Perplexity: 100%|██████████| 100/100 [00:08<00:00, 12.39it/s]

IQR 范围: 下限=0, 上限=52.611210107803345
识别到的异常值数量: 15

平均 Perplexity:
Perplexity: 16.7350

部分示例结果:
                                                text  perplexity
0  Hi, Cannot say in your particular case but loc...   19.448977
1  Hi, Thanks for posting your quarry. As you hav...   18.338043
2  Hi, Thanks for using Chat Doctor. Your throat ...   11.042951
3  Hi. Thanks for your query. Vaginal spotting ma...   19.404900
4                                               护理干预   16.735043



