# 🔬 OSSExtractor 表面合成参数提取工具 - 调试版本

本notebook允许您逐步调试OSSExtractor的每个处理步骤，查看中间结果并优化参数。


## 📦 导入必要的库和模块


In [None]:
import pandas as pd
import os
import sys
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

# 添加模块路径
sys.path.append('Text Parser')
sys.path.append('Text Extraction')

# 导入统一的处理模块
from PDF_Unified_Processor import PDFUnifiedProcessor, save_contents_to_specific_folders
from TXT_Processing import process_text_file_for_processing
from Embedding_and_Similarity import process_text_file_for_embedding
from Unified_Text_Processor import (
    process_text_file_for_filter, process_text_file_for_abstract, process_text_file_for_summerized,
    process_text_file_for_filter_meta_llama, process_text_file_for_abstract_meta_llama, 
    process_text_file_for_summerized_meta_llama
)

print("✅ 所有模块导入成功！")
print("🔄 统一处理模块:")
print("  - PDF_Unified_Processor: 统一PDF处理 (PyMuPDF)")
print("  - Unified_Text_Processor: 统一文本处理 (LLM)")
print("  - 兼容原有接口，功能更强大")
print("  - 支持基础文本提取和结构化解析")


✅ 所有模块导入成功！
🔄 统一处理模块:
  - PDF_Unified_Processor: 统一PDF处理 (PyMuPDF)
  - Unified_Text_Processor: 统一文本处理 (LLM)
  - 兼容原有接口，功能更强大
  - 支持基础文本提取和结构化解析


## 🔄 处理流程说明

**OSSExtractor的完整处理流程：**

1. **PDF转文本** → 原始文本文件
2. **文本预处理** → 段落分割和过滤
3. **嵌入相似度筛选** → 从所有段落中选出最相关的N个段落
4. **LLM内容过滤** → 从相似度筛选的段落中进一步筛选
5. **抽象和总结** → 生成最终的结构化参数

**段落数量变化示例：**
- 原始文本: 100+ 段落
- 预处理后: 50+ 段落  
- 嵌入筛选后: 20 段落 (最相关的)
- LLM过滤后: 10 段落 (最符合要求的)


## 🔧 配置参数


In [23]:
import os

# 配置要处理的PDF文件
pdf_files = [
    '/Users/zhaowenyuan/Projects/FCPDExtractor/Data/papers/on-surface-synthesis-of.pdf',
    # 如果有更多文件，可以加在这里
    # '/Users/zhaowenyuan/Projects/FCPDExtractor/Data/papers/another_paper.pdf',
]

# 定义基础的数据目录
base_data_dir = '/Users/zhaowenyuan/Projects/FCPDExtractor/Data'

# 1. 在Data目录下，定义一个名为 'output' 的主输出文件夹路径
main_output_dir = os.path.join(base_data_dir, 'output')

# 2. 创建 'output' 文件夹 (如果它不存在的话)
# exist_ok=True 表示如果文件夹已存在，则不会报错
os.makedirs(main_output_dir, exist_ok=True)

print(f"📄 将处理 {len(pdf_files)} 个PDF文件:")
print(f"📁 主输出目录已设置为: {main_output_dir}")
print("-" * 40) # 打印分割线

# 遍历每一个要处理的PDF文件
for i, pdf_path in enumerate(pdf_files, 1):
    
    # 3. 从完整路径中获取PDF的文件名 (例如: 'd2cp03073j.pdf')
    pdf_filename = os.path.basename(pdf_path)
    
    # 4. 去掉.pdf扩展名，创建文件夹名 (例如: 'd2cp03073j')
    folder_name = os.path.splitext(pdf_filename)[0]
    
    # 5. 拼接出这个PDF专属的输出文件夹的完整路径
    specific_output_dir = os.path.join(main_output_dir, folder_name)
    
    # 6. 创建这个专属的文件夹
    os.makedirs(specific_output_dir, exist_ok=True)
    
    print(f"  {i}. 正在处理: {pdf_filename}")
    print(f"     -> 将输出到: {specific_output_dir}")

    # --- 在这里接上你后续的处理逻辑 ---
    # 例如，你之后所有保存文件的操作，都应该使用 `specific_output_dir` 作为路径
    # processed_text_path = os.path.join(specific_output_dir, 'Processed_text.txt')
    # with open(processed_text_path, 'w') as f:
    #     f.write("这里是处理后的文本")

📄 将处理 1 个PDF文件:
📁 主输出目录已设置为: /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output
----------------------------------------
  1. 正在处理: on-surface-synthesis-of.pdf
     -> 将输出到: /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output/on-surface-synthesis-of


## 📄 步骤 1: PDF转文本处理


In [24]:
print("🚀 步骤 1/5: PDF转文本处理...")
print("=" * 50)

# 使用统一的PDF处理模块
processor = PDFUnifiedProcessor()

# 执行PDF转文本
output_files = save_contents_to_specific_folders(pdf_files, main_output_dir)

print(f"✅ PDF转文本完成！生成了 {len(output_files)} 个文本文件:")
for i, file in enumerate(output_files, 1):
    print(f"  {i}. {file}")
    
    # 显示文件大小和行数
    if os.path.exists(file):
        with open(file, 'r', encoding='utf-8', errors='ignore') as f:
            lines = f.readlines()
            print(f"     📊 行数: {len(lines)}")
            print(f"     📏 文件大小: {os.path.getsize(file)} bytes")


🚀 步骤 1/5: PDF转文本处理...
✅ PDF转文本完成！生成了 1 个文本文件:
  1. /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output/on-surface-synthesis-of/on-surface-synthesis-of.txt
     📊 行数: 533
     📏 文件大小: 26193 bytes


### 🔍 查看PDF转文本结果


In [25]:
# 选择第一个文件进行详细查看
sample_file = output_files[0]
print(f"📖 查看文件: {os.path.basename(sample_file)}")
print("=" * 50)

with open(sample_file, 'r', encoding='utf-8', errors='ignore') as f:
    content = f.read()
    
print(f"📊 总字符数: {len(content)}")
print(f"📊 总行数: {len(content.splitlines())}")
print("\n📄 前500个字符预览:")
print("-" * 30)
print(content[:500] + "..." if len(content) > 500 else content)


📖 查看文件: on-surface-synthesis-of.txt
📊 总字符数: 25875
📊 总行数: 533

📄 前500个字符预览:
------------------------------
On-Surface Synthesis of Oligo(indenoindene)
 Marco Di Giovannantonio,* Qiang Chen, José I. Urgel, Pascal Ruﬃeux, Carlo A. Pignedoli,
Klaus Müllen,* Akimitsu Narita,* and Roman Fasel*
Cite This: J. Am. Chem. Soc. 2020, 142, 12925−12929
Read Online
ACCESS
Metrics & More
Article Recommendations
*
sı
Supporting Information
ABSTRACT: Fully conjugated ladder polymers (CLP) possess unique optical and electronic properties and are considered
promising materials for applications in (opto)electronic dev...


### 🔍 结构化PDF解析（可选）


In [26]:
# 可选：使用结构化解析提取摘要和结论部分
print("🔍 结构化PDF解析（针对摘要和结论）")
print("=" * 50)

structured_results = []

for i, pdf_path in enumerate(pdf_files, 1):
    print(f"\n📄 处理文件 {i}/{len(pdf_files)}: {os.path.basename(pdf_path)}")
    
    # 使用统一处理器进行结构化解析
    result = processor.process_pdf_comprehensive(pdf_path, main_output_dir, mode='structured')
    structured_results.append(result)
    
    # 显示结果
    for section, file_path in result.items():
        if os.path.exists(file_path):
            with open(file_path, 'r', encoding='utf-8') as f:
                lines = f.readlines()
            print(f"  ✅ {section}: {len(lines)} 个段落")

print(f"\n🎉 结构化解析完成！")


🔍 结构化PDF解析（针对摘要和结论）

📄 处理文件 1/1: on-surface-synthesis-of.pdf
✅ other 章节: 5 个段落 -> /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output/on-surface-synthesis-of/on-surface-synthesis-of_other.txt
  ✅ other: 489 个段落

🎉 结构化解析完成！


## 📝 步骤 2: 文本预处理


In [27]:
print("🚀 步骤 2/5: 文本预处理...")
print("=" * 50)

total_filtered_count = 0
processed_files = []

# 处理上一步生成的TXT文件，而不是PDF文件
for i, txt_file in enumerate(output_files, 1):
    print(f"\n📄 处理文件 {i}/{len(output_files)}: {os.path.basename(txt_file)}")
    
    # 执行文本预处理 - 处理TXT文件
    processed_file_path, filtered_count = process_text_file_for_processing(txt_file)
    processed_files.append(processed_file_path)
    total_filtered_count += filtered_count
    
    print(f"  ✅ 预处理完成，过滤了 {filtered_count} 个段落")
    print(f"  📁 输出文件: {processed_file_path}")

print(f"\n🎉 文本预处理完成！总共过滤了 {total_filtered_count} 个段落")


🚀 步骤 2/5: 文本预处理...

📄 处理文件 1/1: on-surface-synthesis-of.txt
  ✅ 预处理完成，过滤了 19 个段落
  📁 输出文件: /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output/on-surface-synthesis-of/Processed_on-surface-synthesis-of.txt

🎉 文本预处理完成！总共过滤了 19 个段落


### 🔍 查看预处理结果


### 💡 LLM内容过滤说明

**这一步的作用：**
- 输入：嵌入相似度筛选出的段落（如20个段落）
- 处理：使用Meta-Llama-3.1-8B模型判断每个段落是否真正与表面化学反应相关
- 输出：进一步筛选的相关段落（如10个段落）

**模型选择：**
- 优先使用：Meta-Llama-3.1-8B-Instruct（更智能，性能更好）
- 回退模型：nous-hermes-llama2-13b（稳定可靠）

**为什么段落数会减少：**
- 嵌入相似度只是基于关键词匹配
- LLM过滤会进行更智能的内容理解
- 最终保留真正相关的段落


In [28]:
# 详细分析段落分割过程
def analyze_paragraph_segmentation(txt_file):
    print(f"🔍 分析文件: {os.path.basename(txt_file)}")
    print("=" * 50)
    
    with open(txt_file, 'r', encoding='utf-8', errors='ignore') as f:
        lines = f.readlines()
    
    print(f"📊 原始文件总行数: {len(lines)}")
    
    # 模拟段落分割过程
    current_segment = []
    segments = []
    empty_lines_count = 0
    
    for i, line in enumerate(lines):
        if line.strip():  # 非空行
            current_segment.append(line.strip())
        else:  # 空行
            empty_lines_count += 1
            if current_segment:  # 如果当前段落不为空
                segments.append(' '.join(current_segment))
                current_segment = []
    
    # 处理最后一个段落
    if current_segment:
        segments.append(' '.join(current_segment))
    
    print(f"📊 空行数量: {empty_lines_count}")
    print(f"📊 分割后的段落数: {len(segments)}")
    print(f"📊 平均段落长度: {sum(len(seg) for seg in segments) / len(segments):.1f} 字符")
    
    print("\n📄 前5个段落预览:")
    print("-" * 30)
    for i, segment in enumerate(segments[:5]):
        print(f"段落 {i+1} (长度: {len(segment)}): {segment[:150]}..." if len(segment) > 150 else f"段落 {i+1} (长度: {len(segment)}): {segment}")
    
    return segments

# 分析原始TXT文件的段落分割
if output_files:
    original_segments = analyze_paragraph_segmentation(output_files[0])
    
    print(f"\n📊 总结:")
    print(f"  - 原始文件行数: {len(original_segments)}")
    print(f"  - 预处理后段落数: 25")
    print(f"  - 过滤掉的段落数: {len(original_segments) - 25}")
    print(f"  - 保留比例: {25/len(original_segments)*100:.1f}%")


🔍 分析文件: on-surface-synthesis-of.txt
📊 原始文件总行数: 533
📊 空行数量: 34
📊 分割后的段落数: 34
📊 平均段落长度: 757.6 字符

📄 前5个段落预览:
------------------------------
段落 1 (长度: 936): On-Surface Synthesis of Oligo(indenoindene) Marco Di Giovannantonio,* Qiang Chen, José I. Urgel, Pascal Ruﬃeux, Carlo A. Pignedoli, Klaus Müllen,* A...
段落 2 (长度: 947): Achieving defect- free segments of oligo(indenoindene) oﬀers exclusive insight into this CLP and provides the basis to further synthetic approaches. C...
段落 3 (长度: 896): The synthesis of PInIns was pioneered by Scherf and Müllen in 1992, but the ﬁnal dehydrogenation step did not proceed completely, and unambiguous str...
段落 4 (长度: 943): Recently, we have established on-surface syntheses of indenoﬂuorene polymers by utilizing oxidative cyclization of methyl groups against phenylene rin...
段落 5 (长度: 371): Reaction Pathway from 1 to p-OInIn 4 on Au(111) Communication pubs.acs.org/JACS © 2020 American Chemical Society 12925 https://dx.doi.org/10.1021/jacs...

📊 总结:
  - 原

## 🔍 步骤 3: 嵌入和相似度计算


In [29]:
print("🚀 步骤 3/5: 嵌入和相似度计算...")
print("=" * 50)

embedding_files = []

# 使用上一步预处理后的文件
for i, processed_file in enumerate(processed_files, 1):
    print(f"\n📄 处理文件 {i}/{len(processed_files)}: {os.path.basename(processed_file)}")
    
    # 执行嵌入和相似度计算
    embedding_file_path = process_text_file_for_embedding(processed_file)
    embedding_files.append(embedding_file_path)
    
    print(f"  ✅ 嵌入和相似度计算完成")
    print(f"  📁 输出文件: {embedding_file_path}")

print(f"\n🎉 嵌入和相似度计算完成！")


🚀 步骤 3/5: 嵌入和相似度计算...

📄 处理文件 1/1: Processed_on-surface-synthesis-of.txt
  ✅ 嵌入和相似度计算完成
  📁 输出文件: /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output/on-surface-synthesis-of/Embedding_on-surface-synthesis-of.txt

🎉 嵌入和相似度计算完成！


### 🔍 查看嵌入结果


In [30]:
# 查看嵌入结果
sample_embedding = embedding_files[0]
print(f"📖 查看嵌入文件: {os.path.basename(sample_embedding)}")
print("=" * 50)

with open(sample_embedding, 'r', encoding='utf-8', errors='ignore') as f:
    lines = f.readlines()
    
print(f"📊 嵌入相似度筛选后段落数: {len(lines)}")
print("\n📄 相似度最高的段落预览:")
print("-" * 30)
for i, line in enumerate(lines[:3]):
    if line.strip():
        print(f"段落 {i+1}: {line[:200]}..." if len(line) > 200 else f"段落 {i+1}: {line}")


📖 查看嵌入文件: Embedding_on-surface-synthesis-of.txt
📊 嵌入相似度筛选后段落数: 20

📄 相似度最高的段落预览:
------------------------------
段落 1: Recently, we have established on-surface syntheses of indenoﬂuorene polymers by utilizing oxidative cyclization of methyl groups against phenylene rings of polyphenylene backbones,25,26 an alternative...
段落 3: The observed polymers were no longer packed into islands. This change can be attributed to the desorption of bromine atoms from the Au(111) surface, which is known to be promoted by the presence of at...


## 🤖 步骤 4: LLM内容过滤


In [31]:
print("🚀 步骤 4/5: LLM内容过滤...")
print("=" * 50)

filter_files = []

for i, embedding_file in enumerate(embedding_files, 1):
    print(f"\n📄 处理文件 {i}/{len(embedding_files)}: {os.path.basename(embedding_file)}")
    
    # 执行LLM内容过滤
    filter_file_path = process_text_file_for_filter_meta_llama(embedding_file)
    filter_files.append(filter_file_path)
    
    print(f"  ✅ LLM内容过滤完成")
    print(f"  📁 输出文件: {filter_file_path}")

print(f"\n🎉 LLM内容过滤完成！")


🚀 步骤 4/5: LLM内容过滤...

📄 处理文件 1/1: Embedding_on-surface-synthesis-of.txt
🔍 处理文件: Embedding_on-surface-synthesis-of.txt
📊 原始段落数: 10

🤖 步骤1: LLM内容过滤...
🔍 尝试加载模型，路径: /Users/zhaowenyuan/Projects/FCPDExtractor/models
✅ 成功加载 nous-hermes-llama2-13b.Q4_0.gguf 模型
✅ 过滤后段落数: 4
  ✅ LLM内容过滤完成
  📁 输出文件: /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output/on-surface-synthesis-of/Embedding_on-surface-synthesis-of_Filtered.txt

🎉 LLM内容过滤完成！


### 🔍 查看LLM过滤结果


In [32]:
# 查看LLM过滤结果
sample_filter = filter_files[0]
print(f"📖 查看过滤文件: {os.path.basename(sample_filter)}")
print("=" * 50)

with open(sample_filter, 'r', encoding='utf-8', errors='ignore') as f:
    lines = f.readlines()
    
print(f"📊 LLM过滤后段落数: {len(lines)}")
print("\n📄 过滤后的段落预览:")
print("-" * 30)
for i, line in enumerate(lines[:3]):
    if line.strip():
        print(f"段落 {i+1}: {line[:200]}..." if len(line) > 200 else f"段落 {i+1}: {line}")


📖 查看过滤文件: Embedding_on-surface-synthesis-of_Filtered.txt
📊 LLM过滤后段落数: 8

📄 过滤后的段落预览:
------------------------------
段落 1: Recently, we have established on-surface syntheses of indenoﬂuorene polymers by utilizing oxidative cyclization of methyl groups against phenylene rings of polyphenylene backbones,25,26 an alternative...
段落 3: The observed polymers were no longer packed into islands. This change can be attributed to the desorption of bromine atoms from the Au(111) surface, which is known to be promoted by the presence of at...


## 📊 步骤 5: 抽象和总结


In [33]:
print("🚀 步骤 5/5: 抽象和总结...")
print("=" * 50)

abstract_files = []
summarized_files = []

# 使用过滤后的文件进行抽象和总结
for i, filter_file in enumerate(filter_files, 1):
    print(f"\n📄 处理文件 {i}/{len(filter_files)}: {os.path.basename(filter_file)}")
    
    # 执行抽象
    abstract_file_path = process_text_file_for_abstract_meta_llama(filter_file)
    abstract_files.append(abstract_file_path)
    print(f"  ✅ 抽象完成: {abstract_file_path}")
    
    # 执行总结
    summerized_file_path = process_text_file_for_summerized_meta_llama(filter_file)
    summarized_files.append(summerized_file_path)
    print(f"  ✅ 总结完成: {summerized_file_path}")

print(f"\n🎉 抽象和总结完成！")


🚀 步骤 5/5: 抽象和总结...

📄 处理文件 1/1: Embedding_on-surface-synthesis-of_Filtered.txt
🔍 处理文件: Embedding_on-surface-synthesis-of_Filtered.txt
📊 原始段落数: 4

📝 步骤2: 文本抽象...
🔍 尝试加载模型，路径: /Users/zhaowenyuan/Projects/FCPDExtractor/models
✅ 成功加载 nous-hermes-llama2-13b.Q4_0.gguf 模型
Abstract 1/4:
 The resulting para-type oligo(indenoindene) (p-OInIn, 4) was characterized by means of low-temperature STM/STS and noncontact atomic force microscopy (nc-AFM). On the basis of our previous study on dihydroindenofluorene (DHIN),28 we propose that this reaction sequence is also applicable to other molecular systems with different substituents or structural features. In conclusion, these results provide an experimental demonstration for on-surface synthesis of para-type oligo(indenoindene) by thermally activated reactions, and offer a novel approach towards the precise control of molecular structures in surface chemistry. 24 POLYMERS ACCEPTED MANUSCRIPT
Abstract 2/4:

In this study, we have investigated various c

### 🔍 查看最终结果


In [34]:
# 查看最终总结结果
sample_summarized = summarized_files[0]
print(f"📖 查看最终总结: {os.path.basename(sample_summarized)}")
print("=" * 50)

with open(sample_summarized, 'r', encoding='utf-8', errors='ignore') as f:
    content = f.read()
    
print(f"📊 总结内容长度: {len(content)} 字符")
print("\n📄 总结内容预览:")
print("-" * 30)
print(content[:1000] + "..." if len(content) > 1000 else content)


📖 查看最终总结: Embedding_on-surface-synthesis-of_Filtered_Summarized.txt
📊 总结内容长度: 1192 字符

📄 总结内容预览:
------------------------------
 -------------------------------- | ----------| -----------| --------| -----------| --------------- | dibromo-trimethyl-p-terphenyl (1)| Au(111)     | N/A         | p-OInIn 4| oligoindigoindene | monolayer, extended


Precursor   | Azobenzene-4-carboxylic acid N-hydroxysuccinimide ester | Au(111) surface | N/A          | N/A           | N/A           |
Substrate  |                                |               |             |              |             |
Temperature| N/A                            | 360 °C         | Oxidative cyclization products (e.g., azobenzene-4,4'-diyl bis(2-hydroxyethyl)succinate radicals) | 1D          |
Product    | Azobenzene-4,4'-diyl bis(2-hydroxyethyl)succinate radicals                           |               |             |              | N/A           |
Dimensions| 1D (azobenzene backbone)     |               |             |  

## 📊 处理结果统计


In [35]:
# 统计处理结果
print("📊 OSSExtractor 处理结果统计")
print("=" * 50)

stats = []
for i, file_path in enumerate(output_files):
    filename = os.path.basename(file_path)
    
    # 统计各个步骤的文件大小
    original_size = os.path.getsize(file_path) if os.path.exists(file_path) else 0
    
    processed_file = processed_files[i] if i < len(processed_files) else None
    processed_size = os.path.getsize(processed_file) if processed_file and os.path.exists(processed_file) else 0
    
    embedding_file = embedding_files[i] if i < len(embedding_files) else None
    embedding_size = os.path.getsize(embedding_file) if embedding_file and os.path.exists(embedding_file) else 0
    
    filter_file = filter_files[i] if i < len(filter_files) else None
    filter_size = os.path.getsize(filter_file) if filter_file and os.path.exists(filter_file) else 0
    
    summarized_file = summarized_files[i] if i < len(summarized_files) else None
    summarized_size = os.path.getsize(summarized_file) if summarized_file and os.path.exists(summarized_file) else 0
    
    stats.append({
        '文件': filename,
        '原始PDF (MB)': round(original_size / 1024 / 1024, 2),
        '预处理 (KB)': round(processed_size / 1024, 2),
        '嵌入筛选 (KB)': round(embedding_size / 1024, 2),
        'LLM过滤 (KB)': round(filter_size / 1024, 2),
        '最终总结 (KB)': round(summarized_size / 1024, 2)
    })

# 显示统计表格
df_stats = pd.DataFrame(stats)
display(df_stats)

print("\n🎉 所有处理步骤完成！")


📊 OSSExtractor 处理结果统计


Unnamed: 0,文件,原始PDF (MB),预处理 (KB),嵌入筛选 (KB),LLM过滤 (KB),最终总结 (KB)
0,on-surface-synthesis-of.txt,0.02,14.91,7.58,3.16,1.17



🎉 所有处理步骤完成！


## 🔧 调试和优化建议


In [36]:
print("🔧 调试和优化建议")
print("=" * 50)
print("""
1. 📊 检查嵌入相似度阈值
   - 如果筛选的段落太少，可以降低相似度阈值
   - 如果筛选的段落太多，可以提高相似度阈值

2. 🤖 优化LLM提示词
   - 在Filter.py中调整问题描述
   - 在Summerized.py中调整参数提取提示词

3. 📝 调整文本预处理
   - 在TXT_Processing.py中修改过滤规则
   - 调整段落分割策略

4. 🔍 检查模型性能
   - 观察LLM的响应质量
   - 考虑调整模型参数（temp, top_p等）

5. 📈 可视化处理流程
   - 绘制各步骤的数据量变化
   - 分析处理效率
""")


🔧 调试和优化建议

1. 📊 检查嵌入相似度阈值
   - 如果筛选的段落太少，可以降低相似度阈值
   - 如果筛选的段落太多，可以提高相似度阈值

2. 🤖 优化LLM提示词
   - 在Filter.py中调整问题描述
   - 在Summerized.py中调整参数提取提示词

3. 📝 调整文本预处理
   - 在TXT_Processing.py中修改过滤规则
   - 调整段落分割策略

4. 🔍 检查模型性能
   - 观察LLM的响应质量
   - 考虑调整模型参数（temp, top_p等）

5. 📈 可视化处理流程
   - 绘制各步骤的数据量变化
   - 分析处理效率



## 📊 结果查看和分析


In [37]:
# 查看最终结果
print("📊 查看处理结果...")
print("=" * 50)

# 显示所有生成的文件
all_files = {
    '原始文本': output_files,
    '预处理文本': processed_files,
    '嵌入文件': embedding_files,
    '过滤文件': filter_files,
    '抽象文件': abstract_files,
    '总结文件': summarized_files
}

for category, files in all_files.items():
    print(f"\n📁 {category}:")
    for i, file in enumerate(files, 1):
        if os.path.exists(file):
            with open(file, 'r', encoding='utf-8', errors='ignore') as f:
                lines = f.readlines()
            print(f"  {i}. {os.path.basename(file)} ({len(lines)} 行)")
        else:
            print(f"  {i}. {os.path.basename(file)} (文件不存在)")

print(f"\n🎉 处理完成！共处理了 {len(pdf_files)} 个PDF文件")


📊 查看处理结果...

📁 原始文本:
  1. on-surface-synthesis-of.txt (533 行)

📁 预处理文本:
  1. Processed_on-surface-synthesis-of.txt (38 行)

📁 嵌入文件:
  1. Embedding_on-surface-synthesis-of.txt (20 行)

📁 过滤文件:
  1. Embedding_on-surface-synthesis-of_Filtered.txt (8 行)

📁 抽象文件:
  1. Embedding_on-surface-synthesis-of_Filtered_Abstract.txt (10 行)

📁 总结文件:
  1. Embedding_on-surface-synthesis-of_Filtered_Summarized.txt (15 行)

🎉 处理完成！共处理了 1 个PDF文件


## 🔍 结果分析


In [38]:
# 分析最终结果
print("🔍 分析最终结果...")
print("=" * 50)

# 查看总结文件的内容
if summarized_files:
    print("📄 最终总结结果:")
    for i, file in enumerate(summarized_files, 1):
        print(f"\n文件 {i}: {os.path.basename(file)}")
        if os.path.exists(file):
            with open(file, 'r', encoding='utf-8', errors='ignore') as f:
                content = f.read()
            print("内容预览:")
            print("-" * 30)
            print(content[:500] + "..." if len(content) > 500 else content)
        else:
            print("文件不存在")


🔍 分析最终结果...
📄 最终总结结果:

文件 1: Embedding_on-surface-synthesis-of_Filtered_Summarized.txt
内容预览:
------------------------------
 -------------------------------- | ----------| -----------| --------| -----------| --------------- | dibromo-trimethyl-p-terphenyl (1)| Au(111)     | N/A         | p-OInIn 4| oligoindigoindene | monolayer, extended


Precursor   | Azobenzene-4-carboxylic acid N-hydroxysuccinimide ester | Au(111) surface | N/A          | N/A           | N/A           |
Substrate  |                                |               |             |              |             |
Temperature| N/A                        ...


In [39]:
# 显示处理统计信息
print("\n📊 处理统计信息:")
print("=" * 30)

total_paragraphs_original = 0
total_paragraphs_filtered = 0

for i, (original, filtered) in enumerate(zip(processed_files, filter_files), 1):
    if os.path.exists(original):
        with open(original, 'r', encoding='utf-8', errors='ignore') as f:
            original_lines = len(f.readlines())
        total_paragraphs_original += original_lines
    
    if os.path.exists(filtered):
        with open(filtered, 'r', encoding='utf-8', errors='ignore') as f:
            filtered_lines = len(f.readlines())
        total_paragraphs_filtered += filtered_lines
        
        print(f"文件 {i}: {original_lines} → {filtered_lines} 段落 (保留率: {filtered_lines/original_lines*100:.1f}%)")

print(f"\n总计: {total_paragraphs_original} → {total_paragraphs_filtered} 段落")
print(f"整体保留率: {total_paragraphs_filtered/total_paragraphs_original*100:.1f}%")

print(f"\n🎉 摘要结论专用处理完成！")



📊 处理统计信息:
文件 1: 38 → 8 段落 (保留率: 21.1%)

总计: 38 → 8 段落
整体保留率: 21.1%

🎉 摘要结论专用处理完成！
