# 🔬 OSSExtractor 表面合成参数提取工具 - 调试版本

本notebook允许您逐步调试OSSExtractor的每个处理步骤，查看中间结果并优化参数。


## 📦 导入必要的库和模块


In [1]:
import pandas as pd
import os
import sys
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

# 添加模块路径
sys.path.append('Text Parser')
sys.path.append('Text Extraction')

# 导入统一的处理模块
from PDF_Unified_Processor import PDFUnifiedProcessor, save_contents_to_specific_folders
from TXT_Processing import process_text_file_for_processing
from Embedding_and_Similarity import process_text_file_for_embedding
from Unified_Text_Processor import (
    process_text_file_for_filter,
    process_text_file_for_abstract,
    process_text_file_for_summerized,
    process_text_file_for_filter_meta_llama,
    process_text_file_for_abstract_meta_llama,
    process_text_file_for_summerized_meta_llama_strict
)

print("✅ 所有模块导入成功！")


  from .autonotebook import tqdm as notebook_tqdm


✅ 所有模块导入成功！


## 🔄 处理流程说明

**OSSExtractor的完整处理流程：**

1. **PDF转文本** → 原始文本文件
2. **文本预处理** → 段落分割和过滤
3. **嵌入相似度筛选** → 从所有段落中选出最相关的N个段落
4. **LLM内容过滤** → 从相似度筛选的段落中进一步筛选
5. **抽象和总结** → 生成最终的结构化参数

**段落数量变化示例：**
- 原始文本: 100+ 段落
- 预处理后: 50+ 段落  
- 嵌入筛选后: 20 段落 (最相关的)
- LLM过滤后: 10 段落 (最符合要求的)


## 🔧 配置参数


In [2]:
import os

# 配置要处理的PDF文件
pdf_files = [
    '/Users/zhaowenyuan/Projects/FCPDExtractor/Data/papers/101021acsoprd7b00291.pdf',
    # 如果有更多文件，可以加在这里
    # '/Users/zhaowenyuan/Projects/FCPDExtractor/Data/papers/another_paper.pdf',
]

# 定义基础的数据目录
base_data_dir = '/Users/zhaowenyuan/Projects/FCPDExtractor/Data'

# 1. 在Data目录下，定义一个名为 'output' 的主输出文件夹路径
main_output_dir = os.path.join(base_data_dir, 'output')

# 2. 创建 'output' 文件夹 (如果它不存在的话)
# exist_ok=True 表示如果文件夹已存在，则不会报错
os.makedirs(main_output_dir, exist_ok=True)

print(f"📄 将处理 {len(pdf_files)} 个PDF文件:")
print(f"📁 主输出目录已设置为: {main_output_dir}")
print("-" * 40) # 打印分割线

# 遍历每一个要处理的PDF文件
for i, pdf_path in enumerate(pdf_files, 1):
    
    # 3. 从完整路径中获取PDF的文件名 (例如: 'd2cp03073j.pdf')
    pdf_filename = os.path.basename(pdf_path)
    
    # 4. 去掉.pdf扩展名，创建文件夹名 (例如: 'd2cp03073j')
    folder_name = os.path.splitext(pdf_filename)[0]
    
    # 5. 拼接出这个PDF专属的输出文件夹的完整路径
    specific_output_dir = os.path.join(main_output_dir, folder_name)
    
    # 6. 创建这个专属的文件夹
    os.makedirs(specific_output_dir, exist_ok=True)
    
    print(f"  {i}. 正在处理: {pdf_filename}")
    print(f"     -> 将输出到: {specific_output_dir}")

    # --- 在这里接上你后续的处理逻辑 ---
    # 例如，你之后所有保存文件的操作，都应该使用 `specific_output_dir` 作为路径
    # processed_text_path = os.path.join(specific_output_dir, 'Processed_text.txt')
    # with open(processed_text_path, 'w') as f:
    #     f.write("这里是处理后的文本")

📄 将处理 1 个PDF文件:
📁 主输出目录已设置为: /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output
----------------------------------------
  1. 正在处理: 101021acsoprd7b00291.pdf
     -> 将输出到: /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output/101021acsoprd7b00291


## 📄 步骤 1: PDF转文本处理


In [3]:
print("🚀 步骤 1/5: PDF转文本处理...")
print("=" * 50)

# 使用统一的PDF处理模块
processor = PDFUnifiedProcessor()

# 执行PDF转文本
output_files = save_contents_to_specific_folders(pdf_files, main_output_dir)

print(f"✅ PDF转文本完成！生成了 {len(output_files)} 个文本文件:")
for i, file in enumerate(output_files, 1):
    print(f"  {i}. {file}")
    
    # 显示文件大小和行数
    if os.path.exists(file):
        with open(file, 'r', encoding='utf-8', errors='ignore') as f:
            lines = f.readlines()
            print(f"     📊 行数: {len(lines)}")
            print(f"     📏 文件大小: {os.path.getsize(file)} bytes")


🚀 步骤 1/5: PDF转文本处理...
✅ PDF转文本完成！生成了 1 个文本文件:
  1. /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output/101021acsoprd7b00291/101021acsoprd7b00291.txt
     📊 行数: 731
     📏 文件大小: 33175 bytes


### 🔍 查看PDF转文本结果


In [4]:
# 选择第一个文件进行详细查看
sample_file = output_files[0]
print(f"📖 查看文件: {os.path.basename(sample_file)}")
print("=" * 50)

with open(sample_file, 'r', encoding='utf-8', errors='ignore') as f:
    content = f.read()
    
print(f"📊 总字符数: {len(content)}")
print(f"📊 总行数: {len(content.splitlines())}")
print("\n📄 前500个字符预览:")
print("-" * 30)
print(content[:500] + "..." if len(content) > 500 else content)


📖 查看文件: 101021acsoprd7b00291.txt
📊 总字符数: 32660
📊 总行数: 731

📄 前500个字符预览:
------------------------------
Process Development and Scale-up of the Continuous Flow Nitration
of Triﬂuoromethoxybenzene
Zhenghui Wen,†,‡ Fengjun Jiao,† Mei Yang,† Shuainan Zhao,†,‡ Feng Zhou,†,‡ and Guangwen Chen*,†
†Dalian National Laboratory for Clean Energy, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023,
China
‡University of Chinese Academy of Sciences, Beijing 100049, China
*
S Supporting Information
ABSTRACT: In this work, continuous ﬂow nitration of triﬂuoromethoxybenzene (TFMB) was...


### 🔍 结构化PDF解析（可选）


In [5]:
# 可选：使用结构化解析提取摘要和结论部分
print("🔍 结构化PDF解析（针对摘要和结论）")
print("=" * 50)

structured_results = []

for i, pdf_path in enumerate(pdf_files, 1):
    print(f"\n📄 处理文件 {i}/{len(pdf_files)}: {os.path.basename(pdf_path)}")
    
    # 使用统一处理器进行结构化解析
    result = processor.process_pdf_comprehensive(pdf_path, main_output_dir, mode='structured')
    structured_results.append(result)
    
    # 显示结果
    for section, file_path in result.items():
        if os.path.exists(file_path):
            with open(file_path, 'r', encoding='utf-8') as f:
                lines = f.readlines()
            print(f"  ✅ {section}: {len(lines)} 个段落")

print(f"\n🎉 结构化解析完成！")


🔍 结构化PDF解析（针对摘要和结论）

📄 处理文件 1/1: 101021acsoprd7b00291.pdf
✅ other 章节: 8 个段落 -> /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output/101021acsoprd7b00291/101021acsoprd7b00291_other.txt
  ✅ other: 663 个段落

🎉 结构化解析完成！


## 📝 步骤 2: 文本预处理


In [6]:
print("🚀 步骤 2/5: 文本预处理...")
print("=" * 50)

total_filtered_count = 0
processed_files = []

# 处理上一步生成的TXT文件，而不是PDF文件
for i, txt_file in enumerate(output_files, 1):
    print(f"\n📄 处理文件 {i}/{len(output_files)}: {os.path.basename(txt_file)}")
    
    # 执行文本预处理 - 处理TXT文件
    processed_file_path, filtered_count = process_text_file_for_processing(txt_file)
    processed_files.append(processed_file_path)
    total_filtered_count += filtered_count
    
    print(f"  ✅ 预处理完成，过滤了 {filtered_count} 个段落")
    print(f"  📁 输出文件: {processed_file_path}")

print(f"\n🎉 文本预处理完成！总共过滤了 {total_filtered_count} 个段落")


🚀 步骤 2/5: 文本预处理...

📄 处理文件 1/1: 101021acsoprd7b00291.txt
  ✅ 预处理完成，过滤了 34 个段落
  📁 输出文件: /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output/101021acsoprd7b00291/Processed_101021acsoprd7b00291.txt

🎉 文本预处理完成！总共过滤了 34 个段落


### 🔍 查看预处理结果


### 💡 LLM内容过滤说明

**这一步的作用：**
- 输入：嵌入相似度筛选出的段落（如20个段落）
- 处理：使用Nous-Hermes-Llama2-13B模型判断每个段落是否真正与表面化学反应相关
- 输出：进一步筛选的相关段落（如10个段落）

**模型选择：**
- 优先使用：Nous-Hermes-Llama2-13B-Instruct（更智能，性能更好）
- 回退模型：nous-hermes-llama2-13b（稳定可靠）

**为什么段落数会减少：**
- 嵌入相似度只是基于关键词匹配
- LLM过滤会进行更智能的内容理解
- 最终保留真正相关的段落


In [7]:
# 详细分析段落分割过程
def analyze_paragraph_segmentation(txt_file):
    print(f"🔍 分析文件: {os.path.basename(txt_file)}")
    print("=" * 50)
    
    with open(txt_file, 'r', encoding='utf-8', errors='ignore') as f:
        lines = f.readlines()
    
    print(f"📊 原始文件总行数: {len(lines)}")
    
    # 模拟段落分割过程
    current_segment = []
    segments = []
    empty_lines_count = 0
    
    for i, line in enumerate(lines):
        if line.strip():  # 非空行
            current_segment.append(line.strip())
        else:  # 空行
            empty_lines_count += 1
            if current_segment:  # 如果当前段落不为空
                segments.append(' '.join(current_segment))
                current_segment = []
    
    # 处理最后一个段落
    if current_segment:
        segments.append(' '.join(current_segment))
    
    print(f"📊 空行数量: {empty_lines_count}")
    print(f"📊 分割后的段落数: {len(segments)}")
    print(f"📊 平均段落长度: {sum(len(seg) for seg in segments) / len(segments):.1f} 字符")
    
    print("\n📄 前5个段落预览:")
    print("-" * 30)
    for i, segment in enumerate(segments[:5]):
        print(f"段落 {i+1} (长度: {len(segment)}): {segment[:150]}..." if len(segment) > 150 else f"段落 {i+1} (长度: {len(segment)}): {segment}")
    
    return segments

# 分析原始TXT文件的段落分割
if output_files:
    original_segments = analyze_paragraph_segmentation(output_files[0])
    
    print(f"\n📊 总结:")
    print(f"  - 原始文件行数: {len(original_segments)}")
    print(f"  - 预处理后段落数: 25")
    print(f"  - 过滤掉的段落数: {len(original_segments) - 25}")
    print(f"  - 保留比例: {25/len(original_segments)*100:.1f}%")


🔍 分析文件: 101021acsoprd7b00291.txt
📊 原始文件总行数: 731
📊 空行数量: 43
📊 分割后的段落数: 43
📊 平均段落长度: 756.1 字符

📄 前5个段落预览:
------------------------------
段落 1 (长度: 1160): Process Development and Scale-up of the Continuous Flow Nitration of Triﬂuoromethoxybenzene Zhenghui Wen,†,‡ Fengjun Jiao,† Mei Yang,† Shuainan Zhao,†...
段落 2 (长度: 932): Triﬂuoromethoxy aniline is an important intermediate involved in the synthesis of a wide range of ﬁne chemicals, for example, pesticides,1 pharmaceuti...
段落 3 (长度: 969): Therefore, it is very important and urgent to develop a new strategy based on process intensiﬁcation technology to improve the productivity and proces...
段落 4 (长度: 926): Brocklehurst et al.8 used a commercially available continuous ﬂow reactor to perform the challenging nitration of 2-amino-4-bromobenzoic acid methyl e...
段落 5 (长度: 1006): Therefore, the selectivity of m- NB and DNB should be controlled as low as possible to cut the cost of the separation. The objective was to minimize t...

📊 总结:
  - 原始

## 🔍 步骤 3: 嵌入和相似度计算


In [8]:
print("🚀 步骤 3/5: 嵌入和相似度计算...")
print("=" * 50)

embedding_files = []

# 使用上一步预处理后的文件
for i, processed_file in enumerate(processed_files, 1):
    print(f"\n📄 处理文件 {i}/{len(processed_files)}: {os.path.basename(processed_file)}")
    
    # 执行嵌入和相似度计算
    embedding_file_path = process_text_file_for_embedding(processed_file)
    embedding_files.append(embedding_file_path)
    
    print(f"  ✅ 嵌入和相似度计算完成")
    print(f"  📁 输出文件: {embedding_file_path}")

print(f"\n🎉 嵌入和相似度计算完成！")


🚀 步骤 3/5: 嵌入和相似度计算...

📄 处理文件 1/1: Processed_101021acsoprd7b00291.txt
  ✅ 嵌入和相似度计算完成
  📁 输出文件: /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output/101021acsoprd7b00291/Embedding_101021acsoprd7b00291.txt

🎉 嵌入和相似度计算完成！


### 🔍 查看嵌入结果


In [9]:
# 查看嵌入结果
sample_embedding = embedding_files[0]
print(f"📖 查看嵌入文件: {os.path.basename(sample_embedding)}")
print("=" * 50)

with open(sample_embedding, 'r', encoding='utf-8', errors='ignore') as f:
    lines = f.readlines()
    
print(f"📊 嵌入相似度筛选后段落数: {len(lines)}")
print("\n📄 相似度最高的段落预览:")
print("-" * 30)
for i, line in enumerate(lines[:3]):
    if line.strip():
        print(f"段落 {i+1}: {line[:200]}..." if len(line) > 200 else f"段落 {i+1}: {line}")


📖 查看嵌入文件: Embedding_101021acsoprd7b00291.txt
📊 嵌入相似度筛选后段落数: 20

📄 相似度最高的段落预览:
------------------------------
段落 1: The general laboratory process developments were carried out in microchannel reactors. TFMB (7.57 M, ﬂow rate of 0.4−1.0 mL·min−1) and a solution of fuming nitric acid in concentrated sulfuric acid (2...
段落 3: A faster ﬂow rate would result in a better mixing eﬀect, which can make the organic compound more evenly dispersed in the acid phase. Besides, a faster ﬂow rate would also decrease the residence time....


## 🤖 步骤 4: LLM内容过滤


In [10]:
print("🚀 步骤 4/5: LLM内容过滤...")
print("=" * 50)

filter_files = []

for i, embedding_file in enumerate(embedding_files, 1):
    print(f"\n📄 处理文件 {i}/{len(embedding_files)}: {os.path.basename(embedding_file)}")
    
    # 执行LLM内容过滤
    filter_file_path = process_text_file_for_filter(embedding_file)
    filter_files.append(filter_file_path)
    
    print(f"  ✅ LLM内容过滤完成")
    print(f"  📁 输出文件: {filter_file_path}")
    
kept = sum(1 for _ in open(filter_files[-1], 'r', encoding='utf-8', errors='ignore') if _.strip())
if kept == 0:
    print("⚠️ 过滤为空，回退用嵌入Top-N")
    filter_files[-1] = embedding_files[-1]

print(f"\n🎉 LLM内容过滤完成！")


🚀 步骤 4/5: LLM内容过滤...

📄 处理文件 1/1: Embedding_101021acsoprd7b00291.txt
🔍 尝试加载模型，路径: /Users/zhaowenyuan/Projects/FCPDExtractor/models
✅ 成功加载 nous-hermes-llama2-13b.Q4_0.gguf 模型
🔍 处理文件: Embedding_101021acsoprd7b00291.txt
📊 原始段落数: 10

🤖 步骤1: LLM内容过滤...
...开始使用LLM进行段落分类...
...分类完成，保留 10 个相关段落。
✅ 过滤后段落数: 10
  ✅ LLM内容过滤完成
  📁 输出文件: /Users/zhaowenyuan/Projects/FCPDExtractor/Data/output/101021acsoprd7b00291/Embedding_101021acsoprd7b00291_Filtered.txt

🎉 LLM内容过滤完成！


### 🔍 查看LLM过滤结果


In [11]:
# 查看LLM过滤结果
sample_filter = filter_files[0]
print(f"📖 查看过滤文件: {os.path.basename(sample_filter)}")
print("=" * 50)

with open(sample_filter, 'r', encoding='utf-8', errors='ignore') as f:
    lines = f.readlines()
    
print(f"📊 LLM过滤后段落数: {len(lines)}")
print("\n📄 过滤后的段落预览:")
print("-" * 30)
for i, line in enumerate(lines[:3]):
    if line.strip():
        print(f"段落 {i+1}: {line[:200]}..." if len(line) > 200 else f"段落 {i+1}: {line}")


📖 查看过滤文件: Embedding_101021acsoprd7b00291_Filtered.txt
📊 LLM过滤后段落数: 20

📄 过滤后的段落预览:
------------------------------
段落 1: The general laboratory process developments were carried out in microchannel reactors. TFMB (7.57 M, ﬂow rate of 0.4−1.0 mL·min−1) and a solution of fuming nitric acid in concentrated sulfuric acid (2...
段落 3: A faster ﬂow rate would result in a better mixing eﬀect, which can make the organic compound more evenly dispersed in the acid phase. Besides, a faster ﬂow rate would also decrease the residence time....


In [12]:
# 1) 设置严格模型名到你新下的本地文件
import os, importlib
os.environ["FCPD_STRICT_MODEL_NAME"] = "meta-llama-3.1-8b-instruct-q4_k_m-2.gguf"
print("STRICT =", os.getenv("FCPD_STRICT_MODEL_NAME"))

# 2) 强制重载模块，并重新导入函数，避免用到旧版本
import Unified_Text_Processor as UTP
importlib.reload(UTP)
from Unified_Text_Processor import process_text_file_for_summerized


STRICT = meta-llama-3.1-8b-instruct-q4_k_m-2.gguf


## 📊 步骤 5: 抽象和总结


In [13]:
import os # 确保 os 模块已导入

print("🚀 步骤 5/5: 抽象和总结...")
print("=" * 50)

abstract_files = []
summarized_files = []

# 使用过滤后的文件进行抽象和总结
for i, filter_file in enumerate(filter_files, 1):
    print(f"\n📄 处理文件 {i}/{len(filter_files)}: {os.path.basename(filter_file)}")
    

    # --- 抽象 (Abstract) 步骤 ---
    # 1. 首先，根据输入文件名，推断出输出文件的应有路径
    #    例如，将 '..._Filtered.txt' 替换为 '..._Abstract.txt'
    abstract_file_path = filter_file.replace('_Filtered.txt', '_Abstract.txt')

    # 2. 检查这个输出文件是否已经存在
    if os.path.exists(abstract_file_path):
        # 如果文件已存在，则跳过处理
        print(f"  ⏭️  检测到已存在文件，跳过 [抽象] 步骤: {os.path.basename(abstract_file_path)}")
    else:
        # 如果文件不存在，才执行耗时的LLM调用
        print("  ⏳  正在执行 [抽象]...")
        # 注意：这里我们假设 process_text_file_for_abstract 返回的是它创建的文件路径
        abstract_file_path = process_text_file_for_abstract(filter_file)
        print(f"  ✅ 抽象完成: {os.path.basename(abstract_file_path)}")
    
    # 无论是否跳过，都将路径添加到列表中
    abstract_files.append(abstract_file_path)

    # --- 总结 (Summarize) 步骤 ---
    # 1. 同样，先推断出输出文件的路径
    summarized_file_path = filter_file.replace('_Filtered.txt', '_Summarized.txt')

    # 2. 检查文件是否存在
    if os.path.exists(summarized_file_path):
        # 如果已存在，则跳过
        print(f"  ⏭️  检测到已存在文件，跳过 [总结] 步骤: {os.path.basename(summarized_file_path)}")
    else:
        # 如果不存在，才执行
        # 重要提示：根据我们之前的讨论，总结步骤的最佳输入是“抽象”后的文本，而不是“过滤”后的文本。
        # 因此，这里传递 abstract_file_path 作为输入会更高效和准确。
        print("  ⏳  正在执行 [总结]...")
        summarized_file_path = process_text_file_for_summerized(abstract_file_path)
        print(f"  ✅ 总结完成: {os.path.basename(summarized_file_path)}")

    # 无论是否跳过，都将路径添加到列表中
    summarized_files.append(summarized_file_path)


print(f"\n🎉 抽象和总结完成！")

🚀 步骤 5/5: 抽象和总结...

📄 处理文件 1/1: Embedding_101021acsoprd7b00291_Filtered.txt
  ⏳  正在执行 [抽象]...
🔍 尝试加载模型，路径: /Users/zhaowenyuan/Projects/FCPDExtractor/models
✅ 成功加载 nous-hermes-llama2-13b.Q4_0.gguf 模型
🔍 处理文件: Embedding_101021acsoprd7b00291_Filtered.txt
📊 原始段落数: 10

📝 步骤2: 文本抽象...
Abstract 1/10:
The general laboratory process developments were carried out in microchannel reactors. TFMB (7.57 M, ﬂow rate of 0.4−1.0 mL·min−1) and a solution of fuming nitric acid in concentrated sulfuric acid (2.49−3.19 M, ﬂow rate of 0.8− 2.2 mL·min−1) were delivered by two syringe pumps (TYD01- 02, Lead Fluid), respectively. The ﬂuids reacted in capillaries with a length deﬁned by the desired residence tim
Abstract 2/10:
A faster ﬂow rate would result in a better mixing eﬀect, which can make the organic compound more evenly dispersed in the acid phase. Besides, a faster ﬂow rate would also decrease the residence time. Therefore, a faster ﬂow rate could reduce the occurrence of dinitration. 2.1.6. Eﬀect of 

In [14]:
# 查看嵌入结果
sample_embedding = embedding_files[0]
print(f"📖 查看嵌入文件: {os.path.basename(sample_embedding)}")
print("=" * 50)

with open(sample_embedding, 'r', encoding='utf-8', errors='ignore') as f:
    lines = f.readlines()
    
print(f"📊 嵌入相似度筛选后段落数: {len(lines)}")
print("\n📄 相似度最高的段落预览:")
print("-" * 30)
for i, line in enumerate(lines[:3]):
    if line.strip():
        print(f"段落 {i+1}: {line[:200]}..." if len(line) > 200 else f"段落 {i+1}: {line}")


📖 查看嵌入文件: Embedding_101021acsoprd7b00291.txt
📊 嵌入相似度筛选后段落数: 20

📄 相似度最高的段落预览:
------------------------------
段落 1: The general laboratory process developments were carried out in microchannel reactors. TFMB (7.57 M, ﬂow rate of 0.4−1.0 mL·min−1) and a solution of fuming nitric acid in concentrated sulfuric acid (2...
段落 3: A faster ﬂow rate would result in a better mixing eﬀect, which can make the organic compound more evenly dispersed in the acid phase. Besides, a faster ﬂow rate would also decrease the residence time....


### 🔍 查看最终结果


In [15]:
# 查看最终总结结果
sample_summarized = summarized_files[0]
print(f"📖 查看最终总结: {os.path.basename(sample_summarized)}")
print("=" * 50)

with open(sample_summarized, 'r', encoding='utf-8', errors='ignore') as f:
    content = f.read()
    
print(f"📊 总结内容长度: {len(content)} 字符")
print("\n📄 总结内容预览:")
print("-" * 30)
print(content[:1000] + "..." if len(content) > 1000 else content)


📖 查看最终总结: Embedding_101021acsoprd7b00291_Filtered_Abstract_Summarized.txt
📊 总结内容长度: 7311 字符

📄 总结内容预览:
------------------------------
### Step-by-Step Solution ###

## Step 1: Identify the reaction type
The paragraph does not explicitly mention a specific chemical reaction or process beyond mentioning it as part of general laboratory process developments. Therefore, we will leave this field blank.

## Step 2: Extract reactants and their roles
From the text, two substances are mentioned:
- TFMB (7.57 M) delivered by one syringe pump.
- A solution of fuming nitric acid in concentrated sulfuric acid (2.49−3.19 M), also delivered by a separate syringe pump.

Given that both are involved in reacting within capillaries, we can infer they act as reactants but cannot determine their roles without further context or information on the reaction type.

## Step 3: Extract products and yields
There is no mention of specific chemical products formed from this process. Therefore, we will leave these 

## 📊 处理结果统计


In [16]:
# 统计处理结果
print("📊 OSSExtractor 处理结果统计")
print("=" * 50)

stats = []
for i, file_path in enumerate(output_files):
    filename = os.path.basename(file_path)
    
    # 统计各个步骤的文件大小
    original_size = os.path.getsize(file_path) if os.path.exists(file_path) else 0
    
    processed_file = processed_files[i] if i < len(processed_files) else None
    processed_size = os.path.getsize(processed_file) if processed_file and os.path.exists(processed_file) else 0
    
    embedding_file = embedding_files[i] if i < len(embedding_files) else None
    embedding_size = os.path.getsize(embedding_file) if embedding_file and os.path.exists(embedding_file) else 0
    
    filter_file = filter_files[i] if i < len(filter_files) else None
    filter_size = os.path.getsize(filter_file) if filter_file and os.path.exists(filter_file) else 0
    
    summarized_file = summarized_files[i] if i < len(summarized_files) else None
    summarized_size = os.path.getsize(summarized_file) if summarized_file and os.path.exists(summarized_file) else 0
    
    stats.append({
        '文件': filename,
        '原始PDF (MB)': round(original_size / 1024 / 1024, 2),
        '预处理 (KB)': round(processed_size / 1024, 2),
        '嵌入筛选 (KB)': round(embedding_size / 1024, 2),
        'LLM过滤 (KB)': round(filter_size / 1024, 2),
        '最终总结 (KB)': round(summarized_size / 1024, 2)
    })

# 显示统计表格
df_stats = pd.DataFrame(stats)
display(df_stats)

print("\n🎉 所有处理步骤完成！")


📊 OSSExtractor 处理结果统计


Unnamed: 0,文件,原始PDF (MB),预处理 (KB),嵌入筛选 (KB),LLM过滤 (KB),最终总结 (KB)
0,101021acsoprd7b00291.txt,0.03,26.96,8.26,8.26,7.19



🎉 所有处理步骤完成！


## 🔧 调试和优化建议


In [None]:
print("🔧 调试和优化建议")
print("=" * 50)
print("""
1. 📊 检查嵌入相似度阈值
   - 如果筛选的段落太少，可以降低相似度阈值
   - 如果筛选的段落太多，可以提高相似度阈值

2. 🤖 优化LLM提示词
   - 在Filter.py中调整问题描述
   - 在Summerized.py中调整参数提取提示词

3. 📝 调整文本预处理
   - 在TXT_Processing.py中修改过滤规则
   - 调整段落分割策略

4. 🔍 检查模型性能
   - 观察LLM的响应质量
   - 考虑调整模型参数（temp, top_p等）

5. 📈 可视化处理流程
   - 绘制各步骤的数据量变化
   - 分析处理效率
""")


## 📊 结果查看和分析


In [17]:
# 查看最终结果
print("📊 查看处理结果...")
print("=" * 50)

# 显示所有生成的文件
all_files = {
    '原始文本': output_files,
    '预处理文本': processed_files,
    '嵌入文件': embedding_files,
    '过滤文件': filter_files,
    '抽象文件': abstract_files,
    '总结文件': summarized_files
}

for category, files in all_files.items():
    print(f"\n📁 {category}:")
    for i, file in enumerate(files, 1):
        if os.path.exists(file):
            with open(file, 'r', encoding='utf-8', errors='ignore') as f:
                lines = f.readlines()
            print(f"  {i}. {os.path.basename(file)} ({len(lines)} 行)")
        else:
            print(f"  {i}. {os.path.basename(file)} (文件不存在)")

print(f"\n🎉 处理完成！共处理了 {len(pdf_files)} 个PDF文件")


📊 查看处理结果...

📁 原始文本:
  1. 101021acsoprd7b00291.txt (731 行)

📁 预处理文本:
  1. Processed_101021acsoprd7b00291.txt (68 行)

📁 嵌入文件:
  1. Embedding_101021acsoprd7b00291.txt (20 行)

📁 过滤文件:
  1. Embedding_101021acsoprd7b00291_Filtered.txt (20 行)

📁 抽象文件:
  1. Embedding_101021acsoprd7b00291_Filtered_Abstract.txt (20 行)

📁 总结文件:
  1. Embedding_101021acsoprd7b00291_Filtered_Abstract_Summarized.txt (137 行)

🎉 处理完成！共处理了 1 个PDF文件


## 🔍 结果分析


In [None]:
# 分析最终结果
print("🔍 分析最终结果...")
print("=" * 50)
# 查看总结文件的内容
if summarized_files:
    print("📄 最终总结结果:")
    for i, file in enumerate(summarized_files, 1):
        print(f"\n文件 {i}: {os.path.basename(file)}")
        if os.path.exists(file):
            with open(file, 'r', encoding='utf-8', errors='ignore') as f:
                content = f.read()
            print("内容预览:")
            print("-" * 30)
            print(content[:500] + "..." if len(content) > 500 else content)
        else:
            print("文件不存在")


🔍 分析最终结果...
📄 最终总结结果:

文件 1: Embedding_101021acsoprd7b00291_Filtered_Abstract_Summarized.txt
内容预览:
------------------------------
### Step-by-Step Solution ###

## Step 1: Identify the reaction type
The paragraph does not explicitly mention a specific chemical reaction or process beyond mentioning it as part of general laboratory process developments. Therefore, we will leave this field blank.

## Step 2: Extract reactants and their roles
From the text, two substances are mentioned:
- TFMB (7.57 M) delivered by one syringe pump.
- A solution of fuming nitric acid in concentrated sulfuric acid (2.49−3.19 M), also delivered ...


In [None]:
# 显示处理统计信息
print("\n📊 处理统计信息:")
print("=" * 30)

total_paragraphs_original = 0
total_paragraphs_filtered = 0

for i, (original, filtered) in enumerate(zip(processed_files, filter_files), 1):
    if os.path.exists(original):
        with open(original, 'r', encoding='utf-8', errors='ignore') as f:
            original_lines = len(f.readlines())
        total_paragraphs_original += original_lines
    
    if os.path.exists(filtered):
        with open(filtered, 'r', encoding='utf-8', errors='ignore') as f:
            filtered_lines = len(f.readlines())
        total_paragraphs_filtered += filtered_lines
        
        print(f"文件 {i}: {original_lines} → {filtered_lines} 段落 (保留率: {filtered_lines/original_lines*100:.1f}%)")

print(f"\n总计: {total_paragraphs_original} → {total_paragraphs_filtered} 段落")
print(f"整体保留率: {total_paragraphs_filtered/total_paragraphs_original*100:.1f}%")

print(f"\n🎉 摘要结论专用处理完成！")
