### **块 1: 环境设置、数据加载与spaCy配置**

**目标:** 导入所有必需的库，加载核心数据和规则文件，并利用`spaCy`的强大功能为后续处理做好准备。

**关键操作:** 遍历我们 `merge_dict.pkl` 中的所有“超级词”（如 `peoples_bank_of_china`），将它们作为**特殊规则**添加到`spaCy`的分词器（Tokenizer）中，以确保`spaCy`在处理文本时，**绝不会**将这些超级词错误地拆分开。

In [1]:
# =============================================================================
# --- 块 1: 环境设置、数据加载与spaCy配置 (升级版) ---
# =============================================================================

# 作用: 导入所有项目运行所需的Python库。
import pandas as pd
import os
import pickle
import time
from tqdm.auto import tqdm
import psutil
import spacy
from spacy.tokenizer import Tokenizer
from spacy.attrs import ORTH
import gc
import nltk
from nltk.corpus import wordnet
import re  # [新增] 导入正则表达式库

# --- 核心配置区 ---
# 作用: 全局控制参数，方便调试与切换运行模式。
RUNNING_ENV = 'local'
TEST_MODE = False
TEST_SAMPLE_SIZE = 1000

# --- 并行处理配置 ---
# 作用: 智能检测CPU核心数，并为多进程处理设定合理的进程数。
cpu_cores = psutil.cpu_count(logical=False)
N_PROCESSES = min(cpu_cores - 1 if cpu_cores > 1 else 1, 8)
if N_PROCESSES < 1: N_PROCESSES = 1

# 作用: 主动检测Windows操作系统，并强制使用单进程，以规避spaCy在Windows上的已知多进程问题。
if os.name == 'nt':
    print("检测到Windows系统，为避免spaCy多进程已知问题，将强制使用单进程(N_PROCESSES=1)。")
    N_PROCESSES = 1

# --- 路径智能管理 ---
# 作用: 根据运行环境（本地或服务器）自动构建正确的文件路径。
print(f"检测到运行环境为: 【{RUNNING_ENV.upper()}】")
BASE_DATA_PROCESSED_PATH = '../data_processed' if RUNNING_ENV == 'local' else '/mnt/data/data_processed'
BASE_CONFIG_PATH = '../configs' if RUNNING_ENV == 'local' else '/mnt/data/configs'

# 作用: 确保配置文件目录存在，防止后续写入文件时出错。
if not os.path.exists(BASE_CONFIG_PATH):
    os.makedirs(BASE_CONFIG_PATH)
    print(f"创建了configs目录: {BASE_CONFIG_PATH}")

# 作用: 定义所有输入输出文件的完整路径。
# 输入文件路径
SOLIDIFIED_TEXT_PATH = os.path.join(BASE_DATA_PROCESSED_PATH, 'china_news_solidified.pkl')
MERGE_DICT_PATH = os.path.join(BASE_DATA_PROCESSED_PATH, 'merge_dict.pkl')
NEW_STOPWORDS_PATH = os.path.join(BASE_DATA_PROCESSED_PATH, 'new_stopwords.pkl')
PROJECT_STOPWORDS_PATH = os.path.join(BASE_CONFIG_PATH, 'project_specific_stopwords.txt')
# 输出文件路径
CLEANED_TOKENS_PKL_PATH = os.path.join(BASE_DATA_PROCESSED_PATH, 'china_news_cleaned_tokens.pkl')
CLEANED_TOKENS_CSV_PATH = os.path.join(BASE_DATA_PROCESSED_PATH, 'china_news_cleaned_tokens_for_review.csv')

# --- 打印最终配置信息 ---
# 作用: 在程序开始时清晰地展示所有配置，便于检查和追溯。
print("\n--- 环境准备 ---")
if TEST_MODE:
    print(f"🚀🚀🚀 运行在【快速测试模式】下，将处理前 {TEST_SAMPLE_SIZE} 行新闻！🚀🚀🚀")
else:
    print(f"🚢🚢🚢 运行在【完整数据模式】下，将处理所有新闻。🚢🚢🚢")
print(f"固化文本输入: {SOLIDIFIED_TEXT_PATH}")
print(f"合并词典输入: {MERGE_DICT_PATH}")
print(f"审核停用词输入: {NEW_STOPWORDS_PATH}")
print(f"项目停用词输入: {PROJECT_STOPWORDS_PATH}")
print(f"最终Tokens输出 (PKL): {CLEANED_TOKENS_PKL_PATH}")
print(f"最终Tokens输出 (CSV): {CLEANED_TOKENS_CSV_PATH}")
print(f"spaCy将使用 {N_PROCESSES} 个进程进行处理。")


# --- 加载数据和规则 ---
# 作用: 从磁盘读取经过实体固化的新闻数据和合并规则字典。
print("\n--- 阶段一: 加载数据与规则 ---\n")
try:
    read_nrows = TEST_SAMPLE_SIZE if TEST_MODE else None
    df = pd.read_pickle(SOLIDIFIED_TEXT_PATH)
    if read_nrows:
        df = df.head(read_nrows)
    print(f"✅ 成功加载 {len(df)} 篇固化文本。")

    with open(MERGE_DICT_PATH, 'rb') as f:
        merge_dict = pickle.load(f)
    print(f"✅ 成功加载 {len(merge_dict)} 条合并规则。")
except FileNotFoundError as e:
    print(f"❌ 错误: 缺少必要的输入文件: {e.filename}。请确保已成功运行之前的Notebook。")
    df = pd.DataFrame()

# --- 加载和配置spaCy及NLTK模型 ---
# 作用: 加载spaCy模型并为其分词器添加特殊规则，以保护固化实体不被拆分。同时检查并下载NLTK WordNet语料库。
if not df.empty:
    print("\n正在加载和配置spaCy模型...")
    start_time_spacy = time.time()
    nlp = spacy.load("en_core_web_lg")

    special_cases = {}
    for standard_form in set(merge_dict.values()):
        special_cases[standard_form] = [{ORTH: standard_form}]

    for case, rule in special_cases.items():
        nlp.tokenizer.add_special_case(case, rule)

    print(f"✅ spaCy模型加载并配置完成，已添加 {len(special_cases)} 条特殊分词规则。耗时: {time.time() - start_time_spacy:.2f} 秒。")

    print("\n正在检查NLTK WordNet语料库...")
    try:
        wordnet.ensure_loaded()
        print("✅ NLTK WordNet 已加载。")
    except LookupError:
        print("NLTK WordNet 未找到，正在下载...")
        nltk.download('wordnet')
        nltk.download('omw-1.4')
        print("✅ NLTK WordNet 下载完成。")

检测到Windows系统，为避免spaCy多进程已知问题，将强制使用单进程(N_PROCESSES=1)。
检测到运行环境为: 【LOCAL】

--- 环境准备 ---
🚢🚢🚢 运行在【完整数据模式】下，将处理所有新闻。🚢🚢🚢
固化文本输入: ../data_processed\china_news_solidified.pkl
合并词典输入: ../data_processed\merge_dict.pkl
审核停用词输入: ../data_processed\new_stopwords.pkl
项目停用词输入: ../configs\project_specific_stopwords.txt
最终Tokens输出 (PKL): ../data_processed\china_news_cleaned_tokens.pkl
最终Tokens输出 (CSV): ../data_processed\china_news_cleaned_tokens_for_review.csv
spaCy将使用 1 个进程进行处理。

--- 阶段一: 加载数据与规则 ---

✅ 成功加载 180630 篇固化文本。
✅ 成功加载 1826 条合并规则。

正在加载和配置spaCy模型...
✅ spaCy模型加载并配置完成，已添加 1507 条特殊分词规则。耗时: 1.23 秒。

正在检查NLTK WordNet语料库...
✅ NLTK WordNet 已加载。


### **块 2: 集成的、流式深度清洗 (含优雅降级)**

**目标:** 将NLP处理和多步骤清洗流程整合到一个单一的、内存高效的流式管道中。

**关键优化:**
1.  **优雅降级**: 代码会首先**尝试**使用配置的`N_PROCESSES`进行并行处理。如果失败（例如，在非Windows的特殊Jupyter环境中），它将捕获异常并**自动降级**到`n_process=1`的单线程模式重试，确保任务最终能够成功完成。
2.  **流式处理**: 整个过程是流式的，从根本上解决了处理完整数据时可能出现的`MemoryError`。

In [2]:
# =============================================================================
# --- 块 2: 集成的、流式深度清洗 (智能连字符处理版) ---
# =============================================================================

if 'nlp' in locals() and not df.empty:
    print("\n--- 阶段二 & 三: 开始集成的、流式深度清洗流程 (终极版) ---")
    start_time_cleaning = time.time()

    # --- “终极版”词形还原策略的函数定义 ---
    def convert_adv_to_adj_wordnet(adverb: str) -> str | None:
        """作用: 尝试使用NLTK WordNet的派生关系，将副词转换为其对应的形容词形式。"""
        for syn in wordnet.synsets(adverb, pos=wordnet.ADV):
            for lemma in syn.lemmas():
                for related_form in lemma.derivationally_related_forms():
                    if related_form.synset().pos() == 'a': # 'a' 代表形容词
                        return related_form.name().replace('_', '_')
        return None

    def convert_adv_to_adj_spacy(token: spacy.tokens.Token) -> str | None:
        """作用: 当WordNet失败时，使用spaCy词汇表进行启发式转换（移除'ly'后缀并校验）。"""
        text_lower = token.text.lower()
        if not text_lower.endswith('ly'):
            return None
        potential_adj = text_lower[:-2]
        if potential_adj.endswith('i'):
            potential_adj = potential_adj[:-1] + 'y'
        if not potential_adj or not potential_adj.isalpha():
            return None
        if (nlp.vocab.has_vector and nlp.vocab[potential_adj].has_vector) or \
           (not nlp.vocab.has_vector and potential_adj in nlp.vocab):
            if token.lemma_ == 'early' and potential_adj == 'ear':
                return None
            return potential_adj
        return None

    def ultimate_lemmatizer(token: spacy.tokens.Token) -> str:
        """作用: 总调度函数。根据词性决定使用何种策略进行词形还原。"""
        final_lemma = token.lemma_.lower()
        if token.pos_ == 'ADV':
            wn_adj = convert_adv_to_adj_wordnet(token.text)
            if wn_adj: return wn_adj
            spacy_adj = convert_adv_to_adj_spacy(token)
            if spacy_adj: return spacy_adj
        return final_lemma

    # --- 智能文本规范化函数 ---
    def normalize_text_for_spacy(text: str) -> str:
        """
        作用: 在送入spaCy处理前，对文本进行智能的规范化预处理。
        1. 保护复合词：仅当连字符前后都是字母时，才将其替换为下划线。
        2. 清理分隔符：将连续的多个连字符视作分隔符，替换为空格。
        3. 压缩空白符：将多个连续的空白字符统一为一个空格。
        """
        if not isinstance(text, str):
            return ""
        text = re.sub(r'(?<=[a-zA-Z])-(?=[a-zA-Z])', '_', text)
        text = re.sub(r'-{2,}', ' ', text)
        text = re.sub(r'\s+', ' ', text)
        return text.strip()

    # ------------------------------------------------------------------------
    # 步骤 1: 构建并保存最终停用词库
    # ------------------------------------------------------------------------
    print("正在构建最终停用词集...")
    FINAL_STOP_WORDS = spacy.lang.en.stop_words.STOP_WORDS.copy()
    print(f"  - (基础层) 加载 {len(FINAL_STOP_WORDS)} 个spaCy默认停用词。")
    try:
        with open(NEW_STOPWORDS_PATH, 'rb') as f:
            new_stopwords = pickle.load(f)
            FINAL_STOP_WORDS.update(new_stopwords)
        print(f"  - (审核层) 添加了 {len(new_stopwords)} 个来自人工审核的停用词。")
    except FileNotFoundError:
        print(f"  - ℹ️ 未找到审核生成的停用词文件: {NEW_STOPWORDS_PATH}，跳过。")
    try:
        with open(PROJECT_STOPWORDS_PATH, 'r', encoding='utf-8') as f:
            project_stopwords = {line.strip().lower() for line in f if line.strip()}
            FINAL_STOP_WORDS.update(project_stopwords)
        print(f"  - (专家层) 添加了 {len(project_stopwords)} 个项目专属停用词。")
    except FileNotFoundError:
        print(f"  - ℹ️ 未找到项目专属停用词文件: {PROJECT_STOPWORDS_PATH}。")
    print(f"✅ 最终停用词集构建完成，共包含 {len(FINAL_STOP_WORDS)} 个不重复的停用词。")

    try:
        stopwords_df = pd.DataFrame(sorted(list(FINAL_STOP_WORDS)), columns=['stopword'])
        stopwords_output_path = os.path.join(BASE_DATA_PROCESSED_PATH, 'stopwords.csv')
        stopwords_df.to_csv(stopwords_output_path, index=False, encoding='utf-8-sig')
        print(f"✅ 最终停用词列表已保存到: {stopwords_output_path}")
    except Exception as e:
        print(f"⚠️ 警告: 保存停用词CSV文件失败: {e}")

    # ------------------------------------------------------------------------
    # 步骤 2: 定义清洗规则
    # ------------------------------------------------------------------------
    ALLOWED_POS = {'NOUN', 'PROPN', 'ADJ', 'VERB', 'ADV'}
    super_word_values = set(merge_dict.values())

    # ------------------------------------------------------------------------
    # 步骤 3: 创建并执行流式处理管道
    # ------------------------------------------------------------------------
    print("\n开始流式处理和清洗文本...")
    texts_iterator = (normalize_text_for_spacy(text) for text in df['content_solidified'].astype(str))

    disabled_pipes = ['parser', 'ner']
    docs_iterator = None

    try:
        if N_PROCESSES != 1:
             print(f"乐观尝试: 使用 {N_PROCESSES} 个进程进行并行处理...")
        docs_iterator = nlp.pipe(texts_iterator, disable=disabled_pipes, n_process=N_PROCESSES, batch_size=200)
    except Exception as e:
        print(f"⚠️ 警告: 并行处理失败，错误信息: {e}")
        print("✅ 优雅降级: 自动切换到稳定的单线程模式重试...")
        texts_iterator = (normalize_text_for_spacy(text) for text in df['content_solidified'].astype(str))
        docs_iterator = nlp.pipe(texts_iterator, disable=disabled_pipes, n_process=1, batch_size=200)

    final_token_lists = []

    for doc in tqdm(docs_iterator, total=len(df), desc="流式清洗Tokens (终极版)"):
        cleaned_tokens = []
        for token in doc:
            if token.is_punct or token.is_space or token.is_digit:
                continue

            is_super = token.text in super_word_values

            if not is_super and token.pos_ not in ALLOWED_POS:
                continue

            lemma = ultimate_lemmatizer(token)

            if lemma in FINAL_STOP_WORDS:
                continue

            if len(lemma) < 3:
                continue

            cleaned_tokens.append(lemma)

        final_token_lists.append(cleaned_tokens)

    # ------------------------------------------------------------------------
    # 步骤 4: 将结果添加回DataFrame并清理内存
    # ------------------------------------------------------------------------
    df['tokens_for_lda'] = final_token_lists
    print(f"✅ 清洗流程完成！耗时: {(time.time() - start_time_cleaning) / 60:.2f} 分钟。")

    del docs_iterator, texts_iterator, final_token_lists
    gc.collect()

else:
    print("spaCy模型未加载或DataFrame为空，跳过处理。")


--- 阶段二 & 三: 开始集成的、流式深度清洗流程 (终极版) ---
正在构建最终停用词集...
  - (基础层) 加载 326 个spaCy默认停用词。
  - (审核层) 添加了 736 个来自人工审核的停用词。
  - (专家层) 添加了 228 个项目专属停用词。
✅ 最终停用词集构建完成，共包含 1038 个不重复的停用词。
✅ 最终停用词列表已保存到: ../data_processed\stopwords.csv

开始流式处理和清洗文本...


流式清洗Tokens (终极版):   0%|          | 0/180630 [00:00<?, ?it/s]

✅ 清洗流程完成！耗时: 108.12 分钟。


### **块 3: 最终检查与保存**

**目标:** 抽样检查清洗效果，并保存最终产物。这是我们**深度清洗文本**宏观阶段的最终交付。

In [3]:
# =============================================================================
# --- 块 3: 最终检查与保存 ---
# =============================================================================

# 作用: 负责对清洗后的结果进行抽样检查，并保存为最终的Pickle和CSV文件。
if 'df' in locals() and 'tokens_for_lda' in df.columns:
    print("\n--- 阶段四: 最终检查与保存 ---")

    # 作用: 随机抽取5篇文章，打印其固化后文本和最终token列表，用于人工快速验证清洗效果。
    print("\n--- 抽样检查结果 ---")
    sample_size = min(5, len(df))
    if sample_size > 0:
        pd.set_option('display.max_colwidth', 300)
        display(df.sample(sample_size)[['content_solidified', 'tokens_for_lda']])

    # 作用: 将最终结果保存为两种格式。
    try:
        # Pickle格式: 用于后续Python脚本高效读取，保留数据类型。
        df.to_pickle(CLEANED_TOKENS_PKL_PATH)
        print(f"\n✅ [机器用] 包含最终Tokens的DataFrame已保存到Pickle: {CLEANED_TOKENS_PKL_PATH}")

        # CSV格式: 用于人工查阅和审计，token列表会转换为空格分隔的字符串。
        df_for_csv = df.copy()
        df_for_csv['tokens_for_lda'] = df_for_csv['tokens_for_lda'].apply(lambda tokens: ' '.join(tokens))
        df_for_csv.to_csv(CLEANED_TOKENS_CSV_PATH, index=False, encoding='utf-8-sig')
        print(f"✅ [人类用] 包含最终Tokens的DataFrame已保存到CSV: {CLEANED_TOKENS_CSV_PATH}")

        print(f"\n🎉🎉🎉 深度文本清洗流程全部完成！ 🎉🎉🎉")
        print(f"\n下一步是运行 '04_Topic_Modeling.ipynb' 或 '04_sLDA_Modeling.ipynb'，开始主题建模分析。")

    except Exception as e:
        print(f"❌ 保存文件时发生错误: {e}")
else:
    print("没有可供保存的最终Tokens数据。")


--- 阶段四: 最终检查与保存 ---

--- 抽样检查结果 ---


Unnamed: 0,content_solidified,tokens_for_lda
104063,"The Southwest Border Is Open for Business . Over the last few weeks, mayors, sheriffs, business leaders and citizens have joined together with a simple but powerful message: America's Southwest border communities are open for business. This is a message the American people need to hear. Unfortun...","[southwest, border, open, business, mayor, sheriff, business, leader, citizen, join, simple, powerful, message, america, southwest, border, open, business, message, american, hear, unfortunate, widespread, misperception, southwest, wrack, violence, spill, mexico, ongoing, drug, war, different, a..."
41544,justice_department Examines CIA Role In Probe Into Hughes's China Dealings . WASHINGTON -- The justice_department is examining whether the central_intelligence_agency impeded an investigation of Hughes Electronics Corp.'s business dealings in China. Investigators are trying to determine whether ...,"[justice_department, examine, cia, probe, hughes, china, washington, justice_department, examine, central_intelligence_agency, impede, investigation, hughes, electronics, corp., business, china, investigator, cia, official, act, improper, alert, hughes, electronics, employee, intelligence, commi..."
5842,"Film: The Mousekewitz Migration . I wouldn't say that walt_disney warped me exactly; however, on a recent trip to Italy I bought a set of ceramic bowls because the deer painted on them reminded me of Bambi. I still have small yellow records with the sound track from Cinderella tucked away in a c...","[film, mousekewitz, migration, walt_disney, warp, exact, trip, italy, buy, ceramic, bowl, deer, paint, remind, bambi, yellow, record, sound, track, cinderella, tuck, closet, simple, draw, childhood, mythology, disney, animated, teach, dream, evil, anxiety, especial, realize, coat, syrup, safe, b..."
101770,"Why We're Always Fooled by north_korea . According to Siegfried Hecker, the former director of the Los Alamos National Laboratory, north_korea is working on two new nuclear facilities, a light water power reactor in early stages of construction, and a ""modern, clean centrifuge plant"" for uranium...","[fool, north_korea, accord, siegfried, hecker, los, alamos, national, laboratory, north_korea, work, nuclear, facility, light, water, power, reactor, stage, modern, clean, centrifuge, plant, uranium, enrichment, mr., hecker, visit, facility, weekend, appear, near, complete, centrifuge, plant, pa..."
116722,"Big hong_kong Investors See Time to Sell --- Some of the Wealthiest Families Believe the Red-Hot Real-Estate Market Is Coming to an End; 'Selling at the Very Top' . hong_kong -- With the government growing confident that it has halted the meteoric rise in property prices, some of this city's big...","[hong_kong, investor, sell, wealthiest, families, red_hot, real_estate, market, end, sell, hong_kong, government, grow, confident, halt, meteoric, rise, property, price, city, real_estate, investor, hong_kong, wealthy, initial, public, offering, hotel, office, real_estate, asset, coming, price, ..."



✅ [机器用] 包含最终Tokens的DataFrame已保存到Pickle: ../data_processed\china_news_cleaned_tokens.pkl
✅ [人类用] 包含最终Tokens的DataFrame已保存到CSV: ../data_processed\china_news_cleaned_tokens_for_review.csv

🎉🎉🎉 深度文本清洗流程全部完成！ 🎉🎉🎉

下一步是运行 '04_Topic_Modeling.ipynb' 或 '04_sLDA_Modeling.ipynb'，开始主题建模分析。
